Study shows the best visual learning models fail at very basic visual identification tests

Cal Jeffrey

Posts: 4,246   +1,458
Staff member
Bottom line: Recent advancements in AI systems have significantly improved their ability to recognize and analyze complex images. However, a new paper reveals that many state-of-the-art visual learning models struggle with simple visual tasks that humans find easy, like counting the number of lines and rows in a grid or how many times two lines intersect.

Researchers from Auburn University and the University of Alberta recently published a paper titled "Vision language models are blind." The study used eight straightforward visual acuity tests to highlight deficiencies in visual learning models (VLM). The tasks included counting intersecting lines, identifying circled letters, counting nested shapes and others. These tests have objectively definitive answers and require minimal knowledge beyond basic 2D shapes.

To avoid models solving these tasks through memorization, the researchers generated the tests using custom code rather than pre-existing images. They evaluated four VLM models, including GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5. The results showed that none of the models achieved perfect accuracy, and performance varied significantly depending on the task.

For example, the best-performing model could only count the rows and columns in a blank grid with less than 60 percent accuracy. Conversely, Gemini-1.5 Pro approached human-level performance by correctly identifying circled letters 93 percent of the time.

Furthermore, even minor modifications to the tasks resulted in significant performance changes. While all models could correctly identify five overlapping circles, accuracy dropped below 50 percent when the number of circles increased to six or more (above). The researchers theorize that the drop in accuracy might be due to a bias toward the five interlocking rings of the Olympic logo. Some models even provided nonsensical answers, such as "9," "n," or "©" for the circled letter in "Subdermatoglyphic" (below).

These findings underscore a significant limitation in the ability of VLMs to handle low-level abstract visual tasks. The behavior is reminiscent of similar capability gaps in large language models, which can generate coherent text summaries but fail basic math and spelling questions. The researchers hypothesized that these gaps might stem from the models' inability to generalize beyond their training data. However, fine-tuning a model with specific images from one of the tasks (the two circles touching test) only modestly improved accuracy from 17 to 37 percent, indicating that the model overfits the training set but fails to generalize.

The researchers propose that these capability gaps in VLMs might be due to the "late fusion" approach of integrating vision encoders onto pre-trained language models. They suggest that an "early fusion" method, combining visual and language training from the beginning, could improve performance on low-level visual tasks. However, they did not provide an analysis to support this suggestion.

You can view the results and other examples on the team's website.

Permalink to story:

 
"The researchers hypothesized that these gaps might stem from the models' inability to generalize beyond their training data".

You cannot generalize understanding or logical deduction.
Inferring an enormous model of the entirety of Wikipedia is not going to make an AI smart.
Sure, you have a lot of copy cat 'knowledge', but an AI doesn't understand ****.
It's just guessing (by having looked at an infathemable amount of examples) the most probable sequence of words. When it had to come up with something 'out of the box' or based on rules rather than knowledge, it fails spectacularly.
It just reacts, immediately spitting out an output for a given input, there is no thought process, no reflection, no meaning.

So AI thinking like humans in a few years? Forget about it, we'll be stuck in this state with ever increasing model complexity, but missing the core essence of what it is to be human, until we've had at least a few other revolutionary breakthroughs.
 
Last edited:
I feel fundamentally this idea of just using ever more data to train models is not ever going to lead to true AI. Already absurd levels of data and energy have been used and we are still achieving braindead results showing they are not learning anything. The AI models are not working like a human brain at all, IMO neuromorphic computing is the only way to truly evolve this field. And this is the only way we will slash the stupendously large amount of energy current AI modelling eschewed by Nvidia is using.
 
The AI we have today is missing exactly the AI part from the ecuation. It's more like an interactive database that can do certain mathematical deductions based on input and for that it requires quite a lot of energy and time and processing power. This path even if it looks amazing on paper, it will not be the final product, and it needs a breakthrough otherwise it will not be finished within our lifetimes. Let's take an example...a baby human once it starts to walk and inspect the world, you show him a traffic light, once or twice and he/she will be able to recognize that traffic light even in a forest or even if part of it is visible. That's the whole training and all that is needed is to see it once or twice. That is not something you can code yet because as we speak we don't have the slightest clue in how the brain works and how it evolves over time, not to mention that the people with that actual knowledge in the field are not AI programmers. I mean admittedly we don't know **** yet about how even the tiniest of bugs are able to do what they do, like ants which have a brain 15% the size of their body. Probably if we would be able to reprogram an ant brain, that brain would be more powerful than all the so called AIs that we have right now.
 
Back