Study shows the best visual learning models fail at very basic visual identification tests

Cal Jeffrey · Thursday at 3:42 PM

Bottom line: Recent advancements in AI systems have significantly improved their ability to recognize and analyze complex images. However, a new paper reveals that many state-of-the-art visual learning models struggle with simple visual tasks that humans find easy, like counting the number of lines and rows in a grid or how many times two lines intersect.

Researchers from Auburn University and the University of Alberta recently published a paper titled "Vision language models are blind." The study used eight straightforward visual acuity tests to highlight deficiencies in visual learning models (VLM). The tasks included counting intersecting lines, identifying circled letters, counting nested shapes and others. These tests have objectively definitive answers and require minimal knowledge beyond basic 2D shapes.

To avoid models solving these tasks through memorization, the researchers generated the tests using custom code rather than pre-existing images. They evaluated four VLM models, including GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5. The results showed that none of the models achieved perfect accuracy, and performance varied significantly depending on the task.

For example, the best-performing model could only count the rows and columns in a blank grid with less than 60 percent accuracy. Conversely, Gemini-1.5 Pro approached human-level performance by correctly identifying circled letters 93 percent of the time.

Furthermore, even minor modifications to the tasks resulted in significant performance changes. While all models could correctly identify five overlapping circles, accuracy dropped below 50 percent when the number of circles increased to six or more (above). The researchers theorize that the drop in accuracy might be due to a bias toward the five interlocking rings of the Olympic logo. Some models even provided nonsensical answers, such as "9," "n," or "©" for the circled letter in "Subdermatoglyphic" (below).

These findings underscore a significant limitation in the ability of VLMs to handle low-level abstract visual tasks. The behavior is reminiscent of similar capability gaps in large language models, which can generate coherent text summaries but fail basic math and spelling questions. The researchers hypothesized that these gaps might stem from the models' inability to generalize beyond their training data. However, fine-tuning a model with specific images from one of the tasks (the two circles touching test) only modestly improved accuracy from 17 to 37 percent, indicating that the model overfits the training set but fails to generalize.

The researchers propose that these capability gaps in VLMs might be due to the "late fusion" approach of integrating vision encoders onto pre-trained language models. They suggest that an "early fusion" method, combining visual and language training from the beginning, could improve performance on low-level visual tasks. However, they did not provide an analysis to support this suggestion.

You can view the results and other examples on the team's website.

Permalink to story:

Study shows the best visual learning models fail at very basic visual identification tests

VitalyT · Thursday at 3:53 PM

Give it a few years, AI will exceed in all these tests and surpass humans.

MSIGamer · Thursday at 4:03 PM

"The researchers hypothesized that these gaps might stem from the models' inability to generalize beyond their training data".

You cannot generalize understanding or logical deduction.
Inferring an enormous model of the entirety of Wikipedia is not going to make an AI smart.
Sure, you have a lot of copy cat 'knowledge', but an AI doesn't understand ****.
It's just guessing (by having looked at an infathemable amount of examples) the most probable sequence of words. When it had to come up with something 'out of the box' or based on rules rather than knowledge, it fails spectacularly.
It just reacts, immediately spitting out an output for a given input, there is no thought process, no reflection, no meaning.

So AI thinking like humans in a few years? Forget about it, we'll be stuck in this state with ever increasing model complexity, but missing the core essence of what it is to be human, until we've had at least a few other revolutionary breakthroughs.

seeprime · Thursday at 5:40 PM

VitalyT said:
Give it a few years, AI will exceed in all these tests and surpass humans.

How? AI is incapable of creative reasoning, and will not recognize patterns not previously fed into it.

VitalyT · Thursday at 5:45 PM

seeprime said:
How? AI is incapable of creative reasoning, and will not recognize patterns not previously fed into it.

AI can interpolate.

kiwigraeme · Thursday at 8:11 PM

seeprime said:
How? AI is incapable of creative reasoning, and will not recognize patterns not previously fed into it.

an AI model can now compete in the international maths olympiad test , that require a creative solution. yes standard LLM perform very poorly 5% or less right (as they reply on huge sample bases )

one article
https://www.scientificamerican.com/article/ai-matches-the-abilities-of-the-best-math-olympians/

daffy duck · Thursday at 10:50 PM

I feel fundamentally this idea of just using ever more data to train models is not ever going to lead to true AI. Already absurd levels of data and energy have been used and we are still achieving braindead results showing they are not learning anything. The AI models are not working like a human brain at all, IMO neuromorphic computing is the only way to truly evolve this field. And this is the only way we will slash the stupendously large amount of energy current AI modelling eschewed by Nvidia is using.

seeprime · Friday at 12:16 AM

VitalyT said:
AI can interpolate.

Interpolation is math, not creativity.

VitalyT · Friday at 4:18 AM

seeprime said:
Interpolation is math, not creativity.

You wrote - "AI will not recognize patterns not previously fed into it.", and I replied that is not the case, because AI can interpolate.

seeprime · Friday at 7:04 AM

VitalyT said:
You wrote - "AI will not recognize patterns not previously fed into it.", and I replied that is not the case, because AI can interpolate.

Your response was valid. I read bias into it. Thanks for clarifying.

Alex1105 · Sunday at 2:48 PM

The AI we have today is missing exactly the AI part from the ecuation. It's more like an interactive database that can do certain mathematical deductions based on input and for that it requires quite a lot of energy and time and processing power. This path even if it looks amazing on paper, it will not be the final product, and it needs a breakthrough otherwise it will not be finished within our lifetimes. Let's take an example...a baby human once it starts to walk and inspect the world, you show him a traffic light, once or twice and he/she will be able to recognize that traffic light even in a forest or even if part of it is visible. That's the whole training and all that is needed is to see it once or twice. That is not something you can code yet because as we speak we don't have the slightest clue in how the brain works and how it evolves over time, not to mention that the people with that actual knowledge in the field are not AI programmers. I mean admittedly we don't know **** yet about how even the tiniest of bugs are able to do what they do, like ants which have a brain 15% the size of their body. Probably if we would be able to reprogram an ant brain, that brain would be more powerful than all the so called AIs that we have right now.

Study shows the best visual learning models fail at very basic visual identification tests

Cal Jeffrey

Posts: 4,246 +1,458

VitalyT

Posts: 6,922 +8,253

MSIGamer

Posts: 146 +209

seeprime

Posts: 745 +976

VitalyT

Posts: 6,922 +8,253

kiwigraeme

Posts: 2,134 +1,510

daffy duck

Posts: 435 +378

seeprime

Posts: 745 +976

VitalyT

Posts: 6,922 +8,253

seeprime

Posts: 745 +976

Alex1105

Posts: 30 +22

Similar threads

Latest posts