'Visual' AI models might not see anything at all

Devin Coldewey

Updated 11 July 2024 at 5:47 pm·6-min read

The latest round of language models, like GPT-4o and Gemini 1.5 Pro, are touted as "multimodal," able to understand images and audio as well as text. But a new study makes clear that they don't really see the way you might expect. In fact, they may not see at all.

To be clear at the outset, no one has made claims like "This AI can see like people do!" (Well, perhaps some have.) But the marketing and benchmarks used to promote these models use phrases like "vision capabilities," "visual understanding," and so on. They talk about how the model sees and analyzes images and video, so it can do anything from homework problems to watching the game for you.

So although these companies' claims are artfully couched, it's clear that they want to express that the model sees in some sense of the word. And it does — but kind of the same way it does math or writes stories: matching patterns in the input data to patterns in its training data. This leads to the models failing in the same way they do on certain other tasks that seem trivial, like picking a random number.

A study — informal in some ways, but systematic — of current AI models' visual understanding was undertaken by researchers at Auburn University and the University of Alberta. They tested the biggest multimodal models on a series of very simple visual tasks, like asking whether two shapes overlap, or how many pentagons are in a picture, or which letter in a word is circled. (A summary micropage can be perused here.)

They're the kind of thing that even a first-grader would get right, yet they gave the AI models great difficulty.

"Our seven tasks are extremely simple, where humans would perform at 100% accuracy. We expect AIs to do the same, but they are currently NOT," wrote co-author Anh Nguyen in an email to TechCrunch. "Our message is, 'Look, these best models are STILL failing.'"

The overlapping shapes test is one of the simplest conceivable visual reasoning tasks. Presented with two circles either slightly overlapping, just touching or with some distance between them, the models couldn't consistently get it right. Sure, GPT-4o got it right more than 95% of the time when they were far apart, but at zero or small distances, it got it right only 18% of the time. Gemini Pro 1.5 does the best, but still only gets 7/10 at close distances.

(The illustrations do not show the exact performance of the models but are meant to show the inconsistency of the models across the conditions. The statistics for each model are in the paper.)

Or how about counting the number of interlocking circles in an image? I bet an above-average horse could do this.

They all get it right 100% of the time when there are five rings, but then adding one ring completely devastates the results. Gemini is lost, unable to get it right a single time. Sonnet-3.5 answers six … a third of the time, and GPT-4o a little under half the time. Adding another ring makes it even harder, but adding another makes it easier for some.

The point of this experiment is simply to show that, whatever these models are doing, it doesn't really correspond with what we think of as seeing. After all, even if they saw poorly, we wouldn't expect six-, seven-, eight- and nine-ring images to vary so widely in success.

The other tasks tested showed similar patterns; it wasn't that they were seeing or reasoning well or poorly, but there seemed to be some other reason why they were capable of counting in one case but not in another.

One potential answer, of course, is staring us right in the face: Why should they be so good at getting a five-circle image correct, but fail so miserably on the rest, or when it's five pentagons? (To be fair, Sonnet-3.5 did pretty good on that.) Because they all have a five-circle image prominently featured in their training data: the Olympic Rings.

This logo is not just repeated over and over in the training data but likely described in detail in alt text, usage guidelines and articles about it. But where in their training data would you find six interlocking rings. Or seven? If their responses are any indication: nowhere! They have no idea what they're "looking" at, and no actual visual understanding of what rings, overlaps or any of these concepts are.

I asked what the researchers think of this "blindness" they accuse the models of having. Like other terms we use, it has an anthropomorphic quality that is not quite accurate but hard to do without.

"I agree, 'blind' has many definitions even for humans and there is not yet a word for this type of blindness/insensitivity of AIs to the images we are showing," wrote Nguyen. "Currently, there is no technology to visualize exactly what a model is seeing. And their behavior is a complex function of the input text prompt, input image and many billions of weights."

He speculated that the models aren't exactly blind but that the visual information they extract from an image is approximate and abstract, something like "there's a circle on the left side." But the models have no means of making visual judgments, making their responses like those of someone who is informed about an image but can't actually see it.

As a last example, Nguyen sent this, which supports the above hypothesis:

When a blue circle and a green circle overlap (as the question prompts the model to take as fact), there is often a resulting cyan-shaded area, as in a Venn diagram. If someone asked you this question, you or any smart person might well give the same answer, because it's totally plausible … if your eyes are closed! But no one with their eyes open would respond that way.

Does this all mean that these "visual" AI models are useless? Far from it. Not being able to do elementary reasoning about certain images speaks to their fundamental capabilities, but not their specific ones. Each of these models is likely going to be highly accurate on things like human actions and expressions, photos of everyday objects and situations, and the like. And indeed that is what they are intended to interpret.

If we relied on the AI companies' marketing to tell us everything these models can do, we'd think they had 20/20 vision. Research like this is needed to show that, no matter how accurate the model may be in saying whether a person is sitting or walking or running, they do it without "seeing" in the sense (if you will) we tend to mean.

Sportsbeat
Bryony Pitman out of archery but hails 'incredible' Olympic experience at Invalides
Bryony Pitman hailed an ‘incredible experience’ after her Olympic archery campaign ended at the last 32 stage.
Associated Press
16 people killed and dozens critically wounded in bombing in Nigeria's Borno state, officials say
A bomb exploded in a roadside market in Nigeria's Borno state killing at least 16 people and wounding dozens of others, police said. The government imposed a 24-hour curfew after the bombing attack, the second in recent weeks. No one immediately claimed responsibility for the attack but analysts and some local officials suspected the Islamic militant group Boko Haram, which has since 2009 waged an insurgency in Nigeria and neighboring countries in the Lake Chad region.
The Telegraph
The 12 best theme parks in the UK
Britain, quietly, is rather good at theme parks. While plans for a brand new Universal Studios outpost in Bedfordshire might have rollercoaster fans particularly excited, we already have a wealth of amusements on our shores, from the vintage charm of Dreamland Margate to the high octane thrills of Thorpe Park.
Sky News
British citizen among high-profile prisoners released in massive swap between Russia and the West
High-profile people held prisoner in Russia - including British citizen Vladimir Kara-Murza and US reporter Evan Gershkovich - have been freed as part of a massive prisoner swap. In the biggest such exchange since the Cold War, a number of high-profile individuals have been freed. Among those being released from Western prisons is Vadim Krasikov, a Russian hitman serving a life sentence in Germany for the 2019 killing of a Georgian citizen in Berlin.
The Telegraph
Revealed: Britain’s most popular house name
It has long been said that an Englishman’s home is his castle, but new research shows that Britons may have downsized their ambitions in recent times.
Investor's Business Daily
Evolus, A Medical Aesthetics Player, Becomes Profitable — Two Quarters Early
Shares of Evolus jumped Thursday after the medical aesthetics company notched its first-ever operating income — two quarters early.
The Independent
Olympics 2024 LIVE: Simone Biles in gymnastics all-around final and Joe Clarke goes for canoe slalom gold
The first round of the men’s golf gets underway as Rory McIlroy takes to the course, with Andy Murray, Simone Biles and Summer McIntosh in the big stars in action later on Thursday at Paris 2024
SWNS
Footage shows dolphin swimming down River Thames
A dolphin has been spotted swimming in the River Thames. A passerby was running along the riverside before stopping off at The Angel Pub, in Bermondsey, London, where they noticed the animal at around 7.40pm last night (31). He said: “My friend turned to me and said ‘I can see a fin'. “It was quite exciting - everyone was looking over at the river and everyone was in pure shock."
Evening Standard
Tribunal member rebuked after saying online collective noun for Conservatives should be 'tumour'
Jeremy Purkis was given a formal warning over social media posts which suggested he was biased
PA Media: UK News
Call for review of other cases at local authority where Lola James died
Emma Sutton KC said a review was needed to see if other children’s social services cases were closed in the same ‘unorthodox way’ as Lola James.
USA TODAY Sports
Miles Partain, Andy Benesh advance in Paris Olympics beach volleyball after coaching change
USA men's beach volleyball duo of Andy Benesh and Miles Partain elected to "not move forward" with coach Mike Placek during the 2024 Paris Olympics.
Investor's Business Daily
Magnificent Seven Stocks: Nvidia Stock, Tesla Slide; Meta Surges On Earnings
Dubbed the Magnificent Seven stocks, Apple, Microsoft, Google parent Alphabet, Amazon, Nvidia, Meta Platforms and Tesla lived up to their name in 2023 with big gains. But the start of the third quarter of 2024 showed a big divergence of returns.
People
Cardi B Files for Primary Custody of Daughter Kulture and Son Wave amid Divorce Proceedings with Offset: Report
The estranged couple shares their daughter Kulture, 6, and son Wave, 2½
GlobeNewswire
Cervical Spondylosis Diagnosis and Treatment Market Size to Reach USD 6.30 Bn by 2032
The global cervical spondylosis diagnosis and treatment market size is calculated at USD 3.26 billion in 2024 and is expected to reach around USD 6.30 billion by 2032, growing at a solid CAGR of 6.7% between 2023 and 2032.Ottawa, Aug. 01, 2024 (GLOBE NEWSWIRE) -- The global cervical spondylosis diagnosis and treatment market size is predicted to increase from USD 3.09 billion in 2023 to approximately USD 6.30 billion by 2032, according to a study published by Towards Healthcare a sister firm of
Business Wire
Real Madrid Netspend Prepaid Mastercard® Now Available in the U.S.
AUSTIN, Texas & MADRID, August 01, 2024--Ouro, a global financial services and technology innovator, and the Real Madrid Football Club, the reigning Champions League and La Liga champions, today announced the launch of the Real Madrid Netspend® Prepaid Mastercard®, now available to fans across the United States at Netspend.com/RealMadrid.
Men's Health UK
Help, My Girlfriend Won’t Stop Obsessing Over Glen Powell
After playing Tyler Owens in 'Twisters,' Glen Powell's transformation into a movie star is complete. If your girlfriend is obsessed with him, you're not alone.
The Independent
Turkey’s Olympic shooter goes viral for understated look as he wins silver: ‘Absolute legend’
‘This is what you call aura,’ one person writes on social media
TVLine.com
Don Lemon Sues Elon Musk and X Over Cancelled Talk Show Partnership
Former CNN anchor Don Lemon is taking legal action against Elon Musk and X, nearly five months after Lemon’s planned talk show on the social media platform was scrapped. Attorneys for Lemon filed a lawsuit against Musk and X on Thursday; in a copy of the complaint obtained by our sister site Variety, Lemon is …
Yahoo Finance Video
Gold prices rise on Fed rate cut hopes
Gold futures (GC=F) are rising in Thursday's trading session, as commodity markets react to Federal Reserve Chair Jerome Powell's dovish comments regarding the outlook for rate cuts. Yahoo Finance senior markets reporter Ines Ferré breaks down how potential rate cuts in September could positively influence gold prices. For more expert insight and the latest market action, click here to watch this full episode of Catalysts. This post was written by Angel Smith
The Independent
Andy Murray LIVE: Olympics updates and tennis scores as Carlos Alcaraz in action before Murray and Evans
Murray and Dan Evans continue their thrilling ride at Roland Garros against USA duo Taylor Fritz and Tommy Paul

FTSE 100

FTSE 250

AIM

GBP/EUR

GBP/USD

Bitcoin GBP

CMC Crypto 200

S&P 500

DOW

CRUDE OIL

GOLD FUTURES

NIKKEI 225

HANG SENG

DAX

CAC 40

'Visual' AI models might not see anything at all

Latest stories

Bryony Pitman out of archery but hails 'incredible' Olympic experience at Invalides

16 people killed and dozens critically wounded in bombing in Nigeria's Borno state, officials say

The 12 best theme parks in the UK

British citizen among high-profile prisoners released in massive swap between Russia and the West

Revealed: Britain’s most popular house name

Evolus, A Medical Aesthetics Player, Becomes Profitable — Two Quarters Early

Olympics 2024 LIVE: Simone Biles in gymnastics all-around final and Joe Clarke goes for canoe slalom gold

Footage shows dolphin swimming down River Thames

Tribunal member rebuked after saying online collective noun for Conservatives should be 'tumour'

Call for review of other cases at local authority where Lola James died

Miles Partain, Andy Benesh advance in Paris Olympics beach volleyball after coaching change

Magnificent Seven Stocks: Nvidia Stock, Tesla Slide; Meta Surges On Earnings

Cardi B Files for Primary Custody of Daughter Kulture and Son Wave amid Divorce Proceedings with Offset: Report

Cervical Spondylosis Diagnosis and Treatment Market Size to Reach USD 6.30 Bn by 2032

Real Madrid Netspend Prepaid Mastercard® Now Available in the U.S.

Help, My Girlfriend Won’t Stop Obsessing Over Glen Powell

Turkey’s Olympic shooter goes viral for understated look as he wins silver: ‘Absolute legend’

Don Lemon Sues Elon Musk and X Over Cancelled Talk Show Partnership

Gold prices rise on Fed rate cut hopes

Andy Murray LIVE: Olympics updates and tennis scores as Carlos Alcaraz in action before Murray and Evans