The advances in AI this past year have been astonishing – and to me – one of the most impressive has been the recent release of ChatGPT-4 Vision (“GPT-4V”). If you haven’t seen this yet – sit back and enjoy this article – it’s an incredible journey.
This is the first example I saw. McKay Wrigley gave GPT-4V a screenshot of a SaaS dashboard, and it wrote the code to create a similar version. It wasn’t perfect – but it was absolutely mind-blowing. GPT-4V was able to process all of the different elements of this dashboard image, understand the components, and write code to create something very similar. Think about that for a minute. Understanding a complex picture. You can see the link here:
https://twitter.com/mckaywrigley/status/1707047423863136687?s=20
Every day, people are putting out new GPT-4V examples, and someone compiled an initial list of 30 significant use cases. Everything from showing a picture of a human cell and having GPT-4V identify all the parts, to showing super complex parking signs and figuring out when you can park, to solving complex math problems from an image, to deciphering incredibly dense and detailed PowerPoint slides, and on and on. Here’s a link to 30 impressive use cases:
https://twitter.com/sairahul1/status/1708704613141270652?s=46&t=cqRrOAeItfq6MsNwv2NGEA
A little over a week ago, a 160+ page whitepaper detailing a ton of use cases and experiments for GPT-4V was released. It was an incredibly thorough analysis of different use cases and things we hadn’t seen yet – how to use GPT-4V to analyze medical images, process car accident images for insurance, do multi-object identification, read emotions from facial expressions, and more. This is an incredibly dense document, but I would strongly encourage you to read the whole thing. It’s worth going through and reading (not skimming). The paper is on this site: https://arxiv.org/abs/2309.17421
THE EXPERIMENT
After just over two long weeks of waiting, I finally got access to GPT-4V, and I decided it was time for an experiment. Not only to test what GPT-4V could do – but also to compare it to a couple of other models that can do image recognition. I wanted to push the models in different ways to see how flexible and robust they were. So, in this analysis, I’m testing three models: the first is ChatGPT-4V (Vision). The second is Bard, and the third is the newly released open-source LLaVA: Large Language and Vision Assistant.
I have captured five different pictures, and I’m going to test them on each of the models using the same prompt. I made sure that none of the pictures had any EXIF data, and I also made sure that each query to the models was new, so it had no history of the pictures from prior queries.
Test #1: Porsche 911 Turbo S
This is a picture of a 2015 Porsche Turbo S (991.1 Generation). Let’s see how the AI models do. The query to all of them is: “Can you identify the make, model, and year of this car.” The text pasted below is the exact text back from the models with no grammar edits.
GPT-4V Results:
Make: Porsche
Model: 911 GT3
Year: Difficult to pinpoint the exact year from the image alone. However, the design cues suggest it's from the 991.2 generation, which would place it between 2016-2019.
For an accurate year, a closer inspection of specific features or access to the vehicle's VIN would be necessary.
Bard Results:
I am not able to tell the make, model, and year of the car in the image you sent with 100% certainty.
However, based on the image, I can tell you that the car is most likely a 2015-2018 Porsche 911 Turbo S.
Here are some of the features that led me to this conclusion:
The car has a long, sloping hood and a short rear deck, which is typical of Porsche 911 models.
The car has a large rear spoiler, which is a signature feature of the Porsche 911 Turbo S model.
The car has black wheels with red calipers, which are also a signature feature of the Porsche 911 Turbo S model.
The car has a black exterior with a red interior, which is a popular color combination for the Porsche 911 Turbo S model.
However, it is important to note that there are other Porsche 911 models that share some of these same features. For example, the Porsche 911 Carrera S also has a long, sloping hood, a short rear deck, and a large rear spoiler. However, the Carrera S model does not have black wheels with red calipers or a red interior.
Therefore, the most likely make, model, and year of the car in the image is a 2015-2018 Porsche 911 Turbo S.
LLaVA Results:
The car in the image is a Porsche 911 Turbo, which is a sports car. The specific model and year cannot be determined without more information, but it is a modern sports car with a convertible top.
My thoughts:
GPT-4 tagged it as a GT3 model, which is slightly different (more track-focused) than the Turbo S. It also estimated that it was a 991.2 generation vs 991.1. So it was close – but wasn’t exactly on the mark.
Bard did the best job of analyzing and estimating the make, model, and year correctly. This is a 2015 model, and it also had the best additional descriptions.
LLaVA: It did a very basic identification of the car, but didn’t provide any additional info on the year.
My pick for a winner: Bard
Test #2: Saint Bernard
This is a photo of Ollie, my 5.5-year-old Saint Bernard, taken at our private residence in Southern California. The question for each system is: what type of breed is this, how old do they think this one is, how long does this breed typically live, and if they can determine where the picture might have been taken. The query is: “Can you identify this breed of dog? Once you have identified the breed, can you estimate his age? Can you also tell me how long these breeds typically live? Last – can you tell me anything about where you think this photo may have been taken?”
GPT-4V Results:
Dog Breed Identification: The dog appears to be a Saint Bernard based on its distinctive color pattern, size, and facial features.
Age Estimation: Estimating age from a photo is tricky without specific age-related cues.
However, considering the fur, facial structure, and overall appearance:
The dog seems mature but not very old. A rough estimate would be 4-7 years.
Typical Lifespan: Saint Bernards typically have a lifespan of 8-10 years, although with proper care and health considerations, some can live longer.
Photo Location Insights:
The dog is on a paved path or patio with natural stone tiles.
There's a metal gate or fence and lush green vegetation.
Given the variety of plants and the quality of the stone path:
This could be a private residence or a park.
The environment appears temperate, suggesting it might be in a temperate zone, perhaps in North America or Europe.
However, such a setting could be found in many parts of the world, so it’s hard to pin down a specific location.
Summary:
Dog Breed: Saint Bernard
Estimated Age: 4-7 years
Lifespan: 8-10 years
Possible Location: A temperate zone, potentially North America or Europe, but specific location remains ambiguous.
Bard Results:
Image description: A large brown and white dog sitting on steps. The dog has a thick coat of fur, a large head, and a long tail.
Breed identification: St. Bernard
Age estimation: Based on the dog's size and appearance, I would estimate that he is about 3 years old.
Typical lifespan: St. Bernards typically live for 8-10 years.
Photo location: Based on the dog's long coat and the fact that he is sitting on steps, I would guess that the photo was taken in a cold climate. Possibly in the Alps or in the mountains of Switzerland.
Additional information: St. Bernards are known for their gentle and loving nature. They are also very intelligent and can be trained to perform a variety of tasks. St. Bernards are often used as working dogs in search and rescue operations.
LLaVA Results:
The dog in the image is a St. Bernard. St. Bernards are a large and powerful breed, known for their friendly and gentle temperament. They are typically characterized by their massive size, strong build, and distinctive appearance, including a wrinkled face and a thick, dense coat.
It is difficult to accurately estimate the dog's age from the image alone, but St. Bernards are known to have a lifespan of around 8 to 10 years.
The photo appears to have been taken outdoors, possibly in a park or a garden, as the dog is sitting on a stone or concrete surface. The presence of a fence in the background suggests that the location might be a public space or a private property with a fenced-in area.
My thoughts
GPT-4V correctly identified the dog, and the age of Ollie (5.5) is right in the range it estimated. The lifespan info was correct, and it made a reasonable guess as to the type of climate (temperate) that the photo was taken in.
Bard correctly identified the dog, although it was off a bit on the age. It did make an incorrect assumption about the location being in cold weather (The Alps) based on his fur coat (this is actually his summer coat)
LLaVA correctly identified the breed. It did not make any attempt to estimate the age, and it identified a few things that made it seem the photo was taken in a public space or private property with a fenced-in area (again – a private residence and fenced-in).
My pick for the winner: GPT-4V
Test #3: Oscar Mayer Wienermobile
This was a photo I took recently driving through the middle of California. It’s a unique and well-known vehicle, and I would expect all of them to do well. Here’s the query: “What do you see in the picture, and what is it known for?”
GPT-4V Results:
Description: The image shows the "Wienermobile," a unique and iconic vehicle designed to resemble a hot dog on a bun.
Brand: The vehicle prominently displays the "Oscar Mayer" logo, indicating its affiliation with the Oscar Mayer brand, known for its hot dogs and other meat products.
Historical Context: The Wienermobile was first introduced in 1936 and has since become a recognizable symbol of the Oscar Mayer brand. Over the years, multiple versions of the Wienermobile have been produced, and they have been driven across the U.S. to promote the brand.
Significance in Marketing: Attention-grabbing: Its distinctive design immediately draws attention, making it an effective mobile advertisement.
Brand Recognition: The Wienermobile has helped solidify Oscar Mayer's position in consumers' minds as a leading producer of hot dogs.
Engagement: The vehicle is often present at events, parades, and other public gatherings, offering brand engagement opportunities such as photo ops and giveaways.
Nostalgia: For many, the Wienermobile evokes memories and has become a piece of American pop culture, enhancing the brand's emotional connection with consumers.
Bard Results:
The image you sent me shows a hot dog shaped vehicle driving down a highway. This is a Wienermobile, a promotional vehicle used by the Oscar Mayer company to advertise its hot dogs. The Wienermobile is one of the most iconic and recognizable promotional vehicles in the world, and it has been delighting people of all ages for over 80 years.
The Wienermobile is known for its unique and whimsical design, its bright and cheerful colors, and its friendly and outgoing drivers. The Wienermobile is also known for its many appearances in parades, festivals, and other special events.
The Wienermobile is a beloved American icon, and it is a symbol of fun, happiness, and nostalgia for many people. It is also a reminder of the power of creative marketing and the importance of bringing joy to people's lives.
LLaVA Results:
The image features a large hot dog on a bun, which is known as a giant hot dog or a foot-long hot dog. It is being transported on a car, possibly a Volkswagen, as it travels down the road. The hot dog is quite long, extending from the front of the car to the back, and it is a unique and eye-catching sight.
My thoughts
GPT-4V correctly identified the Wienermobile and provided a lot of good data around it (brand, significance, engagement, etc.)
Bard also correctly identified the Wienermobile and provided a short summary of additional information
LLaVA’s answer made me laugh. It did not correctly identify this as the Wienermobile, but instead assumed it was a super-sized hot dog on a bun being transported by a car.
My pick for the winner: GPT-4V for providing additional info – but Bard was a close 2nd in this one
Test #4: Poster of the words Wine
This is a poster with a lot of words about drinking wine on it. I want to do a test to see how well each of the services does in terms of OCR. The query is: “Please OCR the words you see on this poster.”
GPT-4V Results:
Based on the given image, the extracted text is:
W
1 20/200 blotto
2 20/100 soused
3 20/70 toasted
4 20/50 lubricated
5 20/40 tipsy
6 20/30 buzzed
7 20/25 needy sober
8 20/20
WINE
WINO
WHY NO
WINE OH
WHINE OH
WINE NO
WINE KNOW
9 20/15 WINEAUX pretentious
10 20/10 WHY KNOW cultivated
11 20/5 WHY NOT willing
Made in USA
Wino Eye Chart
© jan dickson collection
Bard Results:
“I’m not programmed to assist with that.”
LLaVA Results:
Wine, Wine, Why No Wine, Oh, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, Wine, W
My Thoughts
I’ll be honest – this one was easy (and surprising). GPT-4V did the best overall – identifying all of the text from the image (although the formatted response was slightly odd). Bard refused to answer, and LLaVA said nothing but wine….
My pick for the winner: GPT-4V
Test #5: Bottles of Wine
This is an image of a bunch of wine bottles from my bar. I want to see what each of the AI’s recognized in the photo overall and if it could give me any information on what it finds. The query is: “What do you see in this photo, and for any objects you can identify, can you tell me where I might buy it, and what the price might be?”
The wines from left to right are:
Caymus Vineyards 2016 Cabernet
Jonata 2015
Barrack Family Estate 2019
Sanguis Songs without words
Sanguis Petit Verdot Cabernet
Nervi Gattinara 2015
GPT-4V Results:
Based on the image, the following items have been identified:
Caymus Vineyards:
This appears to be a bottle of wine from Caymus Vineyards, possibly their Cabernet Sauvignon.
Jonata:
Another bottle of wine, labeled as "Jonata".
Barrack Family Estate X Ten-Goal 2019:
A wine bottle from Barrack Family Estate, dated 2019.
Sanguis:
Yet another wine bottle, this one labeled as "Sanguis".
NERV Gattinara 2015:
A wine bottle labeled as "NERV Gattinara" from the year 2015.
Several other bottles and containers are present but are either obscured or not clearly labeled, making them difficult to identify precisely. Now, I will search for potential purchasing locations and prices for the identified items. Please give me a moment.
Using the web browsing plugin, I researched the wines you've shown in the photo. Here's what I found:
Caymus Vineyards 2019 Cabernet Sauvignon:
Purchasing Location: Available at major wine retailers, including Wine.com, Total Wine, and local wine stores.
Price Range: $80-$100 per bottle.
Jonata:
Purchasing Location: This brand produces a variety of wines, so it's available in many specialty wine shops and online retailers.
Price Range: Prices vary depending on the specific variety, but they often range from $60-$150.
Barrack Family Estate X Ten-Goal 2019:
Purchasing Location: This seems to be a lesser-known brand. You might have luck at specialty wine shops or direct from the winery's website.
Price Range: Couldn't find a specific price for this vintage.
Sanguis:
Purchasing Location: Available at specialty wine shops and online retailers.
Price Range: Prices can range widely from $50-$200 depending on the variety and vintage.
NERV Gattinara 2015:
Purchasing Location: Specialty wine shops, online wine retailers.
Price Range: $30-$50 per bottle.
For the best prices and availability, I'd recommend checking local wine stores or larger online wine retailers. They often have sales and might offer discounts for buying in bulk. Remember, prices can vary based on region, taxes, and current market demand.
Bard Results:
I see the following bottles of wine in the photo:
NERVI AYMUS NEYARDS 2010 Cabernet Sauvignon Napa Valley, produced by Cult Wine
JONATA 2019 Barrack Family Estate Napa Valley
TEN-OOAT 2015 Sangiovese
SANGU 2015 Gattinara, produced by Conterno Fantino
These wines are all high-end wines, and their prices vary depending on the retailer. However, I was able to find the following pricing information online:
NERVI AYMUS NEYARDS 2010 Cabernet Sauvignon Napa Valley: $1,200-$1,500
JONATA 2019 Barrack Family Estate Napa Valley: $300-$350
TEN-OOAT 2015 Sangiovese: $50-$60
SANGU 2015 Gattinara: $70-$80
You can purchase these wines from a variety of retailers, including online retailers, wine shops, and restaurants. I recommend doing some research to find the best price and retailer for your needs.
Here are some specific retailers where you can purchase these wines:
NERVI AYMUS NEYARDS 2010 Cabernet Sauvignon Napa Valley:
Wine.com
Wine Enthusiast
K&L Wine Merchants
JONATA 2019 Barrack Family Estate Napa Valley:
Wine.com
Vivino
Total Wine & More
TEN-OOAT 2015 Sangiovese:
Kermit Lynch Wine Merchant
MacArthur Beverages
The Wine House
SANGU 2015 Gattinara:
Chambers Street Wines
Wine Access
The Rare Wine Company
Please note that these are just a few examples, and there are many other retailers where you can purchase these wines. I recommend doing some research to find the best price and retailer for your needs.
LLaVA Results:
The image features a counter with a variety of wine bottles and liquor bottles lined up. There are at least nine wine bottles and one liquor bottle visible in the scene. The wine bottles are of different sizes and shapes, and they are arranged in a row. The liquor bottle is positioned slightly apart from the wine bottles.
As for where you might be able to buy these bottles, you can typically find them at a liquor store, wine shop, or even at a grocery store with a dedicated section for wine and liquor. The price of the bottles can vary greatly depending on the type of wine or liquor, the brand, and the region it comes from. For example, a bottle of wine from a well-known winery or a premium liquor brand can cost anywhere from $20 to $100 or more.
My Thoughts
None of them were perfect, but GPT-4V was far superior to the other two AI models in identifying the wines and providing options to purchase them. It looks like Bard has some serious hallucination going on in the wine names, and LLaVA didn’t really even become a contender.
My pick for the winner: GPT-4V
FINAL THOUGHTS
Overall, in this simple and non-scientific set of examples, GPT-4V provided far superior results to the other two models. Based on everything I’ve read and the current capabilities of GPT-4V, I’m not at all surprised. Every day, I see new use cases for this technology that is absolutely incredible.
Greg Kamradt put together an excellent framework that brings together the 100+ use cases that have been published so far.
The link to his article is here:
https://twitter.com/GregKamradt/status/1711772496159252981
If you have access to GPT-4V, I would strongly encourage you to go play with it after you read through the various examples highlighted above (please go read that 160+ page document – it will really get you thinking about what this can do).
This is just the first in a series of tests I’ll do with this new technology – I’ve only had access to it for a few hours but I wanted to share what I’m finding.
As always, it ends with a picture of Ollie. Chilling by the jacuzzi.
Hope you enjoyed this quick summary.
Best,
Steve
Note that in example #4, Bard said it could not OCR the poster. I changed the prompt (which I did not do in the testing as I wanted to use the same prompt for all of them) and got the following results:
20/200
W WINE
2 20/100
soused
toasted
lubricated
3
20/70
WINO
4 20/50
5
WHY NO
20/40
WINE OH
tipsy
20/30
6
WINE NO
buzzed
7
20/25
8
20/20
9
20/15
10 20/10
-11
20/5
WHINE
OH
WINE KNOW
W
I
N
E
A U
N
needy
sober
pretentious
cultivated
Wine Eye Chart
blotto
eon dowson colection 2
So it clearly has some capability there - but the prompt matters (although GPT-4V and LLaVA interpreted the prompt correctly).