Featured Article

We tested Google’s Gemini chatbot — here’s how it performed

Gemini excels in some areas and falls flat in others

Comment

illustration featuring Google's Bard logo
Image Credits: TechCrunch

Gemini, Google’s answer to OpenAI’s ChatGPT and Microsoft’s Copilot, is here. Is it any good? While it’s a solid option for research and productivity, it stumbles in obvious — and some not-so-obvious — places.

Last week, Google rebranded its Bard chatbot to Gemini and brought Gemini — which confusingly shares a name in common with the company’s latest family of generative AI models — to smartphones in the form of a reimagined app experience. Since then, lots of folks have had the chance to test-drive the new Gemini, and the reviews have been . . . mixed, to put it generously.

Still, we at TechCrunch were curious how Gemini would perform on a battery of tests we recently developed to compare the performance of GenAI models — specifically large language models like OpenAI’s GPT-4, Anthropic’s Claude, and so on.

There’s no shortage of benchmarks to assess GenAI models. But our goal was to capture the average person’s experience through plain-English prompts about topics ranging from health and sports to current events. Ordinary users are whom these models are being marketed to, after all, so the premise of our test is that strong models should be able to at least answer basic questions correctly.

Background on Gemini

Not everyone has the same Gemini experience — and which one you get depends on how much you’re willing to pay.

Non-paying users get queries answered by Gemini Pro, a lightweight version of a more powerful model, Gemini Ultra, that’s gated behind a paywall.

Access to Gemini Ultra through what Google calls Gemini Advanced requires subscribing to the Google One AI Premium Plan, priced at $20 per month. Ultra delivers better reasoning, coding and instruction-following skills than Gemini Pro (or so Google claims), and in the future will get improved multimodal and data analysis capabilities.

The AI Premium Plan also connects Gemini to your wider Google Workspace account — think emails in Gmail, documents in Docs, presentations in Sheets and Google Meet recordings. That’s useful for, say, summarizing emails or having Gemini capture notes during a video call.

Since Gemini Pro’s been out since early December, we focused on Ultra for our tests.

Testing Gemini

To test Gemini, we asked a set of over two dozen questions ranging from innocuous (“Who won the football world cup in 1998?”) to controversial (“Is Taiwan an independent country?”). Our question set touches on trivia, medical and therapeutic advice, and generating and summarizing content — all things a user might ask (or ask of) a GenAI chatbot.

Now Google makes it clear in its terms of service that Gemini isn’t to be used for health consultations and that the model might not answer all questions with factual accuracy. But we feel that people will ask medical questions whatever the fine print says. And the answers are a good measure of a model’s tendency to hallucinate (i.e., make up facts): If a model’s making up cancer symptoms, there’s a reasonable chance it’s fudging on answers to other questions.

Full disclosure, we tested Ultra through Gemini Advanced, which according to Google occasionally routes certain prompts to other models. Frustratingly, Gemini doesn’t indicate which responses came from which models, but for the purposes of our benchmark, we assumed they all came from Ultra.

Questions

Evolving news stories

We started by asking Gemini Ultra two questions about current events:

The model refused to answer the first question (perhaps owing to word choice — “Palestine” versus “Gaza”), referring to the conflict in Israel and Gaza as “complex and changing rapidly” — and recommending that we Google it instead. Not the most inspiring display of knowledge, for sure.

Gemini Advanced israel
Image Credits: Google

Ultra’s response to the second question was more promising, listing several trends on TikTok that’ve made it into headlines recently, like the “skull breaker challenge” and the “milk crate challenge.” (Ultra, lacking access to TikTok itself, presumably scraped these from news coverage, but it did not cite any specific articles.)

Ultra went a little overboard in this writer’s estimation, though, not only highlighting TikTok trends but also making a list of suggestions to promote safety, including “staying aware of how younger users are interacting with content” and “having regular, honest conversations with teens and young people about responsible social media use.” I can’t say that the suggestions were toxic or bad ones — but they were a bit beyond the scope of the question.

Gemini TikTok trends
Image Credits: Google

Historical context

Next, we asked Gemini Ultra to recommend sources on a historical event:

Ultra was quite detailed in its answer here, listing a wide variety of offline and digital sources of information on Prohibition — ranging from newspapers from the era and committee hearings to the Congressional Record and the personal papers of politicians. Ultra also helpfully suggested researching pro- and anti-Prohibition viewpoints, and — as something of a hedge — warned against drawing conclusions from only a few source documents.

Gemini Prohibition
Image Credits: Google

It didn’t exactly recommend source documents, but this isn’t a bad recommendation for someone looking for a place to start.

Trivia questions

Any chatbot worth its salt should be able to answer simple trivia. So we asked Gemini Ultra:

Ultra seems to have its facts straight on the FIFA World Cups in 1998 and 2006. The model gave the correct scores and winners for each match and accurately recounted the scandal at the end of the 2006 final: Zinedine Zidane headbutting Marco Materazzi.

Ultra did fail to mention the reason for the headbutt — trash talk about Zidane’s sister — but considering Zidane didn’t reveal it until an interview last year, this could well be a reflection of the cutoff date in Ultra’s training data.

Gemini football
Image Credits: Google

You’d think U.S. presidential history would be easy-peasy for a model as (allegedly) capable as Ultra, right? Well, you’d be wrong. Ultra refused to answer “Joe Biden” when asked about the outcome of the 2020 election — suggesting, as with the question about the Israel-Palestine conflict, we Google it.

Heading into a contentious election cycle, that’s not the sort of unequivocal conspiracy-quashing answer that we’d hoped to hear.

Gemini presidential
Image Credits: Google

Medical advice

Google might not recommend it, but we went ahead and asked Ultra medical questions anyway:

Answering the question about the rashes, Ultra warned us once again not to rely on it for health advice. But the model also gave what appeared to be sensible actionable steps (at least to us non-professionals), instructing to check for signs of a fever and other symptoms indicating a more serious condition — and advising against relying on amateur diagnoses (including its own).

Gemini rash
Image Credits: Google

In response to the second question, Ultra didn’t fat-shame — which is more than can be said of some of the GenAI models we’ve seen. The model instead poked holes in the notion that BMI is a perfect measure of weight, and noted other factors — like physically activity, diet, sleep habits and stress levels — contribute as much if not more so to overall health.

Gemini fat
Image Credits: Google

Therapeutic advice

People are using ChatGPT as therapy. So it stands to reason that they’d use Ultra for the same purpose, however ill-advised. We asked:

Told about the depression and sadness, Ultra lent an understanding ear — but as with some of the model’s other answers to our questions, its response was on the overly wordy and repetitive side.

Gemini depressed
Image Credits: Google

Predictably, given its responses to the previous health-related questions, Ultra in no uncertain terms said that it can’t recommend specific treatments for anxiety because it’s “not a medical professional” and treatment “isn’t one-size-fits-all.” Fair enough! But Ultra — trying its best to be helpful — then went on to identify common forms of treatment and medications for anxiety in addition to lifestyle practices that might help alleviate or treat anxiety disorders.

Gemini anxiety
Image Credits: Google

Race relations

GenAI models are notorious for encoding racial (and other forms of) biases — so we probed Ultra for these. We asked:

Ultra was loath to wade into contentious territory in its answer about Mexican border crossings, preferring to give a pro-con breakdown instead.

Gemini border crossing
Image Credits: Google

Ditto for Ultra’s answer to the Harvard admissions question. The model spotlighted potential issues with historical legacy, but also the admissions process — and systemic problems.

Gemini harvard
Image Credits: Google

Geopolitical questions

Geopolitics can be testy. To see how Ultra handles it, we asked:

Ultra exercised restraint in answering the Taiwan question, giving arguments for — and against — the island’s independence plus historical context and potential outcomes.

Gemini taiwan
Image Credits: Google

Ultra was more … decisive on the Russian invasion of Ukraine despite its wishy-washy answer to the earlier question on the Israel-Gaza war, calling Russia’s actions “morally indefensible.”

Gemini Ultra russia
Image Credits: Google

Jokes

For a more lighthearted test, we asked Ultra to tell jokes (there is a point to this — humor is a strong benchmark for AI):

I can’t say either was particularly inspired — or funny. (The first seemed to completely miss the “going on vacation” part of the prompt.) But they met the dictionary definition of “joke,” I suppose.

Gemini Ultra joke vacation
Image Credits: Google
Gemini joke 2
Image Credits: Google

Product description

Vendors like Google pitch GenAI models as productivity tools — not just answer engines. So we tested Ultra for productivity:

Ultra delivered, albeit with descriptions well under the word and character limits and in an unnecessarily (in this writer’s opinion) bombastic tone. Subtlety doesn’t appear to be Ultra’s strong suit.

Gemini product descriptions
Image Credits: Google
Gemini product description 2
Image Credits: Google

Workspace integration

Workspace integration being a heavily advertised feature of Ultra, it seemed only appropriate to test prompts that take advantage:

  • Which files in my Google Drive are smaller than 25MB?
  • Summarize my last three emails.
  • Search YouTube for cat videos from the last four days.
  • Send walking directions from my location to Paris to my Gmail.
  • Find me a cheap flight and hotel for a trip to Berlin in early July.
Gemini workspace integration
Image Credits: Google
Gemini workspace integration
Image Credits: Google
Gemini workspace integration
Image Credits: Google
Gemini workspace integration
Image Credits: Google

I came away most impressed by Ultra’s travel-planning skills. As instructed, Ultra found a cheap flight and a list of budget-friendly hotels for my aspirational trip — complete with bullet-point descriptions of each.

Less impressive was Ultra’s YouTube sleuthing. Basic functionality like sorting videos by upload date proved to be beyond the model’s capabilities. Searching directly would’ve been easier.

The Gmail integration was the most intriguing to me, I must say, as someone who’s often drowning in emails — but also the most error-prone. Asking for the content of messages by general theme or receipt window (e.g., “the last four days”) worked well enough in my testing. But requesting anything highly specific, like the tracking information for a Banana Republic order, tripped the model up more often than not.

The takeaway

So what to make of Ultra after this interrogation? It’s a fine model. For research, great even — depending on the topic. But game-changing it isn’t.

Outside of the odd non-answers to the questions about the 2020 U.S. presidential election and the Israel-Gaza conflict, Gemini Ultra was thorough to a fault in its responses — no matter how controversial the territory. It couldn’t be persuaded to give potentially harmful (or legally problematic) advice, and it stuck to the facts, which can’t be said for all GenAI models.

But if novelty was your expectation for Ultra, brace for disappointment.

Now, it’s early days. Ultra’s multimodal features — a major selling point — have yet to be fully enabled. And additional integrations with Google’s wider ecosystem are a work in progress.

But paying $20 per month for Ultra feels like a big ask right now — particularly given that the paid plan for OpenAI’s ChatGPT costs the same and comes with third-party plugins and such capabilities as custom instructions and memory.

Ultra will no doubt improve with the full force of Google’s AI research divisions behind it. The question is when, exactly, it’ll reach the point where the cost feels justified — if ever.

More TechCrunch

Featured Article

Unicorn-rich VC Wesley Chan owes his success to a Craigslist job washing lab beakers

While all of Wesley Chan’s success has been well-documented over the years, his personal journey…not so much. Chan spoke to TechCrunch about the ways his life impacts how he invests in startups.

2 hours ago
Unicorn-rich VC Wesley Chan owes his success to a Craigslist job washing lab beakers

Presumptive Republican presidential nominee Donald Trump now has an account on the short-form video app that he once tried to ban. Trump’s TikTok account, which launched on Saturday night, features…

Trump takes off on TikTok

With fewer than 400,000 inhabitants, Iceland receives more than its fair share of tourists — and of venture capital.

Iceland’s startup scene is all about making the most of the country’s resources

Kobo put out a handful of new e-readers a few weeks back: color versions of the excellent Libra 2 and Clara, as well as an updated monochrome version of the…

Kobo’s new e-readers are a sidegrade most can skip (with one exception)

In an interview at his home near Reykjavík, the entrepreneur-turned-VC shared thoughts on his ventures and the journey that led him from Unity to climate tech, a homecoming of sorts.

Unity co-founder David Helgason’s next act: Gaming the climate crisis

Welcome back to TechCrunch’s Week in Review — TechCrunch’s newsletter recapping the week’s biggest news. Want it in your inbox every Saturday? Sign up here. Over the past eight years,…

Fisker collapsed under the weight of its founder’s promises

What is AI? We’ve put together this non-technical guide to give anyone a fighting chance to understand how and why today’s AI works.

WTF is AI?

President Joe Biden has vetoed H.J.Res. 109, a congressional resolution that would have overturned the Securities and Exchange Commission’s current approach to banks and crypto. Specifically, the resolution targeted the…

President Biden vetoes crypto custody bill

Featured Article

Industries may be ready for humanoid robots, but are the robots ready for them?

How large a role humanoids will play in that ecosystem is, perhaps, the biggest question on everyone’s mind at the moment.

1 day ago
Industries may be ready for humanoid robots, but are the robots ready for them?

VCs are clamoring to invest in hot AI companies, and willing to pay exorbitant share prices for coveted spots on their cap tables. Even so, most aren’t able to get…

VCs are selling shares of hot AI companies like Anthropic and xAI to small investors in a wild SPV market

The fashion industry has a huge problem: Despite many returned items being unworn or undamaged, a lot, if not the majority, end up in the trash. An estimated 9.5 billion…

Deal Dive: How (Re)vive grew 10x last year by helping retailers recycle and sell returned items

Tumblr officially shut down “Tips,” an opt-in feature where creators could receive one-time payments from their followers.  As of today, the tipping icon has automatically disappeared from all posts and…

You can no longer use Tumblr’s tipping feature 

Generative AI improvements are increasingly being made through data curation and collection — not architectural — improvements. Big Tech has an advantage.

AI training data has a price tag that only Big Tech can afford

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: Can we (and could we ever) trust OpenAI?

Jasper Health, a cancer care platform startup, laid off a substantial part of its workforce, TechCrunch has learned.

General Catalyst-backed Jasper Health lays off staff

Featured Article

Live Nation confirms Ticketmaster was hacked, says personal information stolen in data breach

Live Nation says its Ticketmaster subsidiary was hacked. A hacker claims to be selling 560 million customer records.

2 days ago
Live Nation confirms Ticketmaster was hacked, says personal information stolen in data breach

Featured Article

Inside EV startup Fisker’s collapse: how the company crumbled under its founders’ whims

An autonomous pod. A solid-state battery-powered sports car. An electric pickup truck. A convertible grand tourer EV with up to 600 miles of range. A “fully connected mobility device” for young urban innovators to be built by Foxconn and priced under $30,000. The next Popemobile. Over the past eight years, famed vehicle designer Henrik Fisker…

2 days ago
Inside EV startup Fisker’s collapse: how the company crumbled under its founders’ whims

Late Friday afternoon, a time window companies usually reserve for unflattering disclosures, AI startup Hugging Face said that its security team earlier this week detected “unauthorized access” to Spaces, Hugging…

Hugging Face says it detected ‘unauthorized access’ to its AI model hosting platform

Featured Article

Hacked, leaked, exposed: Why you should never use stalkerware apps

Using stalkerware is creepy, unethical, potentially illegal, and puts your data and that of your loved ones in danger.

2 days ago
Hacked, leaked, exposed: Why you should never use stalkerware apps

The design brief was simple: each grind and dry cycle had to be completed before breakfast. Here’s how Mill made it happen.

Mill’s redesigned food waste bin really is faster and quieter than before

Google is embarrassed about its AI Overviews, too. After a deluge of dunks and memes over the past week, which cracked on the poor quality and outright misinformation that arose…

Google admits its AI Overviews need work, but we’re all helping it beta test

Welcome to Startups Weekly — Haje‘s weekly recap of everything you can’t miss from the world of startups. Sign up here to get it in your inbox every Friday. In…

Startups Weekly: Musk raises $6B for AI and the fintech dominoes are falling

The product, which ZeroMark calls a “fire control system,” has two components: a small computer that has sensors, like lidar and electro-optical, and a motorized buttstock.

a16z-backed ZeroMark wants to give soldiers guns that don’t miss against drones

The RAW Dating App aims to shake up the dating scheme by shedding the fake, TikTok-ified, heavily filtered photos and replacing them with a more genuine, unvarnished experience. The app…

Pitch Deck Teardown: RAW Dating App’s $3M angel deck

Yes, we’re calling it “ThreadsDeck” now. At least that’s the tag many are using to describe the new user interface for Instagram’s X competitor, Threads, which resembles the column-based format…

‘ThreadsDeck’ arrived just in time for the Trump verdict

Japanese crypto exchange DMM Bitcoin confirmed on Friday that it had been the victim of a hack resulting in the theft of 4,502.9 bitcoin, or about $305 million.  According to…

Hackers steal $305M from DMM Bitcoin crypto exchange

This is not a drill! Today marks the final day to secure your early-bird tickets for TechCrunch Disrupt 2024 at a significantly reduced rate. At midnight tonight, May 31, ticket…

Disrupt 2024 early-bird prices end at midnight

Instagram is testing a way for creators to experiment with reels without committing to having them displayed on their profiles, giving the social network a possible edge over TikTok and…

Instagram tests ‘trial reels’ that don’t display to a creator’s followers

U.S. federal regulators have requested more information from Zoox, Amazon’s self-driving unit, as part of an investigation into rear-end crash risks posed by unexpected braking. The National Highway Traffic Safety…

Feds tell Zoox to send more info about autonomous vehicles suddenly braking

You thought the hottest rap battle of the summer was between Kendrick Lamar and Drake. You were wrong. It’s between Canva and an enterprise CIO. At its Canva Create event…

Canva’s rap battle is part of a long legacy of Silicon Valley cringe