Big Data Doesn’t Exist

12:00 PM PDT • September 10, 2015

**Image Credits:** Maksim Kabakou (opens in a new window) / Shutterstock (opens in a new window)

Slater Victoroff

Contributor

Slater Victoroff is CEO of Indico Data Solutions.

My customers always lie to me. They don’t lie about what they can afford. They don’t lie about how much (or how little) customer service they’ll need. They don’t lie about how quickly they can pay us.

They lie about how much data they have.

At first, I thought it was just a weird one-off. A client told us they needed to handle several billion calls each month, a “massive data stream.” That much analysis comes with a huge price tag. Once I made this clear, the truth came out: they hoped to ramp up to a million calls a day in the next several months. Even if they reached this optimistic goal, they’d only have less than one one-hundredth of the data they’d originally claimed.

It’s not just this client, either. I’ve found it’s a good rule of thumb to assume a company has one one-thousandth of the data they say they do.

“Big Data” Isn’t Big

Companies brag about the size of their datasets the way fishermen brag about the size of their fish. They claim access to endless terabytes of information. The advantages seem obvious: the more you know, the better.

Based on their marketing materials, it would seem that this data makes companies almost clairvoyant. They claim deep insights about everything from the performance of employees to the preferences of their customer base. More data means more understanding about how people make decisions, what people buy, what motivates them — right?

But marketing materials, like fishermen, exaggerate. Most companies only have a fraction of the data they claim. And typically, only a small fraction of that fraction is useful for generating any non-trivial insight.

Most “Big Data” Isn’t Actually Useful

Why do companies lie about the size of their data? Because they want to feel like one of the big dogs. They’ve heard about the enormous reserves of data collected by the likes of Amazon, Facebook and Google. And even though they don’t have the reach to collect that much data — or the money to buy it — they want to feel (and have outsiders think) they are in on the trend. As data analyst Cathy O’Neil noted in a recent blog post, many believe that “when you take a normal tech company and sprinkle on data, you get the next Google.”

But even big companies only use a tiny fraction of the data they collect.

Twitter processes around 8 terabytes of data per day. That sounds intimidating to a small company trying to extract consumer insights from tweets. But how much of that data is the actual content of tweets? Twitter users create 500 million tweets per day, and the average tweet is 60 characters. If we do the simple math, that’s just 30 gigabytes of actual text content per day — about half a percent of 8 terabytes.

The pattern continues. Wikipedia is one of the largest repositories of text on the Internet, but all its text data could fit on a single USB. All the music in the world could fit on a $600 disk drive. I could go on, but the point is this: big data isn’t big, but good data is even smaller.

Making The Most Of Small Data

If most large datasets are useless, why talk about them at all? Because they aren’t useless for everyone. Deep-learning models can separate signal from noise, finding patterns that would typically take experts months to codify. But typical deep-learning models only work on massive amounts of labeled data. And labelling a large dataset takes hundreds of thousands of dollars and months of time. That’s a job for a corporate behemoth like Facebook or Google. Too many smaller companies don’t realize this and acquire massive data stores that they can’t afford to use.

These companies have a better option. They can get more value out of the data they already have.

True, most deep-learning algorithms need large datasets. But we can also design them to make inferences from small data, just like humans do. Using transfer learning, we can train an algorithm on a large dataset before sending it to work on a small one. This makes the learning process 100 to 1,000 times more effective.

Here are just a few examples of how startups put transfer learning to business use:

Dato’s GraphLab Create platform can be used to identify and classify huge numbers of images in fractions of a second. Users can apply existing features from previously trained deep-learning models — or train their own model on a dataset, like ImageNet.
Clarifai’s image recognition API tags images with descriptive text, making photo archives easily searchable. Its deep-learning algorithm also works on streaming video, which allows advertisers to drop in an ad that’s relevant to the content the user has just seen.
MetaMind’s AI platform can judge whether the content of an individual tweet about a brand is positive or negative, and also determine the main theme of a Twitter discussion surrounding it. For a company looking for insight into their customers’ opinions, that’s much more useful than simply scraping age, sex and location data from many more thousands of accounts.

You don’t even have to be a programmer to take advantage of these services. Blockspring lets users mash-up APIs in Excel spreadsheets without writing a line of code.

With all of these options available, it makes even less sense to purchase big data by the terabyte — much less to brag about it.

It’s clear the future of data isn’t big. It’s small.

With contribution by Artem Kaznatcheev and Ada Kulesza of Hippo Reads.

More TechCrunch

Looking Glass launches new 3D displays

Haje Jan Kamps

37 mins ago

Looking Glass makes trippy-looking mixed-reality screens that make things look 3D without the need of special glasses. Today, it launches a pair of new displays, including a 16-inch mode that…

Ilya Sutskever, OpenAI co-founder and longtime chief scientist, departs

Kyle Wiggers

1 hour ago

Replacing Sutskever is Jakub Pachocki, OpenAI’s director of research.

Ilya Sutskever, OpenAI co-founder and longtime chief scientist, departs

Space

Intuitive Machines wants to help NASA return samples from Mars

Aria Alamalhodaei

2 hours ago

Intuitive Machines made history when it became the first private company to land a spacecraft on the moon, so it makes sense to adapt that tech for Mars.

Intuitive Machines wants to help NASA return samples from Mars

Google adds ‘Web’ search filter for showing old-school text links as AI rolls out

Sarah Perez

2 hours ago

As Google revamps itself for the AI era, offering AI overviews within its search results, the company is introducing a new way to filter for just text-based links. With the…

Google adds ‘Web’ search filter for showing old-school text links as AI rolls out

Space

Blue Origin to resume crewed New Shepard launches on May 19

Aria Alamalhodaei

3 hours ago

Blue Origin’s New Shepard rocket will take a crew to suborbital space for the first time in nearly two years later this month, the company announced on Tuesday. The NS-25…

Blue Origin to resume crewed New Shepard launches on May 19

Google is building its Gemini Nano AI model into Chrome on the desktop

Frederic Lardinois

3 hours ago

This will enable developers to use the on-device model to power their own AI features.

Google is building its Gemini Nano AI model into Chrome on the desktop

Google mentioned ‘AI’ 120+ times during its I/O keynote

Brian Heater

4 hours ago

It ran 110 minutes, but Google managed to reference AI a whopping 121 times during Google I/O 2024 (by its own count). CEO Sundar Pichai referenced the figure to wrap…

Google mentioned ‘AI’ 120+ times during its I/O keynote

Google launches Firebase Genkit, a new open source framework for building AI-powered apps

Frederic Lardinois

4 hours ago

Firebase Genkit is an open source framework that enables developers to quickly build AI into new and existing applications.

Google launches Firebase Genkit, a new open source framework for building AI-powered apps

Patreon and Grammarly are already experimenting with Gemini Nano, says Google

Sarah Perez

4 hours ago

In the coming months, Google says it will open up the Gemini Nano model to more developers.

Patreon and Grammarly are already experimenting with Gemini Nano, says Google

Social

Reddit introduces new tools for ‘Ask Me Anything,’ its Q&A feature

Lauren Forristal

5 hours ago

As part of the update, Reddit also launched a dedicated AMA tab within the web post composer.

Reddit introduces new tools for ‘Ask Me Anything,’ its Q&A feature

Hardware

Google I/O 2024: Here’s everything Google just announced

Christine Hall

5 hours ago

Here are quick hits of the biggest news from the keynote as they are announced.

Google I/O 2024: Here’s everything Google just announced

LearnLM is Google’s new family of AI models for education

Kyle Wiggers

5 hours ago

LearnLM is already powering features across Google products, including in YouTube, Google’s Gemini apps, Google Search and Google Classroom.

LearnLM is Google’s new family of AI models for education

Apps

Google is bringing AI-generated quizzes to academic videos on YouTube

Aisha Malik

6 hours ago

The official launch comes almost a year after YouTube began experimenting with AI-generated quizzes on its mobile app.

Google is bringing AI-generated quizzes to academic videos on YouTube

Transportation

Motional cut about 550 employees, around 40%, in recent restructuring, sources say

Rebecca Bellan

6 hours ago

Around 550 employees across autonomous vehicle company Motional have been laid off, according to information taken from WARN notice filings and sources at the company. Earlier this week, TechCrunch reported…