Enterprise

How to ensure data quality in the era of big data

Comment

Unknown data failures are a big problem in the big data age
Image Credits: gremlin / Getty Images

Patrik Liu Tran

Contributor

Patrik Liu Tran is the co-founder and CEO of Validio, an automated real-time data validation and quality monitoring platform. He holds a Ph.D. in Business Administration (as well as an M.Sc. and B.Sc.) from Stockholm School of Economics, and a Civil Engineering degree in Engineering Physics with an M.Sc. in AI and machine learning from KTH Royal Institute of Technology. Patrik is also the chairman of Stockholm AI.

A little over a decade has passed since The Economist warned us that we would soon be drowning in data. The modern data stack has emerged as a proposed life-jacket for this data flood — spearheaded by Silicon Valley startups such as Snowflake, Databricks and Confluent.

Today, any entrepreneur can sign up for BigQuery or Snowflake and have a data solution that can scale with their business in a matter of hours. The emergence of cheap, flexible and scalable data storage solutions was largely a response to changing needs spurred by the massive explosion of data.

Currently, the world produces 2.5 quintillion bytes of data daily (there are 18 zeros in a quintillion). The explosion of data continues in the roaring ‘20s, both in terms of generation and storage — the amount of stored data is expected to continue to double at least every four years. However, one integral part of modern data infrastructure still lacks solutions suitable for the big data era and its challenges: Monitoring of data quality and data validation.

Let me go through how we got here and the challenges ahead for data quality.

The value vs. volume dilemma of big data

In 2005, Tim O’Reilly published his groundbreaking article “What is Web 2.0?”, truly setting off the big data race. The same year, Roger Mougalas from O’Reilly introduced the term “big data” in its modern context  —  referring to a large set of data that is virtually impossible to manage and process using traditional BI tools.

Back in 2005, one of the biggest challenges with data was managing large volumes of it, as data infrastructure tooling was expensive and inflexible, and the cloud market was still in its infancy (AWS didn’t publicly launch until 2006). The other was speed: As Tristan Handy from Fishtown Analytics (the company behind dbt) notes, before Redshift launched in 2012, performing relatively straightforward analyses could be incredibly time-consuming even with medium-sized data sets. An entire data tooling ecosystem has since been created to mitigate these two problems.

The emergence of the modern data stack (example logos & categories)
The emergence of the modern data stack (example logos and categories). Image Credits: Validio

Scaling relational databases and data warehouse appliances used to be a real challenge. Only 10 years ago, a company that wanted to understand customer behavior had to buy and rack servers before its engineers and data scientists could work on generating insights. Data and its surrounding infrastructure was expensive, so only the biggest companies could afford large-scale data ingestion and storage.

Then came a (Red)shift. In October 2012, AWS presented the first viable solution to the scale challenge with Redshift — a cloud-native, massively parallel processing (MPP) database that anyone could use for a monthly price of a pair of sneakers ($100) — about 1,000x cheaper than the previous “local-server” setup. With a price drop of this magnitude, the floodgates opened and every company, big or small, could now store and process massive amounts of data and unlock new opportunities.

As Jamin Ball from Altimeter Capital summarizes, Redshift was a big deal because it was the first cloud-native OLAP warehouse and reduced the cost of owning an OLAP database by orders of magnitude. The speed of processing analytical queries also increased dramatically. And later on (Snowflake pioneered this), they separated computing and storage, which, in overly simplified terms, meant customers could scale their storage and computing resources independently.

What did this all mean? An explosion of data collection and storage.

Shift from on-premise to cloud data warehouses
Shift from on-premise to cloud data warehouses. Image Credits: Validio

After 2016, I started to see widespread adoption of scalable modern cloud data warehouses. This was when BigQuery and Snowflake truly entered the race (BigQuery didn’t release standard SQL until 2016 and so wasn’t widely adopted before then, and Snowflake’s product wasn’t truly mature before 2017). Not only did these newcomers solve the volume issue, they were built on a new and different logic that decouples cost for volume and cost for compute.

This meant that a company could store large volumes of data even more cheaply, as it only paid for the data it processed (i.e., consumption-based compute costs and consumption-based storage costs, rather than consumption-based costs for “storage with compute power” for all data).

Before the price drop in storage, companies collected data more selectively (than today) and naturally focused on high-value data. Today, with data storage being dirt cheap and the ongoing move to ELT from ETL following the advent of modern cloud data warehouses and lakes, it’s viable to store massive amounts of data “just in case.”

With the challenge of speed and volume solved, the challenge before us now is to ensure that the large volumes of big data are of sufficiently high quality before they’re used.

The value/volume graph of Big Data
The value/volume graph of big data. Image Credits: Validio

Data failures are common, but when data explodes, so does its impact

The data quality space clearly has an issue with definitions. Just like with artificial intelligence, there are almost as many definitions of “data quality” as there are opinions on the subject. In the era of big data, data quality is about preventing and battling data failures.

Data failure is a broad concept that describes situations when the data in a data stream or data set does not behave as expected. Some examples include significant fluctuations in data ingestion rate (for data streams), the number of rows in a data table being greater or smaller than expected, changing data schema, actual data values shifting significantly, or shifts in the relationships between different features in a data set.

Note that this definition of data failures relates to expectations. Data failures can occur even if the newly collected data actually and truthfully represents reality and therefore is free from errors, as long as it differs from expectations (based on which it is used).

In connection with the COVID-19 pandemic, I saw many examples of data failures without underlying data errors. Numerous companies had machine learning models in production, which were trained on historical data. When training machine learning models on historical data, the implicit expectation is that historical data is representative of the new data that the model is supposed to work with. Hence, if new data significantly differs from the historical (training) data, we have a data failure, even if the new data truthfully represents reality.

This blog post by DoorDash describes why they had to retrain their machine learning algorithms in the wake of the COVID-19 pandemic. With lockdowns and all sorts of restrictions on social behavior, more people started ordering in, and more restaurants signed up to DoorDash to be able to cater to the changes in demand.

The new consumption patterns were not captured by the historical data on which DoorDash’s ML-based demand prediction models were trained, so they were quickly rendered useless in the new reality. “ML models rely on patterns in historical data to make predictions, but life-as-usual data can’t project to once-in-a-lifetime events like a pandemic … the COVID-19 pandemic brought demand patterns higher and more volatile than ever before, making it necessary to retrain our prediction models to maintain performance,” DoorDash wrote.

Inspired by former U.S. Secretary of Defense Donald Rumsfeld, I find it useful to divide data failures into four categories based on how aware organizations are of them and how well they understand them. If you plot the categories in a 2×2 matrix, with awareness on the vertical and understanding on the horizontal axis, you get the below diagram.

The Rumsfeld matrix for data quality
The Rumsfeld matrix for data quality. Image Credits: Validio

Note that what is known for one company might be unknown for another, both in terms of awareness and understanding. So while a tempting simplification, it is not meaningful to generally classify different data failures as knowns or unknowns, because this will differ between organizations.

That said, here are a few examples of what the categories in the matrix actually mean in practice:

Known knowns: These are data failures that an organization is aware of and understands, so it’s fairly straightforward to have people implement rules to check for them. These people could be data engineers, data scientists, or business people with domain knowledge about the data in question. For example, rules could govern the format or value range that values in a certain column should have for the data to be valid.

Known unknowns: These data failures are issues that a company is worried about, but lacks the understanding (or resources) to deal with. A couple of years ago, I was advising data teams of big Nordic enterprises. Common complaints from data owners involved constant worries about data from certain sources, because the values that came in were sometimes unreasonable and erroneous. They said, “We do not know how to properly monitor and be proactive, since the errors look different from time to time.”

Unknown knowns: These are data failures that a company is unaware of, but tacitly understands, which may let it handle them once it’s aware of the failures. However, as an organization isn’t aware of a failure before it happens, it is impossible for it to be proactive and define rules and tests to identify them.

For companies with analytical data use cases, in which there is some time between data generation and consumption, usually a human is involved, and they can hopefully analyze the data before it is consumed and learn of potential data failures (turning the unknown known to a known known). However, as data volumes and complexity grow, data failures are often overlooked even by the most knowledgeable people, as humans are not well-suited for high-dimensional pattern recognition or high-volume data processing.

For organizations with operational data use cases, in which data is both generated and consumed in near real time, there is little to no opportunity to have a human in the loop. The knowledge of a potential data failure is of little help, because data failures need to be addressed before the data is consumed, which happens almost right after it is generated.

Statistics Sweden, the organization responsible for producing official statistics for decision-making, debate and research in Sweden, faced the challenge of an unknown known data failure in 2019. It had identified failures in the data collection of the workforce survey, which is used for, among other things, calculating the GDP and unemployment rate of Sweden, which the Swedish government, the Central Bank of Sweden and commercial banks rely on to make crucial decisions.

Notably, there are few organizations in the world that have a higher proportion, or a higher absolute number, of statisticians and data experts than Statistics Sweden, whose raison d’être is to provide high-quality data. In September 2019, Statistics Sweden reported an unemployment rate of 7.1%, but in reality, the true number was closer to 6%.

The reported numbers were so astounding that the Swedish Minister of Finance even reached out to Statistics Sweden to discuss their correctness, but this questioning was not taken seriously enough. The discrepancy was discovered months later and had severe consequences, including the Swedish government having to recalculate Sweden’s GDP for several years back.

In addition, both the central bank and commercial banks in Sweden had based their interest rate decisions on the false sudden increase of unemployment rate. Interpreted as an economic downturn, leading commercial banks assumed that the central bank would keep its rates low (and their lending costs). Subsequently, they lowered their interest rates aggressively.

So what led Statistics Sweden to this enormous mistake? The organization had outsourced parts of the workforce data collection to a consultancy firm that did not follow the proper data collection procedures (including faking data, as double the response rates resulted in a 24x fee, but that’s another ludicrous story), which resulted in erroneous data being collected.

Employees at Statistics Sweden discovered these improper data collection practices by identifying significant discrepancies between internally collected data and the data from the consultancy firm. In addition, control calls were made to survey respondents, who denied ever being involved.

While employees of Statistics Sweden are unarguably knowledgeable enough to understand that data must be properly collected, they were unaware of the incentives of their outsourcing contract and so didn’t have proper checks and balances in place to monitor the quality of the data. As soon as they were made aware, they took corrective actions to try to prevent this from happening again.

Unknown unknowns: These data failures are the most difficult to detect. They are the ones that organizations are neither aware of nor understand, and are the most difficult category to discover.

For instance, an unknown-unknown-driven data failure stemming from a software update recently struck travel and tourism giant TUI. TUI uses average weights to estimate total passenger weight before takeoff, with different standard weights for men, women and children. Total passenger weight is a key input for calculating important flight data such as takeoff thrust and fuel requirements.

In this case, a software update caused the system to mislabel passengers titled “Miss” as “children,” resulting in a 1,200 kg underestimation of total passenger weight for one of their flights. The pilots noted that the estimated weight differed substantially from the weight on the load sheet, and that the number of children shown on the load sheet was unusually high.

However, they discarded their suspicions and took off with lower thrust than required based on the real weight. Luckily, despite several TUI planes making the same error before the data failure was discovered and resolved, none of the planes were above their weight safety margin and no serious harm was caused.

This is the risk of being confronted with an unknown unknown data failure: Even when you are faced with one directly, it might slip under your radar.

Another example is that of peer-reviewed medical journal “The Lancet,” which published a paper in March 2021 claiming that Spain’s child mortality rate due to COVID-19 was up to four times as high as countries like the U.S., U.K., Italy, Germany, France and South Korea. The paper caused a public outburst in Spain, and the data behind the results was scrutinized further, revealing that the paper exaggerated the number of children who died from COVID-19 by almost a factor of eight.

Hacking my way into analytics: A creative’s journey to design with data

The source of the data failure was traced back to the IT system that the Spanish government used to store the data. It could not handle three-digit numbers, resulting in three-digit ages being cropped — so, for instance, a 102-year-old was registered as a two-year-old in the system. Note that highly trained researchers were using this highly erroneous data and let it slip through.

Another example of an unknown-unknown-driven data failure during real-time data consumption concerns Instacart’s product availability predictions across stores. Predictions are used to inform customers which products in their shopping baskets are in stock and can be delivered upon an order. Prior to the pandemic, the accuracy of Instacart’s inventory prediction model was 93%, but when the pandemic hit, that number dropped to 61%, as people started to drastically change their shopping behavior.

The machine-learning-based prediction models were trained on weeks of historical data, but the system could not deal with the unexpected hoarding of toilet paper, hand sanitizers and staple foods. Only when the models’ performance had deteriorated significantly (and cost a lot of money in terms of lost sales and unhappy customers) did engineers at Instacart begin to sense that something was wrong. The machine learning team at Instacart were neither aware of the potential of a data failure caused by a significant shift in behavior, nor did they have sufficient knowledge to recognize it when it happened (i.e., recognize the significant shift in input data failure before the data was consumed).

The above examples of known and unknown data failures are just the tip of the iceberg. Data failures happen all the time in data-driven organizations, and I have seen many with more severe consequences than the cases discussed above. However, most companies are embarrassed about them and, so, rarely discuss them in public.

Observability is necessary, but not sufficient

With the evident need for solutions to ensure high data quality in the big data era, the space has received a lot of attention — especially during the past year. If you’ve paid attention to the data quality space lately, you’re likely familiar with the term observability. I’m seeing an increasing number of data quality monitoring startups from different backgrounds using the term to describe their solutions, many claiming their observability tool as the factor that will take businesses to the next level.

“Observability” has been adopted from the world of DevOps — I’ve seen plenty of observability startups proclaiming they will be the “Datadog of data.” This is largely due to the notion that it will smoothen sales processes/product adoption, as tech teams are familiar with the term from the world of application performance and cloud software monitoring.

In the context of data quality, the observability approach essentially means monitoring data pipelines’ metadata to ensure that data is fresh and complete — ensuring that data is flowing through the pipeline as expected. These solutions may also provide some high-level information about data set schema and error distributions (like missing values), paired with some data lineage solution.

Data observability is a good first step to address an underserved need among data-driven companies, but it is far from sufficient, as it pays little or no attention to issues in the actual data within the data pipelines. There are countless examples of data failures that are not evident from the metadata level and can wreak total havoc in the pipelines while everything looks fine and dandy from a data observability standpoint.

None of the previously described data failures of Statistics Sweden, TUI, the Spanish COVID-19 data or Instacart could be discovered with a data observability tool. This is because the metadata was normal in all the cases: Data was complete, fresh and with normal schema and error distributions. The failures could only have been discovered by looking at the actual data and its values.

In other words, observability is a subset of data quality monitoring, but it’s far from the whole story.

Why traditional ways to identify data failures don’t work

Before I explain how to go about identifying data failures in the big data era, let’s revisit how data failures were identified when volumes were low and values were high (per data point). There were three common approaches:

Reactive: Data failures were dealt with reactively when they grew painful — when errors in dashboards, BI reports or other data products were picked up by the consumers of data products. Given that the damage is already done when the data failures are identified in this way, this approach is not suitable in organizations where data is used for critical purposes.

Manual: Data scientists (note that this was before the era of data engineers) were employed to manually ensure the quality of collected data by analyzing new data (often column by column) and comparing it to historical data to identify potential significant deviations and unreasonable values. Compared to the first approach, this is more proactive, but not at all scalable.

Rule-based: More data-ready organizations often had real-time operational use cases for collected data. When data is consumed almost immediately after it is collected, it is virtually impossible to manually ensure its quality. These companies got internal data consumers with domain knowledge about the data to add a plethora of rules to their data pipelines based on reasonable thresholds.

The rules were often part of big master data management systems. However, these inflexible behemoth systems were a solid recipe for inertia, as rules need to be redefined when their surroundings change (e.g., when migrating to the cloud).

A reactive approach to data quality has always been out of the question when data is applied to use cases where the value (and therefore, the cost of mistakes) is high, which is increasingly the case in the big data era. The alternatives of having a human in the loop or defining hard-coded rules are impractical for several reasons.

Firstly, the vast amount of data makes manual and rule-based approaches unviable. The volume and velocity of big data make it difficult to have humans in the loop, and the variety of data makes rules too inflexible (the “three V’s of big data”). Our limited capacity to identify patterns in high-dimensional data leaves us inept at understanding (and recognizing) the different kinds of data failures and too limited in our awareness to be able to foresee (and create rules) to codify and capture them proactively.

The three Vs of big data
The three Vs of big data. Image Credits: Validio

As data quality has created several new contextual issues, a fourth “V” has been introduced: veracity. In general, veracity refers to the degree of correctness or trustworthiness of the data. It’s fair to say that data veracity, even though always present in data science, has historically dwelt under the shadow of the other V’s (volume, velocity and variety).

This is changing as only correct and trustworthy data can add value to analytics and machine learning models today. The emphasis on veracity will be accelerated by growing data volumes, variety and sources. Upcoming regulatory initiatives, such as the EU AI Regulation, are also directing the focus toward veracity.

Secondly, modern data infrastructures and accompanying development of modern data pipelines have increased data pipeline complexity significantly, which results in additional potential sources of data failures.

Thirdly, given the vast amount of data sources, manually defining rules to monitor each source has become a huge task. It takes a lot of time to understand data sources well enough to define proper rules, and the rules must also be updated continuously as data changes over time.

Lastly, and maybe most importantly, the traditional approaches of manual and/or rule-based data quality monitoring are unable to capture the unknown data failures that an organization either doesn’t understand or is unaware of. The unknown data failures are often not only the most difficult to identify, they are also the most costly to miss.

How to identify (unknown) data failures in the era of big data

Effective approaches to data quality monitoring will need to take into account a handful of secular trends that impact how data is used and how quality can be managed proactively and scalably.

How to identify data failures
Image Credits: Validio

Modern data quality solutions need to use statistical models and machine learning algorithms to complement rule-based approaches to identify unknown data failures in a scalable and adaptable way as data changes over time. Solutions also should be able to operate on data that flows through data pipelines in real time to cater to operational use cases in which data is produced and consumed in near real time.

Data engineers should be able to use these solutions, which means they should be easy to deploy without going through a cumbersome procurement process, and be customizable. These solutions should also include data lineage functionality so that information about data failures upstream will reach the affected data teams downstream proactively.

And lastly, they should have non-intrusive deployment alternatives so that security is never an issue for customers.

Moving along the data quality maturity scale

After speaking with hundreds of data teams over the last few years, I have found that modern data-driven companies usually go through different stages of maturity when it comes to ensuring high data quality.

The data quality maturity scale
The data quality maturity scale. Image Credits: Validio

Level 1 — Basic: No processes or technology to monitor or identify data quality issues. Data quality problems usually take weeks or months to resolve and are identified by downstream users/customers when problems have already occurred.

Level 2 — Starter: Dashboards are used to monitor data patterns to be able to react quickly as data quality issues arise. It can still take weeks to find and resolve the root cause, though, especially in complex data pipelines.

Level 3 — Medium: Rule-based systems are used to identify data quality problems, which can be addressed as they occur. But there aren’t any advanced systems in place to detect unknown data failures.

Level 4 — Advanced: Rule-based systems are complemented by ML-and-statistical-test-based systems that identify both known and unknown data failures. Data lineage is used to identify the origins of data failures to make it easy to identify the error source. Furthermore, the lineage enables one to proactively notify all downstream data teams dependent on the failed data.

Level 5 — Expert: In addition to ML algorithms, statistical tests and rules to keep track of data quality and data lineage, there are tools to operate on and filter data in real time to rectify data failures until their root causes have been resolved so that failures are curbed from contaminating the data pipeline downstream.

Companies at Levels 1-3 employ approaches from the time before the big data era. These approaches are easy to get started with, but data-driven companies tend to realize quickly that they neither scale well nor do they identify unknown data failures.

Companies at Level 4 identify unknown data failures in a scalable way. Alerts based on input data failures are more proactive than finding failures based on errors, but they are not sufficient for operational use cases since the identified data failures are not mitigated in real time.

At Level 5, proactivity is tangible as companies pair monitoring with interventions. Pairing Level 4 functionality with automated operations on the data prevents data failures from propagating through the pipeline by filtering out bad data from the main data pipeline or via automated fixes for different data failures that tidy up the data points as they flow through the pipeline.

The million-dollar question: Buy or build?

The answer to the buy or build question depends on the level of maturity a company is aiming for and the resources at its disposal. Generally, a higher level of maturity will cost more to build and maintain.

Many companies can build solutions for Levels 1-3 from scratch or by leveraging tools like Grafana, dbt, Great Expectations and Deequ. However, building proprietary solutions for Level 4 requires using statistical tests and machine learning algorithms in highly generalizable ways to identify unknown data failures.

As monitoring relies less on domain knowledge and more on comprehensive, generalizable pattern recognition, data quality monitoring companies have an edge simply because for most companies, domain knowledge is core business, but generalizable pattern recognition is not.

At Level 5, system performance requirements are a completely different ballgame. Levels 1-4 concern a range of sophisticated ways to identify data failures so that humans can prioritize how and when to react. To build increasingly generalizable and dynamic pattern recognition and rules, you need a solid understanding of the data, ML and statistics (and software development, of course). These skills are also needed for Level 5, but fixing data in real time for operational use cases virtually requires the systems to be as performant as one’s ETL/ELT pipeline.

What I’ve seen is that companies usually keep DIY solutions at Level 1-3, where their quality monitoring primarily relies on domain knowledge of the data. When they want to scale the monitoring (often because they want to scale their data pipelines) and move toward Levels 4 or 5, they begin looking for external best-of-breed solutions.

Some of the biggest and most data-driven companies out there have managed to build DIY solutions at Level 4, but they are few. Less than a year ago, for example, Uber’s engineering team shared some details about their proprietary solution for cross-infrastructure data quality monitoring. The solution took five data scientists and engineers a year to build and will require ongoing investment to maintain.

This is a pretty lofty investment for most companies, especially as good data scientists and engineers are a scarce resource, to say the least. At Uber’s scale, there may be enough ROI to justify a highly customized in-house solution, but that’s probably an anomaly.

Data scientists: Bring the narrative to the forefront

More TechCrunch

Early attempts at making dedicated hardware to house artificial intelligence smarts have been criticized as, well, a bit rubbish. But here’s an AI gadget-in-the-making that’s all about rubbish, literally: Finnish…

Binit is bringing AI to trash

Temasek has previously invested in Lenskart, and this new funding follows a $500 million investment by the Abu Dhabi Investment Authority last year.

Temasek, Fidelity buy $200M stake in Lenskart at $5B valuation

Less than one year after its iOS launch, French startup ten ten has gone viral with a walkie talkie app that allows teens to send voice messages to their close…

French startup ten ten finds viral success and controversy in reinventing walkie-talkies

Featured Article

Unicorn-rich VC Wesley Chan owes his success to a Craigslist job washing lab beakers

While all of Wesley Chan’s success has been well-documented over the years, his personal journey…not so much. Chan spoke to TechCrunch about the ways his life impacts how he invests in startups.

15 hours ago
Unicorn-rich VC Wesley Chan owes his success to a Craigslist job washing lab beakers

Presumptive Republican presidential nominee Donald Trump now has an account on the short-form video app that he once tried to ban. Trump’s TikTok account, which launched on Saturday night, features…

Trump takes off on TikTok

With fewer than 400,000 inhabitants, Iceland receives more than its fair share of tourists — and of venture capital.

Iceland’s startup scene is all about making the most of the country’s resources

Kobo put out a handful of new e-readers a few weeks back: color versions of the excellent Libra 2 and Clara, as well as an updated monochrome version of the…

Kobo’s new e-readers are a sidegrade most can skip (with one exception)

In an interview at his home near Reykjavík, the entrepreneur-turned-VC shared thoughts on his ventures and the journey that led him from Unity to climate tech, a homecoming of sorts.

Unity co-founder David Helgason’s next act: Gaming the climate crisis

Welcome back to TechCrunch’s Week in Review — TechCrunch’s newsletter recapping the week’s biggest news. Want it in your inbox every Saturday? Sign up here. Over the past eight years,…

Fisker collapsed under the weight of its founder’s promises

What is AI? We’ve put together this non-technical guide to give anyone a fighting chance to understand how and why today’s AI works.

WTF is AI?

President Joe Biden has vetoed H.J.Res. 109, a congressional resolution that would have overturned the Securities and Exchange Commission’s current approach to banks and crypto. Specifically, the resolution targeted the…

President Biden vetoes crypto custody bill

Featured Article

Industries may be ready for humanoid robots, but are the robots ready for them?

How large a role humanoids will play in that ecosystem is, perhaps, the biggest question on everyone’s mind at the moment.

2 days ago
Industries may be ready for humanoid robots, but are the robots ready for them?

VCs are clamoring to invest in hot AI companies, and willing to pay exorbitant share prices for coveted spots on their cap tables. Even so, most aren’t able to get…

VCs are selling shares of hot AI companies like Anthropic and xAI to small investors in a wild SPV market

The fashion industry has a huge problem: Despite many returned items being unworn or undamaged, a lot, if not the majority, end up in the trash. An estimated 9.5 billion…

Deal Dive: How (Re)vive grew 10x last year by helping retailers recycle and sell returned items

Tumblr officially shut down “Tips,” an opt-in feature where creators could receive one-time payments from their followers.  As of today, the tipping icon has automatically disappeared from all posts and…

You can no longer use Tumblr’s tipping feature 

Generative AI improvements are increasingly being made through data curation and collection — not architectural — improvements. Big Tech has an advantage.

AI training data has a price tag that only Big Tech can afford

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: Can we (and could we ever) trust OpenAI?

Jasper Health, a cancer care platform startup, laid off a substantial part of its workforce, TechCrunch has learned.

General Catalyst-backed Jasper Health lays off staff

Featured Article

Live Nation confirms Ticketmaster was hacked, says personal information stolen in data breach

Live Nation says its Ticketmaster subsidiary was hacked. A hacker claims to be selling 560 million customer records.

2 days ago
Live Nation confirms Ticketmaster was hacked, says personal information stolen in data breach

Featured Article

Inside EV startup Fisker’s collapse: how the company crumbled under its founders’ whims

An autonomous pod. A solid-state battery-powered sports car. An electric pickup truck. A convertible grand tourer EV with up to 600 miles of range. A “fully connected mobility device” for young urban innovators to be built by Foxconn and priced under $30,000. The next Popemobile. Over the past eight years, famed vehicle designer Henrik Fisker…

2 days ago
Inside EV startup Fisker’s collapse: how the company crumbled under its founders’ whims

Late Friday afternoon, a time window companies usually reserve for unflattering disclosures, AI startup Hugging Face said that its security team earlier this week detected “unauthorized access” to Spaces, Hugging…

Hugging Face says it detected ‘unauthorized access’ to its AI model hosting platform

Featured Article

Hacked, leaked, exposed: Why you should never use stalkerware apps

Using stalkerware is creepy, unethical, potentially illegal, and puts your data and that of your loved ones in danger.

2 days ago
Hacked, leaked, exposed: Why you should never use stalkerware apps

The design brief was simple: each grind and dry cycle had to be completed before breakfast. Here’s how Mill made it happen.

Mill’s redesigned food waste bin really is faster and quieter than before

Google is embarrassed about its AI Overviews, too. After a deluge of dunks and memes over the past week, which cracked on the poor quality and outright misinformation that arose…

Google admits its AI Overviews need work, but we’re all helping it beta test

Welcome to Startups Weekly — Haje‘s weekly recap of everything you can’t miss from the world of startups. Sign up here to get it in your inbox every Friday. In…

Startups Weekly: Musk raises $6B for AI and the fintech dominoes are falling

The product, which ZeroMark calls a “fire control system,” has two components: a small computer that has sensors, like lidar and electro-optical, and a motorized buttstock.

a16z-backed ZeroMark wants to give soldiers guns that don’t miss against drones

The RAW Dating App aims to shake up the dating scheme by shedding the fake, TikTok-ified, heavily filtered photos and replacing them with a more genuine, unvarnished experience. The app…

Pitch Deck Teardown: RAW Dating App’s $3M angel deck

Yes, we’re calling it “ThreadsDeck” now. At least that’s the tag many are using to describe the new user interface for Instagram’s X competitor, Threads, which resembles the column-based format…

‘ThreadsDeck’ arrived just in time for the Trump verdict

Japanese crypto exchange DMM Bitcoin confirmed on Friday that it had been the victim of a hack resulting in the theft of 4,502.9 bitcoin, or about $305 million.  According to…

Hackers steal $305M from DMM Bitcoin crypto exchange

This is not a drill! Today marks the final day to secure your early-bird tickets for TechCrunch Disrupt 2024 at a significantly reduced rate. At midnight tonight, May 31, ticket…

Disrupt 2024 early-bird prices end at midnight