What is the Value of Data?
From ancient grain records to modern databases: what actually drives the value of data?
Sumerian tablets from around 3000 years ago show us that one of the first things we did once we'd establish the pattern of cognitive tools, like writing and mathematics, was to use them to gather data.
This data-gathering instinct almost certainly predates writing systems. Early humans tracked celestial cycles, seasonal changes, and resource patterns through memory, ritual, and primitive recording methods. What’s our obsession with data?
One possibility is that data collection may be evolutionarily advantageous. Those who could track patterns (animal migration, seasonal changes, resource locations) had survival advantages. This behavioral tendency now manifests as an almost compulsive need to quantify and track everything from steps to sleep patterns.
In this sense our data obsession is similar to our biological tendency to excess calories. But unlike surplus calories, data has no universal utility. It is only contextually useful. Data assets themselves possess zero intrinsic value, their worth emerges entirely from the actions they enable. The same grain inventory records that satisfied ancient tax collectors could also reveal trade routes, predict famines, or optimize agricultural yields across an empire.
Unlike physical assets data exhibits network effects - its value often increases exponentially rather than linearly as more data points are added. A single customer transaction tells you little; millions of transactions reveal purchasing patterns, market trends, and predictive insights.
Over the last 50 years generating and gathering data has become more passive and pervasive, it often happens as a by product of other processes. This represents a fundamental shift from a scarcity paradigm to an abundance paradigm. Where data was once deliberately collected for specific purposes, we now generate it continuously and unconsciously. Every digital interaction - clicks, location pings, sensor readings - create a data exhaust that may have a narrow intended purpose but enormous latent potential.
Because of its intrinsic lack of value, data is always undervalued until someone finds an opportunity to put it to work through the process of entrepreneurial discovery. Sometimes the value created is immense. The owner of the data asset may have an inkling of the potential value of their data, but it will always be possible—and likely—that another actor will find an opportunity of greater utility and seize it. Reddit is a good example of this.
Some other examples: Netflix's viewing data became more valuable for content creation than just recommendation algorithms. Credit card companies realized transaction data was worth more for fraud detection and market insights than just processing payments. Tesla's driving data became a competitive moat for autonomous vehicle development. The pattern repeats: data collected for one purpose finds exponentially greater value in unforeseen applications.
If you own a data asset, you can sell or license it to different actors, and they may each derive wildly different value from it. This creates an interesting asymmetry in data markets. The same dataset might be worth $1M to one buyer and $100M to another, depending on their use case, competitive position, and ability to extract value. This valuation difficulty makes data assets fundamentally different from traditional commodities with more standardized pricing.
The framing of "data is the new oil" is a poor analogy at this point. The best you can say is that "data is liquid". Some may be oil, but some is plain old pond water. Data is objectively neither of those things, or both. If someone builds a highly efficient engine that runs on a cup of pond water then oil is shit out of luck.
The oil analogy breaks down in crucial ways: oil is rivalrous (if I use it, you can't), data is non-rivalrous (infinite copies at zero marginal cost). Oil has relatively predictable applications; data has combinatorial possibilities that multiply when merged with other datasets. Oil depletes with use; data can improve with use through feedback loops and network effects.
Broadly speaking the value of a data asset is primarily driven by the difficulty in gathering and maintaining it, the roundaboutness of its production, and the known possibilities for utilization.
After 2020—and arguably this process is still unfolding—transformer models expanded the horizon for possible utilization and derived value. Data assets that sat latent suddenly became orders of magnitude more valuable. A new game had started.
Specific data types that exploded in value post-2020: conversational data (Reddit, Discord, forums), code repositories (GitHub), multimodal datasets (image-text pairs), and real-time interaction data. Data that was previously considered "exhaust" - like chat logs, comment threads, and user-generated content - became premium training material worth millions per year in licensing deals.
The drivers of data value are the same; how hard is it to create and maintain, and what are opportunities for exploitation of it. But the consensus as what constitutes each of these has gone.
On one hand, certain types of data are easier to gather, clean, and verify than ever before. Data assets that took immense effort to build over time can be replicated quickly. Human data labelers can be replaced by AI labelers for example. If you spend capital and time building a data asset using human labor, and it can be replicated in a fraction of the time at a fraction of the labor cost by using AI your data asset is objectively worth less than it was before.
The counter to that is that the possibilities for entrepreneurial opportunity are also more abundant, how those data assets can be utilized is an ever expanding horizon as foundational AI research progresses.
What does this mean for data asset valuation? We have a deflationary force and an inflationary force. Are they keeping each other in check?
The answer may partly depend on timing and defensibility. First-mover advantages matter - companies that establish data collection early in a category can build compounding advantages that become insurmountable. The question becomes: can defensive moats (proprietary data sources, exclusive partnerships, network effects) grow faster than offensive capabilities (AI-powered data generation, synthetic data, scraping technologies)?
Arguably the most valuable type of dataset in this new world is one that cannot be easily replicated and has high know utility, as well as high unknown potential or optionality. Datasets that require high capital expenditure, and we can also use time as a proxy for capital here, are by definition harder to replicate. Datasets that have high combinatorial potential, or potential for non-linear value growth, are also more valuable.
Businesses should either exit the "Indefensible Asset" quadrant quickly or move toward "Defensible Asset" by increasing switching costs, exclusivity, or network effects. For investors, "Free Optionality" represents the highest risk-adjusted returns - low downside with potentially explosive upside if utility is discovered.
Going back to our original question — what exactly is the value of data in this new world?
It depends. Data is potential — akin to what Austrian economists call 'higher-order capital goods.' Data's value isn't inherent; it emerges from the robustness and sophistication of the processes that both generate and extract value from it.
The most valuable data assets result from complex, indirect production processes. They are roundabout in their creation and utilization. The longer and more complex the chain from data generation to value extraction, the greater the defensibility of the assets value, and the greater potential returns.
This is why Google and Amazon have built such durable advantages—not because they have more data, but because they've invested in the most robust and roundabout processes for both generating and monetizing it. Their data flywheel is a sophisticated capital structure that compounds over time.
My conclusion is that data value is fundamentally tied to roundaboutness. The winners in the AI economy won't necessarily be those with the largest data deposits, but those who've built the most sophisticated processes and infrastructure. Those with data flywheels that create defensibility, and capital structures that do the same.
The quick and dirt mental model is something like—if AI models get 10x better tomorrow, is my infrastructure and data still valuable? Or put another way, how much more or less valuable is it?