In the last few decades, data got big. Digitisation resulted in huge reams of information. Leaders and technologists marvelled at its predictive potential but were met with an equivalently-sized obstacle – sifting through it all. The big data race wasn’t just about hoarding but about analysing. In the last year or so, a new situation has arisen. While problems to do with data were once about profusion, now the biggest issue is about scarcity. Data might be big – but soon, it won’t be big enough.
Artificial intelligence is the most transformative technology of the 2020s. Looking beyond this decade, it’s becoming clearer that the next phase of human advancement will be about augmenting our own intelligence with a digital one. The competition to develop large language models (LLMs) is intensifying, but refining, training and iterating such programs takes data and lots of it. As ever, when a raw material is in short supply, humans respond by creating a man-made version. Synthetic data is emerging as a solution to the great data dearth. Of all the definitions, IBM might have the clearest:
Synthetic data is information that's been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias.”
Synthesise it (don’t criticise it)
Computer-generated data is about to have a moment. Gartner predicts that by 2024, 60% of the data used for developing AI and analytics will be artificially produced. But there are other applications, too. For instance, the first area in which synthetic data became useful was in training autonomous vehicles. In recent years, Waymo, a self-driving car company, has been sending virtual cars off on simulated voyages intended to teach real-world equivalents to manage all manner of situations on a road. In a blog post, the company said: “Each day, as many as 25,000 virtual Waymo self-driving cars drive up to 8 million miles in simulation, testing out new skills and refining old ones.” This example shows a few use cases to do with synthetic data. It maximises the speed of iteration while avoiding any real-world accidents.
The advantages of synthetic data are myriad. It’s inexpensive to obtain, and with ever-tightening GDPR regulations, fabricated data is free from privacy or ethical constraints that gum up the process. Not everyone is convinced that synthetic data is a good idea. A data set that has been created by an algorithm flies in the face of all we know about empiricism – specifically that substantiated facts form the core of science, decision-making and logic itself. There is huge space for misuse:
Make me a data set that proves cigarettes prolong life for those over 60 years of age.”
We foresee a new debate emerging between those who favour primary sources and others who are happy to trust the machine.
Substituting synthetic data for the real thing will become more of a temptation. But as a greater proportion of information is computer generated, there is a risk that more decisions will be formed by an algorithm’s view of the world. Mikkel Krenchel and Maria Cury, partners at consulting firm ReD Associates, define this well in an essay for IAI, saying:
The growing availability of synthetic data might make firms or organisations disinclined to do original research and data collection. And that’s dangerous because even the best synthetic dataset will never be a representation of our constantly changing reality…”
Data is a commodity that is necessary to power civilisation. In the future, demand will continue to soar. But synthesising it carries unforeseeable dangers. Yesterday’s data was too big. Today, it is too scarce. Without the right regulation and academic enquiry, tomorrow, it could confirm our biases, be used to fabricate lies or – perhaps – lose its value altogether.
Researchers will need to exercise caution and restraint when it comes to synthetic data. The technology might be a convenience, but facts are still sacred.