Synthetic Data – Catalyst for AI innovation | Blog
With a connected world and connected humans, we are on track for a huge uptick in new data creation at an unprecedented level. IoT, digitization, and cloud have brought on the generation and storage of ZBs of data created each day. Data has become the new oil but with some caveats. The tap of this oil is controlled by a few organizations globally, making this data asset scarce and expensive. However, enterprises in their pursuit of digital transformation require this data to get insights for better decision-making.
Shortcut to access data
The next logical question arises as to how we can get hold of this data, which, if utilized to its full potential, has the power to transform enterprises. This is where synthetic data comes to play. It is the form of data that is created inorganically rather than being generated through actual interactions or events. It is usually formed by studying the characteristics and relations between different variables. A total of three types of synthetic data exist, which are shown below.
Exhibit 1: Types of synthetic data
Why is it required now?
With the cultural shift towards insights-based decision making from gut-based decision making and the onset of data literacy initiatives, enterprises require apt insights, which further require the generation of huge amounts of data. There are a few instances highlighted below which make a strong case for synthetic data.
- GDPR mandates stringent regulations for data access which stipulates if a company can utilize it with user content. This makes it extremely difficult to share data creating hurdles to solve business problems
- AI models and algorithms require extensive labeled data for training purposes. In the case of self-driving cars, it needs to clock in millions of miles to test computer vision algorithms. This delays the go-to-market for such products
- New product development usually requires a lot of data testing before it is introduced in the market. Innovation becomes scarce if quality data from the field is not there
Techniques to generate synthetic data
There are usually three strategies to generate synthetic data. These include some simplistic techniques as well as methods infused heavily with AI.
Exhibit 2: Techniques to generate synthetic data
Sampling from distribution is simply drawing a lot of random numbers from a normal distribution. Agent-based modeling understands the behavior of the original data. Once the characteristics are defined, it creates new data keeping the behavior constraints in place. Generative Adversarial Network (GAN) models are synthetic data generation techniques usually used for creating image data. These networks have two DL models, one is a generator, and the other is known as a discriminator. For example, GAN can take random noises as its input. Then the generator generates output images, whereas the discriminator tries to find whether the output is fake or real. The more the image is closer to the real one, the output can be considered as real.
Applications across enterprises
An infinite source of data that mimics the real dataset can provide innumerable opportunities to create test scenarios during development.
Synthetic data acts as a beneficiary for enterprises across domains and industries, with some examples shown below.
- “Customer is king”: a tag line commonly used in the current environment wherein organizations strive to provide hyper-personalization to customers for better customer retention and to create upsell and cross-sell opportunities. Synthetic data helps enterprises get detailed analysis of each customer without worrying about the consent through GDPR. This data would have properties of real data and can be used for simulations
- Agile development and DevOps: Software testing and quality assurance often involve a long waiting period to get access to ‘real’ data. Artificially generated data can assist in eliminating this waiting period leading to reduced testing time and increased agility during development
- Research and product development: Synthetic data can be used to create an understanding of the format of real data that does not exist yet and build algorithms and preliminary models on top of it. It can also be used as a baseline for product development and reduce time to market
- Robotics: Companies often struggle to obtain quality real-life data sets to execute testing. Synthetic data helps in running thousands of simulations, thereby improving the robots and complementing expensive real-life testing
- Financial services: Important elements for any financial service enterprise are fraud protection and detection methods, which can be tested and evaluated for their effectiveness using synthetic data
Limitations of synthetic data
However, the use of synthetic data does not come without its own set of limitations.
- At its best, synthetic data imitates the real-life data sets but is not an exact replica. This can result in certain data points that are deviations or exceptions to the overall set, leading to skewed modeling outputs
- It is also not an easy task to assess the quality of the synthetic data set generated as it often depends on the complexity of the original data. As a result, the quality assessment parameters need to change in accordance with the variation in the original data point, meaning there can’t be a standard framework to be followed for each synthetic data set
- It is difficult for business users to trust the credibility of the synthetic data generated due to a lack of technological understanding leading to slow uptake. This is more so in certain industries such as the healthcare and food industry, where there are direct repercussions to human life
Way ahead
Despite these limitations, enterprises should be keen to adopt synthetic data as they have an opportunity to disrupt the business landscape by utilizing data and its benefits to full potential. It can prove to be the push that was required for AI/ML to penetrate across enterprises and gain more traction.
If you’ve utilized synthetic data in your enterprise or know about more areas where synthetic data can be advantageous and disadvantageous, please write to us at [email protected] and [email protected]. We’d love to hear your experiences and ideas!