How to Create Great Machine Learning: The Importance of Synthetic ML Data
Machine learning engineers have discovered the importance of synthetic data in building machine learning application for scale. Businesses relying on machine learning to grow and scale products and services have to become aware of the benefits, implications, and challenges. As stated in a previous post on MLOps, many businesses struggle at implementing, managing, and deploying ML models at scale. Equally challenging is the task of finding, validating, and even generating data for machine learning. Every successful machine learning project requires large datasets; the larger and more diverse the dataset, the better the ML model performance will be. Since acquiring large datasets is a challenge, a growing alternative to “real” datasets is synthetic data.
Gartner defines synthetic data as “data generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real-world.” In the more practical sense of the word, synthetic data is simply data created and controlled by data scientists, developers, and engineers that copies the balance, characteristics, and patterns of real data needed to fuel ML models. This blog will briefly cover the advantages of synthetic data, the real-world challenges it resolves, how it is generated, and the current objections to using synthetic data.
Synthetic Data Advantages
Synthetic data can minimize bias in datasets. Inaccuracies in a dataset can severely impact the quality of an ML model; and so too can data with implicit biases. Synthetic data can address bias which is a common problem when training ML models.
According to Gartner, “85% of the algorithms currently in use are error-prone due largely to bias,” and as result, there are often underrepresented data samples of women, people of color, or other minority groups. With synthetic data, data scientists can artificially boost the number of underrepresented minorities within a dataset by generating new synthetic characteristics that align with those of the minority groups.
Synthetic data is cheaper. Real-world data is expensive to source since it requires ample time for data scientists to collect and process real data. Businesses would also have to consider labor costs too.
Data is the new world currency, so it is no surprise that businesses are spending millions of dollars to generate datasets. According to Datagen, a training image may cost upwards of $5 per image if it is sourced from a data labeling service but may cost as little as $0.05 if generated artificially. Thus, synthetic data can significantly bring down the cost of generating model training data.
Synthetic data is scalable. It is rare for data scientists to have access to data needed to scale to test and train powerful predictive models.
For big technology companies like Google and Amazon, collecting data is not much of an issue. For smaller businesses, however, access to real datasets is either limited, expensive, or simply non-existent. While there is no golden rule for how many training samples are needed to get a reliable model in ML, for many projects it depends on the type of machine learning problem the business is trying to solve. For example, a typical image classification problem could require tens of thousands of images or more in order to create a classifier. So, a small business may need to rely on thousands of synthetically generated images to train their ML models.
Synthetic data is free from privacy regulations. Data privacy regulations and laws often restrict data sharing between businesses or even among the business itself.
For instance, due to HIPAA, medical privacy laws in the United States, and similar laws in other countries, a hospital that wants to use ML to create a service that better diagnoses diseases is not able to use its own patient data. One way synthetic data can circumvent these concerns is by removing any identity traces of the real data and creating a new valid dataset. Unlike anonymization, synthetic data uses a different approach; it does not attempt to obscure, modify, and/or encrypt the underlying data. Instead, it generates new data using various methods and techniques.
Synthetic data offers simplicity and control. Synthetic data makes labeling easy as it can also easily be controlled and adjusted by data scientists.
Data annotation or data labeling is a continuous process and often involves human annotators to label the training data that is fed to ML models. Synthetic data generation ensures higher data quality, balance, and variety and can automatically fill in missing values with contextual data labeling. Thus, synthetic data has major advantages, including reduced costs and higher accuracy in data labeling since the labels in synthetic data are already known.
Synthetic Data Generation Methods
Businesses use a variety of techniques to generate synthetic data. Below is an overview of the most popular synthetic data generation methods:
- Statistical distribution approach:
Probability is the basic building block of machine learning. In this approach the goal of data scientists is to generate a random sample of existing data distributions by observing real statistical distributions. This requires the data scientists to have a good understanding of what the data distribution would look like so they can produce similar synthetic data. Generally, data scientists use popular statistical probability distributions such as normal distribution, exponential distribution, chi-square distribution, lognormal distribution, and more. The normal — also known as Gaussian or bell-curve distribution — for example, occurs in many natural situations such as predicting the probability of car accidents. Thus, data scientists can produce similar data using this method.
2. Fitted distribution approach:
In cases where real data exists, synthetic data can be generated by using best fit distributions, or fitting real data into a known distribution. In particular, if data scientists know the distribution parameters, they can use the Monte Carlo simulation method to generate synthetic data. The initial idea behind Monte Carlo simulations is to utilize a random number generation process to create repeated samples of synthetic data from a set of known statistical distributions. As the name suggests, this method was originally created to explore the impact of repeatedly played games (like casino games); now, Monte Carlo simulations have allowed data scientists to explore the impact of complicated real-life scenarios. Thus, it is primarily used to create variations on an initial dataset that are sufficiently random to be realistic.
3. Decision trees:
The Monte Carlo simulation may not be appropriate for a business’ synthetic data needs. Thus, many businesses have turned to ML-based models to fit the distributions such as decision trees. Decision trees are flowchart-like tree structures, in which the breaches represent decisions and their possible consequences, including chance event outcomes. Decision trees are commonly used in data mining and machine learning as they allow businesses to model non-classical distributions “that can be multi-modal, which does not contain common characteristics of known distributions.” Data scientists can generate synthetic data that is highly correlated with original data using decision trees, although there is a risk of overfitting, or getting a best fit that is too specific (the simplest categorization is best heuristic) which results in failing to fit new data or predict future observations reliably.
4. Generative Adversarial Network (GAN):
Artificial neural networks are considered a more advanced method for generating synthetic data as they are able to handle much richer distributions of data than traditional algorithms (e.g. decision trees). Generative Adversarial Networks (GAN) are one of the more promising data generation solutions and have already seen successful attempts at generating synthetic data for energy consumption. GAN is an algorithm that puts two neural networks against each other; one model attempts to generate “synthetic” or fake data based on real data (generative) and the other (the discriminatory or adversarial network) is fed the synthesized data and attempts to learn to differentiate fake and real samples, competing in a positive loop to get better at the task. As a result of the highly detailed, realistic synthetic data generated, GAN is being widely used to generate unstructured data such as photorealistic images and videos.
5. VAE (Variational Autoencoders):
Alongside GAN, Variational Autoencoders, or VAE, is also a well-known deep learning-based generative model. Recently, a hybrid model of GAN called a VAE-GAN has emerged. Autoencoder neural networks consist of two deep learning-based modules: encoder and decoder to produce synthetic data (images, texts) by learning the latent representations of the training data. Together, GAN and VAE are used to generate high quality, realistic, synthetic data essential for machine learning algorithms as they play a critical role in various classification problems. In one real-world example, 304 rows of heart disease data was used to create a robust model for predicting the presence of an ailment in the patient. Initially, the identification of heart disease was not efficient due to the small amount of available training data. Thus, GAN and VAE were used to generate data and augment the original dataset, thereby also helping to increase the accuracy of the models created using the new dataset.
Challenges
Despite its many benefits, synthetic data still carries some risks. As previously mentioned, the techniques used to generate “realistic-looking” data can also exacerbate harmful biases. In the real-world, this can translate as amplified racism, sexism, and other biases “in high-impact areas like facial recognition, criminality prediction, and health care decision-making.” If a business does not make adjustments to their ML models to account for bias, then the reproduced synthetic data will have all the same biases.
Generating incorrect or erroneous synthetic data can also put a business in trouble with regulators. For example, consider the possibility that using incorrect data may lead to a compliance or legal issue when a product or service ends up harming someone or does not work as advertised. This can result in financial penalties for a business as well. Because synthetic data generation is still an emerging field, regulators are only just beginning to assess how synthetic data is created and measured, but in the future, they are likely to have a critical role in its governance.
Conclusion
In the future, it is very probable that synthetic data will overshadow real data and have a transformative impact across various industries such as automobiles, healthcare, finance, transportation, and more. In fact, Gartner predicts that “20% of all test data for consumer-facing use cases will be synthetically generated by 2025.” The rise of synthetic data will undoubtedly empower a new wave of AI projects by lowering the data barriers to building AI-first products as the many benefits of synthetic data have proven.
For example, advancements in AI have paved the way for successful computer vision (CV) and image recognition applications. However, the challenge with neural networks and their associated applications in CV lies in the fact that these algorithms require large, correctly labeled datasets for better accuracy; collecting and annotating significant amounts of high-quality photos and videos to train a deep learning model is oftentimes time-consuming and expensive. This has given rise to the use of synthetic data in the form of high-quality, realistic, and diverse computer-generated images (read more about data labeling in computer vision applications).
The ability of synthetic data to support machine learning model development is just the beginning. Make sure to follow the SDT blog for future updates on machine learning in heavy industries!
Other resources on data labeling and computer vision can be found on SDT Naver blog or our SDT LinkedIn.
About the Author: Karen is a passionate B2B technology blogger. While studying at Georgia Tech, Karen first grew interested in cybersecurity and has since worked for several security and cloud companies as a global marketer. When she’s not freelance writing, Karen loves to explore new food trends.