You Don’t Have Good AI Training Data? Take a Look.

Synthetic Data — The Road to Data-Centric AI

3 min readApr 11, 2022

Stop optimizing the AI model and start optimizing the data (source)

A lot of AI projects do not have enough good data to train meaningful models. The more complicated the application is the harder it becomes to get your hands on good data. Even if you have a large amount of data, the preparation (cleaning, annotating, labeling, …) takes way too much time driving the costs of your project.

So the question is, is there an alternative? As you might have guessed, the answer is yes:

Data-centric AI in combination with synthetic data can lead you to the promised land!

The Role of Data-Centric AI

Data-centric AI is the latest buzz in town these days. What does that mean and how does it compare to things we have been doing for a while?

That term was coined by Andrew Ng and basically means that we need to start focusing on engineering our training data in a way that is actually useful for the problem at hand.

When designing an AI application the goal isn’t to use AI but to solve a an actual (business) problem and the data is one of the main contributors to the quality of the predictions.

Andrew Ng explained the core of data-centric AI so concise that it would be pointless to compete against it:

Data-centric AI is the discipline of systematically engineering the data needed to successfully build an AI system.

So now you might wonder how in the world are you going to optimize your data? The answer: synthetic data by simulation.

Synthetic data by Simulation

When done right simulating synthetic data is the solution to all your data quality issues.

Simulations are based on input models (e.g. 3D models) and some simulation code which describes the behavior of a sensor (e.g. LIDAR, camera, radar) in the 3D model. The output of the simulation is synthetic sensor data analogous to the data a real sensor would record. Even better, you can run hundreds of thousands of different model parameters creating an super diverse data set.

The main difference is that with synthetic data you know exactly what’s inside the data and where it is. The underlying 3D model tells you what you can see in the data. The combination of the model, the parameter set and their relation to the simulated data allows for automatic annotation which is great news, since that has been a major pain for all AI projects.

Usually simulations are deterministic as well, so synthetic data is a also reproducible. That’s great news for everybody who is working in any kind of regulated industry (automotive, medical, …).

Even more so, for some problems you might not even be able to get data, at least not the amount you need. For medical data there are high barriers to access them for industrial use due to privacy concerns. Driving millions of miles with different cars and drivers in all kinds of road and weather conditions is very expensive and time-consuming. But simulating these problems is actually possible.

Key Advantages of Simulated Synthetic Data.

It’s exactly the data that you need for your application.
Simulations are faster than real-life data acquisition.
Simulations are possible when data acquisition is not possible.
The data is already annotated, saving you even more time.
You have full control over the data creation process.

Where to get synthetic data?

There are many companies offering tools to generate synthetic data. The most popular space right now is the automotive sector because it’s a big problem and the solution is relatively straight-forward: use game engines.

Other vendors focus on more general computer vision and enable you to generate synthetic data for faces, rooms, retail environments and much more.

Most companies offer a Software-as-a-Service right now which lets you generate data for very specific use cases. In the future, you will see more and more platforms-as-a-service offerings that allow you to tailor the data generation to your needs. This will enable even greater opportunities for AI companies.

TL;DR

A lot of AI applications lack good training data
Data-centric AI is about engineering the right data for your problem.
The simulation of synthetic data allows you to do just that.
It has many more advantages (time saving, cost saving).