Glossary · Term

Synthetic data

The data generated by AI is used again for AI learning. It is an alternative to the depletion of real data and a subject of debate over quality degradation.

Synthetic data refers to data that is not collected from the real world but is artificially created by AI or simulation and used for learning. Instead of actual patient records, virtual patient records with the same statistical characteristics are created and used for research, or AI-created math problems are taught to other AIs.

Concerns that high-quality learning data on the Internet is running out, and personal information regulations make it difficult to utilize real data have led to the rise of synthetic data. In fact, synthetic data is widely used in training the latest models, and is also used to safely learn dangerous situations, such as in virtual driving simulations of autonomous driving.

However, if AI is repeatedly trained with data created by AI, errors and biases may accumulate and quality may deteriorate, raising concerns about model collapse, and appropriate mixing with real data and quality verification are considered key.

✅ Why it matters

It is a realistic alternative to real data depletion and data shortage problems
It is advantageous for regulatory response as learning data can be created without personal information
Data in rare or dangerous situations can be safely mass-produced

⚠️ Limits and debates

Repeated training with AI-generated data risks model collapse, degrading quality
Errors and biases in the original model are replicated in synthetic data
It may not fully reproduce the subtle diversity of the real world