Glossary · Term

Synthetic data

The data generated by AI is used again for AI learning. It is an alternative to the depletion of real data and a subject of debate over quality degradation.

Synthetic data refers to data that is not collected from the real world but is artificially created by AI or simulation and used for learning. Instead of actual patient records, virtual patient records with the same statistical characteristics are created and used for research, or AI-created math problems are taught to other AIs.

Concerns that high-quality learning data on the Internet is running out, and personal information regulations make it difficult to utilize real data have led to the rise of synthetic data. In fact, synthetic data is widely used in training the latest models, and is also used to safely learn dangerous situations, such as in virtual driving simulations of autonomous driving.

However, if AI is repeatedly trained with data created by AI, errors and biases may accumulate and quality may deteriorate, raising concerns about model collapse, and appropriate mixing with real data and quality verification are considered key.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries