Glossary · Term

Dataset

Also known as: training data

Dataset is a bundle of data used for AI learning. Quality and scale determine AI performance and are at the center of copyright debates.

A dataset is a bundle of data used for AI learning. If we compare AI to a student, a dataset corresponds to a textbook and workbook, and what and how much the student reads and learns determines the student's skills.

Even if the model structure is similar, performance varies greatly depending on the quality and size of the dataset, so securing a good dataset has become the key to AI competitiveness. Ultra-large text datasets scraped from the entire web have become the basis of LLM, and standard datasets released for research have accelerated technological development.

Meanwhile, datasets collected from the web are prone to being mixed with copyrighted works, personal information, and biased content, and are at the center of copyright lawsuits and ethical debates. Along with concerns that quality data on the Internet is depleting, ways to learn from synthetic data created by AI are also being discussed.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries