Dataset
Dataset is a bundle of data used for AI learning. Quality and scale determine AI performance and are at the center of copyright debates.
A dataset is a bundle of data used for AI learning. If we compare AI to a student, a dataset corresponds to a textbook and workbook, and what and how much the student reads and learns determines the student's skills.
Even if the model structure is similar, performance varies greatly depending on the quality and size of the dataset, so securing a good dataset has become the key to AI competitiveness. Ultra-large text datasets scraped from the entire web have become the basis of LLM, and standard datasets released for research have accelerated technological development.
Meanwhile, datasets collected from the web are prone to being mixed with copyrighted works, personal information, and biased content, and are at the center of copyright lawsuits and ethical debates. Along with concerns that quality data on the Internet is depleting, ways to learn from synthetic data created by AI are also being discussed.
✅ Why it matters
- It is the most basic concept to understand the source of AI performance
- It is the key to interpreting news of competition for data and copyright disputes
- It explains why a company's own data is an asset in the AI era
⚠️ Limits and debates
- There is an ongoing legal dispute due to the mixing of copyrighted works and personal information in web-collected data
- Bias contained in the data is directly leading to bias in AI
- Concerns are being raised that quality data is depleting