Glossary · Term

Crawling

Also known as: scraping

This program automatically collects data from the web. This is a key issue in copyright disputes surrounding AI learning data collection.

Crawling is when a program automatically moves around websites and collects data. Instead of people opening and copying pages one by one, robots scan and collect millions of pages, and crawling is also how search engines like Google index the web.

It is a technology that has been used for a long time in search engines, price comparison, market research, etc., but its presence has grown in the AI era. As large amounts of text became necessary for LLM learning, AI companies crawled the entire web and used it as learning data, which became a key issue in copyright disputes. Media companies and creators filed lawsuits claiming that they created AI by taking content without permission, and it is also controversial whether robots.txt, the crawling blocking standard, is respected. Discussions are underway to establish new rules for ownership and use of web data.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries