Glossary · Term

Crawling

Also known as: scraping

This program automatically collects data from the web. This is a key issue in copyright disputes surrounding AI learning data collection.

Crawling is when a program automatically moves around websites and collects data. Instead of people opening and copying pages one by one, robots scan and collect millions of pages, and crawling is also how search engines like Google index the web.

It is a technology that has been used for a long time in search engines, price comparison, market research, etc., but its presence has grown in the AI era. As large amounts of text became necessary for LLM learning, AI companies crawled the entire web and used it as learning data, which became a key issue in copyright disputes. Media companies and creators filed lawsuits claiming that they created AI by taking content without permission, and it is also controversial whether robots.txt, the crawling blocking standard, is respected. Discussions are underway to establish new rules for ownership and use of web data.

✅ Why it matters

It is the key to understanding copyright dispute news surrounding AI learning data
Helps you understand the basic operating principles of search engines and the data industry
For website operators, it is the starting point for their content protection strategy

⚠️ Limits and debates

A legal battle is underway as to whether collection without the permission of the copyright holder is justified
Excessive crawling also places a burden on website servers
The cat-and-mouse battle between blocking and bypassing technologies continues.