Big Tech's onslaught of bots forces publishers to play an impossible game of Whac-A-Mole
In the battle to protect their valuable content from content-thirsty AI scraping bots, publishers have to rely upon a single text file for their defense.
For thirty years, the humble "robots.txt" file has been used by website owners to alert automated scrapers what content they will allow to be indexed, and what they want to keep out of search engines.
But since tech companies have been racing to ingest as much content as possible to train their AI models, the robots.txt file is also the only place content publishers can use to refuse being scraped and potentially used for AI training — if they know exactly what scrapers to block. Scrapers identify themselves using names like Google’s "googlebot,” Meta's "facebookbot,” or OpenAI's "gptbot,” which appear in the web page request's "user agent" description.
Publishers must now increasingly play a game of Whac-A-Mole to include new scrapers (like Meta recently let loose) in their robots.txt files to block the new bots as they pop up. Once a site has been scraped for AI training without permission, content owners have little recourse, other than the courts.
Data journalist Ben Welsh's homepages.news project collects automated snapshots of top news websites, as well as the contents of their robots.txt files. In a recent sample of Welsh's data from Aug. 16-17, about 40% of top news sites blocked all scrapers. The most blocked scraper was OpenAI's "gptbot,” with about 24% of the news sites blocking it. Meta's new "Meta-ExternalAgent" bot, which appeared in July was only blocked by around 17% of sites.
Earlier this year, Reuters Institute published a report that found by the end of 2023, 79% of US-based news websites were blocking OpenAI's bot.
The entire mechanism of the robots.txt file is voluntary, and many companies have been caught ignoring them altogether. If a company decides to change the name of their bot, or release a new one without their name in the text, publishers may not know to block it.