cdransf/ai.robots.txt

mirror of https://github.com/ai-robots-txt/ai.robots.txt.git synced 2025-05-19 16:53:11 +00:00

A list of AI agents and robots to block.

Find a file

Massimo Gismondi 1cc4b59dfc Shortened htaccess instructions Co-authored-by: Glyn Normington <work@underlap.org>		2025-01-18 12:40:03 +01:00
.github/workflows	Merge pull request #66 from fabianegli/patch-1	2025-01-07 03:54:40 +00:00
assets/images	chore: remove unused image	2024-06-22 12:44:02 -07:00
code	Fixed space in comment	2025-01-18 12:39:07 +01:00
.gitignore	simplify repo and added some tests	2024-10-19 13:06:34 +02:00
.htaccess	Implementing htaccess generation	2025-01-07 11:02:29 +01:00
FAQ.md	Improve formatting	2024-11-10 01:06:13 +00:00
LICENSE	Initial commit	2024-03-27 10:48:29 -07:00
README.md	Shortened htaccess instructions	2025-01-18 12:40:03 +01:00
robots.json	Block SemrushBot	2025-01-06 12:34:38 -08:00
robots.txt	Merge pull request #67 from Nightfirecat/semrushbot	2025-01-06 20:51:56 +00:00
table-of-bot-metrics.md	Merge pull request #67 from Nightfirecat/semrushbot	2025-01-06 20:51:56 +00:00

README.md

ai.robots.txt

This is an open list of web crawlers associated with AI companies and the training of LLMs to block. We encourage you to contribute to and implement this list on your own site. See information about the listed crawlers and the FAQ.

A number of these crawlers have been sourced from Dark Visitors and we appreciate the ongoing effort they put in to track these crawlers.

If you'd like to add information about a crawler to the list, please make a pull request with the bot name added to robots.txt, ai.txt, and any relevant details in table-of-bot-metrics.md to help people understand what's crawling.

Usage

Many visitors will find these files from this repository most useful:

robots.txt
.htaccess

robots.txt implements the Robots Exclusion Protocol (RFC 9309).

The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A .htaccess file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.

Note that, as stated in the httpd documentation, more performant methods than an .htaccess file exist. Nevertheless, most shared hosting providers only allow .htaccess configuration.

We suggest adding both files, as some crawlers may respect robots.txt while not having an identifiable User Agent; on the other hand, other crawlers may not respect the robots.txt, but they provide a identifiable User Agent by which we can filter them out.

Contributing

A note about contributing: updates should be added/made to robots.json. A GitHub action, courtesy of Adam, will then generate the updated robots.txt and table-of-bot-metrics.md.

You can subscribe to list updates via RSS/Atom with the releases feed:

https://github.com/ai-robots-txt/ai.robots.txt/releases.atom

You can subscribe with Feedly, Inoreader, The Old Reader, Feedbin, or any other reader app.

Alternatively, you can also subscribe to new releases with your GitHub account by clicking the ⬇️ on "Watch" button at the top of this page, clicking "Custom" and selecting "Releases".

Report abusive crawlers

If you use Cloudflare's hard block alongside this list, you can report abusive crawlers that don't respect robots.txt here.

Additional resources

Blocking Bots with Nginx by Robb Knight
Blockin' bots. by Ethan Marcotte
Blocking Bots With 11ty And Apache by fLaMEd fury
Blockin' bots on Netlify by Jeremia Kimelman
Blocking AI web crawlers by Glyn Normington
Block AI Bots from Crawling Websites Using Robots.txt by Jonathan Gillham, Originality.AI