mirror of
https://github.com/ai-robots-txt/ai.robots.txt.git
synced 2025-04-05 03:17:46 +00:00
7.6 KiB
7.6 KiB
Name | Operator | Respects robots.txt |
Data use | Visit regularity | Description |
---|---|---|---|---|---|
Amazonbot | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information. provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
anthropic-ai | Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information. provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
Applebot-Extended | Apple | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
Bytespider | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
CCBot | Common Crawl | Yes | Provides crawl data for an open source repository that has been used to train LLMs. | Unclear at this time. | Sources data that is made openly available and is used to train AI models. |
ChatGPT-User | OpenAI | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
ClaudeBot | Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information. provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
Claude-Web | Anthropic | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information. provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
cohere-ai | Cohere | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
Diffbot | Diffbot | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
FacebookBot | Meta/Facebook | Yes | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
facebookexternalhit | Meta/Facebook | Yes | No information. | Unclear at this time. | Unclear at this time. |
Google-Extended | Yes | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. | |
GoogleOther | Yes | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." | |
GoogleOther-Image | Yes | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." | |
GoogleOther-Video | Yes | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." | |
GPTBot | OpenAI | Yes | Scrapes data to train OpenAI's products. | No information. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
ICC-Crawler | NICT | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
img2dataset | img2dataset | Unclear at this time. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
Meta-ExternalAgent | Meta | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
OAI-SearchBot | OpenAI | Yes | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
omgili | Webz.io | Yes | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
omgilibot | Webz.io | Yes | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, omgili agent still used by Webz.io. |
PerplexityBot | Perplexity | No | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
PetalBot | Huawei | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
Scrapy | Zyte | Unclear at this time. | Scrapes data a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
Timpibot | Timpi | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
VelenPublicWebCrawler | Velen Crawler | Yes | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |
YouBot | You | Yes | Scrapes data for search engine and LLMs. | No information. | Retrieves data used for You.com web search engine and LLMs. |