Add information for Google, Meta, and img2dataset bots

2025-05-21 01:33:11 +00:00 · 2024-06-21 20:49:25 -04:00 · 2024-06-21 20:49:25 -04:00 · ec4610a118
commit ec4610a118
parent 4163ca92a5
1 changed files with 5 additions and 5 deletions
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@ -3,7 +3,7 @@
 | AdsBot-Google   | Google  | Yes (Exceptions for Dynamic Search Ads) | Analyzes website content for ad relevancy, improves ad serving for Google Ads. Data anonymized according to [Google's Privacy Policy](https://policies.google.com/privacy). Unclear on data retention or use by other products. | Varies depending on campaign activity and website updates. Crawls optimized to minimize impact, specific frequency not public. | Web crawler by Google Ads to analyze websites for ad effectiveness and ensure ad relevancy to webpage content. |
 |Amazonbot      | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
 |anthropic-ai  | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
-|Applebot-Extended   |    [Apple](https://support.apple.com/en-us/119829#datausage)    |          Yes           |          |                  |      Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.       |
+|Applebot-Extended   | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes           | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.       |
 |Bytespider    | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
 |CCBot         | [Common Crawl](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides crawl data for an open source repository that has been used to train LLMs. | Unclear at this time. | Sources data that is made openly available and is used to train AI models. |
 |ChatGPT-User   | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
@ -11,11 +11,11 @@
 |Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 |cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
 |Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
-|FacebookBot    |         |                       |          |                  |             |
-|Google-Extended|         |                       |          |                  |             |
-|GoogleOther    |         |                       |          |                  |             |
+|FacebookBot    | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
+|Google-Extended| Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information provided. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
+|GoogleOther    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information provided. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 |GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information provided. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
-| img2dataset |         |                       |          |                  |             |
+| img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | At the discretion of img2dataset users. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
 |omgili        |         |                       |          |                  |             |
 |omgilibot     |         |                       |          |                  |             |
 |peer39_crawler|         |                       |          |                  |             |