chore: add Meta-ExternalAgent

2025-04-05 19:37:45 +00:00 · 2024-07-29 08:27:31 -07:00 · 2024-07-29 08:27:31 -07:00 · 6e323554c6
commit 6e323554c6
parent 2972926532
2 changed files with 3 additions and 1 deletions
--- a/robots.txt
+++ b/robots.txt
@ -17,6 +17,7 @@ User-agent: GoogleOther-Video
 User-agent: GPTBot
 User-agent: ImagesiftBot
 User-agent: img2dataset
 User-agent: Meta-ExternalAgent
 User-agent: OAI-SearchBot
 User-agent: omgili
 User-agent: omgilibot
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@ -16,7 +16,8 @@
 |GoogleOther-Image    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 |GoogleOther-Video    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 |GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
-| img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | At the discretion of img2dataset users. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
+| img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | Unclear at this time. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
 | Meta-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
 |OAI-SearchBot        | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information | Crawls sites to surface as results in SearchGPT. |
 |omgili        | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 |omgilibot     | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |