Add bot info for omgili, peer39, and youbot

2025-05-23 01:54:57 +00:00 · 2024-06-21 21:43:41 -04:00 · 2024-06-21 21:43:41 -04:00 · 149a72be0c
commit 149a72be0c
parent ec4610a118
1 changed files with 8 additions and 8 deletions
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@ -12,13 +12,13 @@
 |cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
 |Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
 |FacebookBot    | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
-|Google-Extended| Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information provided. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
-|GoogleOther    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information provided. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
-|GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information provided. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
+|Google-Extended| Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
+|GoogleOther    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+|GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
 | img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | At the discretion of img2dataset users. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
-|omgili        |         |                       |          |                  |             |
-|omgilibot     |         |                       |          |                  |             |
-|peer39_crawler|         |                       |          |                  |             |
-|peer39_crawler/1.0|         |                       |          |                  |             |
+|omgili        | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
+|omgilibot     | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
+|peer39_crawler| [Peer39](https://www.peer39.com/) | [Yes](https://www.peer39.com/crawler-notice) | Targeted advertising. | No information | Web crawler used to "enhance the visibility of your site to advertisers who value and seek out such quality content." |
+|peer39_crawler| [Peer39](https://www.peer39.com/) | [Yes](https://www.peer39.com/crawler-notice) | Targeted advertising. | No information | Web crawler used to "enhance the visibility of your site to advertisers who value and seek out such quality content." |
 |PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/docs/perplexitybot) | Used to answer queries at the request of users. | Takes action based on user prompts.  | Operated by Perplexity to obtain results in response to user queries. |
-|YouBot        |         |                       |          |                  |             |
+|YouBot        | [You](https://about.you.com/youchat/) | [Yes](https://about.you.com/youbot/) | Scrapes data for search engine and LLMs. | No information | Retrieves data used for You.com web search engine and LLMs. |