Update from Dark Visitors

Merge pull request #129 from ai-robots-txt/google-cloudvertexbot
chore(robots.json): adds Google-CloudVertexBot
2025-05-19 08:43:11 +00:00 · 2025-05-17 00:57:28 +00:00 · 2025-05-16 11:35:15 +00:00 · 2025-05-16 12:35:04 +01:00 · 2025-05-15 21:16:49 -07:00 · 2025-05-16 00:59:08 +00:00
14 changed files with 262 additions and 7 deletions
--- a/.github/workflows/run-tests.yml
+++ b/.github/workflows/run-tests.yml
@ -19,3 +19,10 @@ jobs:
      - name: Run tests
        run: |
          code/tests.py
+  lint-json:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out repository
+        uses: actions/checkout@v4
+      - name: JQ Json Lint
+        run: jq . robots.json
--- a/.htaccess
+++ b/.htaccess
@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-CloudVertexBot|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
--- a/3
+++ b/3
@ -0,0 +1,3 @@
+@aibots {
+        header_regexp User-Agent "(AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-CloudVertexBot|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)"
+}
--- a/README.md
+++ b/README.md
@ -14,6 +14,8 @@ This repository provides the following files:
 - `robots.txt`
 - `.htaccess`
 - `nginx-block-ai-bots.conf`
+- `Caddyfile`
+- `haproxy-block-ai-bots.txt`

 `robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).

@ -22,6 +24,25 @@ Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/

 `nginx-block-ai-bots.conf` implements a Nginx configuration snippet that can be included in any virtual host `server {}` block via the `include` directive.

+`Caddyfile` includes a Header Regex matcher group you can copy or import into your Caddyfile, the rejection can then be handled as followed `abort @aibots`
+
+`haproxy-block-ai-bots.txt` may be used to configure HAProxy to block AI bots. To implement it;
+1. Add the file to the config directory of HAProxy
+2. Add the following lines in the `frontend` section;
+   ```
+   acl ai_robot hdr_sub(user-agent) -i -f /etc/haproxy/haproxy-block-ai-bots.txt
+   http-request deny if ai_robot
+   ```
+   (Note that the path of the `haproxy-block-ai-bots.txt` may be different in your environment.)
+
+
+[Bing uses the data it crawls for AI and training, you may opt out by adding a `meta` tag to the `head` of your site.](./docs/additional-steps/bing.md)
+
+### Related
+
+- [Robots.txt Traefik plugin](https://plugins.traefik.io/plugins/681b2f3fba3486128fc34fae/robots-txt-plugin):
+middleware plugin for [Traefik](https://traefik.io/traefik/) to automatically add rules of [robots.txt](./robots.txt)
+file on-the-fly.

 ## Contributing

--- a/code/robots.py
+++ b/code/robots.py
@ -179,6 +179,19 @@ def json_to_nginx(robot_json):
    return config


+def json_to_caddy(robot_json):
+    caddyfile = "@aibots {\n    "
+    caddyfile += f'    header_regexp User-Agent "{list_to_pcre(robot_json.keys())}"'
+    caddyfile += "\n}"
+    return caddyfile
+
+def json_to_haproxy(robots_json):
+    # Creates a source file for HAProxy. Follow instructions in the README to implement it.
+    txt = "\n".join(f"{k}" for k in robots_json.keys())
+    return txt
+
+
+
 def update_file_if_changed(file_name, converter):
    """Update files if newer content is available and log the (in)actions."""
    new_content = converter(load_robots_json())
@ -208,6 +221,15 @@ def conversions():
        file_name="./nginx-block-ai-bots.conf",
        converter=json_to_nginx,
    )
+    update_file_if_changed(
+        file_name="./Caddyfile",
+        converter=json_to_caddy,
+    )
+      
+    update_file_if_changed(
+        file_name="./haproxy-block-ai-bots.txt",
+        converter=json_to_haproxy,
+    )


 if __name__ == "__main__":
--- a/code/test_files/Caddyfile
+++ b/code/test_files/Caddyfile
@ -0,0 +1,3 @@
+@aibots {
+        header_regexp User-Agent "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash)"
+}
--- a/code/test_files/haproxy-block-ai-bots.txt
+++ b/code/test_files/haproxy-block-ai-bots.txt
@ -0,0 +1,47 @@
+AI2Bot
+Ai2Bot-Dolma
+Amazonbot
+anthropic-ai
+Applebot
+Applebot-Extended
+Bytespider
+CCBot
+ChatGPT-User
+Claude-Web
+ClaudeBot
+cohere-ai
+Diffbot
+FacebookBot
+facebookexternalhit
+FriendlyCrawler
+Google-Extended
+GoogleOther
+GoogleOther-Image
+GoogleOther-Video
+GPTBot
+iaskspider/2.0
+ICC-Crawler
+ImagesiftBot
+img2dataset
+ISSCyberRiskCrawler
+Kangaroo Bot
+Meta-ExternalAgent
+Meta-ExternalFetcher
+OAI-SearchBot
+omgili
+omgilibot
+Perplexity-User
+PerplexityBot
+PetalBot
+Scrapy
+Sidetrade indexer bot
+Timpibot
+VelenPublicWebCrawler
+Webzio-Extended
+YouBot
+crawler.with.dots
+star***crawler
+Is this a crawler?
+a[mazing]{42}(robot)
+2^32$
+curl|sudo bash
--- a/code/tests.py
+++ b/code/tests.py
@ -4,7 +4,7 @@
 import json
 import unittest

-from robots import json_to_txt, json_to_table, json_to_htaccess, json_to_nginx
+from robots import json_to_txt, json_to_table, json_to_htaccess, json_to_nginx, json_to_haproxy, json_to_caddy

 class RobotsUnittestExtensions:
    def loadJson(self, pathname):
@ -60,12 +60,33 @@ class TestNginxConfigGeneration(unittest.TestCase, RobotsUnittestExtensions):
        robots_nginx = json_to_nginx(self.robots_dict)
        self.assertEqualsFile("test_files/nginx-block-ai-bots.conf", robots_nginx)

+class TestHaproxyConfigGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 8192
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_haproxy_generation(self):
+        robots_haproxy = json_to_haproxy(self.robots_dict)
+        self.assertEqualsFile("test_files/haproxy-block-ai-bots.txt", robots_haproxy)
+
 class TestRobotsNameCleaning(unittest.TestCase):
    def test_clean_name(self):
        from robots import clean_robot_name

        self.assertEqual(clean_robot_name("Perplexity‑User"), "Perplexity-User")

+class TestCaddyfileGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 8192
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_caddyfile_generation(self):
+        robots_caddyfile = json_to_caddy(self.robots_dict)
+        self.assertEqualsFile("test_files/Caddyfile", robots_caddyfile)
+
+
 if __name__ == "__main__":
    import os
    os.chdir(os.path.dirname(__file__))
--- a/docs/additional-steps/bing.md
+++ b/docs/additional-steps/bing.md
@ -0,0 +1,36 @@
+# Bing (bingbot)
+
+It's not well publicised, but Bing uses the data it crawls for AI and training.
+
+However, the current thinking is, blocking a search engine of this size using `robots.txt` seems a quite drastic approach as it is second only to Google and could significantly impact your website in search results.
+
+Additionally, Bing powers a number of search engines such as Yahoo and AOL, and its search results are also used in Duck Duck Go, amongst others.
+
+Fortunately, Bing supports a relatively simple opt-out method, requiring an additional step.
+
+## How to opt-out of AI training
+
+You must add a metatag in the `<head>` of your webpage. This also needs to be added to every page on your website.
+
+The line you need to add is:
+
+```plaintext
+<meta name="robots" content="noarchive">
+```
+
+By adding this line, you are signifying to Bing: "Do not use the content for training Microsoft's generative AI foundation models."
+
+## Will my site be negatively affected
+
+Simple answer, no.
+The original use of "noarchive" has been retired by all search engines. Google retired its use in 2024.
+
+The use of this metatag will not impact your site in search engines or in any other meaningful way if you add it to your page(s).
+
+It is now solely used by a handful of crawlers, such as Bingbot and Amazonbot, to signify to them not to use your data for AI/training.
+
+## Resources
+
+Bing Blog AI opt-out announcement: https://blogs.bing.com/webmaster/september-2023/Announcing-new-options-for-webmasters-to-control-usage-of-their-content-in-Bing-Chat
+
+Bing metatag information, including AI opt-out: https://www.bing.com/webmasters/help/which-robots-metatags-does-bing-support-5198d240
--- a/haproxy-block-ai-bots.txt
+++ b/haproxy-block-ai-bots.txt
@ -0,0 +1,59 @@
+AI2Bot
+Ai2Bot-Dolma
+aiHitBot
+Amazonbot
+anthropic-ai
+Applebot
+Applebot-Extended
+Brightbot 1.0
+Bytespider
+CCBot
+ChatGPT-User
+Claude-Web
+ClaudeBot
+cohere-ai
+cohere-training-data-crawler
+Cotoyogi
+Crawlspace
+Diffbot
+DuckAssistBot
+FacebookBot
+Factset_spyderbot
+FirecrawlAgent
+FriendlyCrawler
+Google-CloudVertexBot
+Google-Extended
+GoogleOther
+GoogleOther-Image
+GoogleOther-Video
+GPTBot
+iaskspider/2.0
+ICC-Crawler
+ImagesiftBot
+img2dataset
+imgproxy
+ISSCyberRiskCrawler
+Kangaroo Bot
+meta-externalagent
+Meta-ExternalAgent
+meta-externalfetcher
+Meta-ExternalFetcher
+NovaAct
+OAI-SearchBot
+omgili
+omgilibot
+Operator
+PanguBot
+Perplexity-User
+PerplexityBot
+PetalBot
+QualifiedBot
+Scrapy
+SemrushBot-OCOB
+SemrushBot-SWA
+Sidetrade indexer bot
+TikTokSpider
+Timpibot
+VelenPublicWebCrawler
+Webzio-Extended
+YouBot
--- a/nginx-block-ai-bots.conf
+++ b/nginx-block-ai-bots.conf
@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google\-CloudVertexBot|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
    return 403;
 }
--- a/robots.json
+++ b/robots.json
@ -160,6 +160,13 @@
        "operator": "Unknown",
        "respect": "[Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler)"
    },
+    "Google-CloudVertexBot": {
+        "operator": "Google",
+        "respect": "[Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)",
+        "function": "Build and manage AI models for businesses employing Vertex AI",
+        "frequency": "No information.",
+        "description": "Google-CloudVertexBot crawls sites on the site owners' request when building Vertex AI Agents."
+    },
    "Google-Extended": {
        "operator": "Google",
        "respect": "[Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)",
@ -244,13 +251,27 @@
        "frequency": "Unclear at this time.",
        "description": "Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot"
    },
-    "Meta-ExternalAgent": {
+    "meta-externalagent": {
        "operator": "[Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers)",
-        "respect": "Yes.",
+        "respect": "Yes",
        "function": "Used to train models and improve products.",
        "frequency": "No information.",
        "description": "\"The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.\""
    },
+    "Meta-ExternalAgent": {
+        "operator": "Unclear at this time.",
+        "respect": "Unclear at this time.",
+        "function": "AI Data Scrapers",
+        "frequency": "Unclear at this time.",
+        "description": "Meta-ExternalAgent is a web crawler used by Meta to download training data for its AI models and improve its products by indexing content directly. More info can be found at https://darkvisitors.com/agents/agents/meta-externalagent"
+    },
+    "meta-externalfetcher": {
+        "operator": "Unclear at this time.",
+        "respect": "Unclear at this time.",
+        "function": "AI Assistants",
+        "frequency": "Unclear at this time.",
+        "description": "Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher"
+    },
    "Meta-ExternalFetcher": {
        "operator": "Unclear at this time.",
        "respect": "Unclear at this time.",
@ -321,6 +342,13 @@
        "operator": "[Huawei](https://huawei.com/)",
        "respect": "Yes"
    },
+    "QualifiedBot": {
+        "description": "Operated by Qualified as part of their suite of AI product offerings.",
+        "frequency": "No explicit frequency provided.",
+        "function": "Company offers AI agents and other related products; usage can be assumed to support said products.",
+        "operator": "[Qualified](https://www.qualified.com)",
+        "respect": "Unclear at this time."
+    },
    "Scrapy": {
        "description": "\"AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets.\"",
        "frequency": "No information.",
@ -384,4 +412,4 @@
        "frequency": "No information.",
        "description": "Retrieves data used for You.com web search engine and LLMs."
    }
-}
+}
--- a/robots.txt
+++ b/robots.txt
@ -21,6 +21,7 @@ User-agent: FacebookBot
 User-agent: Factset_spyderbot
 User-agent: FirecrawlAgent
 User-agent: FriendlyCrawler
+User-agent: Google-CloudVertexBot
 User-agent: Google-Extended
 User-agent: GoogleOther
 User-agent: GoogleOther-Image
@ -33,7 +34,9 @@ User-agent: img2dataset
 User-agent: imgproxy
 User-agent: ISSCyberRiskCrawler
 User-agent: Kangaroo Bot
+User-agent: meta-externalagent
 User-agent: Meta-ExternalAgent
+User-agent: meta-externalfetcher
 User-agent: Meta-ExternalFetcher
 User-agent: NovaAct
 User-agent: OAI-SearchBot
@ -44,6 +47,7 @@ User-agent: PanguBot
 User-agent: Perplexity-User
 User-agent: PerplexityBot
 User-agent: PetalBot
+User-agent: QualifiedBot
 User-agent: Scrapy
 User-agent: SemrushBot-OCOB
 User-agent: SemrushBot-SWA
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@ -23,6 +23,7 @@
 | Factset\_spyderbot | [Factset](https://www.factset.com/ai) | Unclear at this time. | AI model training. | No information provided. | Scrapes data for AI training. |
 | FirecrawlAgent | [Firecrawl](https://www.firecrawl.dev/) | Yes | AI scraper and LLM training | No information provided. | Scrapes data for AI systems and LLM training. |
 | FriendlyCrawler | Unknown | [Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler) | We are using the data from the crawler to build datasets for machine learning experiments. | Unclear at this time. | Unclear who the operator is; but data is used for training/machine learning. |
+| Google\-CloudVertexBot | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Build and manage AI models for businesses employing Vertex AI | No information. | Google-CloudVertexBot crawls sites on the site owners' request when building Vertex AI Agents. |
 | Google\-Extended | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
 | GoogleOther | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 | GoogleOther\-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
@ -35,7 +36,9 @@
 | imgproxy | [imgproxy](https://imgproxy.net) | Unclear at this time. | Not documented or explained on operator's site. | No information. | AI-powered image processing. |
 | ISSCyberRiskCrawler | [ISS-Corporate](https://iss-cyber.com) | No | Scrapes data to train machine learning models. | No information. | Used to train machine learning based models to quantify cyber risk. |
 | Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
-| Meta\-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
+| meta\-externalagent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
+| Meta\-ExternalAgent | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Meta-ExternalAgent is a web crawler used by Meta to download training data for its AI models and improve its products by indexing content directly. More info can be found at https://darkvisitors.com/agents/agents/meta-externalagent |
+| meta\-externalfetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
 | Meta\-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
 | NovaAct | Unclear at this time. | Unclear at this time. | AI Agents | Unclear at this time. | Nova Act is an AI agent created by Amazon that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/novaact |
 | OAI\-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
@ -46,6 +49,7 @@
 | Perplexity\-User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
+| QualifiedBot | [Qualified](https://www.qualified.com) | Unclear at this time. | Company offers AI agents and other related products; usage can be assumed to support said products. | No explicit frequency provided. | Operated by Qualified as part of their suite of AI product offerings. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
 | SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
 | SemrushBot\-SWA | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Checks URLs on your site for SWA tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
Author	SHA1	Message	Date
dark-visitors	7a2e6cba52	Update from Dark Visitors Some checks failed / ai-robots-txt (push) Has been cancelled Details / run-tests (push) Has been cancelled Details / lint-json (push) Has been cancelled Details	2025-05-17 00:57:28 +00:00
ai.robots.txt	dd1ed174b7	Merge pull request #129 from ai-robots-txt/google-cloudvertexbot Some checks are pending / ai-robots-txt (push) Waiting to run Details / run-tests (push) Waiting to run Details / lint-json (push) Waiting to run Details chore(robots.json): adds Google-CloudVertexBot	2025-05-16 11:35:15 +00:00
Glyn Normington	89c0fbaf86	Merge pull request #129 from ai-robots-txt/google-cloudvertexbot chore(robots.json): adds Google-CloudVertexBot	2025-05-16 12:35:04 +01:00
Cory Dransfeldt	ca918a963f	chore(robots.json): adds Google-CloudVertexBot	2025-05-15 21:16:49 -07:00
dark-visitors	16d1de7094	Update from Dark Visitors Some checks are pending / ai-robots-txt (push) Waiting to run Details / run-tests (push) Waiting to run Details / lint-json (push) Waiting to run Details	2025-05-16 00:59:08 +00:00
Glyn Normington	73f6f67adf	Merge pull request #125 from holysoles/lint_robots_json Some checks are pending / ai-robots-txt (push) Waiting to run Details / run-tests (push) Waiting to run Details / lint-json (push) Waiting to run Details lint robots.json during pull requests	2025-05-15 17:26:15 +01:00
Patrick Evans	498aa50760	lint robots.json during pull requests	2025-05-15 11:15:25 -05:00
ai.robots.txt	1c470babbe	Merge pull request #123 from joehoyle/patch-1 Fix JSON syntax error	2025-05-15 16:12:30 +00:00
Adam Newbold	84d63916d2	Merge pull request #123 from joehoyle/patch-1 Fix JSON syntax error	2025-05-15 12:12:21 -04:00
Joe Hoyle	0c56b96fd9	Fix JSON syntax error	2025-05-15 11:26:47 -04:00
Cory Dransfeldt	28e69e631b	Merge pull request #122 from ai-robots-txt/qualified-bot Some checks are pending / ai-robots-txt (push) Waiting to run Details / run-tests (push) Waiting to run Details chore(robots.json): adds QualifiedBot crawler	2025-05-15 07:17:51 -07:00
Cory Dransfeldt	9539256cb3	chore(robots.json): adds QualifiedBot crawler	2025-05-15 07:16:07 -07:00
Cory Dransfeldt	9659c88b0c	Merge pull request #121 from solution-libre/add-traefik-plugin Some checks are pending / run-tests (push) Waiting to run Details Add Traefik plugin to the README.md file	2025-05-14 16:45:34 -07:00
Florent Poinsaut	c66d180295	Merge branch 'main' into add-traefik-plugin	2025-05-14 22:06:56 +02:00
Glyn Normington	9a9b1b41c0	Merge pull request #119 from ai-robots-txt/bing-ai-opt-out-instructions Some checks are pending / run-tests (push) Waiting to run Details Bing AI opt-out instructions	2025-05-14 19:18:20 +01:00
Florent Poinsaut	b4610a725c	Add Traefik plugin	2025-05-14 14:11:56 +02:00
Cory Dransfeldt	36a52a88d8	Bing AI opt-out instructions	2025-05-12 20:20:18 -07:00
ai.robots.txt	678380727e	Merge pull request #115 from glyn/syntax Some checks failed / run-tests (push) Has been cancelled Details / ai-robots-txt (push) Has been cancelled Details Fix Python syntax error	2025-05-01 10:29:06 +00:00
Glyn Normington	fb8188c49d	Merge pull request #115 from glyn/syntax Fix Python syntax error	2025-05-01 11:28:54 +01:00
Glyn Normington	ec995cd686	Fix Python syntax error	2025-05-01 11:27:40 +01:00
Crazyroostereye	1310dbae46	Added a Caddyfile converter (#110 ) Co-authored-by: Julian Beittel <julian@beittel.net> Co-authored-by: Glyn Normington <work@underlap.org>	2025-05-01 11:21:32 +01:00
Glyn Normington	91a88e2fa8	Merge pull request #113 from rwijnen-um/feature/haproxy Some checks failed / ai-robots-txt (push) Has been cancelled Details / run-tests (push) Has been cancelled Details HAProxy converter added.	2025-04-28 09:00:16 +01:00
Rik Wijnen	a4a9f2ac2b	Tests for HAProxy file added.	2025-04-28 09:30:26 +02:00
Rik Wijnen	66da70905f	Fixed incorrect English sentence.	2025-04-28 09:09:40 +02:00
Rik Wijnen	50e739dd73	HAProxy converter added.	2025-04-28 08:51:02 +02:00
ai.robots.txt	c6c7f1748f	Update from Dark Visitors Some checks failed / run-tests (push) Has been cancelled Details	2025-04-26 00:55:12 +00:00
dark-visitors	934ac7b318	Update from Dark Visitors Some checks failed / run-tests (push) Waiting to run Details / ai-robots-txt (push) Has been cancelled Details	2025-04-25 00:56:57 +00:00
ai.robots.txt	4654e14e9c	Merge pull request #112 from maiavixen/main Some checks are pending / ai-robots-txt (push) Waiting to run Details / run-tests (push) Waiting to run Details Fixed meta-external* being titlecase, and removed period for consistency	2025-04-24 07:00:34 +00:00
Glyn Normington	9bf31fbca8	Merge pull request #112 from maiavixen/main Fixed meta-external* being titlecase, and removed period for consistency	2025-04-24 08:00:24 +01:00
maia	9d846ced45	Update robots.json Lowercase meta-external* as that was not technically the UA for the bots, also removed a period in the "respect" for consistency	2025-04-24 04:08:20 +02:00
dark-visitors	8d25a424d9	Update from Dark Visitors Some checks failed / ai-robots-txt (push) Has been cancelled Details / run-tests (push) Has been cancelled Details	2025-04-23 00:56:52 +00:00