From 3a43714908dd7df42a9ecf35c107e609bc2f9120 Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn@underlap.org>
Date: Sat, 4 Jan 2025 04:55:34 +0000
Subject: [PATCH 01/63] Rename Python code

The name dark_visitors.py gives the impression that the code is entirely
related to the dark visitors website, whereas the update command relates
to dark visitors and the convert command is unrelated to dark visitors.
---
 .github/workflows/ai_robots_update.yml | 2 +-
 .github/workflows/main.yml             | 2 +-
 code/{dark_visitors.py => robots.py}   | 0
 3 files changed, 2 insertions(+), 2 deletions(-)
 rename code/{dark_visitors.py => robots.py} (100%)

diff --git a/.github/workflows/ai_robots_update.yml b/.github/workflows/ai_robots_update.yml
index 654b0b5..59e785d 100644
--- a/.github/workflows/ai_robots_update.yml
+++ b/.github/workflows/ai_robots_update.yml
@@ -16,7 +16,7 @@ jobs:
           git config --global user.name "dark-visitors"
           git config --global user.email "dark-visitors@users.noreply.github.com"
           echo "Updating robots.json with data from darkvisitor.com ..."
-          python code/dark_visitors.py --update
+          python code/robots.py --update
           echo "... done."
           git --no-pager diff
           git add -A
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index a4c47d6..40ac9ab 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -25,7 +25,7 @@ jobs:
           git log -1
           git status
           echo "Updating robots.txt and table-of-bot-metrics.md if necessary ..."
-          python code/dark_visitors.py --convert
+          python code/robots.py --convert
           echo "... done."
           git --no-pager diff
           git add -A
diff --git a/code/dark_visitors.py b/code/robots.py
similarity index 100%
rename from code/dark_visitors.py
rename to code/robots.py

From e4c12ee2f84e2cb6643f7eeb7dd6eb50c6e91df8 Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn@underlap.org>
Date: Sat, 4 Jan 2025 05:03:48 +0000
Subject: [PATCH 02/63] Rename in test code

---
 code/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/code/tests.py b/code/tests.py
index ffa7574..9cf35fe 100644
--- a/code/tests.py
+++ b/code/tests.py
@@ -6,7 +6,7 @@ cd to the `code` directory and run `pytest`
 import json
 from pathlib import Path
 
-from dark_visitors import json_to_txt, json_to_table
+from robots import json_to_txt, json_to_table
 
 
 def test_robots_txt_creation():

From 996b9c678cbdd90dea414006cc14027b29118d5c Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn@underlap.org>
Date: Sat, 4 Jan 2025 05:28:41 +0000
Subject: [PATCH 03/63] Improve job name

The purpose of the job is to convert the JSON file
to the other files.
---
 .github/workflows/ai_robots_update.yml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/ai_robots_update.yml b/.github/workflows/ai_robots_update.yml
index 59e785d..7e11ce8 100644
--- a/.github/workflows/ai_robots_update.yml
+++ b/.github/workflows/ai_robots_update.yml
@@ -22,7 +22,8 @@ jobs:
           git add -A
           git diff --quiet && git diff --staged --quiet || (git commit -m "Update from Dark Visitors" && git push)
         shell: bash
-  call-main:
+  convert:
+    name: convert
     needs: dark-visitors
     uses: ./.github/workflows/main.yml
     secrets: inherit

From 9e372d069625f2a2939c19fb8bfc703548a2ae42 Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn@underlap.org>
Date: Sun, 5 Jan 2025 01:45:33 +0000
Subject: [PATCH 04/63] Ensure dependency installed

Ref: https://github.com/ai-robots-txt/ai.robots.txt/issues/60#issuecomment-2571437913
Ref: https://stackoverflow.com/questions/11783875/importerror-no-module-named-bs4-beautifulsoup
---
 .github/workflows/main.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index a4c47d6..cb5fefc 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -20,6 +20,7 @@ jobs:
         with:
           fetch-depth: 2
       - run: |
+          pip install beautifulsoup4
           git config --global user.name "ai.robots.txt"
           git config --global user.email "ai.robots.txt@users.noreply.github.com"
           git log -1

From c01a68403687f44ef3235ee726ff70b9d6a133f4 Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn@underlap.org>
Date: Sun, 5 Jan 2025 05:03:50 +0000
Subject: [PATCH 05/63] Convert robots.json more frequently

Specifically, when github workflows or code
is changed as either of these can affect the
conversion results.

Ref: https://github.com/ai-robots-txt/ai.robots.txt/issues/60
---
 .github/workflows/main.yml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index cb5fefc..4abbe2b 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -8,6 +8,8 @@ on:
   push:
     paths:
       - 'robots.json'
+      - '.github/workflows/**'
+      - 'code/**'
     branches:
       - "main"
 

From ca8620e28b8b3baddc34852e3cb2ece2bf89d18d Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Sun, 5 Jan 2025 05:05:20 +0000
Subject: [PATCH 06/63] Merge pull request #63 from glyn/push-paths

Convert robots.json more frequently
---
 robots.txt              | 1 +
 table-of-bot-metrics.md | 1 +
 2 files changed, 2 insertions(+)

diff --git a/robots.txt b/robots.txt
index c41ed6d..1ae5558 100644
--- a/robots.txt
+++ b/robots.txt
@@ -10,6 +10,7 @@ User-agent: ChatGPT-User
 User-agent: Claude-Web
 User-agent: ClaudeBot
 User-agent: cohere-ai
+User-agent: cohere-training-data-crawler
 User-agent: Diffbot
 User-agent: DuckAssistBot
 User-agent: FacebookBot
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index e905d2f..1106d0f 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -12,6 +12,7 @@
 | Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | ClaudeBot | [Anthropic](https://www.anthropic.com) | [Yes](https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
+| cohere-training-data-crawler | Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products | Unclear at this time. | AI Data Scrapers | Unclear at this time. | cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler |
 | Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
 | DuckAssistBot | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | DuckAssistBot is used by DuckDuckGo's DuckAssist feature to fetch content and generate realtime AI answers to user searches. More info can be found at https://darkvisitors.com/agents/agents/duckassistbot |
 | FacebookBot | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |

From 83cd54647015829bbf241931e3d602c6081d2a1c Mon Sep 17 00:00:00 2001
From: Fabian Egli <fabianegli@users.noreply.github.com>
Date: Mon, 6 Jan 2025 11:39:41 +0100
Subject: [PATCH 07/63] allow Action to succeed even if no changes were made

Before, the Action would fail in case there were no changes made to any files by the converter.
---
 .github/workflows/main.yml | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 4abbe2b..d26a5a0 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -32,6 +32,13 @@ jobs:
           echo "... done."
           git --no-pager diff
           git add -A
+          if [ "$(git diff --staged)" ]; then
+            # To have the action run successfully, if no changes are staged, we
+            # manually skip the later commits because they fail with exit code 1
+            # and this would then display as a failure for the Action.
+            echo "No staged changes to commit. Skipping commit and push."
+            exit 0
+          fi
           if [ -n "${{ inputs.message }}" ]; then
             git commit -m "${{ inputs.message }}"
           else

From 30ee95701162ac8f67cf6183641b2a140fcde721 Mon Sep 17 00:00:00 2001
From: Fabian Egli <fabianegli@users.noreply.github.com>
Date: Mon, 6 Jan 2025 12:05:42 +0100
Subject: [PATCH 08/63] bail when NO changes are staged

---
 .github/workflows/main.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index d26a5a0..ac20d99 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -32,7 +32,7 @@ jobs:
           echo "... done."
           git --no-pager diff
           git add -A
-          if [ "$(git diff --staged)" ]; then
+          if [ -z "$(git diff --staged)" ]; then
             # To have the action run successfully, if no changes are staged, we
             # manually skip the later commits because they fail with exit code 1
             # and this would then display as a failure for the Action.

From 143f8f228588b1f66bc1435fc21457f610807d5f Mon Sep 17 00:00:00 2001
From: Jordan Atwood <nightfirecat@nightfirec.at>
Date: Mon, 6 Jan 2025 12:34:38 -0800
Subject: [PATCH 09/63] Block SemrushBot

---
 robots.json | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/robots.json b/robots.json
index 1c00b63..c444cb4 100644
--- a/robots.json
+++ b/robots.json
@@ -258,6 +258,13 @@
         "operator": "[Zyte](https://www.zyte.com)",
         "respect": "Unclear at this time."
     },
+    "SemrushBot": {
+        "operator": "[Semrush](https://www.semrush.com/)",
+        "respect": "[Yes](https://www.semrush.com/bot/)",
+        "function": "Scrapes data for use in LLM article-writing tool.",
+        "frequency": "Roughly once every 10 seconds.",
+        "description": "SemrushBot is a bot which, among other functions, scrapes data for use in ContentShake AI tool reports."
+    },
     "Sidetrade indexer bot": {
         "description": "AI product training.",
         "frequency": "No information.",

From ec454b71d3984e58f323bb71631847dfe6b51b78 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Mon, 6 Jan 2025 20:51:56 +0000
Subject: [PATCH 10/63] Merge pull request #67 from Nightfirecat/semrushbot

Block SemrushBot
---
 robots.txt              | 1 +
 table-of-bot-metrics.md | 1 +
 2 files changed, 2 insertions(+)

diff --git a/robots.txt b/robots.txt
index 1ae5558..5c32c96 100644
--- a/robots.txt
+++ b/robots.txt
@@ -35,6 +35,7 @@ User-agent: PanguBot
 User-agent: PerplexityBot
 User-agent: PetalBot
 User-agent: Scrapy
+User-agent: SemrushBot
 User-agent: Sidetrade indexer bot
 User-agent: Timpibot
 User-agent: VelenPublicWebCrawler
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index 1106d0f..31c9367 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -37,6 +37,7 @@
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
+| SemrushBot | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Scrapes data for use in LLM article-writing tool. | Roughly once every 10 seconds. | SemrushBot is a bot which, among other functions, scrapes data for use in ContentShake AI tool reports. |
 | Sidetrade indexer bot | [Sidetrade](https://www.sidetrade.com) | Unclear at this time. | Extracts data for a variety of uses including training AI. | No information. | AI product training. |
 | Timpibot | [Timpi](https://timpi.io) | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
 | VelenPublicWebCrawler | [Velen Crawler](https://velen.io) | [Yes](https://velen.io) | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |

From 933aa6159da9dbe7025f6294e98a6d3e326b43a3 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <omino.gis@gmail.com>
Date: Tue, 7 Jan 2025 11:02:29 +0100
Subject: [PATCH 11/63] Implementing htaccess generation

---
 .htaccess                 |  3 +++
 code/robots.py            | 22 +++++++++++++++++++++-
 code/test_files/.htaccess |  3 +++
 code/tests.py             |  8 +++++++-
 4 files changed, 34 insertions(+), 2 deletions(-)
 create mode 100644 .htaccess
 create mode 100644 code/test_files/.htaccess

diff --git a/.htaccess b/.htaccess
new file mode 100644
index 0000000..31ba5f7
--- /dev/null
+++ b/.htaccess
@@ -0,0 +1,3 @@
+RewriteEngine On
+RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteRule .* - [F,L]
\ No newline at end of file
diff --git a/code/robots.py b/code/robots.py
index cf44e8e..d35d74b 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -132,10 +132,26 @@ def json_to_table(robots_json):
     return table
 
 
+def json_to_htaccess(robot_json):
+    htaccess = "RewriteEngine On\n"
+    htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
+
+    robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())
+    htaccess += "|".join(robots)
+    htaccess += ").*$ [NC]\n"
+    htaccess += "RewriteRule .* - [F,L]"
+    return htaccess
+
+
 def update_file_if_changed(file_name, converter):
     """Update files if newer content is available and log the (in)actions."""
     new_content = converter(load_robots_json())
-    old_content = Path(file_name).read_text(encoding="utf-8")
+    filepath = Path(file_name)
+    if not filepath.exists():
+        filepath.write_text(new_content, encoding="utf-8")
+        print(f"{file_name} has been created.")
+        return
+    old_content = filepath.read_text(encoding="utf-8")
     if old_content == new_content:
         print(f"{file_name} is already up to date.")
     else:
@@ -150,6 +166,10 @@ def conversions():
         file_name="./table-of-bot-metrics.md",
         converter=json_to_table,
     )
+    update_file_if_changed(
+        file_name="./.htaccess",
+        converter=json_to_htaccess,
+    )
 
 
 if __name__ == "__main__":
diff --git a/code/test_files/.htaccess b/code/test_files/.htaccess
new file mode 100644
index 0000000..a34bf55
--- /dev/null
+++ b/code/test_files/.htaccess
@@ -0,0 +1,3 @@
+RewriteEngine On
+RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteRule .* - [F,L]
\ No newline at end of file
diff --git a/code/tests.py b/code/tests.py
index 9cf35fe..6f778c3 100644
--- a/code/tests.py
+++ b/code/tests.py
@@ -6,7 +6,7 @@ cd to the `code` directory and run `pytest`
 import json
 from pathlib import Path
 
-from robots import json_to_txt, json_to_table
+from robots import json_to_txt, json_to_table, json_to_htaccess
 
 
 def test_robots_txt_creation():
@@ -19,3 +19,9 @@ def test_table_of_bot_metrices_md():
     robots_json = json.loads(Path("test_files/robots.json").read_text())
     robots_table = json_to_table(robots_json)
     assert Path("test_files/table-of-bot-metrics.md").read_text() == robots_table
+
+
+def test_htaccess_creation():
+    robots_json = json.loads(Path("test_files/robots.json").read_text())
+    robots_htaccess = json_to_htaccess(robots_json)
+    assert Path("test_files/.htaccess").read_text() == robots_htaccess

From 189e75bbfd06715a5d30972d3aa4c23974aecee0 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <omino.gis@gmail.com>
Date: Fri, 17 Jan 2025 21:25:23 +0100
Subject: [PATCH 12/63] Adding usage instructions

---
 README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/README.md b/README.md
index b3c2e7c..45c8f3a 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,19 @@ A number of these crawlers have been sourced from [Dark Visitors](https://darkvi
 
 If you'd like to add information about a crawler to the list, please make a pull request with the bot name added to `robots.txt`, `ai.txt`, and any relevant details in `table-of-bot-metrics.md` to help people understand what's crawling.
 
+## Usage
+
+Many visitors will find these files from this repository most useful:
+- `robots.txt`
+- `.htaccess`
+
+The first one tells search engine and AI crawlers which parts of your website should be scanned or avoided. The webpages of your server are returned anyway, but the crawler "pledges" not to use them. By default, the provided `robots.txt` tells every AI crawler not to scan any page in your website. This is not bulletproof, as an evil crawler could simply ignore the `robots.txt` content.
+
+The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A `.htaccess` file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.
+
+We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
+
+
 ## Contributing
 
 A note about contributing: updates should be added/made to `robots.json`. A GitHub action, courtesy of [Adam](https://github.com/newbold), will then generate the updated `robots.txt` and `table-of-bot-metrics.md`.

From b455af66e7903e76162d43f3e8f0900084fb9539 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <omino.gis@gmail.com>
Date: Fri, 17 Jan 2025 21:42:08 +0100
Subject: [PATCH 13/63] Adding clarification about performance and code comment

---
 README.md      | 3 ++-
 code/robots.py | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 45c8f3a..dd84a16 100644
--- a/README.md
+++ b/README.md
@@ -18,8 +18,9 @@ The first one tells search engine and AI crawlers which parts of your website sh
 
 The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A `.htaccess` file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.
 
-We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
+Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist. Nevertheless, most shared hosting providers only allow `.htaccess` configuration.
 
+We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
 
 ## Contributing
 
diff --git a/code/robots.py b/code/robots.py
index d35d74b..f2ddbb8 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -133,7 +133,9 @@ def json_to_table(robots_json):
 
 
 def json_to_htaccess(robot_json):
-    htaccess = "RewriteEngine On\n"
+    # Creates a .htaccess filter file. It uses a regular expression to filter out
+    #User agents that contain any of the blocked values.
+    htaccess += "RewriteEngine On\n"
     htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
 
     robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())

From 8aee2f24bb03a8d91a2fb17c3a98628411239d40 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Sat, 18 Jan 2025 12:39:07 +0100
Subject: [PATCH 14/63] Fixed space in comment

Co-authored-by: Glyn Normington <work@underlap.org>
---
 code/robots.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/code/robots.py b/code/robots.py
index f2ddbb8..0172330 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -134,7 +134,7 @@ def json_to_table(robots_json):
 
 def json_to_htaccess(robot_json):
     # Creates a .htaccess filter file. It uses a regular expression to filter out
-    #User agents that contain any of the blocked values.
+    # User agents that contain any of the blocked values.
     htaccess += "RewriteEngine On\n"
     htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
 

From 1cc4b59dfc4acd5666478efea658b1adf1af8aee Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Sat, 18 Jan 2025 12:40:03 +0100
Subject: [PATCH 15/63] Shortened htaccess instructions

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index dd84a16..badd23b 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ Many visitors will find these files from this repository most useful:
 - `robots.txt`
 - `.htaccess`
 
-The first one tells search engine and AI crawlers which parts of your website should be scanned or avoided. The webpages of your server are returned anyway, but the crawler "pledges" not to use them. By default, the provided `robots.txt` tells every AI crawler not to scan any page in your website. This is not bulletproof, as an evil crawler could simply ignore the `robots.txt` content.
+`robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).
 
 The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A `.htaccess` file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.
 

From d65128d10acfd14b714488170b3a261912cc3729 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Sat, 18 Jan 2025 12:41:09 +0100
Subject: [PATCH 16/63] Removed paragraph in favour of future FAQ.md

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/README.md b/README.md
index badd23b..505a8dd 100644
--- a/README.md
+++ b/README.md
@@ -20,7 +20,6 @@ The second one tells your own webserver to return an error page when one of the
 
 Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist. Nevertheless, most shared hosting providers only allow `.htaccess` configuration.
 
-We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
 
 ## Contributing
 

From 5aa08bc0022e8e9960e4cf52359ca2d910f795bf Mon Sep 17 00:00:00 2001
From: Joshua Sheard <mail@jsheard.com>
Date: Sun, 19 Jan 2025 22:03:50 +0000
Subject: [PATCH 17/63] Add Crawlspace

---
 robots.json | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index c444cb4..d71c80b 100644
--- a/robots.json
+++ b/robots.json
@@ -90,6 +90,13 @@
         "frequency": "Unclear at this time.",
         "description": "cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler"
     },
+    "Crawlspace": {
+        "operator": "[Crawlspace](https://crawlspace.dev)",
+        "respect": "[Yes](https://news.ycombinator.com/item?id=42756654)",
+        "function": "Scrapes data",
+        "frequency": "Unclear at this time.",
+        "description": "Provides crawling services for any purpose, but most likely to be used for AI model training."
+    },
     "Diffbot": {
         "operator": "[Diffbot](https://www.diffbot.com/)",
         "respect": "At the discretion of Diffbot users.",
@@ -300,4 +307,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
\ No newline at end of file
+}

From 70fd6c0fb13cdf4f0525bf061556e8e50ca7b8d9 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Mon, 20 Jan 2025 06:25:07 +0100
Subject: [PATCH 18/63] Add mention of htaccess in readme

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 505a8dd..cd8d467 100644
--- a/README.md
+++ b/README.md
@@ -23,7 +23,7 @@ Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/
 
 ## Contributing
 
-A note about contributing: updates should be added/made to `robots.json`. A GitHub action, courtesy of [Adam](https://github.com/newbold), will then generate the updated `robots.txt` and `table-of-bot-metrics.md`.
+A note about contributing: updates should be added/made to `robots.json`. A GitHub action will then generate the updated `robots.txt`, `table-of-bot-metrics.md`, and `.htaccess`.
 
 ## Subscribe to updates
 

From 013b7abfa1f2126e9320ddbab90ff87af54b092c Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Mon, 20 Jan 2025 06:27:02 +0100
Subject: [PATCH 19/63] Update README.md

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index cd8d467..1417a85 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,9 @@ Many visitors will find these files from this repository most useful:
 
 `robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).
 
-The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A `.htaccess` file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.
+### `.htaccess`
+
+`.htaccess` may be used to configure web servers such as [Apache httpd](https://httpd.apache.org/) to return an error page when one of the listed AI crawlers sends a request to the web server.
 
 Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist. Nevertheless, most shared hosting providers only allow `.htaccess` configuration.
 

From 52241bdca6c9930f7b225264cd862b5f98a2d68f Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Mon, 20 Jan 2025 06:27:56 +0100
Subject: [PATCH 20/63] Update README.md

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 1417a85..bb6558c 100644
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@ Many visitors will find these files from this repository most useful:
 
 `.htaccess` may be used to configure web servers such as [Apache httpd](https://httpd.apache.org/) to return an error page when one of the listed AI crawlers sends a request to the web server.
 
-Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist. Nevertheless, most shared hosting providers only allow `.htaccess` configuration.
+Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist.
 
 
 ## Contributing

From 33c38ee70b3a45343ddb360ae79e743e42bc8f76 Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <24638827+MassiminoilTrace@users.noreply.github.com>
Date: Mon, 20 Jan 2025 06:28:32 +0100
Subject: [PATCH 21/63] Update README.md

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index bb6558c..648f5ed 100644
--- a/README.md
+++ b/README.md
@@ -10,10 +10,12 @@ If you'd like to add information about a crawler to the list, please make a pull
 
 ## Usage
 
-Many visitors will find these files from this repository most useful:
+This repository provides the following files:
 - `robots.txt`
 - `.htaccess`
 
+### `robots.txt`
+
 `robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).
 
 ### `.htaccess`

From a9956f7825080467adbbda6e41d7dfbaee47210b Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <omino.gis@gmail.com>
Date: Mon, 20 Jan 2025 06:50:48 +0100
Subject: [PATCH 22/63] Removed additional sections

---
 README.md | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/README.md b/README.md
index 648f5ed..065b0b7 100644
--- a/README.md
+++ b/README.md
@@ -14,14 +14,9 @@ This repository provides the following files:
 - `robots.txt`
 - `.htaccess`
 
-### `robots.txt`
-
 `robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).
 
-### `.htaccess`
-
 `.htaccess` may be used to configure web servers such as [Apache httpd](https://httpd.apache.org/) to return an error page when one of the listed AI crawlers sends a request to the web server.
-
 Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist.
 
 

From 4f03818280e7979697250ac5d59da12290db2e9f Mon Sep 17 00:00:00 2001
From: Massimo Gismondi <omino.gis@gmail.com>
Date: Mon, 20 Jan 2025 06:51:06 +0100
Subject: [PATCH 23/63] Removed if condition and added a little comments

---
 code/robots.py | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/code/robots.py b/code/robots.py
index 0172330..087b00b 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -135,9 +135,10 @@ def json_to_table(robots_json):
 def json_to_htaccess(robot_json):
     # Creates a .htaccess filter file. It uses a regular expression to filter out
     # User agents that contain any of the blocked values.
-    htaccess += "RewriteEngine On\n"
+    htaccess = "RewriteEngine On\n"
     htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
 
+    # Escape spaces in each User Agent to build the regular expression
     robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())
     htaccess += "|".join(robots)
     htaccess += ").*$ [NC]\n"
@@ -149,10 +150,8 @@ def update_file_if_changed(file_name, converter):
     """Update files if newer content is available and log the (in)actions."""
     new_content = converter(load_robots_json())
     filepath = Path(file_name)
-    if not filepath.exists():
-        filepath.write_text(new_content, encoding="utf-8")
-        print(f"{file_name} has been created.")
-        return
+    # "touch" will create the file if it doesn't exist yet
+    filepath.touch()
     old_content = filepath.read_text(encoding="utf-8")
     if old_content == new_content:
         print(f"{file_name} is already up to date.")

From 7427d96bac08d59276292ca7a66d77365f7d26b9 Mon Sep 17 00:00:00 2001
From: Joshua Sheard <mail@jsheard.com>
Date: Mon, 20 Jan 2025 10:59:02 +0000
Subject: [PATCH 24/63] Update robots.json

Co-authored-by: Glyn Normington <work@underlap.org>
---
 robots.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index d71c80b..465a61c 100644
--- a/robots.json
+++ b/robots.json
@@ -95,7 +95,7 @@
         "respect": "[Yes](https://news.ycombinator.com/item?id=42756654)",
         "function": "Scrapes data",
         "frequency": "Unclear at this time.",
-        "description": "Provides crawling services for any purpose, but most likely to be used for AI model training."
+        "description": "Provides crawling services for any purpose, probably including AI model training."
     },
     "Diffbot": {
         "operator": "[Diffbot](https://www.diffbot.com/)",

From 6c552a3daa591f47a81936ebc41c822dc35b9fa2 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Mon, 20 Jan 2025 17:45:42 +0000
Subject: [PATCH 25/63] Merge pull request #71 from jsheard/patch-1

Add Crawlspace
---
 .htaccess               | 2 +-
 robots.txt              | 1 +
 table-of-bot-metrics.md | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.htaccess b/.htaccess
index 31ba5f7..beaddc3 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
 RewriteRule .* - [F,L]
\ No newline at end of file
diff --git a/robots.txt b/robots.txt
index 5c32c96..fd388fd 100644
--- a/robots.txt
+++ b/robots.txt
@@ -11,6 +11,7 @@ User-agent: Claude-Web
 User-agent: ClaudeBot
 User-agent: cohere-ai
 User-agent: cohere-training-data-crawler
+User-agent: Crawlspace
 User-agent: Diffbot
 User-agent: DuckAssistBot
 User-agent: FacebookBot
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index 31c9367..f44c585 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -13,6 +13,7 @@
 | ClaudeBot | [Anthropic](https://www.anthropic.com) | [Yes](https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
 | cohere-training-data-crawler | Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products | Unclear at this time. | AI Data Scrapers | Unclear at this time. | cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler |
+| Crawlspace | [Crawlspace](https://crawlspace.dev) | [Yes](https://news.ycombinator.com/item?id=42756654) | Scrapes data | Unclear at this time. | Provides crawling services for any purpose, probably including AI model training. |
 | Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
 | DuckAssistBot | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | DuckAssistBot is used by DuckDuckGo's DuckAssist feature to fetch content and generate realtime AI answers to user searches. More info can be found at https://darkvisitors.com/agents/agents/duckassistbot |
 | FacebookBot | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |

From 9c060dee1c9cead8a3cb1092bdf8615cf33f3656 Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Tue, 21 Jan 2025 00:49:22 +0000
Subject: [PATCH 26/63] Update from Dark Visitors

---
 robots.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index 465a61c..4d7d582 100644
--- a/robots.json
+++ b/robots.json
@@ -307,4 +307,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
+}
\ No newline at end of file

From 05b79b8a5886983c818eaad107fcf6c7de5fad3a Mon Sep 17 00:00:00 2001
From: nisbet-hubbard <87453615+nisbet-hubbard@users.noreply.github.com>
Date: Mon, 27 Jan 2025 19:41:03 +0800
Subject: [PATCH 27/63] Update robots.json

---
 robots.json | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/robots.json b/robots.json
index 4d7d582..7f3cba3 100644
--- a/robots.json
+++ b/robots.json
@@ -265,12 +265,19 @@
         "operator": "[Zyte](https://www.zyte.com)",
         "respect": "Unclear at this time."
     },
-    "SemrushBot": {
+    "SemrushBot-OCOB": {
         "operator": "[Semrush](https://www.semrush.com/)",
         "respect": "[Yes](https://www.semrush.com/bot/)",
-        "function": "Scrapes data for use in LLM article-writing tool.",
+        "function": "Crawls your site for ContentShake AI tool.",
         "frequency": "Roughly once every 10 seconds.",
-        "description": "SemrushBot is a bot which, among other functions, scrapes data for use in ContentShake AI tool reports."
+        "description": "You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL)."
+    },
+    "SemrushBot-SWA": {
+        "operator": "[Semrush](https://www.semrush.com/)",
+        "respect": "[Yes](https://www.semrush.com/bot/)",
+        "function": "Checks URLs on your site for SWA tool.",
+        "frequency": "Roughly once every 10 seconds.",
+        "description": "You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL)."
     },
     "Sidetrade indexer bot": {
         "description": "AI product training.",
@@ -307,4 +314,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
\ No newline at end of file
+}

From 89d4c6e5ca03f0aedec09b9191e2aece6f2efec3 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Sat, 1 Feb 2025 10:51:01 +0000
Subject: [PATCH 28/63] Merge pull request #73 from nisbet-hubbard/patch-8
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Actually block Semrush’s AI tools
---
 .htaccess               | 2 +-
 robots.txt              | 3 ++-
 table-of-bot-metrics.md | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/.htaccess b/.htaccess
index beaddc3..97482e2 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
 RewriteRule .* - [F,L]
\ No newline at end of file
diff --git a/robots.txt b/robots.txt
index fd388fd..3839e55 100644
--- a/robots.txt
+++ b/robots.txt
@@ -36,7 +36,8 @@ User-agent: PanguBot
 User-agent: PerplexityBot
 User-agent: PetalBot
 User-agent: Scrapy
-User-agent: SemrushBot
+User-agent: SemrushBot-OCOB
+User-agent: SemrushBot-SWA
 User-agent: Sidetrade indexer bot
 User-agent: Timpibot
 User-agent: VelenPublicWebCrawler
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index f44c585..b51bbae 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -38,7 +38,8 @@
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
-| SemrushBot | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Scrapes data for use in LLM article-writing tool. | Roughly once every 10 seconds. | SemrushBot is a bot which, among other functions, scrapes data for use in ContentShake AI tool reports. |
+| SemrushBot-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
+| SemrushBot-SWA | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Checks URLs on your site for SWA tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
 | Sidetrade indexer bot | [Sidetrade](https://www.sidetrade.com) | Unclear at this time. | Extracts data for a variety of uses including training AI. | No information. | AI product training. |
 | Timpibot | [Timpi](https://timpi.io) | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
 | VelenPublicWebCrawler | [Velen Crawler](https://velen.io) | [Yes](https://velen.io) | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |

From bebffccc0ced8c420276c93f3109c2e71cd5ca0c Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Sun, 2 Feb 2025 00:52:50 +0000
Subject: [PATCH 29/63] Update from Dark Visitors

---
 robots.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index 7f3cba3..79762a0 100644
--- a/robots.json
+++ b/robots.json
@@ -314,4 +314,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
+}
\ No newline at end of file

From 261a2b83b90fe89f1d842066709c019fd1dba30f Mon Sep 17 00:00:00 2001
From: always-be-testing <warptank@protonmail.com>
Date: Fri, 14 Feb 2025 12:26:19 -0500
Subject: [PATCH 30/63] update README to inclide list of ai bots Cloudflare
 considers verified

---
 README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/README.md b/README.md
index 065b0b7..6758570 100644
--- a/README.md
+++ b/README.md
@@ -40,6 +40,19 @@ Alternatively, you can also subscribe to new releases with your GitHub account b
 
 If you use [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) alongside this list, you can report abusive crawlers that don't respect `robots.txt` [here](https://docs.google.com/forms/d/e/1FAIpQLScbUZ2vlNSdcsb8LyTeSF7uLzQI96s0BKGoJ6wQ6ocUFNOKEg/viewform).
 
+
+If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that make use of  [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots) conditions, please note that the following AI web crawlers are considered verified bots by Cloudflare: 
+- Amazonbot
+- Applebot
+- CCBot
+- ChatGPT-User
+- DuckAssistBot
+- GoogleOther
+- GPTBot
+- OAI-SearchBot
+- PerplexityBot
+- PetalBot
+
 ## Additional resources
 
 - [Blocking Bots with Nginx](https://rknight.me/blog/blocking-bots-with-nginx/) by Robb Knight

From e396a2ec781095c5e2659eefb99c46ab7715a664 Mon Sep 17 00:00:00 2001
From: always-be-testing <warptank@protonmail.com>
Date: Fri, 14 Feb 2025 12:31:20 -0500
Subject: [PATCH 31/63] forgot to include heading

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6758570..e70d283 100644
--- a/README.md
+++ b/README.md
@@ -40,7 +40,7 @@ Alternatively, you can also subscribe to new releases with your GitHub account b
 
 If you use [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) alongside this list, you can report abusive crawlers that don't respect `robots.txt` [here](https://docs.google.com/forms/d/e/1FAIpQLScbUZ2vlNSdcsb8LyTeSF7uLzQI96s0BKGoJ6wQ6ocUFNOKEg/viewform).
 
-
+## Cloudflare Verified Bots
 If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that make use of  [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots) conditions, please note that the following AI web crawlers are considered verified bots by Cloudflare: 
 - Amazonbot
 - Applebot

From f99339922fa9afdbb00e18bb99105e81cd3f8e88 Mon Sep 17 00:00:00 2001
From: always-be-testing <warptank@protonmail.com>
Date: Fri, 14 Feb 2025 12:36:33 -0500
Subject: [PATCH 32/63] grammar update and include syntax for verified bot
 condition

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e70d283..f471ede 100644
--- a/README.md
+++ b/README.md
@@ -41,7 +41,7 @@ Alternatively, you can also subscribe to new releases with your GitHub account b
 If you use [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) alongside this list, you can report abusive crawlers that don't respect `robots.txt` [here](https://docs.google.com/forms/d/e/1FAIpQLScbUZ2vlNSdcsb8LyTeSF7uLzQI96s0BKGoJ6wQ6ocUFNOKEg/viewform).
 
 ## Cloudflare Verified Bots
-If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that make use of  [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots) conditions, please note that the following AI web crawlers are considered verified bots by Cloudflare: 
+If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that use the `cf.bot_management.verified_bot` condition based on [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots), please note that the following AI web crawlers are considered verified bots by Cloudflare:
 - Amazonbot
 - Applebot
 - CCBot

From af87b85d7f00bc285cb414280e02d2f42284a9d8 Mon Sep 17 00:00:00 2001
From: always-be-testing <warptank@protonmail.com>
Date: Fri, 14 Feb 2025 12:39:08 -0500
Subject: [PATCH 33/63] include return after heading

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index f471ede..303f009 100644
--- a/README.md
+++ b/README.md
@@ -41,6 +41,7 @@ Alternatively, you can also subscribe to new releases with your GitHub account b
 If you use [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) alongside this list, you can report abusive crawlers that don't respect `robots.txt` [here](https://docs.google.com/forms/d/e/1FAIpQLScbUZ2vlNSdcsb8LyTeSF7uLzQI96s0BKGoJ6wQ6ocUFNOKEg/viewform).
 
 ## Cloudflare Verified Bots
+
 If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that use the `cf.bot_management.verified_bot` condition based on [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots), please note that the following AI web crawlers are considered verified bots by Cloudflare:
 - Amazonbot
 - Applebot

From 5b13c2e504c843c2a95981cee1c2655d9f21c8f4 Mon Sep 17 00:00:00 2001
From: always-be-testing <warptank@protonmail.com>
Date: Sat, 15 Feb 2025 11:22:10 -0500
Subject: [PATCH 34/63] add more concise message about verified bots

Co-authored-by: Glyn Normington <work@underlap.org>
---
 README.md | 16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/README.md b/README.md
index 303f009..a206c83 100644
--- a/README.md
+++ b/README.md
@@ -39,21 +39,7 @@ Alternatively, you can also subscribe to new releases with your GitHub account b
 ## Report abusive crawlers
 
 If you use [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) alongside this list, you can report abusive crawlers that don't respect `robots.txt` [here](https://docs.google.com/forms/d/e/1FAIpQLScbUZ2vlNSdcsb8LyTeSF7uLzQI96s0BKGoJ6wQ6ocUFNOKEg/viewform).
-
-## Cloudflare Verified Bots
-
-If you are unable to make use of [Cloudflare's hard block](https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click) and/or have WAF rules that use the `cf.bot_management.verified_bot` condition based on [Cloudflare's Verified Bots](https://radar.cloudflare.com/traffic/verified-bots), please note that the following AI web crawlers are considered verified bots by Cloudflare:
-- Amazonbot
-- Applebot
-- CCBot
-- ChatGPT-User
-- DuckAssistBot
-- GoogleOther
-- GPTBot
-- OAI-SearchBot
-- PerplexityBot
-- PetalBot
-
+But even if you don't use Cloudflare's hard block, their list of [verified bots](https://radar.cloudflare.com/traffic/verified-bots) may come in handy.
 ## Additional resources
 
 - [Blocking Bots with Nginx](https://rknight.me/blog/blocking-bots-with-nginx/) by Robb Knight

From a9ec4ffa6fd1816ee6c1c146fa75983abc0b2edc Mon Sep 17 00:00:00 2001
From: Cory Dransfeldt <hi@coryd.dev>
Date: Sun, 16 Feb 2025 13:36:39 -0800
Subject: [PATCH 35/63] chore: add Brightbot 1.0

---
 robots.json | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index 79762a0..a634634 100644
--- a/robots.json
+++ b/robots.json
@@ -41,6 +41,13 @@
         "frequency": "Unclear at this time.",
         "description": "Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools."
     },
+    "Brightbot 1.0": {
+        "operator": "Browsing.ai",
+        "respect": "Unclear at this time.",
+        "function": "LLM/AI training.",
+        "frequency": "Unclear at this time.",
+        "description": "Scrapes data to train LLMs and AI products focused on website customer support."
+    },
     "Bytespider": {
         "operator": "ByteDance",
         "respect": "No",
@@ -314,4 +321,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
\ No newline at end of file
+}

From 693289bb29c42b7a526d8210d1f743ca3608690d Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Sun, 16 Feb 2025 21:37:52 +0000
Subject: [PATCH 36/63] chore: add Brightbot 1.0

---
 .htaccess               | 2 +-
 robots.txt              | 1 +
 table-of-bot-metrics.md | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.htaccess b/.htaccess
index 97482e2..512c274 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Brightbot\ 1.0|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
 RewriteRule .* - [F,L]
\ No newline at end of file
diff --git a/robots.txt b/robots.txt
index 3839e55..80c40e8 100644
--- a/robots.txt
+++ b/robots.txt
@@ -4,6 +4,7 @@ User-agent: Amazonbot
 User-agent: anthropic-ai
 User-agent: Applebot
 User-agent: Applebot-Extended
+User-agent: Brightbot 1.0
 User-agent: Bytespider
 User-agent: CCBot
 User-agent: ChatGPT-User
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index b51bbae..af32bf2 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -6,6 +6,7 @@
 | anthropic-ai | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | Applebot | Unclear at this time. | Unclear at this time. | AI Search Crawlers | Unclear at this time. | Applebot is a web crawler used by Apple to index search results that allow the Siri AI Assistant to answer user questions. Siri's answers normally contain references to the website. More info can be found at https://darkvisitors.com/agents/agents/applebot |
 | Applebot-Extended | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
+| Brightbot 1.0 | Browsing.ai | Unclear at this time. | LLM/AI training. | Unclear at this time. | Scrapes data to train LLMs and AI products focused on website customer support. |
 | Bytespider | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
 | CCBot | [Common Crawl Foundation](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides open crawl dataset, used for many purposes, including Machine Learning/AI. | Monthly at present. | Web archive going back to 2008. [Cited in thousands of research papers per year](https://commoncrawl.org/research-papers). |
 | ChatGPT-User | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |

From abfd6dfcd15267ed03b5fda4cd3eac2512604ed2 Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Mon, 17 Feb 2025 00:53:32 +0000
Subject: [PATCH 37/63] Update from Dark Visitors

---
 robots.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index a634634..cdc7bb5 100644
--- a/robots.json
+++ b/robots.json
@@ -321,4 +321,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
+}
\ No newline at end of file

From c0d418cd875b432fd4558be57ad3c009326b631e Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Mon, 17 Feb 2025 21:00:57 +0100
Subject: [PATCH 38/63] .htaccess: Allow robots access to /robots.txt

---
 .htaccess                 | 2 +-
 code/robots.py            | 2 +-
 code/test_files/.htaccess | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/.htaccess b/.htaccess
index 512c274..c42f99e 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
 RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Brightbot\ 1.0|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
-RewriteRule .* - [F,L]
\ No newline at end of file
+RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/code/robots.py b/code/robots.py
index 087b00b..bb18e70 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -142,7 +142,7 @@ def json_to_htaccess(robot_json):
     robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())
     htaccess += "|".join(robots)
     htaccess += ").*$ [NC]\n"
-    htaccess += "RewriteRule .* - [F,L]"
+    htaccess += "RewriteRule !^/?robots\\.txt$ - [F,L]\n"
     return htaccess
 
 
diff --git a/code/test_files/.htaccess b/code/test_files/.htaccess
index a34bf55..2e78674 100644
--- a/code/test_files/.htaccess
+++ b/code/test_files/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
 RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
-RewriteRule .* - [F,L]
\ No newline at end of file
+RewriteRule !^/?robots\.txt$ - [F,L]

From a884a2afb9dbc7338b0faa24b3c10308adbc48e4 Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Mon, 17 Feb 2025 21:00:57 +0100
Subject: [PATCH 39/63] .htaccess: Make regex in RewriteCond safe

Improve the regular expression by removing unneeded anchors and
escaping special characters (not just space) to prevent false positives
or a misbehaving rewrite rule.
---
 .htaccess                 |  2 +-
 code/robots.py            | 19 ++++++++++---------
 code/test_files/.htaccess |  2 +-
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/.htaccess b/.htaccess
index c42f99e..2313293 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Brightbot\ 1.0|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/code/robots.py b/code/robots.py
index bb18e70..a8a674d 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -1,8 +1,9 @@
 import json
-from pathlib import Path
-
+import re
 import requests
+
 from bs4 import BeautifulSoup
+from pathlib import Path
 
 
 def load_robots_json():
@@ -99,7 +100,6 @@ def updated_robots_json(soup):
 
 
 def ingest_darkvisitors():
-
     old_robots_json = load_robots_json()
     soup = get_agent_soup()
     if soup:
@@ -132,16 +132,17 @@ def json_to_table(robots_json):
     return table
 
 
+def list_to_pcre(lst):
+    # Python re is not 100% identical to PCRE which is used by Apache, but it
+    # should probably be close enough in the real world for re.escape to work.
+    return f"({"|".join(map(re.escape, lst))})"
+
+
 def json_to_htaccess(robot_json):
     # Creates a .htaccess filter file. It uses a regular expression to filter out
     # User agents that contain any of the blocked values.
     htaccess = "RewriteEngine On\n"
-    htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
-
-    # Escape spaces in each User Agent to build the regular expression
-    robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())
-    htaccess += "|".join(robots)
-    htaccess += ").*$ [NC]\n"
+    htaccess += f"RewriteCond %{{HTTP_USER_AGENT}} {list_to_pcre(robot_json.keys())} [NC]\n"
     htaccess += "RewriteRule !^/?robots\\.txt$ - [F,L]\n"
     return htaccess
 
diff --git a/code/test_files/.htaccess b/code/test_files/.htaccess
index 2e78674..90ddcf2 100644
--- a/code/test_files/.htaccess
+++ b/code/test_files/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} ^.*(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot).*$ [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]

From 0bd3fa63b832ffd8fa908675656c7007021f6654 Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Tue, 18 Feb 2025 10:12:04 +0100
Subject: [PATCH 40/63] table-of-bot-metrics.md: Escape robot names for
 Markdown table

Some characters which could occur in a crawler's name have a special meaning in
Markdown. They are escaped to prevent them from having unintended side effects.

The escaping is only applied to the first (Name) column of the table. The rest
of the columns is expected to already be Markdown encoded in robots.json.
---
 code/robots.py          |  8 ++++++--
 table-of-bot-metrics.md | 40 ++++++++++++++++++++--------------------
 2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/code/robots.py b/code/robots.py
index a8a674d..62fb061 100644
--- a/code/robots.py
+++ b/code/robots.py
@@ -121,13 +121,17 @@ def json_to_txt(robots_json):
     return robots_txt
 
 
+def escape_md(s):
+    return re.sub(r"([]*\\|`(){}<>#+-.!_[])", r"\\\1", s)
+
+
 def json_to_table(robots_json):
     """Compose a markdown table with the information in robots.json"""
     table = "| Name | Operator | Respects `robots.txt` | Data use | Visit regularity | Description |\n"
-    table += "|-----|----------|-----------------------|----------|------------------|-------------|\n"
+    table += "|------|----------|-----------------------|----------|------------------|-------------|\n"
 
     for name, robot in robots_json.items():
-        table += f'| {name} | {robot["operator"]} | {robot["respect"]} | {robot["function"]} | {robot["frequency"]} | {robot["description"]} |\n'
+        table += f'| {escape_md(name)} | {robot["operator"]} | {robot["respect"]} | {robot["function"]} | {robot["frequency"]} | {robot["description"]} |\n'
 
     return table
 
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index af32bf2..ce82047 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -1,48 +1,48 @@
 | Name | Operator | Respects `robots.txt` | Data use | Visit regularity | Description |
-|-----|----------|-----------------------|----------|------------------|-------------|
+|------|----------|-----------------------|----------|------------------|-------------|
 | AI2Bot | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
-| Ai2Bot-Dolma | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
+| Ai2Bot\-Dolma | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
 | Amazonbot | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
-| anthropic-ai | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
+| anthropic\-ai | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | Applebot | Unclear at this time. | Unclear at this time. | AI Search Crawlers | Unclear at this time. | Applebot is a web crawler used by Apple to index search results that allow the Siri AI Assistant to answer user questions. Siri's answers normally contain references to the website. More info can be found at https://darkvisitors.com/agents/agents/applebot |
-| Applebot-Extended | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
-| Brightbot 1.0 | Browsing.ai | Unclear at this time. | LLM/AI training. | Unclear at this time. | Scrapes data to train LLMs and AI products focused on website customer support. |
+| Applebot\-Extended | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
+| Brightbot 1\.0 | Browsing.ai | Unclear at this time. | LLM/AI training. | Unclear at this time. | Scrapes data to train LLMs and AI products focused on website customer support. |
 | Bytespider | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
 | CCBot | [Common Crawl Foundation](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides open crawl dataset, used for many purposes, including Machine Learning/AI. | Monthly at present. | Web archive going back to 2008. [Cited in thousands of research papers per year](https://commoncrawl.org/research-papers). |
-| ChatGPT-User | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
-| Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
+| ChatGPT\-User | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
+| Claude\-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | ClaudeBot | [Anthropic](https://www.anthropic.com) | [Yes](https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
-| cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
-| cohere-training-data-crawler | Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products | Unclear at this time. | AI Data Scrapers | Unclear at this time. | cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler |
+| cohere\-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
+| cohere\-training\-data\-crawler | Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products | Unclear at this time. | AI Data Scrapers | Unclear at this time. | cohere-training-data-crawler is a web crawler operated by Cohere to download training data for its LLMs (Large Language Models) that power its enterprise AI products. More info can be found at https://darkvisitors.com/agents/agents/cohere-training-data-crawler |
 | Crawlspace | [Crawlspace](https://crawlspace.dev) | [Yes](https://news.ycombinator.com/item?id=42756654) | Scrapes data | Unclear at this time. | Provides crawling services for any purpose, probably including AI model training. |
 | Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
 | DuckAssistBot | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | DuckAssistBot is used by DuckDuckGo's DuckAssist feature to fetch content and generate realtime AI answers to user searches. More info can be found at https://darkvisitors.com/agents/agents/duckassistbot |
 | FacebookBot | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
 | FriendlyCrawler | Unknown | [Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler) | We are using the data from the crawler to build datasets for machine learning experiments. | Unclear at this time. | Unclear who the operator is; but data is used for training/machine learning. |
-| Google-Extended | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
+| Google\-Extended | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
 | GoogleOther | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
-| GoogleOther-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
-| GoogleOther-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+| GoogleOther\-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+| GoogleOther\-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 | GPTBot | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
-| iaskspider/2.0 | iAsk | No | Crawls sites to provide answers to user queries. | Unclear at this time. | Used to provide answers to user queries. |
-| ICC-Crawler | [NICT](https://nict.go.jp) | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
+| iaskspider/2\.0 | iAsk | No | Crawls sites to provide answers to user queries. | Unclear at this time. | Used to provide answers to user queries. |
+| ICC\-Crawler | [NICT](https://nict.go.jp) | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
 | ImagesiftBot | [ImageSift](https://imagesift.com) | [Yes](https://imagesift.com/about) | ImageSiftBot is a web crawler that scrapes the internet for publicly available images to support our suite of web intelligence products | No information. | Once images and text are downloaded from a webpage, ImageSift analyzes this data from the page and stores the information in an index. Our web intelligence products use this index to enable search and retrieval of similar images. |
 | img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | Unclear at this time. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
 | ISSCyberRiskCrawler | [ISS-Corporate](https://iss-cyber.com) | No | Scrapes data to train machine learning models. | No information. | Used to train machine learning based models to quantify cyber risk. |
 | Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
-| Meta-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
-| Meta-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
-| OAI-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
+| Meta\-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
+| Meta\-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
+| OAI\-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
-| SemrushBot-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
-| SemrushBot-SWA | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Checks URLs on your site for SWA tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
+| SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
+| SemrushBot\-SWA | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Checks URLs on your site for SWA tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |
 | Sidetrade indexer bot | [Sidetrade](https://www.sidetrade.com) | Unclear at this time. | Extracts data for a variety of uses including training AI. | No information. | AI product training. |
 | Timpibot | [Timpi](https://timpi.io) | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
 | VelenPublicWebCrawler | [Velen Crawler](https://velen.io) | [Yes](https://velen.io) | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |
-| Webzio-Extended | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Webzio-Extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models. More info can be found at https://darkvisitors.com/agents/agents/webzio-extended |
+| Webzio\-Extended | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Webzio-Extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models. More info can be found at https://darkvisitors.com/agents/agents/webzio-extended |
 | YouBot | [You](https://about.you.com/youchat/) | [Yes](https://about.you.com/youbot/) | Scrapes data for search engine and LLMs. | No information. | Retrieves data used for You.com web search engine and LLMs. |

From 17b826a6d3868cf87fb52adf95f52872ac5c4437 Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Tue, 18 Feb 2025 10:13:27 +0100
Subject: [PATCH 41/63] Update tests and convert to stock unittest

For these simple tests Python's built-in unittest framework is more than enough.
No additional dependencies are required.

Added some more test cases with "special" characters to test the escaping code
better.
---
 code/test_files/.htaccess               |  2 +-
 code/test_files/robots.json             | 44 ++++++++++++++++-
 code/test_files/robots.txt              |  6 +++
 code/test_files/table-of-bot-metrics.md | 38 +++++++++------
 code/tests.py                           | 65 ++++++++++++++++++-------
 5 files changed, 120 insertions(+), 35 deletions(-)
 mode change 100644 => 100755 code/tests.py

diff --git a/code/test_files/.htaccess b/code/test_files/.htaccess
index 90ddcf2..7e39092 100644
--- a/code/test_files/.htaccess
+++ b/code/test_files/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/code/test_files/robots.json b/code/test_files/robots.json
index c50d63c..b0cbfbb 100644
--- a/code/test_files/robots.json
+++ b/code/test_files/robots.json
@@ -278,5 +278,47 @@
         "function": "Scrapes data for search engine and LLMs.",
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
+    },
+    "crawler.with.dots": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression dots need to be escaped."
+    },
+    "star***crawler": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression stars need to be escaped."
+    },
+    "Is this a crawler?": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression spaces and question marks need to be escaped."
+    },
+    "a[mazing]{42}(robot)": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression parantheses, braces, etc. need to be escaped."
+    },
+    "2^32$": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression RE anchor characters need to be escaped."
+    },
+    "curl|sudo bash": {
+        "operator": "Test suite",
+        "respect": "No",
+        "function": "To ensure the code works correctly.",
+        "frequency": "No information.",
+        "description": "When used in the .htaccess regular expression pipes need to be escaped."
     }
-}
\ No newline at end of file
+}
diff --git a/code/test_files/robots.txt b/code/test_files/robots.txt
index 927f6f4..03c3c25 100644
--- a/code/test_files/robots.txt
+++ b/code/test_files/robots.txt
@@ -38,4 +38,10 @@ User-agent: Timpibot
 User-agent: VelenPublicWebCrawler
 User-agent: Webzio-Extended
 User-agent: YouBot
+User-agent: crawler.with.dots
+User-agent: star***crawler
+User-agent: Is this a crawler?
+User-agent: a[mazing]{42}(robot)
+User-agent: 2^32$
+User-agent: curl|sudo bash
 Disallow: /
diff --git a/code/test_files/table-of-bot-metrics.md b/code/test_files/table-of-bot-metrics.md
index 257ba99..88af6c0 100644
--- a/code/test_files/table-of-bot-metrics.md
+++ b/code/test_files/table-of-bot-metrics.md
@@ -1,35 +1,35 @@
 | Name | Operator | Respects `robots.txt` | Data use | Visit regularity | Description |
-|-----|----------|-----------------------|----------|------------------|-------------|
+|------|----------|-----------------------|----------|------------------|-------------|
 | AI2Bot | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
-| Ai2Bot-Dolma | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
+| Ai2Bot\-Dolma | [Ai2](https://allenai.org/crawler) | Yes | Content is used to train open language models. | No information provided. | Explores 'certain domains' to find web content. |
 | Amazonbot | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
-| anthropic-ai | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
+| anthropic\-ai | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | Applebot | Unclear at this time. | Unclear at this time. | AI Search Crawlers | Unclear at this time. | Applebot is a web crawler used by Apple to index search results that allow the Siri AI Assistant to answer user questions. Siri's answers normally contain references to the website. More info can be found at https://darkvisitors.com/agents/agents/applebot |
-| Applebot-Extended | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
+| Applebot\-Extended | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. |
 | Bytespider | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
 | CCBot | [Common Crawl Foundation](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides open crawl dataset, used for many purposes, including Machine Learning/AI. | Monthly at present. | Web archive going back to 2008. [Cited in thousands of research papers per year](https://commoncrawl.org/research-papers). |
-| ChatGPT-User | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
-| Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
+| ChatGPT\-User | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
+| Claude\-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 | ClaudeBot | [Anthropic](https://www.anthropic.com) | [Yes](https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
-| cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
+| cohere\-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
 | Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
 | FacebookBot | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
 | facebookexternalhit | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | No information. | Unclear at this time. | Unclear at this time. |
 | FriendlyCrawler | Unknown | [Yes](https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler) | We are using the data from the crawler to build datasets for machine learning experiments. | Unclear at this time. | Unclear who the operator is; but data is used for training/machine learning. |
-| Google-Extended | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
+| Google\-Extended | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information. | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
 | GoogleOther | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
-| GoogleOther-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
-| GoogleOther-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+| GoogleOther\-Image | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+| GoogleOther\-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information. | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
 | GPTBot | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
-| iaskspider/2.0 | iAsk | No | Crawls sites to provide answers to user queries. | Unclear at this time. | Used to provide answers to user queries. |
-| ICC-Crawler | [NICT](https://nict.go.jp) | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
+| iaskspider/2\.0 | iAsk | No | Crawls sites to provide answers to user queries. | Unclear at this time. | Used to provide answers to user queries. |
+| ICC\-Crawler | [NICT](https://nict.go.jp) | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
 | ImagesiftBot | [ImageSift](https://imagesift.com) | [Yes](https://imagesift.com/about) | ImageSiftBot is a web crawler that scrapes the internet for publicly available images to support our suite of web intelligence products | No information. | Once images and text are downloaded from a webpage, ImageSift analyzes this data from the page and stores the information in an index. Our web intelligence products use this index to enable search and retrieval of similar images. |
 | img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | Unclear at this time. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
 | ISSCyberRiskCrawler | [ISS-Corporate](https://iss-cyber.com) | No | Scrapes data to train machine learning models. | No information. | Used to train machine learning based models to quantify cyber risk. |
 | Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
-| Meta-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
-| Meta-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
-| OAI-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
+| Meta\-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
+| Meta\-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
+| OAI\-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
@@ -38,5 +38,11 @@
 | Sidetrade indexer bot | [Sidetrade](https://www.sidetrade.com) | Unclear at this time. | Extracts data for a variety of uses including training AI. | No information. | AI product training. |
 | Timpibot | [Timpi](https://timpi.io) | Unclear at this time. | Scrapes data for use in training LLMs. | No information. | Makes data available for training AI models. |
 | VelenPublicWebCrawler | [Velen Crawler](https://velen.io) | [Yes](https://velen.io) | Scrapes data for business data sets and machine learning models. | No information. | "Our goal with this crawler is to build business datasets and machine learning models to better understand the web." |
-| Webzio-Extended | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Webzio-Extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models. More info can be found at https://darkvisitors.com/agents/agents/webzio-extended |
+| Webzio\-Extended | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Webzio-Extended is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models. More info can be found at https://darkvisitors.com/agents/agents/webzio-extended |
 | YouBot | [You](https://about.you.com/youchat/) | [Yes](https://about.you.com/youbot/) | Scrapes data for search engine and LLMs. | No information. | Retrieves data used for You.com web search engine and LLMs. |
+| crawler\.with\.dots | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression dots need to be escaped. |
+| star\*\*\*crawler | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression stars need to be escaped. |
+| Is this a crawler? | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression spaces and question marks need to be escaped. |
+| a\[mazing\]\{42\}\(robot\) | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression parantheses, braces, etc. need to be escaped. |
+| 2^32$ | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression RE anchor characters need to be escaped. |
+| curl\|sudo bash | Test suite | No | To ensure the code works correctly. | No information. | When used in the .htaccess regular expression pipes need to be escaped. |
diff --git a/code/tests.py b/code/tests.py
old mode 100644
new mode 100755
index 6f778c3..94cbb47
--- a/code/tests.py
+++ b/code/tests.py
@@ -1,27 +1,58 @@
-"""These tests can be run with pytest.
-This requires pytest: pip install pytest
-cd to the `code` directory and run `pytest`
-"""
+#!/usr/bin/env python3
+"""To run these tests just execute this script."""
 
 import json
-from pathlib import Path
+import unittest
 
 from robots import json_to_txt, json_to_table, json_to_htaccess
 
+class RobotsUnittestExtensions:
+    def loadJson(self, pathname):
+        with open(pathname, "rt") as f:
+            return json.load(f)
 
-def test_robots_txt_creation():
-    robots_json = json.loads(Path("test_files/robots.json").read_text())
-    robots_txt = json_to_txt(robots_json)
-    assert Path("test_files/robots.txt").read_text() == robots_txt
+    def assertEqualsFile(self, f, s):
+        with open(f, "rt") as f:
+            f_contents = f.read()
+
+        return self.assertMultiLineEqual(f_contents, s)
 
 
-def test_table_of_bot_metrices_md():
-    robots_json = json.loads(Path("test_files/robots.json").read_text())
-    robots_table = json_to_table(robots_json)
-    assert Path("test_files/table-of-bot-metrics.md").read_text() == robots_table
+class TestRobotsTXTGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 8192
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_robots_txt_generation(self):
+        robots_txt = json_to_txt(self.robots_dict)
+        self.assertEqualsFile("test_files/robots.txt", robots_txt)
 
 
-def test_htaccess_creation():
-    robots_json = json.loads(Path("test_files/robots.json").read_text())
-    robots_htaccess = json_to_htaccess(robots_json)
-    assert Path("test_files/.htaccess").read_text() == robots_htaccess
+class TestTableMetricsGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 32768
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_table_generation(self):
+        robots_table = json_to_table(self.robots_dict)
+        self.assertEqualsFile("test_files/table-of-bot-metrics.md", robots_table)
+
+
+class TestHtaccessGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 8192
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_htaccess_generation(self):
+        robots_htaccess = json_to_htaccess(self.robots_dict)
+        self.assertEqualsFile("test_files/.htaccess", robots_htaccess)
+
+
+if __name__ == "__main__":
+    import os
+    os.chdir(os.path.dirname(__file__))
+
+    unittest.main(verbosity=2)

From c7c1e7b96fe74f90590f4d375c1bab4be53a4044 Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Tue, 18 Feb 2025 10:15:10 +0100
Subject: [PATCH 42/63] robots.py: Make executable

---
 code/robots.py | 2 ++
 1 file changed, 2 insertions(+)
 mode change 100644 => 100755 code/robots.py

diff --git a/code/robots.py b/code/robots.py
old mode 100644
new mode 100755
index 62fb061..6bf7920
--- a/code/robots.py
+++ b/code/robots.py
@@ -1,3 +1,5 @@
+#!/usr/bin/env python3
+
 import json
 import re
 import requests

From 1d55a205e4c8447829abdd34098ef9b0fedefee1 Mon Sep 17 00:00:00 2001
From: Glyn Normington <glyn.normington@gmail.com>
Date: Tue, 18 Feb 2025 05:08:28 +0000
Subject: [PATCH 43/63] Document testing in README

Fixes: https://github.com/ai-robots-txt/ai.robots.txt/issues/81
---
 README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/README.md b/README.md
index a206c83..30a85da 100644
--- a/README.md
+++ b/README.md
@@ -24,6 +24,11 @@ Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/
 
 A note about contributing: updates should be added/made to `robots.json`. A GitHub action will then generate the updated `robots.txt`, `table-of-bot-metrics.md`, and `.htaccess`.
 
+You can run the tests by [installing](https://www.python.org/about/gettingstarted/) Python 3 and issuing:
+```console
+code/tests.py
+```
+
 ## Subscribe to updates
 
 You can subscribe to list updates via RSS/Atom with the releases feed:

From 8a7489633326465fd7e83fecece6740440d38eb6 Mon Sep 17 00:00:00 2001
From: Dennis Camera <dennis.camera@riiengineering.ch>
Date: Tue, 18 Feb 2025 10:23:40 +0100
Subject: [PATCH 44/63] Add workflow to run tests on pull request or push to
 main

---
 .github/workflows/run-tests.yml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 .github/workflows/run-tests.yml

diff --git a/.github/workflows/run-tests.yml b/.github/workflows/run-tests.yml
new file mode 100644
index 0000000..c98861f
--- /dev/null
+++ b/.github/workflows/run-tests.yml
@@ -0,0 +1,21 @@
+on:
+  pull_request:
+    branches:
+      - main
+  push:
+    branches:
+      - main
+jobs:
+  run-tests:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 2
+      - name: Install dependencies
+        run: |
+          pip install -U requests beautifulsoup4
+      - name: Run tests
+        run: |
+          code/tests.py

From 6ecfcdfcbfd1bd36da1982b7a4f9f95cbeb8101a Mon Sep 17 00:00:00 2001
From: deyigifts <daijiahao@deyigifts.com>
Date: Mon, 24 Mar 2025 14:16:57 +0800
Subject: [PATCH 45/63] Update perplexity bot

Update based on perplexity bot docs
---
 robots.json | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/robots.json b/robots.json
index cdc7bb5..eaac816 100644
--- a/robots.json
+++ b/robots.json
@@ -253,10 +253,17 @@
     },
     "PerplexityBot": {
         "operator": "[Perplexity](https://www.perplexity.ai/)",
-        "respect": "[No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/)",
+        "respect": "[Yes](https://docs.perplexity.ai/guides/bots)",
+        "function": "Search result generation.",
+        "frequency": "No information.",
+        "description": "Crawls sites to surface as results in Perplexity."
+    },
+    "Perplexity‑User": {
+        "operator": "[Perplexity](https://www.perplexity.ai/)",
+        "respect": "[No](https://docs.perplexity.ai/guides/bots)",
         "function": "Used to answer queries at the request of users.",
-        "frequency": "Takes action based on user prompts.",
-        "description": "Operated by Perplexity to obtain results in response to user queries."
+        "frequency": "Only when prompted by a user.",
+        "description": "Visit web pages to help provide an accurate answer and include links to the page in Perplexity response."
     },
     "PetalBot": {
         "description": "Operated by Huawei to provide search and AI assistant services.",
@@ -321,4 +328,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
\ No newline at end of file
+}

From da85207314724c02d151a7bdfcdca3ef3fd056a1 Mon Sep 17 00:00:00 2001
From: Thomas Leister <thomas.leister@mailbox.org>
Date: Thu, 27 Mar 2025 12:27:09 +0100
Subject: [PATCH 46/63] Implement new function "json_to_nginx" which outputs an
 Nginx configuration snippet

---
 code/robots.py | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/code/robots.py b/code/robots.py
index 6bf7920..f58f2b8 100755
--- a/code/robots.py
+++ b/code/robots.py
@@ -152,6 +152,12 @@ def json_to_htaccess(robot_json):
     htaccess += "RewriteRule !^/?robots\\.txt$ - [F,L]\n"
     return htaccess
 
+def json_to_nginx(robot_json):
+    # Creates an Nginx config file. This config snippet can be included in 
+    # nginx server{} blocks to block AI bots.
+    config = f"if ($http_user_agent ~* \"{list_to_pcre(robot_json.keys())}\") {{\n    return 403;\n}}"
+    return config
+
 
 def update_file_if_changed(file_name, converter):
     """Update files if newer content is available and log the (in)actions."""
@@ -178,6 +184,10 @@ def conversions():
         file_name="./.htaccess",
         converter=json_to_htaccess,
     )
+    update_file_if_changed(
+        file_name="./nginx-block-ai-bots.conf",
+        converter=json_to_nginx,
+    )
 
 
 if __name__ == "__main__":

From 5a312c5f4d1fcd89c17f4d6cb360ad7230857402 Mon Sep 17 00:00:00 2001
From: Thomas Leister <thomas.leister@mailbox.org>
Date: Thu, 27 Mar 2025 12:28:11 +0100
Subject: [PATCH 47/63] Mention Nginx config feature in README

---
 README.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 30a85da..b984672 100644
--- a/README.md
+++ b/README.md
@@ -13,16 +13,19 @@ If you'd like to add information about a crawler to the list, please make a pull
 This repository provides the following files:
 - `robots.txt`
 - `.htaccess`
+- `nginx-block-ai-bots.conf`
 
 `robots.txt` implements the Robots Exclusion Protocol ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)).
 
 `.htaccess` may be used to configure web servers such as [Apache httpd](https://httpd.apache.org/) to return an error page when one of the listed AI crawlers sends a request to the web server.
 Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist.
 
+`nginx-block-ai-bots.conf` implements a Nginx configuration snippet that can be included in any virtual host `server {}` block via the `include` directive.
+
 
 ## Contributing
 
-A note about contributing: updates should be added/made to `robots.json`. A GitHub action will then generate the updated `robots.txt`, `table-of-bot-metrics.md`, and `.htaccess`.
+A note about contributing: updates should be added/made to `robots.json`. A GitHub action will then generate the updated `robots.txt`, `table-of-bot-metrics.md`, `.htaccess` and `nginx-block-ai-bots.conf`.
 
 You can run the tests by [installing](https://www.python.org/about/gettingstarted/) Python 3 and issuing:
 ```console

From 4f3f4cd0dd0f421c2787b1336d37b8da06998882 Mon Sep 17 00:00:00 2001
From: Thomas Leister <thomas.leister@mailbox.org>
Date: Thu, 27 Mar 2025 12:28:50 +0100
Subject: [PATCH 48/63] Add assembled version of nginx-block-ai-bots.conf file

---
 nginx-block-ai-bots.conf | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 nginx-block-ai-bots.conf

diff --git a/nginx-block-ai-bots.conf b/nginx-block-ai-bots.conf
new file mode 100644
index 0000000..ce30520
--- /dev/null
+++ b/nginx-block-ai-bots.conf
@@ -0,0 +1,3 @@
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+    return 403;
+}
\ No newline at end of file

From 7c3b5a2cb21f5404cf4e2af1acf8689ba77d7b06 Mon Sep 17 00:00:00 2001
From: Thomas Leister <thomas.leister@mailbox.org>
Date: Thu, 27 Mar 2025 16:12:18 +0100
Subject: [PATCH 49/63] Add tests for Nginx config generator

---
 code/test_files/nginx-block-ai-bots.conf |  3 +++
 code/tests.py                            | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 code/test_files/nginx-block-ai-bots.conf

diff --git a/code/test_files/nginx-block-ai-bots.conf b/code/test_files/nginx-block-ai-bots.conf
new file mode 100644
index 0000000..d1b559e
--- /dev/null
+++ b/code/test_files/nginx-block-ai-bots.conf
@@ -0,0 +1,3 @@
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash)") {
+    return 403;
+}
\ No newline at end of file
diff --git a/code/tests.py b/code/tests.py
index 94cbb47..61d69b4 100755
--- a/code/tests.py
+++ b/code/tests.py
@@ -4,7 +4,7 @@
 import json
 import unittest
 
-from robots import json_to_txt, json_to_table, json_to_htaccess
+from robots import json_to_txt, json_to_table, json_to_htaccess, json_to_nginx
 
 class RobotsUnittestExtensions:
     def loadJson(self, pathname):
@@ -50,6 +50,16 @@ class TestHtaccessGeneration(unittest.TestCase, RobotsUnittestExtensions):
         robots_htaccess = json_to_htaccess(self.robots_dict)
         self.assertEqualsFile("test_files/.htaccess", robots_htaccess)
 
+class TestNginxConfigGeneration(unittest.TestCase, RobotsUnittestExtensions):
+    maxDiff = 8192
+
+    def setUp(self):
+        self.robots_dict = self.loadJson("test_files/robots.json")
+
+    def test_nginx_generation(self):
+        robots_nginx = json_to_nginx(self.robots_dict)
+        self.assertEqualsFile("test_files/nginx-block-ai-bots.conf", robots_nginx)
+
 
 if __name__ == "__main__":
     import os

From 68d1d93714bbe4931811f301c7030ca979d95b39 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Thu, 27 Mar 2025 19:29:30 +0000
Subject: [PATCH 50/63] Merge pull request #91 from deyigifts/perplexity-user

Update perplexity bots
---
 .htaccess               | 2 +-
 robots.txt              | 1 +
 table-of-bot-metrics.md | 3 ++-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/.htaccess b/.htaccess
index 2313293..2f5d0e4 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/robots.txt b/robots.txt
index 80c40e8..8c79fc2 100644
--- a/robots.txt
+++ b/robots.txt
@@ -35,6 +35,7 @@ User-agent: omgili
 User-agent: omgilibot
 User-agent: PanguBot
 User-agent: PerplexityBot
+User-agent: Perplexity‑User
 User-agent: PetalBot
 User-agent: Scrapy
 User-agent: SemrushBot-OCOB
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index ce82047..0cc2264 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -36,7 +36,8 @@
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
-| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
+| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |
+| Perplexity‑User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
 | SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |

From 6851413c52b91b9729bbbfd75f84af364b490bde Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Thu, 27 Mar 2025 19:49:15 +0000
Subject: [PATCH 51/63] Merge pull request #94 from
 ThomasLeister/feature/implement-nginx-configuration-snippet-export

Implement Nginx configuration snippet export
---
 nginx-block-ai-bots.conf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nginx-block-ai-bots.conf b/nginx-block-ai-bots.conf
index ce30520..72d65ec 100644
--- a/nginx-block-ai-bots.conf
+++ b/nginx-block-ai-bots.conf
@@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
     return 403;
 }
\ No newline at end of file

From ec18af76242c1b62bbbfc7e1df72098b423402a6 Mon Sep 17 00:00:00 2001
From: Cory Dransfeldt <hi@coryd.dev>
Date: Thu, 27 Mar 2025 12:51:22 -0700
Subject: [PATCH 52/63] Revert "Merge pull request #91 from
 deyigifts/perplexity-user"

This reverts commit 68d1d93714bbe4931811f301c7030ca979d95b39.
---
 .htaccess               | 2 +-
 robots.txt              | 1 -
 table-of-bot-metrics.md | 3 +--
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/.htaccess b/.htaccess
index 2f5d0e4..2313293 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/robots.txt b/robots.txt
index 8c79fc2..80c40e8 100644
--- a/robots.txt
+++ b/robots.txt
@@ -35,7 +35,6 @@ User-agent: omgili
 User-agent: omgilibot
 User-agent: PanguBot
 User-agent: PerplexityBot
-User-agent: Perplexity‑User
 User-agent: PetalBot
 User-agent: Scrapy
 User-agent: SemrushBot-OCOB
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index 0cc2264..ce82047 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -36,8 +36,7 @@
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
-| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |
-| Perplexity‑User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
+| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
 | SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |

From c249de99a317b54e8891f1682dbf514e7763986e Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Fri, 28 Mar 2025 00:54:28 +0000
Subject: [PATCH 53/63] Update from Dark Visitors

---
 robots.json | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/robots.json b/robots.json
index eaac816..e907c8b 100644
--- a/robots.json
+++ b/robots.json
@@ -258,7 +258,7 @@
         "frequency": "No information.",
         "description": "Crawls sites to surface as results in Perplexity."
     },
-    "Perplexity‑User": {
+    "Perplexity\u2011User": {
         "operator": "[Perplexity](https://www.perplexity.ai/)",
         "respect": "[No](https://docs.perplexity.ai/guides/bots)",
         "function": "Used to answer queries at the request of users.",
@@ -328,4 +328,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
+}
\ No newline at end of file

From 5b8650b99b35ff2aa1aa9ae26183b312edc48d45 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Sat, 29 Mar 2025 00:54:10 +0000
Subject: [PATCH 54/63] Update from Dark Visitors

---
 .htaccess               | 2 +-
 robots.txt              | 1 +
 table-of-bot-metrics.md | 3 ++-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/.htaccess b/.htaccess
index 2313293..2f5d0e4 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/robots.txt b/robots.txt
index 80c40e8..8c79fc2 100644
--- a/robots.txt
+++ b/robots.txt
@@ -35,6 +35,7 @@ User-agent: omgili
 User-agent: omgilibot
 User-agent: PanguBot
 User-agent: PerplexityBot
+User-agent: Perplexity‑User
 User-agent: PetalBot
 User-agent: Scrapy
 User-agent: SemrushBot-OCOB
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index ce82047..0cc2264 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -36,7 +36,8 @@
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
-| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
+| PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |
+| Perplexity‑User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
 | SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |

From 6b0349f37ddf69ef9ec0e09a884b351f4a0e4b43 Mon Sep 17 00:00:00 2001
From: Frederic Barthelemy <git@fbartho.com>
Date: Fri, 4 Apr 2025 15:20:30 -0700
Subject: [PATCH 55/63] fix python complaining about f-string syntax

```
python code/tests.py
Traceback (most recent call last):
  File "/Users/fbarthelemy/Code/ai.robots.txt/code/tests.py", line 7, in <module>
    from robots import json_to_txt, json_to_table, json_to_htaccess, json_to_nginx
  File "/Users/fbarthelemy/Code/ai.robots.txt/code/robots.py", line 144
    return f"({"|".join(map(re.escape, lst))})"
                ^
SyntaxError: f-string: expecting '}'
```
---
 code/robots.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/code/robots.py b/code/robots.py
index f58f2b8..90c0e8c 100755
--- a/code/robots.py
+++ b/code/robots.py
@@ -141,7 +141,8 @@ def json_to_table(robots_json):
 def list_to_pcre(lst):
     # Python re is not 100% identical to PCRE which is used by Apache, but it
     # should probably be close enough in the real world for re.escape to work.
-    return f"({"|".join(map(re.escape, lst))})"
+    formatted = "|".join(map(re.escape, lst))
+    return f"({formatted})"
 
 
 def json_to_htaccess(robot_json):

From 5f5a89c38c27b676c3212f6ea3895d31f315f37e Mon Sep 17 00:00:00 2001
From: Frederic Barthelemy <git@fbartho.com>
Date: Fri, 4 Apr 2025 17:34:14 -0700
Subject: [PATCH 56/63] Fix html-mangled hyphen in Perplexity-Users

Fixes: #99
---
 .htaccess                                |  2 +-
 code/robots.py                           | 15 +++++++++++++++
 code/test_files/.htaccess                |  2 +-
 code/test_files/nginx-block-ai-bots.conf |  2 +-
 code/test_files/robots.json              |  7 +++++++
 code/test_files/robots.txt               |  1 +
 code/test_files/table-of-bot-metrics.md  |  1 +
 code/tests.py                            |  5 +++++
 nginx-block-ai-bots.conf                 |  2 +-
 robots.json                              | 14 +++++++-------
 robots.txt                               |  2 +-
 table-of-bot-metrics.md                  |  2 +-
 12 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/.htaccess b/.htaccess
index 2f5d0e4..27a7e11 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/code/robots.py b/code/robots.py
index 90c0e8c..d158b36 100755
--- a/code/robots.py
+++ b/code/robots.py
@@ -50,6 +50,7 @@ def updated_robots_json(soup):
             continue
         for agent in section.find_all("a", href=True):
             name = agent.find("div", {"class": "agent-name"}).get_text().strip()
+            name = clean_robot_name(name)
             desc = agent.find("p").get_text().strip()
 
             default_values = {
@@ -101,6 +102,20 @@ def updated_robots_json(soup):
     return sorted_robots
 
 
+def clean_robot_name(name):
+    """ Clean the robot name by removing some characters that were mangled by html software once. """
+    # This was specifically spotted in "Perplexity-User"
+    # Looks like a non-breaking hyphen introduced by the HTML rendering software
+    # Reading the source page for Perplexity: https://docs.perplexity.ai/guides/bots
+    # You can see the bot is listed several times as "Perplexity‑User" with a normal hyphen, 
+    # and it's only the Row-Heading that has the special hyphen
+    # 
+    # Technically, there's no reason there wouldn't someday be a bot that 
+    # actually uses a non-breaking hyphen, but that seems unlikely,
+    # so this solution should be fine for now.
+    return re.sub(r"\u2011", "-", name)
+
+
 def ingest_darkvisitors():
     old_robots_json = load_robots_json()
     soup = get_agent_soup()
diff --git a/code/test_files/.htaccess b/code/test_files/.htaccess
index 7e39092..f0d6783 100644
--- a/code/test_files/.htaccess
+++ b/code/test_files/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/code/test_files/nginx-block-ai-bots.conf b/code/test_files/nginx-block-ai-bots.conf
index d1b559e..c569b15 100644
--- a/code/test_files/nginx-block-ai-bots.conf
+++ b/code/test_files/nginx-block-ai-bots.conf
@@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|Diffbot|FacebookBot|facebookexternalhit|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot|crawler\.with\.dots|star\*\*\*crawler|Is\ this\ a\ crawler\?|a\[mazing\]\{42\}\(robot\)|2\^32\$|curl\|sudo\ bash)") {
     return 403;
 }
\ No newline at end of file
diff --git a/code/test_files/robots.json b/code/test_files/robots.json
index b0cbfbb..385f284 100644
--- a/code/test_files/robots.json
+++ b/code/test_files/robots.json
@@ -223,6 +223,13 @@
         "operator": "[Webz.io](https://webz.io/)",
         "respect": "[Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html)"
     },
+    "Perplexity-User": {
+        "operator": "[Perplexity](https://www.perplexity.ai/)",
+        "respect": "[No](https://docs.perplexity.ai/guides/bots)",
+        "function": "Used to answer queries at the request of users.",
+        "frequency": "Only when prompted by a user.",
+        "description": "Visit web pages to help provide an accurate answer and include links to the page in Perplexity response."
+    },
     "PerplexityBot": {
         "operator": "[Perplexity](https://www.perplexity.ai/)",
         "respect": "[No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/)",
diff --git a/code/test_files/robots.txt b/code/test_files/robots.txt
index 03c3c25..ee201f8 100644
--- a/code/test_files/robots.txt
+++ b/code/test_files/robots.txt
@@ -30,6 +30,7 @@ User-agent: Meta-ExternalFetcher
 User-agent: OAI-SearchBot
 User-agent: omgili
 User-agent: omgilibot
+User-agent: Perplexity-User
 User-agent: PerplexityBot
 User-agent: PetalBot
 User-agent: Scrapy
diff --git a/code/test_files/table-of-bot-metrics.md b/code/test_files/table-of-bot-metrics.md
index 88af6c0..9b280aa 100644
--- a/code/test_files/table-of-bot-metrics.md
+++ b/code/test_files/table-of-bot-metrics.md
@@ -32,6 +32,7 @@
 | OAI\-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
+| Perplexity\-User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
diff --git a/code/tests.py b/code/tests.py
index 61d69b4..f58b445 100755
--- a/code/tests.py
+++ b/code/tests.py
@@ -60,6 +60,11 @@ class TestNginxConfigGeneration(unittest.TestCase, RobotsUnittestExtensions):
         robots_nginx = json_to_nginx(self.robots_dict)
         self.assertEqualsFile("test_files/nginx-block-ai-bots.conf", robots_nginx)
 
+class TestRobotsNameCleaning(unittest.TestCase):
+    def test_clean_name(self):
+        from robots import clean_robot_name
+
+        self.assertEqual(clean_robot_name("Perplexity‑User"), "Perplexity-User")
 
 if __name__ == "__main__":
     import os
diff --git a/nginx-block-ai-bots.conf b/nginx-block-ai-bots.conf
index 72d65ec..0577bd9 100644
--- a/nginx-block-ai-bots.conf
+++ b/nginx-block-ai-bots.conf
@@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|PerplexityBot|Perplexity‑User|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
     return 403;
 }
\ No newline at end of file
diff --git a/robots.json b/robots.json
index e907c8b..8fd7572 100644
--- a/robots.json
+++ b/robots.json
@@ -251,6 +251,13 @@
         "frequency": "Unclear at this time.",
         "description": "PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot"
     },
+    "Perplexity-User": {
+        "operator": "[Perplexity](https://www.perplexity.ai/)",
+        "respect": "[No](https://docs.perplexity.ai/guides/bots)",
+        "function": "Used to answer queries at the request of users.",
+        "frequency": "Only when prompted by a user.",
+        "description": "Visit web pages to help provide an accurate answer and include links to the page in Perplexity response."
+    },
     "PerplexityBot": {
         "operator": "[Perplexity](https://www.perplexity.ai/)",
         "respect": "[Yes](https://docs.perplexity.ai/guides/bots)",
@@ -258,13 +265,6 @@
         "frequency": "No information.",
         "description": "Crawls sites to surface as results in Perplexity."
     },
-    "Perplexity\u2011User": {
-        "operator": "[Perplexity](https://www.perplexity.ai/)",
-        "respect": "[No](https://docs.perplexity.ai/guides/bots)",
-        "function": "Used to answer queries at the request of users.",
-        "frequency": "Only when prompted by a user.",
-        "description": "Visit web pages to help provide an accurate answer and include links to the page in Perplexity response."
-    },
     "PetalBot": {
         "description": "Operated by Huawei to provide search and AI assistant services.",
         "frequency": "No explicit frequency provided.",
diff --git a/robots.txt b/robots.txt
index 8c79fc2..c531918 100644
--- a/robots.txt
+++ b/robots.txt
@@ -34,8 +34,8 @@ User-agent: OAI-SearchBot
 User-agent: omgili
 User-agent: omgilibot
 User-agent: PanguBot
+User-agent: Perplexity-User
 User-agent: PerplexityBot
-User-agent: Perplexity‑User
 User-agent: PetalBot
 User-agent: Scrapy
 User-agent: SemrushBot-OCOB
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index 0cc2264..d92df34 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -36,8 +36,8 @@
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
+| Perplexity\-User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |
-| Perplexity‑User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PetalBot | [Huawei](https://huawei.com/) | Yes | Used to provide recommendations in Hauwei assistant and AI search services. | No explicit frequency provided. | Operated by Huawei to provide search and AI assistant services. |
 | Scrapy | [Zyte](https://www.zyte.com) | Unclear at this time. | Scrapes data for a variety of uses including training AI. | No information. | "AI and machine learning applications often need large amounts of quality data, and web data extraction is a fast, efficient way to build structured data sets." |
 | SemrushBot\-OCOB | [Semrush](https://www.semrush.com/) | [Yes](https://www.semrush.com/bot/) | Crawls your site for ContentShake AI tool. | Roughly once every 10 seconds. | You enter one text (on-demand) and we will make suggestions on it (the tool uses AI but we are not actively crawling the web, you need to manually enter one text/URL). |

From c6f308cbd0a00166f5085fa4adc98630c767e11e Mon Sep 17 00:00:00 2001
From: Frederic Barthelemy <git@fbartho.com>
Date: Sat, 5 Apr 2025 09:01:52 -0700
Subject: [PATCH 57/63] PR Feedback: log special-case, comment consistency

---
 code/robots.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/code/robots.py b/code/robots.py
index d158b36..86ea413 100755
--- a/code/robots.py
+++ b/code/robots.py
@@ -107,13 +107,16 @@ def clean_robot_name(name):
     # This was specifically spotted in "Perplexity-User"
     # Looks like a non-breaking hyphen introduced by the HTML rendering software
     # Reading the source page for Perplexity: https://docs.perplexity.ai/guides/bots
-    # You can see the bot is listed several times as "Perplexity‑User" with a normal hyphen, 
+    # You can see the bot is listed several times as "Perplexity-User" with a normal hyphen, 
     # and it's only the Row-Heading that has the special hyphen
     # 
     # Technically, there's no reason there wouldn't someday be a bot that 
     # actually uses a non-breaking hyphen, but that seems unlikely,
     # so this solution should be fine for now.
-    return re.sub(r"\u2011", "-", name)
+    result = re.sub(r"\u2011", "-", name)
+    if result != name:
+        print(f"\tCleaned '{name}' to '{result}' - unicode/html mangled chars normalized.")
+    return result
 
 
 def ingest_darkvisitors():

From b65f45e408461560a32f44f05860f80655737467 Mon Sep 17 00:00:00 2001
From: Cory Dransfeldt <hi@coryd.dev>
Date: Thu, 10 Apr 2025 10:12:51 -0700
Subject: [PATCH 58/63] chore(robots.json): adds imgproxy crawler

---
 robots.json | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index 8fd7572..4c9f7d7 100644
--- a/robots.json
+++ b/robots.json
@@ -195,6 +195,13 @@
         "operator": "[img2dataset](https://github.com/rom1504/img2dataset)",
         "respect": "Unclear at this time."
     },
+    "imgproxy": {
+        "frequency": "No information.",
+        "function": "Not documented or explained on operator's site.",
+        "operator": "[imgproxy](https://imgproxy.net)",
+        "respect": "Unclear at this time.",
+        "description": "AI-powered image processing."
+    },
     "ISSCyberRiskCrawler": {
         "description": "Used to train machine learning based models to quantify cyber risk.",
         "frequency": "No information.",
@@ -328,4 +335,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
\ No newline at end of file
+}

From 4a764bba18f10167cb5f7107c8721e5dc208100f Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Thu, 10 Apr 2025 19:22:34 +0000
Subject: [PATCH 59/63] Merge pull request #102 from ai-robots-txt/imgproxy-bot

chore(robots.json): adds imgproxy crawler
---
 .htaccess                | 2 +-
 nginx-block-ai-bots.conf | 2 +-
 robots.txt               | 1 +
 table-of-bot-metrics.md  | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/.htaccess b/.htaccess
index 27a7e11..c0e5fbb 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/nginx-block-ai-bots.conf b/nginx-block-ai-bots.conf
index 0577bd9..a6bbfa2 100644
--- a/nginx-block-ai-bots.conf
+++ b/nginx-block-ai-bots.conf
@@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
     return 403;
 }
\ No newline at end of file
diff --git a/robots.txt b/robots.txt
index c531918..de25a56 100644
--- a/robots.txt
+++ b/robots.txt
@@ -26,6 +26,7 @@ User-agent: iaskspider/2.0
 User-agent: ICC-Crawler
 User-agent: ImagesiftBot
 User-agent: img2dataset
+User-agent: imgproxy
 User-agent: ISSCyberRiskCrawler
 User-agent: Kangaroo Bot
 User-agent: Meta-ExternalAgent
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index d92df34..b3e51fe 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -28,6 +28,7 @@
 | ICC\-Crawler | [NICT](https://nict.go.jp) | Yes | Scrapes data to train and support AI technologies. | No information. | Use the collected data for artificial intelligence technologies; provide data to third parties, including commercial companies; those companies can use the data for their own business. |
 | ImagesiftBot | [ImageSift](https://imagesift.com) | [Yes](https://imagesift.com/about) | ImageSiftBot is a web crawler that scrapes the internet for publicly available images to support our suite of web intelligence products | No information. | Once images and text are downloaded from a webpage, ImageSift analyzes this data from the page and stores the information in an index. Our web intelligence products use this index to enable search and retrieval of similar images. |
 | img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | Unclear at this time. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
+| imgproxy | [imgproxy](https://imgproxy.net) | Unclear at this time. | Not documented or explained on operator's site. | No information. | AI-powered image processing. |
 | ISSCyberRiskCrawler | [ISS-Corporate](https://iss-cyber.com) | No | Scrapes data to train machine learning models. | No information. | Used to train machine learning based models to quantify cyber risk. |
 | Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
 | Meta\-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |

From 305188b2e78855d4e7193f29a3e7205f96fa86f6 Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Fri, 11 Apr 2025 00:55:52 +0000
Subject: [PATCH 60/63] Update from Dark Visitors

---
 robots.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/robots.json b/robots.json
index 4c9f7d7..eff38ac 100644
--- a/robots.json
+++ b/robots.json
@@ -335,4 +335,4 @@
         "frequency": "No information.",
         "description": "Retrieves data used for You.com web search engine and LLMs."
     }
-}
+}
\ No newline at end of file

From d9f882a9b21170754c4b37ff1bbc237171876684 Mon Sep 17 00:00:00 2001
From: Joshua Sheard <mail@jsheard.com>
Date: Mon, 14 Apr 2025 15:46:01 +0100
Subject: [PATCH 61/63] Include "AI Agents" from Dark Visitors

---
 code/robots.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/code/robots.py b/code/robots.py
index 86ea413..8a06b55 100755
--- a/code/robots.py
+++ b/code/robots.py
@@ -30,6 +30,7 @@ def updated_robots_json(soup):
     """Update AI scraper information with data from darkvisitors."""
     existing_content = load_robots_json()
     to_include = [
+        "AI Agents",
         "AI Assistants",
         "AI Data Scrapers",
         "AI Search Crawlers",

From a96e33098975edf1c05c8d9684b36b9fa31f7ef2 Mon Sep 17 00:00:00 2001
From: dark-visitors <dark-visitors@users.noreply.github.com>
Date: Tue, 15 Apr 2025 00:57:01 +0000
Subject: [PATCH 62/63] Update from Dark Visitors

---
 robots.json | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/robots.json b/robots.json
index eff38ac..8bba6b2 100644
--- a/robots.json
+++ b/robots.json
@@ -230,6 +230,13 @@
         "frequency": "Unclear at this time.",
         "description": "Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher"
     },
+    "NovaAct": {
+        "operator": "Unclear at this time.",
+        "respect": "Unclear at this time.",
+        "function": "AI Agents",
+        "frequency": "Unclear at this time.",
+        "description": "Nova Act is an AI agent created by Amazon that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/novaact"
+    },
     "OAI-SearchBot": {
         "operator": "[OpenAI](https://openai.com)",
         "respect": "[Yes](https://platform.openai.com/docs/bots)",
@@ -251,6 +258,13 @@
         "operator": "[Webz.io](https://webz.io/)",
         "respect": "[Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html)"
     },
+    "Operator": {
+        "operator": "Unclear at this time.",
+        "respect": "Unclear at this time.",
+        "function": "AI Agents",
+        "frequency": "Unclear at this time.",
+        "description": "Operator is an AI agent created by OpenAI that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/operator"
+    },
     "PanguBot": {
         "operator": "the Chinese company Huawei",
         "respect": "Unclear at this time.",

From e0cdb278fbd243f554579fe5050850f124b286a8 Mon Sep 17 00:00:00 2001
From: "ai.robots.txt" <ai.robots.txt@users.noreply.github.com>
Date: Wed, 16 Apr 2025 00:57:11 +0000
Subject: [PATCH 63/63] Update from Dark Visitors

---
 .htaccess                | 2 +-
 nginx-block-ai-bots.conf | 2 +-
 robots.txt               | 2 ++
 table-of-bot-metrics.md  | 2 ++
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/.htaccess b/.htaccess
index c0e5fbb..d10e796 100644
--- a/.htaccess
+++ b/.htaccess
@@ -1,3 +1,3 @@
 RewriteEngine On
-RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
+RewriteCond %{HTTP_USER_AGENT} (AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot) [NC]
 RewriteRule !^/?robots\.txt$ - [F,L]
diff --git a/nginx-block-ai-bots.conf b/nginx-block-ai-bots.conf
index a6bbfa2..c37cef5 100644
--- a/nginx-block-ai-bots.conf
+++ b/nginx-block-ai-bots.conf
@@ -1,3 +1,3 @@
-if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|OAI\-SearchBot|omgili|omgilibot|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
+if ($http_user_agent ~* "(AI2Bot|Ai2Bot\-Dolma|Amazonbot|anthropic\-ai|Applebot|Applebot\-Extended|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\-User|Claude\-Web|ClaudeBot|cohere\-ai|cohere\-training\-data\-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google\-Extended|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo\ Bot|Meta\-ExternalAgent|Meta\-ExternalFetcher|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Perplexity\-User|PerplexityBot|PetalBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Timpibot|VelenPublicWebCrawler|Webzio\-Extended|YouBot)") {
     return 403;
 }
\ No newline at end of file
diff --git a/robots.txt b/robots.txt
index de25a56..1e3aa80 100644
--- a/robots.txt
+++ b/robots.txt
@@ -31,9 +31,11 @@ User-agent: ISSCyberRiskCrawler
 User-agent: Kangaroo Bot
 User-agent: Meta-ExternalAgent
 User-agent: Meta-ExternalFetcher
+User-agent: NovaAct
 User-agent: OAI-SearchBot
 User-agent: omgili
 User-agent: omgilibot
+User-agent: Operator
 User-agent: PanguBot
 User-agent: Perplexity-User
 User-agent: PerplexityBot
diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
index b3e51fe..4c87b41 100644
--- a/table-of-bot-metrics.md
+++ b/table-of-bot-metrics.md
@@ -33,9 +33,11 @@
 | Kangaroo Bot | Unclear at this time. | Unclear at this time. | AI Data Scrapers | Unclear at this time. | Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot |
 | Meta\-ExternalAgent | [Meta](https://developers.facebook.com/docs/sharing/webmasters/web-crawlers) | Yes. | Used to train models and improve products. | No information. | "The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly." |
 | Meta\-ExternalFetcher | Unclear at this time. | Unclear at this time. | AI Assistants | Unclear at this time. | Meta-ExternalFetcher is dispatched by Meta AI products in response to user prompts, when they need to fetch an individual links. More info can be found at https://darkvisitors.com/agents/agents/meta-externalfetcher |
+| NovaAct | Unclear at this time. | Unclear at this time. | AI Agents | Unclear at this time. | Nova Act is an AI agent created by Amazon that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/novaact |
 | OAI\-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information. | Crawls sites to surface as results in SearchGPT. |
 | omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information. | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
 | omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information. | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
+| Operator | Unclear at this time. | Unclear at this time. | AI Agents | Unclear at this time. | Operator is an AI agent created by OpenAI that can use a web browser. It can intelligently navigate and interact with websites to complete multi-step tasks on behalf of a human user. More info can be found at https://darkvisitors.com/agents/agents/operator |
 | PanguBot | the Chinese company Huawei | Unclear at this time. | AI Data Scrapers | Unclear at this time. | PanguBot is a web crawler operated by the Chinese company Huawei. It's used to download training data for its multimodal LLM (Large Language Model) called PanGu. More info can be found at https://darkvisitors.com/agents/agents/pangubot |
 | Perplexity\-User | [Perplexity](https://www.perplexity.ai/) | [No](https://docs.perplexity.ai/guides/bots) | Used to answer queries at the request of users. | Only when prompted by a user. | Visit web pages to help provide an accurate answer and include links to the page in Perplexity response. |
 | PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/guides/bots) | Search result generation. | No information. | Crawls sites to surface as results in Perplexity. |