From 74b15028394f2617d7bc805417b85c38554bb7d8 Mon Sep 17 00:00:00 2001 From: nisbet-hubbard <87453615+nisbet-hubbard@users.noreply.github.com> Date: Sat, 3 Aug 2024 14:04:58 +0800 Subject: [PATCH 1/3] Update FAQ.md --- FAQ.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/FAQ.md b/FAQ.md index 0bb1ac9..c0ec16f 100644 --- a/FAQ.md +++ b/FAQ.md @@ -8,6 +8,21 @@ The short answer is that we don't. `robots.txt` is a well-established standard b Yes, provided the crawlers identify themselves and your application/hosting supports doing so. +## What can we do if a bot doesn't respect `robots.txt`? + +That depends on your stack. + +- Nginx + - [Blocking Bots with Nginx](https://rknight.me/blog/blocking-bots-with-nginx/) by Robb Knight + - [Blocking AI web crawlers](https://underlap.org/blocking-ai-web-crawlers) by Glyn Normington +- Apache httpd + - [Blockin' bots.](https://ethanmarcotte.com/wrote/blockin-bots/) by Ethan Marcotte + - [Blocking Bots With 11ty And Apache](https://flamedfury.com/posts/blocking-bots-with-11ty-and-apache/) by fLaMEd fury +> [!TIP] +> The snippets in these articles all use `mod_rewrite`, which [should be considered a last resort](https://httpd.apache.org/docs/trunk/rewrite/avoid.html). A good alternative that's less resource-intensive is `mod_setenvif`; see [httpd docs](https://httpd.apache.org/docs/trunk/rewrite/access.html#blocking-of-robots) for an example. +- Netlify + - [Blockin' bots on Netlify](https://www.jeremiak.com/blog/block-bots-netlify-edge-functions/) by Jeremia Kimelman + ## Why should we block these crawlers? They're extractive, confer no benefit to the creators of data they're ingesting and also have wide-ranging negative externalities. From b24e5cb3bb4e799f1856c22dc77439ddf22e9518 Mon Sep 17 00:00:00 2001 From: nisbet-hubbard <87453615+nisbet-hubbard@users.noreply.github.com> Date: Sat, 3 Aug 2024 14:12:50 +0800 Subject: [PATCH 2/3] Update FAQ.md --- FAQ.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/FAQ.md b/FAQ.md index c0ec16f..06fb2ef 100644 --- a/FAQ.md +++ b/FAQ.md @@ -22,6 +22,8 @@ That depends on your stack. > The snippets in these articles all use `mod_rewrite`, which [should be considered a last resort](https://httpd.apache.org/docs/trunk/rewrite/avoid.html). A good alternative that's less resource-intensive is `mod_setenvif`; see [httpd docs](https://httpd.apache.org/docs/trunk/rewrite/access.html#blocking-of-robots) for an example. - Netlify - [Blockin' bots on Netlify](https://www.jeremiak.com/blog/block-bots-netlify-edge-functions/) by Jeremia Kimelman +- Cloudflare + - [I’m blocking AI crawlers](https://roelant.net/en/2024/im-blocking-ai-crawlers-part-2/) by Roelant ## Why should we block these crawlers? From 2b56c72bacce5a4285e083d60fd1d4a20c033036 Mon Sep 17 00:00:00 2001 From: nisbet-hubbard <87453615+nisbet-hubbard@users.noreply.github.com> Date: Sat, 3 Aug 2024 14:27:25 +0800 Subject: [PATCH 3/3] Update FAQ.md --- FAQ.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/FAQ.md b/FAQ.md index 06fb2ef..c26a936 100644 --- a/FAQ.md +++ b/FAQ.md @@ -19,7 +19,7 @@ That depends on your stack. - [Blockin' bots.](https://ethanmarcotte.com/wrote/blockin-bots/) by Ethan Marcotte - [Blocking Bots With 11ty And Apache](https://flamedfury.com/posts/blocking-bots-with-11ty-and-apache/) by fLaMEd fury > [!TIP] -> The snippets in these articles all use `mod_rewrite`, which [should be considered a last resort](https://httpd.apache.org/docs/trunk/rewrite/avoid.html). A good alternative that's less resource-intensive is `mod_setenvif`; see [httpd docs](https://httpd.apache.org/docs/trunk/rewrite/access.html#blocking-of-robots) for an example. +> The snippets in these articles all use `mod_rewrite`, which [should be considered a last resort](https://httpd.apache.org/docs/trunk/rewrite/avoid.html). A good alternative that's less resource-intensive is `mod_setenvif`; see [httpd docs](https://httpd.apache.org/docs/trunk/rewrite/access.html#blocking-of-robots) for an example. You should also consider [setting this up in `httpd.conf` instead of `.htaccess`](https://httpd.apache.org/docs/trunk/howto/htaccess.html#when) if it's available to you. - Netlify - [Blockin' bots on Netlify](https://www.jeremiak.com/blog/block-bots-netlify-edge-functions/) by Jeremia Kimelman - Cloudflare