mirror of
https://github.com/ai-robots-txt/ai.robots.txt.git
synced 2025-04-04 11:03:59 +00:00
Adding clarification about performance and code comment
This commit is contained in:
parent
189e75bbfd
commit
b455af66e7
2 changed files with 5 additions and 2 deletions
|
@ -18,8 +18,9 @@ The first one tells search engine and AI crawlers which parts of your website sh
|
|||
|
||||
The second one tells your own webserver to return an error page when one of the listed AI crawlers tries to request a page from your website. A `.htaccess` file does not work on every webserver, but works correctly on most common and cheap shared hosting providers. The majority of AI crawlers set a "User Agent" string in every request they send, by which they are identifiable: this string is used to filter the request. Instead of simply hoping the crawler pledges to respect our intention, this solution actively sends back a bad webpage (an error or an empty page). Note that this solution isn't bulletproof either, as anyone can fake the sent User Agent.
|
||||
|
||||
We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
|
||||
Note that, as stated in the [httpd documentation](https://httpd.apache.org/docs/current/howto/htaccess.html), more performant methods than an `.htaccess` file exist. Nevertheless, most shared hosting providers only allow `.htaccess` configuration.
|
||||
|
||||
We suggest adding both files, as some crawlers may respect `robots.txt` while not having an identifiable User Agent; on the other hand, other crawlers may not respect the `robots.txt`, but they provide a identifiable User Agent by which we can filter them out.
|
||||
|
||||
## Contributing
|
||||
|
||||
|
|
|
@ -133,7 +133,9 @@ def json_to_table(robots_json):
|
|||
|
||||
|
||||
def json_to_htaccess(robot_json):
|
||||
htaccess = "RewriteEngine On\n"
|
||||
# Creates a .htaccess filter file. It uses a regular expression to filter out
|
||||
#User agents that contain any of the blocked values.
|
||||
htaccess += "RewriteEngine On\n"
|
||||
htaccess += "RewriteCond %{HTTP_USER_AGENT} ^.*("
|
||||
|
||||
robots = map(lambda el: el.replace(" ", "\\ "), robot_json.keys())
|
||||
|
|
Loading…
Reference in a new issue