diff --git a/FAQ.md b/FAQ.md new file mode 100644 index 0000000..0bb1ac9 --- /dev/null +++ b/FAQ.md @@ -0,0 +1,23 @@ +# Frequently asked questions + +## How do we know AI companies/bots respect `robots.txt`? + +The short answer is that we don't. `robots.txt` is a well-established standard but compliance is voluntary. There is no enforcement mechanism. + +## Can we block crawlers based on user agent strings? + +Yes, provided the crawlers identify themselves and your application/hosting supports doing so. + +## Why should we block these crawlers? + +They're extractive, confer no benefit to the creators of data they're ingesting and also have wide-ranging negative externalities. + +**[How Tech Giants Cut Corners to Harvest Data for A.I.](https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html?unlocked_article_code=1.ik0.Ofja.L21c1wyW-0xj&ugrp=m)** +> OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems. + +**[How AI copyright lawsuits could make the whole industry go extinct](https://www.theverge.com/24062159/ai-copyright-fair-use-lawsuits-new-york-times-openai-chatgpt-decoder-podcast)** +> The New York Times' lawsuit against OpenAI is part of a broader, industry-shaking copyright challenge that could define the future of AI. + +## How can I contribute? + +Open a pull request. It will be reviewed and acted upon appropriately. **We really appreciate contributions** — this is a community effort.