OpenAI recently announced the launch of GPTBot, a web crawler that will collect text data from the public web to improve the accuracy and safety of its large language models.
Unfortunately, it is not possible to retroactively remove content previously scraped from a site from ChatGPT’s training data. But now there are some steps you can take to manage this going forward.
But obviously not everyone is au fait with their data being used this way. In fact, news organisations such as the Guardian, New York Times and CNN have been open about opting out of this.
This announcement has raised concerns that GPTBot is hoovering up data from websites without the consent of website owners.
OpenAI has stated that GPTBot will not collect personally identifiable information (PII) or text that violates its policies. But that’s a vague disclaimer and you would be right to be wary.
And there are other reasons to opt-out too, not least to protect the intellectual property of your information products and services or otherwise manage how the information that you share is being used. Or maybe you just don’t want to be part of the murky future plans of a commercial platform until the picture is clearer.
Alongside the GPTBot announcement, OpenAI also shared details of how you might opt-out of this data collection. The main way is by updating your robots.txt file.
Robots.txt?
So, what exactly is a robots.txt file?
This is a file that can be added on the server that hosts your website to control which file crawlers (such as search engine robots) may or may not access your site.
It usually sits at the root of your site. So, for example, if your library website lives at mylibrary.ac.uk, then the robots.txt file should be at mylibrary.ac.uk/robots.txt.
Please note, that if you place this file in a subdirectory of your site rather than the root directory, search engines will likely ignore it.
In this file, there are ‘rules’ that instruct user agents – web-crawling software, on whether they can or cannot crawl your site.
This is still the main way you can block this kind of access to your site, though new ways are being developed in the wake of a new way of AI tools trawling for content.
How to opt out of GPTBot
GPTBot is identified by the user agent token GPTBot. You can add a ‘disallow’ rule to your robots.txt file to block this specific user agency. For example, add the following line to your website’s robots.txt file:
User-agent: GPTBot
Disallow: /
This will prevent GPTBot from crawling any pages on your website.
If you want to block specific sections of your website, then you can specify the directory in your robots.txt rule:
User-agent: GPTBot
Allow: /this-directory/
Disallow: /that-directory/
Unfortunately, you cannot block individual pages using robots.txt rules. So, for example, blocking a staff page wouldn’t be possible using this method but you can block the relevant directory such as: /about-us/
You can see an example of the robots.txt file blocking AI agents here:
https://github.com/healsdata/ai-training-opt-out/blob/main/robots.txt
IP-based blocking
Another approach is to block GPTBot’s IP address(es). OpenAI has published a list of IP addresses that are used by GPTBot (though these may change so it is important to check back regularly).
https://openai.com/gptbot-ranges.txt
You can block these IP addresses in your firewall or web server configuration.
You can check out the full documentation at: https://platform.openai.com/docs/gptbot
These solutions require both access to the server and some experience editing these files so if you have any questions about implementing this, please feel free to get in touch with us.