OpenAI has a couple autonomous agents that fetch data from the internet, and I've noticed a lot of websites
are updating their robots.txt
to block
them:
A web crawler that crawls all publicly accessible websites/pages.
It's used to build a dataset of the entire internet to improve OpenAI models.
An agent that scrapes a defined, small subset of websites/pages.
It's used by ChatGPT Plugins to fetch data from specific sources.
Out of the top 1000 most visited websites...
229
of 1000
explicitly block GPTBot
nbcnews.com
amazon.it
francetvinfo.fr
vulture.com
vogue.com
chicagotribune.com
nytimes.com
welt.de
kompas.com
metro.co.uk
france24.com
usmagazine.com
superuser.com
nextdoor.com
archiveofourown.org
mercadolibre.com.ar
popularmechanics.com
pinterest.fr
scribd.com
countryliving.com
washingtonpost.com
pinterest.com
verywellfit.com
foursquare.com
123rf.com
bizjournals.com
esquire.com
livemint.com
dictionary.com
corriere.it
wattpad.com
seriouseats.com
nj.com
thrillist.com
stern.de
spiegel.de
pbs.org
byrdie.com
opentable.com
disney.com
newyorker.com
housebeautiful.com
mashable.com
bhphotovideo.com
vimeo.com
radiotimes.com
thoughtco.com
forestparkgolfcourse.com
dnb.com
marthastewart.com
livescience.com
alamyimages.fr
womansday.com
seventeen.com
lifewire.com
amazon.in
distractify.com
shutterstock.com
news18.com
medicalnewstoday.com
thespruce.com
vanityfair.com
francebleu.fr
architecturaldigest.com
cosmopolitan.com
20minutes.fr
slideshare.net
quora.com
eatingwell.com
investopedia.com
tabelog.com
cbsnews.com
hbr.org
amarujala.com
womenshealthmag.com
hellomagazine.com
popsugar.com
rateyourmusic.com
amazon.de
verywellmind.com
theglobeandmail.com
gq.com
city-data.com
justanswer.com
trulia.com
apartments.com
techradar.com
indiamart.com
eater.com
health.com
thespruceeats.com
pcmag.com
insider.com
businessinsider.com
aarp.org
weather.com
spanishdict.com
pinterest.co.uk
amazon.ca
homes.com
mercadolibre.com.mx
thesaurus.com
allrecipes.com
elle.com
variety.com
instyle.com
pinterest.de
cntraveler.com
brides.com
stylecaster.com
livehindustan.com
glamour.com
bhg.com
amazon.co.uk
axios.com
ingles.com
androidauthority.com
harpersbazaar.com
stackexchange.com
pinterest.es
amazon.co.jp
webmd.com
slate.com
cookpad.com
jagran.com
today.com
pcgamer.com
mercadolivre.com.br
geeksforgeeks.org
wired.com
deadline.com
teenvogue.com
usatoday.com
stackoverflow.com
sheknows.com
sueddeutsche.de
amazon.com
jagranjosh.com
glassdoor.com
abc.net.au
teacherspayteachers.com
kbb.com
medicinenet.com
cnn.com
caranddriver.com
allure.com
bonappetit.com
theguardian.com
delish.com
glamourmagazine.co.uk
coursera.org
theatlantic.com
dotesports.com
bloomberg.com
ew.com
reuters.com
inverse.com
npr.org
scientificamerican.com
self.com
amazon.com.au
tumblr.com
picclick.com
digitalspy.com
verywellhealth.com
travelandleisure.com
hindustantimes.com
latimes.com
ndtv.com
faz.net
gamesradar.com
rollingstone.com
bustle.com
cnbc.com
theathletic.com
amazon.fr
realsimple.com
healthline.com
amazon.es
polygon.com
autotrader.com
oprahdaily.com
thehindu.com
alamy.com
coursehero.com
healthgrades.com
alamy.es
goodhousekeeping.com
seattletimes.com
timesofindia.com
fortune.com
uol.com.br
tripsavvy.com
airbnb.com
edmunds.com
townandcountrymag.com
mercadolibre.com.co
nymag.com
people.com
amazon.com.br
marketwatch.com
masterclass.com
cinemablend.com
eonline.com
hollywoodreporter.com
vox.com
prevention.com
thepioneerwoman.com
vocabulary.com
thebalancemoney.com
nationalgeographic.com
billboard.com
prnewswire.com
menshealth.com
ikea.com
repubblica.it
medium.com
foodandwine.com
southernliving.com
espn.com
bbcgoodfood.com
theverge.com
tomsguide.com
lonelyplanet.com
economictimes.com
verywellfamily.com
liveabout.com
actu.fr
wikihow.com
51
of 1000
explicitly block ChatGPT-User
francetvinfo.fr
vogue.com
myfitnesspal.com
france24.com
archiveofourown.org
mercadolibre.com.ar
washingtonpost.com
foursquare.com
stern.de
opentable.com
newyorker.com
vimeo.com
news18.com
vanityfair.com
francebleu.fr
architecturaldigest.com
20minutes.fr
yourdictionary.com
gq.com
weather.com
mercadolibre.com.mx
variety.com
cntraveler.com
stylecaster.com
glamour.com
pikiran-rakyat.com
slate.com
mercadolivre.com.br
geeksforgeeks.org
wired.com
deadline.com
teenvogue.com
sheknows.com
allure.com
bonappetit.com
glamourmagazine.co.uk
theatlantic.com
reuters.com
npr.org
self.com
rollingstone.com
alltrails.com
fortune.com
detik.com
mercadolibre.com.co
liputan6.com
hollywoodreporter.com
billboard.com
fool.com
etymonline.com
actu.fr
Q&A
What is a robots.txt
?
A robots.txt
is a standard used by websites to direct web crawling and scraping bots about
which pages or
files the bot should or shouldn't request from the site. It's essentially a set of rules to guide search
engines on which parts of a website they are allowed to access and index.
How does this work?
I have a script that runs once per day that scrapes the robots.txt
files of the 1000 most
popular websites.
How accurate is this data?
The scraping script is able to fetch the robots.txt
for most sites without any issue, however
Cloudflare
sometimes blocks it temporarily so it might miss a few websites per scrape.
Made by Wayde