Websites blocking OpenAI

OpenAI has a couple autonomous agents that fetch data from the internet, and I've noticed a lot of websites are updating their robots.txt to block them:

GPTBot

A web crawler that crawls all publicly accessible websites/pages.

It's used to build a dataset of the entire internet to improve OpenAI models.

ChatGPT-User

An agent that scrapes a defined, small subset of websites/pages.

It's used by ChatGPT Plugins to fetch data from specific sources.

Out of the top 1000 most visited websites...

229

of 1000

explicitly block GPTBot

nbcnews.com

amazon.it

francetvinfo.fr

vulture.com

vogue.com

chicagotribune.com

nytimes.com

welt.de

kompas.com

metro.co.uk

france24.com

usmagazine.com

superuser.com

nextdoor.com

archiveofourown.org

mercadolibre.com.ar

popularmechanics.com

pinterest.fr

scribd.com

countryliving.com

washingtonpost.com

pinterest.com

verywellfit.com

foursquare.com

123rf.com

bizjournals.com

esquire.com

livemint.com

dictionary.com

corriere.it

wattpad.com

seriouseats.com

nj.com

thrillist.com

stern.de

spiegel.de

pbs.org

byrdie.com

opentable.com

disney.com

newyorker.com

housebeautiful.com

mashable.com

bhphotovideo.com

vimeo.com

radiotimes.com

thoughtco.com

forestparkgolfcourse.com

dnb.com

marthastewart.com

livescience.com

alamyimages.fr

womansday.com

seventeen.com

lifewire.com

amazon.in

distractify.com

shutterstock.com

news18.com

medicalnewstoday.com

thespruce.com

vanityfair.com

francebleu.fr

architecturaldigest.com

cosmopolitan.com

20minutes.fr

slideshare.net

quora.com

eatingwell.com

investopedia.com

tabelog.com

cbsnews.com

hbr.org

amarujala.com

womenshealthmag.com

hellomagazine.com

popsugar.com

rateyourmusic.com

amazon.de

verywellmind.com

theglobeandmail.com

gq.com

city-data.com

justanswer.com

trulia.com

apartments.com

techradar.com

indiamart.com

eater.com

health.com

thespruceeats.com

pcmag.com

insider.com

businessinsider.com

aarp.org

weather.com

spanishdict.com

pinterest.co.uk

amazon.ca

homes.com

mercadolibre.com.mx

thesaurus.com

allrecipes.com

elle.com

variety.com

instyle.com

pinterest.de

cntraveler.com

brides.com

stylecaster.com

livehindustan.com

glamour.com

bhg.com

amazon.co.uk

axios.com

ingles.com

androidauthority.com

harpersbazaar.com

stackexchange.com

pinterest.es

amazon.co.jp

webmd.com

slate.com

cookpad.com

jagran.com

today.com

pcgamer.com

mercadolivre.com.br

geeksforgeeks.org

wired.com

deadline.com

teenvogue.com

usatoday.com

stackoverflow.com

sheknows.com

sueddeutsche.de

amazon.com

jagranjosh.com

glassdoor.com

abc.net.au

teacherspayteachers.com

kbb.com

medicinenet.com

cnn.com

caranddriver.com

allure.com

bonappetit.com

theguardian.com

delish.com

glamourmagazine.co.uk

coursera.org

theatlantic.com

dotesports.com

bloomberg.com

ew.com

reuters.com

inverse.com

npr.org

scientificamerican.com

self.com

amazon.com.au

tumblr.com

picclick.com

digitalspy.com

verywellhealth.com

travelandleisure.com

hindustantimes.com

latimes.com

ndtv.com

faz.net

gamesradar.com

rollingstone.com

bustle.com

cnbc.com

theathletic.com

amazon.fr

realsimple.com

healthline.com

amazon.es

polygon.com

autotrader.com

oprahdaily.com

thehindu.com

alamy.com

coursehero.com

healthgrades.com

alamy.es

goodhousekeeping.com

seattletimes.com

timesofindia.com

fortune.com

uol.com.br

tripsavvy.com

airbnb.com

edmunds.com

townandcountrymag.com

mercadolibre.com.co

nymag.com

people.com

amazon.com.br

marketwatch.com

masterclass.com

cinemablend.com

eonline.com

hollywoodreporter.com

vox.com

prevention.com

thepioneerwoman.com

vocabulary.com

thebalancemoney.com

nationalgeographic.com

billboard.com

prnewswire.com

menshealth.com

ikea.com

repubblica.it

medium.com

foodandwine.com

southernliving.com

espn.com

bbcgoodfood.com

theverge.com

tomsguide.com

lonelyplanet.com

economictimes.com

verywellfamily.com

liveabout.com

actu.fr

wikihow.com

51

of 1000

explicitly block ChatGPT-User

francetvinfo.fr

vogue.com

myfitnesspal.com

france24.com

archiveofourown.org

mercadolibre.com.ar

washingtonpost.com

foursquare.com

stern.de

opentable.com

newyorker.com

vimeo.com

news18.com

vanityfair.com

francebleu.fr

architecturaldigest.com

20minutes.fr

yourdictionary.com

gq.com

weather.com

mercadolibre.com.mx

variety.com

cntraveler.com

stylecaster.com

glamour.com

pikiran-rakyat.com

slate.com

mercadolivre.com.br

geeksforgeeks.org

wired.com

deadline.com

teenvogue.com

sheknows.com

allure.com

bonappetit.com

glamourmagazine.co.uk

theatlantic.com

reuters.com

npr.org

self.com

rollingstone.com

alltrails.com

fortune.com

detik.com

mercadolibre.com.co

liputan6.com

hollywoodreporter.com

billboard.com

fool.com

etymonline.com

actu.fr

Q&A

What is a robots.txt?

A robots.txt is a standard used by websites to direct web crawling and scraping bots about which pages or files the bot should or shouldn't request from the site. It's essentially a set of rules to guide search engines on which parts of a website they are allowed to access and index.

How does this work?

I have a script that runs once per day that scrapes the robots.txt files of the 1000 most popular websites.

How accurate is this data?

The scraping script is able to fetch the robots.txt for most sites without any issue, however Cloudflare sometimes blocks it temporarily so it might miss a few websites per scrape.

Sources?

Made by Wayde