News Sites Are Blocking Internet Archive Over AI Scraping Fears

June 8, 2026

Especially in this era of the Internet, the role of the Internet Archive’s Wayback Machine has become increasingly essential as more and more web content vanishes into the ether or is surreptitiously altered to hide salient details. More recently a new worry has seemingly cropped up in the form of scraping of data for so-called AI systems, or at least that’s part of the excuses being offered for blocking the Wayback Machine’s web crawlers, with [Andrew Deck] and [Hanaa’ Tameez] of [Nieman Lab] detailing the impact and reasons provided.

Some news outlets like The Baltimore Banner insist that they’re only blocking the Wayback Machine crawlers because they are worried that LLM chatbots would otherwise ‘improperly cite’ the source of content, while outlets like The Atlantic have put a blanket anti-scraping policy in place. Meanwhile news outlets are generally happy to let paid commercial news archiving outlets like ProQuest and LexisNexis index their content, showing a potential financial incentive.

Whatever the reasons, the direct effect is that as content is modified or vanishes during for example a system migration, buy-out or bankruptcy, researchers who rely on the Wayback Machine are pretty much forced to rely on paid offerings by ProQuest and kin, without the pure archiving focus and free access to information. It will also leave big holes in what the Wayback Machine can cover in its archives, with news especially becoming very spotty.

Incidentally there’s an ongoing petition over at SaveTheArchive.com which people can sign.

29 thoughts on “News Sites Are Blocking Internet Archive Over AI Scraping Fears”

zeiche says:

June 8, 2026 at 1:04 pm

they’re blocking AI to prevent scraping but firing producers to replace them with AI. make it make sense.

Report comment

Reply
1. syc4p3cM says:
  
  June 8, 2026 at 1:22 pm
  
  The “AI fears” thing is an alibi. That makes it make sense.
  
  Report comment
  
  Reply
2. D says:
  
  June 8, 2026 at 1:35 pm
  
  Correction, they are blocking scrapers that do not pay them, and allowing scrapers that do pay them.
  Doesn’t matter who owns the scrapper.
  
  All this means is they are selling their articles to AI companies and don’t want them getting around the bill by getting the same data elsewhere.
  
  Report comment
  
  Reply
3. Jeff Wright says:
  
  June 8, 2026 at 6:12 pm
  
  They don’t want you to remember false reporting.
  
  Repeat after me:
  
  Information wants to be free
  
  Report comment
  
  Reply
  1. Somehuman says:
    
    June 9, 2026 at 6:59 am
    
    Exactly!
    You win the interweb today.
    
    In truth, AI could probably create more accurate (and unbiased) articles than the reporters.
    
    Report comment
    
    Reply
  2. abjq says:
    
    June 9, 2026 at 7:38 am
    
    Or else, they want to be able to rewrite history without some pesky archive people can check against (for free).
    
    Report comment
    
    Reply
  3. Jon Mayo says:
    
    June 9, 2026 at 10:31 am
    
    why isn’t there a ministry-of-truth.com ? seems like a big oversight in the Internet.
    
    Report comment
    
    Reply
    1. Somehuman says:
      
      June 9, 2026 at 6:46 pm
      
      I’m sure some government in Europe has/is trying to build this.
      They only want the people to know their “truth”…
      
      Report comment
      
      Reply
4. Aknup says:
  
  June 11, 2026 at 3:06 am
  
  It’s not about AI-period, it’s about ‘the AI competition’.
  They all want to have the trillions they see shimmering on the horizon and they will drown each other and anybody that comes near for it.
  
  What is odd though is that they all want us to use AI agents and have them do everything for us on the web, and yet they also constantly come with the ‘we need to check if you are a bot’ nonsense, so they block bots but they want to replace us with bots at the same time..
  They need to get some AI to figure out how that is suppose to work..
  
  Report comment
  
  Reply
Anonymous says:

June 8, 2026 at 3:38 pm

Makes me think that every country should have government funded web archives that are illegal to block.
The Internet Archive is a bastion of preservation but they shouldn’t have to go it alone. Especially with the recent spate of companies saying lies and then taking down archives that prove it.

Report comment

Reply
1. M says:
  
  June 8, 2026 at 5:32 pm
  
  I’m fuzzy on the details, but I do believe britain and possible the USA have government libraries with a copy of every physical work someone has sought registered(?) copyright on. Seems like something that could be expanded upon.
  
  Report comment
  
  Reply
  1. Shannon says:
    
    June 9, 2026 at 9:09 am
    
    In the UK that’s the British Library, it keeps a copy of “everything published and distributed” in the UK by “legal deposit” law.
    
    Report comment
    
    Reply
2. Dude says:
  
  June 9, 2026 at 7:14 am
  
  The only reason a government needs a publicly accessible and free internet archive is to turn it into the Ministry of Truth.
  
  Even in relatively functional democracies, political parties are relying on the short memory of their voters about who to blame over bad policy and who actually promised what. They don’t want a centralized archive detailing their tweet from ten years ago where they said the exact opposite of what they’re claiming today.
  
  Report comment
  
  Reply
Christian says:

June 8, 2026 at 5:46 pm

Perhaps the archive needs to honor a publication delay? Unless it’s misunderstand what the value loss is to AI?

Report comment

Reply
1. Zangar the Pangarian says:
  
  June 9, 2026 at 6:20 am
  
  I am assuming they honor robots.txt and/or headers. I was going to suggest that the news sites adopt a 2-part cycle. Articles get published with robots denied. After X amount of time robots flip to allowed. Then after Y time they can feel free to delete if they want.
  
  Yes, I know. This relies on the crawlers being respectful. But what alternative is there really that doesn’t? A constant and forever-doomed to lose cat and mouse game with the content maker blocking IPs and the crawler changing networks?
  
  Report comment
  
  Reply
JohnU says:

June 9, 2026 at 12:58 am

Per Louis Rossman’s latest video, dodgy companies are blocking the internet archive / robots so they can retcon their technical specs, datasheets & guarantee pages to defraud customers who bought defective products.

Hackaday even got mentioned in the Battleborn lawsuit against Will Prowse, I’ve not seen it mentioned here.

Report comment

Reply
1. Elliot Williams says:
  
  June 9, 2026 at 11:21 pm
  
  Did we? Citation, please!
  
  Report comment
  
  Reply
Jan-Willem says:

June 9, 2026 at 5:28 am

While I very much dislike the paywall internet we have today, I understand why there is one. I would like to see the websites archived, but it doesn’t have to be instantly accessible. News is worth scraping as it is news. In a week, month, year, the value changed but still interesting for people trawling the archive. So maybe they can simply allow archiving the website, but put an embargo on the content. Even if it is several years, it’s better than nothing.

As for the paywall, I would like to pay a flat fee to access a couple dozen articles a month of a variety of sources. Currently, it’s either €1 per article and an account for each website, or a subscription for each outlet. I dislike how we are currently using archiving sites to share articles, as it feels just as scummy as putting a paywall on it in the first place.

Report comment

Reply
NIK282000 says:

June 9, 2026 at 6:21 am

News sites are blocking Internet Archive because they are afraid of being caught modifying articles.

Report comment

Reply
1. kolotxoz says:
  
  June 9, 2026 at 8:25 am
  
  This is the only right answer, look what happened with Microsoft an Office 2019 page, they changed the text of the page to reflect that office on premise (not cloud or 365) will stop working this year, but before the change the page said that you could use it forever.
  
  Corporations wants to rewrite history, just like in 1984 way
  
  Report comment
  
  Reply
2. HaHa says:
  
  June 9, 2026 at 9:55 am
  
  None of the ‘news’ sites cited are credible.
  The Atlantic? They’ve got to be kidding.
  
  I good first step to regaining some credibility would be to allow the archive to scrape them.
  
  Alternatively the archive could just publish a browser add on.
  Their supporters could allow the archive access to their web cache.
  
  Only downside, the archive becomes the worlds biggest porn collection.
  Even bigger than the Vatican’s.
  
  Report comment
  
  Reply
  1. Jonathan says:
    
    June 9, 2026 at 1:49 pm
    
    Don’t be daft, the Vatican’s been collecting porn for centuries :)
    
    Report comment
    
    Reply
    1. David says:
      
      June 9, 2026 at 7:32 pm
      
      The total amount of pre-photography porn created was probably surpassed by photographic porn within a few years of the birth of color photography. The total amount of non-digital porn ever created was probably surpassed by digitally-photographed porn by the year 2000. The total amount of AI-slop-porn will surpass the total amount of non-AI-generated porn soon if it hasn’t done so already.
      
      Report comment
      
      Reply
3. Ostracus says:
  
  June 9, 2026 at 11:13 am
  
  Fortunately there are tools one can self-host to capture web pages so that’s harder to do. The paraphrase a famous saying, with a thousand eyes, hiding things will have a shallow effect.
  
  Report comment
  
  Reply
Zangar the Pangarian says:

June 9, 2026 at 6:24 am

I think it’s time to extend the robots.txt and robot html header standards. We need a way to separate reasons to crawl.

For example;
search engines
archives
AI training

I want to be able to tell the bots you are allowed if your purpose is X but not if it’s Y.

Yes, I know, you can get some of that by specifying the bot’s user agent. But that means you have to constantly keep up to date on what bots are out there, what they do and what their user agents are.

And yes, I also know this relies on the bot respecting it. But it’s always at best going to be a cat and mouse game of blocking ips and the bot switching networks when it comes to bots that do not respect robots.txt/meta data.

Report comment

Reply
Ray says:

June 9, 2026 at 9:32 am

“government funded web archives that are illegal to block.”

… and who watches the watchers?
Would Russia do this? Would Iran? Would any of the many corrupt governments in Africa?

Information has value, but the value varies with the noise of falsehoods.
Giving governments control of civilian web archives just invites a rewrite of history.

As AI already has a large role in education, does anyone really want youth learning faux history?

Report comment

Reply
1. HaHa says:
  
  June 9, 2026 at 10:05 am
  
  Every nation teaches kids (up through undergrad) a constantly revised comic book version of history.
  
  Having feet in multiple nations forces you to see this.
  
  Your nation is no exception, if that thought jumped forward, that’s your conditioning.
  It might share a comic book (at least pages) with neighbors, looking at you EU.
  
  Historians/activists/commies are conscious of this, put forward alternative comics (e.g. ‘Peoples History of the USA’).
  Which they teach to gullible undergrads, completing the cycle.
  
  Report comment
  
  Reply
  1. Aknup says:
    
    June 11, 2026 at 3:12 am
    
    Still though, due to the various interest groups you have some variety, but if a government has full control there is only one story, and thus less questioning and less realization that there might be lies and obfuscations.
    There are already things in our west where the BS is unified, and very few people indeed question it – or even notice it to start questioning it, even amongst the cynical.
    
    Report comment
    
    Reply
Michael O says:

June 11, 2026 at 2:58 pm

This is really dumb because AI by itself can already get around these methods of blocking scrappers.
When I program, I often point my AI agent to review documentation on API usages, etc by giving it the url with the documentation. On many sites it will trip over bot block techniques and seamlessly bypasses it with a few extra steps, normally its as simple as curl + spoofed user agent, but even with more complex techniques I have yet to see one that works to block it. Now my agent could be doing this because it see I am present and am a human so is acting on my behalf so it me in spirit and the the AI companies may have their scrappers respecting those settings, or they aren’t using their own AI tech with the scrappers, but I some how doubt that. Considering they have no problem ingesting copyrighted work and academic journals that normally charge, bypassing this wouldn’t seem violate what ever ethical guidelines they are operating under..

So the digital archive is effectively being punished for respecting the robots.txt, etc while the AI bots continue to work.

Report comment

Reply

Hackaday

News Sites Are Blocking Internet Archive Over AI Scraping Fears

29 thoughts on “News Sites Are Blocking Internet Archive Over AI Scraping Fears”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

The Merits Of Comment-Driven Development As Counterweight To TDD

NASA Announces Artemis III Crew And Ambitious Goals

Revisiting Using AI Coding Assistants: You’re Holding It Wrong Edition

Hunting Submarines Via Gravity Is A Tough Errand

Remember When Flash Drives Were Going To Make Your PC Faster?

Our Columns

FLOSS Weekly Episode 870: Open Source Gardening

Hackaday Links: June 7, 2026

Hackaday Podcast Episode 372: PopTubers, Shifty Semiconductors, And Shelving Shelf Labels

This Week In Security: Messing With AI, 7Zip And Notepad++ Vulnerabilities, HTTP2 Bomb, And More

Linux Fu: Fake Webcams, GUI Edition

29 thoughts on “News Sites Are Blocking Internet Archive Over AI Scraping Fears”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns