News Sites Are Blocking Internet Archive Over AI Scraping Fears

Server racks branded with Internet Archive

Especially in this era of the Internet, the role of the Internet Archive’s Wayback Machine has become increasingly essential as more and more web content vanishes into the ether or is surreptitiously altered to hide salient details. More recently a new worry has seemingly cropped up in the form of scraping of data for so-called AI systems, or at least that’s part of the excuses being offered for blocking the Wayback Machine’s web crawlers, with [Andrew Deck] and [Hanaa’ Tameez] of [Nieman Lab] detailing the impact and reasons provided.

Some news outlets like The Baltimore Banner insist that they’re only blocking the Wayback Machine crawlers because they are worried that LLM chatbots would otherwise ‘improperly cite’ the source of content, while outlets like The Atlantic have put a blanket anti-scraping policy in place. Meanwhile news outlets are generally happy to let paid commercial news archiving outlets like ProQuest and LexisNexis index their content, showing a potential financial incentive.

Whatever the reasons, the direct effect is that as content is modified or vanishes during for example a system migration, buy-out or bankruptcy, researchers who rely on the Wayback Machine are pretty much forced to rely on paid offerings by ProQuest and kin, without the pure archiving focus and free access to information. It will also leave big holes in what the Wayback Machine can cover in its archives, with news especially becoming very spotty.

Incidentally there’s an ongoing petition over at SaveTheArchive.com which people can sign.

29 thoughts on “News Sites Are Blocking Internet Archive Over AI Scraping Fears

    1. Correction, they are blocking scrapers that do not pay them, and allowing scrapers that do pay them.
      Doesn’t matter who owns the scrapper.

      All this means is they are selling their articles to AI companies and don’t want them getting around the bill by getting the same data elsewhere.

    2. It’s not about AI-period, it’s about ‘the AI competition’.
      They all want to have the trillions they see shimmering on the horizon and they will drown each other and anybody that comes near for it.

      What is odd though is that they all want us to use AI agents and have them do everything for us on the web, and yet they also constantly come with the ‘we need to check if you are a bot’ nonsense, so they block bots but they want to replace us with bots at the same time..
      They need to get some AI to figure out how that is suppose to work..

  1. Makes me think that every country should have government funded web archives that are illegal to block.
    The Internet Archive is a bastion of preservation but they shouldn’t have to go it alone. Especially with the recent spate of companies saying lies and then taking down archives that prove it.

    1. I’m fuzzy on the details, but I do believe britain and possible the USA have government libraries with a copy of every physical work someone has sought registered(?) copyright on. Seems like something that could be expanded upon.

    2. The only reason a government needs a publicly accessible and free internet archive is to turn it into the Ministry of Truth.

      Even in relatively functional democracies, political parties are relying on the short memory of their voters about who to blame over bad policy and who actually promised what. They don’t want a centralized archive detailing their tweet from ten years ago where they said the exact opposite of what they’re claiming today.

    1. I am assuming they honor robots.txt and/or headers. I was going to suggest that the news sites adopt a 2-part cycle. Articles get published with robots denied. After X amount of time robots flip to allowed. Then after Y time they can feel free to delete if they want.

      Yes, I know. This relies on the crawlers being respectful. But what alternative is there really that doesn’t? A constant and forever-doomed to lose cat and mouse game with the content maker blocking IPs and the crawler changing networks?

  2. Per Louis Rossman’s latest video, dodgy companies are blocking the internet archive / robots so they can retcon their technical specs, datasheets & guarantee pages to defraud customers who bought defective products.

    Hackaday even got mentioned in the Battleborn lawsuit against Will Prowse, I’ve not seen it mentioned here.

  3. While I very much dislike the paywall internet we have today, I understand why there is one. I would like to see the websites archived, but it doesn’t have to be instantly accessible. News is worth scraping as it is news. In a week, month, year, the value changed but still interesting for people trawling the archive. So maybe they can simply allow archiving the website, but put an embargo on the content. Even if it is several years, it’s better than nothing.

    As for the paywall, I would like to pay a flat fee to access a couple dozen articles a month of a variety of sources. Currently, it’s either €1 per article and an account for each website, or a subscription for each outlet. I dislike how we are currently using archiving sites to share articles, as it feels just as scummy as putting a paywall on it in the first place.

    1. This is the only right answer, look what happened with Microsoft an Office 2019 page, they changed the text of the page to reflect that office on premise (not cloud or 365) will stop working this year, but before the change the page said that you could use it forever.

      Corporations wants to rewrite history, just like in 1984 way

    2. None of the ‘news’ sites cited are credible.
      The Atlantic? They’ve got to be kidding.

      I good first step to regaining some credibility would be to allow the archive to scrape them.

      Alternatively the archive could just publish a browser add on.
      Their supporters could allow the archive access to their web cache.

      Only downside, the archive becomes the worlds biggest porn collection.
      Even bigger than the Vatican’s.

        1. The total amount of pre-photography porn created was probably surpassed by photographic porn within a few years of the birth of color photography. The total amount of non-digital porn ever created was probably surpassed by digitally-photographed porn by the year 2000. The total amount of AI-slop-porn will surpass the total amount of non-AI-generated porn soon if it hasn’t done so already.

    3. Fortunately there are tools one can self-host to capture web pages so that’s harder to do. The paraphrase a famous saying, with a thousand eyes, hiding things will have a shallow effect.

  4. I think it’s time to extend the robots.txt and robot html header standards. We need a way to separate reasons to crawl.

    For example;
    search engines
    archives
    AI training

    I want to be able to tell the bots you are allowed if your purpose is X but not if it’s Y.

    Yes, I know, you can get some of that by specifying the bot’s user agent. But that means you have to constantly keep up to date on what bots are out there, what they do and what their user agents are.

    And yes, I also know this relies on the bot respecting it. But it’s always at best going to be a cat and mouse game of blocking ips and the bot switching networks when it comes to bots that do not respect robots.txt/meta data.

  5. “government funded web archives that are illegal to block.”

    … and who watches the watchers?
    Would Russia do this? Would Iran? Would any of the many corrupt governments in Africa?

    Information has value, but the value varies with the noise of falsehoods.
    Giving governments control of civilian web archives just invites a rewrite of history.

    As AI already has a large role in education, does anyone really want youth learning faux history?

    1. Every nation teaches kids (up through undergrad) a constantly revised comic book version of history.

      Having feet in multiple nations forces you to see this.

      Your nation is no exception, if that thought jumped forward, that’s your conditioning.
      It might share a comic book (at least pages) with neighbors, looking at you EU.

      Historians/activists/commies are conscious of this, put forward alternative comics (e.g. ‘Peoples History of the USA’).
      Which they teach to gullible undergrads, completing the cycle.

      1. Still though, due to the various interest groups you have some variety, but if a government has full control there is only one story, and thus less questioning and less realization that there might be lies and obfuscations.
        There are already things in our west where the BS is unified, and very few people indeed question it – or even notice it to start questioning it, even amongst the cynical.

  6. This is really dumb because AI by itself can already get around these methods of blocking scrappers.
    When I program, I often point my AI agent to review documentation on API usages, etc by giving it the url with the documentation. On many sites it will trip over bot block techniques and seamlessly bypasses it with a few extra steps, normally its as simple as curl + spoofed user agent, but even with more complex techniques I have yet to see one that works to block it. Now my agent could be doing this because it see I am present and am a human so is acting on my behalf so it me in spirit and the the AI companies may have their scrappers respecting those settings, or they aren’t using their own AI tech with the scrappers, but I some how doubt that. Considering they have no problem ingesting copyrighted work and academic journals that normally charge, bypassing this wouldn’t seem violate what ever ethical guidelines they are operating under..

    So the digital archive is effectively being punished for respecting the robots.txt, etc while the AI bots continue to work.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.