Why the Internet Archive is More Relevant Than Ever (npr.org)
- Reference: 0176798141
- News link: https://tech.slashdot.org/story/25/03/23/1742225/why-the-internet-archive-is-more-relevant-than-ever
- Source link: https://www.npr.org/2025/03/23/nx-s1-5326573/internet-archive-wayback-machine-trump
They described the 29-year-old nonprofit Internet Archive as "more relevant than ever."
> Every day, about 100 terabytes of material are uploaded to the Internet Archive, or about a billion URLs, with the assistance of automated crawlers. Most of that ends up in the Wayback Machine, while the rest is digitized analog media — books, television, radio, academic papers — scanned and stored on servers. As one of the few large-scale archivists to back up the web, the Internet Archive finds itself in a particularly [2]unique position right now... Thousands of [U.S. government] [3]datasets were wiped — mostly at agencies focused on science and the environment — in the days following Trump's return to the White House...
>
> The Internet Archive is among the few efforts that exist to catch the stuff that [4]falls [5]through the [6]digital cracks , while also making that information accessible to the public. Six weeks into the new administration, Wayback Machine director [Mark] Graham said, the Internet Archive had cataloged some 73,000 web pages that had existed on U.S. government websites that were expunged after Trump's inauguration...
>
> According to Graham, based on the big jump in page views he's observed over the past two months, the Internet Archive is drawing many more visitors than usual to its services — journalists, researchers and other inquiring minds. Some want to consult the archive for information lost or changed in the purge, while others aim to contribute to the archival process.... "People are coming and rallying behind us," said Brewster Kahle, [the founder and current director of the Internet Archive], "by using it, by pointing at things, helping organize things, by submitting content to be archived — data sets that are under threat or have been taken down...."
>
> A behemoth of link rot repair, the Internet Archive rescues a daily average of 10,000 dead links that appear on Wikipedia pages. In total, it's fixed more than 23 million rotten links on Wikipedia alone, according to the organization.
Though it receives some money for its preservation work for libraries, museums, and other organizations, it's also funded by donations. "From the beginning, it was important for the Internet Archive to be a nonprofit, because it was working for the people," explains founder Brewster Kahle [7]on its donations page :
> Its motives had to be transparent; it had to last a long time. That's why we don't charge for access, sell user data, or run ads, even while we offer free resources to citizens everywhere. We rely on the generosity of individuals like you to pay for servers, staff, and preservation projects. If you can't imagine a future without the Internet Archive, please consider supporting our work. We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million texts, and 14 million audio recordings.
Two interesting statistics from NPR's article:
"A [8]Pew Research Center study published last year found that roughly 38% of web pages on the internet that existed in 2013 were no longer accessible as of 2023."
"According to a [9]Harvard Law Review study published in 2014 , about half of all links cited in U.S. Supreme Court opinions no longer led to the original source material."
Thanks to long-time Slashdot reader [10]jtotheh for sharing the news.
[1] https://www.npr.org/2025/03/23/nx-s1-5326573/internet-archive-wayback-machine-trump
[2] https://www.npr.org/sections/shots-health-news/2025/01/31/nx-s1-5282274/trump-administration-purges-health-websites
[3] https://www.404media.co/archivists-work-to-identify-and-save-the-thousands-of-datasets-disappearing-from-data-gov/
[4] https://nsarchive.gwu.edu/briefing-book/climate-change-transparency-project-foia/2025-02-06/disappearing-data-trump
[5] https://www.npr.org/sections/shots-health-news/2025/01/21/nx-s1-5269875/trump-abortion-hhs-reproductive-rights
[6] https://www.npr.org/2025/03/19/nx-s1-5317567/federal-websites-lgbtq-diversity-erased
[7] https://archive.org/donate
[8] https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/
[9] https://clp.law.harvard.edu/knowledge-hub/magazine/issues/the-evolution-of-law-libraries/pausing-the-internet/
[10] https://slashdot.org/~jtotheh
Generated content (Score:2)
Not much point or value in blindly archiving everything with so much boilerplate content generated to get advertising clicks.
Search on "how to install custom rom on android" and you'll get pages of general information in a recognizable template... same information reformatted slightly emanating from different URLs. Volumes of unfocused garbage basically.
Re: (Score:3)
> Not much point or value in blindly archiving everything with so much boilerplate content generated to get advertising clicks.
I know what you mean, but sometimes there value in dot connecting later.
What is important of course, is that this is a bit like the Svalbard global seed vault, but for US science data. [1]https://www.seedvault.no./ [www.seedvault.no]
When a government of dodgy politicians demands to eliminate science data, it must be very powerful science data. If it was about eliminating bullshit, they'd go after flat earthers, HAARP freaks, alien conspiracies anti-moon landing and other BS.
So if a few weeds get in the mix, it's still a
[1] https://www.seedvault.no./
Re: (Score:3)
Libertarian ideas. I get it. I'm not in complete disagreement, nor am I a card carrying Leftist, but I don't think unrestricted Libertarianism leads to a just world.
By your logic, you might as well close the armed forces and incentivize self funded mercenaries, then reward them when they achieve published goals.
So I start a private army with the mercs put of business when Trump solves the war in the Ukraine, on his first day in office. We invade Greenland and take it over in about 24 hours because ... there
Re: (Score:2)
I would suggest the dead internet theory is where we are now, have been for several years. Most content is generated now.
i.e. it's all weeds now, no genuine seeds to store.
I'm not against the Internet Archive, I just think they are wasting money on an obsolete model.
Every CEO talks up his company, so no different with Brewster, imho.
They could sit pat with what they've got and it's still a valuable public service, methinks.
Re:Generated content (Score:4, Insightful)
The point is that what's important and what's not isn't known until much later. You might think it's useless information today because it doesn't help you, but it may have uses tomorrow.
It's basically like how we learn what life was like in the past not because of the records left behind, but because of the garbage that was thrown away.
Re: (Score:1)
Indeed. The justification and value of the Internet archive comes from maybe 1% of what it stores, maybe less. In the YouTube videos they archive, it may well be less, but nobody knows what it may turn out to be before it becomes important.
Re: (Score:2)
I'm pretty sure we can do without archiving those generated pages of boilerplate on any subject. Best case scenario would be to analyze every page, compare to every other page and only save 1 copy, ignoring the countless copies.
This is like the argument that all data is equal. That Microsoft and OpenAI want to ingest the entirety of the internet. It's literally noise now... I don't see how that makes sense, anymore.
Re: (Score:3, Insightful)
As others have pointed out though without knowing what questions future people want to answer you don't know what is interesting or why.
For example an economist of the future might be very interested in the rate of content duplication, clones, and likely copyright infringement of article content.
Think about like masons marks. They basically were just there to do supply chain management and invoicing. Nobody thought they'd be interesting after the wall was up so to speak but future archeologists have used
Re: (Score:2)
We are discussing trying to save vital information that may soon be totally wiped out by the copyright parasites.
You are arguing that much of that information is redundant.
That is premature optimisation.
Decent people need archive.org to survive... (Score:5, Insightful)
...but AFAICT, the court judgement in favour of the copyright parasites has now doomed it?
Re: (Score:3)
I do not think so. There are enough countries on this planet where archiving in this form is legal. It just has to be non-profit and free access. It can be limited to non-commercial use only and that can be done via the TOU.
Re:Decent people need archive.org to survive... (Score:4, Interesting)
The problem is that the IA team won't move it outside of the US, and won't accept outside help to mirror it. They won't accept any help to develop the backend software either. Of course that's up to them, it's their archive, but users must consider their position and the likely impact on the archive's future.
Realistically the best option here will be to create an open source version of the IA code, and try to organize libraries set up around the world. Each one wouldn't have to be a full mirror. You could even decouple the storage part so that data can be stored where it is not going to run into legal issues, and be accessed seamlessly from a central web interface. There are lots of options and it really needs a team of international copyright lawyers to look at it. Then mirror IA and accept uploads, and do it all completely separate from the current IA, both for legal reasons and because they don't want to be part of it.
It's no small thing, but otherwise it's just a matter of time before we lose it all.
Re: (Score:2)
I meant that the current, US-based, instance of archive.org is doomed.
re. Backing up the contents of archive.org somewhere outside the US: I do not think the code is the problem, but the sheer amount of data that has to be exported to the backup(s), before archive.org goes titsup.
This means the Wayback Machine, the books, the texts, the audio, the video.
Is there an effort underway?
Doesn't some billionaire think this is worth doing?
Re: (Score:2)
It would need some very significant up-front investment, and probably cooperation with the IA team to at least some extent.
Nobody seems to be doing it at the moment.
Re: (Score:3)
The idiots at the Archive invited this outcome by saying "copyright is no longer a thing because Covid" which pretty much forced the publishers to sue, and to do so with a set of facts that was almost tailor made for their purposes to kill format shifting.
Re: (Score:2)
YES! It is almost as if somebody powerful hired a mole to go work at the archive and promote BAD decisions! (This is in fact one of the things assets and spies do.) Any Russians work there?
They could have simply archived materials for a....century.... before publishing them.
Re: (Score:1)
> They could have simply archived materials for a....century.... before publishing them.
With what funds? The current model makes it useful to people currently, which induces them to donate and generally support the effort.
Attempting to just archive, might have have let them run under the radar, or not. We really can't say. Some copyright troll organization (excuse me, I meant to say collaborative industry group) could have still spotted their crawler and gone after them with BS SLAPP type suits.
At least by being open about what they are doing and making the archive searchable and useful they
Re: (Score:2)
"*forced* the publishers to sue"
Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.
Their evil parasitical actions are totally archive.org's fault.
Deliberate disappearing information (Score:4, Interesting)
As we've seen in the last month or so, the current administration is hellbent on destroying any information it doesn't like for whatever reason. Climate change data used by farmers, research data used by scientists, epidemiological research used by health professionals, you name it, it's gone.
Internet Archive is the last refuge of this information before history is rewritten.
Re: (Score:2, Informative)
Once they realize that it exists, they will go after archive.org as well. The administration will misinterpret some law to give themselves power to shut it down, and SCOTUS will upheld it because it's been packed with Trump lackeys. This is the current playbook, and it appears to be working.
Re: (Score:2)
Nobody tell Musk about all the photos of his balding head are still up on archive.org or that article he bullied the Wallstreet Journal into removing from it's archive about how his wealth began with his' parents blood diamond mine (different jewel but whatever.) He'll break in and fire everybody, burn their documents, and begin trying to sell their building.
Re: (Score:2)
"Oceania had always been at war with Eastasia". It's like Trump has taken 1984 as an example to leave his historical mark.
Spread the risk (Score:4, Interesting)
So where are the physical servers located?
I hope that there are mirrors spread out in various places around the globe because of the way things are moving in the US.
Same reason that I now would like to have more information offline: books as well as hard drives with scientific reports, Wikipedia and all the 'data' that I fear might go missing. It will take some time and money and I might well team up with others in the neighbourhood.
GPTs are the new search anyways (Score:2)
there is value in preserving the past intact, but that's going to be less and less practical for many use cases as we're moving closer and closer to the internet of the infinite automated monkeys (https://en.wikipedia.org/wiki/Infinite_monkey_theorem)
so I would disagree about relevant. important, sure
Re: (Score:2)
Facts are political? Trump ordered mass deletions of anything "DEI" related and it's so ham fisted that mention of the Enola Gay was caught up. [1]https://www.pbs.org/newshour/p... [pbs.org]
Not even Jackie Robinson's military service was safe. [2]https://www.espn.com/mlb/story... [espn.com]
And this is supposedly the party that loves the military?
[1] https://www.pbs.org/newshour/politics/war-heroes-military-firsts-and-the-enola-gay-are-among-26000-pentagon-images-flagged-for-removal-in-dei-purge
[2] https://www.espn.com/mlb/story/_/id/44316899/defense-department-removes-story-robinson-military-service
Re: (Score:2)
Yes.
But this story should not be about that.
ALL of archive.org is under threat, not just the bits that the Trumpists don't like.
Re: (Score:2)
I concur.
This affects everyone from the hateful woke to the demented MAGAs and all the decent people inbetween.
Are these journalists not capable of seeing that their story is better without the sideswipes, no matter how true they are ?
I guess they learnt their craft from CNN.
Google Cached Web Pages (Score:5, Insightful)
Yes, especially since Google removed one of the most useful search features it ever had, which was the ability to view the cached page from the last time Google crawled it.
Re:Google Cached Web Pages (Score:5, Interesting)
All part of the enshittification of Google. Also remember that archive.org keeps things forever. I have started turning most links in my lectures into archive.org links, it is just too much effort checking every time whether the pages are still there.
And, yes, I donate to them.
Moderation needs new word: enshittification (Score:2)
So many posts even before the decline are best classified as being about enshittification. It should get added to the list.
Re: (Score:2)
> Also remember that archive.org keeps things forever.
No they don't. Dubious legal complaints frequently result in URLs being excluded from the Wayback Machine. It's extremely irritating.
Re: (Score:2)
Yes, especially since Google removed one of the most useful search features it ever had, which was the ability to view the cached page from the last time Google crawled it.
Reason being is that you can detect when the historical record has been altered. It's going to get worse when all our information is being filtered through ClippyAI:
“ Every record has been destroyed or falsified, every book has been rewritten, every picture has been repainted, every statue and street and building has been