Data Hoarding - Archives

On this page you will find links to data archives from various countries. These archives contain data that was gathered and saved for the public good.

Science

Academic Torrents

Description:
Making over 127.15TB of research data available, this site provides a distributed system for sharing enormous datasets for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

Links:

Science

Anna's Archive

Description:
Described as the largest truly open library in human history. This site mirrors Sci-Hub and LibGen. They also scrape and open-source Z-Lib, DuXiu, and more. Currently hosting over 42 million books, 98 million papers, preserved forever. All their code and data are completely open source.

Links:

World

Archive.today

Description:
Archive.today is a time capsule for web pages! It takes a 'snapshot' of a webpage that will always be online even if the original page disappears. It saves a text and a graphical copy of the page for better accuracy and provides a short and reliable link to an unalterable record of any web page.

Links:

Science

arXiv

Description:
arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv is a community of volunteer authors, readers, moderators, advisory board members, supporting members, donors, and third-party collaborators that are supported by the staff at Cornell University.

Links:

Technology, World

AWS Data Exchange

Description:
AWS Data Exchange makes it easy to find datasets made publicly available through AWS services. Browse available data and learn how to register your own datasets.

Links:
  • aws.amazon.com - List of all Data Exchange applications.
  • Common Crawl - A corpus of web crawl data composed of over 50 billion web pages.
  • Earth on AWS - Registry of Earth related datasets.
  • archives.gov - National Archives Catalog on the AWS Registry of Open Data.

Government, Health, Climate, Science

CAFE

Description:
The Convene-Accelerate-Foster-Expand (CAFE) site is an open collection designed to support and enhance global research initiatives focused on understanding and mitigating the health impacts of climate change. It's hosted by Harvard University, Boston University and contains hundreds of datasets, mostly from US Gov web sites.

Links:

Government, Law

Caselaw Access Project

Description:
The Caselaw Access Project (CAP) scanned the entirety of the Harvard Law School Library's physical collection of American case law and made it machine-readable in a consistent format available online. To facilitate that agreement, the Library Innovation Lab (LIL) maintained the case.law website as the primary access point for the data. CAP includes all official, book-published state and federal United States case law through 2020, every volume or case designated as an official report of decisions by a court within the United States.

Links:
  • case.law - List of law volumes per state.

Government, Climate

Climate Mirror Project

Description:
The Climate Mirror Project is trying to mirror and safely archive US Gov websites and datasets related to climate, climate change, and global warming. It provides mirrors of official NOAA and other government web sites.

Links:

Government

Data Liberation Project

Description:
The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate US Gov datasets of public interest.

Links:

Government

Data Lumos

Description:
DataLumos is an ICPSR archive for valuable government data resources. ICPSR has a long commitment to safekeeping and disseminating US government and other social science data. DataLumos accepts deposits of public data resources from the community and recommendations of public data resources that ICPSR itself might add to DataLumos. The site is hosted by the University of Michigan.

Links:

Science

Dryad

Description:
Dryad is an open data publishing platform and a community committed to the open availability and routine re-use of all research data. Their multi-stakeholder community of academic and research institutions, research funders, scholarly societies and publishers is committed to leading in best practices for open data sharing and reuse.

Links:

Government

End-of-Term web archive

Description:
The End of Term Web Archive captures and saves U.S. Government websites at the end of presidential administrations. The EOT has thus far preserved websites from administration changes in 2008, 2012, 2016, and 2020. The End of Term Web Archive contains federal government websites (.gov, .mil, etc) in the Legislative, Executive, or Judicial branches of the government.

Links:

Technology

Files dot Dog

Description:
This site contains a large collection of Microsoft Developer Network (MSDN) files, along with random other files.

Links:

Technology

Games Database

Description:
Games Database is one of the biggest source for manuals, videos, music and artwork. The site provides over 32k videos, 8k music files, 14k manuals, 5k game adverts, 822 TV commercials for 126 systems.

Links:

Technology

Hugging Face

Description:
Hugging Face is the platform where the machine learning community collaborates on models, datasets, and applications. It contains the largest collection of open source AI models and focuses on machine learning tasks.

Links:

Technology

Ibiblio

Description:
Ibiblio (then called SunSITE) began mirroring open source software in 1992, and was one of only three such repositories available on the internet. Now almost 30 years later mirroring and open source software has evolved.

Links:

Government

ICPSR

Description:
ICPSR is research science data and resources on topics like social media, politics, economics, social sciences, government, GIS, & more. ICPSR is part of the Institute for Social Research at the University of Michigan.

Links:

Science

INSDC

Description:
The International Nucleotide Sequence Database Collaboration (INSDC) archives nucleotide sequence data, from raw to assembled and annotated sequences, from around the world.

Links:

Technology, World

Internet Archive

Description:
The Internet Archive is an American non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including websites, software applications, music, audiovisual, and print materials.

Links:

Science, Health, World

IPUMS

Description:
IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts. Data and services available free of charge.

Links:

Technology, Science, World

Kaggle

Description:
Kaggle is one of the largest collection of datasets, mostly focusing on statistics, science, world affairs and technology. It contains 430K high-quality public datasets. Everything from avocado prices to video game sales.

Links:

Technology, Science, World

Kiwix

Description:
3 billion people have no or little access to internet. This can be because of costs, lack of infrastructure, or outright censorship. Kiwix provides offline versions of popular web sites like Wikipedia, Wikibooks and Project Gutenberg.

Links:

Science

LibreTexts

Description:
LibreTexts is the adaptable, user-friendly open education resource platform that educators trust for creating, customizing, and sharing accessible, interactive textbooks, adaptive homework, and ancillary materials. We collaborate with individuals and organizations to champion open education initiatives, support institutional publishing programs, drive curriculum development projects, and more. The LibreText Commons hosts curated Open Educational Resources from all 16 libraries in the LibreVerse in one convenient location.

Links:

World

Mirror Service

Description:
The UK Mirror Service provides a collection of mirrors of FTP, web and rsync sites of interest to academic users. The service is provided by the University of Kent's School Of Computing.

Links:

Technology

My Abandonedware

Description:
On My abandonware you can download all the old video games from 1965 to 2012 for free. You can play Pacman, Arkanoid, Tetris, Galaxian, Alter Ego, or Blackthorne, Civilization, Sim City, Prince of Persia, Xenon 2, King's quest, Ultima, Kyrandia, The Incredible Machine, Another World, Test Drive, Flashback, Lemmings and more. For each game, they offer all related information included publication year, publisher, developer, size of the game, language, review of the game, instructions to play, the game manual and, of course, the game archive that you can download for free.

Links:

Science, Climate

OpenEI

Description:
The Open Energy Data Initiative (OEDI) enables research, collaboration, and transparency by providing open access to energy data and information. The OpenEI Data Lake is a centralized repository of datasets aggregated from the U.S. Department of Energy’s Programs, Offices, and National Laboratories. It provides links to over 4.19 PB of data.

Links:

World

Project Gutenberg

Description:
Project Gutenberg is a library of over 75,000 free eBooks. Everything from Project Gutenberg is gratis, libre, and completely without cost to readers. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today. The Project Gutenberg Literary Archive Foundation (PGLAF) is the non-profit corporation that oversees operation of the project.

Links:

Government, Science, Climate

Public Environmental Data Project

Description:
The Public Environmental Data Project is committed to preserving and providing public access to federal environmental data. They are a volunteer coalition of several environmental, justice, and policy organizations, researchers across several universities, archivists, and students who rely on federal datasets and tools to support critical research, advocacy, policy, and litigation work. Several datasets are available on their site.

Links:

Science

Sci-Hub

Description:
Sci-Hub started as a tool for providing quick access to articles from scientific journals - such articles are the main medium of communication of scientific knowledge today. Now Sci-Hub has grown a database of over 88 millions research articles and books freely accessible for anyone to read and download.

Links:

Technology

Software Heritage

Description:
The long term goal of the Software Heritage initiative is to collect all publicly available software in source code form together with its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it. The Software Heritage archive is growing over time as they crawl new source code from software projects and development forges.

Links:

Science, Climate, Government

Source COOP

Description:
Source Cooperative is a data publishing utility that allows trusted organizations and individuals to share data using standard HTTP methods. It contains large data collections and mirrors of various sites, mostly centered around science, government and climate.

Links:

Technology

TextFiles

Description:
TEXTFILES.COM has been online for nearly 25 years providing text files, focusing mostly on the years 1980-1995.

Links:

Technology

The Old Computer

Description:
Home to the largest collection of roms and emulators anywhere on the web with over 500,000 ROMs and Emulators for every major computer, console, arcade machine, pinball table and mobile device. Box Scans, Manuals, Magazines and a 179,000+ strong user community.

Links:

Technology

The Unix Heritage

Description:
The Unix Heritage Society's aims include the preservation and maintenance of historical and non-mainstream UNIX systems; the further development of existing UNIX systems; and the continual fostering of the Unix community spirit. They host historical Unix distribution and packages available for download.

Links:

World

Wikipedia

Description:
Wikipedia is a free-content online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. It is the largest and most-read reference work in history. Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects.

Links:

Technology

WinWorld

Description:
WinWorld is an online museum created in 2003 dedicated to the preservation and sharing of vintage, abandoned, and pre-release software. It offers information, media and downloads for a wide variety of computers and operating systems. Get classic operating systems, applications, games and betas for every platform from PC to Mac to Amiga, right here from the software library on WinWorld.

Links: