Data Hoarding - Archives

On this page you will find links to data archives from various countries. These archives contain data that was gathered and saved for the public good.

Filters: Science World Government Technology Climate Health Law Gaming History Books

Health

101 Cookbooks

Description:
101 Cookbooks is a food blog from California that archived thousands of healthy recipes, made available for free.

Links:

Science

Academic Torrents

Description:
Making over 127.15TB of research data available, this site provides a distributed system for sharing enormous datasets for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

Links:

Books Science History

Anna's Archive

Description:
Described as the largest truly open library in human history. This site mirrors Sci-Hub and LibGen. They also scrape and open-source Z-Lib, DuXiu, and more. Currently hosting over 42 million books, 98 million papers, preserved forever. All their code and data are completely open source.

Links:

Science

Archaology Data Service

Description:
ADS is the leading accredited repository in the UK for archaeology and historic environment data, with over 25 years of experience supporting research, learning and teaching with free, high quality and dependable digital resources.

Links:

World

Archive.today

Description:
Archive.today is a time capsule for web pages! It takes a 'snapshot' of a webpage that will always be online even if the original page disappears. It saves a text and a graphical copy of the page for better accuracy and provides a short and reliable link to an unalterable record of any web page.

Links:

Science

arXiv

Description:
arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv is a community of volunteer authors, readers, moderators, advisory board members, supporting members, donors, and third-party collaborators that are supported by the staff at Cornell University.

Links:

World Technology

AWS Data Exchange

Description:
AWS Data Exchange makes it easy to find datasets made publicly available through AWS services. Browse available data and learn how to register your own datasets.

Links:
  • amazon.com - List of all Data Exchange applications.
  • amazon.com - A corpus of web crawl data composed of over 50 billion web pages.
  • amazon.com - Registry of Earth related datasets.
  • archives.gov - National Archives Catalog on the AWS Registry of Open Data.

Government Health Climate Science

CAFE

Description:
The Convene-Accelerate-Foster-Expand (CAFE) site is an open collection designed to support and enhance global research initiatives focused on understanding and mitigating the health impacts of climate change. It's hosted by Harvard University, Boston University and contains hundreds of datasets, mostly from US Gov web sites.

Links:

Government Law

Caselaw Access Project

Description:
The Caselaw Access Project (CAP) scanned the entirety of the Harvard Law School Library's physical collection of American case law and made it machine-readable in a consistent format available online. To facilitate that agreement, the Library Innovation Lab (LIL) maintained the case.law website as the primary access point for the data. CAP includes all official, book-published state and federal United States case law through 2020, every volume or case designated as an official report of decisions by a court within the United States.

Links:
  • case.law - List of law volumes per state.

History

Chartlann Mhileata Military Archives

Description:
The Military Archives offers a diverse range of collections documenting Ireland's military history, including pensions and historical documents.

Links:

Technology

CivitAI

Description:
CivitAI is an online platform and marketplace for generative AI content, primarily focused on AI-generated images and models.

Links:

Government Climate

Climate Mirror Project

Description:
The Climate Mirror Project is trying to mirror and safely archive US Gov websites and datasets related to climate, climate change, and global warming. It provides mirrors of official NOAA and other government web sites.

Links:

World

Common Crawl

Description:
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. They believe that everyone should have the opportunity to indulge their curiosities, analyze the world, and pursue brilliant ideas. The latest crawl contains over 2.74 billion web pages.

Links:

Government Technology

Common Vulnerabilities and Exposures (CVE)

Description:
The CVE program identifies, defines, and catalogs publicly disclosed cybersecurity vulnerabilities. There are currently over 274,000 CVE Records accessible through the program. While it depends on US Government funding, there are several alternative databases also available.

Links:

Gaming

Console Mods

Description:
This wiki contains information on game console modding and game dumping tools.

Links:

World

Cross-National Time-Series Data

Description:
CNTS provides more than 200 years of annual data from 1815 onward, including 196 demographic, political, legislative, economic and social science variables.

Links:

Government

Data Liberation Project

Description:
The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate US Gov datasets of public interest.

Links:

Government

Data Lumos

Description:
DataLumos is an ICPSR archive for valuable government data resources. ICPSR has a long commitment to safekeeping and disseminating US government and other social science data. DataLumos accepts deposits of public data resources from the community and recommendations of public data resources that ICPSR itself might add to DataLumos. The site is hosted by the University of Michigan.

Links:

Government

Data Rescue Project

Description:
The Data Rescue Project is a coordinated effort among a group of data organizations focusing on rescue-related efforts and data access points for public US governmental data that are currently at risk. It provides resources, collections of datasets and news updates.

Links:

World History Books

Digital Public Library of America

Description:
The DPLA highlights millions of items from libraries, archives and museums across the United States, organized into easy-to-navigate topics through a single catalog.

Links:
  • dp.la - Home page.
  • dp.la - The banned books club.

Technology

Drivers Collection

Description:
Drivers Collection is one of largest free web library of device drivers for computer hardware. It contains over 6 million drivers from various hardware vendors.

Links:

Science

Dryad

Description:
Dryad is an open data publishing platform and a community committed to the open availability and routine re-use of all research data. Their multi-stakeholder community of academic and research institutions, research funders, scholarly societies and publishers is committed to leading in best practices for open data sharing and reuse.

Links:

Government

End-of-Term web archive

Description:
The End of Term Web Archive captures and saves U.S. Government websites at the end of presidential administrations. The EOT has thus far preserved websites from administration changes in 2008, 2012, 2016, and 2020. The End of Term Web Archive contains federal government websites (.gov, .mil, etc) in the Legislative, Executive, or Judicial branches of the government.

Links:

Government

European Data

Description:
European Data is the official portal for European data, collected from governments from around the EU, made available on this central portal.

Links:

Technology

Files dot Dog

Description:
This site contains a large collection of Microsoft Developer Network (MSDN) files, along with random other files.

Links:

Science

Free GIS Data

Description:
This page contains a categorized list of links to over 500 sites providing freely available geographic datasets, all ready for loading into a Geographic Information System (GIS).

Links:

Gaming

Games Database

Description:
Games Database is one of the biggest source for manuals, videos, music and artwork. The site provides over 32k videos, 8k music files, 14k manuals, 5k game adverts, 822 TV commercials for 126 systems.

Links:

Science

Global Biodiversity Information Facility

Description:
GBIF (the Global Biodiversity Information Facility) is an international network and data infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. It provides access to over 110,000 datasets.

Links:

Technology

Hugging Face

Description:
Hugging Face is the platform where the machine learning community collaborates on models, datasets, and applications. It contains the largest collection of open source AI models and focuses on machine learning tasks.

Links:

Technology

Ibiblio

Description:
Ibiblio (then called SunSITE) began mirroring open source software in 1992, and was one of only three such repositories available on the internet. Now almost 30 years later mirroring and open source software has evolved.

Links:

Government

ICPSR

Description:
ICPSR is research science data and resources on topics like social media, politics, economics, social sciences, government, GIS, & more. ICPSR is part of the Institute for Social Research at the University of Michigan.

Links:

Science

INSDC

Description:
The International Nucleotide Sequence Database Collaboration (INSDC) archives nucleotide sequence data, from raw to assembled and annotated sequences, from around the world.

Links:

Technology World

Internet Archive

Description:
The Internet Archive is an American non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including websites, software applications, music, audiovisual, and print materials.

Links:

Science Health World

IPUMS

Description:
IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts. Data and services available free of charge.

Links:

Technology World Science

Kaggle

Description:
Kaggle is one of the largest collection of datasets, mostly focusing on statistics, science, world affairs and technology. It contains 430K high-quality public datasets. Everything from avocado prices to video game sales.

Links:

Gaming

Keitai Game Preservation

Description:
This wiki is dedicated to cataloging games from Japanese Feature Phones (keitai), pre-Android/iPhone mobile devices released in Japan. (e.g. i-Mode game, i-Appli game, EZweb game, S!Appli game). They also provide information and support for preserving Japanese feature phone games.

Links:

Technology World Science Books

Kiwix

Description:
3 billion people have no or little access to internet. This can be because of costs, lack of infrastructure, or outright censorship. Kiwix provides offline versions of popular web sites like Wikipedia, Wikibooks and Project Gutenberg.

Links:

Science Books

LibreTexts

Description:
LibreTexts is the adaptable, user-friendly open education resource platform that educators trust for creating, customizing, and sharing accessible, interactive textbooks, adaptive homework, and ancillary materials. We collaborate with individuals and organizations to champion open education initiatives, support institutional publishing programs, drive curriculum development projects, and more. The LibreText Commons hosts curated Open Educational Resources from all 16 libraries in the LibreVerse in one convenient location.

Links:

Books

MangaDex

Description:
MangaDex is one of many websites dedicated to archiving scanned mangas and other Asian comic books. These sites provide thousands of titles to read for free, compiled by volunteers.

Links:

World

Mirror Service

Description:
The UK Mirror Service provides a collection of mirrors of FTP, web and rsync sites of interest to academic users. The service is provided by the University of Kent's School Of Computing.

Links:

Technology History

Museum of Obsolete Media

Description:
A unique online museum of physical media formats showcasing developments in audio, video, film and data storage, the Museum preserves the memory of those objects that held our memories, and every format listed in the Museum is represented by at least one example in the collection.

Links:

Gaming

My Abandonedware

Description:
On My abandonware you can download all the old video games from 1965 to 2012 for free. You can play Pacman, Arkanoid, Tetris, Galaxian, Alter Ego, or Blackthorne, Civilization, Sim City, Prince of Persia, Xenon 2, King's quest, Ultima, Kyrandia, The Incredible Machine, Another World, Test Drive, Flashback, Lemmings and more. For each game, they offer all related information included publication year, publisher, developer, size of the game, language, review of the game, instructions to play, the game manual and, of course, the game archive that you can download for free.

Links:

World Government

National Archives

Description:
The National Archives is a common term to designate a government funded archival institution focused on cataloging and making available historically significant works from the country in question.

Links:

World

National Film Board of Canada

Description:
In addition to being a public producer and distributor of Canadian content, the National Film Board of Canada (NFB) is the caretaker of over 7,000 productions available for free for personal use.

Links:

Government

National Security Archive

Description:
Founded in 1985 by journalists and scholars to check rising government secrecy, the National Security Archive combines a unique range of functions: investigative journalism center, research institute on international affairs, library and archive of declassified U.S. documents.

Links:

Gaming

Nexus Mods

Description:
Nexus Mods is one of several gaming mods archives, hosting over 300,000 mods for over 3,500 PC games.

Links:

Gaming

Old Games

Description:
Old-Games.com provides 10,000+ old PC games free to download, along with screenshots and descriptions.

Links:

Books

Open Library

Description:
Open Library is an initiative of the Internet Archive and provides access to thousands of books, out of print and otherwise. It provides an open, editable library catalog, building towards a web page for every book ever published.

Links:

Climate Science

OpenEI

Description:
The Open Energy Data Initiative (OEDI) enables research, collaboration, and transparency by providing open access to energy data and information. The OpenEI Data Lake is a centralized repository of datasets aggregated from the U.S. Department of Energy’s Programs, Offices, and National Laboratories. It provides links to over 4.19 PB of data.

Links:

Technology

OpenML

Description:
OpenML is an open platform for sharing datasets, algorithms, and experiments. It contains thousands of datasets and machine learning tasks running openly.

Links:

World

OSINT Ukraine

Description:
This is a public repository of tools, resources and an archive of Telegram messages related to the war in Ukraine. Note that some of the media on the site are very graphic.

Links:

World

Our World in Data

Description:
Our World in Data is a project of the Global Change Data Lab, a non-profit organization providing analysis from thousands of researchers around the world about poverty, disease, hunger, climate change, war, existential risks, and inequality.

Links:

Climate Science

Pangea

Description:
The information system PANGAEA is operated as an Open Access library aimed at archiving, publishing and distributing georeferenced data from earth system research. PANGAEA is open to any project, institution, or individual scientist to use or to archive and publish data.

Links:

Books

Project Gutenberg

Description:
Project Gutenberg is a library of over 75,000 free eBooks. Everything from Project Gutenberg is gratis, libre, and completely without cost to readers. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today. The Project Gutenberg Literary Archive Foundation (PGLAF) is the non-profit corporation that oversees operation of the project.

Links:

Climate Science Government

Public Environmental Data Project

Description:
The Public Environmental Data Project is committed to preserving and providing public access to federal environmental data. They are a volunteer coalition of several environmental, justice, and policy organizations, researchers across several universities, archivists, and students who rely on federal datasets and tools to support critical research, advocacy, policy, and litigation work. Several datasets are available on their site.

Links:

Technology History

Radio Museum

Description:
The radio museum contains a vast library of data about radio devices. It contains over 350K radio models, 2.8M pictures including 1M schematics, and 79K tubes/semiconductors.

Links:

Books Gaming

RetroMags

Description:
This site indexes and makes available for free download thousands of retro gaming magazines and strategy guides from 10 years ago and earlier.

Links:

Science Books

Sci-Hub

Description:
Sci-Hub started as a tool for providing quick access to articles from scientific journals - such articles are the main medium of communication of scientific knowledge today. Now Sci-Hub has grown a database of over 88 millions research articles and books freely accessible for anyone to read and download.

Links:

Technology

Sigma AI

Description:
Sigma AI provides a list of open AI related datasets from various other sites.

Links:

Technology

Software Heritage

Description:
The long term goal of the Software Heritage initiative is to collect all publicly available software in source code form together with its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it. The Software Heritage archive is growing over time as they crawl new source code from software projects and development forges.

Links:

Climate Science Government

Source COOP

Description:
Source Cooperative is a data publishing utility that allows trusted organizations and individuals to share data using standard HTTP methods. It contains large data collections and mirrors of various sites, mostly centered around science, government and climate.

Links:

Technology

TextFiles

Description:
TEXTFILES.COM has been online for nearly 25 years providing text files, focusing mostly on the years 1980-1995.

Links:

Gaming

The Cutting Room Floor

Description:
The Cutting Room Floor is a site dedicated to unearthing and researching unused and cut content from video games. From debug menus, to unused music, graphics, enemies, and levels.

Links:

Technology

The Eye

Description:
The Eye is a very large archive of files of all types covering decades. It provides archives of various sub-reddits, Telegram channels, AI models, books, website crawls, 3D models, images and more.

Links:

Gaming

The Old Computer

Description:
Home to the largest collection of roms and emulators anywhere on the web with over 500,000 ROMs and Emulators for every major computer, console, arcade machine, pinball table and mobile device. Box Scans, Manuals, Magazines and a 179,000+ strong user community.

Links:

Technology

The Unix Heritage

Description:
The Unix Heritage Society's aims include the preservation and maintenance of historical and non-mainstream UNIX systems; the further development of existing UNIX systems; and the continual fostering of the Unix community spirit. They host historical Unix distribution and packages available for download.

Links:

World

Uppsala Conflict Data Program

Description:
The Uppsala Conflic Data Program (UCDP) is the world's largest collection of wartime and organized violence data, covering over 40 years of conflicts, based at Uppsala University in Sweden and in collaboration with the Peace Research Institute in Oslo.

Links:

Climate

US Drought Monitor

Description:
The U.S. Drought Monitor provides climate maps weekly since 1999. It's produced jointly by the NDMC, NOAA and USDA.

Links:

Technology

Web Design Museum

Description:
The Web Design Museum exhibits thousands of screenshots and videos of old websites, mobile apps and software from 1990s to mid-00s.

Links:

World

Wikipedia

Description:
Wikipedia is a free-content online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. It is the largest and most-read reference work in history. Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects.

Links:

Gaming Technology

WinWorld

Description:
WinWorld is an online museum created in 2003 dedicated to the preservation and sharing of vintage, abandoned, and pre-release software. It offers information, media and downloads for a wide variety of computers and operating systems. Get classic operating systems, applications, games and betas for every platform from PC to Mac to Amiga, right here from the software library on WinWorld.

Links:

World Government

World Bank Open Data

Description:
The World Bank Open Data portal provides free and open access to global development data, mostly focusing on economic datasets.

Links:

Technology

Your.Org

Description:
Your.Org is a hosting company that provides hundreds of terrabytes of data for various sites. They also host a mirror of various open source software including Linux distributions, FreeBSD, Wikipedia database dumps, other websites such as Microsoft, Corel, IBM and much more.

Links: