Data Hoarding - Resources

This is a list of data hoarding resources if you want to get started and help archival teams, or simply backup web content for your own personal use.

Filters: Tools Services Communities Knowledge

Tools

ArchiveBox

Description:
ArchiveBox is an open source tool that lets organizations and individuals archive public or private web content while retaining control over their data. It can be used to save copies of bookmarks, preserve evidence for legal cases, backup photos from FB/Insta/Flickr or media from YT/Soundcloud/etc., save research papers, and more.

Links:

Tools

ArchivesSpace

Description:
ArchivesSpace is an open-source archives information management application for managing and providing access to archives, manuscripts and digital objects and supports a range of archival functions.

Links:

Tools

ArchiveTeam Warrior

Description:
Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. Since 2009 this variant force of nature has caught wind of shutdowns, shutoffs, mergers, and plain old deletions and done their best to save history before it's lost forever. They provide the ArchiveTeam Warrior, a virtual archiving appliance to help with the ArchiveTeam archiving efforts, along with other tools.

Links:

Knowledge

Automate the Boring Stuff

Description:
If you've ever spent hours renaming files or updating hundreds of spreadsheet cells, you know how tedious tasks like these can be. But what if you could have your computer do them for you? In Automate the Boring Stuff with Python, you'll learn how to use Python to write programs that do in minutes what would take you hours to do by hand, no prior programming experience required.

Links:

Tools

Awesome AI

Description:
Awesome AI is a curated list of awesome AI tools, frameworks, api, software and resources related to machine learning.

Links:

Tools

Awesome Datahoarding

Description:
These tools are aimed at those wishing to get started with data hoarding. The list includes applications that you can run locally to gather data, parse data and index it.

Links:

Tools

Awesome Selfhosted

Description:
Self-hosting is the practice of hosting and managing applications on your own server(s) instead of consuming from SaaSS providers. This is a list of Free Software network services and web applications which can be hosted on your own server(s). Non-Free software is listed on the Non-Free page.

Links:

Tools

Browsertrix

Description:
Browsertrix is an open source web archiving system created by Webrecorder. It provides a web interface to start crawling jobs of web sites, and is available as a SaaS app and as self hosted on Kubernetes. It also supports proxy servers.

Links:

Services

Canadian Technology Resources

Description:
Canadian alternatives for digital products. This site helps you find Canadian alternatives for digital service and products, like cloud services and SaaS products. These can be useful should you want to set up your own web site, email, or other cloud services but want to avoid big tech companies.

Links:

Knowledge

CensorTrace

Description:
Following the 2025 U.S. presidential inauguration, this automated tool monitors changes to major government websites by identifying and tracking removed pages, using publicly available data from the Internet Archive.

Links:

Knowledge

Cybersecurity Mastery Roadmap

Description:
A comprehensive, step-by-step guide to mastering cybersecurity from beginner to expert level with curated resources, tools, and career guidance.

Links:

Knowledge

Digital Preservation

Description:
This site is hosted by the US Library of Congress and presents information about the National Digital Information Infrastructure and Preservation Program (NDIIPP) and its initiatives.

Links:

Communities

Digital Preservation Coalition

Description:
The Digital Preservation Coalition (DPC) is a charity building a welcoming and inclusive global community, working together to bring about a sustainable future for our digital assets. It was established in 2002 as a collaboration between a number of agencies operating in the UK and Ireland.

Links:

Tools

DOSBox

Description:
DOSBox is a free and open-source emulator which runs software for MS-DOS applications and games on modern PCs, supporting thousands of programs.

Links:

Services

European Alternatives

Description:
European alternatives for digital products. This site helps you find European alternatives for digital service and products, like cloud services and SaaS products. These can be useful should you want to set up your own web site, email, or other cloud services but want to avoid big tech companies.

Links:

Services

Filecoin

Description:
Filecoin is a peer-to-peer network that enables reliable, decentralized file storage through built-in economic incentives and cryptographic proofs. Users pay storage providers—computers that store and continuously prove file integrity—to securely store their files over time. Anyone can join Filecoin as a user seeking storage or as a provider offering storage services. Storage availability and pricing aren't controlled by any single entity; instead, Filecoin fosters an open market for file storage and retrieval accessible to all.

Links:

Tools

Gallery-DL

Description:
Gallery-DL is a program to download image galleries and collections from several image hosting sites, similar to how yt-dlp can download videos.

Links:

Services

Git-annex

Description:
Git-annex allows managing large files with git, without storing the file contents in git. It can sync, backup, and archive your data, offline and online. Checksums and encryption keep your data safe and secure. Bring the power and distributed nature of git to bear on your large files with git-annex.

Links:

Communities

International Council on Archives

Description:
The International Council on Archives (ICA) promotes the efficient and effective management and use of records, archives and data in all their formats and their preservation as the cultural and evidentiary heritage of humanity.

Links:

Communities

International Internet Preservation Consortium

Description:
The International Internet Preservation Consortium (IIPC) identifies and develops best practices for selecting, harvesting, collecting, preserving and providing access to Internet content.

Links:

Tools

Internet Archive API

Description:
The Internet Archive is one of the largest online archival source, and as such many data hoarders need to deal with its content programmatically. They offer a Python module allowing you to script and automate commands using their public API.

Links:

Tools

Interoperable Europe

Description:
The Interoperable Europe Portal is the European Union's platform for promoting and supporting interoperability, collaboration, and knowledge sharing across public administrations, businesses, and citizens. It acts as a one-stop shop for discovering, sharing, and reusing IT solutions and good practices.

Links:

Tools

IPFS

Description:
The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed hash table. This content delivery network is built around the innovation of content addressing: store, retrieve, and locate data based on the fingerprint of its actual content rather than its name or location.

Links:

Services

Knowledge Commons

Description:
Knowledge Commons is an open, adaptable collection of tools that support the human work of education and research and make that work more visible and impactful. Projects of the Commons are funded by the National Science Foundation and the National Endowment for the Humanities.

Links:

Tools

Libre Self-hosted

Description:
This is a curated list of free (libre) self-hosted projects.

Links:

Tools

Memento Protocol

Description:
Memento is a project aimed at making Web-archived content more readily discoverable and accessible to the public. It's a protocol that allows clients to find archived web content at specific timestamps.

Links:

Tools

Metadata Editor

Description:
The World Bank Metadata Editor is an open-source web-based application designed to assist data curators in documenting data of various types according to specialized metadata standards. It supports many types including DDI CodeBook 2.5, Dublin Core, ISO 19139, IPTC, etc.

Links:

Services

ODCrawler

Description:
A search engine for open directories. Find millions of publicly available files.

Links:

Services

Perma

Description:
Websites change, go away, and get taken down. When linked citations lead to broken, blank, altered, or even malicious pages, that’s called link rot. Perma.cc helps scholars, journals, courts, and others create permanent records of the web sources they cite. The site is developed and maintained by the Harvard Library Innovation Lab at the Harvard Law School Library and administered by a consortium of libraries, with each library assisting its local users.

Links:

Services

Permanent

Description:
The Permanent Legacy Foundation is a non-profit organization offering cloud storage services at a low cost for building a digital archive for families, organizations and historians. Their mission is to preserve and provide perpetual access to the digital legacy of all people for the historical and educational benefit of future generations.

Links:

Communities

Reddit

Description:
Reddit is an American social news aggregation, content rating, and forum social network. There are several useful subreddits around data hoarding, self hosting and archivals. If you have questions about these subjects or just want to chat, those communities are very active.

Links:

Communities

Safeguarding Research

Description:
Safeguarding Research is a group of individuals organizing to safeguard as much publicly available research, GLAM-collections, etc. as possible.

Links:

Tools

Servarr

Description:
Servarr includes Lidarr, Prowlarr, Radarr, Readarr, Sonarr, and Whisparr. Collectively they are referred to as "*Arr", "*Arrs", "Starr", or "Starrs". They are designed to automatically grab, sort, organize, and monitor your Music, Movie, E-Book, or TV Show collections for Lidarr, Radarr, Readarr, Sonarr, and Whisparr; and to manage your indexers and keep them in sync with the aforementioned apps.

Links:

Knowledge

Sustainable Heritage Network

Description:
The Sustainable Heritage Network (SHN) is an answer to the pressing need for comprehensive workshops, online tutorials, and web resources dedicated to the lifecycle of digital stewardship. The SHN is a collaborative project that complements the work of Indigenous peoples globally to preserve, share, and manage cultural heritage and knowledge.

Links:

Communities

Video Game History Foundation

Description:
The Video Game History Foundation is one of several communities dedicated to the preservation of video games, both in physical and digital form. They provide a community forum and resources for gaming enthusiasts.

Links:

Tools

YT-DLP

Description:
Yt-dlp is a feature-rich command-line audio/video downloader with support for thousands of sites. The project is a fork of youtube-dl based on the now inactive youtube-dlc. It's currently the most popular way to download videos from YouTube and many other sites, allowing you to download a single video, a full playlist or a complete channel with a single command.

Links: