This page covers frequently asked questions.
- 1. What is this site about?
- 2. Why is digital preservation so important?
- 3. How do I get started with digital archiving?
- 4. How can I archive...
- 5. How do I safeguard my data?
- 6. How can I bypass censorship?
- 7. How do I prepare to go off grid?
- 8. What are the criteria for inclusion?
- 9. How can I contribute?
What is this site about?
DataHoarding.org is an index of resources and archives related to data hoarding, web archival and digital preservation. It is run as a non-profit, volunteer effort and was inspired by the recent purge of online information by government agencies, corporations and others, and aims to provide easier access to tools and information. The goal is not only to hoard data, but curate and index it as well.
On the technical side, it runs on a custom 3-node Proxmox cluster, with a QNAP NAS appliance for storage and uses Cloudflare for content delivery. It runs in high-availability mode with a 99.99% uptime. We serve around 500 unique visitors each day.

Technical deployment pipeline
The site has 2 main indexes:
- Tools and resources - This is a list of data hoarding resources if you want to get started and help archival teams, or simply backup web content for your own personal use.
- Web archives - On this page you will find links to data archives from various countries. These archives contain data that was gathered and saved for the public good.
New archives are added on a weekly basis and RSS feeds are available.
Why is digital preservation so important?
Data archival matters because our history, culture, governance and science increasingly exist in digital form. Without intervention, this data disappears. A study from Pew Research showed that 38% of web pages from 2013 had disappeared within 10 years. Over 50% of Wikipedia articles have references to sites that no longer exist. Physical media increasingly lives on obsolete media, from Zip disks to tapes. And governments are increasingly rewriting the past, removing datasets and defunding institutions focused on topics they disagree with. It's up to individuals and organizations to pick up the pace.
How do I get started with digital archiving?
If your group or organization is interested in data preservation, the first thing to do is run an assessment of your current situation. Two popular frameworks exist for this:
- NDSA Levels of Digital Preservation - The Levels of Digital Preservation is a resource to help digital preservation practitioners build or assess their digital preservation program. It covers 4 levels including: Know your content, Protect your content, Monitor your content and Sustain your content.
- DPC Rapid Assessment Model - The DPC's Rapid Assessment Model (DPC RAM) is a digital preservation maturity modelling tool that has been designed to enable rapid benchmarking of an organization's digital preservation capability and facilitate continuous improvement over time. It's a spreadsheet that you can use to identify goals, shortcomings and more.
Once your assessment is done, you can start prioritizing what you want to focus on, such as at-risk data, topics you care about, time sensitive data, and so on. You can create a triage list or a list of sources that you want to use. This should follow the policy you've set up in your assessment phase.
Once you start collecting data (see How can I archive...), make sure you include useful metadata such as: creation date, source, checksum, size, document type, etc. The BagIt protocol is useful for this task.
Storage and redundancy solutions should be adopted early on, including backups. You can use the resources page to find relevant software and services. Ongoing maintenance is crucial for the long term preservation of your data.
Finally, consider making your archives available online. This ensures accessibility and makes your work shine for everyone.

Stages of Data Preservation
How can I archive...
...a web site?
Most browsers allow you to save a single HTML page along with all of its assets locally, or to Print to PDF. If you want to archive an entire site, several tools exist:
- Browsertrix, a web archival tool available as a monthly subscription.
- Zim-It, a free online service to produce offline versions of sites meant for the Kiwix viewer.
- ArchiveBox, a self-hosted, web-based app that can save a list of pages in PDF, WARC and other formats.
- Web Crawler, a self-hosted, web-based site crawler and WARC viewer currently under development.
- HTTrack, a popular Windows based crawler.
- Zeno, a command line web crawler from the Internet Archive.
For more information on archiving web sites, see this handy guide. You can also participate in large scale web archival efforts here to help save sites to the Internet Archive.
...a YouTube video?
The best tool for video archival (from YouTube or any other video sharing site) is called yt-dlp, a free and open source tool that allows you to specify a large amount of options on the command line, and is supported by a large community that keeps updating it. For example, you can download a YouTube video as a .MP4 video file like this:
yt-dlp --no-mtime -f 'bestvideo[ext=mp4][vcodec^=avc]+bestaudio[ext=m4a]/best[ext=mp4]/best' --embed-subs --no-playlist --merge-output-format mp4 https://www.youtube.com/watch?v=dQw4w9WgXcQ -o never_give_you_up.mp4
You can also export only the audio and create a .MP3 file like this:
yt-dlp --no-mtime --embed-thumbnail --no-playlist --extract-audio --audio-format mp3 https://www.youtube.com/watch?v=dQw4w9WgXcQ -o never_give_you_up.mp3
...an FTP site?
Many FTP clients have a bulk download option, but the wget utility on Linux makes it easy to do as well. Use this command to download all files and subfolders from this example site:
wget -m -np -c ftp://ftp.fu-berlin.de/pc/games/idgames/
...a CD or DVD disk?
On Windows, the following popular tools can be used:
- HandBrake can be used to convert a DVD disk or ISO file to a video.
- ImgBurn can be used to burn an image file to a disk.
- Rufus can be used to burn images to a USB device.
On Linux, you can use the dd utility like this:
sudo dd if=/dev/cdrom of=mydvd.iso bs=1M
Other tools can be used to convert other disk archival formats to ISO. CUE/BIN formats: bchunk file.bin file.cue file.iso, MDF format: mdf2iso file.mdf file.iso.
Keep in mind that some DVD disks are encrypted which would not be bypassed without additional steps.
...a Git repository?
Most open-source software these days depend on GitHub, or possibly a private Git instance. That's where the source code, along with issues, comments and releases are kept. This code, along with the various releases, is crucial for the open-source ecosystem. Yet each repository depends on a single person or organization to maintain it, and relies on the hosting site to be online. If you want to mirror an existing repository, either to have a local cache or for archival purposes, the easiest way is to deploy your own Git software.
Gitea is a free option that you can host on a Linux system for free, or pay for the cloud option, and it allows you to easily mirror any other Git repo with a single button.
...social media content?
Social media platforms are among the least stable places to store information. Accounts get suspended, platforms shut down, and content is routinely deleted without warning, as seen with the closures of Vine, Google+, and large portions of early Twitter. Unfortunately, they all worked hard to block bulk downloading, archiving and even shut down their API access, making this process much harder than it used to be.
To archive entire Twitter/X profiles or threads, Thread-Safe can capture full threads whereas Twitter Downloader can download media from X, Instagram, Tiktok and Facebook. X also has their own data export option.
Facebook has its own export option in the Settings page, whereas for Reddit, Arctic Shift provides bulk access to historical Reddit data that was preserved before API changes made direct scraping difficult.
...different video formats?
If you need to convert audio or video formats, the most popular tool for Windows, Linux and MacOS is a command line utility called FFmpeg. For example, you can convert a .MKV file to .MP4 using this command:
ffmpeg -i input.mkv -c:v libx264 -c:a copy -c:s mov_text -movflags +faststart -fflags +genpts output.mp4
...between different computers?
Transfering files between different computer systems, whether on the same network or around the globe, is a very useful task when you want to do offsite backups or move files from an old platform to a new one. The best tools for that are Robocopy on Windows and rsync on MacOS and Linux. While they may seem more complex to use, they include powerful features like synchronization, resuming of transfers and algorithms to compare file sizes, dates and checksums.
How do I safeguard my data?
No data is safe unless you have a backup of it. Even using RAID or a file system like ZFS, that isn't enough to survive many events that could cause a loss of data such as ransomware, malware, accidental erasure or physical events like flood and fire. In the backup world, the golden rule is called the 3-2-1 strategy:
- 3 copies of your data: The original + two backups.
- 2 different media: Use different technologies (HDDs, SSDs, DVDs, LTO tapes, cloud, etc).
- 1 off-site copy: At least one copy should be in a different location.
As an example, a proper backup planning could involve having a NAS using ZFS for redundancy, a removable hard drive for a monthly offline backup, and an automated script that updates an offsite copy to the cloud nightly. Another option could be having a second NAS in a different location such as a work place or friend's house. Regardless of the actual method, having all this redundancy has been proven to ensure data longevity and risk reduction.
The best backup system is an automated one. It can be as simple as a bash script run on a set schedule. The following example uses 7-Zip from the Linux terminal to create an encrypted ZIP archive of the folder important_documents, then uses the AWS CLI to upload it to a cloud bucket named my-bucket:
#!/bin/bash7zz a -bb0 -t7z /tmp/important_documents.7z /share/important_documents -pMY_SECRET_PASSWORD -mhe=onaws s3 cp /tmp/important_documents.7z s3://my-bucket/backups/important_documents.7zrm -f /tmp/important_documents.7z
Other key concerns to keep in mind are bit rot and device degradation. Over time, various media degrade in different ways: HDDs lose their magnetic field over time, flash storage can lose their electrical charges, and DVDs can suffer chemical oxidation. The way to combat bit rot and other types of degradation is to ensure you power on your backup media on a regular basis and use checksum software to ensure integrity.
Lastly, consider media longevity. Here are common figures, although this can vary based on the quality of the material used:
| Media Type | Estimated Life | Unpowered Window |
|---|---|---|
| Standard HDD | 5-7 Years | 3-5 Years |
| SSD / NVMe | 5-10 Years | 6 Months |
| LTO Tapes | 15-30 Years | 15-30 Years |
| CDs / DVDs | 2-15 Years | Indefinite |
| M-Disc (Optical) | 100-1,000 Years | Indefinite |
| Cloud Storage | Indefinite | Indefinite |
Keep in mind that not all data has the same importance. Many organizations have a tier list, where critical data is kept in 2, 3 or even more locations, whereas non-critical data may only have a single backup copy. Determining your risk tolerance is part of the process.
How can I bypass censorship?
Censorship can come in multiple ways, some more understandable than others. Content creators may restrict access to their work because they want to be paid for it, which is a widely accepted form of censorship, where you're required to pay a fee for access. Authoritarian regimes may censor information in order to silence and oppress opposition parties and minorities. Groups or organizations aligned with a specific religious, political or traditional ideology may try to censor data because it doesn't align with what they believe in. Regardless of the type of censorship you deal with, it's up to you to decide what is legally and morally acceptable for your situation, and whether you want to use any of these methods.
Using an ad blocker
Ad blockers can bypass web-based censorship filters, such as JavaScript libraries, popups and other methods used by web sites to restrict your access to information or contain trackers, spyware and malware. A popular option available for all major browsers is uBlock Origin. It's a browser extension that you can install and configure to block or remove ads from any site you go to. You can also check in the extensions store of your browser to see alternatives or use a specialized browser like Camoufox to bypass fingerprinting measures.
Changing your DNS servers
The Domain Name System (DNS) is how hostnames like www.example.com are automatically translated into IP addresses like 23.215.0.136. By default, your devices are configured to use whatever DNS servers your Internet Service Provider (ISP) has configured, but you can change these. Usually, it's as simple as going into your Internet connection settings, but there are step-by-step guides here. I suggest using the following addresses:
- 9.9.9.9
- 1.1.1.1
Quad9 and Cloudflare (the providers of those 2 servers) are well known, global organizations that provide DNS services for free, without censorship. A lot of the corporate and state level censorship is done at the DNS level, and this can bypass a lot (but not all) of it.
Using a VPN
When you access an online resource such as a web server, your traffic goes from your device, through your ISP, various backbone networks in your country, then to the destination's country, and finally to the server you requested. Spying, malware attacks and censorship can happen at any point in that chain. A VPN creates an encrypted tunnel between your device and a remote endpoint somewhere on the Internet, bypassing any bad actor along the way and making it seem as if you live in another region or even country. This helps with anonymity as well, although keep in mind that many VPN providers may still comply with law enforcement requests to reveal who is behind a specific connection. There are many VPN providers for you to choose from.

How a VPN protects you
One thing of note is that a lot of public VPN providers are easily identified and blacklisted already. You may want to opt for providers that specifically use protocols or methods that make it harder to identify the VPN tunnel, such as Obscura or Amnezia.
Tor Project
If all else fails, the Tor Project provides as much privacy as possible through the dark web, by using a technology called Onion Routing. Your traffic goes through a number of random Tor gateways all around the world, anonymizing your traffic. It's used worldwide by activists, freedom fighters and criminals, among others. Keep in mind that using the TOR browser can be technically challenging and will slow down your connection significantly, but for some use cases, it's worth the sacrifices.
Want more details on how to protect yourself online? Check out the EFF Surveillance Self-Defense site for great guides about online privacy.
How do I prepare to go off grid?
Whether you're planning to go live off the grid, worried about government censorship cutting off your Internet access, or preparing for a natural disaster, having an offline-capable setup is increasingly important. The core idea is simple: download the content and services you rely on before you need them, and host them locally.
The way to do that is by first identifying what you actually use daily: reference material, maps, media, communication tools, etc. Then, go and find offline equivalents. Here are some places to get started:
- Kiwix: Kiwix is an app (available for desktop and mobile) that lets you download and browse entire offline copies of Wikipedia, Stack Overflow, Project Gutenberg, and dozens of other resources as compact ZIM files. You can even use Zim-It to create your own offline snapshots of specific websites.
- NextCloud: For services, consider self-hosting replacements for the cloud tools you depend on. Nextcloud can replace Google Drive and Calendar, and it provides a place to share your files.
- Jellyfin: Jellyfin is a media application that can stream locally hosted videos, music and more.
- Proxmox: While you can store all your media and files on an external hard disk, or even a local NAS, you may want a more robust solution to host your services. Proxmox is a popular free hypervisor you can use to run various virtual machines and containers on its dedicated server.
Once you have a self-hosted setup, remember to test it. Unplug your Internet cord for a day and see what breaks. The first thing that tends to go out is DNS, which is something you can fix by hosting a Pi-hole server as a local DNS resolver. Similarly, any software that requires online activation or license verification will stop working, so make sure critical tools are set up in offline mode or use fully open-source alternatives that don't phone home.
Finally, don't neglect communication. Mesh networking tools like Meshtastic can provide basic messaging over long-range radio without any internet infrastructure, while Briar can work over local Wi-Fi or Bluetooth between nearby devices. Store important contacts, documents, and emergency information in plain text or PDF format so they're readable on any device without any special software. The goal is resilience through redundancy: the more you can do without depending on a remote server, the better prepared you'll be when the connection disappears.
For more on surviving Internet shutdown events, check out the EFF's Guide to Circumventing Internet Shutdowns.
What are the criteria for inclusion?
The archives listed on this site are curated using 2 criteria: the site must have a significant collection of items, and these items must be available to the public without having to jump through significant hoops (ie. requirement to have a local library card) or requiring a subscription fee. We also use custom scripts that ensure these sites are up and running, so all links should usually work. Note that while we strive to provide safe and accurate information, we can't guarantee the safety of these sites, so use your own discretion.
How can I contribute?
If you know any resources or archival sites, or if you have legal concerns about any existing data, contact us at contact@datahoarding.org. We do not run ads, accept sponsorships or donations, and are not looking for additional volunteers at this time.