Data Hoarding - FAQ

This page covers frequently asked questions.

What is this site about?

DataHoarding.org is an index of resources and archives related to data hoarding, web archival and digital preservation. It is run as a non-profit, volunteer effort and was inspired by the recent purge of online information by government agencies, corporations and others, and aims to provide easier access to tools and information. The goal is not only to hoard data, but curate and index it as well.

On the technical side, it runs on a custom 3-node Proxmox cluster, with a QNAP NAS appliance for storage and uses Cloudflare for content delivery. It runs in high-availability mode with a 99.99% uptime. We serve around 500 unique visitors each day.

Technical deployment pipeline
Technical deployment pipeline

The site has 2 main indexes:

  • Tools and resources - This is a list of data hoarding resources if you want to get started and help archival teams, or simply backup web content for your own personal use.
  • Web archives - On this page you will find links to data archives from various countries. These archives contain data that was gathered and saved for the public good.

New archives are added on a weekly basis.

Why is digital perservation so important?

Data archival matters because our history, culture, governance and science increasingly exist in digital form. Without intervention, this data disappears. A study from Pew Research showed that 38% of web pages from 2013 had disappeared within 10 years. Over 50% of Wikipedia articles have references to sites that no longer exist. Physical media increasingly lives on obsolete media, from Zip disks to tapes. And governments are increasingly rewriting the past, removing datasets and defunding institutions focused on topics they disagree with. It's up to individuals and organizations to pick up the pace.

How do I get started with digital archiving?

If your group or organization is interested in data preservation, the first thing to do is run an assessment of your current situation. Two popular frameworks exist for this:

  • NDSA Levels of Digital Preservation - The Levels of Digital Preservation is a resource to help digital preservation practitioners build or assess their digital preservation program. It covers 4 levels including: Know your content, Protect your content, Monitor your content and Sustain your content.
  • DPC Rapid Assessment Model - The DPC's Rapid Assessment Model (DPC RAM) is a digital preservation maturity modelling tool that has been designed to enable rapid benchmarking of an organization's digital preservation capability and facilitate continuous improvement over time. It's a spreadsheet that you can use to identify goals, shortcomings and more.

Once your assessment is done, you can start prioritizing what you want to focus on, such as at-risk data, topics you care about, time sensitive data, and so on. You can create a triage list or a list of sources that you want to use. This should follow the policy you've set up in your assessment phase.

Once you start collecting data (see How can I archive...), make sure you include useful metadata such as: creation date, source, checksum, size, document type, etc.

Storage and redundancy solutions should be adopted early on, including backups. You can use the resources page to find relevant software and services. Ongoing maintenance is crucial for the long term preservation of your data.

Finally, consider making your archives available online. This ensures accessibility and makes your work shine for everyone.

Stages of Data Preservation
Stages of Data Preservation

How can I archive...

...a web site?

If your goal is to backup external sites, see this handy guide on how to scrape web sites. You can also participate in large scale web archival efforts here.

...a YouTube video?

The best tool for video archival (from YouTube or any other video sharing site) is called yt-dlp, a free and open source tool that allows you to specify a large amount of options on the command line, and is supported by a large community that keeps updating it. For example, you can download a YouTube video as a .MP4 video file like this:

yt-dlp --no-mtime -f 'bestvideo[ext=mp4][vcodec^=avc]+bestaudio[ext=m4a]/best[ext=mp4]/best' --embed-subs --no-playlist --merge-output-format mp4 https://www.youtube.com/watch?v=dQw4w9WgXcQ -o never_give_you_up.mp4

You can also export only the audio and create a .MP3 file like this:

yt-dlp --no-mtime --embed-thumbnail --no-playlist --extract-audio --audio-format mp3 https://www.youtube.com/watch?v=dQw4w9WgXcQ -o never_give_you_up.mp3

...a CD or DVD disk?

On Windows, the most popular tool to convert a physical CD or DVD into a .ISO file is ImgBurn. On Linux, you can use the dd utility like this:

sudo dd if=/dev/cdrom of=mydvd.iso bs=1M

Keep in mind that some DVD disks are encrypted which would not be bypassed without additional steps.

How can I bypass censorship?

Censorship can come in multiple ways, some more understandable than others. Content creators may restrict access to their work because they want to be paid for it, which is a widely accepted form of censorship, where you're required to pay a fee for access. Authoritarian regimes may censor information in order to silence and oppress opposition parties and minorities. Groups or organizations aligned with a specific religious, political or traditional ideology may try to censor data because it doesn't align with what they believe in. Regardless of the type of censorship you deal with, it's up to you to decide what is legally and morally acceptable for your situation, and whether you want to use any of these methods.

Using an ad blocker

Ad blockers can bypass web-based censorship filters, such as JavaScript libraries, popups and other methods used by web sites to restrict your access to information or contain trackers, spyware and malware. A popular option available for all major browsers is uBlock Origin. It's a browser extension that you can install and configure to block or remove ads from any site you go to. You can also check in the extensions store of your browser to see alternatives.

Changing your DNS servers

The Domain Name System (DNS) is how hostnames like www.example.com are automatically translated into IP addresses like 23.215.0.136. By default, your devices are configured to use whatever DNS servers your Internet Service Provider (ISP) has configured, but you can change these. Usually, it's as simple as going into your Internet connection settings, but there are step-by-step guides here. I suggest using the following addresses:

  • 9.9.9.9
  • 1.1.1.1

Quad9 and Cloudflare (the providers of those 2 servers) are well known, global organizations that provide DNS services for free, without censorship. A lot of the corporate and state level censorship is done at the DNS level, and this can bypass a lot (but not all) of it.

Using a VPN

When you access an online resource such as a web server, your traffic goes from your device, through your ISP, various backbone networks in your country, then to the destination's country, and finally to the server you requested. Spying, malware attacks and censorship can happen at any point in that chain. A VPN creates an encrypted tunnel between your device and a remote endpoint somewhere on the Internet, bypassing any bad actor along the way and making it seem as if you live in another region or even country. This helps with anonymity as well, although keep in mind that many VPN providers may still comply with law enforcement requests to reveal who is behind a specific connection. There are many VPN providers for you to choose from.

Tor Project

If all else fails, the Tor Project provides as much privacy as possible through the dark web, by using a technology called Onion Routing. Your traffic goes through a number of random Tor gateways all around the world, anonymizing your traffic. It's used worldwide by activists, freedom fighters and criminals, among others. Keep in mind that using the TOR browser can be technically challenging and will slow down your connection significantly, but for some use cases, it's worth the sacrifices.

What are the criteria for inclusion?

The archives listed on this site are curated using 2 criteria: the site must have a significant collection of items, and these items must be available to the public without having to jump through significant hoops (ie. requirement to have a local library card) or requiring a subscription fee. We also use custom scripts that ensure these sites are up and running, so all links should usually work. Note that while we strive to provide safe and accurate information, we can't guarantee the safety of these sites, so use your own discretion.

How can I contribute?

If you know any resources or achival sites, or if you have legal concerns about any existing data, contact us at contact@datahoarding.org. We do not run ads, accept sponsorships or donations, and are not looking for additional volunteers at this time.