Britain’s GOV.UK portal has been online since Netscape Navigator 2.0 was state of the art. Now, a National Archives project to make this trove of historical content more accessible has shifted 22 years worth of government websites to the cloud, re-indexed and made searchable through its updated UK Government Web Archive.
The archive consists entirely of historic, publicly available web content, so you’re not likely to turn up any unexpected state secrets. However, it provides valuable historical insight into the changing policies and attitudes of Britain’s official government communications, and there’s a trove of information to be found for anyone with an interest in the finer detail of government publications.
For example, a search for Brexit reveals 19,043 results, the first of which is a 2014 upload of a higher education funding presentation originally produced in April 2013, three months after then-Prime Minister David Cameron announced that the government would hold a referendum on EU membership.
There’s plenty to read on the subject of climate change, starting in 1996, when the GOV.UK records begin, with a single Environment Agency press release on water management, which notes that “the results of research into the impacts of climate change are being studied by the Agency to assess their impact on water resource yields.” In 2016, by comparison, the term got 1,141,844 exact matches in archived documents.
The archive is particularly valuable when it comes to seeing how modern historical events were communicated at the time, with materials including the September 2002 publication of the Iraq Dossier, which spurred the 2003 invasion of the country with claims – later proved false – that it possessed weapons of mass destruction.
Making this new archive wasn’t easy. Over a period of two weeks, 120TB of the British government’s archived GOV.UK web data was transferred from 72 individual two-terabyte hard disks to a pair of physical AWS Snowball transfer devices before being dispatched to one of Amazon’s UK cloud storage facilities, where The National Archives’ websites and content are hosted.
The operation, carried out by Manchester-based archiving firm MirrorWeb, involved a pair of specially built PCs that could have eight drives connected to them simultaneously, allowing data from 16 drives at a time to be decrypted and re-encrypted for transit on the Snowballs when they were finally shipped to Amazon’s UK data centre.
The next step was to build a brand new search index and interface for the huge cache of data – a total of 1.4 billion documents ranging from PDFs to social media posts and web pages with ageing embedded multimedia elements. Everything had to be indexed and fully text searchable, and that meant that MirrorWeb had to develop new tools.
“We attempted to use traditional Hadoop tooling but found it to be impractical for big data sets stored in the cloud,” explains MirrorWeb CTO Philip Clegg. “We decided to develop our own cloud native solution that scales linearly and enabled us to index over 147,000,000 documents per hour.”
It did the trick. “To index the entire 120TB collection they were able to spin up 1,000 node plus cluster of computers to process the entirety of that collection, and in just a couple of days,” adds John Sheridan, digital director of The National Archives.
And the archive is set to keep growing. MirrorWeb is currently developing new crawlers to spider government content, including machine learning and AI to handle automatic content discovery and the patching of problematic site content.
The new archive isn’t quite as comprehensive as the curious might hope. For example, although the government’s official social media channels have been archived, the Twitter archives only go up to March 8, 2016, which means we couldn’t look for the Foreign Office’s hastily deleted claim that Porton Down had identified the use of a “military-grade Novichok nerve agent” in the poisoning of Sergei and Yulia Skripal in March of this year.
By comparison, archives of official government sites such as Your Vote Matters date from as recently as 2018.
MirrorWeb’s Philip Clegg says that that the discrepancy is because “all archives have to go through Quality Assurance (QA) prior approval by the government before release to the public facing archive.”
Given that GOV.UK’s official deletion policy means that content can be removed if it “was published by mistake” or “if it could result in a risk to health, finances or reputation”, it’s probably safe to bet that we won’t be seeing any official return of that compromising tweet.
Interactive content proved to be somewhat hit and miss, an issue which the National Archive acknowledges. While examples of early Macromedia Shockwave games such as those on the Environment Agency’s 2002 edutainment site tried to load, the content was either missing or incompatible with recent versions of Adobe’s Shockwave player.
That’s going to change, though. Clegg says that the ultimate plan is to “achieve ultimate fidelity, including legacy plug-ins and software not supported by modern browsers,” which is a vital and often-ignored aspect of archiving the ephemeral and vanishing history of the web.