The British government’s entire online presence comprising billions of web pages has been indexed and digitally archived to the cloud for the first time.
The National Archives’ gigantic 120TB web archive encompasses billions of web pages – from every government department website and social media account – from 1996 to the present.
It took MirrorWeb – named among our 101 Rising Stars of the UK Start-up Scene last year – just two weeks to transfer the data from 72 hard drives at The National Archives to internal hard drives before transferring and digitally archiving more than two decades of government internet history to the cloud.
As part of a four-year contract, MirrorWeb was tasked with both moving the data to the cloud using Amazon Web Services as well as indexing it. Indexing the data meant that MirrorWeb had to write a complete replacement for the UK Government Web Archives’ previous search functionality.
As a result, 1.4bn documents were indexed and are now accessible and searchable to researchers, students and the members of the public who need to use them, enabling them to view websites and social media content in their original form as well as search for content on specific topics.
John Sheridan, digital director of The National Archives, said: “We are preserving 1,000 years of British history and a big part of that is preserving the digital record of government today.
“MirrorWeb has brought some outstanding technical capabilities, in particular data migration, cloud computing, search, new ways of harvesting and crawling content and new ways of presenting content and making it available.”
To carry out the indexing MirrorWeb built its own software, WarpPipe, which has the ability to index a large number of small files and indexed all the National Archives’ documents in just ten hours.
Philip Clegg, chief technical officer at MirrorWeb, explained: “The files within the National Archives are relatively small but in terms of numbers the volume is huge.
“This posed a problem for the big data processing tools already on the market, which were quoting us a timeframe of six to eight weeks.
“This is why we built WarpPipe, enabling the documents to be indexed in ten hours.”
The search functionality is provided by Elasticsearch, which was chosen because it improves on the National Archives’ previous search engine in terms of speed, flexibility and reliability. The index is now updated monthly as opposed to quarterly, giving the end-user more up to date archive content.
Clegg explained: “In under a second the public-facing website can bring up results from every UK government website which has been preserved and can be viewed just as it was for any chosen date.
“In this information age, it is vital that our digital history is preserved and this resource will help educate future generations to come.”