Compliance guidelines for web archiving

Compliance guidelines for web archiving

These requirements are designed to assist website developers in optimising their sites for more comprehensive and high-quality web archive captures. They are based on the criteria established by the National Archives’ UK Government Web Archive.

HTML versions and HTTP protocols

All versions of HTML to date can be archived and replayed. Ensure that all content on your website is presented using either the HTTP or HTTPS protocol.

Video, infographic, audio and multimedia content

Streaming video or audio cannot be captured and should also be made accessible via progressive download, over HTTP or HTTPS, using absolute URLs, where the source URL is not obscured.

Link to audio-visual (AV) material with absolute URLs instead of relative URLs.

Provide transcripts for all audio and video content.

Provide alternative methods of accessing information held in infographics, videos and animations.

Content protected by a cross-domain file or within a cross-domain iframe cannot usually be captured. This most often applies to multimedia content embedded in web pages but hosted on another domain. If any content falls into this category, please ensure it is accessible to the crawler through an alternative method.

Documents and file sharing

File content hosted on web-based collaborative platforms and file-sharing services such as SharePoint, Google Docs, and Box cannot be captured. To ensure accessibility for the crawler, these files should be made available in alternative ways, such as downloadable files hosted on the root domain.

Site structure and sitemaps

Include a human-readable HTML sitemap in your website. It makes content more accessible, especially when users are accessing the archived version, as it provides an alternative to interactive functionality.

Have an XML sitemap. This significantly speeds up the ability to capture and quality assure the website archives. Please see https://www.sitemaps.org/ for further details. Link the sitemap from robots.txt file (RFC 9309).

Where possible, keep all content under one root URL. Any content hosted under root URLs other than the target domain, sub-domain or microsite is unlikely to be captured. Typical examples of this include documents hosted in the cloud (such as amazonaws.com), newsletters hosted by services such as Mailchimp and when any services that link through external domains are used.

If using pagination (../page1, ../page2 and so on), you will also need to include all URLs from that pagination structure in your browse or XML sitemap as the crawler can sometimes misinterpret recurrences of a similar pattern as a crawler trap and therefore may only crawl to a limited depth.

Links and URLs

“Orphaned” content (content that is not linked to from within your website) will not be captured. You will need to provide a list of orphan links as an XML sitemap or supplementary URL list before the crawl is launched.

Links in binary files attached to websites (links included in .pdf, .doc, .docx, .xls, .xlsx, .csv documents) cannot be captured. All resources linked to in these files must also be linked to on simple web pages or you will need to provide a list of these links as an XML sitemap or supplementary URL list before the crawl is launched.

Where possible, use meaningful URLs such as https://mywebsite.com/news/new-reportlaunch rather than https://mywebsite.com/5lt35hwl. As well as being good practice, this can help when you need to redirect users to the web archive.

Avoid using dynamically-generated URLs.

Dynamically-generated content and scripts

Client-side scripts should only be used if it is determined that they are most appropriate for their intended purpose.

Make sure any client-side scripting is publicly viewable over the internet – do not use encryption to hide the script.

As much as you can, make sure your code is maintained in readily-accessible separate script files (example: with .js extension) rather than coded directly into content pages, as this will help diagnose and fix problems.

Avoid using dynamically-generated date functions. Use the server-generated date, rather than the client-side date. Any dynamically-generated date shown in an archived website will always display today’s date.

Avoid using dynamically-generated URLs.

Dynamically-generated page content using client-side scripting cannot be captured. This may affect the archiving of websites constructed in this way.

Wherever possible the page design should make sure that content is still readable and links can still be followed by using the <no script> element.

When using JavaScript to design and build, follow a “progressive enhancement” approach. This works by building your website in layers:

  1. Code semantic, standards-compliant (X)HTML or HTML5
  2. Add a presentation layer using CSS
  3. Add rich user interactions with JavaScript

This is an example of a complex combination of JavaScript, which will cause problems for archive crawlers, search engines, and some users:

javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)

This is a preferred example of a well designed URL scheme with simple links:

<a href=”content/page1.htm”

onclick=”javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)>“1<a>

Always design for browsers that don’t support JavaScript or have disabled JavaScript.

Provide alternative methods of access to content, such as plain HTML.

Interactive graphs, maps and charts

Interactive content should be avoided where possible, as it is often difficult to archive while retaining full functionality.

For essential interactive elements such as graphs, maps or charts, alternative “crawler-friendly” access methods should be provided. If visualisations are used, the underlying data should always be available in a simple format, such as a .txt or .csv file. In some cases, experimental technology may be used to capture interactive content – please get in touch if this applies to your materials.

Menus, search and forms

Use static links, link lists and basic page anchors for menus and navigation elements, rather than using JavaScript and dynamically generated URLs.

Any function that requires a “submit” operation, such as dropdown menus, forms, search and checkboxes, will not archive well. Always provide an alternative method to access this content wherever possible, and make sure you provide a list of links that are difficult to reach as an XML sitemap or supplementary URL list before the crawl is launched.

Database and lookup functions

If a site uses databases to support its functions, these can only be captured in a limited fashion. Snapshots of database-driven pages can be captured if they can be retrieved via a query string, but the underlying database used to power the pages cannot be captured.

For example, content generated at https://www.mywebsite.lu/mypage.aspx?id=12345&d=true should be capturable, since the page is dynamically generated when requested by the web crawler, just as it would be for a standard user request. This is possible when the data is retrieved using an HTTP GET request, as in the example above.

POST requests and Ajax

Content that relies on HTTP POST requests cannot be archived, as no query string is generated. While POST parameters may be suitable for certain situations, such as search queries, it is essential to ensure that the content is also accessible via a query string URL visible to the crawler; otherwise, it will not be captured.

It is unlikely that content using HTTP POST requests, Ajax, or similar technologies can be successfully captured and replayed.

Wherever possible, an alternative method of accessing this content should be provided. A list of links that are difficult to reach should also be supplied, either as an XML sitemap or a supplementary URL list, prior to the launch of the crawl.

W3C compliance

In most cases, a website designed in accordance with W3C Web Accessibility guidelines should also be straightforward to archive.

Simple, standard web techniques should always be used when building a website. The World Wide Web Consortium (W3C) recommendations offer considerable flexibility for creativity without compromising functionality. Overly complex or non-standard website design increases the likelihood of issues for users, web archiving, and search engine indexing.

Website backups (as files)

“Dumps” or “backups” of websites from content management systems, databases, hard drives, CDs, DVDs, or any other external media cannot be accepted into the archive. Only snapshots captured directly by the web crawling system are eligible for inclusion.

Intranet and secured content areas

Content that is protected behind a login cannot be archived, even if login details are provided.

If content is hosted behind a login because it is not suitable for public access, it should remain there until any sensitivity has lapsed and it can be published on the open web. Alternatively, it may be appropriate to consult with the information management team to explore other preservation methods as part of the public record.

Last update