Web Archiving at the UW-Madison Archives

About

The University Archives began web archiving efforts in 2007 to further support our mission by capturing select websites that meet our collection development policy. This could include websites developed and maintained by University departments, personal faculty websites that relate to their research or work for the University, student organization websites and more. Our primary method of web archiving is using a service from the Internet Archive, called Archive-It, described in more detail below. Donors also have the option of transferring raw digital web files (such as HTML, CSS, JS and media files) and / or WARC (Web ARChive) files to us to archive a website that way. Please review our Transfer Guidelines for more information on transferring digital records.

Our current priorities include sites set to be retired or redesigned and university leadership, communications, and major event sites. We cannot guarantee website capture for every request that falls outside these priority collecting areas.

How often we capture a website depends on the frequency of updates to a particular website and its current relevance. Websites that change frequently—like news sites—are captured more often. For example, we capture the UW-Madison News site weekly and the Commencement site twice a year: in December and May.

Requesting a website capture

The University Archives is not currently crawling every university webpage but anyone can request a capture of their website as long as the content meets our collection development policy. Staff will work with you to determine whether a site should and can be captured. Reasons to request a capture may include:

  • The site is not currently captured in our web archive collection
  • The site was captured but it may need a more recent capture
  • The site will be redesigned or retired

Keep in mind that there may be better ways to capture content on a webpage depending on how the content has been shared. For example, a webpage that simply has a list of links to PDF newsletters would be better archived by transferring the PDFs directly to us.

To request a capture, please contact us with the subject line “Request to Archive Website.”

Please include the following information about the site:

  • URL and brief description of scope of website, organization, or cause
  • How content on the site fits into our collection development policy
  • How frequently portions of the site that meet our collection development policy are updated 
  • Date of retirement or redesign (if applicable)

Archive-It

Currently, our web pages are harvested, stored, and accessed through Archive-It, a subscription service from the Internet Archive. The University Archives selects websites to be crawled through Heritrix, a web archiving tool developed by the Internet Archive. The crawler captures web domains or individual web pages, taking a snapshot of the page at a date in time and storing a copy in the Internet Archive, which can be “played back” through Archive-It and the WayBack Machine. The University Archives service plan with Archive-It allows for scheduling captures for websites yearly, quarterly, semiannually, weekly crawls, or one-time.

Access to archived websites

Archive-It also provides access to our captured content via the University of Wisconsin-Madison collection page. The content is searchable by keyword and facets, such as subjects, creator, date, type, etc. The sites are viewable through the WayBack Machine, the Internet Archive’s access tool.

We do not currently create catalog or finding aid records for individual websites or their content, so discovery is limited to searching through Archive-It or the Wayback Machine

Limitations of the tool

The Archive-It tool may not capture every document on a website and/or may not provide the same user experience when a site is “played back” through the archived version for a variety of reasons. The Internet Archive strives to replicate the look and feel of websites whenever possible by also collecting CSS files and displaying the websites as they existed. While the University Archives and the Internet Archive strive to preserve the authentic experience of a website, it is not always possible. 

File formats and types typically captured through the tool are HTML, JavaScript, PDFs, images, and videos. Content that may not be captured by our tool includes database-driven or form-driven pages, forums, calendars, streaming media, password-protected sites, and sites protected by robots.txt files or that block web crawlers.  The crawler will also not capture external links from UW pages resulting in a “Not Found in Archive” message.

We have a data limit on how much we can capture per year according to our service agreement with Archive-It. When we reach our data cap for a year, we will wait until the start of the next year to capture new content.

Copyright and Permissions

Like all of our collection materials, the University may not necessarily hold copyrights for materials captured by our web archiving processes. Please review our Duplication and Use Policy for more information on copyrights and use of materials in our collections.