Bots, Seeds and People

Web Archives as Infrastructure


Ed Summers
University of Maryland
@edsu / ehs@pobox.com

Ricardo Punzalan
University of Maryland
@archivalflip / punzalan@umd.edu

slides: http://bit.ly/bots-seeds-people
paper: https://arxiv.org/abs/1611.02493v1

Overview

  • Appraisal in Web Archives
  • Research Question
  • Methodology
  • Findings
  • Future Work
How much of the web is in the Internet Archive?

273,000,000,000 1 / 1,000,000,000,000 2 = .273 ???


1. Alpert, J. and Hajaj, N. (2008). We knew the web was big... Google.
2. Goel, V. (2016). Defining Web pages, Web sites and Web captures. Internet Archive.

NYTimes
Archival coverage of the NYTimes homepage in 2016.
VK
Archival coverage of Igor Strelkov's VKontakte profile in 2014.

Appraisal

The process of identifying materials offered to an archives that have sufficient value to be accessioned.


Appraisal in A Glossary of Archival and Records Terminology. Society of American Archivists.

RQ: How is appraisal being enacted in web archives?

  • Selection strategies in web archiving
  • Socio-technical factors that influence selection practices
Kitchin, R. (2016). Thinking critically about and researching algorithms. Information, Communication & Society, 1–16.

  1. Source Code
  2. Reflexively producing code
  3. Reverse engineering
  4. Design & designers
  5. Socio-technical assemblage
  6. The world

Methodology

  • 39 contacted (email)
  • 33 responded
  • 28 interviewed
  • F (13) / M (15)
  • university, non-profit, library/museum
  • archivists, developers, researchers
  • semi-structured interviews
  • memoing + field notes
  • coding / thematic analysis

Findings

Technical

Crawl Modalities
domains, websites, topics, events, documents

Information Structures
hierarchies, networks, streams

Tools
services, storage systems, open source utilities, spreadsheets, forms, email, issue trackers

Social

People
teams, lone-arrangers, developers, collaborations

Time
limits, scheduling, always-on, reading, reviewing

Money
grants, subscriptions, infrastructure

Breakdown & Repair

Future Work

Kitchin, R. (2016). Thinking critically about and researching algorithms. Information, Communication & Society, 1–16.

  1. Source Code
  2. Reflexively producing code
  3. Reverse engineering
  4. Design & designers
  5. Socio-technical assemblage
  6. The world

Thanks!

fix-it by Derek Bridges

Extras

An example of a seed list from Archive-It

Star, S. L. (1999). The ethnography of infrastructure. American behavioral scientist, 43(3): 377– 391.

  • Embededness: infrastructure is part of other structures, arrangements, and technologies
  • Transparency: infrastructure is transparent to use
  • Reach/Scope: infrastructure has reach beyond a particular site or event
  • Learned as part of membership: new participants acquire a naturalized familiarity with the objects of infrastructure
  • Links with practice: infrastructure is shaped by and also shapes communities of practice
  • Standardization: infrastructure achieves scope through approaches to standardization
  • Built on installed base: infrastructures are built upon layers of older base systems
  • Becomes visible on breakdown: the workings of infrastructure become visible when it breaks
  • Is fixed incrementally: changes and modifications are accreted over time, and not globally changed in one go

It was part of that same sort of ecosystem of networks. It became clear to me through that process how impor- tant that network is becoming in collecting social move- ments moving forward. It was interesting watching peo- ple who had been doing collecting for decades in activist networks that they were a part of, and then these new activist networks. . . there wasn’t a whole lot of overlap between them, and where there was overlap there was often tension. Unions really wanted in on Occupy and young people were a little bit wary of that. So social media networks became really important.

I definitely remember there was a lot of trial and error. Because there’s kind of two parts. One of them is block- ing all those extraneous URLs, and there were also a lot of URLs that are on the example.edu domain that are ba- sically junk.

The archiving is by the minute. So if I post something, and then edit it in five minutes then it is archived again. If someone comments on something and then another person comments it is archived again. You don’t miss anything. A lot of the other archiving companies that we’ve talked to say they archive a certain number of times a day: maybe they archive at noon, and at 5, and at midnight, and there’s an opportunity to miss things that people deleted or hid.

I went back to the developer and asked: could you give me a tally of how many videos have had 10 views, how many videos have had 100 views and how many videos have had a 1000 views? It turned out that the amount of videos that had 10 views or more was like 50-75 TB. And he told me that 50% of the videos, that is to say 500 TB had never been viewed. They had been absorbed and then never watched. A small amount had been watched when they were broadcast and never seen again.