Bots, Seeds and People

Web Archives as Infrastructure

Ed Summers
University of Maryland
@edsu / ehs@pobox.com

Ricardo Punzalan
University of Maryland
@archivalflip / punzalan@umd.edu

slides: http://bit.ly/bots-seeds-people
paper: https://arxiv.org/abs/1611.02493v1

Overview

Appraisal in Web Archives
Research Question
Methodology
Findings
Future Work

How much of the web is in the Internet Archive?

273,000,000,000 ¹ / 1,000,000,000,000 ² = .273 ???

1. Alpert, J. and Hajaj, N. (2008). We knew the web was big... Google.
2. Goel, V. (2016). Defining Web pages, Web sites and Web captures. Internet Archive.

Archival coverage of the NYTimes homepage in 2016.

Archival coverage of Igor Strelkov's VKontakte profile in 2014.

This is the coverage for Igor Strelkov's profile page on VKontakte which is a popular Russian social media site. Strelkov is currently being sued by the families of 18 passengers on MH17 because he gave the order to shoot down the civilian aircraft thinking it was a Ukrainian military plane. We only know of his role because he posted on VKontakte about shooting down the aircraft, but deleted his post and his account on learning that it was in fact a civilian aircraft. Anatol Shmelev is a research fellow, curator of the Russia and Eurasia Collection, and the project archivist for the Radio Free Europe/Radio Liberty Collection, at the Hoover Institution. In mid-July you can that Anatol added this webpage to a list of URLs to be routinely crawled by Archive-It to become par of a Ukrainian Conflict collection. Archive-It: 400 partner organizations in 48 U.S. states and 16 countries worldwide.

Appraisal

The process of identifying materials offered to an archives that have sufficient value to be accessioned.

Appraisal in A Glossary of Archival and Records Terminology. Society of American Archivists.

RQ: How is appraisal being enacted in web archives?

Selection strategies in web archiving
Socio-technical factors that influence selection practices

Kitchin, R. (2016). Thinking critically about and researching algorithms. Information, Communication & Society, 1–16.

Source Code
Reflexively producing code
Reverse engineering
Design & designers
Socio-technical assemblage
The world

Methodology

39 contacted (email)
33 responded
28 interviewed
F (13) / M (15)
university, non-profit, library/museum
archivists, developers, researchers
semi-structured interviews
memoing + field notes
coding / thematic analysis

Started with people from the Web Archives 2015: Capture, Curate and Analyze conference at the University of Michigan. And did snowball sampling from there.

The process of inductive thematic analysis performed in this study relied on the use of field notes and personal memos. The analysis began by reading all the field notes to- gether, and then returning to do line by line coding. While coding was done without reference to an explicit theoretical framework, it was guided by our own interest in appraisal theory as a sociotechnical system that entangles the archivist with the material of the web and automated agents. I will now describe these these themes:

Findings

During coding and analysis of our field notes we were able to extract six general themes. The themes were interconnected and interdependent, and fell into roughly two groups, the social and the technical which you see pictured here in different colors. They gave us a picture of the socio-technical environment of appraisal in web archives.

The picture is a bit of a tangled hairball, but this speaks to what Kitchin calls the data assemblage: "that encompasses all of the technological, political, social and economic apparatuses and elements that constitutes and frames the generation, circulation and deployment of data." in this case, web archive data.

Technical

Crawl Modalities
domains, websites, topics, events, documents

Information Structures
hierarchies, networks, streams

Tools
services, storage systems, open source utilities, spreadsheets, forms, email, issue trackers

Social

People
teams, lone-arrangers, developers, collaborations

Time
limits, scheduling, always-on, reading, reviewing

Money
grants, subscriptions, infrastructure

Our interviews also uncovered some expected grouping of social entities that were involved in the selection of web content for archiving

The first obvious one are people: who sometimes were teams (managers, field archivists and technicians), other times individuals working by themselves (lone arrangers), developers contributing to opensource projects and volunteer organizations such as ArchiveTeam. There were also organizational collaborations such as the IIPC that provided a setting for web archiving to take place.

In our paper we lumped time and money together, but I've separated them here in the diagram because they really are different. They were interesting because of the way they abstract, commensurate and make appraisal decisions legible.

Time was made manifest in schedules for how often to archive particular sites, how long to perform a particular crawl operations, how quickly archival captures needed to be performed.

Money or financial resources were often mentioned in terms of grant money from foundations and agencies for building collections, subscription fees for services, and the relationship between money and storage.

Breakdown & Repair

Infrastructural inversion (Bowker) -- foregrounding background elements of work practice.

Breakdown and repair are another theme that emerged during our analysis, which can be seen in the many tangled connections in the diagram. I'm thinking of work in infrastructure studies where breakdown is used as a way to identify relationships in otherwise boring, mundane and consequently invisible systems (Star). And by repair I'm invoking some of the recent work by Steven Jackson who identifies sites of breakdown as opportunities for repair and more importantly design and innovation.

Breakdowns between People and Tools could be seen in the use of external tools such as email, spreadsheets, forms to provide missing features such as documenting provenance for communication and collaboration in web archiving tools.

Breakdowns in the relationship between crawling modalities and information structures could be seen in topical collections. For example a collaborative focus on Fracking by archivists in NY and PA, were problematized when corporations extended across state boundaries and internationally. Where should the scope of collection end? The archivists improvised communication tools to track selection.

There were also breakdowns in legibility where archivists were not able to beforehand determine the extent of websites, and used test crawls in Archive-It to try to determine the size of websites by crawling for a period of time and examining how many URLs were left uncrawled using host reports. They adjusted scoping rules in order to eliminate redundant content or deep wells of data that they weren't adequately funded to collect.

Breakdown and repair is fundamental to the archival enteprirse which is ultimately an ongoing process of repairing documentary breakdowns, in order to shape our knowldge of the past, and social memory.

Future Work

Kitchin, R. (2016). Thinking critically about and researching algorithms. Information, Communication & Society, 1–16.

Source Code
Reflexively producing code
Reverse engineering
Design & designers
Socio-technical assemblage
The world

Thanks!

Extras

An example of a seed list from Archive-It

Star, S. L. (1999). The ethnography of infrastructure. American behavioral scientist, 43(3): 377– 391.

Embededness: infrastructure is part of other structures, arrangements, and technologies
Transparency: infrastructure is transparent to use
Reach/Scope: infrastructure has reach beyond a particular site or event
Learned as part of membership: new participants acquire a naturalized familiarity with the objects of infrastructure
Links with practice: infrastructure is shaped by and also shapes communities of practice
Standardization: infrastructure achieves scope through approaches to standardization
Built on installed base: infrastructures are built upon layers of older base systems
Becomes visible on breakdown: the workings of infrastructure become visible when it breaks
Is fixed incrementally: changes and modifications are accreted over time, and not globally changed in one go

It was part of that same sort of ecosystem of networks. It became clear to me through that process how impor- tant that network is becoming in collecting social move- ments moving forward. It was interesting watching peo- ple who had been doing collecting for decades in activist networks that they were a part of, and then these new activist networks. . . there wasn’t a whole lot of overlap between them, and where there was overlap there was often tension. Unions really wanted in on Occupy and young people were a little bit wary of that. So social media networks became really important.

I definitely remember there was a lot of trial and error. Because there’s kind of two parts. One of them is block- ing all those extraneous URLs, and there were also a lot of URLs that are on the example.edu domain that are ba- sically junk.

The archiving is by the minute. So if I post something, and then edit it in five minutes then it is archived again. If someone comments on something and then another person comments it is archived again. You don’t miss anything. A lot of the other archiving companies that we’ve talked to say they archive a certain number of times a day: maybe they archive at noon, and at 5, and at midnight, and there’s an opportunity to miss things that people deleted or hid.

I went back to the developer and asked: could you give me a tally of how many videos have had 10 views, how many videos have had 100 views and how many videos have had a 1000 views? It turned out that the amount of videos that had 10 views or more was like 50-75 TB. And he told me that 50% of the videos, that is to say 500 TB had never been viewed. They had been absorbed and then never watched. A small amount had been watched when they were broadcast and never seen again.