Concerns for Data Scholarship
Articles,  Blog

Concerns for Data Scholarship


>>From the Library of
Congress in Washington, DC. [ Silence ]>>Bergis Jules: Thank
you for having me. It’s right after lunch;
people are all energized. And now I’m going to depress
you with all my concerns about collecting social media data. I’m actually really
excited that, you know, this session is part
of the program today. Because you know, we don’t do
enough thinking around, you know, the effects of doing this stuff
and how to protect people, so I’m really glad to see that
this panel is part of the program. And I’m glad to be part of it. So where is the — okay. Alright, so at a basic level,
Documenting the Now is a project to build free and open source tools
that are easy to use for collecting, analyzing, and sharing twitter data. It’s a collaborative project between
UC-Riverside, the Maryland Institute for Technology and the Humanities
at the University of Maryland, and Washington University
in St. Louis. The DocNow team, our development
team, has been hard at work over the past ten months, you know, really trying to build
something great that everyone will be able to use. And I’m really excited
for what’s to come. But today I’m going to focus less
on the technical part of our work. What’s been really exciting —
what’s been the most exciting part of the project, in my opinion,
is how much people from all kinds of backgrounds have engaged
with some of the ideas that we’ve been addressing. And I think those ideas
can help us address some of the more serious implications
for building collections of data, especially as they relate
to social media data. [ Silence ] So because our work was inspired
by the activism and protest that followed the police killing
of Michael Brown in Ferguson, Missouri in 2014, I
think from the beginning of the project we felt we
had a responsibility, really, not to forget that there are in
fact people behind all this data. That’s why DocNow has such a focus
on ethics, of collecting this type of content for long-term
preservation. We’re really interested
in how our building of these collections might
affect people’s lives. It’s also why we’re being really
transparent with our work, while at the same time trying to
help build a community of people who also value these ideas. So really, you know,
at the higher level, DocNow is about a couple
of things, in my mind. It’s about valuing
people enough to care about how we collect
and store their data. And it’s about helping to build a
community of archives professionals and other folks committed to
engaging with content owners and creators in equitable and safe
ways as we collect their data. These two things are priorities above our technical
work on the project. And a lot of credit really
has to go to Ed Summers for the project being
framed this way. Ed has been my partner in crime over the past couple
years doing this work. And he’s also a principal
investigator on the DocNow Project. [ Silence ] So we all agree that there’s an
immense value in social media data, especially as it relates to our
work in archives in libraries. I’m primarily interested in
collecting that type of data, especially twitter, because I think
it presents tremendous opportunities to document some aspects — some aspects of African-American
history and culture. For example, this is a Pew — a graph from the Pew
research that talked about how young African-Americans
are among the highest users of twitter especially, right? So a large number of
African-Americans have round a space where they feel free
to share and engage in issues that matter to them. And especially considering
how little information we hold in our traditional collections
about African-Americans. You know, I think this is a good
opportunity to at least learn about some of those issues, if
not collect data about them. [ Silence ] So we’ve also seen the value that
platforms such as twitter have had in amplifying voices, right? In the current movement
for black lives, we’ve seen it in Arrow Spring, and
several other social justice events that have played out on line. This is a screen shot from the
brilliant work of Deen Freelon, Meredith Clark, and
Charlton McIlwain, Ferguson, Black Lives Matter, and the online
struggle for offline justice. I highly suggest you check
it out if you get a chance; it’s a 90-something-page report
about the impact that twitter and other social media had on
spreading the message in Ferguson. So all of this is good. But we should also acknowledge
the significant responsibility and embrace the challenges that
come with collecting, preserving, and making that kind
of data accessible. And we should be prepared
to do this work in ways that don’t compromise people’s
safety, disregards their rights as content owners and creators,
or presents their data in ways that distorts people’s
original intent. Because we know we’re not
the only ones interested in collecting this
kind of data, right? So this is a screen
shot of an email from — the ACLU-California last
week published a report about the growing use of
social media collection tools by law enforcement, especially
local law enforcement. And, you know, they were able to get
their hands on a lot of documents for this, and this is a screen
shot from an email that a company, Geofeedia, which is really
popular in this space right now, is sending out to its
clients about Ferguson. It’s saying, you know, hey,
we have a lot of good stuff on Ferguson protestors here,
click here to view the collection. [ Silence ] And, you know, here I think the ACLU
articulates really well, you know, the danger to people of color
and other minority groups when police start using
these kinds of tools, right? And for those in the back,
I’ll go ahead and read this. “The racist implications of social
media surveillance technology are not surprising. We know that when law
enforcement gets to conceal the use of surveillance technology, they
also get to conceal its misuse. Discriminatory policing
that targets communities of color is unacceptable
and secretive. Sophisticated surveillance
technologies supersize the impact of racial profiling
and abuse, right? So there are some real issues
we have to consider here. How will our collections of
social media data be different than those built by law enforcement
or private security firms? I think that’s a question
we all need to think about. Here’s an image of two
prominent activists who became well known during
the Ferguson protests. Here they’re being labeled as threat
actors by a private security firm. This is DeRay McKesson on
the left, and Johnetta Elzie. I think this is in relation to the
Baltimore uprising when they found out that this company
especially, Zerofox was doing a lot of data collection for
local law enforcement. So it’s a scary situation. Because these companies are
increasingly interested in this data as a way to punish people
for being active citizens. When Ed Summers — and I’m
dropping Ed’s name a lot here today. You’re going to owe
me a beer after this. When Ed Summers first
published a blog post about the Ferguson twitter data set
we collected during the first month of that event, a private security
firm was one of the first groups to reach out to him asking if they
could access that data, right? It’s a real concern for people
in our profession to be aware of. So for example, how do we make sure that the massive twitter data
archive being built right at the Library of Congress
doesn’t become a tool that these groups can use against
already marginalized people, right? Whose only request is that
the police stop killing them. So how will the Library respond to
requests from private security firms and law enforcement for that data? Part of the answer is that
we have to engage directly with people generating social media
data to understand how our work in collecting this type of
data might affect their lives. I think that will be an
important way for us to come up with some policies around offering public access
to some of this content. It’s a difficult task,
but I think it’s possible. And, you know, it’s especially tough
because of the number of people who can engage with an issue
on a social media platform at any given time, right? That number can be daunting. I’m not sure how large the
LC twitter data set is now, but I’m sure it’s significant. And just in the past couple weeks, Ed has collected almost
two million tweets related to the police killings of
Keith Scott in Charlotte, and Terence Crutcher in Tulsa. So it’s a big job, but
I think it’s possible. There are ways to engage if
we’re willing to drop some of our traditional models
of building collections that prioritize our ideas about
professionalism and the myth of neutrality over the wishes
of people and communities. [ Silence ] So last month the DocNow Project
hosted our first advisory board meeting in St. Louis. And this was an opportunity to get our awesome
advisory board members — I’m sure some of these names or
these faces are familiar to you — to get together for
a deep dive into many of the issues we’ve been raising
around social media, web archiving, ethics, and technology
over the past year. And, you know, this was a really — we had six really great
panels of insightful and challenging discussions. And I really hope you’ll
check them out on our website, DocNow.io when you get a chance. Because I don’t think
this type of — this group of people really have
been brought together before in the context of sort of archiving
digital media or web archives. So definitely check it
out if you get a chance, it’s in the meetings
page on our website. But I want to focus on
just one of those panels for the last few minutes, because
I think it’s a great example of the type of community work
we’re going to need to engage in in the future if we want to
continue building these types of data sets as archives. The panel was made up of
four activists from Ferguson who were some of the many
organizers in Ferguson after Michael Brown’s killing. It included Alexis
Templeton, Rasheen Aldridge, Kayla Reed, and Reuben Riggs. And it was really expertly
moderated by Dr. Jonathan Fenderson, who is a faculty member
at Washington University. And I was really thankful for
the activists for joining us. Because, you know these are people who have nine-to-fives;
they’re students. And they took time out of
their day to be with us for a good chunk of the day. And so we really appreciated that. And, you know, they really added a
richness and a realness to the event that we wouldn’t have had otherwise. You know, and they talked about
their lives before Ferguson, they talked about their lives during
Ferguson, and their lives now. And there were some really difficult
conversations that happened on that panel; it’s about
an hour and a half long, our longest panel of the day. And, you know, I think that
there’s some real good stuff in there, so check it out. I’m seeing my five minutes cue here,
so I’ll go ahead and stop here. But I want to share a short
clip from that panel with you; it’s about six minutes long, so
— and I’ll just leave it open and then we can get into
a conversation that way. So let’s see if this will
work; it worked during testing. [ Silence ] So you all are in a roomful of
people who are really interested in documenting the now, right? To try to be able to
capture what’s happening, particularly with social
movements as it’s happening, right? And so one of the things that’s
really interesting is that — sorry, you’re going to
hear from Alexis first. And Kayla second. This is a story that
you all helped create, helped make into a global story. But at the same time it’s a story — and in many ways attempted to
control via twitter, right? But at the same time it’s a
story that’s become so big that it’s also beyond any
of your control, right? And so one of the questions I
really wanted to get at was, how do you all want the
movement to be remembered? And what are some of the things that
you would like people doing research around your lives, your work? What are some of the things you
want them to be conscious of? You can’t stop them, right? But what are some of the things
you would want them to keep in mind as they’re doing that work?>>Can I go first?>>Yes.>>[Inaudible], alright, cool. So like, really iffy about — I
guess I’m more so just talking to like the black folks in the room. When documenting the movement, like
internally, like, don’t wash out, like, the internal politics of it. And like the argument, like,
don’t wash that stuff out. Don’t wash away the homophobia,
don’t wash away the sexism, like don’t wash away the misogyny, like don’t wash away
those front line stories about what was happening
to people like internally, and how people felt unsafe like with
people they were fighting next to, like don’t leave that out. Because we have, in all our
movements and it leaves people so scorned and so hurt
and so bitter. Which is why a lot of the
arguments happen online. Which is why like, you know,
they don’t really bother me. Because it also humanizes
our movement. And I don’t think we’ve
got that with a lot of — I don’t think we’ve gotten that with
any black movement that we’ve had. They’ve never been humanized. They always felt they had to be
perfect when fighting the system, and we just fucking aren’t, right? We just aren’t; we’re depressed. We like to drink and
smoke to cope, like. We curse, like we have kids, we
have jobs, like we’re married, not, like we’re scorned, like whatever. Like we’re human, you know? And like I don’t want
people to forget that. Like actually humanize
the people out there. These everyday people, because
they’re literally like just like you all in this room. And like we can’t — we just
can’t wash them out of the story.>>Bergis Jules: Real quick,
I saw the five-minute sign, so we’ve got to wrap it tight.>>Yes, so for me,
that’s really important. And the other thing I would add to
that, the other thing I would add to that is just the idea of
what Reuben was hinting to, that social media at first was
this super democratized space, but it somewhat perpetuates elitism. That for those of us who have
a — like I have like $15,000, Alexis has whatever, it changes. But it’s a lot. And so people, some like
mistakenly put your value at how many people
choose to follow you. And leadership isn’t based on how
many followers you have online, leadership is based on how
many people you’re invested in in developing on the ground. And so some of my — the most
important things that I feel like I accomplished
in the last few years, maybe weren’t told on
social media, right? Because they didn’t need
to be told on social media. And my development into
self and our relationships, like those stories aren’t
told on social media. And so just being conscious of
the fact that for every face that you see that has been
lifted up, that we are standing on the shoulders of
hundreds of giants that we’ll never even see again. And Ferguson wasn’t because the four
of us just came together in a huddle and said we’re going to
fight for black liberation. It’s because thousands of
people in St. Louis said, yo, this is messed up; and I’m
not going to let this stand. And we see those people
every time we go to Target, hopefully when you
don’t go to Walmart. When you go into your local fast
food spot, you see those people who for a moment say, I’m going
to give all of myself to a space, and then go back to life. And so we’ve been ethically
blessed; like we worked for it, but also part of it is just this
crazy equation that happened that put us in an elevated space. And so our job in that elevated
space is to always fight for those who aren’t in those spaces
and represent them well. And I think we really try to
do that, but that’s your job, it’s to tell those stories,
to find those people. And if you can’t find those
people, at least create space for that narrative to be
built that like for the rest of my life I’ll owe
people I’ll never know. And I take that shit
so seriously, you know? Like in every space I represent,
it’s like my mom and my grandma are like behind some window I can’t see;
that’s the way I need to behave. Because people literally
sacrifice their lives and comfort for one person. And we have somewhat
benefited from it, right? And somewhat been super
traumatized by it. But — and in the benefit
has come more trauma. But like accept that and
like tell our whole selves. And so I made this
commitment in therapy — because I didn’t go to
therapy before Ferguson. But I made this commitment
that said like I will love — I commit to my community and
myself to love us in our entirety. And that’s the non-romantic story; that’s the “we were hungry,
we didn’t have food.” That’s the subway lady
giving us a sandwich when we didn’t have no money. That’s then her getting cancer and
putting up a GoFundMe and reaching out to Ferguson activists
who have become — have platforms now to say,
you all, I have cancer. When she would come out every
single night and ask us, did we need to use the restroom
again before the subway closed? That’s a story that a
lot of people don’t know, but was so important
to just how we sustain. That that kind of community existed.>>Bergis Jules: Alright. Okay, where is — alright,
so that’s all I have. And look forward to the discussion. [ Applause ]>>Nicole Saylor: Oh, hello. It’s an extremely tough act
to follow; thanks for that. I want to thank NDI, Kate for this
wonderful opportunity to participate in such a terrific event. So my talk is a rehash of a
paper that we put together for the upcoming IPRES conference. And so it’s less about the ethical
considerations of this work, and more about the
technical side of the work. But I’m certainly happy to
talk about ethics in the Q&A. So I’d like to give a shout out to
the co-authors, and I invite them to chime in to correct and
expand as needed; I welcome that. I also want to shout out to
Julia Kim and Melissa Lindberg in the American Folklife Center,
who are the ones in receipt of these big piles of ones
and zeros, and they push them through our system for preservation. And we couldn’t do it without them. I also want to take a
minute at a forum like this to acknowledge the AFC
Director, Betsy Peterson, and other senior leaders in
Library Services and in Web Services for sort of recognizing the
importance of these projects and creating conditions where
this kind of experimentation and innovation can exist. So today I’ll be talking about a collaboration among
my library colleagues to pull down web content for
preservation and access that moves us beyond classic
web archiving methods. And I’ll focus on two community
collecting projects, one to solicit and harvest community-generated
photos off of Flickr. And then another is to gather
oral narratives generated through a relatively new
story core mobile application. So both projects involve
a public call to action, and they prompt public engagement
in building library collections. They leverage our partnerships or — by leveraging or partnering with
third party software platforms, these efforts allow us to focus on
preservation and long-term access of records while still
supporting immediate and dynamic engagement
with communities. Kate got into it a little bit,
but just a brief background on the American Folklife Center. We are a large ethnographic
archive with more than five million items
— closer to six. And these collections include
extensive audio-visual documentation of traditional arts, cultural
expressions, oral histories. And they offer researchers
access to songs, stories, and other creative expressions of
people from diverse communities. And as mentioned they range from
wax cylinders created in 1890 that we talked about
yesterday at the conference in the other building, to the born-digital StoryCorps
collection, among many others. And we have pretty
much every technology, AV technology in between
represented in our collection. I’ll mention too that
we’re perhaps best known for our collection
of field recordings. These performances of
songs, instrumental music by little-known grassroots
musicians. But also by famous musicians such as
Elizabeth Cotten and Molly Jackson, Woody Guthrie, Lead
Belly, Jelly Roll Morton, Burl Ives, Johnny Cash, and so on. Some of whom did very
famous performances right on this very stage that
are in our archives. Alright, so moving ahead — maybe — The American Folklife
Center was created in 1976 by an act of Congress. And we have a wide
ranging mandate to preserve and to present folk life. And so I’m showing you just
a snippet of the verbiage from the legislation, which
defines what folk life is, and sort of sets our tone
for collecting scope, which I would say is broad if
not wildly ambitious to try to get all of this documented. So I want to point out that the
digital collecting initiatives that I’m going to talk about
are part of a long history of the archives of
soliciting public collaboration in documenting the cultural record. Even with the founding of the
archives, Robert Winslow Gordon went out and recorded singers to
document American folk song. Of course, Alan Lomax,
right after Pearl Harbor — who was then the assistant in charge
of the archive, sent a telegram to field workers in various
locations throughout the United States telling them to please
do man-on-the-street interviews to get people’s reactions
to what had happened. We have since built on that
tradition, with doing the same for 9/11, for sermons and orations that were commemorating
the inauguration of the first African-American
President, and so on. [ Silence ] So of course with the proliferation
of smart phones, tablets, and wireless internet connections, networked communication
is increasingly where the cultural record
is being documented. And so AFC staff have
long recognized this need to preserve folk expression on the
web, and until how have not had but paper tools to do them. So you’ll see, we have a
very respectable collection of scam emails that are
in boxes, printed out. But anyway, we’ve since evolved
and in June 2014 we worked with Abby Gratke’s team in
the web archiving program to create a web cultures collection. And we co-curate this with
scholars who study digital culture. And we’re capturing a set of sites that document the digital
vernaculars. And so we’re up to about 49 sites
ranging from “know your meme” — there’s a pepper spray everything
meme, that’s one on that site. And to creepypasta.com; and
feel free to google that later. Alright, so our — just briefly
— yes, I love that one so much. Anyway, so we’re looking at
sites that document, you know, communities that are engaging
with each other, DIY communities, SPAN communities, people talking
about urban legends and lore. Again, these are just extensions
of the kinds of documentation that we have in the analog. Alright, so by way of background, the Library of Congress has been
collecting web archives since 2000, and has a well-established
practice for acquiring and processing the collections. Most website collections are
harvested via a crawler tool, Heritrix. And they are saved into
a format called WARC. The crawler works from a list of
known seeds that are starting URL’s. And then it — you’re
doing a great job — and it crawls to a specific depth from each one following
internal links on websites. There’s no verification
against the original source; that’s not possible in this context. But the results are
subjectively reviewed. And these methods are appropriate for collecting documentation
of everyday life. But they’re different and
distinctive from the kinds of abilities that were offered in
these two projects I’m going to talk about in ten minutes or less. Okay. We first started working
with library developers to collect user submitted
photographs via Flickr that document Halloween, Day
of the Dead, and the other sort of constellation of holidays that surround late
October, early November. So participants were asked to post
photos to Flickr with that hash tag; we did it again last year. And then have a creative comments
license accompanying them. We’ve since expanded that; this is
our 40th anniversary of the center. And so we’ve asked people
to document their traditions and do the same thing, so at the end of the calendar year we’ll be
working again with developers to pull that content
into our archives. So the Flickr harvesting projects, like the StoryCorps project I’m
going to talk about in a minute, you know, have these
contours that are well suited for a different approach to
acquiring and processing. There’s no external links in
the data, no easy list of seeds that link to the collections
externally. And most of the data was created — well, in the case of
StoryCorps, in an app. And so the data sets, they also need to be treated differently
in terms of access. But in the case of StoryCorps — well, and Flickr, access is being
handled by these third parties. So that lets us off
the hook for now. Okay, so anyway, at the Library we
seen to have settled into a rhythm where we treat born-digital
data sets and web harvesting as two dependent but
related practices. Data sets are generally collected in
time or tag-bound chunks by querying and external API, they’re
downloaded onto LC storage in baguette structure,
and then continue through the useral workflow. Depending on whether they’re
slated for preservation or online access, or both. Okay, so let me just move quickly
into the StoryCorps example. So this venture is an extension of a longstanding relationship
we’ve had with StoryCorps. StoryCorps started in 2003; they’re now among the largest oral
narrative projects of their kind. And interviews have been
collected at mobile booths and permanent story booths
located in New York, Chicago, San Francisco, and Atlanta. The traditional interview is two
people who know each other talking with a facilitator for
about 40 minutes in a booth. And then each recording is
preserved at the Library of Congress American
Folklife Center. And then these — snippets
of this stuff gets broadcast on National Public
Radio’s morning edition. And so since the inception of the
project in 2003, the signature, what we’re calling, you know,
face-to-face StoryCorps is — they’ve garnered more
than 67,000 interviews. And so then in 2015, founder
Dave Isay won the TED prize; which when you win the TED prize, you get a million dollars,
which is great. So Dave wanted to create an app to take StoryCorps
global; and so he did. And so we were like,
whoa, okay, let’s do this. So we worked with StoryCorps
developers to put something together. And that’s what I’ll talk about. So as StoryCorps developers
were designing their app, they worked with the software
developers here at the Library to identify, design, and
construct this transfer mechanism. The team agreed that an API
would benefit both parties; the Library could receive or fetch. Oh boy, five — and then
it would allow StoryCorps to expose their collection
for harvesting instead of spending staff time prepping
it and pushing it out to us. So one of the main requirements for
the API was the need for fixities. While StoryCorps developers
considered their content fluid, Library developers were adamant that
we need things to be fixed, right? And so explaining this to an
external partner was beneficial, of course, because it’s a
technical facet of the library world that exposes archival
preservation practices more broadly. This was where my cool
graphic was going to go to explain that, but
I couldn’t do it. Anyway, so each interview
package consists of metadata files, JSON format. And there’s an optional
upload of a photo, a selfie with your
interview partner in JPEG. And then an MP3 audio file. And those are discrete
packets that are served up. And StoryCorps was happy to
add the checksum feature, which is like a digital
fingerprint of sorts, against which later comparisons
can be made for errors. And so the MP3 was
hosted at SoundCloud. And StoryCorps wasn’t sure
if SoundCloud would be cool with adding that; but they
were, after a little bit of back-and-forth, happy to do that
for us so that made it a lot better. So then decisions had to
be made like, you know, what about the associated metadata? Do we care about likes? Do we care about descriptive tags? You know, we decided likes
were too fluid, and a snapshot at an arbitrary point in time
seemed a little bit meaningless. But we certainly thought tags
were incredibly important across metadata, rocking and all. So anyway, once developers were
happy with the general design of the transfer mechanism, then
both parties began to iterate. And StoryCorps on the API side, and
the Library on the fetching side. And I will blow through
a few technical details. Anyway, there were two
— so in doing this work, just to jump to the two big classes
of errors that we were trying to protect against
were one, a single file or a tape gets destroyed, in which
case the checksum will reveal this. And then the second class of
error would be losing access to a collection through correlated
errors, which is a class of mistakes where somebody follows a bad
practice or relied upon bad code; in that case an inventory can help
identify those kinds of things. And then they can make a copy. Okay, so this is how the
development is going. You — the — I’ll just move ahead. Alright, well, so in
closing, you know, these projects have enabled
AFC to engage a broad public to collaboratively build its
collections as it’s always done. And we’re able to do so in
an automated way and in a way that bakes in sound preservation
practices, which we love. And if there’s time to
play a 1-1/2 minute clip? Okay. I’d like to close with this
clip from a StoryCorps.me interview. And I want to say that, you know,
this is a tremendous corpus, a research corpus, these interviews. And they do get dinged sometimes
for being overtly sentimental, the stuff you hear on Friday. And this is no exception, so enjoy. But I’ll leave you with this. [ Silence ]>>Well, this last question
basically just says, take time to tell your interview
partner what they mean to you. And so I’m going to say that to you. I absolutely adore you. You’re easily the best thing
that’s ever happened to me in any way, shape, or form. While, like you said earlier,
there’s so many things that we may still be unsure about
with our future, I can at least say that I know I want you in it. So, in the home that
we’ve built together, in the life that we’ve made together
with each other, and in the family that we’ve started with our little
furry prince here, Asia, Colleen, Annette, Adcock, will you do me the
great honor, and will you marry me?>>Yes. Of course, I will, stupid. [Laughter].>>What better way than to talk
about all the wonderful things that we’ve done in our life so far. And then just think
about the great things, what’s going to happen
in the future.>>Oh, gosh, you’re so amazing.>>And I don’t know if you —
I didn’t tell you this yet, but this could also be saved in
the Library of Congress’s archives. Forever. So that everyone
can listen to this.>>So happy.>>I love you so much.>>I love you too.>>This has been my
interview/surprise proposal to my wonderful girlfriend,
and now fiancee, Asia. I’m Rory T. Miller, now the
happiest man in the world.>>Thank you.>>Nicole Saylor: Thank you. [ Applause ]>>Maciej Ceglowski: I have never had
a simultaneous translator before. So let me begin by saying to all
of you: pumpernickel, flapjack, chandelier, riboflavin, hockey
puck, calzone, moose antlers. I owe you a drink. My name is Maciej Ceglowski. I run a little web archive
for about 20,000 people. So to be invited to the Library
of Congress is like being a kid who glued some fins to a
cardboard tube and was invited to tell NASA about
rocket propulsion. Like every speaker
has correctly said, it is a signal honor to be here. It feels especially strange to
me because I’m used to talking about the U.S. Government
as a big, scary adversary. But here I am in a Government
institution that, despite the fact that we all went through a
metal detector to sit here, champions not just freedom but
the fundamental right to privacy, and the dignity that that entails. During the panic that followed
September 11th, Carla Hayden, who was then Head of the
American Library Association, took a principled stand against
provisions in the Patriot Act that required libraries to divulge
what their patrons were reading. She did this in the face of ridicule from the Attorney General
and the Administration. And of course now she is
the Librarian of Congress; so what a wonderful institution. So it’s particularly sad — feel
free to applaud about that — that [Applause] —
yes, what a woman! So for me it is depressing to think
that those provisions that seemed so threatening and un-American look
almost quaint now when compared to what we have done in the commercial internet
to destroy privacy. You know, libraries have a
commitment to protect information about what their patrons
are reading. But Amazon knows every
book that you’ve read, if it’s electronic they
know it down to the page. Google has your correspondence,
your web history. Your phone company knows where you
are and where you’ve been based on a device you voluntarily
carry with you. We all know the depressing litany. And in some ways the commercial
internet is the opposite of a library. You know, the libraries
exist to inform. The commercial internet
is there to extract as much information as possible. There’s no ulterior motives in
a library that I know about. But when you’re online, there’s an
ulterior motive behind every link; every click is there so
that you share something or click something else, or stay on
the page, pay attention to an app. But these un-librarians that I unfortunately represent
have made enormous advances in technology. And I’m here today to talk to
you about machine learning. I want you to hear it
from me rather than out on the street or from your friends. Because I’ve come to think that
these machine learning techniques that are amazing are kind
of like a deep fat fryer. If you have never deep fried
something in your life, it’s life-changing; you
taste it, it’s amazing. And then you start to think,
this would work with anything. And you’re kind of right; it does. Like in my college days I had
friends who worked at the snack bar, and they conducted lots of
research along these lines. They would deep fry cheese,
candy, pens, name tags. And it all came out
tasting pretty good. And then as you know, if you subsist
on it like you do in college, it kind of gives you a
little bit of a hangover. In our case the deep fryer is this
tool box of statistical techniques. The names keep changing. It used to be unsupervised
learning; now it’s called big data or deep learning, AI,
machine learning. Next year there’s going
to be a new buzz word. But the core ideas don’t change. You feed your computer lots of data, and it learns how to
recognize structure. Now, in any deep frying situation
you want to ask yourself, what is the stuff being fried in? In Poland in the ’70s, where I’m
from, there was a crafty person who bought a high pressured
deep fryer from Italy, and they brought it to a
resort town in the mountains. They called it the
frytkopollo [Assumed]. And people would stand
in line for blocks; you had to bring your own chicken
because this was under Communism. But if you brought your own chicken, in three minutes it would be high
pressure deep friend and returned to you as the most delicious, hot,
tasty thing you had ever eaten. And this was a huge fad until
the Health Department came and closed it down. At which point it was revealed that the frytkopollo machine
had never been cleaned. And what the stuff inside
it was basically black tar, and that’s where the flavor
secret of the frytkopollo. So what is your data
getting fried in, right? These algorithms train
on very large collections that you don’t know anything about. And sites like Google operate on
scales hundreds of times bigger than what we’re familiar
with in the humanities. For this reason, I’ve
referred to machine learning as money laundering for bias. Because you can smuggle in a lot of preconceptions based
on how you train. So for example, if you go to
Google Translate and you plug in an Arabic article about
something horrible in Syria, or an opinion piece on terrorism,
you get something in English that looks like it’s
written by a native speaker. But if you put in Arabic a kid’s
letter home from camp, say, or a passage from a novel,
it looks like it was written by the Frankenstein monster. The algorithm just can’t
handle it and doesn’t know. And it’s not because the
algorithm’s obsessed with war and things military, that’s just
what it was trained on to do. I’m sure other languages
have other idiosyncrasies. It’s not always a problem; some uses of machine learning are
benign and wonderful. If you do optical character
recognition, you benefit from a
lot of these advances. But others are problematic. I would be very wary of
using sentiment analysis, for example, on any collection. Or anything to do with social
networks without careful thought as to where it had trained. I find it helpful to think
of algorithms as a dim-witted but extremely industrious graduate
student who you don’t fully trust. So if you want a concordance or
an index or you want them to go through ten million
photos and tell you where there are horses,
then that’s perfect. But if you want them to draw
conclusions on gender based on word use patterns, or if you
want to do social network analysis on census data, then you
want some adult supervision in the room when they’re active. But besides these issues of bias, it’s really the opportunity
cost that irks me. This love affair with computational
techniques removes a lot of the potential for
surprise that comes when you deal with actual people. Because if you go searching
for patterns in your data, you’re going to find
patterns in your data. You know, woo-hoo, right? What gets lost in that
data is what’s fresh and really interesting
and different. And we’ve seen entire
fields disappear down the numerical rabbit hole. Economics came first, Sociology and Political Science are
still trying to get out. Bioinformatics is down
there somewhere and has not been heard
from in a while. In my eyes, the excitement is not
in the computational possibilities. The computers are a tool, and
they eliminate some drudgery and it’s fun. And deep fry some things;
it’s tasty. But the real excitement
is in human potential. Because for the first time
we can make things available to anyone on the planet. And I don’t think we’ve internalized
the enormity of that stat. Now, as you all know, just throwing
the data online is not the answer. When I was a — I’m laughing because the Andrew Mellon
Foundation once hired me as a program officer;
that was crazy of them. But when I was a program
officer at the Mellon Foundation, I remember working with early JSTOR. And finding out that 50% of
JSTOR records had never appeared in a search results page. So not only had no one read
them, no one had even seen them; they might as well not
have been digitized. Part of that was because of
extremely restrictive agreements with publishers. But part of it was a
failure of imagination. Here you have every journal article
under the sun, and no attempt to connect it to people
outside of the academy. For the same reason, we completely
flubbed Wikipedia at Mellon. Nobody could imagine that
the effort would succeed. I remember my boss went as far as
exploring whether to print a copy of it and have it published that
way, but that’s where we stalled. And then later, after leaving
Mellon, I saw librarians fail to engage with some vibrant
communities at Flickr and Delicious in early days. Services that they
later grew to love. Basically because of a lack of
trust and openness to an experiment around unstructured tagging. When we talk about data, a lot
of our language is extractive. You talk about data flows,
data mining, data crunching; it’s like a rocky substance that you
have to smash to get the jewels out. I like that. But in cultivating communities, I prefer to think of
a gardening metaphor. You need the right conditions,
a propitious climate, and a judicious sprinkling
of bullshit. But it also requires some
patience and weeding and tending. And the ability to accept that you
don’t know exactly what is going to grow. If we take seriously the idea that digitizing collections
makes them more accessible, then we have to accept that the
kinds of people and activities that those collections attract
are going to seem odd to us. We have to give up that control. It should make perfect sense, because human cultures
are famously diverse. It’s normal that there’s different
kinds of music and food and dance; we enjoy these differences. Unless you’re a Silicon Valley
nerd, you delight in the fact that there are hundreds of
different kinds of cuisine rather than a single beige slurry that
gives you all your nutrition. But when we go online,
our horizons narrow. We expect that domain experts and programmers can meet
everyone’s needs anywhere. We think it’s normal to
build a social network for seven billion people. I call this the Mormon
bartender problem. When you try to make things for people while lacking the
visceral experience of their lives, imagining that you can just
think your way through it. I’ll give you an example
from my own website. I run the Very Vanilla
Bookmark Archive, where people save URL’s for later. So it’s a search engine
for scholars, journalists. I even have a priest who
researches his sermons on it, which I’m happy about. But one of the biggest
groups of users on my site are writers
of FanFiction. I’ll explain what FanFiction
is because you’re all going to pretend you don’t know — even though I know all the
librarians here are avid writers of it. So this is a vibrant subculture
of people who write stories, often highly erotic, that are set
in various fictional universes. If you always thought that
there were sparks between Holmes and Watson, this is
the subculture for you, and I encourage you
to come explore it. So FanFic authors came to my
site, adopted the tagging system, and to them it’s a search
engine and publication tool, it’s not an archive at all in
the sense that I intended it. And they do a lot of
additional technical work to make it suit their needs. For me, it was like watching bees
arrive in your garden and set up a hive and, you know,
make honeycombs and honey. You just observe and wonder
and hope you don’t get stung. It was really a wonderful
experience. But it made me think,
the internet has to get a lot, lot weirder
than it is. People out there are also tired of this deep-fried data
flavor, and they want substance. They do interesting things, but
you have to trust them first. In an institutional setting,
this can be frightening. It takes courage to ask for a
grant to bring a collection online when you have no measurable
outcomes other than the hope that it will attract
something interesting. It takes even more courage
to award such a grant. It takes courage for a young faculty
member to devote time and energy into projects that someone
might use to make a cat video. I realize that that is not,
you know, the royal road to tenure in most institutions. And it takes courage to commit
to maintaining those collections for a long time, and
for staying engaged with the people who use them. But the search for intelligent
life on the internet means you have to put away some preconceptions about who your communities
of use are. Now, I thought the FanFic authors on my site were just pursuing
a harmless quirky hobby. What I didn’t realize is that
online the frivolous blends easily with the serious. FanFic authors tend to be women. Britta Gustafson has call Fandom
a secret seminar on feminism. Young Fans use stories to
explore issues of gender identity, and in some cases they
use Fandom to discover that there’s something called
gender identity that they’re allowed to have and think about
and question. And they do it through the medium
of something they genuinely enjoy. They learn to deconstruct plot
elements in writing in a way that would make a Russian
structuralist critic blush. They coach each other
in writing better; they coach each other in technology. Because serious people enjoy
frivolous things and they mix and mingle with young people at
the same time, it becomes a sort of secret educational system
that works in surprising ways. My friend, Sasha Judd, who in
an upcoming talk will describe something similar with fans of
the boy band, One Direction. The band has an obsessive
following, again of young women. And in chronicling the lives
of these beloved band members, they reach heights of
technical achievement that media companies can’t get to. They are de facto professional
archivists, developers, video editors, and journalists. But they don’t think
of themselves that way. And to “real technologists,” these interests aren’t
serious, so they don’t matter. So these women would never
dream of applying to jobs that they’re already
effectively doing. I think of this as like a dark
matter of talented, motivated, and interested people online
who just aren’t connected. And I’m convinced our time is much
better spent trying to reach them and engage them than playing
with the same algorithmic toys. I’ve talked a bit about
bringing collections online and making them accessible. There’s also the question of how
to deal with the flood of stuff that is already on the
internet and coming from it. One approach is to go to people
who have control of the data, the big companies, and
partner with them to study it. I think that this is awkward,
though, because the very thing that the Librarian of Congress
objected to in the Patriot Act, this intrusive constant
surveillance, is the bread and butter of online services. Much of the valuable
information is collected in ways that would never pass
ethical standards in academia, in ways that even the NSA would
be constrained by law from using. But the data is there. There are no laws that
constrict us — constrain us from collecting
it however we want. And I know that you can hear
that data calling to you. It’s saying, study me. Preserve me. These fly-by-night companies, they’re going to lose
me, analyze me. You can try to add layers of ethical
paint to this conundrum and say that you’re really helping. But I worry about legitimizing
a culture of universal surveillance like this. I’m very uneasy when I see
social scientists working with Facebook, for example. You know, people are pragmatic. In the absence of meaningful
privacy protections, their approach to privacy has become
click okay and pray. Every once in a while there’s a
big hack that shakes everybody up. But since we have yet to see
a really big tragic misuse of personal information, we’re just
kind of hoping it doesn’t happen. But remember, we live in a time
when this spiritual successor to fascism is on the ascendant
in a lot of Western democracies. Having large unregulated collections
of private information harvested from people — often without their
knowledge — is just dangerous. Don’t legitimize it. People face social pressures
to abandon their privacy. Being on social media has become
an expected part of getting a job or an apartment in places. Now the border patrol wants to
look at your social media data. So when you work with
online behemoths, realize that the behavioral
information is not consensual, and there can’t be
consent to this kind of mass surveillance;
it’s an illusion. I want to talk momentarily
about my bread and butter, which is actually collecting
websites. I hope you’ll forgive me for being
technical in this part of the talk, but the average web page right now
is a giant pile of steaming garbage. I don’t mean the content, but
I mean the way it’s structured and fits together. So everything is a Frankenstein
monster of dependencies and Java scripts and things
pulled in at display time. And what does it mean
to archive this? These steaming piles of garbage? You can save the rendered
image in the browser, but then you forego all the
dynamic behaviors it might have. You know, is a dynamic out
on a page just an annoyance? Or is it a valuable
window on life in 2016? And if it is, then which of the
ads are we supposed archive? And, you know, how do we pull it in? And who should it be targeted to? You know, you end up in a situation
where you have to build a simulator of the entire computing
environment of today. And I don’t think future generations
will forgive us for doing that; it’s bad enough to
live through it now. Game developers have already
wrestled through this program. They have stuff that we can
learn from, because they’ve tried to emulate video games
including classic ones that turn out to be surprisingly hard; they
were implemented in hardware. But the reason these game
developers are able to do it is because they’re also passionate
about playing those games and continuing to have them and
teaching them to their kids. So once again communities pop
up as kind of a life saver. One problem that an
institution like the Library of Congress faces is you
can’t just kind of, you know, roll up your sleeves and
go in there willy-nilly and say we’ll archive
however we can, because what you do would
be considered normative. And then everybody
will follow the lead. So I understand it’s
a dicey problem. But communities can help us. An example from LiveJournal
might help. So LiveJournal is an early blog
site that was very popular. One of the features it had was for every post you could have
a little thumbnail-type image. And what began happening is, people
would use those images as a sort of gloss on the post;
they would kind of riff against what they had written
with their choice of thumbnails. So some people would have hundreds
or thousands of these images. And it became a sort of art for how
to do commentary the silent way. LiveJournal, totally unaware of
this, imposed a limit of 16 images, and it was a retroactive limit. So they destroyed a lot of
the value in their archive without even knowing that
they had done anything. It’s easy to say that,
well, you know, you talk to your users,
you find out about this. But I think a lot of stuff
falls in between the cracks when people use multiple sites. People’s online presence
is complicated; it’s not limited to
one or two places. So the only way to know how to
preserve is to talk to them. The problem isn’t that
we can’t store the stuff; we have more than enough storage. It’s just that we can’t recreate
these environments in a faithful way without bringing in all of the
operating systems and browsers and ad networks that exist now,
which is an impossible task. [ Silence ] Focusing on communities, like I
said, means relinquishing control. It’s scary. But there are some key
steps we can follow. So making materials available in
open formats, without restrictions, with a serious commitment to having
a URL be a permanent promise goes a long way. Publishing texts as text, API’s like
we’ve seen and talked about today, which I find very encouraging. And a commitment to fair use that
the copyright office has also shown, so that people can freely create
and stitch this stuff together without fear of automated
robot emails from lawyers. I want to give my wistful
closing, closing memory of the — a place I never thought
I’d be wistful about, the Glenview Public Library. I grew up in the suburbs of Chicago, and it was a suburban
library like any other. But like a lot of nerdy
kids of my generation, I spent half of my adolescence
there or at home reading books that I checked out from there. It was my window on the
world before the internet. And I never reflected why this
unremarkable suburban library existed, or who funded it, or
where its values came from, or how long it would be around. It was just a part of the
landscape to me, like Lake Michigan. But it taught me that like everyone
else I had a right to learn. And that I was welcome. That I could ask questions. That I could learn how to
find my way to the answers. And it taught me the importance
of being quiet in public spaces. For the generation that’s
growing up today, the internet is that window on the world. And to them it just exists;
they take it for granted. And it’s only us who have seen
it take shape and are aware of all the ways it could be
different, and how contingent it is, that need to worry about it. The coming years are going to decide to what extent the internet
is a medium for consumption, to what extent it’s going to lift
people up, and how much it’s going to be a tool of social control. As it exists today, the internet
is kind of a shopping mall. There are two big anchor stores
— there’s Facebook and Google. And of course there’s an
Apple store in the middle. There’s a food court
with a couple of punks, and lots of surveillance cameras
everywhere watching everything that happens. And outside in the parking
lot there’s the bookmobile, which is you guys. There just to add a veneer
of classiness to the place. This isn’t my dream of the web. My dream of the web is for
it to feel like the big city. It’s a place where you rub elbows
with people who are not like you. Some place that’s a little
bit scary, a little chaotic, full of everything you can imagine,
and many things that you can’t. And a place where there’s
room for chain stores, for entertainment complexes. But also for people
to be themselves, for people to create their own
spaces, to learn from one another. And of course room
for big, beautiful, huge, tremendous libraries. Thank you very much. [ Applause ]>>This has been a presentation
of the Library of Congress. Visit us at loc.gov.

Leave a Reply

Your email address will not be published. Required fields are marked *