Auto Scroll
Select text to annotate, Click play in YouTube to begin
i want to welcome mark hi mark welcome hi mark is the director of the wayback machine at the internet archive super cool stuff and his talk is going
to be great today i think turn all references blue about linking and backlinking what do you call i've heard it called a few things mark what are you oh linking adding links too
yeah yeah just adding links too okay all right so good i like it 28 people in here i see chris doug dave among others so i'm gonna turn it over
to mark now and um i'll monitor the chat please ask your questions either in the chat or the q a and if we have time like i said we only have half an hour but if we have time we're going to try like heck to get some questions answered
so all right mark over to you excellent thank you very much hi everybody uh i just wanted to note that i can see my presentation but i can't see you so uh it's just like is this stuff i'm
talking to my computer here um my name is mark graham as noted i'm going to talk a little bit today about um about a project to help turn more things into links basically the ideas turn all
all references blue and this is a project of the internet archive uh the internet archive is a 25 year old nonprofit we're actually celebrating our 25th anniversary this year
our mission is universal access to all knowledge and uh and one of the ways we we uh we do that is by being the best library that we can be uh and so that's uh kind of on a daily
basis we we work toward toward that goal um the uh once again the the title here turn our references blue and just a little bit of visioning in the star trek future everything that any human has ever
written or spoken will be available through a a command a gesture a thought or whatever have you we're not there yet but along the way we think that we can
uh add value to things that are digital by connecting them uh together and so here we just have to make the the requisite uh nod to uh to mx and and van over bush and
project xanadu i know other people in other presentations in in this conference have spoken about um the influences of ted nelson vanderbilt bush doug engelbart and many others um uh at length so i'm
just going to skip over that except to say uh yeah they influenced me greatly and continue to inspire our work i'm going to step back a little bit and say what are we talking about when we're
talking about adding links to things there are really two different ways to add a link to something the first one is you can you can edit the actual object itself change the underlying primary uh document
and the second is you can add metadata to to that object uh which we often refer to as annotating i think of all of this as as annotating um but i'm just gonna use the phrase you know adding links to uh things
uh generically um and why do we do this well you know one of the reasons we uh one of the challenges and reasons that we work with with links that are added to things is because the
links and the web themselves are somewhat ephemeral uh links go uh go bad when the underlying thing they link to is changed or deleted in some fashion here's an example
of of a url that on the live web returns a 404 a page not found and then an archive version of it on the right is much more satisfying there's another issue which is content drift
meaning that the underlying thing that something links to chain can change over time so we have a web of addresses but not necessarily addresses that
are tied to specific um information that's fixed in time because it can change so what we've been doing about this is we've been working closely with wikipedia
we have a piece of software called internet archive bot that's now running on 78 of the 321 wikipedia language editions and we've been going in and we've been looking for our broken links
and looking on how we can improve links and we've been adding a lot of links these are as an example here of a wikipedia article about open firmware and you can see that many of the
external links down at the bottom connect to uh archives from the wayback machine so this might be they may have been born this way the editor might have may have used a
wayback machine url or our software might have found a broken link and edited it to fix it to point it to an archived version on the wayback machine and today our software has edited more than
25 million urls on wikipedia sites and and pointing about 23 of that 23 million of them back to the wayback machine just alone this year we've added about 1.5
million links and th this um the the the proof is in the pudding here on on on the effect of this uh this is a look at clicks from wikipedia english wikipedia
uh to third party sites and you can see here that for this particular um time period web.archive.org which is the wayback machine um by far was was linked through more than any other external link
i just want to make a little note here um to show a little bit about web archiving the basis for for everything i just said uh is that we've been archiving a lot of wikipedia and a lot of the web for
um a long time and uh you might not think about this but for cnn.com on june 17th 2021 you start with www.cnn.com that's the seed page
there are 174 outlinks on that page those are links to other pages and just one of those pages had 338 in beds for a total of 30 000 embeds so this is an example
of how you can start with one url and try to archive one url and all of the pages linked from that url and all of the page resources and you get a number like 31 000 so um
it it's uh it's a it's a fail it's a lot of heavy lifting on the back end we archive more than a billion urls a day into the wayback machine in addition to um working with links to to web pages we've
been working to add uh links to citations in wikipedia articles where the citation is to a book so here's an example of the martin luther king jr page uh and as a down at the bottom you see the
citation to a book and we added a link directly to a digital version a preview of a digital version of that page at archive.org we do that through a process of identifying the resources in wikipedia
articles trying to understand the semantics of what's there and then finding those documents either at archive.org or other sites and then editing or annotating the the document to accelerate this
process the uh the internet archive um i helped by another organization called better world books and we we turned it into a non-profit this is a used bookstore
it's a so it's a sister non-profit to the internet archive and uh we were able to pull books off of the conveyor the conveyor belts there better world books and put them in pallets on the right hand side of the screen here you can see
a palette and you can't quite see that that label but it says internet archive wish list so i think we pulled something like a million books off the conveyor belt last year and sent them to be digitized uh we also are working
with a number of other organizations around the world to try to source non-english language books uh and i can say that we've added more than a million links uh that pointing to more than 250 000 of
books from archive.org across 50 wikipedia language editions and that work was done in the last year uh upcoming we want to add the
functionality such that when you're looking at a wikipedia article you can do a rollover and you can get a preview of the book that's referenced on a wikipedia article
as as a pop-up and we also want to add further reading sections at scale to um to wikipedia articles today wikipedia articles contain links to things that are cited
but they don't contain a lot of things to things that are not yet cited and we think it'd be helpful to populate wikipedia articles with links to additional resources that people might want to follow if they want
to go deeper and learn more about the topic in all these cases we define success as doing something at scale which means numbers like hundreds of thousands or or millions i'm gonna shift a little bit away from
wikipedia now and talk about books in general you know um the bottom line is that while more than 100 million books have been published very few of them have been digitized let alone have have links for example all the books that
were born as paper books um i don't know i just happen to have one here this book was born as a paper book uh and it's never going to be be republished by the publisher as a digital book
we digitize it at the internet our internet archive um there are no links in it uh and then in even born digital books like books that come out today now on on a kindle often they don't have
links as well here's an example a couple of of new books active measures by thomas ridd even though it's got 75 urls to archive.org um there none of the links in the kindle
version of the book are clickable silicon values a book that came out a few months ago once again tons of urls in the book um but none of them are clickable as as well i could i could
talk about the why about that it has to do with the pipeline of the publishing industry and um and frankly a lot of influence that that amazon has and so this is not so much about
technology as it's about policy but it points to a lack of appreciation understanding for the value of links in general and and so we're working on trying to influence this process
and getting more uh born uh digital books uh uh with links in them uh available here's a an example of a book by tim harford the data detective
that just came out recently and uh it has links in it which which is great and the the links are are clickable um and uh in at least one case it ended up going to
uh a uh a dead link however so that points to the importance of using the persistence of links especially for a book gonna put a url in it you want it to be alive for a long time so i'm gonna encourage that we use
archive urls internet archive way back machine or other archive urls permacc um so that they're persistent a little bit of information about a project we did a couple of years ago
with the digital public library of america when the mueller report came out we noted that it had more than 2 000 footnotes in it but that something like seven of them were clickable
and so we saw a lot of opportunity to help make the mueller report are more accessible to people by adding uh links to it so we we did a lot of research and we added 747 links to the
mueller report in this case we republished it as a new pdf object with the digital public library of america so that's that was a remember the example earlier you can edit the document or you can annotate
the document you can add meta metadata so here's an example where we actually edited a new version of the the pd of the the epub but we in addition we also produce a version of
the mueller report that we annotated and so for this we had to do a little bit of uh engineering work we ended up using a custom version of pdf.js and also a hypothesis client to produce
this view that you're looking at here which uh anyone can uh go to a url on on the net and see a annotated version of the mueller report
without using any kind of software you don't have to have the hypothesis client installed on your browser and uh i could talk a little bit more about how we did that we think this is a very interesting opportunity to annotate
existing pdfs uh and and add links to them i should know with that we we work in a lot of medium at the internet archive television news for example and so here's an example where we we
annotated or added metadata to a a video archive linking to a book that was referenced and this is where donald trump was talking about how someone wrote a book about what a great
environmentalist he was and so i thought to myself really that's interesting so i got the book um i donated it to the internet archive it was digitized and then i was able to add a link to this video
archive pointing directly to the book if anyone wants to read the book and learn more about what a great environmentalist donald trump uh is you can now do it here's an interesting one um you might have noticed a
few weeks ago there was a lot of emails from dr fauci that were released um as per a foia by buzzfeed i think and in the the emails that were released was an email from someone
saying dr fauci you have to read this article that's so important and it included um the url you can see on the screen a medium.com url and uh if you put that url um into the
live web today you come up with an error message but if you put it into the wayback machine today you get what you can see on the screen here including that yellow bit at the top the yellow bit at the top was some context
that we added the internet archive added this context programmatically to this playback url uh noting that the uh that the underlying live web uh web page was
deleted by medium because it violated their content policy and you can read more about that here so here's an example where we've we've added some annotation to a web archive
such that in the future when people want to i'm trying to understand what was going on with this particular article they have more context for it overall and um and in this case it it just worked you
know when i i got the email um that was released under the foia and i put the url into my browser and i was able to see the context here that we had added a few months ago
we are working now to add links to books we've digitized at the internet archive i i noted earlier about um you know helping to ensure that books that are born digital uh like say for example on
kindle or apple or google have links in them but what about all the books that we're digitizing they never were digital we're digitizing them for the first time they have no links in them obviously so we're working on a project
to add to identify opportunities to add links to um to newly digitized paper books we're also working on a project to to help
ensure that um software that's cited in in academic papers is linked you know there's uh obviously academic papers i have links in them to other academic papers
but often there's um there's a there's they they use underlying software and methods but they don't necessarily link to that underlying software or those methods which is important for data reproduce for
um like reproducibility data reproducibility so you have access to the data that's one thing but if you don't know what version of the software or method that was used to analyze that data then there's a missing piece there
also we're working with open syllabus they have analyzed 7.3 million syllabi uh and uh and and from that they've been able to extract out um excuse me they've identified 7.3
million syllabi 7.3 million objects that are syllabi for college university courses that then contain metadata that can in many cases
link out to our books um so we want to ensure that everything that this the syllabi reference and link to is preserved and increase uh the the links in the syllabi
to to improve the access to those underlying resources the reading list etc that the professors might assign we you might know we recently at the internet archive launched scholar.archive.org
which contains about 28 million pdfs of of completely open academic papers those are all accessible through the wayback machine and so we're now starting a project to
to add more links to wikipedia articles that point to the underlying academic papers that is not so much of a of a of a green field opportunity as adding links to books was because in
many cases academic papers are born digital and links were already added to them in wikipedia but not in all cases and especially not in older publications so for example we're
working with um some microfilm that we have and uh and digitizing material that had not yet been digitized and we see an opportunity to add more links to those resources and it's just fun i want to end here
with a note about annotation and the web and this is a point that um that brewster ko our founder actually had made at an i annotate conference a couple of three years ago
and and what he said was he said the web already has an annotation system it's called twitter and um and i think quantitatively if not qualitatively he's right um you know if you think
about what what tweets are about in many cases tweets are about something that's accessible via the web uh you're you're tweeting a comment about an article um
or a book or a paper or something like that it's an annotation but we don't necessarily think of it in in that way um for a variety of reasons one is because it goes in the direction of the
annotation if you will the comment that's the primary object as opposed to the thing that that is being commented about so here's a little hack this is a an experimental new version of the wayback machine browser extension
and it's got a little button there that says tweets so what one can do is uh one can be on any page on the web and then click this button tweets and then it will submit a query
into twitter and you'll be shown the um the web pages that um that that tweet that excuse me the tweets that that web page is about so here i was on the i
annotate 2021 web page i looked to see who had annotated it if you will who had tweeted about it and there uh filled taijin um had had written that tweet about it
so there's i just wanna i i titled this talk about like a progress report and a call for how people can help and i what i what i hope is that i've inspired some of you to think
maybe more broadly um about annotation and and what it can mean and what some of the opportunities are i think i i don't know how to quantify this but my my my hypothesis is that the vast
majority of opportunity to add links to things that are already digitized um is huge and that where they've only begun we've only scratched the surface and whether that be on wikipedia wikipedia
articles are probably the lowest hanging fruit because you can edit them right you can go and you can change them but anything almost almost everything else you can't go in and edit the underlying document you have to add metadata to the document
in a more traditional annotation process so whether it be the web or books or academic papers government documents software um citations in in general there's a lot of opportunity
and i personally and and people on my team as a whole team uh working on this turn all references blue project was my 20 minute timer uh i want to leave 10 minutes for some conversation
um my entire team is is available uh to help you and there's lots of people and organizations that i want to thank here and i will leave it at that thank you i'm coming back on to pipe up
a little bit here and i just want to say great thank you really great and man the way you just went through those slides at a an amazing pace but still with
keeping us connected to what you were saying um i did see one question in the chat oh and somebody's raising their hand so i'm going to allow this person on if we have time to talk to them what do
you think you want to have someone on stage mark yeah how do i do that i'll do it okay great by the way i can see you now i got rid of the presentation so i feel but it feels much more comfortable yeah i also unhid myself so
okay i'm going to shut up and let jeremy ask a question oh um i was typing this in there i had no idea hey jeremy you're dennis jennifer's i'm up here in half moon bay
yo right on yeah uh so um i i've had a uh a book of a favorite book of my own that i was just searching your archive for um and i'm looking for the where can it where can i upload the uh
the the images i had it professionally digitized so go to archive.org and top right hand side press the upload button goodness gracious and if you need any help just mark at
archive.org okay if it was a snake it would have bit me so thank you you're welcome oh thank you jeremy
um i have a question for you uh mark which is do you ever get into good trouble with this you know you put something on the way back machine and someone's like no i don't want anyone to ever
see that again oh yeah yeah and i slap you with a slap lawsuit i mean do you how do you deal with that uh respectfully i would say um but no seriously respectfully and responsibly
um and um that we uh we we generally respond to requests from legitimate rights holders um uh or or just people who might be embarrassed you know i mean there's
it's the web right so yeah i'm not going to go into details but yeah okay that's good that's the whole team that to deal with patron services and that's that's an important dimension of our work at the internet archive i gave out my
email address but i also want to just say info at archive.org we have a whole team of people that handle you know in inbound email we monitor twitter we're experimenting with real-time chat
just a variety of ways that we're trying to make ourselves more accessible um to our patrons because and i want to keep coming back to this idea that the goal is to help us be the best library that
we can be and what does that mean that means first of all it means that we have stuff right but but more than that it means that um that that people are able to discover this material and that they're able to access and use this material
um and so that's this is our you know guiding uh value of almost everything that that we do and whether it be archiving television news or or or books we digitize something like 3
000 books a day or archiving much of the public web and something just went away okay a question how do you think about security and links i mean that's a kind of a big
question that's like how do you think about security on the internet i'm reading that book you know they tell me this is how the world will end about um about zero day exploits for a variety of things i would just say
as a practical matter we do have software that runs in the background that processes um files as they're coming in looking for malware there are autom there are automatic processes there are human processes
there's there's a variety of methods in in place okay but it's the way so you know it's you gotta be careful out there okay i magically put another question on the stage i see that
i'm curious there's an interface for the public to help adding annotations digitized books hi chris um you know we're working on some stuff chris i'm going to get you in contact with with uh that team that is working on
that uh or the open library team open library um is our kind of a catalog interface to a lot of the books that we have um digitized the more than four million books but also
metadata on many many more millions of books and so um this is something there and i know there's some experimental um projects um going on around this and maybe you can help us with it so yes yes
yes thank you chris we have two more minutes did somebody want to raise their hand and come up on stage i mean mark is like an efficient machine here so we do have two one whole minute so
uh come on up if you want come on up if you want but you know i'll just also say um we've been we at the internet archive before covent we used to have this lunch every friday and um and the team would would go around we'd
have between 50 and 100 people there every friday and people would share about what they were working on that week was really great um well in the time of covet we don't do that anymore so what we do now is we have a friday lunch
um all on zoom and between 50 and 100 people come to it um and we we've been practicing this thing for the last many many months where we we kind of do a deep dive on some dimension of the internet archive and
the team will present about that we've got a catalog of those available now on the video of those up and accessible and i
i can put a link uh somewhere uh to it so you can watch and you can you can go deep dive into there's an incredible presentation about uh open open library and annotation and things like like that so
so you can drop a link in the chat that would be great um you might also yeah that's probably the place to put it this has been really great mark i want to thank you and i'm still just my head spinning with like how efficient this
whole thing was and how you took a half an hour and really build it like so great thank you so much right on cheers
thank you everybody all right go team you
End of transcript