Video: Universal Access to All Knowledge: Decentralization Experiments at the Internet Archive (DocDrop)

00:00:00

foreign [Music] and welcome to my very humbly named talk about decentralization at the internet archive so a couple quick caveats this is not

00:00:22

strictly official views of the archive it's sort of my own Executives this was submitted as a 30-minute talk turn into a lightning talk and now is 25 minutes somehow so it's going to be a little

00:00:33

weird but that's okay and thirdly it does not contain ethereum currently this may change uh hopefully is an outcome of this session we'll see okay let's get started

00:00:45

so why decentralize the archive from a spiritual or ideological perspective so since you hear your property know what the archive is but just as a quick recap uh probably best known feature is

00:00:58

a Wayback machine which has been archiving The known web since 1996. it has trillions of captures any given URL you can punch in and travel back in time uh quite far of course that's not all

00:01:12

okay so we have uh thousands and thousands of other collections everything from books to film to wax cylinders and vinyl we

00:01:24

digitize it ourselves we partner with other institutions to preserve their Collections and so on uh it's a lot of stuff yeah okay uh and we are also perhaps known

00:01:38

for our beautiful headquarters which is here seen in its previous Incarnation as a literal Church and inside uh is our own Terracotta Army which by the way is uh I think the best

00:01:50

employee perk ever after three years you get a half size statue of yourself made uh and it's placed uh somewhat creepily on the side of the the Great Hall

00:02:02

and within the ver the very same room is uh actually some of our servers we also have you know more conventional data centers but I think for the purposes of this metaphor um this is a very good image it's these

00:02:14

are uh Peta boxes at one point they held about a petabyte of data each now it's much more so yeah so uh you know we've got four

00:02:27

walls we've got a riff we've got our Hardware serving you uh you know your Grateful Dead uh live set from 1972 or the White House homepage from the 90s

00:02:39

and that is how we accomplish our mission of universal access to all knowledge right well I think actually for a long time this has been true it might be somewhat true today but before

00:02:52

I try to answer this question I just want to play this clip off our founder from the very very beginning you may have seen him earlier in the week and here's a much younger version

00:03:06

and there's no audio but there there are subtitles and he's basically talking about um his thinking in starting the archive uh which at that time uh was just a web

00:03:21

uh repository and I want you to take two things away from this clip first is that this is really a web native organization we do have all this other content uh and we value it very much but the web has

00:03:33

largely eaten the world at this point so um it is in our DNA to essentially be a the missing memory layer of the web and up to this point the best way to

00:03:46

accomplish that was put a bunch of stuff in servers in a you know box and the second thing is that underlying our Top Line Mission off universal access to all

00:03:59

knowledge is this technological imperative to periodically assess the tools that are now newly available whether they can help us further that Top Line and apply them accordingly

00:04:12

so I think for the first time since our founding we actually have something that's not just you know better servers bigger hard drives better scanning equipment we have possibility of

00:04:26

potentially fixing this problem in a way that is universal and not just embodied in our service okay so

00:04:39

to zoom back in a little bit why decentralizer archive from a practical perspective well one is physical location risk so this is a map of seismic risk zones in

00:04:51

the Bay Area and the little logos there uh correspond to our facilities and that's not great we're actually building a new data center in British Columbia currently unfortunately that's also on a fault Zone

00:05:05

and you might say well why don't you stop building your data centers in seismically active regions and for reasons that are outside the script of this talk this is actually surprisingly difficult uh secondly political location risk so

00:05:19

uh we're addressing this a little bit with uh the Canadian expansion but uh as you can see on this map this is essentially a sort of a weighted index of the democratic quality of various

00:05:32

countries and we're not doing great at it and it's really trending downward so um this is a real problem for us because we have a lot of stuff that people want silenced

00:05:44

um and clearly we have pretty significant Network bottlenecks so this is the user sign ups uh over time uh you can see there was 16 in 1996 that's very cute and during the pandemic it just

00:05:58

went crazy and has never let really let up so in practice this means our bandwidth is just totally cooked we're putting fiber in as fast as you know the

00:06:10

city authorities will allow us and it's really just not enough so you might ask why not just put all your stuff on S3 and use cloudfronts and you know not worry about any of this stuff so there are multiple reasons some

00:06:23

of them are also ideological but they're also practical ones like cost so um our modeled cost and actually kind of real cost is around 650 bucks per

00:06:35

terabyte to store forever which in the model means 100 plus years on S3 that's about 160 bucks per year so that multiplied by forever that's not a very nice number

00:06:49

um with filecoin which is sort of pre-saging but we'll get into a little bit later uh the costs are vanishingly low uh and might in fact be negative for us because there's a network subsidy for

00:07:01

culturally valuable data uh how sustainable that is we'll we'll find out uh storage is another decentralized storage network uh they're sort of a little bit more here and now thinking so

00:07:15

they're competing with S3 and they have a slightly better cost uh there's also our weave which you might be familiar with as the forever storage solution and it is expensive it is quite pricey

00:07:29

compared to our model and I'm a little bit unconvinced about their data availability model for stuff that's unimportant until it is right so things that are sitting idle for a very long

00:07:42

time and then because of a political moment or something like that they suddenly become relevant so if you have other thoughts please talk to me and convince me I'd love that and there are others I haven't mentioned

00:07:53

here I'd love to talk to swarm folks for example I don't know what their pricing model is so please if you're with that team find me later uh and they're aside from all the you

00:08:06

know dealing with negatives they're also positive new opportunities with the centralized storage and decentralized networks in general like content addressing which a lot of them support

00:08:17

which for example can allow for transparent link preservation right now when a link breaks we may have a backup copy if it's on Wikipedia we have a bot that will go in and edit uh the broken

00:08:31

links and the citations with our backups or maybe if you're an Enthusiast you'll be using our extension but for the most part an HTTP link when it breaks it it just breaks and of course there is uh

00:08:45

web 3 forward compatibility as a nice bonus when the stuff is in a decentralized network and it's content addressable you can reference it using you know standard ipfs libraries that

00:08:57

I'm sure you're uh familiar with and so on and so on so I'm going to quickly jump into a couple of things that we've we have

00:09:09

working today uh as steps in this direction and then maybe come back to the musings uh about uh how a actual decentralized internet archive might

00:09:22

look like okay so the first one is the simpler one uh and this mostly addresses uh the bandwidth concerns and it's streaming media from Storage uh

00:09:34

and if you'd like to try this yourself you can do it we don't currently have a real front end so you'll need to grab this little bookmarklet from the presentation and just add it to your

00:09:47

toolbar and you'll be able to follow my example all right so here we have a video uh from uh from NASA uh and we have a whole collection

00:10:00

uh of their their videos and in the metadata here you see some uh you know internal identifiers uh this archival identifier Ark and uh then this

00:10:13

storage string so any archive item that you see this uh string on uh is actually mirrored on storage so you can just hit this bookmarklet and hopefully we do not

00:10:26

get demo fail here due to network issues no it's working great so here is a pretty big video you know full resolution MP4 four gigs to do

00:10:41

and so as as oh actually yeah that loaded that's that's fantastic so here we have some wacky astronaut things and it's uh as you see on the right here uh it's

00:10:54

being served from uh almost uh 3000 nodes worldwide the way storage works is uh sort of like an incentivized bittorian swarm uh it's not quite the same but it's the the basic principle

00:11:06

applies uh the use uh Erasure coating uh and basically pick 88 of the fastest nodes to give you a nice robust stream without relying on something like

00:11:18

cloudfront or expensive Edge caching so let's close that all right so so what do we need for this to actually be

00:11:33

useful uh to y'all is well one is obviously an actual front end and that's uh that's assigned to me we need browser support because in this case it was sort of cheating a little bit because I was actually using an HTTP

00:11:47

Terminator where my browser was talking to a storage uh service that was then talking to all these thousands of nodes uh it's actually an https connection to

00:11:59

the nodes so uh all you need to make this work in the browser in a truly peer-to-peer fashion is uh self-signed certificate support in a particular context so if you're was Brave or opera

00:12:12

or in other browser that wants to make a play in the space please talk to me and lastly as with many decentralized systems their systems today there is actually centralized point of failure which is the metadata uh that

00:12:26

essentially uh tells you where to find all the pieces of the file so we're currently using a storage hosted uh what they call satellite metadata Service uh we'll probably be running one of our own soon but the real solution is of course

00:12:40

not to have the single point of failure so that is on the to-do list okay so the second demo is a little bit more complicated and for it you might need to have a crash course in web

00:12:52

archiving uh so the canonical format for web archives uh is called work and it's in uh IRB uh standard uh that uh the the archive has

00:13:06

developed back in the 2000s and it's basically just a dump off web uh of HTTP traffic between your browser and a given server and uh the

00:13:21

uh the actual file structure is basically a tarball or concatenated dumps all these different crawls and this is how the entirety of the Wayback Machine Works is just a bunch of these

00:13:33

giant files that are HTTP dumps and uh some index data that tells you uh what offset in the file to read and uh out comes the web page so let's see oh

00:13:49

all right so I'm going to go back to the demo here so here's an example of such a work file and it doesn't look like much here because this is sort of internal

00:14:03

use if you open this up in a different front end you'll actually see the web pages but this is kind of what's you know how the sausage is made internally uh just these files and we have again in

00:14:18

the metadata fields we have these identifiers uh I I did very CID and com p and these are identifiers within the file coin Network so we've been storing a lot of these web crawls this

00:14:29

particular data set is our uh inauguration uh pre-integration crawl so every U.S presidential election we capture the entirety of the dot gov domain and Associated things uh before

00:14:43

and after uh the administration change to see you know how uh how the politics reflect uh in the reality of the government web uh so

00:14:55

uh we're using these data sets as uh as a test bed because it is in public interest and it is also not copyright encumbered uh as a U.S government data generally is so

00:15:08

we can pull up oops actually this one just grab this identifier here and uh go to a filecoin network indexer uh So to avoid demo fail I've actually

00:15:22

pre-filled this so I'm not going to breathe on it so it doesn't fall over uh so here's here's that content ID and it's founded a couple of Piers where

00:15:33

we stored it so we can map this peer ID to a minor ID here and let's see so we'll grab it from that provider

00:15:55

and we'll grab it by oops no I'm by the identifier and for demo purposes I'm just going to discard it but

00:16:22

all right fantastic so here we are retrieving the uh kind of bulk uh package from the biocoin number so

00:16:36

so at the moment we're sort of treating this as just a dumb blob so it's it's being stored on file coin as essentially another copy we have a couple of internal copies and this is a a third sort of Cold Storage thing but

00:16:50

things can get a lot more interesting so uh we can take a look at some other tools in this space okay it looks a little weird of this resolution but

00:17:01

here we have a capture of the Defcon dot org site uh and uh this Tool uh it's it's not an uh internet archive tool but it's it's

00:17:16

sort of Affiliated uh and uh here we have made a capture of the site and we can store it on ipfs and filecoin through web3 storage

00:17:29

all right and now this bundle is a static um self-hosted application that provides a view into this web archive the same

00:17:43

way that you would get on the Wayback Mission but there's no server right this is just loaded for my apfs and it has all the nice archiving stuff built right in you have this time stamp you could

00:17:55

travel back in time if I had earlier copies uh you can have the links reference and so on so the the next steps in the filecoin work for

00:18:06

us are to try to make our captures uh structured and compatible with something like this so you don't actually need our servers to interact with this data

00:18:21

so all right and beyond that step there are a few things we need to actually make this work for real so one is actually encryption and

00:18:34

ACLS you might think of uh sort of public good information sets such as our collections as being essentially open and that's generally true but because we

00:18:46

capture from the open web and some of our collections do come from sources where just so much stuff goes in that it's very hard to do diligence at ingestion time there are opportunities

00:18:58

when where things do require a certain degree of Access Control uh in in case of legal action or something like that and uh there's also kind of a broad

00:19:10

concern about data mining of these sets so we we seek to primarily support users and good faith researchers but there is a broad spectrum of use cases that go

00:19:22

from kind of white to gray to Black and the most important okay two most important things is indexing and metadata so right now we're able to store this bulk data fairly well we've

00:19:36

spent about a year on this and it seems like kind of a simple thing but at the end of the day it it it it turns out not to be um but uh the thing that we don't have a

00:19:47

solution for right now is the index because we have as I mentioned trillions of captures in the Wayback machine we have billions of other objects and you

00:20:00

can't put that stuff on chain right and you it has to be discoverable somehow so I'm very open to suggestions from the audience after my talk on how we can

00:20:12

attack this and the second most important thing is scale so we have hundreds of petabytes in our collections at the moment and

00:20:23

that's a lot of data I think uh most folks here who are work in on chain things you're probably dealing with things that are at most a few megabytes right it's it's really not that much

00:20:37

data uh that we're used to dealing with uh in the uh the blockchain world and we've got tons and tons and tons so that's something that the file coin team

00:20:49

has been very supportive of but we are just getting started all right and I think that's it um I guess the

00:21:02

question I leave you with that goes back to the beginning is can can the web have a memory not just web 3

00:21:14

which is already somewhat set up for that but all the web content because we we value culture that is not just this narrow domain that we inhabit we value

00:21:26

all culture right so how do we preserve that for forever thank you thank you I have a couple of minutes if people have questions are there any questions

00:21:52

right here [Music] so when you're archiving the web uh very often web servers differentiate what content they serve depending on the IP

00:22:05

address from which you ask and it seems to me that there's an opportunity for decentralization here as well so that your crawlers actually you know pick from which region they

00:22:17

download the who I've already how are you dealing with this problem now yeah that that's a great point so we we definitely run into these issues as you're just being straight up blocked or a page that has meaningfully different

00:22:29

appearance depending on uh the requesting region right now we have a few semi-decentralized tools for this uh there is um a organization called archive team which is basically a volunteer uh group

00:22:43

uh that runs crawlers that receive these tasks uh and there is also academic research in this where different archiving organizations would enter into uh sort of Consortium where they would

00:22:57

synchronize their crawls uh from different regions but I agree that those are not uh super scalable Solutions so uh I think that my short answer is yes

00:23:10

yeah thanks for the talk how do you deal with takedown requests like dmca takedowns so unfortunately we are required to comply with that to continue existing as an

00:23:24

organization uh so we have a legal team that processes them and if they are uh they have merits and are made in good faith uh we will generally block the item for being served

00:23:39

so unfortunately time is up okay um thank you so much akadi and I'm pretty sure they can find you afterwards right okay fantastic thank you very much