Video: LinuxFest Northwest 2018: Perkeep (DocDrop)

00:00:00

okay um so this is I talk about perky as your personal storage system for life I'm Breathitt Patrick I'm not cute and so um who saw or talked to two years ago

00:00:13

by chance about a quarter of the people get in there get in there so two years ago we gave a talk about kamilly store we have since renamed the project to perky win this little derpy bird parakeet is

00:00:30

our logo it's the same project it just does more stuff now and it has a cuter mascot um and so we're going to divide that talk into this mainly four parts so we're going to give you a longer review of

00:00:43

what perk it is then we're going to explain why white exists why we created it and then we're going to drill down into the details of how it actually works and finally we're going to talk a

00:00:56

bit more about the project the community these kind of things and finally yeah if they work the monitor works um okay so first of all

00:01:06

what is perky um for keep is a whole set of components that all work together really nicely but that was very scary myself so we'll kind of start with like the web UI the things that users might actually see um so yeah we're going to

00:01:21

describe you first what what you see the most common me as a user so we have a web UI with you know with you know this top menu that allows you to navigate to different parts of the UI and you have a

00:01:33

search URL on top then of course you have all the objects that are in your stroll around his place and when you select some of these objects you get this menu on the right which is a

00:01:46

contextual menu for different kind of actions you can see that you can select them all and select create a new like container for your objects you can also share all dominant your objects like you would do one of them with is donor then

00:01:58

if you select one of them you get a zip with all of them so this search query is showing a filename startup tax just to show that yes we support files but we kind of consider files boring and increasingly antiquated

00:02:10

nowadays people don't create as many file they used to they create a lot of their content online where there isn't necessarily the compost expire on their local machine um so of course we also support photos and showing you kind of

00:02:23

an infinite stream of your photos but these aren't really more much more interesting than files because they are kind of like you know JPEGs or files so you know here for instance is a search for just my panoramic photos in the

00:02:35

search for affairs panel um but more interestingly we support things like importing storing indexing and searching your tweets and rendering your tweet so even if Twitter goes away we still have a copy of them all um and

00:02:48

then you have some of them have to contain media so we slurp those into but like this tweet for instance is it like a file on my plastics file system that I can back up with some backup software um likewise we support indexing and

00:03:00

searching and importing things like check-ins from sites like Foursquare swarm and by default if you go to your UI this is my personal instance you see like an infinite scrolling list of all your content whether that photos or

00:03:11

tweets or check-ins or whatever files - if you haven't use files and so for all of this just you can see that we also have an interface that is a map one so all of these like the force well it was

00:03:25

four squared snow Swan as displayed the tweets the images anything that has the location really latitude and longitude it's kind of easy to deal with now one we can display them all and you see you can have another kind of search

00:03:39

query at the top which shows a few check-ins in a specific location or at is it a specific yeah these are all my chickens in 2018 and yeah this is a

00:03:53

zooming end of the Seattle area if you zoom in get more detail and so here's my check-ins around Seattle in 2018 and if I click on one of these markers um there is a photo I took on a ferry in the

00:04:05

water you click that again and you see my wife and kid these are also new features in the last two years new

00:04:16

development here's a search query for a location Moscow and this searches both your photos and tweets and check-ins in Moscow and of course as we are like an

00:04:31

open source computer project we have a command-line interface you know so it allows you to do all sorts of crazy stuff from the low-level one which allows you to access the blocks race to the very high level stuff that allows

00:04:43

you to do searches and then you can you know combine all kinds of commands to automate holywell flow if you need to also there's a fuse interface so if you

00:04:56

want to if you're like old school and you want your file system you can mount your whole world and access it as if it's a local machine and there's all these magic directories there's like a magic directory called recent that does a search query of like things you've

00:05:09

interacted with recently so you can like upload a photo from your phone and then find it in your recent folder um there's also personing support so if I you know echo hello letting some North West into

00:05:20

this beautiful directory foo I can cut it back and see it but if I look back in the app directory I can put an arbitrary date string and a support point dozens of formats you can see how this file looked two years ago and then it just

00:05:32

said hello Linux best so you never lose anything yep and yeah we also have a mobile client it's only Android for now um but yeah it's better than nothing so you can

00:05:45

upload your files mainly your photos you can do that manually or you you know we have the usual directories where photos are stored allow continuously watch them can upload them automatically for you

00:05:57

whenever they arrive on your phone and we also have importers like I kind of showed earlier with like things like tweets and things from third-party web providers um this kind of runs as little agents that are within your pre keep

00:06:11

server process or it could be a separate process elsewhere and it so it speaks the various api's and converts it into per keeps format and writes the blobs we currently have various importers we have

00:06:23

to do is for a bunch more there's probably millions that we don't have but this is kind of the best place to contribute to the project to write new importers and most of these have come from third-party contributors um the UI is super sexy once you like

00:06:36

configure your app and like add your accounts and stuff but you never go in here like you go in here like once so we don't yeah finally we also have a bunch

00:06:50

of third-party apps that anyone can build on top of per kid because we have this API so you can use per clip as your like storage area yawns you can use it search capabilities and for example we

00:07:03

have this one which is we call the publish for real publish which as it says allows you to directly publish some of your objects directly to the whole world and this already like a bunch of I keep photos from the trip I took oh my

00:07:15

you publish your instance so this is like in a world accessible without authentication if you just want to put your photos online but like the database behind it is your keep instances it's all private so we showed you the high

00:07:29

level stuff and that was the yeah the main overview of everything you could use but no so why did we build all this crap um and the real question is do you care about your data and are you still

00:07:42

gonna care about your data in five years or twenty sixty years like I I kind of want to have all my stuff when I am 100 and I want you know my descendants to have my photos and stuff because you know we passed down photo galleries you

00:07:55

know like my parents still have all their books of photos and all their negatives and stuff that they've been digitizing so I'm increasingly concerned that like in the current world people put all their crap elsewhere on the web

00:08:07

and those sites shut down I also want to have unified search amongst all these silos I want to do search queries like show me all tweets that occurred you know two hours after I had been checked into a bar or something like that these are search queries that no one

00:08:21

silo can provide you because you know Twitter has your tweets but Foursquare or swarm has your check-ins and so you can't join these data sources and do fun queries like you know show me all my you know photos at sporting arenas or whatever like that

00:08:33

oh so yeah that's not forget that died for any number of reasons and whether that they usually die with your data so yeah here's a bunch of reasons why it could die you know they're ran out of money or they get destroyed by a

00:08:47

competitor or they become evil or taught by someone even or who you know use your data as it shouldn't and you don't you can't access it or it dies I mean as one may be example I created life journal

00:09:00

many years ago in like 1999 and it has changed hands several times now and you know I can't personally vouch for any future owners because they keep you know reselling the company and with that they keep reselling the data it has an API

00:09:13

you get your data out of life Journal and hopefully any user who used it did because who knows what's gonna happen in the future so what can you do about it well our advice is that you should own your data so that you can you know stay

00:09:26

in control of it and this is where kerchiefs comes in you know that that is our goal it is a project that you keeps your content and your online memories forever so two years ago I told a story

00:09:40

here about this crappy table I had just made um we had moved into a new house and didn't read before Thanksgiving and so everyone was coming over to play games I guess in this case boggle um but we didn't have a table and so we went to

00:09:52

the store we've got some lumber and we bought a saw and we used these tools and we created a table and everyone was happy and at this point I didn't care if the original stores went out of business because I had my table and this is ten

00:10:04

how things used to work you know you would like buy a computer it's a tool or we're perfect or open-source software was a tool and you had your data on your little disk nowadays you give your content to these guys and they have the

00:10:15

disks and that is very sad because you don't have the disk and you may not member than 60 80 years so like we said sites died this is a ongoing list of all the sites that have died over time and

00:10:27

it just keeps growing um this is why we have the perky importers the idea is that you publish to them and then you suck it back into your instance this is one model this model is commonly called pesos published elsewhere it's indicate

00:10:41

to your own site there's various other models where you know publish first to yours then you syndicate out but it's better than not having your data um oh yeah then it

00:10:53

doesn't matter who goes out of business because used to all have it um so yeah that was what it is and why we did it and why it exists and now we're going to show you as promised how it all works

00:11:05

yeah so we showed you the top layers and now we're gonna kind of start from the bottom and work our way up from the blob storage um so yeah blob storage what it is it's big it's simply that we store

00:11:17

the data as blobs which are like a chunk of ADA which is a maximum 6 16 megabytes and there is absolutely no file name involved no metadata no mime type no versions it's just an immutable block

00:11:29

and at this point a lot of people scoff and they say well surely must be filenames a metadata how do you represent anything um all that is handled at upper levels at the bottom layer you just have some bytes and their content addressable and we call these

00:11:43

blob refs the blob rough starts with the the hash function identifier we used to use sha-1 but sha-1 is showing signs of being busted so now the default does not sha-1 but we still support your old data in there we just don't let you write new

00:11:55

ones and but yeah so this is how you get back one of these blobs and everything is kind of built up of blob refs and blobs content-addressable has advantages that it's super simple

00:12:09

that you get content deduplication for free it's super cashable because there's no versions you don't need expiration times you can set your cache headers to like you know infinity and you get integrity checking for free if there's any like disk corrupt and you know um

00:12:22

it's an example of using the command line interface you could say PK list on a new instance and you see that there are no blogs and then you can echo hello into PK put blob and you get back a blob

00:12:34

ruff still truncated here for legibility and if you put if you do the same command again you get the exact same blob ruff because he'll up the bites of Hello or still the bites of hello so there's the content deduplication do echo world you get a different blob

00:12:47

rough and PK list and you see your blob reps and the size maybe you call that metadata but whatever so yeah we started blob rest in size and you get it back in and lexical order and then you can get a

00:12:59

blob with PK it on my personal instance I currently have about 6.6 million blocks um obviously I didn't use PK put to put them all there um

00:13:11

the blob server operations that that like the low-level interface can do it's just you put a blob you get a blob by half or you can enumerate all the blobs in sorted order notably there's no delete it's much harder to lose your

00:13:24

data we don't let you delete it then the only way to delete it or to lose it as if you know there's hardware failures and stuff like that but you know we we also deal with that there is kind of a way to delete it in emergencies if you really really need to but we make it

00:13:37

very hard and you have to like restart in special modes and stuff like that um and go in our main server implementation um I mean can't per keep us defined as a

00:13:48

specification of api's and and file formats and stuff but we have our kind of de-facto popular implementation that's a go and internally we have a go interface that supports kind of these

00:14:00

four operations and because it's a very easy interface to implement there are tons of implementations of the blob storage so these these green ones are kind of like actually storing your bytes you can use stored in local files or you

00:14:13

can sort it in my kind of a packed way where there's good data locality and like chunks have your data that are accessed together are stored contiguously on disk there's you can also store your vital these ones in the white are like kind of cloud

00:14:26

storage providers all these ones in kind of orangey brown are ones that let you compose your blob storage in different ways or you can talk to a remote or keep instance so you can run multiple instances and have them like sync and so

00:14:40

these orange ones let you do like weird things like control the flow of like the read path of blobs the right path of blobs the replication policies the synchronous and asynchronous replication of how your stuff goes and you might see

00:14:52

all about and be like oh my god this is really complicated I'm gonna screw it up but you can't screw anything up and / keep one of the design goals is you can't you can't really make a mistake there's no versions there's no deleting it's all immutable so it's always safe

00:15:04

to like take this / keep server of years and this / keep server isn't and just like copy all the files any which way since there are no version so you're not gonna overwrite something like whenever I have two machines and I'm using rsync I'm like

00:15:15

well I think that's the same home directory we're gonna arcing from this machine to this machine and we have a new kit which or did I edit the file there and last and so I'm always terrified of using rsync and like screwing something up but with perky you can't screw anything up because all

00:15:27

blobs are immutable and you can always add things you can't replace things so yeah and if you are paranoid that something is out of sync there's a little like sync validation things so you can verify that all your copies have

00:15:40

what you want and you can do this from the command line tool or there's a thing in the web UI where you can configure as many sync pairs as you want a one-way sync pairs or you can set up to one wasting pairs to make a pair so you can

00:15:52

write to either one and they get eventually consistent I'm here I'm like validating that all my blobs are on s3 as I thought and so far it's found like you know nothing that screwed up and this 11.3% through takes a couple

00:16:04

minutes um all right so we'll blow here and build up there but what is actually the block what it is you put anything you want in a block um you could put some middle chunk of a jpg in a blog

00:16:17

which is probably what most of my six point six million blobs are some like random region of a of a jpg however some blobs are kind of special to us and too

00:16:29

perky and this is what we call the perteet schema it's a bunch of blobs which seem to be formatted in a certain way and indeed they are because they're actually JSON objects so each blob is a

00:16:42

JSON object with a bunch of different fields that are well-defined to us in by convention and you're gonna ask wait wait Jason why not any binary format that would be more efficient that you

00:16:53

could you know transfer faster or whatever and the reason there is we care a lot about data archeology and people understanding what all this crap is in you know 60 80 years and by picking JSON

00:17:05

we're picking something we we assume that archaeologists in 80 years will understand what the web is because the web is pretty popular and ASCII and utf-8 are very well known so if we if we invent our own binary serialization

00:17:18

format or we pick one of like the Avro or thrift or protobuf and we pick the one that doesn't win in 80 years then we have this weirdo historical format that archaeologists can't understand but if we pick something that like isn't

00:17:31

already in popular use like JSON with utf-8 and we're very verbose in the fields then it's readable we also make sure we're very for boats and our blob identifier earlier I showed that we can

00:17:42

have shawwal under sha-2 in our refs we explicitly put the hash function name in there so archaeologists in the future will know what all that hex crap is afterwards it also lets us upgrade over time like when we upgraded from sha-1 to shot 256

00:17:55

or shot to 24 that we can do that gracefully without having to re index the world like git is struggling with right now um so we talked about how we support files even though files are

00:18:07

boring we support modeling files for instance if I echo hello world - hello it up text and IP k-put hello docx I get some blob rough back but that blob rest is not the bytes hello world if I get

00:18:19

that back I get back some JSON blob here that is this is a per keep schema blob and you can see the parts are there there's this one part of data that's 15 bytes and if we get that identifier down

00:18:31

here you can see the actual bytes hello world but it shows you know the original file name it shows you the permissions and the owner so we support you know real support like gigantic piles even

00:18:43

though we only stored 60 mega blocks you put like a 5 terabyte of video file or VM image and it's you know big Merkel tree of chunks with all your data at the leaves and we do the whole rolling check something where we don't just cut the

00:18:56

file at every 16 may exactly instead we we compute this rolling checksum over the file and when the checksum has a certain number of you know bits and that you have 32 that decides the strength of

00:19:08

the cut point and then we kind of balance that into a Merkel tree so even if your file kind of shifts around a little bit or you know gets things in the middle we still do deduplication without reuploading the whole thing like

00:19:22

if you had a a 16 Meg mp3 file with the ID 3 header in the front and you've modified the ID 3 header to like change the name of the song title by won't bite your whole set 60 Meg mp3 files shifted by one byte you're not reduplicating

00:19:35

the whole 16 meg because the cut points are on the contents not on the length boundaries so we can very efficiently store gigantic files and like DM images this also lets us I do efficient seeks

00:19:47

to any point in the file and do like arbitrary reads like period system calls and stuff refused um we support directories a directory like here we make a directory foo we put files a and bar into it and we get the the blob

00:20:00

rough for the directory back out we see that you know can we have directory and we don't see the entries there there's another reference to an entries thing and that entries is a static set of numbers and those two members themselves

00:20:11

are schema blobs that references the file food bar so you can um you could like PK put your home directory and we'll just only put things that have changed and then it'll recompute kind of like the the Merkle tree ish pop

00:20:24

directory and freakin rillette that so then this also lets you go back and look at your home directory at any point in time if you just you know back it up like once an hour or whatever so some

00:20:37

blobs are some schema blobs are super special and we signed those just like you kind of like sign get tags or whatever we use the open PGP every user has a key pair we kind of deal in this behind-the-scenes if you're not like a

00:20:50

crypto nerd by default you just start the server and if you don't have one it will create one for you so I mentioned that I have six point six million blobs the question here is how do I know which

00:21:02

one of these are like tweets and which one of these are like the middle of some JPEG that's super boring and that is where we come to the indexing there which is like yeah second most important thing in perky because it allows you you

00:21:15

know to organize your stuff and basically what it is is it's like another blob storage it almost is the same interface so it can receive love you can replicate blob too and you can sync it with another blob storage so it

00:21:29

takes a blob which is a bunch of information and it transforms it in a you know sorted rows in a sorted key value store so it's like more organized

00:21:40

information already and why is sorted key value interface so as as we saw before the blob storage we have this interface that we mainly implement in go but that is irrelevant and those are basically

00:21:54

the three operations that you need to implement to have a solid key value that you can use as an index so the usual state you know if you want to set a value under a key then if you want to reach you retrieve that value from a key

00:22:07

and then you have this file operation that gives you an iterator so you can iterate through all your organized rows and those are like the implementations we already have but it's very easy to

00:22:19

add new ones and by default we is leveldb because you know you have no dependency with that and it works pretty well but yeah you know we have the H sequel once you collides my sequel

00:22:30

etc so also it's not really important where your index is or how it is implemented or where because it can be deleted it can be corrupted it doesn't matter it's easy to rebuild it from just

00:22:45

from your blocks because it's just you know you go through all your blocks and you write all these rows that just makes everything faster or unorganized notably you don't have to like index your blobs in order if you've got blobs ABC the

00:22:57

first time you index it later you index CA B you'll get to the same index in the end so we're gonna see really what happens you know when turkey precedes a blog what happens how what how what does

00:23:10

the index does with it and so the first step is that we have this HTML err in front of everything and it receives the probe and the first thing it does it it writes it it writes it to your will will

00:23:24

call in the primary blob strange whether you know the default configuration and then this blob gets inked or replicated or whatever you want to call it to the index layer index web storage then if it

00:23:37

is like you just a bunch of data from the file or whatever if it's not a JSON schema a parakeet schema there's nothing to you because for now it's not interesting otherwise if it's a JSON

00:23:49

schema one of the kerchief schema the index are you know got through it analyzes it and we've known write some rows the first thing it can be its it can be a file schema a sprite

00:24:01

show you before if it's a file schema then you want to look at the data again from the file so we get it from your blob storage then you record any interesting stuff that you would want to record on it like it's about time if it

00:24:13

has some exhibit acts its location you know the dimensions of the image etc or some audio properties then if it's a directory schema as we showed before you

00:24:26

drill down into the schema and you get all the children and you record all the children so you can later easily go back from the parent the children fast and finally if it's mutation scheme on an

00:24:38

object which will we will see what our objects and mutations later in detail but you know it's something that modifies an object you first verify that it's a valid mutation if it has you know

00:24:49

a valid GPG signature and you record all the attributes that are you know verified by this mutation it can be a tagger titer of a location anything that we record for the index finally one last

00:25:04

piece of the index which is kind of an implementation detail but we totally rely on it it's this corpus which is an optimized in-memory version of the index that as soon as the index rose as

00:25:16

written or written the corpus gets the blood and also does something with it and so you can imagine it's an even more organized you know structure we had we have these rows and we also have this

00:25:28

structure in memory lips which is a bunch of maps and lists which are lazily reorganized and so what when the corpus gets the blood after the index rows have been written you know it integrates all

00:25:40

these maps and those lists which are typically other GPG signers so the owners of the data or the objects that we have or the mutations on the on these subjects all the files on their attributes image attributes so this caps

00:25:53

shot that we're interested in they are now very very well organized for fast access layer and this is where we come to the surface layer yeah so that corpus basically exists as an optimization for

00:26:05

the the search layer so perky Pez its own built-in search interface um there's basically two ways to search which I kind of call the easy way and the hard way um easy way is kind of like you would

00:26:17

search in Gmail he's kind of like you know is : image operators that you saw on the screenshots earlier and so you could type these and this is what how we primarily search and this compiles into the hard way for you so you know here's

00:26:31

images in a certain time range you know that's in Hawaii we have a bunch of different operators there's a documentation on the website and from the from the UI to say like what these all work we also support things like and

00:26:43

or not groups just the other day I was adding support for iOS hike files which are replacing JPEGs and I found that I like 70 of my images work parsed properly and they didn't have thumbnails

00:26:56

and I had a bug in my parser uh so I did the search query to find my corrupt hike images so I said all file names that are start at hiked - the set of hike files that have a width between 1 and like you

00:27:10

know some arbitrary number so basically show me high fives that don't have a width and that basically means the hike files that I screwed up on and so we fixed the ball and reindex these it was good but it's really nice to be able to

00:27:22

do like you know sets and negations and you know um the hard way if you want to do something custom is we have this JSON schema where you can do this whole kind of structured nested search query the

00:27:35

top level is a search query where you can either do an expression which is kind of the easy way or you can do a constraint that constraints is basically a matching policy that goes over your blobs and you supposed to limit in the

00:27:47

sort the constraints lets you look at properties of either the file or directory' or claims or objects or like logical constraints that leads to and ORS or X ORS and then do nested

00:28:01

constraints you know sub queries and sub queries and joins for each one when you get down to like the leaf of things like strings and integers you can match integers on ranges or you know prefixes or suffixes or you know bite lengths of strings so

00:28:15

basically any policy that you can think of about like matching or joining your data you can express with some JSON so for instance if you do like a search for file name hello star hello star so any

00:28:27

file name that contains hello that compiles down to something like this like constraint it's it's a permanent means it's an object we'll talk about and the content points to a file where

00:28:39

the file name contains hello and these can be like in a page long if you have something complicated but then you can take these search queries and you can save them as saved searches and then you can like use them later in in the easy

00:28:52

query language to to reference a set of a complicated life so but generally if if you find yourself writing these like low level ones too often probably there is a missing operator in the easy search

00:29:05

mode and then you just you modify the search system oh and here's yeah there's the results of a search returning plops so I we kept pretending objects and perma notes and

00:29:16

this is addressing the question that per keep only stores immutable blobs so the question is how do you like model something that's mutable in an immutable world like and also things like how do we store a tweet or something that's

00:29:29

saying now to file um so we have this concept called perma node and every time you run by PK put perma node or use the API or the web UI or an importer it crates you a random a random blah breath

00:29:43

so every time you run it as opposed to before when we like echoed hello and we got the same blogger back every time you run PK permanent it gets gives you new blob rest if you look at one of these blog roughs this should've been in red

00:29:55

you get back this thing and what it is the sites you blog basically there's a random random string in here this is just like some random and so many bytes basically form coded but then it's signed so normally there's a signer here and

00:30:08

this is this is me as a user that represents my my public key and then this is a signature of the previous part of the blog so all this does is establishes that there is an object in

00:30:20

the world the object has this identifier and I am the owner of this object so I can assert I'm the only one that can make mutations on it until I give other people access

00:30:31

to make mutations on this if you get that thing here that's just a public key block so it's verifiable to other people in the world um so you do things like create an object you can put perma note

00:30:43

and then you put attributes on it like title fancy title and then we could put we can replace the title say oh it's a better title and then if you describe it this is asking the search and indexing layers what it thinks of all this crap

00:30:56

and it says oh this thing fancy title but you can also look at the history of this object and you can say what are all claims which are basically anything that's signed but show me all the

00:31:13

mutations that have happened for this object over time and you can see well first it had you know fancy title then it had better title is this slide Rome yeah oh um it's that yeah this like um

00:31:29

but here you can look at a point in time take it despite this object as if they were you know then so you can this is how you look in the past and basically you just ignore all mutations that happen past

00:31:40

that time and now if I like look at my my list of blobs there's all this other crap and like a lot of these are like an object or mutations on an object you know they're quite small they're stored

00:31:53

efficiently so here's like doing a search query I put a permit or all my funny things all things that have tagged funny I get nothing but then I modify that first object and I give it the tag

00:32:06

funny and I do the search again and there it is I got I got my my funny tweet or whatever um that's the same way you would do it with the complicated search yeah this is another example of

00:32:19

like you know case insensitive search for tag funny so we we can now we have works all our way back up from the search from the blood storage to the search and we're going to go back to what you would use

00:32:37

as a user now that you understand how it works and this is actually a web UI yeah I'm sure I got the web UI also can do the nerdy stuff if you want to like drill down into like the low level guts so you can see all the the permanent

00:32:50

attributes that we were talking about earlier you can see them directly from the web UI if you go to do we have those aspects on the top of the web UI that I like all the interesting views you can have an an object so as you we show the

00:33:03

macro so you could show this permanent on about and you can see all the you know the details of the parameter itself but you can also even go further and look at the blob itself what it contains and you have all the contents of the

00:33:15

block and what the indexer has you know indexed about that blood everything it has found interesting about it and what he rolled and same thing what all of the

00:33:27

mutations that have happened on that permanent and that the index R knows about we can see all of them in the web UI oh yeah and uh and we want to talk about kind of the project um the project

00:33:41

is about eight years old and it started just about the time that go became open source which was a November of 2009 I was I was commuting at the time between San Francisco and Mountain View in

00:33:52

Google and is a very painful long bus ride there was Wi-Fi but the Wi-Fi is terrible and I needed something to hack on and at the time I was working on Android the Android build system is terrible and really really slow and requires like this monster of seven

00:34:06

thousand dollar machine to compile so I didn't have that on the shuttle and I didn't really want to write Java and I had this idea for like a storage system for pre keep and so I wanted to play with it but I couldn't imagine writing

00:34:19

and like any other language I used to write a lot of Perl and Python but I was kind of done with scripting languages with like compile errors you know like airs at runtime and I didn't wanna write in C++ because all the build system

00:34:31

tooling is terrible and I didn't write in Java because I've had enough job with Android so go just became open-source I didn't really much about it so I just decided to start hacking on it for fun because it worked well on a laptop and I could do it while

00:34:43

sitting on a bus and I kind of fell in love with it um so I kept working on this I kept working on this and I kept sending patches to the go team to add stuff to the standard library and fix bugs just Center library and flush out HTTP implementation so I actually now

00:34:56

much of the go standard library came from perky originally database stuff HTTP stuff like the JSON stuff and it's sub process handling so like I I work on

00:35:10

the go team now and I kind of like manage the open source side of go but anyway it's kind of gone hand in hand with a go development this is probably one of the oldest projects so

00:35:21

yeah I I work full time I guess on on go but this is kind of my cure side project you know and yeah I kind of work full-time on poor keep now to you make

00:35:35

sure it is at least maintained and it goes forward to win their grant doesn't have enough time to dedicate to it so yet he's generally responding to all the bug reports and when I'm filing bugs and copying him um so we're now using open

00:35:48

collective for funding this is kind of a site like them are all this other funding sites patreon yeah so patreon is more kind of like for artists yeah it's not as geared towards open source or it

00:36:00

doesn't work as well with open source but open collective works much better with open source and that's like organizations fund organization so if like your user doesn't know that you're using live foo X Y but they really like your project and they want to fund your

00:36:12

project but your project is really dependent on live to X Y your a project can become a sponsor of another project the cool thing is with open collective is all your funding all your all your funding is public so all your bills are

00:36:26

public so everyone sees that your your balance sheet and where your money is going and there's people who can like vote on bills or whatever or say that things can be approved anyway yeah so our plan is to do something for

00:36:45

the people who are giving us money we've just you know been using this for a short period of time open collective isn't too new but they have an API so we wanted to like I don't know they give them a thumbs up or something in the UI if they've paid or like recognize them

00:36:57

or on the website but the meantime thank you for the people who are paying for this um anyone here paper PDF um I also thank you for the developers there's about 120 people who have committed to

00:37:11

the project overall 23 that have done kind of a number of commits so yeah open-source here's some of our notable contributors who have done a lot of

00:37:21

commits or major parts um so if you want to get started and the help with us you can of course firebugs an extra but we've done we've kind of found straited that it's easy you know to build on top

00:37:35

of per keep and also run it from the cloud or whatever you don't mean you don't need a whole infrastructure so we made this thing which is a deployer on google cloud and so if you go there it will use your Google account to deploy

00:37:48

your per keep instance on Google cloud that you own and of course you pay for the CPU time whatever the storage but pretty pretty yeah there's a free tea option yeah it kind of works that works

00:37:59

and we can do that we have to give kudos to let's encrypt because thanks to them we can we we offer you some sort of DNS and thanks to that DNS and let's from let's encrypt you get everything you

00:38:13

know on its GPS on that instance without any effort on your part you can of course use your own DNS afterwards on you should but yeah thanks to them yeah

00:38:26

of course you can if you like doing things your own way and eventually you probably should if you wanna you know hot or whatever you can just use the usual go tools to first get the seller

00:38:38

which is for keyd so with this command and all the tools to interact with the the seller I guess this is for following tip but we also do periodic releases and you can just download binaries and

00:38:50

we run on a we run on just about everything including Windows which we sometimes test thanks for our yeah developers who usually test first if it works on Windows are not a bunch of them

00:39:02

I'm telling you and you broke windows we don't want me to fix it okay we just found that there's a service to run Windows CI systems but we but this is the Linux conference no yeah this is

00:39:17

how of course then you can how we can help us file let's try it anything you find you can fight about art well search if there is a bug already of course and yeah our dog cell not the best we we usually just use

00:39:30

generated no good up because go duck is already very good so we rely on that but there's a lot of duck writing to be done and then if you want to contribute code or whatever writing an importer is one

00:39:42

of the best way because it's we've done it a lot of times we have a good cut base for it so just mainly replicating what already works with other thoughts understanding what the API is hoping that it won't change in like two months

00:39:55

to Mountain that your import will break and so that's one thing you can do easily and then you can also write enough if there's anything you're interested in if and if you want to store it on Turkey you can write like we

00:40:07

did the publisher or we we've done another one that relies on for keep you know to stall things and search things as an aside if you guys do change all the time and have bugs it's really terrible

00:40:18

like even Google Photos Google Photos doesn't have an API they have they support people Picasa Web API which have a different data model and you can only have 10,000 photos in a in a gallery but

00:40:30

Google photos doesn't have galleries so there's one kind of implicit default gallery so once you have ten thousand photos in Google photos you can't get it out using the Picasa API which is a bunch of like XML and crap so then you

00:40:41

have to use the Google Drive API to but you first you have to go in Google Drive and say make my Google photos available in Google Drive and then you can get it like the metadata is kind of different so we have to like do all this matching up to find like which identifiers Xand

00:40:54

the Picasa Web API are like the same identifier from the Google Drive API and like not store like two photo permanence so I work at Google and I file bugs but

00:41:04

you know the machine is slowly so yeah these are the resources you can use to get started or ask question or we we now have a fairly stable presence on IRC if

00:41:19

you want to and yeah our new mailing list and of course we're posted on github as well we do also do all the code review on Garrett because it's just much better for code review github is

00:41:32

getting better but we still find it a little painful but um all the code is near to get home and we have github for issued trackers and stuff like that yep questions right yeah but sure there you

00:41:47

go know you can ask the question for the

00:42:01

live recording is how how do we actually personally use all this stuff like what do we do once we get the to it the tweet mirror and stuff so personally I do all my backups into this like so traditional

00:42:13

files and stuff like that so I have it and I can go back if I lose things I use it for also off-site backup so I'm here till yesterday in Google cloud so I have a local copy in my house and but I also

00:42:25

have it all amongst the cloud now it has built in sick dream portress does it does it get you it gets all your tweets all the time you can do it I think he was asking the blob replication oh yeah you can use arsenic if you want if you

00:42:40

do like the file storage blob storage then you can use our sink but it's kind of pointless you can just um you can just have it manage it for you and then it'll validate things and stuff and then also one of the blob storage is an encryption wrapper so if you don't trust

00:42:52

storing your blobs in the crowd in the cloud you can encrypt it before you put it in the cloud and it will deal with all the encryption and transparent decryption wallet accesses a cloud and you can have like your local instance have you know a subset of your data and

00:43:04

have the majority of your data in the cloud and like lazily faulted in as new and decrypt it from the cloud and then cache it locally so like if you're if your laptop doesn't have like the terabytes you need your laptop you can

00:43:16

you know PK mount it and use the fused filesystem on your laptop and then just lazily Fault in the data as you need it and I'll have a you know cache of your recent things um I also use it for searching a lot a Twitter search only

00:43:29

lets you search back like so many months and like doesn't let you search replies so it's like there's a tweet that I like was explaining something or linking to something I per keep search to search my tweets because it's much better it's

00:43:42

much more powerful search api than Twitter's um I also searched my old check-ins because the swarm app doesn't really have good search so if I like wasn't a city a couple years ago and I went to a restaurant and I kind of

00:43:53

remember like what the restaurant or the bar I like this I can go find that restaurant or bar really quickly or I can zoom in on the map and they go yeah that's the bar um so yeah I did I use it to look for my old date and also like when I'm bored I'll just look at the map

00:44:05

view I like zoom in and like look at photos from like someplace on vacation like I was just in New Zealand a few years ago and there's all these beautiful photos and I kind of forgot about them and I was looking at the map view and I saw like oh we could you do

00:44:18

demos actually yeah it is like a photo book that you never think of opening it but the map view forces you to open it you know I have all these great photos

00:44:30

that's cool yeah questions away that comes from any web oh yeah and it's sort

00:44:51

of like hey so this is published elsewhere syndicated your own site and that's kind of what perky biz is focused on mostly I think anyone folks kind of see that as a transitional babe yeah on

00:45:04

the legs of soup ah see where you where you're putting your data into your own site and just it just sharing it on Twitter rather than yeah and so I'm what and this is all this is this is the thing that struggle with perky is like I really

00:45:17

want to go to the pasty way yep let me summarize your question um so earlier I said a pesos which is like you publish on some third-party site like Twitter and you suck it back into perky but that

00:45:30

model is really busted you should own the data first and then give it to them that's just like the posse model right public um we want to get there we actually want to go further and say that you are the only hard drive your hard drive is on the cloud and you use them

00:45:43

as a tool and they go off to you and you they use your hard drive and so we have one of our storage plugins is called namespace and that lets you give a blob storage interface to somebody else and lets them use a subset of it and it so

00:45:56

they can read write and delete in their own sandbox but you have access to the whole world so they're basically in like in a container and so that way they can build an app on top of like reading and writing to your blob storage directly and this is kind of like the model we

00:46:08

would like to go but good luck convincing like twitter to like support for keep as a back-end so like we're pragmatic in the meantime and say you know most these sites are not don't care about indie web so we'll just suck in stuff but we want developers to build on

00:46:22

top of like using pre keep as the direct back-end using the API and the app interface and stuff you know photos yeah

00:46:36

you can go to away with photos already your own photos you can directly store them under keep and then share them wherever you want anyway so here's a here's like you know zooming into New

00:46:50

Zealand and seeing like hey zooming in too far this uh leaflet library we use this so now you can see check-ins and

00:47:00

you can see here's a as an image things in New Zealand oh yeah limit to you know in all like the search and index and stuff should should scale

00:47:20

we have nothing that's like we try to avoid anything I was like oh then so to me by having this leaflet thing kind of a sucks at high number of markers so we've recently restricted it to like 250 markers or something which seems to make

00:47:32

it perform a little bit better and we have um we have a JavaScript bug right now we kind of suck at front-end stuff where sometimes we reset the map you saw it like zoomed all the way out there it was finishing another xhr or something

00:47:44

and it had a different location bounds and it's zooming me back out to a sub-query that was happening but yeah I could see like tweets and check-ins and stuff here the size limit is basically how we scale in algorithm and don't make

00:47:57

things too slow when you have too much data we just have to be good at yet sometimes like when we were playing with this the other day we found that a search query was taking like three seconds and we found that the queer

00:48:09

there's a query optimizer in the search layer that looks at you like that big complicated JSON thing and decides like what indexes and which data structures in the corpus to use and we found that it was not using a good one so we fixed the query planner to like recognize a

00:48:21

certain pattern and use the right index and then it was like and stuff so and then we added tests when we make sure that like it uses that index in the future and we try to optimize all the common queries that the UI and stuff does and if it's using the wrong one you

00:48:33

can give it hints about which indexes to use but worst case you can't get into cases where it's like linear scan and there doesn't exist in index and in the

00:48:40

corpus we tried to do it yesterday yeah the plan isn't yeah Healy he's here for another week and then we're gonna try to

00:48:57

get a 0.10 release out in the next couple days there's supposed to be like yesterday we decided to do slides instead and and then we want to do like a 1.0 within like a month or something I don't know if you've noticed but we've already made a new Android release a few

00:49:10

days ago so like here's you know like check-ins and here's like tweets and you know like here's sad piggy here's like you know cut out all my stuff

00:49:26

we've been lazy about it because it's so easy to you know install with go so it's easy to install so from source but yet we should do binary releases there's my

00:49:37

babies but sleeping yeah so the question

00:50:00

is what's the user experience for dealing with like multiple blob storage providers um so there's again the easy way in the hard way are the config file for the server supports this config file

00:50:11

that the easy way we support the common configurations and if you want to use s3 you put in your s3 you know key for your bucket or whatever or you wanna get Google Cloud you put in to access token and bucket thing in there and then we wire it up and what we assume you want

00:50:25

you know like you want though if you specify the local disk path and you like suppose home bar blobs you know Brad we're like okay you probably want your Bob's there and you probably want replicated us three we can figure all

00:50:35

the sync pairs and stuff yeah if I default that synchronously rights to your um it doesn't acknowledge the right to the client until it's both indexed it locally and written it to the

00:50:50

one whop store but then it sets up an asynchronous replication pair to your remote cloudy ones and that that runs in the background and validates in the background and stuff but if you want to do something crazy weird or just custom

00:51:02

you can wire up whatever you want you basically you list all your storage providers you specify which ones the default and then some of them could be like a conditional one or a routing one or you can like a synchronous replication water an async replication

00:51:14

one you could wire up the whole graph you want so you can specify that the path reads the path for writes yeah and that's one thing we want to support is like being able to like discover other people's instances by the

00:51:34

public key like do a big distributed hash table thing where you can find their instance and they can host your blobs and it'll be like we support an encryption target already so and we support this like kind of subset thing so your friends could and we support a

00:51:46

cache layer that has an upper bound so you can say like I will store you know like 20 gig of your encrypted blobs and you're just like cross replicating each other's stuff and we didn't talk about sir shuttle but you can already like

00:51:58

share parts of your blobs and you give this one blob which is like an authorization so you give this one cup to your friend and they can use it to screw up everything that you've decided that you can share recursively down so it's usually from directories you know

00:52:12

just say hey I want to share this whole directory they get one blob they use the command line to all the web UI and the slope or or you just a fuse melted so I last time he was here I I had all the

00:52:25

photos of mine from his trip so I gave him a share URL and it was just basically up that thing is a JSON schema blob that says it's a share kind of access token and it says it points to

00:52:37

this other thing which was a directory and the top one says it's transitive anything that's reachable from the directory like all the data bytes he has access to then he presents that to my blob server you know authenticating as as him and then my server will get served those

00:52:51

blobs if he presented the access token for it which was the oh man check block and then he could he could just peek a mount that directly and just navigate it to him he'll click around and see his photos and behind the scenes that's faulting in all the blobs and caching it

00:53:02

on his instance so we want to make this a little bit more transparent and all like happened within the web UI is we want kind of like a friend's model and you know sort of thing or like a followers model eventually I you know kind of want to do the content first in

00:53:16

here and like do my posts you know like replace you know Facebook and Twitter and do my social networking and my photo sharing all within the app and let you know and then people that I'm friends with can have different new

00:53:28

access and stuff be in different groups providerís Google and Amazon it sounds like it's possible but I was wondering do you guys have any development of discussions around like consensus a file

00:53:44

coin and so it's not sure like yeah stuff um well we magic no we don't we don't support like storage or file coin but there's no reason you couldn't it's it's really it's really really easy to write blob storage layers and we kind of

00:53:58

have to ask people to stop writing new ones because it's like yeah like a lot of people just want to write one for funsies but they're basically have all been written but not final point so if you want to if you want to add one of

00:54:09

those it's cool oh yeah there's there's a bunch that aren't listed on this light because they just want to fit like one

00:54:26

of the slides were saying that all the blobs are stored by their activity say so you know it's correct yeah so you so we know we know one really well so

00:54:39

presumably you have you know more than one copy so if if you read it you find it's wrong you know you logs big warnings you're like you plead it but I mean hopefully your cloud provider isn't totally screwed but that would be you

00:54:51

don't use that blob because it's cropped and you like have big warnings and you could read from somewhere else well you mean automatically from one store to the other like you to detect on the UM

00:55:06

which one was it I don't just go back one of the storage providers is uh is replica so you can specify and things and on reads you you can specify an order or you can you can race them so you can save these two comes at the same

00:55:19

time so if you read from like this disk and it comes back with a corrupted checksum you won't use that one because that's an error it'll use this other one so you can have you can use like remote and then remote two replicas or you could have two rough because those ones

00:55:31

are remote to talk to your own servers inside your house and if one of those comes back with a corrupted result the client will treat it as an error and use the other replicas right so I mean you could wire this all however you want I

00:55:42

don't use that for the cloud providers because like a if they're returning an error they have bigger problems yeah yeah there's a

00:56:00

about when you do like I sink validate it'll validate the check sums of all your bluffs to and tell you if I'm a Paris as yeah I mean you could there's a

00:56:12

command line PK sank and PK index which and those do like sink validations or kick them off and you can say from range and you can put PK sink or PK index and I cron or the web UI has a limited support for that like I go up here and

00:56:25

say uh so there's server status and I can go down to my sink test3 and see like the state of my copies then I could start a validation now the validation is running and you can see it like

00:56:38

validating the source blobs right now and so now it's validating all my stuff on s3 so you could you can schedule that to run periodically but you're paying for those that's three operations now so you do you may not we don't have that on

00:56:50

by default well depends what you want to check if you want to deal with the file IO likely and you know pay for that activity on your local disk that's fine you probably out paying much for it impacting performance of other things

00:57:03

here you're paying for operations so I mean yeah you could put it NFS server on top of the fuse layer if you want and you can export this as like you know a Samba NFS server but it's not the most

00:57:29

efficient thing but it works I mean you're talking about having people in the office use this as storage directly yeah well hopefully offices aren't you

00:57:49

know using files I mean like rank documents with like versions and renaming it to like locked version too and like I have the lock on that file it's kind of a bad way to collaborate on yeah that's what the fuse layer lets you

00:58:21

do so you know this story yeah you can you can make a virtual file system that looks like a real file system to your host operating system but the backing is

00:58:30

actually per keep we want to do some demos though um but no we don't we don't

00:58:47

support any protocol like DMA or um you wanna send a photo oh sure so this is uh this is our demo instance running it demo it up for keep that or that you can't really see and I will does anyone

00:59:02

not want to be on the internet or on our demo app or keep which is private so I'm going to tweet as the perky org Twitter account and I'm just like hi

00:59:21

from Bellingham yeah so he's not putting that on the internet that's just going to this this one will go in our Twitter

00:59:39

account so cover up your face if you don't be on the Internet wave okay so so that tweet came in in real time there's

00:59:52

like you know WebSocket thing that's came from his Android app this one I'm saying hard to do it so my tweet is going through from the Linux west-northwest Twitter account so much for real-time maybe I didn't enable

01:00:11

actually at real time so you go into here into importers old UI tweets accounts ha import in a little toggle

01:00:22

auto start cool huh yeah as many as many clients as you want as

01:00:41

many accounts as you want so we're supposed to we have like we do the Twitter long pole with subscribe thingy so a tweet should come within a second so the vent demo kind of failed by now

01:00:53

we just created this a demo or keep that or against didn't see yesterday so it's not really configured totally yet but yeah so in here I can go in to this photo he just saw uploaded from his home or from his phone you can see the perma

01:01:06

node you can see like you know the content is this then I could go over yeah brother yet so I can see here's the content here so here's the signer is

01:01:20

fallen the content is this that's an image not the object and the raw blobs this is a filename this here that chunks of the JPEG here is the index or metadata so here's though like

01:01:33

the width of the image is the latitude and longitude so now I could go with things and do like a search query for location bellingham oh we don't have the

01:01:47

key the geo Konya yeah so we use the Google geocoding API and they recently locked it down is no longer free and so you have to like go get a free API key um so but I could do it oh my that's how

01:02:02

we translates a name from the search to real code so we asked Google where where it is we want to move to using some Open Street Map II thing when we use open

01:02:09

street maps for location yeah Oh Kimball oh so here's two years ago the demo anyone see yourself in that photo from

01:02:27

last year well you should can straight on the map now since it's a location hey the piggy go sad when it on the

01:02:39

WebSocket it goes now you can see all like tweets and photos and check-ins from Bellingham so so ah we need help

01:02:58

with the web UI we kind of suck at front-end stuff this is kind of a weird mix of technologies we use like um we just react and we use like closure and we use a gopher js2 compile go to

01:03:10

javascript we use we're thinking of this is like using leaflets and the Google libraries can you close your web browser yeah um so it's kind of like we want to

01:03:22

do a lot more with the UI but it's not really our Forte so if you're interested in like rewriting the UI using something new and fancy we hear good things about like typescript and polymer and web

01:03:33

components blah blah blah but we just this thing's we suck at so um we also need an iOS client maybe our android client kind of sucks and we don't really like writing Java

01:03:45

maybe you like Jabra Kotlin we're not really good at either of these blobs

01:04:00

which is you know little pots of objects I think I'm around 400 gig total um yeah I mean the ones that we go up I'm harder

01:04:13

this but yeah probably probably not I mean our trends keep getting bigger and like the amount of data you care about is also getting bigger but like they seem to be not really one is not really you don't need

01:04:26

to have it on your laptop you can have it on any kind of server at home or wherever on your laptop can just have a view of it thanks to the peak amount or whatever yeah if I run like the fuse filesystem on my laptop it will lazily

01:04:37

Fault in from the server and it will only like you can keep a cache so you can say like oh I only keep a hundred gig on my laptop and it's you know whatever 100 gig was access most recently so yeah you you can even have a

01:04:50

there's like a union blob storage worth like you have like three servers and none of them have enough storage you can say like I use whatever's available and I do reads from all of them and I write still like one of them randomly so you can shard things out however you want

01:05:05

yep you could you can make them all like consistently hash and shard the same way and you can like DNS balance between them all yeah where you can specify like

01:05:22

yeah so it's always eventually consistent between if you sync them all every which way right I think we're let you have to lunch earlier listen you

01:05:34

guys questions are cool thank you