Video: Terminators on Tech - Cloudy with a Chance of Documents (DocDrop)

00:00:02

hey we are live but it's not time yet so we just basically hang around and let people gather see if anybody's coming in nobody's in

00:00:14

yet but uh yeah so oh it shows that there are sex people great welcome hello there's uh mathias hi hello hello

00:00:29

so yeah we we're just like basically hang around until uh maybe more people gathered and um welcome to terminators on tech for those of you who have been it's just a podcast that we chat with

00:00:41

everybody in the company try to be everybody in the company and uh hey pablo is a guest smith is here yeah so this time we have gavin with us

00:00:53

so gavin is our cto our bdfl that's how we call him because yeah and then uh because it has been a while since we have any uh major updates so uh so actually we have

00:01:07

a lot of things going on in the hq here we have as uh always we are not just having a big holiday right now that's why you can't hear anything it's because we are planning something big

00:01:19

right and uh so yeah that's why we have gavin here today to talk about it and um yeah so maybe we can you can start uh start talking about it so it's a the title of this uh

00:01:32

this podcast is like this episode is uh cloudy with a chance of documents so gavin do you want to explain what does that mean i know lou come up with the name but we all know what it means though that's right so so uh cloudy with a

00:01:45

chance of documents uh because we're moving to the cloud so we're going to have a cloud service where you can actually use terminusdb directly on the cloud from an end point you'll be able to spin up your own endpoint and then instead of having to do any of

00:01:58

the installation or any of the management of resources etc you just use that endpoint and you'll be able to use the interface you can do you can enter data into it you can have different data products so we'll have

00:02:10

multiple different uh what we used to call databases we're going to start calling them data products and those each of those databases will then be in the cloud and you can mix and match them and do queries across them etc

00:02:22

so we're sort of grandfathering in we'll have the same interface in both the open source version and the cloud version so it'll be familiar to people who are using the open source version so that's that's sort of the cloud part of the architecture

00:02:35

and what i'd like to talk about more today is really the document aspect so the document aspect we we've always had this vision of having a schema well-defined schema that would be associated

00:02:48

with documents so you'd be able to have a document view where you could view something as a json document but you'd also be able to put it into a graph and you could traverse it like a graph and it would be strongly checked so

00:03:00

that's something that we've always had in terminus but from conversations with our users and projects that we've done ourselves we found that it's it's relatively difficult to utilize the schema so our schema

00:03:12

language initially was owl and uh owl is a really powerful um it's basically a first order logic or a very large subset of first order logic that we implement

00:03:24

uh and that that gives a lot of flexibility but it also means that you you have a lot of complexity and users have told us that they find it really hard to navigate so

00:03:36

what uh so we've produced a prototype uh recently and the prototype is of a much simpler schema language the schema language is designed to facilitate writing what people would

00:03:50

often write for a sort of standard json document that they were writing themselves just for if they had an api interface or they just wanted to represent some kind of data as a dictionary they might write it that

00:04:02

way so we want to have we want to have schemas that represent the kinds of things you might want to write anyhow and to make it really easy to do that for documents that you have and then to be able to store documents

00:04:14

just using a document interface just using these json documents you stick them into the database and then you can interlink them it's still a knowledge graph but you'll have the ability to interface

00:04:26

with it in a much more document-centric way so this that's that's basically the idea of the the cloudy with the chance of documents so come back to the scheme of things i just want to uh promote

00:04:38

something here just like slide in some advertisement so uh we are building a workshop that uh basically is teaching people how to model with uh our kind of a graph kind of schema that basically is something totally different

00:04:52

from what you know like the sql kind of relation database so yeah we will have our kind of a trial workshop with europe python so if you're interested then uh you know pay attention to all the

00:05:03

announcements uh we will have it on our website and all this stuff uh but yeah like go back to the quality of the chance of documents so uh what's our you know what's the major difference like advantage that you see

00:05:14

like from what you know for example for user who are using us right now like what would be the biggest change like do they have to do a lot of things to adopt to the new kind of schema or you know they can keep on using what they're using but

00:05:27

get the benefit of it yeah yeah so i guess what we're going to try to do is um make a migration uh for for uh simple enough uh owl um

00:05:41

ontologies if the owl ontology is exceptionally complex there's lots of things like value restrictions and stuff like that that i don't believe that any of our users are using that we actually implemented um

00:05:53

but they're it's hard enough to figure out how to use them properly that nobody's actually using them so um we're going to try to most normal things you'll be able to lift fairly directly but we'll we're just going to

00:06:06

make it a lot easier to write uh down documents uh so that you can you can write down a schema you'll be able to write it in um so you chuck have written uh like a prototype

00:06:19

of uh class definitions in python where you just write down the class definitions in as you might do for something in python and then you can use them and and throw them into the to

00:06:32

the document store yeah what i found funny is that while we are building that python kind of schema building it's kind of like an orm just like any common ones like django or m and all this so it would be very

00:06:45

streamlined for people who have experience using other products you know and also what i can see from this change is that it will be easier for people to just convert a um a data frame

00:06:57

or like a csv like tableau format data because now still a lot of data stored in csvs even though we hate them but it's still still like that and so i think the new new tool will be easier

00:07:08

for people to convert that isn't it that's right so it should be easier to represent those sorts of formats it won't be as it'll be a lot more straightforward to represent things as you might i mean there's some like

00:07:21

conversion utilities online you can use that just to go from csv to json and we want to be able to go from that json that you might get from one of those and represent it as a schema in german's tv fairly straightforwardly

00:07:34

the other thing that is changed is so in the past in the query language we have done a lot of what we call id generation so the ide generation how you represent an object's identity

00:07:48

is done with uh for terminus with id gen in wacko which is our query language and and that's relatively complicated yes you correctly i don't like it to be

00:08:00

honest like yes for me it's like what what yeah yeah you're not the only one it's confused quite a lot of people it's even confused me and i wrote it so there you go so

00:08:12

we're going to um try to make it so that most like you'll still be able to to define your own uh keys for things or for your own identity or identifiers that um that describe

00:08:24

the identity of an object but we're gonna have a bunch of built-ins that you can just tag your class with and once you tag your class with that that sort of key generation approach then you won't have to worry about the

00:08:36

id or how it's constructed from then on and it'll just not magically come up with the appropriate id you could pass that id around you can point to it from other documents and that sort of thing so

00:08:49

and i think this will also will be able to improve the document interface a lot we're going to add sort of what they call framing this is like being able to search deep into a document by giving the path

00:09:02

into the document and then we want to be able to have pattern matching on on documents you can pattern match parts of a document and then you can construct a new document just by placing in those

00:09:14

patterns using the variables in wackles so that we'll we'll have a sort of we'll be able to take the benefits of data log unification but we'll also have the benefits of this

00:09:26

this sort of document orientated interface sort of like with a document queries you can do a document career you can ask for all the documents that are shaped like this that have this kind of shape

00:09:37

and it can be more convenient than having to write all the triples out by hand especially when you don't care about intermediate identifiers and things like that so yeah sounds really cool but ro roma wise like

00:09:50

where are we right now because like we have been you know kind of doing this work for around a month now like where is it going or where are we now so yeah so uh we have a completed prototype

00:10:05

uh and just uh yesterday we started moving the prototype into terminus tv itself um so that work was going on today i imagine it'll take um a few more weeks so we're shooting

00:10:18

for mid-june so for mid-june we're hoping to have um a dev branch that works uh with this with this new document interface that's

00:10:29

that's the idea and then mid july you know we should have a released product that has the new document interface uh and which you can spin up on the cloud so you can do it either with the open

00:10:42

source version or you can spin it up in the correct cloud and try it out there yeah so i would like to emphasize like if you are interested in what gavin's saying today like if you want to join in the

00:10:54

discussion about you know if you have suggestions or anything this is a really good time because we are developing it we still have a lot of things that we want to figure it out and get opinions from you so uh the link to our you know

00:11:06

discord server is down there so please join and if you're interested in kind of testing it when we have a deaf you know have pre-released and also contact us we're happy to chat with you and then

00:11:18

let you play around with it if you want to so yeah absolutely absolutely yeah so uh do you have uh anything that you could show us today or

00:11:29

uh yeah so i mean i i can give you a a quick sort of view of what this might look like so you get maybe a better handle on how much simpler things are going to be when we moved to this new documented interface

00:11:43

so here let me just can i share things you can of course you can [Music] yeah we would love to see how it works yeah and also for me it's just before

00:11:57

this podcast i was uh updating the uh the the python interface because uh yeah we we we're still changing how you know the the interface work uh and so

00:12:08

like i said you know if you want to add in some opinions and anything feel free to do so uh we're happy to hear from you so uh yeah that's right so yeah so i guess uh this is sort of an example can you

00:12:21

see my screen here with the um yes i can see it i can make it bigger if that helps yeah great so yeah so this is an example of what you might write down

00:12:33

in terms of python classes so we want to be able to you can write down the class of a coordinate you can say oh there's an x there's a y each of them are floats uh you have a country it has a name it has a perimeter

00:12:45

it's a perimeter is defined as a list of coordinates so you can you can just define it in that way so you could imagine writing down a simple schema very quickly in python and then being in what what it will

00:12:59

produce is some json ld that describes the schema so the schema will be compiled essentially to uh a list of json lte documents that

00:13:11

then you can send to the server and you can update the schema and with this new schema then you can start uh entering in data so you could have a coordinate and it might look like this

00:13:23

so a coordinate will be of the coordinate type it has an x uh and a y property and they're both floats so you'll get much simpler uh instance documents out of this thing so

00:13:34

here's an example of a person who has a name and an age uh and they have a friend of uh property as well and the friend of is a set of persons potentially so

00:13:49

uh here you have dave dave is a person dave's identifier is person underscore dave um his name dave age 13 and his friend of is just a set of uh different

00:14:02

persons and here we we just represent them by their identifier because you don't want to unfold the graph um infinitely but you can imagine like you could just ask for the person gym

00:14:15

document back if you need information about them but that gives you a way to represent as documents stuff that's in the graph and you have nice simple identifiers and relatively straightforward definitions for

00:14:28

the documents themselves um we also want to have like a nice pythonic interface you can just create these guys in python manipulate them they'll they'll print out what their fields are

00:14:41

and you can update their fields and you can commit them to the database when you make changes to them so uh in terms of the other features that we've added to

00:14:54

simplify things so you can represent all of these things in al but the representation of the json that corresponds to it is relatively complicated it's somewhat baroque you might say

00:15:06

uh and what we would like is a very simple representation so here's an example of an enumerated type so an enumeration it's just going to be represented as a list of the enumerates uh and

00:15:19

it's um representative will just be one of the elements of that enumeration so you just like you would get in a normal programming language so as you said before it's like this

00:15:31

will feel more like an orm uh but it's gonna be an rm building a knowledge graph you know that's that is vergent so and you'll have nice clean identifiers

00:15:43

so that you can mix and match these between databases you'll be able to refer to the same object multiple times so here's another example of something that we've added to simplify id

00:15:54

generation so here we have a key and the key is defined as being formed from the street the region and the postal code so it's similar to in sql where you define a key in terms of

00:16:08

uh the representative columns that you think make the row unique so they give you some uniqueness and object identity if you

00:16:20

will and this is saying that we're going to create the identifier as a hash key um and here uh the base of the hash key is going to be address it's just a string that represents

00:16:33

how the um thing is how the representative will actually be formed so the json ld for the schema portion of this is like this you have you say it's an address it's a

00:16:46

of type class it has a comment about it its base is address underscore the key is a hash key of these columns or these properties

00:16:58

um and then we have street region and postal code as its has its um properties then the instance of this thing you see here that it's address underscore so it's formed from

00:17:10

the base name that we gave here so it's formed automatically from the base name and the hash of the other street region and postal code and that's how you know that the identity is captured

00:17:23

correctly this so this will allow you to have fields that maybe you have a lot of fields that create the identity of the object or you want to hide

00:17:35

the actual information that generates the key so this is also important you might have the address represented somewhere else but you don't want to disclose what the actual constituent

00:17:49

parts of the address are even though you want to be able to refer to the address so this is particularly important potentially with users so you might have users or

00:18:01

subscribers and certain parts of the business should be able to understand things about that user in terms of their uh how they should be marketed to etc but they don't need to know

00:18:13

where they live or who they are and they shouldn't know it because there's sort of privacy concerns about that so you could form it in one place where there is a restricted data set that does

00:18:25

actually contain the personal identifiable information but then you can refer to that individual externally without disclosing any of that personally identifying information

00:18:38

and this is really important for building uh because you want to be able to refer to these things all over the place without leaking uh the personally identifying information all all over the place so that for gdpr concerns i think this would be

00:18:50

uh very useful but it's also convenient a convenient way to form an identifier that won't overlap so if it's not important what the uh the that you're disclosing this

00:19:03

information then you can form a key lexically without without using uh a hash so here we have an example of a person and actually person's probably a bad example of something you might use a

00:19:16

lexical key for because you will be disclosing information but here we uh we say you know person is dick 23 it's because his name is dick and his age is 23 and we formed it from that

00:19:29

lexical identifier this way so no more working with id gen you don't have to worry about that anymore and you can refer to these things and you can see like person kim 35 is who he's dating and then

00:19:42

the address is here even though you might not know what that address is you know that it exists somewhere as an identifier so that's an example of that um

00:19:54

and then we've also added tagged union so we found that for modeling certain things you want to have one of a thing or the other of a thing and this tagged union just gives us

00:20:07

a property from the list of the tag properties and then some object at the end so you can have that it can be a quite um elaborate object it doesn't have to be a base type like this it could be any type of class

00:20:21

but it's it's a nice way of specifying sort of a a disjoint union of of different things so when you have a clear choice of different possible objects uh and that that this is also possible to

00:20:34

model in in al but uh it's quite uh also quite difficult and whereas both the representation of the instance data and the schema definition are quite simple

00:20:46

yeah right in the uh jason yeah in this new schema yeah so look yeah look like a lot of things are like you know hidden behind the scenes but like i think uh yeah we have uh viewers asking like will any of the owls

00:21:00

constructively deprecated or it's just like we have kind of something on top of it that people can bypass it but our will still be uh available so you know we're actually going to

00:21:12

deprecate out because so i rewrote the um the schema and instance checker for this new thing and it's much faster uh much faster because it's a restricted um it's a

00:21:25

restricted language and the owl uh checker is really explosive in terms of its computational complexity so we've found that almost all like all of the things that i

00:21:37

have done so far in modeling for terms db um can be modeled in this other schema language so i'm hoping that there are very few things used in in how that

00:21:47

that um that will not be possible to represent here in some way the the the few things that are definitely not possible to represent i think need to be done slightly differently

00:22:00

anyhow so there are some global constraints that can be expressed in al that can't be expressed easily in this language and the way that i would like them to be

00:22:12

realized is using a sort of cicd approach so where you have a once you've checked this in then you run some kind of queries against it to check to see if some global properties

00:22:24

hold and then if they do then you uh then you're allowed to commit the schema so externalizing transactions is the way that i would like to go forward with those very complex or global checks rather than rather than

00:22:37

try to fit it all into the schema language and i think this will give people a lot more um possibilities so you not only will you be able to have owl like constructs but you can since you're

00:22:49

going to be restricted to a very specific space of of what you want to check you'll be able to write something that's much faster uh so that'll improve the performance you won't necessarily have to have

00:23:02

uh you know hyper exponential complexity which is one of the problems that i ran into with some of the um double subsumption type things that were going on because you have subsumption in the class hierarchy and

00:23:15

you have subsumption in the property hierarchy uh and it it ends up being very computationally difficult to use as a schema checking language yeah and actually mathias asked can you

00:23:28

talk about con version this new schema so for already like user who already have something built like how can they convert or um any transition effort

00:23:40

yeah so so the first stage is going to be uh i'll write a guide uh for how to transform this by hand and then the second stage is we we're gonna see like if there's if there are people if there are a

00:23:53

number of people who have complex ontologies that have already been constructed and want them mapped uh um they should talk to us and we can see how much trouble it would be just to compile

00:24:05

the al directly now we have some al already as examples but i'm not sure exactly what constructs people are using so there are potentially constructs that would be difficult to compile into the new language but if you built anything in the model

00:24:19

builder it'll be very easy to transform it uh and if you yeah i think in a lot of cases what people have built will be not too difficult to transform yeah and i know that a lot of people you

00:24:32

know there are some people who are like uh you know they love owl they want to keep on using owl but i i would say that like give it a try like i mean you may find it also very similar to owl and you can also construct things

00:24:44

maybe even faster but if you have any concerns or anything again like now we're at the stage that we are very happy to hear from all these different kind of opinions so maybe we overlook some difficulties that you may

00:24:56

encounter that you worry about so please talk to us and you can also follow us on twitter i mean like if you don't use discord you can also you know just send us message via twitter or anything so

00:25:08

yeah right um sorry i was interrupting your presentation so do you still have things that you want to show we can go back to that if let's let's see um

00:25:20

so i was um just remarking i think one of the things that i find better about this this new approach is that you can reuse property names all

00:25:35

over the place without having to somehow stick them into the hierarchy into the class hierarchy so they're they're distinct each of these property names is basically

00:25:45

distinct per class you can inherit from other classes and obtain their properties but there's no need to worry about using the same name in multiple different places so that's

00:25:58

very much more convenient in writing a lot of these things because name for instance comes up in a lot of different places where they they won't be shared in the hierarchy in any sort of way in al you can represent this by using a

00:26:12

mixin class you could also do that with this new variety because we do support full um multiple inheritance uh for this system but

00:26:24

so you know it's but it's more convenient to to just write down the properties that you have right next to the class as you write it so in addition to the i mentioned before we

00:26:38

had lexical keys uh the lexical keys are just formed from various property names um we also have a hash key that's formed from the fields you can choose a list of fields so in

00:26:51

this case it's just the name but it could you know you could imagine it it was name and comment or something we also have a value hash the value hash takes the transitive downward closure of all of

00:27:03

the values of the thing including um yeah so sub sub objects or whatever and it creates a hash out of that and so that

00:27:14

that includes everything and so this it's sort of unique according to the value that it has so certain things that might be distinct uh there's only one of them uh or you want to imagine that the the

00:27:27

name of that thing is equal to its value sort of like where its pointer is uh it's called interning i guess or in in lisp or in a prologue it's called

00:27:40

interning where you take the value you make it assigned a pointer that's unique for that value so that's that's what that does and then random so for instance with events uh when an

00:27:52

event comes in you want to represent that it exists and its identity is new every time uh so just choosing a random large random identifier is a good way

00:28:04

to ensure uniqueness there and then i just have some examples of like what these keys would look like so if you use the lexical key it's a it's going to produce a uri or a valid

00:28:17

um uri utf-8 representation so any like spaces are changed into percent 20s but then it uh it combines like james t kirk

00:28:30

with employee and that's how or in this case it also has this birth date at the end which is apparently in 2233 which i didn't know um here's an example of the hash this is

00:28:43

what the hash identifier might look like this is the value hash identifier for this is formed out of latitude and longitude uh and then we have the random and the random is just a large

00:28:56

random string at the end so it's like some kind of ui unique identifier as well and that's that that basically sums up what we've done so far in terms of future directions and things

00:29:09

that we would like to add in the next iteration uh there's some things so we want to make it easy to represent sub-documents and so we've had some thinking about that and i think we're going to add a

00:29:21

tag that says like basically when you pull this document out of the if you pull a document that has sub documents out of the database it'll just unfold the sub documents so you'll get a larger

00:29:34

json document that has sub objects inside of it those sub-objects they kind of have to be owned in a way you don't want multiple different independent documents pointing to the

00:29:46

same thing necessarily so we're going to try to disallow that by generating the id uniquely for each document that has a sub document and then they'll delete

00:29:58

automatically or update automatically when you do a deletion update and so those that's sort of one of the future directions that we've talked about yeah so and luke asked about like uh

00:30:12

how can like you know the comparison with uh mongol like yeah so we want to be very close to like what i would like yes and that's another future thing so part of what we've done so far is trying

00:30:25

to make it so you can almost feel like you're using here you just put a document into the database as a simple json ld um and you only have to specify what the type of the top level object is and

00:30:37

everything else sort of naturally gets created automatically so compared to we hope to be almost as simple as to use but we want to also have all the schema

00:30:50

checking we already have acid properties so it's going to be a very it's going to be a nice database uh with like strong structural integrity constraints on the information that's in there you'll be able to process these

00:31:02

documents in code much more easily because you'll know for sure what kinds of edges have to be there you can interrogate the schema so you can even you know you'll be able to ask from your program what the schema looks

00:31:14

like and the schema is very simple so it should be possible to interpret what you should be processing in code much more easily by looking at that so i think that's a big win so one of the places though that

00:31:27

is still a little bit better than us is that you can just throw anything you like in there that's also a downside right so it's both a plus and a minus and what what i'd like to see is that we

00:31:38

we have a mode where you can like throw a bunch of documents in and you say please infer me a schema and it generates an automatic schema that it thinks will represent those documents that you put

00:31:50

in there and then uh you can maybe modify or hone the schema or and then maybe lock it down you say okay from now on that's my schema and then then when you put documents in

00:32:02

they have to match the schema but so it gives you a sort of testing phase that would make it as easy to use as but you could migrate to being as solid you know as as terminus is designed to be

00:32:15

yeah so i always you know if if someone who is not technical asked me like oh what what is this database about i'll be like oh it's imagine a data lake with schema with some structure in there that you can governance like how things

00:32:28

are flowing in and out yeah so uh we we're still like i think we're still thinking about the naming of of of the thing that we we're having like there's a chat in our company about like

00:32:42

how we should name things and it's super funny um someone say like uh terminus cordon though [Laughter] exactly so if anybody has good ideas

00:32:55

yeah don't uh definitely tell us we'd be happy to know yeah so we already have uh brad is interested in testing it out which is great which is uh good so anyone have any questions please type in

00:33:08

a comment and since gavin is here we will have an answer and um also tomorrow is the office hour right so if you want to drop in

00:33:21

yeah so if you want to have a chat a voice chat with gavin so tomorrow at uh 5 p.m uh uk time which is uh you know whatever you convert to your own time song and

00:33:32

then gavin will be in the uh in the voice channel of the community server so yeah so please join uh and chat or you can ask questions now since we

00:33:44

are streaming that's right yeah i think it's a really really good direction to go like as someone who is not you know i think the previous approach is good but i think it's more

00:33:57

suitable for people who are already in the semantic web kind of space that they're familiar with our familiar with how how it works but uh for laymen like me or or other like data people that i

00:34:10

think some something that is easier to understand and more similar to other uh products that we are familiar with would be a big advantage yeah so i mean that's what we found we

00:34:22

we did a few big projects using al ourselves and so we were able to use it it's fairly complicated even for me to and i wrote a lot of the schema checker and instance checking myself

00:34:35

and even even then i find it somewhat diff difficult to write large complex schemas now so i wanted something that you know took from that experience and tried to make something simpler that would that

00:34:48

would be closer to what people would like like to write when they're writing down a schema yeah also like there's less confusion about when you get back a kind of an error it's like okay which part of my data doesn't match the schema

00:35:00

is it's so hard to tell previously so hopefully now it's much easier um yes i mean al is sort of like a first order logic so where

00:35:12

the air witness comes from it can often be very divorced from where you think your object is right so the errors tend to be very difficult to interpret

00:35:26

yeah so i can't see any more questions flowing in so it's almost time to end our conversation here i guess um so yeah for people who want to you know follow

00:35:38

the development news and anything we have like the workshops that i advertised previously like follow us on twitter all the exciting things will be on twitter or you can just follow for jokes i mean like look post really

00:35:51

good jokes there so yeah and follow us or if you want to join the discussion we have the server also available down there you can join our discord server again like gavin will be there usually

00:36:04

on thursday uh to be in the voice channel you can jump in and have a direct chat with him uh you know voice channel but uh yeah so i think that's that's it for

00:36:16

uh this month uh we try to keep it monthly but you know sometimes you know uh we we're just like a small team still we are expanding but we're still a small team and so we try to get in touch with everybody as much as we can

00:36:29

and um so yeah i think that's it for this one thank you gavin for taking the time during this busy busy time to talk to us and yeah so uh see you hopefully next month

00:36:42

okay bye see you bye bye