Video: CppCon 2017: Juan Pedro Bolivar Puente “Postmodern immutable data structures” (DocDrop)

00:00:04

Hello, welcome everyone. My name is Juanpe. I am an independent software engineer and consultant currently based in Berlin. And, I am going to talk about immutable data structures. And, I mean the whole point of immutable data structures is kind of making value semantics scale. So who in this room thinks value semantics are awesome?

00:00:27

Okay almost everyone. That's great. Who here work mainly in a code base where most of the types are value types and most of the functions take things and return them by value? (laughing) Oh two people that's awesome. Are you using C++ or Haskell? (laughing) So that's the thing.

00:00:51

It's actually complicated to make value semantics scale. So that's why the first part of this talk is called the tragedy of the value based architecture. And, this tragedy starts with a lot of good intentions. So in the center of our architecture, we ideally have a data model that is composed of simple value types like structs, STD vectors, tuples, simple things.

00:01:16

Now the application logic in the software will be pure functions. Functions that take a document as value and then produce another document as a value, right? And, one of the very often advertised advantages of such an architecture is that then for example implementing undo becomes trivial. We just keep the previous states in a vector or in some collection, and we're done.

00:01:40

We can now undo. Also we get concurrency very simply because if we let's say want to have a render thread to render the UI in an separate thread, we can just loop in the separate thread with some magical message passing mechanism, pass the value of our document to this other thread, and they can do the thing separately independent from where all the logic

00:02:06

is happening. Same thing for let's say persistence. I want to save to disk which might be slow, so I just you know launch an asynchronous task. I capture the document by value in a lambda and operate with it. Likewise, I can have let's say other representations of my documents like I worked a lot in the music software industry where you often have like a sonic representation of your documents,

00:02:30

something that makes audio. And, we can follow the same patterns throughout this kind of alternative UIs or representations. But, there is a problem here. We're doing copies all over the place, right? Like, this is not gonna scale unless we're really working with trivial documents. So what a smart C++ engineer will do is to say okay well let's not take the document by value

00:02:54

and return a new value which is potentially creating copies and just you know take it by reference and do the updating place. But, once we do that, we start thinking of our data model not as a value, we're thinking of it as an object, a location. Where it is stored in memory becomes important for the correctness of this this function. And, of course now there's other people accessing the document. They're gonna need to synchronize accesses

00:03:20

with mutexes and so forth, so we're probably gonna add some stateful intermediate representation of our UI with our widget tree, and of course we need to keep this widget tree updated, so now we need signals and slots in our document, so we can call back and update when things change in the documents. And, of course you know we're not gonna keep copies anymore in our undo history, so we need to implement this command pattern that is very known to be hard to implement.

00:03:46

Doing persistence asynchronously is going to be very hard as well because we cannot simply copy the document especially after we added the signals. Anyways our users are already used to having like this spinning clocks on the UI but nothing is gonna save us from the flying spaghetti monster that has taken over our code. (audience laughing) So okay, what was our problem here? Problem actually was copying.

00:04:11

Like that's why we went down this rabbit hole, and we're hard wired to think that when you pass by value it implies that you're gonna need to make copies sooner or later. But, this doesn't really need to be the case, right? You could think of I store my value somewhere in the heap, and I just pass a pointer, right? When I need to pass it to someone else.

00:04:36

The reason I cannot do that, at least in a world where things a mutable, is because what happens if my original, like the one that pass it to me now changes this object that is in the heap? Now you loose the value. Like the value goes away under your feet you could say. So actually its mutability that implies that now passing

00:05:04

by value means now you need to copy. But, if we remove mutability from the equation, we could actually also remove the implication. We could simply pass by value by copying a pointer. So this is the fundamental idea behind immutable data structures which basically work towards tackling this problem. And, this is a data structure where all the methods are const.

00:05:31

Now when you want to update the data structure, you can still do it, but you're gonna have let's say a push back method that instead of doing the updating place, it's gonna return a new copy let's say at least from a conceptual point of view of the vector which is our example here. We can do this multiple times. We can also do random access updates like in a typical vector.

00:05:57

And, we have preserved all the history of operations in the data structure. So this is a property that in the data structures world is called persistence because the old values don't go away. And, this means that we can compare previous values from the past which is very useful in interactive systems actually to know what's changed. And, this also allows us to have a property called structural sharing which is basically the idea

00:06:23

behind this pointer copying that I mentioned before. These different versions even though they are logically copies, they don't have to be full copies under the hood. They can share data between these versions. So I really wanted these data structures to exist for a very long time, so I ended up implementing a library myself call Immer that is open source, and you can access from that website.

00:06:47

And, my goal with this library was to first make a library that is idiomatic for C++ developers. There are lots of atoms of bringing functional programming to C++ that in the end pretend that C++ is Haskell. I think if we really want to make these techniques mainstream, we really need to make this idiomatic. And, also it has to be performant because this is in the end the reason why we're using C++

00:07:10

and not any of these other higher level languages. And, finally also, and this is tied to performance, it has to be customizable. And, we have a myriad of techniques like policy based design and so on that we can use in C++ to make these data structures customizable. And, this is particularly useful for example to use this data structure let's say plugin the allocator from Python and its garbage

00:07:35

collection strategy, and now you have like a fast implementation of this data structure for Python as well. So okay how do these magical vectors work? And, this is gonna be the second part of the talk. First let's try to understand this idea of structured sharing. And, the typical persistent data structure that functional programmers use is a linked list, a persistent linked list.

00:08:00

Where let's say I have a linked list with one element, now I can create another version by pushing two elements in it, and I get these new nodes with these links. Now in the original value, I can still call the push front operation, oh sorry in the new one here, and get a new version. But, I can still call push front on the old version

00:08:27

and kind of get a divergent history, right? Now I have two possible futures you could say, two alternative variations on operations that we did with it. It's like forking the data structure. And, I just create a new node, and most of the data structure is shared between all these values. And, even though they're conceptually copies, they don't have copies under the hood.

00:08:51

Now C++ developers, we know that node based data structures like these with these small nodes, they're not very efficient. So we're gonna try to get away from them. But, this idea, this underlying intuition is gonna be very helpful down the road. And, actually I point here to a couple of references if you're interested. There is this book Purely Functional Data Structures that really takes this small node based data structures

00:09:15

in combination with list evaluation and other things to the extreme and build really interesting data structures. But, the data structures that I want to implement are the ones invented by Phil Bagwell and popularized by Rich Hickey who is the inventor of the language Closure which I think is a very bold language because it's a Lisp which traditionally Lisps are based

00:09:40

on lists, and he decided okay let's make a Lisp, but it's gonna be based of vectors, on immutable vectors, right? Because vectors are more cache efficient, right? So this is the data structure that we're gonna try to build. Now let's say we want to build a vector. Ideally we want to have a contiguous block of memory as our underlying data structure.

00:10:04

Of course, this is not gonna work well in a persistent setting because if I want to update one of these elements, get a new version with one of the elements updated, I'm gonna need to copy the whole block of memory and all the elements in it. But, I can make a compromise and say I'm gonna split it into blocks and this picture into blocks of four elements. Now these blocks are flying somewhere.

00:10:29

I need to assemble them somehow. I could use a linked list. No, that's not a good idea. Let's put them in another array. I have the same problem again, and recursively we split it, and eventually, we get with this perfectly regular tree where every node has a fixed amount of blocks. This data structure is called a radix balanced tree that is characterized by a branching factor

00:10:54

that we denote normally as M which has to be a power of two. We'll see soon why that is the case. And, now you're gonna tell me okay but this data structure it's still a tree. Trees are bad. I mean that's why STD map is worse than an ordered map. But, you have to understand that these blocks even though in the pictures they are four elements just so they fit in the slides nicely. In practice, they are of 32 elements.

00:11:20

This means that actually you can have a vector that contains as many elements as you can address with a 32 bit integer, and this is only gonna be seven levels deep. This means that the depth of the tree actually grows slower than other constants that are proportional to you data like the working set size versus the cache size. That's gonna be more important when dealing

00:11:44

with a large vector than the depth of the tree itself. So how does this data structure work? Let's say we want to find an element in this tree. Let's say we want to find an element at index 17. First thing we do is we split the index, so we represent it in binary, and we split in groups of B bits. Right we had a power of two, remember our custom M

00:12:17

was a power of two. This is why, right? So we have B bits, now we can simply go on the next group. We use that to index or to access our root. Then we go down the link that index. We do that recursively with every block of bits, and we finally get to our element. Now let's say we want to make a version of the vector.

00:12:44

We want to update the vector and change this element to a new value. Well we will first need to copy this block where the element is located. Then we will need to copy the parent. We copy the parent here, and of course it still has links to some of the other nodes in the data structure but also to our updated node. We do this recursively until we get to the root.

00:13:11

So we actually had to copy one path down to the root to our element. But, everything else in data structure is shared, right? This is how we achieve this balance, in this case, between trying not to copy too much data. We have to copy some data but still get a nice update performance. So this data structure is already quite interesting, right?

00:13:39

It has nice properties, and it has similar properties to a vector from an algorithmic point of view even though the constants obviously are different. But, can we improve? Like a vector, this data structure for example concatenating two vectors is going to be owing to the second vector because of this little slot here that needs to be filled somehow at the right most leaf.

00:14:04

We will need to pull in an element from the second vector that we're concatenating, and eventually we will need to reconstruct the right hand side of the concatenation. But, Phil Bagwell published in a paper an alternative in which we allow some nodes to be relaxed. By relaxed, it basically means we allow this node, this inner node to have less children

00:14:32

than it would normally have if it was perfectly balanced. For this, some of these nodes, they need to of course store information about how many children are located in these sub trees. But, once we allow the structure to be relaxed in this way, he showed that we can define a concatenation algorithm that works in logarithmic time and of course with a structured sharing.

00:14:55

You only copy logarithmic amount of data. With this of course then we can also get insertion, and erasure at any point, yes? - [Audience Member] Do we lose the ability to address it as sufficiently as we did before? The question was do you lose the ability as sufficiently? Yes. Still this concatenation algorithm bounds the amount of extra steps that you need to do.

00:15:20

So it's gonna be a constant vector that if you measure it in practice, it's like two X. So it's still reasonable considering that if you have a lot of amount of data, now you can insert an element at a random position. Yeah, so once we do this, the data structure is the relaxed radix balanced tree, and you get this map of operations where you have effectively constant random access, update, push back,

00:15:50

and slice. This effectively constant means it's actually logarithmic, but as we argumented before, you can treat is as a constant even though of course maybe a big constant, but it's like a constant. But, you have logarithmic concatenation, insertion, and you know push front which is great, also erasure at any point. I've been showing this picture here,

00:16:15

but of course this data structure was invented in languages like Closure that have you know managed runtime, and that box the values when they're stored somewhere. So the real picture is this one where every element has its own pointer. As C++ developers, we want a little bit more, cache locality right, so we want to actually store it directly in these nice blocks that I already have in the data structure. The problem is then what happens if the type T

00:16:43

that I'm storing in the vector is big? Now my leaf nodes become big, and this nice property that we had a study of having a reasonable balance between update performance and node, it doesn't hold anymore, right? So I propose a solution for this where of course if the T you're storing is as big as a pointer, you just use the same block size.

00:17:09

But, if you have bigger elements, you simply store less of them in the leaf, but you still store a few of them. So in this case, they are twice as big as a pointer, so they store half as many. Likewise, this also allow us to say if we have small elements like let's say chars, I can pack more of them in the leaf. This is what I call the embedding radix balanced tree which is characterized by instead of having

00:17:34

just one constant, as I said, it has two different branching factors, one for the inner nodes, another one for the leaf. Now what is gonna be the branching factor at the leaves where we just computed to be as many as would otherwise fit in an inner node? So with this we achieve already nice performance in general. There are still cases where things get a little bit tricky. Let's look at this function here.

00:18:02

So this function here takes an immutable vector. So I'm using the Immer namespace everywhere here. Nothing is STD vector. This is an immutable vector where we push in a loop, we populate it with every integer between first and last, and then return the new updated vector. Can anyone see what's the problem with this function.

00:18:26

- [Audience Member] Adding one at a time. Sorry. - [Audience Member] Adding one at a time. Adding one at a time. And, what's the problem of adding one at a time? - [Audience Member] We're more spread out. We need more copies. Exactly so the answer is there are unnecessary copies. We saw that when we update the data structure, we need to update a path down to the root. So in this push back operation in the node, we're gonna be creating, in the loop, we're gonna

00:18:52

be creating new nodes that the next iteration of the loop is very probably gonna just discard, right? So this is less performant as it could be. So a solution that the Closure people proposes, and we implement in our C++ alternative, is to say okay this function is like a transaction you could say where I'm interested in the vector

00:19:18

that comes in as an argument and the vector that is returned at the end. Everything that happens in between, I don't care if it's doing mutation, right? So we can say okay can you give me immutable view on this vector that is gonna use a copy and write strategy more or less to mutate the vector in place whenever possible?

00:19:43

Yes? - [Audience Member] It should preserve the immutability of P, so everyone else can use it still as a C-- Yeah so the question was do you still preserve the immutability of P, the input value, and the answer is yes exactly. That's why it's copy and write. It still needs to check okay is this node that I'm looking at owned by some immutable thing, right? But, you know as it goes through a couple of iterations,

00:20:09

then it will have new nodes that it just created locally, and it will update them in place. When you're done with it, then you say persistent, and you get again a frozen persistent value. Yes? - [Audience Member] How do you do traversals? Do you have to lookup each of them, or does it try to follow the contiguous array? So the question was how do you do traversals? Do you look up each element, or do you go through contiguous arrays.

00:20:36

And, it depends on which API you use, right? So if you use accessing by index, then you're looking them up each time. If you're using iterators, then it's gonna run through one of the buffers. When it's done with the buffer, it's gonna go and look up the next one. There is a third alternative that the library provides which is using what I call in turn iteration where basically you have a for each function,

00:21:00

and you just pass a lambda to the for each function. In that case, it's gonna just run through the buffers as it goes. And, it's really efficient. I will talk more bout that later. Now another advantage of this transient API is that now you get also compatibility with STL because STL assumes mutable APIs if you want to reuse a STL algorithm, now you can.

00:21:27

But, there is something more you can do actually about transience. Let's look at this function. This function takes a vector and then calls push back on the vector, and does so a couple more times. Many of you will have noticed, the first value is a named value or L value in C++ jargon. And, the second two operations are happening or r-values

00:21:55

which are anonymous. They have no one else looking at them. So if you combine this vector with a policy that does reference counting under the hood for garbage collection, you can say okay if I have an r-value, and the nodes I'm looking at from the root only have one reference, no one else is looking at them. I can update them in place. So that's why basically in this library if you have

00:22:24

reference counting which is one of the things you can choose actually. You can choose other garbage collection strategies. R-values are considered transients, and furthermore, this is very nice actually because now we can go one step further and move v there, the input value there, because we know it's the last time it's gonna be used. And, once we do this, we're really using linear types here.

00:22:51

If you know about Rust and other like advanced type systems. And, now basically transients start to compose. So I can call this function twice, pass the return of this function to the next one, and there are going to be in this whole call no superfluous copies whatsoever, right? Which means that we achieve a nice property which is that we're working with this functional looking,

00:23:18

very value oriented API that the performance of the system is gonna actually depend on the actual amount of persistent, actual amount of sharing at runtime. Which is really nice actually. So another consequence of this is also that the loop that we had before if you don't like to use the transient looking API, you can just move

00:23:49

the value, and it's gonna have exactly the same performance if you're using reference counting. This is great, but the question everyone has here probably is is this fast? Or, how fast is this? So I have a bunch of benchmarks that I'm gonna skip through to the conclusion, just say yes. If you really want to look at the benchmark data,

00:24:18

I actually wrote a paper about this adaptation of immutable vectors for C++ that was published in ICFP this year, this summer. But, I would say like the key aspects here are that of course if you do a lot of persistent updates as we saw, you're doing copies. So that's slower. With regards to iteration, and you asked before okay how can you iterate over the data structure?

00:24:56

There are these three mechanism. If you do in turn iteration which most of the time you can, actually the performance is like only 1.5 times slower than on an STD vector. It's really close. Unless you're not doing anything interesting with the values, it's almost nothing. So that's great. Then the interesting thing is that this data structure allows to organize your architecture in a different way.

00:25:26

So you get concurrency much more easily. You get undo much more easily. So in these ways, basically even though like the update performance is slower, in practice for example latency of the operations in your system, will probably get much, much lower without adding complexity to the code.

00:25:54

But, I don't really want you to take my word for it, so I want to show you something else. I'm gonna show you an example of an application built with this data structure. So I built a text editor probably inspired by the fact that lately, I don't know, there has been hacker news quite a few text editors that is run from Antirez, the guy that wrote Redis it's called.

00:26:22

That's done in C, and I wanted to show okay let's do it in C++ and with purely value based architecture. So let's see how this text editor works. I have to look at the screen now, okay. Okay so I have here a file called Esperanto Wiki. It contains the contents of the while Wikipedia in Esperanto, like the English one is too big

00:26:54

which is one gigabytes. I don't know which text editor you people use, but I can assure you that most text editors really struggle with this file. If you use like Atom or something like that, it's gonna crash immediately. If you use Vim, some things might be slow. Let's see what happens here. So I can open the file. Oh it's immediate. The thing is loading asynchronously,

00:27:23

still responds very fast. Of course, I mean this is something we could expect because I can load in a separate thread, load a little bit, and then send a value to the other thread, no mutex or anything and nice asynchronous loading. There are other things here that are nicer than in the typical text editor. There's one thing that's a little big stupid, but I really like to comment on it which is

00:27:49

okay this thing here, these two dashes on the lower left part of the screen. That is the dirty marker which almost every text editor has. So now if I type, I get an asterisk. Now what almost no text editor does is that if I delete the characters preceding the backspace as opposed to doing undo, and I restore the document

00:28:13

to its original state, the dirty marker disappears. Why? Because I don't have to anymore check the marker in a stateful way. Since my values are now also easy to compare, I didn't mention that before, but because of this structure you know you can see what change easily between two versions. I never keep a dirty flag in my data model. I just keep my original contents of the file,

00:28:42

the current one, and compare them. Is it different or not? Now the only nice thing I can do here is let's say I'm gonna select alt test, press copy. Oh I think many text editors will have crashed by now. Now I gonna go, I don't know, somewhere, some random position, I'm gonna paste it a few times. No problem. I have now like I don't know 82,000,000 lines of file

00:29:08

here. It was immediate, right? And, why is this? Well this is because of the logarithmic concatenation I mentioned. So when you insert something in this document, it's gonna be logarithmic operation, blah blah. So it supports this, and it's quite fast. I can undo of course. So we can go back. And, it supports Emacs style redo also,

00:29:35

so I can edit and then redo, all the typical things. Yeah? - [Audience Member] Can you explain that one more time how the (muttering)? Can I explain what, sorry? - [Audience Member] How you were able to insert that so quickly and even managed to add (muttering). The question was how was I able to insert so quickly? So we will look at that in a second. I mean this text document that's represented

00:30:01

in my data model as a vector of vectors, a vector of lines basically, right? So when I insert text, I just have to insert something in the middle of this vector. We saw we had that concatenation algorithm that is logarithmic in time and also slicing maybe that's the other part that's needed for this to work, right? So I can just slice at the point I need to insert,

00:30:32

and then reconcatenate with what I had copied which is also coming from the immutable data structure. So actually the interesting thing now is that the file might now be bigger than I could fit in my memory if I saved and loaded it again, right? Because this is now relying on a structural sharing. A lot of the document is actually the same objects, the same buffers that are appearing in different positions

00:31:04

of this otherwise linear sequence. So this is how this works. So let me go back to the slides. Oh I can also, just for completeness, I can save the document. So synchronously I can even edit while it's saving, right? - [Audience Member] Wait a minute. Yes. - [Audience Member] So from out of memory what it's trying to save, but if it's previous saving, then you're going-- Yeah sorry, trying to save is okay.

00:31:31

What is not okay is to then load it again because then of course I mean the file doesn't include any information about the sharing, right? - [Audience Member] What you could do the on loading you can find things that you already have. Like you could sort of find correlations, you could use those to find. Yeah of course. The remark was you could maybe have like do a diffing while loading such that you find

00:31:55

the correlations in the document and try to reconstruct the structural sharing. And, yes you could do that. But, I wanted to keep the code simple in this case. So let's go to the slides. That's my view of it. All right so how does this work? Well first let's look at how it doesn't work. This is the typical way you build an interactive application

00:32:31

which is a model, view, controller architecture. This is what we were trying to get away from in the first part of the talk, right? And, the reason is simple. We have this nice separation between model, view, and controllers, but we also have arrows between them. And, in this picture, the solid arrows, they represent references, pointers between these stateful objects, right?

00:32:58

And, even worse, we also have dotted lines which represent indirect references which in this case there is gonna be probably a signal slot mechanism in the data model so the views can get updated. And, you're gonna have a callback that encloses over the view to be able to notify, same thing with a controller and so on. And, this is nice. We've been writing software like this for a long time,

00:33:26

but what many of us have found when we get to make very large software is that these stateful object trees, they don't compose. Like if you modify one thing, and then it meets, and you want to put this as something bigger, then you cannot get rid of the notification that is happening under the hood and all this stuff. So basically when you have lots of things,

00:33:49

then you have lots of objects and lots of arrows, and you have to think about all of them. You could not abstract them. So you get again the flying spaghetti monster that takes over your code. So there is alternative way of doing things that I haven't invented myself. I think people especially in the web technology space especially in the fronted web technologies

00:34:14

are really exploring with these ideas. It has many names. So if you do Closure, it's gonna be called a single atom architecture. If you have read about Facebook's Flux, then it's called unidirectional flow architecture. But, the basic idea is to say okay I have actions which is some stimuli that comes from the world.

00:34:42

I have the model which is a representation in computer space of the state of the world, and we have views which are other representations, more you could say good for human consumption. And, in this architecture, the nice thing is that these things are not objects in this stateful sense. These things are values.

00:35:09

So we have an action value type. I have model value type, and a view value type. And, the arrows between them, they are not pointers or references between objects like it was in the MVC world. They are functions. So you're gonna have a function called update that takes the current state of the world, represent it as a value. You take an action, something that happened

00:35:33

outside of the world, represent it as a value, and you're gonna produce as a result a new data model, right? Likewise, when we get from the model to the view, you're probably gonna have something like a render function where you take the model as a value, and then you return a new view as a value. This is how React JS world works which is a very popular framework on the web.

00:36:03

Now the problem here is that in C++ we don't have React JS. So what do we do? Well alternatively, you can use an immediate mode API and just you know troubleshoot a document and print it to the screen. And, in practice, this works very well. Because what you're producing, this view you could consider I'm producing pixels on the screen. My value is indirectly represented by my immediate

00:36:28

mode API. One last aspect that some APIs have is then of course you need a way to dispatch actions, so there's gonna be some framework in which you can have asynchronous thing dispatching actions, but I don't want to get too much into that 'cause I want to show you this picture. That was the architecture from you know boxes and arrows point of view. This is the architecture from a code point of view.

00:36:57

And, this is actually not accurate anymore, so since I added the asynchronous saving and loading, now there is an event loop. But, this was like real code copy, pasted from the application before that. And, it's an exercises for the reader to see what changes to make asynchronous saving and loading to work. So this is basically you could say the main of the application, right?

00:37:26

This is the main loop. And, in this case, we have a terminal object. Terminal, in this case, you know it's a representation of the terminal as in the text terminal that I interact with. It's basically a wrapper around ncurses. Then I have the state. This is my data model. So my data model, the code type is called application.

00:37:53

That is the root of my data model, and I initialize it, in this case, with the initial contents of the buffer that I load. And, I also give it which key map I want to use. I'm an Emacs user, so I have Emacs like bindings by default. Then I draw the state, and then I loop. And, I loop basically calling the update function,

00:38:19

and as we just said, it's first argument, we passed it the current state of the world, a second argument, we pass it to the action which in this case we get from the terminal. So the terminal has, in this case, blocking API that allows you to say okay what's the next thing that happened? Whenever something happens, we need to update the state, so update our current view

00:38:43

of the model, and then redraw, right? The nice thing here is that basically this state variable here is the only single mutable variable that really leaves through the whole application. And, even further it's hidden within this function, right? So no one is taking a reference to this variable.

00:39:06

So all mutations that happen, you can understand them by reading this function. There is also maybe on slight API change from the previous slide which is that here instead of returning a new application, it returns an optional. It's a way that I use before to signal that the application is done. So you return null when you want to exit. The API changes a little later when I added asynchronicity.

00:39:31

I don't think it's very important to go there in the details. Now the second thing you need to do to understand this application is to understand its data. And, basically here following like a more functional approach instead of modeling objects and relations, we start by modeling the data which is the data that our data model has.

00:39:54

So we have some index that we're gonna use for coordinates. And, then here we have, this is what I mentioned before, the text, right? This is the meat of what makes this application work efficiently. So we have a flex vector. So flex vector is the name in my library of the vector that supports logarithmic concatenation. You can opt out if you just call if vector.

00:40:20

And, we have lines which are vectors of chars. We have text which are vectors of lines. Then we have a file. What is a file? It's a name and its contents The question was what's a box? So this is a single element container. It's a way to avoid copying this to the string all the time.

00:40:48

So it's a box that basically puts const on the API. So you can only get a const reference out of it, and under the hood, it's gonna share the copies of it. So it's gonna just copy a pointer. Then the other thing that I have is a snapshot. We gonna use this for undo, right? What information do I need when I undo? I need to know the contents of my editing buffer

00:41:19

when I did the operation that I want to undo and then the cursor position because when you undo, you also want to send the cursor back to where the user was before the operation. So we just store that data. Now we can build the buffer type. The term buffer here comes from I think Vim and Emacs. This kind of you could say hacker text editor just the term buffer for one of the documents

00:41:47

that you're editing. The first thing that we do is keep the file that we loaded. We saw it's the name and the content. This is how we can see did something change for the dirty marker, right? We have the current contents. We have the cursor. We have the scrolling in the screen. We also optionally have a selection because you can select texts.

00:42:16

And, then we have a vector of snapshots as my undo history which in itself, by the way, is again also another immutable collection so I could also keep a history of histories if you want to go crazy. And, then I have the position in the history. So when we're gonna do undo, we're gonna need to know how much did we undo. Finally, recursively we get to the application

00:42:45

which is the root of our data model which has the current buffer that we're editing and some information about the key map that we're using, the current input sequence because you can do like Control + C + X, so you need to keep what the sequence of keys you're pressing is, a clipboard, and some messages that we have. And, finally we have the action which in this case

00:43:08

is this type that indicated what happened in the world. In this case, because it's a terminal application, it's similar, sorry, it's simple. The only thing that can happen is a resize or a key press, so that's what we store. And, I think just by reading this data, like, most of you can already know and straightforward implement the applications. Right if you want to insert a character where you're gonna return a new buffer with a character

00:43:34

in there and update the cursor position and so on and so forth. Yes? - [Audience Member] I see you're mixing flex vector and vector, or like not mixing, but like you're calling both. What's the difference? So the question is what's the difference between flex vector and vector? Flex vector supports insertion and concatenation in logarithmic time. So it's very handy for the text. Normal vector does not support that.

00:44:00

But, that's okay for the history because the only thing that you're gonna do in the history is push back and potentially maybe pop back. - [Audience Member] Which you're not doing because your history function. Exactly I have Emacs like undo which I'm gonna explain in a second. - [Audience Member] Why not use a stack for this? Sorry, can you repeat the question? - [Audience Member] Stack, (muttering). A stack? - [Audience Member] For history. Yes, so the question was can you use a stack

00:44:28

for history? Yes, this is in practice a stack. We're not gonna use the rest of the API. Well actually yeah, we're gonna use a little bit. We're gonna see it in a second how undo is implemented. But, conceptually a vector is a stack also if you only use push back and pop back. - [Audience Member] It simplifies this. Yeah, there another question. - [Audience Member] How is the box and the string different from a shared pointer to a const string?

00:44:55

The question was how is the box different from a share pointer with a const string. And, it's almost the same actually. So the difference is that box has maybe a nicer API, so you can easily let's say it has for example a method update. You can pass it a function and return a new value that change the thing inside.

00:45:23

Also it supports memory policies. So this library has this notion of memory policies in case you don't want to use reference counting for example. Or, if you want to use single threaded reference counting and all these things. It's like having the share pointer but with an interface that gives you a value based API as opposed to a pointer API. Okay so just to show you one last thing

00:45:52

let's look at how do I implement a feature. And, I want to show you how to implement undo because it's one of the things that in the typical mutable world is quite complicated. And, I said before that I implement Emacs like undo, so for those of you that don't use Emacs, I'm gonna explain to you how undo works in Emacs. Because the goal in Emacs undo is to basically be able

00:46:20

to undo the undo. So you can always redo and never lose history. So basically there is only a undo operation there is no redo. So I do an edit. This means that I have a sate that is produced by this edit, and I have this cursor here which is the history pos. This is this variable we saw before, the history position, where in the history am I looking at.

00:46:46

Now when I do more edits, I push more states, and of course the history position is always at the end. If you remember, if you were paying attention, the history position was an optional. When the optional is null, that means it's at the end. That's how we indicate this. You can do this a few times, more times. Now let's say I want to undo where if I undo, I'm gonna replace the state by what was there before.

00:47:15

So I'm gonna replace the current contents of the buffer by S3, and I'm gonna move the cursor back. Now in Emacs what happens is when you undo something, the undo action itself gets pushed as a new state in the undo history. I can undo a few times, as I undo, these states here get pushed back again there.

00:47:44

And, now when I do an new edit, of course the new edited text is pushed at the end, but I also moved the cursor to the end of the undo history. This means that basically in Emacs there is no redo. If you want to redo, you do an edit, and then you undo, and you will get the undo actions as something that you will also undo.

00:48:11

Do I make sense? So how much code is this? This is all the code for undo, right? And, more importantly, this code is totally independent of the actual edits that you do, the actual operations. You don't have to know how to insert a character to un-insert a character. Let's look at what these two functions do. The first function is called record.

00:48:40

And, it gets the old buffer, the buffer before my edit operation and the new buffer, the buffer after the edit operation. This function is gonna be called whenever I try to do an edit. And, basically what it's gonna do first is it's gonna look did the content change because there might be operations that don't change the buffer, even operations that sometimes change the buffer but in this case didn't because, I don't know,

00:49:05

you're deleting the first character or something. Then if it did change something, we need to record it. So we just push it at the end of the history vector. And, then we also compare okay did the position of the history change. Sorry, is the position of the history the same? If it was not the same, I just did an undo,

00:49:31

so that's okay. The undo probably has already managed the history position. But, if it's the same, we want to bring it back till the end. This is the thing that we saw that happened there. And, then finally, well we return this new buffer value that has recorded the undo. Then when we want to undo, simply we take a buffer. Then we say okay give me the index in the history

00:50:02

in which we add. If the index is greater than zero, then we take the last state from the history, and we set it as the new state. So the snapshot has the content and the cursor position. We set it, and of course the cursor in the history has moved back. We also store that, and we return it. We also see here you know like this code

00:50:30

is also not weird to the C++ developer in the sense that I'm also using mutation of structs. I'm using normal assignments. The nice thing in C++ already gives me the guarantee that because I'm taking this by value and returning this by value. These updates are not gonna be visible. So I could say this is kind of idiomatic. It works nicely from an architectural point of view. So this is pretty cool I think.

00:50:56

I hope you find it cool too. Just for the record, I also, as a let's say CppCon present, I merged two days ago a new data structure in the library which is persistent hash maps and sets which work in a similar way with regards to this balancing between you know having like wide nodes in the tree approach to data structures

00:51:23

that is actually a talk tomorrow by Phil Nash who I think I saw him before somewhere there. There he is. He's gonna talk tomorrow actually about how this magical hash table that is also persistent work. So that's that. That's also another thing that appeared in the title of this talk that I didn't talk about. So I said the data structures were post modern.

00:51:49

Hopefully I will do a lightning talk this evening where I will maybe clarify or obscure some of these topics. Thank you very much. Here are the links to the text editor and the library. I hope you enjoyed it. Thank you very much. (audience clapping) I think we also have some time for questions. - [Audience Member] Hi. Hi.

00:52:28

- [Audience Member] So my question is, first of all, thank you for the talk. It was very amusing and also very useful. If you want to be concurrently editing the same model, how would you deal with that? You don't want to miss updates. So that is one of the tricky things that the data structure doesn't solve for you, right?

00:52:59

So this data structure you could say has a forking model. So I can fork it, create two histories. What happens if I want to reconcile these two histories? And, that's up to you. You have to in whatever meaning the things you store in the data structure have, you have to define some merging operation for it. The data structure is not doing it for you.

00:53:25

Maybe there could be some things the data structure could do for you like I'm considering adding in the future, for example, a way to diff to the structures which could be done efficiently such that you see everything that happen between these two forks. And, with this information for example, you can do the merging in a better way. Or even, I could try also like I could try to have a merge operation that given a new diff

00:53:50

that you could use then you know that's a merge. But, I don't think there is a general solution that would work for anyone anyways. - [Audience Member] Thank you. - [Audience Member] Okay, you mentioned Ralph Hines's finger trees at the beginning of your talk, have you ever been given talk about actually implementing those? Because they look very marvelous on paper, but the only implementation I ever stumbled upon

00:54:14

was very inefficient. And, maybe if we implement it in C++ with modern features and pay attention to cache locales and stuff, this may be a cure data structure. Yeah, so you were talking about the finger trees data structure. I had a link to the paper, but I didn't even mention it I think. But, yeah that's a very interesting data structure, and, as you say, doesn't have very cache locality and all this stuff, so I haven't implemented it.

00:54:42

There is though an attempt to make them more efficient already. There is a paper called chunked sequences. I think it's called Theory of Chunked Sequences which is basically applying the same principles but instead of only five pointers in the node, it uses a buffer also. The only thing with that paper is that it produces

00:55:05

only a mutable data structure. So I guess it's open work to apply that idea of using buffers and combing with persistence. I've run benchmarks actually against that mutable implementation, and this radix balanced tree data structure was anyways faster, but it might be also because that implementation had other defects or maybe I also did

00:55:32

something wrong in the evaluation because there are some things you could customize. But, yeah if you want to give it a try then... Thank you. You're welcome, yes? - [Audience Member] Hi, great talk. I was wondering how much you've looked into looking at memory growth of mutable versus immutable fragmentation like any scenarios where this would not be

00:55:57

favorable? So I haven't done measurements in that regard. You mention fragmentation and this stuff off the heap for example because you're doing more allocations and all this. I guess that depends very much also on how you're using, like how rare are your updates versus your reads. I provide some tools to try to alleviate that actually.

00:56:27

So if you're in the default configuration, I also use a pool of buffers to try to not reallocate if there is already a buffer there and reuse buffers more. But, I think that it's very application dependent. In the particular application of the text editor, I haven't measured anything, but it's an interesting topic. - [Audience Member] All right, thanks. Yes?

00:56:56

- [Audience Member] Did the container support custom allocators like for example you want to store it in shared memory, and for example use smart pointers instead of raw pointers. So it does support custom allocators. It doesn't support, at the moment, the standard C++ allocator API. So it only supports allocator that have no internal state at the moment because there are some questions

00:57:21

about what to do, like you're always producing new values, so you have to carry this pointer to the allocator with a confluent data structure which is the technical term for the concatenation. You even get the question of okay what happens if I concatenate two vectors and they use two different allocators. I think you can find compromises and answers to this question.

00:57:45

I was too lazy, so I didn't answer them. I think, for example, for the shared memory example, you can do that right now because that's a global property that you can encode in the type. And, there are a few things you can customize. For example, I also provide policies that already enable using LibGC. It's Boehm's conservative garbage collector. And, in that case, there is no reference counting. It just relies on that for the garbage collection.

00:58:12

So yeah, that's the answer. - [Audience Member] But, does it rely on pointers being raw pointers? For example, if you store it in shared memory, you might want to use the Boost offset pointer for example. Uh huh, interesting. So at the moment, it's using raw pointers, yes. I haven't looked at that Boost library enough to know how to make that work actually. - [Audience Member] You can allocate with point traits.

00:58:39

Yeah so the answer here was use an allocator traits or pointer traits which at the moment, I'm not honoring. So I guess that will be something you will need to change in the library. I imagine it's possible probably, but yeah it's not implemented. - [Audience Member] Okay thanks. Yes? - [Audience Member] Hi, in the past, I've used like the builder pattern if you want to have like a person class that has a first name and a last name, and you want to have that immutable,

00:59:05

so you can say with first name on an object, and it'll create a copy of that. Do you have anything in your library to kind of help with creating like user defined immutable data types or any recommended patterns for that? Not really, so as you saw in the text editor, I'm using plain structs, and I'm just relying on the fact that you can copy them or pass them by value.

00:59:31

And, in my personal experience, these kind of have worked for me. Maybe also because I'm used to working in other languages like Closure or Haskell where you work with open data more often. I would imagine that if you wanted to use encapsulation, I guess, that's what you want to do there, right? You don't want to expose the members directly, and encapsulate it with const setters. You could build something like that.

00:59:57

I haven't really thought about it much. It's interesting. If you have some ideas actually, it would be nice to talk about them afterwards. Anything else? Well thank you very much. Oh there is one last one. (laughing) - [Audience Member] You mentioned reference counting as a default policy for member management. And, there is also a demo in which you would concurrently save the file while you're editing it. What kind of pointer incrementation technique you use

01:00:27

like atomic increments or? So by default, it's atomic, and that's why by default it works in a concurrent setting. You can also change it and say I want non atomic. So for example, there is already Python bindings. They're not complete but just enough to prove that you can do it. And, in Python for example, that's not a problem. I use the single threaded non atomic reference counts.

01:00:58

- [Audience Member] Isn't it involves paying the atomic cost on the usual additional operations for the possibility of saving concurrently? Can you repeat that? - [Audience Member] Do you involve the cost of atomics on every update which goes in the main thread for the very possibility of doing saving concurrently? Yes, because you cannot really know when can you do it,

01:01:24

right? That's the point of the reference count. It's what has the information about whether you can do it or not, right? So yeah, this is the case here. It's still fast enough. Of course, if you benchmark it, and this is something you can see in the paper like reference counts have a measurable cost. And, when you're copying these nodes, you have to touch a few like reference counts if I have quite a few childs. It's still a constant factor.

01:01:49

In this case, I'm inserting, pressing a key and you know it's maybe gonna change it from one microsecond to five microseconds. It's good enough. - [Audience Member] Is there actually a way to do non atomic updates in some cases like with transience or like for some reason when you're not saving, you would not incur that cost. So I think yes with transience, you could do that.

01:02:21

If you combine the two transient strategies, right? So there is the default transient strategy which is just look at the reference count that you have for garbage collection. There is another one which is okay let's say you're using libCG, you don't want to have reference counts at all. So instead you annotate the nodes with an owner which is associated to this transient object.

01:02:46

That's why I mentioned before if you want to use move semantics, or if you want to optimize the move from operations, you need reference counting. You could combine the two somehow and say okay in my transient annotate things to not do the reference count increments, but outside I'm gonna do the atomic reference counts.

01:03:15

I'm not sure how much you will save though because you still in the persistent updates, you will still need to do like atomic reference counts and also when you pass by value. But, it might depend on your use case. Like if you find that this, for example, a problem because you have you know like big push backs. - [Audience Member] Just not a problem in a text editor. It's a problem when you use a lot of threads

01:03:43

working on same set of immutable data structure sharing a sizeable common chunk. Yes, yes it's a potential problem. You also have actually false sharing. So you could say all my data is immutable, right? It's very nice for the caches and everything. But, then you have this atomic reference counter you're touching, and you're limiting your ability to, - [Audience Member] Parallelize. to parallelize, right?

01:04:07

But, that is going to depend very much on your workload. So when are doing the updates, how do these interleave with the reads from the many threads? So it's very hard for me to give you like a general advice in that regard. - [Audience Member] But, in that very case, you have a trade off. You can just clone the data structure from the point of view of the user of the library there would be no difference. You will pay more memory for less overhead

01:04:33

in reference counting for example. Yeah, exactly that. - [Audience Member] I have a separate working set for each thread and do not pay for atomics for sizeable stuff. Do pointer provide this in your library? Sorry. - [Audience Member] Do pointer provides this in you library somehow? I haven't thought about it till now. (laughing) If you want to contribute it. I prefer to be use case driven. So my first goal is to have a complete enough set

01:05:00

of data structures to have something like a standard library of data structures. So now we have hash maps and sets. I haven't implemented transience for those yet, so that's the next step. Then probably ordered collections are gonna be useful. Then afterwards, I'm thinking of internalized as strings. Like doing something about strings might be nice. - [Audience Member] Ropes you mean? Sorry? - [Audience Member] You mean ropes? Ropes, no not ropes.

01:05:26

I mean strings like the ones in Python where basically there is like a global hash table with all the strings in your application. - [Audience Member] In turn strings. In turn strings, right that's the English word. But, this is speculating a little bit. Also the roadmap might change depending on the users and if someone else wants to hire me to implement a specific data structure or use case. You know I'm available as a contractor.

01:05:51

- [Audience Member] Very interesting stuff, thank you. Yeah thank you. (audience clapping)