The article is missing one of the important but commonly overlooked structures: a "piece chain" [1] (aka "piece table"). Not easy to find editors using it, especially among the popular ones, but:
In the same vein as the article, the code for insertion can be found at[0]. The interesting thing about the piece chain is indeed that it is a persistent data structure.
This makes it relatively easy to implement non chronological undo/redo operations i.e. an undo tree.
Because old versions of the text are available one could for example link compiler error messages back to the state when the file was last saved. Thus after a long build only those errors which are still relevant (i.e. not already fixed in the meantime) would be displayed.
Piece chains also map well to POSIX system where the initial file can be mmap(2)-ed. That is the reason why vis supports editing of large files efficiently. All editing operations are independent of the file size, but linear in the number of non-consecutive changes. Of course things like go to line nn will have to read the file content, there is no way around that. But for example navigating using the nn% command is possible in constant time. The way syntax highlighting is implemented it should also (mostly) work for all types of files.
A problem with this data structure is that it's not self optimizing. Suppose you insert a character after every other character (not common, but easy to do with a macro). You will end up with a link-node per character. On the one hand, the size expansion is probably fine because the undo records of other data structures would equal it (if they support unlimited undo). On the other, all data access now involves pointer chasing for each character. This could be bad for screen update, search, etc.
The doubly-linked list of gap buffers method is somewhat similar, but is self-optimizing because adjacent small buffers can be merged.
The piece chain approach does indeed have some cons too, not only advantages. E.g. retrieving a byte at a specific absolute offset requires iterating sequentially over all preceding "pieces" of the chain in the basic variant of the structure (i.e. if used without additional helpers/caches/...). But which method doesn't bring its own disadvantages as well as advantages? It would probably be interesting if someone managed to write a comprehensive comparison/analysis of all the various known data structures for editors...
My intention was to highlight the piece-chain method as it seems to me surprisingly often overlooked (not known?), while having quite some noteworthy features. That said, I'm starting to think that maybe its non-trivialness can be seen in itself as one of the disadvantages too (i.e. by maybe making the program more complex than for the other, "dumber" methods)? Not sure on that, though. Adding undo upon the other ("simpler") approaches may possibly make them more complex anyway?
Of course all data structures have different trade-offs. I just think the piece chain has some of the most favorable ones. The fragmentation issue you mention is true.
Its effects can be mitigated by:
1) storing the pieces in a balanced search tree instead of a linked list. This would make random access logarithmic in the number of non-consecutive changes.
2) implement some kind of compaction algorithm which merges adjacent changes. This would increase memory usage because not only the changed text but also some unrelated data in between would have to be stored in a contiguous memory region. Furthermore this duplication of "unchanged data" would harm some other nice properties of the data structure. For example marks could no longer be represented as simple pointers into the underlying buffers because their content would not be unique (i.e. a character could be part of multiple buffers).
The pointer chasing you mention is an issue for all non-trivial (i.e. not stored in a consecutive memory region) data structures. Screen updates are a non issue since the output region is of fixed size (a terminal). For syntax highlighting using LPeg the relevant text region is temporarily copied to a contiguous memory region. This is simplified by the fact that vis currently always soft wraps lines and does not support folds.
Gap buffers have other issues, for example you only have one gap. What happens if you support multiple cursors/selections which might be on totally different regions of the file? You will have to move the gap.
This is not an issue for a doubly-linked list of gap buffers unless the cursors are on the same line. However what happens if you have a single line file (like minified Javascript)?
Anyway I find the piece chain and its support for (non chronological) undo/redo as well as mark management rather elegant.
In practice it seems to work quite well for vis. I haven't encountered performance issue in daily usage. To the contrary users reported that is "blazing fast".
i interviewed there earlier this year and they asked me a question on the piece table. i didn't know what it was called until now but based on that question, they might still use it.
- from what I've read somewhere, MS Word was/is (?) using it internally [http://1017.songtrellisopml.com/whatsBeenWroughtUsingPieceTa..., ]
- AbiWord [1]
- [1]: http://www.catch22.net/tuts/piece-chains
- https://github.com/martanne/vis
- the text editor of the Oberon OS used it [1]
One of its most intevesting advantages is that it trivially supports fast unlimited and persistent undo/redo