Why Rewriting Emacs Is Hard
There have been quite a few attempts to re-implement (part of) Emacs’ functionalities in languages other than C, like JEmacs, remacs, and lem. And we are seeing new efforts in EmacsConf 2024: rune 1, gypsum 2, and the revived Guilemacs 3. (Strictly speaking, Guilemacs is more a fork than a total rewrite, but anyway.)
However, a complete rewrite of (GNU) Emacs has always seemed like an insurmountable task, not just because writing editors is hard. This article aims to look into some of these difficulties and the Emacs designs (that of course have been exposed in some Emacs Lisp API) that lead to them.
The rationale for this piece is that I too have been pipe-dreaming about yet another rewrite. :-/
1. Rolling your own string types¶
What is the highest code point supported in your favorite programming language? If it’s #x10FFFF
, as per the current Unicode standard, then good luck rolling your own string implementation to stay compatible with Emacs, in that it supports characters up to #x3FFFFF
.
This design decision has a bit of history behind it, which led us to the following two nice features of GNU Emacs.
1.1. Lossless file editing¶
In an era dominated by Unicode, it is becoming rarer to have to work with multiple character sets. But if you do (or did) need to deal with some other encodings, you might have the experience of opening a file and the editor getting the encoding wrong. Now, for some editors, if you accidentally save the wrongly-decoded file back, poof, your file is doomed.
The main reason for such disasters lies in how string types handle invalid byte sequences. For instance, in Java, when decoding bytes into a String
with built-in libraries, you can ask the decoder to either:
- Replace invalid bytes with replacement characters like
#xFFFD
, - Or just throw an error.
That is to say, you either give up decoding, or lossily convert them to Unicode. The same applies to most other languages/libraries whose string types are aware of Unicode: the string type has no place for invalid Unicode characters, rightfully so, because it was designed (or re-designed) with only Unicode in mind.
However, Emacs Lisp, as a language dedicated to a text editor, decides to reserve space for even invalid bytes. Trying to construct a string containing a single #x3FFFFF
code point in Emacs, and you will see that it is treated as a single raw byte \377
(or #xFF
) - the code point range #x3FFF80 - #x3FFFFF
is for representing raw bytes:
(string #x3FFF80 #x3FFFFF)
"\200\377"
This is how Emacs ensures lossless file editing. Regardless of the encoding, it transforms invalid bytes into “raw bytes” that both the ELisp runtime and encoders recognize. When saving, these raw bytes are preserved exactly as they were.
Actually many C-based editors may have similar properties: unedited bytes are preserved as is because a string is just a sequence of bytes. But I am afraid their scripting language, if any, may not be prepared for this.
1.2. More universal than Unicode¶
Another reason for the 0 - #x3FFFFF
character range is that Emacs was born before Unicode, and, quite luckily, has been one of the few apps that do not just wrongly assume that Unicode has been actually universal.
Before Unicode, there were a lot of encodings for various languages in various territories. While many of them have now been incorporated into Unicode and can map the code points back and forth, some of them are yet to be fully mapped. An example is the tibetan ISO-2022 encoding, and you may see for yourself at tibetan.el
(where Emacs defines its lossless tibetan coding system) for how many of the characters are only representable by replacement characters. 4
The Emacs’ solution to this is similar to that of invalid bytes: Emacs reserves code point space (#x110000 - #x3FFF7F
) for characters that are not yet unified (meaning, not yet mapped to Unicode), and any ELisp programs may simply treat them as normal characters, with all text operations still applicable.
Personally I still wonder how these un-unified characters are to be displayed. On my Emacs setup they are just tofu blocks with hexadecimal codes on top, and I don’t have a clue whether there are any fonts for them. It seems the look-up tables in many font formats nowadays also assume Unicode code points (or at least searching in English tells me that). So maybe one will need special font formats/terminals for that?
These characters are encoded in Emacs’ extended UTF-like encoding. Also, it’s possible that tibetan.el
has been out-of-date and Unicode may be better these days.
2. Bringing your own string libraries¶
If you are to roll your own string types, then you definitely will need to re-implement all those string libraries. Even if you decide to ignore all that and just use the language’s built-in Unicode strings for everything, you will still need to brace yourself for most of the string primitives. No, you cannot simply wrap around your favorite regexp/string libraries, and here is why.
2.1. Case tables¶
Most languages provide ways to convert a string to its uppercase/lowercase, and most of them use look-up tables under the hood because of the vast number of conversions and irregularities of Unicode case conversions. Emacs does this too, except that it fully exposes those look-up tables as mutable to ELisp code (although some of them are only accessible through unicode-property-table-internal
). The result of this is that one can use upcase/downcase
for arbitrary string transformations, in addition to Unicode case conversions. And of course, some ELisp code has been doing so:
(defconst erc--casemapping-rfc1459-strict
(let ((tbl (copy-sequence ascii-case-table))
(cup (copy-sequence (char-table-extra-slot ascii-case-table 0))))
(set-char-table-extra-slot tbl 0 cup)
(set-char-table-extra-slot tbl 1 nil)
(set-char-table-extra-slot tbl 2 nil)
(pcase-dolist (`(,uc . ,lc) '((?\[ . ?\{) (?\] . ?\}) (?\\ . ?\|)))
(aset tbl uc lc)
(aset tbl lc lc)
(aset cup uc uc))
tbl))
The code snippet is from erc.el
, one of the built-in IRC clients in Emacs. What it is doing here is creating case tables for scandanavian case conversions between []\
and {}|
, as is specified in RFC 1459:
Because of IRC’s scandanavian origin, the characters {}| are considered to be the lower case equivalents of the characters []\, respectively. This is a critical issue when determining the equivalence of two nicknames.
Emacs also provides a with-case-table
macro for this, mostly used with ascii-case-table
to prevent locale-aware case conversions from messing with some ASCII protocols.
2.2. Regexp¶
How can an editor do without regular expressions? Unfortunately, the regexp in Emacs is too specialized to be compatible with any other regexp implementations, backtracking or not. The following is a regexp example I took from cc-fonts.el
(font lock support for CC Mode (a mode for C/C++)):
"\\(\\=\\|\\(\\=\\|[^\\]\\)[\n\r]\\)" ; noncontinued-line-end
No, it is not the extraneous backslashes that makes it incompatible. For people that prefer “normal” regexp syntax, here it is: (\=|(\=|[^\\])[\n\r])
, with \=
basically meaning “where the user cursor is”, and the regexp just means “let’s treat <backslash> and then <end_of_line>
or <user_cursor> and then <end_of_line>
as a non-continued line end. Simple, right? Only if the common regexp libraries support asserting <user_cursor>
.
And it’s more than that. Of course Emacs regexps will use case tables for [:lower:]
, [:upper:]
or case-insensitive matching via case-fold-search
. And unsurprisingly, the same applies to the word assertion \w
, word boundary assertion \b
and more, which involve syntax tables and maybe char category tables. All these also make a slowish backtracking regexp engine mandatory (although one may offload simple regexps to a non-backtracking one).
Emacs allows ELisp to attach
syntax-table
properties to specific texts in buffers or strings, thus overriding “what the syntax table says about this particular character”. It seems mainly used with subroutines for parsing, likeparse-partial-sexp
, but thank goodness it does not affect regexp matching. (It is not that uncommon by the way.)
2.3. Any encoding you want¶
Did I forget to mention that Emacs employs another language (CCL, the Code Conversion Language) for encoding conversion and you can implement literally any encoding converters from ELisp with CCL? It’s not that hard. All you need to do is to include yet another bytecode interpreter in your implementation.
An alternative is to pretend that everything is Unicode.
3. Buffers are more than buffers¶
Every editor must have something to represent editable texts, whether it is gap buffers, or ropes, or piece tables/trees or plain strings. However, Emacs buffers are much more than texts and multiple concepts couple tightly to buffers:
- Text properties: Attach fonts, colors or any objects to specific texts. One may also use it to highlight texts or hide texts.
- Overlays: Similar to text properties, but instead of replacing existing properties, it “overlays” them with newer ones so that one doesn’t need to worry about having to restore them later.
-
Markers: Want to go back to your previous cursor position?
C-u C-SPC
will pop and take you to the last marker, which likely marks where you were in the buffer. -
Indirect buffers: Indirect buffers share their texts and text properties with base buffers, but not markers, overlays, nor narrowing (i.e., a range specified to make the buffer operate on only a sub-string of its text).
All of these stay in sync with the text, however or wherever you insert or delete texts. And they need to be considered when you design buffers for your Emacs, especially if you are hoping for concurrent redisplay in the first place.
Also, at least in Emacs, users index strings and buffers mostly by code points. So if you are storing strings as some UTF-8-like bytes, you will need to come up with a way to map between byte offsets and char offsets. Or, if you are using some auto-compacting UTF-32 encoding, you will need to map char offsets back to byte offsets, as is required by the Emacs
position-bytes/string-bytes
subroutine.
3.1. Trees are everywhere¶
For an editor, it is critical to offer decent text editing performance. Let’s first take a look at the buffer implementations in some editors here:
- GNU Emacs: Gap buffers, with intervals stored in a tree, with a line-number cache, with markers that are just integer wrappers that need constant adjustment.
- GNU Emacs uses a list of markers that track both char offsets and byte offsets to convert between the two systems of offsets, which it admits to be slow sometimes.
- Rune (“experimental Emacs core written in Rust”): Also gap buffers, except that it stores char offsets in a tree.
- VS Code: “Piece trees”, or, what-actually-look-like-ropes. It seems they are just ropes with the string fragments turned into sub-string references to a huge append-only string buffer.
-
Neovim: Line-based ropes, with marks stored in a separate tree.
“Mom, can we get ropes?” “We have ropes at home.”
- IntelliJ: Rope-like strings, with range markers stored in
RangeMarkerTree
and line-number mappings stored inLineSet
.- Fleet: Ropes, with “updatable interval trees” for widgets, highlighting and more. (As it is not (yet) open source, I cannot look into what they mean by “updatable interval trees”, but it sounds really like Emacs text properties.)
- Zed: Ropes, implemented as a
SumTree
.
I won’t go into what all these implementations are about since there have been quite some articles about them already (e.g., gap buffers, piece tree, and ropes). Different as they might sound, when you think of ropes as strings with attached statistics, the distinction actually turns vague (under a single-threaded context):
Ropes | Gap Buffers + Tree | Piece Tree | |
---|---|---|---|
Metadata Keeping | In-tree | In-tree | In-tree |
New Strings | Allocated | In gaps | In string buffer |
String Deletion | De-allocation | Gap update | No-op |
Notice the “in-tree metadata keeping” part? Buffers are not just the texts: they are texts with all the attached metadata, be it text properties or code point offsets or line numbers. And trees are usually the way to go if you want these to be in sync with the text. (And I really like how Zed treats all these as SumTree
with Rust’s generics and traits.)
3.2. Strings into buffers into strings¶
Text properties are not unique to Emacs buffers, and you probably have guessed it: ELisp strings also carry properties, and it is only natural that one can pass them between strings and buffers:
(with-temp-buffer
;; `insert' also inserts string properties into buffers.
(insert #("hello world" 0 11 (str-prop 1)))
(put-text-property (point-min) (point) 'buf-prop 2)
;; Buffer text properties are extracted into substrings.
(buffer-substring 1 6))
#("hello" 0 5 (buf-prop 2 str-prop 1))
The weird
#("hello world" 0 11 (str-prop 1))
thing is the read syntax for strings with text properties.
Inserting strings into buffers have some corner cases, like conversion between 0/1-base offsets, interval coalescing and stickiness. But since these cases are quite straightforward, let’s look at some more interesting cases:
(with-temp-buffer
(insert #("🤗" 0 1 (a 1)))
(set-buffer-multibyte nil)
(buffer-string))
#("\360\237\244\227" 0 4 (a 1))
Basically, a property spanning 1 - 2
is automatically expanded to 1 - 5
when converting a multi-byte buffer into a single byte buffer (probably because intervals track byte positions under the hood):
1 2 (chars)
|----> Property: (a 1) <----|
+------+------+------+------+
| 🤗 (four bytes in UTF-8) |
+------+------+------+------+
| (set-buffer-multibyte nil)
\|/
+------+------+------+------+
| \360 | \237 | \244 | \227 |
+------+------+------+------+
|----> Property: (a 1) <----|
1 2 3 4 5 (chars)
This also makes one wonder: what happens if we convert a multi-byte string info a single-byte string? Well, normally you won’t be able to do that while preserving string properties, but we can work around that with clear-string
since Emacs strings are mutable:
(let ((s #("🤗" 0 1 (a 1))))
(clear-string s)
(prin1 s))
Fatal error 11: Segmentation fault Backtrace: emacs(emacs_backtrace+0x51) [0x57ed0155c5d1] emacs(terminate_due_to_signal+0xa1) [0x57ed014408f4] emacs(+0x80260) [0x57ed01441260] emacs(+0x80267) [0x57ed01441267] emacs(+0x19969e) [0x57ed0155a69e] /usr/lib/libc.so.6(+0x3d1d0) [0x755cd06dc1d0] emacs(copy_intervals+0x188) [0x57ed0164fa58] emacs(Fcopy_sequence+0x84) [0x57ed015dafc4] emacs(+0x23a86d) [0x57ed015fb86d] emacs(Fprin1+0x6b) [0x57ed015fcfdb] emacs(eval_sub+0x933) [0x57ed015d2c43] emacs(Flet+0x250) [0x57ed015d6080] emacs(eval_sub+0x7fa) [0x57ed015d2b0a] ... (as of Emacs 30.0.93)
That’s … unexpected. Ahem, and the moral is that you definitely need to be careful working with corner cases, especially when Emacs texts (both buffers and strings) are separated into single-byte (raw-byte) variants and multi-byte ones. 🤗
3.3. More on gap buffers¶
I sometimes see an article (by Troy Hinckley, the creator of Rune) quoted in discussions on gap buffers and ropes: Text showdown: Gap Buffers vs Ropes. But I don’t think some of its benchmarks are actually fair: ropey and crop track line numbers, while the gap buffer implementation does not. (Or maybe I am missing something here: although the post says it has “metrics include things like char and line position”, but actually it does not (yet).) And scanning for new lines as well as special-casing for CRLF (e.g., what about inserting CR
and only afterwards LF
?) certainly takes time… or maybe not. But anyway, to quote from one of my favorite benchmarking libraries (JMH):
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.
Also, as is discussed above, I really don’t think the performance of ropes and gap buffers differs that much when:
- There is also a bunch of other metadata to maintain.
- Most of the time (statistics needed), you have to use a backtracking regexp engine and do char table lookup for many regexp assertions.
- Emacs spends most of its time in all kinds of hooks (citation needed).
However, they do differ in other aspects that may determine how other components of Emacs can be designed. Here is a comment from Eli on Reddit:
… a few aspects of buffer text storage … are very important for Emacs: display and file I/O. The former is especially important, because a careful analysis of the cases where redisplay in Emacs is slow (like very long lines etc.) inevitably leads to the conclusion that one basic problem is the completely unstructured, unidimensional representation of buffer text. Therefore, any study of alternative data structures for buffer text that doesn’t consider its effect on display, and doesn’t examine alternatives that could solve at least some of our current redisplay problems, is from my POV woefully incomplete.
So, because buffers are involved in emacs redisplay and tons of other emacs internals, we need to first talk about redisplay to understand some of the challenges with buffers. But… well, that will be for a future post and this post will just be “woefully incomplete” because of the mere amount of code to read.
Fun fact: the
xdisp.c
file in Emacs is around 1.2 MB, containing 38k lines of code mostly dedicated to “display generation”. I really doubt if I can get to understand what it does and finish the second article, but we will see.
4. Summarizing¶
In this post we mostly covered some of the text processing features of Emacs and what challenges they pose on implementations. Personally, I am yet to see another language design that puts embedded raw bytes into the built-in string, and it is indeed a good-to-have feature for me. So kudos again to Emacs maintainers and contributors! In a future post or two (that I hope I can eventually finish), we will be discussing redisplay in Emacs and look into why Emacs can be hard to parallelize.
Comments