Skip to main content

Why Rewriting Emacs Is Hard

There have been quite a few attempts to re-implement (part of) Emacs’ functionalities in languages other than C, like JEmacs, remacs, and lem. And we are seeing new efforts in EmacsConf 2024: rune 1, gypsum 2, and the revived Guilemacs 3. (Strictly speaking, Guilemacs is more a fork than a total rewrite, but anyway.)

However, a complete rewrite of (GNU) Emacs has always seemed like an insurmountable task, not just because writing editors is hard. This article aims to look into some of these difficulties and the Emacs designs (that of course have been exposed in some Emacs Lisp API) that lead to them.

The rationale for this piece is that I too have been pipe-dreaming about yet another rewrite. :-/

1. Rolling your own string types

What is the highest code point supported in your favorite programming language? If it’s #x10FFFF, as per the current Unicode standard, then good luck rolling your own string implementation to stay compatible with Emacs, in that it supports characters up to #x3FFFFF.

This design decision has a bit of history behind it, which led us to the following two nice features of GNU Emacs.

1.1. Lossless file editing

In an era dominated by Unicode, it is becoming rarer to have to work with multiple character sets. But if you do (or did) need to deal with some other encodings, you might have the experience of opening a file and the editor getting the encoding wrong. Now, for some editors, if you accidentally save the wrongly-decoded file back, poof, your file is doomed.

The main reason for such disasters lies in how string types handle invalid byte sequences. For instance, in Java, when decoding bytes into a String with built-in libraries, you can ask the decoder to either:

  1. Replace invalid bytes with replacement characters like #xFFFD,
  2. Or just throw an error.

That is to say, you either give up decoding, or lossily convert them to Unicode. The same applies to most other languages/libraries whose string types are aware of Unicode: the string type has no place for invalid Unicode characters, rightfully so, because it was designed (or re-designed) with only Unicode in mind.

However, Emacs Lisp, as a language dedicated to a text editor, decides to reserve space for even invalid bytes. Trying to construct a string containing a single #x3FFFFF code point in Emacs, and you will see that it is treated as a single raw byte \377 (or #xFF) - the code point range #x3FFF80 - #x3FFFFF is for representing raw bytes:

(string #x3FFF80 #x3FFFFF)
"\200\377"

This is how Emacs ensures lossless file editing. Regardless of the encoding, it transforms invalid bytes into “raw bytes” that both the ELisp runtime and encoders recognize. When saving, these raw bytes are preserved exactly as they were.

Actually many C-based editors may have similar properties: unedited bytes are preserved as is because a string is just a sequence of bytes. But I am afraid their scripting language, if any, may not be prepared for this.

1.2. More universal than Unicode

Another reason for the 0 - #x3FFFFF character range is that Emacs was born before Unicode, and, quite luckily, has been one of the few apps that do not just wrongly assume that Unicode has been actually universal.

Before Unicode, there were a lot of encodings for various languages in various territories. While many of them have now been incorporated into Unicode and can map the code points back and forth, some of them are yet to be fully mapped. An example is the tibetan ISO-2022 encoding, and you may see for yourself at tibetan.el (where Emacs defines its lossless tibetan coding system) for how many of the characters are only representable by replacement characters. 4

The Emacs’ solution to this is similar to that of invalid bytes: Emacs reserves code point space (#x110000 - #x3FFF7F) for characters that are not yet unified (meaning, not yet mapped to Unicode), and any ELisp programs may simply treat them as normal characters, with all text operations still applicable.

Personally I still wonder how these un-unified characters are to be displayed. On my Emacs setup they are just tofu blocks with hexadecimal codes on top, and I don’t have a clue whether there are any fonts for them. It seems the look-up tables in many font formats nowadays also assume Unicode code points (or at least searching in English tells me that). So maybe one will need special font formats/terminals for that?

4

These characters are encoded in Emacs’ extended UTF-like encoding. Also, it’s possible that tibetan.el has been out-of-date and Unicode may be better these days.

2. Bringing your own string libraries

If you are to roll your own string types, then you definitely will need to re-implement all those string libraries. Even if you decide to ignore all that and just use the language’s built-in Unicode strings for everything, you will still need to brace yourself for most of the string primitives. No, you cannot simply wrap around your favorite regexp/string libraries, and here is why.

2.1. Case tables

Most languages provide ways to convert a string to its uppercase/lowercase, and most of them use look-up tables under the hood because of the vast number of conversions and irregularities of Unicode case conversions. Emacs does this too, except that it fully exposes those look-up tables as mutable to ELisp code (although some of them are only accessible through unicode-property-table-internal). The result of this is that one can use upcase/downcase for arbitrary string transformations, in addition to Unicode case conversions. And of course, some ELisp code has been doing so:

(defconst erc--casemapping-rfc1459-strict
  (let ((tbl (copy-sequence ascii-case-table))
        (cup (copy-sequence (char-table-extra-slot ascii-case-table 0))))
    (set-char-table-extra-slot tbl 0 cup)
    (set-char-table-extra-slot tbl 1 nil)
    (set-char-table-extra-slot tbl 2 nil)
    (pcase-dolist (`(,uc . ,lc) '((?\[ . ?\{) (?\] . ?\}) (?\\ . ?\|)))
      (aset tbl uc lc)
      (aset tbl lc lc)
      (aset cup uc uc))
    tbl))

The code snippet is from erc.el, one of the built-in IRC clients in Emacs. What it is doing here is creating case tables for scandanavian case conversions between []\ and {}|, as is specified in RFC 1459:

Because of IRC’s scandanavian origin, the characters {}| are considered to be the lower case equivalents of the characters []\, respectively. This is a critical issue when determining the equivalence of two nicknames.

Emacs also provides a with-case-table macro for this, mostly used with ascii-case-table to prevent locale-aware case conversions from messing with some ASCII protocols.

2.2. Regexp

How can an editor do without regular expressions? Unfortunately, the regexp in Emacs is too specialized to be compatible with any other regexp implementations, backtracking or not. The following is a regexp example I took from cc-fonts.el (font lock support for CC Mode (a mode for C/C++)):

"\\(\\=\\|\\(\\=\\|[^\\]\\)[\n\r]\\)" ; noncontinued-line-end

No, it is not the extraneous backslashes that makes it incompatible. For people that prefer “normal” regexp syntax, here it is: (\=|(\=|[^\\])[\n\r]), with \= basically meaning “where the user cursor is”, and the regexp just means “let’s treat <backslash> and then <end_of_line> or <user_cursor> and then <end_of_line> as a non-continued line end. Simple, right? Only if the common regexp libraries support asserting <user_cursor>.

And it’s more than that. Of course Emacs regexps will use case tables for [:lower:], [:upper:] or case-insensitive matching via case-fold-search. And unsurprisingly, the same applies to the word assertion \w, word boundary assertion \b and more, which involve syntax tables and maybe char category tables. All these also make a slowish backtracking regexp engine mandatory (although one may offload simple regexps to a non-backtracking one).

Emacs allows ELisp to attach syntax-table properties to specific texts in buffers or strings, thus overriding “what the syntax table says about this particular character”. It seems mainly used with subroutines for parsing, like parse-partial-sexp, but thank goodness it does not affect regexp matching. (It is not that uncommon by the way.)

2.3. Any encoding you want

Did I forget to mention that Emacs employs another language (CCL, the Code Conversion Language) for encoding conversion and you can implement literally any encoding converters from ELisp with CCL? It’s not that hard. All you need to do is to include yet another bytecode interpreter in your implementation.

An alternative is to pretend that everything is Unicode.

3. Buffers are more than buffers

Every editor must have something to represent editable texts, whether it is gap buffers, or ropes, or piece tables/trees or plain strings. However, Emacs buffers are much more than texts and multiple concepts couple tightly to buffers:

  • Text properties: Attach fonts, colors or any objects to specific texts. One may also use it to highlight texts or hide texts.
  • Overlays: Similar to text properties, but instead of replacing existing properties, it “overlays” them with newer ones so that one doesn’t need to worry about having to restore them later.
  • Markers: Want to go back to your previous cursor position? C-u C-SPC will pop and take you to the last marker, which likely marks where you were in the buffer.

  • Indirect buffers: Indirect buffers share their texts and text properties with base buffers, but not markers, overlays, nor narrowing (i.e., a range specified to make the buffer operate on only a sub-string of its text).

All of these stay in sync with the text, however or wherever you insert or delete texts. And they need to be considered when you design buffers for your Emacs, especially if you are hoping for concurrent redisplay in the first place.

Also, at least in Emacs, users index strings and buffers mostly by code points. So if you are storing strings as some UTF-8-like bytes, you will need to come up with a way to map between byte offsets and char offsets. Or, if you are using some auto-compacting UTF-32 encoding, you will need to map char offsets back to byte offsets, as is required by the Emacs position-bytes/string-bytes subroutine.

3.1. Trees are everywhere

For an editor, it is critical to offer decent text editing performance. Let’s first take a look at the buffer implementations in some editors here:

I won’t go into what all these implementations are about since there have been quite some articles about them already (e.g., gap buffers, piece tree, and ropes). Different as they might sound, when you think of ropes as strings with attached statistics, the distinction actually turns vague (under a single-threaded context):

  Ropes Gap Buffers + Tree Piece Tree
Metadata Keeping In-tree In-tree In-tree
New Strings Allocated In gaps In string buffer
String Deletion De-allocation Gap update No-op

Notice the “in-tree metadata keeping” part? Buffers are not just the texts: they are texts with all the attached metadata, be it text properties or code point offsets or line numbers. And trees are usually the way to go if you want these to be in sync with the text. (And I really like how Zed treats all these as SumTree with Rust’s generics and traits.)

3.2. Strings into buffers into strings

Text properties are not unique to Emacs buffers, and you probably have guessed it: ELisp strings also carry properties, and it is only natural that one can pass them between strings and buffers:

(with-temp-buffer
  ;; `insert' also inserts string properties into buffers.
  (insert #("hello world" 0 11 (str-prop 1)))
  (put-text-property (point-min) (point) 'buf-prop 2)
  ;; Buffer text properties are extracted into substrings.
  (buffer-substring 1 6))
#("hello" 0 5 (buf-prop 2 str-prop 1))

The weird #("hello world" 0 11 (str-prop 1)) thing is the read syntax for strings with text properties.

Inserting strings into buffers have some corner cases, like conversion between 0/1-base offsets, interval coalescing and stickiness. But since these cases are quite straightforward, let’s look at some more interesting cases:

(with-temp-buffer
  (insert #("🤗" 0 1 (a 1)))
  (set-buffer-multibyte nil)
  (buffer-string))
#("\360\237\244\227" 0 4 (a 1))

Basically, a property spanning 1 - 2 is automatically expanded to 1 - 5 when converting a multi-byte buffer into a single byte buffer (probably because intervals track byte positions under the hood):

1                           2 (chars)
|----> Property: (a 1) <----|
+------+------+------+------+
| 🤗 (four bytes in UTF-8)  |
+------+------+------+------+
              | (set-buffer-multibyte nil)
             \|/
+------+------+------+------+
| \360 | \237 | \244 | \227 |
+------+------+------+------+
|----> Property: (a 1) <----|

1      2      3      4      5 (chars)

This also makes one wonder: what happens if we convert a multi-byte string info a single-byte string? Well, normally you won’t be able to do that while preserving string properties, but we can work around that with clear-string since Emacs strings are mutable:

(let ((s #("🤗" 0 1 (a 1))))
  (clear-string s)
  (prin1 s))
Fatal error 11: Segmentation fault
Backtrace:
emacs(emacs_backtrace+0x51) [0x57ed0155c5d1]
emacs(terminate_due_to_signal+0xa1) [0x57ed014408f4]
emacs(+0x80260) [0x57ed01441260]
emacs(+0x80267) [0x57ed01441267]
emacs(+0x19969e) [0x57ed0155a69e]
/usr/lib/libc.so.6(+0x3d1d0) [0x755cd06dc1d0]
emacs(copy_intervals+0x188) [0x57ed0164fa58]
emacs(Fcopy_sequence+0x84) [0x57ed015dafc4]
emacs(+0x23a86d) [0x57ed015fb86d]
emacs(Fprin1+0x6b) [0x57ed015fcfdb]
emacs(eval_sub+0x933) [0x57ed015d2c43]
emacs(Flet+0x250) [0x57ed015d6080]
emacs(eval_sub+0x7fa) [0x57ed015d2b0a]
... (as of Emacs 30.0.93)

That’s … unexpected. Ahem, and the moral is that you definitely need to be careful working with corner cases, especially when Emacs texts (both buffers and strings) are separated into single-byte (raw-byte) variants and multi-byte ones. 🤗

3.3. More on gap buffers

I sometimes see an article (by Troy Hinckley, the creator of Rune) quoted in discussions on gap buffers and ropes: Text showdown: Gap Buffers vs Ropes. But I don’t think some of its benchmarks are actually fair: ropey and crop track line numbers, while the gap buffer implementation does not. (Or maybe I am missing something here: although the post says it has “metrics include things like char and line position”, but actually it does not (yet).) And scanning for new lines as well as special-casing for CRLF (e.g., what about inserting CR and only afterwards LF?) certainly takes time… or maybe not. But anyway, to quote from one of my favorite benchmarking libraries (JMH):

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.

Also, as is discussed above, I really don’t think the performance of ropes and gap buffers differs that much when:

  • There is also a bunch of other metadata to maintain.
  • Most of the time (statistics needed), you have to use a backtracking regexp engine and do char table lookup for many regexp assertions.
  • Emacs spends most of its time in all kinds of hooks (citation needed).

However, they do differ in other aspects that may determine how other components of Emacs can be designed. Here is a comment from Eli on Reddit:

… a few aspects of buffer text storage … are very important for Emacs: display and file I/O. The former is especially important, because a careful analysis of the cases where redisplay in Emacs is slow (like very long lines etc.) inevitably leads to the conclusion that one basic problem is the completely unstructured, unidimensional representation of buffer text. Therefore, any study of alternative data structures for buffer text that doesn’t consider its effect on display, and doesn’t examine alternatives that could solve at least some of our current redisplay problems, is from my POV woefully incomplete.

So, because buffers are involved in emacs redisplay and tons of other emacs internals, we need to first talk about redisplay to understand some of the challenges with buffers. But… well, that will be for a future post and this post will just be “woefully incomplete” because of the mere amount of code to read.

Fun fact: the xdisp.c file in Emacs is around 1.2 MB, containing 38k lines of code mostly dedicated to “display generation”. I really doubt if I can get to understand what it does and finish the second article, but we will see.

4. Summarizing

In this post we mostly covered some of the text processing features of Emacs and what challenges they pose on implementations. Personally, I am yet to see another language design that puts embedded raw bytes into the built-in string, and it is indeed a good-to-have feature for me. So kudos again to Emacs maintainers and contributors! In a future post or two (that I hope I can eventually finish), we will be discussing redisplay in Emacs and look into why Emacs can be hard to parallelize.

Comments

This one-pixel image serves as a page view tracker and nothing more. This feedback keeps me motivated, because very few people ever comment.