Skip to main content

Exploring Speculative JIT Compilation for Emacs Lisp with Java

This Org-mode file was used for an org-present presentation at EmacsConf 2025 – Juicemacs. This blog post is adapted from that presentation, with added transcript and explanations for a bunch of things I didn’t dig into during the presentation.

emacsconf-pre.jpg

emacsconf-logo1-256.png For EmacsConf 2025

Project: https://github.com/gudzpoz/Juicemacs

Contact: See the navigation bar (or join the Zulip chat)

More interested in the JIT part? Skip the first few Juicemacs sections.

1. Juicemacs: Yet Another Emacs Clone (WIP)

Juicemacs is my work-in-progress toy project re-implementing Emacs in Java. At its centre sits an Emacs Lisp Just-In-Time (JIT) compilation runtime powered by Graal Truffle, a JIT interpreter framework based on partial evaluation and Futamura projections.

1.1. Exploration-centric

Currently, Juicemacs aims to explore a few things I’ve been wondering about for a while:

Transparent Concurrency
I hope to use a JavaScript-like concurrency model, but make it transparent by using Java’s virtual threads 1 and sending each input event to a separate virtual thread – mounted on the same physical thread to reduce parallelism issues.
Concurrent GUI
I have yet to fully grasp the whole picture. But, for example, to display the mode-line in Emacs, the redisplay routine needs to evaluate Lisp expressions in mode-line :eval clauses or some display text properties, which can block redisplay and thus the whole Emacs. A straightforward approach we might explore is allowing out-of-date mode-lines: we only update them asynchronously when we’re “free” (with an empty event loop).
Just-In-Time (JIT) Compilation
That will be the topic of this post. :)
1

Java, as of Java 25, does not provide a public scheduler API. However, JDK developers seem open to feedback on their experimental custom scheduler design, which we actually need to implement our JavaScript-like model. So I think, for our explorations, we will just access these JDK internals with reflection while waiting for a stabilized scheduler API.

1.2. Problem: Exploration from scratch?

andromeda.jpg
Figure 1: A picture of the Andromeda Galaxy

However, a main problem with explorations in Emacs clones is that Emacs is a whole universe. And that means, to make these explorations meaningful for Emacs users, we need to cover a lot of Emacs features before we can even begin.

For example, I can already see some issues in my concurrency design: what if the execution of different events gets out of order? This is very likely, especially when the events involve blocking hooks. But to see how this would affect real-world editing (and how we could fix this), we need a fully-functional “Emacs”.

2. Why Rewriting Emacs Is Hard?

But to get a “fully-functional” Emacs clone, well, you need to take care of a lot of details.

Alternatively, maybe you can actually ignore these details if you consider them unimportant to your explorations. But personally, I think in (GNU) Emacs, everything influences everything, and to avoid making wrong decisions that might have long-term implications, I’ll be more conservative with Juicemacs.

The following contents are actually taken from my previous blog post: Why Rewriting Emacs Is Hard.

2.1. Extended UTF-8

For example, one of Emacs’s features is that it supports many encodings. Let’s look at this string 洲﨑神社: it can be encoded in both Unicode and Shift-JIS, a Japanese encoding system.

(let* ((jis-encoded-bytes  "\x1b$B='yu?@<R\x1b(B") ; 洲﨑神社
       (emacs (decode-coding-string jis-encoded-bytes 'iso-2022-jp))
       (utf8 (encode-coding-string emacs 'utf-8-unix))
       (utf8 (decode-coding-string emacs 'utf-8-unix)))
  utf8)
emacs: 洲▯神社 (#x1420A4 > #x10FFFF (max unicode))
other: 洲�神社 (#xFFFD replacement)

But currently, Unicode does not have an official mapping for this “ki” (﨑) character. When we map from Shift-JIS-encoded bytes to Unicode, in most programming languages, you end up with something like , a replacement character. But in Emacs, it actually extends the Unicode range by threefold and uses the extra range to losslessly support characters like this. (In this case, the “﨑” character is transformed into a #x1420A4 code-point, which is outside the Unicode range 0 - #x10FFFF.)

Unfortunately, if I put that #x1420A4 character as is, encoded in Emacs’ utf-8-emacs coding, Nikola, my blog generator, seems to break. Watch the EmacsConf video (or run the code in your Emacs) if you want to see how Emacs handles and displays that.

By the way, the 洲﨑神社 example is from Unicodeにマッピングされていない文字 - 清水川のScrapbox (“Characters that are not yet mapped to Unicode” in English).

So if you want to support this feature, that basically rules out all string libraries with Unicode assumptions.

2.2. Irregular “regular” expressions

Another challenge is supporting Emacs’s regular expressions, which are quite … irregular. :P

For example, it supports asserting about the user’s cursor position, using the \= predicate. It also uses character tables that can be modified from Lisp code to determine case mappings in [[:lower:]] or [[:upper:]].

All these make it really hard, or even impossible, to use any existing regex libraries.

2.3. Others

Also, you need a functional garbage collector (GC). You need threading primitives, because Emacs already has some threading support. You might also want the performance of your clone to match Emacs, even with its native compilation enabled. Not to mention, you also need a GUI for an editor.

When I was writing the “Why Rewriting Emacs Is Hard” post, I actually planned (and I’m still planning!) to write a “Why Emacs Redisplay is Hard” post next. And, I’m now fully stuck on this – GUI is much, much harder than the things I mentioned above.

That certainly depends on what feature sets you want the GUI to have, but I guess it just can’t be easy when you’re aiming for:

For Juicemacs, building on Java and a compiler framework called Truffle helps achieve better performance; and by choosing a language with a good GC, maybe we can focus more on the challenges above.

3. Juicemacs Status

Currently, Juicemacs has implemented three (out of at least four) of the interpreters in Emacs, all JIT-capable:

AST Bytecode Regexp CCL Built-in Functions
🤗 🤗 Complete🤔 💤 ~420 out of ~1800
JIT-capable JIT-capable Slowish JIT Not yet Enough for pbootstrap/pdump/ERT

Other than these, Emacs also has around two thousand built-in functions in C code, which we’ll need to implement eventually. Juicemacs has around four hundred of them implemented right now. It’s not that many, but it’s, surprisingly, enough to bootstrap Emacs and run the portable dumper, or pdump for short.

3.1. Portable Dumper

Quoting from Emacs docs:

Because it takes some time to load the standard Lisp files, … --temacs tells temacs how to record all the standard preloaded Lisp functions and variables, so that when you subsequently run Emacs, it will start much faster.

During pdump, we let Emacs (or Juicemacs) load all the files needed for bootstrapping. Then it dumps the memory to an emacs.pdmp file to be loaded later, giving us fast startup.

$ build/native/nativeCompile/elisp-repl --dump=pdump
load: ../elisp/emacs/lisp/loadup.el
Dump mode: pdump
Using load-path ("../elisp/emacs/lisp")
load: ../elisp/emacs/lisp/emacs-lisp/debug-early.elc
load: ../elisp/emacs/lisp/emacs-lisp/byte-run.elc
load: ../elisp/emacs/lisp/emacs-lisp/backquote.elc
load: ../elisp/emacs/lisp/subr.elc
...
Loading leim/leim-list.el (source)...
Loading leim/leim-list.el (source)...done
Waiting for git...
Waiting for git...
Finding pointers to doc strings...
Snarf-documentation not implemented
Finding pointers to doc strings...done
Pure-hashed: 0 strings, 0 vectors, 0 conses, 0 bytecodes, 0 others
Dumping under the name emacs.pdmp

When we try to load that dump back, currently Juicemacs throws some frame errors because it doesn’t have an editor UI or functional frames yet. Otherwise, it can already run quite a bit of Lisp code.

$ build/native/nativeCompile/elisp-repl --dump-file=./emacs.pdmp
Loading ../elisp/emacs/lisp/subdirs.el (source)...
java.lang.UnsupportedOperationException
party.iroiro.juicemacs.elisp.forms.BuiltInFrame$FFrameParameters.frameParameters(BuiltInFrame.java:684)
...
<elisp> frame-parameters(Unknown)
<elisp> frame-notice-user-settings(frame:171:0)
...

Juicemacs’ pdump directly builds on Fory, a (de)serialization framework. Of course it’s not as fast as Emacs, which mostly dumps and reloads its heap memory as is, but going from tens of seconds to less than one second is still quite significant.

3.2. ELisp Fibonacci

For example, this code uses the benchmark library to measure the performance of a Fibonacci function.

(defun fib (x)
  (if (<= x 1) x
    (+ (fib (1- x)) (fib (- x 2)))))
(cl-loop for i from 0 to 10 do
         (funcall #'message "(fib 37): %S seconds"
                  (car (benchmark-run 1 (fib 37)))))
Loading benchmark...
Loading cl-lib...
Loading cl-loaddefs...
Loading cl-macs...
Loading gv...
Loading time-date...
Loading subr-x...
Local variables list is not properly terminated
(fib 37): 0.7413439900000001 seconds
(fib 37): 0.453432999 seconds
(fib 37): 0.397605726 seconds
...
fib-running.gif
Figure 2: The Fibonacci benchmark running

As we can see above, the JIT engine is already kicking in, making the execution faster.

3.3. ERT (Emacs Lisp Regression Testing)

In addition, with a few workarounds, Juicemacs can also run some of the ERT, or Emacs Regression Test suite, that comes with Emacs.

(require 'pp)
(defalias 'pp 'princ)
;; ert asserts about newlines produced by pp
(defun cl--assertion-failed (&rest _) t)

(require 'ert)
(load "../test/src/data-tests")
(load "../test/src/floatfns-tests")
(null (ert-run-tests-batch)) ; don't print the huge info object
...
   passed  86/87  logb-extreme-fixnum (0.000496 sec)
Test special-round backtrace:
  backtrace-to-string(#f(compiled-function (&optional arg1) #<bytecode
  #f(compiled-function (arg1 &rest rest) #<bytecode 0xffffffff8e4e542a
  #f(compiled-function () #<bytecode 0xffffffffa66d753b>)(#f(compiled-
  ert-run-or-rerun-test(#f(compiled-function (arg1 arg2 arg3) #<byteco
  ert-run-tests(#f(compiled-function (arg1 arg2 &optional arg3) #<byte
  ert-run-tests-batch(#f(compiled-function (&optional arg1) #<bytecode
  nil()
Test special-round condition:
    (ert-test-failed ((should-error (funcall f n)) :form (funcall ceiling -1.0e+INF) :value -9223372036854775808 :fail-reason "did not signal an error")
   FAILED  87/87  special-round (0.000754 sec)

Ran 87 tests, 61 results as expected, 26 unexpected (2025-10-30T05:29:18.077226393Z, 5.447236 sec)

26 unexpected results:
   FAILED  bignum-expt
   FAILED  bignum-round
   FAILED  bignum-to-float
...

There are a bunch of test failures, which means we’re not fully compatible with Emacs yet and need more work. But the whole testing procedure runs fine with proper stack traces, which are quite useful for debugging Juicemacs.

So with that, a rather functional JIT runtime, let’s now get to today’s topic: JIT compilation for ELisp.

4. What We Talk About When We Talk About JIT (1/2)

4.1. Emacs nativecomp (native-compile)

You probably know that Emacs has supported native-compilation, or nativecomp for short, for some time now. 2 It mainly uses GCC to compile Lisp code into native code ahead of time. During runtime, Emacs loads those compiled files and gets the performance of native code.

gcc-emacs-aot-jit.svg
Figure 3: How AOT compilation and JIT compilation works in Emacs

However, for example, for installed packages, we might want to compile them when we actually use them instead of ahead of time. For that, you can use the native-comp-jit-compilation flag provided by Emacs.

With this flag enabled, Emacs will send loaded files to external Emacs worker processes during runtime, which will then compile those files asynchronously. When the compilation is done, the current Emacs session will load the compiled code back and improve its performance on the fly.

When you look at this procedure, however, it is ahead-of-time compilation, done at runtime. And this is what current Emacs calls JIT compilation. But if you look at some other JIT engines, you’ll see much more complex architectures.

2

Andrea talked about nativecomp on EmacsConf 2021 if you’re interested: Emacs Lisp native compiler, current status and future developments.

5. What We Talk About When We Talk About JIT (2/2)

Take LuaJIT as an example. In addition to the red lines in the following graph, which lead from an interpreted state to a compiled native state (which is also what Emacs does), LuaJIT also supports going from a compiled state back to its interpreter. And this process is called “deoptimization”.

elisp-jit-arch-vs-luajit.svg
Figure 4: Comparision between two JIT compilation approaches

Despite its name, deoptimization here actually enables a huge category of JIT optimizations – speculative compilation, or “speculation”. 3

Basically, with speculation, the compiler can use runtime statistics to speculate and make bolder assumptions in the compiled code. When the assumptions are invalidated, the runtime deoptimizes the code, updates statistics, and recompiles the code based on new assumptions, making the code more performant.

Compilers Ahead-Of-Time Emacs JIT LuaJIT
When? ~Build time Runtime Runtime
Runtime-info 🟢
Deoptimization 🟢
Speculation 🟢

Let’s look at an example.

3

Filip Pizlo has written a very good blog post on Speculation in JavaScriptCore, the JavaScript engine behind WebKit. By the way, you might have heard about Fil-C, a memory-safe C/C++ implementation, also by Filip. (Fil-C runs Emacs?!)

6. Speculative Compilation – An Example

(defun plus-one (x)
  (1+ x))

So here is a really simple function that adds one to the input number. But in Emacs (and many other dynamic programming languages), it’s not that simple, because Emacs has three categories of numbers: fixnums (machine-word-sized integers), floating numbers, and big integers. When we compile this, we need to handle all three cases.

We can use the disassemble command in Emacs to analyze native-compiled functions, like this:

00000000000010f0 <F706c75732d6f6e65_plus_one_0>:
    10f0:	push   %rbx
        ;; ... Preamble ...
    110b:	lea    -0x2(%rax),%edx ; pointer tagging
    110e:	and    $0x3,%edx
    1111:	jne    1138
    1113:	movabs $0x1fffffffffffffff,%rcx
    111d:	mov    %rax,%rdx
    1120:	sar    $0x2,%rdx
    1124:	cmp    %rcx,%rdx ; fastpath: fixnum
    1127:	je     1138
    1129:	lea    0x6(,%rdx,4),%rax
    1131:	pop    %rbx
    1132:	ret
    1138:	mov    (%rbx),%rdx ; slowpath: others
    113b:	mov    %rax,%rdi
    113e:	pop    %rbx
    113f:	mov    0x2870(%rdx),%rdx
    1146:	jmp    *%rdx
    1148:	nopl   0x0(%rax,%rax,1)

The next graph visualizes the control flow of this disassembly.

7. Speculation v.s. Slow Path

If we analyze the code produced by Emacs, as is shown by the gray graph on the left, we can see that it has two paths: one fast path that does fast fixnum addition, and one slow path that calls out to an external plus-one function to handle floating numbers and big integers.

elisp-vs-speculation-plusone.svg
Figure 5: Comparison between AOT compilation and speculative JIT compilation

Now, if we pass integers into this function, it’s pretty fast because it’s on the fast path. However, if we pass in a floating number, it has to go through the slow path, doing an extra function call, which is slow.

What speculation might help here is that, it can have flexible fast paths. When we pass a floating number into this function, which currently has only fixnums on the fast path, it also has to go through the slow path (green graph in the middle). But the difference is that a speculative runtime can deoptimize and recompile the code to adapt to this. When it recompiles, it might look at the statistics and decide to add floating numbers to the fast path, and now floating number operations are also fast (green graph on the right). This kind of speculation is why speculative runtimes can be really fast.

  nativecomp (AOT) Speculation-based JIT
Fast path Fast 🚀 ~on par
Slow path Slow Fast 🚀 (only after slow 🐌 recompilation)

Since recompilation is not cheap. Most JIT runtimes will start out interpreted to gather statistics, hoping to produce “suitable” fastpaths on the first few compilations. And some employ “tiered compilation”, with multiple compilers of different trade-offs.

8. How Juicemacs Performs

Let’s look at some benchmarks obtained with the elisp-benchmarks library on ELPA, with some modifications.

bench-all.svg

Results from elisp-benchmarks visualized with ECharts. (Blue areas mean that nativecomp is slower, and green areas mean that Juicemacs is slower.)

At a glance, the two (or four) seem somewhat on par. But let’s take a closer look at some of them.

Unlike typical Java Microbenchmark Harness benchmarks, these benchmarks were not fully warmed up. An editor, unlike Java server-side applications which spend 99% of their time fully warmed up, needs to consider how warming up affects initial editing latency. And I was hoping that running elisp-benchmarks as is, without warming up, can provide a more balanced view. However, elisp-benchmarks runs each benchmark three times, which seems enough when I looked at per-run results.

But, microbenchmarks are microbenchmarks after all, and eventually we will need a full editor to test things out.

9. Fibonacci Series

The Fibonacci series
fib(n) = fib(n - 2) + fib(n - 1), fib(1) = 1, fib(0) = 0

The first few benchmarks are the classic Fibonacci benchmarks. We know that the series is formed by adding up the previous two numbers. Looking at the formula above, Fibonacci benchmarks are quite intensive in number additions (f_{n-1} + f_{n-2}), subtractions (n - 2, n - 1), and function calls if you use recursion. And this is exactly why the Fibonacci series is a good (micro-)benchmark: it benchmarks only a few things, which, however, are important for a language runtime; its results can be trivially validated; and it’s simple enough to be implemented across runtimes for comparison.

9.1. Initial Runs

bench-fibn-first.svg
Figure 6: (tc: tail call, rec: recursion)

Looking at the results here… In the first three entries (fibn, fibn-tc, fibn-rec) Emacs nativecomp executes instantaneously. It’s a total defeat for Juicemacs, seemingly. But if you’re into benchmarks, you know something is wrong: we are comparing different things.

So let’s look under the hood, again with the disassemble command.

10. Know What You Are Benchmarking!

(defun elb-fibn-entry ()
  (elb-fibn 1000000 80))
0000000000001340 <F656c622d6669626e2d656e747279_elb_fibn_entry_0>:
    1340:	movabs $0x14cc58fbc0c8796,%rax
    134a:	ret

And these two lines of assembly are what we got. So we can already see what’s going on here:

  • nativecomp analyzes the functions called by the Fibonacci function (elb-fibn), which are all pure. (Done in the comp--ipa-pure pass.)
  • This implies that elb-fibn is also pure.
  • Later, in the comp--fwprop pass, it inline-folds all pure functions with constant arguments (in comp--function-call-maybe-fold), pre-calculating the result (that is, the $0x14cc58fbc0c8796 value above).

I made a mistake here assuming GCC was the one doing the inlining. But I was wrong.

10.1. What do we want to benchmark?

This is actually great! It shows that nativecomp knows about pure functions and can do all kinds of things like removing or constant-folding them. Juicemacs just does not do that.

  nativecomp (AOT) Juicemacs (JIT)
Pure function removal 🤗 Perfect (speed=3) 🥀 Not at all

However, we are also concerned about the things we mentioned earlier: the performance of number additions and function calls.

  • Iterative fibonacci: Fixnum arithmetic
  • Recursive fibonacci: Intensive function calls + a bit of arithmetic
  nativecomp (AOT) Juicemacs (JIT)
Fixnum arithmetic
Direct function call costs

So, in order to let the benchmarks show some extra things, we need to modify them a bit by simply making things non-constant. 4

4

It’s a quite frequently used trick in many micro-benchmarks.

11. Fibonacci, Again

(defvar elb-fibn-n 80)
(defun elb-fibn-non-const-entry ()
  (setq elb-blackhole (elb-fibn 1000000 elb-fibn-n)))
bench-fibn-all.svg
Figure 7: non-const: no constant-folding

With that, Emacs gets much slower. And again, let’s see what’s happening behind these numbers.

12. Looking Behind The Numbers

12.1. Emacs nativecomp itself

  • Not all functions (+, -, < …) have fast paths?

    1199:	call   *0x2900(%r13)    ; calls the "<" function
    11a0:	test   %rax,%rax
    11a3:	je     12d0
    

Looking at the assembly, it seems Emacs nativecomp does not have fast paths for operations like additions, subtractions, or comparisons, which is exactly what Fibonacci benchmarks are measuring. In the compiled code, Emacs has to call some generic, external functions for them, and this is slow.

In fairness to nativecomp, I also ran the same benchmark in Common Lisp 5, with SBCL:

  • Comparison with SBCL:

    Fibonacci Speed:
    
    - 🥝Juicemacs > SBCL (untyped) ≈ 🟣 nativecomp
    
    - SBCL (safety 0) >> SBCL (typed) ≈ 🥝Juicemacs
    

Nativecomp is already fast compared to untyped SBCL, because it turns out that SBCL also emits call instructions when it has no type info. However, once we declare the types of the arguments, SBCL is able to compile a fast path for fixnums, which makes its performance on par with speculative JIT engines (that is, Juicemacs), because now both of us are on fast paths.

Additionally, if we are bold enough to pass this (safety 0) flag to SBCL, it will remove all the slow paths and type checks, and its performance is close to what you get with C.

Well, probably we don’t want (safety 0) most of the time. But even then, if nativecomp were to get fast paths for more constructs, there’s certainly room for performance improvement.

13. Some More Interesting Ones

Let’s look at some more benchmarks.

bench-advice-inclist.svg
Figure 8: Results of two more benchmarks

For example, for the inclist/inclist-type-hints benchmarks (the green columns above, namely, increment-elements-in-list), Juicemacs is really slow. This comes partly from the cost of Java “boxing” integers – it has to allocate for every incremented integer, while Emacs nativecomp, with tagged values, does not.

On the other hand, for this particular benchmark, Emacs nativecomp actually has fast paths for all of the operations 6. That’s why it can be so fast, and that also proves that, compared to the Fibonacci benchmarks, nativecomp has a lot of potential for improvement if it gets more fast paths.

6

The involved operations are: consp, cdr for list iteration, and car, 1+, setcar for incrementation.

  nativecomp Juicemacs
advice Glue not native-compiled JIT-capable
inclist All operations on fast path Java integer boxing costs

There is another benchmark here that uses advices—Emacs Lisp supports using advices to override functions by wrapping the original function and an advice function inside a glue function:

;; The "glue" functions defined in nadvice.el, where
;; "car" is the advice and "cdr" is the original.
(defvar advice--how-alist
  (advice--make-how-alist
   (:around (apply car cdr r))
   (:before (apply car r) (apply cdr r))
   (:after (prog1 (apply cdr r) (apply car r)))
   (:override (apply car r))
   (:after-until (or (apply cdr r) (apply car r)))
   (:after-while (and (apply cdr r) (apply car r)))
   (:before-until (or (apply car r) (apply cdr r)))
   (:before-while (and (apply car r) (apply cdr r)))
   (:filter-args (apply cdr (funcall car r)))
   (:filter-return (funcall car (apply cdr r)))))

In this benchmark, we advise the Fibonacci function to cache the first ten entries to speed up computation, as can be seen in the speed-up in the Juicemacs results. However, it seems that nativecomp does not yet compile glue functions (as of GNU Emacs 30), which makes advices slower.

(define-advice elb-fibn-rec-adviced (:before-until (n) elb-fibn-lookup-table-advice)
  (and
   (< n 10)
   (aref [0 1 1 2 3 5 8 13 21 34] n)))

Every (interactive ...) function in Emacs comes with an interactive form, which injects interaction info into function arguments when run interactively:

(interactive-form 'evil-goto-first-line)
(interactive (byte-code "\10\205\7\0\301\10!C\207" [current-prefix-arg prefix-numeric-value] 2))

However, similar to advices, it seems to me that these interactive forms are also not native-compiled. Of course, we can’t know whether or not this affects user experience without more benchmarking.

14. Can GNU Emacs Get Speculative JIT?

With these benchmarks, let’s discuss this big question: should GNU Emacs adopt speculative JIT compilation?

14.1. Is it worth it? Maybe not.

The hidden question is actually: is it worth it? And my personal answer is: maybe not.

14.1.1. Are slow paths frequent?

The first reason is that slow paths, like floating numbers, are actually not that frequent in Emacs.[citation-needed] Optimizing for fast paths like fixnums can already get us very good performance.

14.1.2. What does it take to implement a speculative JIT?

The second or main reason is that speculative JIT is very hard. LuaJIT, for example, took a genius to build 7. Even with the help of GCC, we need to hand-write all those fast paths and slow paths. We need to find a way to deoptimize, which requires mapping machine registers back to the interpreter stack. And speculation also needs runtime info, which costs extra memory.

LLVM has an experimental IR for deoptimization: @llvm.experimental.deoptimize(...). It seems to be used by Azul JVM.

14.2. (Relatively-)low-hanging fruits in nativecomp

Moreover, as shown by some benchmarks above, there are some low-hanging fruits in nativecomp that might get us better performance with relatively lower effort. (For example, fast paths for +, -, =, <, >, ..., in combination with declare type hints?) Compared to this, a JIT engine is a huge, huge undertaking.

And there are some JIT techniques that do not require a speculative runtime. For example, you can do many things already with inline caches.

15. How Juicemacs Builds JIT Runtimes

But for Juicemacs, the JIT engine comes a lot cheaper because we are building on an existing compiler framework called Truffle.

I have another blog post for this: Writing a Lisp JIT Interpreter with GraalVM Truffle, which explains some more implementation details.

15.1. The shortcut: the Truffle meta-compiler framework

Truffle is a meta-compiler framework that lets you write an interpreter and automatically get a JIT runtime.

15.1.1. Write an interpreter

For example, here is a typical bytecode interpreter.

for (int insn : bytecode) {
    switch (insn) {
    case ADD1: stack[top] += 1; break;
    // ...other bytecodes
    case RETURN: return stack[top];
    }
}

15.1.2. Add Truffle annotations

After you add the required annotations, Truffle will know the bytecode annotated as @CompilationFinal is constant, and it should unroll the loops annotated as @ExplodeLoop to inline all those bytecodes.

@CompilationFinal(dimension = 1)
int[] bytecode = { ADD1, RETURN };
@ExplodeLoop
for (int insn : bytecode) {
    switch (insn) {
    case ADD1: stack[top] += 1; break;
    case RETURN: return stack[top];
    }
}

15.1.3. Get a JIT compiler: loop unrolled when compiled

Then, when Truffle compiles the code, it knows the loop ends after two rounds; the first loop does x + 1, and the second does return . It will then compile all that into return x + 1 , which is exactly what we would expect when compiling this pseudocode.

// Loop unrolled:
stack[top] += 1;
return stack[top];

// Variables inlined:
return (x += 1);

16. How Truffle Builds JIT Runtimes

Building on that, we can also easily implement speculation by using the transferToInterpreterAndInvalidate function provided by Truffle. Truffle will automatically turn that into deoptimization.

truffle-binary-add.svg
Figure 9: Adding speculation with Truffle

Now, for example, when the add function node above is supplied with two floating numbers, it will go through the slow path because it’s not on the fast paths (green in the graph above). This might lead to a compiled slow path (that calls out to external generic functions), or deoptimization.

When the runtime decides to deoptimize, it can then update runtime statistics (in this case, the fastPathSelector variable, marked as @CompilationFinal). Then, when the code is compiled again, Truffle will know from these statistics that we have floating numbers. This floating point addition branch will then be incorporated into the fast path.

  • That is to say, @CompilationFinal combined with transferToInterpreterAndInvalidate() more or less gets you a speculative JIT compiler.

17. How Truffle DSL Builds JIT Runtimes

17.1. Example

In Java code, most operations are just as simple as this:

@ELispBuiltIn(name = "1+", minArgs = 1, maxArgs = 1)
@GenerateNodeFactory
public abstract static class FAdd1 extends ELispBuiltInBaseNode {
    @Specialization(guards = {"isSafeLong(number)"})
    public static long add1LongSafe(long number) {
        return number + 1;
    }
    @Specialization
    public static Object add1Long(long number) {
        return ELispBigNum.forceWrap(number).add1();
    }
    @Specialization
    public static double add1Double(double number) {
        return number + 1;
    }
    @Specialization
    public static Object add1BigNum(ELispBigNum number) {
        return number.add1();
    }
}

And it will then have fast path support for integers, floating numbers, and big integers.

18. JIT Technique Explorations

This simplicity not only saves us work but also enables Juicemacs to explore more things rapidly. I have actually done some silly explorations:

18.1. Constant-folding global settings

For example, I tried to constant-fold more things. (Inspired by how TruffleRuby handles some of their variables.)

Many of us have an Emacs config that stays largely unchanged, at least during one Emacs session. That means many of the global variables in ELisp are constant. With speculation, we can speculate about the stable ones and try to inline them as constants.

🤔

Real-world implications?

However, for now, we have no idea whether this will improve performance or not, since we still need a full editor to get real-world data, in real-world Emacs sessions.

18.2. Cons list backed by unrolled linked list

I also tried changing cons lists to be backed by arrays, because maybe arrays are faster, I guess?

It is actually inspired by the many element types of ararys in V8. I mentioned above that Juicemacs suffers from Java having to box integers (allocate Integer objects) to fit it in to a cons list, and this is one way to improve on that.

But in the end, setcdr requires some kind of indirection, which actually makes the performance worse.

18.3. JIT-compiled regexp engine

And for regular expressions, I also tried borrowing the “static backtracking” technique from PCRE JIT, which is quite fast in PCRE, but it is unfortunately unsupported by the Java Truffle runtime.

So, looking at these, explorations can certainly fail. But with Truffle and Java, these are not that hard to implement, and very often, they teach us something in return, whether or not they fail.

And explorations are fun! Just for Fun. No, Really.

19. Future Explorations

Finally, let’s talk about some explorations that we might get into in the future.

  • Can we reuse the LIMPLE pipeline from nativecomp?

    Emacs nativecomp (comp.el) compiles Lisp code into an intermediate representation called LIMPLE, to which a bunch of optimizations are then applied. For Juicemacs, I’m currently looking into the implementation of nativecomp to maybe reuse some of its LIMPLE optimizations. This might enable us, for example, to recognize pure functions like nativecomp does.

  • A GUI protocol? A client in Rust? What if we can inline buffers/xwidgets in buffers?

    For the GUI, I’m thinking of a model based on an IPC protocol and am very very slowly working on one. If it ever completes, I have one thing I’m really looking forward to implementing. That is, inlining widgets, or even other buffers, directly into a buffer, like this:

    line 1
    line 2 ################### line 2 ###########
           # buffer 2 line 1 #        # xwidget #
           # buffer 2 line 2 #        # buffer  #
           ###################        ###########
    line 3
    

    Well, people sometimes complain about Emacs’s GUI capabilities, but I personally think that supporting inlining, like a whole buffer inside another buffer as a rectangle, could get us very far in layout abilities. And this approach should also be compatible with terminals. And I really want to see how this idea plays out with Juicemacs.

    (And probably a proper canvas overlay/API is also good to have? But all these will need to wait until we have a proper GUI of certain feature parity.)

  • Can we have transparent, JavaScript-like, single-threaded concurrency? How incompatible can it be?

    And of course, there’s Lisp concurrency. I’m currently thinking of a JavaScript-like, transparent, single-thread model, using Java’s virtual threads.

But anyway, if you’re interested in Emacs, JIT compilation, GUI, or anything above, or maybe you have your own ideas, you’re very welcome to reach out! Besides the issue tracker, I also keep a devlog on a Zulip chat server (currently experimenting with an IPC protocol for GUIs). Please feel free to leave comments and share your thoughts!

20. End!

21. Things that I wish I could also talk about

21.1. More benchmarks

The benchmarks above mainly compare between nativecomp and Juicemacs. But, with our analysis, we know that nativecomp is not that fast yet. That means those benchmarks tell us very little about Juicemacs itself!

To see where we can improve Juicemacs, we need to compare with fast languages, and I did perform a few non-rigorous tests running JavaScript (Node.js) and Java, which might be of interest to you. :)

  • With Java (floating point operations): In the other blog post (Writing a Lisp JIT Interpreter with GraalVM Truffle), I ran a mandelbrot benchmark in Emacs nativecomp, Juicemacs (AST & bytecode) and Java. Juicemacs AST JIT is only slightly slower than Java. However, the bytecode JIT is 3x slower, which I assume comes from the integer boxing costs.

    mandelbrot-benchmark.svg
  • With JavaScript (function call costs): I also ran the recursive fibonacci benchmark with Node.js. Here are the results:

      Node.js Juicemacs AST bytecode Native image Native image bytecode
    Time ~840 ms ~630 ms ~1350 ms ~870 ms ~1700 ms
    Time 1x 0.75x 1.6x 1.04x 2.0x

    I still need to make sense of the performance difference here… But in order to do fast pdump, we will most likely stick with “native image bytecode”, so think of Juicemacs as 2x slower 8 than JavaScript (V8).

    Another interesting thing is that “mutual recursion” makes the code execute faster in both Node.js and Juicemacs (actually Truffle/Graal):

    const fib1 = (x) => (x <= 1 ? x : fib2(x - 1) + fib2(x - 2));
    const fib2 = (x) => (x <= 1 ? x : fib1(x - 1) + fib1(x - 2));
    
    (defun fib1 (x) (if (<= x 1) x (+ (fib2 (1- x)) (fib2 (- x 2)))))
    (defun fib2 (x) (if (<= x 1) x (+ (fib1 (1- x)) (fib1 (- x 2)))))
    

    Notice how fib1 calls fib2, and fib2 calls fib1. Truffle offers us some options to look into how things are compiled, like the engine.TraceInlining=true option:

    sh -c "cd app && $(./gradlew -q :app:jvmCmd) --extra-option=engine.TraceInlining=true --dump-file=../emacs.pdmp"
    

    With this, we can look into the inlining depth of the fibonacci function:

    [engine] Cutoff fib   |Recursion Depth      3|...|Depth      3
    [engine] Cutoff fib1  |Recursion Depth      3|...|Depth      6
    

    So it seems that when inlining functions, Truffle imposes a self-recursion depth limit, and using mutual recursion roughly doubles the actual inlining depth, potentially making the code more compact and performant. I am curious if V8 does the same.

8

…Wait. Is it “1x slower” or “2x slower”?

22. Unused diagrams

These were made before I realized that Emacs does not have fast paths for many benchmarked operations. This means the benchmarks don’t have a simple explanation: too many factors might have slowed them down.

22.1. “Diamond speculation”

In complex programs, instead of emitting deoptimization instructions for every operation, the compiler very often can remove many of these slow paths: it can deduce that, for example, “we’ve had a deoptimization above, so this object must be an integer”. And that will make the code more compact.

elisp-vs-diamond-speculation.svg

22.2. Inlined + Type Hints + Unsafe

Here are some more benchmarks on cons lists:

bench-cons.svg
  • bubble: bubble sorting; pcase: simple cons pcase, inclist: increment list members by one
  • type-hints: nativecomp (safety 0) (where signaling path is removed), unsupported by Juicemacs (signaling pathway (Java exceptions) preserved)

22.2.1. What JVM is bad at

So, for example, the bubble sorting benchmarks involve comparisons, which will slow down Emacs nativecomp (as it does not have a fast path) but not Juicemacs. Having Juicemacs only slightly faster than nativecomp suggests that Juicemacs also has bottlenecks elsewhere.

  nativecomp Juicemacs
bubble cons allocation allocation + boxed fixnum indirection
inclist inlined 1+ calls inlining + boxed fixnum allocation

Personally, I suspect that it comes from all the pointer indirections in Java: we have to box integers, and to inspect the type of an object (with instanceof ), we also need to go through the pointer. In GNU Emacs, with tagged pointers, you can know an object is a cons simply by looking at the pointer.

22.3. Floating Point / Bignums

Here are some more benchmarks on floats and bignums:

bench-arith-other.svg
  • Floating point intensive: mandelbrot, nbody, inclist-th-float
  • Bignum intensive: inclist-th-bignum, pidigits (GMP v.s. JDK BigInteger)
  • Are slow paths frequent?

22.4. Function Call Costs

Here are some more benchmarks on closures and function calls:

bench-funcall.svg

The following is what I hope these benchmarks to provide insight into. But, because Emacs nativecomp might get bottlenecked by other operations, the analysis does not hold.

Direct calls
fibn-rec
Indirect calls via funcall/apply
flet, fibn-rec-advice
  • Advices are not (yet?) native-compiled.
Indirect calls via mapcar/...
map-closure, inclist-...

But the principle is that if Juicemacs is not much faster than nativecomp, then there is a bottleneck to be solved. (Let’s hope boxing also explains the slowdown in these inclist-mapcar benchmarks.)

22.5. Benchmark Summary

bench-all.svg
(On GNU Emacs 30.1) nativecomp (AOT) Juicemacs (JIT) Benchmarks
Pure function removal 🤗 🥀 fibn-*
  Perfect (speed=3) Not at all  
Arithmetic inlining 🤔 🤗 fibn-non-const-*
  Partial (1+, etc.) Floats & +, -, =  
Bignums 🤗 🥀 pidigits
  GMP & Emacs GC JDK & JDK GC  
Advices 🤔 🤔 fibn-rec-advice
  Bytecode glue funcs?    
Function direct calls 🤔 🤔 (unfair) fibn-rec
Closures 🤗 🤔 🤗 map-closure

By the way, if you want to run the benchmarks yourself with Juicemacs, you will need some workarounds to avoid calling Org-mode and unimplemented functions. Refer to ELispBenchmarksTest.java for ELisp code and comments.

Comments

This one-pixel image serves as a page view tracker and nothing more. This feedback keeps me motivated, because very few people ever comment.