Mandelbort benchmark: Emacs (native-comp) runs 10x slower than Java or Truffle implementation 1
So I’ve been working on an interpreter for Emacs Lisp using the GraalVM Truffle framework for a while now. As I’ve just completed its AST interpreter and a bytecode interpreter, I guess its time to give it a little write-up (or rather, some random babbling of various tricks).
Although the Truffle interpreter covered here is for Emacs Lisp only, most of the following should also be applicable to other Lisp dialects and maybe other Truffle-based language implementations.
The Mandelbrot benchmark above involves quite a lot of floating point operations, which Emacs seems bad at. So you can say this benchmark is cheating and I guess Emacs can perform much better in some fixnum benchmarks that involves only integers or list operations.
Graal is a Java Just-In-Time compiler (written in Java),
and Truffle is a language implementation framework that lets you build a JIT compiler by just writing an interpreter.
How? One may also think of Truffle as a way to instruct Graal to perform super aggressive inlining (called “partial evaluation or Futamura projections“). For example, (+ 1 2) in a Lispy language may parse into an AST as follows:
Nodeliteral1=newNode(){intexecute(){return1;}};Nodeliteral2=newNode(){intexecute(){return2;}};NodeaddNode=newNode(){@ChildNodelhs=literal1;@ChildNoderhs=literal2;intexecute(){returnlhs.execute()+rhs.execute();}};// execute the programaddNode.execute();// 1 + 2 => 3
Upon compilation, the compiler might attempt to inline, like, everything:
Figure 1: Inlining steps: from “addNode.execute()” to “lhs.execute()+rhs.execute()” to “1+2”
That is, having Graal JIT-compile your AST interpreter is essentially JIT-compiling the interpreted program: we go from an AST to a compiled, constant-folded form.
Let’s have a look at a lengthier example:
classBytecodeInterpreterextendsNode{@CompilationFinal(dimensions=1)finalbyte[]instructions={REF_ARG,ADD_CONST,-100,RETURN};// the bytecode is more or less: (lambda (x) (+ x -100))@BytecodeInterpreterSwitch@ExplodeLoop(kind=LoopExplosionKind.MERGE_EXPLODE)intexecute(VirtualFrameframe){intpc=0;while(true){byteinstruction=instructions[pc++];switch(instruction){caseREF_ARG:frame.setInt(0,(Integer)frame.getArguments()[0]);break;caseADD_CONST:frame.setInt(0,frame.getInt(0)+instructions[pc++]);break;caseRETURN:returnframe.getInt(0);}}}}// execute the programVirtualFrameframe=createTheFrameSomehow();newBytecodeInterpreter().execute(frame);
The code above is a barebone bytecode interpreter implemented with Truffle. @CompilationFinal tells Graal to treat instructions as constants when compiling, and @ExplodeLoop tells Graal to unroll loops aggressively. With these, we now have a simplistic bytecode JIT compiler, and the bytecode above partial-evaluates to:
Depending on the stack frame type, Graal might be able to recognize frame.getInt(0) as a local variable and make further optimization:
{return(Integer)frame.getArguments()[0]+(-100);}
Graal offers a tool called Ideal Graph Visualizer (IGV) to visualize compilation graphs. The following graph was produced by Graal running said bytecode:
Figure 2: A compilation graph produced by IGV
Up until the 127 Unbox node, the program reads in a function parameter.
The 149 + node adds a constant -100 to the unboxed result.
The 200 BoxNode and 195 Return nodes box the sum and return.
Partial evaluation alone is not enough for an efficient JIT implementation. For example, we expect (+ 1 2) and (+ 1.0 2.0) to compile to different instructions (maybe ADD and FADD respectively), but for dynamic languages, we don’t always know the argument types at compile time. And this is exactly what speculative compilation (or speculation) can help with.
Basically, without speculation, the compiler might compile (+ x y) into:
if(xinstanceofIntegerxi&&yinstanceofIntegeryi){returnxi+yi;}if(xinstanceofFloatxf&&yinstanceofFloatyf){returnxf+yf;}// maybe tons of other cases// - float + int// - int + float// - int/float + bigint// - bigint + int/float/bigintthrownewUnsupportedOperationException();
With speculation, the compiler can somehow guess from previous calls what types x, y are of, and put irrelevant paths in a separate interpreted slow path:
The interpretedSlowPath usually involves invalidating the current compilation and re-specializing the AST node. Truffle languages usually achieve this by using a mutable @CompilationFinal state field and a special compiler directive transferToInterpreterAndInvalidate():
classSpeculativeAddextendsNode{staticfinalintSTATE_INT_INT=1;staticfinalintSTATE_FLOAT_FLOAT=2;@CompilationFinalintstate=0;Objectexecute(Objectlhs,Objectrhs){if(state==STATE_INT_INT){if(lhsinstanceofIntegerl&&rhsinstanceofIntegerr){returnl+r;}}if(state==STATE_FLOAT_FLOAT){if(lhsinstanceofFloatl&&rhsinstanceofFloatr){returnl+r;}}// other specializations: int + float, float + int, bigints, etc.returnspecialize(lhs,rhs);}Objectspecialize(Objectlhs,Objectrhs){CompilerDirectives.transferToInterpreterAndInvalidate();if(lhsinstanceofIntegerl&&rhsinstanceofIntegerr){state=STATE_INT_INT;returnl+r;}if(lhsinstanceofFloatl&&rhsinstanceofFloatr){state=STATE_FLOAT_FLOAT;returnl+r;}// other specializations: int + float, float + int, bigints, etc.thrownewUnsupportedOperationException();}}
Whenever there is an active state, the compiler treats state as @CompilationFinal and compiles only the active path upon compilation; whenever the state assumptions are invalidated, the specialize method is responsible for invalidating the current compilation and adjusting the state so that the node can be recompiled correctly.
This pattern allows us to realize tons of advanced JIT techniques, including:
Inline caching: For costly operations, if the inputs are known and vary infrequently, it might be best to cache the results/intermediate products. For example, global variable lookup can really use the technique - instead of consulting a hash table every single time you access a variable, cache the value container.
Polymorphism: In (+ 1 x), when x can sometimes be an integer and sometimes a floating point number, instead of re-compiling every time the type changes, it is better to just compile these two paths (out of many others) and switch to different paths at runtime.
Splitting: A highly polymorphic AST graph, however, can impact performance. One way to handle this is making the some parent nodes polymorphic by deep copying the sub-tree.
Figure 3: Splitting a simple_exp function depending on argument type
However, it is definitely not fun having to hand-rolling all these ourselves. Instead, Truffle provides a powerful DSL (read: Java annotations) that generates all the boilerplate for us. Using the DSL, The SpeculativeAdd node above is as simple as a few @Specialization:
abstractclassMyLanguageExpressionNodeextendsNode{/// An abstract "executeXXX" method to implemented by the DSLpublicabstractObjectexecute(VirtualFrameframe);}@NodeChild(value="lhs",type=MyLanguageExpressionNode.class)@NodeChild(value="rhs",type=MyLanguageExpressionNode.class)abstractclassSpeculativeAddextendsMyLanguageExpressionNode{@SpecializationpublicstaticintaddInts(intl,intr){returnl+r;}@SpecializationpublicstaticfloataddFloats(floatl,floatr){returnl+r;}@SpecializationpublicstaticfloataddFloatInt(floatl,intr){returnl+r;}@SpecializationpublicstaticfloataddIntFloat(intl,floatr){returnl+r;}}
The code above asks Truffle annotation processor to generate a SpeculativeAddNodeGen class implementing specialization and polymorphism logic.
The above specializes the AST nodes. But to supply values to the nodes, you will need variables. It turns out that Truffle offers VirtualFrame exactly for this purpose. Storing local variables on a Frame, you can specialize the slots to Object or primitive types so as to remove boxing/unboxing costs. See SLWriteLocalVariableNode.java and SLReadLocalVariableNode.java for how easily one can speculate about unboxed storage of local variables.
Emacs Lisp has over a thousand built-in functions, each requiring a dedicated Node implementation. Even if Truffle DSL generates Plus/Minus/Times/...NodeGen for us, we really don’t want to initialize all those functions ourselves:
global.put("+",createFunction(AddNodeGen::create));global.put("-",createFunction(SubNodeGen::create));global.put("1+",createFunction(Add1NodeGen::create));global.put("eq",createFunction(EqNodeGen::create));// repeat a thousand times...
Fortunately, Truffle provides @GenerateNodeFactory. If you annotate your AddNode with it, Truffle generates a AddNodeFactory for you and the code above becomes:
global.put("+",createFunction(AddNodeFactory.getInstance()));// still, repeat it a thousand times...
(gasp)
No, @GenerateNodeFactory is not about creating more FactoryFactoryFactories, and nice things happen when you put nodes in inner classes:
classBuiltInFns{@MyAnnotation(name="+",minArgs=0,maxArgs=-1,varArgs=true)@GenerateNodeFactoryabstractstaticclassAddNodeextendsMyLanguageExpressionNode{// specializations}@MyAnnotation(name="-",minArgs=0,maxArgs=-1,varArgs=true)@GenerateNodeFactoryabstractstaticclassSubNodeextendsMyLanguageExpressionNode{// specializations}// a hundred other nodes}// now, this gives you all the inner factoriesvarfactories=BuiltInFnsFactory.getFactories();extractAnnotationsAndCreateRootNodes(factories);
I found this usage in the GraalPy codebase and didn’t see it documented anywhere. But it is definitely worth mentioning here in case anyone also has thousands of built-in functions to implement.
Now, with runtime macro expansion, we cannot know for centain (at declaration time) whether (something a b) is a call-node or a macro-node. And the only way is to have a call-or-macro-node that dynamically switches to call/macro nodes at runtime.
Static analysis is doomed.
Truffle frees us from doing register allocation, but we still need to allocate a frame slot for every local variable ourselves:
(lambda(x); x -> frame slot #0(let((y(*x2)); y -> frame slot #1(z(*xx))); z -> frame slot #2(+xyz)))
For some languages, slot allocation can be done at parse time, but not with runtime macros:
pcase-let is a macro that introduces new let forms, declaring new variables. Since macros expand at runtime, it means we also assign slot numbers to those variables at runtime.
This is fine: we can always introduce more states in our AST and keeping a slot count is not a big deal. The only problem is that Truffle requires knowing the frame size (total slot counts) at declaration time, which I don’t think is possible with Emacs Lisp. So, we need another trick to handle this.
For a Truffle frame, we need to know its size before hand. How do we handle a dynamic number of variables? Do you implement variable spilling yourself? Well, I did, only to find that there is better way weeks later.
I discovered this technique from the GraalJs codebase. A suggestion for anyone trying to use Truffle for a real-world language: when implementing a feature, look into GraalJs, GraalPy, Espresso or TruffleRuby for how they implement that.
So Truffle offers two types of frames: VirtualFrame and MaterializedFrame:
VirtualFrame is more lightweight, and variables stored in it will be stored on a real stack frame when the function is compiled. However, for this to work, it should only be passed around in function arguments and must not be stored in any fields.
MaterializedFrame is usually obtained by calling materialize on a VirtualFrame, so that a frame can be stored (for example, for closures that require references to the parent frame). I am not very sure how well it gets inlined, but in the following case, it does.
The technique is quite simple: when introducing new variables, just allocate a new VirtualFrame for them and store the current frame in it. Let’s step through how this program executes with the technique:
When Truffle calls this function, it allocates a VirtualFrame according to the RootNode of the function. Typically, one wants one slot for each argument, so the frame should probably be:
(virtual-frame@0nil; slot #0 (read on for its usage)slot-for-x; slot #1slot-for-y; slot #2)
Entering the first let scope, we allocate a new virtual-frame@1 for the xx variable. However, since we need to be able to access variables from outer scopes, we need to chain these frames up. The simplest way is to store the upper frame in one of the slots of the new frame:
A variable accessing node should store the relative frame level of the variable in addition to the slot number.
If implemented correctly (that is, materialized frames are only stored within the “frame chain”), Truffle will recognized the frames as part of the current frame and make them virtual. And, hooray, now we have a dynamically-sized frame as good as a static one:
Figure 4: IGV graph for the vector-length function. Notice that there is no references to materialized frames.
In some lisps, these special forms are more-or-less identified by their corresponding symbols, in that you cannot define your own my-if by aliasing my-if to if. Well, this isn’t the case with Emacs:
(defalias'my-if#'if); now you can use 'my-if as a special form(my-if(special-form-p'my-if)"special""not special")
special
And it is (or was?) a feature that actually saw some usages:
;; org-compat.el:(define-obsolete-function-alias'org-without-partial-completion'progn"9.6");; byte-run.el:;; This was undocumented and unused for decades.(defalias'inline'progn"Like `progn', but when compiled inline top-level function calls in body.You don't need this. (See bytecomp.el commentary for more details.)\(fn BODY...)")
The solution is quite simple: since we already have runtime macro-expansion, it is quite straightforward to just also put our special form logic there.
Now we have talked about macros and special forms. Let’s go on to the last list form: function calls.
In most Lisps, there will be some “primitive” functions that have to be provided by the runtime, like arithmetic operations ((+ 1 x) and (- 1 y)) or list/cons-related functions ((car cons) and (cdr cons)). Semantically, +, -, car, cdr are all functions: one can use funcall to indirect-call them, or pass them to mapcar as a mapping function. However, we really want to some special processing for them: function calls are too costly for common constructs like these:
Figure 5: Function calls incur boxing and vararg packing costs
Instead, we want to inline the addition instructions into the AST graph so that Truffle can specialize each node, hopefully turning each into a simple integer addition operation:
Figure 6: “(+ 1 2 3)” function call inlined into two binary add node, allowing specialization into integer additions
The code here summarizes how we handle list forms like (maybe-func arg1 arg2): we create a placeholder node for it and defer any analysis until evaluation. When the node is executed, it replaces itself with the correct node implementation, possibly inlining some built-in functions.
The actual implementation is left as an exercise to the reader.
There’s one thing still missing in our pseudocode above: it does not handle function redefinitions. Once the nodes are inlined (e.g., (+ 1 2 3) inlined into (int-add (int-add 1 2) 3), the nodes won’t be aware of any function changes and we won’t be able to get back. This is problematic because when we redefine a function, we expect the change to take effect immediately:
(define-advice+(:aroundoldfun&restnumbers)(message"INFO: %S"(cons'+numbers))(applyoldfunnumbers));; Now the `+' function is an adviced function;; that logs the call before calling the original.
An intuitive way to implement this is to introduce a check:
This works, and can be quite efficient when you caches the function value container for the variables/symbols (so that you don’t need to look up the symbol evey time). But there’s a more idiomatic way to do this: using Truffle Assumption and CyclicAssumption:
The advantage of assumptions is that Truffle is aware of it and can turn pull-based stability checks into push-based invalidations. It tracks compilation units that depend on a particular assumption, and automatically invalidates them when the assumption no longer holds. And that means, the currentFunctionStable.isValid() check above is essentially free, because the compiler simply assumes it to be true and can completely remove the other branch:
Figure 7: With assumptions, the stability check can be compiled away.
One can also use assumptions to constant-fold global variables (which I learned from how TruffleRuby handles some of their variables): assume a constant variable by default, and fall back to a dynamic read when a CyclicAssumption is invalidated too many times.
Internally, GNU Emacs Lisp keeps local variables in a list (Vinternal_interpreter_environment) and it implements variable capturing by keeping the very list in the closure object.
Truffle languages, as is shown above, usually use VirtualFrame objects for local variables, and they have very good reasons to do so: Truffle is aware of VirtualFrame and can inline the costs away. (Still remember the BytecodeInterpreter example above?)
To add to this, Emacs makes quite heavy use of the environment list in its oclosures (closures with associated metadata) as of Emacs 30:
;; an advice oclosure object (with up to 6 slots)#[;; [0]: arguments(&restr);; [1]: lisp code((iftnilcarcdrhowprops); prevent environment trimming by cconv(applycarcdrr));; [2]: environment((car.test-advice)(cdr.test-main)(how.:around)(props));; [3]: nilnil;; [4]: docstringadvice]
An adviceoclosure object has four fields: car, cdr, how, props, and we can clearly see these fields are stored right in the environment slot of the closure.
I don’t have a solution for this. Currently, whenever oclosure requests it, my implementation produces a list of captured variables from the parent scope (stored as a MaterializedFrame), but it never propagates the changes backwards. Let’s hope this works. Otherwise, I don’t know, we might have to ditch MaterializedFrame/VirtualFrame altogether.
Multiple closures created in a function might only partially shares the same MaterializedFrame. Consider the following program:
;; -*- lexical-binding: t -*-(let(closures)(dolist(i'(123))(push(lambda()i)closures))(mapcar#'funcallclosures))
3
2
1
The three closures created in the loop return 3, 2, 1 respectively, meaning that each closure captures a different i. However, for variable outside the loop:
;; -*- lexical-binding: t -*-(let(jclosures)(dolist(temp'(123))(setqjtemp)(push(lambda()j)closures))(mapcar#'funcallclosures))
3
3
3
Now they all return 3, with j shared between all three closures.
Figure 8: “j” is shared by closures; “i”, inside the loop, is different for each closure.
Since we already split a stack frame into a series of chained frames (3.1), we don’t need to do anything special: (j . 1) and (i . X) is stored on different Truffle frames, and we allocate a new frame every iteration. By doing so, the closures naturally get to share j while keeping their own i:
Figure 9: Different frames hold different i values.
For languages with for loops with per-iteration scopes, this can get more complex though:
See how each JS closure above gets a different i while the let i = 0 declaration is technically outside the loop body? (GraalJs handles this by copying frames every iteration (when needed).)
GraalJS does a bit more to prune the captured frames. Since Truffle frames hold all local variables, it can prevent unused values from garbage collection when there are closures keeping references to the frame chain. And GraalJs does static analysis to decide what frame slots are safe to clear, keeping only the ones used by closures.
Actually, Emacs offers this too, after bootstrapping, Emacs calls cconv-make-interpreted-closure when creating closures and it then prunes unused conses off the environment list. But, as always, it assumes the captured environment to be a list, which I am not sure how to deal with.
There is still one last thing we need to check to ensure efficient closure usages: AST (or Truffle nodes) sharing between lambda instances.
Figure 10: Closures differing only in environment should share code AST nodes.
Why? It is quite intuitive: in our JIT runtime, a new code snippet means a new compilation unit, and sharing compiled code between closures is the way to go if you don’t want to overwhelm your compiler with unnecessary deoptimizations.
However, since we store the lexical environment in the closure object, the shared code will need a way to access per-closure environment: it will require some “calling convention”, similar to how languages specify their this/self pointers are passed around. One might simply pass the closure object as the first argument and adapt their variable access logic accordingly.
ELisp has two kinds of interpreted functions: interpreted-closure and byte-code-function, each differing in how they handle closure creation.
For interpreted-closue, the runtime captures the environment whenever the function special form is run. For byte-code-function, the bytecode compiler emits code to manually capture required variables into the CLOSURE_CONSTANTS field of the created function, freeing us of some of the trouble.
No! That’s enough rambling for today. The above has covered most things I’ve tried and done in the little AST interpreter and should get you an interpreter that can potentially run on par with Java. However, since Emacs nowadays also runs bytecode and native-comp products (compiled with libgccjit), there will probably be another post dedicated to writing bytecode interpreters (specifically, a hand-rolled Emacs Lisp bytecode interpreter and a nativecompLIMPLE interpreter (to be) implemented with Truffle bytecode DSL).
By the way, you can find the interpreter here: https://codeberg.org/gudzpoz/Juicemacs/ (it requires Java 23 and probably some patience to getting it running though ;-)
Comments