What is threaded code? What is it good for? What are the differences between the various threading techniques? How do I implement threaded code portably? How fast are various threading techniques?
Threaded code, in its original meaning [bell73], is one of the techniques for implementing virtual machine interpreters. Nowadays, at least the Forth community uses the term threading for almost any technique used for implementing Forth's virtual machine.
call Ar call Br call CrThis is known as subroutine-threaded code, although it is not threaded code in the original sense; in fact, subroutine threading is not even an interpretive technique.
Now, let's eliminate the call
s:
Ar Br Cri.e., a sequence of code addresses that represent the virtual machine instructions.
As a consequence, we cannot execute this piece of code by jumping to its start. We also have to keep track of the pointer to the current instruction in a separate register (instead of using the processor's program counter register and return address stack/register).
How do we execute the next instruction? Let's assume that the instruction pointer register (ip) always points to the word in the code sequence that directly follows the current instruction word. Then we just have to load the word, jump to the routine pointed to by it, and increment the instruction pointer. E.g., in MIPS assembly language:
lw $2,0($4) #get next instruction, $4=inst.ptr. addu $4,$4,4 #advance instruction pointer j $2 #execute next instruction #nop #branch delay slotThis routine is the interpreter (aka inner interpreter in the Forth community). It is also known as NEXT routine. Either every virtual machine instruction routine has a copy of NEXT at the end, or they share one copy of NEXT and jump to it. With modern processors, the shared NEXT not only costs a jump, but also dramatically increases the misprediction rate of the indirect jump on CPUs with BTBs and similar indirect branch predictors [ertl&gregg03jilp]. Therefore, I recommend not to share NEXT.
The method described above is the one described in [bell73] as threaded code and has later been called direct threaded code.
Note that, in contrast to popular myths, subroutine threading is usually slower than direct threading. But it is a starting point for native code generation.
lit
(for literal), followed
by the value of the constant. If the constant occurs frequently, it is
more space-efficient to have a virtual machine instruction for this
constant. If we have several constants, the code for their virtual
machine instructions will look very similar. So, we may want to share
the code, while having separate data.
We can do this by adding a level of indirection in NEXT (resulting in indirect threaded code. Each word (a generalization of the virtual machine instruction) has a code field and a parameter field. E.g., in the case of the constant:
code-field: docon #code address of shared CONstant code parameter-field: value #value of constantThe virtual machine code is now represented by a sequence of code field addresses, not code addresses. Simple virtual machine instructions (primitives) are typically represented like this:
code-field2: parameter-field2 parameter-field2: code #machine code of the primitiveThe NEXT of indirect threaded code looks like this in MIPS assembly language:
lw $2,0($4) #get the code-field address of the next word, $4=inst.ptr. #nop #load delay slot lw $3,0($2) #get the code address of the word addu $4,$4,4 #advance instruction pointer j $3 #execute code for the word #nop #branch delay slotThe code for the next instruction can compute the parameter-field address from the code-field address in
$2
.
(Note: On the Pentium, the K5 and the K6(-2) it is extremely expensive to mix code and data, so this form of direct threading is much slower on this processor than indirect threading.)
Here I present variants corresponding to direct threading. Add an indirection in NEXT to get the indirect threaded version.
switch
statement).
In GNU C, a direct threaded NEXT looks like this:
typedef void *Inst; Inst *ip; /* you should use a local variable for this */ #define NEXT goto **ip++
typedef void (* Inst)(); void inst1(Inst *ip, /* other regs */) { ... (*ip)(ip+1, /* other registers */); }
switch
statement (or similar statements in other languages, e.g., Pascal's
CASE
). The result has the same advantages as token
threaded code; the disadvantage over token threaded code is the lower
speed due to the range check generated by most (all?) C compilers for
the switch; moreover, with the right encoding token threaded code can
avoid scaling, while the C compilers I have seen do not avoid it.
The main performance drawback of switch threading on modern CPUs is is that it uses one shared indirect branch; this leads to close to 100% mispredictions in the BTB (branch target buffer) used for indirect branch prediction on many modern CPUs (e.g., Athlon 64, Pentium 4, PPC 970), whereas threaded code with separate NEXTs has only about 50% mispredictions [ertl&gregg03jilp].
Switch threading in C looks like this:
typedef enum { add /* ... */ } Inst; void engine() { static Inst program[] = { inst1 /* ... */ }; Inst *ip; for (;;) switch (*ip++) { case inst1: ... break; ... } } Separate NEXTs and the resulting reduction in BTB mispredictions and increased performance can be implemented with switch, too, by replicating the switch, as follows:typedef enum { add /* ... */ } Inst; void engine() { static Inst program[] = { inst1 /* ... */ }; Inst *ip; switch (*ip++) { case inst1: goto label1; ... } label1: ... switch (*ip++) { case inst1: goto label1; ... } ... }While this technique gives the same branch prediction advantages as the separate NEXTs in threaded code, it does not eliminate the other performance disadvantages of switch threading.
Moreover, each virtual machine instruction is represented by a
function (procedure). This is very elegant to look at, and allows
separate compilation, but it means that global variables have to be
used for the virtual machine registers (e.g., the instruction pointer
ip
), which most (all?) compilers allocate into memory; in
contrast, with switch threading you can use local variables, which are
(hopefully) allocated into registers.
Call threading looks like this:
typedef void (* Inst)(); Inst *ip; void inst1() { ... } void engine() { for (;;) (*ip++)(); }Note: Call threading is a term I have used for this technique. There seems to be no established term yet. I have seen subroutine threading used for this technique, but that term has a well-established and different meaning (see above).
Interestingly, the system described in [moore&leach70] (before Moore worked at NRAO) does not use threaded code, but string interpretation. However, it uses a code field, i.e., it has the seed for indirect threading; Then again, code fields were folklore at that time [kay96].
@Article{bell73, author = "James R. Bell", title = "Threaded Code", journal = cacm, year = 1973, volume = 16, number = 6, pages = "370--372" } @Article{dewar75, author = {Robert B.K. Dewar}, title = {Indirect Threaded Code}, journal = cacm, year = {1975}, volume = {18}, number = {6}, month = jun, pages = {330--331} } @string{ieeecomputer="Computer"} @Article{kogge82, author = "Peter M. Kogge", title = "An Architectural Trail to Threaded-Code Systems", journal = ieeecomputer, year = "1982", pages = "22--32", month = mar, annote = "Explains the design of (a classical implementation of) Forth, starting with threaded code, then adding the parameter stack, constants, variables, control structures, dictionary, outer interpreter and compiler." } @InProceedings{ertl93, author = "M. Anton Ertl", title = "A Portable {Forth} Engine", booktitle = "EuroFORTH '93 conference proceedings", year = "1993", address = "Mari\'ansk\'e L\'azn\`e (Marienbad)", url = "http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z", abstract = "The Forth engine discussed in this paper is written in GNU C, which provides several extensions that are important for Forth implementation. The indirect threaded Forth engine is completely machine-independent, direct threading requires a few machine-specific lines for each machine. Using a portable language like GNU C encourages producing an engine with many primitives. In order to make the development of primitives easier and less error-prone, an automatic tool generates most of the code for a Forth primitive from the stack effect notation, even if the TOS is kept in a register. The engine is combined with the parts of the system written in Forth by loading a machine-independent image file that contains the executable Forth code in relocatable form." } @InProceedings{ertl95pldi, author = "M. Anton Ertl", title = "Stack Caching for Interpreters", crossref = "sigplan95", pages = "315--327", url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz", abstract = "An interpreter can spend a significant part of its execution time on arguments of virtual machine instructions. This paper explores two methods to reduce this overhead for virtual stack machines by caching top-of-stack values in (real machine) registers. The {\em dynamic method} is based on having, for every possible state of the cache, one specialized version of the whole interpreter; the execution of an instruction usually changes the state of the cache and the next instruction is executed in the version corresponding to the new state. In the {\em static method} a state machine that keeps track of the cache state is added to the compiler. Common instructions exist in specialized versions for several states, but it is not necessary to have a version of every instruction for every cache state. Stack manipulation instructions are optimized away." } @Proceedings{sigplan95, booktitle = "SIGPLAN '95 Conference on Programming Language Design and Implementation", title = "SIGPLAN '95 Conference on Programming Language Design and Implementation", year = "1995", key = "SIGPLAN '95" } @Article{ertl&gregg03jilp, author = {M. Anton Ertl and David Gregg}, title = {The Structure and Performance of \emph{Efficient} Interpreters}, journal = {The Journal of Instruction-Level Parallelism}, year = {2003}, volume = {5}, month = nov, url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz}, url2 = {http://www.jilp.org/vol5/v5paper12.pdf}, note = {http://www.jilp.org/vol5/}, abstract = {Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2\%--13\% of all executed instructions in our benchmarks). These branches consume more than half of the run-time in a number of configurations we simulated. We evaluate how accurate various existing and proposed branch prediction schemes are on a number of interpreters, how the mispredictions affect the performance of the interpreters and how two different interpreter implementation techniques perform with various branch predictors. We also suggest various ways in which hardware designers, C compiler writers, and interpreter writers can improve the performance of interpreters.} } @TechReport{moore&leach70, author = "Charles H. Moore and Geoffrey C. Leach", title = "FORTH -- A Language for Interactive Computing", institution = "Mohasco Industries, Inc.", year = "1970", address = "Amsterdam, NY", url = "http://www.dnai.com/~jfox/F70POST.ZIP" } @Article{moore87, author = {Charles Moore}, title = {Forth -- eine persönliche Sprache}, journal = {Vierte Dimension}, year = {1987}, volume = {3}, number = {3}, month = oct, pages = {11--13}, note = {Translated into German by Klaus Schleisiek, the original is probably in More on NC4000}, } @InProceedings{kay96, author = "Alan C. Kay", title = "The Early History of Smalltalk", crossref = "hopl2", pages = "511--579", } @Proceedings{hopl2, title = {History of Programming Languages}, booktitle = {History of Programming Languages}, year = 1996, key = {HOPL-II}, publisher = {ACM Press/Addison-Wesley} }