Threaded Code

[Belorussian translation by Bohdan Zograf]

What is threaded code? What is it good for? What are the differences between the various threading techniques? How do I implement threaded code portably? How fast are various threading techniques?

What is Threaded Code Good for?

Threaded code is a technique for implementing virtual machine interpreters. There are various ways to implement interpreters: Some of the more popular are: If you are interested in performance, the virtual machine approach is the way to go (because fetching and decoding is simpler, and therefore faster). If you are not interested in performance (yet), you still may want to consider the virtual machine approach, because it often is as simple as the others.

Threaded code, in its original meaning [bell73], is one of the techniques for implementing virtual machine interpreters. Nowadays, at least the Forth community uses the term threading for almost any technique used for implementing Forth's virtual machine.

What is Threaded Code?

Let's look at a piece of straight-line code consisting of the virtual machine instructions A, B and C. We could write machine-level subroutines Ar, Br and Cr for performing the action of the virtual machine instructions. Then we could write the following machine code to execute these virtual machine instructions:
call Ar
call Br
call Cr
This is known as subroutine-threaded code, although it is not threaded code in the original sense; in fact, subroutine threading is not even an interpretive technique.

Now, let's eliminate the calls:

Ar
Br
Cr
i.e., a sequence of code addresses that represent the virtual machine instructions.

As a consequence, we cannot execute this piece of code by jumping to its start. We also have to keep track of the pointer to the current instruction in a separate register (instead of using the processor's program counter register and return address stack/register).

How do we execute the next instruction? Let's assume that the instruction pointer register (ip) always points to the word in the code sequence that directly follows the current instruction word. Then we just have to load the word, jump to the routine pointed to by it, and increment the instruction pointer. E.g., in MIPS assembly language:

lw   $2,0($4) #get next instruction, $4=inst.ptr.
addu $4,$4,4  #advance instruction pointer
j    $2       #execute next instruction
#nop          #branch delay slot
This routine is the interpreter (aka inner interpreter in the Forth community). It is also known as NEXT routine. Either every virtual machine instruction routine has a copy of NEXT at the end, or they share one copy of NEXT and jump to it. With modern processors, the shared NEXT not only costs a jump, but also dramatically increases the misprediction rate of the indirect jump on CPUs with BTBs and similar indirect branch predictors [ertl&gregg03jilp]. Therefore, I recommend not to share NEXT.

The method described above is the one described in [bell73] as threaded code and has later been called direct threaded code.

Note that, in contrast to popular myths, subroutine threading is usually slower than direct threading. But it is a starting point for native code generation.

Threading Techniques

Several variations on the scheme were developed:

Indirect Threaded Code

Let's consider constants: They can be represented by the virtual machine instruction lit (for literal), followed by the value of the constant. If the constant occurs frequently, it is more space-efficient to have a virtual machine instruction for this constant. If we have several constants, the code for their virtual machine instructions will look very similar. So, we may want to share the code, while having separate data.

We can do this by adding a level of indirection in NEXT (resulting in indirect threaded code. Each word (a generalization of the virtual machine instruction) has a code field and a parameter field. E.g., in the case of the constant:

code-field:      docon #code address of shared CONstant code
parameter-field: value #value of constant
The virtual machine code is now represented by a sequence of code field addresses, not code addresses. Simple virtual machine instructions (primitives) are typically represented like this:
code-field2:      parameter-field2
parameter-field2: code #machine code of the primitive
The NEXT of indirect threaded code looks like this in MIPS assembly language:
lw   $2,0($4) #get the code-field address of the next word, $4=inst.ptr.
#nop          #load delay slot
lw   $3,0($2) #get the code address of the word
addu $4,$4,4  #advance instruction pointer
j    $3       #execute code for the word
#nop          #branch delay slot
The code for the next instruction can compute the parameter-field address from the code-field address in $2.

Forth and Direct Threading

Traditionally Forth is implemented using indirect threading. Therefore, direct threaded Forth implementations have much in common with indirect threaded implementations: Non-primitives have a code field, but it now contains a jump to the code instead of its address. On most processors this jump costs more time than the additional load of indirect threading, so direct threading only pays off when primitives are executed. The resulting speedup is 2%-8% on the 486.

(Note: On the Pentium, the K5 and the K6(-2) it is extremely expensive to mix code and data, so this form of direct threading is much slower on this processor than indirect threading.)

Token Threaded Code

Direct threaded code cannot be simply transported from one machine to another, because it contains code addresses, which vary from machine to machine. Token threaded code uses a fixed virtual machine instruction encoding, allowing code portability at the price of a table lookup (that maps the intruction token to its code address) in each NEXT. Indirect threading and token threading are orthogonal, so they can be combined into indirect token threading (with an even slower NEXT).

Other Terms

Switch Threading
A method used for implementing virtual machine interpreters in languages like C. See below.
Call Threading
Another method used in languages like C. See below.
Segment Threading
A technique used for the 8086 architecture. The code consists of a sequence of segments instead of a sequence of code addresses. This allows using the whole address space of the 8086 for threaded code with sixteen-bit "pointers", but requires 16-byte alignment for all words.
Native Code
This is not a threading technique. This is the term used in the Forth community and elsewhere for implementations that generate machine code instead of some interpreted code. Simple native code systems start with subroutine-threaded code, then inline and peephole-optimize some subroutines. Related terms are true compiler, just-in-time compiler etc.
Byte code
Each virtual machine instruction is represented by one byte. This can be seen as a variant of token threading.

How do I Implement Threaded Code Portably?

Many languages don't offer a way to produce indirect jumps, so they cannot be used for implementing direct or indirect threaded code. This section presents the available options. You can find more on this topic in [ertl95pldi] and [ertl93].

Here I present variants corresponding to direct threading. Add an indirection in NEXT to get the indirect threaded version.

GNU C's Labels as Values

This is one of GNU C's extensions to standard C, and it is currently the most portable method for implementing threaded code. A similar feature is FORTRAN's assigned goto (not the computed goto, which is more similar to C's switch statement).

In GNU C, a direct threaded NEXT looks like this:

typedef void *Inst;
Inst *ip; /* you should use a local variable for this */
#define NEXT goto **ip++

Continuation-passing style

In continuation-passing style, all calls are tail-calls, and can be tail-call-optimized into jumps. Thus, we could implement the indirect jumps of threaded code as indirect tail-calls by writing the interpreter in continuation-passing style and using a language or compiler that guarantees tail-call optimization (Scheme? ML?). In C (which, AFAIK has no compilers that guarantee tail-call optimization) direct threading with this technique would look like this:
typedef void (* Inst)();

void inst1(Inst *ip, /* other regs */)
{
  ...
  (*ip)(ip+1, /* other registers */);
}

Switch Threading

Switch threading (aka switch dispatch) uses the C switch statement (or similar statements in other languages, e.g., Pascal's CASE). The result has the same advantages as token threaded code; the disadvantage over token threaded code is the lower speed due to the range check generated by most (all?) C compilers for the switch; moreover, with the right encoding token threaded code can avoid scaling, while the C compilers I have seen do not avoid it.

The main performance drawback of switch threading on modern CPUs is is that it uses one shared indirect branch; this leads to close to 100% mispredictions in the BTB (branch target buffer) used for indirect branch prediction on many modern CPUs (e.g., Athlon 64, Pentium 4, PPC 970), whereas threaded code with separate NEXTs has only about 50% mispredictions [ertl&gregg03jilp].

Switch threading in C looks like this:

typedef enum {
  add /* ... */
} Inst;

void engine()
{
  static Inst program[] = { inst1 /* ... */ };

  Inst *ip;

  for (;;)
    switch (*ip++) {
    case inst1:
      ...
      break;
    ...
    }
}

Separate NEXTs and the resulting reduction in BTB mispredictions and
increased performance can be implemented with switch, too, by replicating
the switch, as follows:

typedef enum {
  add /* ... */
} Inst;

void engine()
{
  static Inst program[] = { inst1 /* ... */ };

  Inst *ip;

  switch (*ip++) {
  case inst1: goto label1;
    ...
  }

  label1:
    ...
    switch (*ip++) {
    case inst1: goto label1;
      ...
    }
  ...
}
While this technique gives the same branch prediction advantages as the separate NEXTs in threaded code, it does not eliminate the other performance disadvantages of switch threading.

Call Threading

Call threading uses indirect calls (which most programming languages have) instead of indirect jumps. For every call, there must also be a return (except for optimized tail-calls, which don't occur here). So the cost of this method is that of using a call-return sequence instead of a simple jump.

Moreover, each virtual machine instruction is represented by a function (procedure). This is very elegant to look at, and allows separate compilation, but it means that global variables have to be used for the virtual machine registers (e.g., the instruction pointer ip), which most (all?) compilers allocate into memory; in contrast, with switch threading you can use local variables, which are (hopefully) allocated into registers.

Call threading looks like this:

typedef void (* Inst)();

Inst *ip;

void inst1()
{
  ...
}

void engine()
{
  for (;;)
    (*ip++)();
}
Note: Call threading is a term I have used for this technique. There seems to be no established term yet. I have seen subroutine threading used for this technique, but that term has a well-established and different meaning (see above).

History

Charles Moore invented (indirect) threaded Code in 1970, while working at the National Radio Astronomy Observatory (NRAO), according to [moore87]. The first publication is [bell73], which was first received by CACM in June 1971; this paper does not contain any historic information.

Interestingly, the system described in [moore&leach70] (before Moore worked at NRAO) does not use threaded code, but string interpretation. However, it uses a code field, i.e., it has the seed for indirect threading; Then again, code fields were folklore at that time [kay96].

References

@Article{bell73,
  author = 	 "James R. Bell",
  title = 	 "Threaded Code",
  journal =	 cacm,
  year =	 1973,
  volume =	 16,
  number =	 6,
  pages =	 "370--372"
}

@Article{dewar75,
  author = 	 {Robert B.K. Dewar},
  title = 	 {Indirect Threaded Code},
  journal = 	 cacm,
  year = 	 {1975},
  volume =	 {18},
  number =	 {6},
  month =	 jun,
  pages =	 {330--331}
}

@string{ieeecomputer="Computer"}

@Article{kogge82,
  author = 	 "Peter M. Kogge",
  title = 	 "An Architectural Trail to Threaded-Code Systems",
  journal =	 ieeecomputer,
  year =	 "1982",
  pages =	 "22--32",
  month =	 mar,
  annote =	 "Explains the design of (a classical
		  implementation of) Forth, starting with threaded
		  code, then adding the parameter stack, constants,
		  variables, control structures, dictionary, outer
		  interpreter and compiler."
}

@InProceedings{ertl93,
  author = 	 "M. Anton Ertl",
  title = 	 "A Portable {Forth} Engine",
  booktitle =	 "EuroFORTH '93 conference proceedings",
  year =	 "1993",
  address =	 "Mari\'ansk\'e L\'azn\`e (Marienbad)",
  url =		 "http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z",
  abstract = 	 "The Forth engine discussed in this paper is written
		  in GNU C, which provides several extensions that are
		  important for Forth implementation. The indirect
		  threaded Forth engine is completely
		  machine-independent, direct threading requires a few
		  machine-specific lines for each machine. Using
		  a portable language like GNU C encourages producing
		  an engine with many primitives. In order to make the
		  development of primitives easier and less
		  error-prone, an automatic tool generates most of the
		  code for a Forth primitive from the stack effect
		  notation, even if the TOS is kept in a register. The
		  engine is combined with the parts of the system
		  written in Forth by loading a machine-independent
		  image file that contains the executable Forth code
		  in relocatable form."
}

@InProceedings{ertl95pldi,
  author =       "M. Anton Ertl",
  title =        "Stack Caching for Interpreters",
  crossref =     "sigplan95",
  pages =	 "315--327",
  url =		 "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
  abstract =     "An interpreter can spend a significant part of its
                  execution time on arguments of virtual machine
                  instructions. This paper explores two methods to
                  reduce this overhead for virtual stack machines by
                  caching top-of-stack values in (real machine)
                  registers. The {\em dynamic method} is based on
                  having, for every possible state of the cache, one
                  specialized version of the whole interpreter; the
                  execution of an instruction usually changes the
                  state of the cache and the next instruction is
                  executed in the version corresponding to the new
                  state. In the {\em static method} a state machine
                  that keeps track of the cache state is added to the
                  compiler. Common instructions exist in specialized
                  versions for several states, but it is not necessary
                  to have a version of every instruction for every
                  cache state. Stack manipulation instructions are
                  optimized away."
}

@Proceedings{sigplan95,
  booktitle =    "SIGPLAN '95 Conference on Programming Language
                  Design and Implementation",
  title =        "SIGPLAN '95 Conference on Programming Language
                  Design and Implementation",
  year =         "1995",
  key =          "SIGPLAN '95"
}

@Article{ertl&gregg03jilp,
  author =	 {M. Anton Ertl and David Gregg},
  title =	 {The Structure and Performance of \emph{Efficient}
                  Interpreters},
  journal =	 {The Journal of Instruction-Level Parallelism},
  year =	 {2003},
  volume =	 {5},
  month =	 nov,
  url =         {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
  url2 =	 {http://www.jilp.org/vol5/v5paper12.pdf},
  note =	 {http://www.jilp.org/vol5/},
  abstract =	 {Interpreters designed for high general-purpose
                  performance typically perform a large number of
                  indirect branches (3.2\%--13\% of all executed
                  instructions in our benchmarks). These branches
                  consume more than half of the run-time in a number
                  of configurations we simulated. We evaluate how
                  accurate various existing and proposed branch
                  prediction schemes are on a number of interpreters,
                  how the mispredictions affect the performance of the
                  interpreters and how two different interpreter
                  implementation techniques perform with various
                  branch predictors. We also suggest various ways in
                  which hardware designers, C compiler writers, and
                  interpreter writers can improve the performance of
                  interpreters.}
}

@TechReport{moore&leach70,
  author = 	 "Charles H. Moore and Geoffrey C. Leach",
  title = 	 "FORTH -- A Language for Interactive Computing",
  institution =  "Mohasco Industries, Inc.",
  year = 	 "1970",
  address =	 "Amsterdam, NY",
  url = 	 "http://www.dnai.com/~jfox/F70POST.ZIP"
}

@Article{moore87,
  author = 	 {Charles Moore},
  title = 	 {Forth -- eine persönliche Sprache},
  journal = 	 {Vierte Dimension},
  year = 	 {1987},
  volume =	 {3},
  number =	 {3},
  month =	 oct,
  pages =	 {11--13},
  note =	 {Translated into German by Klaus Schleisiek, the 
		  original is probably in More on NC4000},
}

@InProceedings{kay96,
  author = 	 "Alan C. Kay",
  title = 	 "The Early History of Smalltalk",
  crossref =	 "hopl2",
  pages = 	 "511--579",
}

@Proceedings{hopl2,
  title = 	 {History of Programming Languages},
  booktitle = 	 {History of Programming Languages},
  year = 	 1996,
  key =		 {HOPL-II},
  publisher =	 {ACM Press/Addison-Wesley}
}

Anton Ertl