Node:Dynamic Superinstructions, Next:DOES>, Previous:Direct or Indirect Threaded?, Up:Threading
The engines gforth
and gforth-fast
use another
optimization: Dynamic superinstructions with replication. As an
example, consider the following colon definition:
: squared ( n1 -- n2 ) dup * ;
Gforth compiles this into the threaded code sequence
dup * ;s
In normal direct threaded code there is a code address occupying one
cell for each of these primitives. Each code address points to a
machine code routine, and the interpreter jumps to this machine code in
order to execute the primitive. The routines for these three
primitives are (in gforth-fast
on the 386):
Code dup ( $804B950 ) add esi , # -4 \ $83 $C6 $FC ( $804B953 ) add ebx , # 4 \ $83 $C3 $4 ( $804B956 ) mov dword ptr 4 [esi] , ecx \ $89 $4E $4 ( $804B959 ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code Code * ( $804ACC4 ) mov eax , dword ptr 4 [esi] \ $8B $46 $4 ( $804ACC7 ) add esi , # 4 \ $83 $C6 $4 ( $804ACCA ) add ebx , # 4 \ $83 $C3 $4 ( $804ACCD ) imul ecx , eax \ $F $AF $C8 ( $804ACD0 ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code Code ;s ( $804A693 ) mov eax , dword ptr [edi] \ $8B $7 ( $804A695 ) add edi , # 4 \ $83 $C7 $4 ( $804A698 ) lea ebx , dword ptr 4 [eax] \ $8D $58 $4 ( $804A69B ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code
With dynamic superinstructions and replication the compiler does not
just lay down the threaded code, but also copies the machine code
fragments, usually without the jump at the end.
( $4057D27D ) add esi , # -4 \ $83 $C6 $FC ( $4057D280 ) add ebx , # 4 \ $83 $C3 $4 ( $4057D283 ) mov dword ptr 4 [esi] , ecx \ $89 $4E $4 ( $4057D286 ) mov eax , dword ptr 4 [esi] \ $8B $46 $4 ( $4057D289 ) add esi , # 4 \ $83 $C6 $4 ( $4057D28C ) add ebx , # 4 \ $83 $C3 $4 ( $4057D28F ) imul ecx , eax \ $F $AF $C8 ( $4057D292 ) mov eax , dword ptr [edi] \ $8B $7 ( $4057D294 ) add edi , # 4 \ $83 $C7 $4 ( $4057D297 ) lea ebx , dword ptr 4 [eax] \ $8D $58 $4 ( $4057D29A ) jmp dword ptr FC [ebx] \ $FF $63 $FC
Only when a threaded-code control-flow change happens (e.g., in
;s
), the jump is appended. This optimization eliminates many of
these jumps and makes the rest much more predictable. The speedup
depends on the processor and the application; on the Athlon and Pentium
III this optimization typically produces a speedup by a factor of 2.
The code addresses in the direct-threaded code are set to point to the
appropriate points in the copied machine code, in this example like
this:
primitive code address dup $4057D27D * $4057D286 ;s $4057D292
Thus there can be threaded-code jumps to any place in this piece of code. This also simplifies decompilation quite a bit.
You can disable this optimization with --no-dynamic
. You can
use the copying without eliminating the jumps (i.e., dynamic
replication, but without superinstructions) with --no-super
;
this gives the branch prediction benefit alone; the effect on
performance depends on the CPU; on the Athlon and Pentium III the
speedup is a little less than for dynamic superinstructions with
replication.
One use of these options is if you want to patch the threaded code. With superinstructions, many of the dispatch jumps are eliminated, so patching often has no effect. These options preserve all the dispatch jumps.
On some machines dynamic superinstructions are disabled by default,
because it is unsafe on these machines. However, if you feel
adventurous, you can enable it with --dynamic
.