This page is derived from Section 4.5 of [Ert96].
The benchmarks we used were the ubiquitous Sieve (counting the primes <16384 a thousand times); bubble-sorting (6000 integers) and matrix multiplication (200*200 matrices) come from the Stanford integer benchmarks (originally in Pascal, but available in C (The C version and Martin Fraeman's original translations to Forth can be found at ftp.complang.tuwien.ac.at/pub/forth/stanford-benchmarks.tar.gz)) and have been translated into Forth by Martin Fraeman and included in the TILE Forth package. These three benchmarks share one disadvantage: They have an unusually low amount of calls. To benchmark calling performance, we computed the 34th Fibonacci number using a recursive algorithm with exponential run-time complexity. You can find the benchmarks in Forth, C and forth2c-generated C at http://www.complang.tuwien.ac.at/forth/bench.tar.gz.
The measured systems fall into several classes:
Win32Forth, NT Forth and its NCC were benchmarked under Windows NT, bigForth and iForth under DOS/GO32, all other systems under Linux. The results for Win32Forth, NT Forth and its NCC were provided by Kenneth O'Heskin, those for iForth and eForth (with and without peephole optimization of intermediate code) by Marcel Hendrix. All measurements were performed on PCs with an Intel 486DX2/66 CPU with 256K secondary cache with similar memory performance. The times given are the median of three measurements of the user time (the system time is negligible anyway).
relative f2c f2c hand- big- mx- NT-Forth Win32- NT eforth This- abs.\ time time opt.Timbre no opt coded Forth iForth Forth NCC FLK Gforth Forth Forth eforth +opt PFE Forth TILE f2c opt. sieve 1.00 - 7.03 0.86 1.87 2.16 2.31 1.27 1.47 5.05 7.99 6.56 8.00 4.87 9.09 18.32 49.42 5.19s bubble 1.00 - 8.28 0.87 2.34 2.32 2.20 7.12 1.61 6.25 9.69 10.41 10.94 6.49 11.11 - 28.67 4.79s matmul 1.00 - 9.35 1.10 3.02 2.19 2.31 4.14 1.99 5.92 9.87 9.09 9.80 4.95 10.59 - 27.41 4.02s fib 1.00 3.14 4.92 1.00 1.37 1.32 0.95 1.47 0.83 3.79 6.62 5.81 5.31 3.76 7.56 12.99 18.68 7.96s
The table above (and a picture) shows the time that the systems need for the benchmarks, relative to the time of f2c opt (or, in other words, the speedup factor that our translator (and GCC) achieves over the other systems). Empty entries indicate that I did not succeed in running the benchmark on the system.
The result of Timbre's Forth-to-C translator is slow, as expected (since Timbre does not have DO..LOOP and friends, I could only measure fib, but I think this result is representative). Combining our translator with a non-optimizing GCC results in code that is even slower than the interpretive Gforth system. Hand-coded C is between 14% faster and 10% slower than the output of the Forth translator. I was a little surprised by the matrix multiplication result, where the code translated to Forth and back was faster than the original. Closer inspection showed that, in translating from C to Forth, an optimization had been performed, which the C compiler does not perform (it is an interprocedural optimization that requires interprocedural alias analysis) and which reduced the amount of memory accesses; this is probably responsible for the speedup, maybe combined with vagaries such as instruction cache alignment.
BigForth and iForth achieve a speedup of about 3 over the fastest interpreters, but there is still a lot of room for improvement (at least a factor of 1.3-3). Even on the fib Benchmark, which should be the strong point of Forth compilers, f2c opt was better. The results of NT Forth NCC are a bit worse on average and have a big variance (a speedup of 1-4 over Gforth, 1.2-7.2 times slower than f2c opt.). These results show that researching better native code generation techniques is not just a waste of time and that there is still a lot to be gained in this area. These results also show that the following statement has not become outdated yet: ``The resulting machine code is a factor of two or three slower than the equivalent code written in machine language, due mainly to the use of the stack rather than registers.'' [Ros86]
The interpretive systems written in assembly language (except eforth opt) are, surprisingly, slower than Gforth. One important reason for the disappointing performance of these systems is probably that they are not written optimally for the 486 (e.g., they use the lods instruction). eforth opt demonstrates that peephole optimization of intermediate code offers substantial gains. eforth opt is 3.5-6.5 times slower than f2c opt. We can expect even better results if the baseline interpreter is more efficient (e.g., Gforth).
Gforth is 3.5-6.5 times slower than f2c opt, PFE 7.5-11 times. The slowdown of PFE with respect to Gforth can be explained with the self-imposed restriction to standard C (although the measured configuration of PFE uses a GNU C extension: global register variables), which makes efficient threading impossible (PFE uses indirect call threading). ThisForth and TILE were obviously written with a certain negligence towards efficiency issues and the limited optimization abilities of state-of-the-art C compilers, resulting in a slowdown factor of more than 49 for TILE on the Sieve.
interp. .o size compile source C size size ratio time lines lines sieve 418 272 1.54 1.10s 25 482 bubble 1020 748 1.36 1.60s 72 1100 matmul 784 412 1.90 1.40s 55 793 fib 140 140 1.00 0.90s 10 169
The code size measurements dispell another popular myth, that of the inherent size advantage of stack architecture code and of the bloat produced by optimizing C compilers. While a comparison of a header-stripping 16-bit Forth with a RISC (about 50% bigger code than CISCs) would give a somewhat different result, the reported size differences of more than an order of magnitude need a different explanation: differences in the functionality of the software and different software engineering practices come to mind.
For the compile time measurements, only the user time needed by GCC to compile and link the program are displayed. The system time was constant at 0.6s. The compilation to Gforth's interpreted code needed a negligible amount of time; the translation to C also vanished in the measurement noise, although it was not written for speed and although the present implementation should be much slower than normal Forth compilation. The compile time data indicate that, after a startup time of about 1.4s (user+system), GCC compiles about 90 lines of Forth code (1500 lines of translator output) per second. Interestingly, less than one byte of machine code is generated per line of C code.
Martin Fraeman provided the C versions of the Stanford benchmarks. Kenneth O'Heskin kindly produced the benchmark results for Win32Forth, NT Forth, and NT Forth NCC. Marcel Hendrix provided the iForth and eForth results.