The main difference is where the steps between the latencies occur: Under Linux the first step is at 8KB, whereas it is at 16KB under DU. Furthermore, for Linux there is a gradual rise from L2 latency to main memory latency, starting at around 128KB, with some jags, whereas on DU the step is pretty well-defined at 2MB.
How do I explain this?
The explanation for the difference in the 8K-16K range is that Linux (as we ran it) used the 21064A in a mode where only 8KB of D-cache are used. In the meantime I have developed a way to switch the D-cache to 16KB. You can see that this is the cause of this difference by looking at the lmbench result with 8K and 16K D-cache for another machine (with 512KB L2 cache).
It looks to me that the differences at the larger array sizes are the result of the combination of low-associativity caches and different page allocation policies for virtual memory; if we used physical memory, the graphs would look like the DU graphs. However, with paged memory any two pages may get physical addresses that are mapped to the same page of cache sets. Apparently DU has a special page allocation policy (e.g., page colouring) to achieve more predictable behaviour. AFAIK Linux allocates pages arbitrarily.
Does this affect application performance? Hard to tell. I
measured a speedup of 5% in our LaTeX
benchmark when switching from 8K D-cache to 16K. The effects of
page allocation policies are probably mainly visible for programs that
use large, regular data structures and are optimized for page
colouring. The main part of the difference between Linux and DU in
the LaTeX benchmark is probably due to using different TeX and LaTeX
versions, different compilers, and different assemblers (DU's
as
contains a scheduler).