Memory Dependency Microbenchmark

This microbenchmark tests some aspects of how hardware deals with dependences through memory. It performs the following loop:

  for (i=0; i<250000000; i++) {
    *b = *a+1;
    *d = *c+1;
    *f = *e+1;
    *h = *g+1;
  }

The eight parameters of the binary allow you to determine to which memory location the pointers a b c d e f g h refer, in particular, if some of them refer to the same memory location or not. Each parameter corresponds to one pointer, and two pointers refer to the same memory location iff the corresponding parameters are the same.

This microbenchmark uses pointers that have been in their registers for a long time, so it does not test speculative alias detection by the hardware.

I have measured using the following parameter combinations (in the order a b c d e f g h):

X 0 8 16 24 32 40 48 56: All computations (statements in the loop body) are completely independent. Ideally they can be performed as fast as the resources allow, but all accesses are to the starts of different cache lines. This may show up some resource limitations.
Y 0 9 18 27 36 45 54 63: Like X, but this time all memory accesses are to different parts of the cache lines.
Z 0 513 994 11 524 989 22 535: Like Y, but this time the memory accesses are far enough apart to probably hit three different pages in a round-robin fashion.
A 0 1 2 3 4 5 6 7: All computations (statements in the loop body) are completely independent. Ideally they can be performed as fast as the resources allow. All accesses are in the same 64-byte block (possibly in the same cache line).
B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but the next iteration does not depend on the results of the previous one, so again, the work can be performed as fast as the resources allow. However, in order to achieve this performance, the loads of the next iteration have to be started while several architecturally earlier stores still have to be executed, and, comparing the results of B to those of A, all measured cores have more difficulty with that, resulting in slower performance.
B1 0 1 1 1 1 1 1 1: Like B, but this time all but the first access per loop is to the same memory location. If the hardware can do something like memory renaming, it can overlap the next iteration with the current one, if not, this should be quite a bit slower than B. Of course, on in-order microarchitectures there is no overlap between iterations, so you won't see such a difference there.
B2 0 1 1 1 2 1 1 1: Like B1, bit instead of having a single chain of 4 dependent computations, now we have two chains of two dependent computations. Both chains occupy the same memory location for a while.
C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the computation in the next iteration depends on a computation in the current iteration), four of those. So at most 4 of these computations (plus the loop overhead) can be performed in parallel.
D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent computations in an iteration, and the first of those in the current iteration depends on the second one in the previous iteration), 2 of those. So at most two of these computations (plus the loop overhead) can be performed in parallel.
E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations are arranged differently: Here we first have two independent computations, and then two computations that depend on the earlier computations.
F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation recurrence. In the best case you see the latency of three computations per iteration.
G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations are arranged differently.
H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These computations can only be performed sequentialy, only the loop overhead can be performed in parallel to them.

The results for different CPU cores are shown in the following. The numbers are cycles per computation (statement in the loop body). To get the cycles per iteration, multiply by 4.

  A     B     C     D     E     F     G     H   microarchitecture CPU
 1.98 14.42  6.04  6.07  6.07  4.78  4.73  8.02 21264B (Alpha)

  X     Y     Z     A     B     B1    B2    C     D     E     F     G     H   microarchitecture CPU
 3.01  3.01  3.01  3.01  4.52  4.52  4.01  3.01  4.02  3.01  4.01  4.01  5.02 U74 JH7100 (RISC-V)
 3.01  3.02  3.02  3.01  6.02  6.01  5.27  3.01  5.27  5.01  5.38  5.39  7.01 PPC7447A iBook G4 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=1066)
 3.25  3.25  3.25  3.25  4.00  4.00  3.75  3.25  3.75  3.25  3.75  3.50  4.00 Cortex-A53 Amlogic S905
 3.25  3.25  3.25  3.25  4.00  4.00  3.75  3.25  3.75  3.25  3.75  3.50  4.00 Cortex-A55 RK3588
 1.25  1.25  1.25  1.25  2.27  5.87  3.71  2.22  3.96  4.00  5.85  5.67  7.69 Cortex-A72 BCM2835
 1.75  1.75  1.75  1.75  3.24  4.25  3.49  1.75  2.74  2.99  3.74  3.74  5.00 Cortex-A73 Amlogic S922 (taskset -c 5 make TIME="/usr/bin/time -f %U 2>&1" MHZ=1800)
 1.72  1.50  1.50  1.50  1.77  1.89  1.16  1.63  2.98  3.02  4.11  4.10  5.51 Cortex-A76 RK3588
 1.06  1.03  1.03  1.04  1.67  2.18  1.44  1.50  3.00  3.00  4.50  4.51  6.00 IceStorm Apple M1 (variations from 6.0-8.75 for H)
 0.55  0.55  0.55  0.52  0.84  0.84  0.61  1.68  3.32  3.29  5.00  4.94  6.36 FireStorm Apple M1 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=3228)
 3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00 Bonnell Atom 330
 2.00  2.00  2.05  2.00  3.07  4.25  3.50  2.00  3.52  2.96  5.09  5.46  6.70 Silvermont Celeron J1900
 1.25  1.25  1.25  1.25  1.82  1.82  1.51  2.01  3.46  3.48  4.79  4.75  6.42 Goldmont Celeron J3455
 1.25  1.25  1.25  1.25  1.68  1.68  1.50  1.67  3.00  3.00  4.50  4.50  6.00 Goldmont+ Celeron J4105
 1.21  1.00  1.00  1.00  1.53  1.55  1.19  1.49  3.00  3.00  4.50  4.50  6.11 Tremont Celeron N4500
 1.00  1.00  1.00  0.75  0.75  0.70  0.70  0.75  0.70  1.00  0.75  0.75  1.00 Gracemont Core i3-1315U (taskset -c 5 make CYCLES=cpu_atom/cycles:u/)
 2.00  1.17  1.17  1.17  2.05  2.84  2.07  1.25  3.30  3.68  4.50  4.50  6.93 K8 Athlon 64 X2 4600+
 1.17  1.17  1.27  1.23  2.47  2.91  2.18  1.20  3.28  3.07  4.63  4.73  6.48 K10 Phenom II X2 560
 1.25  1.25  1.25  1.33  1.76  1.63  1.34  1.64  4.34  4.16  6.24  6.36  8.96 Excavator Athlon X4 845
 1.00  1.00  1.00  1.00  1.26  1.26  1.04  2.00  4.00  4.00  6.01  6.00  8.00 Zen Ryzen 1600X
 1.00  1.00  1.00  1.00  1.19  1.19  1.00  2.00  4.00  4.00  6.00  6.00  8.00 Zen2 Ryzen 3900X
 1.00  1.00  1.00  0.75  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00 Zen3 Ryzen 5800X
 2.19  1.38  1.38  1.38  1.78  3.37  2.50  1.38  3.02  3.02  4.55  4.50  6.00 Penryn Xeon E5450 
 1.26  1.24  1.26  1.25  1.72  1.72  1.50  1.46  3.01  3.01  4.51  4.51  6.03 Nehalem Xeon X3460
 1.06  1.00  1.01  1.00  1.52  1.52  1.25  1.99  3.69  3.68  4.65  4.63  7.45 Sandy Bridge Xeon E3-1220
 1.00  1.00  1.00  1.00  1.26  1.26  1.05  1.57  3.03  3.00  4.50  4.50  6.05 Haswell Core i7-4790K
 1.00  1.00  1.00  1.00  1.12  1.21  1.04  1.43  2.90  2.84  4.03  4.05  5.53 Skylake Core i5-6600K
 1.00  1.00  1.00  0.78  0.94  0.91  0.85  1.01  1.81  1.13  2.00  2.00  2.25 Rocket Lake Xeon W-1370P
 1.00  1.00  1.00  0.81  0.81  0.81  0.81  0.75  0.81  0.84  0.82  0.86  1.00 Tiger Lake Core i5-1135G7
 1.00  1.00  1.00  0.55  0.75  0.65  0.65  0.78  0.68  0.67  0.66  0.65  0.65 Golden Cove Core i3-1315U (taskset -c 2 make CYCLES=cpu_core/cycles:u/)

There can be different sophistication levels of CPU cores, both in terms of dealing with aliases, and in terms of forwarding from stores to loads. And of course there is the difference between in-order execution and out-of-order execution.

Aliases

The simplest approach would be to assume that all memory accesses may refer to the same memory location. In that case we would expect that the core performs all parameter combinations as badly as H. Apart from Bonnell, none of the measured cores exhibits this behaviour, so obviously they are more sophisticated than that.

On the plus side for Bonnell, A-H takes as long (in cycles) as A does on other in-order cores (e.g., the U74); and looking at the Bonnell pipeline, I really wonder how they do that: Given a three-cycle data cache access followed by a 1-cycle ALU operation, the results require a -1-cycle store-to-load latency, whether the addresses are the same or not. Maybe there is a bypass from the store to one of the later load-unit stages if the addresses match (and if they don't, starting the load early does not hurt).

The next approach is to allow to perform architecturally later loads at the same time, or, with OoO, earlier than stores to a different address. All tested cores seem to do this.

Finally, there is the issue of what to do when the load is to the same address as an architecturally preceding store. This brings us to the next section:

Store to load forwarding

The simplest approach is to wait until the data arrives in the cache and then load it from there. I dimly rememeber seeing 15 cycles per iteration for a loop that incremented a memory location in every iteration, but none of the cores measured above take that long (although, for Zen and Zen2 one might wonder).

The next approach is to let the load read from the store buffer if the data is there (and first wait until it is there). In this case the whole sequence of store-load has a total latency that is not much higher than the load latency. It seems that most cores measured here do that. We see this best in the H results; e.g., a Skylake has 4 cycles of load latency and 1 cycle of add latency, and we see 5.53 cycles of latency for store-load-add, meaning that the store contributes an average latency of 0.53 cycles. There are some complications in case the load only partially overlaps the store (I should put an appropriate chipsncheese link here).

Finally, the core might detect that the data that is loaded is coming from a store that has not been retired yet, so the physical register used by the store can be directly renamed into the register of the load result, and the load does not need to access the store buffer or cache at all (what is the name of this feature?). As a result, in the ideal case we see only the 1-cycle latency of the add in case H. In the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully; Rocket Lake probably also does that, but either does not succeed in all cases (the different results between D and E speak for this theory), or there is an additional source of latency.

I expected that Firestorm (the performance core of the Apple M1) also has this feature, but, looking at the results, it does not.

Memory Renaming

The test B1 is, as far as true dependences are concerned, the same as B. However, B uses different memory locations all the time, while B1 uses the same memory location most of the time. Can the next iteration overlap the accesses of this iteration, like register renaming allows to overlap the execution of instructions that use the same architectural register name? If not, we will see worse results than for B.

B2 is similar to B1, but there are two 2-computation dependency chains instead of the one 4 computation-chain.

Intel added memory renaming in Goldmont and Nehalem, AMD added it between K10 and Excavator (probably already with Bulldozer). On the ARM side, only Firestorm (Apple M1 P-core) has memory renaming in the microarchitectures I measured.

On the PPC7447A, B1 has the same speed as B, but they are both so close to the H case that it seems more likely that the PPC7447A has very limited out-of-order capabilities and does reorder enough to allow more overlapping for B than for B1 without memory renaming.

In-order vs. out-of-order execution

With in-order execution (on U74, Cortex A53 and A55, and on the Bonnell), the loads cannot be executed before architecturally earlier stores, even if both refer to different memory locations. So even A is relatively slow on in-order cores. In-order execution also means that B is almost as slow as H, while with out-of-order execution it can theoretically be executed as fast as A (but in practice they exhibit slightly slower performance, but certainly much faster than H).

With OoO, we see much better performance in cases where there are independent computation chains. Given the size of the buffers in the various OoO microarchitectures (hundreds of instructions in the reorder buffer, dozens in schedulers), it is surprising that B is slower than A given the small size of each iteration (~15 instructions); and even D is slower than E on some microarchitectures, most notably Rocket Lake.

Measuring your own hardware

You can download the contents of this directory and run the benchmark with

  wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
  unzip memdep.zip
  cd memdep
  make

If you are working on a machine without working perf, you can use the following instead of the make in the last line:

  make TIME="/bin/time -f %U 2>&1" MHZ=...

where you have to replace the ... with the actual (not base) clock rate of the processor you are measuring. If you want to do your own parameter combinations, you can run the binary with

  ./memdep-`uname -m` a b c d e f g h

where a b c d e f g h are integers and correspond to the pointers in the loop. If you want to get results like in the table above, run it like this:

  perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

Future work

Make a good microbenchmark that produces the addresses so late that either the core has to wait for the addresses or has to speculate whether the load accesses the same address as the store or not. Do it for predictable aliasing and for unpredictable aliasing. I remember a good article about this predictor (that actually looked at Intel's patent), but don't remember the URL; Intel calls this technique as implemented in Ice Lake and later P-cores Fast Store Forwarding Predictor (FSFP) (but my memory says that the article I read about looked at older microarchitectures that have a similar feature). AMD describes such a hardware feature under the name predictive store forwarding (PSF), which they added in Zen3.

Henry Wong has measured Store-to-Load Forwarding and Memory Disambiguation in x86 Processors and his approach (and probably tools) has then been used by sites such as Anandtech and Chipsandcheese.

I have done a microbenchmark that was intended to measure predictive store forwarding, but it (also?) measured the forward-at-register-level technique described above.

Anton Ertl

Name	Last modified	Size

Parent Directory		-
Makefile	14-Nov-2023 22:55	1.9K
main-aarch64.o	02-Nov-2023 13:09	2.3K
main-alpha.o	03-Nov-2023 14:12	2.6K
main-ppc.o	03-Nov-2023 18:08	1.5K
main-riscv64.o	02-Nov-2023 13:10	2.6K
main-x86_64.o	02-Nov-2023 11:19	2.2K
main.c	01-Nov-2023 19:55	691
measure-x86_64	14-Nov-2023 22:29	1.7K
memdep-aarch64	02-Nov-2023 13:12	587K
memdep-aarch64.o	02-Nov-2023 13:09	1.3K
memdep-alpha	03-Nov-2023 14:12	605K
memdep-alpha.o	03-Nov-2023 14:12	1.2K
memdep-ppc	03-Nov-2023 18:08	612K
memdep-ppc.o	03-Nov-2023 18:08	869
memdep-riscv64	02-Nov-2023 13:12	1.3M
memdep-riscv64.o	02-Nov-2023 13:10	1.1K
memdep-x86_64	02-Nov-2023 13:12	764K
memdep-x86_64.o	02-Nov-2023 11:19	1.3K
memdep.c	01-Nov-2023 19:42	211
memdep.zip	15-Nov-2023 12:30	1.4M

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80