Memory Dependency Microbenchmark

This microbenchmark tests some aspects of how hardware deals with dependences through memory. It performs the following loop:
  for (i=0; i<250000000; i++) {
    *b = *a+1;
    *d = *c+1;
    *f = *e+1;
    *h = *g+1;
  }
The eight parameters of the binary allow you to determine to which memory location the pointers a b c d e f g h refer, in particular, if some of them refer to the same memory location or not. Each parameter corresponds to one pointer, and two pointers refer to the same memory location iff the corresponding parameters are the same.

This microbenchmark uses pointers that have been in their registers for a long time, so it does not test speculative alias detection by the hardware.

I have measured using the following parameter combinations (in the order a b c d e f g h):

The results for different CPU cores are shown in the following. The numbers are cycles per computation (statement in the loop body). To get the cycles per iteration, multiply by 4.
  A     B     C     D     E     F     G     H   microarchitecture CPU
 1.98 14.42  6.04  6.07  6.07  4.78  4.73  8.02 21264B (Alpha)
  X     Y     Z     A     B     B1    B2    C     D     E     F     G     H   microarchitecture CPU
 3.01  3.01  3.01  3.01  4.52  4.52  4.01  3.01  4.02  3.01  4.01  4.01  5.02 U74 JH7100 (RISC-V)
 3.01  3.02  3.02  3.01  6.02  6.01  5.27  3.01  5.27  5.01  5.38  5.39  7.01 PPC7447A iBook G4 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=1066)
 3.25  3.25  3.25  3.25  4.00  4.00  3.75  3.25  3.75  3.25  3.75  3.50  4.00 Cortex-A53 Amlogic S905
 3.25  3.25  3.25  3.25  4.00  4.00  3.75  3.25  3.75  3.25  3.75  3.50  4.00 Cortex-A55 RK3588
 1.25  1.25  1.25  1.25  2.27  5.87  3.71  2.22  3.96  4.00  5.85  5.67  7.69 Cortex-A72 BCM2835
 1.75  1.75  1.75  1.75  3.24  4.25  3.49  1.75  2.74  2.99  3.74  3.74  5.00 Cortex-A73 Amlogic S922 (taskset -c 5 make TIME="/usr/bin/time -f %U 2>&1" MHZ=1800)
 1.72  1.50  1.50  1.50  1.77  1.89  1.16  1.63  2.98  3.02  4.11  4.10  5.51 Cortex-A76 RK3588
 1.06  1.03  1.03  1.04  1.67  2.18  1.44  1.50  3.00  3.00  4.50  4.51  6.00 IceStorm Apple M1 (variations from 6.0-8.75 for H)
 0.55  0.55  0.55  0.52  0.84  0.84  0.61  1.68  3.32  3.29  5.00  4.94  6.36 FireStorm Apple M1 (make TIME="/usr/bin/time -f %U 2>&1" MHZ=3228)
 3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00  3.00 Bonnell Atom 330
 2.00  2.00  2.05  2.00  3.07  4.25  3.50  2.00  3.52  2.96  5.09  5.46  6.70 Silvermont Celeron J1900
 1.25  1.25  1.25  1.25  1.82  1.82  1.51  2.01  3.46  3.48  4.79  4.75  6.42 Goldmont Celeron J3455
 1.25  1.25  1.25  1.25  1.68  1.68  1.50  1.67  3.00  3.00  4.50  4.50  6.00 Goldmont+ Celeron J4105
 1.21  1.00  1.00  1.00  1.53  1.55  1.19  1.49  3.00  3.00  4.50  4.50  6.11 Tremont Celeron N4500
 1.00  1.00  1.00  0.75  0.75  0.70  0.70  0.75  0.70  1.00  0.75  0.75  1.00 Gracemont Core i3-1315U (taskset -c 5 make CYCLES=cpu_atom/cycles:u/)
 2.00  1.17  1.17  1.17  2.05  2.84  2.07  1.25  3.30  3.68  4.50  4.50  6.93 K8 Athlon 64 X2 4600+
 1.17  1.17  1.27  1.23  2.47  2.91  2.18  1.20  3.28  3.07  4.63  4.73  6.48 K10 Phenom II X2 560
 1.25  1.25  1.25  1.33  1.76  1.63  1.34  1.64  4.34  4.16  6.24  6.36  8.96 Excavator Athlon X4 845
 1.00  1.00  1.00  1.00  1.26  1.26  1.04  2.00  4.00  4.00  6.01  6.00  8.00 Zen Ryzen 1600X
 1.00  1.00  1.00  1.00  1.19  1.19  1.00  2.00  4.00  4.00  6.00  6.00  8.00 Zen2 Ryzen 3900X
 1.00  1.00  1.00  0.75  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00 Zen3 Ryzen 5800X
 2.19  1.38  1.38  1.38  1.78  3.37  2.50  1.38  3.02  3.02  4.55  4.50  6.00 Penryn Xeon E5450 
 1.26  1.24  1.26  1.25  1.72  1.72  1.50  1.46  3.01  3.01  4.51  4.51  6.03 Nehalem Xeon X3460
 1.06  1.00  1.01  1.00  1.52  1.52  1.25  1.99  3.69  3.68  4.65  4.63  7.45 Sandy Bridge Xeon E3-1220
 1.00  1.00  1.00  1.00  1.26  1.26  1.05  1.57  3.03  3.00  4.50  4.50  6.05 Haswell Core i7-4790K
 1.00  1.00  1.00  1.00  1.12  1.21  1.04  1.43  2.90  2.84  4.03  4.05  5.53 Skylake Core i5-6600K
 1.00  1.00  1.00  0.78  0.94  0.91  0.85  1.01  1.81  1.13  2.00  2.00  2.25 Rocket Lake Xeon W-1370P
 1.00  1.00  1.00  0.81  0.81  0.81  0.81  0.75  0.81  0.84  0.82  0.86  1.00 Tiger Lake Core i5-1135G7
 1.00  1.00  1.00  0.55  0.75  0.65  0.65  0.78  0.68  0.67  0.66  0.65  0.65 Golden Cove Core i3-1315U (taskset -c 2 make CYCLES=cpu_core/cycles:u/)
There can be different sophistication levels of CPU cores, both in terms of dealing with aliases, and in terms of forwarding from stores to loads. And of course there is the difference between in-order execution and out-of-order execution.

Aliases

The simplest approach would be to assume that all memory accesses may refer to the same memory location. In that case we would expect that the core performs all parameter combinations as badly as H. Apart from Bonnell, none of the measured cores exhibits this behaviour, so obviously they are more sophisticated than that.

On the plus side for Bonnell, A-H takes as long (in cycles) as A does on other in-order cores (e.g., the U74); and looking at the Bonnell pipeline, I really wonder how they do that: Given a three-cycle data cache access followed by a 1-cycle ALU operation, the results require a -1-cycle store-to-load latency, whether the addresses are the same or not. Maybe there is a bypass from the store to one of the later load-unit stages if the addresses match (and if they don't, starting the load early does not hurt).

The next approach is to allow to perform architecturally later loads at the same time, or, with OoO, earlier than stores to a different address. All tested cores seem to do this.

Finally, there is the issue of what to do when the load is to the same address as an architecturally preceding store. This brings us to the next section:

Store to load forwarding

The simplest approach is to wait until the data arrives in the cache and then load it from there. I dimly rememeber seeing 15 cycles per iteration for a loop that incremented a memory location in every iteration, but none of the cores measured above take that long (although, for Zen and Zen2 one might wonder).

The next approach is to let the load read from the store buffer if the data is there (and first wait until it is there). In this case the whole sequence of store-load has a total latency that is not much higher than the load latency. It seems that most cores measured here do that. We see this best in the H results; e.g., a Skylake has 4 cycles of load latency and 1 cycle of add latency, and we see 5.53 cycles of latency for store-load-add, meaning that the store contributes an average latency of 0.53 cycles. There are some complications in case the load only partially overlaps the store (I should put an appropriate chipsncheese link here).

Finally, the core might detect that the data that is loaded is coming from a store that has not been retired yet, so the physical register used by the store can be directly renamed into the register of the load result, and the load does not need to access the store buffer or cache at all (what is the name of this feature?). As a result, in the ideal case we see only the 1-cycle latency of the add in case H. In the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully; Rocket Lake probably also does that, but either does not succeed in all cases (the different results between D and E speak for this theory), or there is an additional source of latency.

I expected that Firestorm (the performance core of the Apple M1) also has this feature, but, looking at the results, it does not.

Memory Renaming

The test B1 is, as far as true dependences are concerned, the same as B. However, B uses different memory locations all the time, while B1 uses the same memory location most of the time. Can the next iteration overlap the accesses of this iteration, like register renaming allows to overlap the execution of instructions that use the same architectural register name? If not, we will see worse results than for B.

B2 is similar to B1, but there are two 2-computation dependency chains instead of the one 4 computation-chain.

Intel added memory renaming in Goldmont and Nehalem, AMD added it between K10 and Excavator (probably already with Bulldozer). On the ARM side, only Firestorm (Apple M1 P-core) has memory renaming in the microarchitectures I measured.

On the PPC7447A, B1 has the same speed as B, but they are both so close to the H case that it seems more likely that the PPC7447A has very limited out-of-order capabilities and does reorder enough to allow more overlapping for B than for B1 without memory renaming.

In-order vs. out-of-order execution

With in-order execution (on U74, Cortex A53 and A55, and on the Bonnell), the loads cannot be executed before architecturally earlier stores, even if both refer to different memory locations. So even A is relatively slow on in-order cores. In-order execution also means that B is almost as slow as H, while with out-of-order execution it can theoretically be executed as fast as A (but in practice they exhibit slightly slower performance, but certainly much faster than H).

With OoO, we see much better performance in cases where there are independent computation chains. Given the size of the buffers in the various OoO microarchitectures (hundreds of instructions in the reorder buffer, dozens in schedulers), it is surprising that B is slower than A given the small size of each iteration (~15 instructions); and even D is slower than E on some microarchitectures, most notably Rocket Lake.

Measuring your own hardware

You can download the contents of this directory and run the benchmark with
  wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
  unzip memdep.zip
  cd memdep
  make
If you are working on a machine without working perf, you can use the following instead of the make in the last line:

  make TIME="/bin/time -f %U 2>&1" MHZ=...
where you have to replace the ... with the actual (not base) clock rate of the processor you are measuring. If you want to do your own parameter combinations, you can run the binary with
  ./memdep-`uname -m` a b c d e f g h
where a b c d e f g h are integers and correspond to the pointers in the loop. If you want to get results like in the table above, run it like this:
  perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

Future work

Make a good microbenchmark that produces the addresses so late that either the core has to wait for the addresses or has to speculate whether the load accesses the same address as the store or not. Do it for predictable aliasing and for unpredictable aliasing. I remember a good article about this predictor (that actually looked at Intel's patent), but don't remember the URL; Intel calls this technique as implemented in Ice Lake and later P-cores Fast Store Forwarding Predictor (FSFP) (but my memory says that the article I read about looked at older microarchitectures that have a similar feature). AMD describes such a hardware feature under the name predictive store forwarding (PSF), which they added in Zen3.

Related

Henry Wong has measured Store-to-Load Forwarding and Memory Disambiguation in x86 Processors and his approach (and probably tools) has then been used by sites such as Anandtech and Chipsandcheese.

I have done a microbenchmark that was intended to measure predictive store forwarding, but it (also?) measured the forward-at-register-level technique described above.


Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile14-Nov-2023 22:55 1.9K 
[   ]main-aarch64.o02-Nov-2023 13:09 2.3K 
[   ]main-alpha.o03-Nov-2023 14:12 2.6K 
[   ]main-ppc.o03-Nov-2023 18:08 1.5K 
[   ]main-riscv64.o02-Nov-2023 13:10 2.6K 
[   ]main-x86_64.o02-Nov-2023 11:19 2.2K 
[TXT]main.c01-Nov-2023 19:55 691  
[   ]measure-x86_6414-Nov-2023 22:29 1.7K 
[   ]memdep-aarch6402-Nov-2023 13:12 587K 
[   ]memdep-aarch64.o02-Nov-2023 13:09 1.3K 
[   ]memdep-alpha03-Nov-2023 14:12 605K 
[   ]memdep-alpha.o03-Nov-2023 14:12 1.2K 
[   ]memdep-ppc03-Nov-2023 18:08 612K 
[   ]memdep-ppc.o03-Nov-2023 18:08 869  
[   ]memdep-riscv6402-Nov-2023 13:12 1.3M 
[   ]memdep-riscv64.o02-Nov-2023 13:10 1.1K 
[   ]memdep-x86_6402-Nov-2023 13:12 764K 
[   ]memdep-x86_64.o02-Nov-2023 11:19 1.3K 
[TXT]memdep.c01-Nov-2023 19:42 211  
[   ]memdep.zip15-Nov-2023 12:30 1.4M 

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80