Zero-cycle constant adds

Most CPU cores have a latency of one cycle for an integer addition, so a sequence like
  	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax
	addq	$1, %rax
usually has 5 cycles of total latency. The P-Core of the Core i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is a Golden Cove?)), however, executes this sequence much faster. This microbenchmark finds out performance characteristics about that. There are 14 loops, most containing sequences like the one shown above, but there are others:
      a                 i                  j               n              k-m
addq %r8, %rax   leaq 1(%rax),%r8   addq $1, %rax   imul %rsi, %rax  imul %rsi, %rax
addq %rcx, %rax  leaq 1(%r8),%rcx   addq %rax,%r8   addq %rsi, %rax  addq $1, %rax
addq %rdx, %rax  leaq 1(%rcx),%rdx  addq $1, %rax                    addq $1, %rax
addq %rsi, %rax  leaq 1(%rdx),%rsi  addq %rax,%rcx                   ...
addq %rdi, %rax  leaq 1(%rsi),%rax  addq $1, %rax
cmpq %rax, %r9   cmpq %rax, %rdi    addq %rax,%rdx
jg   2b          jg   2b            addq $1, %rax
                                    addq %rax,%rsi
                                    addq $1, %rax
                                    addq %rax,%r9
                                    addq $1, %rax
                                    cmpq %rax, %rdi
                                    jg   2b

The results we see are:

loop cyc/it Mops dep.adds remarks
 a    5.00    6     5     addq reg, %eax
 b    1.01    6     5     addq $1, %rax
 f    1.23    7     6     addq $1, %rax
 c    1.98   11    10     addq $1, %rax
 d    2.00   12    11     addq $1, %rax
 e    2.21   13    12     addq $1, %rax
 g    3.01   18    17     addq $1, %rax
 h    3.25   19    18     addq $1, %rax
 i    1.00    6     5     leaq 1(reg1),reg2
 j    2.00   12     6     addq $1, %rax; addq %rax, reg
 n    4.00    3     1     imul %rsi, %rax; addq %rsi, %rax
 k    3.00    3     1     imul %rsi, %rax; addq $1, %rax
 l    3.01   12    10     imul %rsi, %rax; addq $1, %rax
 m    3.20   17    15     imul %rsi, %rax; addq $1, %rax
 o    3.00   12    10     imul ...; addq $0x333, %rax; ...; addq $-0x333+2, %rax
 p    7.39   12    10     imul ...; addq $0x334, %rax; ...; addq $-0x334+2, %rax
 q    3.00   12    10     imul ...; addq $-0x333, %rax; ...; addq $0x333+2, %rax
 r    7.72   12    10     imul ...; addq $-0x334, %rax; ...; addq $0x334+2, %rax
 s    3.00   12    10     imul; +1023+1023+1023+1023-1024-1024-1024-1024-1009+1023
 s    3.74   12    10     imul; +1023+1023+1023+1023-1024-1024-1024-1024-1024+1038
The Mops column counts the addq, cmpq+jg, leaq, and imul as Macro-ops. What is happening here? I can only guess:

I believe that the hardware optimizes sequences of constant adds at the renaming stage of the microarchitecture. Not later, because according to Chips and Cheese Golden Cove (and, by extension, Raptor Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier, because it works across the zero-cycle store-to-load-forwarding (which probably happens not earlier than the renaming stage). The renamer already performs move elimination, the constant-add elimination could be a generalized form of that.

I don't have a good explanation for the apparent latency of 0 for the adds following the imul in loops k and l. I would think that the additions accumulated in some state in the renamer have to be reified in a physical register when it is used for a proper ALU operation, and I would expect that to cost one cycle of latency. One theory I considered was that the microarchitecture internally has a multiply-add unit, but one would expect that to work also for loop n, but there we see an additional cycle of latency from the addition.

Limit

I played around with variants of loop l to find out where the number limits of this optimization are: the last loops that had the same performance as l was o:
2: imul %rsi, %rax addq $0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq	$0x333, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	addq   $-0x333+2, %rax
	cmpq	%rax, %rdi
	jg	2b
Here the largest intermediate sum is 4095 (0xfff); if we replace 0x333 with 0x334 (loop p, intermediate sum 4100 (0x1004)), the loop runs much slower (7.39 cycles/iteration on average, but with quite a bit of variation (although perf stat -r 100 reports only 0.1% variation). Similarly first adding -0x333 5 times, then adding 0x333+2 5 times is fast, while adding -0x334 and 0x334+2 respectively is slow. So apparently the limits of this optimization are that the intermediate sum must stay in the 13-bit range -4096..4095.

The individual addends apparentl have to be in the range -1024..1023 (loops s and t).

How to run this yourself

Download this directory and do make or possibly something like taskset -c 2 make on a Linux system with perf. The calls are scaled for 1G iterations (which you can see in the number of branches shown in the perf output), so dividing the number of cycles by 1G gives the cycles/iteration.

Previous work

After my measurements and writing the above, I found that Chipsandcheese have published about this two years earlier, interestingly in the article on Gracemont (in very short form), not in the one on Golden Cove.
Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile23-Dec-2023 10:47 502  
[   ]loop1.o23-Dec-2023 19:31 3.4K 
[   ]loop1.s23-Dec-2023 19:31 7.7K 
[   ]main23-Dec-2023 19:31 16K 
[TXT]main.c23-Dec-2023 19:19 2.1K 
[   ]main.o23-Dec-2023 19:19 4.1K 

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80