Article: 40127 of comp.arch Path: news.tuwien.ac.at!a0.complang.tuwien.ac.at!anton From: anton@a0.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.sys.intel,comp.arch Subject: Why "switch" is slow on the Pentium Date: 30 Nov 1995 19:56:01 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 112 Message-ID: <49l28h$6gb@news.tuwien.ac.at> NNTP-Posting-Host: a0.complang.tuwien.ac.at Xref: news.tuwien.ac.at comp.sys.intel:67261 comp.arch:40127 Yesterday I ran a few microbenchmarks on my Pentium. In one benchmark that exercises mainly the C "switch" statement, my Pentium was surprisingly slow. This was all the more surprising because the 486 did relatively well on this benchmark. Here's the benchmark in C: -------------------------------- #define NEXT break #define I_NEXT 0 #define I_LOOP 1 main() { static int prog[] = {I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_LOOP}; int *ip=prog; int count = 10000000; for (;;) { switch (*ip++) { case I_NEXT: NEXT; case I_LOOP: if (count>0) { count--; ip=prog; NEXT; } else return 0; /* the rest is to get gcc to make a realistic switch statement */ case 12: case 8: case 4: case 5: count=25; NEXT; case 10: case 2: count--; NEXT; } } } ------------------------------- and here are the relevant parts of the assembly code produced by gcc-2.6.3 for it: ------------------------------ L19: movl (%ecx),%eax addl $4,%ecx cmpl $12,%eax ja L19 jmp *L33(,%eax,4) .align 4,0x90 L33: .long L19 .long L24 .... ------------------------------ The result I got was that my Pentium-133 takes 13.44s for this benchmark, i.e. about 17 cycles per switch. I could not make out what takes so many cycles in the code above (the 486 took about 10, and I count 5 for the Pentium). So, I experimented: Changing the code (e.g., removing the overflow check) had little effect. Then I tried putting the table that starts at L33 into the data segment. And voila, the benchmark came in at 3.85s (a little over 5 cycles per switch). I then experimented with keeping the table in the text segment, but putting some distance between the table and the code with the .align directive. By changing the ".align 4" before L33 into ".align 7" (alignment to 128-byte-boundaries), I succeeded (I also did this to the .align following the table). I then formed my hypothesis about the reason for this effect (see below), and to check it, I turned off the secondary cache on my machine. Just as I expected, the original benchmark slowed down even further (to 39s). The Pentium has two first-level primary caches, one for code and one for data. My hypothesis is that reading from one cache invalidates the same data in the other cache, causing a cache miss when the invalidated data is accessed again (I think only one cache has this problem; otherwise we would see 24 cycles slowdown (a refill takes 6 bus cyles=12 processor cycles)). If I understand cache consistency theory (MESI etc.) right, this should not happen. In theory, both caches could have the same data (in shared state) and allow any read access without invalidating the data in the other cache. Could anyone shed some light on why the Pentium does not follow the theory? I also wonder why the granularity here seems to be 128 bytes. I thought the Pentium has 32-byte cache lines. This effect seems to be little known. Gcc does not avoid it; not even gcc-i2.6.3 (a version originally tuned by Intel engineers for the Pentium, then maintained by others in parallel with the official gcc) avoids it. Neither the Pentium manual nor the Application note AN-500 on "Optimization for the Intel 32-bit processors" mentions this. My machine is based on the Asus P55TP4XE motherboard, has 256K Pipelined burst secondary cache, and was running under Linux. The program was linked without the -N option, so the text segment was read-only. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html Article: 40136 of comp.arch Path: news.tuwien.ac.at!newsfeed.ACO.net!swidir.switch.ch!nntp.coast.net!news.dacom.co.kr!news.kreonet.re.kr!usenet.kornet.nm.kr!ames!lll-winken.llnl.gov!enews.sgi.com!fido.asd.sgi.com!arkham.boston.sgi.com!omar From: omar@arkham.boston.sgi.com (Omar G. Stradella) Newsgroups: comp.sys.intel,comp.arch Subject: Re: Why "switch" is slow on the Pentium Date: 30 Nov 1995 21:53:35 GMT Organization: Silicon Graphics, Inc. Lines: 37 Distribution: world Message-ID: <49l94v$g2h@fido.asd.sgi.com> References: <49l28h$6gb@news.tuwien.ac.at> NNTP-Posting-Host: arkham.boston.sgi.com Xref: news.tuwien.ac.at comp.sys.intel:67294 comp.arch:40136 In article <49l28h$6gb@news.tuwien.ac.at>, anton@a0.complang.tuwien.ac.at (Anton Ertl) writes: |> I then formed my hypothesis about the reason for this effect (see |> below), and to check it, I turned off the secondary cache on my |> machine. Just as I expected, the original benchmark slowed down even |> further (to 39s). |> |> The Pentium has two first-level primary caches, one for code and one |> for data. My hypothesis is that reading from one cache invalidates the |> same data in the other cache, causing a cache miss when the |> invalidated data is accessed again (I think only one cache has this |> problem; otherwise we would see 24 cycles slowdown (a refill takes 6 |> bus cyles=12 processor cycles)). If I understand cache consistency |> theory (MESI etc.) right, this should not happen. In theory, both |> caches could have the same data (in shared state) and allow any read |> access without invalidating the data in the other cache. |> |> Could anyone shed some light on why the Pentium does not follow the |> theory? I also wonder why the granularity here seems to be 128 |> bytes. I thought the Pentium has 32-byte cache lines. There are two policies for dealing with "duplicate lines" in I and D caches, either supported or prohibited. When they're prohibited a miss in the I cache loads a line in the I cache and checks the D cache invalidating the line if present, and viceversa. Apparently, the no-duplicate policy is easier to implement. Check: "Computer Architecture", by Michael Flynn, p.295-296. Omar. -- +-------------------------------------------------------------------+ Omar G. Stradella, Ph.D. Silicon Graphics, Inc. E-mail: omar@boston.sgi.com Computational Chemistry Phone: +1-(508) 567-2258 http://www.sgi.com/ChemBio FAX: +1-(508) 562-4755 http://reality.sgi.com/employees/omar +------ Ph-nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn -------+ Article: 40363 of comp.arch Path: news.tuwien.ac.at!a0.complang.tuwien.ac.at!anton From: anton@a0.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.sys.intel,comp.arch Subject: Re: Why "switch" is slow on the Pentium Date: 8 Dec 1995 17:22:38 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 58 Message-ID: <4a9s8v$32j@news.tuwien.ac.at> References: <49l28h$6gb@news.tuwien.ac.at> NNTP-Posting-Host: a0.complang.tuwien.ac.at Xref: news.tuwien.ac.at comp.sys.intel:68138 comp.arch:40363 In article <49l28h$6gb@news.tuwien.ac.at>, anton@a0.complang.tuwien.ac.at (Anton Ertl) writes: .... |> ------------------------------ |> L19: |> movl (%ecx),%eax |> addl $4,%ecx |> cmpl $12,%eax |> ja L19 |> jmp *L33(,%eax,4) |> .align 4,0x90 |> L33: |> .long L19 |> .long L24 |> .... |> ------------------------------ ... |> So, I experimented: Changing the code (e.g., removing the overflow |> check) had little effect. Then I tried putting the table that starts |> at L33 into the data segment. And voila, the benchmark came in at |> 3.85s (a little over 5 cycles per switch). I then experimented with |> keeping the table in the text segment, but putting some distance |> between the table and the code with the .align directive. By changing |> the ".align 4" before L33 into ".align 7" (alignment to |> 128-byte-boundaries), I succeeded (I also did this to the .align |> following the table). Joern Rennecke pointed out to me that the linker (for a.out format) does not preserve the meaning of the .align directive. So, I replaced the .aligns with .fills and experimented with different amounts of spacing between code and jump table. I looked at the program after linking to see the actual placement of things. I kept the start of the table (i.e., L33) at (hex) 10e0 (i.e., at a 32-byte-boundary), and varied the end of the code (i.e., the first fill byte): end of code time -10cb 3.71s 10cc-10d0 11.62s 10d1-10dc 13.14s 10dd-10de 14.52s 10df-10e0 16.02s My conclusions from these results are that the granularity of this phenomenon is not larger than 32 bytes (other measurements indicate that it is not less than 32 bytes), and that the conflict is not (only) between necessary I-cahe reads and D-cache reads but between I-prefetches and D-cache reads. I cannot explain why the times for the conflicts cases vary between 11.62 and 16.02s, however. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html