Article: 40127 of comp.arch
Path: news.tuwien.ac.at!a0.complang.tuwien.ac.at!anton
From: anton@a0.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.sys.intel,comp.arch
Subject: Why "switch" is slow on the Pentium
Date: 30 Nov 1995 19:56:01 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 112
Message-ID: <49l28h$6gb@news.tuwien.ac.at>
NNTP-Posting-Host: a0.complang.tuwien.ac.at
Xref: news.tuwien.ac.at comp.sys.intel:67261 comp.arch:40127

Yesterday I ran a few microbenchmarks on my Pentium. In one benchmark
that exercises mainly the C "switch" statement, my Pentium was
surprisingly slow. This was all the more surprising because the 486
did relatively well on this benchmark.

Here's the benchmark in C:
--------------------------------
#define NEXT break
#define I_NEXT 0
#define I_LOOP 1

main()
{
  static int prog[] = {I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_NEXT,I_LOOP};
  int *ip=prog;
  int count = 10000000;

  for (;;) {
    switch (*ip++) {
    case I_NEXT:
      NEXT;
    case I_LOOP:
      if (count>0) {
        count--;
        ip=prog;
        NEXT;
      }
      else
        return 0;
/* the rest is to get gcc to make a realistic switch statement */
    case 12:
    case 8:
    case 4:
    case 5:
      count=25;
      NEXT;
    case 10:
    case 2:
      count--;
      NEXT;
    }
  }
}
-------------------------------

and here are the relevant parts of the assembly code produced by
gcc-2.6.3 for it:

------------------------------
L19:
        movl (%ecx),%eax
        addl $4,%ecx
        cmpl $12,%eax
        ja L19
        jmp *L33(,%eax,4)
        .align 4,0x90
L33:
        .long L19
        .long L24
....
------------------------------

The result I got was that my Pentium-133 takes 13.44s for this
benchmark, i.e. about 17 cycles per switch. I could not make out what
takes so many cycles in the code above (the 486 took about 10, and I
count 5 for the Pentium).

So, I experimented: Changing the code (e.g., removing the overflow
check) had little effect. Then I tried putting the table that starts
at L33 into the data segment. And voila, the benchmark came in at
3.85s (a little over 5 cycles per switch). I then experimented with
keeping the table in the text segment, but putting some distance
between the table and the code with the .align directive. By changing
the ".align 4" before L33 into ".align 7" (alignment to
128-byte-boundaries), I succeeded (I also did this to the .align
following the table).

I then formed my hypothesis about the reason for this effect (see
below), and to check it, I turned off the secondary cache on my
machine. Just as I expected, the original benchmark slowed down even
further (to 39s).

The Pentium has two first-level primary caches, one for code and one
for data. My hypothesis is that reading from one cache invalidates the
same data in the other cache, causing a cache miss when the
invalidated data is accessed again (I think only one cache has this
problem; otherwise we would see 24 cycles slowdown (a refill takes 6
bus cyles=12 processor cycles)). If I understand cache consistency
theory (MESI etc.)  right, this should not happen. In theory, both
caches could have the same data (in shared state) and allow any read
access without invalidating the data in the other cache.

Could anyone shed some light on why the Pentium does not follow the
theory? I also wonder why the granularity here seems to be 128
bytes. I thought the Pentium has 32-byte cache lines.

This effect seems to be little known. Gcc does not avoid it; not even
gcc-i2.6.3 (a version originally tuned by Intel engineers for the
Pentium, then maintained by others in parallel with the official gcc)
avoids it. Neither the Pentium manual nor the Application note AN-500
on "Optimization for the Intel 32-bit processors" mentions this.

My machine is based on the Asus P55TP4XE motherboard, has 256K
Pipelined burst secondary cache, and was running under Linux. The
program was linked without the -N option, so the text segment was
read-only.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html


Article: 40136 of comp.arch
Path: news.tuwien.ac.at!newsfeed.ACO.net!swidir.switch.ch!nntp.coast.net!news.dacom.co.kr!news.kreonet.re.kr!usenet.kornet.nm.kr!ames!lll-winken.llnl.gov!enews.sgi.com!fido.asd.sgi.com!arkham.boston.sgi.com!omar
From: omar@arkham.boston.sgi.com (Omar G. Stradella)
Newsgroups: comp.sys.intel,comp.arch
Subject: Re: Why "switch" is slow on the Pentium
Date: 30 Nov 1995 21:53:35 GMT
Organization: Silicon Graphics, Inc.
Lines: 37
Distribution: world
Message-ID: <49l94v$g2h@fido.asd.sgi.com>
References: <49l28h$6gb@news.tuwien.ac.at>
NNTP-Posting-Host: arkham.boston.sgi.com
Xref: news.tuwien.ac.at comp.sys.intel:67294 comp.arch:40136

In article <49l28h$6gb@news.tuwien.ac.at>, anton@a0.complang.tuwien.ac.at (Anton Ertl) writes:
|> I then formed my hypothesis about the reason for this effect (see
|> below), and to check it, I turned off the secondary cache on my
|> machine. Just as I expected, the original benchmark slowed down even
|> further (to 39s).
|> 
|> The Pentium has two first-level primary caches, one for code and one
|> for data. My hypothesis is that reading from one cache invalidates the
|> same data in the other cache, causing a cache miss when the
|> invalidated data is accessed again (I think only one cache has this
|> problem; otherwise we would see 24 cycles slowdown (a refill takes 6
|> bus cyles=12 processor cycles)). If I understand cache consistency
|> theory (MESI etc.)  right, this should not happen. In theory, both
|> caches could have the same data (in shared state) and allow any read
|> access without invalidating the data in the other cache.
|> 
|> Could anyone shed some light on why the Pentium does not follow the
|> theory? I also wonder why the granularity here seems to be 128
|> bytes. I thought the Pentium has 32-byte cache lines.

There are two policies for dealing with "duplicate lines" in I and D
caches, either supported or prohibited. When they're prohibited a miss
in the I cache loads a line in the I cache and checks the D cache
invalidating the line if present, and viceversa. Apparently, the
no-duplicate policy is easier to implement.

Check: "Computer Architecture", by Michael Flynn, p.295-296.

Omar.

-- 
+-------------------------------------------------------------------+
 Omar G. Stradella, Ph.D.                     Silicon Graphics, Inc. 
 E-mail: omar@boston.sgi.com                 Computational Chemistry 
 Phone:  +1-(508) 567-2258                http://www.sgi.com/ChemBio 
 FAX:    +1-(508) 562-4755     http://reality.sgi.com/employees/omar 
+------  Ph-nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn  -------+


Article: 40363 of comp.arch
Path: news.tuwien.ac.at!a0.complang.tuwien.ac.at!anton
From: anton@a0.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.sys.intel,comp.arch
Subject: Re: Why "switch" is slow on the Pentium
Date: 8 Dec 1995 17:22:38 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 58
Message-ID: <4a9s8v$32j@news.tuwien.ac.at>
References: <49l28h$6gb@news.tuwien.ac.at>
NNTP-Posting-Host: a0.complang.tuwien.ac.at
Xref: news.tuwien.ac.at comp.sys.intel:68138 comp.arch:40363

In article <49l28h$6gb@news.tuwien.ac.at>, anton@a0.complang.tuwien.ac.at (Anton Ertl) writes:
....
|> ------------------------------
|> L19:
|>         movl (%ecx),%eax
|>         addl $4,%ecx
|>         cmpl $12,%eax
|>         ja L19
|>         jmp *L33(,%eax,4)
|>         .align 4,0x90
|> L33:
|>         .long L19
|>         .long L24
|> ....
|> ------------------------------
...
|> So, I experimented: Changing the code (e.g., removing the overflow
|> check) had little effect. Then I tried putting the table that starts
|> at L33 into the data segment. And voila, the benchmark came in at
|> 3.85s (a little over 5 cycles per switch). I then experimented with
|> keeping the table in the text segment, but putting some distance
|> between the table and the code with the .align directive. By changing
|> the ".align 4" before L33 into ".align 7" (alignment to
|> 128-byte-boundaries), I succeeded (I also did this to the .align
|> following the table).

Joern Rennecke pointed out to me that the linker (for a.out format)
does not preserve the meaning of the .align directive.

So, I replaced the .aligns with .fills and experimented with different
amounts of spacing between code and jump table. I looked at the
program after linking to see the actual placement of things.

I kept the start of the table (i.e., L33) at (hex) 10e0 (i.e., at a
32-byte-boundary), and varied the end of the code (i.e., the first fill
byte):

end of code	time
    -10cb	 3.71s
10cc-10d0	11.62s
10d1-10dc	13.14s
10dd-10de	14.52s
10df-10e0	16.02s

My conclusions from these results are that the granularity of this
phenomenon is not larger than 32 bytes (other measurements indicate
that it is not less than 32 bytes), and that the conflict is not
(only) between necessary I-cahe reads and D-cache reads but between
I-prefetches and D-cache reads.

I cannot explain why the times for the conflicts cases vary between
11.62 and 16.02s, however.

- anton 
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html