Index of /anton/lvas/effizienz/tspv

Name	Last modified	Size

Parent Directory		-
tspv.cc	2026-05-18 00:26	4.3K
tspv	2026-05-15 18:04	18K
tspu.c	2026-05-15 13:55	3.8K
tspu	2026-05-15 13:55	18K
tspr.c	2026-05-13 16:56	4.1K
tspr	2026-05-13 16:57	18K
tspq.c	2026-05-11 11:31	3.1K
tspq	2026-05-17 12:51	18K
tspp.c	2026-05-11 11:31	3.0K
tspp	2026-05-11 11:33	18K
tspo.optinfo	2026-05-18 15:57	1.2K
tspo-smarter.c	2026-05-21 19:36	2.8K
tspo-smarter-noinline.c	2026-05-21 23:26	3.1K
tspo-smarter-noinline	2026-05-21 23:26	17K
tspo-smarter	2026-05-21 19:38	17K
tspo-harder.c	2026-05-18 00:26	2.7K
tspo-harder-noinline.c	2026-05-21 23:31	3.1K
tspo-harder-noinline	2026-05-21 23:31	17K
tspo-harder	2026-05-22 10:13	17K
tspn.c	2026-05-11 11:31	2.7K
tspn	2026-05-17 19:53	18K
perf.data.old	2026-05-21 23:33	35K
perf.data	2026-05-21 23:35	35K
Makefile	2026-05-18 00:23	244

Vectorization of Jon Bentley's TSP example

In his 1982 Book "Writing Efficient Programs" Jon Bentley used a simple program for the traveling salesman problem as example. This algorithm actually turns out to be vectorizable and here I use this program to exercise the auto-vectorization capabilities of gcc and clang (various versions), and the explicit vector extensions of these compilers.

Jon Bentley's program for the traveling salesman problem (TSP)

The program uses a simple (and simple-minded) greedy algorithm: From the starting point, find the closest unvisited city, and travel there. Repeat until all cities have been visited. This algorithm does not give the optimal solution (shortest journey) and it has quadratic complexity. Bentley implemented a number of manual optimizations for this program. I translated Bentley's original Pascal code to C around 1999, and most of Bentley's optimizations still provide speedups even with the compilers of the 2020s.

The unoptimized code (tsp1.c) has the following inner loop:

// typedef struct {
//   double x;
//   double y;
// } point;
// double CloseDist; int j, ncities;
// char *visited; point cities[];
// int ThisPt, ClosePt;
CloseDist = DBL_MAX;
for (j=0; j

After applying a number of optimizations, the program became tsp9.c and its inner loop:

// point tour[];
// int i,j;
// int ClosePt;
// double CloseDist;
double ThisX = tour[i-1].x;
double ThisY = tour[i-1].y;
CloseDist = DBL_MAX;
for (j=ncities-1; ;j--) {
  double ThisDist = sqr(tour[j].x-ThisX);
  if (ThisDist <= CloseDist) {
    ThisDist += sqr(tour[j].y-ThisY);
    if (ThisDist <= CloseDist) {
      if (j < i)
        break;
      CloseDist = ThisDist;
      ClosePt = j;
    }
  }


Actually, compilers tend to see that there is another loop inside this
one, which can be made explicit as follows:

double ThisX = tour[i-1].x;
double ThisY = tour[i-1].y;
CloseDist = DBL_MAX;
for (j=ncities; ;) {
  double ThisDist;
  do {
    j--;
    ThisDist = sqr(tour[j].x-ThisX);
  } while (ThisDist > CloseDist);
  ThisDist += sqr(tour[j].y-ThisY);
  if (ThisDist <= CloseDist) {
    if (j < i)
      break;
    CloseDist = ThisDist;
    ClosePt = j;
  }
}


Vectorization of Jon Bentley's TSP program

What Bentley did not do in his book is to vectorize the code; vector
computers were not widespread in 1982, but nowadays every
general-purpose CPU has SIMD capabilities.

In 2016 I vectorized this code using (alternatively) SSE2 or AVX2 instructions
using inline assembly language, and discussed it in the thread
starting at
<2016Nov14.164726@mips.complang.tuwien.ac.at>; in
the discussion already5chosen (aka Michael S) suggested and
implemented some improvements, in particular the "work harder, not
smarter" and the "branchless" approaches discussed below.

In the meantime I have access to machines with AVX-512 (Rocket Lake
(Xeon E-2388G), and Zen 4 (Ryzen 7 8700G)), so I eventually set out to
make use of that.  This time I wanted to do it as much without
assembly language/intrinsics as possible, and to exercise the
machine-independent features of gcc and clang.  Before I go into that,
let's look at the vectorizability and vectorization approaches.

Why is the inner loop vectorizable

Data-parallel problems can be vectorized, and the inner loop is
data-parallel: The distances to all the unvisited cities
(tour[i..ncities-1] in the optimized code) can be
computed in parallel, and the determining the index of the closest
city is a reduction
that is also highly parallel.

Rearranging the data

Vectorization also needs the parallel data to be arranged adjacently.
The original arrangement as an array of structures can be vectorized:
Perform the subtraction and the squaring in parallel, then use vhaddpd, there are
the following reasons for rearranging the x coordinates as one array
and the y coordinates as a second array (the structure-of-arrays
approach):



The optimized inner loop shown above only accesses the X
coordinates.  The "work harder" approach and the branchless approach
access both coordinates, however, so they would 

vhaddpd does not exist in a 512-bit-wide version, so
one would have to split the results of 512-bit computations into
256-bit pieces to make use of vhaddpd.  The
structure-of-arrays approach allows to make use of 512-bit wide
addition.

The GNU C/C++ vector extensions have no horizontal addition
operation.

Given the state of auto-vectorization (see below), I doubt that
auto-vectorization will be able to vectorize the array-of-structures
data arrangement anytime soon.



While I was at rearranging the data, I also arranged the arrays to be
64-byte-aligned at the end.  This allows the inner loop to go
backwards from the end with aligned accesses (which the code generated
for the GNU C vector extensions requires).  There is also padding at
the start such that the backwards loop can load extra values from
there without having to worry about out-of-bounds situations.  The
aligned accesses reduce the cost of one of the vectorized versions on
Rocket Lake for 10000 cities from 54M cycles to 50M cycles
(independently of whether the loads were performed with
vmovupd or vmovapd, where the latter traps
on an access that is not 64-byte-aligned).

Another alternative would be to reverse the arrays that contain the X
and Y coordinates, and align it at the start and pad it at the end.
This would mean that the inner loop has a forward stride.  I have not
tried this out to see if the auto-vectorization of gcc and clang works
better for that.

I rearranged the data only in the function tsp().  An
alternative would be to perform the rearrangement in the whole program
(which would improve the performance a little), which is not that much
effort for this program, but in more realistic applications may
require too much effort or may conflict with other optimizations, so
limiting the rearrangement only to a part of the program is a
realistic technique.

Check X distance first

The inner loop of the optimized approach shown above can be vectorized
by computing the square of the X distance for N elements in parallel
(N=8 for double-precision on AVX-512), and comparing them to CloseDist
in parallel.  If any of the comparisons shows an X distance that is
less-than-or-equal to CloseDist, the loop can be left.  The AVX-512
code for that loop can look as follows:

L1: add    $-14,%rsi
    vmovupd (%rsi),%zmm5
    vsubpd %zmm1,%zmm5,%zmm5
    vmulpd %zmm5,%zmm5,%zmm5
    vcmpltpd %zmm4,%zmm5,%k0
    kortestb %k0,%k0
    je     L1


Note that only %rsi is loop-carried, so the latency of
the other instructions does not prevent the next iteration of the loop
from starting in the next cycle.  So, if everything works well, this
loop is resource-limited.  If there are enough resources to perform
more than one iteration per cycle (not on Rocket Lake or Zen 4), one
can unroll this loop.

One can compute the absolute value instead of the square to compare
the X distances.  This may save power, but I have not tested this
approach.

Work harder, not smarter

Checking the X distance first results in the inner loop being left
1_466_378 times and a totel of 1_464_182 branch mispredictions on
Rocket Lake for 10000 cities (this was measured with a scalar program,
but I don't expect vectorization to change that).  Given the cost of
branch mispredictions (my usual estimate is ~20 cycles per
misprediction), it may be better to compute the complete distance at
once and compare only that.  The AVX-512 code for this approach is:

L1: vmovupd -0x40(%rax,%rsi,8),%zmm5
    vmovupd -0x40(%r13,%rsi,8),%zmm6
    add    $-8,%rsi
    vsubpd %zmm1,%zmm5,%zmm5
    vmulpd %zmm5,%zmm5,%zmm7
    vsubpd %zmm2,%zmm6,%zmm5
    vfmadd213pd %zmm7,%zmm5,%zmm5
    vcmpltpd %zmm4,%zmm5,%k0
    kortestb %k0,%k0
    je     L1


This reduces the inner-loop exits to 100_611 and the branch
mispredictions to 158_550, but it increases the number of instructions
in the inner loop and the total executed instructions (from 362M to
550M for 10000 cities for scalar versions compiled with gcc-14 -O3).

Working smarter is one John Bentley's optimizations, and in the
scalar case, it does pay off (on Rocket Lake 171M cycles for work
harder, 151M cycles for work smarter).

In the AVX2 vectorization work from 2016, however, working harder
proved to be superior on Haswell (Core i7-4790K): smarter: 137M cycles; harder: 95M cycles.  On the Rocket Lake, the
speed difference is even bigger: smarter: 123M cycles, harder: 69M
cycles.

Branchless

Instead of leaving the inner loop if a closer city is found, the loop
can go across all unvisited cities and update the index with
branchless conditional code if a closer city is found.  The resulting
AVX-512 code for the inner loop body looks like this:

L1: vmovdqu8      (%rcx,%rax,8),%zmm1
    vmovdqu8      (%rbx,%rax,8),%zmm0
    vsubpd        %zmm4,%zmm1,%zmm1
    vsubpd        %zmm5,%zmm0,%zmm0
    vmulpd        %zmm1,%zmm1,%zmm1
    vfmadd132pd   %zmm0,%zmm1,%zmm0
    vpbroadcastq  %rax,%zmm1
    sub           $0x8,%rax
    vcmpltpd      %zmm2,%zmm0,%k1
    vminpd        %zmm0,%zmm2,%zmm2
    vpaddq        %zmm8,%zmm1,%zmm3{%k1}
    cmp           %rdx,%rax
    jge           L1


This code computes 8 CloseDist values (in zmm2) and 8 ClosePt (in
zmm3) of 8 slices of the array in parallel; once the loop exits, the
results of the 8 slices can be combined.

The final combination offers an interesting choice: If there are
several closest cities (all with the same distance to the current
city), the original scalar code chooses the one with the lowest index.
Do we want to keep this?  Or, given the heuristic nature of the
algorithm, where no choice is obviously better than the other, do we
just choose any of them?  I decided for the latter in the branchless
variants below (and in 2016).  Choosing the lowest-indexed one would
complicate the code and would result in a very small slowdown (the
more complicated code does not affect the body of the inner loop).

Choosing any of the closest cities is justified by the fact that we
do not know whether which of the closest cities leads to a better or
worse tour for the rest of the algorithm, so we can just choose any.
Sticking with the lowest-indexed one has the nice property of
resulting in the same test outputs as the scalar version under all
circumstances.  In the testing I did, the results were always the
same (several closest cities are rare).

The loop-carried values in this code are %rax, %zmm2, and %zmm3,
With the recurrences (dependence loops) only involving the
instructions sub, vminpd and a predicated
vaddpd.  It may be possible that one of these
instructions has such a long latency that it pays off to split the
computation across more lanes with loop unrolling, but I have not
investigated that.

The best AVX2 branchless version from 2016 reduces the branch mispredictions on Rocket Lake to 93k (from
158k for "work harder"), and takes 69M cycles, like "work harder".


Combiation

The number of inner-loop exits for the smarter and harder variants
tends to decrease as CloseDist becomes smaller, and the advantage of
fewer instructions per iteration may overcome the disadvantage of more
branch mispredictions at some point.

This inspires the idea of performing the branchless variant for the
first N1 cities, then switching to the work-harder variant until
CloseDist is small enough (or for another N2 cities, which is agnostic
of the average city distance), and finally switching to the
work-smarter variant.

I have tried this with a combination of work-harder and
work-smarter, but unfortunatly, the shortcomings of the vectorization
support by gcc and clang (see below) meant that the lower-instruction
benefit did not materialize.



Manual vectorization

The
GNU C/C++ Vector Extensions provide fixed-size vectors and a
number of operations useful for vectorizing the TSP program.  Clang
supports the same extensions as well as others; for now I have
limited myself to the GNU C or C++ extensions.

Unfortunately, the GNU C Vector extensions do not provide a way to
check the result of a vector comparison, so the smarter and harder
approaches need assembly language or intrinsic support, and I leave
that for another day.

This leaves us with the branchless approach.  The ? :
operator is useful for that, but only supported in C++, not in C.
There is also a minor hurdle that C++ (unlike C) does not support
assigning (broadcasting) a scalar to a vector (while vector operations
with scalars are supported).

This results in a program with the following
inner loop:

// typedef double vd __attribute__ ((vector_size (SIMD_SIZE*sizeof(double))));
// typedef   long vl __attribute__ ((vector_size (SIMD_SIZE*sizeof(long))));
// long ncities, j,i;
// double *x,y;
// vd dist, closedist;
// vl closeidx, baseidx;
for (j = ncities-SIMD_SIZE; j >= i; j -= SIMD_SIZE) {
  vd vx=*(vd *)&x[j];
  vd vy=*(vd *)&y[j];
  dist = vsqr(vx-ThisX)+vsqr(vy-ThisY);
  closeidx =  dist

Here baseidx contains {0, 1,...,7}, such
that baseidx+j is the vector or element indexes.

The resulting native code is shown in Section
Branchless.




Auto-Vectorization

What's the promise of auto-vectorization?  You throw your program into
the compiler as-is, and out comes a vectorized program, if the code is
vectorizable.  For our running example, we have established that the
code is vectorizable.  How does auto-vectorization fare?

In the following I show the run-time for 10000 cities for different
versions of the program, compiled with gcc-16 and clang-19 with

autovectorize: -O3 -ffast-math -march=rocketlake -mprefer-vector-width=512
 force scalar: -O3 -ffast-math -march=rocketlake -fno-tree-vectorize


The results in Mcycles on Rocket Lake are:



  gcc-16   clang-19
vect scal vect scal
659M 699M 366M 365M unoptimized
168M 167M 168M 167M optimized
164M 164M 167M 167M optimized, struct of arrays
146M 146M 171M 170M struct of arrays, work harder


Except for gcc-16 for the unoptimized code, the autovectorization has
little, if any effect (the number of executed instructions confirms
this), and the one case where auto-vectorization provides an
improvement is beaten by the competition by more than a factor of 2,
with and without vectorization.

Even for the two cases where I manually rearranged the data
structures to be friendly to auto-vectorization, auto-vectorization
gave no benefit.



Manual "Auto"-Vectorization

Ok, so can we not still make use of auto-vectorization by first
thinking about how to vectorize the code (we already have done so),
and then writing the code in a way that will result in the code
actually using SIMD instructions?

Of course, that poses the question if it would not be better if the
programming language provided ways to express the functionality
directly in a vectorized way (e.g., being able to reduce a vector of
comparsion results to an int, such that it can be used to control loop
termination.

Where auto-vectorization seems to be pretty reliable is counted
loops that iterate over one or more arrays, with each array just being
read, or declared restrict.  I have seen successful
auto-vectorization for simple data-parallel masking based on
conditions on array elements.  Reduction loops are also vectorized, as
are broadcasts.

But data-dependent loops tend to be beyond the capabilities of
auto-vectorization, except that the recent gcc-16 claims it as a new
feature.