Index of /anton/lvas/effizienz/tspv

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]tspv.cc2026-05-18 00:26 4.3K 
[   ]tspv2026-05-15 18:04 18K 
[TXT]tspu.c2026-05-15 13:55 3.8K 
[   ]tspu2026-05-15 13:55 18K 
[TXT]tspr.c2026-05-13 16:56 4.1K 
[   ]tspr2026-05-13 16:57 18K 
[TXT]tspq.c2026-05-11 11:31 3.1K 
[   ]tspq2026-05-17 12:51 18K 
[TXT]tspp.c2026-05-11 11:31 3.0K 
[   ]tspp2026-05-11 11:33 18K 
[   ]tspo.optinfo2026-05-18 15:57 1.2K 
[TXT]tspo-smarter.c2026-05-21 19:36 2.8K 
[TXT]tspo-smarter-noinline.c2026-05-21 23:26 3.1K 
[   ]tspo-smarter-noinline2026-05-21 23:26 17K 
[   ]tspo-smarter2026-05-21 19:38 17K 
[TXT]tspo-harder.c2026-05-18 00:26 2.7K 
[TXT]tspo-harder-noinline.c2026-05-21 23:31 3.1K 
[   ]tspo-harder-noinline2026-05-21 23:31 17K 
[   ]tspo-harder2026-05-22 10:13 17K 
[TXT]tspn.c2026-05-11 11:31 2.7K 
[   ]tspn2026-05-17 19:53 18K 
[   ]perf.data.old2026-05-21 23:33 35K 
[   ]perf.data2026-05-21 23:35 35K 
[   ]Makefile2026-05-18 00:23 244  

Vectorization of Jon Bentley's TSP example

Vectorization of Jon Bentley's TSP example

In his 1982 Book "Writing Efficient Programs" Jon Bentley used a simple program for the traveling salesman problem as example. This algorithm actually turns out to be vectorizable and here I use this program to exercise the auto-vectorization capabilities of gcc and clang (various versions), and the explicit vector extensions of these compilers.

Jon Bentley's program for the traveling salesman problem (TSP)

The program uses a simple (and simple-minded) greedy algorithm: From the starting point, find the closest unvisited city, and travel there. Repeat until all cities have been visited. This algorithm does not give the optimal solution (shortest journey) and it has quadratic complexity. Bentley implemented a number of manual optimizations for this program. I translated Bentley's original Pascal code to C around 1999, and most of Bentley's optimizations still provide speedups even with the compilers of the 2020s.

The unoptimized code (tsp1.c) has the following inner loop:

// typedef struct {
//   double x;
//   double y;
// } point;
// double CloseDist; int j, ncities;
// char *visited; point cities[];
// int ThisPt, ClosePt;
CloseDist = DBL_MAX;
for (j=0; j

After applying a number of optimizations, the program became tsp9.c and its inner loop:

// point tour[];
// int i,j;
// int ClosePt;
// double CloseDist;
double ThisX = tour[i-1].x;
double ThisY = tour[i-1].y;
CloseDist = DBL_MAX;
for (j=ncities-1; ;j--) {
  double ThisDist = sqr(tour[j].x-ThisX);
  if (ThisDist <= CloseDist) {
    ThisDist += sqr(tour[j].y-ThisY);
    if (ThisDist <= CloseDist) {
      if (j < i)
        break;
      CloseDist = ThisDist;
      ClosePt = j;
    }
  }
Actually, compilers tend to see that there is another loop inside this one, which can be made explicit as follows:
double ThisX = tour[i-1].x;
double ThisY = tour[i-1].y;
CloseDist = DBL_MAX;
for (j=ncities; ;) {
  double ThisDist;
  do {
    j--;
    ThisDist = sqr(tour[j].x-ThisX);
  } while (ThisDist > CloseDist);
  ThisDist += sqr(tour[j].y-ThisY);
  if (ThisDist <= CloseDist) {
    if (j < i)
      break;
    CloseDist = ThisDist;
    ClosePt = j;
  }
}

Vectorization of Jon Bentley's TSP program

What Bentley did not do in his book is to vectorize the code; vector computers were not widespread in 1982, but nowadays every general-purpose CPU has SIMD capabilities.

In 2016 I vectorized this code using (alternatively) SSE2 or AVX2 instructions using inline assembly language, and discussed it in the thread starting at <2016Nov14.164726@mips.complang.tuwien.ac.at>; in the discussion already5chosen (aka Michael S) suggested and implemented some improvements, in particular the "work harder, not smarter" and the "branchless" approaches discussed below.

In the meantime I have access to machines with AVX-512 (Rocket Lake (Xeon E-2388G), and Zen 4 (Ryzen 7 8700G)), so I eventually set out to make use of that. This time I wanted to do it as much without assembly language/intrinsics as possible, and to exercise the machine-independent features of gcc and clang. Before I go into that, let's look at the vectorizability and vectorization approaches.

Why is the inner loop vectorizable

Data-parallel problems can be vectorized, and the inner loop is data-parallel: The distances to all the unvisited cities (tour[i..ncities-1] in the optimized code) can be computed in parallel, and the determining the index of the closest city is a reduction that is also highly parallel.

Rearranging the data

Vectorization also needs the parallel data to be arranged adjacently. The original arrangement as an array of structures can be vectorized: Perform the subtraction and the squaring in parallel, then use vhaddpd, there are the following reasons for rearranging the x coordinates as one array and the y coordinates as a second array (the structure-of-arrays approach): While I was at rearranging the data, I also arranged the arrays to be 64-byte-aligned at the end. This allows the inner loop to go backwards from the end with aligned accesses (which the code generated for the GNU C vector extensions requires). There is also padding at the start such that the backwards loop can load extra values from there without having to worry about out-of-bounds situations. The aligned accesses reduce the cost of one of the vectorized versions on Rocket Lake for 10000 cities from 54M cycles to 50M cycles (independently of whether the loads were performed with vmovupd or vmovapd, where the latter traps on an access that is not 64-byte-aligned). Another alternative would be to reverse the arrays that contain the X and Y coordinates, and align it at the start and pad it at the end. This would mean that the inner loop has a forward stride. I have not tried this out to see if the auto-vectorization of gcc and clang works better for that. I rearranged the data only in the function tsp(). An alternative would be to perform the rearrangement in the whole program (which would improve the performance a little), which is not that much effort for this program, but in more realistic applications may require too much effort or may conflict with other optimizations, so limiting the rearrangement only to a part of the program is a realistic technique.

Check X distance first

The inner loop of the optimized approach shown above can be vectorized by computing the square of the X distance for N elements in parallel (N=8 for double-precision on AVX-512), and comparing them to CloseDist in parallel. If any of the comparisons shows an X distance that is less-than-or-equal to CloseDist, the loop can be left. The AVX-512 code for that loop can look as follows:
L1: add    $-14,%rsi
    vmovupd (%rsi),%zmm5
    vsubpd %zmm1,%zmm5,%zmm5
    vmulpd %zmm5,%zmm5,%zmm5
    vcmpltpd %zmm4,%zmm5,%k0
    kortestb %k0,%k0
    je     L1
Note that only %rsi is loop-carried, so the latency of the other instructions does not prevent the next iteration of the loop from starting in the next cycle. So, if everything works well, this loop is resource-limited. If there are enough resources to perform more than one iteration per cycle (not on Rocket Lake or Zen 4), one can unroll this loop.

One can compute the absolute value instead of the square to compare the X distances. This may save power, but I have not tested this approach.

Work harder, not smarter

Checking the X distance first results in the inner loop being left 1_466_378 times and a totel of 1_464_182 branch mispredictions on Rocket Lake for 10000 cities (this was measured with a scalar program, but I don't expect vectorization to change that). Given the cost of branch mispredictions (my usual estimate is ~20 cycles per misprediction), it may be better to compute the complete distance at once and compare only that. The AVX-512 code for this approach is:
L1: vmovupd -0x40(%rax,%rsi,8),%zmm5
    vmovupd -0x40(%r13,%rsi,8),%zmm6
    add    $-8,%rsi
    vsubpd %zmm1,%zmm5,%zmm5
    vmulpd %zmm5,%zmm5,%zmm7
    vsubpd %zmm2,%zmm6,%zmm5
    vfmadd213pd %zmm7,%zmm5,%zmm5
    vcmpltpd %zmm4,%zmm5,%k0
    kortestb %k0,%k0
    je     L1

This reduces the inner-loop exits to 100_611 and the branch mispredictions to 158_550, but it increases the number of instructions in the inner loop and the total executed instructions (from 362M to 550M for 10000 cities for scalar versions compiled with gcc-14 -O3).

Working smarter is one John Bentley's optimizations, and in the scalar case, it does pay off (on Rocket Lake 171M cycles for work harder, 151M cycles for work smarter).

In the AVX2 vectorization work from 2016, however, working harder proved to be superior on Haswell (Core i7-4790K): smarter: 137M cycles; harder: 95M cycles. On the Rocket Lake, the speed difference is even bigger: smarter: 123M cycles, harder: 69M cycles.

Branchless

Instead of leaving the inner loop if a closer city is found, the loop can go across all unvisited cities and update the index with branchless conditional code if a closer city is found. The resulting AVX-512 code for the inner loop body looks like this:
L1: vmovdqu8      (%rcx,%rax,8),%zmm1
    vmovdqu8      (%rbx,%rax,8),%zmm0
    vsubpd        %zmm4,%zmm1,%zmm1
    vsubpd        %zmm5,%zmm0,%zmm0
    vmulpd        %zmm1,%zmm1,%zmm1
    vfmadd132pd   %zmm0,%zmm1,%zmm0
    vpbroadcastq  %rax,%zmm1
    sub           $0x8,%rax
    vcmpltpd      %zmm2,%zmm0,%k1
    vminpd        %zmm0,%zmm2,%zmm2
    vpaddq        %zmm8,%zmm1,%zmm3{%k1}
    cmp           %rdx,%rax
    jge           L1
This code computes 8 CloseDist values (in zmm2) and 8 ClosePt (in zmm3) of 8 slices of the array in parallel; once the loop exits, the results of the 8 slices can be combined.

The final combination offers an interesting choice: If there are several closest cities (all with the same distance to the current city), the original scalar code chooses the one with the lowest index. Do we want to keep this? Or, given the heuristic nature of the algorithm, where no choice is obviously better than the other, do we just choose any of them? I decided for the latter in the branchless variants below (and in 2016). Choosing the lowest-indexed one would complicate the code and would result in a very small slowdown (the more complicated code does not affect the body of the inner loop).

Choosing any of the closest cities is justified by the fact that we do not know whether which of the closest cities leads to a better or worse tour for the rest of the algorithm, so we can just choose any. Sticking with the lowest-indexed one has the nice property of resulting in the same test outputs as the scalar version under all circumstances. In the testing I did, the results were always the same (several closest cities are rare).

The loop-carried values in this code are %rax, %zmm2, and %zmm3, With the recurrences (dependence loops) only involving the instructions sub, vminpd and a predicated vaddpd. It may be possible that one of these instructions has such a long latency that it pays off to split the computation across more lanes with loop unrolling, but I have not investigated that.

The best AVX2 branchless version from 2016 reduces the branch mispredictions on Rocket Lake to 93k (from 158k for "work harder"), and takes 69M cycles, like "work harder".

Combiation

The number of inner-loop exits for the smarter and harder variants tends to decrease as CloseDist becomes smaller, and the advantage of fewer instructions per iteration may overcome the disadvantage of more branch mispredictions at some point.

This inspires the idea of performing the branchless variant for the first N1 cities, then switching to the work-harder variant until CloseDist is small enough (or for another N2 cities, which is agnostic of the average city distance), and finally switching to the work-smarter variant.

I have tried this with a combination of work-harder and work-smarter, but unfortunatly, the shortcomings of the vectorization support by gcc and clang (see below) meant that the lower-instruction benefit did not materialize.

Manual vectorization

The GNU C/C++ Vector Extensions provide fixed-size vectors and a number of operations useful for vectorizing the TSP program. Clang supports the same extensions as well as others; for now I have limited myself to the GNU C or C++ extensions.

Unfortunately, the GNU C Vector extensions do not provide a way to check the result of a vector comparison, so the smarter and harder approaches need assembly language or intrinsic support, and I leave that for another day.

This leaves us with the branchless approach. The ? : operator is useful for that, but only supported in C++, not in C. There is also a minor hurdle that C++ (unlike C) does not support assigning (broadcasting) a scalar to a vector (while vector operations with scalars are supported).

This results in a program with the following inner loop:

// typedef double vd __attribute__ ((vector_size (SIMD_SIZE*sizeof(double))));
// typedef   long vl __attribute__ ((vector_size (SIMD_SIZE*sizeof(long))));
// long ncities, j,i;
// double *x,y;
// vd dist, closedist;
// vl closeidx, baseidx;
for (j = ncities-SIMD_SIZE; j >= i; j -= SIMD_SIZE) {
  vd vx=*(vd *)&x[j];
  vd vy=*(vd *)&y[j];
  dist = vsqr(vx-ThisX)+vsqr(vy-ThisY);
  closeidx =  dist

Here baseidx contains {0, 1,...,7}, such
that baseidx+j is the vector or element indexes.

The resulting native code is shown in Section Branchless.

Auto-Vectorization

What's the promise of auto-vectorization? You throw your program into the compiler as-is, and out comes a vectorized program, if the code is vectorizable. For our running example, we have established that the code is vectorizable. How does auto-vectorization fare? In the following I show the run-time for 10000 cities for different versions of the program, compiled with gcc-16 and clang-19 with
autovectorize: -O3 -ffast-math -march=rocketlake -mprefer-vector-width=512
 force scalar: -O3 -ffast-math -march=rocketlake -fno-tree-vectorize
The results in Mcycles on Rocket Lake are:
  gcc-16   clang-19
vect scal vect scal
659M 699M 366M 365M unoptimized
168M 167M 168M 167M optimized
164M 164M 167M 167M optimized, struct of arrays
146M 146M 171M 170M struct of arrays, work harder
Except for gcc-16 for the unoptimized code, the autovectorization has little, if any effect (the number of executed instructions confirms this), and the one case where auto-vectorization provides an improvement is beaten by the competition by more than a factor of 2, with and without vectorization.

Even for the two cases where I manually rearranged the data structures to be friendly to auto-vectorization, auto-vectorization gave no benefit.

Manual "Auto"-Vectorization

Ok, so can we not still make use of auto-vectorization by first thinking about how to vectorize the code (we already have done so), and then writing the code in a way that will result in the code actually using SIMD instructions?

Of course, that poses the question if it would not be better if the programming language provided ways to express the functionality directly in a vectorized way (e.g., being able to reduce a vector of comparsion results to an int, such that it can be used to control loop termination.

Where auto-vectorization seems to be pretty reliable is counted loops that iterate over one or more arrays, with each array just being read, or declared restrict. I have seen successful auto-vectorization for simple data-parallel masking based on conditions on array elements. Reduction loops are also vectorized, as are broadcasts.

But data-dependent loops tend to be beyond the capabilities of auto-vectorization, except that the recent gcc-16 claims it as a new feature.