Bubble sort slower with -O3 than -O2 with GCC

I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all! Meanwhile -O2 was making it run a lot faster as expected.

Without optimisations:

time ./sort 30000./sort 30000  1.82s user 0.00s system 99% cpu 1.816 total

-O2:

time ./sort 30000./sort 30000  1.00s user 0.00s system 99% cpu 1.005 total

-O3:

time ./sort 30000./sort 30000  2.01s user 0.00s system 99% cpu 2.007 total

The code:

#include <stdio.h>#include <stdlib.h>#include <stdbool.h>#include <time.h>int n;void bubblesort(int *buf){    bool changed = true;    for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */        changed = false;        for (int x = 0; x < i-1; x++) {            if (buf[x] > buf[x+1]) {                /* swap */                int tmp = buf[x+1];                buf[x+1] = buf[x];                buf[x] = tmp;                changed = true;            }        }    }}int main(int argc, char *argv[]){    if (argc != 2) {        fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);        return EXIT_FAILURE;    }    n = atoi(argv[1]);    if (n < 1) {        fprintf(stderr, "Invalid array size.\n");        return EXIT_FAILURE;    }    int *buf = malloc(sizeof(int) * n);    /* init buffer with random values */    srand(time(NULL));    for (int i = 0; i < n; i++)        buf[i] = rand() % n + 1;    bubblesort(buf);    return EXIT_SUCCESS;}

The assembly language generated for -O2 (from godbolt.org):

bubblesort:        mov     r9d, DWORD PTR n[rip]        xor     edx, edx        xor     r10d, r10d.L2:        lea     r8d, [r9-1]        cmp     r8d, edx        jle     .L13.L5:        movsx   rax, edx        lea     rax, [rdi+rax*4].L4:        mov     esi, DWORD PTR [rax]        mov     ecx, DWORD PTR [rax+4]        add     edx, 1        cmp     esi, ecx        jle     .L2        mov     DWORD PTR [rax+4], esi        mov     r10d, 1        add     rax, 4        mov     DWORD PTR [rax-4], ecx        cmp     r8d, edx        jg      .L4        mov     r9d, r8d        xor     edx, edx        xor     r10d, r10d        lea     r8d, [r9-1]        cmp     r8d, edx        jg      .L5.L13:        test    r10b, r10b        jne     .L14.L1:        ret.L14:        lea     eax, [r9-2]        cmp     r9d, 2        jle     .L1        mov     r9d, r8d        xor     edx, edx        mov     r8d, eax        xor     r10d, r10d        jmp     .L5

And the same for -O3:

bubblesort:        mov     r9d, DWORD PTR n[rip]        xor     edx, edx        xor     r10d, r10d.L2:        lea     r8d, [r9-1]        cmp     r8d, edx        jle     .L13.L5:        movsx   rax, edx        lea     rcx, [rdi+rax*4].L4:        movq    xmm0, QWORD PTR [rcx]        add     edx, 1        pshufd  xmm2, xmm0, 0xe5        movd    esi, xmm0        movd    eax, xmm2        pshufd  xmm1, xmm0, 225        cmp     esi, eax        jle     .L2        movq    QWORD PTR [rcx], xmm1        mov     r10d, 1        add     rcx, 4        cmp     r8d, edx        jg      .L4        mov     r9d, r8d        xor     edx, edx        xor     r10d, r10d        lea     r8d, [r9-1]        cmp     r8d, edx        jg      .L5.L13:        test    r10b, r10b        jne     .L14.L1:        ret.L14:        lea     eax, [r9-2]        cmp     r9d, 2        jle     .L1        mov     r9d, r8d        xor     edx, edx        mov     r8d, eax        xor     r10d, r10d        jmp     .L5

It seems like the only significant difference to me is the apparent attempt to use SIMD, which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd instructions... is this just a failed attempt at SIMD? Or maybe the couple of extra instructions is just about edging out my instruction cache?

Timings were done on an AMD Ryzen 5 3600.

Latest Images

Trending Articles

Latest Images