Instruction Patterns

Common instruction patterns one might see with vectorized code generation

This page collects architecture-dependent gcc-14 expansions, where simple C sequences are translated into optimized code.

Our baseline is a gcc-14 compiler with -O2 optimization and a base machine architecture of -march=rv64gc. That’s a basic 64 bit RISCV processor (or a hart core of that processor) with support for compressed instructions.

Variant machine architectures considered here are:

march description
rv64gc baseline
rv64gcv baseline + vector extension (dynamic vector length)
rv64gcv_zvl128b baseline + vector (minimum 128 bit vectors)
rv64gcv_zvl512b baseline + vector (minimum 512 bit vectors)
rv64gcv_zvl1024b baseline + vector (minimum 1024 bit vectors)
rv64gc_xtheadbb baseline + THead bit manipulation extension (no vector)

Memory copy operations

Note: memory copy operations require non-overlapping source and destination. memory move operations allow overlap but are much more complicated and are not currently optimized.

Optimizing compilers are good at turning simple memory copy operations into confusing - but fast - instruction sequences. GCC can recognize memory copy operations as calls to memcpy or as structure assignments like *a = *c.

The current reference C file is:

extern void *memcpy(void *__restrict dest, const void *__restrict src, __SIZE_TYPE__ n);
extern void *memmov(void *dest, const void *src, __SIZE_TYPE__ n);

/* invoke memcpy with dynamic size */
void cpymem_1 (void *a, void *b, __SIZE_TYPE__ l)
{
  memcpy (a, b, l);
}

/* invoke memcpy with known size and aligned pointers */
extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;

void cpymem_2 ()
{
  memcpy (&a_a, &a_b, sizeof a_a);
}

typedef struct { char c[16]; } c16;
typedef struct { char c[32]; } c32;
typedef struct { short s; char c[30]; } s16;

/* copy fixed 128 bits of memory */
void cpymem_3 (c16 *a, c16* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_4 (c32 *a, c32* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_5 (s16 *a, s16* b)
{
  *a = *b;
}

/* memmov allows overlap - don't vectorize or inline */
void movmem_1(void *a, void *b, __SIZE_TYPE__ l)
{
  memmov (a, b, l);
}

Baseline (no vector)

Ghidra 11 with the isa_ext branch decompiler gives us something simple after fixing the signature of the memcpy thunk.

void cpymem_1(void *param_1,void *param_2,size_t param_3)
{
  memcpy(param_1,param_2,param_3);
  return;
}
void cpymem_2(void)
{
  memcpy(&a_a,&a_b,0x40);
  return;
}
void cpymem_3(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x10);
  return;
}
void cpymem_4(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}
void cpymem_5(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}

rv64gcv - vector extensions

If the compiler knows the target hart can process vector extensions, but is not told explicitly the size of each vector register, it optimizes all of these calls. Ghidra 11 gives us the following, with binutils’ objdump instruction listings added as comments:

long cpymem_1(long param_1,long param_2,long param_3)
{
  long lVar1;
  undefined auVar2 [256];
  do {
    lVar1 = vsetvli_e8m8tama(param_3);  // vsetvli a5,a2,e8,m8,ta,ma
    auVar2 = vle8_v(param_2);           // vle8.v  v8,(a1)
    param_3 = param_3 - lVar1;          // sub     a2,a2,a5
    vse8_v(auVar2,param_1);             // vse8.v  v8,(a0)
    param_2 = param_2 + lVar1;          // add     a1,a1,a5
    param_1 = param_1 + lVar1;          // add     a0,a0,a5
  } while (param_3 != 0);               // bnez    a2,8a8 <cpymem_1>
  return param_1;
}
void cpymem_2(void)
{
                                        // ld      a4,1922(a4) # 2040 <a_b@Base>
                                        // ld      a5,1938(a5) # 2058 <a_a@Base>
  undefined auVar1 [256];
  vsetivli(0x10,0xd3);                  // vsetivli        zero,16,e32,m8,ta,ma
  auVar1 = vle32_v(&a_b);               // vle32.v v8,(a4)
  vse32_v(auVar1,&a_a);                 // vse32.v v8,(a5)
  return;
}
void cpymem_3(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xc0);                   // vsetivli        zero,16,e8,m1,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v1,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v1,(a0)
  return;
}
void cpymem_4(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];                // li      a5,32
  vsetvli_e8m8tama(0x20);                // vsetvli        zero,a5,e8,m8,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v8,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v8,(a0)
  return;
}
void cpymem_5(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xcb);                   // vsetivli        zero,16,e16,m8,ta,ma
  auVar1 = vle16_v(param_2);             // vle16.v v8,(a1)
  vse16_v(auVar1,param_1);               // vse16.v v8,(a0)
  return;
}

The variation in the vset* instructions is a bit puzzling. This may be due to alignment issues - trying to copy a short int into a misaligned odd address generates an exception at the store instruction, so perhaps the vector optimization is supposed to throw an exception there too.