1 - Ghidra Advisor Examples

1.1 - Whisper.cpp Example

Let’s inspect whisper.cpp binaries as compiled for a RISCV-64 processor with vector extensions. We’ll start with this snippet from the function whisper_full_with_state.

000cc166 57 77 80 0d     vsetvli                        a4,zero,e64,m1,ta,ma
000cc16a d7 a1 08 52     vid.v                          v3
000cc16e d7 30 00 5e     vmv.v.i                        v1,0x0
000cc172 93 06 80 03     li                             a3,0x38
000cc176 d7 e1 36 96     vmul.vx                        v3,v3,a3
000cc17a d6 86           c.mv                           a3,s5
000cc17c ce 97           c.add                          a5,s3
000cc17e 03 b6 07 18     ld                             a2,0x180(a5)
000cc182 31 06           c.addi                         a2,0xc
LAB_000cc184                                    XREF[1]:     000cc1a0(j)
000cc184 d7 f7 76 09     vsetvli                        a5,a3,e32,mf2,tu,ma
000cc188 07 71 36 06     vluxei64.v                     v2,(a2),v3
000cc18c 57 32 10 9e     vmv1r.v                        v4,v1
000cc190 13 97 37 00     slli                           a4,a5,0x3
000cc194 1d 8f           c.sub                          a4,a5
000cc196 0e 07           c.slli                         a4,0x3
000cc198 9d 8e           c.sub                          a3,a5
000cc19a d7 10 41 d2     vfwadd.wv                      v1,v4,v2
000cc19e 3a 96           c.add                          a2,a4
000cc1a0 f5 f2           c.bnez                         a3,LAB_000cc184
000cc1a2 d3 07 00 f2     fmv.d.x                        fa5,zero
000cc1a6 57 77 80 0d     vsetvli                        a4,zero,e64,m1,ta,ma
000cc1aa 57 d1 07 42     vfmv.s.f                       v2,fa5
000cc1ae d7 10 11 06     vfredusum.vs                   v1,v1,v2
000cc1b2 d7 14 10 42     vfmv.f.s                       fs1,v1

Advisor Output

Signatures:

  • Element width is = 64 bits
  • Vector element index: vid.v
    • often used in generating a vector of offsets for indexing
  • Element width is = 32 bits
    • Vector multiplier is fractional, MUL = f2
  • Vector unordered indexed load: vluxei64.v
  • Vector register move/copy: vmv1r.v
  • Element width is = 64 bits
  • FP register to vector register element 0
  • Vector floating point reduction: vfredusum.vs
  • Vector register element 0 to FP register
  • Significant opcodes, in the order they appear:
    • vsetvli,vid.v,vmv.v.i,vmul.vx,vsetvli,vluxei64.v,vmv1r.v,vfwadd.wv,c.bnez,vsetvli,vfmv.s.f,vfredusum.vs,vfmv.f.s
  • Significant opcodes, in alphanumeric order:
    • c.bnez,vfmv.f.s,vfmv.s.f,vfredusum.vs,vfwadd.wv,vid.v,vluxei64.v,vmul.vx,vmv.v.i,vmv1r.v,vsetvli,vsetvli,vsetvli
  • At least one loop exists

Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=588 [0.687]= bnez,vfadd.vv,vle8.v,vlseg2e64.v,vmsne.vi,vmv.v.i,vmv1r.v,vse64.v,vsetvli,vsetvli,vsetvli

The clip is similar to the reference example data/gcc_riscv_testsuite/rvv/autovec/struct/mask_struct_load-1_rv64gcv:test_f64_f64_i8_2

Reference C Source

void test_f64_f64_i8_2(double *__restrict dest, double *__restrict src, int8_t *__restrict cond, intptr_t n)
{
  for (intptr_t i = 0; i < n; ++i)
    if (cond[i])
      dest[i] = src[i * 2] + src[i * 2 + 1];
}

Reference Compiled Assembly Code

2932	blez	a3,2970 <test_f64_f64_i8_2+0x3e>
2936	vsetvli	a5,zero,e64,m1,ta,ma
293a	vmv.v.i	v4,0
293e	vsetvli	a5,a3,e8,mf8,ta,ma
2942	vle8.v	v0,(a2)
2946	vmv1r.v	v1,v4
294a	slli	a6,a5,0x4
294e	slli	a4,a5,0x3
2952	sub	a3,a3,a5
2954	add	a2,a2,a5
2956	vmsne.vi	v0,v0,0
295a	vlseg2e64.v	v2,(a1),v0.t
295e	vsetvli	zero,zero,e64,m1,tu,mu
2962	add	a1,a1,a6
2964	vfadd.vv	v1,v3,v2,v0.t
2968	vse64.v	v1,(a0),v0.t
296c	add	a0,a0,a4
296e	bnez	a3,293e <test_f64_f64_i8_2+0xc>
2970	ret	

Manual analysis

The matched C code is not that good a match. The Advisor suggests that this is a floating point add reduction loop over 32-bit floats into a 64 bit double result, with a relatively complex indexing calculation to access the addends.

The path forward is to add custom training examples until we converge on a decent match. We can do that by adding a new training function to the database:

#include <stdint.h>
double reduce_floats(float *p, uint32_t count)
{
    double result = 0.0;
    for (int i = 0; i < count; i++)
    {
        result += p[i * 14];
    }
    return result;
}

Rebuild the database with generator.py, ingest.py, and sample_analytics.py. We can then rerun the Advisor - but we don’t get a better match! In fact compiling this training example with GCC 15.0, -O3, and -march=rv64gcv doesn’t produce any vector instructions at all. We need to add one more compilation option, -ffast-math, to convince the compiler that vectorization is desired.

After adding -ffast-math to the relevant BUILD file, we get a much better Advisor output:

Advisor Output - with fast-math

Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=1532 [0.957]= bnez,vfmv.f.s,vfmv.s.f,vfredusum.vs,vfwadd.wv,vid.v,vluxei64.v,vmul.vx,vmv.v.i,vmv1r.v,vsetvli,vsetvli,vsetvli,vsll.vi

The clip is similar to the reference example data/custom_testsuite/structs/reduction_rv64gcv:reduce_floats

Notes

  • The whisper.cpp binary was compiled with -O3 and -ffast-math. This is an example of iteratively reconstructing the toolchain until we can generate something close to the observed code.

  • The actual C source likely does not contain anything like p[i * 14]. It is much more likely to be iterating over an array of structs, each 56 bytes long, and summing a 32 bit count field within that structure.

  • There likely isn’t much of a gain to be had with vectorizing this loop if VLEN is 128 bits. Only two elements can be processed per iteration, with 10 instructions per iteration. The scalar code handles one element per iteration in five instructions, so any possible gain is due to a tradeoff between code size and branch handling.

    Reconciling with Ghidra

    This reduction example decomposes the loop into three segments:

    1. A setup section before the loop initializing:
      • a double result vector v1 to zero
      • a pointer to the first addend in a2,
      • a vector of relative offsets v1 = (0, 1, ...) * 0x38
    2. The loop body:
      • a5 = number of addends that fit into a vector register
      • v2 = an indexed load of addends
      • v1 += v2
      • a4 = 0x38 * a5
      • a2 += a4 ; adjust the address of the next first addend
    3. The loop post-processing
      • v2 = 0.0
      • v1[0] = v2[0] +/v1 ; where +/ sums the elements of a single vector into the first element of the result vector
      • fs1 = v1[0]

The original C code might look like this:

struct astruct {
  uint32_t field0x00[3];
  float    field0x0c;
  uint32_t field0x10[10];
};
// sizeof astruct = 0x38
double reduce_floats(struct astruct* p, uint32_t count)
{
    double result = 0.0;
    for (int i = 0; i < count; i++)
    {
        result += p[i].count;
    }
    return result;
}

1.2 - DPDK Network Example

Are network appliances improved with autovectorization? We can examine code examples taken from the DPDK network project to find out. We’ll start with code that uses vector gather and slide operations in what might be a key datapath routine. The binary dpdk_l3fwd is an example of a layer 3 forwarding engine. It includes the function rx_recv_pkts, which appears to be selected from the DPDK source file drivers/net/iavf/iavf_rxtx.c.

This function includes several vector stanzas that use vrgather and vslide1down.

We will make things a bit easier by including the preprocessed source file drivers/net/iavf/iavf_rxtx.c in our custom dataset under data/custom_testsuite/dpdk. This brings all of the relevant DPDK header files inline.

The first vector stanza to examine is:

LAB_0053c12e             XREF[2]:     0053be34(j), 0053be42(j)  
0053c12e 73 2e 20 c2     csrrs        t3,vlenb,zero
0053c132 57 73 80 0d     vsetvli      t1,zero,e64,m1,ta,ma 
0053c136 b3 0e c0 41     neg          t4,t3
0053c13a 93 58 3e 00     srli         a7,t3,0x3
0053c13e 3e 97           c.add        a4,a5
0053c140 d7 a1 08 52     vid.v        v3
0053c144 93 87 0e 04     addi         a5,t4,0x40
0053c148 93 86 f8 ff     addi         a3,a7,-0x1
0053c14c 33 67 f7 20     sh3add       a4,a4,a5
0053c150 d7 c1 36 0e     vrsub.vx     v3,v3,a3
0053c154 5a 97           c.add        a4,s6
0053c156 3b 8f 1b 41     subw         t5,s7,a7
0053c15a a2 86           c.mv         a3,s0
0053c15c 81 47           c.li         a5,0x0
LAB_0053c15e             XREF[1]:     0053c172(j)  
0053c15e 07 71 87 02     vl1re64.v    v2,(a4)
0053c162 bb 87 17 01     addw         a5,a5,a7
0053c166 76 97           c.add        a4,t4
0053c168 d7 80 21 32     vrgather.vv  v1,v2,v3
0053c16c a7 80 86 02     vs1r.v       v1,(a3)
0053c170 f2 96           c.add        a3,t3
0053c172 e3 76 ff fe     bgeu         t5,a5,LAB_0053c15e

This sequence suggests a load/permute/store operation on 64 bit elements. There are lots of complications:

  • The function rx_recv_pkts has static inline attributes, so it has likely been merged with other functions
  • There are no obvious candidates for the gcc C source code that generates this vrgather pattern.
  • This may be an artifact of using snapshots of gcc and dpdk.
  • The vs1r.v instruction is a whole register store variant that ignores vsetvli settings.

Analysis of this snippet is easier with emulation:

  • The assembly fragment is copied into a new source file, emulations/test_1.S
  • The assembly source is converted into a subroutine
  • Instrumentation is added to store intermediate values into global data variables.
  • A main.c driver routine is added, to provide input and output vectors and instrumentation printouts.
  • A RISCV-64 executable is built
  • A RISCV-64 QEMU userspace emulator is built and configured
  • The RISCV-64 executable is run to see what permutations are generated
  • The RISCV-64 executable is run with alternate values for VLEN, to show that the results are stable regardless of the vector hardware implementation.

Process this Ghidra assembly fragment through the current Advisor:

Signatures:
* Element width is = 64 bits
* Vector element index: vid.v
    * often used in generating a vector of offsets for indexing
* Vector whole register load: vl1re64.v
* Vector gather: vrgather.vv
* Vector whole register(s) store: vs1r.v
* Significant opcodes, in the order they appear:
    * vsetvli,vid.v,vrsub.vx,vl1re64.v,vrgather.vv,vs1r.v,bgeu
* Significant opcodes, in alphanumeric order:
    * bgeu,vid.v,vl1re64.v,vrgather.vv,vrsub.vx,vs1r.v,vsetvli
* At least one loop exists

## Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=1689 [0.860]= vid.v,vle16.v,vrgather.vv,vrsub.vi,vse16.v,vsetivli

The clip is similar to the reference example `data/gcc_riscv_testsuite/rvv/autovec/vls-vlmax/perm-4_rv64gcv:permute_vnx2hi`

### Reference C Source

void permute_vnx2hi(vnx2hi values1, vnx2hi values2, vnx2hi *out)
{
  vnx2hi v = __builtin_shufflevector (values1, values2, (2) - 1 - (0), (2) - 2 - (0));
  *(vnx2hi *) out = v;
} __attribute__((noipa))

Gcc’s __builtin_shufflevector with the given mask reverses the order of a vector of two 16 bit values. The signature match of 0.86 suggests this is just a hint of what is going on here.

After rewriting the Ghidra sample as a riscv64 assembly routine, we get:

    .section .data.test_1
    .globl cntr, vl, src_incr, dst_incr, src_start, dst_start
    .globl cnt_limit
dst_incr:
    .dword  0
vl:
    .dword  0
src_incr:
    .dword  0
cntr:
    .dword  0
src_start:
    .dword  0
dst_start:
    .dword  0
cnt_limit:
    .word  0

    .section        .text.test_1,"ax",@progbits
    .globl test_1
    .type  test_1, @function

test_1:
    csrrs          t3,vlenb,zero
    lui            t6,%hi(dst_incr)
    addi           t6,t6,%lo(dst_incr)
    sd             t3,(t6)                # ->dst_incr
    vsetvli        t1,zero,e64,m1,ta,ma
    sd             t1,8(t6)               # ->vl
    neg            t4,t3
    sd             t4,16(t6)              # ->src_incr
    srli           a7,t3,0x3
    sd             a7,24(t6)              # ->cntr
    vid.v          v3
    addi           a5,t4,0x40
    addi           a3,a7,-0x1
    add            a0,a5,a0
    sd             a0,32(t6)              # ->src_start
    vrsub.vx       v3,v3,a3
    c.li           t5,0x8
    subw           t5,t5,a7
    sw             t5,48(t6)              # ->cnt_limit
    vs1r.v         v3,(a2)
    c.li           a5,0x0
    sd             a1,40(t6)              # ->dst_start
LAB_0053c15e:
    vl1re64.v      v2,(a0)
    addw           a5,a5,a7
    c.add          a0,t4
    vrgather.vv    v1,v2,v3
    vs1r.v         v1,(a1)
    add            a1,a1,t3
    bgeu           t5,a5,LAB_0053c15e
    ret

.globl get_vlenb
.type  get_vlanb, @function

get_vlanb:
    csrrs          a0,vlenb,zero
    srli           a2,a0,0x3
    ret

The main routine to exercise this is:

#include <stdio.h>
extern void test_1(unsigned long long *, unsigned long long *, unsigned long long *);
extern void test_1_ref(unsigned long long *in, unsigned long long *out, unsigned int size);
extern long long cntr, vl, src_incr, dst_incr, src_start, dst_start;
extern long cnt_limit;
int main(void)
{
    unsigned long long in_vector[8] = {90,91,92,93,94,95,96,97};
    unsigned long long out_vector[8] = {0,0,0,0,0,0,0,0};
    unsigned long long shuffle_vector[4] = {0,0,0,0};
    printf("Initializing\n");
    printf("in_vector_addr: %16p\n", (void*)in_vector);
    printf("out_vector_addr: %16p\n", (void*)out_vector);
    test_1(in_vector, out_vector, shuffle_vector);
    printf("cntr: %lld\n", cntr);
    printf("vl: %lld\n", vl);
    printf("src_incr: 0x%llx\n", src_incr);
    printf("dst_incr: 0x%llx\n", dst_incr);
    printf("src_start: %p\n", (void*)src_start);
    printf("dst_start: %p\n", (void*)dst_start);
    printf("cnt_limit: 0x%lx\n", cnt_limit);
    printf("shuffle vector: %lld, %lld, %lld, %lld\n", shuffle_vector[0], shuffle_vector[1], shuffle_vector[2], shuffle_vector[3]);
    printf("out vector: %lld, %lld, %lld, %lld, ", out_vector[0], out_vector[1], out_vector[2], out_vector[3]);
    printf("%lld, %lld, %lld, %lld\n", out_vector[4], out_vector[5], out_vector[6], out_vector[7]);
    printf("Assembly version Complete, starting reference version\n");
    /* Once the assembly mockup returns a good value, try a C equivalent
    test_1_ref(in_vector, out_vector, 8);
    printf("out vector: %lld, %lld, %lld, %lld, ", out_vector[0], out_vector[1], out_vector[2], out_vector[3]);
    printf("%lld, %lld, %lld, %lld\n", out_vector[4], out_vector[5], out_vector[6], out_vector[7]);
    */
}

The QEMU VLEN variable and microarchitecture extensions are configured on the command line. The following requests the VLEN=128 minimum basic register size.

$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true

Build and run with:

training_set$ bazel build --platforms=//platforms:riscv_gcc emulations:test_1
INFO: Analyzed target //emulations:test_1 (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //emulations:test_1 up-to-date:
  bazel-bin/emulations/test_1

training_set$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/emulations/test_1
Initializing
in_vector_addr:   0x2aaaab2aaab0
out_vector_addr:   0x2aaaab2aaaf0
cntr: 2
vl: 2
src_incr: 0xfffffffffffffff0
dst_incr: 0x10
src_start: 0x2aaaab2aaae0
dst_start: 0x2aaaab2aaaf0
cnt_limit: 0x6
shuffle vector: 1, 0, 0, 0
out vector: 97, 96, 95, 94, 93, 92, 91, 90
Complete!

Repeat with VLEN=256 and the same executable binary, to see if the results are stable:

$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true

training_set$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/emulations/test_1
Initializing
in_vector_addr:   0x2aaaab2aaab0
out_vector_addr:   0x2aaaab2aaaf0
cntr: 4
vl: 4
src_incr: 0xffffffffffffffe0
dst_incr: 0x20
src_start: 0x2aaaab2aaad0
dst_start: 0x2aaaab2aaaf0
cnt_limit: 0x4
shuffle vector: 3, 2, 1, 0
out vector: 97, 96, 95, 94, 93, 92, 91, 90
Complete!

So now the permutation is clear - this snippet copies a vector of 64 bit values - likely pointers - from one location to another in the reverse order. The source pointer advances backwards, starting vl elements before the end, while the destination pointer advances forwards, starting at the first element. The reverse-copy operation is stable with VLEN of 128 or 256 bits.

Next we can add test_1_ref.c, a C routine which should compile to something much closer to the sample binary:

void test_1_ref(unsigned long long *in, unsigned long long *out, unsigned int size)
{
    int i;
    int upper_index = size - 1;
    for (i=0; i < size; i++) {
        out[i] = in[upper_index - i];
    }
}

Record the new example

We want to add test_1_ref to the training set database, so copy it into custom_testsuite/structs, update the BUILD file there, then rerun the analysis:

./generator.py 
./ingest.py 
./sample_analytics.py 
./advisor.py data_test/advisor_tests/gather2.ghidra
...
The best match is id=1864 [0.926]= bgeu,bltu,bne,vid.v,vl1re64.v,vrgather.vv,vrsub.vx,vs1r.v,vsetvli

The clip is similar to the reference example `data/custom_testsuite/structs/test_1_ref_rv64gcv:test_1_ref`

The extra branch instructions clutter the signature match result somewhat.

Feature extraction

What generalized features does this example provide?

  1. There is a single loop
  2. The loop includes one vector load, one vector store, and one vector gather operation.
  3. The loop exit condition depends on a scalar counter advancing to equal or exceed a fixed limit value, without a dependency on vector element values.
  4. Source and destination pointers increment with different values
  5. The loop setup identifies the SEW as 64 bits. This is unchanged throughout the sample code. This implies that this vs1r.v instruction is equivalent to a vse64.v instruction.
  6. The bgeu instruction terminating the loop could be encoded as bleu with reversed operands. More generally, the loop termination could have been implemented in several different ways.

Register assignments appear to be:

Assignment Register Initializations
Source address a4
Source address increment t4 = (-vlenb)
Destination address a3
Destination address increment t3 = vlenb
Loop counter a5 =0
Loop counter increment a7 vlenb»3
Loop counter limit t5
Source vector register v2
Destination vector register v1
Gather index vector register v3 depends on vlenb

The vector gather index register v3 depends on vl, the number of 64 bit elements that fit within a vector register.

  • if vlen = 128 - or vlenb = 16 - then v3 = (1,0)
  • if vlen = 256 - or vlenb = 32 - then v3 = (3,2,1,0)

It is not at all clear whether the compiler knows the size of the source vector, or whether the code works as intended when the source vector size is odd or smaller than the size of a vector register.

We can resolve that by looking at Ghidra’s decompiler window. The vector instruction stanza actually begins earlier, executing the vector instructions only after a test of vlenb.

Examination of test_1_ref

GCC 15 compiles test_1_ref.c into something more complex than the assembly code stanza we analyzed. The single-line loop becomes three loops - a scalar loop to handle relatively small vector sizes, then a vector loop to handle an integer number of VLEN strips, finally a second scalar loop to handle any remaining elements. For the code stanza passed in for analysis the input and output vectors appear to have a fixed size known to the compiler - 64 bytes. This implies no scalar loops needed for VLEN=128 bits or VLEN=256 bits. The code stanza would generate a buffer overflow error for VLEN>512 bits.

void test_1_ref(ulonglong *in,ulonglong *out,ulong size)
{
  ulonglong *pIn;
  ulonglong *puVar1;
  long lVar2;
  ulonglong uVar3;
  ulong uVar4;
  ulong uVar5;
  int iVar6;
  undefined auVar7 [256];
  undefined auVar8 [256];
  ulong in_vlenb;
  ulonglong *pOut;
  
  gp = &__global_pointer$;
  if (size != 0) {
    uVar5 = (ulong)((int)size + -1);
    if (((uVar5 < 0xd) || (uVar5 < (ulong)(long)((int)(in_vlenb >> 3) + -1))) ||
       ((lVar2 = uVar5 + 1, in + (lVar2 - (size & 0xffffffff)) < out + (size & 0xffffffff) &&
        (out < in + lVar2)))) {
      pIn = in + uVar5;
      pOut = out;
      do {
        uVar3 = *pIn;
        puVar1 = pOut + 1;
        pIn = pIn + -1;
        *pOut = uVar3;
        pOut = puVar1;
      } while (puVar1 != out + (size & 0xffffffff));
    }
    else {
      vsetvli_e64m1tama(0);
      auVar8 = vid_v();
      auVar8 = vrsub_vx(auVar8,(in_vlenb >> 3) - 1);
      lVar2 = lVar2 * 8 + -in_vlenb + (long)in;
      iVar6 = (int)(in_vlenb >> 3);
      uVar4 = 0;
      pOut = out;
      do {
        auVar7 = vl1re64_v(lVar2);
        uVar4 = (ulong)((int)uVar4 + iVar6);
        lVar2 = lVar2 + -in_vlenb;
        auVar7 = vrgather_vv(auVar7,auVar8);
        vs1r_v(auVar7,pOut);
        pOut = (ulonglong *)((long)pOut + in_vlenb);
      } while (uVar4 <= (ulong)(long)((int)size - iVar6));
      if (size != uVar4) {
        pOut = in + (uVar5 - uVar4);
        pIn = out + uVar4;
        do {
          uVar3 = *pOut;
          uVar4 = (ulong)((int)uVar4 + 1);
          pOut = pOut + -1;
          *pIn = uVar3;
          pIn = pIn + 1;
        } while (uVar4 < size);
        return;
      }
    }
  }
  return;
}

The scalar loop consists of about 22 bytes, or 7 instructions. The full vector function takes up 180 bytes. There is little reason to believe that vectorization actually improves this code, especially for vector harts holding only two 64 bit objects per vector register.

Summary

There is no visible reason to enable full autovectorization for datapath network routing code like DPDK. It might make sense to vectorize lockable critical regions or specific critical path code. Vector replacement of utility functions like memcpy or strncmp would still make sense. If someone created an AI-adaptive network appliance the math inference engine portions would likely be improved with vectorization.

1.3 - Whisper Exploration Example

How do we move forward when the Advisor doesn’t provide a lot of help? We’ll start with an example taken from Whisper.cpp’s main routine.

pFVar26 = local_460;
vsetivli_e64m1tama(2);
local_700 = (FILE *)local_c0._40_8_;
local_6f8 = pFVar34->_IO_read_ptr;
auVar45 = vle64_v(avStack_470);
vmv_v_i(in_v4,0);
auVar46 = vle64_v(&local_700);
vse64_v(in_v4,avStack_470);
auVar47 = vslidedown_vi(auVar45,1);
auVar46 = vslidedown_vi(auVar46,1);
local_4e0 = (FILE *)vmv_x_s(auVar46);
auVar46 = vle64_v(&local_700);
pcVar20 = (char *)vmv_x_s(auVar47);
local_4d8 = local_88;
local_c0._40_8_ = vmv_x_s(auVar45);
pFVar34->_IO_read_ptr = pcVar20;
local_4e8 = (FILE *)vmv_x_s(auVar46);
local_460 = (FILE *)0x0;
local_88 = pFVar26;
std::vector<>::~vector((vector<> *)&local_4e8);
std::vector<>::~vector(avStack_470);
std::_Rb_tree<>::_M_erase((_Rb_tree_node *)local_490);

This clearly isn’t a loop. Instead it is some sort of initialization sequence that allows vector instructions to slightly optimize the code. The advisor results aren’t very helpful:

Signatures:

    Vector length set to = 0x2
    Element width is = 64 bits
    Vector load: vle64.v
    Vector load: vle64.v
    Vector store: vse64.v
    Vector integer slidedown: vslidedown.vi
    Vector integer slidedown: vslidedown.vi
    Vector load: vle64.v
    Significant operations, in the order they appear:
        vsetivli,vle64.v,vmv.v.i,vle64.v,vse64.v,vslidedown.vi,vslidedown.vi,vmv.x.s,vle64.v,vmv.x.s,vmv.x.s,vmv.x.s
    Significant operations, in alphanumeric order:
        vle64.v,vle64.v,vle64.v,vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vmv.x.s,vse64.v,vsetivli,vslidedown.vi,vslidedown.vi

Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=1889 [0.652]= vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vse8.v,vsetivli,vsetivli,vsetivli,vsetvli

The clip is similar to the reference example data/custom_testsuite/builtins/string_rv64gcv:bzero_15

This suggests several Advisor improvements:

  • explicitly report that no loops are found, and that the stanza is likely a vector optimization of scalar instruction transforms.
  • add a quick explanation of what vslidedown.vi does
  • the vmv instructions need annotation, especially any that load constants into registers.

A manual analysis suggests that the vector instructions manipulate pairs of 64 bit pointers, variously copying them, zeroing them, or copying first or second elements of the pair into scalar registers. That probably means we want simple C++ vector manipulation functions in our set of custom patterns.

1.4 - Whisper Memcpy

Is it easy to recognize vector expansions of libc functions like memcpy?

Let’s locate some explicit invocations of memcpy within Whisper and see what the Advisor has to say.

struct whisper_context * whisper_init_from_buffer_with_params_no_state(void * buffer, size_t buffer_size, struct whisper_context_params params) {
    struct buf_context {
        uint8_t* buffer;
        size_t size;
        size_t current_offset;
    };
    loader.read = [](void * ctx, void * output, size_t read_size) {
        buf_context * buf = reinterpret_cast<buf_context *>(ctx);

        size_t size_to_copy = buf->current_offset + read_size < buf->size ? read_size : buf->size - buf->current_offset;

        memcpy(output, buf->buffer + buf->current_offset, size_to_copy);
        buf->current_offset += size_to_copy;

        return size_to_copy;
    };
};

This source example shows a few traits:

  • the number of bytes to copy is not in general known at compile time
  • the buffer type is uint8_t*
  • there are no alignment guarantees

GCC 15 compiles the lambda stored in loader.read as whisper_init_from_buffer_with_params_no_state::{lambda(void*,void*,unsigned_long)#1}::_FUN. The relevant Ghidra listing window instruction sequence (trimmed of address and whitespace) is:

LAB_000b0be2
    vsetvli  a3,param_3,e8,m8,ta,ma  
    vle8.v   v8,(a4)
    c.sub    param_3,a3
    c.add    a4,a3
    vse8.v   v8,(param_2)
    c.add    param_2,a3
    c.bnez   param_3,LAB_000b0be2

Copying the Ghidra listing to the clipboard and running the Advisor gives us:

Clipboard Contents to Analyze

LAB_000b0be2                                    XREF[1]:     000b0bf4(j)
000b0be2 d7 76 36 0c     vsetvli                        a3,param_3,e8,m8,ta,ma
000b0be6 07 04 07 02     vle8.v                         v8,(a4)
000b0bea 15 8e           c.sub                          param_3,a3
000b0bec 36 97           c.add                          a4,a3
000b0bee 27 84 05 02     vse8.v                         v8,(param_2)
000b0bf2 b6 95           c.add                          param_2,a3
000b0bf4 7d f6           c.bnez                         param_3,LAB_000b0be2

Signatures:

    Element width is = 8 bits
    Vector registers are grouped with MUL = 8
    Vector load: vle8.v
        Vector load is to multiple registers
    Vector store: vse8.v
        Vector store is from multiple registers
    At least one loop exists
    Significant operations, in the order they appear:
        vsetvli,vle8.v,vse8.v,_loop
    Significant operations, in alphanumeric order:
        _loop,vle8.v,vse8.v,vsetvli

Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=1873 [1.000]= _loop,vle8.v,vse8.v,vsetvli

The clip is similar to the reference example data/custom_testsuite/builtins/memcpy_rv64gcv:memcpy_255
Reference C Source

void memcpy_255()
{
  __builtin_memcpy (to, from, 255);
};

Reference Compiled Assembly Code

65e	auipc	a3,0x2
662	ld	a3,-1678(a3)
666	auipc	a2,0x0
66a	addi	a2,a2,82
66e	li	a4,255
672	vsetvli	a5,a4,e8,m8,ta,ma
676	vle8.v	v8,(a2)
67a	sub	a4,a4,a5
67c	add	a2,a2,a5
67e	vse8.v	v8,(a3)
682	add	a3,a3,a5

The Advisor has matched the vector instruction loop to the GCC __builtin_memcpy test case where the number of bytes to transfer is large (255). The individual scalar instructions are not the same.

This example shows something important that we probably want to add to the Advisor’s report:

The vsetvli instruction includes the m8 multiplier option, which means vector operations cover groups of 8 registers. The vle8.v only references vector register v8, but the loads and stores affect the 8 registers v8 through v15. If the __builtin_memcpy appeared in an inline code fragment, where there may be more pressure on vector register availability, we might have seen very similar code with multipliers of m4, m2, or m1.

What does the Ghidra decompiler show for this instruction sequence?

  do {
    lVar3 = vsetvli_e8m8tama(uVar1);
    auVar4 = vle8_v(lVar2);
    uVar1 = uVar1 - lVar3;
    lVar2 = lVar2 + lVar3;
    vse8_v(auVar4,param_2);
    param_2 = (void *)((long)param_2 + lVar3);
  } while (uVar1 != 0);

What would we like Ghidra’s decompiler to show instead? Something like:

__builtin_memcpy(param_2, lvar2, uVar1);

That’s not quite correct, as __builtin_memcpy doesn’t mutate the values param_2 or lvar2.

1.5 - Whisper Output Forensics

You might expect Whisper to use a lot of vector instructions in its inference engine, and it definitely does. Are vector instructions common enough to complicate Whisper forensic analysis, looking at functions an adversary is likeliest to target? For this example we will make a deep dive into the function output_txt, since malicious code might want to review and alter dictated text.

This code also lets us examine how RISCV vector instructions are used to implement the libstdc++ vector library functions.

const char * whisper_full_get_segment_text(struct whisper_context * ctx, int i_segment) {
    return ctx->state->result_all[i_segment].text.c_str();
}

static bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
    std::ofstream fout(fname);
    if (!fout.is_open()) {
        fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
        return false;
    }
    fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i) {
        const char * text = whisper_full_get_segment_text(ctx, i);
        std::string speaker = "";
        if (params.diarize && pcmf32s.size() == 2)
        {
            const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
            const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
            speaker = estimate_diarization_speaker(pcmf32s, t0, t1);
        }
        fout << speaker << text << "\n";
    }
    return true;
}

The key elements of this function are:

  • text is collected in segments and stored in the context variable ctx
  • text segments are retrieved with the function whisper_full_get_segment_text
  • text is copied into an output stream fout

The params.diarize code block matters only if voice is collected in stereo and Whisper has been asked to differentiate between speakers.

The Ghidra decompiler shows four vector instruction sets starting with a vset* instruction. The first of these is a simple initialization:

vsetivli_e64m1tama(2);
vmv_v_i(in_v1,0);
vse64_v(in_v1,auStack_90);
vse64_v(in_v1,auStack_80);

These instructions initialize two adjacent 16 byte blocks of memory to zero. These are likely four 64 bit pointers or counters embedded within structures.

The next vector stanza is:

vsetivli_e64m1tama(2);
lStack_2e0 = lStack_288;
uStack_2d8 = local_280[0];
auVar24 = vle64_v(&lStack_2e0);
auVar25 = vle64_v(&lStack_2e0);
auVar24 = vslidedown_vi(auVar24,1);
lStack_2a8 = vmv_x_s(auVar25);
local_2a0[0] = vmv_x_s(auVar24);

This one is puzzling, as it appears to load two 64 bit values twice, then store them into separate scalar registers.

The next stanza looks like a simple memcpy expansion:

do {
    lVar18 = vsetvli_e8m8tama(lStack_288);
    auVar24 = vle8_v(puVar16);
    lStack_288 = lStack_288 - lVar18;
    puVar16 = (ulong *)((long)puVar16 + lVar18);
    vse8_v(auVar24,puVar20);
    puVar20 = (ulong *)((long)puVar20 + lVar18);
} while (lStack_288 != 0);

The final stanza is the interesting one:

pcVar17 = text;
do {
    vsetvli_e8m1tama(0);
    pcVar17 = pcVar17 + lVar18;
    auVar24 = vle8ff_v(pcVar17);
    auVar24 = vmseq_vi(auVar24,0);
    lVar19 = vfirst_m(auVar24);
    lVar18 = in_vl;
    } while (lVar19 < 0);
std::__ostream_insert<>(pbVar12,text,(long)(pcVar17 + (lVar19 - (long)text)));

This appears to be a vector implementation of strlen(text) requested by std::__ostream_insert<>.

Our hypothetical adversary would want to evaluate *text and reset the text pointer to the maliciously altered output string.

The current Advisor classifies these four stanzas as:

  • some sort of initializer
  • some sort of shuffle
  • memcpy
  • strlen

A Ghidra user would likely ignore the initializer and the shuffle as doing something benign and obscure within the I/O subsystem, recognize the memcpy and strlen for what they are, then concentrate on any unexpected manipulations of the *text string.

2 - Ghidra Advisor Reference

2.1 - Advisor Dependencies

Ghidra

  • Ghidra 11.3_DEV with the RISCV-64 isa_ext branch. Without this branch Ghidra is stuck with a never-ratified older version of RISCV support.

Bazel

Bazel builds in this workspace generate output in the temporary directory /run/user/1000/bazel, as specified in .bazelrc. This override can be changed or removed

This project should work with Bazel 7.x as well, after adjusting some toolchain path names. Bazel 8 uses ‘+’ instead of ‘~’ as an external repo naming suffix and ‘@@’ instead of ‘@’ to identify standard bazel repositories.

Toolchain

The toolchain is packaged locally as a Bazel module named gcc_riscv_suite, version 15.0.0.1. (Note that this is the first patch to the Bazel module based on the unreleased GCC-15.0.0). This module depends on a second module, fedora_syslibs version 41.0.0. These are served out of a local Bazel module repository. The gcc_riscv_suite and fedora_syslibs modules wrap a 42 MB and 4.0 MB tarball, respectively.

Emulators

Two qemu emulators are used, both built from source shortly after the 9.0.50 release.

  • qemu-riscv64 provides user space emulation, which is very useful for exploring the behavior of particularly confusing assembly code sequences.
  • qemu-system-riscv64 provides full RISCV-64 VM hosting. This is more narrowly useful when testing binaries like DPDK which require non-standard kernel options or kernel modules.
    • The RISCV-64 VM used here is based on an Ubuntu 24.04 disk image and the u-boot.bin boot loader. This boot loader is critical for RISCV VMs, since the emulated BIOS firmware provides the kernel with the definitive set of RISCV extensions available to the hardware threads (aka harts)

Jupyter

  • jupyterlab 4.1.1

System

  • Fedora 41 with wayland graphics.
  • Python 3.13

2.2 - Populating the Database

The training set consists of matched C source code and RISCV-64 disassembly code. The C source is processed through the C preprocessor cpp and indent. That code is then compiled with GCC and at least two different machine architectures, then saved under ./data.

Populating the Training Set Database

The initial C source code is selected from the GCC riscv autovector testsuite. We can add custom examples of code to fill gaps or represent code patterns we might find in a Ghidra binary under review. Autovectored loops over structure arrays can be especially confusing to interpret, so we will likely want extra samples of that type.

The C sources for these two test suites appear in ./gcc_riscv_testsuite and ./custom_testsuite. The script generator.py processes these into cpp output (*.i), compiled libraries (*.so), and objdump assembly listings (*_objdump) for each requested machine architecture.

For example, gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1.c is processed by generator.py into:

  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.i
  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.so
  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc_objdump
  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.i
  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.so
  • data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv_objdump

The ingest.py script reads everything under ./data to populate the sample table in the Sqlite3 database

.schema sample
CREATE TABLE sample(id INTEGER PRIMARY KEY AUTOINCREMENT, namespace TEXT, arch TEXT, name TEXT, source TEXT, assembly TEXT);

Next we need to generate signatures from this table with the sample_analytics.py script. At present, signatures are simple strings. We have three signature types at the moment.

.schema signatures
CREATE TABLE signatures(id INTEGER PRIMARY KEY AUTOINCREMENT, sample_id INTEGER, signature_type TEXT, signature_value TEXT);
select distinct signature_type from signatures;
Traits
Opcodes, sorted
Opcodes, ordered
  • The Traits signature holds simple facts from the disassembly code, such as hasLoop if at least one backwards branch exists.
  • The Opcodes, ordered signature is a simple list of vector and branch opcodes, concatenated in the same order as they are found in the disassembly.
  • The Opcodes, sorted signature is similar to Opcodes, ordered, but sorted into alphanumeric order. This may be useful if the compiler reorders instructions.

Querying the Database in Advisor

Workflow Generation

Users can select assembly code from Ghidra’s listing window, then run analysis cells in Advisor.ipynb to generate reports on the types of C code that may match the listing. Users will likely want to iterate complex selections by adding custom examples and repeating the match, to see if they can reproduce the C code that might have generated the vectorized assembly.

2.3 - Feature Analysis

At a very high level the Advisor tries to translate between two languages - C or C++ source code that a human might write and the sequence of assembly language instructions a compiler generates. Compilers are very good at the forward translation from C to assembly. We want something that works in the reverse direction - suggesting C or other source code that might have been compiled into specific assembly sequences.

The Advisor tries to brute-force this reverse translation by compiling a reference set of C sources into binaries, extracting the instructions with objdump into a database, then looking for the best match to the instructions copied into the clipboard. The GCC compiler test suite gives us thousands of reference C functions to start with.

Some features are easy to recognize.

  • if the assembly listing includes a backwards branch instruction and branch target, then the source code likely contains a vectorized loop.
  • if the assembly listing includes an instruction matching vred*, vfred* vwred*, then the source code likely contains a vectorized reduction loop reading a vector and emitting a scalar.

Other features are mostly distractions, adding entropy that we would like to ignore:

  • The local choice of registers to hold intermediate values
  • The specific loop termination branch condition - a test against a counter, an input pointer, or an output pointer are all equally valid but only one will be implemented.
  • Instruction ordering is often arbitrary inside a loop, as counters and pointers are incremented/decremented.
  • The compiler may reorder instructions to minimize the impact of memory latency.
  • The compiler will change the emitted instructions for inlined function depending on what it knows at compile time. This is especially true when the compiler knows the exact number of loop iterations, the alignment of operands, and the minimum size of vector registers.
  • The compiler will change the emitted instructions based on the local ‘register pressure’ - whether or not there are lots of free vector registers.
  • The compiler (or inline header macros) will translate a simple loop into multiple code blocks evaluated at run time. If the count is small, a scalar implementation is used. If the count is large one or more vector blocks are used.
  • The compiler writers sometimes have to guess whether to optimize for instruction count, minimal branches, or memory accesses.

And some features are harder to recognize but useful for the Ghidra user:

  • Operand type is sometimes set at runtime, not encoded into the instruction opcode.
  • Compilers can emit completely different code if the machine architecture indicates a vector length of greater than 128 bits.
  • Vector registers may be grouped based on runtime context, so that the number of registers read or written must be inferred from instruction flows.
  • The compiler will accept intrinsic vector functions - not all vector loops have a C counterpart.

2.4 - Frequently Asked Questions

Why Bazel?
Bazel does a good job of managing cross-compiler builds and build caches together, where the cross-compiler toolchain can be switched easily.
How do I compile with support for RISCV Instruction Set Architecture extensions?
The binutils and gcc base code need to support those extensions first. The gcc compiler uses the -march= command line option to identify which extensions to apply for a given compilation. For example -march=rv32gcv says vector instructions are supported, while -march=rv32gc excludes vector instructions.
What machine architectures are currently implemented?
The variables.bzl file sets MARCH_SET = ("rv64gc", "rv64gcv"). Most sources are then compiled with and without vector support.
Are all RISCV vector binaries runnable on all vector hardware threads?
Not always. By default GCC will build for a minimum vector register length (VLEN) of 128 bits, which should be portable across all general purpose RISCV harts. If _zvl512b were added to the -march setting, GCC will know that vector registers are bigger and can unroll loops more aggressively - generating code that will fail on 128 bit vector harts. This can get complicated when processors have both 128 bit and 512 bit cores, like the sg2380.
Aren’t vector extensions unlikely to be used in programs that don’t do vector math?
No. Vector extensions are very likely to be found in inlined utilities like memcpy and strncmp. Most simple loops over arrays of structs can be optimized with vector instructions.

3 - Whisper Deep Dive

This exercise mocks up a forensic analysis of a hypothetical voice to text application believed to be based on Whisper.cpp. The previous examples applied Advisor to fairly random vector instruction sequences found in a Whisper.cpp compilation without much identifying metadata. This time we will do things more methodically, using a specific Whisper.cpp release built with specific build instructions, analyzed in Ghidra in both stripped and unstripped binary formats. Dependent libraries libc, libm, and libstdc++ will be imported into Ghidra from the toolchain used to construct the whisper executable. Once we have trained Advisor to help with the known-source whisper application, we might be better able to use in in analyzing potentially malicious whisper-like applications.

This is an iterative process, where we take some initial guesses into how the application-under-test (AUT) was built, rebuild our whisper reference model the same way, then adjust either the reference model or the build parameters until we see similar key patterns in our AUT and reference models.

The initial guesses are:

  • Similar to Whisper.cpp release 1.7.1
  • Built for RISCV64 cores with the rva22 profile plus vector extensions, something like the SiFive P670 cores within a SG2380 processor.
  • Built with gcc 15.0.0 with march=rv64gcv, fast-math, and O3 options for a linux-like OS.
  • dynamically linked with libc, libm, libstdc++ as of mid 2024.

It’s worthwhile establishing key structures used by Whisper - and likely by any malicious code forked from Whisper.

Inspecting the reference source code suggests these structures:

  • whisper_context
  • whisper_state - created by whisper_init_state(struct whisper_context * ctx)
  • whisper_context_params

3.1 - Building the Reference Model

We start by importing Whisper.cpp into our Bazel build environment. We may eventually want a full fork of the code, but not until we are sure of the base release and what directions that fork will need to take.

This import starts with a simple addition to our MODULE.bazel workspace file:

# MODULE.bazel
http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
# whisper.cpp is an open source voice-to-text inference app built on Meta's LLaMA model.
# It is a useful exemplar of autovectorization of ML code with some examples of hand-coded
# riscv intrinsics.
http_archive(
    name = "whisper_cpp",
    urls = ["https://github.com/ggerganov/whisper.cpp/archive/refs/tags/v1.7.1.tar.gz"],
    strip_prefix = "whisper.cpp-1.7.1/",
    build_file = "//:whisper-cpp.BUILD",
    sha256 = "97f19a32212f2f215e538ee37a16ff547aaebc54817bd8072034e02466ce6d55"
)

Next we add whisper-cpp.BUILD to show how to build libraries and binaries. The instructions for whisper library include these stanzas:

c_library(
    name = "whisper",
    srcs = [
        "ggml/src/ggml.c",
        "ggml/src/ggml-aarch64.c",
        "ggml/src/ggml-alloc.c",
        "ggml/src/ggml-backend.cpp",
        "ggml/src/ggml-backend-impl.h",
        "ggml/src/ggml-impl.h",
        "ggml/src/ggml-quants.c",
        "src/whisper.cpp",
    ],
    copts = [
        "-I%s/include" % EXTERNAL_PATH,
        "-I%s/ggml/include" % EXTERNAL_PATH,
        "-I%s/ggml/src" % EXTERNAL_PATH,
        "-pthread",
        "-O3",
        "-ffast-math",
    ],
    ...
    defines = [
        "NDEBUG",
        "_XOPEN_SOURCE=600",
        "_GNU_SOURCE",
        "__FINITE_MATH_ONLY__=0",
        "__riscv_v_intrinsic=0",
    ],
    ...
)
cc_binary(
    name = "main",
    srcs = [
        "examples/common.cpp",
        "examples/common.h",
        "examples/common-ggml.cpp",
        "examples/common-ggml.h",
        "examples/dr_wav.h",
        "examples/grammar-parser.cpp",
        "examples/grammar-parser.h",
        "examples/main/main.cpp",
    ],
    ...
        deps = [
        "whisper",
    ],
)

Now we can build the reference app using our existing RISCV-64 toolchain:

$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' @whisper_cpp//:main
...
$ file bazel-bin/external/+_repo_rules+whisper_cpp/main
bazel-bin/external/+_repo_rules+whisper_cpp/main: ELF 64-bit LSB executable, UCB RISC-V, RVC, double-float ABI, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, for GNU/Linux 4.15.0, not stripped

$ readelf -A bazel-bin/external/+_repo_rules+whisper_cpp/main
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0_zca1p0_zcd1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

The final step is to locate the toolchain libraries used in this build, so that we can load them into Ghidra. They are usually cached in a per-user location. We’ll search for the RISCV libstdc++ toolchain library:

$ bazel info
...
output_base: /run/user/1000/bazel
output_path: /run/user/1000/bazel/execroot/_main/bazel-out
package_path: %workspace%
release: release 7.4.0
...
$ find /run/user/1000 -name libstdc++\*
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6.0.33
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so

We will want to load libstdc++.so.6 into Ghidra before we load the reference app.

3.2 - Loading the Reference Model into Ghidra

Create a new Ghidra project and load the whisper dependencies and then whisper itself, in both stripped and unstripped forms.

/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libm.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~whisper_cpp/main

We can check the machine architecture for which these libraries were built with readelf -A:

$ readelf -A /run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0"
  Tag_RISCV_unaligned_access: Unaligned access
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

These libraries were most likely built with the non-vector rv64gc machine architecture.

The stripped version of main is generated by copying the non-stripped binary to /tmp and then stripping it with the riscv toolchain strip before importing it into Ghidra:

/run/user/1000/bazel/external/gcc_riscv_suite+/bin/riscv64-unknown-linux-gnu-strip /tmp/main

Now we can use the non-stripped version to orient ourselves and find reference points, then visit the stripped version to use those reference points and start to recover symbol names and structures.

3.3 - Ghidra Examination

We want to know how we can use Advisor to untangle vectorized instruction sequences. We’ve seen that Advisor can help with simple loops and builtin functions like memcpy. Now we want to tackle vectorized ‘shuffle’ code, where GCC turns a sequence of simple assignments or initializations into a much more obscure sequence of vector instructions.

We’ll assume the Ghidra user wishes to search for malicious behavior adjacent to the output_txt function, called by main after the voice-to-text inference engine has crunched the numbers.

The first step is to locate the main routine in our stripped binary. There is no main symbol left after stripping, so we need to find a path from the entry point to main in the unstripped binary first. The entry point is the symbol _start or entry.

In the unstripped binary:

void _start(void)
{
...
  gp = &__global_pointer$;
  uVar1 = _PREINIT_0();
  __libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
  ebreak();
  main();
  return;
}

While the stripped binary is:

void entry(void)
{
  undefined8 uVar1;
  undefined8 in_stack_00000000;
  
  gp = &__global_pointer$;
  uVar1 = _PREINIT_0();
  __libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
  ebreak();
  FUN_0001e758();
  return;
}

So we know that main is FUN_0001e758.

In the source code, main calls output_txt. There is no output_txt symbol in either stripped or non-stripped binaries, so this function has apparently been converted into inline code deep within main.

There are several different paths forward for examination. Sometimes the best approach is to explore a short distance along each likely path, backtracking or switching paths when we get stuck. For this exercise we want to know how to make Advisor useful in at least some of those cases where we get stuck.

Paths available:

  • Search for C strings as literals, then find where they are used. This will often give printf or logging utility functions.
  • Search for C++ string constructors given literals as input. Start to identify standard library string objects in the code.
  • Look for initialization or print functions recognizable either from symbol names or printf formatting strings.
  • Start to identify recurring structures passed by pointers. This can include context, state, and parameter structures

Let’s start with a routine that touches several of those paths. This is the basic decompiled output from the stripped binary, for a function that gets called a lot by our identified main routine. Peeking into the unstripped binary, we see that its signature is void __thiscall std::string::string<>(string *this,char *cStr,allocator *param_2) - a C++ basic_string constructor given a literal C string as input. Note that there is actually no allocator passed into the function.

void FUN_000542e0(undefined8 *param_1,undefined *param_2)

{
  undefined *puVar1;
  long lVar2;
  undefined *puVar3;
  long lVar4;
  undefined8 *puVar5;
  undefined auVar6 [256];
  long in_vl;
  
  gp = &__global_pointer$;
  puVar5 = param_1 + 2;
  *param_1 = puVar5;
  if (param_2 == (undefined *)0x0) {
                    /* WARNING: Subroutine does not return */
    std::__throw_logic_error("basic_string: construction from null is not valid");
  }
  puVar1 = param_2;
  lVar4 = 0;
  do {
    vsetvli_e8m1tama(0);
    puVar1 = puVar1 + lVar4;
    auVar6 = vle8ff_v(puVar1);
    auVar6 = vmseq_vi(auVar6,0);
    lVar2 = vfirst_m(auVar6);
    lVar4 = in_vl;
  } while (lVar2 < 0);
  puVar1 = puVar1 + (lVar2 - (long)param_2);
  puVar3 = puVar1;
  if (puVar1 < (undefined *)0x10) {
    if (puVar1 == (undefined *)0x1) {
      *(undefined *)(param_1 + 2) = *param_2;
      goto LAB_00054326;
    }
    if (puVar1 == (undefined *)0x0) goto LAB_00054326;
  }
  else {
    puVar5 = (undefined8 *)operator.new((ulong)(puVar1 + 1));
    param_1[2] = puVar1;
    *param_1 = puVar5;
  }
  do {
    lVar4 = vsetvli_e8m8tama(puVar3);
    auVar6 = vle8_v(param_2);
    puVar3 = puVar3 + -lVar4;
    param_2 = param_2 + lVar4;
    vse8_v(auVar6,puVar5);
    puVar5 = (undefined8 *)((long)puVar5 + lVar4);
  } while (puVar3 != (undefined *)0x0);
  puVar5 = (undefined8 *)*param_1;
LAB_00054326:
  param_1[1] = puVar1;
  *(undefined *)((long)puVar5 + (long)puVar1) = 0;
  return;
}

The exercise here is to recover the basic_string internal structure and identify the two vector stanzas.

The Advisor identifies the first vector stanza as a builtin_strlen and the second as a builtin_memcpy. The std::string structure is 0x20 bytes and consists of a char*, a 64 bit string length, and a 16 byte union field. If the string with null termination is less than 16 bytes in length, it is stored directly in the 16 byte union field. Otherwise, new memory is allocated for the copy and a pointer is stored in the first 8 bytes of the union.

The next step is easy enough to make the Advisor unnecessary. A std::vector copy constructor involves two vector instruction stanzas.

The new vector has three 64 bit pointer fields, all of which need to be zeroed. GCC 15 does that with:

  vsetivli_e64m1tama(2);
  vmv_v_i(in_v1,0);
  ...
  vse64_v(in_v1,this);
  *(undefined8 *)&this->field_0x10 = 0;

That’s a little bit odd, since it is using three vector instructions to replace two scalar instructions, followed by a separate scalar store instruction. It could alternatively used three scalar store instructions or three vector instructions with an m2 LMUL multiplier option. Perhaps this is an example of incomplete or over-eager optimization, or an optimization from a RISC-V vendor who knows that vector instructions can be executed in parallel with scalar instructions.

A little later in the copy constructor a builtin_memcpy vector stanza occurs, to copy the contents of the original vector into the newly initialized vector.

This suggests:

  • vector stanzas like builtin_memcpy, builtin_strlen, and vector instructions to zero 16 bytes are common and fairly easy to recognize, either by eye or Advisor. Adding more builtin functions to the exemplar directory makes good sense.
  • vector stanzas often occur in initialization sequences, where they can be difficult to untangle from associated C++ object initializations. If we want to tackle this, we also need examples of stdlibc++ initializations, especially for vectors, maps, and iostreams.
  • we need more examples of less common vector stanzas, including gather and slide operations.

3.4 - Dealing with C++

GCC 15 uses RISCV vector instruction sequences in many initialization sequences - even when there is no need for a loop. If we want to understand that code, we need a decent understanding of what is getting initialized. One way to move forward is with a small program that uses some of the same libstdc++ classes, to help us understand their memory layout and especially the fields that may need initialization.

The first iteration of this is a toy program using std::string, std::vector, std::pair, std::map, and std::ofstream library code. we generally want to know object sizes, pointers, and key internal fields. If something like ofstream is constructed on the stack, we can probably ignore any interior objects initialized within it.

#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <fstream>
#include <cstdint>

void dumpHex(const uint64_t* p, int numwords)  {
    std::cout << "\tRaw: ";
    for (int i = 0; i < numwords; i++) {
        std::cout << "0x" << std::hex << p[i];
        if (i < numwords) {
        std::cout << ", ";
        }
    }
    std::cout << std::endl;
}

void showString(const std::string* s, const char* label) {
    std::cout << "std::string " << label << " = " << *s << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*s) << std::endl;
    std::cout << "\tLength = 0x" << std::hex << s->length() << std::endl;
    dumpHex((const uint64_t*)s, 4);
}

void showVector(const std::vector<std::string>* v, const char* label) {
    std::cout << "std::vector<std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*v) << std::endl;
    std::cout << "\tInternal Size = " << v->size() << std::endl;
    dumpHex((const uint64_t*)v, 3);
}

void showPair(const std::pair<std::string,std::string>* p, const char* label) {
    std::cout << "std::pair<std::string,std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*p) << std::endl;
    dumpHex((const uint64_t*)p, 8);
}

void showMap(const std::map<std::string,std::string>* token_map, const char* label) {
    std::cout << "std::map<std::string,std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*token_map) << std::endl;
    std::cout << "\tInternal Size = " << token_map->size() << std::endl;
    dumpHex((const uint64_t*)token_map, 6);
}

void showOfstream(const std::ofstream* fout, const char* label) {
    std::cout << "std::ofstream " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*fout) << std::endl;
    dumpHex((const uint64_t*)fout, 8);
}

int main() {
    std::cout << "Initializing\n";
    std::ofstream fout("/tmp/scratch");
    showOfstream(&fout, "initialized ofstream");
    std::string xs("short string");
    showString(&xs, "short string");
    std::string x("This is a sample long string");
    showString(&x, "long string");
    std::vector<std::string> vx;
    showVector(&vx, "empty vector");
    vx.push_back(x);
    showVector(&vx, "singleton vector");

    fout << "something to fill the file" << std::endl;

    std::pair<std::string,std::string> map_element("key", "value");
    showPair(&map_element, "map_element");
    std::map<std::string,std::string> token_map;
    showMap(&token_map, "token_map, empty");
    token_map.insert(map_element);
    showMap(&token_map, "token_map, one insertion");
    fout.close();
    showOfstream(&fout, "closed ofstream");

Build with:

$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' other_src:stdlibc++_exploration

Run this under qemu emulation :

$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/other_src/stdlibc++_exploration
Initializing
std::ofstream initialized ofstream:
	Raw Size = 0x200
	Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x288a0, 0x288a0, 0x288a0, 0x0, 0x0, 0x0, 
std::string short string = short string
	Raw Size = 0x20
	Length = 0xc
	Raw: 0x7fbfe3ffe820, 0xc, 0x74732074726f6873, 0x7f00676e6972, 
std::string long string = This is a sample long string
	Raw Size = 0x20
	Length = 0x1c
	Raw: 0x2a8b0, 0x1c, 0x1c, 0x7fbfe3ffe8f0, 
std::vector<std::string> empty vector:
	Raw Size = 0x18
	Internal Size = 0
	Raw: 0x0, 0x0, 0x0, 
std::vector<std::string> singleton vector:
	Raw Size = 0x18
	Internal Size = 1
	Raw: 0x2a8e0, 0x2a900, 0x2a900, 
std::pair<std::string,std::string> map_element:
	Raw Size = 0x40
	Raw: 0x7fbfe3ffe890, 0x3, 0x79656b, 0x7fbff0411528, 0x7fbfe3ffe8b0, 0x5, 0x65756c6176, 0x7fbff14d751e, 
std::map<std::string,std::string> token_map, empty:
	Raw Size = 0x30
	Internal Size = 0
	Raw: 0x1, 0x7fbf00000000, 0x0, 0x7fbfe3ffe858, 0x7fbfe3ffe858, 0x0, 
std::map<std::string,std::string> token_map, one insertion:
	Raw Size = 0x30
	Internal Size = 1
	Raw: 0x1, 0x7fbf00000000, 0x2a940, 0x2a940, 0x2a940, 0x1, 
std::ofstream closed ofstream:
	Raw Size = 0x200
	Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 

Now we can load bazel-bin/other_src/stdlibc++_exploration into Ghidra and get some insight into how these standard library classes are initialized.

std::string has a length of 0x20 bytes and a structure similar to:

struct string { /* stdlib basic string */
    char *cstr;
    uint64_t length;
    char data[16];
};

If the string is 15 bytes or less, it is stored directly in data. Otherwise new memory is allocated and data holds information needed to manage that allocation.

std::vector<T> has a length of 0x18 bytes, regardless of the component type T.

struct vector {
    T *start;        // points to the first element in the vector
    T *end;          // points just past the last element in the vector
    T *alloc_end;    // points just past the last empty element allocated for the vector.
};

std::pair<T1,T2> is simply a concatenation of elements of type T1 and T2, so a pair of strings is 0x40 bytes and a pair of string pointers is 0x10 bytes.

std::ofstream is a large object of 0x200 bytes. We’ll ignore its internal structure for now - and especially any whisper.cpp code initializing elements of that structure.

Ghidra examination of stdlibc++_exploration provides some insight and more than a few red herrings.

stdlibc++_exploration

std::vector

The three 64 bit pointers that make up a std::vector<std::string> are initialized inline to zero with a stanza of five vector instructions - even though that does not look optimal.

  vsetivli_e8mf2tama(8);
  vmv_v_i(in_v1,0);
  vse8_v(in_v1,&vx.end);
  vse8_v(in_v1,&vx.alloc_end);
  vse8_v(in_v1,&vx);
  showVector(&vx,"empty vector");

vx.push_back(x) makes a copy of the string x and allocates space for both the vector element and the string copy with operator.new.

std::vectors are most easily recognized by their destructors:

std::vector<>::~vector((vector<> *)&vx);

std::ofstream

File structures like std::ofstream are easy to recognize through their constructors and destructors. They can be confusing though, since they are likely to be reused if multiple files are opened and closed in the same function.