This is the multi-page printable view of this section. Click here to print.
Ghidra Advisor Reference
- 1: Ghidra Advisor Examples
- 1.1: Whisper.cpp Example
- 1.2: DPDK Network Example
- 1.3: Whisper Exploration Example
- 1.4: Whisper Memcpy
- 1.5: Whisper Output Forensics
- 2: Ghidra Advisor Reference
- 2.1: Advisor Dependencies
- 2.2: Populating the Database
- 2.3: Feature Analysis
- 2.4: Frequently Asked Questions
- 3: Whisper Deep Dive
1 - Ghidra Advisor Examples
1.1 - Whisper.cpp Example
Let’s inspect whisper.cpp binaries as compiled for a RISCV-64 processor with vector extensions.
We’ll start with this snippet from the function whisper_full_with_state.
000cc166 57 77 80 0d vsetvli a4,zero,e64,m1,ta,ma
000cc16a d7 a1 08 52 vid.v v3
000cc16e d7 30 00 5e vmv.v.i v1,0x0
000cc172 93 06 80 03 li a3,0x38
000cc176 d7 e1 36 96 vmul.vx v3,v3,a3
000cc17a d6 86 c.mv a3,s5
000cc17c ce 97 c.add a5,s3
000cc17e 03 b6 07 18 ld a2,0x180(a5)
000cc182 31 06 c.addi a2,0xc
LAB_000cc184 XREF[1]: 000cc1a0(j)
000cc184 d7 f7 76 09 vsetvli a5,a3,e32,mf2,tu,ma
000cc188 07 71 36 06 vluxei64.v v2,(a2),v3
000cc18c 57 32 10 9e vmv1r.v v4,v1
000cc190 13 97 37 00 slli a4,a5,0x3
000cc194 1d 8f c.sub a4,a5
000cc196 0e 07 c.slli a4,0x3
000cc198 9d 8e c.sub a3,a5
000cc19a d7 10 41 d2 vfwadd.wv v1,v4,v2
000cc19e 3a 96 c.add a2,a4
000cc1a0 f5 f2 c.bnez a3,LAB_000cc184
000cc1a2 d3 07 00 f2 fmv.d.x fa5,zero
000cc1a6 57 77 80 0d vsetvli a4,zero,e64,m1,ta,ma
000cc1aa 57 d1 07 42 vfmv.s.f v2,fa5
000cc1ae d7 10 11 06 vfredusum.vs v1,v1,v2
000cc1b2 d7 14 10 42 vfmv.f.s fs1,v1
Advisor Output
Signatures:
- Element width is = 64 bits
- Vector element index: vid.v
- often used in generating a vector of offsets for indexing
- Element width is = 32 bits
- Vector multiplier is fractional, MUL = f2
- Vector unordered indexed load: vluxei64.v
- Vector register move/copy: vmv1r.v
- Element width is = 64 bits
- FP register to vector register element 0
- Vector floating point reduction: vfredusum.vs
- Vector register element 0 to FP register
- Significant opcodes, in the order they appear:
- vsetvli,vid.v,vmv.v.i,vmul.vx,vsetvli,vluxei64.v,vmv1r.v,vfwadd.wv,c.bnez,vsetvli,vfmv.s.f,vfredusum.vs,vfmv.f.s
- Significant opcodes, in alphanumeric order:
- c.bnez,vfmv.f.s,vfmv.s.f,vfredusum.vs,vfwadd.wv,vid.v,vluxei64.v,vmul.vx,vmv.v.i,vmv1r.v,vsetvli,vsetvli,vsetvli
- At least one loop exists
Similarity Analysis
Compare the clipped example to the database of vectorized examples.
The best match is id=588 [0.687]= bnez,vfadd.vv,vle8.v,vlseg2e64.v,vmsne.vi,vmv.v.i,vmv1r.v,vse64.v,vsetvli,vsetvli,vsetvli
The clip is similar to the reference example data/gcc_riscv_testsuite/rvv/autovec/struct/mask_struct_load-1_rv64gcv:test_f64_f64_i8_2
Reference C Source
void test_f64_f64_i8_2(double *__restrict dest, double *__restrict src, int8_t *__restrict cond, intptr_t n)
{
for (intptr_t i = 0; i < n; ++i)
if (cond[i])
dest[i] = src[i * 2] + src[i * 2 + 1];
}
Reference Compiled Assembly Code
2932 blez a3,2970 <test_f64_f64_i8_2+0x3e>
2936 vsetvli a5,zero,e64,m1,ta,ma
293a vmv.v.i v4,0
293e vsetvli a5,a3,e8,mf8,ta,ma
2942 vle8.v v0,(a2)
2946 vmv1r.v v1,v4
294a slli a6,a5,0x4
294e slli a4,a5,0x3
2952 sub a3,a3,a5
2954 add a2,a2,a5
2956 vmsne.vi v0,v0,0
295a vlseg2e64.v v2,(a1),v0.t
295e vsetvli zero,zero,e64,m1,tu,mu
2962 add a1,a1,a6
2964 vfadd.vv v1,v3,v2,v0.t
2968 vse64.v v1,(a0),v0.t
296c add a0,a0,a4
296e bnez a3,293e <test_f64_f64_i8_2+0xc>
2970 ret
Manual analysis
The matched C code is not that good a match. The Advisor suggests that this is a floating point add reduction loop over 32-bit floats into a 64 bit double result, with a relatively complex indexing calculation to access the addends.
The path forward is to add custom training examples until we converge on a decent match. We can do that by adding a new training function to the database:
#include <stdint.h>
double reduce_floats(float *p, uint32_t count)
{
double result = 0.0;
for (int i = 0; i < count; i++)
{
result += p[i * 14];
}
return result;
}
Rebuild the database with generator.py, ingest.py, and sample_analytics.py. We can then rerun the Advisor - but we don’t get a better match!
In fact compiling this training example with GCC 15.0, -O3, and -march=rv64gcv doesn’t produce any vector instructions at all.
We need to add one more compilation option, -ffast-math, to convince the compiler that vectorization is desired.
After adding -ffast-math to the relevant BUILD file, we get a much better Advisor output:
Advisor Output - with fast-math
Similarity Analysis
Compare the clipped example to the database of vectorized examples.
The best match is id=1532 [0.957]= bnez,vfmv.f.s,vfmv.s.f,vfredusum.vs,vfwadd.wv,vid.v,vluxei64.v,vmul.vx,vmv.v.i,vmv1r.v,vsetvli,vsetvli,vsetvli,vsll.vi
The clip is similar to the reference example data/custom_testsuite/structs/reduction_rv64gcv:reduce_floats
Notes
-
The
whisper.cppbinary was compiled with-O3and-ffast-math. This is an example of iteratively reconstructing the toolchain until we can generate something close to the observed code. -
The actual C source likely does not contain anything like
p[i * 14]. It is much more likely to be iterating over an array of structs, each 56 bytes long, and summing a 32 bit count field within that structure. -
There likely isn’t much of a gain to be had with vectorizing this loop if VLEN is 128 bits. Only two elements can be processed per iteration, with 10 instructions per iteration. The scalar code handles one element per iteration in five instructions, so any possible gain is due to a tradeoff between code size and branch handling.
Reconciling with Ghidra
This reduction example decomposes the loop into three segments:
- A setup section before the loop initializing:
- a double result vector
v1to zero - a pointer to the first addend in
a2, - a vector of relative offsets
v1 = (0, 1, ...) * 0x38
- a double result vector
- The loop body:
a5= number of addends that fit into a vector registerv2= an indexed load of addendsv1 += v2a4 = 0x38 * a5a2 += a4; adjust the address of the next first addend
- The loop post-processing
v2 = 0.0v1[0] = v2[0] +/v1; where+/sums the elements of a single vector into the first element of the result vectorfs1 = v1[0]
- A setup section before the loop initializing:
The original C code might look like this:
struct astruct {
uint32_t field0x00[3];
float field0x0c;
uint32_t field0x10[10];
};
// sizeof astruct = 0x38
double reduce_floats(struct astruct* p, uint32_t count)
{
double result = 0.0;
for (int i = 0; i < count; i++)
{
result += p[i].count;
}
return result;
}
1.2 - DPDK Network Example
Are network appliances improved with autovectorization? We can examine code examples taken from the
DPDK network project to find out.
We’ll start with code that uses vector gather and slide operations in what might be a key datapath routine.
The binary dpdk_l3fwd is an example of a layer 3 forwarding engine. It includes the function rx_recv_pkts,
which appears to be selected from the DPDK source file drivers/net/iavf/iavf_rxtx.c.
This function includes several vector stanzas that use vrgather and vslide1down.
We will make things a bit easier by including the preprocessed source file drivers/net/iavf/iavf_rxtx.c
in our custom dataset under data/custom_testsuite/dpdk. This brings all of the relevant DPDK header files
inline.
The first vector stanza to examine is:
LAB_0053c12e XREF[2]: 0053be34(j), 0053be42(j)
0053c12e 73 2e 20 c2 csrrs t3,vlenb,zero
0053c132 57 73 80 0d vsetvli t1,zero,e64,m1,ta,ma
0053c136 b3 0e c0 41 neg t4,t3
0053c13a 93 58 3e 00 srli a7,t3,0x3
0053c13e 3e 97 c.add a4,a5
0053c140 d7 a1 08 52 vid.v v3
0053c144 93 87 0e 04 addi a5,t4,0x40
0053c148 93 86 f8 ff addi a3,a7,-0x1
0053c14c 33 67 f7 20 sh3add a4,a4,a5
0053c150 d7 c1 36 0e vrsub.vx v3,v3,a3
0053c154 5a 97 c.add a4,s6
0053c156 3b 8f 1b 41 subw t5,s7,a7
0053c15a a2 86 c.mv a3,s0
0053c15c 81 47 c.li a5,0x0
LAB_0053c15e XREF[1]: 0053c172(j)
0053c15e 07 71 87 02 vl1re64.v v2,(a4)
0053c162 bb 87 17 01 addw a5,a5,a7
0053c166 76 97 c.add a4,t4
0053c168 d7 80 21 32 vrgather.vv v1,v2,v3
0053c16c a7 80 86 02 vs1r.v v1,(a3)
0053c170 f2 96 c.add a3,t3
0053c172 e3 76 ff fe bgeu t5,a5,LAB_0053c15e
This sequence suggests a load/permute/store operation on 64 bit elements. There are lots of complications:
- The function
rx_recv_pktshas static inline attributes, so it has likely been merged with other functions - There are no obvious candidates for the gcc C source code that generates this
vrgatherpattern. - This may be an artifact of using snapshots of gcc and dpdk.
- The
vs1r.vinstruction is a whole register store variant that ignoresvsetvlisettings.
Analysis of this snippet is easier with emulation:
- The assembly fragment is copied into a new source file,
emulations/test_1.S - The assembly source is converted into a subroutine
- Instrumentation is added to store intermediate values into global data variables.
- A
main.cdriver routine is added, to provide input and output vectors and instrumentation printouts. - A RISCV-64 executable is built
- A RISCV-64 QEMU userspace emulator is built and configured
- The RISCV-64 executable is run to see what permutations are generated
- The RISCV-64 executable is run with alternate values for VLEN, to show that the results are stable regardless of the vector hardware implementation.
Process this Ghidra assembly fragment through the current Advisor:
Signatures:
* Element width is = 64 bits
* Vector element index: vid.v
* often used in generating a vector of offsets for indexing
* Vector whole register load: vl1re64.v
* Vector gather: vrgather.vv
* Vector whole register(s) store: vs1r.v
* Significant opcodes, in the order they appear:
* vsetvli,vid.v,vrsub.vx,vl1re64.v,vrgather.vv,vs1r.v,bgeu
* Significant opcodes, in alphanumeric order:
* bgeu,vid.v,vl1re64.v,vrgather.vv,vrsub.vx,vs1r.v,vsetvli
* At least one loop exists
## Similarity Analysis
Compare the clipped example to the database of vectorized examples.
The best match is id=1689 [0.860]= vid.v,vle16.v,vrgather.vv,vrsub.vi,vse16.v,vsetivli
The clip is similar to the reference example `data/gcc_riscv_testsuite/rvv/autovec/vls-vlmax/perm-4_rv64gcv:permute_vnx2hi`
### Reference C Source
void permute_vnx2hi(vnx2hi values1, vnx2hi values2, vnx2hi *out)
{
vnx2hi v = __builtin_shufflevector (values1, values2, (2) - 1 - (0), (2) - 2 - (0));
*(vnx2hi *) out = v;
} __attribute__((noipa))
Gcc’s __builtin_shufflevector with the given mask reverses the order of a vector of two 16 bit values.
The signature match of 0.86 suggests this is just a hint of what is going on here.
After rewriting the Ghidra sample as a riscv64 assembly routine, we get:
.section .data.test_1
.globl cntr, vl, src_incr, dst_incr, src_start, dst_start
.globl cnt_limit
dst_incr:
.dword 0
vl:
.dword 0
src_incr:
.dword 0
cntr:
.dword 0
src_start:
.dword 0
dst_start:
.dword 0
cnt_limit:
.word 0
.section .text.test_1,"ax",@progbits
.globl test_1
.type test_1, @function
test_1:
csrrs t3,vlenb,zero
lui t6,%hi(dst_incr)
addi t6,t6,%lo(dst_incr)
sd t3,(t6) # ->dst_incr
vsetvli t1,zero,e64,m1,ta,ma
sd t1,8(t6) # ->vl
neg t4,t3
sd t4,16(t6) # ->src_incr
srli a7,t3,0x3
sd a7,24(t6) # ->cntr
vid.v v3
addi a5,t4,0x40
addi a3,a7,-0x1
add a0,a5,a0
sd a0,32(t6) # ->src_start
vrsub.vx v3,v3,a3
c.li t5,0x8
subw t5,t5,a7
sw t5,48(t6) # ->cnt_limit
vs1r.v v3,(a2)
c.li a5,0x0
sd a1,40(t6) # ->dst_start
LAB_0053c15e:
vl1re64.v v2,(a0)
addw a5,a5,a7
c.add a0,t4
vrgather.vv v1,v2,v3
vs1r.v v1,(a1)
add a1,a1,t3
bgeu t5,a5,LAB_0053c15e
ret
.globl get_vlenb
.type get_vlanb, @function
get_vlanb:
csrrs a0,vlenb,zero
srli a2,a0,0x3
ret
The main routine to exercise this is:
#include <stdio.h>
extern void test_1(unsigned long long *, unsigned long long *, unsigned long long *);
extern void test_1_ref(unsigned long long *in, unsigned long long *out, unsigned int size);
extern long long cntr, vl, src_incr, dst_incr, src_start, dst_start;
extern long cnt_limit;
int main(void)
{
unsigned long long in_vector[8] = {90,91,92,93,94,95,96,97};
unsigned long long out_vector[8] = {0,0,0,0,0,0,0,0};
unsigned long long shuffle_vector[4] = {0,0,0,0};
printf("Initializing\n");
printf("in_vector_addr: %16p\n", (void*)in_vector);
printf("out_vector_addr: %16p\n", (void*)out_vector);
test_1(in_vector, out_vector, shuffle_vector);
printf("cntr: %lld\n", cntr);
printf("vl: %lld\n", vl);
printf("src_incr: 0x%llx\n", src_incr);
printf("dst_incr: 0x%llx\n", dst_incr);
printf("src_start: %p\n", (void*)src_start);
printf("dst_start: %p\n", (void*)dst_start);
printf("cnt_limit: 0x%lx\n", cnt_limit);
printf("shuffle vector: %lld, %lld, %lld, %lld\n", shuffle_vector[0], shuffle_vector[1], shuffle_vector[2], shuffle_vector[3]);
printf("out vector: %lld, %lld, %lld, %lld, ", out_vector[0], out_vector[1], out_vector[2], out_vector[3]);
printf("%lld, %lld, %lld, %lld\n", out_vector[4], out_vector[5], out_vector[6], out_vector[7]);
printf("Assembly version Complete, starting reference version\n");
/* Once the assembly mockup returns a good value, try a C equivalent
test_1_ref(in_vector, out_vector, 8);
printf("out vector: %lld, %lld, %lld, %lld, ", out_vector[0], out_vector[1], out_vector[2], out_vector[3]);
printf("%lld, %lld, %lld, %lld\n", out_vector[4], out_vector[5], out_vector[6], out_vector[7]);
*/
}
The QEMU VLEN variable and microarchitecture extensions are configured on the command line. The following requests the VLEN=128 minimum basic register size.
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
Build and run with:
training_set$ bazel build --platforms=//platforms:riscv_gcc emulations:test_1
INFO: Analyzed target //emulations:test_1 (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //emulations:test_1 up-to-date:
bazel-bin/emulations/test_1
training_set$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/emulations/test_1
Initializing
in_vector_addr: 0x2aaaab2aaab0
out_vector_addr: 0x2aaaab2aaaf0
cntr: 2
vl: 2
src_incr: 0xfffffffffffffff0
dst_incr: 0x10
src_start: 0x2aaaab2aaae0
dst_start: 0x2aaaab2aaaf0
cnt_limit: 0x6
shuffle vector: 1, 0, 0, 0
out vector: 97, 96, 95, 94, 93, 92, 91, 90
Complete!
Repeat with VLEN=256 and the same executable binary, to see if the results are stable:
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
training_set$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/emulations/test_1
Initializing
in_vector_addr: 0x2aaaab2aaab0
out_vector_addr: 0x2aaaab2aaaf0
cntr: 4
vl: 4
src_incr: 0xffffffffffffffe0
dst_incr: 0x20
src_start: 0x2aaaab2aaad0
dst_start: 0x2aaaab2aaaf0
cnt_limit: 0x4
shuffle vector: 3, 2, 1, 0
out vector: 97, 96, 95, 94, 93, 92, 91, 90
Complete!
So now the permutation is clear - this snippet copies a vector of 64 bit values - likely pointers - from one location to another in the reverse order.
The source pointer advances backwards, starting vl elements before the end, while the
destination pointer advances forwards, starting at the first element. The reverse-copy
operation is stable with VLEN of 128 or 256 bits.
Next we can add test_1_ref.c, a C routine which should compile to something much closer to the
sample binary:
void test_1_ref(unsigned long long *in, unsigned long long *out, unsigned int size)
{
int i;
int upper_index = size - 1;
for (i=0; i < size; i++) {
out[i] = in[upper_index - i];
}
}
Record the new example
We want to add test_1_ref to the training set database, so copy it into custom_testsuite/structs, update the
BUILD file there, then rerun the analysis:
./generator.py
./ingest.py
./sample_analytics.py
./advisor.py data_test/advisor_tests/gather2.ghidra
...
The best match is id=1864 [0.926]= bgeu,bltu,bne,vid.v,vl1re64.v,vrgather.vv,vrsub.vx,vs1r.v,vsetvli
The clip is similar to the reference example `data/custom_testsuite/structs/test_1_ref_rv64gcv:test_1_ref`
The extra branch instructions clutter the signature match result somewhat.
Feature extraction
What generalized features does this example provide?
- There is a single loop
- The loop includes one vector load, one vector store, and one vector gather operation.
- The loop exit condition depends on a scalar counter advancing to equal or exceed a fixed limit value, without a dependency on vector element values.
- Source and destination pointers increment with different values
- The loop setup identifies the SEW as 64 bits. This is unchanged throughout
the sample code. This implies that this
vs1r.vinstruction is equivalent to avse64.vinstruction. - The
bgeuinstruction terminating the loop could be encoded asbleuwith reversed operands. More generally, the loop termination could have been implemented in several different ways.
Register assignments appear to be:
| Assignment | Register | Initializations |
|---|---|---|
| Source address | a4 | |
| Source address increment | t4 | = (-vlenb) |
| Destination address | a3 | |
| Destination address increment | t3 | = vlenb |
| Loop counter | a5 | =0 |
| Loop counter increment | a7 | vlenb»3 |
| Loop counter limit | t5 | |
| Source vector register | v2 | |
| Destination vector register | v1 | |
| Gather index vector register | v3 | depends on vlenb |
The vector gather index register v3 depends on vl, the number of 64 bit elements that fit within a vector register.
- if vlen = 128 - or vlenb = 16 - then v3 = (1,0)
- if vlen = 256 - or vlenb = 32 - then v3 = (3,2,1,0)
It is not at all clear whether the compiler knows the size of the source vector, or whether the code works as intended when the source vector size is odd or smaller than the size of a vector register.
We can resolve that by looking at Ghidra’s decompiler window. The vector instruction stanza actually begins earlier, executing the vector instructions only after a test of vlenb.
Examination of test_1_ref
GCC 15 compiles test_1_ref.c into something more complex than the assembly code stanza
we analyzed. The single-line loop becomes three loops - a scalar loop to handle
relatively small vector sizes, then a vector loop to handle an integer number of VLEN strips, finally a second
scalar loop to handle any remaining elements. For the code stanza passed in for analysis the
input and output vectors appear to have a fixed size known to the compiler - 64 bytes. This implies
no scalar loops needed for VLEN=128 bits or VLEN=256 bits. The code stanza would generate
a buffer overflow error for VLEN>512 bits.
void test_1_ref(ulonglong *in,ulonglong *out,ulong size)
{
ulonglong *pIn;
ulonglong *puVar1;
long lVar2;
ulonglong uVar3;
ulong uVar4;
ulong uVar5;
int iVar6;
undefined auVar7 [256];
undefined auVar8 [256];
ulong in_vlenb;
ulonglong *pOut;
gp = &__global_pointer$;
if (size != 0) {
uVar5 = (ulong)((int)size + -1);
if (((uVar5 < 0xd) || (uVar5 < (ulong)(long)((int)(in_vlenb >> 3) + -1))) ||
((lVar2 = uVar5 + 1, in + (lVar2 - (size & 0xffffffff)) < out + (size & 0xffffffff) &&
(out < in + lVar2)))) {
pIn = in + uVar5;
pOut = out;
do {
uVar3 = *pIn;
puVar1 = pOut + 1;
pIn = pIn + -1;
*pOut = uVar3;
pOut = puVar1;
} while (puVar1 != out + (size & 0xffffffff));
}
else {
vsetvli_e64m1tama(0);
auVar8 = vid_v();
auVar8 = vrsub_vx(auVar8,(in_vlenb >> 3) - 1);
lVar2 = lVar2 * 8 + -in_vlenb + (long)in;
iVar6 = (int)(in_vlenb >> 3);
uVar4 = 0;
pOut = out;
do {
auVar7 = vl1re64_v(lVar2);
uVar4 = (ulong)((int)uVar4 + iVar6);
lVar2 = lVar2 + -in_vlenb;
auVar7 = vrgather_vv(auVar7,auVar8);
vs1r_v(auVar7,pOut);
pOut = (ulonglong *)((long)pOut + in_vlenb);
} while (uVar4 <= (ulong)(long)((int)size - iVar6));
if (size != uVar4) {
pOut = in + (uVar5 - uVar4);
pIn = out + uVar4;
do {
uVar3 = *pOut;
uVar4 = (ulong)((int)uVar4 + 1);
pOut = pOut + -1;
*pIn = uVar3;
pIn = pIn + 1;
} while (uVar4 < size);
return;
}
}
}
return;
}
The scalar loop consists of about 22 bytes, or 7 instructions. The full vector function takes up 180 bytes. There is little reason to believe that vectorization actually improves this code, especially for vector harts holding only two 64 bit objects per vector register.
Summary
There is no visible reason to enable full autovectorization for datapath network routing code like DPDK. It might make sense to vectorize lockable critical regions or specific critical path code. Vector replacement of utility functions like memcpy or strncmp would still make sense. If someone created an AI-adaptive network appliance the math inference engine portions would likely be improved with vectorization.
1.3 - Whisper Exploration Example
How do we move forward when the Advisor doesn’t provide a lot of help? We’ll start with an example taken from Whisper.cpp’s main routine.
pFVar26 = local_460;
vsetivli_e64m1tama(2);
local_700 = (FILE *)local_c0._40_8_;
local_6f8 = pFVar34->_IO_read_ptr;
auVar45 = vle64_v(avStack_470);
vmv_v_i(in_v4,0);
auVar46 = vle64_v(&local_700);
vse64_v(in_v4,avStack_470);
auVar47 = vslidedown_vi(auVar45,1);
auVar46 = vslidedown_vi(auVar46,1);
local_4e0 = (FILE *)vmv_x_s(auVar46);
auVar46 = vle64_v(&local_700);
pcVar20 = (char *)vmv_x_s(auVar47);
local_4d8 = local_88;
local_c0._40_8_ = vmv_x_s(auVar45);
pFVar34->_IO_read_ptr = pcVar20;
local_4e8 = (FILE *)vmv_x_s(auVar46);
local_460 = (FILE *)0x0;
local_88 = pFVar26;
std::vector<>::~vector((vector<> *)&local_4e8);
std::vector<>::~vector(avStack_470);
std::_Rb_tree<>::_M_erase((_Rb_tree_node *)local_490);
This clearly isn’t a loop. Instead it is some sort of initialization sequence that allows vector instructions to slightly optimize the code. The advisor results aren’t very helpful:
Signatures:
Vector length set to = 0x2
Element width is = 64 bits
Vector load: vle64.v
Vector load: vle64.v
Vector store: vse64.v
Vector integer slidedown: vslidedown.vi
Vector integer slidedown: vslidedown.vi
Vector load: vle64.v
Significant operations, in the order they appear:
vsetivli,vle64.v,vmv.v.i,vle64.v,vse64.v,vslidedown.vi,vslidedown.vi,vmv.x.s,vle64.v,vmv.x.s,vmv.x.s,vmv.x.s
Significant operations, in alphanumeric order:
vle64.v,vle64.v,vle64.v,vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vmv.x.s,vse64.v,vsetivli,vslidedown.vi,vslidedown.vi
Similarity Analysis
Compare the clipped example to the database of vectorized examples.
The best match is id=1889 [0.652]= vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vse8.v,vsetivli,vsetivli,vsetivli,vsetvli
The clip is similar to the reference example data/custom_testsuite/builtins/string_rv64gcv:bzero_15
This suggests several Advisor improvements:
- explicitly report that no loops are found, and that the stanza is likely a vector optimization of scalar instruction transforms.
- add a quick explanation of what vslidedown.vi does
- the vmv instructions need annotation, especially any that load constants into registers.
A manual analysis suggests that the vector instructions manipulate pairs of 64 bit pointers, variously copying them, zeroing them, or copying first or second elements of the pair into scalar registers. That probably means we want simple C++ vector manipulation functions in our set of custom patterns.
1.4 - Whisper Memcpy
Is it easy to recognize vector expansions of libc functions like memcpy?
Let’s locate some explicit invocations of memcpy within Whisper and
see what the Advisor has to say.
struct whisper_context * whisper_init_from_buffer_with_params_no_state(void * buffer, size_t buffer_size, struct whisper_context_params params) {
struct buf_context {
uint8_t* buffer;
size_t size;
size_t current_offset;
};
loader.read = [](void * ctx, void * output, size_t read_size) {
buf_context * buf = reinterpret_cast<buf_context *>(ctx);
size_t size_to_copy = buf->current_offset + read_size < buf->size ? read_size : buf->size - buf->current_offset;
memcpy(output, buf->buffer + buf->current_offset, size_to_copy);
buf->current_offset += size_to_copy;
return size_to_copy;
};
};
This source example shows a few traits:
- the number of bytes to copy is not in general known at compile time
- the buffer type is
uint8_t* - there are no alignment guarantees
GCC 15 compiles the lambda stored in loader.read as
whisper_init_from_buffer_with_params_no_state::{lambda(void*,void*,unsigned_long)#1}::_FUN.
The relevant Ghidra listing window instruction sequence (trimmed of address and whitespace) is:
LAB_000b0be2
vsetvli a3,param_3,e8,m8,ta,ma
vle8.v v8,(a4)
c.sub param_3,a3
c.add a4,a3
vse8.v v8,(param_2)
c.add param_2,a3
c.bnez param_3,LAB_000b0be2
Copying the Ghidra listing to the clipboard and running the Advisor gives us:
Clipboard Contents to Analyze
LAB_000b0be2 XREF[1]: 000b0bf4(j)
000b0be2 d7 76 36 0c vsetvli a3,param_3,e8,m8,ta,ma
000b0be6 07 04 07 02 vle8.v v8,(a4)
000b0bea 15 8e c.sub param_3,a3
000b0bec 36 97 c.add a4,a3
000b0bee 27 84 05 02 vse8.v v8,(param_2)
000b0bf2 b6 95 c.add param_2,a3
000b0bf4 7d f6 c.bnez param_3,LAB_000b0be2
Signatures:
Element width is = 8 bits
Vector registers are grouped with MUL = 8
Vector load: vle8.v
Vector load is to multiple registers
Vector store: vse8.v
Vector store is from multiple registers
At least one loop exists
Significant operations, in the order they appear:
vsetvli,vle8.v,vse8.v,_loop
Significant operations, in alphanumeric order:
_loop,vle8.v,vse8.v,vsetvli
Similarity Analysis
Compare the clipped example to the database of vectorized examples.
The best match is id=1873 [1.000]= _loop,vle8.v,vse8.v,vsetvli
The clip is similar to the reference example data/custom_testsuite/builtins/memcpy_rv64gcv:memcpy_255
Reference C Source
void memcpy_255()
{
__builtin_memcpy (to, from, 255);
};
Reference Compiled Assembly Code
65e auipc a3,0x2
662 ld a3,-1678(a3)
666 auipc a2,0x0
66a addi a2,a2,82
66e li a4,255
672 vsetvli a5,a4,e8,m8,ta,ma
676 vle8.v v8,(a2)
67a sub a4,a4,a5
67c add a2,a2,a5
67e vse8.v v8,(a3)
682 add a3,a3,a5
The Advisor has matched the vector instruction loop to the GCC __builtin_memcpy test case where the
number of bytes to transfer is large (255). The individual scalar instructions are not the same.
This example shows something important that we probably want to add to the Advisor’s report:
The vsetvli instruction includes the m8 multiplier option, which means vector operations cover groups of 8
registers. The vle8.v only references vector register v8, but the loads and stores affect the 8
registers v8 through v15. If the __builtin_memcpy appeared in an inline code fragment, where
there may be more pressure on vector register availability, we might have seen very similar code
with multipliers of m4, m2, or m1.
What does the Ghidra decompiler show for this instruction sequence?
do {
lVar3 = vsetvli_e8m8tama(uVar1);
auVar4 = vle8_v(lVar2);
uVar1 = uVar1 - lVar3;
lVar2 = lVar2 + lVar3;
vse8_v(auVar4,param_2);
param_2 = (void *)((long)param_2 + lVar3);
} while (uVar1 != 0);
What would we like Ghidra’s decompiler to show instead? Something like:
__builtin_memcpy(param_2, lvar2, uVar1);
That’s not quite correct, as __builtin_memcpy doesn’t mutate the values param_2 or lvar2.
1.5 - Whisper Output Forensics
You might expect Whisper to use a lot of vector instructions in its inference engine, and it definitely does.
Are vector instructions common enough to complicate Whisper forensic analysis, looking at functions an
adversary is likeliest to target? For this example we will make a deep dive into the function output_txt,
since malicious code might want to review and alter dictated text.
This code also lets us examine how RISCV vector instructions are used to implement the libstdc++ vector library functions.
const char * whisper_full_get_segment_text(struct whisper_context * ctx, int i_segment) {
return ctx->state->result_all[i_segment].text.c_str();
}
static bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
std::ofstream fout(fname);
if (!fout.is_open()) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
return false;
}
fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const char * text = whisper_full_get_segment_text(ctx, i);
std::string speaker = "";
if (params.diarize && pcmf32s.size() == 2)
{
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
speaker = estimate_diarization_speaker(pcmf32s, t0, t1);
}
fout << speaker << text << "\n";
}
return true;
}
The key elements of this function are:
- text is collected in segments and stored in the context variable
ctx - text segments are retrieved with the function
whisper_full_get_segment_text - text is copied into an output stream
fout
The params.diarize code block matters only if voice is collected
in stereo and Whisper has been asked to differentiate between speakers.
The Ghidra decompiler shows four vector instruction sets starting with a vset* instruction.
The first of these is a simple initialization:
vsetivli_e64m1tama(2);
vmv_v_i(in_v1,0);
vse64_v(in_v1,auStack_90);
vse64_v(in_v1,auStack_80);
These instructions initialize two adjacent 16 byte blocks of memory to zero. These are likely four 64 bit pointers or counters embedded within structures.
The next vector stanza is:
vsetivli_e64m1tama(2);
lStack_2e0 = lStack_288;
uStack_2d8 = local_280[0];
auVar24 = vle64_v(&lStack_2e0);
auVar25 = vle64_v(&lStack_2e0);
auVar24 = vslidedown_vi(auVar24,1);
lStack_2a8 = vmv_x_s(auVar25);
local_2a0[0] = vmv_x_s(auVar24);
This one is puzzling, as it appears to load two 64 bit values twice, then store them into separate scalar registers.
The next stanza looks like a simple memcpy expansion:
do {
lVar18 = vsetvli_e8m8tama(lStack_288);
auVar24 = vle8_v(puVar16);
lStack_288 = lStack_288 - lVar18;
puVar16 = (ulong *)((long)puVar16 + lVar18);
vse8_v(auVar24,puVar20);
puVar20 = (ulong *)((long)puVar20 + lVar18);
} while (lStack_288 != 0);
The final stanza is the interesting one:
pcVar17 = text;
do {
vsetvli_e8m1tama(0);
pcVar17 = pcVar17 + lVar18;
auVar24 = vle8ff_v(pcVar17);
auVar24 = vmseq_vi(auVar24,0);
lVar19 = vfirst_m(auVar24);
lVar18 = in_vl;
} while (lVar19 < 0);
std::__ostream_insert<>(pbVar12,text,(long)(pcVar17 + (lVar19 - (long)text)));
This appears to be a vector implementation of strlen(text) requested by std::__ostream_insert<>.
Our hypothetical adversary would want to evaluate *text and reset the text pointer to the maliciously altered output string.
The current Advisor classifies these four stanzas as:
- some sort of initializer
- some sort of shuffle
memcpystrlen
A Ghidra user would likely ignore the initializer and the shuffle as doing something benign and obscure within the I/O subsystem,
recognize the memcpy and strlen for what they are, then concentrate on any unexpected manipulations of the *text string.
2 - Ghidra Advisor Reference
2.1 - Advisor Dependencies
Ghidra
- Ghidra 11.3_DEV with the RISCV-64 isa_ext branch. Without this branch Ghidra is stuck with a never-ratified older version of RISCV support.
Bazel
Bazel builds in this workspace generate output in the temporary directory /run/user/1000/bazel, as specified in .bazelrc.
This override can be changed or removed
This project should work with Bazel 7.x as well, after adjusting some toolchain path names. Bazel 8 uses ‘+’ instead of ‘~’ as an external repo naming suffix and ‘@@’ instead of ‘@’ to identify standard bazel repositories.
Toolchain
- binutils 2.42.50
- gcc 15.0
- glibc developmental version 2.39.9000
- sysroot - a stripped down linux sysroot derived from the sysroot bootstrap in riscv-gnu-toolchain
The toolchain is packaged locally as a Bazel module named gcc_riscv_suite, version 15.0.0.1.
(Note that this is the first patch to the Bazel module based on the unreleased GCC-15.0.0).
This module depends on a second module, fedora_syslibs version 41.0.0. These are served out of a local
Bazel module repository.
The gcc_riscv_suite and fedora_syslibs modules wrap a 42 MB and 4.0 MB tarball, respectively.
Emulators
Two qemu emulators are used, both built from source shortly after the 9.0.50 release.
qemu-riscv64provides user space emulation, which is very useful for exploring the behavior of particularly confusing assembly code sequences.qemu-system-riscv64provides full RISCV-64 VM hosting. This is more narrowly useful when testing binaries like DPDK which require non-standard kernel options or kernel modules.- The RISCV-64 VM used here is based on an Ubuntu 24.04 disk image and the
u-boot.binboot loader. This boot loader is critical for RISCV VMs, since the emulated BIOS firmware provides the kernel with the definitive set of RISCV extensions available to the hardware threads (aka harts)
- The RISCV-64 VM used here is based on an Ubuntu 24.04 disk image and the
Jupyter
- jupyterlab 4.1.1
System
- Fedora 41 with wayland graphics.
- Python 3.13
2.2 - Populating the Database
The training set consists of matched C source code and RISCV-64 disassembly code.
The C source is processed through the C preprocessor cpp and indent. That code is then
compiled with GCC and at least two different machine architectures, then saved under ./data.
Populating the Training Set Database
The initial C source code is selected from the GCC riscv autovector testsuite. We can add custom examples of code to fill gaps or represent code patterns we might find in a Ghidra binary under review. Autovectored loops over structure arrays can be especially confusing to interpret, so we will likely want extra samples of that type.
The C sources for these two test suites appear in ./gcc_riscv_testsuite and ./custom_testsuite.
The script generator.py processes these into cpp output (*.i), compiled libraries (*.so), and objdump
assembly listings (*_objdump) for each requested machine architecture.
For example, gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1.c is processed by generator.py into:
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.idata/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.sodata/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc_objdumpdata/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.idata/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.sodata/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv_objdump
The ingest.py script reads everything under ./data to populate the sample table in the Sqlite3 database
.schema sample
CREATE TABLE sample(id INTEGER PRIMARY KEY AUTOINCREMENT, namespace TEXT, arch TEXT, name TEXT, source TEXT, assembly TEXT);
Next we need to generate signatures from this table with the sample_analytics.py script. At present, signatures are simple
strings. We have three signature types at the moment.
.schema signatures
CREATE TABLE signatures(id INTEGER PRIMARY KEY AUTOINCREMENT, sample_id INTEGER, signature_type TEXT, signature_value TEXT);
select distinct signature_type from signatures;
Traits
Opcodes, sorted
Opcodes, ordered
- The
Traitssignature holds simple facts from the disassembly code, such ashasLoopif at least one backwards branch exists. - The
Opcodes, orderedsignature is a simple list of vector and branch opcodes, concatenated in the same order as they are found in the disassembly. - The
Opcodes, sortedsignature is similar toOpcodes, ordered, but sorted into alphanumeric order. This may be useful if the compiler reorders instructions.
Querying the Database in Advisor
Users can select assembly code from Ghidra’s listing window, then run analysis cells in Advisor.ipynb to generate reports
on the types of C code that may match the listing. Users will likely want to iterate complex selections by adding custom
examples and repeating the match, to see if they can reproduce the C code that might have generated the vectorized assembly.
2.3 - Feature Analysis
At a very high level the Advisor tries to translate between two languages - C or C++ source code that a human might write and the sequence of assembly language instructions a compiler generates. Compilers are very good at the forward translation from C to assembly. We want something that works in the reverse direction - suggesting C or other source code that might have been compiled into specific assembly sequences.
The Advisor tries to brute-force this reverse translation by compiling a reference set of C sources into binaries, extracting
the instructions with objdump into a database, then looking for the best match to the instructions copied into the clipboard.
The GCC compiler test suite gives us thousands of reference C functions to start with.
Some features are easy to recognize.
- if the assembly listing includes a backwards branch instruction and branch target, then the source code likely contains a vectorized loop.
- if the assembly listing includes an instruction matching
vred*,vfred*vwred*, then the source code likely contains a vectorized reduction loop reading a vector and emitting a scalar.
Other features are mostly distractions, adding entropy that we would like to ignore:
- The local choice of registers to hold intermediate values
- The specific loop termination branch condition - a test against a counter, an input pointer, or an output pointer are all equally valid but only one will be implemented.
- Instruction ordering is often arbitrary inside a loop, as counters and pointers are incremented/decremented.
- The compiler may reorder instructions to minimize the impact of memory latency.
- The compiler will change the emitted instructions for inlined function depending on what it knows at compile time. This is especially true when the compiler knows the exact number of loop iterations, the alignment of operands, and the minimum size of vector registers.
- The compiler will change the emitted instructions based on the local ‘register pressure’ - whether or not there are lots of free vector registers.
- The compiler (or inline header macros) will translate a simple loop into multiple code blocks evaluated at run time. If the count is small, a scalar implementation is used. If the count is large one or more vector blocks are used.
- The compiler writers sometimes have to guess whether to optimize for instruction count, minimal branches, or memory accesses.
And some features are harder to recognize but useful for the Ghidra user:
- Operand type is sometimes set at runtime, not encoded into the instruction opcode.
- Compilers can emit completely different code if the machine architecture indicates a vector length of greater than 128 bits.
- Vector registers may be grouped based on runtime context, so that the number of registers read or written must be inferred from instruction flows.
- The compiler will accept intrinsic vector functions - not all vector loops have a C counterpart.
2.4 - Frequently Asked Questions
- Why Bazel?
- Bazel does a good job of managing cross-compiler builds and build caches together, where the cross-compiler toolchain can be switched easily.
- How do I compile with support for RISCV Instruction Set Architecture extensions?
- The binutils and gcc base code need to support those extensions first.
The gcc compiler uses the
-march=command line option to identify which extensions to apply for a given compilation. For example-march=rv32gcvsays vector instructions are supported, while-march=rv32gcexcludes vector instructions. - What machine architectures are currently implemented?
- The
variables.bzlfile setsMARCH_SET = ("rv64gc", "rv64gcv"). Most sources are then compiled with and without vector support. - Are all RISCV vector binaries runnable on all vector hardware threads?
- Not always. By default GCC will build for a minimum vector register length (
VLEN) of 128 bits, which should be portable across all general purpose RISCV harts. If_zvl512bwere added to the-marchsetting, GCC will know that vector registers are bigger and can unroll loops more aggressively - generating code that will fail on 128 bit vector harts. This can get complicated when processors have both 128 bit and 512 bit cores, like the sg2380. - Aren’t vector extensions unlikely to be used in programs that don’t do vector math?
- No. Vector extensions are very likely to be found in inlined utilities like memcpy and strncmp. Most simple loops over arrays of structs can be optimized with vector instructions.
3 - Whisper Deep Dive
This exercise mocks up a forensic analysis of a hypothetical voice to text application believed to be
based on Whisper.cpp. The previous examples applied Advisor to fairly random vector instruction sequences
found in a Whisper.cpp compilation without much identifying metadata. This time we will do things more
methodically, using a specific Whisper.cpp release built with specific build instructions, analyzed in Ghidra in both stripped
and unstripped binary formats. Dependent libraries libc, libm, and libstdc++ will be imported into Ghidra from the toolchain
used to construct the whisper executable. Once we have trained Advisor to help with the known-source whisper application, we might
be better able to use in in analyzing potentially malicious whisper-like applications.
This is an iterative process, where we take some initial guesses into how the application-under-test (AUT) was built, rebuild our whisper reference model the same way, then adjust either the reference model or the build parameters until we see similar key patterns in our AUT and reference models.
The initial guesses are:
- Similar to Whisper.cpp release 1.7.1
- Built for RISCV64 cores with the rva22 profile plus vector extensions, something like the SiFive P670 cores within a SG2380 processor.
- Built with gcc 15.0.0 with march=rv64gcv, fast-math, and O3 options for a linux-like OS.
- dynamically linked with libc, libm, libstdc++ as of mid 2024.
It’s worthwhile establishing key structures used by Whisper - and likely by any malicious code forked from Whisper.
Inspecting the reference source code suggests these structures:
- whisper_context
- whisper_state - created by
whisper_init_state(struct whisper_context * ctx) - whisper_context_params
3.1 - Building the Reference Model
We start by importing Whisper.cpp into our Bazel build environment. We may eventually want a full fork of the
code, but not until we are sure of the base release and what directions that fork will need to take.
This import starts with a simple addition to our MODULE.bazel workspace file:
# MODULE.bazel
http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
# whisper.cpp is an open source voice-to-text inference app built on Meta's LLaMA model.
# It is a useful exemplar of autovectorization of ML code with some examples of hand-coded
# riscv intrinsics.
http_archive(
name = "whisper_cpp",
urls = ["https://github.com/ggerganov/whisper.cpp/archive/refs/tags/v1.7.1.tar.gz"],
strip_prefix = "whisper.cpp-1.7.1/",
build_file = "//:whisper-cpp.BUILD",
sha256 = "97f19a32212f2f215e538ee37a16ff547aaebc54817bd8072034e02466ce6d55"
)
Next we add whisper-cpp.BUILD to show how to build libraries and binaries. The instructions
for whisper library include these stanzas:
c_library(
name = "whisper",
srcs = [
"ggml/src/ggml.c",
"ggml/src/ggml-aarch64.c",
"ggml/src/ggml-alloc.c",
"ggml/src/ggml-backend.cpp",
"ggml/src/ggml-backend-impl.h",
"ggml/src/ggml-impl.h",
"ggml/src/ggml-quants.c",
"src/whisper.cpp",
],
copts = [
"-I%s/include" % EXTERNAL_PATH,
"-I%s/ggml/include" % EXTERNAL_PATH,
"-I%s/ggml/src" % EXTERNAL_PATH,
"-pthread",
"-O3",
"-ffast-math",
],
...
defines = [
"NDEBUG",
"_XOPEN_SOURCE=600",
"_GNU_SOURCE",
"__FINITE_MATH_ONLY__=0",
"__riscv_v_intrinsic=0",
],
...
)
cc_binary(
name = "main",
srcs = [
"examples/common.cpp",
"examples/common.h",
"examples/common-ggml.cpp",
"examples/common-ggml.h",
"examples/dr_wav.h",
"examples/grammar-parser.cpp",
"examples/grammar-parser.h",
"examples/main/main.cpp",
],
...
deps = [
"whisper",
],
)
Now we can build the reference app using our existing RISCV-64 toolchain:
$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' @whisper_cpp//:main
...
$ file bazel-bin/external/+_repo_rules+whisper_cpp/main
bazel-bin/external/+_repo_rules+whisper_cpp/main: ELF 64-bit LSB executable, UCB RISC-V, RVC, double-float ABI, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, for GNU/Linux 4.15.0, not stripped
$ readelf -A bazel-bin/external/+_repo_rules+whisper_cpp/main
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0_zca1p0_zcd1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
Tag_RISCV_priv_spec: 1
Tag_RISCV_priv_spec_minor: 11
The final step is to locate the toolchain libraries used in this build, so that we can load them into Ghidra. They are usually cached in a per-user location. We’ll search for the RISCV libstdc++ toolchain library:
$ bazel info
...
output_base: /run/user/1000/bazel
output_path: /run/user/1000/bazel/execroot/_main/bazel-out
package_path: %workspace%
release: release 7.4.0
...
$ find /run/user/1000 -name libstdc++\*
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6.0.33
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so
We will want to load libstdc++.so.6 into Ghidra before we load the reference app.
3.2 - Loading the Reference Model into Ghidra
Create a new Ghidra project and load the whisper dependencies and then whisper itself, in both stripped and unstripped forms.
/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libm.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~whisper_cpp/main
We can check the machine architecture for which these libraries were built with readelf -A:
$ readelf -A /run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0"
Tag_RISCV_unaligned_access: Unaligned access
Tag_RISCV_priv_spec: 1
Tag_RISCV_priv_spec_minor: 11
These libraries were most likely built with the non-vector rv64gc machine architecture.
The stripped version of main is generated by copying the non-stripped binary to /tmp and then stripping it with the riscv toolchain strip before importing it into Ghidra:
/run/user/1000/bazel/external/gcc_riscv_suite+/bin/riscv64-unknown-linux-gnu-strip /tmp/main
Now we can use the non-stripped version to orient ourselves and find reference points, then visit the stripped version to use those reference points and start to recover symbol names and structures.
3.3 - Ghidra Examination
We want to know how we can use Advisor to untangle vectorized instruction sequences. We’ve seen that
Advisor can help with simple loops and builtin functions like memcpy. Now we want to tackle vectorized
‘shuffle’ code, where GCC turns a sequence of simple assignments or initializations into a much more obscure
sequence of vector instructions.
We’ll assume the Ghidra user wishes to search for malicious behavior adjacent to the output_txt function,
called by main after the voice-to-text inference engine has crunched the numbers.
The first step is to locate the main routine in our stripped binary. There is no main symbol left after
stripping, so we need to find a path from the entry point to main in the unstripped binary first. The entry point is
the symbol _start or entry.
In the unstripped binary:
void _start(void)
{
...
gp = &__global_pointer$;
uVar1 = _PREINIT_0();
__libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
ebreak();
main();
return;
}
While the stripped binary is:
void entry(void)
{
undefined8 uVar1;
undefined8 in_stack_00000000;
gp = &__global_pointer$;
uVar1 = _PREINIT_0();
__libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
ebreak();
FUN_0001e758();
return;
}
So we know that main is FUN_0001e758.
In the source code, main calls output_txt. There is no output_txt symbol in either stripped or non-stripped binaries, so
this function has apparently been converted into inline code deep within main.
There are several different paths forward for examination. Sometimes the best approach is to explore a short distance along each likely path,
backtracking or switching paths when we get stuck. For this exercise we want to know how to make Advisor useful in at least some of those
cases where we get stuck.
Paths available:
- Search for C strings as literals, then find where they are used. This will often give printf or logging utility functions.
- Search for C++ string constructors given literals as input. Start to identify standard library string objects in the code.
- Look for initialization or print functions recognizable either from symbol names or printf formatting strings.
- Start to identify recurring structures passed by pointers. This can include context, state, and parameter structures
Let’s start with a routine that touches several of those paths. This is the basic decompiled output from the stripped binary,
for a function that gets called a lot by our identified main routine. Peeking into the unstripped binary, we see that its signature
is void __thiscall std::string::string<>(string *this,char *cStr,allocator *param_2) - a C++ basic_string constructor given a literal
C string as input. Note that there is actually no allocator passed into the function.
void FUN_000542e0(undefined8 *param_1,undefined *param_2)
{
undefined *puVar1;
long lVar2;
undefined *puVar3;
long lVar4;
undefined8 *puVar5;
undefined auVar6 [256];
long in_vl;
gp = &__global_pointer$;
puVar5 = param_1 + 2;
*param_1 = puVar5;
if (param_2 == (undefined *)0x0) {
/* WARNING: Subroutine does not return */
std::__throw_logic_error("basic_string: construction from null is not valid");
}
puVar1 = param_2;
lVar4 = 0;
do {
vsetvli_e8m1tama(0);
puVar1 = puVar1 + lVar4;
auVar6 = vle8ff_v(puVar1);
auVar6 = vmseq_vi(auVar6,0);
lVar2 = vfirst_m(auVar6);
lVar4 = in_vl;
} while (lVar2 < 0);
puVar1 = puVar1 + (lVar2 - (long)param_2);
puVar3 = puVar1;
if (puVar1 < (undefined *)0x10) {
if (puVar1 == (undefined *)0x1) {
*(undefined *)(param_1 + 2) = *param_2;
goto LAB_00054326;
}
if (puVar1 == (undefined *)0x0) goto LAB_00054326;
}
else {
puVar5 = (undefined8 *)operator.new((ulong)(puVar1 + 1));
param_1[2] = puVar1;
*param_1 = puVar5;
}
do {
lVar4 = vsetvli_e8m8tama(puVar3);
auVar6 = vle8_v(param_2);
puVar3 = puVar3 + -lVar4;
param_2 = param_2 + lVar4;
vse8_v(auVar6,puVar5);
puVar5 = (undefined8 *)((long)puVar5 + lVar4);
} while (puVar3 != (undefined *)0x0);
puVar5 = (undefined8 *)*param_1;
LAB_00054326:
param_1[1] = puVar1;
*(undefined *)((long)puVar5 + (long)puVar1) = 0;
return;
}
The exercise here is to recover the basic_string internal structure and identify the two vector stanzas.
The Advisor identifies the first vector stanza as a builtin_strlen and the second as a builtin_memcpy.
The std::string structure is 0x20 bytes and consists of a char*, a 64 bit string length, and a 16 byte union field. If the string
with null termination is less than 16 bytes in length, it is stored directly in the 16 byte union field. Otherwise, new memory is allocated
for the copy and a pointer is stored in the first 8 bytes of the union.
The next step is easy enough to make the Advisor unnecessary. A std::vector copy constructor involves two vector instruction stanzas.
The new vector has three 64 bit pointer fields, all of which need to be zeroed. GCC 15 does that with:
vsetivli_e64m1tama(2);
vmv_v_i(in_v1,0);
...
vse64_v(in_v1,this);
*(undefined8 *)&this->field_0x10 = 0;
That’s a little bit odd, since it is using three vector instructions to replace two scalar instructions, followed by a separate scalar store instruction.
It could alternatively used three scalar store instructions or three vector instructions with an m2 LMUL multiplier option. Perhaps this is an example of
incomplete or over-eager optimization, or an optimization from a RISC-V vendor who knows that vector instructions can be executed in parallel with scalar instructions.
A little later in the copy constructor a builtin_memcpy vector stanza occurs, to copy the contents of the original vector into the newly initialized vector.
This suggests:
- vector stanzas like
builtin_memcpy,builtin_strlen, and vector instructions to zero 16 bytes are common and fairly easy to recognize, either by eye or Advisor. Adding more builtin functions to the exemplar directory makes good sense. - vector stanzas often occur in initialization sequences, where they can be difficult to untangle from associated C++ object initializations. If we want to tackle this, we also need examples of stdlibc++ initializations, especially for vectors, maps, and iostreams.
- we need more examples of less common vector stanzas, including gather and slide operations.
3.4 - Dealing with C++
GCC 15 uses RISCV vector instruction sequences in many initialization sequences - even when there is no need for a loop. If we want to understand that code, we need a decent understanding of what is getting initialized. One way to move forward is with a small program that uses some of the same libstdc++ classes, to help us understand their memory layout and especially the fields that may need initialization.
The first iteration of this is a toy program using std::string, std::vector, std::pair, std::map, and std::ofstream library code.
we generally want to know object sizes, pointers, and key internal fields. If something like ofstream is constructed on the stack,
we can probably ignore any interior objects initialized within it.
#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <fstream>
#include <cstdint>
void dumpHex(const uint64_t* p, int numwords) {
std::cout << "\tRaw: ";
for (int i = 0; i < numwords; i++) {
std::cout << "0x" << std::hex << p[i];
if (i < numwords) {
std::cout << ", ";
}
}
std::cout << std::endl;
}
void showString(const std::string* s, const char* label) {
std::cout << "std::string " << label << " = " << *s << std::endl;
std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*s) << std::endl;
std::cout << "\tLength = 0x" << std::hex << s->length() << std::endl;
dumpHex((const uint64_t*)s, 4);
}
void showVector(const std::vector<std::string>* v, const char* label) {
std::cout << "std::vector<std::string> " << label << ":" << std::endl;
std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*v) << std::endl;
std::cout << "\tInternal Size = " << v->size() << std::endl;
dumpHex((const uint64_t*)v, 3);
}
void showPair(const std::pair<std::string,std::string>* p, const char* label) {
std::cout << "std::pair<std::string,std::string> " << label << ":" << std::endl;
std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*p) << std::endl;
dumpHex((const uint64_t*)p, 8);
}
void showMap(const std::map<std::string,std::string>* token_map, const char* label) {
std::cout << "std::map<std::string,std::string> " << label << ":" << std::endl;
std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*token_map) << std::endl;
std::cout << "\tInternal Size = " << token_map->size() << std::endl;
dumpHex((const uint64_t*)token_map, 6);
}
void showOfstream(const std::ofstream* fout, const char* label) {
std::cout << "std::ofstream " << label << ":" << std::endl;
std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*fout) << std::endl;
dumpHex((const uint64_t*)fout, 8);
}
int main() {
std::cout << "Initializing\n";
std::ofstream fout("/tmp/scratch");
showOfstream(&fout, "initialized ofstream");
std::string xs("short string");
showString(&xs, "short string");
std::string x("This is a sample long string");
showString(&x, "long string");
std::vector<std::string> vx;
showVector(&vx, "empty vector");
vx.push_back(x);
showVector(&vx, "singleton vector");
fout << "something to fill the file" << std::endl;
std::pair<std::string,std::string> map_element("key", "value");
showPair(&map_element, "map_element");
std::map<std::string,std::string> token_map;
showMap(&token_map, "token_map, empty");
token_map.insert(map_element);
showMap(&token_map, "token_map, one insertion");
fout.close();
showOfstream(&fout, "closed ofstream");
Build with:
$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' other_src:stdlibc++_exploration
Run this under qemu emulation :
$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/other_src/stdlibc++_exploration
Initializing
std::ofstream initialized ofstream:
Raw Size = 0x200
Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x288a0, 0x288a0, 0x288a0, 0x0, 0x0, 0x0,
std::string short string = short string
Raw Size = 0x20
Length = 0xc
Raw: 0x7fbfe3ffe820, 0xc, 0x74732074726f6873, 0x7f00676e6972,
std::string long string = This is a sample long string
Raw Size = 0x20
Length = 0x1c
Raw: 0x2a8b0, 0x1c, 0x1c, 0x7fbfe3ffe8f0,
std::vector<std::string> empty vector:
Raw Size = 0x18
Internal Size = 0
Raw: 0x0, 0x0, 0x0,
std::vector<std::string> singleton vector:
Raw Size = 0x18
Internal Size = 1
Raw: 0x2a8e0, 0x2a900, 0x2a900,
std::pair<std::string,std::string> map_element:
Raw Size = 0x40
Raw: 0x7fbfe3ffe890, 0x3, 0x79656b, 0x7fbff0411528, 0x7fbfe3ffe8b0, 0x5, 0x65756c6176, 0x7fbff14d751e,
std::map<std::string,std::string> token_map, empty:
Raw Size = 0x30
Internal Size = 0
Raw: 0x1, 0x7fbf00000000, 0x0, 0x7fbfe3ffe858, 0x7fbfe3ffe858, 0x0,
std::map<std::string,std::string> token_map, one insertion:
Raw Size = 0x30
Internal Size = 1
Raw: 0x1, 0x7fbf00000000, 0x2a940, 0x2a940, 0x2a940, 0x1,
std::ofstream closed ofstream:
Raw Size = 0x200
Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
Now we can load bazel-bin/other_src/stdlibc++_exploration into Ghidra and get some insight into how these standard library classes are
initialized.
std::string has a length of 0x20 bytes and a structure similar to:
struct string { /* stdlib basic string */
char *cstr;
uint64_t length;
char data[16];
};
If the string is 15 bytes or less, it is stored directly in data. Otherwise new memory is allocated and data holds information needed to manage that allocation.
std::vector<T> has a length of 0x18 bytes, regardless of the component type T.
struct vector {
T *start; // points to the first element in the vector
T *end; // points just past the last element in the vector
T *alloc_end; // points just past the last empty element allocated for the vector.
};
std::pair<T1,T2> is simply a concatenation of elements of type T1 and T2, so a pair of strings is 0x40 bytes and a pair of string pointers is 0x10 bytes.
std::ofstream is a large object of 0x200 bytes. We’ll ignore its internal structure for now - and especially any whisper.cpp code initializing elements of that structure.
Ghidra examination of stdlibc++_exploration provides some insight and more than a few red herrings.
stdlibc++_exploration
std::vector
The three 64 bit pointers that make up a std::vector<std::string> are initialized inline to zero with a stanza of five vector instructions - even though that does not look optimal.
vsetivli_e8mf2tama(8);
vmv_v_i(in_v1,0);
vse8_v(in_v1,&vx.end);
vse8_v(in_v1,&vx.alloc_end);
vse8_v(in_v1,&vx);
showVector(&vx,"empty vector");
vx.push_back(x) makes a copy of the string x and allocates space for both the vector element and the string copy with operator.new.
std::vectors are most easily recognized by their destructors:
std::vector<>::~vector((vector<> *)&vx);
std::ofstream
File structures like std::ofstream are easy to recognize through their constructors and destructors. They can be confusing though, since they are likely to be reused if multiple files are opened and closed in the same function.