Application Survey

Survey a voice-to-text app for common vector instruction patterns

Take an exemplar RISCV-64 binary like whisper.cpp, with its many vector instructions. Which vector patterns are easy to recognize, either for a human Ghidra user or for a hypothetical Ghidra plugin?

Some of the most common patterns correspond to memcpy or memset invocations where the number of bytes is known at compile time as is the alignment of operands.

ML apps like whisper.cpp often work with parameters of less than 8 bits, so there can be a lot of demarshalling, unpacking, and repacking operations. That means lots of vector bit manipulation and width conversion operations.

ML apps also do a lot of vector, matrix, and tensor arithmetic, so we can expect to find vectorized arithmetic operations mixed in with vector parameter conversion operations.

Note: This page is likely to change rapidly as we get a better handle on the problem and develop better analytic tools to guide the process.

Survey for vector instruction blocks

Most vector instructions come in groups started with a vsetvli or vsetivli instruction to set up the vector context. If the number of vector elements is known at compile time and less than 32, then the vsetivli instruction is often used. Otherwise the vsetvli instruction is used.

Scanning for these instructions showed 673 vsetvli and 888 vsetivli instructions within whisper.cpp.

The most common vsetvli instruction (343 out of 673) is type 0xc3 or e8,m8,ta,ma. That expands to:

  • element width = 8 bits - no alignment checks are needed, 16 elements per vector register if VLEN=128
  • multiplier = 8 - up to 8 vector registers are processed in parallel
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

The most common vsetivli instruction (565 out of 888) is type 0xd8 or e64,m1,ta,ma. That expands to:

  • element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
  • multiplier = 1 - only the named vector register is used
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

A similar common vsetivli instruction (102 out of 888) is type 0xdb or e64,m8,ta,ma. That expands to:

  • element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
  • multiplier = 8 - up to 8 vector registers are processed in parallel, or 16 64 bit elements if VLEN=128
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

The second most common vsetivli instruction (107 out of 888) is type 0xc7 or e8,mf2,ta,ma. That expands to:

  • element width = 8 bits
  • multiplier = 1/2 - vector registers are only half used, perhaps to allow element widening to 16 bits
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

How many of these vector blocks can be treated as simple memcpy or memset invocations?

For example, this Ghidra listing snippet looks like a good candidate for memcpy:

00090bdc 57 f0 b7 cd     vsetivli                       zero,0xf,e64,m8,ta,ma
00090be0 07 74 07 02     vle64.v                        v8,(a4)
00090be4 27 f4 07 02     vse64.v                        v8,(a5)

A pcode equivalent might be __builtin_memcpy(dest=(a5), src=(a4), 8 * 15) with a possible context note that vector registers v8 through v16 are changed.

A longer example might be a good candidate for memset:

00090b84 57 70 81 cd     vsetivli                       zero,0x2,e64,m1,ta,ma
00090b88 93 07 07 01     addi                           a5,a4,0x10
00090b8c d7 30 00 5e     vmv.v.i                        v1,0x0
00090b90 a7 70 07 02     vse64.v                        v1,(a4)
00090b94 a7 f0 07 02     vse64.v                        v1,(a5)
00090b98 93 07 07 02     addi                           a5,a4,0x20
00090b9c a7 f0 07 02     vse64.v                        v1,(a5)
00090ba0 93 07 07 03     addi                           a5,a4,0x30
00090ba4 a7 f0 07 02     vse64.v                        v1,(a5)
00090ba8 93 07 07 04     addi                           a5,a4,0x40
00090bac a7 f0 07 02     vse64.v                        v1,(a5)
00090bb0 93 07 07 05     addi                           a5,a4,0x50
00090bb4 a7 f0 07 02     vse64.v                        v1,(a5)
00090bb8 93 07 07 06     addi                           a5,a4,0x60
00090bbc a7 f0 07 02     vse64.v                        v1,(a5)
00090bc0 fd 1b           c.addi                         s7,-0x1
00090bc2 23 38 07 06     sd                             zero,0x70(a4)

This example is based on a minimum VLEN of 128 bits, so the vector registers can hold 2 64 bit elements. The vmv.v.i instruction sets those two elements of v1 to zero. Seven vse64.v instructions then store two 64 bit zeros each to successive memory locations, with a trailing scalar double word store to handle the tail.

A pcode equivalent for this sequence might be __builtin_memset(dest=(a4), 0, 0x78).

top down scan of vector blocks

The python script objdump_analytic.py provides a crude scan of a RISCV-64 binary, reporting on likely vector instruction blocks. It doesn’t handle blocks with more than one vsetvli or vsetivli instruction, something common in vector narrowing or widening operations. If we apply this script to whisper_cpp_vector we can collect a crude field guide to vector expansions.

VLEN in the following is the hart’s vector length, determined at execution time. It is usually something like 128 bits for a general purpose core (aka hart) and up to 1024 bits for a dedicated accelerator hart.

memcpy with known and limited nbytes

This pattern is often found when copying objects of known and limited size. It is useful with objects as small as 4 bytes if the source alignment is unknown and the destination object must be aligned on half-word, word, or double-word boundaries.

;                memcpy(dest=a0, src=a3, nbytes=a4) where a4 < 8 * (VLEN/8)
1d3da:  0c377057                vsetvli zero,a4,e8,m8,ta,ma
1d3de:  02068407                vle8.v  v8,(a3)
1d3e2:  02050427                vse8.v  v8,(a0)

memcpy with unknown nbytes

This pattern is usually found in a simple loop, moving 8 * (VLEN/8) bytes at a time. The a5 register holds the number of bytes processed per iteration.

;                memcpy(dest=a6, src=a7, nbytes=a0) 
1d868:  0c3577d7                vsetvli a5,a0,e8,m8,ta,ma
1d86c:  02088407                vle8.v  v8,(a7)
1d872:  02080427                vse8.v  v8,(a6)

widening floating point reduction

The next example appears to be compiled from estimate_diarization_speaker whose source is:

double energy0 = 0.0f;
double energy1 = 0.0f;

for (int64_t j = is0; j < is1; j++) {
    energy0 += fabs(pcmf32s[0][j]);
    energy1 += fabs(pcmf32s[1][j]);
}

This is a typical reduction with widening pattern.

The vector instructions generated are:

242ce:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
242d2:  5e0031d7                vmv.v.i v3,0
242d6:  9e303257                vmv1r.v v4,v3
242da:  0976f7d7                vsetvli a5,a3,e32,mf2,tu,ma
242e4:  0205e107                vle32.v v2,(a1)
242e8:  02066087                vle32.v v1,(a2)
242ec:  2a211157                vfabs.v v2,v2
242f0:  2a1090d7                vfabs.v v1,v1
242f8:  d2411257                vfwadd.wv       v4,v4,v2
242fc:  d23091d7                vfwadd.wv       v3,v3,v1
24312:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
24316:  4207d0d7                vfmv.s.f        v1,fa5
2431a:  063091d7                vfredusum.vs    v3,v3,v1
2431e:  42301757                vfmv.f.s        fa4,v3
24326:  06409257                vfredusum.vs    v4,v4,v1
2432a:  424017d7                vfmv.f.s        fa5,v4

A hypothetical vectorized Ghidra might decompile these instructions (ignoring the scalar instructions not displayed here) as:

double vector v3, v4;  // SEW=64 bit
v3 := vector 0;  // load immediate
v4 := v3;        // vector copy
float vector v1, v2;  // SEW=32 bit
while(...) {
    v2 = vector *a1;
    v1 = vector *a2;
    v2 = abs(v2);
    v1 = abs(v1);
    v4 = v4 + v2;  // widening 32 to 64 bits
    v3 = v3 + v1;  // widening 32 to 64 bits
}
double vector v1, v3, v4;
v1[0] = fa5;   // fa5 is the scalar 'carry-in' 
v3[0] = v1[0] +  v3; // unordered vector reduction
fa4 = v3[0];
v4[0] = v1[0] +  v4;
fa5 = v4[0];

The vector instruction vfredusum.vs provides the unordered reduction sum over the elements of a single vector. That’s likely faster than an ordered sum, but the floating point round-off errors will not be deterministic.

Note: this whisper.cpp routine attempts to recognize which of two speakers is responsible for each word of a conversation. A speaker-misattribution exploit might attack functions that call this.

complex structure element copy

The source code includes:

static drwav_uint64 drwav_read_pcm_frames_s16__msadpcm(drwav* pWav, drwav_uint64 framesToRead, drwav_int16* pBufferOut) {
    ...
    pWav->msadpcm.bytesRemainingInBlock = pWav->fmt.blockAlign - sizeof(header);

    pWav->msadpcm.predictor[0] = header[0];
    pWav->msadpcm.predictor[1] = header[1];
    pWav->msadpcm.delta[0] = drwav__bytes_to_s16(header + 2);
    pWav->msadpcm.delta[1] = drwav__bytes_to_s16(header + 4);
    pWav->msadpcm.prevFrames[0][1] = (drwav_int32)drwav__bytes_to_s16(header + 6);
    pWav->msadpcm.prevFrames[1][1] = (drwav_int32)drwav__bytes_to_s16(header + 8);
    pWav->msadpcm.prevFrames[0][0] = (drwav_int32)drwav__bytes_to_s16(header + 10);
    pWav->msadpcm.prevFrames[1][0] = (drwav_int32)drwav__bytes_to_s16(header + 12);

    pWav->msadpcm.cachedFrames[0] = pWav->msadpcm.prevFrames[0][0];
    pWav->msadpcm.cachedFrames[1] = pWav->msadpcm.prevFrames[1][0];
    pWav->msadpcm.cachedFrames[2] = pWav->msadpcm.prevFrames[0][1];
    pWav->msadpcm.cachedFrames[3] = pWav->msadpcm.prevFrames[1][1];
    pWav->msadpcm.cachedFrameCount = 2;
...
}

This gets vectorized into sequences containing:

2c6ce:  ccf27057                vsetivli        zero,4,e16,mf2,ta,ma ; vl=4, SEW=16
2c6d2:  5e06c0d7                vmv.v.x v1,a3              ; v1[0..3] = a3
2c6d6:  3e1860d7                vslide1down.vx  v1,v1,a6   ; v1 = v1[1:3], a6
2c6da:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4
2c6de:  3e1560d7                vslide1down.vx  v1,v1,a0   ; v1 = (a3,a6,a4,a0)

2c6e2:  0d007057                vsetvli zero,zero,e32,m1,ta,ma ; keep existing vl (=4), SEW=32
2c6e6:  4a13a157                vsext.vf2       v2,v1      ; v2 = vector sext(v1) // widening sign extend
2c6ea:  0207e127                vse32.v v2,(a5)            ; memcpy(a5, v2, 4 * 4)
2c6f2:  0a07d087                vlse16.v        v1,(a5),zero ; v1 = a5[]

2c6fa:  0cf07057                vsetvli zero,zero,e16,mf2,ta,ma
2c702:  3e1660d7                vslide1down.vx  v1,v1,a2   ; v1 = v1[1:3], a2
2c70a:  3e16e0d7                vslide1down.vx  v1,v1,a3   ; v1 = v1[1:3], a3
2c70e:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4

2c712:  0d007057                vsetvli zero,zero,e32,m1,ta,ma
2c716:  4a13a157                vsext.vf2       v2,v1
2c71a:  0205e127                vse32.v v2,(a1)

That’s the kind of messy code you could analyze if you had to. Hopefully not.

Last modified April 4, 2024: Spell check pass (08034c6)