This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Instruction Patterns

Common instruction patterns one might see with vectorized code generation

This page collects architecture-dependent gcc-14 expansions, where simple C sequences are translated into optimized code.

Our baseline is a gcc-14 compiler with -O2 optimization and a base machine architecture of -march=rv64gc. That’s a basic 64 bit RISCV processor (or a hart core of that processor) with support for compressed instructions.

Variant machine architectures considered here are:

march description
rv64gc baseline
rv64gcv baseline + vector extension (dynamic vector length)
rv64gcv_zvl128b baseline + vector (minimum 128 bit vectors)
rv64gcv_zvl512b baseline + vector (minimum 512 bit vectors)
rv64gcv_zvl1024b baseline + vector (minimum 1024 bit vectors)
rv64gc_xtheadbb baseline + THead bit manipulation extension (no vector)

Memory copy operations

Note: memory copy operations require non-overlapping source and destination. memory move operations allow overlap but are much more complicated and are not currently optimized.

Optimizing compilers are good at turning simple memory copy operations into confusing - but fast - instruction sequences. GCC can recognize memory copy operations as calls to memcpy or as structure assignments like *a = *c.

The current reference C file is:

extern void *memcpy(void *__restrict dest, const void *__restrict src, __SIZE_TYPE__ n);
extern void *memmov(void *dest, const void *src, __SIZE_TYPE__ n);

/* invoke memcpy with dynamic size */
void cpymem_1 (void *a, void *b, __SIZE_TYPE__ l)
{
  memcpy (a, b, l);
}

/* invoke memcpy with known size and aligned pointers */
extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;

void cpymem_2 ()
{
  memcpy (&a_a, &a_b, sizeof a_a);
}

typedef struct { char c[16]; } c16;
typedef struct { char c[32]; } c32;
typedef struct { short s; char c[30]; } s16;

/* copy fixed 128 bits of memory */
void cpymem_3 (c16 *a, c16* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_4 (c32 *a, c32* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_5 (s16 *a, s16* b)
{
  *a = *b;
}

/* memmov allows overlap - don't vectorize or inline */
void movmem_1(void *a, void *b, __SIZE_TYPE__ l)
{
  memmov (a, b, l);
}

Baseline (no vector)

Ghidra 11 with the isa_ext branch decompiler gives us something simple after fixing the signature of the memcpy thunk.

void cpymem_1(void *param_1,void *param_2,size_t param_3)
{
  memcpy(param_1,param_2,param_3);
  return;
}
void cpymem_2(void)
{
  memcpy(&a_a,&a_b,0x40);
  return;
}
void cpymem_3(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x10);
  return;
}
void cpymem_4(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}
void cpymem_5(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}

rv64gcv - vector extensions

If the compiler knows the target hart can process vector extensions, but is not told explicitly the size of each vector register, it optimizes all of these calls. Ghidra 11 gives us the following, with binutils’ objdump instruction listings added as comments:

long cpymem_1(long param_1,long param_2,long param_3)
{
  long lVar1;
  undefined auVar2 [256];
  do {
    lVar1 = vsetvli_e8m8tama(param_3);  // vsetvli a5,a2,e8,m8,ta,ma
    auVar2 = vle8_v(param_2);           // vle8.v  v8,(a1)
    param_3 = param_3 - lVar1;          // sub     a2,a2,a5
    vse8_v(auVar2,param_1);             // vse8.v  v8,(a0)
    param_2 = param_2 + lVar1;          // add     a1,a1,a5
    param_1 = param_1 + lVar1;          // add     a0,a0,a5
  } while (param_3 != 0);               // bnez    a2,8a8 <cpymem_1>
  return param_1;
}
void cpymem_2(void)
{
                                        // ld      a4,1922(a4) # 2040 <a_b@Base>
                                        // ld      a5,1938(a5) # 2058 <a_a@Base>
  undefined auVar1 [256];
  vsetivli(0x10,0xd3);                  // vsetivli        zero,16,e32,m8,ta,ma
  auVar1 = vle32_v(&a_b);               // vle32.v v8,(a4)
  vse32_v(auVar1,&a_a);                 // vse32.v v8,(a5)
  return;
}
void cpymem_3(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xc0);                   // vsetivli        zero,16,e8,m1,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v1,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v1,(a0)
  return;
}
void cpymem_4(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];                // li      a5,32
  vsetvli_e8m8tama(0x20);                // vsetvli        zero,a5,e8,m8,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v8,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v8,(a0)
  return;
}
void cpymem_5(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xcb);                   // vsetivli        zero,16,e16,m8,ta,ma
  auVar1 = vle16_v(param_2);             // vle16.v v8,(a1)
  vse16_v(auVar1,param_1);               // vse16.v v8,(a0)
  return;
}

The variation in the vset* instructions is a bit puzzling. This may be due to alignment issues - trying to copy a short int into a misaligned odd address generates an exception at the store instruction, so perhaps the vector optimization is supposed to throw an exception there too.

1 - Application Survey

Survey a voice-to-text app for common vector instruction patterns

Take an exemplar RISCV-64 binary like whisper.cpp, with its many vector instructions. Which vector patterns are easy to recognize, either for a human Ghidra user or for a hypothetical Ghidra plugin?

Some of the most common patterns correspond to memcpy or memset invocations where the number of bytes is known at compile time as is the alignment of operands.

ML apps like whisper.cpp often work with parameters of less than 8 bits, so there can be a lot of demarshalling, unpacking, and repacking operations. That means lots of vector bit manipulation and width conversion operations.

ML apps also do a lot of vector, matrix, and tensor arithmetic, so we can expect to find vectorized arithmetic operations mixed in with vector parameter conversion operations.

Note: This page is likely to change rapidly as we get a better handle on the problem and develop better analytic tools to guide the process.

Survey for vector instruction blocks

Most vector instructions come in groups started with a vsetvli or vsetivli instruction to set up the vector context. If the number of vector elements is known at compile time and less than 32, then the vsetivli instruction is often used. Otherwise the vsetvli instruction is used.

Scanning for these instructions showed 673 vsetvli and 888 vsetivli instructions within whisper.cpp.

The most common vsetvli instruction (343 out of 673) is type 0xc3 or e8,m8,ta,ma. That expands to:

  • element width = 8 bits - no alignment checks are needed, 16 elements per vector register if VLEN=128
  • multiplier = 8 - up to 8 vector registers are processed in parallel
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

The most common vsetivli instruction (565 out of 888) is type 0xd8 or e64,m1,ta,ma. That expands to:

  • element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
  • multiplier = 1 - only the named vector register is used
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

A similar common vsetivli instruction (102 out of 888) is type 0xdb or e64,m8,ta,ma. That expands to:

  • element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
  • multiplier = 8 - up to 8 vector registers are processed in parallel, or 16 64 bit elements if VLEN=128
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

The second most common vsetivli instruction (107 out of 888) is type 0xc7 or e8,mf2,ta,ma. That expands to:

  • element width = 8 bits
  • multiplier = 1/2 - vector registers are only half used, perhaps to allow element widening to 16 bits
  • tail agnostic - we don’t care about preserving unassigned vector register bits
  • mask agnostic - we don’t care about preserving unmasked vector register bits

How many of these vector blocks can be treated as simple memcpy or memset invocations?

For example, this Ghidra listing snippet looks like a good candidate for memcpy:

00090bdc 57 f0 b7 cd     vsetivli                       zero,0xf,e64,m8,ta,ma
00090be0 07 74 07 02     vle64.v                        v8,(a4)
00090be4 27 f4 07 02     vse64.v                        v8,(a5)

A pcode equivalent might be __builtin_memcpy(dest=(a5), src=(a4), 8 * 15) with a possible context note that vector registers v8 through v16 are changed.

A longer example might be a good candidate for memset:

00090b84 57 70 81 cd     vsetivli                       zero,0x2,e64,m1,ta,ma
00090b88 93 07 07 01     addi                           a5,a4,0x10
00090b8c d7 30 00 5e     vmv.v.i                        v1,0x0
00090b90 a7 70 07 02     vse64.v                        v1,(a4)
00090b94 a7 f0 07 02     vse64.v                        v1,(a5)
00090b98 93 07 07 02     addi                           a5,a4,0x20
00090b9c a7 f0 07 02     vse64.v                        v1,(a5)
00090ba0 93 07 07 03     addi                           a5,a4,0x30
00090ba4 a7 f0 07 02     vse64.v                        v1,(a5)
00090ba8 93 07 07 04     addi                           a5,a4,0x40
00090bac a7 f0 07 02     vse64.v                        v1,(a5)
00090bb0 93 07 07 05     addi                           a5,a4,0x50
00090bb4 a7 f0 07 02     vse64.v                        v1,(a5)
00090bb8 93 07 07 06     addi                           a5,a4,0x60
00090bbc a7 f0 07 02     vse64.v                        v1,(a5)
00090bc0 fd 1b           c.addi                         s7,-0x1
00090bc2 23 38 07 06     sd                             zero,0x70(a4)

This example is based on a minimum VLEN of 128 bits, so the vector registers can hold 2 64 bit elements. The vmv.v.i instruction sets those two elements of v1 to zero. Seven vse64.v instructions then store two 64 bit zeros each to successive memory locations, with a trailing scalar double word store to handle the tail.

A pcode equivalent for this sequence might be __builtin_memset(dest=(a4), 0, 0x78).

top down scan of vector blocks

The python script objdump_analytic.py provides a crude scan of a RISCV-64 binary, reporting on likely vector instruction blocks. It doesn’t handle blocks with more than one vsetvli or vsetivli instruction, something common in vector narrowing or widening operations. If we apply this script to whisper_cpp_vector we can collect a crude field guide to vector expansions.

VLEN in the following is the hart’s vector length, determined at execution time. It is usually something like 128 bits for a general purpose core (aka hart) and up to 1024 bits for a dedicated accelerator hart.

memcpy with known and limited nbytes

This pattern is often found when copying objects of known and limited size. It is useful with objects as small as 4 bytes if the source alignment is unknown and the destination object must be aligned on half-word, word, or double-word boundaries.

;                memcpy(dest=a0, src=a3, nbytes=a4) where a4 < 8 * (VLEN/8)
1d3da:  0c377057                vsetvli zero,a4,e8,m8,ta,ma
1d3de:  02068407                vle8.v  v8,(a3)
1d3e2:  02050427                vse8.v  v8,(a0)

memcpy with unknown nbytes

This pattern is usually found in a simple loop, moving 8 * (VLEN/8) bytes at a time. The a5 register holds the number of bytes processed per iteration.

;                memcpy(dest=a6, src=a7, nbytes=a0) 
1d868:  0c3577d7                vsetvli a5,a0,e8,m8,ta,ma
1d86c:  02088407                vle8.v  v8,(a7)
1d872:  02080427                vse8.v  v8,(a6)

widening floating point reduction

The next example appears to be compiled from estimate_diarization_speaker whose source is:

double energy0 = 0.0f;
double energy1 = 0.0f;

for (int64_t j = is0; j < is1; j++) {
    energy0 += fabs(pcmf32s[0][j]);
    energy1 += fabs(pcmf32s[1][j]);
}

This is a typical reduction with widening pattern.

The vector instructions generated are:

242ce:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
242d2:  5e0031d7                vmv.v.i v3,0
242d6:  9e303257                vmv1r.v v4,v3
242da:  0976f7d7                vsetvli a5,a3,e32,mf2,tu,ma
242e4:  0205e107                vle32.v v2,(a1)
242e8:  02066087                vle32.v v1,(a2)
242ec:  2a211157                vfabs.v v2,v2
242f0:  2a1090d7                vfabs.v v1,v1
242f8:  d2411257                vfwadd.wv       v4,v4,v2
242fc:  d23091d7                vfwadd.wv       v3,v3,v1
24312:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
24316:  4207d0d7                vfmv.s.f        v1,fa5
2431a:  063091d7                vfredusum.vs    v3,v3,v1
2431e:  42301757                vfmv.f.s        fa4,v3
24326:  06409257                vfredusum.vs    v4,v4,v1
2432a:  424017d7                vfmv.f.s        fa5,v4

A hypothetical vectorized Ghidra might decompile these instructions (ignoring the scalar instructions not displayed here) as:

double vector v3, v4;  // SEW=64 bit
v3 := vector 0;  // load immediate
v4 := v3;        // vector copy
float vector v1, v2;  // SEW=32 bit
while(...) {
    v2 = vector *a1;
    v1 = vector *a2;
    v2 = abs(v2);
    v1 = abs(v1);
    v4 = v4 + v2;  // widening 32 to 64 bits
    v3 = v3 + v1;  // widening 32 to 64 bits
}
double vector v1, v3, v4;
v1[0] = fa5;   // fa5 is the scalar 'carry-in' 
v3[0] = v1[0] +  v3; // unordered vector reduction
fa4 = v3[0];
v4[0] = v1[0] +  v4;
fa5 = v4[0];

The vector instruction vfredusum.vs provides the unordered reduction sum over the elements of a single vector. That’s likely faster than an ordered sum, but the floating point round-off errors will not be deterministic.

Note: this whisper.cpp routine attempts to recognize which of two speakers is responsible for each word of a conversation. A speaker-misattribution exploit might attack functions that call this.

complex structure element copy

The source code includes:

static drwav_uint64 drwav_read_pcm_frames_s16__msadpcm(drwav* pWav, drwav_uint64 framesToRead, drwav_int16* pBufferOut) {
    ...
    pWav->msadpcm.bytesRemainingInBlock = pWav->fmt.blockAlign - sizeof(header);

    pWav->msadpcm.predictor[0] = header[0];
    pWav->msadpcm.predictor[1] = header[1];
    pWav->msadpcm.delta[0] = drwav__bytes_to_s16(header + 2);
    pWav->msadpcm.delta[1] = drwav__bytes_to_s16(header + 4);
    pWav->msadpcm.prevFrames[0][1] = (drwav_int32)drwav__bytes_to_s16(header + 6);
    pWav->msadpcm.prevFrames[1][1] = (drwav_int32)drwav__bytes_to_s16(header + 8);
    pWav->msadpcm.prevFrames[0][0] = (drwav_int32)drwav__bytes_to_s16(header + 10);
    pWav->msadpcm.prevFrames[1][0] = (drwav_int32)drwav__bytes_to_s16(header + 12);

    pWav->msadpcm.cachedFrames[0] = pWav->msadpcm.prevFrames[0][0];
    pWav->msadpcm.cachedFrames[1] = pWav->msadpcm.prevFrames[1][0];
    pWav->msadpcm.cachedFrames[2] = pWav->msadpcm.prevFrames[0][1];
    pWav->msadpcm.cachedFrames[3] = pWav->msadpcm.prevFrames[1][1];
    pWav->msadpcm.cachedFrameCount = 2;
...
}

This gets vectorized into sequences containing:

2c6ce:  ccf27057                vsetivli        zero,4,e16,mf2,ta,ma ; vl=4, SEW=16
2c6d2:  5e06c0d7                vmv.v.x v1,a3              ; v1[0..3] = a3
2c6d6:  3e1860d7                vslide1down.vx  v1,v1,a6   ; v1 = v1[1:3], a6
2c6da:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4
2c6de:  3e1560d7                vslide1down.vx  v1,v1,a0   ; v1 = (a3,a6,a4,a0)

2c6e2:  0d007057                vsetvli zero,zero,e32,m1,ta,ma ; keep existing vl (=4), SEW=32
2c6e6:  4a13a157                vsext.vf2       v2,v1      ; v2 = vector sext(v1) // widening sign extend
2c6ea:  0207e127                vse32.v v2,(a5)            ; memcpy(a5, v2, 4 * 4)
2c6f2:  0a07d087                vlse16.v        v1,(a5),zero ; v1 = a5[]

2c6fa:  0cf07057                vsetvli zero,zero,e16,mf2,ta,ma
2c702:  3e1660d7                vslide1down.vx  v1,v1,a2   ; v1 = v1[1:3], a2
2c70a:  3e16e0d7                vslide1down.vx  v1,v1,a3   ; v1 = v1[1:3], a3
2c70e:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4

2c712:  0d007057                vsetvli zero,zero,e32,m1,ta,ma
2c716:  4a13a157                vsext.vf2       v2,v1
2c71a:  0205e127                vse32.v v2,(a1)

That’s the kind of messy code you could analyze if you had to. Hopefully not.

2 - Application Top Down Analysis

How much complexity do vector instructions add to a top down analysis?

We know that whisper.cpp contains lots of vector instructions. Now we want to understand how few vector instruction blocks we really need to understand.

For this analysis we will assume a specific goal - inspect the final text output phase to see if an adversary has modified the generated text.

First we want to understand the unmodified behavior using a simple demo case. One of the whisper.cpp examples works well. It was built for the x86-64-v3 platform, not the riscv-64 gcv platform, but that’s fine - we just want to understand the rough sequencing and get a handle on the strings we might find in or near the top level main routine.

what is the expected behavior?

Note: added comments are flagged with //

/opt/whisper_cpp$ ./main -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.46 MB (1 buffers)
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

// done with initialization, lets run speach-to-text
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

// this is the reference line our adversary wants to modify:
[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

// display statistics
whisper_print_timings:     load time =   183.72 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    10.30 ms
whisper_print_timings:   sample time =    33.90 ms /   131 runs (    0.26 ms per run)
whisper_print_timings:   encode time =   718.87 ms /     1 runs (  718.87 ms per run)
whisper_print_timings:   decode time =     8.35 ms /     2 runs (    4.17 ms per run)
whisper_print_timings:   batchd time =   150.96 ms /   125 runs (    1.21 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1110.87 ms

The adversary wants to change the text output from “… ask not what you can do for your country.” to “… ask not what you can do for your enemy.” They likely drop a string substitution into the code between the output of main: processing and whisper_print_timings:, probably very close to code printing timestamp intervals like [00:00:00.000 --> 00:00:11.000].

what function names and strings look relevant?

Our RISCV-64 binary retains some function names and lots of relevant strings. We want to accumulate strings that occur in the demo printout, then glance at the functions that reference those strings.

For this example we will use a binary that includes some debugging type information. Ghidra can determine names of structure types but not necessarily the size or field names of those structures.

strings

  • %s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, %d beams + best of %d, lang = %s, task = %s, %stimestamps = %d ... is referenced near the middle of main
  • [%s --> %s] is referenced by whisper_print_segment_callback
  • [%s --> %s] %s\n is referenced by whisper_full_with_state
  • segment occurs in several places, suggesting that the word refers to a segment of text generated from speech between two timestamps.
  • ctx occurs 33 times, suggesting that a context structure is used - and occasionally displayed with field names
  • error: failed to initialize whisper context\n is referenced within main. It may help in understanding internal data organization.

functions

  • main - Ghidra decompiles this as ~1000 C statements, including many vector statements
  • whisper_print_timings - referenced directly in main near the end
  • whisper_full_with_state - referenced indirectly from main via whisper_full_parallel and whisper_full
  • output_txt - referenced directly in main, invokes I/O routines like std::__ostream_insert<>. There are other output routines like output_json. The specific output routine can be selected as a command line parameter to main.

types and structs

Ghidra knows that these exist as names, but the details are left to us to unravel.

  • gpt_params and gpt_vocab - these look promising, at a lower ML level
  • whisper_context - this likely holds most of the top-level data
  • whisper_full_params and whisper_params - likely structures related to the optional parameters revealed with the --help command line option.
  • whisper_segment - possibly a segment of digitized audio to be converted as speech.
  • whisper_vocab - possible holding the text words known to the training data.

notes

Now we have enough context to narrow the search. We want to know:

  • how does main call either whisper_print_segment_callback or whisper_full_with_state.
    • whisper_full is called directly by main. Ghidra reports this to be about 3000 lines of C. The Ghidra call tree suggests that this function does most of the text-to-speech tensor math and other ML heavy lifting.
    • whisper_print_segment_callback appears to be inserted into a C++ object vtable as a function pointer. The object itself appears to be built on main’s stack, so we don’t immediately know its size or use. whisper_print_segment_callback is less than a tenth the size of whisper_full_with_state.
  • how does the JFK output text get appended to the string [%s --> %s]?
  • from what structures is the output text retrieved?
  • where are those structures initialized? How large are they, and are any of their fields named in diagnostic output?
  • are there any diagnostic routines displaying the contents of such structures?

next steps

A simple but tedious technique involves a mix of top-down and bottom-up analysis. We work upwards from strings and function references, and down from the main routine towards the functions associated with our target text string. Trial and error with lots of backtracking are common here, so switching back and forth between top-down and bottom-up exploration can provide fresh insights.

Remember that we don’t want to understand any more of whisper.cpp than we have to. The adversary we are chasing only wants to understand where the generated text comes within reach. Neither they nor we need to understand all of the ways the C++ standard library might use vector instructions during I/O subsystem initialization.

On the other hand, they and we may need to recognize basic I/O and string handling operations, since the target text is likely to exist as either a standard string or a standard vector of strings.

Note: This isn’t a tutorial on how to approach a C++ reverse engineering challenge - it’s an evaluation of how vectorization might make that more difficult and an exploration of what additional tools Ghidra or Ghidra users may find useful when faced with vectorization. That means we’ll skip most of the non-vector analysis.

vectorization obscures initialization

This sequence from main affects initialization and obscures a possible exploit vector.

  vsetivli_e8m8tama(0x17);         // memcpy(puStack_110, "models/ggml-base.en.bin", 0x17)
  auVar27 = vle8_v(0xa6650);
  vsetivli_e8m8tama(0xf);          // memcpy(puStack_f0, "" [SPEAKER_TURN]", 0xf)
  auVar26 = vle8_v(0xa6668);
  puStack_f0 = auStack_e0;
  vsetivli_e8m8tama(0x17);
  vse8_v(auVar27,puStack_110);
  vsetivli_e8m8tama(0xf);
  vse8_v(auVar26,puStack_f0);
  puStack_d0 = &uStack_c0;
  vsetivli_e64m1tama(2);           // memset(lStack_b0, 0, 16)
  vmv_v_i(auVar25,0);
  vse64_v(auVar25,&lStack_b0);
  *(char *)((long)puStack_110 + 0x17) = '\0';

If the hypothetical adversary wanted to replace the training model ggml-base.en.bin with a less benign model, changing the memory reference within vle8_v(0xa6650) would be a good place to do it. Note that the compiler has interleaved instructions generated from the two memcpy expansions, at the cost of two extra vsetivli instructions. This allows more time for the vector load instructions to complete.

Focus on output_txt

Some browsing in Ghidra suggests that the following section of main is close to where we need to focus.

    lVar11 = whisper_full_parallel
                      (ctx,(long)pFVar18,(ulong)pvStack_348,
                      (long)(int)(lStack_340 - (long)pvStack_348 >> 2),
                      (long)pvVar20);
  if (lVar11 == 0) {
    putchar(10,pFVar18);
    if (params.do_output_txt != false) {
  /* try { // try from 0001dce8 to 0001dceb has its CatchHandler @ 0001e252 */
      std::operator+(&full_params,(undefined8 *)pFStack_2e0,
                      (undefined8 *)pFStack_2d8,(undefined8 *)".txt",
                      (char *)pvVar20);
      uVar13 = full_params._0_8_;
  /* try { // try from 0001dcfc to 0001dcfd has its CatchHandler @ 0001e2ec */
      std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
  /* try { // try from 0001dd06 to 0001dd09 has its CatchHandler @ 0001e2f0 */
      output_txt(ctx,(char *)uVar13,&params,(vector *)unaff_s3);
      std::vector<>::~vector(unaff_s3);
      std::__cxx11::basic_string<>::_M_dispose((basic_string<> *)&full_params);
    }
    ...
  }

Looking into output_txt Ghidra gives us:

long output_txt(whisper_context *ctx,char *output_file_path,whisper_params *param_3,vector *param_4)

{
    fprintf(_stderr,"%s: saving output to \'%s\'\n","output_txt",output_file_path);
    max_index = whisper_full_n_segments(ctx);
    index = 0;
    if (0 < max_index) {
      do {
        __s = (char *)whisper_full_get_segment_text(ctx,index);
    ...
        sVar8 = strlen(__s);
        std::__ostream_insert<>((basic_ostream *)plVar7,__s,sVar8);
    ...
        index = (long)((int)index + 1);
      } while (max_index != index);
    ...
    }
...
}

Finally, whisper_full_get_segment_text is decompiled into:

undefined8 whisper_full_get_segment_text(whisper_context *ctx,long index)
{
  gp = &__global_pointer$;
  return *(undefined8 *)(index * 0x50 + *(long *)(ctx->state + 0xa5f8) + 0x10);
}

Now the adversary has enough information to try rewriting the generated text from an arbitrary segment of speech. The text is found in an array linked into the ctx context variable, probably during the call to whisper_full_parallel.

added complexity of vectorization

Our key goal is to understand how much effort to put into Ghidra’s decompiler processing of RISCV-64 vector instructions. The metric for measuring that effort is relative to the effort needed to understand the other instructions produced by a C++ optimizing compiler implementing libstdc++ containers like vectors.

Take a closer look at the call to output_txt:

std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
output_txt(ctx,(char *)uVar13,&params,(vector *)unaff_s3);
std::vector<>::~vector(unaff_s3);

The unaff_s3 parameter to output_txt might be important. Maybe we should examine the constructor and destructor for this object to probe its internal structure.

In fact unaff_s3 is only used when passing stereo audio into output_txt, so it is more of a red herring slowing down the analysis than a true roadblock. Its internal structure is a C++ standard vector of C++ standard vectors of float, so it’s a decent example of what happens when RISCV-64 vector instructions are used implementing vectors (and two dimensional matrices) at a higher abstraction level.

A little analysis shows us that std::vector<>::vector is actually a copy constructor for a class generated from a vector template. The true type of unaff_s3 and unaff_s5 is roughly std::vector<std::vector<float>>.

Comment: the copy constructor and the associated destructor are likely present only because the programmer didn’t mark the parameter as a const reference.

The destructor std::vector<>::~vector(unaff_s3) listing shows no vector instructions are used. The inner vectors are deleted and their memory reclaimed, then the outer containing vector is deleted.

The constructor std::vector<>::vector is different. Vector instructions are used often, but in very simple contexts.

  • The only vset mode used is vsetivli_e64m1tama(2), asking for no more than two 64 bit elements in a vector register
  • The most common vector pattern stores 0 into two adjacent 64 bit pointers
  • In one case a 64 bit value is stored into two adjacent 64 bit pointers.

Summary

If whisper.cpp is representative of a broader class of ML programs compiled for RISCV-64 vector-enabled hardware, then:

  1. Ghidra’s sleigh subsystem needs to recognize at least those vector instructions found in the rvv 1.0 release.
  2. The decompiler view should have access to pcodeops for all of those vector instructions.
  3. The 20 to 50 most common vset* configurations (e.g., e64m1tama) should be explicitly recognized at the pcodeop layer and displayed in the decompiler view.
  4. Ghidra users should have documentation on common RISCV-64 vector instruction patterns generated during compilation. These patterns should include common loop patterns and builtin expansions for memcpy and memset, plus examples showing the common source code patterns resulting in vector reduction, width conversion, slideup/down, and gather/scatter instructions.

Other Ghidra extensions would be nice to have but likely deliver diminishing bang-for-the-buck relative to multiplatform C++ analytics:

  1. Extend sleigh *.sinc file syntax to convey comments or hints to be visible in the decompiler view, either as pop-ups, instruction info, or comment blocks.
  2. Take advantage of the open source nature of RISCV ISA to display links to open source documents on vector instructions when clicking on a given instruction.
  3. Treat pcodeops as function calls within the decompiler view, enabling signature overrides and type assignment to the arguments.
  4. Create a decompiler plugin framework that can scan the decompiled source and translate vector instruction patterns back into calls to __builtin_memcpy(...) calls.
  5. Create a decompiler plugin framework that can scan the decompiled source and generate inline comments in a sensible vector notation.

The toughest challenges might be:

  1. Find a Ghidra micro-architecture-independent approach to untangling vector instruction generation.
  2. Use ML translation techniques to match C, C++, and Rust source patterns to generated vector instruction sequences for known architectures, compilers, and compiler optimization settings.