This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Whisper Deep Dive

This exercise mocks up a forensic analysis of a hypothetical voice to text application believed to be based on Whisper.cpp. The previous examples applied Advisor to fairly random vector instruction sequences found in a Whisper.cpp compilation without much identifying metadata. This time we will do things more methodically, using a specific Whisper.cpp release built with specific build instructions, analyzed in Ghidra in both stripped and unstripped binary formats. Dependent libraries libc, libm, and libstdc++ will be imported into Ghidra from the toolchain used to construct the whisper executable. Once we have trained Advisor to help with the known-source whisper application, we might be better able to use in in analyzing potentially malicious whisper-like applications.

This is an iterative process, where we take some initial guesses into how the application-under-test (AUT) was built, rebuild our whisper reference model the same way, then adjust either the reference model or the build parameters until we see similar key patterns in our AUT and reference models.

The initial guesses are:

  • Similar to Whisper.cpp release 1.7.1
  • Built for RISCV64 cores with the rva22 profile plus vector extensions, something like the SiFive P670 cores within a SG2380 processor.
  • Built with gcc 15.0.0 with march=rv64gcv, fast-math, and O3 options for a linux-like OS.
  • dynamically linked with libc, libm, libstdc++ as of mid 2024.

It’s worthwhile establishing key structures used by Whisper - and likely by any malicious code forked from Whisper.

Inspecting the reference source code suggests these structures:

  • whisper_context
  • whisper_state - created by whisper_init_state(struct whisper_context * ctx)
  • whisper_context_params

1 - Building the Reference Model

We start by importing Whisper.cpp into our Bazel build environment. We may eventually want a full fork of the code, but not until we are sure of the base release and what directions that fork will need to take.

This import starts with a simple addition to our MODULE.bazel workspace file:

# MODULE.bazel
http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
# whisper.cpp is an open source voice-to-text inference app built on Meta's LLaMA model.
# It is a useful exemplar of autovectorization of ML code with some examples of hand-coded
# riscv intrinsics.
http_archive(
    name = "whisper_cpp",
    urls = ["https://github.com/ggerganov/whisper.cpp/archive/refs/tags/v1.7.1.tar.gz"],
    strip_prefix = "whisper.cpp-1.7.1/",
    build_file = "//:whisper-cpp.BUILD",
    sha256 = "97f19a32212f2f215e538ee37a16ff547aaebc54817bd8072034e02466ce6d55"
)

Next we add whisper-cpp.BUILD to show how to build libraries and binaries. The instructions for whisper library include these stanzas:

c_library(
    name = "whisper",
    srcs = [
        "ggml/src/ggml.c",
        "ggml/src/ggml-aarch64.c",
        "ggml/src/ggml-alloc.c",
        "ggml/src/ggml-backend.cpp",
        "ggml/src/ggml-backend-impl.h",
        "ggml/src/ggml-impl.h",
        "ggml/src/ggml-quants.c",
        "src/whisper.cpp",
    ],
    copts = [
        "-I%s/include" % EXTERNAL_PATH,
        "-I%s/ggml/include" % EXTERNAL_PATH,
        "-I%s/ggml/src" % EXTERNAL_PATH,
        "-pthread",
        "-O3",
        "-ffast-math",
    ],
    ...
    defines = [
        "NDEBUG",
        "_XOPEN_SOURCE=600",
        "_GNU_SOURCE",
        "__FINITE_MATH_ONLY__=0",
        "__riscv_v_intrinsic=0",
    ],
    ...
)
cc_binary(
    name = "main",
    srcs = [
        "examples/common.cpp",
        "examples/common.h",
        "examples/common-ggml.cpp",
        "examples/common-ggml.h",
        "examples/dr_wav.h",
        "examples/grammar-parser.cpp",
        "examples/grammar-parser.h",
        "examples/main/main.cpp",
    ],
    ...
        deps = [
        "whisper",
    ],
)

Now we can build the reference app using our existing RISCV-64 toolchain:

$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' @whisper_cpp//:main
...
$ file bazel-bin/external/+_repo_rules+whisper_cpp/main
bazel-bin/external/+_repo_rules+whisper_cpp/main: ELF 64-bit LSB executable, UCB RISC-V, RVC, double-float ABI, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, for GNU/Linux 4.15.0, not stripped

$ readelf -A bazel-bin/external/+_repo_rules+whisper_cpp/main
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0_zca1p0_zcd1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

The final step is to locate the toolchain libraries used in this build, so that we can load them into Ghidra. They are usually cached in a per-user location. We’ll search for the RISCV libstdc++ toolchain library:

$ bazel info
...
output_base: /run/user/1000/bazel
output_path: /run/user/1000/bazel/execroot/_main/bazel-out
package_path: %workspace%
release: release 7.4.0
...
$ find /run/user/1000 -name libstdc++\*
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6.0.33
...
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so

We will want to load libstdc++.so.6 into Ghidra before we load the reference app.

2 - Loading the Reference Model into Ghidra

Create a new Ghidra project and load the whisper dependencies and then whisper itself, in both stripped and unstripped forms.

/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/lib/libm.so.6
/run/user/1000/bazel/external/gcc_riscv_suite+/riscv64-unknown-linux-gnu/lib/libstdc++.so.6
/run/user/1000/bazel/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~whisper_cpp/main

We can check the machine architecture for which these libraries were built with readelf -A:

$ readelf -A /run/user/1000/bazel/external/gcc_riscv_suite+/lib/libc.so.6
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0"
  Tag_RISCV_unaligned_access: Unaligned access
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

These libraries were most likely built with the non-vector rv64gc machine architecture.

The stripped version of main is generated by copying the non-stripped binary to /tmp and then stripping it with the riscv toolchain strip before importing it into Ghidra:

/run/user/1000/bazel/external/gcc_riscv_suite+/bin/riscv64-unknown-linux-gnu-strip /tmp/main

Now we can use the non-stripped version to orient ourselves and find reference points, then visit the stripped version to use those reference points and start to recover symbol names and structures.

3 - Ghidra Examination

We want to know how we can use Advisor to untangle vectorized instruction sequences. We’ve seen that Advisor can help with simple loops and builtin functions like memcpy. Now we want to tackle vectorized ‘shuffle’ code, where GCC turns a sequence of simple assignments or initializations into a much more obscure sequence of vector instructions.

We’ll assume the Ghidra user wishes to search for malicious behavior adjacent to the output_txt function, called by main after the voice-to-text inference engine has crunched the numbers.

The first step is to locate the main routine in our stripped binary. There is no main symbol left after stripping, so we need to find a path from the entry point to main in the unstripped binary first. The entry point is the symbol _start or entry.

In the unstripped binary:

void _start(void)
{
...
  gp = &__global_pointer$;
  uVar1 = _PREINIT_0();
  __libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
  ebreak();
  main();
  return;
}

While the stripped binary is:

void entry(void)
{
  undefined8 uVar1;
  undefined8 in_stack_00000000;
  
  gp = &__global_pointer$;
  uVar1 = _PREINIT_0();
  __libc_start_main(0x2649e,in_stack_00000000,&stack0x00000008,0,0,uVar1,&stack0x00000000);
  ebreak();
  FUN_0001e758();
  return;
}

So we know that main is FUN_0001e758.

In the source code, main calls output_txt. There is no output_txt symbol in either stripped or non-stripped binaries, so this function has apparently been converted into inline code deep within main.

There are several different paths forward for examination. Sometimes the best approach is to explore a short distance along each likely path, backtracking or switching paths when we get stuck. For this exercise we want to know how to make Advisor useful in at least some of those cases where we get stuck.

Paths available:

  • Search for C strings as literals, then find where they are used. This will often give printf or logging utility functions.
  • Search for C++ string constructors given literals as input. Start to identify standard library string objects in the code.
  • Look for initialization or print functions recognizable either from symbol names or printf formatting strings.
  • Start to identify recurring structures passed by pointers. This can include context, state, and parameter structures

Let’s start with a routine that touches several of those paths. This is the basic decompiled output from the stripped binary, for a function that gets called a lot by our identified main routine. Peeking into the unstripped binary, we see that its signature is void __thiscall std::string::string<>(string *this,char *cStr,allocator *param_2) - a C++ basic_string constructor given a literal C string as input. Note that there is actually no allocator passed into the function.

void FUN_000542e0(undefined8 *param_1,undefined *param_2)

{
  undefined *puVar1;
  long lVar2;
  undefined *puVar3;
  long lVar4;
  undefined8 *puVar5;
  undefined auVar6 [256];
  long in_vl;
  
  gp = &__global_pointer$;
  puVar5 = param_1 + 2;
  *param_1 = puVar5;
  if (param_2 == (undefined *)0x0) {
                    /* WARNING: Subroutine does not return */
    std::__throw_logic_error("basic_string: construction from null is not valid");
  }
  puVar1 = param_2;
  lVar4 = 0;
  do {
    vsetvli_e8m1tama(0);
    puVar1 = puVar1 + lVar4;
    auVar6 = vle8ff_v(puVar1);
    auVar6 = vmseq_vi(auVar6,0);
    lVar2 = vfirst_m(auVar6);
    lVar4 = in_vl;
  } while (lVar2 < 0);
  puVar1 = puVar1 + (lVar2 - (long)param_2);
  puVar3 = puVar1;
  if (puVar1 < (undefined *)0x10) {
    if (puVar1 == (undefined *)0x1) {
      *(undefined *)(param_1 + 2) = *param_2;
      goto LAB_00054326;
    }
    if (puVar1 == (undefined *)0x0) goto LAB_00054326;
  }
  else {
    puVar5 = (undefined8 *)operator.new((ulong)(puVar1 + 1));
    param_1[2] = puVar1;
    *param_1 = puVar5;
  }
  do {
    lVar4 = vsetvli_e8m8tama(puVar3);
    auVar6 = vle8_v(param_2);
    puVar3 = puVar3 + -lVar4;
    param_2 = param_2 + lVar4;
    vse8_v(auVar6,puVar5);
    puVar5 = (undefined8 *)((long)puVar5 + lVar4);
  } while (puVar3 != (undefined *)0x0);
  puVar5 = (undefined8 *)*param_1;
LAB_00054326:
  param_1[1] = puVar1;
  *(undefined *)((long)puVar5 + (long)puVar1) = 0;
  return;
}

The exercise here is to recover the basic_string internal structure and identify the two vector stanzas.

The Advisor identifies the first vector stanza as a builtin_strlen and the second as a builtin_memcpy. The std::string structure is 0x20 bytes and consists of a char*, a 64 bit string length, and a 16 byte union field. If the string with null termination is less than 16 bytes in length, it is stored directly in the 16 byte union field. Otherwise, new memory is allocated for the copy and a pointer is stored in the first 8 bytes of the union.

The next step is easy enough to make the Advisor unnecessary. A std::vector copy constructor involves two vector instruction stanzas.

The new vector has three 64 bit pointer fields, all of which need to be zeroed. GCC 15 does that with:

  vsetivli_e64m1tama(2);
  vmv_v_i(in_v1,0);
  ...
  vse64_v(in_v1,this);
  *(undefined8 *)&this->field_0x10 = 0;

That’s a little bit odd, since it is using three vector instructions to replace two scalar instructions, followed by a separate scalar store instruction. It could alternatively used three scalar store instructions or three vector instructions with an m2 LMUL multiplier option. Perhaps this is an example of incomplete or over-eager optimization, or an optimization from a RISC-V vendor who knows that vector instructions can be executed in parallel with scalar instructions.

A little later in the copy constructor a builtin_memcpy vector stanza occurs, to copy the contents of the original vector into the newly initialized vector.

This suggests:

  • vector stanzas like builtin_memcpy, builtin_strlen, and vector instructions to zero 16 bytes are common and fairly easy to recognize, either by eye or Advisor. Adding more builtin functions to the exemplar directory makes good sense.
  • vector stanzas often occur in initialization sequences, where they can be difficult to untangle from associated C++ object initializations. If we want to tackle this, we also need examples of stdlibc++ initializations, especially for vectors, maps, and iostreams.
  • we need more examples of less common vector stanzas, including gather and slide operations.

4 - Dealing with C++

GCC 15 uses RISCV vector instruction sequences in many initialization sequences - even when there is no need for a loop. If we want to understand that code, we need a decent understanding of what is getting initialized. One way to move forward is with a small program that uses some of the same libstdc++ classes, to help us understand their memory layout and especially the fields that may need initialization.

The first iteration of this is a toy program using std::string, std::vector, std::pair, std::map, and std::ofstream library code. we generally want to know object sizes, pointers, and key internal fields. If something like ofstream is constructed on the stack, we can probably ignore any interior objects initialized within it.

#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <fstream>
#include <cstdint>

void dumpHex(const uint64_t* p, int numwords)  {
    std::cout << "\tRaw: ";
    for (int i = 0; i < numwords; i++) {
        std::cout << "0x" << std::hex << p[i];
        if (i < numwords) {
        std::cout << ", ";
        }
    }
    std::cout << std::endl;
}

void showString(const std::string* s, const char* label) {
    std::cout << "std::string " << label << " = " << *s << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*s) << std::endl;
    std::cout << "\tLength = 0x" << std::hex << s->length() << std::endl;
    dumpHex((const uint64_t*)s, 4);
}

void showVector(const std::vector<std::string>* v, const char* label) {
    std::cout << "std::vector<std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*v) << std::endl;
    std::cout << "\tInternal Size = " << v->size() << std::endl;
    dumpHex((const uint64_t*)v, 3);
}

void showPair(const std::pair<std::string,std::string>* p, const char* label) {
    std::cout << "std::pair<std::string,std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*p) << std::endl;
    dumpHex((const uint64_t*)p, 8);
}

void showMap(const std::map<std::string,std::string>* token_map, const char* label) {
    std::cout << "std::map<std::string,std::string> " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*token_map) << std::endl;
    std::cout << "\tInternal Size = " << token_map->size() << std::endl;
    dumpHex((const uint64_t*)token_map, 6);
}

void showOfstream(const std::ofstream* fout, const char* label) {
    std::cout << "std::ofstream " << label << ":" << std::endl;
    std::cout << "\tRaw Size = 0x" << std::hex << sizeof(*fout) << std::endl;
    dumpHex((const uint64_t*)fout, 8);
}

int main() {
    std::cout << "Initializing\n";
    std::ofstream fout("/tmp/scratch");
    showOfstream(&fout, "initialized ofstream");
    std::string xs("short string");
    showString(&xs, "short string");
    std::string x("This is a sample long string");
    showString(&x, "long string");
    std::vector<std::string> vx;
    showVector(&vx, "empty vector");
    vx.push_back(x);
    showVector(&vx, "singleton vector");

    fout << "something to fill the file" << std::endl;

    std::pair<std::string,std::string> map_element("key", "value");
    showPair(&map_element, "map_element");
    std::map<std::string,std::string> token_map;
    showMap(&token_map, "token_map, empty");
    token_map.insert(map_element);
    showMap(&token_map, "token_map, one insertion");
    fout.close();
    showOfstream(&fout, "closed ofstream");

Build with:

$ bazel build --platforms=//platforms:riscv_gcc --copt='-march=rv64gcv' other_src:stdlibc++_exploration

Run this under qemu emulation :

$ qemu-riscv64-static -L /opt/riscvx/sysroot -E LD_LIBRARY_PATH=/opt/riscvx/sysroot/riscv64-unknown-linux-gnu/lib/ bazel-bin/other_src/stdlibc++_exploration
Initializing
std::ofstream initialized ofstream:
	Raw Size = 0x200
	Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x288a0, 0x288a0, 0x288a0, 0x0, 0x0, 0x0, 
std::string short string = short string
	Raw Size = 0x20
	Length = 0xc
	Raw: 0x7fbfe3ffe820, 0xc, 0x74732074726f6873, 0x7f00676e6972, 
std::string long string = This is a sample long string
	Raw Size = 0x20
	Length = 0x1c
	Raw: 0x2a8b0, 0x1c, 0x1c, 0x7fbfe3ffe8f0, 
std::vector<std::string> empty vector:
	Raw Size = 0x18
	Internal Size = 0
	Raw: 0x0, 0x0, 0x0, 
std::vector<std::string> singleton vector:
	Raw Size = 0x18
	Internal Size = 1
	Raw: 0x2a8e0, 0x2a900, 0x2a900, 
std::pair<std::string,std::string> map_element:
	Raw Size = 0x40
	Raw: 0x7fbfe3ffe890, 0x3, 0x79656b, 0x7fbff0411528, 0x7fbfe3ffe8b0, 0x5, 0x65756c6176, 0x7fbff14d751e, 
std::map<std::string,std::string> token_map, empty:
	Raw Size = 0x30
	Internal Size = 0
	Raw: 0x1, 0x7fbf00000000, 0x0, 0x7fbfe3ffe858, 0x7fbfe3ffe858, 0x0, 
std::map<std::string,std::string> token_map, one insertion:
	Raw Size = 0x30
	Internal Size = 1
	Raw: 0x1, 0x7fbf00000000, 0x2a940, 0x2a940, 0x2a940, 0x1, 
std::ofstream closed ofstream:
	Raw Size = 0x200
	Raw: 0x7fbff0608d70, 0x7fbff0608bb8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 

Now we can load bazel-bin/other_src/stdlibc++_exploration into Ghidra and get some insight into how these standard library classes are initialized.

std::string has a length of 0x20 bytes and a structure similar to:

struct string { /* stdlib basic string */
    char *cstr;
    uint64_t length;
    char data[16];
};

If the string is 15 bytes or less, it is stored directly in data. Otherwise new memory is allocated and data holds information needed to manage that allocation.

std::vector<T> has a length of 0x18 bytes, regardless of the component type T.

struct vector {
    T *start;        // points to the first element in the vector
    T *end;          // points just past the last element in the vector
    T *alloc_end;    // points just past the last empty element allocated for the vector.
};

std::pair<T1,T2> is simply a concatenation of elements of type T1 and T2, so a pair of strings is 0x40 bytes and a pair of string pointers is 0x10 bytes.

std::ofstream is a large object of 0x200 bytes. We’ll ignore its internal structure for now - and especially any whisper.cpp code initializing elements of that structure.

Ghidra examination of stdlibc++_exploration provides some insight and more than a few red herrings.

stdlibc++_exploration

std::vector

The three 64 bit pointers that make up a std::vector<std::string> are initialized inline to zero with a stanza of five vector instructions - even though that does not look optimal.

  vsetivli_e8mf2tama(8);
  vmv_v_i(in_v1,0);
  vse8_v(in_v1,&vx.end);
  vse8_v(in_v1,&vx.alloc_end);
  vse8_v(in_v1,&vx);
  showVector(&vx,"empty vector");

vx.push_back(x) makes a copy of the string x and allocates space for both the vector element and the string copy with operator.new.

std::vectors are most easily recognized by their destructors:

std::vector<>::~vector((vector<> *)&vx);

std::ofstream

File structures like std::ofstream are easy to recognize through their constructors and destructors. They can be confusing though, since they are likely to be reused if multiple files are opened and closed in the same function.