This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Exemplars

List the current importable and buildable exemplars, their origins, and the Ghidra features they are intended to validate or stress.

Overview

Exemplars suitable for Ghidra import are generally collected by platform architecture, such as riscv64/exemplars or x86_64/exemplars. Some are imported from system disk images. Others are locally built from small source code files and an appropriate compiler toolchain. The initial scope includes Linux-capable RISCV 64 bit systems that might be found in network appliances or ML inference engines. That makes for a local bias towards privileged code, concurrency management, and performance optimization. That scope expands slightly to x86_64 exemplars that may help triage issues that show up first in RISCV 64 exemplars.

You can get decent exemplar coverage with this set of exemplars:

  • from a general purpose RISCV-64 disk image:
    • kernel - a RISCV-64 kernel built with a recent Linux release
    • kernel load module - an ELF binary intended to be loaded into a running kernel
    • system library - libc.so or libssl.so copied from a generic Linux disk image
    • system application - a user application linking against system libraries and running over a Linux kernel
  • built from source, with the development tip of the gcc toolchain and many explicit ISA extensions:
    • binutils assembly test suite - ISA extensions usually show up here first, along with preferred disassembly patterns
    • memcpy and other libc replacements coded with RISCV-64 ISA intrinsic extensions
    • libssl.so and libcrypt.so built from source and configured for all standard and frozen crypto, vector, and bit manipulation instruction extensions.
    • DPDK network appliance source code, l3fwd and l2fwd.
    • a custom crosscompiled kernel, with ISA extensions enabled

In general, visual inspection of these exemplars after importing into Ghidra should show:

  • no failed constructors, so all instructions are recognized by Ghidra during disassembly
  • no missing pcode
  • all vector vset* instructions are unwrapped to show selected element width, multiplier, tail and mask handling

Ghidra will in a few cases disassemble an instruction differently than binutils’ objdump. That’s fine, if it is due to a limitation of Ghidra’s SLEIGH language. If alignment to objdump is possible, that’s preferable.

Imported exemplars

Most of the imported large binary exemplars are broken out of available Fedora disk images. The top level acquireExternalExemplars.py script controls this process, sometimes with some manual intervention to handle image mounting. Selection of the imported disk image is controlled with text like:

LOGLEVEL = logging.WARN
FEDORA_RISCV_SITE = "http://fedora.riscv.rocks/kojifiles/work/tasks/6900/1466900"
FEDORA_RISCV_IMAGE = "Fedora-Developer-39-20230927.n.0-sda.raw"
FEDORA_KERNEL = "vmlinuz-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_KERNEL_OFFSET = 40056
FEDORA_KERNEL_DECOMPRESSED = "vmlinux-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_SYSMAP = "System.map-6.5.4-300.0.riscv64.fc39.riscv64"

Fedora kernel

Warning: the cited Fedora disk image may no longer be maintained. If so, we will replace it with a custom cross-compiled kernel tuned for a hypothetical network appliance.

This exemplar kernel is not an ELF file, so analysis of the import process will need help.

  • The import process explicitly sets the processor on the command line: -processor RISCV:LE:64:RV64IC. This will likely be the same as the processor determined from imported kernel load modules.
  • Ghidra recognizes three sections, one text and two data. All three need to be moved to the offset suggested in the associated System.map file. For example, .text moves from 0x1000 to 0x80001000. Test this by verifying function start addresses identified in System.map look like actual RISCV-64 kernel functions. Most begin with 16 bytes of no-op instructions to support debugging and tracing operations.
  • Mark .text as code by selecting from 0x80001000 to 0x80dfffff and hitting the D key.

Verification

Verify that kernel code correctly references data:

  1. locate the address of panic in System.map: ffffffff80b6b188
  2. go to 0x80b6b188 in Ghidra and verify that this is a function
  3. display references to panic and examine the decompiler window.
 /* WARNING: Subroutine does not return */
  panic(s_Fatal_exception_in_interrupt_813f84f8);

Notes

This kernel includes 149 strings including sifive, most of which appear in System.map. It’s not immediately clear whether these indicate kernel mods by SiFive or an SiFive SDK kernel module compiled into the kernel.

The kernel currently includes a few RISCV instruction set extensions not handled by Ghidra, and possibly not even by binutils and the gas RISCV assembler. Current Linux kernels can bypass the standard assembler to insert custom or obscure privileged instructions.

This Linux kernel explicitly includes ISA extension code for processors that support those extensions. For example, if the kernel boots up on a processor supporting the _zbb bit manipulation instruction extensions, then the vanilla strlen, strcmp, and strncmp kernel functions are patched out to invoke strlen_zbb, strcmp_zbb, and strncmp_zbb respectively.

This kernel can support up to 64 discrete ISA extensions, of which about 30 are currently defined. It has some support for hybrid processors, where each of the hardware threads (aka ‘harts’) can support a different mix of ISA extensions.

Note: The combination of instruction set extensions and self-modifying privileged code makes for a fertile ground for Ghidra research. We can expect vector variants of memcpy inline expansion sometime in 2024, significantly complicating cyberanalysis of even the simplest programs.

Fedora kernel modules

Kernel modules are typically ELF files compiled as Position Independent Code, often using more varied Elf relocation types for dynamically loading and linking into kernel memory space. This study looks at the igc.ko kernel module for a type of Intel network interface device. Network device drivers can have some of the most time-critical and race-condition-rich behavior, making this class of driver a good exemplar.

RISCV relocation types found in this exemplar include:

R_RISCV_64(2), R_RISCV_BRANCH(16), R_RISCV_JAL(17), R_RISCV_CALL(18), R_RISCV_PCREL_HI20(23), R_RISCV_PCREL_LO12_I(24), R_RISCV_ADD32(35), R_RISCV_ADD64(36), R_RISCV_SUB32(39), R_RISCV_SUB64(40), R_RISCV_RVC_BRANCH(44), and R_RISCV_RVC_JUMP(45)

Verification

Open Ghidra’s Relocation Table window and verify that all relocations were applied.

Go to igc_poll, open a decompiler window, and export the function as igc_poll.c. Compare this file with the provided igc_poll_decompiled.c in the visual difftool of your choice (e.g. meld) and check for the presence of lines like:

netdev_printk(&_LC7,*(undefined8 *)(lVar33 + 8),"Unknown Tx buffer type\n");

This statement generates - and provides tests for - at least four relocation types.

Notes

The decompiler translates all fence instructions as fence(). This kernel module uses 8 distinct fence instructions to request memory barriers. The sleigh files should probably be extended to show either fence(1,5) or the Linux macro names given in linux/arch/riscv/include/asm/barrier.h.

Fedora system libraries

System libraries like libc.so and libssl.so typically link to versioned shareable object libraries like libc.so.6 and libssl.so.3.0.5. Ghidra imports RISCV system libraries well.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5), and R_RISCV_TLS_TPREL64(11)

R_RISCV_TLS_TPREL64 is currently unsupported by Ghidra, appearing in the libc.so.6 .got section about 15 times. This relocation type does not appear in libssl.so.3.0.5. It appears in multithreaded applications that use thread-local storage.

Fedora system executables

The ssh utility imports cleanly into Ghidra.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5)

Function thunks referencing external library functions do not automatically get the name of the external function propagated into the name of the thunk.

Locally built exemplars

Imported binaries are generally locked into a single platform and a single toolchain. The imported binaries above are built for an SiFive development board, a 64 bit RISCV processor with support for Integer and Compressed instruction sets, and a gcc-13 toolchain. If we want some variation on that, say to look ahead at challenges a gcc-14 toolchain might throw our way, we need to build our own exemplars.

Open source test suites can be a good source for feature-focused importable exemplars. If we want to test Ghidra’s ability to import RISCV instruction set extensions, we want to import many of the files from binutils-gdb/gas/testsuite/gas/riscv.

For example, most of the ratified set of RISCV vector instructions are used in vector-insns.s. If we assemble this with a gas assembler compatible with the -march=rv32ifv architecture we get an importable binary exemplar for those instructions. Even better, we can disassemble that exemplar with a compatible objdump and get the reference disassembly to compare against Ghidra’s disassembly. This gives us three kinds of insights into Ghidra’s import capabilities:

  1. When new instructions appear in the binutils gas main branch, they are good candidates for implementation in Ghidra within the next 12 months. This currently includes vector, bit manipulation, cache management, and crypto approved extensions plus about a dozen vendor-specific extensions from AliBaba’s THead RISCV server initiative.
  2. These exemplars drive extension of Ghidra’s RISCV sleigh files, both as new instruction definitions and as pcode semantics for display in the decompiler window.
  3. Disassembly of those exemplars with a current binutils objdump utility gives us a reference disassembly to compare with Ghidra’s. We can minimize arbitrary or erroneous Ghidra disassembly by comparing the two disassembler views. Ghidra and objdump have different goals, so we don’t need strict alignment of Ghidra with objdump.

Most exemplars appear as four related files. We can use the vector exemplar as an example.

  • The source file is riscv64/generated/assemblySamples/vector.S, copied from binutils-gdb/gas/testsuite/gas/riscv/vector-insns.s.
  • vector.S is assembled into riscv64/exemplars/vector.o
  • That assembly run generates the assembly output listing riscv64/exemplars/vector.log.
  • riscv64/exemplars/vector.o is finally processed by binutils objdump to generate the reference disassembly riscv64/exemplars/vector.objdump.

The riscv64/exemplars/vector.o is then imported into the Ghidra exemplars project, where we can evaluate the import and disassembly results.

Assembly language exemplars usually don’t have any sensible decompilation. C or C++ language exemplars usually do, so that gives the test analyst more to work with.

Another example shows Ghidra’s difficulty with vector optimized code. Compile this C code for the rv64gcv architecture (RISCV-64 with vector extensions), using the gcc-14 compiler suite released in May of 2024.

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

Ghidra’s 11.0 release decompiles this into:

/* WARNING: Control flow encountered unimplemented instructions */

void main(void)

{
  gp = &__global_pointer$;
                    /* WARNING: Unimplemented instruction - Truncating control flow here */
  halt_unimplemented();
}

Try the import again with the isa_ext experimental branch of Ghidra:

undefined8 main(void)

{
  undefined auVar1 [64];
  undefined8 uVar2;
  undefined (*pauVar3) [64];
  long lVar4;
  long lVar5;
  undefined auVar6 [256];
  undefined auVar7 [256];
  char local_540 [1319];
  undefined uStack_19;
  
  gp = &__global_pointer$;
  pauVar3 = (undefined (*) [64])local_540;
  lVar4 = 0x527;
  vsetvli_e32m1tama(0);
  auVar7 = vid_v();
  do {
    lVar5 = vsetvli(lVar4,0xcf);
    auVar6 = vmv1r_v(auVar7);
    lVar4 = lVar4 - lVar5;
    auVar6 = vncvt_xxw(auVar6);
    vsetvli(0,0xc6);
    auVar6 = vncvt_xxw(auVar6);
    auVar6 = vadd_vi(auVar6,1);
    auVar1 = vse8_v(auVar6);
    *pauVar3 = auVar1;
    uVar2 = vsetvli_e32m1tama(0);
    pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
    auVar6 = vmv_v_x(lVar5);
    auVar7 = vadd_vv(auVar7,auVar6);
  } while (lVar4 != 0);
  uStack_19 = 0;
  printf(local_540,uVar2);
  return 0;
}

That Ghidra branch decompiles, but the decompilation listing only resembles the C source code if you are familiar with RISCV vector extension instructions.

Repeat the example, this time building with a gcc-13 compiler suite. Ghidra 11.0 does a fine job of decompiling this.

undefined8 main(void)
{
  long lVar1;
  char acStack_541 [1320];
  undefined uStack_19;
    gp = &__global_pointer$;
  lVar1 = 1;
  do {
    acStack_541[lVar1] = (char)lVar1;
    lVar1 = lVar1 + 1;
  } while (lVar1 != 0x528);
  uStack_19 = 0;
  printf(acStack_541 + 1);
  return 0;
}

custom Linux kernel and kernel mods

The Fedora 39 disk image is a good exemplar of endpoint system code. We can supplement that with a custom kernel build. This gives us more flexibility and a peek into future system builds.

Building a custom kernel - with standard kernel modules - requires steps like these:

  1. Download the linux kernel source from https://github.com/torvalds/linux.git
    • This example currently uses the kernel development tip shortly after version 6.9 RC2
  2. Generate a new .config kernel configuration file with a command like:
    $ PATH=$PATH:/opt/riscvx/bin
    $ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' menuconfig
    
  3. In the menuconfig view select architecture-specific features we want to view. This will likely include platform selections like Vector extension support, Zbb extension support. It may also include Cryptographic API selections like Accelerated Cryptographic Algorithms for CPU (riscv)
  4. Build the kernel and selected kernel modules with a gcc 14.0.0 riscv64 toolchain
    $ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' all
    
  5. Copy the selected vmunix ELF files into the riscv64/exemplars directory:
    $ cp vmlinux ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2
    $ cp arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.o ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2_aes-riscv64-zvkned-zvbb-zvkg.o
    

analysis

Importing the custom vmlinux kernel into Ghidra 11.1-DEV(isa_ext) shows:

  • there are relatively few vector extension sequences in the kernel - 17 instances of vset*.
    • for example, __asm_vector_usercopy uses vector loads and stores to copy into user memory spaces.
  • there are Zbb variants: strcmp_zbb, strlen_zbb, and strncmp_zbb which can be patched into calls

Importing the aes-riscv64-zvkned-zvbb-vkg.o object file - presumably available for use in loadable kernel crypto modules - shows:

  • two functions aes_xts_encrypt_zvkned_zvbb_zvkg and aes_xts_decrypt_zvkned_zvbb_zvkg
  • many vector, crypto, and bit manipulation extension instructions.

Commit logs for the Linux kernel sources suggest that the riscv vector crypto functions were derived from openssl source code, possibly intended for use in file system encryption and decryption.

x86_64 exemplars

A few x86_64 exemplars exist to explore the scope of issues raised by RISCV exemplars. The x86_64/exemplars directory shows how optimizing gcc-14 compilations handle simple loops and built-ins like memcpy for various microarchitectures.

Intel microarchitectures can be grouped into common profiles like x86-64-v2, x86-64-v3, and x86-64-v4. Each has its own set of instruction set extensions, so an optimizing compiler like gcc-14 will autovectorize loops and built-ins differently for each microarchitecture.

The memcpy exemplar set includes source code and three executables compiled from that source code with -march=x86-64-v2, -march=x86-64-v3, and -march=x86-64-v4. The binutils-2.41 objdump disassembly is provided for each executable, for comparison with Ghidra’s disassembly window.

x86_64/exemplars$ ls memcpy*
memcpy.c  memcpy_x86-64-v2  memcpy_x86-64-v2.objdump  memcpy_x86-64-v3  memcpy_x86-64-v3.objdump  memcpy_x86-64-v4  memcpy_x86-64-v4.objdump

These exemplars suggest several Ghidra issues:

  • Ghidra’s disassembler is generally unable to recognize many vector instructions generated by gcc-14 with -march=x86-64-v4 and -O3.
  • Ghidra’s decompiler provides the user little help in recognizing the semantics of memcpy or many simple loops with -march=x86-64-v2 or -march=x86-64-v3.
  • Ghidra users should be prepared for wide variety in vector optimized instruction sequences. Pattern recognition will be difficult.

custom exemplars

Not all RISCV instruction set extensions are standardized and supported by open source compiler suites. Vendors can generate their own custom extensions. These may be instructions that are proposed for standardization, instructions that predate standardized extensions that are effectively deprecated for new RISCV variants, and (potentially) instructions that are considered non-public licenseable intellectual property.

We have one example of a set of vendor-specific RISC-V extension exemplars that is pending classification. Some of the WCH QingKe 32 bit RISCV processors support what they call extended instruction or XW instructions like c.lbu, c.lhu, c.sb, c.sh, c.lbusp, c.lhusp, c.sbsp, and c.shsp. The encoding for these custom instructions overlaps other, standardized extensions like Zcd, while some of the instruction mnemonics overlap those of Zcb. There is no known evidence that these XW instructions are tracked for inclusion in binutils, as other full-custom extensions from the THead alibaba group are. There is no evidence that these XW instructions are considered licensable or proprietary to WCH (Nanjing Qinheng Microelectronics).

https://github.com/ArcaneNibble has generated a set of binary exemplars for this vendor custom extension. Naming conventions for full-custom extensions are very much To Be Determined. The RISCV binutils toolchain attaches an architecture tag to each ELF file it generates. For these binary exemplars that is:

Tag_RISCV_arch: "rv32i2p0_m2p0_a2p0_f2p0_c2p0_xw2p2"

That architectural tag implies the binaries are for a base RISCV 32 bit processor, with the standard compressed (c) extension version 2.0 and other standard extensions. The vendor custom (x) extension (w) version 2.0 (2p2) is enabled. The Zcd and Zcb extensions are explicitly not enabled, so there is no conflict with either assembly or disassembly of the instructions.

These exemplars are currently filed under riscv64/exemplars as:

custom
└── wch
    ├── lbu.S
    ├── lbusp.S
    ├── lhu.S
    ├── lhusp.S
    ├── sb.S
    ├── sbsp.S
    ├── sh.S
    ├── shsp.S
    ├── w2p2-lbu.o
    ├── w2p2-lbusp.o
    ├── w2p2-lhu.o
    ├── w2p2-lhusp.o
    ├── w2p2-sb.o
    ├── w2p2-sbsp.o
    ├── w2p2-sh.o
    └── w2p2-shsp.o

1 - Whisper_cpp

Explore analysis of a machine learning application built with large language model techniques. What Ghidra gaps does such an analysis reveal?

How might we inspect a machine-learning application for malware? For example, suppose someone altered the automatic speech recognition library whisper.cpp. Would Ghidra be able to cope with the instruction set extensions used to accelerate ML inference engines? What might be added to Ghidra to help the human analyst in this kind of inspection?

Components for this exercise:

  • A Linux x86_64 Fedora 39 base system
  • Ghidra 11.0 public
  • Ghidra 11.1-DEV with the isa_ext branch for RISCV-64 support
  • A stripped target binary whisper_cpp_vendor built with RISCV-64 gcc-14 toolchain and the whisper.cpp 1.5.4 release.
    • RISCV-64 vector and other approved extensions are enabled for this build
    • published binutils-2.41 vendor-specific extensions are enabled for this build
    • whisper library components are statically linked, while system libraries are dynamically linked
  • Reference binaries whisper_cpp_* built locally with other RISCV-64 gcc toolchains
  • Ghidra’s BSIM binary similarities plugins and analytics

Questions to address:

  • does the presence of vector and other ISA extensions in whisper_cpp_vendor materially hurt Ghidra 11.0 analysis?
  • can BSIM analytics still find similarities between whisper_cpp_vendor and the non-vector build whisper_cpp_default
  • are there recurring vector instruction patterns present in whisper_cpp_vendor that Ghidra users should be able to recognize?
  • are there additional instructions or instruction-semantics that we should add to the isa_ext branch?
  • if the vendor adds Link Time Optimization to their whisper_cpp_vendor build, does this materially hurt Ghidra 11.0 analysis?

There are a lot of variables in this exercise. Some are important, most are not.

Baseline Ghidra analysis

Starting with the baseline Ghidra 11.0, examine a locally built whisper_cpp_default, an ELF 64-bit LSB executable built with gcc-13.2.1. Import and perform standard analyses to get these statistics:

  • 186558 instructions recognized
  • text segment size 0x8678a
  • 12 bad instruction errors, all of which appear to be the fence.tso instruction extension

Now examine whisper_cpp_vendor (built with gcc 14 rather than gcc 13) with the baseline Ghidra 11.0:

  • 100521 instructions recognized
  • text segment size 0xb93cc
  • 4299 bad instruction errors

Examine whisper_cpp_vendor with the isa_ext branch of 11.1-DEV:

  • 169813 instructions recognized
  • text segment size 0xb93cc
  • 17 bad instruction errors, all of which appear to be the fence.tso instruction extension

Next apply a manual correction to whisper_cpp_vendor, selecting the entire .text segment and forcing disassembly, then clearing any unreachable 0x00 bytes.

  • 190311 instructions recognized
  • 17 bad instruction errors
  • 4138 vset* instructions usually found in vector code
  • 946 gather instructions
  • 3562 custom instructions

Finally, reset the ’language’ of whisper_cpp_vendor to match the vendor (THead, for this exercise).

The 3562 custom instructions resolve to:

Instruction Count Semantics
th.ext* 151 Sign extract and extend
th.ldd 1719 Load 2 doublewords
th.lwd 10 Load 2 words
th.sdd 1033 store 2 doublewords
th.swd 16 store 2 words
th.mula 284 Multiply-add
th.muls 67 Multiply-subtract
th.mveqz 127 Move if == 0
th.mvneqz 154 Move if != 0

This leads to some tentative next steps:

  1. Adding fence.tso to Ghidra looks like a simple small win, and a perfect place to start.
  2. The THead vendor-specific extensions look like simple peep-hole optimizations. The semantics could easily be added to Ghidra as compositions of two original instruction semantics. Slightly less than 2% of the total instructions are THead vendor customizations.
  3. The baseline Ghidra 11.0 stalls out very quickly on the vector instructions, making an early switch to the isa_ext branch necessary.
  4. The vector gather instructions are unexpectedly prevalent.
  5. Manual inspection and sampling of the 4138 vset* instruction blocks may reveal some key patterns to recognize first.

Note: fence.tso is now recognized in the Ghidra 11.1-DEV branch isa_ext, clearing the bad instruction errors.

A top-down assessment

At the highest level, what features of whisper.cpp generate vector instructions?

  • There are about 400 invocations of RISCV vector intrinsic within ggml-quants.c. In these cases the developer has explicitly managed the vectorization.
  • There are an unknown number of automatic loop vectorizations, where gcc-14 has replaced simple scalar loops with vector-based loops. This vectorization will generally reduce the number of loop iterations, but may not always reduce the number of instructions executed.
  • Gcc expansions of memcpy or structure copies into vector load-store loops.

Much of whisper.cpp involves vector, matrix, or tensor math using ggml math functions. This is also where most of the explicit RISCV vector intrinsic C functions appear, and likely the code the developer believes is most in need of vector performance enhancements.

Example: dot product

ggml_vec_dot_f32(n, sum, x, y) generates the vector dot product of two vectors x and y of length n with the result stored to *sum. In the absence of vector or SIMD support the source code is:

// scalar
    float s;
    double sumf = 0.0;
    for (int i = 0; i < n; ++i) {
        sumf += (double)(x[i]*y[i]);
    }
   *s = sumf;

GCC-14 will autovectorize this into something Ghidra decompiles like this (comments added after //):

void ggml_vec_dot_f32(long n,float *s,float *x,float *y)

{
  long step;
  double dVar1;
  undefined auVar2 [256];
  undefined in_v2 [256];
  undefined auVar3 [256];
  
  gp = &__global_pointer$;
  if (0 < n) {
    vsetvli_e64m1tama(0);
    vmv_v_i(in_v2,0);                  // v2 = 0
    do {
      step = vsetvli(n,0x97);          // vsetvli a5,a0,e32,mf2,tu,ma
      n = n - step;
      auVar3 = vle32_v(x);             // v3 = *x (slice of size step)
      auVar2 = vle32_v(y);             // v1 = *y (slice of size step)
      x = (float *)sh2add(step,x);      // x = x + step
      auVar2 = vfmul_vv(auVar2,auVar3); // v1 = v1 * v3
      y = (float *)sh2add(step,y);      // y = y + step
      in_v2 = vfwadd_wv(in_v2,auVar2);  // v2 = v1 + v2
    } while (n != 0);
    vsetvli_e64m1tama(0);
    auVar2 = vfmv_sf(0);                 // v1[0] = 0
    auVar2 = vfredusum_vs(in_v2,auVar2); // v1[0] = sum(v2)
    dVar1 = (double)vfmv_fs(auVar2);     // dvar1 = v1[0]
    *s = (float)dVar1;
    return;
  }
  *s = 0.0;
  return;
}

Inspecting this disassembly and decompilation suggests several top down issues:

  • The semantics for shadd2 are simple and should be explicit sh2add(a, b) = a>>2 + b
    • This is now implemented in Ghidra 11.1-DEV isa_ext.
  • The vsetvli(n,0x97) instruction should be expanded to show semantics as vsetvli_e32m2ftuma
    • Running the binary through a RISCV objdump program gives us this formal expansion. This instruction says that the selected element width is 32 bits with a LMUL multiplication factor of 1/2. This means that only half of the vector register is used to allow for 64 bit arithmetic output.
    • This is now implemented in Ghidra 11.1-DEV isa_ext.
  • The semantics for vector results need clarification
  • The loop accumulates 64 bit double values with 32 bit input values. If the vector length is 256 bits, that means the step size is 4 not 8
  • A capability to generate processor-specific inline hints or comments in the decompiler may be useful, especially if there were a typographic way to distinguish vector and scalar objects.
  • If vector registers were infinitely long the loop might become v2 = x * y and the reduction dvar1 = reduce(+, v2)

The path forward may be to manually analyze several examples from whisper.cpp, extending and revising Ghidra’s semantics and decompiler to add a bit of clarity each time.

Example: auto-vectorization makes the simple complicated

Autovectorization can generate complicated code when the compiler has no knowledge of the number of elements in a vector or the number of elements that can fit within single vector register.

A good example is from:

ggml_tensor * ggml_new_tensor_impl(
        struct ggml_context * ctx,
        enum   ggml_type      type,
        int                   n_dims,
        const int64_t       * ne,
        struct ggml_tensor  * view_src,
        size_t                view_offs) {
...
         size_t data_size = ggml_row_size(type, ne[0]);
         for (int i = 1; i < n_dims; i++) {
            data_size *= ne[i];
         }
}

The ne vector typically has up to 4 elements, so this loop will be executed at most once. The compiler doesn’t know this so it autovectorizes the loop into something more complex:

undefined4 * ggml_new_tensor(ggml_context *ctx,undefined8 type,long ndims,int64_t *ne)

{
...
  data_size = ggml_row_size(type,*ne);              // get the first dimension ne[0]
  lVar6 = 1;
  if (1 < ndims) {
    uVar2 = (int)ndims - 1;
    if (1 < (int)ndims - 2U) {                      // if ndims > 3 process two at a time
      piVar7 = ne + 1;                              // starting with ne[1] and ne[2]
      piVar4 = piVar7 + (long)(int)(uVar2 >> 1) * 2;
      vsetivli_e64m1tamu(2);                        //vector length = 2, 64 bit element, tail agnostic mask unchanged
      vmv_v_i(in_v1,1);                             // v1 = (1,1)
      do {
        auVar10 = vle64_v(piVar7);
        piVar7 = piVar7 + 2;
        in_v1 = vmul_vv(in_v1,auVar10);              // v1 = v1 * ne[slice]
      } while (piVar4 != piVar7);
      auVar10 = vid_v();                             // v2 = (0,1)
      vmv_v_i(in_v4,0);                              // v4 = (0,0)
      auVar11 = vadd_vi(auVar10,1);                  // v2 = v2 + 1 = (1,2)
      auVar10 = vmsgtu_vi(auVar11,1);                // v0 = (v2 > 1) = (0, 1)
      vrgather_vv(in_v1,auVar11);                    // v3 = gather(v1, v2) => v3=v1[v2] = (v1[1], 0)
      auVar11 = vadd_vi(auVar11,0xfffffffffffffffe); // v2 = v2 - 2 = (-1,0)
      auVar10 = vrgather_vv(in_v4,auVar11,auVar10);  // v3 = gather_masked(v4,v2,v0.t) = (v3[0], v4[0])
      auVar10 = vmul_vv(auVar10,in_v1);              // v3 = v3 * v1
      vmv_x_s(in_v14,auVar10);                       // a4 = v3[0]
      data_size = data_size * (long)piVar4;          // data_size = data_size * a4
      if ((uVar2 & 1) == 0) goto LAB_00074a80;
      lVar6 = (long)(int)((uVar2 & 0xfffffffe) + 1);
    }
    plVar5 = (long *)sh3add(lVar6,ne);               // multiply by one or two 
    data_size = data_size * *plVar5;                 // 
    if ((int)lVar6 + 1 < ndims) {
      data_size = data_size * plVar5[1];
    }
  }
...
}

That’s a very confusing way to multiply at most four integers. If ne has 1, 2, or 3 elements then no vector instructions are processed at all. If it has 4 elements then the first and last one or two are handled with scalar math while pairs of elements are accumulated in the loop. The gather instructions are used together to generate a mask and then multiply the two elements of vector v1, leaving the result in the first element slot of vector v4.

This particular loop vectorization is likely to change a lot in future releases. The performance impact is negligible either way. The analyst may look at code like this and decide to ignore the ndims>3 case along with all of the vector instructions used within it. Alternatively, we could look at the gcc vectorization code handling the general vector reduction meta operation, then see if this pattern is a macro of some sort within it.

Take a step back and look at the gcc RISCV autovectorization code. It’s changing quite frequently, so it’s probably premature to try and abstract out loop reduction models that we can get Ghidra to recognize. When that happens we might draw source exemplars from gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec and build a catalog of source pattern to instruction pattern expansions.

Example: source code use of RISCV vector intrinsics

The previous example showed an overly aggressive autovectorization of a simple loop. Here we look at source code that the developer has decided is important enough to directly code in RISCV intrinsic C functions. The function ggml_vec_dot_q5_0_q8_0 is one such function, with separate implementations for ARM_NEON, wasm_simd128, AVX2, AVX, and riscv_v_intrinsic. If none of those accelerators are available a scalar implementation is used instead:

void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int qk = QK8_0;
    const int nb = n / qk;

    assert(n % qk == 0);
    assert(qk == QK5_0);

    const block_q5_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    // scalar
    float sumf = 0.0;

    for (int i = 0; i < nb; i++) {
        uint32_t qh;
        memcpy(&qh, x[i].qh, sizeof(qh));

        int sumi = 0;

        for (int j = 0; j < qk/2; ++j) {
            const uint8_t xh_0 = ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
            const uint8_t xh_1 = ((qh & (1u << (j + 16))) >> (j + 12));

            const int32_t x0 = ((x[i].qs[j] & 0x0F) | xh_0) - 16;
            const int32_t x1 = ((x[i].qs[j] >>   4) | xh_1) - 16;

            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
        }

        sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
    }

    *s = sumf;
}

The RISCV intrinsic source is:

Note: added comments are flagged with ///

void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int qk = QK8_0;  /// QK8_0 = 32
    const int nb = n / qk;

    assert(n % qk == 0);
    assert(qk == QK5_0);   /// QK5_0 = 32

    const block_q5_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0;

    uint32_t qh;

    size_t vl = __riscv_vsetvl_e8m1(qk/2);

    // These temporary registers are for masking and shift operations
    vuint32m2_t vt_1 = __riscv_vid_v_u32m2(vl);
    vuint32m2_t vt_2 = __riscv_vsll_vv_u32m2(__riscv_vmv_v_x_u32m2(1, vl), vt_1, vl);

    vuint32m2_t vt_3 = __riscv_vsll_vx_u32m2(vt_2, 16, vl);
    vuint32m2_t vt_4 = __riscv_vadd_vx_u32m2(vt_1, 12, vl);

    for (int i = 0; i < nb; i++) {
        memcpy(&qh, x[i].qh, sizeof(uint32_t));

        // ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
        vuint32m2_t xha_0 = __riscv_vand_vx_u32m2(vt_2, qh, vl);
        vuint32m2_t xhr_0 = __riscv_vsrl_vv_u32m2(xha_0, vt_1, vl);
        vuint32m2_t xhl_0 = __riscv_vsll_vx_u32m2(xhr_0, 4, vl);

        // ((qh & (1u << (j + 16))) >> (j + 12));
        vuint32m2_t xha_1 = __riscv_vand_vx_u32m2(vt_3, qh, vl);
        vuint32m2_t xhl_1 = __riscv_vsrl_vv_u32m2(xha_1, vt_4, vl);

        // narrowing
        vuint16m1_t xhc_0 = __riscv_vncvt_x_x_w_u16m1(xhl_0, vl);
        vuint8mf2_t xh_0 = __riscv_vncvt_x_x_w_u8mf2(xhc_0, vl);

        vuint16m1_t xhc_1 = __riscv_vncvt_x_x_w_u16m1(xhl_1, vl);
        vuint8mf2_t xh_1 = __riscv_vncvt_x_x_w_u8mf2(xhc_1, vl);

        // load
        vuint8mf2_t tx = __riscv_vle8_v_u8mf2(x[i].qs, vl);

        vint8mf2_t y0 = __riscv_vle8_v_i8mf2(y[i].qs, vl);
        vint8mf2_t y1 = __riscv_vle8_v_i8mf2(y[i].qs+16, vl);

        vuint8mf2_t x_at = __riscv_vand_vx_u8mf2(tx, 0x0F, vl);
        vuint8mf2_t x_lt = __riscv_vsrl_vx_u8mf2(tx, 0x04, vl);

        vuint8mf2_t x_a = __riscv_vor_vv_u8mf2(x_at, xh_0, vl);
        vuint8mf2_t x_l = __riscv_vor_vv_u8mf2(x_lt, xh_1, vl);

        vint8mf2_t x_ai = __riscv_vreinterpret_v_u8mf2_i8mf2(x_a);
        vint8mf2_t x_li = __riscv_vreinterpret_v_u8mf2_i8mf2(x_l);

        vint8mf2_t v0 = __riscv_vsub_vx_i8mf2(x_ai, 16, vl);
        vint8mf2_t v1 = __riscv_vsub_vx_i8mf2(x_li, 16, vl);

        vint16m1_t vec_mul1 = __riscv_vwmul_vv_i16m1(v0, y0, vl);
        vint16m1_t vec_mul2 = __riscv_vwmul_vv_i16m1(v1, y1, vl);

        vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vl);

        vint32m1_t vs1 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul1, vec_zero, vl);
        vint32m1_t vs2 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul2, vs1, vl);

        int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);

        sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
    }

    *s = sumf;
}

Ghidra’s 11.1 isa_ext rendering of this is (after minor parameter name propagation):

long ggml_vec_dot_q5_0_q8_0(ulong n,float *s,void *vx,void *vy)

{
  ushort *puVar1;
  long lVar2;
  long lVar3;
  long lVar4;
  long lVar5;
  undefined8 uVar6;
  int i;
  float fVar7;
  undefined auVar8 [256];
  undefined auVar9 [256];
  undefined auVar10 [256];
  undefined auVar11 [256];
  undefined in_v7 [256];
  undefined in_v8 [256];
  undefined auVar12 [256];
  undefined auVar13 [256];
  undefined auVar14 [256];
  undefined auVar15 [256];
  undefined auVar16 [256];
  int iStack_4;
  
  gp = &__global_pointer$;
  uVar6 = vsetivli(0x10,0xc0);
  vsetvli(uVar6,0xd1);
  auVar13 = vid_v();
  vmv_v_i(in_v8,1);
  auVar15 = vadd_vi(auVar13,0xc);
  auVar12 = vsll_vv(in_v8,auVar13);
  auVar14 = vsll_vi(auVar12,0x10);
  if (0x1f < (long)n) {
    fVar7 = 0.0;
    vsetvli_e32m1tama(uVar6);
    lVar3 = (long)vx + 2;
    lVar4 = (long)vy + 2;
    i = 0;
    vmv_v_i(in_v7,0);
    vsetivli(4,0xc6);
    do {
      auVar8 = vle8_v(lVar3);
      vse8_v(auVar8,&iStack_4);
      puVar1 = (ushort *)(lVar4 + -2);
      vsetvli(uVar6,0xd1);
      lVar2 = lVar3 + 4;
      auVar8 = vle8_v(lVar2);
      auVar9 = vand_vx(auVar12,(long)iStack_4);
      auVar9 = vsrl_vv(auVar9,auVar13);
      vsetvli(0,199);
      auVar11 = vand_vi(auVar8,0xf);
      vsetvli(0,0xd1);
      auVar9 = vsll_vi(auVar9,4);
      vsetvli(0,199);
      auVar8 = vsrl_vi(auVar8,4);
      vsetvli(0,200);
      auVar9 = vncvt_xxw(auVar9);
      auVar16 = vle8_v(lVar4);
      vsetvli(0,199);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,0xd1);
      auVar10 = vand_vx(auVar14,(long)iStack_4);
      vsetvli(0,199);
      auVar11 = vor_vv(auVar11,auVar9);
      vsetvli(0,0xd1);
      auVar9 = vsrl_vv(auVar10,auVar15);
      vsetvli(0,199);
      auVar10 = vadd_vi(auVar11,0xfffffffffffffff0);
      vsetvli(0,200);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,199);
      auVar10 = vwmul_vv(auVar10,auVar16);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,200);
      lVar5 = lVar4 + 0x10;
      auVar10 = vwredsum_vs(auVar10,in_v7);
      vsetvli(0,199);
      auVar8 = vor_vv(auVar8,auVar9);
      auVar9 = vle8_v(lVar5);
      auVar8 = vadd_vi(auVar8,0xfffffffffffffff0);
      auVar8 = vwmul_vv(auVar8,auVar9);
      vsetvli(0,200);
      auVar8 = vwredsum_vs(auVar8,auVar10);
      vsetivli(4,0xd0);
      vmv_x_s(auVar15,auVar8);
      i = i + 1;
      lVar4 = lVar4 + 0x22;
      fVar7 = (float)(&ggml_table_f32_f16)[*puVar1] *
              (float)(&ggml_table_f32_f16)[*(ushort *)(lVar3 + -2)] * (float)(int)lVar5 + fVar7;
      lVar3 = lVar3 + 0x16;
    } while (i < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
    *s = fVar7;
    return lVar2;
  }
  *s = 0.0;
  return n;
}

It looks like the developer unrolled an inner loop and used the LMUL multiplier to help reduce the loop iterations. The immediate action item for us may be to add more explicit decodings for vsetvli and vsetivli, or look for existing processor-specific decoders in the Ghidra decompiler.

x86_64 whisper

Let’s take a glance at the x86_64 build of whisper. First copy whisper-cpp.BUILD into the x86_64 workspace then build the executable with two architectures:

$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v3" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3
...
$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v4" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v4

Load these into Ghidra 11.1-DEV. The x86-64-v4 build is useless in Ghidra, since a different class of x86_64 vector extensions is used in that newer microarchitecture and Ghidra doesn’t recognize it. The x86-64-v3 build looks accessible.

Try an x86_64 build with the local compiler (Fedora 39 default compiler) and Link Time Optimization enabled:

$ bazel build  --copt="-march=x86-64-v3" --copt="-flto"  --linkopt="-Wl,-flto" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3-lto

We’ll leave the differential analysis of link time optimization for another day. A couple of quick notes are worthwhile here:

  • The function ggml_new_tensor no longer exists in the binary. Instead we get ggml_new_tensor_impl.constprop.0 ggml_new_tensor_impl.constprop.0, ggml_new_tensor_impl.constprop.2, and ggml_new_tensor_impl.constprop.3. This suggests BSIM could get confused with intermediate functions if trying to connect binaries built with and without LTO.
  • None of the hermetic toolchains appear to work when link time optimization is requested. There appears to be at least one missing LTO plugin from the gcc-14 toolchain packaging. We’ll try and find such for the next snapshot of gcc-14.

2 - Data Plane Development Kit

Intel’s DPDK framework supports some Intel and Arm Neon vector instructions. What does it for RISCV extensions? Are ISA extensions materially useful to a network appliance?

Check out DPDK from GitHub, patch the RISCV configuration with riscv64/generated/dpdk/dpdk.pat, and crosscompile with meson and ninja. Copy some of the examples into riscv64/exemplars and examine them in Ghidra.

  • for this we use as many standard extensions as we can, excluding vendor-specific extensions.

Configure a build directory:

$ patch -p1 < .../riscv64/generated/dpdk/dpdk.pat
$ meson setup build --cross-file config/riscv/riscv64_linux_gcc -Dexamples=all
$ cd build

Edit build/build.ninja:

  • replace all occurrences of -ldl with /opt/riscvx/lib/libdl.a - you should see about 235 replacements

Build with:

$ ninja -C build

Check the cross-compilation with:

$ readelf -A build/examples/dpdk-l3fwd
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zba1p0_zbb1p0_zbc1p0_zbkb1p0_zbkc1p0_zbkx1p0_zvbb1p0_zvbc1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvkb1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

Import into Ghidra 11.1-DEV(isa_ext), noting multiple import messages similar to:

  • Unsupported Thread-Local Symbol not loaded
  • ELF Relocation Failure: R_RISCV_COPY (4, 0x4) at 00f0c400 (Symbol = stdout) - Runtime copy not supported

Analysis

source code examination

explicit vectorization

The source code has explicit parallel coding for Intel and Arm/Neon architectures, for instance within the basic layer 3 forwarding example:

struct acl_algorithms acl_alg[] = {
        {
                .name = "scalar",
                .alg = RTE_ACL_CLASSIFY_SCALAR,
        },
        {
                .name = "sse",
                .alg = RTE_ACL_CLASSIFY_SSE,
        },
        {
                .name = "avx2",
                .alg = RTE_ACL_CLASSIFY_AVX2,
        },
        {
                .name = "neon",
                .alg = RTE_ACL_CLASSIFY_NEON,
        },
        {
                .name = "altivec",
                .alg = RTE_ACL_CLASSIFY_ALTIVEC,
        },
        {
                .name = "avx512x16",
                .alg = RTE_ACL_CLASSIFY_AVX512X16,
        },
        {
                .name = "avx512x32",
                .alg = RTE_ACL_CLASSIFY_AVX512X32,
        },
};

If avx512x32 is selected, then basic trie search and other operations can proceed across 32 flows in parallel. Other examples exist within the code. For more information, search the doc directory for SIMD. Check out lib/fib/trie_avx512.c and the function trie_vec_lookup_x16x2 for an AVX manual vectorization of a trie address to next hop lookup.

Note: It won’t be clear for some time which vectorization transforms actually improve performance on specific processors. Vector support adds a lot of local register space but vector loads and stores can saturate memory bandwidth and drive up processor temperature. We might see earlier adoption in contexts that tolerate higher latency, like firewalls, rather than low-latency switches and routers.

explicit ML support

DPDK includes ML contributions from Marvell. See doc/guides/mldevs/cnxk.rst for more information and references to the cnxk support. Source code support exists under drivers/ml/cnxk and lib/mldev. lib/mldev/rte_mldev.c may provide some insight into how Marvell expects DPDK users to apply their component. See the Marvell Octeon 10 white paper for some possible applications.

Ghidra analysis

The DPDK exemplars stress test Ghidra in multiple ways:

  • When compiled with RISCV-64 vector and bit manipulation extension support you get a good mix of autovectorization instructions.
  • There are a number of unsupported thread-local relocations requested
  • ELF replication failures are reported for R_RISCV_COPY, claiming “Runtime copy is not supported”.
    • this relocation code apparently asks for a symbol to be copied from a shareable object into an executable.