We can help Ghidra import newer binaries by collecting samples of those binaries. The scope is limited to RISCV-64 Linux-capable processors one might find in smart network appliances or Machine Learning inference engines.

Note: This proof-of-concept project focuses on a single processor family, RISCV. Some results are checked against equivalent x86_64 processors, to see if pending issues are limited in scope or likely to hit a larger community

This project collects files that may stress - in a good way - Ghidra’s import, disassembly, and decompilation capabilities. Others are doing a great job extending Ghidra’s ability to import and recognize C++ structures and classes, so we will focus on lower level objects like instruction sets, relocation codes, and pending toolchain improvements. The primary CPU family will be based on the RISCV-64 processor. This processor is relatively new and easily modified, so it will likely show lots of new features early. Not all of these new features will make it into common use or arenas in which Ghidra is necessary, so we don’t really know how much effort is worth spending on any given feature.

The RISCV processor family being relatively new, we can expect compiler and toolchain support to be evolving more rapidly than more established families like x86_64. That means RISCV appliances may be more likely to be built with newer compiler toolchains than x86_64 appliances.

There are two key goals here:

Experiment with Ghidra import integration tests that can detect Ghidra regressions. This involves collecting a number of processor and toolchain binary exemplars to be imported plus analysis scripts to verify those import results remain valid. Example: verify that ELF relocation codes are properly handled when importing a RISCV-64 kernel module. These integration tests should always pass after changes to Ghidra’s source code.
Collect feature-specific binary exemplars that might highlight emergent gaps in Ghidra’s import processes. Ghidra will usually fail to properly import these exemplars, allowing the Ghidra development team to triage the gap and evaluate options for closing it. Example: pass the RISCV instruction set extension testsuite from binutils/gas into Ghidra to test whether Ghidra can recognize all of the new instructions gas can generate.

A secondary goal developed during testing - explore the impact on Ghidra users of vector instruction set extensions as used in aggressive compiler optimization. The RISCV 1.0 vector instructions as generated by the gcc 14.0 optimizing compiler can turn simple loops into more complex instruction sequences.

The initial scope focuses on RISCV 64 bit processors capable of running a full Linux network stack, likely implementing the 2023 standard profile.

We want to track recent additions to standard RISCV-64 toolchains (like binutils and gcc) to see how they might make life interesting for Ghidra developers. At present, that includes newly frozen or ratified instruction set architecture (ISA) changes and compiler autovectorization optimizations. Some vendor-specific instruction set extensions will be included if they are accepted into the binutils main branch.

Running integration tests

Note: These scripts use both unittest and logging frameworks, where the loglevel is variously set at INFO or WARN. The exact output may vary

The first two steps collect binary exemplars for Ghidra to import. Large binaries are extracted from public disk images, such as the latest Fedora RISCV-64 system disk image. Small binaries are generated locally from minimal C or C++ source files and gcc toolchains.

The large binaries are downloaded and extracted using acquireExternalExemplars.py. This script is built on the python unittest framework to either verify the existence of previously extracted exemplars or regenerate those if missing.

The small binaries are created - if not already present - with the generateInternalExemplars.py script

Warning: GCC-13 and GCC-14 binary toolchains are not included in this project. Sources should be downloaded, compiled, and installed to something like /opt/... then post-processed by bundled toolchain scripts into portable, hermetic tarballs.

$ ./acquireExternalExemplars.py 
...........
----------------------------------------------------------------------
Ran 11 tests in 0.003s

OK

$ ./generateInternalExemplars.py 
.......
----------------------------------------------------------------------
Ran 7 tests in 4.092s

The exemplar binaries can now be imported into two Ghidra projects - one for RISCV64 and another for x86_64. The import process includes pre- and post-script processing. Pre-script processing is used for the kernel import to fix symbol names and load address. Post-script processing is used for the kernel module import to gather relocation results for later regression testing. These relocation results are saved in testresults/*.json

Import processing generates a log file for each binary imported into Ghidra. If that log file is newer than the binary, the import process is skipped. If you want to rerun an import for foo.o, simply delete the matching log file in .../exemplars/foo.log.

OK
$ ./importExemplars.py
.INFO:root:Current Kernel import log file found - skipping import
.INFO:root:Current Kernel module import log file found - skipping import
...
.
----------------------------------------------------------------------
Ran 7 tests in 0.003s

OK

Test results gathered during binary imports and saved in testresults/*.json are now compared with expected values in the final script:

$ ./integrationTest.py 
inspecting the R_RISCV_BRANCH relocation test
inspecting the R_RISCV_JAL test
inspecting the R_RISCV_PCREL_HI20 1/2 test
inspecting the R_RISCV_PCREL_HI20 2/2 test
inspecting the R_RISCV_PCREL_LO12_I test
inspecting the R_RISCV_64 test
inspecting the R_RISCV_RVC_BRANCH test
inspecting the R_ADD_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_SUB_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_RISCV_RVC_JUMP test
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

Ghidra gap analysis

We are looking for processor features that may soon be commonplace but that current Ghidra releases do not support well. One such feature involves RISCV extensions to the instruction set architecture, especially vector and bit manipulation extensions. For each such feature we might consider the following questions:

What is a current example of this feature, especially examples that support analysis or pathologies of those features.
How and when might this feature impact a significant number of Ghidra analysts?
How much effort might it take Ghidra developers to fill the implied feature gap?
Is this feature specific to RISCV systems or more broadly applicable to other processor families? Would support for that feature be common to many processor families or vary widely by processor?
What are the existing frameworks within Ghidra that might most credibly be extended to support that feature?

Thread Local Storage (TLS) is a fairly simple feature we can use as an example. Addressing each of the five questions in turn we might find:

TLS relocation codes appear occasionally in multithreaded applications across most processor families. They might appear a few times within libc. Ghidra often doesn’t recognize these codes. Existing analytics like objdump and readelf certainly do recognize TLS codes, but do not pretend to provide semantic aid in interpreting those codes. TLS codes have well documented C source contexts in the form of compiler attributes.
The TLS handling gap is unlikely to affect many Ghidra users anytime soon, mostly because they appear only rarely and mostly apply to local variables where the decompiler can provide context.
Experienced Ghidra developers might be able to implement the general TLS case easily, but would then have to add supporting ELF import code to a broader range of processor definitions.
The TLS feature is common across most processor families supporting multithreading.
Support within Ghidra might grow out of existing memory space models and existing processor-specific ELF importers.

The general design questions boil down to:

how long can we defer working on this gap?
how long would it take to fill that gap after we got started?
where would we likely want to start

The Ghidra design team might assign TLS support a relatively low priority, since the gap doesn’t currently have a large impact. If the incidence and complexity of TLS suddenly increased, the extension of existing Ghidra support could likely increase just as rapidly.

Extensions to Instruction Set Architectures make up a much more complicated example. Standardized instructions for cache management and cryptography are likely easy enough to fold into Ghidra’s framework, but vector instruction extensions will hit harder and sooner, without a clear path forward for Ghidra.

1.1 - Glossary

Some of the commonly used terms in this project

exemplar: An example of a binary file one might expect Ghidra to accept as input. This might be an ELF executable, an ELF object file or library of object files, a kernel load module, or a kernel vmlinux image. Ideally it should be relatively small and easy to screen for hidden malware. Not all features demonstrated by the exemplar need be supported by the current Ghidra release.
platform: The technology base one or more exemplars are used on. A kernel exemplar expects to be run on top of a bootloader platform. A Linux application exemplar may consider the Linux kernel plus system libraries as its platform. System libraries like libc.so can then be both exemplars and platform elements.
compiler suite: A compiler suite includes a compiler or cross compiler plus all of the supporting tools and libraries to build executables for a range of platforms. This generally includes a versioned C and C++ compiler, preprocessor, assembler, linker, linker scripts, and core libraries like libgcc. Compiler suites often support many architecture variants, such as 32 or 64 bit word size and a host of microarchitecture or instruction set options. Compiler suites can be customized by selecting specific configurations and options, becoming toolchains.
cross compiler: A compiler capable of generating code for a processor other than the one it is running on. An x86_64 gcc-14 compiler configured to generate RISCV-64 object files would be a cross-compiler. Cross-compilers run on either the local host platform or on a Continuous Integration test server platform.
linker: A tool that takes one or more object files and resolves those runtime linkages internal to those object files. Usually ld on a Linux system. Often generates an ELF file or a kernel image.
loader: A tool - often integrated with the kernel - that loads an Elf file into RAM. The loader finalizes runtime linkages with external objects. The loader will often rewrite code (aka relaxation) to optimize memory references and so performance.
sysroot: The system root directories provide the interface between platform (kernel and system libraries) and user code. This can be as simple as /usr/include or as complicated as a sysroot/lib/ldscripts holding over 250 ld scripts detailing how a linker should generate code the kernel loader can fully process. Cross-compiler toolchains often need to import a sysroot to build for a given kernel. This can make for a circular dependency.
toolchain: A toolchain is an assembly of cross-compiler, linker, loader, and sysroot, plus a default set of options and switches for each component. Different toolchains might share a gcc compiler suite, but be configured for different platforms - building a kernel image, building libc.so, or building an executable application. Note: the word toolchain is often used in this project where compiler suite is intended.
workspace: An environment that provides mappings between platforms and toolchains. If you want to build an executable for a given platform, just name that platform on the command line and the build tool will select a compatible toolchain and a default set of options. You can still override those options.
hermetic: Build artifacts are not affected by any local host files other than those imported with the toolchain. A hermetic build on a Fedora platform will generate exactly the same binary output as if built on an Ubuntu platform. This allows remote build servers to cache build artifacts and CI/CD servers to use exactly the same build environment as a diverse development team.

2 - Exemplars

List the current importable and buildable exemplars, their origins, and the Ghidra features they are intended to validate or stress.

Overview

Exemplars suitable for Ghidra import are generally collected by platform architecture, such as riscv64/exemplars or x86_64/exemplars. Some are imported from system disk images. Others are locally built from small source code files and an appropriate compiler toolchain. The initial scope includes Linux-capable RISCV 64 bit systems that might be found in network appliances or ML inference engines. That makes for a local bias towards privileged code, concurrency management, and performance optimization. That scope expands slightly to x86_64 exemplars that may help triage issues that show up first in RISCV 64 exemplars.

You can get decent exemplar coverage with this set of exemplars:

from a general purpose RISCV-64 disk image:
- kernel - a RISCV-64 kernel built with a recent Linux release
- kernel load module - an ELF binary intended to be loaded into a running kernel
- system library - libc.so or libssl.so copied from a generic Linux disk image
- system application - a user application linking against system libraries and running over a Linux kernel
built from source, with the development tip of the gcc toolchain and many explicit ISA extensions:
- binutils assembly test suite - ISA extensions usually show up here first, along with preferred disassembly patterns
- memcpy and other libc replacements coded with RISCV-64 ISA intrinsic extensions
- libssl.so and libcrypt.so built from source and configured for all standard and frozen crypto, vector, and bit manipulation instruction extensions.
- DPDK network appliance source code, l3fwd and l2fwd.
- a custom crosscompiled kernel, with ISA extensions enabled

In general, visual inspection of these exemplars after importing into Ghidra should show:

no failed constructors, so all instructions are recognized by Ghidra during disassembly
no missing pcode
all vector vset* instructions are unwrapped to show selected element width, multiplier, tail and mask handling

Ghidra will in a few cases disassemble an instruction differently than binutils’ objdump. That’s fine, if it is due to a limitation of Ghidra’s SLEIGH language. If alignment to objdump is possible, that’s preferable.

Imported exemplars

Most of the imported large binary exemplars are broken out of available Fedora disk images. The top level acquireExternalExemplars.py script controls this process, sometimes with some manual intervention to handle image mounting. Selection of the imported disk image is controlled with text like:

LOGLEVEL = logging.WARN
FEDORA_RISCV_SITE = "http://fedora.riscv.rocks/kojifiles/work/tasks/6900/1466900"
FEDORA_RISCV_IMAGE = "Fedora-Developer-39-20230927.n.0-sda.raw"
FEDORA_KERNEL = "vmlinuz-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_KERNEL_OFFSET = 40056
FEDORA_KERNEL_DECOMPRESSED = "vmlinux-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_SYSMAP = "System.map-6.5.4-300.0.riscv64.fc39.riscv64"

Fedora kernel

Warning: the cited Fedora disk image may no longer be maintained. If so, we will replace it with a custom cross-compiled kernel tuned for a hypothetical network appliance.

This exemplar kernel is not an ELF file, so analysis of the import process will need help.

The import process explicitly sets the processor on the command line: -processor RISCV:LE:64:RV64IC. This will likely be the same as the processor determined from imported kernel load modules.
Ghidra recognizes three sections, one text and two data. All three need to be moved to the offset suggested in the associated System.map file. For example, .text moves from 0x1000 to 0x80001000. Test this by verifying function start addresses identified in System.map look like actual RISCV-64 kernel functions. Most begin with 16 bytes of no-op instructions to support debugging and tracing operations.
Mark .text as code by selecting from 0x80001000 to 0x80dfffff and hitting the D key.

Verification

Verify that kernel code correctly references data:

locate the address of panic in System.map: ffffffff80b6b188
go to 0x80b6b188 in Ghidra and verify that this is a function
display references to panic and examine the decompiler window.

 /* WARNING: Subroutine does not return */
  panic(s_Fatal_exception_in_interrupt_813f84f8);

Notes

This kernel includes 149 strings including sifive, most of which appear in System.map. It’s not immediately clear whether these indicate kernel mods by SiFive or an SiFive SDK kernel module compiled into the kernel.

The kernel currently includes a few RISCV instruction set extensions not handled by Ghidra, and possibly not even by binutils and the gas RISCV assembler. Current Linux kernels can bypass the standard assembler to insert custom or obscure privileged instructions.

This Linux kernel explicitly includes ISA extension code for processors that support those extensions. For example, if the kernel boots up on a processor supporting the _zbb bit manipulation instruction extensions, then the vanilla strlen, strcmp, and strncmp kernel functions are patched out to invoke strlen_zbb, strcmp_zbb, and strncmp_zbb respectively.

This kernel can support up to 64 discrete ISA extensions, of which about 30 are currently defined. It has some support for hybrid processors, where each of the hardware threads (aka ‘harts’) can support a different mix of ISA extensions.

Note: The combination of instruction set extensions and self-modifying privileged code makes for a fertile ground for Ghidra research. We can expect vector variants of memcpy inline expansion sometime in 2024, significantly complicating cyberanalysis of even the simplest programs.

Fedora kernel modules

Kernel modules are typically ELF files compiled as Position Independent Code, often using more varied Elf relocation types for dynamically loading and linking into kernel memory space. This study looks at the igc.ko kernel module for a type of Intel network interface device. Network device drivers can have some of the most time-critical and race-condition-rich behavior, making this class of driver a good exemplar.

RISCV relocation types found in this exemplar include:

R_RISCV_64(2), R_RISCV_BRANCH(16), R_RISCV_JAL(17), R_RISCV_CALL(18), R_RISCV_PCREL_HI20(23), R_RISCV_PCREL_LO12_I(24), R_RISCV_ADD32(35), R_RISCV_ADD64(36), R_RISCV_SUB32(39), R_RISCV_SUB64(40), R_RISCV_RVC_BRANCH(44), and R_RISCV_RVC_JUMP(45)

Verification

Open Ghidra’s Relocation Table window and verify that all relocations were applied.

Go to igc_poll, open a decompiler window, and export the function as igc_poll.c. Compare this file with the provided igc_poll_decompiled.c in the visual difftool of your choice (e.g. meld) and check for the presence of lines like:

netdev_printk(&_LC7,*(undefined8 *)(lVar33 + 8),"Unknown Tx buffer type\n");

This statement generates - and provides tests for - at least four relocation types.

Notes

The decompiler translates all fence instructions as fence(). This kernel module uses 8 distinct fence instructions to request memory barriers. The sleigh files should probably be extended to show either fence(1,5) or the Linux macro names given in linux/arch/riscv/include/asm/barrier.h.

Fedora system libraries

System libraries like libc.so and libssl.so typically link to versioned shareable object libraries like libc.so.6 and libssl.so.3.0.5. Ghidra imports RISCV system libraries well.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5), and R_RISCV_TLS_TPREL64(11)

R_RISCV_TLS_TPREL64 is currently unsupported by Ghidra, appearing in the libc.so.6 .got section about 15 times. This relocation type does not appear in libssl.so.3.0.5. It appears in multithreaded applications that use thread-local storage.

Fedora system executables

The ssh utility imports cleanly into Ghidra.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5)

Function thunks referencing external library functions do not automatically get the name of the external function propagated into the name of the thunk.

Locally built exemplars

Imported binaries are generally locked into a single platform and a single toolchain. The imported binaries above are built for an SiFive development board, a 64 bit RISCV processor with support for Integer and Compressed instruction sets, and a gcc-13 toolchain. If we want some variation on that, say to look ahead at challenges a gcc-14 toolchain might throw our way, we need to build our own exemplars.

Open source test suites can be a good source for feature-focused importable exemplars. If we want to test Ghidra’s ability to import RISCV instruction set extensions, we want to import many of the files from binutils-gdb/gas/testsuite/gas/riscv.

For example, most of the ratified set of RISCV vector instructions are used in vector-insns.s. If we assemble this with a gas assembler compatible with the -march=rv32ifv architecture we get an importable binary exemplar for those instructions. Even better, we can disassemble that exemplar with a compatible objdump and get the reference disassembly to compare against Ghidra’s disassembly. This gives us three kinds of insights into Ghidra’s import capabilities:

When new instructions appear in the binutils gas main branch, they are good candidates for implementation in Ghidra within the next 12 months. This currently includes vector, bit manipulation, cache management, and crypto approved extensions plus about a dozen vendor-specific extensions from AliBaba’s THead RISCV server initiative.
These exemplars drive extension of Ghidra’s RISCV sleigh files, both as new instruction definitions and as pcode semantics for display in the decompiler window.
Disassembly of those exemplars with a current binutils objdump utility gives us a reference disassembly to compare with Ghidra’s. We can minimize arbitrary or erroneous Ghidra disassembly by comparing the two disassembler views. Ghidra and objdump have different goals, so we don’t need strict alignment of Ghidra with objdump.

Most exemplars appear as four related files. We can use the vector exemplar as an example.

The source file is riscv64/generated/assemblySamples/vector.S, copied from binutils-gdb/gas/testsuite/gas/riscv/vector-insns.s.
vector.S is assembled into riscv64/exemplars/vector.o
That assembly run generates the assembly output listing riscv64/exemplars/vector.log.
riscv64/exemplars/vector.o is finally processed by binutils objdump to generate the reference disassembly riscv64/exemplars/vector.objdump.

The riscv64/exemplars/vector.o is then imported into the Ghidra exemplars project, where we can evaluate the import and disassembly results.

Assembly language exemplars usually don’t have any sensible decompilation. C or C++ language exemplars usually do, so that gives the test analyst more to work with.

Another example shows Ghidra’s difficulty with vector optimized code. Compile this C code for the rv64gcv architecture (RISCV-64 with vector extensions), using the gcc-14 compiler suite released in May of 2024.

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

Ghidra’s 11.0 release decompiles this into:

/* WARNING: Control flow encountered unimplemented instructions */

void main(void)

{
  gp = &__global_pointer$;
                    /* WARNING: Unimplemented instruction - Truncating control flow here */
  halt_unimplemented();
}

Try the import again with the isa_ext experimental branch of Ghidra:

undefined8 main(void)

{
  undefined auVar1 [64];
  undefined8 uVar2;
  undefined (*pauVar3) [64];
  long lVar4;
  long lVar5;
  undefined auVar6 [256];
  undefined auVar7 [256];
  char local_540 [1319];
  undefined uStack_19;
  
  gp = &__global_pointer$;
  pauVar3 = (undefined (*) [64])local_540;
  lVar4 = 0x527;
  vsetvli_e32m1tama(0);
  auVar7 = vid_v();
  do {
    lVar5 = vsetvli(lVar4,0xcf);
    auVar6 = vmv1r_v(auVar7);
    lVar4 = lVar4 - lVar5;
    auVar6 = vncvt_xxw(auVar6);
    vsetvli(0,0xc6);
    auVar6 = vncvt_xxw(auVar6);
    auVar6 = vadd_vi(auVar6,1);
    auVar1 = vse8_v(auVar6);
    *pauVar3 = auVar1;
    uVar2 = vsetvli_e32m1tama(0);
    pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
    auVar6 = vmv_v_x(lVar5);
    auVar7 = vadd_vv(auVar7,auVar6);
  } while (lVar4 != 0);
  uStack_19 = 0;
  printf(local_540,uVar2);
  return 0;
}

That Ghidra branch decompiles, but the decompilation listing only resembles the C source code if you are familiar with RISCV vector extension instructions.

Repeat the example, this time building with a gcc-13 compiler suite. Ghidra 11.0 does a fine job of decompiling this.

undefined8 main(void)
{
  long lVar1;
  char acStack_541 [1320];
  undefined uStack_19;
    gp = &__global_pointer$;
  lVar1 = 1;
  do {
    acStack_541[lVar1] = (char)lVar1;
    lVar1 = lVar1 + 1;
  } while (lVar1 != 0x528);
  uStack_19 = 0;
  printf(acStack_541 + 1);
  return 0;
}

custom Linux kernel and kernel mods

The Fedora 39 disk image is a good exemplar of endpoint system code. We can supplement that with a custom kernel build. This gives us more flexibility and a peek into future system builds.

Building a custom kernel - with standard kernel modules - requires steps like these:

Download the linux kernel source from https://github.com/torvalds/linux.git
- This example currently uses the kernel development tip shortly after version 6.9 RC2

Generate a new .config kernel configuration file with a command like:

$ PATH=$PATH:/opt/riscvx/bin
$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' menuconfig

In the menuconfig view select architecture-specific features we want to view. This will likely include platform selections like Vector extension support, Zbb extension support. It may also include Cryptographic API selections like Accelerated Cryptographic Algorithms for CPU (riscv)

Build the kernel and selected kernel modules with a gcc 14.0.0 riscv64 toolchain

$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' all

Copy the selected vmunix ELF files into the riscv64/exemplars directory:

$ cp vmlinux ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2
$ cp arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.o ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2_aes-riscv64-zvkned-zvbb-zvkg.o

analysis

Importing the custom vmlinux kernel into Ghidra 11.1-DEV(isa_ext) shows:

there are relatively few vector extension sequences in the kernel - 17 instances of vset*.
- for example, __asm_vector_usercopy uses vector loads and stores to copy into user memory spaces.
there are Zbb variants: strcmp_zbb, strlen_zbb, and strncmp_zbb which can be patched into calls

Importing the aes-riscv64-zvkned-zvbb-vkg.o object file - presumably available for use in loadable kernel crypto modules - shows:

two functions aes_xts_encrypt_zvkned_zvbb_zvkg and aes_xts_decrypt_zvkned_zvbb_zvkg
many vector, crypto, and bit manipulation extension instructions.

Commit logs for the Linux kernel sources suggest that the riscv vector crypto functions were derived from openssl source code, possibly intended for use in file system encryption and decryption.

x86_64 exemplars

A few x86_64 exemplars exist to explore the scope of issues raised by RISCV exemplars. The x86_64/exemplars directory shows how optimizing gcc-14 compilations handle simple loops and built-ins like memcpy for various microarchitectures.

Intel microarchitectures can be grouped into common profiles like x86-64-v2, x86-64-v3, and x86-64-v4. Each has its own set of instruction set extensions, so an optimizing compiler like gcc-14 will autovectorize loops and built-ins differently for each microarchitecture.

The memcpy exemplar set includes source code and three executables compiled from that source code with -march=x86-64-v2, -march=x86-64-v3, and -march=x86-64-v4. The binutils-2.41 objdump disassembly is provided for each executable, for comparison with Ghidra’s disassembly window.

x86_64/exemplars$ ls memcpy*
memcpy.c  memcpy_x86-64-v2  memcpy_x86-64-v2.objdump  memcpy_x86-64-v3  memcpy_x86-64-v3.objdump  memcpy_x86-64-v4  memcpy_x86-64-v4.objdump

These exemplars suggest several Ghidra issues:

Ghidra’s disassembler is generally unable to recognize many vector instructions generated by gcc-14 with -march=x86-64-v4 and -O3.
Ghidra’s decompiler provides the user little help in recognizing the semantics of memcpy or many simple loops with -march=x86-64-v2 or -march=x86-64-v3.
Ghidra users should be prepared for wide variety in vector optimized instruction sequences. Pattern recognition will be difficult.

custom exemplars

Not all RISCV instruction set extensions are standardized and supported by open source compiler suites. Vendors can generate their own custom extensions. These may be instructions that are proposed for standardization, instructions that predate standardized extensions that are effectively deprecated for new RISCV variants, and (potentially) instructions that are considered non-public licenseable intellectual property.

We have one example of a set of vendor-specific RISC-V extension exemplars that is pending classification. Some of the WCH QingKe 32 bit RISCV processors support what they call extended instruction or XW instructions like c.lbu, c.lhu, c.sb, c.sh, c.lbusp, c.lhusp, c.sbsp, and c.shsp. The encoding for these custom instructions overlaps other, standardized extensions like Zcd, while some of the instruction mnemonics overlap those of Zcb. There is no known evidence that these XW instructions are tracked for inclusion in binutils, as other full-custom extensions from the THead alibaba group are. There is no evidence that these XW instructions are considered licensable or proprietary to WCH (Nanjing Qinheng Microelectronics).

https://github.com/ArcaneNibble has generated a set of binary exemplars for this vendor custom extension. Naming conventions for full-custom extensions are very much To Be Determined. The RISCV binutils toolchain attaches an architecture tag to each ELF file it generates. For these binary exemplars that is:

Tag_RISCV_arch: "rv32i2p0_m2p0_a2p0_f2p0_c2p0_xw2p2"

That architectural tag implies the binaries are for a base RISCV 32 bit processor, with the standard compressed (c) extension version 2.0 and other standard extensions. The vendor custom (x) extension (w) version 2.0 (2p2) is enabled. The Zcd and Zcb extensions are explicitly not enabled, so there is no conflict with either assembly or disassembly of the instructions.

These exemplars are currently filed under riscv64/exemplars as:

custom
└── wch
    ├── lbu.S
    ├── lbusp.S
    ├── lhu.S
    ├── lhusp.S
    ├── sb.S
    ├── sbsp.S
    ├── sh.S
    ├── shsp.S
    ├── w2p2-lbu.o
    ├── w2p2-lbusp.o
    ├── w2p2-lhu.o
    ├── w2p2-lhusp.o
    ├── w2p2-sb.o
    ├── w2p2-sbsp.o
    ├── w2p2-sh.o
    └── w2p2-shsp.o

2.1 - Whisper_cpp

Explore analysis of a machine learning application built with large language model techniques. What Ghidra gaps does such an analysis reveal?

How might we inspect a machine-learning application for malware? For example, suppose someone altered the automatic speech recognition library whisper.cpp. Would Ghidra be able to cope with the instruction set extensions used to accelerate ML inference engines? What might be added to Ghidra to help the human analyst in this kind of inspection?

Components for this exercise:

A Linux x86_64 Fedora 39 base system
Ghidra 11.0 public
Ghidra 11.1-DEV with the isa_ext branch for RISCV-64 support
A stripped target binary whisper_cpp_vendor built with RISCV-64 gcc-14 toolchain and the whisper.cpp 1.5.4 release.
- RISCV-64 vector and other approved extensions are enabled for this build
- published binutils-2.41 vendor-specific extensions are enabled for this build
- whisper library components are statically linked, while system libraries are dynamically linked
Reference binaries whisper_cpp_* built locally with other RISCV-64 gcc toolchains
Ghidra’s BSIM binary similarities plugins and analytics

Questions to address:

does the presence of vector and other ISA extensions in whisper_cpp_vendor materially hurt Ghidra 11.0 analysis?
can BSIM analytics still find similarities between whisper_cpp_vendor and the non-vector build whisper_cpp_default
are there recurring vector instruction patterns present in whisper_cpp_vendor that Ghidra users should be able to recognize?
are there additional instructions or instruction-semantics that we should add to the isa_ext branch?
if the vendor adds Link Time Optimization to their whisper_cpp_vendor build, does this materially hurt Ghidra 11.0 analysis?

There are a lot of variables in this exercise. Some are important, most are not.

Baseline Ghidra analysis

Starting with the baseline Ghidra 11.0, examine a locally built whisper_cpp_default, an ELF 64-bit LSB executable built with gcc-13.2.1. Import and perform standard analyses to get these statistics:

186558 instructions recognized
text segment size 0x8678a
12 bad instruction errors, all of which appear to be the fence.tso instruction extension

Now examine whisper_cpp_vendor (built with gcc 14 rather than gcc 13) with the baseline Ghidra 11.0:

100521 instructions recognized
text segment size 0xb93cc
4299 bad instruction errors

Examine whisper_cpp_vendor with the isa_ext branch of 11.1-DEV:

169813 instructions recognized
text segment size 0xb93cc
17 bad instruction errors, all of which appear to be the fence.tso instruction extension

Next apply a manual correction to whisper_cpp_vendor, selecting the entire .text segment and forcing disassembly, then clearing any unreachable 0x00 bytes.

190311 instructions recognized
17 bad instruction errors
4138 vset* instructions usually found in vector code
946 gather instructions
3562 custom instructions

Finally, reset the ’language’ of whisper_cpp_vendor to match the vendor (THead, for this exercise).

The 3562 custom instructions resolve to:

Instruction	Count	Semantics
th.ext*	151	Sign extract and extend
th.ldd	1719	Load 2 doublewords
th.lwd	10	Load 2 words
th.sdd	1033	store 2 doublewords
th.swd	16	store 2 words
th.mula	284	Multiply-add
th.muls	67	Multiply-subtract
th.mveqz	127	Move if == 0
th.mvneqz	154	Move if != 0

This leads to some tentative next steps:

Adding fence.tso to Ghidra looks like a simple small win, and a perfect place to start.
The THead vendor-specific extensions look like simple peep-hole optimizations. The semantics could easily be added to Ghidra as compositions of two original instruction semantics. Slightly less than 2% of the total instructions are THead vendor customizations.
The baseline Ghidra 11.0 stalls out very quickly on the vector instructions, making an early switch to the isa_ext branch necessary.
The vector gather instructions are unexpectedly prevalent.
Manual inspection and sampling of the 4138 vset* instruction blocks may reveal some key patterns to recognize first.

Note: fence.tso is now recognized in the Ghidra 11.1-DEV branch isa_ext, clearing the bad instruction errors.

A top-down assessment

At the highest level, what features of whisper.cpp generate vector instructions?

There are about 400 invocations of RISCV vector intrinsic within ggml-quants.c. In these cases the developer has explicitly managed the vectorization.
There are an unknown number of automatic loop vectorizations, where gcc-14 has replaced simple scalar loops with vector-based loops. This vectorization will generally reduce the number of loop iterations, but may not always reduce the number of instructions executed.
Gcc expansions of memcpy or structure copies into vector load-store loops.

Much of whisper.cpp involves vector, matrix, or tensor math using ggml math functions. This is also where most of the explicit RISCV vector intrinsic C functions appear, and likely the code the developer believes is most in need of vector performance enhancements.

Example: dot product

ggml_vec_dot_f32(n, sum, x, y) generates the vector dot product of two vectors x and y of length n with the result stored to *sum. In the absence of vector or SIMD support the source code is:

// scalar
    float s;
    double sumf = 0.0;
    for (int i = 0; i < n; ++i) {
        sumf += (double)(x[i]*y[i]);
    }
   *s = sumf;

GCC-14 will autovectorize this into something Ghidra decompiles like this (comments added after //):

void ggml_vec_dot_f32(long n,float *s,float *x,float *y)

{
  long step;
  double dVar1;
  undefined auVar2 [256];
  undefined in_v2 [256];
  undefined auVar3 [256];
  
  gp = &__global_pointer$;
  if (0 < n) {
    vsetvli_e64m1tama(0);
    vmv_v_i(in_v2,0);                  // v2 = 0
    do {
      step = vsetvli(n,0x97);          // vsetvli a5,a0,e32,mf2,tu,ma
      n = n - step;
      auVar3 = vle32_v(x);             // v3 = *x (slice of size step)
      auVar2 = vle32_v(y);             // v1 = *y (slice of size step)
      x = (float *)sh2add(step,x);      // x = x + step
      auVar2 = vfmul_vv(auVar2,auVar3); // v1 = v1 * v3
      y = (float *)sh2add(step,y);      // y = y + step
      in_v2 = vfwadd_wv(in_v2,auVar2);  // v2 = v1 + v2
    } while (n != 0);
    vsetvli_e64m1tama(0);
    auVar2 = vfmv_sf(0);                 // v1[0] = 0
    auVar2 = vfredusum_vs(in_v2,auVar2); // v1[0] = sum(v2)
    dVar1 = (double)vfmv_fs(auVar2);     // dvar1 = v1[0]
    *s = (float)dVar1;
    return;
  }
  *s = 0.0;
  return;
}

Inspecting this disassembly and decompilation suggests several top down issues:

The semantics for shadd2 are simple and should be explicit sh2add(a, b) = a>>2 + b
- This is now implemented in Ghidra 11.1-DEV isa_ext.
The vsetvli(n,0x97) instruction should be expanded to show semantics as vsetvli_e32m2ftuma
- Running the binary through a RISCV objdump program gives us this formal expansion. This instruction says that the selected element width is 32 bits with a LMUL multiplication factor of 1/2. This means that only half of the vector register is used to allow for 64 bit arithmetic output.
- This is now implemented in Ghidra 11.1-DEV isa_ext.
The semantics for vector results need clarification
The loop accumulates 64 bit double values with 32 bit input values. If the vector length is 256 bits, that means the step size is 4 not 8
A capability to generate processor-specific inline hints or comments in the decompiler may be useful, especially if there were a typographic way to distinguish vector and scalar objects.
If vector registers were infinitely long the loop might become v2 = x * y and the reduction dvar1 = reduce(+, v2)

The path forward may be to manually analyze several examples from whisper.cpp, extending and revising Ghidra’s semantics and decompiler to add a bit of clarity each time.

Example: auto-vectorization makes the simple complicated

Autovectorization can generate complicated code when the compiler has no knowledge of the number of elements in a vector or the number of elements that can fit within single vector register.

A good example is from:

ggml_tensor * ggml_new_tensor_impl(
        struct ggml_context * ctx,
        enum   ggml_type      type,
        int                   n_dims,
        const int64_t       * ne,
        struct ggml_tensor  * view_src,
        size_t                view_offs) {
...
         size_t data_size = ggml_row_size(type, ne[0]);
         for (int i = 1; i < n_dims; i++) {
            data_size *= ne[i];
         }
}

The ne vector typically has up to 4 elements, so this loop will be executed at most once. The compiler doesn’t know this so it autovectorizes the loop into something more complex:

undefined4 * ggml_new_tensor(ggml_context *ctx,undefined8 type,long ndims,int64_t *ne)

{
...
  data_size = ggml_row_size(type,*ne);              // get the first dimension ne[0]
  lVar6 = 1;
  if (1 < ndims) {
    uVar2 = (int)ndims - 1;
    if (1 < (int)ndims - 2U) {                      // if ndims > 3 process two at a time
      piVar7 = ne + 1;                              // starting with ne[1] and ne[2]
      piVar4 = piVar7 + (long)(int)(uVar2 >> 1) * 2;
      vsetivli_e64m1tamu(2);                        //vector length = 2, 64 bit element, tail agnostic mask unchanged
      vmv_v_i(in_v1,1);                             // v1 = (1,1)
      do {
        auVar10 = vle64_v(piVar7);
        piVar7 = piVar7 + 2;
        in_v1 = vmul_vv(in_v1,auVar10);              // v1 = v1 * ne[slice]
      } while (piVar4 != piVar7);
      auVar10 = vid_v();                             // v2 = (0,1)
      vmv_v_i(in_v4,0);                              // v4 = (0,0)
      auVar11 = vadd_vi(auVar10,1);                  // v2 = v2 + 1 = (1,2)
      auVar10 = vmsgtu_vi(auVar11,1);                // v0 = (v2 > 1) = (0, 1)
      vrgather_vv(in_v1,auVar11);                    // v3 = gather(v1, v2) => v3=v1[v2] = (v1[1], 0)
      auVar11 = vadd_vi(auVar11,0xfffffffffffffffe); // v2 = v2 - 2 = (-1,0)
      auVar10 = vrgather_vv(in_v4,auVar11,auVar10);  // v3 = gather_masked(v4,v2,v0.t) = (v3[0], v4[0])
      auVar10 = vmul_vv(auVar10,in_v1);              // v3 = v3 * v1
      vmv_x_s(in_v14,auVar10);                       // a4 = v3[0]
      data_size = data_size * (long)piVar4;          // data_size = data_size * a4
      if ((uVar2 & 1) == 0) goto LAB_00074a80;
      lVar6 = (long)(int)((uVar2 & 0xfffffffe) + 1);
    }
    plVar5 = (long *)sh3add(lVar6,ne);               // multiply by one or two 
    data_size = data_size * *plVar5;                 // 
    if ((int)lVar6 + 1 < ndims) {
      data_size = data_size * plVar5[1];
    }
  }
...
}

That’s a very confusing way to multiply at most four integers. If ne has 1, 2, or 3 elements then no vector instructions are processed at all. If it has 4 elements then the first and last one or two are handled with scalar math while pairs of elements are accumulated in the loop. The gather instructions are used together to generate a mask and then multiply the two elements of vector v1, leaving the result in the first element slot of vector v4.

This particular loop vectorization is likely to change a lot in future releases. The performance impact is negligible either way. The analyst may look at code like this and decide to ignore the ndims>3 case along with all of the vector instructions used within it. Alternatively, we could look at the gcc vectorization code handling the general vector reduction meta operation, then see if this pattern is a macro of some sort within it.

Take a step back and look at the gcc RISCV autovectorization code. It’s changing quite frequently, so it’s probably premature to try and abstract out loop reduction models that we can get Ghidra to recognize. When that happens we might draw source exemplars from gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec and build a catalog of source pattern to instruction pattern expansions.

Example: source code use of RISCV vector intrinsics

The previous example showed an overly aggressive autovectorization of a simple loop. Here we look at source code that the developer has decided is important enough to directly code in RISCV intrinsic C functions. The function ggml_vec_dot_q5_0_q8_0 is one such function, with separate implementations for ARM_NEON, wasm_simd128, AVX2, AVX, and riscv_v_intrinsic. If none of those accelerators are available a scalar implementation is used instead:

void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int qk = QK8_0;
    const int nb = n / qk;

    assert(n % qk == 0);
    assert(qk == QK5_0);

    const block_q5_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    // scalar
    float sumf = 0.0;

    for (int i = 0; i < nb; i++) {
        uint32_t qh;
        memcpy(&qh, x[i].qh, sizeof(qh));

        int sumi = 0;

        for (int j = 0; j < qk/2; ++j) {
            const uint8_t xh_0 = ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
            const uint8_t xh_1 = ((qh & (1u << (j + 16))) >> (j + 12));

            const int32_t x0 = ((x[i].qs[j] & 0x0F) | xh_0) - 16;
            const int32_t x1 = ((x[i].qs[j] >>   4) | xh_1) - 16;

            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
        }

        sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
    }

    *s = sumf;
}

The RISCV intrinsic source is:

Note: added comments are flagged with ///

void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int qk = QK8_0;  /// QK8_0 = 32
    const int nb = n / qk;

    assert(n % qk == 0);
    assert(qk == QK5_0);   /// QK5_0 = 32

    const block_q5_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0;

    uint32_t qh;

    size_t vl = __riscv_vsetvl_e8m1(qk/2);

    // These temporary registers are for masking and shift operations
    vuint32m2_t vt_1 = __riscv_vid_v_u32m2(vl);
    vuint32m2_t vt_2 = __riscv_vsll_vv_u32m2(__riscv_vmv_v_x_u32m2(1, vl), vt_1, vl);

    vuint32m2_t vt_3 = __riscv_vsll_vx_u32m2(vt_2, 16, vl);
    vuint32m2_t vt_4 = __riscv_vadd_vx_u32m2(vt_1, 12, vl);

    for (int i = 0; i < nb; i++) {
        memcpy(&qh, x[i].qh, sizeof(uint32_t));

        // ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
        vuint32m2_t xha_0 = __riscv_vand_vx_u32m2(vt_2, qh, vl);
        vuint32m2_t xhr_0 = __riscv_vsrl_vv_u32m2(xha_0, vt_1, vl);
        vuint32m2_t xhl_0 = __riscv_vsll_vx_u32m2(xhr_0, 4, vl);

        // ((qh & (1u << (j + 16))) >> (j + 12));
        vuint32m2_t xha_1 = __riscv_vand_vx_u32m2(vt_3, qh, vl);
        vuint32m2_t xhl_1 = __riscv_vsrl_vv_u32m2(xha_1, vt_4, vl);

        // narrowing
        vuint16m1_t xhc_0 = __riscv_vncvt_x_x_w_u16m1(xhl_0, vl);
        vuint8mf2_t xh_0 = __riscv_vncvt_x_x_w_u8mf2(xhc_0, vl);

        vuint16m1_t xhc_1 = __riscv_vncvt_x_x_w_u16m1(xhl_1, vl);
        vuint8mf2_t xh_1 = __riscv_vncvt_x_x_w_u8mf2(xhc_1, vl);

        // load
        vuint8mf2_t tx = __riscv_vle8_v_u8mf2(x[i].qs, vl);

        vint8mf2_t y0 = __riscv_vle8_v_i8mf2(y[i].qs, vl);
        vint8mf2_t y1 = __riscv_vle8_v_i8mf2(y[i].qs+16, vl);

        vuint8mf2_t x_at = __riscv_vand_vx_u8mf2(tx, 0x0F, vl);
        vuint8mf2_t x_lt = __riscv_vsrl_vx_u8mf2(tx, 0x04, vl);

        vuint8mf2_t x_a = __riscv_vor_vv_u8mf2(x_at, xh_0, vl);
        vuint8mf2_t x_l = __riscv_vor_vv_u8mf2(x_lt, xh_1, vl);

        vint8mf2_t x_ai = __riscv_vreinterpret_v_u8mf2_i8mf2(x_a);
        vint8mf2_t x_li = __riscv_vreinterpret_v_u8mf2_i8mf2(x_l);

        vint8mf2_t v0 = __riscv_vsub_vx_i8mf2(x_ai, 16, vl);
        vint8mf2_t v1 = __riscv_vsub_vx_i8mf2(x_li, 16, vl);

        vint16m1_t vec_mul1 = __riscv_vwmul_vv_i16m1(v0, y0, vl);
        vint16m1_t vec_mul2 = __riscv_vwmul_vv_i16m1(v1, y1, vl);

        vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vl);

        vint32m1_t vs1 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul1, vec_zero, vl);
        vint32m1_t vs2 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul2, vs1, vl);

        int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);

        sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
    }

    *s = sumf;
}

Ghidra’s 11.1 isa_ext rendering of this is (after minor parameter name propagation):

long ggml_vec_dot_q5_0_q8_0(ulong n,float *s,void *vx,void *vy)

{
  ushort *puVar1;
  long lVar2;
  long lVar3;
  long lVar4;
  long lVar5;
  undefined8 uVar6;
  int i;
  float fVar7;
  undefined auVar8 [256];
  undefined auVar9 [256];
  undefined auVar10 [256];
  undefined auVar11 [256];
  undefined in_v7 [256];
  undefined in_v8 [256];
  undefined auVar12 [256];
  undefined auVar13 [256];
  undefined auVar14 [256];
  undefined auVar15 [256];
  undefined auVar16 [256];
  int iStack_4;
  
  gp = &__global_pointer$;
  uVar6 = vsetivli(0x10,0xc0);
  vsetvli(uVar6,0xd1);
  auVar13 = vid_v();
  vmv_v_i(in_v8,1);
  auVar15 = vadd_vi(auVar13,0xc);
  auVar12 = vsll_vv(in_v8,auVar13);
  auVar14 = vsll_vi(auVar12,0x10);
  if (0x1f < (long)n) {
    fVar7 = 0.0;
    vsetvli_e32m1tama(uVar6);
    lVar3 = (long)vx + 2;
    lVar4 = (long)vy + 2;
    i = 0;
    vmv_v_i(in_v7,0);
    vsetivli(4,0xc6);
    do {
      auVar8 = vle8_v(lVar3);
      vse8_v(auVar8,&iStack_4);
      puVar1 = (ushort *)(lVar4 + -2);
      vsetvli(uVar6,0xd1);
      lVar2 = lVar3 + 4;
      auVar8 = vle8_v(lVar2);
      auVar9 = vand_vx(auVar12,(long)iStack_4);
      auVar9 = vsrl_vv(auVar9,auVar13);
      vsetvli(0,199);
      auVar11 = vand_vi(auVar8,0xf);
      vsetvli(0,0xd1);
      auVar9 = vsll_vi(auVar9,4);
      vsetvli(0,199);
      auVar8 = vsrl_vi(auVar8,4);
      vsetvli(0,200);
      auVar9 = vncvt_xxw(auVar9);
      auVar16 = vle8_v(lVar4);
      vsetvli(0,199);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,0xd1);
      auVar10 = vand_vx(auVar14,(long)iStack_4);
      vsetvli(0,199);
      auVar11 = vor_vv(auVar11,auVar9);
      vsetvli(0,0xd1);
      auVar9 = vsrl_vv(auVar10,auVar15);
      vsetvli(0,199);
      auVar10 = vadd_vi(auVar11,0xfffffffffffffff0);
      vsetvli(0,200);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,199);
      auVar10 = vwmul_vv(auVar10,auVar16);
      auVar9 = vncvt_xxw(auVar9);
      vsetvli(0,200);
      lVar5 = lVar4 + 0x10;
      auVar10 = vwredsum_vs(auVar10,in_v7);
      vsetvli(0,199);
      auVar8 = vor_vv(auVar8,auVar9);
      auVar9 = vle8_v(lVar5);
      auVar8 = vadd_vi(auVar8,0xfffffffffffffff0);
      auVar8 = vwmul_vv(auVar8,auVar9);
      vsetvli(0,200);
      auVar8 = vwredsum_vs(auVar8,auVar10);
      vsetivli(4,0xd0);
      vmv_x_s(auVar15,auVar8);
      i = i + 1;
      lVar4 = lVar4 + 0x22;
      fVar7 = (float)(&ggml_table_f32_f16)[*puVar1] *
              (float)(&ggml_table_f32_f16)[*(ushort *)(lVar3 + -2)] * (float)(int)lVar5 + fVar7;
      lVar3 = lVar3 + 0x16;
    } while (i < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
    *s = fVar7;
    return lVar2;
  }
  *s = 0.0;
  return n;
}

It looks like the developer unrolled an inner loop and used the LMUL multiplier to help reduce the loop iterations. The immediate action item for us may be to add more explicit decodings for vsetvli and vsetivli, or look for existing processor-specific decoders in the Ghidra decompiler.

x86_64 whisper

Let’s take a glance at the x86_64 build of whisper. First copy whisper-cpp.BUILD into the x86_64 workspace then build the executable with two architectures:

$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v3" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3
...
$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v4" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v4

Load these into Ghidra 11.1-DEV. The x86-64-v4 build is useless in Ghidra, since a different class of x86_64 vector extensions is used in that newer microarchitecture and Ghidra doesn’t recognize it. The x86-64-v3 build looks accessible.

Try an x86_64 build with the local compiler (Fedora 39 default compiler) and Link Time Optimization enabled:

$ bazel build  --copt="-march=x86-64-v3" --copt="-flto"  --linkopt="-Wl,-flto" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3-lto

We’ll leave the differential analysis of link time optimization for another day. A couple of quick notes are worthwhile here:

The function ggml_new_tensor no longer exists in the binary. Instead we get ggml_new_tensor_impl.constprop.0 ggml_new_tensor_impl.constprop.0, ggml_new_tensor_impl.constprop.2, and ggml_new_tensor_impl.constprop.3. This suggests BSIM could get confused with intermediate functions if trying to connect binaries built with and without LTO.
None of the hermetic toolchains appear to work when link time optimization is requested. There appears to be at least one missing LTO plugin from the gcc-14 toolchain packaging. We’ll try and find such for the next snapshot of gcc-14.

2.2 - Data Plane Development Kit

Intel’s DPDK framework supports some Intel and Arm Neon vector instructions. What does it for RISCV extensions? Are ISA extensions materially useful to a network appliance?

Check out DPDK from GitHub, patch the RISCV configuration with riscv64/generated/dpdk/dpdk.pat, and crosscompile with meson and ninja. Copy some of the examples into riscv64/exemplars and examine them in Ghidra.

for this we use as many standard extensions as we can, excluding vendor-specific extensions.

Configure a build directory:

$ patch -p1 < .../riscv64/generated/dpdk/dpdk.pat
$ meson setup build --cross-file config/riscv/riscv64_linux_gcc -Dexamples=all
$ cd build

Edit build/build.ninja:

replace all occurrences of -ldl with /opt/riscvx/lib/libdl.a - you should see about 235 replacements

Build with:

$ ninja -C build

Check the cross-compilation with:

$ readelf -A build/examples/dpdk-l3fwd
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zba1p0_zbb1p0_zbc1p0_zbkb1p0_zbkc1p0_zbkx1p0_zvbb1p0_zvbc1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvkb1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

Import into Ghidra 11.1-DEV(isa_ext), noting multiple import messages similar to:

Unsupported Thread-Local Symbol not loaded
ELF Relocation Failure: R_RISCV_COPY (4, 0x4) at 00f0c400 (Symbol = stdout) - Runtime copy not supported

Analysis

source code examination

explicit vectorization

The source code has explicit parallel coding for Intel and Arm/Neon architectures, for instance within the basic layer 3 forwarding example:

struct acl_algorithms acl_alg[] = {
        {
                .name = "scalar",
                .alg = RTE_ACL_CLASSIFY_SCALAR,
        },
        {
                .name = "sse",
                .alg = RTE_ACL_CLASSIFY_SSE,
        },
        {
                .name = "avx2",
                .alg = RTE_ACL_CLASSIFY_AVX2,
        },
        {
                .name = "neon",
                .alg = RTE_ACL_CLASSIFY_NEON,
        },
        {
                .name = "altivec",
                .alg = RTE_ACL_CLASSIFY_ALTIVEC,
        },
        {
                .name = "avx512x16",
                .alg = RTE_ACL_CLASSIFY_AVX512X16,
        },
        {
                .name = "avx512x32",
                .alg = RTE_ACL_CLASSIFY_AVX512X32,
        },
};

If avx512x32 is selected, then basic trie search and other operations can proceed across 32 flows in parallel. Other examples exist within the code. For more information, search the doc directory for SIMD. Check out lib/fib/trie_avx512.c and the function trie_vec_lookup_x16x2 for an AVX manual vectorization of a trie address to next hop lookup.

Note: It won’t be clear for some time which vectorization transforms actually improve performance on specific processors. Vector support adds a lot of local register space but vector loads and stores can saturate memory bandwidth and drive up processor temperature. We might see earlier adoption in contexts that tolerate higher latency, like firewalls, rather than low-latency switches and routers.

explicit ML support

DPDK includes ML contributions from Marvell. See doc/guides/mldevs/cnxk.rst for more information and references to the cnxk support. Source code support exists under drivers/ml/cnxk and lib/mldev. lib/mldev/rte_mldev.c may provide some insight into how Marvell expects DPDK users to apply their component. See the Marvell Octeon 10 white paper for some possible applications.

Ghidra analysis

The DPDK exemplars stress test Ghidra in multiple ways:

When compiled with RISCV-64 vector and bit manipulation extension support you get a good mix of autovectorization instructions.
There are a number of unsupported thread-local relocations requested
ELF replication failures are reported for R_RISCV_COPY, claiming “Runtime copy is not supported”.
- this relocation code apparently asks for a symbol to be copied from a shareable object into an executable.

3 - Issues

Summarize Ghidra import issues here to promote discussion on relative priority and possible solutions.

Thread local storage class handling

Thread local storage provides one instance of the variable per extant thread. GCC supports this feature as:

__thread char threadLocalString[4096];

Binaries built with this feature will often include ELF relocation codes like R_RISCV_TLS_TPREL64. This relocation code is not recognized by Ghidra, nor is it clear how TLS storage should be handled itself within Ghidra - perhaps as a memory section akin to BSS?

To reproduce, import libc.so.6 and look for lines like Elf Relocation Warning: Type = R_RISCV_TLS_TPREL64. Alternatively, compile, link, and import riscv64/generated/userSpaceSamples/relocationTest.c.

Vector instruction support

Newer C compiler releases can replace simple loops and standard C library invocations with processor-specific vector instructions. These vector instructions can be handled poorly by Ghidra’s disassembler and worse by Ghidra’s decompiler. See autovectorization and vector_intrinsics for examples.

3.1 - inferring semantics from code patterns

How can we do a better job of recognizing semantic patterns in optimized code? Instruction set extensions make that more challenging.

Ghidra users want to understand the intent of binary code. The semantics and intent of the memcpy(dest,src,nbytes) operation are pretty clear. If the compiler converts this into a call to a named external function, that’s easy to recognize. If it converts it into a simple inline loop of load and store instructions, that should be recognizable too.

Optimizing compilers like gcc can generate many different instruction sequences from a simple concept like memcpy or strnlen, especially if the processor for which the code is intended supports advanced vector or bit manipulation instructions. We can examine the compiler testsuite to see what those patterns can be, enabling either human or machine translation of those sequences into the higher level semantics of memory movement or finding the first null byte in a string.

Gcc semantics recognizes memory copy operations via the operation cpymem. Calls to the standard library memcpy and various kinds of struct copies are translated into this RTL (Register Transfer Logic) cpymem token. The processor-specific gcc back-end then expands cpymem into a half-dozen or so instruction patterns, depending on size, alignment, and instruction set extensions of the target processor.

In the ideal world, Ghidra would recognize all of those RTL operations as pcode operations, and further recognize all of the common back end expansions for all processor variants. It might rewrite the decompiler window or simply add comments indicating a likely cpymem pcode expansion.

It’s enough for now to show how to gather the more common patterns to help human Ghidra operators untangle these optimizations and understand the simpler semantics they encode.

Note: This example uses RISCV vector optimization - many other optimizations are supported by gcc too.

Patterns in gcc vectorization source code

Maybe the best reference on gcc vectorization is the gcc source code itself.

What intrinsics are likely to be replaced with vector code?
What patterns of vector assembly instructions are likely to be generated?
How does the gcc test suite search for those patterns to verify intrinsic replacement is correct?

Start with:

gcc/config/riscv/riscv-vector-builtins.cc
gcc/config/riscv/riscv-vector-strings.cc
gcc/config/riscv/autovec.md
gcc/config/riscv/riscv-string.cc

Ghidra semantics use pcode operations. GCC uses something similar in RTL (Register Transfer Language). These are described in gcc/doc/md.texi. These include:

cpymem
setmem
strlen
rawmemchr
cmpstrn
cmpstr

The cpymem op covers inline calls to memcpy and structure copies. Trace this out:

riscv.md:

(define_expand "cpymem<mode>"
  [(parallel [(set (match_operand:BLK 0 "general_operand")
                   (match_operand:BLK 1 "general_operand"))
              (use (match_operand:P 2 ""))
              (use (match_operand:SI 3 "const_int_operand"))])]
  ""
{
  if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
    DONE;
  else
    FAIL;
})

riscv_expand_block_move is also mentioned in riscv-protos.h and riscv-string.cc.

Look into riscv-string.cc:

/* This function delegates block-move expansion to either the vector
   implementation or the scalar one.  Return TRUE if successful or FALSE
   otherwise.  */

bool
riscv_expand_block_move (rtx dest, rtx src, rtx length)
{
  if (TARGET_VECTOR && stringop_strategy & STRATEGY_VECTOR)
    {
      bool ok = riscv_vector::expand_block_move (dest, src, length);
      if (ok)
        return true;
    }

  if (stringop_strategy & STRATEGY_SCALAR)
    return riscv_expand_block_move_scalar (dest, src, length);

  return false;
...
}
...
* --- Vector expanders --- */

namespace riscv_vector {

/* Used by cpymemsi in riscv.md .  */

bool
expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
{
  /*
    memcpy:
        mv a3, a0                       # Copy destination
    loop:
        vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
        vle8.v v0, (a1)                 # Load bytes
        add a1, a1, t0                  # Bump pointer
        sub a2, a2, t0                  # Decrement count
        vse8.v v0, (a3)                 # Store bytes
        add a3, a3, t0                  # Bump pointer
        bnez a2, loop                   # Any more?
        ret                             # Return
 */
}
}

Note that the RISCV assembly instructions in the comment are just an example, and that the C++ implementation handles many different variants. The ret instruction is not part of the expansion, just copied into the source code from the testsuite.

The testsuite (gcc/testsuite/gcc.target/riscv) shows which variants are common enough to test against.

a minimalist call to memcpy

void f1 (void *a, void *b, __SIZE_TYPE__ l)
{
  memcpy (a, b, l);
}

** f1:
XX      \.L\d+: # local label is ignored
**      vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
**      vle8\.v\s+v\d+,0\(a1\)
**      vse8\.v\s+v\d+,0\(a0\)
**      add\s+a1,a1,[ta][0-7]
**      add\s+a0,a0,[ta][0-7]
**      sub\s+a2,a2,[ta][0-7]
**      bne\s+a2,zero,\.L\d+
**      ret
*/

a typed call to memcpy

void f2 (__INT32_TYPE__* a, __INT32_TYPE__* b, int l)
{
  memcpy (a, b, l);
}

Additional type information doesn’t appear to affect the inline code

** f2:
XX      \.L\d+: # local label is ignored
**      vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
**      vle8\.v\s+v\d+,0\(a1\)
**      vse8\.v\s+v\d+,0\(a0\)
**      add\s+a1,a1,[ta][0-7]
**      add\s+a0,a0,[ta][0-7]
**      sub\s+a2,a2,[ta][0-7]
**      bne\s+a2,zero,\.L\d+
**      ret

memcpy with aligned elements and known size

In this case arguments are aligned and 512 bytes in length.

extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;
void f3 ()
{
  memcpy (&a_a, &a_b, sizeof a_a);
}

The generated sequence varies depending on how much the compiler knows about the target architecture.

** f3: { target { { any-opts "-mcmodel=medlow" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m2" "--param=riscv-autovec-lmul=m4" "-
-param=riscv-autovec-lmul=m8" "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vsetivli\s+zero,16,e32,m8,ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret

f3: { target { { any-opts "-mcmodel=medlow --param=riscv-autovec-preference=fixed-vlmax" "-mcmodel=medlow -march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vl(1|4|2)re32\.v\s+v\d+,0\([ta][0-7]\)
**        vs(1|4|2)r\.v\s+v\d+,0\([ta][0-7]\)
**        ret

** f3: { target { { any-opts "-mcmodel=medlow -march=rv64gcv_zvl1024b" "-mcmodel=medlow -march=rv64gcv_zvl512b" } && { no-opts "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vsetivli\s+zero,16,e32,(m1|m4|mf2),ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret

** f3: { target { { any-opts "-mcmodel=medany" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m8" "--param=riscv-autovec-lmul=m4" "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lla\s+[ta][0-7],a_a
**        lla\s+[ta][0-7],a_b
**        vsetivli\s+zero,16,e32,m8,ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)

** f3: { target { { any-opts "-mcmodel=medany"  } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv" "-march=rv64gc_zve64d" "-march=rv64gc_zve32f" } } }
**        lla\s+[ta][0-7],a_b
**        vsetivli\s+zero,16,e32,m(f2|1|4),ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        lla\s+[ta][0-7],a_a
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret
*/

** f3: { target { { any-opts "-mcmodel=medany --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
**        lla\s+[ta][0-7],a_a
**        lla\s+[ta][0-7],a_b
**        vl(1|2|4)re32\.v\s+v\d+,0\([ta][0-7]\)
**        vs(1|2|4)r\.v\s+v\d+,0\([ta][0-7]\)
**        ret

3.2 - autovectorization

If a processor supports vector (aka SIMD) instructions, optimizing compilers will use them. That means Ghidra may need to make sense of the generated code.

Loop autovectorization

What happens when a gcc toolchain optimizes the following code?

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

This involves a simple loop filling a character array with integers. It isn’t a well formed C string, so the printf statement is just there to keep the character array from being optimized away.

The elements of the loop involve incremental indexing, narrowing from 16 bit to 8 bit elements, and storage in a 1320 element vector.

The result depends on the compiler version and what kind of microarchitecture gcc-14 was told to compile for.

Compile and link this file with a variety of compiler versions, flags, and microarchitectures to see how well Ghidra tracks toolchain evolution. In each case the decompiler output is manually adjusted to relabel variables like s and i and remove extraneous declarations.

RISCV-64 gcc-13, no optimization, no vector extensions

$ bazel build -s --platforms=//platforms:riscv_userspace  gcc_vectorization:narrowing_loop
...

Ghidra gives:

  char s [1319];
 ...
  int i;
  ...
  for (i = 0; i < 0x527; i = i + 1) {
    s[i] = (char)i + '\x01';
  }
  ...
  printf(s);

The loop consists of 17 instructions and 60 bytes. It is executed 1319 times

RISCV-64 gcc-13, full optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3" gcc_vectorization:narrowing_loop

Ghidra gives:

  long i;
  char s_offset_by_1 [1320];
 
  i = 1;
  do {
    s_offset_by_1[i] = (char)i;
    i = i + 1;
  } while (i != 0x528);
  uStack_19 = 0;
  printf(s_offset_by_1 + 1);

The loop consists of 4 instructions and 14 bytes. It is executed 1319 times.

Note that Ghidra has reconstructed the target vector s a bit strangely, with the beginning offset by one byte to help shorten the loop.

RISCV-64 gcc-13, full optimization, with vector extensions

bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3"  gcc_vectorization:narrowing_loop_vector

The Ghidra import is essentially unchanged - updating the target architecture from rv64igc to rv64igcv makes no difference when building with gcc-13.

RISCV-64 gcc-14, no optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:narrowing_loop

The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference without optimization.

RISCV-64 gcc-14, full optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:narrowing_loop

The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference - when using the default target architecture without vector extensions.

RISCV-64 gcc-14, full optimization, with vector extensions

Build with -march=rv64gcv to tell the compiler to assume the processor supports RISCV vector extensions.

bazel build -s --platforms=//platforms:riscv_vector --copt="-O3"  gcc_vectorization:narrowing_loop_vector

                    /* WARNING: Unimplemented instruction - Truncating control flow here */
  halt_unimplemented();

The disassembly window shows that the loop consists of 13 instructions and 46 bytes. Many of these are vector extension instructions for which Ghidra 11.0 has no semantics. Different RISCV processors will take a different number of iterations to finish the loop. If the processor VLEN=128, then each vector register will hold 4 32 bit integers and the loop will take 330 iterations. If the processor VLEN=1024 then the loop will take 83 iterations.

Either way, Ghidra 11.0 will fail to decompile any such autovectorized loop, and fail to decompile the remainder of any function which contains such an autovectorized loop.

x86-64 gcc-14, full optimization, with sapphirerapids

Note: Intel’s Saphire Rapids includes high end server processors like the Xeon Max family. A more general choice for x86_64 exemplars would be -march=x86-64-v3 with -O2 optimization. We can expect full Red Hat Linux distributions soon built with those options. The x86-64-v4 microarchitecture is a generalization of Saphire Rapids microarchitecture, and would more likely be found in servers specialized for numerical analysis or possibly ML applications.

$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:narrowing_loop

Ghidra 11.0 disassembler and decompiler fail immediately on hitting the first vector instruction vpbroadcastd, an older avx2 vector extension.

    /* WARNING: Bad instruction - Truncating control flow here */
  halt_baddata();

builtin autovectorization

GCC can replace calls to some functions like memcpy, replacing those calls with inline - and potentially vectorized - instructions.

This source file shows different ways memcopy can be compiled.

include "common.h"
#include <string.h>

int main() {
  const int N = 127;
  const uint32_t seed = 0xdeadbeef;
  srand(seed);

  // data gen
  double A[N];
  gen_rand_1d(A, N);

  // compute
  double copy[N];
  memcpy(copy, A, sizeof(A));
  
  // prevent optimization from removing result
  printf("%f\n", copy[N-1]);
}

RISCV-64 builds

Build with:

$ bazel build --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:memcpy_vector

Ghidra 11 gives:

undefined8 main(void)

{
  undefined auVar1 [64];
  undefined *puVar2;
  undefined (*pauVar3) [64];
  long lVar4;
  long lVar5;
  undefined auVar6 [256];
  undefined local_820 [8];
  undefined8 uStack_818;
  undefined auStack_420 [1032];
  
  srand(0xdeadbeef);
  puVar2 = auStack_420;
  gen_rand_1d(auStack_420,0x7f);
  pauVar3 = (undefined (*) [64])local_820;
  lVar4 = 0x3f8;
  do {
    lVar5 = vsetvli_e8m8tama(lVar4);
    auVar6 = vle8_v(puVar2);
    lVar4 = lVar4 - lVar5;
    auVar1 = vse8_v(auVar6);
    *pauVar3 = auVar1;
    puVar2 = puVar2 + lVar5;
    pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
  } while (lVar4 != 0);
  printf("%f\n",uStack_818);
  return 0;
}

What would we like the decompiler to show instead? The memcpy pattern should be fairly general and stable.

  src = auStack_420;
  gen_rand_1d(auStack_420,0x7f);
  dest = (undefined (*) [64])local_820;
  /* char* dest, src;
    dest[0..n] ≔ src[0..n]; */
  n = 0x3f8;
  do {
    lVar2 = vsetvli_e8m8tama(n);
    auVar3 = vle8_v(src);
    n = n - lVar2;
    auVar1 = vse8_v(auVar3);
    *dest = auVar1;
    src = src + lVar2;
    dest = (undefined (*) [64])(*dest + lVar2);
  } while (n != 0);

More generally, we want a precomment showing the memcpy in vector terms immediately before the loop. The type definition of dest is a red herring to be dealt with.

x86-64 builds

Build with:

$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:memcpy_sapphirerapids

Ghidra 11.0’s disassembler and decompiler bail out when they reach the inline replacement for memcpy - gcc-14 has replaced the call with vector instructions like vmovdqu64, which is unrecognized by Ghidra.

void main(void)

{
  undefined auStack_428 [1016];
  undefined8 uStack_30;
  
  uStack_30 = 0x4010ad;
  srand(0xdeadbeef);
  gen_rand_1d(auStack_428,0x7f);
                    /* WARNING: Bad instruction - Truncating control flow here */
  halt_baddata();
}

3.3 - vector intrinsics

Invoking RISCV vector instructions from C.

RISCV vector intrinsic functions can be coded into C or C++.

That document includes examples of code that might be found shortly in libc:

void *memcpy_vec(void *restrict destination, const void *restrict source,
                 size_t n) {
  unsigned char *dst = destination;
  const unsigned char *src = source;
  // copy data byte by byte
  for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
    vl = __riscv_vsetvl_e8m8(n);
    vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
    __riscv_vse8_v_u8m8(dst, vec_src, vl);
  }
  return destination;
}

Note: GCC-14 autovectorization will often convert normal calls to memcpy into something very similar to the memcpy_vec code above, then assemble it down to RISCV vector instructions.

As another example, here is a snippet of code from the whisper.cc voice to text open source project:

...
#ifdef __riscv_v_intrinsic
#include <riscv_vector.h>
#endif
...
elif defined(__riscv_v_intrinsic)

    size_t vl = __riscv_vsetvl_e32m4(QK8_0);

    for (int i = 0; i < nb; i++) {
        // load elements
        vfloat32m4_t v_x   = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);

        vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
        vfloat32m1_t tmp   = __riscv_vfmv_v_f_f32m1(0.0f, vl);
        vfloat32m1_t vmax  = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
        float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);
        ...
    }

Normally you would expect to see functions like __riscv_vfabs_v_f32m4 defined in the include file riscv_vector.h, where Ghidra could process it and help identify calls to these intrinsics. The vector intrinsic functions are instead autogenerated directly into GCC’s internal compiled header format when the compiler is built - there are just too many variants to cope with. The PDF listing of all intrinsic functions is currently over 4000 pages long. For example, the signature for __riscv_vfredmax_vs_f32m4_f32m1 is given on page 734 under Vector Reduction Operations as

vfloat32m1_t __riscv_vfredmax_vs_f32m4_f32m1(vfloat32m4_t vs2, vfloat32m1_t vs1, size_t vl);

There aren’t all that many vector instruction genotypes, but there are an enormous number of contextual variations the compiler and assembler know about.

3.4 - link time optimization

Link Time Optimization

Link Time Optimization (LTO) is a relatively new form of toolchain optimization that can produce smaller and faster binaries. It can also mutate control flows in those binaries making Ghidra analysis trickier, especially if one is using BSIM to look for control flow similarities.

Can we generate importable exemplars using LTO to show what such optimization steps look like in Ghidra?

LTO needs a command line parameter added for both compilation and linking. With bazel, that means --copt="-flto" --linkopt="-Wl,-flto" is enough to request LTO optimization on a build. These lto flags can also be defaulted into the toolchain definition or individual build files.

Let’s try this with a progressively more complicated series of exemplars

# Build helloworld without LTO as a control
$ bazel build -s --copt="-O2" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
...
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8624 Jan 31 10:44 bazel-bin/userSpaceSamples/helloworld

# The helloworld exemplar doesn't benefit much from link time optimization
$ bazel build -s  --copt="-O2"  --copt="-flto" --linkopt="-Wl,-flto" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8608 Jan 31 10:46 bazel-bin/userSpaceSamples/helloworld

The memcpy source exemplar can be built three ways:

without vector extensions and without LTO - build target gcc_vectorization:memcpy
with vector extensions and without LTO - build target gcc_vectorization:memcpy_vector
with vector extensions and with LTO - build target gcc_vectorization:memcpy_lto

In this case the LTO options are configured into gcc_vectorization/BUILD.

$ bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:memcpy
...
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy
-r-xr-xr-x. 1 --- --- 13488 Jan 31 11:16 bazel-bin/gcc_vectorization/memcpy
$ bazel build -s  --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_vector
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy_vector
-r-xr-xr-x. 1 --- --- 13728 Jan 31 11:18 bazel-bin/gcc_vectorization/memcpy_vector
$ bazel build -s  --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_lto
ERROR: ...: Linking gcc_vectorization/memcpy_lto failed: (Exit 1): gcc failed: error executing CppLink command (from target //gcc_vectorization:memcpy_lto) ...
lto1: internal compiler error: in riscv_hard_regno_nregs, at config/riscv/riscv.cc:8058
Please submit a full bug report, with preprocessed source (by using -freport-bug).
See <https://gcc.gnu.org/bugs/> for instructions.
lto-wrapper: fatal error: external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc returned 1 exit status
compilation terminated.

So it looks like LTO has problems with RISCV vector instructions. We’ll keep testing this as more gcc 14 snapshots become available, but as a lower priority exercise. LTO does not seem like a popular optimization.

3.5 - testing pcode semantics

Ghidra processor semantics needs tests

Procesor instructions known to Ghidra are defined in Sleigh pcode or semanitc sections. Adding new instructions - such as instruction set extensions to an existing processor - requires a pcode description of what that instruction does. That pcode drives both the decompiler process and any emulator or debugger processes.

This generates a conflict in testing. Should we test for maximum clarity for semantics rendered in the decompiler window or maximum fidelity in any Ghidra emulator? For example, should a divide instruction include pcode to test against a divide-by-zero? Should floating point instructions guard against NaN (Not a Number) inputs?

We assume here that decompiler fidelity is more important than emulator fidelity. That implies:

ignore any exception-generating cases, including divide-by-zero, NaN, memory access and memory alignment.
pcode must allow for normal C implicit type conversions, such as between different integer and floating point lengths.
- this implies pcode must pay attention to Ghidra’s type inference system.

Concept of Operations

Individual instructions are wrapped in C and exercised within a Google Test C++ framework. The test framework is then executed within a qemu static emulation environment.

For example, let’s examine two riscv-64 instructions: fcvt.w.s and fmv.x.w

fcvt.w.s converts a ﬂoating-point number in ﬂoating-point register rs1 to a signed 32-bit or 64-bit integer, respectively, in integer register rd.
fmv.x.w moves the single-precision value in floating-point register rs1 represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. For RV64, the higher 32 bits of the destination register are filled with copies of the floating-point number’s sign bit.

These two instructions have similar signatures but very different semantics. fcvt.w.s performs a float to int type conversion, so the float 1.0 can be converted to int 1. fmv.x.w moves the raw bits between float and int registers without any type coversion.

We can generate simple exemplars of both instructions with this C code:

int fcvt_w_s(float* x) {
    return (int)*x;
}

int fmv_x_w(float* x) {
    int val;
    float src = *x;

    __asm__ __volatile__ (
        "fmv.x.w  %0, %1" \
        : "=r" (val) \
        : "f" (src));
    return val;
}

Ghidra’s 11.2-DEV decompiler renders these as:

long fcvt_w_s(float *param_1)
{
  return (long)(int)*param_1;
}
long fmv_x_w(float *param_1)
{
  return (long)(int)param_1;
}

fmv_x_w was missing a dereference operation. The fmv_x_w version was also implying an implicit type conversion when none is actually performed. Let’s trace how to use these test results to improve the decompiler output.

Running tests

The draft test harness can be built and run from the top level workspace directory.

$ bazel build --platforms=//riscv64/generated/platforms:riscv_userspace riscv64/generated/emulator_tests:testSemantics
Starting local Bazel server and connecting to it...
INFO: Analyzed target //riscv64/generated/emulator_tests:testSemantics (74 packages loaded, 1902 targets configured).
...
INFO: From Executing genrule //riscv64/generated/emulator_tests:testSemantics:
...
INFO: Found 1 target...
Target //riscv64/generated/emulator_tests:testSemantics up-to-date:
  bazel-bin/riscv64/generated/emulator_tests/results
INFO: Elapsed time: 4.172s, Critical Path: 1.93s
INFO: 4 processes: 1 internal, 3 linux-sandbox.
INFO: Build completed successfully, 4 total actions

$ cat bazel-bin/riscv64/generated/emulator_tests/results
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from FP
[ RUN      ] FP.testharness
[       OK ] FP.testharness (3 ms)
[ RUN      ] FP.fcvt
[       OK ] FP.fcvt (10 ms)
[ RUN      ] FP.fmv
[       OK ] FP.fmv (1 ms)
[ RUN      ] FP.fp16
[       OK ] FP.fp16 (0 ms)
[----------] 4 tests from FP (15 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (19 ms total)
[  PASSED  ] 4 tests.

$ file bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so
bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so: ELF 64-bit LSB shared object, UCB RISC-V, RVC, double-float ABI, version 1 (SYSV), dynamically linked, not stripped

Test parameters

What version of Ghidra are we testing against?

Ghidra as released, e.g., 11.1
Ghidra master, currently 11.2-DEV
Our isa_ext fork
jobermayr’s clang fork
One of the many Sleigh-InSPECtor RISCV Ghidra pull requests

What do we do with Ghidra patches that improve the decompilation results?

If the patched instructions only exist within the isa_ext fork, we will make the changes to that fork and PR
If the patches come from unmerged PRs we may cherry-pick them into isa_ext. This includes some of the fcvt and fmv patches from other sources.

Test example

Use meld to compare the original C source file floatOperations.c with the exported C decompiler view from Ghidra 11.2-DEV. A quick inspection shows some errors to address:

Original	Ghidra
float fcvt_s_wu(uint32_t* i) {return (float)*i;}	float fcvt_s_wu(uint param_1){return (float)ZEXT416(param_1);}
double fcvt_d_wu(uint32_t* j){return (double)*j;}	double fcvt_d_wu(uint param_1){return (double)ZEXT416(param_1);}
	long fmv_x_w(float param_1){return (long)(int)param_1;}
	long fmv_x_d(double param_1){return (long)(int)param_1;}
	long fcvt_h_w(int param_1){return (long)param_1;}
	long fcvt_h_wu(int param_1){return (long)param_1;}
	ulong fcvt_h_d(ulong param_1){return param_1 & 0xffffffff;}

The errors include:

spurious ZEXT416 in two places
fmv instructions appear to force an implicit type conversion where none is wanted
missing dereference operation in fcvt_h_w and fcvt_h_wu
bad mask operation in fcvt_h_d

Next steps

Testing semantics for the zfh half-precision floating point instructions is more complicated than usual. Ghidra’s semantics and pcode system has no known provision for half-precision floating point, so emulation won’t work well. The current zfh implementation makes these _fp16 objects look like 32 bit floats in registers and like 16 bit shorts in memory operations, making Ghidra type inferencing even more confusing.

Let’s look at a more limited scope, the definition of the Ghidra trunc pcode op.

The documentation says trunc produces a signed integer obtained by truncating its argument.

how does trunc set its result type?
does trunc expect only a floating point double?
what would it take to define trunk_u to generate an unsigned integer
what would it take to accept a half-precision floating point value as an argument?

The documentation also says that float2float ‘copies a floating-point number with more or less precision’, so its implementation may tell us something about type inferencing.

Ghidra/Features/Decompiler/src/decompile/cpp/pcodeparse.cc binds float2float to OP_FLOAT2FLOAT
this leads to CPUI_FLOAT_FLOAT2FLOAT and to several files under Ghidra/Features/Decompiler/src/decompile/cpp.
functions like FloatFormat::opFloat2Float and FloatFormat::opTrunc look relevant in float.hh and float.cc

4 - Gap Analysis Example

What gaps in Ghidra’s import processes need the most long term attention?

Some features are easy or quick to add to Ghidra’s import processes. Other features might be nice to have but just aren’t worth the effort. How do we approach features that are probably going to be important in the long term but would require a lot of effort to address?

This section considers RISCV-64 code optimization by vector instruction insertion as an example. Either the compiler or the coder can choose to replace sequences of simple instructions with sequences of vector instructions. Those vector sequences often do not have a clean C representation in Ghidra’s decompiler view, making it difficult for Ghidra users to understand what the code is doing and to look for malware or other pathologies.

The overview introduced an approach to this sort of challenge:

What is a current example of this feature, especially examples that support analysis or pathologies of those features.
- ⇒see Examples
How and when might this feature impact a significant number of Ghidra analysts?
- ⇒see Impact
How much effort might it take Ghidra developers to fill the implied feature gap? Do we fill it by extending the core of Ghidra, by generating new plugin scripts or tools, or by educating Ghidra users on how to recognize semantic patterns from raw instructions?
- ⇒see Effort
Is this feature specific to RISCV systems or more broadly applicable to other processor families? Would support for that feature be common to many processor families or vary widely by processor?
- ⇒see Scope
What are the existing frameworks within Ghidra that might most credibly be extended to support that feature?
- ⇒see Existing Frameworks

4.1 - Examples

Where does this gap appear?

memory copy

alignment issues
obfuscated memcpy and strcpy inline code

other pcode or RTL expansions

loop optimization

vector intrinsics

ML and AI subsystems

4.2 - Impact

What is the impact of this gap?

How

Ghidra’s current limits in handling RISCV-64 vector instructions will impact users in phases, where the initial impacts are modest and fairly easy to deal with while later impacts will take significant design work to address.

The most immediate impact involves Ghidra disassembly and decompilation failure when encountering unrecognized instructions. The Fedora 39 exemplar kernel contains several extension instructions that Ghidra 11 can’t recognize. These are limited in number and don’t have a material impact on someone examining RISCV kernel code. The voice-to-text app whisper.cpp shows more serious limits - roughly one third of the app’s instructions are unprocessed by Ghidra 11 because of vector and other extension instructions.

That impact can be addressed by simply defining the missing instructions, as in Ghidra’s isa_ext experimental branch. This will allow the disassembler and decompiler to process all instructions in the app. This is necessary but not sufficient, since many or most of the vector extension instructions do not have a clean pcode representation. Obvious calls to memcpy will be replaced with one of a half-dozen inline vector instruction sequences. Simple or nested loops will be ‘vectorized’ with fewer iterations but much more complex instruction opcode sequences. Optimizing compilers can handle those complexities, while Ghidra users searching for malware will have a harder time of it.

The general challenge for Ghidra is that of reconstructing the context from sequences of vector extension instructions.

When

Note: Some material comes as-is from https://www.reddit.com/r/RISCV

The first generally available 64 bit RISCV vector systems development kit has just become available (January 2024), based on the relatively modest THead C908 core. This SDK appears tuned for video processing, perhaps video surveillance applications aggregating multiple cameras into a common video feed. We are probably several years from seeing server-class systems built on SiFive P870 cores, and fabricated on the fastest available fab lines. Memory bandwidth is poor at present, while energy efficiency is potentially better than x86_64 designs.

Judging from internet hype, we can expect to see RISCV vector code appearing in replacements of ARM systems (automotive and possibly cell phone) and as the extensible basis of AI applications.

Cores announced
- SiFive
  - P670 2 x 128 bit vector units, up to 16 cores
  - P870 2 x 128 bit vector units, vector crypto, up to 16 cores
- Alibaba XuanTie THead
  - C908 with RVV 1.0 support, 128 bit VLEN; announced 2022
- StarFive
  - Starfive does not appear to offer a vector RISCV core
SDKs available
- CanMV-K230, dual C908 cores, triple video camera inputs, $40; one core supports RVV 1.0 at 1.6 GHz; 512 MB RAM; announced 2023
- Sophgo SG2380 due Q3 2024 with 16 core SiFive P670 and 8 core SiFiveX280

Who is working this

January 2024 saw a flurry of open source toolchain and framework contributions from several sources.

binutils contributors
- multiple recent contributors from Alibaba, mostly in support of THead extensions
gcc contributors
- intel, alibaba, rivai (ref XCVsimd extension), embecosm, sifive, eswincomputing, ventanamicro, andestech all contributed to the riscv testsuite in the last two weeks.
glibc contributions
- some references to Alibaba riscv extensions
ML framework contributors
- riscv intrinsics appeared in whisper.cpp in November 2023, sync’d from llama, originally contributed by https://pk.linkedin.com/in/ahmad-tameem

4.3 - Effort

How much effort might it take to fill the gap?

4.4 - Scope

Does the scope of this gap extend to other processors?

x86_64 comparison
alignment

4.5 - Existing Frameworks

Which Ghidra frameworks might be extended to fill the gap?

Outline

What can we add to sleigh .sinc files?
- add all extension instructions
- add translation of Elf file attributes into vendor-specific processor selection
- flesh out extension mnemonics to convey vector context, especially vset* instructions
- add comments or metadata that is accessible to the decompiler
What can we add to pcode semantics?
- gcc built-ins like __builtin_memcpy or popcount
- cross platform vector notation
- processor dependent decompiler plugins
What can we add to disassembler
- generalized instruction information on common use patterns
What can we add to decompiler
- reconstruct gcc RTL built-ins
What plugins can we add?
- reconstruct gcc RTL built-ins
What external tools can we leverage?
- generate .sinc updates based on objdump mnemonics
- known source exemplar builds to correlate RTL expressions with instruction sequences
- apply general ML translation to undo pcode expansion into vector instructions

5 - Platforms and Toolchains

Code is built by a toolchain (compiler, linker) to run on a platform (e.g., a pixel 7a cellphone).

This project adopts the Bazel framework for building importable exemplars. Platforms describe the foundation on which code will run. Toolchains compile and link code for different platforms. Bazel builds are hermetic, which for our purposes means that platforms and toolchains are all versioned and importable, so build results are the same no matter where the build host may be.

Example of RISCV-64 platforms and toolchains

The directory RISCV64/toolchain defines these platforms:

//platforms:riscv_userspace for a generic RISCV-64 Linux appliance with the usual libc and libstdio APIs
//platforms:riscv_vector for a more specialized RISCV-64 Linux appliance with vector extensions supported
//platforms:riscv_custom for a highly specialized RISCV-64 Linux appliance with vector and vendor-specific extensions supported
//platforms:riscv_local for toolchain debugging, using a local file system toolchain under /opt/riscvx

Note: The current binutils and gcc show more vendor-specific instruction set extensions from THead, so we will arbitrarily use that as the exemplar custom platform.

This directory defines these toolchains:

//toolchains:riscv64-default - a gcc-13 stable RISCV compiler, linker, loader, and sysroot of related include files and libraries
//toolchains:riscv64-next - a gcc-14 unreleased but feature-frozen RISCV compiler, linker, loader, and sysroot of related include files and libraries
//toolchains:riscv64-custom - a variant of //toolchains:riscv64-next with multiple standard and vendor-specific ISA extensions enabled by default
//toolchains:riscv64-local - a toolchain executing out of /opt/riscvx instead of a portable tarball. Generally useful only when debugging the generation of a fully portable and hermetic toolchain tarball.

Exemplars are built by naming the platform for each build. Bazel then finds a compatible toolchain to complete the build.

# compile for the riscv_userspace platform, automatically selecting the riscv64-default toolchain with gcc-13.
bazel build -s --platforms=//platforms:riscv_userspace gcc_vectorization:helloworld_challenge
# compile for the riscv_vector platform, automatically selecting the riscv64-next toolchain with gcc-14.
bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:helloworld_challenge

This table shows relationships between platforms, constraints, toolchains, and default options:

platform	cpu constraint	toolchain	default options	added optimized options
//platforms:riscv_userspace	//toolchains:riscv64	//toolchains:riscv64-default		-O3
//platforms:riscv_vector	//toolchains:riscv64-v	//toolchains:riscv64-next	-march=rv64gcv	-O3
//platforms:riscv_custom	//toolchains:riscv64-c	//toolchains:riscv64-custom	-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_xtheadmac_xtheadfmemidx_xtheadmempair_xtheadsync	-O3
//platforms:riscv_local	//toolchains:riscv64-l	//toolchains:riscv64-local		-O3

Notes:

The -O3 option is likely too aggressive. The -O2 option would be more common in broadly released software.
//toolchains:riscv64-default currently uses a gcc-13 toolchain suite
the other toolchains use various developmental snapshots of the gcc-14 toolchain suite
vector extensions version 1.0 are default on //toolchains:riscv64-next and //toolchains:riscv64-custom
//toolchains:riscv64-custom adds bit manipulation and many of the THead extensions supported by binutils.

Warning: C options can be added by the toolchain, within a BUILD file, and on the command line. For options like -O and -march, only the last instance of the option affects the build.

Toolchain details

Toolchains generally include several components that can affect the generated binaries:

the gcc compiler, built from source and configured for a specific target architecture and language set
binutils utilities, including a gas assembler with support for various instruction set extensions and disassembler tools like objdump that provide reference handling of newer instructions.
linker and linker scripts
a sysroot holding files the above subsystems would normally expect to find under /usr, for instance /usr/include files supplied by the kernel and standard libraries
libc, libstdc++, etc.
default compiler options and include directories

The toolchain prepared for building a kernel module won’t be the same as a toolchain built for userspace programs, even if the compilers are identical.

See adding toolchains for an example of adding a new toolchain to this project.

5.1 - ISA Extensions

Extensions to a processor family’s Instruction Set Architecture add capability and complexity.

The RISCV community has a rich set of extensions to the base Instruction Set Architecture. That means a diverse set of new binary import targets to test against. This work-in-progress is collected in the riscv64/generated/assemblySamples directory. The basic idea is to compare current Ghidra disassembly with current binutils objdump disassembly, using object files assembled from the binutils gas testsuite. For example:

riscv64/generated/assemblySamples/h-ext-64.S was copied from the binutils gas testsuite. It contains unit test instructions for hypervisor support extensions like hfence.vvma and hlv.w.
riscv64/exemplars/h-ext-64.o is the object file produced by a current snapshot of the binutils 2-41 assembler. The associated listing is riscv64/exemplars/h-ext-64.list.
riscv64/exemplars/h-ext-64.objdump is the output from disassembling riscv64/exemplars/h-ext-64.o using the current snapshot of the binutils 2-41 objdump.

So we want to open Ghidra, import riscv64/exemplars/h-ext-64.o, and compare the disassembly window to riscv64/exemplars/h-ext-64.objdump, then triage any variances.

Some variances are trivial. The h-ext-64.S tests include instructions that assemble into a single 4 byte sequence. Disassembly will only give a single instruction, perhaps the simplest one of the given aliases.

Other variances are harder - it looks like Ghidra expects to see an earlier and deprecated set of vector instructions than one currently approved set.

riscv64/generated/assemblySamples/TODO.md collects some of the variances noted so far.

One big question is what kind of pcode should Ghidra generate for some of these instructions - and how many Ghidra users will care about that pcode. The short term answer is to treat extension instructions as pcode function calls. The longer term answer may be to wait until GCC14 comes out with support for vector extensions, then see what kind of C source is conventionally used when invoking those extensions. The memcpy inline function from libc is a likely place to find early use of vector instructions.

Also, what can we safely ignore for now? The proposed vendor-specific T-Head extension instruction th.l2cache.iall won’t be seen by most Ghidra users. On the other hand, the encoding rules published with those T-Head extensions look like a good example to follow.

The Fedora 39 kernel includes virtual machine cache management instructions that are not necessarily supported by binutils - they are ‘assembled’ with gcc macros before reaching the binutils assembler. We will ignore those instruction extensions for now, and only consider instruction extensions supported by binutils.

Determining the ISA extensions required by a binary

Some newer compilers annotate executable binaries by adding the ISA extensions used during the build.

$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_default
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zmmul1p0"

$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_vector
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_vendor
Attribute Section: riscv
File Attributes
  Tag_RISCV_stack_align: 16-bytes
  Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zba1p0_zbb1p0_zbc1p0_zbkb1p0_zbkc1p0_zbkx1p0_zvbc1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0_xtheadba1p0_xtheadbb1p0_xtheadbs1p0_xtheadcmo1p0_xtheadcondmov1p0_xtheadfmemidx1p0_xtheadmac1p0_xtheadmempair1p0_xtheadsync1p0"
  Tag_RISCV_priv_spec: 1
  Tag_RISCV_priv_spec_minor: 11

If Tag_RISCV_arch contains the substring v1p0, then the associated binary was built assuming RV Vector 1.0 extension instructions are present on the executing CPU hardware thread.

6 - Instruction Patterns

Common instruction patterns one might see with vectorized code generation

This page collects architecture-dependent gcc-14 expansions, where simple C sequences are translated into optimized code.

Our baseline is a gcc-14 compiler with -O2 optimization and a base machine architecture of -march=rv64gc. That’s a basic 64 bit RISCV processor (or a hart core of that processor) with support for compressed instructions.

Variant machine architectures considered here are:

march	description
rv64gc	baseline
rv64gcv	baseline + vector extension (dynamic vector length)
rv64gcv_zvl128b	baseline + vector (minimum 128 bit vectors)
rv64gcv_zvl512b	baseline + vector (minimum 512 bit vectors)
rv64gcv_zvl1024b	baseline + vector (minimum 1024 bit vectors)
rv64gc_xtheadbb	baseline + THead bit manipulation extension (no vector)

Memory copy operations

Note: memory copy operations require non-overlapping source and destination. memory move operations allow overlap but are much more complicated and are not currently optimized.

Optimizing compilers are good at turning simple memory copy operations into confusing - but fast - instruction sequences. GCC can recognize memory copy operations as calls to memcpy or as structure assignments like *a = *c.

The current reference C file is:

extern void *memcpy(void *__restrict dest, const void *__restrict src, __SIZE_TYPE__ n);
extern void *memmov(void *dest, const void *src, __SIZE_TYPE__ n);

/* invoke memcpy with dynamic size */
void cpymem_1 (void *a, void *b, __SIZE_TYPE__ l)
{
  memcpy (a, b, l);
}

/* invoke memcpy with known size and aligned pointers */
extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;

void cpymem_2 ()
{
  memcpy (&a_a, &a_b, sizeof a_a);
}

typedef struct { char c[16]; } c16;
typedef struct { char c[32]; } c32;
typedef struct { short s; char c[30]; } s16;

/* copy fixed 128 bits of memory */
void cpymem_3 (c16 *a, c16* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_4 (c32 *a, c32* b)
{
  *a = *b;
}

/* copy fixed 256 bits of memory */
void cpymem_5 (s16 *a, s16* b)
{
  *a = *b;
}

/* memmov allows overlap - don't vectorize or inline */
void movmem_1(void *a, void *b, __SIZE_TYPE__ l)
{
  memmov (a, b, l);
}

Baseline (no vector)

Ghidra 11 with the isa_ext branch decompiler gives us something simple after fixing the signature of the memcpy thunk.

void cpymem_1(void *param_1,void *param_2,size_t param_3)
{
  memcpy(param_1,param_2,param_3);
  return;
}
void cpymem_2(void)
{
  memcpy(&a_a,&a_b,0x40);
  return;
}
void cpymem_3(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x10);
  return;
}
void cpymem_4(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}
void cpymem_5(void *param_1,void *param_2)
{
  memcpy(param_1,param_2,0x20);
  return;
}

rv64gcv - vector extensions

If the compiler knows the target hart can process vector extensions, but is not told explicitly the size of each vector register, it optimizes all of these calls. Ghidra 11 gives us the following, with binutils’ objdump instruction listings added as comments:

long cpymem_1(long param_1,long param_2,long param_3)
{
  long lVar1;
  undefined auVar2 [256];
  do {
    lVar1 = vsetvli_e8m8tama(param_3);  // vsetvli a5,a2,e8,m8,ta,ma
    auVar2 = vle8_v(param_2);           // vle8.v  v8,(a1)
    param_3 = param_3 - lVar1;          // sub     a2,a2,a5
    vse8_v(auVar2,param_1);             // vse8.v  v8,(a0)
    param_2 = param_2 + lVar1;          // add     a1,a1,a5
    param_1 = param_1 + lVar1;          // add     a0,a0,a5
  } while (param_3 != 0);               // bnez    a2,8a8 <cpymem_1>
  return param_1;
}
void cpymem_2(void)
{
                                        // ld      a4,1922(a4) # 2040 <a_b@Base>
                                        // ld      a5,1938(a5) # 2058 <a_a@Base>
  undefined auVar1 [256];
  vsetivli(0x10,0xd3);                  // vsetivli        zero,16,e32,m8,ta,ma
  auVar1 = vle32_v(&a_b);               // vle32.v v8,(a4)
  vse32_v(auVar1,&a_a);                 // vse32.v v8,(a5)
  return;
}
void cpymem_3(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xc0);                   // vsetivli        zero,16,e8,m1,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v1,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v1,(a0)
  return;
}
void cpymem_4(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];                // li      a5,32
  vsetvli_e8m8tama(0x20);                // vsetvli        zero,a5,e8,m8,ta,ma
  auVar1 = vle8_v(param_2);              // vle8.v  v8,(a1)
  vse8_v(auVar1,param_1);                // vse8.v  v8,(a0)
  return;
}
void cpymem_5(undefined8 param_1,undefined8 param_2)
{
  undefined auVar1 [256];
  vsetivli(0x10,0xcb);                   // vsetivli        zero,16,e16,m8,ta,ma
  auVar1 = vle16_v(param_2);             // vle16.v v8,(a1)
  vse16_v(auVar1,param_1);               // vse16.v v8,(a0)
  return;
}

The variation in the vset* instructions is a bit puzzling. This may be due to alignment issues - trying to copy a short int into a misaligned odd address generates an exception at the store instruction, so perhaps the vector optimization is supposed to throw an exception there too.

6.1 - Application Survey

Survey a voice-to-text app for common vector instruction patterns

Take an exemplar RISCV-64 binary like whisper.cpp, with its many vector instructions. Which vector patterns are easy to recognize, either for a human Ghidra user or for a hypothetical Ghidra plugin?

Some of the most common patterns correspond to memcpy or memset invocations where the number of bytes is known at compile time as is the alignment of operands.

ML apps like whisper.cpp often work with parameters of less than 8 bits, so there can be a lot of demarshalling, unpacking, and repacking operations. That means lots of vector bit manipulation and width conversion operations.

ML apps also do a lot of vector, matrix, and tensor arithmetic, so we can expect to find vectorized arithmetic operations mixed in with vector parameter conversion operations.

Note: This page is likely to change rapidly as we get a better handle on the problem and develop better analytic tools to guide the process.

Survey for vector instruction blocks

Most vector instructions come in groups started with a vsetvli or vsetivli instruction to set up the vector context. If the number of vector elements is known at compile time and less than 32, then the vsetivli instruction is often used. Otherwise the vsetvli instruction is used.

Scanning for these instructions showed 673 vsetvli and 888 vsetivli instructions within whisper.cpp.

The most common vsetvli instruction (343 out of 673) is type 0xc3 or e8,m8,ta,ma. That expands to:

element width = 8 bits - no alignment checks are needed, 16 elements per vector register if VLEN=128
multiplier = 8 - up to 8 vector registers are processed in parallel
tail agnostic - we don’t care about preserving unassigned vector register bits
mask agnostic - we don’t care about preserving unmasked vector register bits

The most common vsetivli instruction (565 out of 888) is type 0xd8 or e64,m1,ta,ma. That expands to:

element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
multiplier = 1 - only the named vector register is used
tail agnostic - we don’t care about preserving unassigned vector register bits
mask agnostic - we don’t care about preserving unmasked vector register bits

A similar common vsetivli instruction (102 out of 888) is type 0xdb or e64,m8,ta,ma. That expands to:

element width = 64 bits - all memory operations should be 64 bit aligned, 2 elements per vector register if VLEN=128
multiplier = 8 - up to 8 vector registers are processed in parallel, or 16 64 bit elements if VLEN=128
tail agnostic - we don’t care about preserving unassigned vector register bits
mask agnostic - we don’t care about preserving unmasked vector register bits

The second most common vsetivli instruction (107 out of 888) is type 0xc7 or e8,mf2,ta,ma. That expands to:

element width = 8 bits
multiplier = 1/2 - vector registers are only half used, perhaps to allow element widening to 16 bits
tail agnostic - we don’t care about preserving unassigned vector register bits
mask agnostic - we don’t care about preserving unmasked vector register bits

How many of these vector blocks can be treated as simple memcpy or memset invocations?

For example, this Ghidra listing snippet looks like a good candidate for memcpy:

00090bdc 57 f0 b7 cd     vsetivli                       zero,0xf,e64,m8,ta,ma
00090be0 07 74 07 02     vle64.v                        v8,(a4)
00090be4 27 f4 07 02     vse64.v                        v8,(a5)

A pcode equivalent might be __builtin_memcpy(dest=(a5), src=(a4), 8 * 15) with a possible context note that vector registers v8 through v16 are changed.

A longer example might be a good candidate for memset:

00090b84 57 70 81 cd     vsetivli                       zero,0x2,e64,m1,ta,ma
00090b88 93 07 07 01     addi                           a5,a4,0x10
00090b8c d7 30 00 5e     vmv.v.i                        v1,0x0
00090b90 a7 70 07 02     vse64.v                        v1,(a4)
00090b94 a7 f0 07 02     vse64.v                        v1,(a5)
00090b98 93 07 07 02     addi                           a5,a4,0x20
00090b9c a7 f0 07 02     vse64.v                        v1,(a5)
00090ba0 93 07 07 03     addi                           a5,a4,0x30
00090ba4 a7 f0 07 02     vse64.v                        v1,(a5)
00090ba8 93 07 07 04     addi                           a5,a4,0x40
00090bac a7 f0 07 02     vse64.v                        v1,(a5)
00090bb0 93 07 07 05     addi                           a5,a4,0x50
00090bb4 a7 f0 07 02     vse64.v                        v1,(a5)
00090bb8 93 07 07 06     addi                           a5,a4,0x60
00090bbc a7 f0 07 02     vse64.v                        v1,(a5)
00090bc0 fd 1b           c.addi                         s7,-0x1
00090bc2 23 38 07 06     sd                             zero,0x70(a4)

This example is based on a minimum VLEN of 128 bits, so the vector registers can hold 2 64 bit elements. The vmv.v.i instruction sets those two elements of v1 to zero. Seven vse64.v instructions then store two 64 bit zeros each to successive memory locations, with a trailing scalar double word store to handle the tail.

A pcode equivalent for this sequence might be __builtin_memset(dest=(a4), 0, 0x78).

top down scan of vector blocks

The python script objdump_analytic.py provides a crude scan of a RISCV-64 binary, reporting on likely vector instruction blocks. It doesn’t handle blocks with more than one vsetvli or vsetivli instruction, something common in vector narrowing or widening operations. If we apply this script to whisper_cpp_vector we can collect a crude field guide to vector expansions.

VLEN in the following is the hart’s vector length, determined at execution time. It is usually something like 128 bits for a general purpose core (aka hart) and up to 1024 bits for a dedicated accelerator hart.

memcpy with known and limited nbytes

This pattern is often found when copying objects of known and limited size. It is useful with objects as small as 4 bytes if the source alignment is unknown and the destination object must be aligned on half-word, word, or double-word boundaries.

;                memcpy(dest=a0, src=a3, nbytes=a4) where a4 < 8 * (VLEN/8)
1d3da:  0c377057                vsetvli zero,a4,e8,m8,ta,ma
1d3de:  02068407                vle8.v  v8,(a3)
1d3e2:  02050427                vse8.v  v8,(a0)

memcpy with unknown nbytes

This pattern is usually found in a simple loop, moving 8 * (VLEN/8) bytes at a time. The a5 register holds the number of bytes processed per iteration.

;                memcpy(dest=a6, src=a7, nbytes=a0) 
1d868:  0c3577d7                vsetvli a5,a0,e8,m8,ta,ma
1d86c:  02088407                vle8.v  v8,(a7)
1d872:  02080427                vse8.v  v8,(a6)

widening floating point reduction

The next example appears to be compiled from estimate_diarization_speaker whose source is:

double energy0 = 0.0f;
double energy1 = 0.0f;

for (int64_t j = is0; j < is1; j++) {
    energy0 += fabs(pcmf32s[0][j]);
    energy1 += fabs(pcmf32s[1][j]);
}

This is a typical reduction with widening pattern.

The vector instructions generated are:

242ce:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
242d2:  5e0031d7                vmv.v.i v3,0
242d6:  9e303257                vmv1r.v v4,v3
242da:  0976f7d7                vsetvli a5,a3,e32,mf2,tu,ma
242e4:  0205e107                vle32.v v2,(a1)
242e8:  02066087                vle32.v v1,(a2)
242ec:  2a211157                vfabs.v v2,v2
242f0:  2a1090d7                vfabs.v v1,v1
242f8:  d2411257                vfwadd.wv       v4,v4,v2
242fc:  d23091d7                vfwadd.wv       v3,v3,v1
24312:  0d8077d7                vsetvli a5,zero,e64,m1,ta,ma
24316:  4207d0d7                vfmv.s.f        v1,fa5
2431a:  063091d7                vfredusum.vs    v3,v3,v1
2431e:  42301757                vfmv.f.s        fa4,v3
24326:  06409257                vfredusum.vs    v4,v4,v1
2432a:  424017d7                vfmv.f.s        fa5,v4

A hypothetical vectorized Ghidra might decompile these instructions (ignoring the scalar instructions not displayed here) as:

double vector v3, v4;  // SEW=64 bit
v3 := vector 0;  // load immediate
v4 := v3;        // vector copy
float vector v1, v2;  // SEW=32 bit
while(...) {
    v2 = vector *a1;
    v1 = vector *a2;
    v2 = abs(v2);
    v1 = abs(v1);
    v4 = v4 + v2;  // widening 32 to 64 bits
    v3 = v3 + v1;  // widening 32 to 64 bits
}
double vector v1, v3, v4;
v1[0] = fa5;   // fa5 is the scalar 'carry-in' 
v3[0] = v1[0] + ⅀ v3; // unordered vector reduction
fa4 = v3[0];
v4[0] = v1[0] + ⅀ v4;
fa5 = v4[0];

The vector instruction vfredusum.vs provides the unordered reduction sum over the elements of a single vector. That’s likely faster than an ordered sum, but the floating point round-off errors will not be deterministic.

Note: this whisper.cpp routine attempts to recognize which of two speakers is responsible for each word of a conversation. A speaker-misattribution exploit might attack functions that call this.

complex structure element copy

The source code includes:

static drwav_uint64 drwav_read_pcm_frames_s16__msadpcm(drwav* pWav, drwav_uint64 framesToRead, drwav_int16* pBufferOut) {
    ...
    pWav->msadpcm.bytesRemainingInBlock = pWav->fmt.blockAlign - sizeof(header);

    pWav->msadpcm.predictor[0] = header[0];
    pWav->msadpcm.predictor[1] = header[1];
    pWav->msadpcm.delta[0] = drwav__bytes_to_s16(header + 2);
    pWav->msadpcm.delta[1] = drwav__bytes_to_s16(header + 4);
    pWav->msadpcm.prevFrames[0][1] = (drwav_int32)drwav__bytes_to_s16(header + 6);
    pWav->msadpcm.prevFrames[1][1] = (drwav_int32)drwav__bytes_to_s16(header + 8);
    pWav->msadpcm.prevFrames[0][0] = (drwav_int32)drwav__bytes_to_s16(header + 10);
    pWav->msadpcm.prevFrames[1][0] = (drwav_int32)drwav__bytes_to_s16(header + 12);

    pWav->msadpcm.cachedFrames[0] = pWav->msadpcm.prevFrames[0][0];
    pWav->msadpcm.cachedFrames[1] = pWav->msadpcm.prevFrames[1][0];
    pWav->msadpcm.cachedFrames[2] = pWav->msadpcm.prevFrames[0][1];
    pWav->msadpcm.cachedFrames[3] = pWav->msadpcm.prevFrames[1][1];
    pWav->msadpcm.cachedFrameCount = 2;
...
}

This gets vectorized into sequences containing:

2c6ce:  ccf27057                vsetivli        zero,4,e16,mf2,ta,ma ; vl=4, SEW=16
2c6d2:  5e06c0d7                vmv.v.x v1,a3              ; v1[0..3] = a3
2c6d6:  3e1860d7                vslide1down.vx  v1,v1,a6   ; v1 = v1[1:3], a6
2c6da:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4
2c6de:  3e1560d7                vslide1down.vx  v1,v1,a0   ; v1 = (a3,a6,a4,a0)

2c6e2:  0d007057                vsetvli zero,zero,e32,m1,ta,ma ; keep existing vl (=4), SEW=32
2c6e6:  4a13a157                vsext.vf2       v2,v1      ; v2 = vector sext(v1) // widening sign extend
2c6ea:  0207e127                vse32.v v2,(a5)            ; memcpy(a5, v2, 4 * 4)
2c6f2:  0a07d087                vlse16.v        v1,(a5),zero ; v1 = a5[]

2c6fa:  0cf07057                vsetvli zero,zero,e16,mf2,ta,ma
2c702:  3e1660d7                vslide1down.vx  v1,v1,a2   ; v1 = v1[1:3], a2
2c70a:  3e16e0d7                vslide1down.vx  v1,v1,a3   ; v1 = v1[1:3], a3
2c70e:  3e1760d7                vslide1down.vx  v1,v1,a4   ; v1 = v1[1:3], a4

2c712:  0d007057                vsetvli zero,zero,e32,m1,ta,ma
2c716:  4a13a157                vsext.vf2       v2,v1
2c71a:  0205e127                vse32.v v2,(a1)

That’s the kind of messy code you could analyze if you had to. Hopefully not.

6.2 - Application Top Down Analysis

How much complexity do vector instructions add to a top down analysis?

We know that whisper.cpp contains lots of vector instructions. Now we want to understand how few vector instruction blocks we really need to understand.

For this analysis we will assume a specific goal - inspect the final text output phase to see if an adversary has modified the generated text.

First we want to understand the unmodified behavior using a simple demo case. One of the whisper.cpp examples works well. It was built for the x86-64-v3 platform, not the riscv-64 gcv platform, but that’s fine - we just want to understand the rough sequencing and get a handle on the strings we might find in or near the top level main routine.

what is the expected behavior?

Note: added comments are flagged with //

/opt/whisper_cpp$ ./main -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.46 MB (1 buffers)
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

// done with initialization, lets run speach-to-text
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

// this is the reference line our adversary wants to modify:
[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

// display statistics
whisper_print_timings:     load time =   183.72 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    10.30 ms
whisper_print_timings:   sample time =    33.90 ms /   131 runs (    0.26 ms per run)
whisper_print_timings:   encode time =   718.87 ms /     1 runs (  718.87 ms per run)
whisper_print_timings:   decode time =     8.35 ms /     2 runs (    4.17 ms per run)
whisper_print_timings:   batchd time =   150.96 ms /   125 runs (    1.21 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1110.87 ms

The adversary wants to change the text output from “… ask not what you can do for your country.” to “… ask not what you can do for your enemy.” They likely drop a string substitution into the code between the output of main: processing and whisper_print_timings:, probably very close to code printing timestamp intervals like [00:00:00.000 --> 00:00:11.000].

what function names and strings look relevant?

Our RISCV-64 binary retains some function names and lots of relevant strings. We want to accumulate strings that occur in the demo printout, then glance at the functions that reference those strings.

For this example we will use a binary that includes some debugging type information. Ghidra can determine names of structure types but not necessarily the size or field names of those structures.

strings

%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, %d beams + best of %d, lang = %s, task = %s, %stimestamps = %d ... is referenced near the middle of main
[%s --> %s] is referenced by whisper_print_segment_callback
[%s --> %s] %s\n is referenced by whisper_full_with_state
segment occurs in several places, suggesting that the word refers to a segment of text generated from speech between two timestamps.
ctx occurs 33 times, suggesting that a context structure is used - and occasionally displayed with field names
error: failed to initialize whisper context\n is referenced within main. It may help in understanding internal data organization.

functions

main - Ghidra decompiles this as ~1000 C statements, including many vector statements
whisper_print_timings - referenced directly in main near the end
whisper_full_with_state - referenced indirectly from main via whisper_full_parallel and whisper_full
output_txt - referenced directly in main, invokes I/O routines like std::__ostream_insert<>. There are other output routines like output_json. The specific output routine can be selected as a command line parameter to main.

types and structs

Ghidra knows that these exist as names, but the details are left to us to unravel.

gpt_params and gpt_vocab - these look promising, at a lower ML level
whisper_context - this likely holds most of the top-level data
whisper_full_params and whisper_params - likely structures related to the optional parameters revealed with the --help command line option.
whisper_segment - possibly a segment of digitized audio to be converted as speech.
whisper_vocab - possible holding the text words known to the training data.

notes

Now we have enough context to narrow the search. We want to know:

how does main call either whisper_print_segment_callback or whisper_full_with_state.
- whisper_full is called directly by main. Ghidra reports this to be about 3000 lines of C. The Ghidra call tree suggests that this function does most of the text-to-speech tensor math and other ML heavy lifting.
- whisper_print_segment_callback appears to be inserted into a C++ object vtable as a function pointer. The object itself appears to be built on main’s stack, so we don’t immediately know its size or use. whisper_print_segment_callback is less than a tenth the size of whisper_full_with_state.
how does the JFK output text get appended to the string [%s --> %s]?
from what structures is the output text retrieved?
where are those structures initialized? How large are they, and are any of their fields named in diagnostic output?
are there any diagnostic routines displaying the contents of such structures?

next steps

A simple but tedious technique involves a mix of top-down and bottom-up analysis. We work upwards from strings and function references, and down from the main routine towards the functions associated with our target text string. Trial and error with lots of backtracking are common here, so switching back and forth between top-down and bottom-up exploration can provide fresh insights.

Remember that we don’t want to understand any more of whisper.cpp than we have to. The adversary we are chasing only wants to understand where the generated text comes within reach. Neither they nor we need to understand all of the ways the C++ standard library might use vector instructions during I/O subsystem initialization.

On the other hand, they and we may need to recognize basic I/O and string handling operations, since the target text is likely to exist as either a standard string or a standard vector of strings.

Note: This isn’t a tutorial on how to approach a C++ reverse engineering challenge - it’s an evaluation of how vectorization might make that more difficult and an exploration of what additional tools Ghidra or Ghidra users may find useful when faced with vectorization. That means we’ll skip most of the non-vector analysis.

vectorization obscures initialization

This sequence from main affects initialization and obscures a possible exploit vector.

  vsetivli_e8m8tama(0x17);         // memcpy(puStack_110, "models/ggml-base.en.bin", 0x17)
  auVar27 = vle8_v(0xa6650);
  vsetivli_e8m8tama(0xf);          // memcpy(puStack_f0, "" [SPEAKER_TURN]", 0xf)
  auVar26 = vle8_v(0xa6668);
  puStack_f0 = auStack_e0;
  vsetivli_e8m8tama(0x17);
  vse8_v(auVar27,puStack_110);
  vsetivli_e8m8tama(0xf);
  vse8_v(auVar26,puStack_f0);
  puStack_d0 = &uStack_c0;
  vsetivli_e64m1tama(2);           // memset(lStack_b0, 0, 16)
  vmv_v_i(auVar25,0);
  vse64_v(auVar25,&lStack_b0);
  *(char *)((long)puStack_110 + 0x17) = '\0';

If the hypothetical adversary wanted to replace the training model ggml-base.en.bin with a less benign model, changing the memory reference within vle8_v(0xa6650) would be a good place to do it. Note that the compiler has interleaved instructions generated from the two memcpy expansions, at the cost of two extra vsetivli instructions. This allows more time for the vector load instructions to complete.

Focus on `output_txt`

Some browsing in Ghidra suggests that the following section of main is close to where we need to focus.

    lVar11 = whisper_full_parallel
                      (ctx,(long)pFVar18,(ulong)pvStack_348,
                      (long)(int)(lStack_340 - (long)pvStack_348 >> 2),
                      (long)pvVar20);
  if (lVar11 == 0) {
    putchar(10,pFVar18);
    if (params.do_output_txt != false) {
  /* try { // try from 0001dce8 to 0001dceb has its CatchHandler @ 0001e252 */
      std::operator+(&full_params,(undefined8 *)pFStack_2e0,
                      (undefined8 *)pFStack_2d8,(undefined8 *)".txt",
                      (char *)pvVar20);
      uVar13 = full_params._0_8_;
  /* try { // try from 0001dcfc to 0001dcfd has its CatchHandler @ 0001e2ec */
      std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
  /* try { // try from 0001dd06 to 0001dd09 has its CatchHandler @ 0001e2f0 */
      output_txt(ctx,(char *)uVar13,&params,(vector *)unaff_s3);
      std::vector<>::~vector(unaff_s3);
      std::__cxx11::basic_string<>::_M_dispose((basic_string<> *)&full_params);
    }
    ...
  }

Looking into output_txt Ghidra gives us:

long output_txt(whisper_context *ctx,char *output_file_path,whisper_params *param_3,vector *param_4)

{
    fprintf(_stderr,"%s: saving output to \'%s\'\n","output_txt",output_file_path);
    max_index = whisper_full_n_segments(ctx);
    index = 0;
    if (0 < max_index) {
      do {
        __s = (char *)whisper_full_get_segment_text(ctx,index);
    ...
        sVar8 = strlen(__s);
        std::__ostream_insert<>((basic_ostream *)plVar7,__s,sVar8);
    ...
        index = (long)((int)index + 1);
      } while (max_index != index);
    ...
    }
...
}

Finally, whisper_full_get_segment_text is decompiled into:

undefined8 whisper_full_get_segment_text(whisper_context *ctx,long index)
{
  gp = &__global_pointer$;
  return *(undefined8 *)(index * 0x50 + *(long *)(ctx->state + 0xa5f8) + 0x10);
}

Now the adversary has enough information to try rewriting the generated text from an arbitrary segment of speech. The text is found in an array linked into the ctx context variable, probably during the call to whisper_full_parallel.

added complexity of vectorization

Our key goal is to understand how much effort to put into Ghidra’s decompiler processing of RISCV-64 vector instructions. The metric for measuring that effort is relative to the effort needed to understand the other instructions produced by a C++ optimizing compiler implementing libstdc++ containers like vectors.

Take a closer look at the call to output_txt:

std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
output_txt(ctx,(char *)uVar13,&params,(vector *)unaff_s3);
std::vector<>::~vector(unaff_s3);

The unaff_s3 parameter to output_txt might be important. Maybe we should examine the constructor and destructor for this object to probe its internal structure.

In fact unaff_s3 is only used when passing stereo audio into output_txt, so it is more of a red herring slowing down the analysis than a true roadblock. Its internal structure is a C++ standard vector of C++ standard vectors of float, so it’s a decent example of what happens when RISCV-64 vector instructions are used implementing vectors (and two dimensional matrices) at a higher abstraction level.

A little analysis shows us that std::vector<>::vector is actually a copy constructor for a class generated from a vector template. The true type of unaff_s3 and unaff_s5 is roughly std::vector<std::vector<float>>.

Comment: the copy constructor and the associated destructor are likely present only because the programmer didn’t mark the parameter as a const reference.

The destructor std::vector<>::~vector(unaff_s3) listing shows no vector instructions are used. The inner vectors are deleted and their memory reclaimed, then the outer containing vector is deleted.

The constructor std::vector<>::vector is different. Vector instructions are used often, but in very simple contexts.

The only vset mode used is vsetivli_e64m1tama(2), asking for no more than two 64 bit elements in a vector register
The most common vector pattern stores 0 into two adjacent 64 bit pointers
In one case a 64 bit value is stored into two adjacent 64 bit pointers.

Summary

If whisper.cpp is representative of a broader class of ML programs compiled for RISCV-64 vector-enabled hardware, then:

Ghidra’s sleigh subsystem needs to recognize at least those vector instructions found in the rvv 1.0 release.
The decompiler view should have access to pcodeops for all of those vector instructions.
The 20 to 50 most common vset* configurations (e.g., e64m1tama) should be explicitly recognized at the pcodeop layer and displayed in the decompiler view.
Ghidra users should have documentation on common RISCV-64 vector instruction patterns generated during compilation. These patterns should include common loop patterns and builtin expansions for memcpy and memset, plus examples showing the common source code patterns resulting in vector reduction, width conversion, slideup/down, and gather/scatter instructions.

Other Ghidra extensions would be nice to have but likely deliver diminishing bang-for-the-buck relative to multiplatform C++ analytics:

Extend sleigh *.sinc file syntax to convey comments or hints to be visible in the decompiler view, either as pop-ups, instruction info, or comment blocks.
Take advantage of the open source nature of RISCV ISA to display links to open source documents on vector instructions when clicking on a given instruction.
Treat pcodeops as function calls within the decompiler view, enabling signature overrides and type assignment to the arguments.
Create a decompiler plugin framework that can scan the decompiled source and translate vector instruction patterns back into calls to __builtin_memcpy(...) calls.
Create a decompiler plugin framework that can scan the decompiled source and generate inline comments in a sensible vector notation.

The toughest challenges might be:

Find a Ghidra micro-architecture-independent approach to untangling vector instruction generation.
Use ML translation techniques to match C, C++, and Rust source patterns to generated vector instruction sequences for known architectures, compilers, and compiler optimization settings.

7 - Testbed Internals

This testbed uses several open source components that need descriptions and reference links.

Ghidra development sources

We track the Ghidra repository for released Ghidra packages, currently Ghidra 11.0. A Ghidra fork is also used here which adds proposed RISCV instruction set extension support. The host environment for this project is currently a Fedora 39 workstation with an AMD Ryzen 9 5900HX and 32 GB of RAM.

Toolchain sources

binutils

source repo: https://sourceware.org/git/binutils-gdb.git
commit 2c73aeb8d2e02de7b69cbcb13361cfbca9d76a4e (HEAD, tag: binutils-2_41), 30 July 2023.
local source directory /home2/vendor/binutils-gdb

gcc, stdlib

source repo: git://gcc.gnu.org/git/gcc.git
commit ac9c81dd76cfc34ed53402049021689a61c6d6e7 (HEAD -> master, origin/trunk, origin/master, origin/HEAD), Date: Mon Dec 18 21:40:00 2023 +0800
local source directory /home2/vendor/gcc

glibc

source repo: git@github.com:bminor/glibc.git
commit e957308723ac2e55dad360d602298632980bbd38 (HEAD -> master, origin/master, origin/HEAD) Date: Fri Dec 15 12:04:05 2023 -0800
local source directory /home2/vendor/glibc

Bazel

release repo https://github.com/bazelbuild/bazel
release 7.0 currently used

website sources

hugo v0.120.4, installed as a Fedora snap package
docsy v0.8.0

7.1 - adding toolchains

Adding a new toolchain takes lots of little steps, and some trial and error.

Overview

We want x86_64 exemplars built with the same next generation of gcc, libc, and libstdc++ as we use for RISCV exemplars. This will give us some hints about how common new issues may be and how global new solutions may need to be.

We will generate this x86_64 gcc-14 toolchain about the same way as our existing RISCV-64 gcc-14 toolchain.

This example uses the latest released version of binutils and the development head of gcc and glibc.

If we were building a toolchain for an actual product we would start by configuring and building a specialized kernel, which would prepopulate the system root. We aren’t doing that here, so we will use placeholders from the Fedora 40 x86_64 kernel.

binutils and the first gcc pass

We want binutils installed first.

$ cd /home2/vendor/binutils-gdb
$ git log
commit 2c73aeb8d2e02de7b69cbcb13361cfbca9d76a4e (HEAD, tag: binutils-2_41)
Author: Nick Clifton <nickc@redhat.com>
Date:   Sun Jul 30 14:55:52 2023 +0100

    The 2.41 release!
...
$ cd /home2/build_x86/binutils
.../vendor/binutils-gdb/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
...
$ make
$ make install
...

The gcc suite and the glibc standard library have a circular dependency. We build and install the basic gcc capability first, then glibc, and then finish with the rest of gcc. During this process we likely need to add system files to the new sysroot directory.

$ cd /home2/vendor/gcc
$ git log
commit ac9c81dd76cfc34ed53402049021689a61c6d6e7 (HEAD -> master, origin/trunk, origin/master, origin/HEAD)
Author: Pan Li <pan2.li@intel.com>
Date:   Mon Dec 18 21:40:00 2023 +0800

    RISC-V: Rename the rvv test case.
...
$ cd /home2/build_x86/gcc
/home2/vendor/gcc/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
$ make
...
$ make install
...

The make and make install may throw errors after completing the basic compiler. If so, we can complete the build after we get glibc installed.

glibc

We should have enough of gcc-14 built to configure and build the 64 bit glibc package. This pending release of glibc has lots of changes, so we can expect some tinkering to get it to work for us.


$ cd /home2/vendor/glibc
$ git log
commit e957308723ac2e55dad360d602298632980bbd38 (HEAD -> master, origin/master, origin/HEAD)
Author: Matthew Sterrett <matthew.sterrett@intel.com>
Date:   Fri Dec 15 12:04:05 2023 -0800

    x86: Unifies 'strlen-evex' and 'strlen-evex512' implementations.
...
$ mkdir -p /home2/build_x86/glibc
$ cd /home2/build_x86/glibc
$ /home2/vendor/glibc/configure CC="/opt/gcc14/bin/gcc" --prefix="/usr" install_root=/opt/gcc14/sysroot --disable-werror --enable-shared --disable-multilib
$ make
$ make install_root=/opt/gcc14/sysroot install
$ du -hs /opt/gcc14/sysroot
105M	/opt/gcc14/sysroot

gcc finish

If the gcc installation errored out before completion, try it again after glibc is installed. This time it should complete without error.

testing the local toolchain

Next we want to exercise the toolchain by compiling a very simple C program:

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

We’ll build it with three sets of options and import all three into Ghidra 11

/opt/gcc14/bin/gcc gcc_vectorization/helloworld_challenge.c -o a_unoptimized.out
/opt/gcc14/bin/gcc -O3 gcc_vectorization/helloworld_challenge.c -o a_host_optimized.out
/opt/gcc14/bin/gcc -march=rocketlake -O3 gcc_vectorization/helloworld_challenge.c -o a_rocketlake_optimized.out

Note: Rocket Lake is Intel’s codename for its 11th generation Core microprocessors

Ghidra 11 gives us:

a_unoptimized.out imports and decompiles cleanly, with recognizable disassembly and decompiler output of 5 lines of code.
a_host_optimized.out imports cleanly and decompiles into about 150 lines of hard-to-interpret C code. The loop has been autovectorized using instructions like PUNPCKHWD, PUNPCKLWD, and PADDD. These appear to be AVX-512 vector extensions.
a_rocketlake_optimized.out fails to disassemble or decompile when it hits AVX2 instructions like vpbroadcastd. Binutils 2.41’s objdump appears to recognize these instructions.

As a stretch goal, what does the gcc-14 Rust compiler give us?

/opt/gcc14/bin/gccrs -frust-incomplete-and-experimental-compiler-do-not-use src/main.rs
src/main.rs:25:5: error: unknown macro: [log::info]
   25 |     log::info!(
      |     ^~~
src/main.rs:29:5: error: unknown macro: [log::info]
   29 |     log::info!(
      |     ^~~
...

If gccrs can’t handle basic rust macros, it isn’t very useful for generating exemplars. We won’t include it in our portable toolchain.

packaging the toolchain

Now we need to make the toolchain hermetic, portable, and ready for Bazel workspaces.

Hermeticity means that nothing under /opt/gcc14 makes a hidden reference to local host files under /usr. Any such reference needs to be changed into a relative reference. These are common in short shareable object files that link to one or more true shareable object libraries.

You can often identify possible troublemakers by searching for smallish regular files with a .so extension.

$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
/opt/gcc14$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
... 273 Dec 27 12:41 /opt/gcc14/lib/libc.so
... 126 Dec 27 12:42 /opt/gcc14/lib/libm.so
... 132 Dec 27 11:42 /opt/gcc14/lib64/libgcc_s.so

$ cat /opt/gcc14/lib/libc.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library, so try that secondarily.  */
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libc.so.6 /opt/gcc14/lib/libc_nonshared.a  AS_NEEDED ( /opt/gcc14/lib/ld-linux-x86-64.so.2 ) )

$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6  AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )

$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6  AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )
thixotropist@mini:/opt/gcc14$ cat /opt/gcc14/lib64/libgcc_s.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library.  */
GROUP ( libgcc_s.so.1 -lgcc )

So two of the three files need patching: replacing /opt/gcc14/lib with .. Any text editor will do.

Next we need to identify all dynamic host dependencies of binaries under /opt/gcc14. The ldd command will identify these local system files, which should be collected into a separate tarball. This tarball can be shared with other cross-compilers built at the same time, and is generally portable across similar Linux kernels and distributions.

At this point we can strip the executable files within the toolchain and identify the ones we want to keep in the portable toolchain tarball. Scripts under generated/toolchains/gcc-14-*/scripts will help with that.

generate.sh uses rsync to copy selected files from /opt/gcc14 into /tmp/export, stripping the known binaries, and creating the portable tarball. It then collects relevant dynamic libraries from the host and creates a second portable tarball.

These two tarballs can then be copied to other computers and imported into a project by adding stanzas to the project WORKSPACE file.

installing the toolchain

The toolchain tarball is currently in /opt/bazel``. We need its full path and sha256sum to make it accessible within our workspace. Edit WORKSPACE` to include:

load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

# gcc-14 x86_64 toolchain from snapshot gcc-14 and glibc development heads
http_archive(
    name = "gcc-14-x86_64-suite",
    urls = ["file:///opt/bazel/x86_64_linux_gnu-14.tar.xz"],
    build_file = "//:gcc-14-x86_64-suite.BUILD",
    sha256 = "40cc4664a11b8da56478393c7c8b823b54f250192bdc1e1181c9e4f8ac15e3be",
)

# system libraries used by toolchain build system
# We built the custom toolchain on a fedora x86_64 platform, so we need some
# fedora x86_64 sharable system libraries to execute.
http_archive(
    name = "fedora39-system-libs",
    urls = ["file:///opt/bazel/fedora39_system_libs.tar.xz"],
    build_file = "//:fedora39-system-libs.BUILD",
    sha256 = "fe91415b05bb902964f05f7986683b84c70338bf484f23d05f7e8d4096949d1b",
)

Bazel will unpack this tarball into an external project directory, something like /run/user/1000/bazel/execroot/_main/external/gcc-14-x86_64-suite/. Individual files and filegroups within that directory are defined in x86_64/generated/gcc-14-x86_64-suite.BUILD. The filegroup compiler_files is probably the most important, as it collects everything that might be used in anything launched from gcc or g++. The full Bazel name for this filegroup is @gcc-14-x86_64-suite//:compiler_files.

Each custom toolchain is defined within the x86_64/generated/toolchains/BUILD file. This associates filegroups from a (possibly shared) toolchain tarball like gcc-14-x86_64-suite with a set of default compiler and linker options and standard libraries. We might want multiple gcc-14 toolchains, for building kernels, kernel modules, and userspace applications respectively.

Most of the configuration exists within stanzas like this:

toolchain(
    name = "x86_64_default",
    target_compatible_with = [
        ":x86_64",
    ],
    toolchain = ":x86_64-default-gcc",
    toolchain_type = "@bazel_tools//tools/cpp:toolchain_type",
)
cc_toolchain(
    name = "x86_64-default-gcc",
    all_files = ":all_files",
    ar_files = ":gcc_14_compiler_files",
    as_files = ":empty",
    compiler_files = ":gcc_14_compiler_files",
    dwp_files = ":empty",
    linker_files = ":gcc_14_compiler_files",
    objcopy_files = ":empty",
    strip_files = ":empty",
    supports_param_files = 0,
    toolchain_config = ":x86_64-default-gcc-config",
    toolchain_identifier = "x86_64-default-gcc",
)
cc_toolchain_config(
    name = "x86_64-default-gcc-config",
    abi_libc_version = ":empty",
    abi_version = ":empty",
    compile_flags = [
        # take the isystem ordering from the output of gcc -xc++ -E -v -
        "--sysroot", "external/gcc-14-x86_64-suite/sysroot/",
        "-Wall",
    ],
    compiler = "gcc",
    coverage_compile_flags = ["--coverage"],
    coverage_link_flags = ["--coverage"],
    cpu = "x86_64",
    # we really want the following to be constructed from $(output_base) or $(location ...)
    cxx_builtin_include_directories = [
       OUTPUT_BASE + "/external/gcc-14-x86_64-suite/sysroot/usr/include",
       OUTPUT_BASE + "/external/gcc-14-x86_64-suite/x86_64-pc-linux-gnu/include/c++/14.0.0",
       OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include",
       OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include-fixed",
       ],
    cxx_flags = [
        "-std=c++20",
        "-fno-rtti",
        ],
    dbg_compile_flags = ["-g"],
    host_system_name = ":empty",
    link_flags = ["--sysroot", "external/gcc-14-x86_64-suite/sysroot/"],
    link_libs = ["-lstdc++", "-lm"],
    opt_compile_flags = [
        "-g0",
        "-Os",
        "-DNDEBUG",
        "-ffunction-sections",
        "-fdata-sections",
    ],
    opt_link_flags = ["-Wl,--gc-sections"],
    supports_start_end_lib = False,
    target_libc = ":empty",
    target_system_name = ":empty",
    tool_paths = {
        "ar": "gcc-14-x86_64/imported/ar",
        "ld": "gcc-14-x86_64/imported/ld",
        "cpp": "gcc-14-x86_64/imported/cpp",
        "gcc": "gcc-14-x86_64/imported/gcc",
        "dwp": ":empty",
        "gcov": ":empty",
        "nm": "gcc-14-x86_64/imported/nm",
        "objcopy": "gcc-14-x86_64/imported/objcopy",
        "objdump": "gcc-14-x86_64/imported/objdump",
        "strip": "gcc-14-x86_64/imported/strip",
    },
    toolchain_identifier = "gcc-14-x86_64",
    unfiltered_compile_flags = [
        "-fno-canonical-system-headers",
        "-Wno-builtin-macro-redefined",
        "-D__DATE__=\"redacted\"",
        "-D__TIMESTAMP__=\"redacted\"",
        "-D__TIME__=\"redacted\"",
    ],
)

The tool_paths element points to small bash scripts needed to launch compiler components like gcc and ar and strip. These give us the chance to use imported system shareable object libraries rather than the host’s shareable object libraries.

#!/bin/bash
set -euo pipefail
PATH=`pwd`/toolchains/gcc-14-x86_64/imported \
LD_LIBRARY_PATH=external/fedora39-system-libs \
  external/gcc-14-x86_64-suite/bin/gcc "$@"

finding the hidden toolchain dependencies

Compiling and linking source files takes many dependent files from /opt/gcc14. The next step is tedious and iterative - we need to prove that the portable toolchain tarball derived from /opt/gcc14 never references any file in that directory, or any local host file under /usr. Bazel can do that for us, at the cost of identifying every file or file ‘glob’ that may be called for each of the toolchain primitives. It runs the toolchain in a sandbox, forcing an exception on all references not previously declared as dependencies.

This kind of exception looks like this:

ERROR: /home/XXX/projects/github/ghidra_import_tests/x86_64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: absolute path inclusion(s) found in rule '//userSpaceSamples:helloworld':
the source file 'userSpaceSamples/helloworld.c' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
  '/usr/include/stdc-predef.h'
  '/usr/include/stdio.h'

If you see this check:

whether stdio.h was installed in the right directory under /opt/gcc14.
whether stdio.h was copied into /tmp/export when building the tarball
whether the instances of stdio.h appeared in the appropriate compiler file groups defined in gcc-14-x86_64-suite.BUILD
whether those filegroups were properly imported into the Bazel sandbox for your build
whether the compile_flags for your toolchain tell gcc-14 to search the sandbox for the directories containing stdio.h
```
"-isystem", "external/gcc-14-x86_64-suite/sysroot/usr/include",
```
whether the link_flags for your toolchain tell gcc-14 to search the sandbox for the directories containing crt1.o and crti.o

using the toolchain

We can test our new toolchain with a build of helloworld.

x86_64/generated$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
x86_64/generated$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld
INFO: Analyzed target //userSpaceSamples:helloworld (69 packages loaded, 1538 targets configured).
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Compiling userSpaceSamples/helloworld.c', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppCompile]
(cd /run/user/1000/bazel/execroot/_main && \
  exec env - \
    PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
    PWD=/proc/self/cwd \
  toolchains/gcc-14-x86_64/imported/gcc -U_FORTIFY_SOURCE --sysroot external/gcc-14-x86_64-suite/sysroot/ -Wall -MD -MF bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.d '-frandom-seed=bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o' -fPIC -iquote . -iquote bazel-out/k8-fastbuild/bin -iquote external/bazel_tools -iquote bazel-out/k8-fastbuild/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -o bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Linking userSpaceSamples/helloworld', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppLink]
(cd /run/user/1000/bazel/execroot/_main && \
  exec env - \
    PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
    PWD=/proc/self/cwd \
  toolchains/gcc-14-x86_64/imported/gcc -o bazel-out/k8-fastbuild/bin/userSpaceSamples/helloworld -Wl,-S --sysroot external/gcc-14-x86_64-suite/sysroot/ bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o -lstdc++ -lm)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
INFO: Found 1 target...
Target //userSpaceSamples:helloworld up-to-date:
  bazel-bin/userSpaceSamples/helloworld
INFO: Elapsed time: 0.289s, Critical Path: 0.10s
INFO: 6 processes: 4 internal, 2 linux-sandbox.
INFO: Build completed successfully, 6 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld
Hello World!
$ strings bazel-bin/userSpaceSamples/helloworld|grep -i gcc
GCC: (GNU) 14.0.0 20231218 (experimental)

Things to note:

The command line includes --platforms=//platforms:x86_64_default to show we are not building for the local host
toolchains/gcc-14-x86_64/imported/gcc is invoked twice, once to compile and once to link
--sysroot external/gcc-14-x86_64-suite/sysroot is used twice, to avoid including host files under /usr
The helloworld executable happens to execute on the host machine.
The helloworld executable contains no references to gcc-13, the native toolchain on the host machine.

Now try a C++ build:

$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld++
...
Target //userSpaceSamples:helloworld++ up-to-date:
  bazel-bin/userSpaceSamples/helloworld++
INFO: Elapsed time: 0.589s, Critical Path: 0.51s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld++
Hello World!

cleanup

We’ve got a working toolchain, but with many dangling links, duplicate files, and unused definitions. The toolchain files normally provided by a kernel were copied in as needed from the host, with the understanding that we never really needed unable applications.

If this were a production environment we would be a lot more careful. It’s not, so we will just summarize some of the areas that might benefit from such a cleanup.

toolchain directories

Adding and testing a toolchain involves lots of similar-looking directories.

/opt/gcc14

This directory is the install target for our binutils, gcc, and glibc builds. It is not itself used by the Bazel build framework, and need not be present on any host machine running the toolchain.

The overall size is reported as 3.1 GB, inflated somewhat by multiple hard links
fdupes reports 2446 duplicate files (in 2136 sets), occupying 200.2 megabytes
There are six files over 150 MB in size

/tmp/export

This directory holds a subset of /opt/gcc14, with many binaries stripped. The hard links of `/opt/gcc14`` are lost. It may be discarded after the portable tarball is generated.

the overall size is 731 MB
fdupes reports 1010 duplicate files (in 983 sets), occupying 69.0 megabytes
There are 16 files over 10 MB in size

/opt/bazel/x86_64_linux_gnu-14.tar.xz

The compressed portable tarball size is 171M. It expands into a locally cached equivalent of /tmp/export. This file must be accessible to Bazel during a cross-compilation, either as a file reference or as a remote http or https URL.

/run/user/1000/bazel/execroot/_main/external/x86_64_linux_gnu-14

Bazel agents will decompress the tarball if and when needed into a local cache directory. In this case, it is unpacked into a RAM file system for speed.

/run/user/1000/bazel/sandbox/linux-sandbox/2/execroot/_main/external/x86_64_linux_gnu-14

Temporary sandboxes like this are created when individual compiler and linker steps are executed. They implement whatever subset of the cached toolchain tarball are explicitly named as dependencies for that step. Toolchain references outside of the sandbox are often flagged as hermeticity errors and abort the build.

7.2 - debugging toolchains

Debugging toolchains can be tedious

Suppose you wanted to build a gcc-14 toolchain with the latest glibc standard libraries, and you were using a Linux host with gcc-14 and reasonably current glibc standard libraries. How would you guarantee that none of your older host files were accidentally used where you expected the newer gcc and glibc files to be used?

Bazel enforces this hermeticity by running all toolchain steps in a sandbox, where only declared dependencies of the toolchain components are visible. That means nothing under /usr or $HOME is generally available, and any attempt to access files there will abort the build.

Example:

ERROR: /home/XXX/projects/github/ghidra_import_tests/x86_64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: absolute path inclusion(s) found in rule '//userSpaceSamples:helloworld':
the source file 'userSpaceSamples/helloworld.c' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
  '/usr/include/stdc-predef.h'
  '/usr/include/stdio.h'

In this example the toolchain tried to load host files, where it should have been loading equivalent files from the toolchain tarball.

Toolchain failure modes

Bazel toolchains should provide and encapsulate almost everything host computers need to compile and link executables. The goal is simply to minimize toolchain differences between individual developers’ workstations and the reference Continuous Integration test servers. The toolchains do not include kernels or loaders, or system code tightly associated with the kernel. That presents a challenge, since we want the linker to be imported as part of the toolchain, while the system loader is provided by the host.

Common toolchain failure modes often show up during crosscompilation of something as simple as riscv64-unknown-linux-gnu-gcc helloworld.c.

The gcc compiler must find the compiler dynamic libraries it was compiled with, probably using LD_LIBRARY_PATH to find them.
- These include compiler-specific files like libstdc++.so.6 which links to concrete versions like libstdc++.so.6.0.32.
- These libraries must be part of the imported toolchain tarball and explicitly named as Bazel toolchain dependencies so that they are imported into the ‘sandbox’ isolating the build from system libraries
- Other host-specific loader files should not be part of the toolchain tarball. These include the dynamic loader ld-linux-x86-64.so.2
The gcc executable must find and execute multiple other executables from the toolchain, such as cpp, as, and ld.
- These should not be the same executables as may be provided by the native host system
- Each of these other executables must find their own dependencies, never the host system’s files of similar name.
Many of the toughest problems surface during the linking phase of crosscompilation, where gcc internally invokes the linker ld.
- ld executes on the host computer - we assume an x86_64 linux system - which means it needs an x86_64 libc.so library from the toolchain. It also generally needs to link object files against the target platform’s libc.so library from a different library in the toolchain.
- ld also often needs files specific to the target system’s kernel or loader. These include files like usr/lib/crt1.o.
- ld accepts many arguments detailing the target system’s memory model. Different arguments cause the linker to require different linker scripts under .../ldscripts.
- ld sharable object files can be scripts referencing other libraries - and those references may be absolute, not relative. These scripts may need to be patched so that host paths are not followed.

Compiler developers often refactor their dependent file layouts, making it very easy to not have required files in the expected places. You will generally get a useful error message if something like crt1.o isn’t located. If a dynamic library is not found in a child process, you might just get a segfault.

The debugging process often proceeds with:

A python integration test script showing multiple toolchain Bazel failures
Isolate and execute a single failing relatively simple Bazel build operation
Add Bazel diagnostics to the build command, such as --sandbox_debug
Locate the Bazel sandbox created for that build command and execute the gcc command directly
Check the sandbox to verify that key files are available within the sandbox, not just present in the imported toolchain tarball
Execute the gcc command within an strace command, with options to follow child processes and expand strings. Examine execve and open system calls to verify that imported files are found before host system files, and that the imported files are actually in a searched directory

Bazel segment faults after upgrade

The crosscompiler toolchain assumes that all files needed for a build are known to the Bazel build system. This assumption often breaks when upgrading a compiler or OS. This example shows what can happen when updating the host OS from Fedora 39 to Fedora 40.

The relevant integration test is generateInternalExemplars.py:

$ ./generateInternalExemplars.py
...
FAIL: test_03_riscv64_build (__main__.T0BazelEnvironment.test_03_riscv64_build)
riscV64 C build of helloworld, with checks to see if a compatible toolchain was
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/thixotropist/projects/github/ghidra_import_tests/./generateInternalExemplars.py", line 58, in test_03_riscv64_build
    self.assertEqual(0, result.returncode,
AssertionError: 0 != 1 : bazel //platforms:riscv_userspace build of userSpaceSamples:helloworld failed
...
Ran 8 tests in 6.290s

FAILED (failures=5)

The error log is large, showing 5 failures out of 8 tests. We will narrow the test to a single test case:

$ python  ./generateInternalExemplars.py T0BazelEnvironment.test_03_riscv64_build
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
...
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.s.d ... (remaining 20 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
toolchains/gcc-14-riscv/imported/gcc: line 5:     4 Segmentation fault      (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d ... (remaining 20 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
toolchains/gcc-14-riscv/imported/gcc: line 5:     4 Segmentation fault      (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.d ... (remaining 19 arguments skipped)

The three segment fault dumps can be found in /var/lib/systemd/coredump/.

The ERROR message indicates segment faults when generating three dependency listings. To drill down further we want to use take the Use --sandbox_debug hint and run the single bazel build command:

$  cd riscv64/generated/
riscv64/generated $ bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --sandbox_debug --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
...
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): linux-sandbox failed: error executing CppCompile command 
  (cd /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main && \
  exec env - \
    PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_11.1_DEV/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
    PWD=/proc/self/cwd \
    TMPDIR=/tmp \
  /home/thixotropist/.cache/bazel/_bazel_thixotropist/install/80f400a450641cd3dd880bb8dec91ff8/linux-sandbox -t 15 -w /dev/shm -w /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main -w /tmp -S /run/user/1000/bazel/sandbox/linux-sandbox/4/stats.out -D /run/user/1000/bazel/sandbox/linux-sandbox/4/debug.out -- toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d '-frandom-seed=bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i' -fPIC -iquote . -iquote bazel-out/k8-dbg/bin -iquote external/bazel_tools -iquote bazel-out/k8-dbg/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -E -o bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i)
toolchains/gcc-14-riscv/imported/gcc: line 5:     4 Segmentation fault      (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"

This tells us several things:

the failing command is trying to generate helloworld.pic.i from userSpaceSamples/helloworld.c with the gcc flag -E. This means the failure involves the preprocessor phase, not the compiler or linker phase.
the failing command is executing in the sandbox directory /run/user/1000/bazel/sandbox/linux-sandbox/4.

The next step is to rerun the generated command outside of bazel, but using the bazel sandbox.

$ pushd /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main
$ toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d '-frandom-seed=bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i' -fPIC -iquote . -iquote bazel-out/k8-dbg/bin -iquote external/bazel_tools -iquote bazel-out/k8-dbg/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -E -o bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i
toolchains/gcc-14-riscv/imported/gcc: line 5: 552557 Segmentation fault      (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
$ cat toolchains/gcc-14-riscv/imported/gcc
#!/bin/bash
set -euo pipefail
PATH=`pwd`/toolchains/gcc-14-riscv/imported \
LD_LIBRARY_PATH=external/fedora39-system-libs \
  external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
$ ls -l external/gcc-14-riscv64-suite/bin
total 0
riscv64-unknown-linux-gnu-ar -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ar
riscv64-unknown-linux-gnu-as -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-as
riscv64-unknown-linux-gnu-cpp -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
riscv64-unknown-linux-gnu-gcc -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc
riscv64-unknown-linux-gnu-ld -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ld
riscv64-unknown-linux-gnu-ld.bfd -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ld.bfd
riscv64-unknown-linux-gnu-objdump -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-objdump
riscv64-unknown-linux-gnu-ranlib -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ranlib
riscv64-unknown-linux-gnu-strip -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-strip

$ ls -l external/fedora39-system-libs
total 0
libc.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libc.so
libc.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libc.so.6
libexpat.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libexpat.so.1
libexpat.so.1.8.10 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libexpat.so.1.8.10
libgcc_s-13-20231205.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s-13-20231205.so.1
libgcc_s.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s.so.1
libgmp.so.10 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgmp.so.10
libgmp.so.10.4.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgmp.so.10.4.1
libisl.so.15 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libisl.so.15
libisl.so.15.1.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libisl.so.15.1.1
libmpc.so.3 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpc.so.3
libmpc.so.3.3.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpc.so.3.3.1
libmpfr.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpfr.so.6
libmpfr.so.6.2.0 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpfr.so.6.2.0
libm.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libm.so.6
libpython3.12.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.12.so
libpython3.12.so.1.0 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.12.so.1.0
libpython3.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.so
libstdc++.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6
libstdc++.so.6.0.32 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6.0.32
libz.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libz.so.1
libz.so.1.2.13 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libz.so.1.2.13
libzstd.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libzstd.so.1
libzstd.so.1.5.5 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libzstd.so.1.5.5

This suggests a missing or out-of-date sharable library, so try executing cpp with and without overriding the library path

$ LD_LIBRARY_PATH=external/fedora39-system-libs /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp --version
Segmentation fault (core dumped)
$ /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp --version
riscv64-unknown-linux-gnu-cpp (g3f23fa7e74f) 13.2.1 20230901
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Next see which libraries are required for cpp to execute:

$ ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
	linux-vdso.so.1 (0x00007ffdb7172000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007faf79200000)
	libm.so.6 => /lib64/libm.so.6 (0x00007faf7911d000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007faf794ff000)
	libc.so.6 => /lib64/libc.so.6 (0x00007faf78f30000)
	/lib64/ld-linux-x86-64.so.2 (0x00007faf79547000)

Is this a case of a missing library, or something corrupt in our imported fedora39-system-libs? Try a differential test in which we search both libraries, in different orders:

$ LD_LIBRARY_PATH=/lib64/:/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
	linux-vdso.so.1 (0x00007ffeb2b92000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fda32200000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fda3211d000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fda324a7000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fda31f30000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fda324d6000)
$  LD_LIBRARY_PATH=/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs:/lib64 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
Segmentation fault (core dumped)

We can trace the library and child process actions with commands like:

$ strace -f --string-limit=1000 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp 2>&1 |egrep 'openat|execve'
execve("/usr/bin/ldd", ["ldd", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x7ffcd3479788 /* 54 vars */) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib64/libtinfo.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/bin/ldd", O_RDONLY) = 3
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 595197] execve("/lib64/ld-linux-x86-64.so.2", ["/lib64/ld-linux-x86-64.so.2", "--verify", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x56317c32bf10 /* 54 vars */) = 0
[pid 595197] openat(AT_FDCWD, "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] execve("/lib64/ld-linux-x86-64.so.2", ["/lib64/ld-linux-x86-64.so.2", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x56317c339130 /* 58 vars */) = 0
[pid 595200] openat(AT_FDCWD, "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3

Examine the imported fedora39-system-libs directory, finding one significant error. The file libc.so is not a symbolic link but a loader script, referencing the host’s /lib64/libc.so.6, /usr/lib64/libc_nonshared.a, and /lib64/ld-linux-x86-64.so.2. If we purge libc.* from fedora39-system-libs we get a saner result:

$ LD_LIBRARY_PATH=/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs:/lib64 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp 2>&1 
	linux-vdso.so.1 (0x00007ffda9fed000)
	libstdc++.so.6 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6 (0x00007f0c1c45a000)
	libm.so.6 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libm.so.6 (0x00007f0c1c379000)
	libgcc_s.so.1 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s.so.1 (0x00007f0c1c355000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f0c1c168000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f0c1c6b0000)

Now we have a hermeticity design question to resolve - which system libraries do we import, and which do we pull from the host machine? This exercise suggests we use the host libraries for dynamic loading and for the standard C libc.so, and import libraries associated with the C and C++ compiler.

Update the LD_LIBRARY_PATH variable in all toolchain scripts and explicitly remove libc.* files from the system libraries, then try repeat the failing tests:

$ ./generateInternalExemplars.py 
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=@local_config_platform//:host --compilation_mode=dbg userSpaceSamples:helloworld
.INFO:root:Running: bazel query //platforms:*
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
INFO:root:Running: file bazel-bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld++
INFO:root:Running: file bazel-bin/userSpaceSamples/_objs/helloworld++/helloworld.pic.o
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom assemblySamples:archive
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom gcc_expansions:archive
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom @whisper_cpp//:main @whisper_cpp//:main.stripped
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_vector @whisper_cpp//:main @whisper_cpp//:main.stripped
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace @whisper_cpp//:main @whisper_cpp//:main.stripped
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:x86_64_default gcc_vectorization:archive
.
----------------------------------------------------------------------
Ran 8 tests in 61.357s

OK

7.3 - refreshing toolchains

Refreshing (updating) an existing toolchain is mostly straightforward.

Warning: This sequence uses unreleased code for binutils, gcc, and glibc. We use this experimental toolchain to get a glimpse of future toolchains and products, not for stable code.

Update binutils

binutils’ latest release is 2.42. Let’s update our RISCV toolchain to use the current binutils head, which is currently very close to the released version. The git log shows relatively little change to the RISCV assembler, other than some corrections to the THead extension encodings.

Update the source directory to commit a197d5f7eb27e99c27577, January 18 2024. RISCV updates to the previous snapshot have landed from various alibaba contributors.
switch to the binutils build directory and refresh the configuration, build, and install to /opt/riscvx.

$ /home2/vendor/binutils-gdb/configure --prefix=/opt/riscvx --target=riscv64-unknown-linux-gnu
$ make
$ make install

Update gcc

Update to the tip of the master branch, glancing at the log to see that alibaba, intel, rivai, rivos, and others have contributed recent RISCV updates.
switch to the existing build directory, clean the old configuration, and repeat the configuration used before.
make and install to /opt/riscvx

$ make distclean
$ /home2/vendor/gcc/configure --prefix=/opt/riscvx --enable-languages=c,c++,lto --disable-multilib --target=riscv64-unknown-linux-gnu --with-sysroot=/opt/riscvx/sysroot
$ make
$ make install

update glibc

Update the source directory to the tip of the master branch, refresh the configuration, build, and install

$ ../../vendor/glibc/configure CC=/opt/riscvx/bin/riscv64-unknown-linux-gnu-gcc  --host=riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu --prefix=/opt/riscvx --disable-werror --enable-shared --disable-multilib
$ make
$ make install

testing the refreshed toolchain

The previous steps generate a new, non-portable toolchain under /opt/riscvx. Before we can generate the portable tarball (e.g., risc64_linux_gcc-14.0.1.tar.xz) we can exercise the newer toolchain. If we pass --platforms=//platforms:riscv_local to bazel it will use a toolchain loaded from local files under /opt/riscvx instead of files extracted from the portable tarball.

This is mostly useful in debugging the bazel ‘sandbox’ - recognizing newer files required by the toolchain that have been installed locally but not explicitly included in the portable tarball.

For example, suppose we are refreshing the gcc-14 toolchain from 14.0.0 to 14.0.1. The following sequence of builds should all succeed:

# build with an unrelated and fully released toolchain as a control experiment
$ bazel build --platforms=//platforms:riscv_userspace @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
  bazel-bin/external/whisper_cpp/main        /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901                   /// the released compiler only was used
GCC: (g3f23fa7e74f) 13.2.1 20230901
_Unwind_Resume@GCC_3.0

# repeat with the local toolchain introducing 14.0.1 for the application build
$ bazel build --platforms=//platforms:riscv_local @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
  bazel-bin/external/whisper_cpp/main       /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901                  /// some system files were previously compiled
GCC: (GNU) 14.0.1 20240130 (experimental)   /// the new toolchain was used in part
_Unwind_Resume@GCC_3.0

# repeat with the candidate portable tarball
$ bazel build --platforms=//platforms:riscv_vector @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
  bazel-bin/external/whisper_cpp/main       /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901
GCC: (GNU) 14.0.1 20240130 (experimental)   /// the new toolchain was used in part
_Unwind_Resume@GCC_3.0

Different build options can require different files in the portable tarball, so this kind of test may fail for some projects while succeeding in others. That’s easily fixed by updating the generate.sh rsync script that builds the portable tarball.

8 - Notes

Put unstructured comments here until we know what to do with them.

TODO

Update the isa_ext Ghidra branch to expand vsetvli arguments
- vsetvli zero,zero,0xc5 ⇒ vsetvli zero,zero,e8,mf8,ta,ma
- vsetvli zero,zero,0x18 ⇒ vsetvli zero,zero,e64,m1,tu,mu
Determine why the isa_ext Ghidra branch fails to disassemble the bext instruction in b-ext-64.o and b-ext.o
- that regression was do to an accidental typo
Determine why zvbc.o won’t disassemble
- These are compressed (16 bit) vector multiply instructions not currently defined in isa_ext
Determine why unknown.o won’t disassemble or reference where we found these instructions
- These instructions include sfence``, hinval_vvma, hinval_gvma, orc.b, cbo.clean, cbo.inval, cbo.flush. orc.b` is handled properly, the others are not implemented.
Clarify python scripts to show more of the directory context

Experiments

how much Ghidra complexity does gcc-14 introduce in a full build?

Assume a vendor generates a new toolchain with multiple extensions enabled by default. What fraction of the compiled functions would contain extensions unrecognized by Ghidra 11.0? Since THead has supplied most of the vendor-specific extensions known to binutils 2-41, we’ll use that as a reference. The architecture name will be something like

-march=rv64gv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_xtheadmac_xtheadfmemidx_xtheadmempair_xtheadsync

Add some C++ code to exercise libstdc++ ordered maps (based on red-black trees?), unordered maps (hash table based), and the Murmur hash function.

There are a few places where THead customized instructions are used. The Murmur hash function uses vector load and store instructions to implement 8 byte unaligned reads. Bit manipulation extension instructions are not yet used.

Initial results suggest the largest complexity impact will be gcc rewriting of memory and structure copies with vector code. This may be especially true for hardware requiring aligned integers where alignment can not be guaranteed.

8.1 - Hardware Availability

When will RISCV-64 cores be deployed into systems needing reverse-engineering?

General purpose systems

https://www.cnx-software.com/2022/11/02/sifive-p670-and-p470-risc-v-processors-add-risc-v-vector-extensions/

https://www.cnx-software.com/2023/08/30/sifive-unveils-p870-high-performance-core-discusses-future-of-risc-v

https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc

https://www.scmp.com/tech/tech-trends/article/3232686/chinas-top-chip-designers-form-risc-v-patent-alliance-promote-semiconductor-self-sufficiency

Note: the general SiFive SDK boards might have been deprioritized in favor of specific licensing agreements. https://www.sifive.com/boards/hifive-pro-p550

https://liliputing.com/sifive-hifive-pro-p550-dev-board-coming-this-summer-with-intel-horse-creek-risc-v-chip/

We might expect to see high performance network appliances in 2026 using chip architectures like the SiFive 670 or 870, or from one of the alternative Chinese vendors. Chips with vector extensions are due soon, with crypto extensions coming shortly after. A network appliance development board might have two P670 class sockets and four to eight 10 GbE network interfaces.

To manage scope, we won’t be worrying about instructions supporting AI or complex virtualization. Custom instructions that might be used in network appliances are definitely in scope, while custom instructions for nested virtualization are not. Possibly in scope are new instructions that help manage or synchronize multi-socket cache memory.

Let’s set a provocative long term goal: How will Ghidra analyze a future network appliance that combines Machine Learning with self-modifying code to accelerate network routing and forwarding? Such a device might generate fast-path code sequences to sessionize incoming packets and deliver them with minimal cache flushes or branches taken.

A RISCV-64 implementation of the Marvell Octeon 10 might be a feasible future hardware component.

Portable appliances

This might include cell phones or voice-recognition apps. Things that today might use an Arm core set but be implemented with RISC-V cores in the future.

role of mixed 32 and 64 bit cores

Consider a midpoint network appliance (router or firewall) sitting near the Provider-Customer demarcation. What might be an appealing RISCV processor look like? This kind of appliance likely handles a mix of link layer protocols with an optimization for low energy dissipation and low latency per packet. A fast and simple serializer/deserializer feeding a RISCV classifier and forwarding engine makes sense here. You don’t want to do network or application layer processing unless the appliance has a firewall role.

Link layer processing means a packet mix of stacked MPLS and VLAN tags with IPv4 and IPv6 network layers underneath. Packet header processing won’t need 32 bit addressing, but might benefit from the high memory bandwidth of a 64 bit core. Fast header hashing combined with fast hashmap session lookups (for MPLS, VLAN, and selected IP) or fast trie session lookups (for IPv4 and IPv6). Network stacks can have a lot of branches, creating pipeline stalls, so hyperthreading may make sense.

Denial of Service and overload protections make fast analytics necessary at the session level. That’s where a 64 bit core with vector and other extensions can be useful.

This all suggests we might see more hybrid RISCV designs, with a mix of many lean 32 bit cores supported by one or two 64 bit cores. The 32 bit cores handle fast link layer processing and the 64 bit cores handle background analytics and control.

In the extreme case, the 64 bit analytics engine rewrites link layer code for the 32 bit cores continuously, optimizing code paths depending on what the link layer classifiers determine the most common packet types to be for each physical port. Cache management and branch prediction hints might drive new instruction additions.

Code rewriting could start as simple updates to RISCV hint branch instructions and possibly prefetch instructions, so it isn’t necessarily as radical as it sounds.

8.2 - Network Appliances

What will RISCV-64 cores offer networking?

will vector instructions be useful in network appliances?

Network kernel code has lots of conditional branches and very few loops. This suggests RISCV vector instructions won’t be found in network appliances anytime soon, other than memmove or similar simple contexts. Gather-scatter, bit manipulation, and crypto instruction extensions are likely to be useful in networking much sooner. Ghidra will have a much easier time generating pcode for those instructions than the 25K+ RISCV vector intrinsic C functions covering all combinations of vector instructions and vector processing modes.

What should Ghidra do when faced with a counter-example, say a network appliance that aggressively moves vector analytics into network processing? Such an appliance - perhaps a smart edge router or a zero-trust gateway device - might combine the following:

64 RISCV cores with no floating point or vector capability, optimized for traditional network ingress processing. These cores are designed to cope with the many branches of network packet processing, possibly including better branch prediction and hyperthreading.
2 or more RISCV cores with full floating point and vector capability, optimized for performing analytics on the inbound packet stream. These analytics can range from simple statistics generation to heuristic sessionization to self-modifying code generation. The self-modifying code may be either eBPF code or native RISCV instructions, depending on how aggressive the designers may be.

In the extreme case, this might be a generative AI subsystem trained on inbound packets and emitting either optimized packet handling code or threat-detection indicators. How would a Ghidra analyst look for malware in such a system?

midpoint versus endpoint network appliances

We need to be clearer about what kind of network code we might find in different contexts:

midpoint equipment like network-edge routers and switches, optimized for maximum throughput
endpoint equipment like host computers, web servers, and database servers where applications take up the bulk of the CPU cycles

For each of these contexts we have at least two topology variants:

Inline network code through which packets must transit, generally optimized for low latency and high throughput
Tapped network code (e.g., wireshark or port-mirrored accesses) observing copies of packets for session and endpoint analytics. Latency is not an issue here.

Midpoint network appliances may need to track session state. A simple network switch is close to stateless. A real-world network switch has a lot of session state to manage if it supports:

denial of service overload detection or other flow control
link bonding or equal-weight multipath routing

The key point here is that midpoint network appliances may benefit from instruction set extensions that enable faster packet classification, hashing, and cached session lookup. An adaptive midpoint network appliance might adjust the packet classification code in real-time, based on the mix of MPLS, VLAN, IPv4, IPv6, and VPN traffic most often seen on any given network interface. ISA extensions supporting gather, hash, vector comparison, and find-first operations are good candidates here.

8.3 - A vectorization case study

Compare and debug human and gcc vectorization

This case study compares human and compiler vectorization of a simple ML quantization algorithm. We’ll assume we need to inspect the code to understand why these two binaries sometimes produce different results. Our primary goal is to see whether we can improve Ghidra’s RISCV pcode generation to make such analyses easier. A secondary goal is to collect generated instruction patterns that may help Ghidra users understand what optimizing vectorizing compilers can do to source code.

The ML algorithm under test comes from https://github.com/ggerganov/llama.cpp. It packs an array of 32 bit floats into a set of q8_0 blocks to condense large model files. The q8_0 quantization reduces 32 32 bit floating point numbers to 32 8 bit integers with an associated 16 bit floating point scale factor.

The ggml-quants.c file in the llama.cpp repo provides both scalar source code (quantize_row_q8_0_reference) and hand-generated vectorized source code (quantize_row_q8_0).

The quantize_row_q8_0 function has several #ifdef sections providing hand-generated vector intrinsics for riscv, avx2, arm/neon, and wasm.
The quantize_row_q8_0_reference function source uses more loops but no vector instructions. GCC-14 will autovectorize the scalar quantize_row_q8_0_reference, producing vector code that is quite different from the hand-generated vector intrinsics.

The notional AI development shop wants to use Ghidra to inspect generated assembly instructions for both quantize_row_q8_0 and quantize_row_q8_0_reference to track down reported quirks. On some systems they produce identical results, on others the results differ. The test framework includes:

A target RISCV-64 processor supporting vector and compressed instructions.
GCC-14 developmental (pending release) compiler toolchain for native x86_64 builds
GCC-14 developmental (pending release) RISCV-64 cross-compiler toolchain with standard options -march=rv64gcv, -O3, and -ffast-math.
qemu-riscv64-static emulated execution of user space RISCV-64 applications on an x86_64 Linux test server.
A generic unit testing framework like gtest.
Ghidra 11+ with the isa_ext branch supporting RISCV 1.0 vector instructions.

The unit test process involves three unit test executions:

a reference x86_64 execution to test the logic on a common platform.
within a qemu-riscv64-static environment with an emulated VLEN=256 bits
within a qemu-riscv64-static environment with an emulated VLEN=128 bits

Note: This exercise uses whisper C and C++ source code as ‘ground truth’, coupled with a C++ test framework. If we didn’t have source code, we would have to reconstruct key library source files based on Ghidra inspection, then refine those reconstructions until Ghidra and unit testing shows that our reconstructions behave the same as the original binaries.

As setup to the Ghidra inspection, we will build and run all three and expect to see three PASSED notifications:

$ bazel run -platforms=//platforms:x86_64 case_studies:unitTests
...

INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 21.065s, Critical Path: 20.71s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
INFO: Running command line: bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (0 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (0 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (0 ms total)
[  PASSED  ] 2 tests.

$ bazel build --platforms=//platforms:riscv_vector case_studies:unitTests
$ bazel build --platforms=//platforms:riscv_vector --define __riscv_v_intrinsics=1 case_studies:unitTests
WARNING: Build option --platforms has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 1904 targets configured).
...
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 22.265s, Critical Path: 22.07s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[  PASSED  ] 2 tests.

Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 8.984s, Critical Path: 8.88s
INFO: 29 processes: 2 internal, 27 linux-sandbox.
INFO: Build completed successfully, 29 total actions

$ QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[  PASSED  ] 2 tests.

$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
case_studies/unitTests.cpp:55: Failure
Expected equality of these values:
  dest[0].d
    Which is: 12175
  fp16_test_array.d
    Which is: 13264
fp16 scale factor is correct
case_studies/unitTests.cpp:57: Failure
Expected equality of these values:
  comparison
    Which is: -65
  0
entire fp16 block is converted correctly
[  FAILED  ] FP16.convertFromFp32VectorIntrinsics (8 ms)
[----------] 2 tests from FP16 (10 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (14 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] FP16.convertFromFp32VectorIntrinsics

 1 FAILED TEST

These results imply:

The hand-vectorized quantize_row_q8_0 test passes on harts with VLEN=256 but fails when executed on harts with VLEN=128. Further tracing suggests that quantize_row_q8_0 only processes the first 16 floats, not the 32 floats that should be processed in each block.
The gcc autovectorized quantize_row_q8_0_reference passes on both types of harts.

Now we need to import the riscv-64 unitTests program into Ghidra and examine the compiled differences between quantize_row_q8_0 and quantize_row_q8_0_reference.

Note: Remember that our real integration test goal is to look for new problems or regressions in Ghidra’s decompiler presentation of functions like these, and then to look for ways to improve that presentation.

Original Source Code

The goal of both quantize_row_q8_0* routines is a lossy compression of 32 bit floats into blocks of 8 bit scaled values. The routines should return identical results, with quantize_row_q8_0 invoked on architectures with vector acceleration and quantize_row_q8_0_reference for all other architectures.

static const int QK8_0 = 32;
// reference implementation for deterministic creation of model files
void quantize_row_q8_0_reference(const float * restrict x, block_q8_0 * restrict y, int k) {
    assert(k % QK8_0 == 0);
    const int nb = k / QK8_0;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        for (int j = 0; j < QK8_0; j++) {
            const float v = x[i*QK8_0 + j];
            amax = MAX(amax, fabsf(v));
        }

        const float d = amax / ((1 << 7) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);

        for (int j = 0; j < QK8_0; ++j) {
            const float x0 = x[i*QK8_0 + j]*id;

            y[i].qs[j] = roundf(x0);
        }
    }
}
void quantize_row_q8_0(const float * restrict x, void * restrict vy, int k) {
    assert(QK8_0 == 32);
    assert(k % QK8_0 == 0);
    const int nb = k / QK8_0;

    block_q8_0 * restrict y = vy;

#if defined(__ARM_NEON)
...
#elif defined(__wasm_simd128__)
...
#elif defined(__AVX2__) || defined(__AVX__)
...
#elif defined(__riscv_v_intrinsic)

    size_t vl = __riscv_vsetvl_e32m4(QK8_0);

    for (int i = 0; i < nb; i++) {
        // load elements
        vfloat32m4_t v_x   = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);

        vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
        vfloat32m1_t tmp   = __riscv_vfmv_v_f_f32m1(0.0f, vl);
        vfloat32m1_t vmax  = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
        float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);

        const float d = amax / ((1 << 7) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);
        vfloat32m4_t x0 = __riscv_vfmul_vf_f32m4(v_x, id, vl);

        // convert to integer
        vint16m2_t   vi = __riscv_vfncvt_x_f_w_i16m2(x0, vl);
        vint8m1_t    vs = __riscv_vncvt_x_x_w_i8m1(vi, vl);

        // store result
        __riscv_vse8_v_i8m1(y[i].qs , vs, vl);
    }
#else
    GGML_UNUSED(nb);
    // scalar
    quantize_row_q8_0_reference(x, y, k);
#endif
}

The reference version has an outer loop iterating over scalar 32 bit floats, with two inner loops operating on blocks of 32 of those floats. The first inner loop accumulates the maximum of absolute values within the block to generate a scale factor, while the second inner loop applies that scale factor to each 32 bit float in the block and then converts the scaled value to an 8 bit integer. Each output block is then a 16 bit floating point scale factor plus 32 8 bit scaled integers.

The code includes some distractions that complicate Ghidra analysis:

The k input parameter is signed, making the integer division by 32 more complicated than it needs to be.
The GGML_FP32_TO_FP16(d) conversion might be a single instruction on some architectures, but it requires branch evaluation on our RISCV-64 target architecture. GCC may elect to duplicate code in order to minimize the number of branches needed.

The hand-optimized quantize_row_q8_0 has similar distractions, plus a few more:

The two inner loops have been converted into RISCV vector intrinsics, such that each iteration processes 32 4 byte floats into a single 34 byte block_q8_0 struct.
Four adjacent vector registers are grouped with the m4 setting. On architectures with a vector length VLEN=256, that means all 32 4 byte floats per block will fit nicely and can be processed in parallel. If the architecture only supports a vector length of VLEN=128, then only half of each block will be processed in every iteration. That accounts for the unit test failure.
The code uses standard riscv_intrinsics - of which there are nearly 40,000 variants. The root of each intrinsic is generally a single vector instruction, then extended with information on the expected vector context (from vset* instructions) and the expected return type of the result. There is no C header file providing signatures for all possible variants, so nothing Ghidra can import and use in the decompiler view.
The __riscv_vle32_v_f32m4 intrinsic is likely the slowest of the set, as this 32 bit instruction will require a 128 byte memory read, stalling the instruction pipeline for some number of cycles.

Ghidra inspection

inspecting the hand-vectorized quantizer

Load unitTest into Ghidra and inspect quantize_row_q8_0. We know the correct signature so we can override what Ghidra has inferred, then name the parameters so that they look more like the source code.

void quantize_row_q8_0(float *x,block_q0_0 *y,long k)

{
  float fVar1;
  int iVar2;
  char *pcVar3;
  undefined8 uVar4;
  uint uVar5;
  int iVar6;
  ulong uVar7;
  undefined8 uVar8;
  undefined in_v1 [256];
  undefined auVar9 [256];
  undefined auVar10 [256];
  gp = &__global_pointer$;
  if (k < 0x20) {
    return;
  }
  uVar4 = vsetvli_e8m1tama(0x20);
  vsetvli_e32m1tama(uVar4);
  uVar5 = 0x106c50;
  iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5;
  iVar6 = 0;
  vmv_v_i(in_v1,0);
  pcVar3 = y->qs;
  do {
    while( true ) {
      vsetvli_e32m4tama(uVar4);
      auVar9 = vle32_v(x);
      auVar10 = vfsgnjx_vv(auVar9,auVar9);
      auVar10 = vfredmax_vs(auVar10,in_v1);
      uVar8 = vfmv_fs(auVar10);
      fVar1 = (float)uVar8 * 0.007874016;
      uVar7 = (ulong)(uint)fVar1;
      if ((fVar1 == 0.0) || (uVar7 = (ulong)(uint)(127.0 / (float)uVar8), uVar5 << 1 < 0xff000001))
      break;
      auVar9 = vfmul_vf(auVar9,uVar7);
      vsetvli_e16m2tama(0);
      ((block_q0_0 *)(pcVar3 + -2))->d = 0x7e00;
      auVar9 = vfncvt_xfw(auVar9);
      iVar6 = iVar6 + 1;
      vsetvli_e8m1tama(0);
      auVar9 = vncvt_xxw(auVar9);
      vse8_v(auVar9,pcVar3);
      x = x + 0x20;
      pcVar3 = pcVar3 + 0x22;
      if (iVar2 <= iVar6) {
        return;
      }
    }
    iVar6 = iVar6 + 1;
    auVar9 = vfmul_vf(auVar9,uVar7);
    vsetvli_e16m2tama(0);
    auVar9 = vfncvt_xfw(auVar9);
    vsetvli_e8m1tama(0);
    auVar9 = vncvt_xxw(auVar9);
    vse8_v(auVar9,pcVar3);
    x = x + 0x20;
    uVar5 = uVar5 & 0xfff;
    ((block_q0_0 *)(pcVar3 + -2))->d = (short)uVar5;
    pcVar3 = pcVar3 + 0x22;
  } while (iVar6 < iVar2);
  return;
}

Note: an earlier run showed several pcode errors in riscv-rvv.sinc, which have been fixed as of this run.

Red herrings - none of these have anything to do with RISCV or vector intrinsics

uVar5 = 0x106c50; - there is no uVar5 variable, just a shared upper immediate load register.
iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5; - since k is a signed long and not unsigned, the compiler has to implement the divide by 32 with rounding adjustments for negative numbers.\
fVar1 = (float)uVar8 * 0.007874016; - the compiler changed a division by 127.0 into a multiplication by 0.007874016.
((block_q0_0 *)(pcVar3 + -2))->d - the compiler has set pcVar3 to point to an element within the block, so it uses negative offsets to address preceding elements.
duplicate code blocks - the conversion from a 32 bit float to the 16 bit float involves some branches. The compiler has decided that duplicating following code for at least one branch will be faster.
Decompiler handling of fmv.x.w instructions looks odd. fmv.x.w moves the single-precision value in ﬂoating-point register rs1 represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. This works fine when the source is zero, but it has no clear C-like representation otherwise. These may better be replaced with specialized pcode operations.

There is one discrepancy that does involve the vectorization code. The source code uses a standard RISCV vector intrinsic function to store data:

__riscv_vse8_v_i8m1(y[i].qs, vs, vl);

Ghidra pcode for this instruction after renaming operands is (currently):

vse8_v(vs, y[i].qs);

The order of the first two parameters is swapped. We should probably align the pcode to avoid deviations from the standard intrinsic signature as much as possible. Those intrinsics have context and type information encoded into their name, which Ghidra does not currently have, so we can’t exactly match.

inspecting the auto-vectorized quantizer

Load unitTest into Ghidra and inspect quantize_row_q8_0_reference. We know the correct signature so we can override what Ghidra has inferred, then name the parameters so that they look more like the source code.

void quantize_row_q8_0_reference(float *x,block_q0_0 *y,long k)

{
  float fVar1;
  long lVar2;
  long lVar3;
  char *pcVar4;
  ushort uVar5;
  int iVar6;
  ulong uVar7;
  undefined8 uVar8;
  undefined auVar9 [256];
  undefined auVar10 [256];
  undefined auVar11 [256];
  undefined auVar12 [256];
  undefined auVar13 [256];
  undefined auVar14 [256];
  undefined in_v7 [256];
  undefined auVar15 [256];
  undefined auVar16 [256];
  undefined auVar17 [256];
  undefined auVar18 [256];
  undefined auVar19 [256];

  gp = &__global_pointer$;
  if (k < 0x20) {
    return;
  }
  vsetivli_e32m1tama(4);
  pcVar4 = y->qs;
  lVar2 = 0;
  iVar6 = 0;
  auVar15 = vfmv_sf(0xff800000);
  vmv_v_i(in_v7,0);
  do {
    lVar3 = (long)x + lVar2;
    auVar14 = vle32_v(lVar3);
    auVar13 = vle32_v(lVar3 + 0x10);
    auVar10 = vfsgnjx_vv(auVar14,auVar14);
    auVar9 = vfsgnjx_vv(auVar13,auVar13);
    auVar10 = vfmax_vv(auVar10,in_v7);
    auVar9 = vfmax_vv(auVar9,auVar10);
    auVar12 = vle32_v(lVar3 + 0x20);
    auVar11 = vle32_v(lVar3 + 0x30);
    auVar10 = vfsgnjx_vv(auVar12,auVar12);
    auVar10 = vfmax_vv(auVar10,auVar9);
    auVar9 = vfsgnjx_vv(auVar11,auVar11);
    auVar9 = vfmax_vv(auVar9,auVar10);
    auVar10 = vle32_v(lVar3 + 0x40);
    auVar18 = vle32_v(lVar3 + 0x50);
    auVar16 = vfsgnjx_vv(auVar10,auVar10);
    auVar16 = vfmax_vv(auVar16,auVar9);
    auVar9 = vfsgnjx_vv(auVar18,auVar18);
    auVar9 = vfmax_vv(auVar9,auVar16);
    auVar17 = vle32_v(lVar3 + 0x60);
    auVar16 = vle32_v(lVar3 + 0x70);
    auVar19 = vfsgnjx_vv(auVar17,auVar17);
    auVar19 = vfmax_vv(auVar19,auVar9);
    auVar9 = vfsgnjx_vv(auVar16,auVar16);
    auVar9 = vfmax_vv(auVar9,auVar19);
    auVar9 = vfredmax_vs(auVar9,auVar15);
    uVar8 = vfmv_fs(auVar9);
    fVar1 = (float)uVar8 * 0.007874016;
    uVar7 = (ulong)(uint)fVar1;
    if (fVar1 == 0.0) {
LAB_00076992:
      uVar5 = ((ushort)lVar3 & 0xfff) + ((ushort)((uint)lVar3 >> 0xd) & 0x7c00);
    }
    else {
      uVar7 = (ulong)(uint)(127.0 / (float)uVar8);
      uVar5 = 0x7e00;
      if ((uint)lVar3 << 1 < 0xff000001) goto LAB_00076992;
    }
    auVar9 = vfmv_vf(uVar7);
    auVar14 = vfmul_vv(auVar14,auVar9);
    auVar14 = vfcvt_xfv(auVar14);
    auVar13 = vfmul_vv(auVar13,auVar9);
    auVar13 = vfcvt_xfv(auVar13);
    auVar12 = vfmul_vv(auVar12,auVar9);
    auVar12 = vfcvt_xfv(auVar12);
    auVar11 = vfmul_vv(auVar9,auVar11);
    auVar11 = vfcvt_xfv(auVar11);
    vsetvli_e16mf2tama(0);
    auVar14 = vncvt_xxw(auVar14);
    vsetvli_e8mf4tama(0);
    auVar14 = vncvt_xxw(auVar14);
    vse8_v(auVar14,pcVar4);
    vsetvli_e32m1tama(0);
    auVar14 = vfmul_vv(auVar9,auVar10);
    vsetvli_e16mf2tama(0);
    ((block_q0_0 *)(pcVar4 + -2))->d = (ushort)((ulong)lVar3 >> 0x10) & 0x8000 | uVar5;
    auVar10 = vncvt_xxw(auVar13);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 4);
    vsetvli_e32m1tama(0);
    auVar13 = vfcvt_xfv(auVar14);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar12);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 8);
    vsetvli_e32m1tama(0);
    auVar12 = vfmul_vv(auVar9,auVar18);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar11);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0xc);
    vsetvli_e32m1tama(0);
    auVar11 = vfcvt_xfv(auVar12);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar13);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0x10);
    vsetvli_e32m1tama(0);
    auVar10 = vfmul_vv(auVar9,auVar17);
    vsetvli_e16mf2tama(0);
    auVar11 = vncvt_xxw(auVar11);
    vsetvli_e8mf4tama(0);
    auVar11 = vncvt_xxw(auVar11);
    vse8_v(auVar11,pcVar4 + 0x14);
    vsetvli_e32m1tama(0);
    auVar10 = vfcvt_xfv(auVar10);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0x18);
    vsetvli_e32m1tama(0);
    auVar9 = vfmul_vv(auVar16,auVar9);
    auVar9 = vfcvt_xfv(auVar9);
    vsetvli_e16mf2tama(0);
    iVar6 = iVar6 + 1;
    auVar9 = vncvt_xxw(auVar9);
    vsetvli_e8mf4tama(0);
    auVar9 = vncvt_xxw(auVar9);
    vse8_v(auVar9,pcVar4 + 0x1c);
    lVar2 = lVar2 + 0x80;
    pcVar4 = pcVar4 + 0x22;
    if ((int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5 <= iVar6) {
      return;
    }
    vsetvli_e32m1tama(0);
  } while( true );
}

Some of the previous red herrings show up here too. Things to note:

undefined auVar19 [256]; - something in riscv-rvv.sinc is claiming vector registers are 256 bits long - that’s not generally true, so hunt down the confusion.
- riscv.reg.sinc is the root of this, with @define VLEN "256" and define register offset=0x4000 size=$(VLEN) [ v0 ...]. What should Ghidra believe the size of vector registers to be? More generally, should the size and element type of vector registers be mutable?
the autovectorizer has correctly decided VLEN=128 architectures must be supported, and has dedicated 8 vector registers to hold all 32 floats required per loop iteration. Unlike the hand-optimized solution, the 8 vector registers are handled by 8 interleaved sequences of vector instructions. This roughly doubles the instruction count, but provides good distribution of load and store memory operations across the loop, likely minimizing execution stalls.

RISCV vector instruction execution engines - and autovectorization passes in gcc - are both so immature we have no idea of which implementation performs better. At best we can guess that autovectorization will be good enough to make hand optimized coding with riscv intrinsic functions rarely needed.

Vectorized function analysis without source code

Now try using Ghidra to inspect a function that dominates execution time in the whisper.cpp demo - ggml_vec_dot_16. We’ll do this without first checking the source code. We’ll make a few reasonable assumptions:

this is likely a vector dot product
the vector elements are 16 bit floating point values of the type we’ve seen already.

A quick inspection lets us rewrite the function signature as:

void ggml_vec_dot_f16(long n,float *sum,fp16 *x,fp16 *y) {...}

That quick inspection also shows a glaring error - the pcode semantics for vluxei64.v has left out a critical parameter. It’s present in the listing view but missing in the pcode semantics view. Fix this and move on.

After tinkering with variable names and signatures, we get:

void ggml_vec_dot_q8_0_q8_0(long n,float *sum,block_q8_0 *x,block_q8_0 *y)

{
  block_q8_0 *pbVar1;
  int iVar2;
  char *px_qs;
  char *py_qs;
  undefined8 uVar3;
  undefined8 uVar4;
  float partial_sum;
  undefined auVar5 [256];
  undefined auVar6 [256];
  undefined in_v5 [256];

  gp = &__global_pointer$;
  partial_sum = 0.0;
  uVar4 = vsetvli_e8m1tama(0x20);
  if (0x1f < n) {
    px_qs = x->qs;
    py_qs = y->qs;
    iVar2 = 0;
    vsetvli_e32m1tama(uVar4);
    vmv_v_i(in_v5,0);
    do {
      pbVar1 = (block_q8_0 *)(px_qs + -2);
      vsetvli_e8m1tama(uVar4);
      auVar6 = vle8_v(px_qs);
      auVar5 = vle8_v(py_qs);
      auVar5 = vwmul_vv(auVar6,auVar5);
      vsetvli_e16m2tama(0);
      auVar5 = vwredsum_vs(auVar5,in_v5);
      vsetivli_e32m1tama(0);
      uVar3 = vmv_x_s(auVar5);
      iVar2 = iVar2 + 1;
      px_qs = px_qs + 0x22;
      partial_sum = (float)(int)uVar3 *
                    (float)(&ggml_table_f32_f16)[((block_q8_0 *)(py_qs + -2))->field0_0x0] *
                    (float)(&ggml_table_f32_f16)[pbVar1->field0_0x0] + partial_sum;
      py_qs = py_qs + 0x22;
    } while (iVar2 < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
  }
  *sum = partial_sum;
  return;
}

That’s fairly clear - the two vectors are presented as arrays of block_q8_0 structs, each with 32 entries and a scale factor d. An earlier run showed another error, now fixed, with the pcode for vmv_x_s.

8.4 - Pcode testing

Ghidra testing of semantic pcode.

Note: paths and names are likely to change here. Use these notes just as a guide.

The Ghidra 11 isa_ext branch makes heavy use of user-defined pcode (aka Sleigh semantics). Much of that pcode is arbitrarily defined, adding more confusion to an already complex field. Can we build a testing framework to highlight problem areas in pcode semantics?

For example, let’s look at Ghidra’s decompiler rendering of two RISCV-64 vector instructions vmv.s.x and vmv.x.s. These instructions move a single element between an integer scalar register and the first element of a vector register. The RISCV vector definition says:

The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination integer register.
The vmv.s.x instruction copies the scalar integer register to element 0 of the destination vector register.

These instructions have a lot of symmetry, but the current isa_ext branch doesn’t render them symmetrically. Let’s build a sample function that uses both instructions followed by an assertion of what we expect to see.

bool test_integer_scalar_vector_move() {
    ///@ exercise integer scalar moves into and out of a vector register
    int x = 1;
    int y = 0;
    // set vector mode to something simple
    __asm__ __volatile__ ("vsetivli zero,1,e32,m1,ta,ma\n\t");
    // execute both instructions to set y:= x
    __asm__ __volatile__ ("vmv.s.x  v1, %1\n\t" "vmv.x.s  %0, v1"\
                          : "=r" (y) \
                          : "r" (x) );
    return x==y;
}

This function should return the boolean value True. It’s defined in the file failing_tests/pcodeSamples.cpp and compiled into the library libsamples.so. The function is executed within the test harness failing_tests/pcodeTests.cpp.

Build (with O3 optimization) and execute the test harness with:

$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
$ bazel build -s -c opt  --platforms=//platforms:riscv_vector failing_tests:samples
$ cp -f bazel-bin/failing_tests/libsamples.so /tmp
$ bazel build -s -c opt  --platforms=//platforms:riscv_vector failing_tests:pcodeTests
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/failing_tests/pcodeTests
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VectorMove
[ RUN      ] VectorMove.vmv_s_x
[       OK ] VectorMove.vmv_s_x (0 ms)
[----------] 1 test from VectorMove (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3 ms total)
[  PASSED  ] 1 test.

Now import /tmp/libsamples.so into Ghidra and look for test* functions. The decompilation is:

bool test_integer_scalar_vector_move(void)
{
  undefined8 uVar1;
  undefined in_v1 [256];
  vsetivli_e32m1tama(1);
  vmv_s_x(in_v1,1);
  uVar1 = vmv_x_s(in_v1);
  return (int)uVar1 == 1;
}

This shows two issues:

Ghidra believes vector v1 is 256 bits long, when in fact it is unknown at compile and link time. That’s probably OK for now, as it provides a hint that this is a vector register.
The treatment of instruction output is inconsistent. For vmv_s_x, the output is the first element of in_v1. For vmv_x_s, the output is the scalar register uVar1. That might be OK, since we don’t specify what happens to other elements of in_v1.

The general question raised by this example is how to treat pcode output - as an output parameter or as a returned result? The sleigh pcode documentation suggests that parameters are assumed to be input parameters, with the only output register the one returned by the pcode operation. A quick glance at the ARM Neon and AARCH64 SVE vector sleigh files suggests that this is the convention, but perhaps not a requirement.

Let’s try adding some more test cases before taking any action.

8.5 - Tracking Convergence

We can track external events to plan future integration test effort.

This project gets more relevant when RISCV-64 processors start appearing in appliances with more instruction set extensions and with code compiled by newer compiler toolchains.

The project results get easier to integrate if and when more development effort is applied to specific Ghidra components.

This page collects external sites to track for convergence.

Toolchains and platforms

binutils

New instruction extensions often appear here as the first public implementation. Check out the opcodes, aliases, and disassembly patterns found in the test suite.

track the source
inspect git log include/opcode/|grep riscv|head
inspect git log gas/testsuite/gas/riscv
track updates to the list of supported extensions

sample log

28 Feb 2024 - jiawei@iscas.ac.cn added support for Zabha riscv extension (atomic byte and half-word memory ops)
4 Jan 2024 - jinma@linux.alibaba.com fixed th.vsetvli for T-Head extensions

compilers

track the source
inspect git log gcc/testsuite/gcc.target/riscv

log

Look for commits indicating the stability of vectorization or new compound loop types that now allow auto vectorization.

libraries

track the glibc source
track the openssl source

log

glibc
- Not much specific to RISC-V
openssl (in master, not released as of openssl 3.2)
- phoebe.chen@sifive.com added vector crypto implementations of AES-CBC mode, AES-128/192/256-CTR, AES-128/256-XTS
- jerry.shih@sifive.com added Zvksh support for sm3

kernels

track the source
inspect git log arch/riscv

Note: the Linux kernel just added vector crypto support, derived from the openssl crypto routines. This appears to mostly be in support of encrypted file systems.

system images

Fedora
Ubuntu

cloud instances

Scaleway risc-v servers
- with the T-HEAD TH1520 SoC, 16GB RAM and 128GB

RISCV International Wiki

The RISCV International wiki home page leads to:

Active Specification updates
Software Ecosystem status

ISA Extensions

profiles and individual standards-tracked extensions
vendor-specific extensions
gcc intrinsics

Applications

track source
Look for use of riscv intrinsics with arm/Neon and avx2 equivalents as opposed to allowing compiler autovectorization.
Watch for standardization of 16 bit floating point

Ghidra

similar vector instruction suites

Ghidra/Processors/AARCH64/data/languages/AARCH64sve.sinc defines the instructions used by the AARCH64 Scalable Vector Extensions package. This suite is similar to the RISCV vector suite in that it is vector register length agnostic. It was added in March of 2019 and not updated since.

pcode extensions

Ghidra/Features/Decompiler/src/decompile/cpp holds much of the existing Ghidra code for system and user defined pcodes. userop.h and userop.cc look relevant, with caheckman a common contributor.

Community

RISCV Organization
reddit discussion group
- new RISCV products are often discussed here.

8.6 - Deep Dive Openssl

Openssl configuration for ISA Extensions provides a good example.

/home2/build_openssl$ ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
$ perl configdata.pm --dump

Command line (with current working directory = .):

    /usr/bin/perl ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg

Perl information:

    /usr/bin/perl
    5.38.2 for x86_64-linux-thread-multi

Enabled features:

    afalgeng
    apps
    argon2
    aria
    asm
    async
    atexit
    autoalginit
    autoerrinit
    autoload-config
    bf
    blake2
    bulk
    cached-fetch
    camellia
    capieng
    cast
    chacha
    cmac
    cmp
    cms
    comp
    ct
    default-thread-pool
    deprecated
    des
    dgram
    dh
    docs
    dsa
    dso
    dtls
    dynamic-engine
    ec
    ec2m
    ecdh
    ecdsa
    ecx
    engine
    err
    filenames
    gost
    http
    idea
    legacy
    loadereng
    makedepend
    md4
    mdc2
    module
    multiblock
    nextprotoneg
    ocb
    ocsp
    padlockeng
    pic
    pinshared
    poly1305
    posix-io
    psk
    quic
    unstable-qlog
    rc2
    rc4
    rdrand
    rfc3779
    rmd160
    scrypt
    secure-memory
    seed
    shared
    siphash
    siv
    sm2
    sm2-precomp
    sm3
    sm4
    sock
    srp
    srtp
    sse2
    ssl
    ssl-trace
    static-engine
    stdio
    tests
    thread-pool
    threads
    tls
    ts
    ui-console
    whirlpool
    tls1
    tls1-method
    tls1_1
    tls1_1-method
    tls1_2
    tls1_2-method
    tls1_3
    dtls1
    dtls1-method
    dtls1_2
    dtls1_2-method

Disabled features:

    acvp-tests          [cascade]        OPENSSL_NO_ACVP_TESTS
    asan                [default]        OPENSSL_NO_ASAN
    brotli              [default]        OPENSSL_NO_BROTLI
    brotli-dynamic      [default]        OPENSSL_NO_BROTLI_DYNAMIC
    buildtest-c++       [default]        
    winstore            [not-windows]    OPENSSL_NO_WINSTORE
    crypto-mdebug       [default]        OPENSSL_NO_CRYPTO_MDEBUG
    devcryptoeng        [default]        OPENSSL_NO_DEVCRYPTOENG
    ec_nistp_64_gcc_128 [default]        OPENSSL_NO_EC_NISTP_64_GCC_128
    egd                 [default]        OPENSSL_NO_EGD
    external-tests      [default]        OPENSSL_NO_EXTERNAL_TESTS
    fips                [default]        
    fips-securitychecks [cascade]        OPENSSL_NO_FIPS_SECURITYCHECKS
    fuzz-afl            [default]        OPENSSL_NO_FUZZ_AFL
    fuzz-libfuzzer      [default]        OPENSSL_NO_FUZZ_LIBFUZZER
    ktls                [default]        OPENSSL_NO_KTLS
    md2                 [default]        OPENSSL_NO_MD2 (skip crypto/md2)
    msan                [default]        OPENSSL_NO_MSAN
    rc5                 [default]        OPENSSL_NO_RC5 (skip crypto/rc5)
    sctp                [default]        OPENSSL_NO_SCTP
    tfo                 [default]        OPENSSL_NO_TFO
    trace               [default]        OPENSSL_NO_TRACE
    ubsan               [default]        OPENSSL_NO_UBSAN
    unit-test           [default]        OPENSSL_NO_UNIT_TEST
    uplink              [no uplink_arch] OPENSSL_NO_UPLINK
    weak-ssl-ciphers    [default]        OPENSSL_NO_WEAK_SSL_CIPHERS
    zlib                [default]        OPENSSL_NO_ZLIB
    zlib-dynamic        [default]        OPENSSL_NO_ZLIB_DYNAMIC
    zstd                [default]        OPENSSL_NO_ZSTD
    zstd-dynamic        [default]        OPENSSL_NO_ZSTD_DYNAMIC
    ssl3                [default]        OPENSSL_NO_SSL3
    ssl3-method         [default]        OPENSSL_NO_SSL3_METHOD

Config target attributes:

    AR => "ar",
    ARFLAGS => "qc",
    CC => "gcc",
    CFLAGS => "-Wall -O3",
    CXX => "g++",
    CXXFLAGS => "-Wall -O3",
    HASHBANGPERL => "/usr/bin/env perl",
    RANLIB => "ranlib",
    RC => "windres",
    asm_arch => "riscv64",
    bn_ops => "SIXTY_FOUR_BIT_LONG RC4_CHAR",
    build_file => "Makefile",
    build_scheme => [ "unified", "unix" ],
    cflags => "-pthread",
    cppflags => "",
    cxxflags => "-std=c++11 -pthread",
    defines => [ "OPENSSL_BUILDING_OPENSSL" ],
    disable => [  ],
    dso_ldflags => "-Wl,-z,defs",
    dso_scheme => "dlfcn",
    enable => [ "afalgeng" ],
    ex_libs => "-ldl -pthread",
    includes => [  ],
    lflags => "",
    lib_cflags => "",
    lib_cppflags => "-DOPENSSL_USE_NODELETE",
    lib_defines => [  ],
    module_cflags => "-fPIC",
    module_cxxflags => undef,
    module_ldflags => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
    perl_platform => "Unix",
    perlasm_scheme => "linux64",
    shared_cflag => "-fPIC",
    shared_defflag => "-Wl,--version-script=",
    shared_defines => [  ],
    shared_ldflag => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
    shared_rcflag => "",
    shared_sonameflag => "-Wl,-soname=",
    shared_target => "linux-shared",
    thread_defines => [  ],
    thread_scheme => "pthreads",
    unistd => "<unistd.h>",

Recorded environment:

    AR = 
    BUILDFILE = 
    CC = 
    CFLAGS = 
    CPPFLAGS = 
    CROSS_COMPILE = 
    CXX = 
    CXXFLAGS = 
    HASHBANGPERL = 
    LDFLAGS = 
    LDLIBS = 
    OPENSSL_LOCAL_CONFIG_DIR = 
    PERL = 
    RANLIB = 
    RC = 
    RCFLAGS = 
    WINDRES = 
    __CNF_CFLAGS = 
    __CNF_CPPDEFINES = 
    __CNF_CPPFLAGS = 
    __CNF_CPPINCLUDES = 
    __CNF_CXXFLAGS = 
    __CNF_LDFLAGS = 
    __CNF_LDLIBS = 

Makevars:

    AR              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ar
    ARFLAGS         = qc
    ASFLAGS         = 
    CC              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-gcc
    CFLAGS          = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
    CPPDEFINES      = 
    CPPFLAGS        = 
    CPPINCLUDES     = 
    CROSS_COMPILE   = /opt/riscvx/bin/riscv64-unknown-linux-gnu-
    CXX             = /opt/riscvx/bin/riscv64-unknown-linux-gnu-g++
    CXXFLAGS        = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
    HASHBANGPERL    = /usr/bin/env perl
    LDFLAGS         = 
    LDLIBS          = 
    PERL            = /usr/bin/perl
    RANLIB          = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ranlib
    RC              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-windres
    RCFLAGS         = 

NOTE: These variables only represent the configuration view.  The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
    Makefile

build file:

    Makefile

build file templates:

    ../vendor/openssl/Configurations/common0.tmpl
    ../vendor/openssl/Configurations/unix-Makefile.tmpl
$ make
...
opt/riscvx/lib/gcc/riscv64-unknown-linux-gnu/14.0.1/../../../../riscv64-unknown-linux-gnu/bin/ld: cannot find -ldl: No such file or directory

The error is in the linking phase, since we did not provide the correct sysroot and path information needed by the crosscompiling linker.

A quick check of the object files generated includes:

$  find . -name \*risc\*.o
./crypto/sm4/libcrypto-lib-sm4-riscv64-zvksed.o
./crypto/sm4/libcrypto-shlib-sm4-riscv64-zvksed.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-lib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64.o
./crypto/chacha/libcrypto-shlib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha-riscv64-zvkb.o
./crypto/chacha/libcrypto-shlib-chacha-riscv64-zvkb.o
./crypto/libcrypto-shlib-riscv64cpuid.o
./crypto/libcrypto-lib-riscv64cpuid.o
./crypto/sha/libcrypto-lib-sha_riscv.o
./crypto/sha/libcrypto-lib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-shlib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sha/libcrypto-shlib-sha_riscv.o
./crypto/sha/libcrypto-shlib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-lib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sm3/libcrypto-lib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-lib-sm3_riscv.o
./crypto/sm3/libcrypto-shlib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-shlib-sm3_riscv.o
./crypto/libcrypto-shlib-riscvcap.o
./crypto/modes/libcrypto-shlib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-lib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkb-zvbc.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-shlib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkb-zvbc.o
./crypto/libcrypto-lib-riscvcap.o

That suggests we need to cover more extensions:

vbb
vbc
vkb
vkg
vkned

The openssl source code conditionally defines symbols like:

RISCV_HAS_V
RISCV_HAS_ZVBC
RISCV_HAS_ZVKB
RISCV_HAS_ZVKNHA
RISCV_HAS_ZVKNHB
RISCV_HAS_ZVKSH
RISCV_HAS_ZBKB
RISCV_HAS_ZBB
RISCV_HAS_ZBC
RISCV_HAS_ZKND
RISCV_HAS_ZKNE
RISCV_HAS_ZVKG - currently missing, or a union of zvkng and zvksg?
RISCV_HAS_ZVKNED - currently missing or a union of zvkned and zvksed?
RISCV_HAS_ZVKSED - currently missing, defined but unused?

These symbols are defined in crypto/riscvcap.c after analyzing the march string passed to the compiler.

So the next steps include:

Define LDFLAGS and LDLIBS to enable building a riscv-64 openssl.so.
add additional march elements to generate as many ISA extension exemplars as we can
iterate on Ghidra sinc files to define any missing instructions
extend riscv-64 assembly samples to include all riscv-64 ISA extensions appearing in openssl source
verify that we have acceptable pcode opcodes for all riscv-64 ISA extensions appearing in openssl source

$ build_openssl$../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvbb_zvbc_zvkb_zvkg_zvkned_zvksg

Patch the generated Makefile to:

< CNF_EX_LIBS=-ldl -pthread
---
> CNF_EX_LIBS=/opt/riscvx/lib/libdl.a -pthread

$ make

open libcrypto.so.3 and libssl.so.3 in Ghidra.
analyze and open bookmarks
verify - in the Bookmarks window - that all instructions disassembled and no instructions lack pcode

Integration testing (manual)

Disassembly testing against binutils reference dumps can follow these steps:

Open libcrypt.so.3 in Ghidra
export as ASCII to /tmp/libcrypto.so.3.txt
export as C/C++ to /tmp/libcrypto.so.3.c
generate reference disassembly via
- /opt/riscvx/bin/riscv64-unknown-linux-gnu-objdump -j .text -D libcrypto.so.3 > libcrypto.so.3_ref.txt
grep both /tmp/libcrypto.so.3.txt and libcrypto.so.3_ref.txt for vset instructions, comparing operands
optionally parse vector instructions out of both files and compare decodings

inspect extension management

How does Openssl manage RISCV ISA extensions? We’ll use the gcm_ghash family of functions as examples.

At compile time any march=rv64gcv_z... arguments are processed by the Openssl configuration tool and turned into #ifdef variables. These can include combinations like RISCV_HAS_ZVKB_AND_ZVKSED. Multiple versions of key routines are compiled, each with different required extensions.
The compiler can also use any of the bit manipulation and vector extensions in local optimization.
At runtime the library queries the underlying system to see which extensions are supported. The function gcm_get_funcs returns the preferred set of implementing functions. The gcm_ghash set can include:
- gcm_ghash_4bit
- gcm_ghash_rv64i_zvkb_zvbc
- gcm_ghash_rv64i_zvkg
- gcm_ghash_rv64i_zbc
- gcm_ghash_rv64i_zbc__zbkb

The gcm_ghash_4bit is the default version with 412 instructions, of which 11 are vector instructions inserted by the compiler.

The gcm_ghash_rv64i_zvkg is the most advanced version with only 32 instructions. Ghidra decompiles this as:

void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
  undefined auVar1 [256];
  undefined auVar2 [256];
  vsetivli_e32m1tumu(4);
  auVar1 = vle32_v(param_2);
  vle32_v(param_1);
  do {
    auVar2 = vle32_v(param_3);
    param_3 = param_3 + 0x10;
    param_4 = param_4 + -0x10;
    auVar2 = vghsh_vv(auVar1,auVar2);
  } while (param_4 != 0);
  vse32_v(auVar2,param_1);
  return;
}

That shows an error in our sinc files - several instructions use the vd register as both an input and an output, so our pcode semantics need updating. Do this and inspect the Ghidra output again:

void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
  undefined auVar1 [256];
  undefined auVar2 [256];
  undefined auVar3 [256];
  vsetivli_e32m1tumu(4);
  auVar2 = vle32_v(param_2);
  auVar1 = vle32_v(param_1);
  do {
    auVar3 = vle32_v(param_3);
    param_3 = param_3 + 0x10;
    param_4 = param_4 + -0x10;
    auVar1 = vghsh_vv(auVar1,auVar2,auVar3);
  } while (param_4 != 0);
  vse32_v(auVar1,param_1);
  return;
}

That’s better - commit and push.

8.7 - Building a gcc-14 toolchain

Building a new toolchain can be messy.

A C or C++ toolchain needs at least three components:

kernel - to supply key header files and loader dependencies
binutils - to supply assembler and linker
gcc - to supply the compiler and compiler dependencies
glibc - to supply key libraries and header files
sysroot - a directory containing the libraries and resources expected for the root of the target system

These components have cross-dependencies. A full gcc build needs libc.so from glibc. A full glibc build needs libgcc from gcc. There are different ways to handle these cross-dependencies, such as splitting the gcc build into two phases or prepopulating the build directories with ‘close-enough’ files from a previous build.

The sysroot component is the trickiest to handle, since gcc and glibc need to pull files from the sysroot as they update files within sysroot. You can generally start with a bootstrap sysroot, say from a previous toolchain, then update it with the latest binutils, gcc, and glibc.

Start with a released tarball for gcc and glibc. We’ll use the development tip of binutils for this pass.

Copy kernel header files into /opt/riscv/sysroot/usr/include/.

Configure and install binutils:

$ /home2/vendor/binutils-gdb/configure --prefix=/opt/riscv/sysroot --with-sysroot=/opt/riscv/sysroot --target=riscv64-unknown-linux-gnu
$ make -j4
$ make install

Configure and install minimal gcc:

$ /home2/vendor/gcc-14.1.0/configure --prefix=/opt/riscv --enable-languages=c,c++ --disable-multilib --target=riscv64-unknown-linux-gnu --with-sysroot=/opt/riscv/sysroot
$ make all-gcc
$ make install-gcc

Configure and install glibc

$ ../../vendor/glibc-2.39/configure --host=riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu --prefix=/opt/riscv --disable-werror --enable-shared --disable-multilib --with-headers=/opt/riscv/sysroot/usr/include
$ make install-bootstrap-headers=yes install_root=/opt/riscv/sysroot install-headers

Cleaning sysroot of bootstrap artifacts

How do we replace any older sysroot bootstrap files with their freshly built versions? The most common problems involve libgcc*, libc*, and crt* files. The bootstrap sysroot needs these files to exist. The toolchain build process should replace them, but it may not replace all instances of these files.

Let’s scrub the libgcc files, comparing the gcc directory in which they are built with the sysroot directories in which they will be saved.

$ B=/home2/build_riscv/gcc
$ S=/opt/riscv/sysroot
$ find $B $S -name libgcc_s.so -ls
 57940911      4 -rw-r--r--   1 ____     ____          132 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so
 57940908      4 -rw-r--r--   1 ____     ____          132 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so
 14361792      4 -rw-r--r--   1 ____     ____          132 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
 14351655      4 -rw-r--r--   1 ____     ____          132 May 10 08:52 /opt/riscv/sysroot/lib/libgcc_s.so
 $ diff /opt/riscv/sysroot/lib/libgcc_s.so /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
 $ $ cat /opt/riscv/sysroot/lib/libgcc_s.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library.  */
GROUP ( libgcc_s.so.1 -lgcc )
$

/opt/riscv/sysroot/lib/libgcc_s.so is our bootstrap input
/home2/build_riscv/gcc/gcc/libgcc_s.so and /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so are the generated outputs
the bootstrap input is identical to the generate output
neither input nor output contain absolute paths

Now check libgcc_s.so.1 for staleness:

$ find $B $S -name libgcc_s.so.1 -ls
 57940910    700 -rw-r--r--   1 ____     ____       713128 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so.1
 57946454    700 -rwxr-xr-x   1 ____     ____       713128 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so.1
 14361791    700 -rw-r--r--   1 ____     ____       713128 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1
 14351656    696 -rw-r--r--   1 ____     ____       708624 May 10 08:53 /opt/riscv/sysroot/lib/libgcc_s.so.1

That looks like a potential problem. The older bootstrap file is older and smaller than the generated files. We need to fix that:

$ rm /opt/riscv/sysroot/lib/libgcc_s.so.1
$ ln /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1 /opt/riscv/sysroot/lib/libgcc_s.so.1

Next check the crt* files:

$ find $B $S -name crt\*.o -ls
 57940817      8 -rw-r--r--   1 ____     ____         4248 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginS.o
 57940826      4 -rw-r--r--   1 ____     ____          848 May 10 12:28 /home2/build_riscv/gcc/gcc/crtn.o
 57940824      4 -rw-r--r--   1 ____     ____          848 May 10 12:28 /home2/build_riscv/gcc/gcc/crti.o
 57940827      8 -rw-r--r--   1 ____     ____         4712 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginT.o
 57940822      4 -rw-r--r--   1 ____     ____         1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtendS.o
 57940823      4 -rw-r--r--   1 ____     ____         1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtend.o
 57940815      4 -rw-r--r--   1 ____     ____         3640 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbegin.o
 57940800      8 -rw-r--r--   1 ____     ____         4248 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginS.o
 57940808      4 -rw-r--r--   1 ____     ____          848 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtn.o
 57940806      4 -rw-r--r--   1 ____     ____          848 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crti.o
 57940803      8 -rw-r--r--   1 ____     ____         4712 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginT.o
 57940812      4 -rw-r--r--   1 ____     ____         1384 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtendS.o
 57940804      4 -rw-r--r--   1 ____     ____         1384 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtend.o
 57940798      4 -rw-r--r--   1 ____     ____         3640 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbegin.o
 14351609     16 -rw-r--r--   1 ____     ____        13848 May 10 08:48 /opt/riscv/sysroot/usr/lib/crt1.o
 14351614      4 -rw-r--r--   1 ____     ____          952 May 10 08:48 /opt/riscv/sysroot/usr/lib/crti.o
 14351623      4 -rw-r--r--   1 ____     ____          952 May 10 08:49 /opt/riscv/sysroot/usr/lib/crtn.o
 14361798      8 -rw-r--r--   1 ____     ____         4248 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginS.o
 14361802      4 -rw-r--r--   1 ____     ____         3640 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbegin.o
 14361803      4 -rw-r--r--   1 ____     ____         1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtend.o
 14361804      4 -rw-r--r--   1 ____     ____          848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crti.o
 14361805      4 -rw-r--r--   1 ____     ____          848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtn.o
 14361806      4 -rw-r--r--   1 ____     ____         1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtendS.o
 14361807      8 -rw-r--r--   1 ____     ____         4712 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginT.o

The files in /opt/riscv/sysroot/usr/lib are likely the bootstrap files. The sysroot files are identical to the build files, with exceptions:

crt1.o is not generated by the gcc compiler build process. It may be something provided by the kernel build.
crti.o and crtn.o bootstrap files and generated files are different. If we wanted to use this updated sysroot to build a 14.2.0 toolchain, we probably want to use the newer versions.

So replace the bootstrap /opt/riscv/sysroot/usr/lib/crt*.o with hard links to the generated files.