Exemplars

List the current importable and buildable exemplars, their origins, and the Ghidra features they are intended to validate or stress.

Overview

Exemplars suitable for Ghidra import are generally collected by platform architecture, such as riscv64/exemplars or x86_64/exemplars. Some are imported from system disk images. Others are locally built from small source code files and an appropriate compiler toolchain. The initial scope includes Linux-capable RISCV 64 bit systems that might be found in network appliances or ML inference engines. That makes for a local bias towards privileged code, concurrency management, and performance optimization. That scope expands slightly to x86_64 exemplars that may help triage issues that show up first in RISCV 64 exemplars.

You can get decent exemplar coverage with this set of exemplars:

from a general purpose RISCV-64 disk image:
- kernel - a RISCV-64 kernel built with a recent Linux release
- kernel load module - an ELF binary intended to be loaded into a running kernel
- system library - libc.so or libssl.so copied from a generic Linux disk image
- system application - a user application linking against system libraries and running over a Linux kernel
built from source, with the development tip of the gcc toolchain and many explicit ISA extensions:
- binutils assembly test suite - ISA extensions usually show up here first, along with preferred disassembly patterns
- memcpy and other libc replacements coded with RISCV-64 ISA intrinsic extensions
- libssl.so and libcrypt.so built from source and configured for all standard and frozen crypto, vector, and bit manipulation instruction extensions.
- DPDK network appliance source code, l3fwd and l2fwd.
- a custom crosscompiled kernel, with ISA extensions enabled

In general, visual inspection of these exemplars after importing into Ghidra should show:

no failed constructors, so all instructions are recognized by Ghidra during disassembly
no missing pcode
all vector vset* instructions are unwrapped to show selected element width, multiplier, tail and mask handling

Ghidra will in a few cases disassemble an instruction differently than binutils’ objdump. That’s fine, if it is due to a limitation of Ghidra’s SLEIGH language. If alignment to objdump is possible, that’s preferable.

Imported exemplars

Most of the imported large binary exemplars are broken out of available Fedora disk images. The top level acquireExternalExemplars.py script controls this process, sometimes with some manual intervention to handle image mounting. Selection of the imported disk image is controlled with text like:

LOGLEVEL = logging.WARN
FEDORA_RISCV_SITE = "http://fedora.riscv.rocks/kojifiles/work/tasks/6900/1466900"
FEDORA_RISCV_IMAGE = "Fedora-Developer-39-20230927.n.0-sda.raw"
FEDORA_KERNEL = "vmlinuz-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_KERNEL_OFFSET = 40056
FEDORA_KERNEL_DECOMPRESSED = "vmlinux-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_SYSMAP = "System.map-6.5.4-300.0.riscv64.fc39.riscv64"

Fedora kernel

Warning: the cited Fedora disk image may no longer be maintained. If so, we will replace it with a custom cross-compiled kernel tuned for a hypothetical network appliance.

This exemplar kernel is not an ELF file, so analysis of the import process will need help.

The import process explicitly sets the processor on the command line: -processor RISCV:LE:64:RV64IC. This will likely be the same as the processor determined from imported kernel load modules.
Ghidra recognizes three sections, one text and two data. All three need to be moved to the offset suggested in the associated System.map file. For example, .text moves from 0x1000 to 0x80001000. Test this by verifying function start addresses identified in System.map look like actual RISCV-64 kernel functions. Most begin with 16 bytes of no-op instructions to support debugging and tracing operations.
Mark .text as code by selecting from 0x80001000 to 0x80dfffff and hitting the D key.

Verification

Verify that kernel code correctly references data:

locate the address of panic in System.map: ffffffff80b6b188
go to 0x80b6b188 in Ghidra and verify that this is a function
display references to panic and examine the decompiler window.

 /* WARNING: Subroutine does not return */
  panic(s_Fatal_exception_in_interrupt_813f84f8);

Notes

This kernel includes 149 strings including sifive, most of which appear in System.map. It’s not immediately clear whether these indicate kernel mods by SiFive or an SiFive SDK kernel module compiled into the kernel.

The kernel currently includes a few RISCV instruction set extensions not handled by Ghidra, and possibly not even by binutils and the gas RISCV assembler. Current Linux kernels can bypass the standard assembler to insert custom or obscure privileged instructions.

This Linux kernel explicitly includes ISA extension code for processors that support those extensions. For example, if the kernel boots up on a processor supporting the _zbb bit manipulation instruction extensions, then the vanilla strlen, strcmp, and strncmp kernel functions are patched out to invoke strlen_zbb, strcmp_zbb, and strncmp_zbb respectively.

This kernel can support up to 64 discrete ISA extensions, of which about 30 are currently defined. It has some support for hybrid processors, where each of the hardware threads (aka ‘harts’) can support a different mix of ISA extensions.

Note: The combination of instruction set extensions and self-modifying privileged code makes for a fertile ground for Ghidra research. We can expect vector variants of memcpy inline expansion sometime in 2024, significantly complicating cyberanalysis of even the simplest programs.

Fedora kernel modules

Kernel modules are typically ELF files compiled as Position Independent Code, often using more varied Elf relocation types for dynamically loading and linking into kernel memory space. This study looks at the igc.ko kernel module for a type of Intel network interface device. Network device drivers can have some of the most time-critical and race-condition-rich behavior, making this class of driver a good exemplar.

RISCV relocation types found in this exemplar include:

R_RISCV_64(2), R_RISCV_BRANCH(16), R_RISCV_JAL(17), R_RISCV_CALL(18), R_RISCV_PCREL_HI20(23), R_RISCV_PCREL_LO12_I(24), R_RISCV_ADD32(35), R_RISCV_ADD64(36), R_RISCV_SUB32(39), R_RISCV_SUB64(40), R_RISCV_RVC_BRANCH(44), and R_RISCV_RVC_JUMP(45)

Verification

Open Ghidra’s Relocation Table window and verify that all relocations were applied.

Go to igc_poll, open a decompiler window, and export the function as igc_poll.c. Compare this file with the provided igc_poll_decompiled.c in the visual difftool of your choice (e.g. meld) and check for the presence of lines like:

netdev_printk(&_LC7,*(undefined8 *)(lVar33 + 8),"Unknown Tx buffer type\n");

This statement generates - and provides tests for - at least four relocation types.

Notes

The decompiler translates all fence instructions as fence(). This kernel module uses 8 distinct fence instructions to request memory barriers. The sleigh files should probably be extended to show either fence(1,5) or the Linux macro names given in linux/arch/riscv/include/asm/barrier.h.

Fedora system libraries

System libraries like libc.so and libssl.so typically link to versioned shareable object libraries like libc.so.6 and libssl.so.3.0.5. Ghidra imports RISCV system libraries well.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5), and R_RISCV_TLS_TPREL64(11)

R_RISCV_TLS_TPREL64 is currently unsupported by Ghidra, appearing in the libc.so.6 .got section about 15 times. This relocation type does not appear in libssl.so.3.0.5. It appears in multithreaded applications that use thread-local storage.

Fedora system executables

The ssh utility imports cleanly into Ghidra.

Relocation types observed include:

R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5)

Function thunks referencing external library functions do not automatically get the name of the external function propagated into the name of the thunk.

Locally built exemplars

Imported binaries are generally locked into a single platform and a single toolchain. The imported binaries above are built for an SiFive development board, a 64 bit RISCV processor with support for Integer and Compressed instruction sets, and a gcc-13 toolchain. If we want some variation on that, say to look ahead at challenges a gcc-14 toolchain might throw our way, we need to build our own exemplars.

Open source test suites can be a good source for feature-focused importable exemplars. If we want to test Ghidra’s ability to import RISCV instruction set extensions, we want to import many of the files from binutils-gdb/gas/testsuite/gas/riscv.

For example, most of the ratified set of RISCV vector instructions are used in vector-insns.s. If we assemble this with a gas assembler compatible with the -march=rv32ifv architecture we get an importable binary exemplar for those instructions. Even better, we can disassemble that exemplar with a compatible objdump and get the reference disassembly to compare against Ghidra’s disassembly. This gives us three kinds of insights into Ghidra’s import capabilities:

When new instructions appear in the binutils gas main branch, they are good candidates for implementation in Ghidra within the next 12 months. This currently includes vector, bit manipulation, cache management, and crypto approved extensions plus about a dozen vendor-specific extensions from AliBaba’s THead RISCV server initiative.
These exemplars drive extension of Ghidra’s RISCV sleigh files, both as new instruction definitions and as pcode semantics for display in the decompiler window.
Disassembly of those exemplars with a current binutils objdump utility gives us a reference disassembly to compare with Ghidra’s. We can minimize arbitrary or erroneous Ghidra disassembly by comparing the two disassembler views. Ghidra and objdump have different goals, so we don’t need strict alignment of Ghidra with objdump.

Most exemplars appear as four related files. We can use the vector exemplar as an example.

The source file is riscv64/generated/assemblySamples/vector.S, copied from binutils-gdb/gas/testsuite/gas/riscv/vector-insns.s.
vector.S is assembled into riscv64/exemplars/vector.o
That assembly run generates the assembly output listing riscv64/exemplars/vector.log.
riscv64/exemplars/vector.o is finally processed by binutils objdump to generate the reference disassembly riscv64/exemplars/vector.objdump.

The riscv64/exemplars/vector.o is then imported into the Ghidra exemplars project, where we can evaluate the import and disassembly results.

Assembly language exemplars usually don’t have any sensible decompilation. C or C++ language exemplars usually do, so that gives the test analyst more to work with.

Another example shows Ghidra’s difficulty with vector optimized code. Compile this C code for the rv64gcv architecture (RISCV-64 with vector extensions), using the gcc-14 compiler suite released in May of 2024.

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

Ghidra’s 11.0 release decompiles this into:

/* WARNING: Control flow encountered unimplemented instructions */

void main(void)

{
  gp = &__global_pointer$;
                    /* WARNING: Unimplemented instruction - Truncating control flow here */
  halt_unimplemented();
}

Try the import again with the isa_ext experimental branch of Ghidra:

undefined8 main(void)

{
  undefined auVar1 [64];
  undefined8 uVar2;
  undefined (*pauVar3) [64];
  long lVar4;
  long lVar5;
  undefined auVar6 [256];
  undefined auVar7 [256];
  char local_540 [1319];
  undefined uStack_19;
  
  gp = &__global_pointer$;
  pauVar3 = (undefined (*) [64])local_540;
  lVar4 = 0x527;
  vsetvli_e32m1tama(0);
  auVar7 = vid_v();
  do {
    lVar5 = vsetvli(lVar4,0xcf);
    auVar6 = vmv1r_v(auVar7);
    lVar4 = lVar4 - lVar5;
    auVar6 = vncvt_xxw(auVar6);
    vsetvli(0,0xc6);
    auVar6 = vncvt_xxw(auVar6);
    auVar6 = vadd_vi(auVar6,1);
    auVar1 = vse8_v(auVar6);
    *pauVar3 = auVar1;
    uVar2 = vsetvli_e32m1tama(0);
    pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
    auVar6 = vmv_v_x(lVar5);
    auVar7 = vadd_vv(auVar7,auVar6);
  } while (lVar4 != 0);
  uStack_19 = 0;
  printf(local_540,uVar2);
  return 0;
}

That Ghidra branch decompiles, but the decompilation listing only resembles the C source code if you are familiar with RISCV vector extension instructions.

Repeat the example, this time building with a gcc-13 compiler suite. Ghidra 11.0 does a fine job of decompiling this.

undefined8 main(void)
{
  long lVar1;
  char acStack_541 [1320];
  undefined uStack_19;
    gp = &__global_pointer$;
  lVar1 = 1;
  do {
    acStack_541[lVar1] = (char)lVar1;
    lVar1 = lVar1 + 1;
  } while (lVar1 != 0x528);
  uStack_19 = 0;
  printf(acStack_541 + 1);
  return 0;
}

custom Linux kernel and kernel mods

The Fedora 39 disk image is a good exemplar of endpoint system code. We can supplement that with a custom kernel build. This gives us more flexibility and a peek into future system builds.

Building a custom kernel - with standard kernel modules - requires steps like these:

Download the linux kernel source from https://github.com/torvalds/linux.git
- This example currently uses the kernel development tip shortly after version 6.9 RC2

Generate a new .config kernel configuration file with a command like:

$ PATH=$PATH:/opt/riscvx/bin
$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' menuconfig

In the menuconfig view select architecture-specific features we want to view. This will likely include platform selections like Vector extension support, Zbb extension support. It may also include Cryptographic API selections like Accelerated Cryptographic Algorithms for CPU (riscv)

Build the kernel and selected kernel modules with a gcc 14.0.0 riscv64 toolchain

$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' all

Copy the selected vmunix ELF files into the riscv64/exemplars directory:

$ cp vmlinux ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2
$ cp arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.o ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2_aes-riscv64-zvkned-zvbb-zvkg.o

analysis

Importing the custom vmlinux kernel into Ghidra 11.1-DEV(isa_ext) shows:

there are relatively few vector extension sequences in the kernel - 17 instances of vset*.
- for example, __asm_vector_usercopy uses vector loads and stores to copy into user memory spaces.
there are Zbb variants: strcmp_zbb, strlen_zbb, and strncmp_zbb which can be patched into calls

Importing the aes-riscv64-zvkned-zvbb-vkg.o object file - presumably available for use in loadable kernel crypto modules - shows:

two functions aes_xts_encrypt_zvkned_zvbb_zvkg and aes_xts_decrypt_zvkned_zvbb_zvkg
many vector, crypto, and bit manipulation extension instructions.

Commit logs for the Linux kernel sources suggest that the riscv vector crypto functions were derived from openssl source code, possibly intended for use in file system encryption and decryption.

x86_64 exemplars

A few x86_64 exemplars exist to explore the scope of issues raised by RISCV exemplars. The x86_64/exemplars directory shows how optimizing gcc-14 compilations handle simple loops and built-ins like memcpy for various microarchitectures.

Intel microarchitectures can be grouped into common profiles like x86-64-v2, x86-64-v3, and x86-64-v4. Each has its own set of instruction set extensions, so an optimizing compiler like gcc-14 will autovectorize loops and built-ins differently for each microarchitecture.

The memcpy exemplar set includes source code and three executables compiled from that source code with -march=x86-64-v2, -march=x86-64-v3, and -march=x86-64-v4. The binutils-2.41 objdump disassembly is provided for each executable, for comparison with Ghidra’s disassembly window.

x86_64/exemplars$ ls memcpy*
memcpy.c  memcpy_x86-64-v2  memcpy_x86-64-v2.objdump  memcpy_x86-64-v3  memcpy_x86-64-v3.objdump  memcpy_x86-64-v4  memcpy_x86-64-v4.objdump

These exemplars suggest several Ghidra issues:

Ghidra’s disassembler is generally unable to recognize many vector instructions generated by gcc-14 with -march=x86-64-v4 and -O3.
Ghidra’s decompiler provides the user little help in recognizing the semantics of memcpy or many simple loops with -march=x86-64-v2 or -march=x86-64-v3.
Ghidra users should be prepared for wide variety in vector optimized instruction sequences. Pattern recognition will be difficult.

custom exemplars

Not all RISCV instruction set extensions are standardized and supported by open source compiler suites. Vendors can generate their own custom extensions. These may be instructions that are proposed for standardization, instructions that predate standardized extensions that are effectively deprecated for new RISCV variants, and (potentially) instructions that are considered non-public licenseable intellectual property.

We have one example of a set of vendor-specific RISC-V extension exemplars that is pending classification. Some of the WCH QingKe 32 bit RISCV processors support what they call extended instruction or XW instructions like c.lbu, c.lhu, c.sb, c.sh, c.lbusp, c.lhusp, c.sbsp, and c.shsp. The encoding for these custom instructions overlaps other, standardized extensions like Zcd, while some of the instruction mnemonics overlap those of Zcb. There is no known evidence that these XW instructions are tracked for inclusion in binutils, as other full-custom extensions from the THead alibaba group are. There is no evidence that these XW instructions are considered licensable or proprietary to WCH (Nanjing Qinheng Microelectronics).

https://github.com/ArcaneNibble has generated a set of binary exemplars for this vendor custom extension. Naming conventions for full-custom extensions are very much To Be Determined. The RISCV binutils toolchain attaches an architecture tag to each ELF file it generates. For these binary exemplars that is:

Tag_RISCV_arch: "rv32i2p0_m2p0_a2p0_f2p0_c2p0_xw2p2"

That architectural tag implies the binaries are for a base RISCV 32 bit processor, with the standard compressed (c) extension version 2.0 and other standard extensions. The vendor custom (x) extension (w) version 2.0 (2p2) is enabled. The Zcd and Zcb extensions are explicitly not enabled, so there is no conflict with either assembly or disassembly of the instructions.

These exemplars are currently filed under riscv64/exemplars as:

custom
└── wch
    ├── lbu.S
    ├── lbusp.S
    ├── lhu.S
    ├── lhusp.S
    ├── sb.S
    ├── sbsp.S
    ├── sh.S
    ├── shsp.S
    ├── w2p2-lbu.o
    ├── w2p2-lbusp.o
    ├── w2p2-lhu.o
    ├── w2p2-lhusp.o
    ├── w2p2-sb.o
    ├── w2p2-sbsp.o
    ├── w2p2-sh.o
    └── w2p2-shsp.o

Whisper_cpp

Data Plane Development Kit

Last modified May 16, 2024: Complete Issue #22, including refactoring to pull toolchain suite into top level (995d842)