Ghidra Import Testing

A testbed for binaries Ghidra should - or may soon - be able to import.

Introduction

Ghidra is a wonderful tool for analysis of executable binaries. This project attempts to collect elements of a testing framework for the binary import process - loading something with executable code into a Ghidra project. There are a lot of different types of binaries to be analyzed, as well as rapid evolution of each of those types. This project collects examples of those binaries (aka exemplars) so that Ghidra users and developers can gauge Ghidra’s ability to import and display disassemblies and decompilations. We’ll concentrate on executable binaries for 64 bit RISCV processors, partly because this architecture is evolving rapidly. Which of those evolutions are worth tracking in Ghidra?

Exemplars here fall into one of two classes:

  1. Existing binaries available for downloading from the public internet, such as a RISCV-64 Fedora system boot image:
    1. user space executables
    2. user space system libraries
    3. kernel loadable modules and device drivers
    4. kernel code
  2. Small source files (assembly, C, C++, or maybe Rust) demonstrating a single concept, which can be compiled and/or linked into an importable ELF binary. These source files are likely copied from open source test suites or feature demos:
    • Ghidra’s disassembler is a lot like binutil’s objdump, so the gas testsuite is a great source of these sample source files. For example, this project includes RISCV instruction set extensions by importing the binutil test files for vector, bit manipulation, and cache control extension instructions, then comparing Ghidra’s disassembly with that of the latest objdump.
    • Ghidra’s decompiler needs to make sense of the binaries generated by compilers like gcc, ideally turning the disassembly output into something resembling the original source file. GCC’s optimization can make that a challenge, so we include some small source files triggering gcc’s autovectorization optimization.

This test harness includes several components:

  1. scripts to download existing disk images, breaking them down into kernel code and the three types of ELF importables listed above.
  2. portable toolchains capable of crosscompiling and linking the small source files with various versions of gcc.
  3. one or more integration test programs using python’s unittest framework. These are used to catch regressions and help keep Ghidra aligned with advances in toolchain development.

Example

RISCV ratified extensions include vector extensions that speed up code from memcpy to machine language inference engines.

This lets performance-minded developers replace a call to memcpy with a call to this instead:

void *memcpy_vec(void *restrict destination, const void *restrict source,
                 size_t n) {
  unsigned char *dst = destination;
  const unsigned char *src = source;
  // copy data byte by byte
  for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
    vl = __riscv_vsetvl_e8m8(n);
    vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
    __riscv_vse8_v_u8m8(dst, vec_src, vl);
  }
  return destination;
}

With gcc-14, due for release mid 2024, calls to the C standard library version of memcpy can also be machine-optimized into similar vectorized inline code. Ghidra should probably be able to make sense of those instruction sequences, perhaps even recognizing vectorized patterns that correspond to common library functions like memcpy, and strncpy.