Overview

1: Glossary

We can help Ghidra import newer binaries by collecting samples of those binaries. The scope is limited to RISCV-64 Linux-capable processors one might find in smart network appliances or Machine Learning inference engines.

Note: This proof-of-concept project focuses on a single processor family, RISCV. Some results are checked against equivalent x86_64 processors, to see if pending issues are limited in scope or likely to hit a larger community

This project collects files that may stress - in a good way - Ghidra’s import, disassembly, and decompilation capabilities. Others are doing a great job extending Ghidra’s ability to import and recognize C++ structures and classes, so we will focus on lower level objects like instruction sets, relocation codes, and pending toolchain improvements. The primary CPU family will be based on the RISCV-64 processor. This processor is relatively new and easily modified, so it will likely show lots of new features early. Not all of these new features will make it into common use or arenas in which Ghidra is necessary, so we don’t really know how much effort is worth spending on any given feature.

The RISCV processor family being relatively new, we can expect compiler and toolchain support to be evolving more rapidly than more established families like x86_64. That means RISCV appliances may be more likely to be built with newer compiler toolchains than x86_64 appliances.

There are two key goals here:

Experiment with Ghidra import integration tests that can detect Ghidra regressions. This involves collecting a number of processor and toolchain binary exemplars to be imported plus analysis scripts to verify those import results remain valid. Example: verify that ELF relocation codes are properly handled when importing a RISCV-64 kernel module. These integration tests should always pass after changes to Ghidra’s source code.
Collect feature-specific binary exemplars that might highlight emergent gaps in Ghidra’s import processes. Ghidra will usually fail to properly import these exemplars, allowing the Ghidra development team to triage the gap and evaluate options for closing it. Example: pass the RISCV instruction set extension testsuite from binutils/gas into Ghidra to test whether Ghidra can recognize all of the new instructions gas can generate.

A secondary goal developed during testing - explore the impact on Ghidra users of vector instruction set extensions as used in aggressive compiler optimization. The RISCV 1.0 vector instructions as generated by the gcc 14.0 optimizing compiler can turn simple loops into more complex instruction sequences.

The initial scope focuses on RISCV 64 bit processors capable of running a full Linux network stack, likely implementing the 2023 standard profile.

We want to track recent additions to standard RISCV-64 toolchains (like binutils and gcc) to see how they might make life interesting for Ghidra developers. At present, that includes newly frozen or ratified instruction set architecture (ISA) changes and compiler autovectorization optimizations. Some vendor-specific instruction set extensions will be included if they are accepted into the binutils main branch.

Running integration tests

Note: These scripts use both unittest and logging frameworks, where the loglevel is variously set at INFO or WARN. The exact output may vary

The first two steps collect binary exemplars for Ghidra to import. Large binaries are extracted from public disk images, such as the latest Fedora RISCV-64 system disk image. Small binaries are generated locally from minimal C or C++ source files and gcc toolchains.

The large binaries are downloaded and extracted using acquireExternalExemplars.py. This script is built on the python unittest framework to either verify the existence of previously extracted exemplars or regenerate those if missing.

The small binaries are created - if not already present - with the generateInternalExemplars.py script

Warning: GCC-13 and GCC-14 binary toolchains are not included in this project. Sources should be downloaded, compiled, and installed to something like /opt/... then post-processed by bundled toolchain scripts into portable, hermetic tarballs.

$ ./acquireExternalExemplars.py 
...........
----------------------------------------------------------------------
Ran 11 tests in 0.003s

OK

$ ./generateInternalExemplars.py 
.......
----------------------------------------------------------------------
Ran 7 tests in 4.092s

The exemplar binaries can now be imported into two Ghidra projects - one for RISCV64 and another for x86_64. The import process includes pre- and post-script processing. Pre-script processing is used for the kernel import to fix symbol names and load address. Post-script processing is used for the kernel module import to gather relocation results for later regression testing. These relocation results are saved in testresults/*.json

Import processing generates a log file for each binary imported into Ghidra. If that log file is newer than the binary, the import process is skipped. If you want to rerun an import for foo.o, simply delete the matching log file in .../exemplars/foo.log.

OK
$ ./importExemplars.py
.INFO:root:Current Kernel import log file found - skipping import
.INFO:root:Current Kernel module import log file found - skipping import
...
.
----------------------------------------------------------------------
Ran 7 tests in 0.003s

OK

Test results gathered during binary imports and saved in testresults/*.json are now compared with expected values in the final script:

$ ./integrationTest.py 
inspecting the R_RISCV_BRANCH relocation test
inspecting the R_RISCV_JAL test
inspecting the R_RISCV_PCREL_HI20 1/2 test
inspecting the R_RISCV_PCREL_HI20 2/2 test
inspecting the R_RISCV_PCREL_LO12_I test
inspecting the R_RISCV_64 test
inspecting the R_RISCV_RVC_BRANCH test
inspecting the R_ADD_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_SUB_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_RISCV_RVC_JUMP test
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

Ghidra gap analysis

We are looking for processor features that may soon be commonplace but that current Ghidra releases do not support well. One such feature involves RISCV extensions to the instruction set architecture, especially vector and bit manipulation extensions. For each such feature we might consider the following questions:

What is a current example of this feature, especially examples that support analysis or pathologies of those features.
How and when might this feature impact a significant number of Ghidra analysts?
How much effort might it take Ghidra developers to fill the implied feature gap?
Is this feature specific to RISCV systems or more broadly applicable to other processor families? Would support for that feature be common to many processor families or vary widely by processor?
What are the existing frameworks within Ghidra that might most credibly be extended to support that feature?

Thread Local Storage (TLS) is a fairly simple feature we can use as an example. Addressing each of the five questions in turn we might find:

TLS relocation codes appear occasionally in multithreaded applications across most processor families. They might appear a few times within libc. Ghidra often doesn’t recognize these codes. Existing analytics like objdump and readelf certainly do recognize TLS codes, but do not pretend to provide semantic aid in interpreting those codes. TLS codes have well documented C source contexts in the form of compiler attributes.
The TLS handling gap is unlikely to affect many Ghidra users anytime soon, mostly because they appear only rarely and mostly apply to local variables where the decompiler can provide context.
Experienced Ghidra developers might be able to implement the general TLS case easily, but would then have to add supporting ELF import code to a broader range of processor definitions.
The TLS feature is common across most processor families supporting multithreading.
Support within Ghidra might grow out of existing memory space models and existing processor-specific ELF importers.

The general design questions boil down to:

how long can we defer working on this gap?
how long would it take to fill that gap after we got started?
where would we likely want to start

The Ghidra design team might assign TLS support a relatively low priority, since the gap doesn’t currently have a large impact. If the incidence and complexity of TLS suddenly increased, the extension of existing Ghidra support could likely increase just as rapidly.

Extensions to Instruction Set Architectures make up a much more complicated example. Standardized instructions for cache management and cryptography are likely easy enough to fold into Ghidra’s framework, but vector instruction extensions will hit harder and sooner, without a clear path forward for Ghidra.

1 - Glossary

Some of the commonly used terms in this project

exemplar: An example of a binary file one might expect Ghidra to accept as input. This might be an ELF executable, an ELF object file or library of object files, a kernel load module, or a kernel vmlinux image. Ideally it should be relatively small and easy to screen for hidden malware. Not all features demonstrated by the exemplar need be supported by the current Ghidra release.
platform: The technology base one or more exemplars are used on. A kernel exemplar expects to be run on top of a bootloader platform. A Linux application exemplar may consider the Linux kernel plus system libraries as its platform. System libraries like libc.so can then be both exemplars and platform elements.
compiler suite: A compiler suite includes a compiler or cross compiler plus all of the supporting tools and libraries to build executables for a range of platforms. This generally includes a versioned C and C++ compiler, preprocessor, assembler, linker, linker scripts, and core libraries like libgcc. Compiler suites often support many architecture variants, such as 32 or 64 bit word size and a host of microarchitecture or instruction set options. Compiler suites can be customized by selecting specific configurations and options, becoming toolchains.
cross compiler: A compiler capable of generating code for a processor other than the one it is running on. An x86_64 gcc-14 compiler configured to generate RISCV-64 object files would be a cross-compiler. Cross-compilers run on either the local host platform or on a Continuous Integration test server platform.
linker: A tool that takes one or more object files and resolves those runtime linkages internal to those object files. Usually ld on a Linux system. Often generates an ELF file or a kernel image.
loader: A tool - often integrated with the kernel - that loads an Elf file into RAM. The loader finalizes runtime linkages with external objects. The loader will often rewrite code (aka relaxation) to optimize memory references and so performance.
sysroot: The system root directories provide the interface between platform (kernel and system libraries) and user code. This can be as simple as /usr/include or as complicated as a sysroot/lib/ldscripts holding over 250 ld scripts detailing how a linker should generate code the kernel loader can fully process. Cross-compiler toolchains often need to import a sysroot to build for a given kernel. This can make for a circular dependency.
toolchain: A toolchain is an assembly of cross-compiler, linker, loader, and sysroot, plus a default set of options and switches for each component. Different toolchains might share a gcc compiler suite, but be configured for different platforms - building a kernel image, building libc.so, or building an executable application. Note: the word toolchain is often used in this project where compiler suite is intended.
workspace: An environment that provides mappings between platforms and toolchains. If you want to build an executable for a given platform, just name that platform on the command line and the build tool will select a compatible toolchain and a default set of options. You can still override those options.
hermetic: Build artifacts are not affected by any local host files other than those imported with the toolchain. A hermetic build on a Fedora platform will generate exactly the same binary output as if built on an Ubuntu platform. This allows remote build servers to cache build artifacts and CI/CD servers to use exactly the same build environment as a diverse development team.