Ghidra Advisor Reference

1: Advisor Dependencies
2: Populating the Database
3: Feature Analysis
4: Frequently Asked Questions

1 - Advisor Dependencies

Ghidra

Ghidra 11.3_DEV with the RISCV-64 isa_ext branch. Without this branch Ghidra is stuck with a never-ratified older version of RISCV support.

Bazel

Bazel 8.0

Bazel builds in this workspace generate output in the temporary directory /run/user/1000/bazel, as specified in .bazelrc. This override can be changed or removed

This project should work with Bazel 7.x as well, after adjusting some toolchain path names. Bazel 8 uses ‘+’ instead of ‘~’ as an external repo naming suffix and ‘@@’ instead of ‘@’ to identify standard bazel repositories.

Toolchain

binutils 2.42.50
gcc 15.0
glibc developmental version 2.39.9000
sysroot - a stripped down linux sysroot derived from the sysroot bootstrap in riscv-gnu-toolchain

The toolchain is packaged locally as a Bazel module named gcc_riscv_suite, version 15.0.0.1. (Note that this is the first patch to the Bazel module based on the unreleased GCC-15.0.0). This module depends on a second module, fedora_syslibs version 41.0.0. These are served out of a local Bazel module repository. The gcc_riscv_suite and fedora_syslibs modules wrap a 42 MB and 4.0 MB tarball, respectively.

Emulators

Two qemu emulators are used, both built from source shortly after the 9.0.50 release.

qemu-riscv64 provides user space emulation, which is very useful for exploring the behavior of particularly confusing assembly code sequences.
qemu-system-riscv64 provides full RISCV-64 VM hosting. This is more narrowly useful when testing binaries like DPDK which require non-standard kernel options or kernel modules.
- The RISCV-64 VM used here is based on an Ubuntu 24.04 disk image and the u-boot.bin boot loader. This boot loader is critical for RISCV VMs, since the emulated BIOS firmware provides the kernel with the definitive set of RISCV extensions available to the hardware threads (aka harts)

Jupyter

jupyterlab 4.1.1

System

Fedora 41 with wayland graphics.
Python 3.13

2 - Populating the Database

The training set consists of matched C source code and RISCV-64 disassembly code. The C source is processed through the C preprocessor cpp and indent. That code is then compiled with GCC and at least two different machine architectures, then saved under ./data.

Populating the Training Set Database

The initial C source code is selected from the GCC riscv autovector testsuite. We can add custom examples of code to fill gaps or represent code patterns we might find in a Ghidra binary under review. Autovectored loops over structure arrays can be especially confusing to interpret, so we will likely want extra samples of that type.

The C sources for these two test suites appear in ./gcc_riscv_testsuite and ./custom_testsuite. The script generator.py processes these into cpp output (*.i), compiled libraries (*.so), and objdump assembly listings (*_objdump) for each requested machine architecture.

For example, gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1.c is processed by generator.py into:

data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.i
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc.so
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gc_objdump
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.i
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv.so
data/gcc_riscv_testsuite/rvv/autovec/reduc/reduc-1_rv64gcv_objdump

The ingest.py script reads everything under ./data to populate the sample table in the Sqlite3 database

.schema sample
CREATE TABLE sample(id INTEGER PRIMARY KEY AUTOINCREMENT, namespace TEXT, arch TEXT, name TEXT, source TEXT, assembly TEXT);

Next we need to generate signatures from this table with the sample_analytics.py script. At present, signatures are simple strings. We have three signature types at the moment.

.schema signatures
CREATE TABLE signatures(id INTEGER PRIMARY KEY AUTOINCREMENT, sample_id INTEGER, signature_type TEXT, signature_value TEXT);
select distinct signature_type from signatures;
Traits
Opcodes, sorted
Opcodes, ordered

The Traits signature holds simple facts from the disassembly code, such as hasLoop if at least one backwards branch exists.
The Opcodes, ordered signature is a simple list of vector and branch opcodes, concatenated in the same order as they are found in the disassembly.
The Opcodes, sorted signature is similar to Opcodes, ordered, but sorted into alphanumeric order. This may be useful if the compiler reorders instructions.

Querying the Database in Advisor

Workflow Generation

Users can select assembly code from Ghidra’s listing window, then run analysis cells in Advisor.ipynb to generate reports on the types of C code that may match the listing. Users will likely want to iterate complex selections by adding custom examples and repeating the match, to see if they can reproduce the C code that might have generated the vectorized assembly.

3 - Feature Analysis

At a very high level the Advisor tries to translate between two languages - C or C++ source code that a human might write and the sequence of assembly language instructions a compiler generates. Compilers are very good at the forward translation from C to assembly. We want something that works in the reverse direction - suggesting C or other source code that might have been compiled into specific assembly sequences.

The Advisor tries to brute-force this reverse translation by compiling a reference set of C sources into binaries, extracting the instructions with objdump into a database, then looking for the best match to the instructions copied into the clipboard. The GCC compiler test suite gives us thousands of reference C functions to start with.

Some features are easy to recognize.

if the assembly listing includes a backwards branch instruction and branch target, then the source code likely contains a vectorized loop.
if the assembly listing includes an instruction matching vred*, vfred* vwred*, then the source code likely contains a vectorized reduction loop reading a vector and emitting a scalar.

Other features are mostly distractions, adding entropy that we would like to ignore:

The local choice of registers to hold intermediate values
The specific loop termination branch condition - a test against a counter, an input pointer, or an output pointer are all equally valid but only one will be implemented.
Instruction ordering is often arbitrary inside a loop, as counters and pointers are incremented/decremented.
The compiler may reorder instructions to minimize the impact of memory latency.
The compiler will change the emitted instructions for inlined function depending on what it knows at compile time. This is especially true when the compiler knows the exact number of loop iterations, the alignment of operands, and the minimum size of vector registers.
The compiler will change the emitted instructions based on the local ‘register pressure’ - whether or not there are lots of free vector registers.
The compiler (or inline header macros) will translate a simple loop into multiple code blocks evaluated at run time. If the count is small, a scalar implementation is used. If the count is large one or more vector blocks are used.
The compiler writers sometimes have to guess whether to optimize for instruction count, minimal branches, or memory accesses.

And some features are harder to recognize but useful for the Ghidra user:

Operand type is sometimes set at runtime, not encoded into the instruction opcode.
Compilers can emit completely different code if the machine architecture indicates a vector length of greater than 128 bits.
Vector registers may be grouped based on runtime context, so that the number of registers read or written must be inferred from instruction flows.
The compiler will accept intrinsic vector functions - not all vector loops have a C counterpart.

4 - Frequently Asked Questions

Why Bazel?: Bazel does a good job of managing cross-compiler builds and build caches together, where the cross-compiler toolchain can be switched easily.
How do I compile with support for RISCV Instruction Set Architecture extensions?: The binutils and gcc base code need to support those extensions first. The gcc compiler uses the -march= command line option to identify which extensions to apply for a given compilation. For example -march=rv32gcv says vector instructions are supported, while -march=rv32gc excludes vector instructions.
What machine architectures are currently implemented?: The variables.bzl file sets MARCH_SET = ("rv64gc", "rv64gcv"). Most sources are then compiled with and without vector support.
Are all RISCV vector binaries runnable on all vector hardware threads?: Not always. By default GCC will build for a minimum vector register length (VLEN) of 128 bits, which should be portable across all general purpose RISCV harts. If _zvl512b were added to the -march setting, GCC will know that vector registers are bigger and can unroll loops more aggressively - generating code that will fail on 128 bit vector harts. This can get complicated when processors have both 128 bit and 512 bit cores, like the sg2380.
Aren’t vector extensions unlikely to be used in programs that don’t do vector math?: No. Vector extensions are very likely to be found in inlined utilities like memcpy and strncmp. Most simple loops over arrays of structs can be optimized with vector instructions.