The material here is loosely collected into a set of notes and examples.
…
This is the multi-page printable view of this section. Click here to print.
The material here is loosely collected into a set of notes and examples.
…
We can help Ghidra import newer binaries by collecting samples of those binaries. The scope is limited to RISCV-64 Linux-capable processors one might find in smart network appliances or Machine Learning inference engines.
Note: This proof-of-concept project focuses on a single processor family, RISCV. Some results are checked against equivalent x86_64 processors, to see if pending issues are limited in scope or likely to hit a larger community
This project collects files that may stress - in a good way - Ghidra’s import, disassembly, and decompilation capabilities. Others are doing a great job extending Ghidra’s ability to import and recognize C++ structures and classes, so we will focus on lower level objects like instruction sets, relocation codes, and pending toolchain improvements. The primary CPU family will be based on the RISCV-64 processor. This processor is relatively new and easily modified, so it will likely show lots of new features early. Not all of these new features will make it into common use or arenas in which Ghidra is necessary, so we don’t really know how much effort is worth spending on any given feature.
The RISCV processor family being relatively new, we can expect compiler and toolchain support to be evolving more rapidly than more established families like x86_64. That means RISCV appliances may be more likely to be built with newer compiler toolchains than x86_64 appliances.
There are two key goals here:
binutils/gas
into Ghidra to test whether Ghidra can
recognize all of the new instructions gas
can generate.A secondary goal developed during testing - explore the impact on Ghidra users of vector instruction set extensions as used in aggressive compiler optimization. The RISCV 1.0 vector instructions as generated by the gcc 14.0 optimizing compiler can turn simple loops into more complex instruction sequences.
The initial scope focuses on RISCV 64 bit processors capable of running a full Linux network stack, likely implementing the 2023 standard profile.
We want to track recent additions to standard RISCV-64 toolchains (like binutils
and gcc
) to
see how they might make life interesting for Ghidra developers. At present, that includes
newly frozen or ratified instruction set architecture (ISA) changes and compiler autovectorization
optimizations. Some vendor-specific instruction set extensions will be included if they are accepted into
the binutils
main branch.
Note: These scripts use both
unittest
andlogging
frameworks, where the loglevel is variously set atINFO
orWARN
. The exact output may vary
The first two steps collect binary exemplars for Ghidra to import. Large binaries are extracted from public disk images, such as the latest Fedora RISCV-64 system disk image. Small binaries are generated locally from minimal C or C++ source files and gcc toolchains.
The large binaries are downloaded and extracted using acquireExternalExemplars.py
. This script is built on the python unittest
framework
to either verify the existence of previously extracted exemplars or regenerate those if missing.
The small binaries are created - if not already present - with the generateInternalExemplars.py
script
Warning: GCC-13 and GCC-14 binary toolchains are not included in this project. Sources should be downloaded, compiled, and installed to something like
/opt/...
then post-processed by bundled toolchain scripts into portable, hermetic tarballs.
$ ./acquireExternalExemplars.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.003s
OK
$ ./generateInternalExemplars.py
.......
----------------------------------------------------------------------
Ran 7 tests in 4.092s
The exemplar binaries can now be imported into two Ghidra projects - one for RISCV64 and another for x86_64.
The import process includes pre- and post-script processing. Pre-script processing is used for the kernel import
to fix symbol names and load address. Post-script processing is used for the kernel module import to gather
relocation results for later regression testing. These relocation results are saved in testresults/*.json
Import processing generates a log file for each binary imported into Ghidra. If that log file is newer than the
binary, the import process is skipped. If you want to rerun an import for foo.o, simply delete the matching log file in
.../exemplars/foo.log.
OK
$ ./importExemplars.py
.INFO:root:Current Kernel import log file found - skipping import
.INFO:root:Current Kernel module import log file found - skipping import
...
.
----------------------------------------------------------------------
Ran 7 tests in 0.003s
OK
Test results gathered during binary imports and saved in testresults/*.json
are now compared with expected
values in the final script:
$ ./integrationTest.py
inspecting the R_RISCV_BRANCH relocation test
inspecting the R_RISCV_JAL test
inspecting the R_RISCV_PCREL_HI20 1/2 test
inspecting the R_RISCV_PCREL_HI20 2/2 test
inspecting the R_RISCV_PCREL_LO12_I test
inspecting the R_RISCV_64 test
inspecting the R_RISCV_RVC_BRANCH test
inspecting the R_ADD_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_SUB_32 test
inspecting the R_RISCV_ADD64 test
inspecting the R_RISCV_RVC_JUMP test
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
We are looking for processor features that may soon be commonplace but that current Ghidra releases do not support well. One such feature involves RISCV extensions to the instruction set architecture, especially vector and bit manipulation extensions. For each such feature we might consider the following questions:
Thread Local Storage (TLS) is a fairly simple feature we can use as an example. Addressing each of the five questions in turn we might find:
libc
. Ghidra often doesn’t recognize these codes. Existing analytics like objdump
and readelf
certainly do recognize
TLS codes, but do not pretend to provide semantic aid in interpreting those codes. TLS codes have well documented C source contexts in
the form of compiler attributes
.The general design questions boil down to:
The Ghidra design team might assign TLS support a relatively low priority, since the gap doesn’t currently have a large impact. If the incidence and complexity of TLS suddenly increased, the extension of existing Ghidra support could likely increase just as rapidly.
Extensions to Instruction Set Architectures make up a much more complicated example. Standardized instructions for cache management and cryptography are likely easy enough to fold into Ghidra’s framework, but vector instruction extensions will hit harder and sooner, without a clear path forward for Ghidra.
Some of the commonly used terms in this project
toolchains
.ld
on a Linux system. Often generates an ELF file
or a kernel image.relaxation
) to optimize memory references and so performance./usr/include
or as complicated as a sysroot/lib/ldscripts
holding
over 250 ld scripts detailing how a linker should generate code the kernel loader can fully process.
Cross-compiler toolchains often need to import a sysroot to build for a given kernel. This can make for a circular dependency.libc.so
, or building an executable application. Note: the word toolchain
is often used in this project where compiler suite
is intended.List the current importable and buildable exemplars, their origins, and the Ghidra features they are intended to validate or stress.
Exemplars suitable for Ghidra import are generally collected by platform architecture, such as riscv64/exemplars
or x86_64/exemplars
.
Some are imported from system disk images. Others are locally built from small source code files and an appropriate compiler toolchain.
The initial scope includes Linux-capable RISCV 64 bit systems that might be found in network appliances or ML inference engines. That makes for a local
bias towards privileged code, concurrency management, and performance optimization. That scope expands slightly to x86_64 exemplars that
may help triage issues that show up first in RISCV 64 exemplars.
You can get decent exemplar coverage with this set of exemplars:
libssl.so
and libcrypt.so
built from source and configured for all standard and frozen crypto, vector, and bit manipulation
instruction extensions.l3fwd
and l2fwd
.In general, visual inspection of these exemplars after importing into Ghidra should show:
Ghidra will in a few cases disassemble an instruction differently than binutils’ objdump
. That’s fine, if it is due to a limitation
of Ghidra’s SLEIGH language. If alignment to objdump
is possible, that’s preferable.
Most of the imported large binary exemplars are broken out of available Fedora disk images. The top level acquireExternalExemplars.py
script controls this process, sometimes with some manual intervention to handle image mounting. Selection of the imported disk image
is controlled with text like:
LOGLEVEL = logging.WARN
FEDORA_RISCV_SITE = "http://fedora.riscv.rocks/kojifiles/work/tasks/6900/1466900"
FEDORA_RISCV_IMAGE = "Fedora-Developer-39-20230927.n.0-sda.raw"
FEDORA_KERNEL = "vmlinuz-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_KERNEL_OFFSET = 40056
FEDORA_KERNEL_DECOMPRESSED = "vmlinux-6.5.4-300.0.riscv64.fc39.riscv64"
FEDORA_SYSMAP = "System.map-6.5.4-300.0.riscv64.fc39.riscv64"
Warning: the cited Fedora disk image may no longer be maintained. If so, we will replace it with a custom cross-compiled kernel tuned for a hypothetical network appliance.
This exemplar kernel is not an ELF file, so analysis of the import process will need help.
-processor RISCV:LE:64:RV64IC
.
This will likely be the same as the processor determined from imported kernel load modules.System.map
file. For example, .text
moves from
0x1000 to 0x80001000. Test this by verifying function start addresses identified in System.map
look like actual RISCV-64 kernel functions. Most begin with 16 bytes of no-op instructions
to support debugging and tracing operations..text
as code by selecting from 0x80001000 to 0x80dfffff and hitting the D
key.Verify that kernel code correctly references data:
panic
in System.map
: ffffffff80b6b188panic
and examine the decompiler window. /* WARNING: Subroutine does not return */
panic(s_Fatal_exception_in_interrupt_813f84f8);
This kernel includes 149 strings including sifive
, most of which appear in System.map
. It’s not immediately clear whether these
indicate kernel mods by SiFive or an SiFive SDK kernel module compiled into the kernel.
The kernel currently includes a few RISCV instruction set extensions not handled by Ghidra, and possibly not even by binutils
and the gas
RISCV assembler. Current Linux kernels can bypass the standard assembler to insert custom or obscure privileged instructions.
This Linux kernel explicitly includes ISA extension code for processors that support those extensions. For example, if the kernel boots
up on a processor supporting the _zbb
bit manipulation instruction extensions, then the vanilla strlen
, strcmp
, and strncmp
kernel
functions are patched out to invoke strlen_zbb
, strcmp_zbb
, and strncmp_zbb
respectively.
This kernel can support up to 64 discrete ISA extensions, of which about 30 are currently defined. It has some support for hybrid processors, where each of the hardware threads (aka ‘harts’) can support a different mix of ISA extensions.
Note: The combination of instruction set extensions and self-modifying privileged code makes for a fertile ground for Ghidra research. We can expect vector variants of
memcpy
inline expansion sometime in 2024, significantly complicating cyberanalysis of even the simplest programs.
Kernel modules are typically ELF files compiled as Position Independent Code, often using more varied Elf relocation types
for dynamically loading and linking into kernel memory space. This study looks at the igc.ko
kernel module for a type
of Intel network interface device. Network device drivers can have some of the most time-critical and race-condition-rich
behavior, making this class of driver a good exemplar.
RISCV relocation types found in this exemplar include:
R_RISCV_64(2), R_RISCV_BRANCH(16), R_RISCV_JAL(17), R_RISCV_CALL(18), R_RISCV_PCREL_HI20(23), R_RISCV_PCREL_LO12_I(24), R_RISCV_ADD32(35), R_RISCV_ADD64(36), R_RISCV_SUB32(39), R_RISCV_SUB64(40), R_RISCV_RVC_BRANCH(44), and R_RISCV_RVC_JUMP(45)
Open Ghidra’s Relocation Table window and verify that all relocations were applied.
Go to igc_poll
, open a decompiler window, and export the function as igc_poll.c
. Compare this file with the
provided igc_poll_decompiled.c
in the visual difftool of your choice (e.g. meld
) and check for the presence of lines
like:
netdev_printk(&_LC7,*(undefined8 *)(lVar33 + 8),"Unknown Tx buffer type\n");
This statement generates - and provides tests for - at least four relocation types.
The decompiler translates all fence instructions as fence()
. This kernel module uses 8 distinct fence
instructions
to request memory barriers. The sleigh files should probably be extended to show either fence(1,5)
or the Linux macro names
given in linux/arch/riscv/include/asm/barrier.h
.
System libraries like libc.so
and libssl.so
typically link to versioned shareable object libraries like libc.so.6
and libssl.so.3.0.5
. Ghidra imports
RISCV system libraries well.
Relocation types observed include:
R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5), and R_RISCV_TLS_TPREL64(11)
R_RISCV_TLS_TPREL64 is currently unsupported by Ghidra, appearing in the libc.so.6
.got section about 15 times. This relocation type does not appear
in libssl.so.3.0.5
. It appears in multithreaded applications that use thread-local storage.
The ssh
utility imports cleanly into Ghidra.
Relocation types observed include:
R_RISCV_64(2), R_RISCV_RELATIVE(3), R_RISCV_JUMP_SLOT(5)
Function thunks referencing external library functions do not automatically get the name of the external function propagated into the name of the thunk.
Imported binaries are generally locked into a single platform and a single toolchain. The imported binaries above are built for an SiFive development board, a 64 bit RISCV processor with support for Integer and Compressed instruction sets, and a gcc-13 toolchain. If we want some variation on that, say to look ahead at challenges a gcc-14 toolchain might throw our way, we need to build our own exemplars.
Open source test suites can be a good source for feature-focused importable exemplars. If we want to test Ghidra’s ability to import RISCV instruction set extensions, we want to import many of the files from binutils-gdb/gas/testsuite/gas/riscv.
For example, most of the ratified set of RISCV vector instructions are used in vector-insns.s
.
If we assemble this with a gas
assembler compatible with the -march=rv32ifv
architecture
we get an importable binary exemplar for those instructions.
Even better, we can disassemble that exemplar with a compatible objdump
and get the reference disassembly to compare against Ghidra’s disassembly.
This gives us three kinds of insights into Ghidra’s import capabilities:
binutils
gas
main branch, they are good candidates for implementation in Ghidra within the next 12 months.
This currently includes vector, bit manipulation, cache management, and crypto approved extensions plus about a dozen vendor-specific extensions
from AliBaba’s THead RISCV server initiative.binutils
objdump
utility gives us a reference disassembly to compare with Ghidra’s.
We can minimize arbitrary or erroneous Ghidra disassembly by comparing the two disassembler views.
Ghidra and objdump
have different goals, so we don’t need strict alignment of Ghidra with objdump
.Most exemplars appear as four related files. We can use the vector
exemplar as an example.
riscv64/generated/assemblySamples/vector.S
, copied from binutils-gdb/gas/testsuite/gas/riscv/vector-insns.s
.vector.S
is assembled into riscv64/exemplars/vector.o
riscv64/exemplars/vector.log
.riscv64/exemplars/vector.o
is finally processed by binutils
objdump
to generate the reference disassembly riscv64/exemplars/vector.objdump
.The riscv64/exemplars/vector.o
is then imported into the Ghidra exemplars
project, where we can evaluate the import and disassembly results.
Assembly language exemplars usually don’t have any sensible decompilation. C or C++ language exemplars usually do, so that gives the test analyst more to work with.
Another example shows Ghidra’s difficulty with vector optimized code. Compile this C code for the rv64gcv
architecture (RISCV-64 with vector extensions), using the
gcc-14 compiler suite released in May of 2024.
#include <stdio.h>
int main(int argc, char** argv){
const int N = 1320;
char s[N];
for (int i = 0; i < N - 1; ++i)
s[i] = i + 1;
s[N - 1] = '\0';
printf(s);
}
Ghidra’s 11.0 release decompiles this into:
/* WARNING: Control flow encountered unimplemented instructions */
void main(void)
{
gp = &__global_pointer$;
/* WARNING: Unimplemented instruction - Truncating control flow here */
halt_unimplemented();
}
Try the import again with the isa_ext
experimental branch of Ghidra:
undefined8 main(void)
{
undefined auVar1 [64];
undefined8 uVar2;
undefined (*pauVar3) [64];
long lVar4;
long lVar5;
undefined auVar6 [256];
undefined auVar7 [256];
char local_540 [1319];
undefined uStack_19;
gp = &__global_pointer$;
pauVar3 = (undefined (*) [64])local_540;
lVar4 = 0x527;
vsetvli_e32m1tama(0);
auVar7 = vid_v();
do {
lVar5 = vsetvli(lVar4,0xcf);
auVar6 = vmv1r_v(auVar7);
lVar4 = lVar4 - lVar5;
auVar6 = vncvt_xxw(auVar6);
vsetvli(0,0xc6);
auVar6 = vncvt_xxw(auVar6);
auVar6 = vadd_vi(auVar6,1);
auVar1 = vse8_v(auVar6);
*pauVar3 = auVar1;
uVar2 = vsetvli_e32m1tama(0);
pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
auVar6 = vmv_v_x(lVar5);
auVar7 = vadd_vv(auVar7,auVar6);
} while (lVar4 != 0);
uStack_19 = 0;
printf(local_540,uVar2);
return 0;
}
That Ghidra branch decompiles, but the decompilation listing only resembles the C source code if you are familiar with RISCV vector extension instructions.
Repeat the example, this time building with a gcc-13 compiler suite. Ghidra 11.0 does a fine job of decompiling this.
undefined8 main(void)
{
long lVar1;
char acStack_541 [1320];
undefined uStack_19;
gp = &__global_pointer$;
lVar1 = 1;
do {
acStack_541[lVar1] = (char)lVar1;
lVar1 = lVar1 + 1;
} while (lVar1 != 0x528);
uStack_19 = 0;
printf(acStack_541 + 1);
return 0;
}
The Fedora 39 disk image is a good exemplar of endpoint system code. We can supplement that with a custom kernel build. This gives us more flexibility and a peek into future system builds.
Building a custom kernel - with standard kernel modules - requires steps like these:
.config
kernel configuration file with a command like:
$ PATH=$PATH:/opt/riscvx/bin
$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' menuconfig
menuconfig
view select architecture-specific features we want to view. This will likely include platform
selections like Vector extension support
, Zbb extension support
. It may also include Cryptographic API selections
like Accelerated Cryptographic Algorithms for CPU (riscv)
$ make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- MY_CFLAGS='-march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbb_zvbc' all
$ cp vmlinux ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2
$ cp arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.o ~/projects/github/ghidra_import_tests/riscv64/exemplars/vmlinux_6.9rc2_aes-riscv64-zvkned-zvbb-zvkg.o
Importing the custom vmlinux
kernel into Ghidra 11.1-DEV(isa_ext) shows:
vset*
.
__asm_vector_usercopy
uses vector loads and stores to copy into user memory spaces.strcmp_zbb
, strlen_zbb
, and strncmp_zbb
which can be patched into callsImporting the aes-riscv64-zvkned-zvbb-vkg.o
object file - presumably available for use in loadable kernel crypto
modules - shows:
aes_xts_encrypt_zvkned_zvbb_zvkg
and aes_xts_decrypt_zvkned_zvbb_zvkg
Commit logs for the Linux kernel sources suggest that the riscv vector crypto functions were derived from openssl source code, possibly intended for use in file system encryption and decryption.
A few x86_64 exemplars exist to explore the scope of issues raised by RISCV exemplars. The x86_64/exemplars
directory
shows how optimizing gcc-14 compilations handle simple loops and built-ins like memcpy
for various microarchitectures.
Intel microarchitectures can be grouped into common profiles like x86-64-v2
, x86-64-v3
, and x86-64-v4
. Each has its own set of
instruction set extensions, so an optimizing compiler like gcc-14 will autovectorize loops and built-ins differently for each microarchitecture.
The memcpy
exemplar set includes source code and three executables compiled from that source code with -march=x86-64-v2
, -march=x86-64-v3
, and
-march=x86-64-v4
. The binutils-2.41 objdump
disassembly is provided for each executable, for comparison with Ghidra’s disassembly window.
x86_64/exemplars$ ls memcpy*
memcpy.c memcpy_x86-64-v2 memcpy_x86-64-v2.objdump memcpy_x86-64-v3 memcpy_x86-64-v3.objdump memcpy_x86-64-v4 memcpy_x86-64-v4.objdump
These exemplars suggest several Ghidra issues:
-march=x86-64-v4
and -O3
.memcpy
or many simple loops with -march=x86-64-v2
or -march=x86-64-v3
.Not all RISCV instruction set extensions are standardized and supported by open source compiler suites. Vendors can generate their own custom extensions. These may be instructions that are proposed for standardization, instructions that predate standardized extensions that are effectively deprecated for new RISCV variants, and (potentially) instructions that are considered non-public licenseable intellectual property.
We have one example of a set of vendor-specific RISC-V extension exemplars that is pending classification. Some of the WCH QingKe 32 bit RISCV
processors support what they call extended instruction or XW
instructions like c.lbu
, c.lhu
, c.sb
, c.sh
, c.lbusp
, c.lhusp
, c.sbsp
, and c.shsp
.
The encoding for these custom instructions overlaps other, standardized extensions like Zcd
, while some of the instruction mnemonics overlap those of Zcb
.
There is no known evidence that these XW
instructions are tracked for inclusion in binutils
, as other full-custom extensions from the THead alibaba group are.
There is no evidence that these XW
instructions are considered licensable or proprietary to WCH (Nanjing Qinheng Microelectronics).
https://github.com/ArcaneNibble has generated a set of binary exemplars for this vendor custom extension. Naming conventions for full-custom extensions are very much To Be Determined. The RISCV binutils toolchain attaches an architecture tag to each ELF file it generates. For these binary exemplars that is:
Tag_RISCV_arch: "rv32i2p0_m2p0_a2p0_f2p0_c2p0_xw2p2"
That architectural tag implies the binaries are for a base RISCV 32 bit processor, with the standard compressed (c
) extension version 2.0 and other standard extensions.
The vendor custom (x
) extension (w
) version 2.0 (2p2
) is enabled. The Zcd
and Zcb
extensions are explicitly not enabled, so there is no conflict with
either assembly or disassembly of the instructions.
These exemplars are currently filed under riscv64/exemplars
as:
custom
└── wch
├── lbu.S
├── lbusp.S
├── lhu.S
├── lhusp.S
├── sb.S
├── sbsp.S
├── sh.S
├── shsp.S
├── w2p2-lbu.o
├── w2p2-lbusp.o
├── w2p2-lhu.o
├── w2p2-lhusp.o
├── w2p2-sb.o
├── w2p2-sbsp.o
├── w2p2-sh.o
└── w2p2-shsp.o
Explore analysis of a machine learning application built with large language model techniques. What Ghidra gaps does such an analysis reveal?
How might we inspect a machine-learning application for malware? For example, suppose someone altered the automatic speech recognition library whisper.cpp. Would Ghidra be able to cope with the instruction set extensions used to accelerate ML inference engines? What might be added to Ghidra to help the human analyst in this kind of inspection?
Components for this exercise:
isa_ext
branch for RISCV-64 supportwhisper_cpp_vendor
built with RISCV-64 gcc-14 toolchain and the whisper.cpp 1.5.4 release.
whisper_cpp_*
built locally with other RISCV-64 gcc toolchainsQuestions to address:
whisper_cpp_vendor
materially hurt Ghidra 11.0 analysis?whisper_cpp_vendor
and the non-vector build whisper_cpp_default
whisper_cpp_vendor
that Ghidra users should be able to recognize?isa_ext
branch?whisper_cpp_vendor
build, does this materially hurt Ghidra 11.0 analysis?There are a lot of variables in this exercise. Some are important, most are not.
Starting with the baseline Ghidra 11.0, examine a locally built whisper_cpp_default
, an ELF 64-bit LSB executable built with gcc-13.2.1.
Import and perform standard analyses to get these statistics:
fence.tso
instruction extensionNow examine whisper_cpp_vendor
(built with gcc 14 rather than gcc 13) with the baseline Ghidra 11.0:
Examine whisper_cpp_vendor
with the isa_ext
branch of 11.1-DEV:
fence.tso
instruction extensionNext apply a manual correction to whisper_cpp_vendor
, selecting the entire .text
segment and
forcing disassembly, then clearing any unreachable 0x00 bytes.
vset*
instructions usually found in vector codegather
instructionscustom
instructionsFinally, reset the ’language’ of whisper_cpp_vendor
to match the vendor (THead, for this exercise).
The 3562 custom instructions resolve to:
Instruction | Count | Semantics |
---|---|---|
th.ext* | 151 | Sign extract and extend |
th.ldd | 1719 | Load 2 doublewords |
th.lwd | 10 | Load 2 words |
th.sdd | 1033 | store 2 doublewords |
th.swd | 16 | store 2 words |
th.mula | 284 | Multiply-add |
th.muls | 67 | Multiply-subtract |
th.mveqz | 127 | Move if == 0 |
th.mvneqz | 154 | Move if != 0 |
This leads to some tentative next steps:
fence.tso
to Ghidra looks like a simple small win, and a perfect place to start.isa_ext
branch necessary.gather
instructions are unexpectedly prevalent.vset*
instruction blocks may reveal some key patterns to
recognize first.Note:
fence.tso
is now recognized in the Ghidra 11.1-DEV branchisa_ext
, clearing thebad instruction errors
.
At the highest level, what features of whisper.cpp
generate vector instructions?
ggml-quants.c
. In these cases the developer
has explicitly managed the vectorization.memcpy
or structure copies into vector load-store loops.Much of whisper.cpp
involves vector, matrix, or tensor math using ggml
math functions. This is also where most
of the explicit RISCV vector intrinsic C functions appear, and likely the code the developer believes is most in need
of vector performance enhancements.
ggml_vec_dot_f32(n, sum, x, y)
generates the vector dot product of two vectors x and y of length n with the result
stored to *sum
. In the absence of vector or SIMD support the source code is:
// scalar
float s;
double sumf = 0.0;
for (int i = 0; i < n; ++i) {
sumf += (double)(x[i]*y[i]);
}
*s = sumf;
GCC-14 will autovectorize this into something Ghidra decompiles like this (comments added after //
):
void ggml_vec_dot_f32(long n,float *s,float *x,float *y)
{
long step;
double dVar1;
undefined auVar2 [256];
undefined in_v2 [256];
undefined auVar3 [256];
gp = &__global_pointer$;
if (0 < n) {
vsetvli_e64m1tama(0);
vmv_v_i(in_v2,0); // v2 = 0
do {
step = vsetvli(n,0x97); // vsetvli a5,a0,e32,mf2,tu,ma
n = n - step;
auVar3 = vle32_v(x); // v3 = *x (slice of size step)
auVar2 = vle32_v(y); // v1 = *y (slice of size step)
x = (float *)sh2add(step,x); // x = x + step
auVar2 = vfmul_vv(auVar2,auVar3); // v1 = v1 * v3
y = (float *)sh2add(step,y); // y = y + step
in_v2 = vfwadd_wv(in_v2,auVar2); // v2 = v1 + v2
} while (n != 0);
vsetvli_e64m1tama(0);
auVar2 = vfmv_sf(0); // v1[0] = 0
auVar2 = vfredusum_vs(in_v2,auVar2); // v1[0] = sum(v2)
dVar1 = (double)vfmv_fs(auVar2); // dvar1 = v1[0]
*s = (float)dVar1;
return;
}
*s = 0.0;
return;
}
Inspecting this disassembly and decompilation suggests several top down issues:
shadd2
are simple and should be explicit sh2add(a, b) = a>>2 + b
vsetvli(n,0x97)
instruction should be expanded to show semantics as vsetvli_e32m2ftuma
v2 = x * y
and the reduction dvar1 = reduce(+, v2)
The path forward may be to manually analyze several examples from whisper.cpp
, extending and revising Ghidra’s semantics
and decompiler to add a bit of clarity each time.
Autovectorization can generate complicated code when the compiler has no knowledge of the number of elements in a vector or the number of elements that can fit within single vector register.
A good example is from:
ggml_tensor * ggml_new_tensor_impl(
struct ggml_context * ctx,
enum ggml_type type,
int n_dims,
const int64_t * ne,
struct ggml_tensor * view_src,
size_t view_offs) {
...
size_t data_size = ggml_row_size(type, ne[0]);
for (int i = 1; i < n_dims; i++) {
data_size *= ne[i];
}
}
The ne
vector typically has up to 4 elements, so this loop will be executed at most once. The compiler doesn’t know this so it
autovectorizes the loop into something more complex:
undefined4 * ggml_new_tensor(ggml_context *ctx,undefined8 type,long ndims,int64_t *ne)
{
...
data_size = ggml_row_size(type,*ne); // get the first dimension ne[0]
lVar6 = 1;
if (1 < ndims) {
uVar2 = (int)ndims - 1;
if (1 < (int)ndims - 2U) { // if ndims > 3 process two at a time
piVar7 = ne + 1; // starting with ne[1] and ne[2]
piVar4 = piVar7 + (long)(int)(uVar2 >> 1) * 2;
vsetivli_e64m1tamu(2); //vector length = 2, 64 bit element, tail agnostic mask unchanged
vmv_v_i(in_v1,1); // v1 = (1,1)
do {
auVar10 = vle64_v(piVar7);
piVar7 = piVar7 + 2;
in_v1 = vmul_vv(in_v1,auVar10); // v1 = v1 * ne[slice]
} while (piVar4 != piVar7);
auVar10 = vid_v(); // v2 = (0,1)
vmv_v_i(in_v4,0); // v4 = (0,0)
auVar11 = vadd_vi(auVar10,1); // v2 = v2 + 1 = (1,2)
auVar10 = vmsgtu_vi(auVar11,1); // v0 = (v2 > 1) = (0, 1)
vrgather_vv(in_v1,auVar11); // v3 = gather(v1, v2) => v3=v1[v2] = (v1[1], 0)
auVar11 = vadd_vi(auVar11,0xfffffffffffffffe); // v2 = v2 - 2 = (-1,0)
auVar10 = vrgather_vv(in_v4,auVar11,auVar10); // v3 = gather_masked(v4,v2,v0.t) = (v3[0], v4[0])
auVar10 = vmul_vv(auVar10,in_v1); // v3 = v3 * v1
vmv_x_s(in_v14,auVar10); // a4 = v3[0]
data_size = data_size * (long)piVar4; // data_size = data_size * a4
if ((uVar2 & 1) == 0) goto LAB_00074a80;
lVar6 = (long)(int)((uVar2 & 0xfffffffe) + 1);
}
plVar5 = (long *)sh3add(lVar6,ne); // multiply by one or two
data_size = data_size * *plVar5; //
if ((int)lVar6 + 1 < ndims) {
data_size = data_size * plVar5[1];
}
}
...
}
That’s a very confusing way to multiply at most four integers. If ne has 1, 2, or 3 elements then no vector instructions are processed at all. If it has 4 elements then the first and last one or two are handled with scalar math while pairs of elements are accumulated in the loop. The gather instructions are used together to generate a mask and then multiply the two elements of vector v1, leaving the result in the first element slot of vector v4.
This particular loop vectorization is likely to change a lot in future releases. The performance impact is negligible either way. The analyst may look at code like this and decide to ignore the ndims>3 case along with all of the vector instructions used within it. Alternatively, we could look at the gcc vectorization code handling the general vector reduction meta operation, then see if this pattern is a macro of some sort within it.
Take a step back and look at the gcc RISCV autovectorization code. It’s changing quite frequently, so it’s probably premature to try and abstract out
loop reduction models that we can get Ghidra to recognize. When that happens we might draw source exemplars from
gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec
and build a catalog of source pattern to instruction pattern expansions.
The previous example showed an overly aggressive autovectorization of a simple loop. Here we look at source code that the developer has decided is important enough
to directly code in RISCV intrinsic C functions. The function ggml_vec_dot_q5_0_q8_0
is one such function, with separate implementations for ARM_NEON
, wasm_simd128
,
AVX2
, AVX
, and riscv_v_intrinsic
. If none of those accelerators are available a scalar implementation is used instead:
void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
const int qk = QK8_0;
const int nb = n / qk;
assert(n % qk == 0);
assert(qk == QK5_0);
const block_q5_0 * restrict x = vx;
const block_q8_0 * restrict y = vy;
// scalar
float sumf = 0.0;
for (int i = 0; i < nb; i++) {
uint32_t qh;
memcpy(&qh, x[i].qh, sizeof(qh));
int sumi = 0;
for (int j = 0; j < qk/2; ++j) {
const uint8_t xh_0 = ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
const uint8_t xh_1 = ((qh & (1u << (j + 16))) >> (j + 12));
const int32_t x0 = ((x[i].qs[j] & 0x0F) | xh_0) - 16;
const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16;
sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
}
sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
}
*s = sumf;
}
The RISCV intrinsic source is:
Note: added comments are flagged with
///
void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
const int qk = QK8_0; /// QK8_0 = 32
const int nb = n / qk;
assert(n % qk == 0);
assert(qk == QK5_0); /// QK5_0 = 32
const block_q5_0 * restrict x = vx;
const block_q8_0 * restrict y = vy;
float sumf = 0.0;
uint32_t qh;
size_t vl = __riscv_vsetvl_e8m1(qk/2);
// These temporary registers are for masking and shift operations
vuint32m2_t vt_1 = __riscv_vid_v_u32m2(vl);
vuint32m2_t vt_2 = __riscv_vsll_vv_u32m2(__riscv_vmv_v_x_u32m2(1, vl), vt_1, vl);
vuint32m2_t vt_3 = __riscv_vsll_vx_u32m2(vt_2, 16, vl);
vuint32m2_t vt_4 = __riscv_vadd_vx_u32m2(vt_1, 12, vl);
for (int i = 0; i < nb; i++) {
memcpy(&qh, x[i].qh, sizeof(uint32_t));
// ((qh & (1u << (j + 0 ))) >> (j + 0 )) << 4;
vuint32m2_t xha_0 = __riscv_vand_vx_u32m2(vt_2, qh, vl);
vuint32m2_t xhr_0 = __riscv_vsrl_vv_u32m2(xha_0, vt_1, vl);
vuint32m2_t xhl_0 = __riscv_vsll_vx_u32m2(xhr_0, 4, vl);
// ((qh & (1u << (j + 16))) >> (j + 12));
vuint32m2_t xha_1 = __riscv_vand_vx_u32m2(vt_3, qh, vl);
vuint32m2_t xhl_1 = __riscv_vsrl_vv_u32m2(xha_1, vt_4, vl);
// narrowing
vuint16m1_t xhc_0 = __riscv_vncvt_x_x_w_u16m1(xhl_0, vl);
vuint8mf2_t xh_0 = __riscv_vncvt_x_x_w_u8mf2(xhc_0, vl);
vuint16m1_t xhc_1 = __riscv_vncvt_x_x_w_u16m1(xhl_1, vl);
vuint8mf2_t xh_1 = __riscv_vncvt_x_x_w_u8mf2(xhc_1, vl);
// load
vuint8mf2_t tx = __riscv_vle8_v_u8mf2(x[i].qs, vl);
vint8mf2_t y0 = __riscv_vle8_v_i8mf2(y[i].qs, vl);
vint8mf2_t y1 = __riscv_vle8_v_i8mf2(y[i].qs+16, vl);
vuint8mf2_t x_at = __riscv_vand_vx_u8mf2(tx, 0x0F, vl);
vuint8mf2_t x_lt = __riscv_vsrl_vx_u8mf2(tx, 0x04, vl);
vuint8mf2_t x_a = __riscv_vor_vv_u8mf2(x_at, xh_0, vl);
vuint8mf2_t x_l = __riscv_vor_vv_u8mf2(x_lt, xh_1, vl);
vint8mf2_t x_ai = __riscv_vreinterpret_v_u8mf2_i8mf2(x_a);
vint8mf2_t x_li = __riscv_vreinterpret_v_u8mf2_i8mf2(x_l);
vint8mf2_t v0 = __riscv_vsub_vx_i8mf2(x_ai, 16, vl);
vint8mf2_t v1 = __riscv_vsub_vx_i8mf2(x_li, 16, vl);
vint16m1_t vec_mul1 = __riscv_vwmul_vv_i16m1(v0, y0, vl);
vint16m1_t vec_mul2 = __riscv_vwmul_vv_i16m1(v1, y1, vl);
vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vl);
vint32m1_t vs1 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul1, vec_zero, vl);
vint32m1_t vs2 = __riscv_vwredsum_vs_i16m1_i32m1(vec_mul2, vs1, vl);
int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
sumf += (GGML_FP16_TO_FP32(x[i].d)*GGML_FP16_TO_FP32(y[i].d)) * sumi;
}
*s = sumf;
}
Ghidra’s 11.1 isa_ext rendering of this is (after minor parameter name propagation):
long ggml_vec_dot_q5_0_q8_0(ulong n,float *s,void *vx,void *vy)
{
ushort *puVar1;
long lVar2;
long lVar3;
long lVar4;
long lVar5;
undefined8 uVar6;
int i;
float fVar7;
undefined auVar8 [256];
undefined auVar9 [256];
undefined auVar10 [256];
undefined auVar11 [256];
undefined in_v7 [256];
undefined in_v8 [256];
undefined auVar12 [256];
undefined auVar13 [256];
undefined auVar14 [256];
undefined auVar15 [256];
undefined auVar16 [256];
int iStack_4;
gp = &__global_pointer$;
uVar6 = vsetivli(0x10,0xc0);
vsetvli(uVar6,0xd1);
auVar13 = vid_v();
vmv_v_i(in_v8,1);
auVar15 = vadd_vi(auVar13,0xc);
auVar12 = vsll_vv(in_v8,auVar13);
auVar14 = vsll_vi(auVar12,0x10);
if (0x1f < (long)n) {
fVar7 = 0.0;
vsetvli_e32m1tama(uVar6);
lVar3 = (long)vx + 2;
lVar4 = (long)vy + 2;
i = 0;
vmv_v_i(in_v7,0);
vsetivli(4,0xc6);
do {
auVar8 = vle8_v(lVar3);
vse8_v(auVar8,&iStack_4);
puVar1 = (ushort *)(lVar4 + -2);
vsetvli(uVar6,0xd1);
lVar2 = lVar3 + 4;
auVar8 = vle8_v(lVar2);
auVar9 = vand_vx(auVar12,(long)iStack_4);
auVar9 = vsrl_vv(auVar9,auVar13);
vsetvli(0,199);
auVar11 = vand_vi(auVar8,0xf);
vsetvli(0,0xd1);
auVar9 = vsll_vi(auVar9,4);
vsetvli(0,199);
auVar8 = vsrl_vi(auVar8,4);
vsetvli(0,200);
auVar9 = vncvt_xxw(auVar9);
auVar16 = vle8_v(lVar4);
vsetvli(0,199);
auVar9 = vncvt_xxw(auVar9);
vsetvli(0,0xd1);
auVar10 = vand_vx(auVar14,(long)iStack_4);
vsetvli(0,199);
auVar11 = vor_vv(auVar11,auVar9);
vsetvli(0,0xd1);
auVar9 = vsrl_vv(auVar10,auVar15);
vsetvli(0,199);
auVar10 = vadd_vi(auVar11,0xfffffffffffffff0);
vsetvli(0,200);
auVar9 = vncvt_xxw(auVar9);
vsetvli(0,199);
auVar10 = vwmul_vv(auVar10,auVar16);
auVar9 = vncvt_xxw(auVar9);
vsetvli(0,200);
lVar5 = lVar4 + 0x10;
auVar10 = vwredsum_vs(auVar10,in_v7);
vsetvli(0,199);
auVar8 = vor_vv(auVar8,auVar9);
auVar9 = vle8_v(lVar5);
auVar8 = vadd_vi(auVar8,0xfffffffffffffff0);
auVar8 = vwmul_vv(auVar8,auVar9);
vsetvli(0,200);
auVar8 = vwredsum_vs(auVar8,auVar10);
vsetivli(4,0xd0);
vmv_x_s(auVar15,auVar8);
i = i + 1;
lVar4 = lVar4 + 0x22;
fVar7 = (float)(&ggml_table_f32_f16)[*puVar1] *
(float)(&ggml_table_f32_f16)[*(ushort *)(lVar3 + -2)] * (float)(int)lVar5 + fVar7;
lVar3 = lVar3 + 0x16;
} while (i < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
*s = fVar7;
return lVar2;
}
*s = 0.0;
return n;
}
It looks like the developer unrolled an inner loop and used the LMUL multiplier to help reduce the loop iterations. The immediate action item for us may be to
add more explicit decodings for vsetvli
and vsetivli
, or look for existing processor-specific decoders in the Ghidra decompiler.
Let’s take a glance at the x86_64 build of whisper
. First copy whisper-cpp.BUILD
into the x86_64 workspace then build the executable with two architectures:
$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v3" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3
...
$ bazel build --platforms=//platforms:x86_64_default --copt="-march=x86-64-v4" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v4
Load these into Ghidra 11.1-DEV. The x86-64-v4
build is useless in Ghidra, since a different class of x86_64 vector extensions is used in that newer microarchitecture
and Ghidra doesn’t recognize it. The x86-64-v3
build looks accessible.
Try an x86_64 build with the local compiler (Fedora 39 default compiler) and Link Time Optimization enabled:
$ bazel build --copt="-march=x86-64-v3" --copt="-flto" --linkopt="-Wl,-flto" @whisper_cpp//:main
...
$ cp bazel-bin/external/whisper_cpp/main ../exemplars/whisper_cpp_x86-64-v3-lto
We’ll leave the differential analysis of link time optimization for another day. A couple of quick notes are worthwhile here:
ggml_new_tensor
no longer exists in the binary. Instead we get ggml_new_tensor_impl.constprop.0
ggml_new_tensor_impl.constprop.0
, ggml_new_tensor_impl.constprop.2
, and ggml_new_tensor_impl.constprop.3
.
This suggests BSIM could get confused with intermediate functions if trying to connect binaries built with and without LTO.Intel’s DPDK framework supports some Intel and Arm Neon vector instructions. What does it for RISCV extensions? Are ISA extensions materially useful to a network appliance?
Check out DPDK from GitHub, patch the RISCV configuration with riscv64/generated/dpdk/dpdk.pat
, and crosscompile with meson
and ninja
. Copy some of the examples into riscv64/exemplars
and examine them in Ghidra.
Configure a build directory:
$ patch -p1 < .../riscv64/generated/dpdk/dpdk.pat
$ meson setup build --cross-file config/riscv/riscv64_linux_gcc -Dexamples=all
$ cd build
Edit build/build.ninja
:
-ldl
with /opt/riscvx/lib/libdl.a
- you should see about 235 replacementsBuild with:
$ ninja -C build
Check the cross-compilation with:
$ readelf -A build/examples/dpdk-l3fwd
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zba1p0_zbb1p0_zbc1p0_zbkb1p0_zbkc1p0_zbkx1p0_zvbb1p0_zvbc1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvkb1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
Tag_RISCV_priv_spec: 1
Tag_RISCV_priv_spec_minor: 11
Import into Ghidra 11.1-DEV(isa_ext), noting multiple import messages similar to:
The source code has explicit parallel coding for Intel and Arm/Neon architectures, for instance within the basic layer 3 forwarding example:
struct acl_algorithms acl_alg[] = {
{
.name = "scalar",
.alg = RTE_ACL_CLASSIFY_SCALAR,
},
{
.name = "sse",
.alg = RTE_ACL_CLASSIFY_SSE,
},
{
.name = "avx2",
.alg = RTE_ACL_CLASSIFY_AVX2,
},
{
.name = "neon",
.alg = RTE_ACL_CLASSIFY_NEON,
},
{
.name = "altivec",
.alg = RTE_ACL_CLASSIFY_ALTIVEC,
},
{
.name = "avx512x16",
.alg = RTE_ACL_CLASSIFY_AVX512X16,
},
{
.name = "avx512x32",
.alg = RTE_ACL_CLASSIFY_AVX512X32,
},
};
If avx512x32 is selected, then basic trie search and other operations can proceed across 32 flows in parallel.
Other examples exist within the code. For more information, search the doc
directory for SIMD
. Check out lib/fib/trie_avx512.c
and the function trie_vec_lookup_x16x2
for an AVX manual vectorization of a trie address to next hop lookup.
Note: It won’t be clear for some time which vectorization transforms actually improve performance on specific processors. Vector support adds a lot of local register space but vector loads and stores can saturate memory bandwidth and drive up processor temperature. We might see earlier adoption in contexts that tolerate higher latency, like firewalls, rather than low-latency switches and routers.
DPDK includes ML contributions from Marvell. See doc/guides/mldevs/cnxk.rst
for more information and references to the cnxk support.
Source code support exists under drivers/ml/cnxk
and lib/mldev
. lib/mldev/rte_mldev.c
may provide some insight into how
Marvell expects DPDK users to apply their component.
See the Marvell Octeon 10 white paper for some possible applications.
The DPDK exemplars stress test Ghidra in multiple ways:
R_RISCV_COPY
, claiming “Runtime copy is not supported”.
Summarize Ghidra import issues here to promote discussion on relative priority and possible solutions.
Thread local storage provides one instance of the variable per extant thread. GCC supports this feature as:
__thread char threadLocalString[4096];
Binaries built with this feature will often include ELF relocation codes like R_RISCV_TLS_TPREL64
. This relocation code is not recognized by
Ghidra, nor is it clear how TLS storage should be handled itself within Ghidra - perhaps as a memory section akin to BSS?
To reproduce, import libc.so.6
and look for lines like Elf Relocation Warning: Type = R_RISCV_TLS_TPREL64
.
Alternatively, compile, link, and import riscv64/generated/userSpaceSamples/relocationTest.c
.
Newer C compiler releases can replace simple loops and standard C library invocations with processor-specific vector instructions. These vector instructions can be handled poorly by Ghidra’s disassembler and worse by Ghidra’s decompiler. See autovectorization and vector_intrinsics for examples.
How can we do a better job of recognizing semantic patterns in optimized code? Instruction set extensions make that more challenging.
Ghidra users want to understand the intent of binary code. The semantics
and intent of the memcpy(dest,src,nbytes)
operation are pretty clear.
If the compiler converts this into a call to a named external function,
that’s easy to recognize. If it converts it into a simple inline loop of load and
store instructions, that should be recognizable too.
Optimizing compilers like gcc can generate many different instruction sequences from
a simple concept like memcpy
or strnlen
, especially if the processor for which the code is intended
supports advanced vector or bit manipulation instructions. We can examine the compiler
testsuite to see what those patterns can be, enabling either human or machine translation of
those sequences into the higher level semantics of memory movement or finding the first null
byte in a string.
Gcc semantics recognizes memory copy operations via the operation cpymem
. Calls to
the standard library memcpy
and various kinds of struct copies are translated into this RTL (Register Transfer Logic)
cpymem
token. The processor-specific gcc back-end then expands cpymem
into a half-dozen or so instruction patterns,
depending on size, alignment, and instruction set extensions of the target processor.
In the ideal world, Ghidra would recognize all of those RTL operations as pcode operations, and further recognize
all of the common back end expansions for all processor variants. It might rewrite the decompiler window or simply
add comments indicating a likely cpymem
pcode expansion.
It’s enough for now to show how to gather the more common patterns to help human Ghidra operators untangle these optimizations and understand the simpler semantics they encode.
Note: This example uses RISCV vector optimization - many other optimizations are supported by gcc too.
Maybe the best reference on gcc vectorization is the gcc source code itself.
Start with:
gcc/config/riscv/riscv-vector-builtins.cc
gcc/config/riscv/riscv-vector-strings.cc
gcc/config/riscv/autovec.md
gcc/config/riscv/riscv-string.cc
Ghidra semantics use pcode
operations. GCC uses something similar in RTL (Register Transfer Language). These
are described in gcc/doc/md.texi
. These include:
cpymem
setmem
strlen
rawmemchr
cmpstrn
cmpstr
The cpymem
op covers inline calls to memcpy and structure copies. Trace this out:
riscv.md
:
(define_expand "cpymem<mode>"
[(parallel [(set (match_operand:BLK 0 "general_operand")
(match_operand:BLK 1 "general_operand"))
(use (match_operand:P 2 ""))
(use (match_operand:SI 3 "const_int_operand"))])]
""
{
if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
DONE;
else
FAIL;
})
riscv_expand_block_move
is also mentioned in riscv-protos.h
and riscv-string.cc
.
Look into riscv-string.cc
:
/* This function delegates block-move expansion to either the vector
implementation or the scalar one. Return TRUE if successful or FALSE
otherwise. */
bool
riscv_expand_block_move (rtx dest, rtx src, rtx length)
{
if (TARGET_VECTOR && stringop_strategy & STRATEGY_VECTOR)
{
bool ok = riscv_vector::expand_block_move (dest, src, length);
if (ok)
return true;
}
if (stringop_strategy & STRATEGY_SCALAR)
return riscv_expand_block_move_scalar (dest, src, length);
return false;
...
}
...
* --- Vector expanders --- */
namespace riscv_vector {
/* Used by cpymemsi in riscv.md . */
bool
expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
{
/*
memcpy:
mv a3, a0 # Copy destination
loop:
vsetvli t0, a2, e8, m8, ta, ma # Vectors of 8b
vle8.v v0, (a1) # Load bytes
add a1, a1, t0 # Bump pointer
sub a2, a2, t0 # Decrement count
vse8.v v0, (a3) # Store bytes
add a3, a3, t0 # Bump pointer
bnez a2, loop # Any more?
ret # Return
*/
}
}
Note that the RISCV assembly instructions in the comment are just an example, and that the C++ implementation
handles many different variants. The ret
instruction is not part of the expansion, just copied into the source
code from the testsuite.
The testsuite (gcc/testsuite/gcc.target/riscv
) shows which variants are common enough to test against.
void f1 (void *a, void *b, __SIZE_TYPE__ l)
{
memcpy (a, b, l);
}
** f1:
XX \.L\d+: # local label is ignored
** vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
** vle8\.v\s+v\d+,0\(a1\)
** vse8\.v\s+v\d+,0\(a0\)
** add\s+a1,a1,[ta][0-7]
** add\s+a0,a0,[ta][0-7]
** sub\s+a2,a2,[ta][0-7]
** bne\s+a2,zero,\.L\d+
** ret
*/
void f2 (__INT32_TYPE__* a, __INT32_TYPE__* b, int l)
{
memcpy (a, b, l);
}
Additional type information doesn’t appear to affect the inline code
** f2:
XX \.L\d+: # local label is ignored
** vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
** vle8\.v\s+v\d+,0\(a1\)
** vse8\.v\s+v\d+,0\(a0\)
** add\s+a1,a1,[ta][0-7]
** add\s+a0,a0,[ta][0-7]
** sub\s+a2,a2,[ta][0-7]
** bne\s+a2,zero,\.L\d+
** ret
In this case arguments are aligned and 512 bytes in length.
extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;
void f3 ()
{
memcpy (&a_a, &a_b, sizeof a_a);
}
The generated sequence varies depending on how much the compiler knows about the target architecture.
** f3: { target { { any-opts "-mcmodel=medlow" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m2" "--param=riscv-autovec-lmul=m4" "-
-param=riscv-autovec-lmul=m8" "--param=riscv-autovec-preference=fixed-vlmax" } } }
** lui\s+[ta][0-7],%hi\(a_a\)
** addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
** lui\s+[ta][0-7],%hi\(a_b\)
** addi\s+a4,[ta][0-7],%lo\(a_b\)
** vsetivli\s+zero,16,e32,m8,ta,ma
** vle32.v\s+v\d+,0\([ta][0-7]\)
** vse32\.v\s+v\d+,0\([ta][0-7]\)
** ret
f3: { target { { any-opts "-mcmodel=medlow --param=riscv-autovec-preference=fixed-vlmax" "-mcmodel=medlow -march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
** lui\s+[ta][0-7],%hi\(a_a\)
** lui\s+[ta][0-7],%hi\(a_b\)
** addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
** addi\s+a4,[ta][0-7],%lo\(a_b\)
** vl(1|4|2)re32\.v\s+v\d+,0\([ta][0-7]\)
** vs(1|4|2)r\.v\s+v\d+,0\([ta][0-7]\)
** ret
** f3: { target { { any-opts "-mcmodel=medlow -march=rv64gcv_zvl1024b" "-mcmodel=medlow -march=rv64gcv_zvl512b" } && { no-opts "--param=riscv-autovec-preference=fixed-vlmax" } } }
** lui\s+[ta][0-7],%hi\(a_a\)
** lui\s+[ta][0-7],%hi\(a_b\)
** addi\s+a4,[ta][0-7],%lo\(a_b\)
** vsetivli\s+zero,16,e32,(m1|m4|mf2),ta,ma
** vle32.v\s+v\d+,0\([ta][0-7]\)
** addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
** vse32\.v\s+v\d+,0\([ta][0-7]\)
** ret
** f3: { target { { any-opts "-mcmodel=medany" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m8" "--param=riscv-autovec-lmul=m4" "--param=riscv-autovec-preference=fixed-vlmax" } } }
** lla\s+[ta][0-7],a_a
** lla\s+[ta][0-7],a_b
** vsetivli\s+zero,16,e32,m8,ta,ma
** vle32.v\s+v\d+,0\([ta][0-7]\)
** vse32\.v\s+v\d+,0\([ta][0-7]\)
** f3: { target { { any-opts "-mcmodel=medany" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv" "-march=rv64gc_zve64d" "-march=rv64gc_zve32f" } } }
** lla\s+[ta][0-7],a_b
** vsetivli\s+zero,16,e32,m(f2|1|4),ta,ma
** vle32.v\s+v\d+,0\([ta][0-7]\)
** lla\s+[ta][0-7],a_a
** vse32\.v\s+v\d+,0\([ta][0-7]\)
** ret
*/
** f3: { target { { any-opts "-mcmodel=medany --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
** lla\s+[ta][0-7],a_a
** lla\s+[ta][0-7],a_b
** vl(1|2|4)re32\.v\s+v\d+,0\([ta][0-7]\)
** vs(1|2|4)r\.v\s+v\d+,0\([ta][0-7]\)
** ret
If a processor supports vector (aka SIMD) instructions, optimizing compilers will use them. That means Ghidra may need to make sense of the generated code.
What happens when a gcc toolchain optimizes the following code?
#include <stdio.h>
int main(int argc, char** argv){
const int N = 1320;
char s[N];
for (int i = 0; i < N - 1; ++i)
s[i] = i + 1;
s[N - 1] = '\0';
printf(s);
}
This involves a simple loop filling a character array with integers. It isn’t a well formed C string,
so the printf
statement is just there to keep the character array from being optimized away.
The elements of the loop involve incremental indexing, narrowing from 16 bit to 8 bit elements, and storage in a 1320 element vector.
The result depends on the compiler version and what kind of microarchitecture gcc-14 was told to compile for.
Compile and link this file with a variety of compiler versions, flags, and microarchitectures to see how well
Ghidra tracks toolchain evolution. In each case the decompiler output is manually adjusted to relabel
variables like s
and i
and remove extraneous declarations.
$ bazel build -s --platforms=//platforms:riscv_userspace gcc_vectorization:narrowing_loop
...
Ghidra gives:
char s [1319];
...
int i;
...
for (i = 0; i < 0x527; i = i + 1) {
s[i] = (char)i + '\x01';
}
...
printf(s);
The loop consists of 17 instructions and 60 bytes. It is executed 1319 times
bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3" gcc_vectorization:narrowing_loop
Ghidra gives:
long i;
char s_offset_by_1 [1320];
i = 1;
do {
s_offset_by_1[i] = (char)i;
i = i + 1;
} while (i != 0x528);
uStack_19 = 0;
printf(s_offset_by_1 + 1);
The loop consists of 4 instructions and 14 bytes. It is executed 1319 times.
Note that Ghidra has reconstructed the target vector s
a bit strangely, with the beginning
offset by one byte to help shorten the loop.
bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3" gcc_vectorization:narrowing_loop_vector
The Ghidra import is essentially unchanged - updating the target architecture from rv64igc
to rv64igcv
makes no difference
when building with gcc-13.
bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:narrowing_loop
The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference without optimization.
bazel build -s --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:narrowing_loop
The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference - when using the default target architecture without vector extensions.
Build with -march=rv64gcv
to tell the compiler to assume the processor supports RISCV vector extensions.
bazel build -s --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:narrowing_loop_vector
/* WARNING: Unimplemented instruction - Truncating control flow here */
halt_unimplemented();
The disassembly window shows that the loop consists of 13 instructions and 46 bytes. Many of these are vector extension instructions for which Ghidra 11.0 has no semantics. Different RISCV processors will take a different number of iterations to finish the loop. If the processor VLEN=128, then each vector register will hold 4 32 bit integers and the loop will take 330 iterations. If the processor VLEN=1024 then the loop will take 83 iterations.
Either way, Ghidra 11.0 will fail to decompile any such autovectorized loop, and fail to decompile the remainder of any function which contains such an autovectorized loop.
Note: Intel’s Saphire Rapids includes high end server processors like the Xeon Max family. A more general choice for x86_64 exemplars would be
-march=x86-64-v3
with-O2
optimization. We can expect full Red Hat Linux distributions soon built with those options. Thex86-64-v4
microarchitecture is a generalization of Saphire Rapids microarchitecture, and would more likely be found in servers specialized for numerical analysis or possibly ML applications.
$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:narrowing_loop
Ghidra 11.0 disassembler and decompiler fail immediately on hitting the first vector instruction vpbroadcastd
, an older avx2 vector extension.
/* WARNING: Bad instruction - Truncating control flow here */
halt_baddata();
GCC can replace calls to some functions like memcpy
, replacing those calls with inline - and potentially vectorized - instructions.
This source file shows different ways memcopy can be compiled.
include "common.h"
#include <string.h>
int main() {
const int N = 127;
const uint32_t seed = 0xdeadbeef;
srand(seed);
// data gen
double A[N];
gen_rand_1d(A, N);
// compute
double copy[N];
memcpy(copy, A, sizeof(A));
// prevent optimization from removing result
printf("%f\n", copy[N-1]);
}
Build with:
$ bazel build --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:memcpy_vector
Ghidra 11 gives:
undefined8 main(void)
{
undefined auVar1 [64];
undefined *puVar2;
undefined (*pauVar3) [64];
long lVar4;
long lVar5;
undefined auVar6 [256];
undefined local_820 [8];
undefined8 uStack_818;
undefined auStack_420 [1032];
srand(0xdeadbeef);
puVar2 = auStack_420;
gen_rand_1d(auStack_420,0x7f);
pauVar3 = (undefined (*) [64])local_820;
lVar4 = 0x3f8;
do {
lVar5 = vsetvli_e8m8tama(lVar4);
auVar6 = vle8_v(puVar2);
lVar4 = lVar4 - lVar5;
auVar1 = vse8_v(auVar6);
*pauVar3 = auVar1;
puVar2 = puVar2 + lVar5;
pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
} while (lVar4 != 0);
printf("%f\n",uStack_818);
return 0;
}
What would we like the decompiler to show instead? The memcpy
pattern should be fairly general and stable.
src = auStack_420;
gen_rand_1d(auStack_420,0x7f);
dest = (undefined (*) [64])local_820;
/* char* dest, src;
dest[0..n] ≔ src[0..n]; */
n = 0x3f8;
do {
lVar2 = vsetvli_e8m8tama(n);
auVar3 = vle8_v(src);
n = n - lVar2;
auVar1 = vse8_v(auVar3);
*dest = auVar1;
src = src + lVar2;
dest = (undefined (*) [64])(*dest + lVar2);
} while (n != 0);
More generally, we want a precomment showing the memcpy in vector terms immediately before
the loop. The type definition of dest
is a red herring to be dealt with.
Build with:
$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:memcpy_sapphirerapids
Ghidra 11.0’s disassembler and decompiler bail out when they reach the inline replacement for memcpy
- gcc-14 has replaced the call with
vector instructions like vmovdqu64
, which is unrecognized by Ghidra.
void main(void)
{
undefined auStack_428 [1016];
undefined8 uStack_30;
uStack_30 = 0x4010ad;
srand(0xdeadbeef);
gen_rand_1d(auStack_428,0x7f);
/* WARNING: Bad instruction - Truncating control flow here */
halt_baddata();
}
Invoking RISCV vector instructions from C.
RISCV vector intrinsic functions can be coded into C or C++.
That document includes examples of code that might be found shortly in libc:
void *memcpy_vec(void *restrict destination, const void *restrict source,
size_t n) {
unsigned char *dst = destination;
const unsigned char *src = source;
// copy data byte by byte
for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
vl = __riscv_vsetvl_e8m8(n);
vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
__riscv_vse8_v_u8m8(dst, vec_src, vl);
}
return destination;
}
Note: GCC-14 autovectorization will often convert normal calls to
memcpy
into something very similar to thememcpy_vec
code above, then assemble it down to RISCV vector instructions.
As another example, here is a snippet of code from the whisper.cc voice to text open source project:
...
#ifdef __riscv_v_intrinsic
#include <riscv_vector.h>
#endif
...
elif defined(__riscv_v_intrinsic)
size_t vl = __riscv_vsetvl_e32m4(QK8_0);
for (int i = 0; i < nb; i++) {
// load elements
vfloat32m4_t v_x = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);
vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
vfloat32m1_t tmp = __riscv_vfmv_v_f_f32m1(0.0f, vl);
vfloat32m1_t vmax = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);
...
}
Normally you would expect to see functions like __riscv_vfabs_v_f32m4
defined in the include file riscv_vector.h
, where Ghidra could process it and help identify calls to
these intrinsics. The vector intrinsic functions are instead autogenerated directly into GCC’s internal compiled header format when the compiler is built - there are just too many variants
to cope with. The PDF listing of all intrinsic functions is currently over 4000 pages long. For example, the signature for __riscv_vfredmax_vs_f32m4_f32m1
is given on page 734 under
Vector Reduction Operations
as
vfloat32m1_t __riscv_vfredmax_vs_f32m4_f32m1(vfloat32m4_t vs2, vfloat32m1_t vs1, size_t vl);
There aren’t all that many vector instruction genotypes, but there are an enormous number of contextual variations the compiler and assembler know about.
Link Time Optimization
Link Time Optimization (LTO) is a relatively new form of toolchain optimization that can produce smaller and faster binaries. It can also mutate control flows in those binaries making Ghidra analysis trickier, especially if one is using BSIM to look for control flow similarities.
Can we generate importable exemplars using LTO to show what such optimization steps look like in Ghidra?
LTO needs a command line parameter added for both compilation and linking. With bazel, that means
--copt="-flto" --linkopt="-Wl,-flto"
is enough to request LTO optimization on a build. These lto
flags
can also be defaulted into the toolchain definition or individual build files.
Let’s try this with a progressively more complicated series of exemplars
# Build helloworld without LTO as a control
$ bazel build -s --copt="-O2" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
...
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8624 Jan 31 10:44 bazel-bin/userSpaceSamples/helloworld
# The helloworld exemplar doesn't benefit much from link time optimization
$ bazel build -s --copt="-O2" --copt="-flto" --linkopt="-Wl,-flto" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8608 Jan 31 10:46 bazel-bin/userSpaceSamples/helloworld
The memcpy
source exemplar can be built three ways:
gcc_vectorization:memcpy
gcc_vectorization:memcpy_vector
gcc_vectorization:memcpy_lto
In this case the LTO options are configured into gcc_vectorization/BUILD
.
$ bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:memcpy
...
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy
-r-xr-xr-x. 1 --- --- 13488 Jan 31 11:16 bazel-bin/gcc_vectorization/memcpy
$ bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_vector
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy_vector
-r-xr-xr-x. 1 --- --- 13728 Jan 31 11:18 bazel-bin/gcc_vectorization/memcpy_vector
$ bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_lto
ERROR: ...: Linking gcc_vectorization/memcpy_lto failed: (Exit 1): gcc failed: error executing CppLink command (from target //gcc_vectorization:memcpy_lto) ...
lto1: internal compiler error: in riscv_hard_regno_nregs, at config/riscv/riscv.cc:8058
Please submit a full bug report, with preprocessed source (by using -freport-bug).
See <https://gcc.gnu.org/bugs/> for instructions.
lto-wrapper: fatal error: external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc returned 1 exit status
compilation terminated.
So it looks like LTO has problems with RISCV vector instructions. We’ll keep testing this as more gcc 14 snapshots become available, but as a lower priority exercise. LTO does not seem like a popular optimization.
Ghidra processor semantics needs tests
Procesor instructions known to Ghidra are defined in Sleigh pcode or semanitc sections. Adding new instructions - such as instruction set extensions to an existing processor - requires a pcode description of what that instruction does. That pcode drives both the decompiler process and any emulator or debugger processes.
This generates a conflict in testing. Should we test for maximum clarity for semantics rendered in the decompiler window or maximum fidelity in any Ghidra emulator? For example, should a divide instruction include pcode to test against a divide-by-zero? Should floating point instructions guard against NaN (Not a Number) inputs?
We assume here that decompiler fidelity is more important than emulator fidelity. That implies:
Individual instructions are wrapped in C and exercised within a Google Test C++ framework. The test framework is then executed within a qemu static emulation environment.
For example, let’s examine two riscv-64 instructions: fcvt.w.s
and fmv.x.w
fcvt.w.s
converts a floating-point number in floating-point register rs1 to a signed
32-bit or 64-bit integer, respectively, in integer register rd.fmv.x.w
moves the single-precision value in floating-point register rs1 represented in
IEEE 754-2008 encoding to the lower 32 bits of integer register rd. For RV64,
the higher 32 bits of the destination register are filled with copies of the floating-point
number’s sign bit.These two instructions have similar signatures but very different semantics. fcvt.w.s
performs
a float to int type conversion, so the float 1.0
can be converted to int 1
. fmv.x.w
moves the
raw bits between float and int registers without any type coversion.
We can generate simple exemplars of both instructions with this C code:
int fcvt_w_s(float* x) {
return (int)*x;
}
int fmv_x_w(float* x) {
int val;
float src = *x;
__asm__ __volatile__ (
"fmv.x.w %0, %1" \
: "=r" (val) \
: "f" (src));
return val;
}
Ghidra’s 11.2-DEV decompiler renders these as:
long fcvt_w_s(float *param_1)
{
return (long)(int)*param_1;
}
long fmv_x_w(float *param_1)
{
return (long)(int)param_1;
}
fmv_x_w
was missing a dereference operation. The fmv_x_w
version was also implying an
implicit type conversion when none is actually performed. Let’s trace how to use these test
results to improve the decompiler output.
The draft test harness can be built and run from the top level workspace directory.
$ bazel build --platforms=//riscv64/generated/platforms:riscv_userspace riscv64/generated/emulator_tests:testSemantics
Starting local Bazel server and connecting to it...
INFO: Analyzed target //riscv64/generated/emulator_tests:testSemantics (74 packages loaded, 1902 targets configured).
...
INFO: From Executing genrule //riscv64/generated/emulator_tests:testSemantics:
...
INFO: Found 1 target...
Target //riscv64/generated/emulator_tests:testSemantics up-to-date:
bazel-bin/riscv64/generated/emulator_tests/results
INFO: Elapsed time: 4.172s, Critical Path: 1.93s
INFO: 4 processes: 1 internal, 3 linux-sandbox.
INFO: Build completed successfully, 4 total actions
$ cat bazel-bin/riscv64/generated/emulator_tests/results
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from FP
[ RUN ] FP.testharness
[ OK ] FP.testharness (3 ms)
[ RUN ] FP.fcvt
[ OK ] FP.fcvt (10 ms)
[ RUN ] FP.fmv
[ OK ] FP.fmv (1 ms)
[ RUN ] FP.fp16
[ OK ] FP.fp16 (0 ms)
[----------] 4 tests from FP (15 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (19 ms total)
[ PASSED ] 4 tests.
$ file bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so
bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so: ELF 64-bit LSB shared object, UCB RISC-V, RVC, double-float ABI, version 1 (SYSV), dynamically linked, not stripped
What version of Ghidra are we testing against?
What do we do with Ghidra patches that improve the decompilation results?
fcvt
and fmv
patches from other sources.Use meld
to compare the original C source file floatOperations.c
with the exported C decompiler view from Ghidra 11.2-DEV. A quick inspection shows some errors to address:
Original | Ghidra |
---|---|
float fcvt_s_wu(uint32_t* i) {return (float)*i;} | float fcvt_s_wu(uint *param_1){return (float)ZEXT416(*param_1);} |
double fcvt_d_wu(uint32_t* j){return (double)*j;} | double fcvt_d_wu(uint *param_1){return (double)ZEXT416(*param_1);} |
long fmv_x_w(float *param_1){return (long)(int)*param_1;} | |
long fmv_x_d(double *param_1){return (long)(int)*param_1;} | |
long fcvt_h_w(int param_1){return (long)param_1;} | |
long fcvt_h_wu(int param_1){return (long)param_1;} | |
ulong fcvt_h_d(ulong *param_1){return *param_1 & 0xffffffff;} |
The errors include:
ZEXT416
in two placesfmv
instructions appear to force an implicit type conversion where none is wantedfcvt_h_w
and fcvt_h_wu
fcvt_h_d
Testing semantics for the zfh
half-precision floating point instructions is more complicated
than usual. Ghidra’s semantics and pcode system has no known provision for half-precision floating point,
so emulation won’t work well. The current zfh
implementation makes these _fp16 objects look
like 32 bit floats in registers and like 16 bit shorts in memory operations, making Ghidra type inferencing
even more confusing.
Let’s look at a more limited scope, the definition of the Ghidra trunc
pcode op.
The documentation says trunc
produces a signed integer obtained by truncating its argument.
trunc
set its result type?trunc
expect only a floating point double?trunk_u
to generate an unsigned integerThe documentation also says that float2float
‘copies a floating-point number with more or less precision’,
so its implementation may tell us something about type inferencing.
Ghidra/Features/Decompiler/src/decompile/cpp/pcodeparse.cc
binds float2float
to OP_FLOAT2FLOAT
CPUI_FLOAT_FLOAT2FLOAT
and to several files under Ghidra/Features/Decompiler/src/decompile/cpp
.FloatFormat::opFloat2Float
and FloatFormat::opTrunc
look relevant in float.hh
and float.cc
What gaps in Ghidra’s import processes need the most long term attention?
Some features are easy or quick to add to Ghidra’s import processes. Other features might be nice to have but just aren’t worth the effort. How do we approach features that are probably going to be important in the long term but would require a lot of effort to address?
This section considers RISCV-64 code optimization by vector instruction insertion as an example. Either the compiler or the coder can choose to replace sequences of simple instructions with sequences of vector instructions. Those vector sequences often do not have a clean C representation in Ghidra’s decompiler view, making it difficult for Ghidra users to understand what the code is doing and to look for malware or other pathologies.
The overview introduced an approach to this sort of challenge:
Where does this gap appear?
What is the impact of this gap?
Ghidra’s current limits in handling RISCV-64 vector instructions will impact users in phases, where the initial impacts are modest and fairly easy to deal with while later impacts will take significant design work to address.
The most immediate impact involves Ghidra disassembly and decompilation failure when encountering unrecognized instructions. The Fedora 39 exemplar kernel contains several extension instructions that Ghidra 11 can’t recognize. These are limited in number and don’t have a material impact on someone examining RISCV kernel code. The voice-to-text app whisper.cpp shows more serious limits - roughly one third of the app’s instructions are unprocessed by Ghidra 11 because of vector and other extension instructions.
That impact can be addressed by simply defining the missing instructions, as in Ghidra’s isa_ext
experimental branch. This will allow the disassembler and
decompiler to process all instructions in the app. This is necessary but not sufficient, since many or most of the vector extension instructions do not have
a clean pcode representation. Obvious calls to memcpy
will be replaced with one of a half-dozen inline vector instruction sequences. Simple or nested
loops will be ‘vectorized’ with fewer iterations but much more complex instruction opcode sequences. Optimizing compilers can handle those complexities, while
Ghidra users searching for malware will have a harder time of it.
The general challenge for Ghidra is that of reconstructing the context from sequences of vector extension instructions.
Note: Some material comes as-is from https://www.reddit.com/r/RISCV
The first generally available 64 bit RISCV vector systems development kit has just become available (January 2024), based on the relatively modest THead C908 core. This SDK appears tuned for video processing, perhaps video surveillance applications aggregating multiple cameras into a common video feed. We are probably several years from seeing server-class systems built on SiFive P870 cores, and fabricated on the fastest available fab lines. Memory bandwidth is poor at present, while energy efficiency is potentially better than x86_64 designs.
Judging from internet hype, we can expect to see RISCV vector code appearing in replacements of ARM systems (automotive and possibly cell phone) and as the extensible basis of AI applications.
January 2024 saw a flurry of open source toolchain and framework contributions from several sources.
How much effort might it take to fill the gap?
Does the scope of this gap extend to other processors?
Which Ghidra frameworks might be extended to fill the gap?
.sinc
files?
vset*
instructionsCode is built by a toolchain (compiler, linker) to run on a platform (e.g., a pixel 7a cellphone).
This project adopts the Bazel framework for building importable exemplars. Platforms describe the foundation on which code will run. Toolchains compile and link code for different platforms. Bazel builds are hermetic, which for our purposes means that platforms and toolchains are all versioned and importable, so build results are the same no matter where the build host may be.
The directory RISCV64/toolchain defines these platforms:
//platforms:riscv_userspace
for a generic RISCV-64 Linux appliance with the usual libc and libstdio APIs//platforms:riscv_vector
for a more specialized RISCV-64 Linux appliance with vector extensions supported//platforms:riscv_custom
for a highly specialized RISCV-64 Linux appliance with vector and vendor-specific extensions supported//platforms:riscv_local
for toolchain debugging, using a local file system toolchain under /opt/riscvx
Note: The current binutils and gcc show more vendor-specific instruction set extensions from THead, so we will arbitrarily use that as the exemplar custom platform.
This directory defines these toolchains:
//toolchains:riscv64-default
- a gcc-13 stable RISCV compiler, linker, loader, and sysroot of related include files and libraries//toolchains:riscv64-next
- a gcc-14 unreleased but feature-frozen RISCV compiler, linker, loader, and sysroot of related include files
and libraries//toolchains:riscv64-custom
- a variant of //toolchains:riscv64-next
with multiple standard and vendor-specific ISA extensions enabled
by default//toolchains:riscv64-local
- a toolchain executing out of /opt/riscvx
instead of a portable tarball. Generally useful only when
debugging the generation of a fully portable and hermetic toolchain tarball.Exemplars are built by naming the platform for each build. Bazel then finds a compatible toolchain to complete the build.
# compile for the riscv_userspace platform, automatically selecting the riscv64-default toolchain with gcc-13.
bazel build -s --platforms=//platforms:riscv_userspace gcc_vectorization:helloworld_challenge
# compile for the riscv_vector platform, automatically selecting the riscv64-next toolchain with gcc-14.
bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:helloworld_challenge
This table shows relationships between platforms, constraints, toolchains, and default options:
platform | cpu constraint | toolchain | default options | added optimized options |
---|---|---|---|---|
//platforms:riscv_userspace | //toolchains:riscv64 | //toolchains:riscv64-default | -O3 | |
//platforms:riscv_vector | //toolchains:riscv64-v | //toolchains:riscv64-next | -march=rv64gcv | -O3 |
//platforms:riscv_custom | //toolchains:riscv64-c | //toolchains:riscv64-custom | -march=rv64gcv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_xtheadmac_xtheadfmemidx_xtheadmempair_xtheadsync | -O3 |
//platforms:riscv_local | //toolchains:riscv64-l | //toolchains:riscv64-local | -O3 |
Notes:
-O3
option is likely too aggressive. The -O2
option would be more common in broadly released software.//toolchains:riscv64-default
currently uses a gcc-13 toolchain suite//toolchains:riscv64-next
and //toolchains:riscv64-custom
//toolchains:riscv64-custom
adds bit manipulation and many of the THead extensions supported by binutils.Warning: C options can be added by the toolchain, within a BUILD file, and on the command line. For options like
-O
and-march
, only the last instance of the option affects the build.
Toolchains generally include several components that can affect the generated binaries:
gas
assembler with support for various instruction set extensions
and disassembler tools like objdump
that provide reference handling of newer instructions.sysroot
holding files the above subsystems would normally expect to find under /usr
, for instance
/usr/include
files supplied by the kernel and standard librariesThe toolchain prepared for building a kernel module won’t be the same as a toolchain built for userspace programs, even if the compilers are identical.
See adding toolchains for an example of adding a new toolchain to this project.
Extensions to a processor family’s Instruction Set Architecture add capability and complexity.
The RISCV community has a rich set of extensions to the base Instruction Set Architecture. That means a diverse set
of new binary import targets to test against. This work-in-progress is collected in the riscv64/generated/assemblySamples
directory.
The basic idea is to compare current Ghidra disassembly with current binutils objdump
disassembly, using object files
assembled from the binutils gas
testsuite. For example:
riscv64/generated/assemblySamples/h-ext-64.S
was copied from the binutils gas testsuite. It contains unit test instructions for
hypervisor support extensions like hfence.vvma
and hlv.w
.riscv64/exemplars/h-ext-64.o
is the object file produced by a current snapshot of the binutils 2-41 assembler. The associated listing
is riscv64/exemplars/h-ext-64.list
.riscv64/exemplars/h-ext-64.objdump
is the output from disassembling riscv64/exemplars/h-ext-64.o
using the current snapshot of the binutils 2-41
objdump
.So we want to open Ghidra, import riscv64/exemplars/h-ext-64.o
, and compare the disassembly window to riscv64/exemplars/h-ext-64.objdump
, then triage
any variances.
Some variances are trivial. The h-ext-64.S
tests include instructions that assemble into a single 4 byte sequence. Disassembly will only give a single
instruction, perhaps the simplest one of the given aliases.
Other variances are harder - it looks like Ghidra expects to see an earlier and deprecated set of vector instructions than one currently approved set.
riscv64/generated/assemblySamples/TODO.md
collects some of the variances noted so far.
One big question is what kind of pcode should Ghidra generate for some of these instructions - and how many Ghidra users will care about that pcode.
The short term answer is to treat extension instructions as pcode function calls. The longer term answer may be to wait until GCC14 comes out with support for
vector extensions, then see what kind of C source is conventionally used when invoking those extensions. The memcpy
inline function from libc
is a likely
place to find early use of vector instructions.
Also, what can we safely ignore for now? The proposed vendor-specific T-Head extension instruction
th.l2cache.iall
won’t be seen by most Ghidra users.
On the other hand, the encoding rules published with those T-Head extensions look like a good example to follow.
The Fedora 39 kernel includes virtual machine cache management instructions that are not necessarily supported by binutils - they are ‘assembled’ with gcc macros before reaching the binutils assembler. We will ignore those instruction extensions for now, and only consider instruction extensions supported by binutils.
Some newer compilers annotate executable binaries by adding the ISA extensions used during the build.
$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_default
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zmmul1p0"
$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_vector
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
Tag_RISCV_priv_spec: 1
Tag_RISCV_priv_spec_minor: 11
$ /opt/riscvx/bin/riscv64-unknown-linux-gnu-readelf -A riscv64/exemplars/whisper_cpp_vendor
Attribute Section: riscv
File Attributes
Tag_RISCV_stack_align: 16-bytes
Tag_RISCV_arch: "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zba1p0_zbb1p0_zbc1p0_zbkb1p0_zbkc1p0_zbkx1p0_zvbc1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0_xtheadba1p0_xtheadbb1p0_xtheadbs1p0_xtheadcmo1p0_xtheadcondmov1p0_xtheadfmemidx1p0_xtheadmac1p0_xtheadmempair1p0_xtheadsync1p0"
Tag_RISCV_priv_spec: 1
Tag_RISCV_priv_spec_minor: 11
If Tag_RISCV_arch
contains the substring v1p0
, then the associated binary was built assuming RV Vector 1.0 extension instructions are present on the executing CPU hardware thread.
Common instruction patterns one might see with vectorized code generation
This page collects architecture-dependent gcc-14 expansions, where simple C sequences are translated into optimized code.
Our baseline is a gcc-14 compiler with -O2
optimization and a base machine architecture of -march=rv64gc
. That’s a basic 64 bit RISCV
processor (or a hart
core of that processor) with support for compressed instructions.
Variant machine architectures considered here are:
march | description |
---|---|
rv64gc | baseline |
rv64gcv | baseline + vector extension (dynamic vector length) |
rv64gcv_zvl128b | baseline + vector (minimum 128 bit vectors) |
rv64gcv_zvl512b | baseline + vector (minimum 512 bit vectors) |
rv64gcv_zvl1024b | baseline + vector (minimum 1024 bit vectors) |
rv64gc_xtheadbb | baseline + THead bit manipulation extension (no vector) |
Note: memory copy operations require non-overlapping source and destination. memory move operations allow overlap but are much more complicated and are not currently optimized.
Optimizing compilers are good at turning simple memory copy operations into confusing - but fast - instruction sequences.
GCC can recognize memory copy operations as calls to memcpy
or as structure assignments like *a = *c
.
The current reference C file is:
extern void *memcpy(void *__restrict dest, const void *__restrict src, __SIZE_TYPE__ n);
extern void *memmov(void *dest, const void *src, __SIZE_TYPE__ n);
/* invoke memcpy with dynamic size */
void cpymem_1 (void *a, void *b, __SIZE_TYPE__ l)
{
memcpy (a, b, l);
}
/* invoke memcpy with known size and aligned pointers */
extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;
void cpymem_2 ()
{
memcpy (&a_a, &a_b, sizeof a_a);
}
typedef struct { char c[16]; } c16;
typedef struct { char c[32]; } c32;
typedef struct { short s; char c[30]; } s16;
/* copy fixed 128 bits of memory */
void cpymem_3 (c16 *a, c16* b)
{
*a = *b;
}
/* copy fixed 256 bits of memory */
void cpymem_4 (c32 *a, c32* b)
{
*a = *b;
}
/* copy fixed 256 bits of memory */
void cpymem_5 (s16 *a, s16* b)
{
*a = *b;
}
/* memmov allows overlap - don't vectorize or inline */
void movmem_1(void *a, void *b, __SIZE_TYPE__ l)
{
memmov (a, b, l);
}
Ghidra 11 with the isa_ext
branch decompiler gives us something simple after fixing the signature of the memcpy
thunk.
void cpymem_1(void *param_1,void *param_2,size_t param_3)
{
memcpy(param_1,param_2,param_3);
return;
}
void cpymem_2(void)
{
memcpy(&a_a,&a_b,0x40);
return;
}
void cpymem_3(void *param_1,void *param_2)
{
memcpy(param_1,param_2,0x10);
return;
}
void cpymem_4(void *param_1,void *param_2)
{
memcpy(param_1,param_2,0x20);
return;
}
void cpymem_5(void *param_1,void *param_2)
{
memcpy(param_1,param_2,0x20);
return;
}
If the compiler knows the target hart can process vector extensions, but is not told explicitly the size of each vector register, it optimizes all of these calls. Ghidra 11 gives us the following, with binutils’ objdump instruction listings added as comments:
long cpymem_1(long param_1,long param_2,long param_3)
{
long lVar1;
undefined auVar2 [256];
do {
lVar1 = vsetvli_e8m8tama(param_3); // vsetvli a5,a2,e8,m8,ta,ma
auVar2 = vle8_v(param_2); // vle8.v v8,(a1)
param_3 = param_3 - lVar1; // sub a2,a2,a5
vse8_v(auVar2,param_1); // vse8.v v8,(a0)
param_2 = param_2 + lVar1; // add a1,a1,a5
param_1 = param_1 + lVar1; // add a0,a0,a5
} while (param_3 != 0); // bnez a2,8a8 <cpymem_1>
return param_1;
}
void cpymem_2(void)
{
// ld a4,1922(a4) # 2040 <a_b@Base>
// ld a5,1938(a5) # 2058 <a_a@Base>
undefined auVar1 [256];
vsetivli(0x10,0xd3); // vsetivli zero,16,e32,m8,ta,ma
auVar1 = vle32_v(&a_b); // vle32.v v8,(a4)
vse32_v(auVar1,&a_a); // vse32.v v8,(a5)
return;
}
void cpymem_3(undefined8 param_1,undefined8 param_2)
{
undefined auVar1 [256];
vsetivli(0x10,0xc0); // vsetivli zero,16,e8,m1,ta,ma
auVar1 = vle8_v(param_2); // vle8.v v1,(a1)
vse8_v(auVar1,param_1); // vse8.v v1,(a0)
return;
}
void cpymem_4(undefined8 param_1,undefined8 param_2)
{
undefined auVar1 [256]; // li a5,32
vsetvli_e8m8tama(0x20); // vsetvli zero,a5,e8,m8,ta,ma
auVar1 = vle8_v(param_2); // vle8.v v8,(a1)
vse8_v(auVar1,param_1); // vse8.v v8,(a0)
return;
}
void cpymem_5(undefined8 param_1,undefined8 param_2)
{
undefined auVar1 [256];
vsetivli(0x10,0xcb); // vsetivli zero,16,e16,m8,ta,ma
auVar1 = vle16_v(param_2); // vle16.v v8,(a1)
vse16_v(auVar1,param_1); // vse16.v v8,(a0)
return;
}
The variation in the vset*
instructions is a bit puzzling. This may be due to
alignment issues - trying to copy a short int
into a misaligned odd address generates
an exception at the store instruction, so perhaps the vector optimization is supposed
to throw an exception there too.
Survey a voice-to-text app for common vector instruction patterns
Take an exemplar RISCV-64 binary like whisper.cpp
, with its many vector instructions.
Which vector patterns are easy to recognize, either for a human Ghidra user or for a hypothetical Ghidra plugin?
Some of the most common patterns correspond to memcpy
or memset
invocations where the number of bytes is known at
compile time as is the alignment of operands.
ML apps like whisper.cpp
often work with parameters of less than 8 bits, so there can be a lot of demarshalling, unpacking,
and repacking operations. That means lots of vector bit manipulation and width conversion operations.
ML apps also do a lot of vector, matrix, and tensor arithmetic, so we can expect to find vectorized arithmetic operations mixed in with vector parameter conversion operations.
Note: This page is likely to change rapidly as we get a better handle on the problem and develop better analytic tools to guide the process.
Most vector instructions come in groups started with a vsetvli
or vsetivli
instruction to set up the vector context.
If the number of vector elements is known at compile time and less than 32, then the vsetivli
instruction is often used.
Otherwise the vsetvli
instruction is used.
Scanning for these instructions showed 673 vsetvli
and 888 vsetivli
instructions within whisper.cpp
.
The most common vsetvli
instruction (343 out of 673) is type 0xc3 or e8,m8,ta,ma
. That expands to:
The most common vsetivli
instruction (565 out of 888) is type 0xd8 or e64,m1,ta,ma
. That expands to:
A similar common vsetivli
instruction (102 out of 888) is type 0xdb or e64,m8,ta,ma
. That expands to:
The second most common vsetivli
instruction (107 out of 888) is type 0xc7 or e8,mf2,ta,ma
. That expands to:
How many of these vector blocks can be treated as simple memcpy
or memset
invocations?
For example, this Ghidra listing snippet looks like a good candidate for memcpy
:
00090bdc 57 f0 b7 cd vsetivli zero,0xf,e64,m8,ta,ma
00090be0 07 74 07 02 vle64.v v8,(a4)
00090be4 27 f4 07 02 vse64.v v8,(a5)
A pcode equivalent might be __builtin_memcpy(dest=(a5), src=(a4), 8 * 15)
with a possible context note that
vector registers v8 through v16 are changed.
A longer example might be a good candidate for memset
:
00090b84 57 70 81 cd vsetivli zero,0x2,e64,m1,ta,ma
00090b88 93 07 07 01 addi a5,a4,0x10
00090b8c d7 30 00 5e vmv.v.i v1,0x0
00090b90 a7 70 07 02 vse64.v v1,(a4)
00090b94 a7 f0 07 02 vse64.v v1,(a5)
00090b98 93 07 07 02 addi a5,a4,0x20
00090b9c a7 f0 07 02 vse64.v v1,(a5)
00090ba0 93 07 07 03 addi a5,a4,0x30
00090ba4 a7 f0 07 02 vse64.v v1,(a5)
00090ba8 93 07 07 04 addi a5,a4,0x40
00090bac a7 f0 07 02 vse64.v v1,(a5)
00090bb0 93 07 07 05 addi a5,a4,0x50
00090bb4 a7 f0 07 02 vse64.v v1,(a5)
00090bb8 93 07 07 06 addi a5,a4,0x60
00090bbc a7 f0 07 02 vse64.v v1,(a5)
00090bc0 fd 1b c.addi s7,-0x1
00090bc2 23 38 07 06 sd zero,0x70(a4)
This example is based on a minimum VLEN of 128 bits, so the vector registers can hold 2 64 bit elements. The vmv.v.i
instruction sets those two elements of v1
to zero.
Seven vse64.v
instructions then store two 64 bit zeros each to successive memory locations, with a trailing scalar double word store to handle the tail.
A pcode equivalent for this sequence might be __builtin_memset(dest=(a4), 0, 0x78)
.
The python script objdump_analytic.py
provides a crude scan of a RISCV-64 binary, reporting on likely vector instruction blocks. It doesn’t handle blocks with more than one
vsetvli
or vsetivli
instruction, something common in vector narrowing or widening operations. If we apply this script to whisper_cpp_vector
we can collect a crude field guide to vector expansions.
VLEN in the following is the hart’s vector length, determined at execution time. It is usually something like 128 bits for a general purpose core (aka hart) and up to 1024 bits for a dedicated accelerator hart.
This pattern is often found when copying objects of known and limited size. It is useful with objects as small as 4 bytes if the source alignment is unknown and the destination object must be aligned on half-word, word, or double-word boundaries.
; memcpy(dest=a0, src=a3, nbytes=a4) where a4 < 8 * (VLEN/8)
1d3da: 0c377057 vsetvli zero,a4,e8,m8,ta,ma
1d3de: 02068407 vle8.v v8,(a3)
1d3e2: 02050427 vse8.v v8,(a0)
This pattern is usually found in a simple loop, moving 8 * (VLEN/8) bytes at a time. The a5 register holds the number of bytes processed per iteration.
; memcpy(dest=a6, src=a7, nbytes=a0)
1d868: 0c3577d7 vsetvli a5,a0,e8,m8,ta,ma
1d86c: 02088407 vle8.v v8,(a7)
1d872: 02080427 vse8.v v8,(a6)
The next example appears to be compiled from estimate_diarization_speaker
whose source is:
double energy0 = 0.0f;
double energy1 = 0.0f;
for (int64_t j = is0; j < is1; j++) {
energy0 += fabs(pcmf32s[0][j]);
energy1 += fabs(pcmf32s[1][j]);
}
This is a typical reduction with widening pattern.
The vector instructions generated are:
242ce: 0d8077d7 vsetvli a5,zero,e64,m1,ta,ma
242d2: 5e0031d7 vmv.v.i v3,0
242d6: 9e303257 vmv1r.v v4,v3
242da: 0976f7d7 vsetvli a5,a3,e32,mf2,tu,ma
242e4: 0205e107 vle32.v v2,(a1)
242e8: 02066087 vle32.v v1,(a2)
242ec: 2a211157 vfabs.v v2,v2
242f0: 2a1090d7 vfabs.v v1,v1
242f8: d2411257 vfwadd.wv v4,v4,v2
242fc: d23091d7 vfwadd.wv v3,v3,v1
24312: 0d8077d7 vsetvli a5,zero,e64,m1,ta,ma
24316: 4207d0d7 vfmv.s.f v1,fa5
2431a: 063091d7 vfredusum.vs v3,v3,v1
2431e: 42301757 vfmv.f.s fa4,v3
24326: 06409257 vfredusum.vs v4,v4,v1
2432a: 424017d7 vfmv.f.s fa5,v4
A hypothetical vectorized Ghidra might decompile these instructions (ignoring the scalar instructions not displayed here) as:
double vector v3, v4; // SEW=64 bit
v3 := vector 0; // load immediate
v4 := v3; // vector copy
float vector v1, v2; // SEW=32 bit
while(...) {
v2 = vector *a1;
v1 = vector *a2;
v2 = abs(v2);
v1 = abs(v1);
v4 = v4 + v2; // widening 32 to 64 bits
v3 = v3 + v1; // widening 32 to 64 bits
}
double vector v1, v3, v4;
v1[0] = fa5; // fa5 is the scalar 'carry-in'
v3[0] = v1[0] + ⅀ v3; // unordered vector reduction
fa4 = v3[0];
v4[0] = v1[0] + ⅀ v4;
fa5 = v4[0];
The vector instruction vfredusum.vs
provides the unordered reduction sum over the elements of a single vector. That’s likely faster than an ordered sum,
but the floating point round-off errors will not be deterministic.
Note: this
whisper.cpp
routine attempts to recognize which of two speakers is responsible for each word of a conversation. A speaker-misattribution exploit might attack functions that call this.
The source code includes:
static drwav_uint64 drwav_read_pcm_frames_s16__msadpcm(drwav* pWav, drwav_uint64 framesToRead, drwav_int16* pBufferOut) {
...
pWav->msadpcm.bytesRemainingInBlock = pWav->fmt.blockAlign - sizeof(header);
pWav->msadpcm.predictor[0] = header[0];
pWav->msadpcm.predictor[1] = header[1];
pWav->msadpcm.delta[0] = drwav__bytes_to_s16(header + 2);
pWav->msadpcm.delta[1] = drwav__bytes_to_s16(header + 4);
pWav->msadpcm.prevFrames[0][1] = (drwav_int32)drwav__bytes_to_s16(header + 6);
pWav->msadpcm.prevFrames[1][1] = (drwav_int32)drwav__bytes_to_s16(header + 8);
pWav->msadpcm.prevFrames[0][0] = (drwav_int32)drwav__bytes_to_s16(header + 10);
pWav->msadpcm.prevFrames[1][0] = (drwav_int32)drwav__bytes_to_s16(header + 12);
pWav->msadpcm.cachedFrames[0] = pWav->msadpcm.prevFrames[0][0];
pWav->msadpcm.cachedFrames[1] = pWav->msadpcm.prevFrames[1][0];
pWav->msadpcm.cachedFrames[2] = pWav->msadpcm.prevFrames[0][1];
pWav->msadpcm.cachedFrames[3] = pWav->msadpcm.prevFrames[1][1];
pWav->msadpcm.cachedFrameCount = 2;
...
}
This gets vectorized into sequences containing:
2c6ce: ccf27057 vsetivli zero,4,e16,mf2,ta,ma ; vl=4, SEW=16
2c6d2: 5e06c0d7 vmv.v.x v1,a3 ; v1[0..3] = a3
2c6d6: 3e1860d7 vslide1down.vx v1,v1,a6 ; v1 = v1[1:3], a6
2c6da: 3e1760d7 vslide1down.vx v1,v1,a4 ; v1 = v1[1:3], a4
2c6de: 3e1560d7 vslide1down.vx v1,v1,a0 ; v1 = (a3,a6,a4,a0)
2c6e2: 0d007057 vsetvli zero,zero,e32,m1,ta,ma ; keep existing vl (=4), SEW=32
2c6e6: 4a13a157 vsext.vf2 v2,v1 ; v2 = vector sext(v1) // widening sign extend
2c6ea: 0207e127 vse32.v v2,(a5) ; memcpy(a5, v2, 4 * 4)
2c6f2: 0a07d087 vlse16.v v1,(a5),zero ; v1 = a5[]
2c6fa: 0cf07057 vsetvli zero,zero,e16,mf2,ta,ma
2c702: 3e1660d7 vslide1down.vx v1,v1,a2 ; v1 = v1[1:3], a2
2c70a: 3e16e0d7 vslide1down.vx v1,v1,a3 ; v1 = v1[1:3], a3
2c70e: 3e1760d7 vslide1down.vx v1,v1,a4 ; v1 = v1[1:3], a4
2c712: 0d007057 vsetvli zero,zero,e32,m1,ta,ma
2c716: 4a13a157 vsext.vf2 v2,v1
2c71a: 0205e127 vse32.v v2,(a1)
That’s the kind of messy code you could analyze if you had to. Hopefully not.
How much complexity do vector instructions add to a top down analysis?
We know that whisper.cpp contains lots of vector instructions. Now we want to understand how few vector instruction blocks we really need to understand.
For this analysis we will assume a specific goal - inspect the final text output phase to see if an adversary has modified the generated text.
First we want to understand the unmodified behavior using a simple demo case. One of the whisper.cpp examples works well. It was built for the x86-64-v3 platform, not the riscv-64 gcv platform, but that’s fine - we just want to understand the rough sequencing and get a handle on the strings we might find in or near the top level main routine.
Note: added comments are flagged with
//
/opt/whisper_cpp$ ./main -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.46 MB (1 buffers)
whisper_model_load: model size = 147.37 MB
whisper_init_state: kv self size = 16.52 MB
whisper_init_state: kv cross size = 18.43 MB
whisper_init_state: compute buffer (conv) = 14.86 MB
whisper_init_state: compute buffer (encode) = 85.99 MB
whisper_init_state: compute buffer (cross) = 4.78 MB
whisper_init_state: compute buffer (decode) = 96.48 MB
system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |
// done with initialization, lets run speach-to-text
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
// this is the reference line our adversary wants to modify:
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
// display statistics
whisper_print_timings: load time = 183.72 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 10.30 ms
whisper_print_timings: sample time = 33.90 ms / 131 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 718.87 ms / 1 runs ( 718.87 ms per run)
whisper_print_timings: decode time = 8.35 ms / 2 runs ( 4.17 ms per run)
whisper_print_timings: batchd time = 150.96 ms / 125 runs ( 1.21 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 1110.87 ms
The adversary wants to change the text output from “… ask not what you can do for your country.” to “… ask not what you can do for your enemy.”
They likely drop a string substitution into the code between the output of main: processing
and whisper_print_timings:
, probably very close to
code printing timestamp intervals like [00:00:00.000 --> 00:00:11.000]
.
Our RISCV-64 binary retains some function names and lots of relevant strings. We want to accumulate strings that occur in the demo printout, then glance at the functions that reference those strings.
For this example we will use a binary that includes some debugging type information. Ghidra can determine names of structure types but not necessarily the size or field names of those structures.
%s: processing '%s' (%d samples, %.1f sec), %d threads, %d processors, %d beams + best of %d, lang = %s, task = %s, %stimestamps = %d ...
is referenced
near the middle of main
[%s --> %s]
is referenced by whisper_print_segment_callback
[%s --> %s] %s\n
is referenced by whisper_full_with_state
segment
occurs in several places, suggesting that the word refers to a segment of text generated from speech between two timestamps.ctx
occurs 33 times, suggesting that a context structure is used - and occasionally displayed with field nameserror: failed to initialize whisper context\n
is referenced within main
. It may help in understanding internal data organization.main
- Ghidra decompiles this as ~1000 C statements, including many vector statementswhisper_print_timings
- referenced directly in main near the endwhisper_full_with_state
- referenced indirectly from main via whisper_full_parallel
and whisper_full
output_txt
- referenced directly in main, invokes I/O routines like std::__ostream_insert<>
. There are
other output routines like output_json
. The specific output routine can be selected as a command line parameter
to main
.Ghidra knows that these exist as names, but the details are left to us to unravel.
gpt_params
and gpt_vocab
- these look promising, at a lower ML levelwhisper_context
- this likely holds most of the top-level datawhisper_full_params
and whisper_params
- likely structures related to the optional parameters
revealed with the --help
command line option.whisper_segment
- possibly a segment of digitized audio to be converted as speech.whisper_vocab
- possible holding the text words known to the training data.Now we have enough context to narrow the search. We want to know:
main
call either whisper_print_segment_callback
or whisper_full_with_state
.
whisper_full
is called directly by main
. Ghidra reports this to be about 3000 lines of C. The Ghidra
call tree suggests that this function does most of the text-to-speech tensor math and other ML heavy lifting.whisper_print_segment_callback
appears to be inserted into a C++ object vtable as a function pointer. The object itself
appears to be built on main
’s stack, so we don’t immediately know its size or use. whisper_print_segment_callback
is less than a tenth the size of
whisper_full_with_state
.[%s --> %s]
?A simple but tedious technique involves a mix of top-down and bottom-up analysis. We work upwards from strings and function references, and down
from the main
routine towards the functions associated with our target text string. Trial and error with lots of backtracking are common here, so
switching back and forth between top-down and bottom-up exploration can provide fresh insights.
Remember that we don’t want to understand any more of whisper.cpp
than we have to. The adversary we are chasing only wants to understand where
the generated text comes within reach. Neither they nor we need to understand all of the ways the C++ standard library might use vector instructions
during I/O subsystem initialization.
On the other hand, they and we may need to recognize basic I/O and string handling operations, since the target text is likely to exist as either a standard string or a standard vector of strings.
Note: This isn’t a tutorial on how to approach a C++ reverse engineering challenge - it’s an evaluation of how vectorization might make that more difficult and an exploration of what additional tools Ghidra or Ghidra users may find useful when faced with vectorization. That means we’ll skip most of the non-vector analysis.
This sequence from main
affects initialization and obscures a possible exploit vector.
vsetivli_e8m8tama(0x17); // memcpy(puStack_110, "models/ggml-base.en.bin", 0x17)
auVar27 = vle8_v(0xa6650);
vsetivli_e8m8tama(0xf); // memcpy(puStack_f0, "" [SPEAKER_TURN]", 0xf)
auVar26 = vle8_v(0xa6668);
puStack_f0 = auStack_e0;
vsetivli_e8m8tama(0x17);
vse8_v(auVar27,puStack_110);
vsetivli_e8m8tama(0xf);
vse8_v(auVar26,puStack_f0);
puStack_d0 = &uStack_c0;
vsetivli_e64m1tama(2); // memset(lStack_b0, 0, 16)
vmv_v_i(auVar25,0);
vse64_v(auVar25,&lStack_b0);
*(char *)((long)puStack_110 + 0x17) = '\0';
If the hypothetical adversary wanted to replace the training model ggml-base.en.bin
with a less benign model, changing the
memory reference within vle8_v(0xa6650)
would be a good place to do it. Note that the compiler has interleaved instructions
generated from the two memcpy expansions, at the cost of two extra vsetivli
instructions. This allows more time for the
vector load instructions to complete.
output_txt
Some browsing in Ghidra suggests that the following section of main
is close to where we need to focus.
lVar11 = whisper_full_parallel
(ctx,(long)pFVar18,(ulong)pvStack_348,
(long)(int)(lStack_340 - (long)pvStack_348 >> 2),
(long)pvVar20);
if (lVar11 == 0) {
putchar(10,pFVar18);
if (params.do_output_txt != false) {
/* try { // try from 0001dce8 to 0001dceb has its CatchHandler @ 0001e252 */
std::operator+(&full_params,(undefined8 *)pFStack_2e0,
(undefined8 *)pFStack_2d8,(undefined8 *)".txt",
(char *)pvVar20);
uVar13 = full_params._0_8_;
/* try { // try from 0001dcfc to 0001dcfd has its CatchHandler @ 0001e2ec */
std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
/* try { // try from 0001dd06 to 0001dd09 has its CatchHandler @ 0001e2f0 */
output_txt(ctx,(char *)uVar13,¶ms,(vector *)unaff_s3);
std::vector<>::~vector(unaff_s3);
std::__cxx11::basic_string<>::_M_dispose((basic_string<> *)&full_params);
}
...
}
Looking into output_txt
Ghidra gives us:
long output_txt(whisper_context *ctx,char *output_file_path,whisper_params *param_3,vector *param_4)
{
fprintf(_stderr,"%s: saving output to \'%s\'\n","output_txt",output_file_path);
max_index = whisper_full_n_segments(ctx);
index = 0;
if (0 < max_index) {
do {
__s = (char *)whisper_full_get_segment_text(ctx,index);
...
sVar8 = strlen(__s);
std::__ostream_insert<>((basic_ostream *)plVar7,__s,sVar8);
...
index = (long)((int)index + 1);
} while (max_index != index);
...
}
...
}
Finally, whisper_full_get_segment_text
is decompiled into:
undefined8 whisper_full_get_segment_text(whisper_context *ctx,long index)
{
gp = &__global_pointer$;
return *(undefined8 *)(index * 0x50 + *(long *)(ctx->state + 0xa5f8) + 0x10);
}
Now the adversary has enough information to try rewriting the generated text from an arbitrary segment of speech.
The text is found in an array linked into the ctx
context variable, probably during the call to whisper_full_parallel
.
Our key goal is to understand how much effort to put into Ghidra’s decompiler processing of RISCV-64 vector instructions. The metric for measuring that effort is relative to the effort needed to understand the other instructions produced by a C++ optimizing compiler implementing libstdc++ containers like vectors.
Take a closer look at the call to output_txt
:
std::vector<>::vector(unaff_s3,(vector<> *)unaff_s5);
output_txt(ctx,(char *)uVar13,¶ms,(vector *)unaff_s3);
std::vector<>::~vector(unaff_s3);
The unaff_s3
parameter to output_txt
might be important. Maybe we should examine the constructor and destructor for
this object to probe its internal structure.
In fact unaff_s3
is only used when passing stereo audio into output_txt
, so it is more of a red herring
slowing down the analysis than a true roadblock. Its internal structure is a C++ standard vector of C++ standard vectors
of float, so it’s a decent example of what happens when RISCV-64 vector instructions are used implementing vectors
(and two dimensional matrices) at a higher abstraction level.
A little analysis shows us that std::vector<>::vector
is actually a copy constructor for a class generated from
a vector template. The true type of unaff_s3
and unaff_s5
is roughly std::vector<std::vector<float>>
.
Comment: the copy constructor and the associated destructor are likely present only because the programmer didn’t mark the parameter as a
const
reference.
The destructor std::vector<>::~vector(unaff_s3)
listing shows no vector instructions are used. The inner vectors
are deleted and their memory reclaimed, then the outer containing vector is deleted.
The constructor std::vector<>::vector
is different. Vector instructions are used often, but in very simple contexts.
vset
mode used is vsetivli_e64m1tama(2)
, asking for no more than two 64 bit elements in a vector registerIf whisper.cpp is representative of a broader class of ML programs compiled for RISCV-64 vector-enabled hardware, then:
vset*
configurations (e.g., e64m1tama
) should be explicitly recognized at the pcodeop layer
and displayed in the decompiler view.memcpy
and memset
, plus examples showing
the common source code patterns resulting in vector reduction, width conversion, slideup/down, and gather/scatter instructions.Other Ghidra extensions would be nice to have but likely deliver diminishing bang-for-the-buck relative to multiplatform C++ analytics:
*.sinc
file syntax to convey comments or hints to be visible in the decompiler view, either as pop-ups,
instruction info, or comment blocks.__builtin_memcpy(...)
calls.The toughest challenges might be:
This testbed uses several open source components that need descriptions and reference links.
We track the Ghidra repository for released Ghidra packages, currently Ghidra 11.0. A Ghidra fork is also used here which adds proposed RISCV instruction set extension support. The host environment for this project is currently a Fedora 39 workstation with an AMD Ryzen 9 5900HX and 32 GB of RAM.
/home2/vendor/binutils-gdb
/home2/vendor/gcc
/home2/vendor/glibc
Adding a new toolchain takes lots of little steps, and some trial and error.
We want x86_64 exemplars built with the same next generation of gcc, libc, and libstdc++ as we use for RISCV exemplars. This will give us some hints about how common new issues may be and how global new solutions may need to be.
We will generate this x86_64 gcc-14 toolchain about the same way as our existing RISCV-64 gcc-14 toolchain.
This example uses the latest released version of binutils and the development head of gcc and glibc.
If we were building a toolchain for an actual product we would start by configuring and building a specialized kernel, which would prepopulate the system root. We aren’t doing that here, so we will use placeholders from the Fedora 40 x86_64 kernel.
We want binutils installed first.
$ cd /home2/vendor/binutils-gdb
$ git log
commit 2c73aeb8d2e02de7b69cbcb13361cfbca9d76a4e (HEAD, tag: binutils-2_41)
Author: Nick Clifton <nickc@redhat.com>
Date: Sun Jul 30 14:55:52 2023 +0100
The 2.41 release!
...
$ cd /home2/build_x86/binutils
.../vendor/binutils-gdb/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
...
$ make
$ make install
...
The gcc suite and the glibc standard library have a circular dependency. We build and install the basic gcc capability first, then glibc, and then finish with the rest of gcc. During this process we likely need to add system files to the new sysroot directory.
$ cd /home2/vendor/gcc
$ git log
commit ac9c81dd76cfc34ed53402049021689a61c6d6e7 (HEAD -> master, origin/trunk, origin/master, origin/HEAD)
Author: Pan Li <pan2.li@intel.com>
Date: Mon Dec 18 21:40:00 2023 +0800
RISC-V: Rename the rvv test case.
...
$ cd /home2/build_x86/gcc
/home2/vendor/gcc/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
$ make
...
$ make install
...
The make
and make install
may throw errors after completing the basic compiler. If so, we can
complete the build after we get glibc installed.
We should have enough of gcc-14 built to configure and build the 64 bit glibc package. This pending release of glibc has lots of changes, so we can expect some tinkering to get it to work for us.
$ cd /home2/vendor/glibc
$ git log
commit e957308723ac2e55dad360d602298632980bbd38 (HEAD -> master, origin/master, origin/HEAD)
Author: Matthew Sterrett <matthew.sterrett@intel.com>
Date: Fri Dec 15 12:04:05 2023 -0800
x86: Unifies 'strlen-evex' and 'strlen-evex512' implementations.
...
$ mkdir -p /home2/build_x86/glibc
$ cd /home2/build_x86/glibc
$ /home2/vendor/glibc/configure CC="/opt/gcc14/bin/gcc" --prefix="/usr" install_root=/opt/gcc14/sysroot --disable-werror --enable-shared --disable-multilib
$ make
$ make install_root=/opt/gcc14/sysroot install
$ du -hs /opt/gcc14/sysroot
105M /opt/gcc14/sysroot
If the gcc
installation errored out before completion, try it again after glibc is installed. This time it should complete without error.
Next we want to exercise the toolchain by compiling a very simple C program:
#include <stdio.h>
int main(int argc, char** argv){
const int N = 1320;
char s[N];
for (int i = 0; i < N - 1; ++i)
s[i] = i + 1;
s[N - 1] = '\0';
printf(s);
}
We’ll build it with three sets of options and import all three into Ghidra 11
/opt/gcc14/bin/gcc gcc_vectorization/helloworld_challenge.c -o a_unoptimized.out
/opt/gcc14/bin/gcc -O3 gcc_vectorization/helloworld_challenge.c -o a_host_optimized.out
/opt/gcc14/bin/gcc -march=rocketlake -O3 gcc_vectorization/helloworld_challenge.c -o a_rocketlake_optimized.out
Note: Rocket Lake is Intel’s codename for its 11th generation Core microprocessors
Ghidra 11 gives us:
a_unoptimized.out
imports and decompiles cleanly, with recognizable disassembly and decompiler output of 5 lines of code.a_host_optimized.out
imports cleanly and decompiles into about 150 lines of hard-to-interpret C code. The loop has been
autovectorized using instructions like PUNPCKHWD
, PUNPCKLWD
, and PADDD
. These appear to be AVX-512 vector extensions.a_rocketlake_optimized.out
fails to disassemble or decompile when it hits AVX2 instructions like vpbroadcastd
.
Binutils 2.41’s objdump
appears to recognize these instructions.As a stretch goal, what does the gcc-14 Rust compiler give us?
/opt/gcc14/bin/gccrs -frust-incomplete-and-experimental-compiler-do-not-use src/main.rs
src/main.rs:25:5: error: unknown macro: [log::info]
25 | log::info!(
| ^~~
src/main.rs:29:5: error: unknown macro: [log::info]
29 | log::info!(
| ^~~
...
If gccrs can’t handle basic rust macros, it isn’t very useful for generating exemplars. We won’t include it in our portable toolchain.
Now we need to make the toolchain hermetic, portable, and ready for Bazel workspaces.
Hermeticity means that nothing under /opt/gcc14
makes a hidden reference to local host files under /usr
. Any such reference needs to
be changed into a relative reference. These are common in short shareable object files that link to one or more true shareable object libraries.
You can often identify possible troublemakers by searching for smallish regular files with a .so
extension.
$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
/opt/gcc14$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
... 273 Dec 27 12:41 /opt/gcc14/lib/libc.so
... 126 Dec 27 12:42 /opt/gcc14/lib/libm.so
... 132 Dec 27 11:42 /opt/gcc14/lib64/libgcc_s.so
$ cat /opt/gcc14/lib/libc.so
/* GNU ld script
Use the shared library, but some functions are only in
the static library, so try that secondarily. */
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libc.so.6 /opt/gcc14/lib/libc_nonshared.a AS_NEEDED ( /opt/gcc14/lib/ld-linux-x86-64.so.2 ) )
$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6 AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )
$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6 AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )
thixotropist@mini:/opt/gcc14$ cat /opt/gcc14/lib64/libgcc_s.so
/* GNU ld script
Use the shared library, but some functions are only in
the static library. */
GROUP ( libgcc_s.so.1 -lgcc )
So two of the three files need patching: replacing /opt/gcc14/lib
with .
. Any text editor will do.
Next we need to identify all dynamic host dependencies of binaries under /opt/gcc14
.
The ldd
command will identify these local system files,
which should be collected into a separate tarball. This tarball can be shared with other cross-compilers
built at the same time, and is generally portable across similar Linux kernels and distributions.
At this point we can strip
the executable files within the toolchain and identify the ones we want to keep in the portable toolchain tarball.
Scripts under generated/toolchains/gcc-14-*/scripts
will help with that.
generate.sh
uses rsync to copy selected files from /opt/gcc14
into /tmp/export
, stripping the known binaries, and creating the portable tarball.
It then collects relevant dynamic libraries from the host and creates a second portable tarball.
These two tarballs can then be copied to other computers and imported into a project by adding stanzas to the project WORKSPACE file.
The toolchain tarball is currently in /opt/bazel``. We need its full path and sha256sum to make it accessible within our workspace. Edit
WORKSPACE` to include:
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
# gcc-14 x86_64 toolchain from snapshot gcc-14 and glibc development heads
http_archive(
name = "gcc-14-x86_64-suite",
urls = ["file:///opt/bazel/x86_64_linux_gnu-14.tar.xz"],
build_file = "//:gcc-14-x86_64-suite.BUILD",
sha256 = "40cc4664a11b8da56478393c7c8b823b54f250192bdc1e1181c9e4f8ac15e3be",
)
# system libraries used by toolchain build system
# We built the custom toolchain on a fedora x86_64 platform, so we need some
# fedora x86_64 sharable system libraries to execute.
http_archive(
name = "fedora39-system-libs",
urls = ["file:///opt/bazel/fedora39_system_libs.tar.xz"],
build_file = "//:fedora39-system-libs.BUILD",
sha256 = "fe91415b05bb902964f05f7986683b84c70338bf484f23d05f7e8d4096949d1b",
)
Bazel will unpack this tarball into an external project directory, something like /run/user/1000/bazel/execroot/_main/external/gcc-14-x86_64-suite/
.
Individual files and filegroups within that directory are defined in x86_64/generated/gcc-14-x86_64-suite.BUILD
.
The filegroup compiler_files
is probably the most important, as it collects everything that might be used in anything launched from gcc or g++.
The full Bazel name for this filegroup is @gcc-14-x86_64-suite//:compiler_files
.
Each custom toolchain is defined within the x86_64/generated/toolchains/BUILD
file. This associates filegroups from a (possibly shared) toolchain tarball like
gcc-14-x86_64-suite
with a set of default compiler and linker options and standard libraries. We might want multiple gcc-14 toolchains, for building kernels,
kernel modules, and userspace applications respectively.
Most of the configuration exists within stanzas like this:
toolchain(
name = "x86_64_default",
target_compatible_with = [
":x86_64",
],
toolchain = ":x86_64-default-gcc",
toolchain_type = "@bazel_tools//tools/cpp:toolchain_type",
)
cc_toolchain(
name = "x86_64-default-gcc",
all_files = ":all_files",
ar_files = ":gcc_14_compiler_files",
as_files = ":empty",
compiler_files = ":gcc_14_compiler_files",
dwp_files = ":empty",
linker_files = ":gcc_14_compiler_files",
objcopy_files = ":empty",
strip_files = ":empty",
supports_param_files = 0,
toolchain_config = ":x86_64-default-gcc-config",
toolchain_identifier = "x86_64-default-gcc",
)
cc_toolchain_config(
name = "x86_64-default-gcc-config",
abi_libc_version = ":empty",
abi_version = ":empty",
compile_flags = [
# take the isystem ordering from the output of gcc -xc++ -E -v -
"--sysroot", "external/gcc-14-x86_64-suite/sysroot/",
"-Wall",
],
compiler = "gcc",
coverage_compile_flags = ["--coverage"],
coverage_link_flags = ["--coverage"],
cpu = "x86_64",
# we really want the following to be constructed from $(output_base) or $(location ...)
cxx_builtin_include_directories = [
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/sysroot/usr/include",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/x86_64-pc-linux-gnu/include/c++/14.0.0",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include-fixed",
],
cxx_flags = [
"-std=c++20",
"-fno-rtti",
],
dbg_compile_flags = ["-g"],
host_system_name = ":empty",
link_flags = ["--sysroot", "external/gcc-14-x86_64-suite/sysroot/"],
link_libs = ["-lstdc++", "-lm"],
opt_compile_flags = [
"-g0",
"-Os",
"-DNDEBUG",
"-ffunction-sections",
"-fdata-sections",
],
opt_link_flags = ["-Wl,--gc-sections"],
supports_start_end_lib = False,
target_libc = ":empty",
target_system_name = ":empty",
tool_paths = {
"ar": "gcc-14-x86_64/imported/ar",
"ld": "gcc-14-x86_64/imported/ld",
"cpp": "gcc-14-x86_64/imported/cpp",
"gcc": "gcc-14-x86_64/imported/gcc",
"dwp": ":empty",
"gcov": ":empty",
"nm": "gcc-14-x86_64/imported/nm",
"objcopy": "gcc-14-x86_64/imported/objcopy",
"objdump": "gcc-14-x86_64/imported/objdump",
"strip": "gcc-14-x86_64/imported/strip",
},
toolchain_identifier = "gcc-14-x86_64",
unfiltered_compile_flags = [
"-fno-canonical-system-headers",
"-Wno-builtin-macro-redefined",
"-D__DATE__=\"redacted\"",
"-D__TIMESTAMP__=\"redacted\"",
"-D__TIME__=\"redacted\"",
],
)
The tool_paths
element points to small bash scripts needed to launch compiler components like gcc
and ar
and strip
.
These give us the chance to use imported system shareable object libraries rather than the host’s shareable object libraries.
#!/bin/bash
set -euo pipefail
PATH=`pwd`/toolchains/gcc-14-x86_64/imported \
LD_LIBRARY_PATH=external/fedora39-system-libs \
external/gcc-14-x86_64-suite/bin/gcc "$@"
Compiling and linking source files takes many dependent files from /opt/gcc14
. The next step is tedious and iterative - we need to prove
that the portable toolchain tarball derived from /opt/gcc14
never references any file in that directory, or any local host file under /usr
.
Bazel can do that for us, at the cost of identifying every file or file ‘glob’ that may be called for each of the toolchain primitives.
It runs the toolchain in a sandbox, forcing an exception on all references not previously declared as dependencies.
This kind of exception looks like this:
ERROR: /home/XXX/projects/github/ghidra_import_tests/x86_64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: absolute path inclusion(s) found in rule '//userSpaceSamples:helloworld':
the source file 'userSpaceSamples/helloworld.c' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
'/usr/include/stdc-predef.h'
'/usr/include/stdio.h'
If you see this check:
stdio.h
was installed in the right directory under /opt/gcc14
.stdio.h
was copied into /tmp/export
when building the tarballstdio.h
appeared in the appropriate compiler file groups defined in gcc-14-x86_64-suite.BUILD
stdio.h
"-isystem", "external/gcc-14-x86_64-suite/sysroot/usr/include",
crt1.o
and crti.o
We can test our new toolchain with a build of helloworld
.
x86_64/generated$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
x86_64/generated$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld
INFO: Analyzed target //userSpaceSamples:helloworld (69 packages loaded, 1538 targets configured).
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Compiling userSpaceSamples/helloworld.c', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppCompile]
(cd /run/user/1000/bazel/execroot/_main && \
exec env - \
PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
PWD=/proc/self/cwd \
toolchains/gcc-14-x86_64/imported/gcc -U_FORTIFY_SOURCE --sysroot external/gcc-14-x86_64-suite/sysroot/ -Wall -MD -MF bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.d '-frandom-seed=bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o' -fPIC -iquote . -iquote bazel-out/k8-fastbuild/bin -iquote external/bazel_tools -iquote bazel-out/k8-fastbuild/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -o bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Linking userSpaceSamples/helloworld', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppLink]
(cd /run/user/1000/bazel/execroot/_main && \
exec env - \
PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
PWD=/proc/self/cwd \
toolchains/gcc-14-x86_64/imported/gcc -o bazel-out/k8-fastbuild/bin/userSpaceSamples/helloworld -Wl,-S --sysroot external/gcc-14-x86_64-suite/sysroot/ bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o -lstdc++ -lm)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
INFO: Found 1 target...
Target //userSpaceSamples:helloworld up-to-date:
bazel-bin/userSpaceSamples/helloworld
INFO: Elapsed time: 0.289s, Critical Path: 0.10s
INFO: 6 processes: 4 internal, 2 linux-sandbox.
INFO: Build completed successfully, 6 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld
Hello World!
$ strings bazel-bin/userSpaceSamples/helloworld|grep -i gcc
GCC: (GNU) 14.0.0 20231218 (experimental)
Things to note:
--platforms=//platforms:x86_64_default
to show we are not building for the local hosttoolchains/gcc-14-x86_64/imported/gcc
is invoked twice, once to compile and once to link--sysroot external/gcc-14-x86_64-suite/sysroot
is used twice, to avoid including host files under /usr
helloworld
executable happens to execute on the host machine.helloworld
executable contains no references to gcc-13, the native toolchain on the host machine.Now try a C++ build:
$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld++
...
Target //userSpaceSamples:helloworld++ up-to-date:
bazel-bin/userSpaceSamples/helloworld++
INFO: Elapsed time: 0.589s, Critical Path: 0.51s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld++
Hello World!
We’ve got a working toolchain, but with many dangling links, duplicate files, and unused definitions. The toolchain files normally provided by a kernel were copied in as needed from the host, with the understanding that we never really needed unable applications.
If this were a production environment we would be a lot more careful. It’s not, so we will just summarize some of the areas that might benefit from such a cleanup.
Adding and testing a toolchain involves lots of similar-looking directories.
This directory is the install target for our binutils, gcc, and glibc builds. It is not itself used by the Bazel build framework, and need not be present on any host machine running the toolchain.
fdupes
reports 2446 duplicate files (in 2136 sets), occupying 200.2 megabytesThis directory holds a subset of /opt/gcc14
, with many binaries stripped. The hard links
of `/opt/gcc14`` are lost. It may be discarded after the portable tarball is generated.
fdupes
reports 1010 duplicate files (in 983 sets), occupying 69.0 megabytesThe compressed portable tarball size is 171M. It expands into a locally cached equivalent of /tmp/export
.
This file must be accessible to Bazel during a cross-compilation, either as a file reference or as a remote http or
https URL.
Bazel agents will decompress the tarball if and when needed into a local cache directory. In this case, it is unpacked into a RAM file system for speed.
Temporary sandboxes like this are created when individual compiler and linker steps are executed. They implement whatever subset of the cached toolchain tarball are explicitly named as dependencies for that step. Toolchain references outside of the sandbox are often flagged as hermeticity errors and abort the build.
Debugging toolchains can be tedious
Suppose you wanted to build a gcc-14 toolchain with the latest glibc standard libraries, and you were using a Linux host with gcc-14 and reasonably current glibc standard libraries. How would you guarantee that none of your older host files were accidentally used where you expected the newer gcc and glibc files to be used?
Bazel enforces this hermeticity by running all toolchain steps in a sandbox, where only declared dependencies of the toolchain components are visible. That means nothing under /usr or $HOME is generally available, and any attempt to access files there will abort the build.
Example:
ERROR: /home/XXX/projects/github/ghidra_import_tests/x86_64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: absolute path inclusion(s) found in rule '//userSpaceSamples:helloworld':
the source file 'userSpaceSamples/helloworld.c' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
'/usr/include/stdc-predef.h'
'/usr/include/stdio.h'
In this example the toolchain tried to load host files, where it should have been loading equivalent files from the toolchain tarball.
Bazel toolchains should provide and encapsulate almost everything host computers need to compile and link executables. The goal is simply to minimize toolchain differences between individual developers’ workstations and the reference Continuous Integration test servers. The toolchains do not include kernels or loaders, or system code tightly associated with the kernel. That presents a challenge, since we want the linker to be imported as part of the toolchain, while the system loader is provided by the host.
Common toolchain failure modes often show up during crosscompilation of something as simple as riscv64-unknown-linux-gnu-gcc helloworld.c
.
gcc
compiler must find the compiler dynamic libraries it was compiled with, probably using LD_LIBRARY_PATH
to find them.
libstdc++.so.6
which links to concrete versions like libstdc++.so.6.0.32
.ld-linux-x86-64.so.2
gcc
executable must find and execute multiple other executables from the toolchain, such as cpp
, as
, and ld
.
gcc
internally invokes the linker ld
.
ld
executes on the host computer - we assume an x86_64 linux system - which means it needs an x86_64 libc.so
library from the toolchain. It also generally needs to link
object files against the target platform’s libc.so
library from a different library in the toolchain.ld
also often needs files specific to the target system’s kernel or loader. These include files like usr/lib/crt1.o
.ld
accepts many arguments detailing the target system’s memory model. Different arguments cause the linker to require different linker scripts under .../ldscripts
.ld
sharable object files can be scripts referencing other libraries - and those references may be absolute, not relative. These scripts may need to be patched so that
host paths are not followed.Compiler developers often refactor their dependent file layouts, making it very easy to not have required files in the expected places. You will generally get a useful error message if
something like crt1.o
isn’t located. If a dynamic library is not found in a child process, you might just get a segfault.
The debugging process often proceeds with:
--sandbox_debug
gcc
command directlygcc
command within an strace
command, with options to follow child processes and expand strings. Examine execve
and open
system calls to verify that imported files
are found before host system files, and that the imported files are actually in a searched directoryThe crosscompiler toolchain assumes that all files needed for a build are known to the Bazel build system. This assumption often breaks when upgrading a compiler or OS. This example shows what can happen when updating the host OS from Fedora 39 to Fedora 40.
The relevant integration test is generateInternalExemplars.py
:
$ ./generateInternalExemplars.py
...
FAIL: test_03_riscv64_build (__main__.T0BazelEnvironment.test_03_riscv64_build)
riscV64 C build of helloworld, with checks to see if a compatible toolchain was
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/thixotropist/projects/github/ghidra_import_tests/./generateInternalExemplars.py", line 58, in test_03_riscv64_build
self.assertEqual(0, result.returncode,
AssertionError: 0 != 1 : bazel //platforms:riscv_userspace build of userSpaceSamples:helloworld failed
...
Ran 8 tests in 6.290s
FAILED (failures=5)
The error log is large, showing 5 failures out of 8 tests. We will narrow the test to a single test case:
$ python ./generateInternalExemplars.py T0BazelEnvironment.test_03_riscv64_build
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
...
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.s.d ... (remaining 20 arguments skipped)
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
toolchains/gcc-14-riscv/imported/gcc: line 5: 4 Segmentation fault (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d ... (remaining 20 arguments skipped)
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
toolchains/gcc-14-riscv/imported/gcc: line 5: 4 Segmentation fault (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): gcc failed: error executing CppCompile command (from target //userSpaceSamples:helloworld) toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.d ... (remaining 19 arguments skipped)
The three segment fault dumps can be found in /var/lib/systemd/coredump/
.
The ERROR message indicates segment faults when generating three dependency listings. To drill down further we want to use take the Use --sandbox_debug
hint and run the single bazel
build command:
$ cd riscv64/generated/
riscv64/generated $ bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --sandbox_debug --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
...
ERROR: /home/thixotropist/projects/github/ghidra_import_tests/riscv64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: (Segmentation fault): linux-sandbox failed: error executing CppCompile command
(cd /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main && \
exec env - \
PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_11.1_DEV/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
PWD=/proc/self/cwd \
TMPDIR=/tmp \
/home/thixotropist/.cache/bazel/_bazel_thixotropist/install/80f400a450641cd3dd880bb8dec91ff8/linux-sandbox -t 15 -w /dev/shm -w /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main -w /tmp -S /run/user/1000/bazel/sandbox/linux-sandbox/4/stats.out -D /run/user/1000/bazel/sandbox/linux-sandbox/4/debug.out -- toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d '-frandom-seed=bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i' -fPIC -iquote . -iquote bazel-out/k8-dbg/bin -iquote external/bazel_tools -iquote bazel-out/k8-dbg/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -E -o bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i)
toolchains/gcc-14-riscv/imported/gcc: line 5: 4 Segmentation fault (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
This tells us several things:
helloworld.pic.i
from userSpaceSamples/helloworld.c
with the gcc flag -E
. This means the failure involves the preprocessor phase, not the compiler or linker phase./run/user/1000/bazel/sandbox/linux-sandbox/4
.The next step is to rerun the generated command outside of bazel, but using the bazel sandbox.
$ pushd /run/user/1000/bazel/sandbox/linux-sandbox/4/execroot/_main
$ toolchains/gcc-14-riscv/imported/gcc -U_FORTIFY_SOURCE '--sysroot=external/gcc-14-riscv64-suite/sysroot' -Wall -g -MD -MF bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i.d '-frandom-seed=bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i' -fPIC -iquote . -iquote bazel-out/k8-dbg/bin -iquote external/bazel_tools -iquote bazel-out/k8-dbg/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -E -o bazel-out/k8-dbg/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.i
toolchains/gcc-14-riscv/imported/gcc: line 5: 552557 Segmentation fault (core dumped) PATH=`pwd`/toolchains/gcc-14-riscv/imported LD_LIBRARY_PATH=external/fedora39-system-libs external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
$ cat toolchains/gcc-14-riscv/imported/gcc
#!/bin/bash
set -euo pipefail
PATH=`pwd`/toolchains/gcc-14-riscv/imported \
LD_LIBRARY_PATH=external/fedora39-system-libs \
external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc "$@"
$ ls -l external/gcc-14-riscv64-suite/bin
total 0
riscv64-unknown-linux-gnu-ar -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ar
riscv64-unknown-linux-gnu-as -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-as
riscv64-unknown-linux-gnu-cpp -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
riscv64-unknown-linux-gnu-gcc -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc
riscv64-unknown-linux-gnu-ld -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ld
riscv64-unknown-linux-gnu-ld.bfd -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ld.bfd
riscv64-unknown-linux-gnu-objdump -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-objdump
riscv64-unknown-linux-gnu-ranlib -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-ranlib
riscv64-unknown-linux-gnu-strip -> /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-strip
$ ls -l external/fedora39-system-libs
total 0
libc.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libc.so
libc.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libc.so.6
libexpat.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libexpat.so.1
libexpat.so.1.8.10 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libexpat.so.1.8.10
libgcc_s-13-20231205.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s-13-20231205.so.1
libgcc_s.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s.so.1
libgmp.so.10 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgmp.so.10
libgmp.so.10.4.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgmp.so.10.4.1
libisl.so.15 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libisl.so.15
libisl.so.15.1.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libisl.so.15.1.1
libmpc.so.3 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpc.so.3
libmpc.so.3.3.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpc.so.3.3.1
libmpfr.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpfr.so.6
libmpfr.so.6.2.0 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libmpfr.so.6.2.0
libm.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libm.so.6
libpython3.12.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.12.so
libpython3.12.so.1.0 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.12.so.1.0
libpython3.so -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libpython3.so
libstdc++.so.6 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6
libstdc++.so.6.0.32 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6.0.32
libz.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libz.so.1
libz.so.1.2.13 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libz.so.1.2.13
libzstd.so.1 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libzstd.so.1
libzstd.so.1.5.5 -> /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libzstd.so.1.5.5
This suggests a missing or out-of-date sharable library, so try executing cpp with and without overriding the library path
$ LD_LIBRARY_PATH=external/fedora39-system-libs /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp --version
Segmentation fault (core dumped)
$ /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp --version
riscv64-unknown-linux-gnu-cpp (g3f23fa7e74f) 13.2.1 20230901
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Next see which libraries are required for cpp
to execute:
$ ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
linux-vdso.so.1 (0x00007ffdb7172000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007faf79200000)
libm.so.6 => /lib64/libm.so.6 (0x00007faf7911d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007faf794ff000)
libc.so.6 => /lib64/libc.so.6 (0x00007faf78f30000)
/lib64/ld-linux-x86-64.so.2 (0x00007faf79547000)
Is this a case of a missing library, or something corrupt in our imported fedora39-system-libs
?
Try a differential test in which we search both libraries, in different orders:
$ LD_LIBRARY_PATH=/lib64/:/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
linux-vdso.so.1 (0x00007ffeb2b92000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fda32200000)
libm.so.6 => /lib64/libm.so.6 (0x00007fda3211d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fda324a7000)
libc.so.6 => /lib64/libc.so.6 (0x00007fda31f30000)
/lib64/ld-linux-x86-64.so.2 (0x00007fda324d6000)
$ LD_LIBRARY_PATH=/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs:/lib64 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp
Segmentation fault (core dumped)
We can trace the library and child process actions with commands like:
$ strace -f --string-limit=1000 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp 2>&1 |egrep 'openat|execve'
execve("/usr/bin/ldd", ["ldd", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x7ffcd3479788 /* 54 vars */) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib64/libtinfo.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/bin/ldd", O_RDONLY) = 3
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 595197] execve("/lib64/ld-linux-x86-64.so.2", ["/lib64/ld-linux-x86-64.so.2", "--verify", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x56317c32bf10 /* 54 vars */) = 0
[pid 595197] openat(AT_FDCWD, "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] execve("/lib64/ld-linux-x86-64.so.2", ["/lib64/ld-linux-x86-64.so.2", "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp"], 0x56317c339130 /* 58 vars */) = 0
[pid 595200] openat(AT_FDCWD, "/run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
[pid 595200] openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
Examine the imported fedora39-system-libs
directory, finding one significant error. The file libc.so
is not a symbolic link but a loader script, referencing
the host’s /lib64/libc.so.6
, /usr/lib64/libc_nonshared.a
, and /lib64/ld-linux-x86-64.so.2
. If we purge libc.*
from fedora39-system-libs
we get a saner result:
$ LD_LIBRARY_PATH=/run/user/1000/bazel/execroot/_main/external/fedora39-system-libs:/lib64 ldd /run/user/1000/bazel/execroot/_main/external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-cpp 2>&1
linux-vdso.so.1 (0x00007ffda9fed000)
libstdc++.so.6 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libstdc++.so.6 (0x00007f0c1c45a000)
libm.so.6 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libm.so.6 (0x00007f0c1c379000)
libgcc_s.so.1 => /run/user/1000/bazel/execroot/_main/external/fedora39-system-libs/libgcc_s.so.1 (0x00007f0c1c355000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0c1c168000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0c1c6b0000)
Now we have a hermeticity design question to resolve - which system libraries do we import, and which do we pull from the host machine? This exercise suggests we use the host libraries
for dynamic loading and for the standard C libc.so
, and import libraries associated with the C and C++ compiler.
Update the LD_LIBRARY_PATH
variable in all toolchain scripts and explicitly remove libc.* files from the system libraries, then try repeat the failing tests:
$ ./generateInternalExemplars.py
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=@local_config_platform//:host --compilation_mode=dbg userSpaceSamples:helloworld
.INFO:root:Running: bazel query //platforms:*
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld
INFO:root:Running: file bazel-bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace --compilation_mode=dbg userSpaceSamples:helloworld++
INFO:root:Running: file bazel-bin/userSpaceSamples/_objs/helloworld++/helloworld.pic.o
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom assemblySamples:archive
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom gcc_expansions:archive
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_custom @whisper_cpp//:main @whisper_cpp//:main.stripped
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_vector @whisper_cpp//:main @whisper_cpp//:main.stripped
INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:riscv_userspace @whisper_cpp//:main @whisper_cpp//:main.stripped
.INFO:root:Running: bazel --noworkspace_rc --output_base=/run/user/1000/bazel build -s --distdir=/opt/bazel/distdir --incompatible_enable_cc_toolchain_resolution --experimental_enable_bzlmod --incompatible_sandbox_hermetic_tmp=false --save_temps --platforms=//platforms:x86_64_default gcc_vectorization:archive
.
----------------------------------------------------------------------
Ran 8 tests in 61.357s
OK
Refreshing (updating) an existing toolchain is mostly straightforward.
Warning: This sequence uses unreleased code for binutils, gcc, and glibc. We use this experimental toolchain to get a glimpse of future toolchains and products, not for stable code.
binutils’ latest release is 2.42. Let’s update our RISCV toolchain to use the current binutils head, which is currently very close to the released version. The git log shows relatively little change to the RISCV assembler, other than some corrections to the THead extension encodings.
/opt/riscvx
.$ /home2/vendor/binutils-gdb/configure --prefix=/opt/riscvx --target=riscv64-unknown-linux-gnu
$ make
$ make install
/opt/riscvx
$ make distclean
$ /home2/vendor/gcc/configure --prefix=/opt/riscvx --enable-languages=c,c++,lto --disable-multilib --target=riscv64-unknown-linux-gnu --with-sysroot=/opt/riscvx/sysroot
$ make
$ make install
Update the source directory to the tip of the master branch, refresh the configuration, build, and install
$ ../../vendor/glibc/configure CC=/opt/riscvx/bin/riscv64-unknown-linux-gnu-gcc --host=riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu --prefix=/opt/riscvx --disable-werror --enable-shared --disable-multilib
$ make
$ make install
The previous steps generate a new, non-portable toolchain under /opt/riscvx
. Before we can generate the portable tarball
(e.g., risc64_linux_gcc-14.0.1.tar.xz
) we can exercise the newer toolchain. If we pass --platforms=//platforms:riscv_local
to bazel it will use a toolchain loaded from local files under /opt/riscvx
instead of files extracted from the portable tarball.
This is mostly useful in debugging the bazel ‘sandbox’ - recognizing newer files required by the toolchain that have been installed locally but not explicitly included in the portable tarball.
For example, suppose we are refreshing the gcc-14 toolchain from 14.0.0
to 14.0.1
. The following sequence of builds should all succeed:
# build with an unrelated and fully released toolchain as a control experiment
$ bazel build --platforms=//platforms:riscv_userspace @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
bazel-bin/external/whisper_cpp/main /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901 /// the released compiler only was used
GCC: (g3f23fa7e74f) 13.2.1 20230901
_Unwind_Resume@GCC_3.0
# repeat with the local toolchain introducing 14.0.1 for the application build
$ bazel build --platforms=//platforms:riscv_local @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
bazel-bin/external/whisper_cpp/main /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901 /// some system files were previously compiled
GCC: (GNU) 14.0.1 20240130 (experimental) /// the new toolchain was used in part
_Unwind_Resume@GCC_3.0
# repeat with the candidate portable tarball
$ bazel build --platforms=//platforms:riscv_vector @whisper_cpp//:main
...
Target @@whisper_cpp//:main up-to-date:
bazel-bin/external/whisper_cpp/main /// build was successful
...
$ strings bazel-bin/external/whisper_cpp/main|grep GCC
GCC_3.0
GCC: (GNU) 13.2.1 20230901
GCC: (GNU) 14.0.1 20240130 (experimental) /// the new toolchain was used in part
_Unwind_Resume@GCC_3.0
Different build options can require different files in the portable tarball, so this kind of test may
fail for some projects while succeeding in others. That’s easily fixed by updating the generate.sh
rsync
script
that builds the portable tarball.
Put unstructured comments here until we know what to do with them.
isa_ext
Ghidra branch to expand vsetvli
arguments
vsetvli zero,zero,0xc5
⇒ vsetvli zero,zero,e8,mf8,ta,ma
vsetvli zero,zero,0x18
⇒ vsetvli zero,zero,e64,m1,tu,mu
isa_ext
Ghidra branch fails to disassemble the bext
instruction in b-ext-64.o
and b-ext.o
zvbc.o
won’t disassemble
isa_ext
unknown.o
won’t disassemble or reference where we found these instructions
sfence``,
hinval_vvma,
hinval_gvma,
orc.b,
cbo.clean,
cbo.inval,
cbo.flush.
orc.b` is handled properly, the others are not implemented.Assume a vendor generates a new toolchain with multiple extensions enabled by default. What fraction of the compiled functions would contain extensions unrecognized by Ghidra 11.0? Since THead has supplied most of the vendor-specific extensions known to binutils 2-41, we’ll use that as a reference. The architecture name will be something like
-march=rv64gv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_xtheadmac_xtheadfmemidx_xtheadmempair_xtheadsync
Add some C++ code to exercise libstdc++ ordered maps (based on red-black trees?), unordered maps (hash table based), and the Murmur hash function.
There are a few places where THead customized instructions are used. The Murmur hash function uses vector load and store instructions to implement 8 byte unaligned reads. Bit manipulation extension instructions are not yet used.
Initial results suggest the largest complexity impact will be gcc rewriting of memory and structure copies with vector code. This may be especially true for hardware requiring aligned integers where alignment can not be guaranteed.
When will RISCV-64 cores be deployed into systems needing reverse-engineering?
https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc
Note: the general SiFive SDK boards might have been deprioritized in favor of specific licensing agreements. https://www.sifive.com/boards/hifive-pro-p550
We might expect to see high performance network appliances in 2026 using chip architectures like the SiFive 670 or 870, or from one of the alternative Chinese vendors. Chips with vector extensions are due soon, with crypto extensions coming shortly after. A network appliance development board might have two P670 class sockets and four to eight 10 GbE network interfaces.
To manage scope, we won’t be worrying about instructions supporting AI or complex virtualization. Custom instructions that might be used in network appliances are definitely in scope, while custom instructions for nested virtualization are not. Possibly in scope are new instructions that help manage or synchronize multi-socket cache memory.
Let’s set a provocative long term goal: How will Ghidra analyze a future network appliance that combines Machine Learning with self-modifying code to accelerate network routing and forwarding? Such a device might generate fast-path code sequences to sessionize incoming packets and deliver them with minimal cache flushes or branches taken.
A RISCV-64 implementation of the Marvell Octeon 10 might be a feasible future hardware component.
This might include cell phones or voice-recognition apps. Things that today might use an Arm core set but be implemented with RISC-V cores in the future.
Consider a midpoint network appliance (router or firewall) sitting near the Provider-Customer demarcation. What might be an appealing RISCV processor look like? This kind of appliance likely handles a mix of link layer protocols with an optimization for low energy dissipation and low latency per packet. A fast and simple serializer/deserializer feeding a RISCV classifier and forwarding engine makes sense here. You don’t want to do network or application layer processing unless the appliance has a firewall role.
Link layer processing means a packet mix of stacked MPLS and VLAN tags with IPv4 and IPv6 network layers underneath. Packet header processing won’t need 32 bit addressing, but might benefit from the high memory bandwidth of a 64 bit core. Fast header hashing combined with fast hashmap session lookups (for MPLS, VLAN, and selected IP) or fast trie session lookups (for IPv4 and IPv6). Network stacks can have a lot of branches, creating pipeline stalls, so hyperthreading may make sense.
Denial of Service and overload protections make fast analytics necessary at the session level. That’s where a 64 bit core with vector and other extensions can be useful.
This all suggests we might see more hybrid RISCV designs, with a mix of many lean 32 bit cores supported by one or two 64 bit cores. The 32 bit cores handle fast link layer processing and the 64 bit cores handle background analytics and control.
In the extreme case, the 64 bit analytics engine rewrites link layer code for the 32 bit cores continuously, optimizing code paths depending on what the link layer classifiers determine the most common packet types to be for each physical port. Cache management and branch prediction hints might drive new instruction additions.
Code rewriting could start as simple updates to RISCV hint branch instructions and possibly prefetch instructions, so it isn’t necessarily as radical as it sounds.
What will RISCV-64 cores offer networking?
Network kernel code has lots of conditional branches and very few loops. This suggests RISCV vector instructions won’t be
found in network appliances anytime soon, other than memmove
or similar simple contexts. Gather-scatter, bit manipulation, and
crypto instruction extensions are likely to be useful in networking much sooner. Ghidra will have a much easier time generating
pcode for those instructions than the 25K+ RISCV vector intrinsic C functions covering all combinations of vector instructions and
vector processing modes.
What should Ghidra do when faced with a counter-example, say a network appliance that aggressively moves vector analytics into network processing? Such an appliance - perhaps a smart edge router or a zero-trust gateway device - might combine the following:
In the extreme case, this might be a generative AI subsystem trained on inbound packets and emitting either optimized packet handling code or threat-detection indicators. How would a Ghidra analyst look for malware in such a system?
We need to be clearer about what kind of network code we might find in different contexts:
For each of these contexts we have at least two topology variants:
Midpoint network appliances may need to track session state. A simple network switch is close to stateless. A real-world network switch has a lot of session state to manage if it supports:
The key point here is that midpoint network appliances may benefit from instruction set extensions that enable faster packet classification, hashing, and cached session lookup. An adaptive midpoint network appliance might adjust the packet classification code in real-time, based on the mix of MPLS, VLAN, IPv4, IPv6, and VPN traffic most often seen on any given network interface. ISA extensions supporting gather, hash, vector comparison, and find-first operations are good candidates here.
Compare and debug human and gcc vectorization
This case study compares human and compiler vectorization of a simple ML quantization algorithm. We’ll assume we need to inspect the code to understand why these two binaries sometimes produce different results. Our primary goal is to see whether we can improve Ghidra’s RISCV pcode generation to make such analyses easier. A secondary goal is to collect generated instruction patterns that may help Ghidra users understand what optimizing vectorizing compilers can do to source code.
The ML algorithm under test comes from https://github.com/ggerganov/llama.cpp. It packs an array of 32 bit floats into a set of q8_0 blocks to condense large model files. The q8_0 quantization reduces 32 32 bit floating point numbers to 32 8 bit integers with an associated 16 bit floating point scale factor.
The ggml-quants.c
file in the llama.cpp
repo provides both scalar source code (quantize_row_q8_0_reference
) and hand-generated vectorized source code (quantize_row_q8_0
).
quantize_row_q8_0
function has several #ifdef
sections providing hand-generated vector intrinsics for riscv, avx2, arm/neon, and wasm.quantize_row_q8_0_reference
function source uses more loops but no vector instructions. GCC-14 will autovectorize the scalar quantize_row_q8_0_reference
,
producing vector code that is quite different from the hand-generated vector intrinsics.The notional AI development shop wants to use Ghidra to inspect generated assembly instructions for both quantize_row_q8_0
and quantize_row_q8_0_reference
to track down
reported quirks. On some systems they produce identical results, on others the results differ. The test framework includes:
-march=rv64gcv
, -O3
, and -ffast-math
.qemu-riscv64-static
emulated execution of user space RISCV-64 applications on an x86_64 Linux test server.gtest
.isa_ext
branch supporting RISCV 1.0 vector instructions.The unit test process involves three unit test executions:
qemu-riscv64-static
environment with an emulated VLEN=256 bitsqemu-riscv64-static
environment with an emulated VLEN=128 bitsNote: This exercise uses whisper C and C++ source code as ‘ground truth’, coupled with a C++ test framework. If we didn’t have source code, we would have to reconstruct key library source files based on Ghidra inspection, then refine those reconstructions until Ghidra and unit testing shows that our reconstructions behave the same as the original binaries.
As setup to the Ghidra inspection, we will build and run all three and expect to see three PASSED notifications:
$ bazel run -platforms=//platforms:x86_64 case_studies:unitTests
...
INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
bazel-bin/case_studies/unitTests
INFO: Elapsed time: 21.065s, Critical Path: 20.71s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
INFO: Running command line: bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN ] FP16.convertFromFp32Reference
[ OK ] FP16.convertFromFp32Reference (0 ms)
[ RUN ] FP16.convertFromFp32VectorIntrinsics
[ OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (0 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 2 tests.
$ bazel build --platforms=//platforms:riscv_vector case_studies:unitTests
$ bazel build --platforms=//platforms:riscv_vector --define __riscv_v_intrinsics=1 case_studies:unitTests
WARNING: Build option --platforms has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 1904 targets configured).
...
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
bazel-bin/case_studies/unitTests
INFO: Elapsed time: 22.265s, Critical Path: 22.07s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN ] FP16.convertFromFp32Reference
[ OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN ] FP16.convertFromFp32VectorIntrinsics
[ OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[ PASSED ] 2 tests.
Target //case_studies:unitTests up-to-date:
bazel-bin/case_studies/unitTests
INFO: Elapsed time: 8.984s, Critical Path: 8.88s
INFO: 29 processes: 2 internal, 27 linux-sandbox.
INFO: Build completed successfully, 29 total actions
$ QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN ] FP16.convertFromFp32Reference
[ OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN ] FP16.convertFromFp32VectorIntrinsics
[ OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[ PASSED ] 2 tests.
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN ] FP16.convertFromFp32Reference
[ OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN ] FP16.convertFromFp32VectorIntrinsics
case_studies/unitTests.cpp:55: Failure
Expected equality of these values:
dest[0].d
Which is: 12175
fp16_test_array.d
Which is: 13264
fp16 scale factor is correct
case_studies/unitTests.cpp:57: Failure
Expected equality of these values:
comparison
Which is: -65
0
entire fp16 block is converted correctly
[ FAILED ] FP16.convertFromFp32VectorIntrinsics (8 ms)
[----------] 2 tests from FP16 (10 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (14 ms total)
[ PASSED ] 1 test.
[ FAILED ] 1 test, listed below:
[ FAILED ] FP16.convertFromFp32VectorIntrinsics
1 FAILED TEST
These results imply:
quantize_row_q8_0
test passes on harts with VLEN=256 but fails when
executed on harts with VLEN=128. Further tracing suggests that quantize_row_q8_0
only processes the
first 16 floats, not the 32 floats that should be processed in each block.quantize_row_q8_0_reference
passes on both types of harts.Now we need to import the riscv-64 unitTests
program into Ghidra and examine the compiled differences between
quantize_row_q8_0
and quantize_row_q8_0_reference
.
Note: Remember that our real integration test goal is to look for new problems or regressions in Ghidra’s decompiler presentation of functions like these, and then to look for ways to improve that presentation.
The goal of both quantize_row_q8_0*
routines is a lossy compression of 32 bit floats into blocks of 8 bit scaled values.
The routines should return identical results, with quantize_row_q8_0
invoked on architectures with vector acceleration
and quantize_row_q8_0_reference
for all other architectures.
static const int QK8_0 = 32;
// reference implementation for deterministic creation of model files
void quantize_row_q8_0_reference(const float * restrict x, block_q8_0 * restrict y, int k) {
assert(k % QK8_0 == 0);
const int nb = k / QK8_0;
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int j = 0; j < QK8_0; j++) {
const float v = x[i*QK8_0 + j];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 7) - 1);
const float id = d ? 1.0f/d : 0.0f;
y[i].d = GGML_FP32_TO_FP16(d);
for (int j = 0; j < QK8_0; ++j) {
const float x0 = x[i*QK8_0 + j]*id;
y[i].qs[j] = roundf(x0);
}
}
}
void quantize_row_q8_0(const float * restrict x, void * restrict vy, int k) {
assert(QK8_0 == 32);
assert(k % QK8_0 == 0);
const int nb = k / QK8_0;
block_q8_0 * restrict y = vy;
#if defined(__ARM_NEON)
...
#elif defined(__wasm_simd128__)
...
#elif defined(__AVX2__) || defined(__AVX__)
...
#elif defined(__riscv_v_intrinsic)
size_t vl = __riscv_vsetvl_e32m4(QK8_0);
for (int i = 0; i < nb; i++) {
// load elements
vfloat32m4_t v_x = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);
vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
vfloat32m1_t tmp = __riscv_vfmv_v_f_f32m1(0.0f, vl);
vfloat32m1_t vmax = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);
const float d = amax / ((1 << 7) - 1);
const float id = d ? 1.0f/d : 0.0f;
y[i].d = GGML_FP32_TO_FP16(d);
vfloat32m4_t x0 = __riscv_vfmul_vf_f32m4(v_x, id, vl);
// convert to integer
vint16m2_t vi = __riscv_vfncvt_x_f_w_i16m2(x0, vl);
vint8m1_t vs = __riscv_vncvt_x_x_w_i8m1(vi, vl);
// store result
__riscv_vse8_v_i8m1(y[i].qs , vs, vl);
}
#else
GGML_UNUSED(nb);
// scalar
quantize_row_q8_0_reference(x, y, k);
#endif
}
The reference version has an outer loop iterating over scalar 32 bit floats, with two inner loops operating on blocks of 32 of those floats. The first inner loop accumulates the maximum of absolute values within the block to generate a scale factor, while the second inner loop applies that scale factor to each 32 bit float in the block and then converts the scaled value to an 8 bit integer. Each output block is then a 16 bit floating point scale factor plus 32 8 bit scaled integers.
The code includes some distractions that complicate Ghidra analysis:
GGML_FP32_TO_FP16(d)
conversion might be a single instruction on some architectures, but it requires branch evaluation
on our RISCV-64 target architecture. GCC may elect to duplicate code in order to minimize the number of branches needed.The hand-optimized quantize_row_q8_0
has similar distractions, plus a few more:
block_q8_0
struct.m4
setting. On architectures with a vector length VLEN=256, that means
all 32 4 byte floats per block will fit nicely and can be processed in parallel. If the architecture only supports a vector length of
VLEN=128, then only half of each block will be processed in every iteration. That accounts for the unit test failure.__riscv_vle32_v_f32m4
intrinsic is likely the slowest of the set, as this 32 bit instruction will require a 128 byte memory read,
stalling the instruction pipeline for some number of cycles.Load unitTest
into Ghidra and inspect quantize_row_q8_0
. We know the correct signature so we can override what Ghidra has inferred,
then name the parameters so that they look more like the source code.
void quantize_row_q8_0(float *x,block_q0_0 *y,long k)
{
float fVar1;
int iVar2;
char *pcVar3;
undefined8 uVar4;
uint uVar5;
int iVar6;
ulong uVar7;
undefined8 uVar8;
undefined in_v1 [256];
undefined auVar9 [256];
undefined auVar10 [256];
gp = &__global_pointer$;
if (k < 0x20) {
return;
}
uVar4 = vsetvli_e8m1tama(0x20);
vsetvli_e32m1tama(uVar4);
uVar5 = 0x106c50;
iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5;
iVar6 = 0;
vmv_v_i(in_v1,0);
pcVar3 = y->qs;
do {
while( true ) {
vsetvli_e32m4tama(uVar4);
auVar9 = vle32_v(x);
auVar10 = vfsgnjx_vv(auVar9,auVar9);
auVar10 = vfredmax_vs(auVar10,in_v1);
uVar8 = vfmv_fs(auVar10);
fVar1 = (float)uVar8 * 0.007874016;
uVar7 = (ulong)(uint)fVar1;
if ((fVar1 == 0.0) || (uVar7 = (ulong)(uint)(127.0 / (float)uVar8), uVar5 << 1 < 0xff000001))
break;
auVar9 = vfmul_vf(auVar9,uVar7);
vsetvli_e16m2tama(0);
((block_q0_0 *)(pcVar3 + -2))->d = 0x7e00;
auVar9 = vfncvt_xfw(auVar9);
iVar6 = iVar6 + 1;
vsetvli_e8m1tama(0);
auVar9 = vncvt_xxw(auVar9);
vse8_v(auVar9,pcVar3);
x = x + 0x20;
pcVar3 = pcVar3 + 0x22;
if (iVar2 <= iVar6) {
return;
}
}
iVar6 = iVar6 + 1;
auVar9 = vfmul_vf(auVar9,uVar7);
vsetvli_e16m2tama(0);
auVar9 = vfncvt_xfw(auVar9);
vsetvli_e8m1tama(0);
auVar9 = vncvt_xxw(auVar9);
vse8_v(auVar9,pcVar3);
x = x + 0x20;
uVar5 = uVar5 & 0xfff;
((block_q0_0 *)(pcVar3 + -2))->d = (short)uVar5;
pcVar3 = pcVar3 + 0x22;
} while (iVar6 < iVar2);
return;
}
Note: an earlier run showed several pcode errors in
riscv-rvv.sinc
, which have been fixed as of this run.
Red herrings - none of these have anything to do with RISCV or vector intrinsics
uVar5 = 0x106c50;
- there is no uVar5 variable, just a shared upper immediate load register.iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5;
- since k is a signed long and not unsigned, the compiler
has to implement the divide by 32 with rounding adjustments for negative numbers.\fVar1 = (float)uVar8 * 0.007874016;
- the compiler changed a division by 127.0 into a multiplication by 0.007874016.((block_q0_0 *)(pcVar3 + -2))->d
- the compiler has set pcVar3 to point to an element within the block, so it uses negative
offsets to address preceding elements.fmv.x.w
instructions looks odd. fmv.x.w moves the single-precision value in floating-point register rs1
represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. This works fine when the source
is zero, but it has no clear C-like representation otherwise. These may better be replaced with specialized pcode operations.There is one discrepancy that does involve the vectorization code. The source code uses a standard RISCV vector intrinsic function to store data:
__riscv_vse8_v_i8m1(y[i].qs, vs, vl);
Ghidra pcode for this instruction after renaming operands is (currently):
vse8_v(vs, y[i].qs);
The order of the first two parameters is swapped. We should probably align the pcode to avoid deviations from the standard intrinsic signature as much as possible. Those intrinsics have context and type information encoded into their name, which Ghidra does not currently have, so we can’t exactly match.
Load unitTest
into Ghidra and inspect quantize_row_q8_0_reference
. We know the correct signature so we can override what
Ghidra has inferred, then name the parameters so that they look more like the source code.
void quantize_row_q8_0_reference(float *x,block_q0_0 *y,long k)
{
float fVar1;
long lVar2;
long lVar3;
char *pcVar4;
ushort uVar5;
int iVar6;
ulong uVar7;
undefined8 uVar8;
undefined auVar9 [256];
undefined auVar10 [256];
undefined auVar11 [256];
undefined auVar12 [256];
undefined auVar13 [256];
undefined auVar14 [256];
undefined in_v7 [256];
undefined auVar15 [256];
undefined auVar16 [256];
undefined auVar17 [256];
undefined auVar18 [256];
undefined auVar19 [256];
gp = &__global_pointer$;
if (k < 0x20) {
return;
}
vsetivli_e32m1tama(4);
pcVar4 = y->qs;
lVar2 = 0;
iVar6 = 0;
auVar15 = vfmv_sf(0xff800000);
vmv_v_i(in_v7,0);
do {
lVar3 = (long)x + lVar2;
auVar14 = vle32_v(lVar3);
auVar13 = vle32_v(lVar3 + 0x10);
auVar10 = vfsgnjx_vv(auVar14,auVar14);
auVar9 = vfsgnjx_vv(auVar13,auVar13);
auVar10 = vfmax_vv(auVar10,in_v7);
auVar9 = vfmax_vv(auVar9,auVar10);
auVar12 = vle32_v(lVar3 + 0x20);
auVar11 = vle32_v(lVar3 + 0x30);
auVar10 = vfsgnjx_vv(auVar12,auVar12);
auVar10 = vfmax_vv(auVar10,auVar9);
auVar9 = vfsgnjx_vv(auVar11,auVar11);
auVar9 = vfmax_vv(auVar9,auVar10);
auVar10 = vle32_v(lVar3 + 0x40);
auVar18 = vle32_v(lVar3 + 0x50);
auVar16 = vfsgnjx_vv(auVar10,auVar10);
auVar16 = vfmax_vv(auVar16,auVar9);
auVar9 = vfsgnjx_vv(auVar18,auVar18);
auVar9 = vfmax_vv(auVar9,auVar16);
auVar17 = vle32_v(lVar3 + 0x60);
auVar16 = vle32_v(lVar3 + 0x70);
auVar19 = vfsgnjx_vv(auVar17,auVar17);
auVar19 = vfmax_vv(auVar19,auVar9);
auVar9 = vfsgnjx_vv(auVar16,auVar16);
auVar9 = vfmax_vv(auVar9,auVar19);
auVar9 = vfredmax_vs(auVar9,auVar15);
uVar8 = vfmv_fs(auVar9);
fVar1 = (float)uVar8 * 0.007874016;
uVar7 = (ulong)(uint)fVar1;
if (fVar1 == 0.0) {
LAB_00076992:
uVar5 = ((ushort)lVar3 & 0xfff) + ((ushort)((uint)lVar3 >> 0xd) & 0x7c00);
}
else {
uVar7 = (ulong)(uint)(127.0 / (float)uVar8);
uVar5 = 0x7e00;
if ((uint)lVar3 << 1 < 0xff000001) goto LAB_00076992;
}
auVar9 = vfmv_vf(uVar7);
auVar14 = vfmul_vv(auVar14,auVar9);
auVar14 = vfcvt_xfv(auVar14);
auVar13 = vfmul_vv(auVar13,auVar9);
auVar13 = vfcvt_xfv(auVar13);
auVar12 = vfmul_vv(auVar12,auVar9);
auVar12 = vfcvt_xfv(auVar12);
auVar11 = vfmul_vv(auVar9,auVar11);
auVar11 = vfcvt_xfv(auVar11);
vsetvli_e16mf2tama(0);
auVar14 = vncvt_xxw(auVar14);
vsetvli_e8mf4tama(0);
auVar14 = vncvt_xxw(auVar14);
vse8_v(auVar14,pcVar4);
vsetvli_e32m1tama(0);
auVar14 = vfmul_vv(auVar9,auVar10);
vsetvli_e16mf2tama(0);
((block_q0_0 *)(pcVar4 + -2))->d = (ushort)((ulong)lVar3 >> 0x10) & 0x8000 | uVar5;
auVar10 = vncvt_xxw(auVar13);
vsetvli_e8mf4tama(0);
auVar10 = vncvt_xxw(auVar10);
vse8_v(auVar10,pcVar4 + 4);
vsetvli_e32m1tama(0);
auVar13 = vfcvt_xfv(auVar14);
vsetvli_e16mf2tama(0);
auVar10 = vncvt_xxw(auVar12);
vsetvli_e8mf4tama(0);
auVar10 = vncvt_xxw(auVar10);
vse8_v(auVar10,pcVar4 + 8);
vsetvli_e32m1tama(0);
auVar12 = vfmul_vv(auVar9,auVar18);
vsetvli_e16mf2tama(0);
auVar10 = vncvt_xxw(auVar11);
vsetvli_e8mf4tama(0);
auVar10 = vncvt_xxw(auVar10);
vse8_v(auVar10,pcVar4 + 0xc);
vsetvli_e32m1tama(0);
auVar11 = vfcvt_xfv(auVar12);
vsetvli_e16mf2tama(0);
auVar10 = vncvt_xxw(auVar13);
vsetvli_e8mf4tama(0);
auVar10 = vncvt_xxw(auVar10);
vse8_v(auVar10,pcVar4 + 0x10);
vsetvli_e32m1tama(0);
auVar10 = vfmul_vv(auVar9,auVar17);
vsetvli_e16mf2tama(0);
auVar11 = vncvt_xxw(auVar11);
vsetvli_e8mf4tama(0);
auVar11 = vncvt_xxw(auVar11);
vse8_v(auVar11,pcVar4 + 0x14);
vsetvli_e32m1tama(0);
auVar10 = vfcvt_xfv(auVar10);
vsetvli_e16mf2tama(0);
auVar10 = vncvt_xxw(auVar10);
vsetvli_e8mf4tama(0);
auVar10 = vncvt_xxw(auVar10);
vse8_v(auVar10,pcVar4 + 0x18);
vsetvli_e32m1tama(0);
auVar9 = vfmul_vv(auVar16,auVar9);
auVar9 = vfcvt_xfv(auVar9);
vsetvli_e16mf2tama(0);
iVar6 = iVar6 + 1;
auVar9 = vncvt_xxw(auVar9);
vsetvli_e8mf4tama(0);
auVar9 = vncvt_xxw(auVar9);
vse8_v(auVar9,pcVar4 + 0x1c);
lVar2 = lVar2 + 0x80;
pcVar4 = pcVar4 + 0x22;
if ((int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5 <= iVar6) {
return;
}
vsetvli_e32m1tama(0);
} while( true );
}
Some of the previous red herrings show up here too. Things to note:
undefined auVar19 [256];
- something in riscv-rvv.sinc
is claiming vector registers are 256 bits long - that’s not generally
true, so hunt down the confusion.
riscv.reg.sinc
is the root of this, with @define VLEN "256"
and define register offset=0x4000 size=$(VLEN) [ v0 ...]
.
What should Ghidra believe the size of vector registers to be? More generally, should the size and element type of vector
registers be mutable?RISCV vector instruction execution engines - and autovectorization passes in gcc - are both so immature we have no idea of which implementation performs better. At best we can guess that autovectorization will be good enough to make hand optimized coding with riscv intrinsic functions rarely needed.
Now try using Ghidra to inspect a function that dominates execution time in the whisper.cpp demo - ggml_vec_dot_16
. We’ll do this
without first checking the source code. We’ll make a few reasonable assumptions:
A quick inspection lets us rewrite the function signature as:
void ggml_vec_dot_f16(long n,float *sum,fp16 *x,fp16 *y) {...}
That quick inspection also shows a glaring error - the pcode semantics for vluxei64.v
has left out a critical parameter. It’s present in the
listing view but missing in the pcode semantics view. Fix this and move on.
After tinkering with variable names and signatures, we get:
void ggml_vec_dot_q8_0_q8_0(long n,float *sum,block_q8_0 *x,block_q8_0 *y)
{
block_q8_0 *pbVar1;
int iVar2;
char *px_qs;
char *py_qs;
undefined8 uVar3;
undefined8 uVar4;
float partial_sum;
undefined auVar5 [256];
undefined auVar6 [256];
undefined in_v5 [256];
gp = &__global_pointer$;
partial_sum = 0.0;
uVar4 = vsetvli_e8m1tama(0x20);
if (0x1f < n) {
px_qs = x->qs;
py_qs = y->qs;
iVar2 = 0;
vsetvli_e32m1tama(uVar4);
vmv_v_i(in_v5,0);
do {
pbVar1 = (block_q8_0 *)(px_qs + -2);
vsetvli_e8m1tama(uVar4);
auVar6 = vle8_v(px_qs);
auVar5 = vle8_v(py_qs);
auVar5 = vwmul_vv(auVar6,auVar5);
vsetvli_e16m2tama(0);
auVar5 = vwredsum_vs(auVar5,in_v5);
vsetivli_e32m1tama(0);
uVar3 = vmv_x_s(auVar5);
iVar2 = iVar2 + 1;
px_qs = px_qs + 0x22;
partial_sum = (float)(int)uVar3 *
(float)(&ggml_table_f32_f16)[((block_q8_0 *)(py_qs + -2))->field0_0x0] *
(float)(&ggml_table_f32_f16)[pbVar1->field0_0x0] + partial_sum;
py_qs = py_qs + 0x22;
} while (iVar2 < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
}
*sum = partial_sum;
return;
}
That’s fairly clear - the two vectors are presented as arrays of block_q8_0
structs, each with
32 entries and a scale factor d
. An earlier run showed another error, now fixed, with the pcode for
vmv_x_s
.
Ghidra testing of semantic pcode.
Note: paths and names are likely to change here. Use these notes just as a guide.
The Ghidra 11 isa_ext
branch makes heavy use of user-defined pcode (aka Sleigh semantics). Much of that pcode is arbitrarily defined, adding more confusion to an
already complex field. Can we build a testing framework to highlight problem areas in pcode semantics?
For example, let’s look at Ghidra’s decompiler rendering of two RISCV-64 vector instructions vmv.s.x
and vmv.x.s
. These instructions move a single element between an
integer scalar register and the first element of a vector register. The RISCV
vector definition says:
These instructions have a lot of symmetry, but the current isa_ext branch doesn’t render them symmetrically. Let’s build a sample function that uses both instructions followed by an assertion of what we expect to see.
bool test_integer_scalar_vector_move() {
///@ exercise integer scalar moves into and out of a vector register
int x = 1;
int y = 0;
// set vector mode to something simple
__asm__ __volatile__ ("vsetivli zero,1,e32,m1,ta,ma\n\t");
// execute both instructions to set y:= x
__asm__ __volatile__ ("vmv.s.x v1, %1\n\t" "vmv.x.s %0, v1"\
: "=r" (y) \
: "r" (x) );
return x==y;
}
This function should return the boolean value True. It’s defined in the file failing_tests/pcodeSamples.cpp
and
compiled into the library libsamples.so
. The function is executed within the test harness failing_tests/pcodeTests.cpp
.
Build (with O3 optimization) and execute the test harness with:
$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
$ bazel build -s -c opt --platforms=//platforms:riscv_vector failing_tests:samples
$ cp -f bazel-bin/failing_tests/libsamples.so /tmp
$ bazel build -s -c opt --platforms=//platforms:riscv_vector failing_tests:pcodeTests
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/failing_tests/pcodeTests
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VectorMove
[ RUN ] VectorMove.vmv_s_x
[ OK ] VectorMove.vmv_s_x (0 ms)
[----------] 1 test from VectorMove (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3 ms total)
[ PASSED ] 1 test.
Now import /tmp/libsamples.so into Ghidra and look for test* functions. The decompilation is:
bool test_integer_scalar_vector_move(void)
{
undefined8 uVar1;
undefined in_v1 [256];
vsetivli_e32m1tama(1);
vmv_s_x(in_v1,1);
uVar1 = vmv_x_s(in_v1);
return (int)uVar1 == 1;
}
This shows two issues:
vmv_s_x
, the output is the first element of in_v1
.
For vmv_x_s
, the output is the scalar register uVar1
. That might be OK, since we don’t specify what
happens to other elements of in_v1
.The general question raised by this example is how to treat pcode output - as an output parameter or as a returned result? The sleigh pcode documentation suggests that parameters are assumed to be input parameters, with the only output register the one returned by the pcode operation. A quick glance at the ARM Neon and AARCH64 SVE vector sleigh files suggests that this is the convention, but perhaps not a requirement.
Let’s try adding some more test cases before taking any action.
We can track external events to plan future integration test effort.
This project gets more relevant when RISCV-64 processors start appearing in appliances with more instruction set extensions and with code compiled by newer compiler toolchains.
The project results get easier to integrate if and when more development effort is applied to specific Ghidra components.
This page collects external sites to track for convergence.
New instruction extensions often appear here as the first public implementation. Check out the opcodes, aliases, and disassembly patterns found in the test suite.
git log include/opcode/|grep riscv|head
git log gas/testsuite/gas/riscv
git log gcc/testsuite/gcc.target/riscv
Look for commits indicating the stability of vectorization or new compound loop types that now allow auto vectorization.
git log arch/riscv
Note: the Linux kernel just added vector crypto support, derived from the openssl crypto routines. This appears to mostly be in support of encrypted file systems.
The RISCV International wiki home page leads to:
Ghidra/Processors/AARCH64/data/languages/AARCH64sve.sinc
defines the instructions used by the AARCH64 Scalable Vector Extensions package.
This suite is similar to the RISCV vector suite in that it is vector register length agnostic. It was added in March of 2019 and not updated since.
Ghidra/Features/Decompiler/src/decompile/cpp
holds much of the existing Ghidra code for system and user defined pcodes.
userop.h
and userop.cc
look relevant, with caheckman
a common contributor.
Openssl configuration for ISA Extensions provides a good example.
/home2/build_openssl$ ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
$ perl configdata.pm --dump
Command line (with current working directory = .):
/usr/bin/perl ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
Perl information:
/usr/bin/perl
5.38.2 for x86_64-linux-thread-multi
Enabled features:
afalgeng
apps
argon2
aria
asm
async
atexit
autoalginit
autoerrinit
autoload-config
bf
blake2
bulk
cached-fetch
camellia
capieng
cast
chacha
cmac
cmp
cms
comp
ct
default-thread-pool
deprecated
des
dgram
dh
docs
dsa
dso
dtls
dynamic-engine
ec
ec2m
ecdh
ecdsa
ecx
engine
err
filenames
gost
http
idea
legacy
loadereng
makedepend
md4
mdc2
module
multiblock
nextprotoneg
ocb
ocsp
padlockeng
pic
pinshared
poly1305
posix-io
psk
quic
unstable-qlog
rc2
rc4
rdrand
rfc3779
rmd160
scrypt
secure-memory
seed
shared
siphash
siv
sm2
sm2-precomp
sm3
sm4
sock
srp
srtp
sse2
ssl
ssl-trace
static-engine
stdio
tests
thread-pool
threads
tls
ts
ui-console
whirlpool
tls1
tls1-method
tls1_1
tls1_1-method
tls1_2
tls1_2-method
tls1_3
dtls1
dtls1-method
dtls1_2
dtls1_2-method
Disabled features:
acvp-tests [cascade] OPENSSL_NO_ACVP_TESTS
asan [default] OPENSSL_NO_ASAN
brotli [default] OPENSSL_NO_BROTLI
brotli-dynamic [default] OPENSSL_NO_BROTLI_DYNAMIC
buildtest-c++ [default]
winstore [not-windows] OPENSSL_NO_WINSTORE
crypto-mdebug [default] OPENSSL_NO_CRYPTO_MDEBUG
devcryptoeng [default] OPENSSL_NO_DEVCRYPTOENG
ec_nistp_64_gcc_128 [default] OPENSSL_NO_EC_NISTP_64_GCC_128
egd [default] OPENSSL_NO_EGD
external-tests [default] OPENSSL_NO_EXTERNAL_TESTS
fips [default]
fips-securitychecks [cascade] OPENSSL_NO_FIPS_SECURITYCHECKS
fuzz-afl [default] OPENSSL_NO_FUZZ_AFL
fuzz-libfuzzer [default] OPENSSL_NO_FUZZ_LIBFUZZER
ktls [default] OPENSSL_NO_KTLS
md2 [default] OPENSSL_NO_MD2 (skip crypto/md2)
msan [default] OPENSSL_NO_MSAN
rc5 [default] OPENSSL_NO_RC5 (skip crypto/rc5)
sctp [default] OPENSSL_NO_SCTP
tfo [default] OPENSSL_NO_TFO
trace [default] OPENSSL_NO_TRACE
ubsan [default] OPENSSL_NO_UBSAN
unit-test [default] OPENSSL_NO_UNIT_TEST
uplink [no uplink_arch] OPENSSL_NO_UPLINK
weak-ssl-ciphers [default] OPENSSL_NO_WEAK_SSL_CIPHERS
zlib [default] OPENSSL_NO_ZLIB
zlib-dynamic [default] OPENSSL_NO_ZLIB_DYNAMIC
zstd [default] OPENSSL_NO_ZSTD
zstd-dynamic [default] OPENSSL_NO_ZSTD_DYNAMIC
ssl3 [default] OPENSSL_NO_SSL3
ssl3-method [default] OPENSSL_NO_SSL3_METHOD
Config target attributes:
AR => "ar",
ARFLAGS => "qc",
CC => "gcc",
CFLAGS => "-Wall -O3",
CXX => "g++",
CXXFLAGS => "-Wall -O3",
HASHBANGPERL => "/usr/bin/env perl",
RANLIB => "ranlib",
RC => "windres",
asm_arch => "riscv64",
bn_ops => "SIXTY_FOUR_BIT_LONG RC4_CHAR",
build_file => "Makefile",
build_scheme => [ "unified", "unix" ],
cflags => "-pthread",
cppflags => "",
cxxflags => "-std=c++11 -pthread",
defines => [ "OPENSSL_BUILDING_OPENSSL" ],
disable => [ ],
dso_ldflags => "-Wl,-z,defs",
dso_scheme => "dlfcn",
enable => [ "afalgeng" ],
ex_libs => "-ldl -pthread",
includes => [ ],
lflags => "",
lib_cflags => "",
lib_cppflags => "-DOPENSSL_USE_NODELETE",
lib_defines => [ ],
module_cflags => "-fPIC",
module_cxxflags => undef,
module_ldflags => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
perl_platform => "Unix",
perlasm_scheme => "linux64",
shared_cflag => "-fPIC",
shared_defflag => "-Wl,--version-script=",
shared_defines => [ ],
shared_ldflag => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
shared_rcflag => "",
shared_sonameflag => "-Wl,-soname=",
shared_target => "linux-shared",
thread_defines => [ ],
thread_scheme => "pthreads",
unistd => "<unistd.h>",
Recorded environment:
AR =
BUILDFILE =
CC =
CFLAGS =
CPPFLAGS =
CROSS_COMPILE =
CXX =
CXXFLAGS =
HASHBANGPERL =
LDFLAGS =
LDLIBS =
OPENSSL_LOCAL_CONFIG_DIR =
PERL =
RANLIB =
RC =
RCFLAGS =
WINDRES =
__CNF_CFLAGS =
__CNF_CPPDEFINES =
__CNF_CPPFLAGS =
__CNF_CPPINCLUDES =
__CNF_CXXFLAGS =
__CNF_LDFLAGS =
__CNF_LDLIBS =
Makevars:
AR = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ar
ARFLAGS = qc
ASFLAGS =
CC = /opt/riscvx/bin/riscv64-unknown-linux-gnu-gcc
CFLAGS = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
CPPDEFINES =
CPPFLAGS =
CPPINCLUDES =
CROSS_COMPILE = /opt/riscvx/bin/riscv64-unknown-linux-gnu-
CXX = /opt/riscvx/bin/riscv64-unknown-linux-gnu-g++
CXXFLAGS = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
HASHBANGPERL = /usr/bin/env perl
LDFLAGS =
LDLIBS =
PERL = /usr/bin/perl
RANLIB = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ranlib
RC = /opt/riscvx/bin/riscv64-unknown-linux-gnu-windres
RCFLAGS =
NOTE: These variables only represent the configuration view. The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
Makefile
build file:
Makefile
build file templates:
../vendor/openssl/Configurations/common0.tmpl
../vendor/openssl/Configurations/unix-Makefile.tmpl
$ make
...
opt/riscvx/lib/gcc/riscv64-unknown-linux-gnu/14.0.1/../../../../riscv64-unknown-linux-gnu/bin/ld: cannot find -ldl: No such file or directory
The error is in the linking phase, since we did not provide the correct sysroot and path information needed by the crosscompiling linker.
A quick check of the object files generated includes:
$ find . -name \*risc\*.o
./crypto/sm4/libcrypto-lib-sm4-riscv64-zvksed.o
./crypto/sm4/libcrypto-shlib-sm4-riscv64-zvksed.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-lib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64.o
./crypto/chacha/libcrypto-shlib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha-riscv64-zvkb.o
./crypto/chacha/libcrypto-shlib-chacha-riscv64-zvkb.o
./crypto/libcrypto-shlib-riscv64cpuid.o
./crypto/libcrypto-lib-riscv64cpuid.o
./crypto/sha/libcrypto-lib-sha_riscv.o
./crypto/sha/libcrypto-lib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-shlib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sha/libcrypto-shlib-sha_riscv.o
./crypto/sha/libcrypto-shlib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-lib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sm3/libcrypto-lib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-lib-sm3_riscv.o
./crypto/sm3/libcrypto-shlib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-shlib-sm3_riscv.o
./crypto/libcrypto-shlib-riscvcap.o
./crypto/modes/libcrypto-shlib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-lib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkb-zvbc.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-shlib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkb-zvbc.o
./crypto/libcrypto-lib-riscvcap.o
That suggests we need to cover more extensions:
The openssl source code conditionally defines symbols like:
These symbols are defined in crypto/riscvcap.c
after analyzing the march
string passed to the compiler.
So the next steps include:
LDFLAGS
and LDLIBS
to enable building a riscv-64 openssl.so
.$ build_openssl$../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvbb_zvbc_zvkb_zvkg_zvkned_zvksg
Patch the generated Makefile to:
< CNF_EX_LIBS=-ldl -pthread
---
> CNF_EX_LIBS=/opt/riscvx/lib/libdl.a -pthread
$ make
libcrypto.so.3
and libssl.so.3
in Ghidra.Disassembly testing against binutils reference dumps can follow these steps:
libcrypt.so.3
in Ghidra/tmp/libcrypto.so.3.txt
/tmp/libcrypto.so.3.c
/opt/riscvx/bin/riscv64-unknown-linux-gnu-objdump -j .text -D libcrypto.so.3 > libcrypto.so.3_ref.txt
/tmp/libcrypto.so.3.txt
and libcrypto.so.3_ref.txt
for vset
instructions, comparing operandsHow does Openssl manage RISCV ISA extensions? We’ll use the gcm_ghash
family of functions as examples.
march=rv64gcv_z...
arguments are processed by the Openssl configuration tool and
turned into #ifdef
variables. These can include combinations like RISCV_HAS_ZVKB_AND_ZVKSED
.
Multiple versions of key routines are compiled, each with different required extensions.gcm_get_funcs
returns the preferred set of implementing functions. The gcm_ghash
set can include:
gcm_ghash_4bit
gcm_ghash_rv64i_zvkb_zvbc
gcm_ghash_rv64i_zvkg
gcm_ghash_rv64i_zbc
gcm_ghash_rv64i_zbc__zbkb
The gcm_ghash_4bit
is the default version with 412 instructions, of which 11 are vector instructions inserted by the compiler.
The gcm_ghash_rv64i_zvkg
is the most advanced version with only 32 instructions. Ghidra decompiles this as:
void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
undefined auVar1 [256];
undefined auVar2 [256];
vsetivli_e32m1tumu(4);
auVar1 = vle32_v(param_2);
vle32_v(param_1);
do {
auVar2 = vle32_v(param_3);
param_3 = param_3 + 0x10;
param_4 = param_4 + -0x10;
auVar2 = vghsh_vv(auVar1,auVar2);
} while (param_4 != 0);
vse32_v(auVar2,param_1);
return;
}
That shows an error in our sinc files - several instructions use the vd register as both an input and an output, so our pcode semantics need updating. Do this and inspect the Ghidra output again:
void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
undefined auVar1 [256];
undefined auVar2 [256];
undefined auVar3 [256];
vsetivli_e32m1tumu(4);
auVar2 = vle32_v(param_2);
auVar1 = vle32_v(param_1);
do {
auVar3 = vle32_v(param_3);
param_3 = param_3 + 0x10;
param_4 = param_4 + -0x10;
auVar1 = vghsh_vv(auVar1,auVar2,auVar3);
} while (param_4 != 0);
vse32_v(auVar1,param_1);
return;
}
That’s better - commit and push.
Building a new toolchain can be messy.
A C or C++ toolchain needs at least three components:
These components have cross-dependencies. A full gcc build needs libc.so from glibc. A full glibc build needs libgcc from gcc. There are different ways to handle these cross-dependencies, such as splitting the gcc build into two phases or prepopulating the build directories with ‘close-enough’ files from a previous build.
The sysroot
component is the trickiest to handle, since gcc and glibc need to pull files from the sysroot
as they update files within sysroot
.
You can generally start with a bootstrap sysroot
, say from a previous toolchain, then update it with the latest binutils, gcc, and glibc.
Start with a released tarball for gcc and glibc. We’ll use the development tip of binutils for this pass.
Copy kernel header files into /opt/riscv/sysroot/usr/include/
.
Configure and install binutils:
$ /home2/vendor/binutils-gdb/configure --prefix=/opt/riscv/sysroot --with-sysroot=/opt/riscv/sysroot --target=riscv64-unknown-linux-gnu
$ make -j4
$ make install
Configure and install minimal gcc:
$ /home2/vendor/gcc-14.1.0/configure --prefix=/opt/riscv --enable-languages=c,c++ --disable-multilib --target=riscv64-unknown-linux-gnu --with-sysroot=/opt/riscv/sysroot
$ make all-gcc
$ make install-gcc
Configure and install glibc
$ ../../vendor/glibc-2.39/configure --host=riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu --prefix=/opt/riscv --disable-werror --enable-shared --disable-multilib --with-headers=/opt/riscv/sysroot/usr/include
$ make install-bootstrap-headers=yes install_root=/opt/riscv/sysroot install-headers
How do we replace any older sysroot bootstrap files with their freshly built versions? The most common problems involve libgcc*, libc*, and crt* files. The bootstrap sysroot needs these files to exist. The toolchain build process should replace them, but it may not replace all instances of these files.
Let’s scrub the libgcc files, comparing the gcc directory in which they are built with the sysroot directories in which they will be saved.
$ B=/home2/build_riscv/gcc
$ S=/opt/riscv/sysroot
$ find $B $S -name libgcc_s.so -ls
57940911 4 -rw-r--r-- 1 ____ ____ 132 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so
57940908 4 -rw-r--r-- 1 ____ ____ 132 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so
14361792 4 -rw-r--r-- 1 ____ ____ 132 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
14351655 4 -rw-r--r-- 1 ____ ____ 132 May 10 08:52 /opt/riscv/sysroot/lib/libgcc_s.so
$ diff /opt/riscv/sysroot/lib/libgcc_s.so /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
$ $ cat /opt/riscv/sysroot/lib/libgcc_s.so
/* GNU ld script
Use the shared library, but some functions are only in
the static library. */
GROUP ( libgcc_s.so.1 -lgcc )
$
/opt/riscv/sysroot/lib/libgcc_s.so
is our bootstrap input/home2/build_riscv/gcc/gcc/libgcc_s.so
and /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so
are the generated outputsNow check libgcc_s.so.1
for staleness:
$ find $B $S -name libgcc_s.so.1 -ls
57940910 700 -rw-r--r-- 1 ____ ____ 713128 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so.1
57946454 700 -rwxr-xr-x 1 ____ ____ 713128 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so.1
14361791 700 -rw-r--r-- 1 ____ ____ 713128 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1
14351656 696 -rw-r--r-- 1 ____ ____ 708624 May 10 08:53 /opt/riscv/sysroot/lib/libgcc_s.so.1
That looks like a potential problem. The older bootstrap file is older and smaller than the generated files. We need to fix that:
$ rm /opt/riscv/sysroot/lib/libgcc_s.so.1
$ ln /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1 /opt/riscv/sysroot/lib/libgcc_s.so.1
Next check the crt* files:
$ find $B $S -name crt\*.o -ls
57940817 8 -rw-r--r-- 1 ____ ____ 4248 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginS.o
57940826 4 -rw-r--r-- 1 ____ ____ 848 May 10 12:28 /home2/build_riscv/gcc/gcc/crtn.o
57940824 4 -rw-r--r-- 1 ____ ____ 848 May 10 12:28 /home2/build_riscv/gcc/gcc/crti.o
57940827 8 -rw-r--r-- 1 ____ ____ 4712 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginT.o
57940822 4 -rw-r--r-- 1 ____ ____ 1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtendS.o
57940823 4 -rw-r--r-- 1 ____ ____ 1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtend.o
57940815 4 -rw-r--r-- 1 ____ ____ 3640 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbegin.o
57940800 8 -rw-r--r-- 1 ____ ____ 4248 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginS.o
57940808 4 -rw-r--r-- 1 ____ ____ 848 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtn.o
57940806 4 -rw-r--r-- 1 ____ ____ 848 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crti.o
57940803 8 -rw-r--r-- 1 ____ ____ 4712 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginT.o
57940812 4 -rw-r--r-- 1 ____ ____ 1384 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtendS.o
57940804 4 -rw-r--r-- 1 ____ ____ 1384 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtend.o
57940798 4 -rw-r--r-- 1 ____ ____ 3640 May 9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbegin.o
14351609 16 -rw-r--r-- 1 ____ ____ 13848 May 10 08:48 /opt/riscv/sysroot/usr/lib/crt1.o
14351614 4 -rw-r--r-- 1 ____ ____ 952 May 10 08:48 /opt/riscv/sysroot/usr/lib/crti.o
14351623 4 -rw-r--r-- 1 ____ ____ 952 May 10 08:49 /opt/riscv/sysroot/usr/lib/crtn.o
14361798 8 -rw-r--r-- 1 ____ ____ 4248 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginS.o
14361802 4 -rw-r--r-- 1 ____ ____ 3640 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbegin.o
14361803 4 -rw-r--r-- 1 ____ ____ 1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtend.o
14361804 4 -rw-r--r-- 1 ____ ____ 848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crti.o
14361805 4 -rw-r--r-- 1 ____ ____ 848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtn.o
14361806 4 -rw-r--r-- 1 ____ ____ 1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtendS.o
14361807 8 -rw-r--r-- 1 ____ ____ 4712 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginT.o
The files in /opt/riscv/sysroot/usr/lib
are likely the bootstrap files. The sysroot files are identical to the build files, with exceptions:
crt1.o
is not generated by the gcc compiler build process. It may be something provided by the kernel build.crti.o
and crtn.o
bootstrap files and generated files are different. If we wanted to use this updated sysroot to build a 14.2.0 toolchain,
we probably want to use the newer versions.So replace the bootstrap /opt/riscv/sysroot/usr/lib/crt*.o
with hard links to the generated files.