This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Notes

1: Hardware Availability
2: Network Appliances
3: A vectorization case study
4: Pcode testing
5: Tracking Convergence
6: Deep Dive Openssl
7: Building a gcc-14 toolchain

Put unstructured comments here until we know what to do with them.

TODO

Update the isa_ext Ghidra branch to expand vsetvli arguments
- vsetvli zero,zero,0xc5 ⇒ vsetvli zero,zero,e8,mf8,ta,ma
- vsetvli zero,zero,0x18 ⇒ vsetvli zero,zero,e64,m1,tu,mu
Determine why the isa_ext Ghidra branch fails to disassemble the bext instruction in b-ext-64.o and b-ext.o
- that regression was do to an accidental typo
Determine why zvbc.o won’t disassemble
- These are compressed (16 bit) vector multiply instructions not currently defined in isa_ext
Determine why unknown.o won’t disassemble or reference where we found these instructions
- These instructions include sfence``, hinval_vvma, hinval_gvma, orc.b, cbo.clean, cbo.inval, cbo.flush. orc.b` is handled properly, the others are not implemented.
Clarify python scripts to show more of the directory context

Experiments

how much Ghidra complexity does gcc-14 introduce in a full build?

Assume a vendor generates a new toolchain with multiple extensions enabled by default. What fraction of the compiled functions would contain extensions unrecognized by Ghidra 11.0? Since THead has supplied most of the vendor-specific extensions known to binutils 2-41, we’ll use that as a reference. The architecture name will be something like

-march=rv64gv_zba_zbb_zbc_zbkb_zbkc_zbkx_zvbc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_xtheadmac_xtheadfmemidx_xtheadmempair_xtheadsync

Add some C++ code to exercise libstdc++ ordered maps (based on red-black trees?), unordered maps (hash table based), and the Murmur hash function.

There are a few places where THead customized instructions are used. The Murmur hash function uses vector load and store instructions to implement 8 byte unaligned reads. Bit manipulation extension instructions are not yet used.

Initial results suggest the largest complexity impact will be gcc rewriting of memory and structure copies with vector code. This may be especially true for hardware requiring aligned integers where alignment can not be guaranteed.

1 - Hardware Availability

When will RISCV-64 cores be deployed into systems needing reverse-engineering?

General purpose systems

https://www.cnx-software.com/2022/11/02/sifive-p670-and-p470-risc-v-processors-add-risc-v-vector-extensions/

https://www.cnx-software.com/2023/08/30/sifive-unveils-p870-high-performance-core-discusses-future-of-risc-v

https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc

https://www.scmp.com/tech/tech-trends/article/3232686/chinas-top-chip-designers-form-risc-v-patent-alliance-promote-semiconductor-self-sufficiency

Note: the general SiFive SDK boards might have been deprioritized in favor of specific licensing agreements. https://www.sifive.com/boards/hifive-pro-p550

https://liliputing.com/sifive-hifive-pro-p550-dev-board-coming-this-summer-with-intel-horse-creek-risc-v-chip/

We might expect to see high performance network appliances in 2026 using chip architectures like the SiFive 670 or 870, or from one of the alternative Chinese vendors. Chips with vector extensions are due soon, with crypto extensions coming shortly after. A network appliance development board might have two P670 class sockets and four to eight 10 GbE network interfaces.

To manage scope, we won’t be worrying about instructions supporting AI or complex virtualization. Custom instructions that might be used in network appliances are definitely in scope, while custom instructions for nested virtualization are not. Possibly in scope are new instructions that help manage or synchronize multi-socket cache memory.

Let’s set a provocative long term goal: How will Ghidra analyze a future network appliance that combines Machine Learning with self-modifying code to accelerate network routing and forwarding? Such a device might generate fast-path code sequences to sessionize incoming packets and deliver them with minimal cache flushes or branches taken.

A RISCV-64 implementation of the Marvell Octeon 10 might be a feasible future hardware component.

Portable appliances

This might include cell phones or voice-recognition apps. Things that today might use an Arm core set but be implemented with RISC-V cores in the future.

role of mixed 32 and 64 bit cores

Consider a midpoint network appliance (router or firewall) sitting near the Provider-Customer demarcation. What might be an appealing RISCV processor look like? This kind of appliance likely handles a mix of link layer protocols with an optimization for low energy dissipation and low latency per packet. A fast and simple serializer/deserializer feeding a RISCV classifier and forwarding engine makes sense here. You don’t want to do network or application layer processing unless the appliance has a firewall role.

Link layer processing means a packet mix of stacked MPLS and VLAN tags with IPv4 and IPv6 network layers underneath. Packet header processing won’t need 32 bit addressing, but might benefit from the high memory bandwidth of a 64 bit core. Fast header hashing combined with fast hashmap session lookups (for MPLS, VLAN, and selected IP) or fast trie session lookups (for IPv4 and IPv6). Network stacks can have a lot of branches, creating pipeline stalls, so hyperthreading may make sense.

Denial of Service and overload protections make fast analytics necessary at the session level. That’s where a 64 bit core with vector and other extensions can be useful.

This all suggests we might see more hybrid RISCV designs, with a mix of many lean 32 bit cores supported by one or two 64 bit cores. The 32 bit cores handle fast link layer processing and the 64 bit cores handle background analytics and control.

In the extreme case, the 64 bit analytics engine rewrites link layer code for the 32 bit cores continuously, optimizing code paths depending on what the link layer classifiers determine the most common packet types to be for each physical port. Cache management and branch prediction hints might drive new instruction additions.

Code rewriting could start as simple updates to RISCV hint branch instructions and possibly prefetch instructions, so it isn’t necessarily as radical as it sounds.

2 - Network Appliances

What will RISCV-64 cores offer networking?

will vector instructions be useful in network appliances?

Network kernel code has lots of conditional branches and very few loops. This suggests RISCV vector instructions won’t be found in network appliances anytime soon, other than memmove or similar simple contexts. Gather-scatter, bit manipulation, and crypto instruction extensions are likely to be useful in networking much sooner. Ghidra will have a much easier time generating pcode for those instructions than the 25K+ RISCV vector intrinsic C functions covering all combinations of vector instructions and vector processing modes.

What should Ghidra do when faced with a counter-example, say a network appliance that aggressively moves vector analytics into network processing? Such an appliance - perhaps a smart edge router or a zero-trust gateway device - might combine the following:

64 RISCV cores with no floating point or vector capability, optimized for traditional network ingress processing. These cores are designed to cope with the many branches of network packet processing, possibly including better branch prediction and hyperthreading.
2 or more RISCV cores with full floating point and vector capability, optimized for performing analytics on the inbound packet stream. These analytics can range from simple statistics generation to heuristic sessionization to self-modifying code generation. The self-modifying code may be either eBPF code or native RISCV instructions, depending on how aggressive the designers may be.

In the extreme case, this might be a generative AI subsystem trained on inbound packets and emitting either optimized packet handling code or threat-detection indicators. How would a Ghidra analyst look for malware in such a system?

midpoint versus endpoint network appliances

We need to be clearer about what kind of network code we might find in different contexts:

midpoint equipment like network-edge routers and switches, optimized for maximum throughput
endpoint equipment like host computers, web servers, and database servers where applications take up the bulk of the CPU cycles

For each of these contexts we have at least two topology variants:

Inline network code through which packets must transit, generally optimized for low latency and high throughput
Tapped network code (e.g., wireshark or port-mirrored accesses) observing copies of packets for session and endpoint analytics. Latency is not an issue here.

Midpoint network appliances may need to track session state. A simple network switch is close to stateless. A real-world network switch has a lot of session state to manage if it supports:

denial of service overload detection or other flow control
link bonding or equal-weight multipath routing

The key point here is that midpoint network appliances may benefit from instruction set extensions that enable faster packet classification, hashing, and cached session lookup. An adaptive midpoint network appliance might adjust the packet classification code in real-time, based on the mix of MPLS, VLAN, IPv4, IPv6, and VPN traffic most often seen on any given network interface. ISA extensions supporting gather, hash, vector comparison, and find-first operations are good candidates here.

3 - A vectorization case study

Compare and debug human and gcc vectorization

This case study compares human and compiler vectorization of a simple ML quantization algorithm. We’ll assume we need to inspect the code to understand why these two binaries sometimes produce different results. Our primary goal is to see whether we can improve Ghidra’s RISCV pcode generation to make such analyses easier. A secondary goal is to collect generated instruction patterns that may help Ghidra users understand what optimizing vectorizing compilers can do to source code.

The ML algorithm under test comes from https://github.com/ggerganov/llama.cpp. It packs an array of 32 bit floats into a set of q8_0 blocks to condense large model files. The q8_0 quantization reduces 32 32 bit floating point numbers to 32 8 bit integers with an associated 16 bit floating point scale factor.

The ggml-quants.c file in the llama.cpp repo provides both scalar source code (quantize_row_q8_0_reference) and hand-generated vectorized source code (quantize_row_q8_0).

The quantize_row_q8_0 function has several #ifdef sections providing hand-generated vector intrinsics for riscv, avx2, arm/neon, and wasm.
The quantize_row_q8_0_reference function source uses more loops but no vector instructions. GCC-14 will autovectorize the scalar quantize_row_q8_0_reference, producing vector code that is quite different from the hand-generated vector intrinsics.

The notional AI development shop wants to use Ghidra to inspect generated assembly instructions for both quantize_row_q8_0 and quantize_row_q8_0_reference to track down reported quirks. On some systems they produce identical results, on others the results differ. The test framework includes:

A target RISCV-64 processor supporting vector and compressed instructions.
GCC-14 developmental (pending release) compiler toolchain for native x86_64 builds
GCC-14 developmental (pending release) RISCV-64 cross-compiler toolchain with standard options -march=rv64gcv, -O3, and -ffast-math.
qemu-riscv64-static emulated execution of user space RISCV-64 applications on an x86_64 Linux test server.
A generic unit testing framework like gtest.
Ghidra 11+ with the isa_ext branch supporting RISCV 1.0 vector instructions.

The unit test process involves three unit test executions:

a reference x86_64 execution to test the logic on a common platform.
within a qemu-riscv64-static environment with an emulated VLEN=256 bits
within a qemu-riscv64-static environment with an emulated VLEN=128 bits

Note: This exercise uses whisper C and C++ source code as ‘ground truth’, coupled with a C++ test framework. If we didn’t have source code, we would have to reconstruct key library source files based on Ghidra inspection, then refine those reconstructions until Ghidra and unit testing shows that our reconstructions behave the same as the original binaries.

As setup to the Ghidra inspection, we will build and run all three and expect to see three PASSED notifications:

$ bazel run -platforms=//platforms:x86_64 case_studies:unitTests
...

INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 21.065s, Critical Path: 20.71s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
INFO: Running command line: bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (0 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (0 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (0 ms total)
[  PASSED  ] 2 tests.

$ bazel build --platforms=//platforms:riscv_vector case_studies:unitTests
$ bazel build --platforms=//platforms:riscv_vector --define __riscv_v_intrinsics=1 case_studies:unitTests
WARNING: Build option --platforms has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed target //case_studies:unitTests (0 packages loaded, 1904 targets configured).
...
INFO: Found 1 target...
Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 22.265s, Critical Path: 22.07s
INFO: 37 processes: 2 internal, 35 linux-sandbox.
INFO: Build completed successfully, 37 total actions
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[  PASSED  ] 2 tests.

Target //case_studies:unitTests up-to-date:
  bazel-bin/case_studies/unitTests
INFO: Elapsed time: 8.984s, Critical Path: 8.88s
INFO: 29 processes: 2 internal, 27 linux-sandbox.
INFO: Build completed successfully, 29 total actions

$ QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
[       OK ] FP16.convertFromFp32VectorIntrinsics (0 ms)
[----------] 2 tests from FP16 (2 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[  PASSED  ] 2 tests.

$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/case_studies/unitTests
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from FP16
[ RUN      ] FP16.convertFromFp32Reference
[       OK ] FP16.convertFromFp32Reference (1 ms)
[ RUN      ] FP16.convertFromFp32VectorIntrinsics
case_studies/unitTests.cpp:55: Failure
Expected equality of these values:
  dest[0].d
    Which is: 12175
  fp16_test_array.d
    Which is: 13264
fp16 scale factor is correct
case_studies/unitTests.cpp:57: Failure
Expected equality of these values:
  comparison
    Which is: -65
  0
entire fp16 block is converted correctly
[  FAILED  ] FP16.convertFromFp32VectorIntrinsics (8 ms)
[----------] 2 tests from FP16 (10 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (14 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] FP16.convertFromFp32VectorIntrinsics

 1 FAILED TEST

These results imply:

The hand-vectorized quantize_row_q8_0 test passes on harts with VLEN=256 but fails when executed on harts with VLEN=128. Further tracing suggests that quantize_row_q8_0 only processes the first 16 floats, not the 32 floats that should be processed in each block.
The gcc autovectorized quantize_row_q8_0_reference passes on both types of harts.

Now we need to import the riscv-64 unitTests program into Ghidra and examine the compiled differences between quantize_row_q8_0 and quantize_row_q8_0_reference.

Note: Remember that our real integration test goal is to look for new problems or regressions in Ghidra’s decompiler presentation of functions like these, and then to look for ways to improve that presentation.

Original Source Code

The goal of both quantize_row_q8_0* routines is a lossy compression of 32 bit floats into blocks of 8 bit scaled values. The routines should return identical results, with quantize_row_q8_0 invoked on architectures with vector acceleration and quantize_row_q8_0_reference for all other architectures.

static const int QK8_0 = 32;
// reference implementation for deterministic creation of model files
void quantize_row_q8_0_reference(const float * restrict x, block_q8_0 * restrict y, int k) {
    assert(k % QK8_0 == 0);
    const int nb = k / QK8_0;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        for (int j = 0; j < QK8_0; j++) {
            const float v = x[i*QK8_0 + j];
            amax = MAX(amax, fabsf(v));
        }

        const float d = amax / ((1 << 7) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);

        for (int j = 0; j < QK8_0; ++j) {
            const float x0 = x[i*QK8_0 + j]*id;

            y[i].qs[j] = roundf(x0);
        }
    }
}
void quantize_row_q8_0(const float * restrict x, void * restrict vy, int k) {
    assert(QK8_0 == 32);
    assert(k % QK8_0 == 0);
    const int nb = k / QK8_0;

    block_q8_0 * restrict y = vy;

#if defined(__ARM_NEON)
...
#elif defined(__wasm_simd128__)
...
#elif defined(__AVX2__) || defined(__AVX__)
...
#elif defined(__riscv_v_intrinsic)

    size_t vl = __riscv_vsetvl_e32m4(QK8_0);

    for (int i = 0; i < nb; i++) {
        // load elements
        vfloat32m4_t v_x   = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);

        vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
        vfloat32m1_t tmp   = __riscv_vfmv_v_f_f32m1(0.0f, vl);
        vfloat32m1_t vmax  = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
        float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);

        const float d = amax / ((1 << 7) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);
        vfloat32m4_t x0 = __riscv_vfmul_vf_f32m4(v_x, id, vl);

        // convert to integer
        vint16m2_t   vi = __riscv_vfncvt_x_f_w_i16m2(x0, vl);
        vint8m1_t    vs = __riscv_vncvt_x_x_w_i8m1(vi, vl);

        // store result
        __riscv_vse8_v_i8m1(y[i].qs , vs, vl);
    }
#else
    GGML_UNUSED(nb);
    // scalar
    quantize_row_q8_0_reference(x, y, k);
#endif
}

The reference version has an outer loop iterating over scalar 32 bit floats, with two inner loops operating on blocks of 32 of those floats. The first inner loop accumulates the maximum of absolute values within the block to generate a scale factor, while the second inner loop applies that scale factor to each 32 bit float in the block and then converts the scaled value to an 8 bit integer. Each output block is then a 16 bit floating point scale factor plus 32 8 bit scaled integers.

The code includes some distractions that complicate Ghidra analysis:

The k input parameter is signed, making the integer division by 32 more complicated than it needs to be.
The GGML_FP32_TO_FP16(d) conversion might be a single instruction on some architectures, but it requires branch evaluation on our RISCV-64 target architecture. GCC may elect to duplicate code in order to minimize the number of branches needed.

The hand-optimized quantize_row_q8_0 has similar distractions, plus a few more:

The two inner loops have been converted into RISCV vector intrinsics, such that each iteration processes 32 4 byte floats into a single 34 byte block_q8_0 struct.
Four adjacent vector registers are grouped with the m4 setting. On architectures with a vector length VLEN=256, that means all 32 4 byte floats per block will fit nicely and can be processed in parallel. If the architecture only supports a vector length of VLEN=128, then only half of each block will be processed in every iteration. That accounts for the unit test failure.
The code uses standard riscv_intrinsics - of which there are nearly 40,000 variants. The root of each intrinsic is generally a single vector instruction, then extended with information on the expected vector context (from vset* instructions) and the expected return type of the result. There is no C header file providing signatures for all possible variants, so nothing Ghidra can import and use in the decompiler view.
The __riscv_vle32_v_f32m4 intrinsic is likely the slowest of the set, as this 32 bit instruction will require a 128 byte memory read, stalling the instruction pipeline for some number of cycles.

Ghidra inspection

inspecting the hand-vectorized quantizer

Load unitTest into Ghidra and inspect quantize_row_q8_0. We know the correct signature so we can override what Ghidra has inferred, then name the parameters so that they look more like the source code.

void quantize_row_q8_0(float *x,block_q0_0 *y,long k)

{
  float fVar1;
  int iVar2;
  char *pcVar3;
  undefined8 uVar4;
  uint uVar5;
  int iVar6;
  ulong uVar7;
  undefined8 uVar8;
  undefined in_v1 [256];
  undefined auVar9 [256];
  undefined auVar10 [256];
  gp = &__global_pointer$;
  if (k < 0x20) {
    return;
  }
  uVar4 = vsetvli_e8m1tama(0x20);
  vsetvli_e32m1tama(uVar4);
  uVar5 = 0x106c50;
  iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5;
  iVar6 = 0;
  vmv_v_i(in_v1,0);
  pcVar3 = y->qs;
  do {
    while( true ) {
      vsetvli_e32m4tama(uVar4);
      auVar9 = vle32_v(x);
      auVar10 = vfsgnjx_vv(auVar9,auVar9);
      auVar10 = vfredmax_vs(auVar10,in_v1);
      uVar8 = vfmv_fs(auVar10);
      fVar1 = (float)uVar8 * 0.007874016;
      uVar7 = (ulong)(uint)fVar1;
      if ((fVar1 == 0.0) || (uVar7 = (ulong)(uint)(127.0 / (float)uVar8), uVar5 << 1 < 0xff000001))
      break;
      auVar9 = vfmul_vf(auVar9,uVar7);
      vsetvli_e16m2tama(0);
      ((block_q0_0 *)(pcVar3 + -2))->d = 0x7e00;
      auVar9 = vfncvt_xfw(auVar9);
      iVar6 = iVar6 + 1;
      vsetvli_e8m1tama(0);
      auVar9 = vncvt_xxw(auVar9);
      vse8_v(auVar9,pcVar3);
      x = x + 0x20;
      pcVar3 = pcVar3 + 0x22;
      if (iVar2 <= iVar6) {
        return;
      }
    }
    iVar6 = iVar6 + 1;
    auVar9 = vfmul_vf(auVar9,uVar7);
    vsetvli_e16m2tama(0);
    auVar9 = vfncvt_xfw(auVar9);
    vsetvli_e8m1tama(0);
    auVar9 = vncvt_xxw(auVar9);
    vse8_v(auVar9,pcVar3);
    x = x + 0x20;
    uVar5 = uVar5 & 0xfff;
    ((block_q0_0 *)(pcVar3 + -2))->d = (short)uVar5;
    pcVar3 = pcVar3 + 0x22;
  } while (iVar6 < iVar2);
  return;
}

Note: an earlier run showed several pcode errors in riscv-rvv.sinc, which have been fixed as of this run.

Red herrings - none of these have anything to do with RISCV or vector intrinsics

uVar5 = 0x106c50; - there is no uVar5 variable, just a shared upper immediate load register.
iVar2 = (int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5; - since k is a signed long and not unsigned, the compiler has to implement the divide by 32 with rounding adjustments for negative numbers.\
fVar1 = (float)uVar8 * 0.007874016; - the compiler changed a division by 127.0 into a multiplication by 0.007874016.
((block_q0_0 *)(pcVar3 + -2))->d - the compiler has set pcVar3 to point to an element within the block, so it uses negative offsets to address preceding elements.
duplicate code blocks - the conversion from a 32 bit float to the 16 bit float involves some branches. The compiler has decided that duplicating following code for at least one branch will be faster.
Decompiler handling of fmv.x.w instructions looks odd. fmv.x.w moves the single-precision value in ﬂoating-point register rs1 represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. This works fine when the source is zero, but it has no clear C-like representation otherwise. These may better be replaced with specialized pcode operations.

There is one discrepancy that does involve the vectorization code. The source code uses a standard RISCV vector intrinsic function to store data:

__riscv_vse8_v_i8m1(y[i].qs, vs, vl);

Ghidra pcode for this instruction after renaming operands is (currently):

vse8_v(vs, y[i].qs);

The order of the first two parameters is swapped. We should probably align the pcode to avoid deviations from the standard intrinsic signature as much as possible. Those intrinsics have context and type information encoded into their name, which Ghidra does not currently have, so we can’t exactly match.

inspecting the auto-vectorized quantizer

Load unitTest into Ghidra and inspect quantize_row_q8_0_reference. We know the correct signature so we can override what Ghidra has inferred, then name the parameters so that they look more like the source code.

void quantize_row_q8_0_reference(float *x,block_q0_0 *y,long k)

{
  float fVar1;
  long lVar2;
  long lVar3;
  char *pcVar4;
  ushort uVar5;
  int iVar6;
  ulong uVar7;
  undefined8 uVar8;
  undefined auVar9 [256];
  undefined auVar10 [256];
  undefined auVar11 [256];
  undefined auVar12 [256];
  undefined auVar13 [256];
  undefined auVar14 [256];
  undefined in_v7 [256];
  undefined auVar15 [256];
  undefined auVar16 [256];
  undefined auVar17 [256];
  undefined auVar18 [256];
  undefined auVar19 [256];

  gp = &__global_pointer$;
  if (k < 0x20) {
    return;
  }
  vsetivli_e32m1tama(4);
  pcVar4 = y->qs;
  lVar2 = 0;
  iVar6 = 0;
  auVar15 = vfmv_sf(0xff800000);
  vmv_v_i(in_v7,0);
  do {
    lVar3 = (long)x + lVar2;
    auVar14 = vle32_v(lVar3);
    auVar13 = vle32_v(lVar3 + 0x10);
    auVar10 = vfsgnjx_vv(auVar14,auVar14);
    auVar9 = vfsgnjx_vv(auVar13,auVar13);
    auVar10 = vfmax_vv(auVar10,in_v7);
    auVar9 = vfmax_vv(auVar9,auVar10);
    auVar12 = vle32_v(lVar3 + 0x20);
    auVar11 = vle32_v(lVar3 + 0x30);
    auVar10 = vfsgnjx_vv(auVar12,auVar12);
    auVar10 = vfmax_vv(auVar10,auVar9);
    auVar9 = vfsgnjx_vv(auVar11,auVar11);
    auVar9 = vfmax_vv(auVar9,auVar10);
    auVar10 = vle32_v(lVar3 + 0x40);
    auVar18 = vle32_v(lVar3 + 0x50);
    auVar16 = vfsgnjx_vv(auVar10,auVar10);
    auVar16 = vfmax_vv(auVar16,auVar9);
    auVar9 = vfsgnjx_vv(auVar18,auVar18);
    auVar9 = vfmax_vv(auVar9,auVar16);
    auVar17 = vle32_v(lVar3 + 0x60);
    auVar16 = vle32_v(lVar3 + 0x70);
    auVar19 = vfsgnjx_vv(auVar17,auVar17);
    auVar19 = vfmax_vv(auVar19,auVar9);
    auVar9 = vfsgnjx_vv(auVar16,auVar16);
    auVar9 = vfmax_vv(auVar9,auVar19);
    auVar9 = vfredmax_vs(auVar9,auVar15);
    uVar8 = vfmv_fs(auVar9);
    fVar1 = (float)uVar8 * 0.007874016;
    uVar7 = (ulong)(uint)fVar1;
    if (fVar1 == 0.0) {
LAB_00076992:
      uVar5 = ((ushort)lVar3 & 0xfff) + ((ushort)((uint)lVar3 >> 0xd) & 0x7c00);
    }
    else {
      uVar7 = (ulong)(uint)(127.0 / (float)uVar8);
      uVar5 = 0x7e00;
      if ((uint)lVar3 << 1 < 0xff000001) goto LAB_00076992;
    }
    auVar9 = vfmv_vf(uVar7);
    auVar14 = vfmul_vv(auVar14,auVar9);
    auVar14 = vfcvt_xfv(auVar14);
    auVar13 = vfmul_vv(auVar13,auVar9);
    auVar13 = vfcvt_xfv(auVar13);
    auVar12 = vfmul_vv(auVar12,auVar9);
    auVar12 = vfcvt_xfv(auVar12);
    auVar11 = vfmul_vv(auVar9,auVar11);
    auVar11 = vfcvt_xfv(auVar11);
    vsetvli_e16mf2tama(0);
    auVar14 = vncvt_xxw(auVar14);
    vsetvli_e8mf4tama(0);
    auVar14 = vncvt_xxw(auVar14);
    vse8_v(auVar14,pcVar4);
    vsetvli_e32m1tama(0);
    auVar14 = vfmul_vv(auVar9,auVar10);
    vsetvli_e16mf2tama(0);
    ((block_q0_0 *)(pcVar4 + -2))->d = (ushort)((ulong)lVar3 >> 0x10) & 0x8000 | uVar5;
    auVar10 = vncvt_xxw(auVar13);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 4);
    vsetvli_e32m1tama(0);
    auVar13 = vfcvt_xfv(auVar14);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar12);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 8);
    vsetvli_e32m1tama(0);
    auVar12 = vfmul_vv(auVar9,auVar18);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar11);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0xc);
    vsetvli_e32m1tama(0);
    auVar11 = vfcvt_xfv(auVar12);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar13);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0x10);
    vsetvli_e32m1tama(0);
    auVar10 = vfmul_vv(auVar9,auVar17);
    vsetvli_e16mf2tama(0);
    auVar11 = vncvt_xxw(auVar11);
    vsetvli_e8mf4tama(0);
    auVar11 = vncvt_xxw(auVar11);
    vse8_v(auVar11,pcVar4 + 0x14);
    vsetvli_e32m1tama(0);
    auVar10 = vfcvt_xfv(auVar10);
    vsetvli_e16mf2tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vsetvli_e8mf4tama(0);
    auVar10 = vncvt_xxw(auVar10);
    vse8_v(auVar10,pcVar4 + 0x18);
    vsetvli_e32m1tama(0);
    auVar9 = vfmul_vv(auVar16,auVar9);
    auVar9 = vfcvt_xfv(auVar9);
    vsetvli_e16mf2tama(0);
    iVar6 = iVar6 + 1;
    auVar9 = vncvt_xxw(auVar9);
    vsetvli_e8mf4tama(0);
    auVar9 = vncvt_xxw(auVar9);
    vse8_v(auVar9,pcVar4 + 0x1c);
    lVar2 = lVar2 + 0x80;
    pcVar4 = pcVar4 + 0x22;
    if ((int)(((uint)((int)k >> 0x1f) >> 0x1b) + (int)k) >> 5 <= iVar6) {
      return;
    }
    vsetvli_e32m1tama(0);
  } while( true );
}

Some of the previous red herrings show up here too. Things to note:

undefined auVar19 [256]; - something in riscv-rvv.sinc is claiming vector registers are 256 bits long - that’s not generally true, so hunt down the confusion.
- riscv.reg.sinc is the root of this, with @define VLEN "256" and define register offset=0x4000 size=$(VLEN) [ v0 ...]. What should Ghidra believe the size of vector registers to be? More generally, should the size and element type of vector registers be mutable?
the autovectorizer has correctly decided VLEN=128 architectures must be supported, and has dedicated 8 vector registers to hold all 32 floats required per loop iteration. Unlike the hand-optimized solution, the 8 vector registers are handled by 8 interleaved sequences of vector instructions. This roughly doubles the instruction count, but provides good distribution of load and store memory operations across the loop, likely minimizing execution stalls.

RISCV vector instruction execution engines - and autovectorization passes in gcc - are both so immature we have no idea of which implementation performs better. At best we can guess that autovectorization will be good enough to make hand optimized coding with riscv intrinsic functions rarely needed.

Vectorized function analysis without source code

Now try using Ghidra to inspect a function that dominates execution time in the whisper.cpp demo - ggml_vec_dot_16. We’ll do this without first checking the source code. We’ll make a few reasonable assumptions:

this is likely a vector dot product
the vector elements are 16 bit floating point values of the type we’ve seen already.

A quick inspection lets us rewrite the function signature as:

void ggml_vec_dot_f16(long n,float *sum,fp16 *x,fp16 *y) {...}

That quick inspection also shows a glaring error - the pcode semantics for vluxei64.v has left out a critical parameter. It’s present in the listing view but missing in the pcode semantics view. Fix this and move on.

After tinkering with variable names and signatures, we get:

void ggml_vec_dot_q8_0_q8_0(long n,float *sum,block_q8_0 *x,block_q8_0 *y)

{
  block_q8_0 *pbVar1;
  int iVar2;
  char *px_qs;
  char *py_qs;
  undefined8 uVar3;
  undefined8 uVar4;
  float partial_sum;
  undefined auVar5 [256];
  undefined auVar6 [256];
  undefined in_v5 [256];

  gp = &__global_pointer$;
  partial_sum = 0.0;
  uVar4 = vsetvli_e8m1tama(0x20);
  if (0x1f < n) {
    px_qs = x->qs;
    py_qs = y->qs;
    iVar2 = 0;
    vsetvli_e32m1tama(uVar4);
    vmv_v_i(in_v5,0);
    do {
      pbVar1 = (block_q8_0 *)(px_qs + -2);
      vsetvli_e8m1tama(uVar4);
      auVar6 = vle8_v(px_qs);
      auVar5 = vle8_v(py_qs);
      auVar5 = vwmul_vv(auVar6,auVar5);
      vsetvli_e16m2tama(0);
      auVar5 = vwredsum_vs(auVar5,in_v5);
      vsetivli_e32m1tama(0);
      uVar3 = vmv_x_s(auVar5);
      iVar2 = iVar2 + 1;
      px_qs = px_qs + 0x22;
      partial_sum = (float)(int)uVar3 *
                    (float)(&ggml_table_f32_f16)[((block_q8_0 *)(py_qs + -2))->field0_0x0] *
                    (float)(&ggml_table_f32_f16)[pbVar1->field0_0x0] + partial_sum;
      py_qs = py_qs + 0x22;
    } while (iVar2 < (int)(((uint)((int)n >> 0x1f) >> 0x1b) + (int)n) >> 5);
  }
  *sum = partial_sum;
  return;
}

That’s fairly clear - the two vectors are presented as arrays of block_q8_0 structs, each with 32 entries and a scale factor d. An earlier run showed another error, now fixed, with the pcode for vmv_x_s.

4 - Pcode testing

Ghidra testing of semantic pcode.

Note: paths and names are likely to change here. Use these notes just as a guide.

The Ghidra 11 isa_ext branch makes heavy use of user-defined pcode (aka Sleigh semantics). Much of that pcode is arbitrarily defined, adding more confusion to an already complex field. Can we build a testing framework to highlight problem areas in pcode semantics?

For example, let’s look at Ghidra’s decompiler rendering of two RISCV-64 vector instructions vmv.s.x and vmv.x.s. These instructions move a single element between an integer scalar register and the first element of a vector register. The RISCV vector definition says:

The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination integer register.
The vmv.s.x instruction copies the scalar integer register to element 0 of the destination vector register.

These instructions have a lot of symmetry, but the current isa_ext branch doesn’t render them symmetrically. Let’s build a sample function that uses both instructions followed by an assertion of what we expect to see.

bool test_integer_scalar_vector_move() {
    ///@ exercise integer scalar moves into and out of a vector register
    int x = 1;
    int y = 0;
    // set vector mode to something simple
    __asm__ __volatile__ ("vsetivli zero,1,e32,m1,ta,ma\n\t");
    // execute both instructions to set y:= x
    __asm__ __volatile__ ("vmv.s.x  v1, %1\n\t" "vmv.x.s  %0, v1"\
                          : "=r" (y) \
                          : "r" (x) );
    return x==y;
}

This function should return the boolean value True. It’s defined in the file failing_tests/pcodeSamples.cpp and compiled into the library libsamples.so. The function is executed within the test harness failing_tests/pcodeTests.cpp.

Build (with O3 optimization) and execute the test harness with:

$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
$ bazel build -s -c opt  --platforms=//platforms:riscv_vector failing_tests:samples
$ cp -f bazel-bin/failing_tests/libsamples.so /tmp
$ bazel build -s -c opt  --platforms=//platforms:riscv_vector failing_tests:pcodeTests
$ export QEMU_CPU=rv64,zba=true,zbb=true,v=true,vlen=128,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true
$ qemu-riscv64-static -L /opt/riscvx -E LD_LIBRARY_PATH=/opt/riscvx/riscv64-unknown-linux-gnu/lib/ bazel-bin/failing_tests/pcodeTests
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VectorMove
[ RUN      ] VectorMove.vmv_s_x
[       OK ] VectorMove.vmv_s_x (0 ms)
[----------] 1 test from VectorMove (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3 ms total)
[  PASSED  ] 1 test.

Now import /tmp/libsamples.so into Ghidra and look for test* functions. The decompilation is:

bool test_integer_scalar_vector_move(void)
{
  undefined8 uVar1;
  undefined in_v1 [256];
  vsetivli_e32m1tama(1);
  vmv_s_x(in_v1,1);
  uVar1 = vmv_x_s(in_v1);
  return (int)uVar1 == 1;
}

This shows two issues:

Ghidra believes vector v1 is 256 bits long, when in fact it is unknown at compile and link time. That’s probably OK for now, as it provides a hint that this is a vector register.
The treatment of instruction output is inconsistent. For vmv_s_x, the output is the first element of in_v1. For vmv_x_s, the output is the scalar register uVar1. That might be OK, since we don’t specify what happens to other elements of in_v1.

The general question raised by this example is how to treat pcode output - as an output parameter or as a returned result? The sleigh pcode documentation suggests that parameters are assumed to be input parameters, with the only output register the one returned by the pcode operation. A quick glance at the ARM Neon and AARCH64 SVE vector sleigh files suggests that this is the convention, but perhaps not a requirement.

Let’s try adding some more test cases before taking any action.

5 - Tracking Convergence

We can track external events to plan future integration test effort.

This project gets more relevant when RISCV-64 processors start appearing in appliances with more instruction set extensions and with code compiled by newer compiler toolchains.

The project results get easier to integrate if and when more development effort is applied to specific Ghidra components.

This page collects external sites to track for convergence.

Toolchains and platforms

binutils

New instruction extensions often appear here as the first public implementation. Check out the opcodes, aliases, and disassembly patterns found in the test suite.

track the source
inspect git log include/opcode/|grep riscv|head
inspect git log gas/testsuite/gas/riscv
track updates to the list of supported extensions

sample log

28 Feb 2024 - jiawei@iscas.ac.cn added support for Zabha riscv extension (atomic byte and half-word memory ops)
4 Jan 2024 - jinma@linux.alibaba.com fixed th.vsetvli for T-Head extensions

compilers

track the source
inspect git log gcc/testsuite/gcc.target/riscv

log

Look for commits indicating the stability of vectorization or new compound loop types that now allow auto vectorization.

libraries

track the glibc source
track the openssl source

log

glibc
- Not much specific to RISC-V
openssl (in master, not released as of openssl 3.2)
- phoebe.chen@sifive.com added vector crypto implementations of AES-CBC mode, AES-128/192/256-CTR, AES-128/256-XTS
- jerry.shih@sifive.com added Zvksh support for sm3

kernels

track the source
inspect git log arch/riscv

Note: the Linux kernel just added vector crypto support, derived from the openssl crypto routines. This appears to mostly be in support of encrypted file systems.

system images

Fedora
Ubuntu

cloud instances

Scaleway risc-v servers
- with the T-HEAD TH1520 SoC, 16GB RAM and 128GB

RISCV International Wiki

The RISCV International wiki home page leads to:

Active Specification updates
Software Ecosystem status

ISA Extensions

profiles and individual standards-tracked extensions
vendor-specific extensions
gcc intrinsics

Applications

track source
Look for use of riscv intrinsics with arm/Neon and avx2 equivalents as opposed to allowing compiler autovectorization.
Watch for standardization of 16 bit floating point

Ghidra

similar vector instruction suites

Ghidra/Processors/AARCH64/data/languages/AARCH64sve.sinc defines the instructions used by the AARCH64 Scalable Vector Extensions package. This suite is similar to the RISCV vector suite in that it is vector register length agnostic. It was added in March of 2019 and not updated since.

pcode extensions

Ghidra/Features/Decompiler/src/decompile/cpp holds much of the existing Ghidra code for system and user defined pcodes. userop.h and userop.cc look relevant, with caheckman a common contributor.

Community

RISCV Organization
reddit discussion group
- new RISCV products are often discussed here.

6 - Deep Dive Openssl

Openssl configuration for ISA Extensions provides a good example.

/home2/build_openssl$ ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
$ perl configdata.pm --dump

Command line (with current working directory = .):

    /usr/bin/perl ../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg

Perl information:

    /usr/bin/perl
    5.38.2 for x86_64-linux-thread-multi

Enabled features:

    afalgeng
    apps
    argon2
    aria
    asm
    async
    atexit
    autoalginit
    autoerrinit
    autoload-config
    bf
    blake2
    bulk
    cached-fetch
    camellia
    capieng
    cast
    chacha
    cmac
    cmp
    cms
    comp
    ct
    default-thread-pool
    deprecated
    des
    dgram
    dh
    docs
    dsa
    dso
    dtls
    dynamic-engine
    ec
    ec2m
    ecdh
    ecdsa
    ecx
    engine
    err
    filenames
    gost
    http
    idea
    legacy
    loadereng
    makedepend
    md4
    mdc2
    module
    multiblock
    nextprotoneg
    ocb
    ocsp
    padlockeng
    pic
    pinshared
    poly1305
    posix-io
    psk
    quic
    unstable-qlog
    rc2
    rc4
    rdrand
    rfc3779
    rmd160
    scrypt
    secure-memory
    seed
    shared
    siphash
    siv
    sm2
    sm2-precomp
    sm3
    sm4
    sock
    srp
    srtp
    sse2
    ssl
    ssl-trace
    static-engine
    stdio
    tests
    thread-pool
    threads
    tls
    ts
    ui-console
    whirlpool
    tls1
    tls1-method
    tls1_1
    tls1_1-method
    tls1_2
    tls1_2-method
    tls1_3
    dtls1
    dtls1-method
    dtls1_2
    dtls1_2-method

Disabled features:

    acvp-tests          [cascade]        OPENSSL_NO_ACVP_TESTS
    asan                [default]        OPENSSL_NO_ASAN
    brotli              [default]        OPENSSL_NO_BROTLI
    brotli-dynamic      [default]        OPENSSL_NO_BROTLI_DYNAMIC
    buildtest-c++       [default]        
    winstore            [not-windows]    OPENSSL_NO_WINSTORE
    crypto-mdebug       [default]        OPENSSL_NO_CRYPTO_MDEBUG
    devcryptoeng        [default]        OPENSSL_NO_DEVCRYPTOENG
    ec_nistp_64_gcc_128 [default]        OPENSSL_NO_EC_NISTP_64_GCC_128
    egd                 [default]        OPENSSL_NO_EGD
    external-tests      [default]        OPENSSL_NO_EXTERNAL_TESTS
    fips                [default]        
    fips-securitychecks [cascade]        OPENSSL_NO_FIPS_SECURITYCHECKS
    fuzz-afl            [default]        OPENSSL_NO_FUZZ_AFL
    fuzz-libfuzzer      [default]        OPENSSL_NO_FUZZ_LIBFUZZER
    ktls                [default]        OPENSSL_NO_KTLS
    md2                 [default]        OPENSSL_NO_MD2 (skip crypto/md2)
    msan                [default]        OPENSSL_NO_MSAN
    rc5                 [default]        OPENSSL_NO_RC5 (skip crypto/rc5)
    sctp                [default]        OPENSSL_NO_SCTP
    tfo                 [default]        OPENSSL_NO_TFO
    trace               [default]        OPENSSL_NO_TRACE
    ubsan               [default]        OPENSSL_NO_UBSAN
    unit-test           [default]        OPENSSL_NO_UNIT_TEST
    uplink              [no uplink_arch] OPENSSL_NO_UPLINK
    weak-ssl-ciphers    [default]        OPENSSL_NO_WEAK_SSL_CIPHERS
    zlib                [default]        OPENSSL_NO_ZLIB
    zlib-dynamic        [default]        OPENSSL_NO_ZLIB_DYNAMIC
    zstd                [default]        OPENSSL_NO_ZSTD
    zstd-dynamic        [default]        OPENSSL_NO_ZSTD_DYNAMIC
    ssl3                [default]        OPENSSL_NO_SSL3
    ssl3-method         [default]        OPENSSL_NO_SSL3_METHOD

Config target attributes:

    AR => "ar",
    ARFLAGS => "qc",
    CC => "gcc",
    CFLAGS => "-Wall -O3",
    CXX => "g++",
    CXXFLAGS => "-Wall -O3",
    HASHBANGPERL => "/usr/bin/env perl",
    RANLIB => "ranlib",
    RC => "windres",
    asm_arch => "riscv64",
    bn_ops => "SIXTY_FOUR_BIT_LONG RC4_CHAR",
    build_file => "Makefile",
    build_scheme => [ "unified", "unix" ],
    cflags => "-pthread",
    cppflags => "",
    cxxflags => "-std=c++11 -pthread",
    defines => [ "OPENSSL_BUILDING_OPENSSL" ],
    disable => [  ],
    dso_ldflags => "-Wl,-z,defs",
    dso_scheme => "dlfcn",
    enable => [ "afalgeng" ],
    ex_libs => "-ldl -pthread",
    includes => [  ],
    lflags => "",
    lib_cflags => "",
    lib_cppflags => "-DOPENSSL_USE_NODELETE",
    lib_defines => [  ],
    module_cflags => "-fPIC",
    module_cxxflags => undef,
    module_ldflags => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
    perl_platform => "Unix",
    perlasm_scheme => "linux64",
    shared_cflag => "-fPIC",
    shared_defflag => "-Wl,--version-script=",
    shared_defines => [  ],
    shared_ldflag => "-Wl,-znodelete -shared -Wl,-Bsymbolic",
    shared_rcflag => "",
    shared_sonameflag => "-Wl,-soname=",
    shared_target => "linux-shared",
    thread_defines => [  ],
    thread_scheme => "pthreads",
    unistd => "<unistd.h>",

Recorded environment:

    AR = 
    BUILDFILE = 
    CC = 
    CFLAGS = 
    CPPFLAGS = 
    CROSS_COMPILE = 
    CXX = 
    CXXFLAGS = 
    HASHBANGPERL = 
    LDFLAGS = 
    LDLIBS = 
    OPENSSL_LOCAL_CONFIG_DIR = 
    PERL = 
    RANLIB = 
    RC = 
    RCFLAGS = 
    WINDRES = 
    __CNF_CFLAGS = 
    __CNF_CPPDEFINES = 
    __CNF_CPPFLAGS = 
    __CNF_CPPINCLUDES = 
    __CNF_CXXFLAGS = 
    __CNF_LDFLAGS = 
    __CNF_LDLIBS = 

Makevars:

    AR              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ar
    ARFLAGS         = qc
    ASFLAGS         = 
    CC              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-gcc
    CFLAGS          = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
    CPPDEFINES      = 
    CPPFLAGS        = 
    CPPINCLUDES     = 
    CROSS_COMPILE   = /opt/riscvx/bin/riscv64-unknown-linux-gnu-
    CXX             = /opt/riscvx/bin/riscv64-unknown-linux-gnu-g++
    CXXFLAGS        = -Wall -O3 -march=rv64gcv_zkne_zknd_zknh_zvkng_zvksg
    HASHBANGPERL    = /usr/bin/env perl
    LDFLAGS         = 
    LDLIBS          = 
    PERL            = /usr/bin/perl
    RANLIB          = /opt/riscvx/bin/riscv64-unknown-linux-gnu-ranlib
    RC              = /opt/riscvx/bin/riscv64-unknown-linux-gnu-windres
    RCFLAGS         = 

NOTE: These variables only represent the configuration view.  The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
    Makefile

build file:

    Makefile

build file templates:

    ../vendor/openssl/Configurations/common0.tmpl
    ../vendor/openssl/Configurations/unix-Makefile.tmpl
$ make
...
opt/riscvx/lib/gcc/riscv64-unknown-linux-gnu/14.0.1/../../../../riscv64-unknown-linux-gnu/bin/ld: cannot find -ldl: No such file or directory

The error is in the linking phase, since we did not provide the correct sysroot and path information needed by the crosscompiling linker.

A quick check of the object files generated includes:

$  find . -name \*risc\*.o
./crypto/sm4/libcrypto-lib-sm4-riscv64-zvksed.o
./crypto/sm4/libcrypto-shlib-sm4-riscv64-zvksed.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-lib-aes-riscv64-zkn.o
./crypto/aes/libcrypto-shlib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-shlib-aes-riscv64.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvkb-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64-zvbb-zvkg-zvkned.o
./crypto/aes/libcrypto-lib-aes-riscv64.o
./crypto/chacha/libcrypto-shlib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha_riscv.o
./crypto/chacha/libcrypto-lib-chacha-riscv64-zvkb.o
./crypto/chacha/libcrypto-shlib-chacha-riscv64-zvkb.o
./crypto/libcrypto-shlib-riscv64cpuid.o
./crypto/libcrypto-lib-riscv64cpuid.o
./crypto/sha/libcrypto-lib-sha_riscv.o
./crypto/sha/libcrypto-lib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-shlib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sha/libcrypto-shlib-sha_riscv.o
./crypto/sha/libcrypto-shlib-sha256-riscv64-zvkb-zvknha_or_zvknhb.o
./crypto/sha/libcrypto-lib-sha512-riscv64-zvkb-zvknhb.o
./crypto/sm3/libcrypto-lib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-lib-sm3_riscv.o
./crypto/sm3/libcrypto-shlib-sm3-riscv64-zvksh.o
./crypto/sm3/libcrypto-shlib-sm3_riscv.o
./crypto/libcrypto-shlib-riscvcap.o
./crypto/modes/libcrypto-shlib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-lib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64.o
./crypto/modes/libcrypto-shlib-ghash-riscv64-zvkb-zvbc.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkg.o
./crypto/modes/libcrypto-shlib-aes-gcm-riscv64-zvkb-zvkg-zvkned.o
./crypto/modes/libcrypto-lib-ghash-riscv64-zvkb-zvbc.o
./crypto/libcrypto-lib-riscvcap.o

That suggests we need to cover more extensions:

vbb
vbc
vkb
vkg
vkned

The openssl source code conditionally defines symbols like:

RISCV_HAS_V
RISCV_HAS_ZVBC
RISCV_HAS_ZVKB
RISCV_HAS_ZVKNHA
RISCV_HAS_ZVKNHB
RISCV_HAS_ZVKSH
RISCV_HAS_ZBKB
RISCV_HAS_ZBB
RISCV_HAS_ZBC
RISCV_HAS_ZKND
RISCV_HAS_ZKNE
RISCV_HAS_ZVKG - currently missing, or a union of zvkng and zvksg?
RISCV_HAS_ZVKNED - currently missing or a union of zvkned and zvksed?
RISCV_HAS_ZVKSED - currently missing, defined but unused?

These symbols are defined in crypto/riscvcap.c after analyzing the march string passed to the compiler.

So the next steps include:

Define LDFLAGS and LDLIBS to enable building a riscv-64 openssl.so.
add additional march elements to generate as many ISA extension exemplars as we can
iterate on Ghidra sinc files to define any missing instructions
extend riscv-64 assembly samples to include all riscv-64 ISA extensions appearing in openssl source
verify that we have acceptable pcode opcodes for all riscv-64 ISA extensions appearing in openssl source

$ build_openssl$../vendor/openssl/Configure linux64-riscv64 --cross-compile-prefix=/opt/riscvx/bin/riscv64-unknown-linux-gnu- -march=rv64gcv_zkne_zknd_zknh_zvkng_zvbb_zvbc_zvkb_zvkg_zvkned_zvksg

Patch the generated Makefile to:

< CNF_EX_LIBS=-ldl -pthread
---
> CNF_EX_LIBS=/opt/riscvx/lib/libdl.a -pthread

$ make

open libcrypto.so.3 and libssl.so.3 in Ghidra.
analyze and open bookmarks
verify - in the Bookmarks window - that all instructions disassembled and no instructions lack pcode

Integration testing (manual)

Disassembly testing against binutils reference dumps can follow these steps:

Open libcrypt.so.3 in Ghidra
export as ASCII to /tmp/libcrypto.so.3.txt
export as C/C++ to /tmp/libcrypto.so.3.c
generate reference disassembly via
- /opt/riscvx/bin/riscv64-unknown-linux-gnu-objdump -j .text -D libcrypto.so.3 > libcrypto.so.3_ref.txt
grep both /tmp/libcrypto.so.3.txt and libcrypto.so.3_ref.txt for vset instructions, comparing operands
optionally parse vector instructions out of both files and compare decodings

inspect extension management

How does Openssl manage RISCV ISA extensions? We’ll use the gcm_ghash family of functions as examples.

At compile time any march=rv64gcv_z... arguments are processed by the Openssl configuration tool and turned into #ifdef variables. These can include combinations like RISCV_HAS_ZVKB_AND_ZVKSED. Multiple versions of key routines are compiled, each with different required extensions.
The compiler can also use any of the bit manipulation and vector extensions in local optimization.
At runtime the library queries the underlying system to see which extensions are supported. The function gcm_get_funcs returns the preferred set of implementing functions. The gcm_ghash set can include:
- gcm_ghash_4bit
- gcm_ghash_rv64i_zvkb_zvbc
- gcm_ghash_rv64i_zvkg
- gcm_ghash_rv64i_zbc
- gcm_ghash_rv64i_zbc__zbkb

The gcm_ghash_4bit is the default version with 412 instructions, of which 11 are vector instructions inserted by the compiler.

The gcm_ghash_rv64i_zvkg is the most advanced version with only 32 instructions. Ghidra decompiles this as:

void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
  undefined auVar1 [256];
  undefined auVar2 [256];
  vsetivli_e32m1tumu(4);
  auVar1 = vle32_v(param_2);
  vle32_v(param_1);
  do {
    auVar2 = vle32_v(param_3);
    param_3 = param_3 + 0x10;
    param_4 = param_4 + -0x10;
    auVar2 = vghsh_vv(auVar1,auVar2);
  } while (param_4 != 0);
  vse32_v(auVar2,param_1);
  return;
}

That shows an error in our sinc files - several instructions use the vd register as both an input and an output, so our pcode semantics need updating. Do this and inspect the Ghidra output again:

void gcm_ghash_rv64i_zvkg(undefined8 param_1,undefined8 param_2,long param_3,long param_4)
{
  undefined auVar1 [256];
  undefined auVar2 [256];
  undefined auVar3 [256];
  vsetivli_e32m1tumu(4);
  auVar2 = vle32_v(param_2);
  auVar1 = vle32_v(param_1);
  do {
    auVar3 = vle32_v(param_3);
    param_3 = param_3 + 0x10;
    param_4 = param_4 + -0x10;
    auVar1 = vghsh_vv(auVar1,auVar2,auVar3);
  } while (param_4 != 0);
  vse32_v(auVar1,param_1);
  return;
}

That’s better - commit and push.

7 - Building a gcc-14 toolchain

Building a new toolchain can be messy.

A C or C++ toolchain needs at least three components:

kernel - to supply key header files and loader dependencies
binutils - to supply assembler and linker
gcc - to supply the compiler and compiler dependencies
glibc - to supply key libraries and header files
sysroot - a directory containing the libraries and resources expected for the root of the target system

These components have cross-dependencies. A full gcc build needs libc.so from glibc. A full glibc build needs libgcc from gcc. There are different ways to handle these cross-dependencies, such as splitting the gcc build into two phases or prepopulating the build directories with ‘close-enough’ files from a previous build.

The sysroot component is the trickiest to handle, since gcc and glibc need to pull files from the sysroot as they update files within sysroot. You can generally start with a bootstrap sysroot, say from a previous toolchain, then update it with the latest binutils, gcc, and glibc.

Start with a released tarball for gcc and glibc. We’ll use the development tip of binutils for this pass.

Copy kernel header files into /opt/riscv/sysroot/usr/include/.

Configure and install binutils:

$ /home2/vendor/binutils-gdb/configure --prefix=/opt/riscv/sysroot --with-sysroot=/opt/riscv/sysroot --target=riscv64-unknown-linux-gnu
$ make -j4
$ make install

Configure and install minimal gcc:

$ /home2/vendor/gcc-14.1.0/configure --prefix=/opt/riscv --enable-languages=c,c++ --disable-multilib --target=riscv64-unknown-linux-gnu --with-sysroot=/opt/riscv/sysroot
$ make all-gcc
$ make install-gcc

Configure and install glibc

$ ../../vendor/glibc-2.39/configure --host=riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu --prefix=/opt/riscv --disable-werror --enable-shared --disable-multilib --with-headers=/opt/riscv/sysroot/usr/include
$ make install-bootstrap-headers=yes install_root=/opt/riscv/sysroot install-headers

Cleaning sysroot of bootstrap artifacts

How do we replace any older sysroot bootstrap files with their freshly built versions? The most common problems involve libgcc*, libc*, and crt* files. The bootstrap sysroot needs these files to exist. The toolchain build process should replace them, but it may not replace all instances of these files.

Let’s scrub the libgcc files, comparing the gcc directory in which they are built with the sysroot directories in which they will be saved.

$ B=/home2/build_riscv/gcc
$ S=/opt/riscv/sysroot
$ find $B $S -name libgcc_s.so -ls
 57940911      4 -rw-r--r--   1 ____     ____          132 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so
 57940908      4 -rw-r--r--   1 ____     ____          132 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so
 14361792      4 -rw-r--r--   1 ____     ____          132 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
 14351655      4 -rw-r--r--   1 ____     ____          132 May 10 08:52 /opt/riscv/sysroot/lib/libgcc_s.so
 $ diff /opt/riscv/sysroot/lib/libgcc_s.so /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so
 $ $ cat /opt/riscv/sysroot/lib/libgcc_s.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library.  */
GROUP ( libgcc_s.so.1 -lgcc )
$

/opt/riscv/sysroot/lib/libgcc_s.so is our bootstrap input
/home2/build_riscv/gcc/gcc/libgcc_s.so and /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so are the generated outputs
the bootstrap input is identical to the generate output
neither input nor output contain absolute paths

Now check libgcc_s.so.1 for staleness:

$ find $B $S -name libgcc_s.so.1 -ls
 57940910    700 -rw-r--r--   1 ____     ____       713128 May 10 12:28 /home2/build_riscv/gcc/gcc/libgcc_s.so.1
 57946454    700 -rwxr-xr-x   1 ____     ____       713128 May 10 12:28 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/libgcc_s.so.1
 14361791    700 -rw-r--r--   1 ____     ____       713128 May 10 12:32 /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1
 14351656    696 -rw-r--r--   1 ____     ____       708624 May 10 08:53 /opt/riscv/sysroot/lib/libgcc_s.so.1

That looks like a potential problem. The older bootstrap file is older and smaller than the generated files. We need to fix that:

$ rm /opt/riscv/sysroot/lib/libgcc_s.so.1
$ ln /opt/riscv/sysroot/riscv64-unknown-linux-gnu/lib/libgcc_s.so.1 /opt/riscv/sysroot/lib/libgcc_s.so.1

Next check the crt* files:

$ find $B $S -name crt\*.o -ls
 57940817      8 -rw-r--r--   1 ____     ____         4248 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginS.o
 57940826      4 -rw-r--r--   1 ____     ____          848 May 10 12:28 /home2/build_riscv/gcc/gcc/crtn.o
 57940824      4 -rw-r--r--   1 ____     ____          848 May 10 12:28 /home2/build_riscv/gcc/gcc/crti.o
 57940827      8 -rw-r--r--   1 ____     ____         4712 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbeginT.o
 57940822      4 -rw-r--r--   1 ____     ____         1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtendS.o
 57940823      4 -rw-r--r--   1 ____     ____         1384 May 10 12:28 /home2/build_riscv/gcc/gcc/crtend.o
 57940815      4 -rw-r--r--   1 ____     ____         3640 May 10 12:28 /home2/build_riscv/gcc/gcc/crtbegin.o
 57940800      8 -rw-r--r--   1 ____     ____         4248 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginS.o
 57940808      4 -rw-r--r--   1 ____     ____          848 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtn.o
 57940806      4 -rw-r--r--   1 ____     ____          848 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crti.o
 57940803      8 -rw-r--r--   1 ____     ____         4712 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbeginT.o
 57940812      4 -rw-r--r--   1 ____     ____         1384 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtendS.o
 57940804      4 -rw-r--r--   1 ____     ____         1384 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtend.o
 57940798      4 -rw-r--r--   1 ____     ____         3640 May  9 16:00 /home2/build_riscv/gcc/riscv64-unknown-linux-gnu/libgcc/crtbegin.o
 14351609     16 -rw-r--r--   1 ____     ____        13848 May 10 08:48 /opt/riscv/sysroot/usr/lib/crt1.o
 14351614      4 -rw-r--r--   1 ____     ____          952 May 10 08:48 /opt/riscv/sysroot/usr/lib/crti.o
 14351623      4 -rw-r--r--   1 ____     ____          952 May 10 08:49 /opt/riscv/sysroot/usr/lib/crtn.o
 14361798      8 -rw-r--r--   1 ____     ____         4248 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginS.o
 14361802      4 -rw-r--r--   1 ____     ____         3640 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbegin.o
 14361803      4 -rw-r--r--   1 ____     ____         1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtend.o
 14361804      4 -rw-r--r--   1 ____     ____          848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crti.o
 14361805      4 -rw-r--r--   1 ____     ____          848 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtn.o
 14361806      4 -rw-r--r--   1 ____     ____         1384 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtendS.o
 14361807      8 -rw-r--r--   1 ____     ____         4712 May 10 12:32 /opt/riscv/sysroot/lib/gcc/riscv64-unknown-linux-gnu/14.1.0/crtbeginT.o

The files in /opt/riscv/sysroot/usr/lib are likely the bootstrap files. The sysroot files are identical to the build files, with exceptions:

crt1.o is not generated by the gcc compiler build process. It may be something provided by the kernel build.
crti.o and crtn.o bootstrap files and generated files are different. If we wanted to use this updated sysroot to build a 14.2.0 toolchain, we probably want to use the newer versions.

So replace the bootstrap /opt/riscv/sysroot/usr/lib/crt*.o with hard links to the generated files.