This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Issues

1: inferring semantics from code patterns
2: autovectorization
3: vector intrinsics
4: link time optimization
5: testing pcode semantics

Summarize Ghidra import issues here to promote discussion on relative priority and possible solutions.

Thread local storage class handling

Thread local storage provides one instance of the variable per extant thread. GCC supports this feature as:

__thread char threadLocalString[4096];

Binaries built with this feature will often include ELF relocation codes like R_RISCV_TLS_TPREL64. This relocation code is not recognized by Ghidra, nor is it clear how TLS storage should be handled itself within Ghidra - perhaps as a memory section akin to BSS?

To reproduce, import libc.so.6 and look for lines like Elf Relocation Warning: Type = R_RISCV_TLS_TPREL64. Alternatively, compile, link, and import riscv64/generated/userSpaceSamples/relocationTest.c.

Vector instruction support

Newer C compiler releases can replace simple loops and standard C library invocations with processor-specific vector instructions. These vector instructions can be handled poorly by Ghidra’s disassembler and worse by Ghidra’s decompiler. See autovectorization and vector_intrinsics for examples.

1 - inferring semantics from code patterns

How can we do a better job of recognizing semantic patterns in optimized code? Instruction set extensions make that more challenging.

Ghidra users want to understand the intent of binary code. The semantics and intent of the memcpy(dest,src,nbytes) operation are pretty clear. If the compiler converts this into a call to a named external function, that’s easy to recognize. If it converts it into a simple inline loop of load and store instructions, that should be recognizable too.

Optimizing compilers like gcc can generate many different instruction sequences from a simple concept like memcpy or strnlen, especially if the processor for which the code is intended supports advanced vector or bit manipulation instructions. We can examine the compiler testsuite to see what those patterns can be, enabling either human or machine translation of those sequences into the higher level semantics of memory movement or finding the first null byte in a string.

Gcc semantics recognizes memory copy operations via the operation cpymem. Calls to the standard library memcpy and various kinds of struct copies are translated into this RTL (Register Transfer Logic) cpymem token. The processor-specific gcc back-end then expands cpymem into a half-dozen or so instruction patterns, depending on size, alignment, and instruction set extensions of the target processor.

In the ideal world, Ghidra would recognize all of those RTL operations as pcode operations, and further recognize all of the common back end expansions for all processor variants. It might rewrite the decompiler window or simply add comments indicating a likely cpymem pcode expansion.

It’s enough for now to show how to gather the more common patterns to help human Ghidra operators untangle these optimizations and understand the simpler semantics they encode.

Note: This example uses RISCV vector optimization - many other optimizations are supported by gcc too.

Patterns in gcc vectorization source code

Maybe the best reference on gcc vectorization is the gcc source code itself.

What intrinsics are likely to be replaced with vector code?
What patterns of vector assembly instructions are likely to be generated?
How does the gcc test suite search for those patterns to verify intrinsic replacement is correct?

Start with:

gcc/config/riscv/riscv-vector-builtins.cc
gcc/config/riscv/riscv-vector-strings.cc
gcc/config/riscv/autovec.md
gcc/config/riscv/riscv-string.cc

Ghidra semantics use pcode operations. GCC uses something similar in RTL (Register Transfer Language). These are described in gcc/doc/md.texi. These include:

cpymem
setmem
strlen
rawmemchr
cmpstrn
cmpstr

The cpymem op covers inline calls to memcpy and structure copies. Trace this out:

riscv.md:

(define_expand "cpymem<mode>"
  [(parallel [(set (match_operand:BLK 0 "general_operand")
                   (match_operand:BLK 1 "general_operand"))
              (use (match_operand:P 2 ""))
              (use (match_operand:SI 3 "const_int_operand"))])]
  ""
{
  if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
    DONE;
  else
    FAIL;
})

riscv_expand_block_move is also mentioned in riscv-protos.h and riscv-string.cc.

Look into riscv-string.cc:

/* This function delegates block-move expansion to either the vector
   implementation or the scalar one.  Return TRUE if successful or FALSE
   otherwise.  */

bool
riscv_expand_block_move (rtx dest, rtx src, rtx length)
{
  if (TARGET_VECTOR && stringop_strategy & STRATEGY_VECTOR)
    {
      bool ok = riscv_vector::expand_block_move (dest, src, length);
      if (ok)
        return true;
    }

  if (stringop_strategy & STRATEGY_SCALAR)
    return riscv_expand_block_move_scalar (dest, src, length);

  return false;
...
}
...
* --- Vector expanders --- */

namespace riscv_vector {

/* Used by cpymemsi in riscv.md .  */

bool
expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
{
  /*
    memcpy:
        mv a3, a0                       # Copy destination
    loop:
        vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
        vle8.v v0, (a1)                 # Load bytes
        add a1, a1, t0                  # Bump pointer
        sub a2, a2, t0                  # Decrement count
        vse8.v v0, (a3)                 # Store bytes
        add a3, a3, t0                  # Bump pointer
        bnez a2, loop                   # Any more?
        ret                             # Return
 */
}
}

Note that the RISCV assembly instructions in the comment are just an example, and that the C++ implementation handles many different variants. The ret instruction is not part of the expansion, just copied into the source code from the testsuite.

The testsuite (gcc/testsuite/gcc.target/riscv) shows which variants are common enough to test against.

a minimalist call to memcpy

void f1 (void *a, void *b, __SIZE_TYPE__ l)
{
  memcpy (a, b, l);
}

** f1:
XX      \.L\d+: # local label is ignored
**      vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
**      vle8\.v\s+v\d+,0\(a1\)
**      vse8\.v\s+v\d+,0\(a0\)
**      add\s+a1,a1,[ta][0-7]
**      add\s+a0,a0,[ta][0-7]
**      sub\s+a2,a2,[ta][0-7]
**      bne\s+a2,zero,\.L\d+
**      ret
*/

a typed call to memcpy

void f2 (__INT32_TYPE__* a, __INT32_TYPE__* b, int l)
{
  memcpy (a, b, l);
}

Additional type information doesn’t appear to affect the inline code

** f2:
XX      \.L\d+: # local label is ignored
**      vsetvli\s+[ta][0-7],a2,e8,m8,ta,ma
**      vle8\.v\s+v\d+,0\(a1\)
**      vse8\.v\s+v\d+,0\(a0\)
**      add\s+a1,a1,[ta][0-7]
**      add\s+a0,a0,[ta][0-7]
**      sub\s+a2,a2,[ta][0-7]
**      bne\s+a2,zero,\.L\d+
**      ret

memcpy with aligned elements and known size

In this case arguments are aligned and 512 bytes in length.

extern struct { __INT32_TYPE__ a[16]; } a_a, a_b;
void f3 ()
{
  memcpy (&a_a, &a_b, sizeof a_a);
}

The generated sequence varies depending on how much the compiler knows about the target architecture.

** f3: { target { { any-opts "-mcmodel=medlow" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m2" "--param=riscv-autovec-lmul=m4" "-
-param=riscv-autovec-lmul=m8" "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vsetivli\s+zero,16,e32,m8,ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret

f3: { target { { any-opts "-mcmodel=medlow --param=riscv-autovec-preference=fixed-vlmax" "-mcmodel=medlow -march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vl(1|4|2)re32\.v\s+v\d+,0\([ta][0-7]\)
**        vs(1|4|2)r\.v\s+v\d+,0\([ta][0-7]\)
**        ret

** f3: { target { { any-opts "-mcmodel=medlow -march=rv64gcv_zvl1024b" "-mcmodel=medlow -march=rv64gcv_zvl512b" } && { no-opts "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lui\s+[ta][0-7],%hi\(a_a\)
**        lui\s+[ta][0-7],%hi\(a_b\)
**        addi\s+a4,[ta][0-7],%lo\(a_b\)
**        vsetivli\s+zero,16,e32,(m1|m4|mf2),ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        addi\s+[ta][0-7],[ta][0-7],%lo\(a_a\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret

** f3: { target { { any-opts "-mcmodel=medany" } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv_zvl1024b" "--param=riscv-autovec-lmul=dynamic" "--param=riscv-autovec-lmul=m8" "--param=riscv-autovec-lmul=m4" "--param=riscv-autovec-preference=fixed-vlmax" } } }
**        lla\s+[ta][0-7],a_a
**        lla\s+[ta][0-7],a_b
**        vsetivli\s+zero,16,e32,m8,ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        vse32\.v\s+v\d+,0\([ta][0-7]\)

** f3: { target { { any-opts "-mcmodel=medany"  } && { no-opts "-march=rv64gcv_zvl512b" "-march=rv64gcv_zvl256b" "-march=rv64gcv" "-march=rv64gc_zve64d" "-march=rv64gc_zve32f" } } }
**        lla\s+[ta][0-7],a_b
**        vsetivli\s+zero,16,e32,m(f2|1|4),ta,ma
**        vle32.v\s+v\d+,0\([ta][0-7]\)
**        lla\s+[ta][0-7],a_a
**        vse32\.v\s+v\d+,0\([ta][0-7]\)
**        ret
*/

** f3: { target { { any-opts "-mcmodel=medany --param=riscv-autovec-preference=fixed-vlmax" } && { no-opts "-march=rv64gcv_zvl1024b" } } }
**        lla\s+[ta][0-7],a_a
**        lla\s+[ta][0-7],a_b
**        vl(1|2|4)re32\.v\s+v\d+,0\([ta][0-7]\)
**        vs(1|2|4)r\.v\s+v\d+,0\([ta][0-7]\)
**        ret

2 - autovectorization

If a processor supports vector (aka SIMD) instructions, optimizing compilers will use them. That means Ghidra may need to make sense of the generated code.

Loop autovectorization

What happens when a gcc toolchain optimizes the following code?

#include <stdio.h>
int main(int argc, char** argv){
    const int N = 1320;
    char s[N];
    for (int i = 0; i < N - 1; ++i)
        s[i] = i + 1;
    s[N - 1] = '\0';
    printf(s);
}

This involves a simple loop filling a character array with integers. It isn’t a well formed C string, so the printf statement is just there to keep the character array from being optimized away.

The elements of the loop involve incremental indexing, narrowing from 16 bit to 8 bit elements, and storage in a 1320 element vector.

The result depends on the compiler version and what kind of microarchitecture gcc-14 was told to compile for.

Compile and link this file with a variety of compiler versions, flags, and microarchitectures to see how well Ghidra tracks toolchain evolution. In each case the decompiler output is manually adjusted to relabel variables like s and i and remove extraneous declarations.

RISCV-64 gcc-13, no optimization, no vector extensions

$ bazel build -s --platforms=//platforms:riscv_userspace  gcc_vectorization:narrowing_loop
...

Ghidra gives:

  char s [1319];
 ...
  int i;
  ...
  for (i = 0; i < 0x527; i = i + 1) {
    s[i] = (char)i + '\x01';
  }
  ...
  printf(s);

The loop consists of 17 instructions and 60 bytes. It is executed 1319 times

RISCV-64 gcc-13, full optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3" gcc_vectorization:narrowing_loop

Ghidra gives:

  long i;
  char s_offset_by_1 [1320];
 
  i = 1;
  do {
    s_offset_by_1[i] = (char)i;
    i = i + 1;
  } while (i != 0x528);
  uStack_19 = 0;
  printf(s_offset_by_1 + 1);

The loop consists of 4 instructions and 14 bytes. It is executed 1319 times.

Note that Ghidra has reconstructed the target vector s a bit strangely, with the beginning offset by one byte to help shorten the loop.

RISCV-64 gcc-13, full optimization, with vector extensions

bazel build -s --platforms=//platforms:riscv_userspace --copt="-O3"  gcc_vectorization:narrowing_loop_vector

The Ghidra import is essentially unchanged - updating the target architecture from rv64igc to rv64igcv makes no difference when building with gcc-13.

RISCV-64 gcc-14, no optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:narrowing_loop

The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference without optimization.

RISCV-64 gcc-14, full optimization, no vector extensions

bazel build -s --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:narrowing_loop

The Ghidra import is essentially unchanged - updating gcc from gcc-13 to gcc-14 makes no difference - when using the default target architecture without vector extensions.

RISCV-64 gcc-14, full optimization, with vector extensions

Build with -march=rv64gcv to tell the compiler to assume the processor supports RISCV vector extensions.

bazel build -s --platforms=//platforms:riscv_vector --copt="-O3"  gcc_vectorization:narrowing_loop_vector

                    /* WARNING: Unimplemented instruction - Truncating control flow here */
  halt_unimplemented();

The disassembly window shows that the loop consists of 13 instructions and 46 bytes. Many of these are vector extension instructions for which Ghidra 11.0 has no semantics. Different RISCV processors will take a different number of iterations to finish the loop. If the processor VLEN=128, then each vector register will hold 4 32 bit integers and the loop will take 330 iterations. If the processor VLEN=1024 then the loop will take 83 iterations.

Either way, Ghidra 11.0 will fail to decompile any such autovectorized loop, and fail to decompile the remainder of any function which contains such an autovectorized loop.

x86-64 gcc-14, full optimization, with sapphirerapids

Note: Intel’s Saphire Rapids includes high end server processors like the Xeon Max family. A more general choice for x86_64 exemplars would be -march=x86-64-v3 with -O2 optimization. We can expect full Red Hat Linux distributions soon built with those options. The x86-64-v4 microarchitecture is a generalization of Saphire Rapids microarchitecture, and would more likely be found in servers specialized for numerical analysis or possibly ML applications.

$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:narrowing_loop

Ghidra 11.0 disassembler and decompiler fail immediately on hitting the first vector instruction vpbroadcastd, an older avx2 vector extension.

    /* WARNING: Bad instruction - Truncating control flow here */
  halt_baddata();

builtin autovectorization

GCC can replace calls to some functions like memcpy, replacing those calls with inline - and potentially vectorized - instructions.

This source file shows different ways memcopy can be compiled.

include "common.h"
#include <string.h>

int main() {
  const int N = 127;
  const uint32_t seed = 0xdeadbeef;
  srand(seed);

  // data gen
  double A[N];
  gen_rand_1d(A, N);

  // compute
  double copy[N];
  memcpy(copy, A, sizeof(A));
  
  // prevent optimization from removing result
  printf("%f\n", copy[N-1]);
}

RISCV-64 builds

Build with:

$ bazel build --platforms=//platforms:riscv_vector --copt="-O3" gcc_vectorization:memcpy_vector

Ghidra 11 gives:

undefined8 main(void)

{
  undefined auVar1 [64];
  undefined *puVar2;
  undefined (*pauVar3) [64];
  long lVar4;
  long lVar5;
  undefined auVar6 [256];
  undefined local_820 [8];
  undefined8 uStack_818;
  undefined auStack_420 [1032];
  
  srand(0xdeadbeef);
  puVar2 = auStack_420;
  gen_rand_1d(auStack_420,0x7f);
  pauVar3 = (undefined (*) [64])local_820;
  lVar4 = 0x3f8;
  do {
    lVar5 = vsetvli_e8m8tama(lVar4);
    auVar6 = vle8_v(puVar2);
    lVar4 = lVar4 - lVar5;
    auVar1 = vse8_v(auVar6);
    *pauVar3 = auVar1;
    puVar2 = puVar2 + lVar5;
    pauVar3 = (undefined (*) [64])(*pauVar3 + lVar5);
  } while (lVar4 != 0);
  printf("%f\n",uStack_818);
  return 0;
}

What would we like the decompiler to show instead? The memcpy pattern should be fairly general and stable.

  src = auStack_420;
  gen_rand_1d(auStack_420,0x7f);
  dest = (undefined (*) [64])local_820;
  /* char* dest, src;
    dest[0..n] ≔ src[0..n]; */
  n = 0x3f8;
  do {
    lVar2 = vsetvli_e8m8tama(n);
    auVar3 = vle8_v(src);
    n = n - lVar2;
    auVar1 = vse8_v(auVar3);
    *dest = auVar1;
    src = src + lVar2;
    dest = (undefined (*) [64])(*dest + lVar2);
  } while (n != 0);

More generally, we want a precomment showing the memcpy in vector terms immediately before the loop. The type definition of dest is a red herring to be dealt with.

x86-64 builds

Build with:

$ bazel build -s --platforms=//platforms:x86_64_default --copt="-O3" --copt="-march=sapphirerapids" gcc_vectorization:memcpy_sapphirerapids

Ghidra 11.0’s disassembler and decompiler bail out when they reach the inline replacement for memcpy - gcc-14 has replaced the call with vector instructions like vmovdqu64, which is unrecognized by Ghidra.

void main(void)

{
  undefined auStack_428 [1016];
  undefined8 uStack_30;
  
  uStack_30 = 0x4010ad;
  srand(0xdeadbeef);
  gen_rand_1d(auStack_428,0x7f);
                    /* WARNING: Bad instruction - Truncating control flow here */
  halt_baddata();
}

3 - vector intrinsics

Invoking RISCV vector instructions from C.

RISCV vector intrinsic functions can be coded into C or C++.

That document includes examples of code that might be found shortly in libc:

void *memcpy_vec(void *restrict destination, const void *restrict source,
                 size_t n) {
  unsigned char *dst = destination;
  const unsigned char *src = source;
  // copy data byte by byte
  for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
    vl = __riscv_vsetvl_e8m8(n);
    vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
    __riscv_vse8_v_u8m8(dst, vec_src, vl);
  }
  return destination;
}

Note: GCC-14 autovectorization will often convert normal calls to memcpy into something very similar to the memcpy_vec code above, then assemble it down to RISCV vector instructions.

As another example, here is a snippet of code from the whisper.cc voice to text open source project:

...
#ifdef __riscv_v_intrinsic
#include <riscv_vector.h>
#endif
...
elif defined(__riscv_v_intrinsic)

    size_t vl = __riscv_vsetvl_e32m4(QK8_0);

    for (int i = 0; i < nb; i++) {
        // load elements
        vfloat32m4_t v_x   = __riscv_vle32_v_f32m4(x+i*QK8_0, vl);

        vfloat32m4_t vfabs = __riscv_vfabs_v_f32m4(v_x, vl);
        vfloat32m1_t tmp   = __riscv_vfmv_v_f_f32m1(0.0f, vl);
        vfloat32m1_t vmax  = __riscv_vfredmax_vs_f32m4_f32m1(vfabs, tmp, vl);
        float amax = __riscv_vfmv_f_s_f32m1_f32(vmax);
        ...
    }

Normally you would expect to see functions like __riscv_vfabs_v_f32m4 defined in the include file riscv_vector.h, where Ghidra could process it and help identify calls to these intrinsics. The vector intrinsic functions are instead autogenerated directly into GCC’s internal compiled header format when the compiler is built - there are just too many variants to cope with. The PDF listing of all intrinsic functions is currently over 4000 pages long. For example, the signature for __riscv_vfredmax_vs_f32m4_f32m1 is given on page 734 under Vector Reduction Operations as

vfloat32m1_t __riscv_vfredmax_vs_f32m4_f32m1(vfloat32m4_t vs2, vfloat32m1_t vs1, size_t vl);

There aren’t all that many vector instruction genotypes, but there are an enormous number of contextual variations the compiler and assembler know about.

4 - link time optimization

Link Time Optimization

Link Time Optimization (LTO) is a relatively new form of toolchain optimization that can produce smaller and faster binaries. It can also mutate control flows in those binaries making Ghidra analysis trickier, especially if one is using BSIM to look for control flow similarities.

Can we generate importable exemplars using LTO to show what such optimization steps look like in Ghidra?

LTO needs a command line parameter added for both compilation and linking. With bazel, that means --copt="-flto" --linkopt="-Wl,-flto" is enough to request LTO optimization on a build. These lto flags can also be defaulted into the toolchain definition or individual build files.

Let’s try this with a progressively more complicated series of exemplars

# Build helloworld without LTO as a control
$ bazel build -s --copt="-O2" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
...
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8624 Jan 31 10:44 bazel-bin/userSpaceSamples/helloworld

# The helloworld exemplar doesn't benefit much from link time optimization
$ bazel build -s  --copt="-O2"  --copt="-flto" --linkopt="-Wl,-flto" --platforms=//platforms:riscv_vector userSpaceSamples:helloworld
$ ls -l bazel-bin/userSpaceSamples/helloworld
-r-xr-xr-x. 1 --- --- 8608 Jan 31 10:46 bazel-bin/userSpaceSamples/helloworld

The memcpy source exemplar can be built three ways:

without vector extensions and without LTO - build target gcc_vectorization:memcpy
with vector extensions and without LTO - build target gcc_vectorization:memcpy_vector
with vector extensions and with LTO - build target gcc_vectorization:memcpy_lto

In this case the LTO options are configured into gcc_vectorization/BUILD.

$ bazel build -s --platforms=//platforms:riscv_vector gcc_vectorization:memcpy
...
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy
-r-xr-xr-x. 1 --- --- 13488 Jan 31 11:16 bazel-bin/gcc_vectorization/memcpy
$ bazel build -s  --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_vector
INFO: Build completed successfully ...
$ ls -l bazel-bin/gcc_vectorization/memcpy_vector
-r-xr-xr-x. 1 --- --- 13728 Jan 31 11:18 bazel-bin/gcc_vectorization/memcpy_vector
$ bazel build -s  --platforms=//platforms:riscv_vector gcc_vectorization:memcpy_lto
ERROR: ...: Linking gcc_vectorization/memcpy_lto failed: (Exit 1): gcc failed: error executing CppLink command (from target //gcc_vectorization:memcpy_lto) ...
lto1: internal compiler error: in riscv_hard_regno_nregs, at config/riscv/riscv.cc:8058
Please submit a full bug report, with preprocessed source (by using -freport-bug).
See <https://gcc.gnu.org/bugs/> for instructions.
lto-wrapper: fatal error: external/gcc-14-riscv64-suite/bin/riscv64-unknown-linux-gnu-gcc returned 1 exit status
compilation terminated.

So it looks like LTO has problems with RISCV vector instructions. We’ll keep testing this as more gcc 14 snapshots become available, but as a lower priority exercise. LTO does not seem like a popular optimization.

5 - testing pcode semantics

Ghidra processor semantics needs tests

Procesor instructions known to Ghidra are defined in Sleigh pcode or semanitc sections. Adding new instructions - such as instruction set extensions to an existing processor - requires a pcode description of what that instruction does. That pcode drives both the decompiler process and any emulator or debugger processes.

This generates a conflict in testing. Should we test for maximum clarity for semantics rendered in the decompiler window or maximum fidelity in any Ghidra emulator? For example, should a divide instruction include pcode to test against a divide-by-zero? Should floating point instructions guard against NaN (Not a Number) inputs?

We assume here that decompiler fidelity is more important than emulator fidelity. That implies:

ignore any exception-generating cases, including divide-by-zero, NaN, memory access and memory alignment.
pcode must allow for normal C implicit type conversions, such as between different integer and floating point lengths.
- this implies pcode must pay attention to Ghidra’s type inference system.

Concept of Operations

Individual instructions are wrapped in C and exercised within a Google Test C++ framework. The test framework is then executed within a qemu static emulation environment.

For example, let’s examine two riscv-64 instructions: fcvt.w.s and fmv.x.w

fcvt.w.s converts a ﬂoating-point number in ﬂoating-point register rs1 to a signed 32-bit or 64-bit integer, respectively, in integer register rd.
fmv.x.w moves the single-precision value in floating-point register rs1 represented in IEEE 754-2008 encoding to the lower 32 bits of integer register rd. For RV64, the higher 32 bits of the destination register are filled with copies of the floating-point number’s sign bit.

These two instructions have similar signatures but very different semantics. fcvt.w.s performs a float to int type conversion, so the float 1.0 can be converted to int 1. fmv.x.w moves the raw bits between float and int registers without any type coversion.

We can generate simple exemplars of both instructions with this C code:

int fcvt_w_s(float* x) {
    return (int)*x;
}

int fmv_x_w(float* x) {
    int val;
    float src = *x;

    __asm__ __volatile__ (
        "fmv.x.w  %0, %1" \
        : "=r" (val) \
        : "f" (src));
    return val;
}

Ghidra’s 11.2-DEV decompiler renders these as:

long fcvt_w_s(float *param_1)
{
  return (long)(int)*param_1;
}
long fmv_x_w(float *param_1)
{
  return (long)(int)param_1;
}

fmv_x_w was missing a dereference operation. The fmv_x_w version was also implying an implicit type conversion when none is actually performed. Let’s trace how to use these test results to improve the decompiler output.

Running tests

The draft test harness can be built and run from the top level workspace directory.

$ bazel build --platforms=//riscv64/generated/platforms:riscv_userspace riscv64/generated/emulator_tests:testSemantics
Starting local Bazel server and connecting to it...
INFO: Analyzed target //riscv64/generated/emulator_tests:testSemantics (74 packages loaded, 1902 targets configured).
...
INFO: From Executing genrule //riscv64/generated/emulator_tests:testSemantics:
...
INFO: Found 1 target...
Target //riscv64/generated/emulator_tests:testSemantics up-to-date:
  bazel-bin/riscv64/generated/emulator_tests/results
INFO: Elapsed time: 4.172s, Critical Path: 1.93s
INFO: 4 processes: 1 internal, 3 linux-sandbox.
INFO: Build completed successfully, 4 total actions

$ cat bazel-bin/riscv64/generated/emulator_tests/results
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from FP
[ RUN      ] FP.testharness
[       OK ] FP.testharness (3 ms)
[ RUN      ] FP.fcvt
[       OK ] FP.fcvt (10 ms)
[ RUN      ] FP.fmv
[       OK ] FP.fmv (1 ms)
[ RUN      ] FP.fp16
[       OK ] FP.fp16 (0 ms)
[----------] 4 tests from FP (15 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (19 ms total)
[  PASSED  ] 4 tests.

$ file bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so
bazel-bin/riscv64/generated/emulator_tests/libfloatOperations.so: ELF 64-bit LSB shared object, UCB RISC-V, RVC, double-float ABI, version 1 (SYSV), dynamically linked, not stripped

Test parameters

What version of Ghidra are we testing against?

Ghidra as released, e.g., 11.1
Ghidra master, currently 11.2-DEV
Our isa_ext fork
jobermayr’s clang fork
One of the many Sleigh-InSPECtor RISCV Ghidra pull requests

What do we do with Ghidra patches that improve the decompilation results?

If the patched instructions only exist within the isa_ext fork, we will make the changes to that fork and PR
If the patches come from unmerged PRs we may cherry-pick them into isa_ext. This includes some of the fcvt and fmv patches from other sources.

Test example

Use meld to compare the original C source file floatOperations.c with the exported C decompiler view from Ghidra 11.2-DEV. A quick inspection shows some errors to address:

Original	Ghidra
float fcvt_s_wu(uint32_t* i) {return (float)*i;}	float fcvt_s_wu(uint param_1){return (float)ZEXT416(param_1);}
double fcvt_d_wu(uint32_t* j){return (double)*j;}	double fcvt_d_wu(uint param_1){return (double)ZEXT416(param_1);}
	long fmv_x_w(float param_1){return (long)(int)param_1;}
	long fmv_x_d(double param_1){return (long)(int)param_1;}
	long fcvt_h_w(int param_1){return (long)param_1;}
	long fcvt_h_wu(int param_1){return (long)param_1;}
	ulong fcvt_h_d(ulong param_1){return param_1 & 0xffffffff;}

The errors include:

spurious ZEXT416 in two places
fmv instructions appear to force an implicit type conversion where none is wanted
missing dereference operation in fcvt_h_w and fcvt_h_wu
bad mask operation in fcvt_h_d

Next steps

Testing semantics for the zfh half-precision floating point instructions is more complicated than usual. Ghidra’s semantics and pcode system has no known provision for half-precision floating point, so emulation won’t work well. The current zfh implementation makes these _fp16 objects look like 32 bit floats in registers and like 16 bit shorts in memory operations, making Ghidra type inferencing even more confusing.

Let’s look at a more limited scope, the definition of the Ghidra trunc pcode op.

The documentation says trunc produces a signed integer obtained by truncating its argument.

how does trunc set its result type?
does trunc expect only a floating point double?
what would it take to define trunk_u to generate an unsigned integer
what would it take to accept a half-precision floating point value as an argument?

The documentation also says that float2float ‘copies a floating-point number with more or less precision’, so its implementation may tell us something about type inferencing.

Ghidra/Features/Decompiler/src/decompile/cpp/pcodeparse.cc binds float2float to OP_FLOAT2FLOAT
this leads to CPUI_FLOAT_FLOAT2FLOAT and to several files under Ghidra/Features/Decompiler/src/decompile/cpp.
functions like FloatFormat::opFloat2Float and FloatFormat::opTrunc look relevant in float.hh and float.cc