Whisper Memcpy

Is it easy to recognize vector expansions of libc functions like memcpy?

Let’s locate some explicit invocations of memcpy within Whisper and see what the Advisor has to say.

struct whisper_context * whisper_init_from_buffer_with_params_no_state(void * buffer, size_t buffer_size, struct whisper_context_params params) {
    struct buf_context {
        uint8_t* buffer;
        size_t size;
        size_t current_offset;
    };
    loader.read = [](void * ctx, void * output, size_t read_size) {
        buf_context * buf = reinterpret_cast<buf_context *>(ctx);

        size_t size_to_copy = buf->current_offset + read_size < buf->size ? read_size : buf->size - buf->current_offset;

        memcpy(output, buf->buffer + buf->current_offset, size_to_copy);
        buf->current_offset += size_to_copy;

        return size_to_copy;
    };
};

This source example shows a few traits:

the number of bytes to copy is not in general known at compile time
the buffer type is uint8_t*
there are no alignment guarantees

GCC 15 compiles the lambda stored in loader.read as whisper_init_from_buffer_with_params_no_state::{lambda(void*,void*,unsigned_long)#1}::_FUN. The relevant Ghidra listing window instruction sequence (trimmed of address and whitespace) is:

LAB_000b0be2
    vsetvli  a3,param_3,e8,m8,ta,ma  
    vle8.v   v8,(a4)
    c.sub    param_3,a3
    c.add    a4,a3
    vse8.v   v8,(param_2)
    c.add    param_2,a3
    c.bnez   param_3,LAB_000b0be2

Copying the Ghidra listing to the clipboard and running the Advisor gives us:

Clipboard Contents to Analyze

LAB_000b0be2                                    XREF[1]:     000b0bf4(j)
000b0be2 d7 76 36 0c     vsetvli                        a3,param_3,e8,m8,ta,ma
000b0be6 07 04 07 02     vle8.v                         v8,(a4)
000b0bea 15 8e           c.sub                          param_3,a3
000b0bec 36 97           c.add                          a4,a3
000b0bee 27 84 05 02     vse8.v                         v8,(param_2)
000b0bf2 b6 95           c.add                          param_2,a3
000b0bf4 7d f6           c.bnez                         param_3,LAB_000b0be2

Signatures:

    Element width is = 8 bits
    Vector registers are grouped with MUL = 8
    Vector load: vle8.v
        Vector load is to multiple registers
    Vector store: vse8.v
        Vector store is from multiple registers
    At least one loop exists
    Significant operations, in the order they appear:
        vsetvli,vle8.v,vse8.v,_loop
    Significant operations, in alphanumeric order:
        _loop,vle8.v,vse8.v,vsetvli

Similarity Analysis

Compare the clipped example to the database of vectorized examples.

The best match is id=1873 [1.000]= _loop,vle8.v,vse8.v,vsetvli

The clip is similar to the reference example data/custom_testsuite/builtins/memcpy_rv64gcv:memcpy_255
Reference C Source

void memcpy_255()
{
  __builtin_memcpy (to, from, 255);
};

Reference Compiled Assembly Code

65e	auipc	a3,0x2
662	ld	a3,-1678(a3)
666	auipc	a2,0x0
66a	addi	a2,a2,82
66e	li	a4,255
672	vsetvli	a5,a4,e8,m8,ta,ma
676	vle8.v	v8,(a2)
67a	sub	a4,a4,a5
67c	add	a2,a2,a5
67e	vse8.v	v8,(a3)
682	add	a3,a3,a5

The Advisor has matched the vector instruction loop to the GCC __builtin_memcpy test case where the number of bytes to transfer is large (255). The individual scalar instructions are not the same.

This example shows something important that we probably want to add to the Advisor’s report:

The vsetvli instruction includes the m8 multiplier option, which means vector operations cover groups of 8 registers. The vle8.v only references vector register v8, but the loads and stores affect the 8 registers v8 through v15. If the __builtin_memcpy appeared in an inline code fragment, where there may be more pressure on vector register availability, we might have seen very similar code with multipliers of m4, m2, or m1.

What does the Ghidra decompiler show for this instruction sequence?

  do {
    lVar3 = vsetvli_e8m8tama(uVar1);
    auVar4 = vle8_v(lVar2);
    uVar1 = uVar1 - lVar3;
    lVar2 = lVar2 + lVar3;
    vse8_v(auVar4,param_2);
    param_2 = (void *)((long)param_2 + lVar3);
  } while (uVar1 != 0);

What would we like Ghidra’s decompiler to show instead? Something like:

__builtin_memcpy(param_2, lvar2, uVar1);

That’s not quite correct, as __builtin_memcpy doesn’t mutate the values param_2 or lvar2.

Last modified November 18, 2024: Extend analysis of Whisper (7d7e800)