adding toolchains
Adding a new toolchain takes lots of little steps, and some trial and error.
Overview
We want x86_64 exemplars built with the same next generation of gcc, libc, and libstdc++ as we use for RISCV exemplars. This will give us some hints about how common new issues may be and how global new solutions may need to be.
We will generate this x86_64 gcc-14 toolchain about the same way as our existing RISCV-64 gcc-14 toolchain.
This example uses the latest released version of binutils and the development head of gcc and glibc.
If we were building a toolchain for an actual product we would start by configuring and building a specialized kernel, which would prepopulate the system root. We aren’t doing that here, so we will use placeholders from the Fedora 40 x86_64 kernel.
binutils and the first gcc pass
We want binutils installed first.
$ cd /home2/vendor/binutils-gdb
$ git log
commit 2c73aeb8d2e02de7b69cbcb13361cfbca9d76a4e (HEAD, tag: binutils-2_41)
Author: Nick Clifton <nickc@redhat.com>
Date: Sun Jul 30 14:55:52 2023 +0100
The 2.41 release!
...
$ cd /home2/build_x86/binutils
.../vendor/binutils-gdb/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
...
$ make
$ make install
...
The gcc suite and the glibc standard library have a circular dependency. We build and install the basic gcc capability first, then glibc, and then finish with the rest of gcc. During this process we likely need to add system files to the new sysroot directory.
$ cd /home2/vendor/gcc
$ git log
commit ac9c81dd76cfc34ed53402049021689a61c6d6e7 (HEAD -> master, origin/trunk, origin/master, origin/HEAD)
Author: Pan Li <pan2.li@intel.com>
Date: Mon Dec 18 21:40:00 2023 +0800
RISC-V: Rename the rvv test case.
...
$ cd /home2/build_x86/gcc
/home2/vendor/gcc/configure --prefix=/opt/gcc14 --disable-multilib --enable-languages=c,c++,rust,lto
$ make
...
$ make install
...
The make
and make install
may throw errors after completing the basic compiler. If so, we can
complete the build after we get glibc installed.
glibc
We should have enough of gcc-14 built to configure and build the 64 bit glibc package. This pending release of glibc has lots of changes, so we can expect some tinkering to get it to work for us.
$ cd /home2/vendor/glibc
$ git log
commit e957308723ac2e55dad360d602298632980bbd38 (HEAD -> master, origin/master, origin/HEAD)
Author: Matthew Sterrett <matthew.sterrett@intel.com>
Date: Fri Dec 15 12:04:05 2023 -0800
x86: Unifies 'strlen-evex' and 'strlen-evex512' implementations.
...
$ mkdir -p /home2/build_x86/glibc
$ cd /home2/build_x86/glibc
$ /home2/vendor/glibc/configure CC="/opt/gcc14/bin/gcc" --prefix="/usr" install_root=/opt/gcc14/sysroot --disable-werror --enable-shared --disable-multilib
$ make
$ make install_root=/opt/gcc14/sysroot install
$ du -hs /opt/gcc14/sysroot
105M /opt/gcc14/sysroot
gcc finish
If the gcc
installation errored out before completion, try it again after glibc is installed. This time it should complete without error.
testing the local toolchain
Next we want to exercise the toolchain by compiling a very simple C program:
#include <stdio.h>
int main(int argc, char** argv){
const int N = 1320;
char s[N];
for (int i = 0; i < N - 1; ++i)
s[i] = i + 1;
s[N - 1] = '\0';
printf(s);
}
We’ll build it with three sets of options and import all three into Ghidra 11
/opt/gcc14/bin/gcc gcc_vectorization/helloworld_challenge.c -o a_unoptimized.out
/opt/gcc14/bin/gcc -O3 gcc_vectorization/helloworld_challenge.c -o a_host_optimized.out
/opt/gcc14/bin/gcc -march=rocketlake -O3 gcc_vectorization/helloworld_challenge.c -o a_rocketlake_optimized.out
Note: Rocket Lake is Intel’s codename for its 11th generation Core microprocessors
Ghidra 11 gives us:
a_unoptimized.out
imports and decompiles cleanly, with recognizable disassembly and decompiler output of 5 lines of code.a_host_optimized.out
imports cleanly and decompiles into about 150 lines of hard-to-interpret C code. The loop has been autovectorized using instructions likePUNPCKHWD
,PUNPCKLWD
, andPADDD
. These appear to be AVX-512 vector extensions.a_rocketlake_optimized.out
fails to disassemble or decompile when it hits AVX2 instructions likevpbroadcastd
. Binutils 2.41’sobjdump
appears to recognize these instructions.
As a stretch goal, what does the gcc-14 Rust compiler give us?
/opt/gcc14/bin/gccrs -frust-incomplete-and-experimental-compiler-do-not-use src/main.rs
src/main.rs:25:5: error: unknown macro: [log::info]
25 | log::info!(
| ^~~
src/main.rs:29:5: error: unknown macro: [log::info]
29 | log::info!(
| ^~~
...
If gccrs can’t handle basic rust macros, it isn’t very useful for generating exemplars. We won’t include it in our portable toolchain.
packaging the toolchain
Now we need to make the toolchain hermetic, portable, and ready for Bazel workspaces.
Hermeticity means that nothing under /opt/gcc14
makes a hidden reference to local host files under /usr
. Any such reference needs to
be changed into a relative reference. These are common in short shareable object files that link to one or more true shareable object libraries.
You can often identify possible troublemakers by searching for smallish regular files with a .so
extension.
$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
/opt/gcc14$ find /opt/gcc14 -name \*.so -type f -size -1000c -ls
... 273 Dec 27 12:41 /opt/gcc14/lib/libc.so
... 126 Dec 27 12:42 /opt/gcc14/lib/libm.so
... 132 Dec 27 11:42 /opt/gcc14/lib64/libgcc_s.so
$ cat /opt/gcc14/lib/libc.so
/* GNU ld script
Use the shared library, but some functions are only in
the static library, so try that secondarily. */
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libc.so.6 /opt/gcc14/lib/libc_nonshared.a AS_NEEDED ( /opt/gcc14/lib/ld-linux-x86-64.so.2 ) )
$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6 AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )
$ cat /opt/gcc14/lib/libm.so
/* GNU ld script
*/
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /opt/gcc14/lib/libm.so.6 AS_NEEDED ( /opt/gcc14/lib/libmvec.so.1 ) )
thixotropist@mini:/opt/gcc14$ cat /opt/gcc14/lib64/libgcc_s.so
/* GNU ld script
Use the shared library, but some functions are only in
the static library. */
GROUP ( libgcc_s.so.1 -lgcc )
So two of the three files need patching: replacing /opt/gcc14/lib
with .
. Any text editor will do.
Next we need to identify all dynamic host dependencies of binaries under /opt/gcc14
.
The ldd
command will identify these local system files,
which should be collected into a separate tarball. This tarball can be shared with other cross-compilers
built at the same time, and is generally portable across similar Linux kernels and distributions.
At this point we can strip
the executable files within the toolchain and identify the ones we want to keep in the portable toolchain tarball.
Scripts under generated/toolchains/gcc-14-*/scripts
will help with that.
generate.sh
uses rsync to copy selected files from /opt/gcc14
into /tmp/export
, stripping the known binaries, and creating the portable tarball.
It then collects relevant dynamic libraries from the host and creates a second portable tarball.
These two tarballs can then be copied to other computers and imported into a project by adding stanzas to the project WORKSPACE file.
installing the toolchain
The toolchain tarball is currently in /opt/bazel``. We need its full path and sha256sum to make it accessible within our workspace. Edit
WORKSPACE` to include:
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
# gcc-14 x86_64 toolchain from snapshot gcc-14 and glibc development heads
http_archive(
name = "gcc-14-x86_64-suite",
urls = ["file:///opt/bazel/x86_64_linux_gnu-14.tar.xz"],
build_file = "//:gcc-14-x86_64-suite.BUILD",
sha256 = "40cc4664a11b8da56478393c7c8b823b54f250192bdc1e1181c9e4f8ac15e3be",
)
# system libraries used by toolchain build system
# We built the custom toolchain on a fedora x86_64 platform, so we need some
# fedora x86_64 sharable system libraries to execute.
http_archive(
name = "fedora39-system-libs",
urls = ["file:///opt/bazel/fedora39_system_libs.tar.xz"],
build_file = "//:fedora39-system-libs.BUILD",
sha256 = "fe91415b05bb902964f05f7986683b84c70338bf484f23d05f7e8d4096949d1b",
)
Bazel will unpack this tarball into an external project directory, something like /run/user/1000/bazel/execroot/_main/external/gcc-14-x86_64-suite/
.
Individual files and filegroups within that directory are defined in x86_64/generated/gcc-14-x86_64-suite.BUILD
.
The filegroup compiler_files
is probably the most important, as it collects everything that might be used in anything launched from gcc or g++.
The full Bazel name for this filegroup is @gcc-14-x86_64-suite//:compiler_files
.
Each custom toolchain is defined within the x86_64/generated/toolchains/BUILD
file. This associates filegroups from a (possibly shared) toolchain tarball like
gcc-14-x86_64-suite
with a set of default compiler and linker options and standard libraries. We might want multiple gcc-14 toolchains, for building kernels,
kernel modules, and userspace applications respectively.
Most of the configuration exists within stanzas like this:
toolchain(
name = "x86_64_default",
target_compatible_with = [
":x86_64",
],
toolchain = ":x86_64-default-gcc",
toolchain_type = "@bazel_tools//tools/cpp:toolchain_type",
)
cc_toolchain(
name = "x86_64-default-gcc",
all_files = ":all_files",
ar_files = ":gcc_14_compiler_files",
as_files = ":empty",
compiler_files = ":gcc_14_compiler_files",
dwp_files = ":empty",
linker_files = ":gcc_14_compiler_files",
objcopy_files = ":empty",
strip_files = ":empty",
supports_param_files = 0,
toolchain_config = ":x86_64-default-gcc-config",
toolchain_identifier = "x86_64-default-gcc",
)
cc_toolchain_config(
name = "x86_64-default-gcc-config",
abi_libc_version = ":empty",
abi_version = ":empty",
compile_flags = [
# take the isystem ordering from the output of gcc -xc++ -E -v -
"--sysroot", "external/gcc-14-x86_64-suite/sysroot/",
"-Wall",
],
compiler = "gcc",
coverage_compile_flags = ["--coverage"],
coverage_link_flags = ["--coverage"],
cpu = "x86_64",
# we really want the following to be constructed from $(output_base) or $(location ...)
cxx_builtin_include_directories = [
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/sysroot/usr/include",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/x86_64-pc-linux-gnu/include/c++/14.0.0",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include",
OUTPUT_BASE + "/external/gcc-14-x86_64-suite/lib/gcc/x86_64-pc-linux-gnu/14.0.0/include-fixed",
],
cxx_flags = [
"-std=c++20",
"-fno-rtti",
],
dbg_compile_flags = ["-g"],
host_system_name = ":empty",
link_flags = ["--sysroot", "external/gcc-14-x86_64-suite/sysroot/"],
link_libs = ["-lstdc++", "-lm"],
opt_compile_flags = [
"-g0",
"-Os",
"-DNDEBUG",
"-ffunction-sections",
"-fdata-sections",
],
opt_link_flags = ["-Wl,--gc-sections"],
supports_start_end_lib = False,
target_libc = ":empty",
target_system_name = ":empty",
tool_paths = {
"ar": "gcc-14-x86_64/imported/ar",
"ld": "gcc-14-x86_64/imported/ld",
"cpp": "gcc-14-x86_64/imported/cpp",
"gcc": "gcc-14-x86_64/imported/gcc",
"dwp": ":empty",
"gcov": ":empty",
"nm": "gcc-14-x86_64/imported/nm",
"objcopy": "gcc-14-x86_64/imported/objcopy",
"objdump": "gcc-14-x86_64/imported/objdump",
"strip": "gcc-14-x86_64/imported/strip",
},
toolchain_identifier = "gcc-14-x86_64",
unfiltered_compile_flags = [
"-fno-canonical-system-headers",
"-Wno-builtin-macro-redefined",
"-D__DATE__=\"redacted\"",
"-D__TIMESTAMP__=\"redacted\"",
"-D__TIME__=\"redacted\"",
],
)
The tool_paths
element points to small bash scripts needed to launch compiler components like gcc
and ar
and strip
.
These give us the chance to use imported system shareable object libraries rather than the host’s shareable object libraries.
#!/bin/bash
set -euo pipefail
PATH=`pwd`/toolchains/gcc-14-x86_64/imported \
LD_LIBRARY_PATH=external/fedora39-system-libs \
external/gcc-14-x86_64-suite/bin/gcc "$@"
finding the hidden toolchain dependencies
Compiling and linking source files takes many dependent files from /opt/gcc14
. The next step is tedious and iterative - we need to prove
that the portable toolchain tarball derived from /opt/gcc14
never references any file in that directory, or any local host file under /usr
.
Bazel can do that for us, at the cost of identifying every file or file ‘glob’ that may be called for each of the toolchain primitives.
It runs the toolchain in a sandbox, forcing an exception on all references not previously declared as dependencies.
This kind of exception looks like this:
ERROR: /home/XXX/projects/github/ghidra_import_tests/x86_64/generated/userSpaceSamples/BUILD:3:10: Compiling userSpaceSamples/helloworld.c failed: absolute path inclusion(s) found in rule '//userSpaceSamples:helloworld':
the source file 'userSpaceSamples/helloworld.c' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
'/usr/include/stdc-predef.h'
'/usr/include/stdio.h'
If you see this check:
- whether
stdio.h
was installed in the right directory under/opt/gcc14
. - whether
stdio.h
was copied into/tmp/export
when building the tarball - whether the instances of
stdio.h
appeared in the appropriate compiler file groups defined ingcc-14-x86_64-suite.BUILD
- whether those filegroups were properly imported into the Bazel sandbox for your build
- whether the compile_flags for your toolchain tell gcc-14 to search the sandbox for the directories containing
stdio.h
"-isystem", "external/gcc-14-x86_64-suite/sysroot/usr/include",
- whether the link_flags for your toolchain tell gcc-14 to search the sandbox for the directories containing
crt1.o
andcrti.o
using the toolchain
We can test our new toolchain with a build of helloworld
.
x86_64/generated$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
x86_64/generated$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld
INFO: Analyzed target //userSpaceSamples:helloworld (69 packages loaded, 1538 targets configured).
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Compiling userSpaceSamples/helloworld.c', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppCompile]
(cd /run/user/1000/bazel/execroot/_main && \
exec env - \
PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
PWD=/proc/self/cwd \
toolchains/gcc-14-x86_64/imported/gcc -U_FORTIFY_SOURCE --sysroot external/gcc-14-x86_64-suite/sysroot/ -Wall -MD -MF bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.d '-frandom-seed=bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o' -fPIC -iquote . -iquote bazel-out/k8-fastbuild/bin -iquote external/bazel_tools -iquote bazel-out/k8-fastbuild/bin/external/bazel_tools -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c userSpaceSamples/helloworld.c -o bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
SUBCOMMAND: # //userSpaceSamples:helloworld [action 'Linking userSpaceSamples/helloworld', configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b, execution platform: @@local_config_platform//:host, mnemonic: CppLink]
(cd /run/user/1000/bazel/execroot/_main && \
exec env - \
PATH=/home/thixotropist/.local/bin:/home/thixotropist/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/thixotropist/.local/bin:/home/thixotropist/bin:/opt/ghidra_10.3.2_PUBLIC/:/home/thixotropist/.cargo/bin::/usr/lib/jvm/jdk-17-oracle-x64/bin:/opt/gradle-7.6.2/bin \
PWD=/proc/self/cwd \
toolchains/gcc-14-x86_64/imported/gcc -o bazel-out/k8-fastbuild/bin/userSpaceSamples/helloworld -Wl,-S --sysroot external/gcc-14-x86_64-suite/sysroot/ bazel-out/k8-fastbuild/bin/userSpaceSamples/_objs/helloworld/helloworld.pic.o -lstdc++ -lm)
# Configuration: 672d6d72a34879952e2365b9bc032c10f7e50fda380c4b7c8e86b49faa982e8b
# Execution platform: @@local_config_platform//:host
INFO: Found 1 target...
Target //userSpaceSamples:helloworld up-to-date:
bazel-bin/userSpaceSamples/helloworld
INFO: Elapsed time: 0.289s, Critical Path: 0.10s
INFO: 6 processes: 4 internal, 2 linux-sandbox.
INFO: Build completed successfully, 6 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld
Hello World!
$ strings bazel-bin/userSpaceSamples/helloworld|grep -i gcc
GCC: (GNU) 14.0.0 20231218 (experimental)
Things to note:
- The command line includes
--platforms=//platforms:x86_64_default
to show we are not building for the local host toolchains/gcc-14-x86_64/imported/gcc
is invoked twice, once to compile and once to link--sysroot external/gcc-14-x86_64-suite/sysroot
is used twice, to avoid including host files under/usr
- The
helloworld
executable happens to execute on the host machine. - The
helloworld
executable contains no references to gcc-13, the native toolchain on the host machine.
Now try a C++ build:
$ bazel run -s --platforms=//platforms:x86_64_default userSpaceSamples:helloworld++
...
Target //userSpaceSamples:helloworld++ up-to-date:
bazel-bin/userSpaceSamples/helloworld++
INFO: Elapsed time: 0.589s, Critical Path: 0.51s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/userSpaceSamples/helloworld++
Hello World!
cleanup
We’ve got a working toolchain, but with many dangling links, duplicate files, and unused definitions. The toolchain files normally provided by a kernel were copied in as needed from the host, with the understanding that we never really needed unable applications.
If this were a production environment we would be a lot more careful. It’s not, so we will just summarize some of the areas that might benefit from such a cleanup.
toolchain directories
Adding and testing a toolchain involves lots of similar-looking directories.
/opt/gcc14
This directory is the install target for our binutils, gcc, and glibc builds. It is not itself used by the Bazel build framework, and need not be present on any host machine running the toolchain.
- The overall size is reported as 3.1 GB, inflated somewhat by multiple hard links
fdupes
reports 2446 duplicate files (in 2136 sets), occupying 200.2 megabytes- There are six files over 150 MB in size
/tmp/export
This directory holds a subset of /opt/gcc14
, with many binaries stripped. The hard links
of `/opt/gcc14`` are lost. It may be discarded after the portable tarball is generated.
- the overall size is 731 MB
fdupes
reports 1010 duplicate files (in 983 sets), occupying 69.0 megabytes- There are 16 files over 10 MB in size
/opt/bazel/x86_64_linux_gnu-14.tar.xz
The compressed portable tarball size is 171M. It expands into a locally cached equivalent of /tmp/export
.
This file must be accessible to Bazel during a cross-compilation, either as a file reference or as a remote http or
https URL.
/run/user/1000/bazel/execroot/_main/external/x86_64_linux_gnu-14
Bazel agents will decompress the tarball if and when needed into a local cache directory. In this case, it is unpacked into a RAM file system for speed.
/run/user/1000/bazel/sandbox/linux-sandbox/2/execroot/_main/external/x86_64_linux_gnu-14
Temporary sandboxes like this are created when individual compiler and linker steps are executed. They implement whatever subset of the cached toolchain tarball are explicitly named as dependencies for that step. Toolchain references outside of the sandbox are often flagged as hermeticity errors and abort the build.