GPU Programming

Sounio has two GPU compiler paths in this repo snapshot:

souc-linux-x86_64 (the default self-hosted artifact): compiles GPU syntax, runs serial CPU fallback, and lets kretikos build predefined PTX/CUBIN artifact templates.
souc-linux-x86_64-gpu (the GPU-facing artifact): separate binary with broader PTX emission via build --backend gpu.

Self-hosted GPU path (CPU fallback + artifact templates)

The default compiler (bin/souc or bin/souc-linux-x86_64) accepts GPU kernel syntax and runs a deterministic serial CPU fallback. The Kretikos CLI uses that compiler to build tiny in-tree emitter drivers, then writes predefined GPU artifact templates:

# Emit predefined PTX templates (self-hosted, no GPU required)
kretikos emit-ptx vec_add     -o /tmp/kernel.ptx
kretikos emit-ptx vec_sub     -o /tmp/kernel.ptx
kretikos emit-ptx vec_mul     -o /tmp/kernel.ptx
kretikos emit-ptx vec_div     -o /tmp/kernel.ptx
kretikos emit-ptx vec_add_f64 -o /tmp/kernel.ptx
kretikos emit-ptx fma         -o /tmp/kernel.ptx
kretikos emit-ptx fma_f64     -o /tmp/kernel.ptx
kretikos emit-ptx store_u32_const -o /tmp/kernel.ptx

# Emit predefined Metal/MSL templates (self-hosted, no macOS required)
kretikos emit-metal vec_add          -o /tmp/kernel.metal
kretikos emit-metal ossm_oct_step    -o /tmp/kernel.metal
kretikos emit-metal sedenion_cd_step -o /tmp/kernel.metal

# Emit predefined CUBIN templates
kretikos emit-cubin exit_only -o /tmp/kernel.cubin
kretikos emit-cubin store_u32_const -o /tmp/kernel.cubin
kretikos emit-cubin vec_add_f32 -o /tmp/kernel.cubin
kretikos emit-cubin epistemic_dual_f32 -o /tmp/kernel.cubin

# Emit a structural artifact bundle: PTX, CUBIN, hashes, boundaries
kretikos bundle -o /tmp/kretikos-bundle

# Optionally record assembler/disassembler and CUDA Driver API evidence
kretikos bundle -o /tmp/kretikos-validated-bundle --validate-toolchain --validate-runtime

What that proves today:

The self-hosted compiler compiles kernel fn, with GPU, GPU.launch, and GPU.sync.
The CPU fallback runs kernels serially with deterministic thread/block IDs.
The kretikos CLI wrapper emits predefined PTX text and NVIDIA CUDA ELF/CUBIN byte templates from in-tree Sounio code.
The kretikos bundle command writes a machine-readable sounio.kretikos.bundle.v1 manifest with artifact hashes, structural checks, optional ptxas/nvdisasm and CUDA Driver API validation, and explicit non-claims.
This path does not yet lower arbitrary user kernel source into PTX/CUBIN.
Toolchain and runtime validation apply only to the predefined PTX/CUBIN templates selected by kretikos; they do not validate arbitrary user-written kernels.

GPU artifact path (broader PTX emission)

The checked GPU artifact is a separate binary with broader pattern support:

export SOUC_GPU_BIN="$(pwd)/artifacts/omega/souc-bin/souc-linux-x86_64-gpu"
export SOUNIO_STDLIB_PATH="$(pwd)/stdlib"

"$SOUC_GPU_BIN" info
"$SOUC_GPU_BIN" check examples/gpu.sio
"$SOUC_GPU_BIN" check tests/run-pass/gpu_launch_surface.sio
"$SOUC_GPU_BIN" build examples/kernel_matmul.sio --backend gpu -o /tmp/kernel_matmul.ptx

Public surface versus source-tree surface

The checked public GPU artifact currently accepts:

kernel fn
with GPU
perform GPU.launch(...)
perform GPU.sync()
PTX emission through build --backend gpu

The self-hosted compiler (bin/souc) also accepts the same syntax and additionally routes:

predefined PTX templates through kretikos emit-ptx
predefined CUBIN templates through kretikos emit-cubin
structural PTX/CUBIN artifact bundles through kretikos bundle
optional fail-closed toolchain/runtime checks through kretikos bundle --require-toolchain and kretikos bundle --require-runtime

The checked public GPU artifact does not yet resolve the older gpu.* intrinsic namespace from historical sketches:

gpu.thread_id.*
gpu.block_id.*
gpu.block_dim.*
gpu.alloc<T>(...)

Those names still matter to the implementation story, but they are not yet the recommended public syntax.

Backend evidence

The strongest GPU evidence in the repo is under artifacts/omega/:

gpu_codegen_parity.v1.json
gpu_binary_attestation.v1.json
gpu_runtime_attest_gate.v1.json
gpu_public_contract.v1.json

Current attested compute lanes:

CUDA: cuda-sm80
ROCm: rocm-gfx942

Where the bigger GPU implementation lives

self-hosted/gpu/ contains PTX, SPIR-V, Metal, runtime, tensor, and tuning work.
docs/features/GPU_RUNTIME.md is the repo-native explanation of the current contract.
The self-hosted tree still contains an internal gpu-emit path, but the checked public CLI path is build --backend gpu.