Kretikos GPU Compiler
Scope note: Kretikos is an active research lane. Clinical pharmacology (vancomycin ε gates, PBPK dissertation demo) is the primary public demo surface on this website — not GPU storefront features.
Kretikos (Κρητικός) = Cretan. Like Ariadne’s thread through the Labyrinth of Knossos, this is the named path for Sounio’s GPU compiler work: source-level thread builtins, serial CPU fallback, in-tree PTX/CUBIN templates, and artifact bundles.
Executive reading
Most languages wave at GPU ambition and hide the current boundary. Kretikos does the opposite: it exposes a bounded GPU surface as a first-class language feature, then ties each claim to an inspectable artifact or command.
- Checked x-axis builtins compile today:
gpu_thread_id_x(),gpu_block_id_x(),gpu_block_dim_x(),gpu_sync_threads() - y/z builtin spellings are recognized; the native CPU fallback returns deterministic stubs
- Predefined PTX templates are live via
kretikos emit-ptxusing the in-treeself-hosted/gpu/ptx.sioemitter - Predefined CUBIN templates are live via
kretikos emit-cubinusing the in-treeself-hosted/gpu/nvidia_bare.sioemitter - Structural artifact bundles are live via
kretikos bundle, with PTX/CUBIN hashes and explicit non-claims - Kretikos depends on the active
bin/souc/souc-linux-x86_64compiler to build its small emitter drivers - CPU fallback path is deterministic and serial; it is not a parallel GPU simulator
- The backend source — PTX emitter, SPIR-V lowerer, CUDA ELF generator — lives in-tree at
self-hosted/gpu/
What Kretikos proves
A language serious about scientific software must not delegate GPU honesty to a black-box runtime. Kretikos demonstrates that Sounio can:
- expose thread indexing as source-level syntax, not driver-level magic
- emit predefined PTX and CUBIN artifact templates from self-hosted code
- run kernels on a deterministic serial CPU fallback for syntax/runtime smoke
- keep the compiler, runtime, and narrative aligned under real pressure
Bare-metal syntax
kernel fn vec_add(n: i64) with GPU {
let tid = gpu_thread_id_x()
let bid = gpu_block_id_x()
let bdim = gpu_block_dim_x()
let i = bid * bdim + tid
if i >= n { return }
// This thread owns element i.
}
fn main() with GPU, IO {
let grid = (16, 1, 1)
let block = (64, 1, 1)
perform GPU.launch(vec_add, grid, block)(1024)
perform GPU.sync()
}
This is the checked public shape today: 1D indexing plus deterministic CPU fallback. Broader y/z lowering and richer memory surfaces are separate promotion lanes.
Verified commands
# Compile and run on serial CPU fallback (no GPU hardware required)
souc examples/kernel_source_level.sio /tmp/kretikos_demo.elf
/tmp/kretikos_demo.elf
# Emit predefined PTX templates (self-hosted, no GPU required)
kretikos emit-ptx vec_add -o /tmp/kretikos.ptx
kretikos emit-ptx vec_sub -o /tmp/kretikos.ptx
kretikos emit-ptx vec_mul -o /tmp/kretikos.ptx
kretikos emit-ptx vec_div -o /tmp/kretikos.ptx
kretikos emit-ptx vec_add_f64 -o /tmp/kretikos.ptx
kretikos emit-ptx fma -o /tmp/kretikos.ptx
kretikos emit-ptx fma_f64 -o /tmp/kretikos.ptx
kretikos emit-ptx store_u32_const -o /tmp/kretikos.ptx
# Emit predefined Metal/MSL templates (self-hosted, no macOS required)
kretikos emit-metal vec_add -o /tmp/kretikos.metal
kretikos emit-metal ossm_oct_step -o /tmp/kretikos.metal
kretikos emit-metal sedenion_cd_step -o /tmp/kretikos.metal
# Emit the predefined vec_add_f32 CUBIN template
kretikos emit-cubin vec_add_f32 -o /tmp/kretikos.cubin
# Emit a structural artifact bundle with hashes and boundaries
kretikos bundle -o /tmp/kretikos-bundle
# Add optional toolchain/runtime validation when the host exposes those tools
kretikos bundle -o /tmp/kretikos-validated-bundle --validate-toolchain --validate-runtime
# Build PTX with the checked GPU artifact (broader pattern support)
export SOUC_GPU_BIN="$(pwd)/artifacts/omega/souc-bin/souc-linux-x86_64-gpu"
"$SOUC_GPU_BIN" build examples/kernel_source_level.sio --backend gpu -o /tmp/kretikos.ptx
Why this matters
GPU work is where language marketing goes to die. It is easy to sketch intrinsics or promise tensor-core support. It is much harder to keep a self-hosted compiler, a bounded GPU surface, and a public-facing narrative aligned while real bugs surface in stack alignment, launch parameters, and PTX module loading.
Kretikos is valuable not only because it accelerates compute, but because it forces the language to reveal whether its honesty survives contact with backend artifacts.
Support tier
- checked builtins:
gpu_thread_id_x,gpu_block_id_x,gpu_block_dim_x,gpu_sync_threads— self-hosted compiler - recognized CPU-fallback stubs: y/z thread/block IDs return
0; y/z block dimensions return1 - PTX templates:
kretikos emit-ptxwith 6 patterns (vec_add, vec_sub, vec_mul, vec_div, vec_add_f64, fma); broader PTX emission throughbuild --backend gpu(checked GPU artifact) - MSL templates:
kretikos emit-metalwith 3 patterns (vec_add, ossm_oct_step, sedenion_cd_step) from in-tree self-hosted emitter - CUBIN templates:
kretikos emit-cubin(in-tree self-hosted emitter) - Artifact bundle:
kretikos bundleemits PTX+CUBIN plussounio.kretikos.bundle.v1 - Optional promotion checks:
--validate-toolchainrecordsptxas/nvdisasmevidence, and--validate-runtimeattempts the CUDA Driver API rung with exactnot_runreasons on non-GPU hosts - backend source:
self-hosted/gpu/ptx.sio,self-hosted/gpu/ptx_advanced.sio,self-hosted/gpu/nvidia_bare.sio,self-hosted/gpu/spirv_lower.sio - attested targets: CUDA
sm_80, ROCmgfx942
The boundary
Kretikos does not claim:
- that every GPU backend is production-complete
- that multi-dimensional kernels emit PTX for all patterns (currently pattern-matched)
- that
gpu.alloc<T>()or shared-memory abstractions are checked public surface - that the CPU fallback simulates a parallel GPU grid
The honest claim is narrower and therefore stronger: there is a GPU compiler lane here, and the current Kretikos CLI artifact bundle is a predefined-template surface, not arbitrary user-kernel lowering. Toolchain and runtime validation apply only to those selected templates; they do not validate arbitrary user-written GPU kernels.