Resources

High-trust sources for grounding the AI inference lab. Triton is young and fast-moving, so prefer official docs, primary source material, reproducible benchmarks, and real codebases over generic roadmaps.

How to use these resources

Use Netra-style tasks as the short-term benchmark.
Use Triton/CUDA material for hands-on kernel skill.
Use Baseten's inference material as the high-level map.
Use real runtimes like sam3.c, llama.cpp, vLLM, and SGLang as architecture studies, but scope each study to one subsystem at a time.
Avoid roadmap-driven wandering unless it directly supports an experiment.

Current target

Netra Runtime — AI Engineer Interview Puzzles
Practical benchmark for Triton kernels, quantization/dequantization, QLoRA/FSDP2, torch.compile, upstream PRs, and performance explanations.
Use as a forcing function, not necessarily as the final employer target.

Primary / canonical

Introducing Triton: Open-source GPU programming for neural networks — OpenAI (2021)
Trust: ★★★★★ (the source of record). The phrase that grounds Lesson 0001: Triton uses an SPMD model "in which programs – rather than threads – are blocked." Kernels are "launched concurrently with different program_id's on a grid of so-called instances." ~10-min read.
Triton documentation — Introduction — triton-lang.org
Trust: ★★★★★ (official). Lists exactly what the compiler automates: "automatic coalescing, thread swizzling, pre-fetching, automatic vectorization, tensor core-aware instruction selection, shared memory allocation/synchronization, asynchronous copy scheduling."
Triton tutorial — Vector Addition — triton-lang.org
Trust: ★★★★★ (official). The canonical first kernel. Source of the annotated add_kernel in Lesson 0001.

Inference engineering map

Inference Engineering — Baseten
Use as a stack map: models, hardware, software, optimization techniques, modalities, and production. Do not publish standalone chapter summaries; connect reading to experiments.
inferenceengineering.tech
Interactive companion/course. Useful for reinforcement, but lower priority than implementing and benchmarking.

GPU fundamentals

Programming Massively Parallel Processors
Deep CUDA/GPU fundamentals: execution model, memory hierarchy, tiling, synchronization, performance reasoning, and parallel patterns.
How it comes into play: use it as the why underneath Triton. Triton hides thread-level CUDA details, but Netra-style kernel work still requires the mental model: coalesced memory access, occupancy, cache behavior, shared memory, arithmetic intensity, and why a kernel is memory-bound or compute-bound.
Read selectively after experiments raise questions. Example: run Triton vector add, then read about memory bandwidth/coalescing; attempt matmul, then read tiling/shared-memory chapters; work on dequantization, then read memory hierarchy and data movement. Avoid reading cover-to-cover before coding.

Runtime codebases to study

NetraRuntime/sam3.c
To study and personally verify. External-codebase summary says it is a pure-C SAM3 inference runtime with custom tensor ops, CPU/Metal backends, mmap weight loading, quantization support, feature caching, CLI/bindings, and benchmarks.
Use it as a model-specific runtime engineering case study only after scoping one subsystem at a time, such as weight loading, feature caching, or a single backend kernel family.

Background on the CUDA model (the diagram)

Thread block (CUDA programming) — Wikipedia)
Trust: ★★★☆☆ (good enough for the vocabulary in the diagram: threadIdx, blockIdx, blockDim, gridDim).

Secondary (orientation, read with care)

Triton Is Not CUDA in Python — It's a Tiling DSL — Medium
Trust: ★★★☆☆. Useful framing ("one program processes a whole tile, not one value"), but verify any claim against the official docs above.

Communities (for wisdom — testing understanding with practitioners)

GPU MODE (formerly CUDA MODE) Discord + YouTube lectures — active practitioner community for Triton/CUDA kernel work.
Triton GitHub Discussions — questions answered by maintainers.
r/CUDA and r/LocalLLaMA for inference-focused discussion.