CUDA ↔ Triton cheat-sheet

Reference · the execution-model vocabulary, compressed · keep open while reading lessons

The one-line model

CUDA: you write code for one thread → one scalar.  Triton: you write code for one program instance → one tile (block of data). The compiler turns your tile into threads.
CUDA execution model diagram: GRID of blocks, BLOCK of threads, and a single THREAD. Triton programs at the BLOCK level and removes the THREAD level from source code.
The CUDA execution model. Triton programs at the BLOCK level; the THREAD level is compiler-managed.

Glossary (adhere to this in every lesson)

TermMeaning
GridAll the work for a single kernel launch; a (multi-dim) array of blocks/programs.
Block (CUDA)A group of threads sharing fast on-chip memory; can synchronize. The unit Triton programs at.
Thread (CUDA)The smallest execution unit; processes one scalar. Invisible in Triton source.
Program instance (Triton)One run of your @triton.jit kernel. Occupies the role of a CUDA block, but its body operates on a whole tile.
Tile (Triton)A small fixed-size array (dims are powers of two) that one program loads, computes, and stores as a unit.
SPMDSingle Program, Multiple Data. Both models are SPMD — Triton just blocks programs instead of threads.
MaskBoolean tile that disables out-of-bounds lanes so the last (ragged) block is safe.

Symbol map

CUDATritonQuestion it answers
blockIdx.xtl.program_id(axis=0)Which block/program am I?
blockIdx.{y,z}tl.program_id(1), tl.program_id(2)2-D / 3-D launch grids.
gridDim.xlaunch grid tuple; tl.num_programs(0)How many programs total?
blockDim.xBLOCK_SIZE: tl.constexprHow many elements per program (you choose / autotune).
threadIdx.x— none —(Compiler-owned. You never write it.)
__shared__ / __syncthreads()— none —(Compiler allocates & synchronizes shared memory.)

What the Triton compiler does for you

Things you'd hand-tune in CUDA that simply don't appear in Triton code:

Triton docs · Introduction↗

The kernel skeleton (memorize the shape)

@triton.jit
def kernel(x_ptr, ..., n_elements, BLOCK_SIZE: tl.constexpr):
    pid     = tl.program_id(axis=0)             # 1. who am I
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)  # 2. my tile of indices
    mask    = offsets < n_elements              # 3. guard the edge
    x       = tl.load(x_ptr + offsets, mask=mask)         # 4. load tile
    # ... compute over the whole tile ...                # 5. tensor math
    tl.store(out_ptr + offsets, result, mask=mask)       # 6. store tile

# launch
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
kernel[grid](x, ..., n_elements, BLOCK_SIZE=1024)

Six beats: who am I → my indices → mask → load → compute → store.

💬 Ask your teacher to expand any row of these tables into its own lesson.

Reference · Zain's AI Inference Lab · sources: openai.com/index/triton, triton-lang.org