Triton deletes a level

Lesson 1 · mission: understand the GPU execution model · the diagram, re-drawn for Triton

The CUDA execution model: a GRID of blocks (gridDim.x/y/z) with one block highlighted at (1,0); a BLOCK of threads (blockDim.x/y/z) with one thread highlighted at (2,3); and a single THREAD. blockIdx locates the block in the grid, threadIdx the thread in the block.
The CUDA execution model (Netra Runtime). Triton keeps the GRID and BLOCK; it removes the THREAD from your code.

The diagram you shared is the CUDA execution model: three nested levels — a GRID of BLOCKS, each block a 3-D set of THREADS. CUDA asks you to write code from the point of view of one single thread, which figures out which data element it owns using blockIdx and threadIdx.

Triton keeps the grid and the block. It deletes the thread from your code. You write one program that owns a whole block of data at once. That is the entire idea of this lesson — everything below is detail.

The three levels, in one breath

LevelIn the diagramWhat it is
GRIDthe big outer cube, gridDim.{x,y,z}All the work for one kernel launch.
BLOCKthe small purple cube at (1,0), blockDim.{x,y,z}A group of threads that can share fast memory and synchronize.
THREADthe tiny orange cell at (2,3)"A single thread of computation, minding its own business" — processes one scalar.

CUDA vs Triton: who writes what

CUDA — you are one thread

Triton — you are one program

OpenAI's own words: Triton revisits the SPMD model and proposes "a variant in which programs — rather than threads — are blocked." Kernels are "launched concurrently with different program_id's on a grid of so-called instances." (Introducing Triton, OpenAI)

The vocabulary map

CUDA (the diagram)TritonNote
blockIdx.xtl.program_id(axis=0)Which block/program am I? (Literally reads blockIdx.x in the compiled PTX.)
gridDim.xthe launch grid tupleHow many programs were launched.
blockDim.xBLOCK_SIZE (a tl.constexpr)How many elements one program owns — set by you, tuned by autotuner.
threadIdx.x— none —There is no thread index in Triton. The compiler owns the threads.

See it in the canonical first kernel

Vector add: output = x + y. Watch how every line is about the block, never a thread.

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements,
               BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)            # which block am I?  (≈ blockIdx.x)
    block_start = pid * BLOCK_SIZE         # first element MY block owns
    offsets = block_start + tl.arange(0, BLOCK_SIZE)   # a whole TILE of indices
    mask = offsets < n_elements            # guard the ragged last block
    x = tl.load(x_ptr + offsets, mask=mask)            # load the tile
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y                         # one op over the WHOLE tile
    tl.store(output_ptr + offsets, output, mask=mask)  # store the tile

Source (official tutorial): Triton · Vector Addition↗

# launch: how many programs (blocks) do we need?
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)

The tells that you're at the block level, not the thread level:

"But surely the threads still exist?" Yes — on the hardware. tl.program_id(0) compiles down to blockIdx.x, and the tile becomes many threads. The point is they're absent from your source code: the compiler does coalescing, thread swizzling, shared-memory allocation/synchronization, vectorization and tensor-core scheduling for you. (Triton docs)
Cold-recall defense (one breath):
CUDA = write code for one thread that grabs one element via blockIdx+threadIdx. Triton = write code for one program instance that grabs a whole tile via tl.program_id; there is no threadIdx, and the compiler manages the threads.

Read next

Primary source

Read Introducing Triton (OpenAI) — ~10 min. Stop at the line about "programs — rather than threads — are blocked"; that one sentence is this whole lesson.

💬 I'm your teacher for this — ask me followups any time. Confused about why a "program" becomes "threads" on the GPU, or what a tile really is? Ask, and I'll build the next lesson around it.

Lesson 1 · Zain's AI Inference Lab · mission: understand the GPU execution model via Triton