Triton deletes a level

Lesson 1 · mission: understand the GPU execution model · the diagram, re-drawn for Triton

The CUDA execution model: a GRID of blocks (gridDim.x/y/z) with one block highlighted at (1,0); a BLOCK of threads (blockDim.x/y/z) with one thread highlighted at (2,3); and a single THREAD. blockIdx locates the block in the grid, threadIdx the thread in the block. — The CUDA execution model (Netra Runtime). Triton keeps the GRID and BLOCK; it removes the THREAD from your code.

The diagram you shared is the CUDA execution model: three nested levels — a GRID of BLOCKS, each block a 3-D set of THREADS. CUDA asks you to write code from the point of view of one single thread, which figures out which data element it owns using blockIdx and threadIdx.

Triton keeps the grid and the block. It deletes the thread from your code. You write one program that owns a whole block of data at once. That is the entire idea of this lesson — everything below is detail.

The three levels, in one breath

Level	In the diagram	What it is
GRID	the big outer cube, `gridDim.{x,y,z}`	All the work for one kernel launch.
BLOCK	the small purple cube at `(1,0)`, `blockDim.{x,y,z}`	A group of threads that can share fast memory and synchronize.
THREAD	the tiny orange cell at `(2,3)`	"A single thread of computation, minding its own business" — processes one scalar.

CUDA vs Triton: who writes what

CUDA — you are one thread

Your code runs once per thread.
You compute your element: i = blockIdx.x*blockDim.x + threadIdx.x.
You manage shared memory, __syncthreads(), coalescing, vectorization — by hand.
Mental unit: a scalar.

Triton — you are one program

Your code runs once per block (called a program instance).
You compute your block's range of elements with tl.program_id(0).
The compiler does shared memory, sync, coalescing, vectorization for you.
Mental unit: a tile (a small array).

OpenAI's own words: Triton revisits the SPMD model and proposes "a variant in which programs — rather than threads — are blocked." Kernels are "launched concurrently with different program_id's on a grid of so-called instances." (Introducing Triton, OpenAI)

The vocabulary map

CUDA (the diagram)	Triton	Note
`blockIdx.x`	`tl.program_id(axis=0)`	Which block/program am I? (Literally reads `blockIdx.x` in the compiled PTX.)
`gridDim.x`	the launch `grid` tuple	How many programs were launched.
`blockDim.x`	`BLOCK_SIZE` (a `tl.constexpr`)	How many elements one program owns — set by you, tuned by autotuner.
`threadIdx.x`	— none —	There is no thread index in Triton. The compiler owns the threads.

See it in the canonical first kernel

Vector add: output = x + y. Watch how every line is about the block, never a thread.

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements,
               BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)            # which block am I?  (≈ blockIdx.x)
    block_start = pid * BLOCK_SIZE         # first element MY block owns
    offsets = block_start + tl.arange(0, BLOCK_SIZE)   # a whole TILE of indices
    mask = offsets < n_elements            # guard the ragged last block
    x = tl.load(x_ptr + offsets, mask=mask)            # load the tile
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y                         # one op over the WHOLE tile
    tl.store(output_ptr + offsets, output, mask=mask)  # store the tile

Source (official tutorial): Triton · Vector Addition↗

# launch: how many programs (blocks) do we need?
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)

The tells that you're at the block level, not the thread level:

tl.arange(0, BLOCK_SIZE) — you build an array of offsets, not a single index.
output = x + y — one line adds 1024 elements at once. In CUDA this would be one scalar add, run by 1024 separate threads.
mask — because a program owns a fixed-size tile, the last (ragged) block masks off the overflow. This is the price of thinking in tiles.

"But surely the threads still exist?" Yes — on the hardware. tl.program_id(0) compiles down to blockIdx.x, and the tile becomes many threads. The point is they're absent from your source code: the compiler does coalescing, thread swizzling, shared-memory allocation/synchronization, vectorization and tensor-core scheduling for you. (Triton docs)

Cold-recall defense (one breath):
CUDA = write code for one thread that grabs one element via blockIdx+threadIdx. Triton = write code for one program instance that grabs a whole tile via tl.program_id; there is no threadIdx, and the compiler manages the threads.

Primary source

Read Introducing Triton (OpenAI) — ~10 min. Stop at the line about "programs — rather than threads — are blocked"; that one sentence is this whole lesson.

💬 I'm your teacher for this — ask me followups any time. Confused about why a "program" becomes "threads" on the GPU, or what a tile really is? Ask, and I'll build the next lesson around it.

Lesson 1 · Zain's AI Inference Lab · mission: understand the GPU execution model via Triton