Experiment 0001 — Triton vector add on a free T4
Status: prepared, not yet run on T4.
Purpose
Close the first real lab loop after Lesson 0001:
- Run a simple Triton kernel on an NVIDIA GPU.
- Check correctness against PyTorch.
- Capture a small benchmark against
torchaddition. - Record what broke or surprised me.
This is not meant to prove performance skill yet. It proves the workflow: local notes → free GPU run → captured output → learning record.
Target environment
- Colab or Kaggle notebook
- NVIDIA T4 preferred, but any CUDA GPU is acceptable for the first loop
- Python 3
- PyTorch with CUDA
- Triton
How to run
In Colab/Kaggle, upload or paste vectoraddbenchmark.py, then run:
python vector_add_benchmark.py
If Triton is missing:
pip install triton
python vector_add_benchmark.py
What to capture after running
Paste the terminal/notebook output below.
TODO: paste actual output here after T4 run.
Notes after run
- TODO: What GPU did I get?
- TODO: Did Triton import cleanly?
- TODO: Did correctness pass?
- TODO: Was Triton faster or slower than
torchfor this tiny benchmark? - TODO: What did I misunderstand from Lesson 0001?