2024-06-03

Programming on GPU with Triton

Why you should learn Triton instead of CUDA


Programming for GPU is very critical and so you need to write kernels in low-level language like CUDA. Now, there's a language called Triton developed by Open AI which lets you write code in higher-level and compile to GPUs.

This is one of the talk from the GPU optimization workshop by Chip Huyen and these are my notes on the topic.

Phillipe Tillet of Open AI talks about triton

  1. Triton is a block based programming language for GPUs.

  2. Triton is an alternative to CUDA.

  3. CUDA is very flexible and the developer can have control over almost everything (e.g. what every thread does, what goes into the memory, which data structures to use).

  4. But this can also be a con as it kinda complicate things and can unknowingly hamper performance. also imo cuda has a steep learning curve.

  5. There are graph compilers which are more simpler but lack flexibility.

  6. So triton is the middle way out as its simpler than cuda but also provides a lot of flexibility.

  7. Triton sort of works on any consumer hardware that basically follows the typical von neumann architecture(has shared cache, memory controllers, multiple cores, local cache of all the cores).

  8. Using triton we're basically programming these cores individually.

  9. It's upto the compiler to decide which cache to use.

  10. Triton code is similar to numpy or pytorch (python function with @jit compiling).

  11. Triton is high performant because it has improved several compiler optimizations.

  12. Peephole optimization - compiler recognizes patterns and converts the tensor ops into code which is more performant, consumes less memory and has less code size. these ops are like redundant code removal, combining operations.

  13. SRAM allocation - compilers allocates not only SRAM (registers) but also shared memory.

  14. Automatic vectorization - analyzes the efficient way to combine loops to get vectorized code from the scaler code.

To learn CUDA, my recommendation is to start with Triton and then move to cuda for advanced/complex use-cases.

Resources

  1. Triton Puzzles by Sasha Rush
  2. A Practitioners Guide to Triton by Umer Hayat Adil