Patterns in Optimizing Triton Kernels for Reduction Ops
Triton language offers a tile-oriented, CTA-level programming paradigm, attaining a good balance between control of hardware and mental burden, which allows for non-gpu experts to author kernels with reasonable performance within short time. But when writing triton kernels for a library of general purpose, we need to choose appropriate algorithms and tasking partitioning scheme according to the size of the problem or the layout of the data to achieve better performance. This report shares some common tricks and patterns for optimizing reduction-like kernels with softmax as an example, including persistent reduction, online softmax normalizer, split-reduction and tasking partitioning scheme for outer-reduction.