Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates

The latest release of NVIDIA cuBLAS library, version 12.5, continues to deliver functionality and performance to deep learning (DL) and high-performance computing (HPC) workloads. This post provides an overview of the following updates on cuBLAS matrix multiplications (matmuls) since version 12.0, and a walkthrough:

Grouped GEMM APIs for single, double, and half precisions
Latest LLM matmul performance on NVIDIA Hopper (H100 and H200) and NVIDIA Ada (L40S) GPUs
A note on cuBLAS performance tuning options, benchmarking, and API recommendations
Improved functional coverage in cuBLASLt

Grouped GEMM APIs

Grouped GEMM APIs can be viewed as a generalization of the batched APIs that enable different matrix sizes, transpositions, and scaling factors to be grouped and parallelized in one kernel launch.

One example where this approach provides speedup is the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs. In this example, the grouped GEMM API can achieve a 1.2x speedup over naive looping using the batched GEMM API.

This is impressive because the current grouped GEMM kernels only leverage warp-level MMA instructions. They have demonstrated that they can compete against the batched GEMM kernels, which leverage warp group-level mma (wgmma) instructions.

Two new sets of APIs are available in the cuBLAS library for ‌Grouped GEMM support:

cublas<t>gemmGroupedBatched for FP32 (including TF32) and FP64 precisions where <t> is S or D for single and double precision, respectively.
cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.

These APIs currently support variable shapes, transpositions, and scaling factors. Visit NVIDIA/CUDALibrarySamples on GitHub to see examples for cuBLAS Extension APIs and cuBLAS Level 3 APIs.

Latest LLM matmul performance on NVIDIA H100, H200, and L40S GPUs

The latest snapshot of matmul performance for NVIDIA H100, H200, and L40S GPUs is presented in Figure 1 for Llama 2 70B and GPT3 training workloads. These speedups are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload. A speedup of nearly 3x and 5x can be seen on H200 compared to A100 for Llama 2 70B and GPT3 training phase, respectively. The final end-to-end speedup of the actual workload will depend on these speedups and the speedup of the non-GEMM fraction of each workload.

The speedup of GEMMs in Llama2 70B and GPT3 LLM models on L40S, H100-SXM, and H200-SXM GPUs compared to A100 PCIe. On H200 for example, Llama 2 70B benefits from close to 3x speedup and GPT3 training phase from close to 5x speedup. — *Figure 1. Speedup of the GEMM-only fraction of e2e workloads*

Library performance and benchmarking

This section covers the following topics:

Runtime performance heuristics, or how cuBLAS library dispatches the fastest implementation on average. This should answer why users sometimes encounter performance gaps when comparing cuBLAS with other backends.
Performance tuning API in the cuBLAS library to unlock faster implementations when available. This should answer how users can reach the best performance with cuBLAS before separate specialized kernels are needed.

Runtime heuristics

cuBLAS library leverages a recommender system at runtime to dispatch the fastest configuration possible for any user-requested matmuls. Each configuration includes implementations (kernels) and runtime launch parameters. This recommender system is trained on actual timing data from running a large number of problems (including multiple precisions, matrix shapes, layouts and epilogues) with several available configurations on the GPU.

Figure 2 demonstrates an example of how this data would look on a Hopper GPU fixed at some typical operating clock with performance normalized to the GPU peak at that clock.

This figure shows a plot that contains two main datasets: 1) the Roofline of the underlying GPU calculated at a given frequency which is composed of a memory-bound region and a compute-bound region 2) performance of various GEMMs using multiple kernels in different kernel families available in the cuBLAS library. The y-axis is in % of the theoretical peak and the x-axis denotes arithmetic intensity in FLOP/B. — Figure 2. Sampling of various GEMMs using multiple configurations (kernels and launch parameters) in different kernel families available in the cuBLAS library plotted against the GPU roofline at a set frequency (x-axis is log-scale)

The goal of this recommender system is then to choose the fastest implementation for each of these problem sizes at runtime. In the absence of auto-tuning, this condition can only be met on average across the problem space. Figure 3 shows the outcome of training a recommender system that achieves an accuracy of 93% (geomean) across the large problem space. In this example, runtime heuristics achieves 93% of the best available performance across the visualized dataset.

This figure shows the relative performance of the best implementation picked by the runtime heuristics over the best implementation available in the cuBLAS library. It also shows that on average the runtime heuristics selects the best implementation 93% of the time. The x-axis denotes arithmetic intensity in FLOP/B. — *Figure 3. Relative performance of the best implementation chosen by the runtime heuristics over the best implementation available in the cuBLAS library (x-axis is log-scale)*

It’s important to note that there are always specific matmul problems where this recommender system would not return the best implementation out of the box, even if that implementation is available in the library.

Performance tuning with cuBLASLt APIs

This section describes how to ensure meaningful comparisons across different backends and to further close the gap to the best available implementation using cuBLAS heuristics API.

So far, the assumption has been that users leverage cuBLAS matmul API (that is, cubalsLtMatmul), which in turn dispatches the best implementation option according to the default heuristics summarized earlier. As this option is not always guaranteed to be strictly the fastest, the cuBLAS library exposes another API, cublasLtMatmulAlgoGetHeuristic, which enables users to easily perform auto-tuning to dispatch a faster implementation than the one returned by default runtime heuristics.

This API accepts the types that users would otherwise specify to the original matmul API. But instead of dispatching a matmul, it returns a number of options for users to iterate over, run once, and select a potentially better implementation. Visit NVIDIA/CUDALibrarySamples to see an example for auto-tuning in cuBLAS. Figure 4 shows the main sections of this example.

Code and instructions involving the use of GPU-accelerated linear algebra computations through the cuBLAS library. — *Figure 4.* An example of auto-tuning in cuBLAS (left) with three main operations involved (right)

Since the performance tuning options on recent GPUs are only available in cublasLt* APIs, any benchmarking must be done with this set of APIs and by leveraging cublasLtMatmulAlgoGetHeuristic. Note that PyTorch matmul does not currently expose this ability and it would not reflect the best performance available if used as a proxy to benchmark cuBLAS. While a similar option, cublasGemmAlgo_t, is available in the cublasGemmEx API, this option is a no-op on NVIDIA Ampere architecture and newer GPUs.

Migrate to cublasLtMatmul API

It is recommended that advanced users of NVIDIA Ampere architecture and newer GPUs migrate from cublasGemmEx to cublasLtMatmul. In addition to having access to the performance tuning options discussed earlier, the transition to cublasLtMatmul APIs will unlock access to fused epilogues, and the increasing support for mixed-precision matrix multiplications. To facilitate this transition, see the cuBLASLt Library API examples.

Better functionality and performance in cuBLASLt

Since cuBLAS 12.0, the following enhancements have been included in the cuBLAS library:

Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and NVIDIA Ada.
Additional fused epilogues on NVIDIA Hopper and NVIDIA Ampere.
Support for FP8 on Ada GPUs and performance update on Ada L4, L40, and L40S.
Removal of M, N, and batch size limitations of cuBLASLt matmul API, which closes cuBLASLt functional gaps when compared to cuBLAS gemmEx API.
Improved performance of heuristics cache for workloads with high eviction rate.
cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository. To learn more, see NVIDIA CUDA Toolkit Symbol Server.

To learn more about cuBLAS, see the documentation and samples.

Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates

Grouped GEMM APIs

Latest LLM matmul performance on NVIDIA H100, H200, and L40S GPUs