Skip to content

Latest commit

 

History

History
105 lines (74 loc) · 4.64 KB

README.md

File metadata and controls

105 lines (74 loc) · 4.64 KB

Speedup Benchmark vs Vendor Libraries

This part presents a benchmark comparison between our custom library, BitBLAS, and various vendor libraries (cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin) across different matrix operation types (GEMM, GEMV) and data formats (float16xfloat16, int8xint8, float16xint4/nf4). The benchmarks are conducted on NVIDIA GPUs - 24GB RTX 3090 and 80GB A100, with CUDA 12.1 installed.

Benchmark Overview

Tested Operations and Formats

  • GEMM (General Matrix Multiply) and GEMV (General Matrix-Vector Multiply)
  • Data formats: float16, int8, float16xint4/nf4

Hardware

  • NVIDIA RTX 3090 (24GB)
  • NVIDIA A100 (80GB)

Software

  • CUDA 12.1
  • Compared libraries: cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, Marlin
  • Commit ID:
    • bitsandbytes == 0.43.0
    • vLLM: 865732342b4e3b8a4ef38f28a2a5bdb87cf3f970
    • FasterTransformer: 1afbf20129647a35d108152fc6789bc1d029cda5
    • TensorRT-LLM: 2bf3a0a4287069ac55ee3304c285b08592d3d1bc
    • CUTLASS: 629f4653c3ea3db3264030382956fabe715f3436
    • Marlin: 512f1b1ba39ff708bcc95419f11cfd1285cd31b3

Results Summary

RTX 3090 Benchmarks

  • Float16 and Int8 GEMM with Tensorcore: BitBLAS matches the performance of cuBLAS and CUTLASS.
  • Float16xnf4 GEMV and GEMM: BitBLAS achieves 2x the speed of bitsandbytes and 4x the base float16 performance.
  • Optimal performance in float16xint4 GEMM.

A100 Benchmarks

  • Int4 Dequantize Performance: BitBLAS outperforms bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin.

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

configProviderMNK
V0None11638416384
V1BLOOM14300814336
V2BLOOM11433614336
V3BLOOM15734414336
V4BLOOM11433657344
V5OPT192169216
V6OPT1368649216
V7OPT1921636864
V8LLAMA1220168192
V9LLAMA1819222016
V10LLAMA-2181928192
V11LLAMA-21286728192
V12LLAMA-21819228672
M0None163841638416384
M1BLOOM81924300814336
M2BLOOM81921433614336
M3BLOOM81925734414336
M4BLOOM81921433657344
M5OPT819292169216
M6OPT8192368649216
M7OPT8192921636864
M8LLAMA8192220168192
M9LLAMA8192819222016
M10LLAMA-2819281928192
M11LLAMA-28192286728192
M12LLAMA-28192819228672

Note: To reproduce the 3rdparty frameworks' benchmark results, please refer to mlc-benchmark.

Benchmark Images

BitNET 1.58B INT8xINT2 Matmul BS Scaling on A100.

int8xiint2_scaling

3090 Related benchmark numbers

3090-gemm-fp16

3090-gemm-s8

3090-nf4-gemv

3090-nf4-gemm

A100 Related Benchmark Result

a100-wq-gemv

a100-wq-gemm

INT8xUINT1 Matmul BS Scaling on A100.

int8xiint1_scaling