[Discussion] Refactor the dispatcher design for incoming different backends #81

LeiWang1999 · 2024-07-07T16:59:29Z

Due to the limitations of our tile-based kernel optimization for quantized kernels with small LLM shapes, as discussed in issue #64, and considering we are a library capable of providing different backends for various scenarios, PR #80 introduces a CUDA implementation for efficient small-batch quantized matrix multiplication. Looking ahead, we are contemplating the implementation of quantized flash attention with our TL backend. BitBlas needs to determine when and how to dispatch operation configurations to different backends, which requires thoughtful design for the new component.

LeiWang1999 mentioned this issue Jul 10, 2024

[Dev] Refactor Backend Dispatch and Kernel Wrap Related Design #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Refactor the dispatcher design for incoming different backends #81

[Discussion] Refactor the dispatcher design for incoming different backends #81

LeiWang1999 commented Jul 7, 2024

[Discussion] Refactor the dispatcher design for incoming different backends #81

[Discussion] Refactor the dispatcher design for incoming different backends #81

Comments

LeiWang1999 commented Jul 7, 2024