Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Refactor the dispatcher design for incoming different backends #81

Open
LeiWang1999 opened this issue Jul 7, 2024 · 0 comments

Comments

@LeiWang1999
Copy link
Contributor

Due to the limitations of our tile-based kernel optimization for quantized kernels with small LLM shapes, as discussed in issue #64, and considering we are a library capable of providing different backends for various scenarios, PR #80 introduces a CUDA implementation for efficient small-batch quantized matrix multiplication. Looking ahead, we are contemplating the implementation of quantized flash attention with our TL backend. BitBlas needs to determine when and how to dispatch operation configurations to different backends, which requires thoughtful design for the new component.

image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant