-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expected throughput? #3
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Can you provide any insight into expected throughput, relative to a "base" transformer implementation?
I.e., if you consider two model with same hidden size, # layers, etc., will sparse_attention version run significantly slower (if yes, presumably because of recompute)?
Apologies if this was covered in the paper--I skimmed and didn't see it addressed.
Am considering getting this up and running--extremely interesting--but would like a sense on whether there is a major throughput hit before doing so.
Thank you--very neat to see successful evolution from https://openai.com/blog/block-sparse-gpu-kernels/.
The text was updated successfully, but these errors were encountered: