Skip to content

CUTLASS 3.3.0

Compare
Choose a tag to compare
@hwu36 hwu36 released this 06 Dec 01:55
· 90 commits to main since this release
a75b4ac
  • New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
  • New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
  • New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
  • EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
  • Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
  • Support for Clang as a host compiler.
  • Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface