Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid LDGSTS routing by changing default copy to be universalcopy #1674

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZelboK
Copy link

@ZelboK ZelboK commented Aug 1, 2024

#1672

This PR changes the default copy to be UniversalCopy so the LDGSTS instruction is avoided, and downstream users will need to specify the copy type if they want to use it which is more intuitive.

Note: I imagine since all of the tests and examples that are configured to use DefaultCopy will need to now transition over to AutoVectorizingCopyWithAssumedAlignment<128> as that was the previous copy type. This way we can preserve behavior across benchmarks and tests. Before I make a bunch of changes across files though I'd like to get feedback now incase I'm missing anything.

cutlass_library.egg-info/
build/
.vscode/
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping this is okay to add? I've a habit of just doing git add . for small repos and it can get annoying having to deal with me accidently committing my build directory.

@@ -485,7 +485,7 @@ int main(int argc, char const **args) {
// Tiled copy from Smem to Registers
// Note : CuTe will vectorize this copy if the tiling + swizzling above were right
using TiledCopyS2R = TiledCopy<
Copy_Atom<DefaultCopy, ElementAcc>,
Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>, ElementAcc>,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the DefaultCopy used to be AutoVectorizingCopyWithAssumedAlignment<128>, it must now be changed to explicitly specify this to preserve behavior. I believe this is the easiest way to reduce the scope of this PR and keep changes minimal.

@@ -109,7 +109,7 @@ template <class T, class GmemLayout, class RmemTiler>
void
test_copy_vectorization(GmemLayout gmem_layout, RmemTiler rmem_tiler)
{
test_copy_vectorization<T>(DefaultCopy{}, gmem_layout, rmem_tiler);
test_copy_vectorization<T>(AutoVectorizingCopyWithAssumedAlignment<128>{}, gmem_layout, rmem_tiler);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the DefaultCopy used to be AutoVectorizingCopyWithAssumedAlignment<128>, it must now be changed to explicitly specify this to preserve behavior. I believe this is the easiest way to reduce the scope of this PR and keep changes minimal.

@ZelboK
Copy link
Author

ZelboK commented Aug 1, 2024

cc @thakkarV

Any ideas on other places of CUTLASS that might need help? Perhaps any Ampere related issues?

@thakkarV
Copy link
Collaborator

thakkarV commented Aug 1, 2024

@ccecka I like this change. wdyt?

@mammoth831
Copy link

how about removing select_elementwise_copy and just using UniversalCopy<SrcType,DstType>{} by default to avoid LDGSTS?

@ccecka
Copy link

ccecka commented Aug 2, 2024

Or (un)setting CUTE_ARCH_CP_ASYNC_SM80_ENABLED so that select_elementwise_copy always chooses UniversalCopy.

All of these changes look very dangerous to me.

@ZelboK
Copy link
Author

ZelboK commented Aug 2, 2024

Or (un)setting CUTE_ARCH_CP_ASYNC_SM80_ENABLED so that select_elementwise_copy always chooses UniversalCopy.

The problem as I understood it was that logically it's a bit unintuitive that DefaultCopy dispatches to this instruction because the user doesn't actually know that it's doing this unless they inspect under the hood. Wouldn't unsetting this be equivalent to the user just specifying UniversalCopy from the get go? They have to know what's going on in the first place, and it doesn't prevent them from having to discover they need more synchronizations.

All of these changes look very dangerous to me.

I'm probably missing some context here. Could you give an example situation of where this change could go wrong? (Might be a stupid question 😅 )

@ZelboK
Copy link
Author

ZelboK commented Aug 2, 2024

how about removing select_elementwise_copy and just using UniversalCopy<SrcType,DstType>{} by default to avoid LDGSTS?

I think this could work too, actually, IIUC if downstream code specifies that it wants to do a specific async cp it'll go through a different codepath from this anyway. Also won't need to change the default type so less code changes overall.

@thakkarV
Copy link
Collaborator

thakkarV commented Aug 7, 2024

@ccecka and I had a conversation about this internally, and have some thoughts.

  1. we likely do not want to disable this auto LDGSTS. This is because it is only enabled when SM80 is set
  2. when the source tensor is tagged to have a gmem pointer
  3. when the destination tensor is tagged to have an smem pointer

As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.

We are working on a nicer fix of this internally that entails adding a simple assignment copy type that is only used when no atom is specified.

@ZelboK
Copy link
Author

ZelboK commented Aug 7, 2024

As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.

I actually made a mistake I believe and I meant to do AutoVectorizingCopyWithAssumedAlignment<8>; and not 128... That is my bad. With that being said I'm curious to know if I'm missing any context - the DefaultCopy is set to AutoVectorizingCopyWithAssumedAlignment<8>;, right? If all DefaultCopys in the codebase were replaced with this explicit type why would it be dangerous? Or was that only the case because in the code I mistakenly put 128 bits?

Copy link

github-actions bot commented Sep 6, 2024

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants