Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish Avx512 specific lightup for Vector128/256/512<T> #85207

Open
7 of 8 tasks
Tracked by #77034
tannergooding opened this issue Apr 23, 2023 · 4 comments
Open
7 of 8 tasks
Tracked by #77034

Finish Avx512 specific lightup for Vector128/256/512<T> #85207

tannergooding opened this issue Apr 23, 2023 · 4 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Apr 23, 2023

With #80814, we achieved functional parity of Vector512<T> with Vector128<T> and Vector256<T>. However, there are some new instructions available in Avx512 capable hardware that will allow additional hardware acceleration opportunities for all three types.

This includes:

  • ConvertToDouble() - vcvtqq2pd & vcvtuqq2pd
  • ConvertToInt64() - vcvtpd2qq
  • ConvertToUInt32() - vcvtps2udq
  • ConvertToUInt64() - vcvtpd2uqq
  • ConditionalSelect() - vpternlog
  • Shuffle() - vpermi2*, vpermt2*, etc

We should also ensure that all APIs are accelerated as intrinsic, where applicable, in particular the following are still managed fallbacks (but accelerated):

  • Vector512.Dot()
  • Vector512.Sum()

There may be others as well, so a general audit to validate would be good.

@tannergooding tannergooding added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture labels Apr 23, 2023
@tannergooding tannergooding added this to the 8.0.0 milestone Apr 23, 2023
@ghost
Copy link

ghost commented Apr 23, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

With #80814, we achieved functional parity of Vector512<T> with Vector128<T> and Vector256<T>. However, there are some new instructions available in Avx512 capable hardware that will allow additional hardware acceleration opportunities for all three types.

This includes:

  • ConvertToDouble() - vcvtqq2pd & vcvtuqq2pd
  • ConvertToInt64() - vcvtpd2qq
  • ConvertToUInt32() - vcvtps2udq
  • ConvertToUInt64() - vcvtpd2uqq
  • ConditionalSelect() - vpternlog
  • Shuffle() - vpermi2*, vpermt2*, etc

We should also ensure that all APIs are accelerated as intrinsic, where applicable, in particular the following are still managed fallbacks (but accelerated):

  • Vector512.Dot()
  • Vector512.Sum()

There may be others as well, so a general audit to validate would be good.

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr, arch-avx512

Milestone: 8.0.0

@DeepakRajendrakumaran
Copy link
Contributor

@tannergooding Sum is a weird one. I don't believe any single avx512 instruction exists for it

Documentation:
https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/intrinsics-for-integer-reduction-operations.html
image

Clang implementation:
https://godbolt.org/z/hW1Thxhe3

@JulieLeeMSFT
Copy link
Member

Most of this is being handled in #100993.

@tannergooding
Copy link
Member Author

Vector512.Dot is the only one left here and can be done in .NET 10, the current implementation is functionality correct and generating the expected codegen it just isn't handled directly as an intrinsic in the JIT.

@tannergooding tannergooding modified the milestones: 9.0.0, 10.0.0 Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

No branches or pull requests

4 participants