-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] seqdec llTable access is extremely slow when profiled #466
Comments
I often find this on superscalar processors. It basically shows the cost at a particular point, but in reality the processor is resolving the other instructions in parallel. There are no branches in this piece of code, so while the PC is "stuck" at a particular point, it is actually processing the remainders. I have tried a lot of different combinations of this piece of code. The "simpler" (but slower) alternative is:
or branchless:
The last is the fastest alternative, but the original gives the best performance in my tests. |
I'll give the last a try on N1 arm64 and report back whether there's a performance improvement; if so, would suggest splitting into _arm64.go and _other.go function implementations of that |
You could also try without the |
The closest I can get:
with
I suspect getBits16ZeroARM is faster on ARM. I prefer not to have platform specific implementations, unless it is ASM, or the gains can justify it. Mainly to keep the amount of duplicated code down. |
Unfortunately, no improvement. Hrm.
|
Yes, it is quite fiddly. Tiny changes lead to rather unpredictable differences. Found a small improvement in #467 Tried a lot of variations, including reordering the first part of the loop, but all but the one above lead to regression. For example, I also tried this, which should (logically) reduce the dependency chain, and it also produces nicer assembly:
... but still gives worse performance.
... is another rejected alternative. Welcome to superscalar optimizations :D |
For some reason, the array access to llTable on seqdec.go:283 profiles as being significantly slower on arm64 than the mlTable and ofTable accesses, taking 5-10x longer than any other similar access. Disassembly attached below (note that the time is misattributed to :285, but after reordering the code order, I was able to get the problem to show up in the llTable access). this suggests the processor reports llTable instruction has succeeded in the program counter, but then stalls before it can perform the :285 instructions.
The text was updated successfully, but these errors were encountered: