[arm64] seqdec llTable access is extremely slow when profiled #466

lizthegrey · 2022-01-09T10:11:11Z

For some reason, the array access to llTable on seqdec.go:283 profiles as being significantly slower on arm64 than the mlTable and ofTable accesses, taking 5-10x longer than any other similar access. Disassembly attached below (note that the time is misattributed to :285, but after reordering the code order, I was able to get the problem to show up in the llTable access). this suggests the processor reports llTable instruction has succeeded in the program counter, but then stalls before it can perform the :285 instructions.

     260ms      320ms     540cd0: NOOP                                    ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:281
         .          .     540cd4: MOVD 344(RSP), R8                       ;bitreader.go:60
         .          .     540cd8: MOVD 32(R8), R9
     110ms      110ms     540cdc: MOVBU 40(R8), R10                       ;github.com/klauspost/compress/zstd.(*bitReader).getBitsFast bitreader.go:61
      10ms       10ms     540ce0: ADD R10, R5, R11
         .          .     540ce4: MOVB R11, 40(R8)                        ;bitreader.go:61
      10ms       10ms     540ce8: ADD R6, R4, R11                         ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:282
         .          .     540cec: AND $31, R11, R11                       ;seqdec.go:282
      20ms       20ms     540cf0: AND $63, R10, R10                       ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode bitreader.go:60
         .          .     540cf4: LSL R10, R9, R9                         ;bitreader.go:60
         .          .     540cf8: ORR $64, ZR, R10
      30ms       30ms     540cfc: SUB R5, R10, R5                         ;github.com/klauspost/compress/zstd.(*bitReader).getBitsFast bitreader.go:60
         .          .     540d00: AND $63, R5, R5                         ;bitreader.go:60
         .          .     540d04: LSR R5, R9, R5
         .          .     540d08: MOVW R5, R5                             ;bitreader.go:62
         .          .     540d0c: ASR R11, R5, R9                         ;seqdec.go:282
      60ms       60ms     540d10: ADD R3>>16, R9, R9                      ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:283
         .          .     540d14: UBFIZ $3, R9, $9, R9                    ;seqdec.go:283
         .          .     540d18: MOVD 256(RSP), R11
      20ms       20ms     540d1c: MOVD (R11)(R9), R3                      ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:283
     1.10s      1.10s     540d20: AND $31, R6, R9                         ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:285
         .          .     540d24: ASR R9, R5, R9                          ;seqdec.go:285
         .          .     540d28: UBFIZ $1, R4, $4, R12                   ;seqdec.go:286

The text was updated successfully, but these errors were encountered:

klauspost · 2022-01-09T11:58:16Z

I often find this on superscalar processors. It basically shows the cost at a particular point, but in reality the processor is resolving the other instructions in parallel.

There are no branches in this piece of code, so while the PC is "stuck" at a particular point, it is actually processing the remainders. I have tried a lot of different combinations of this piece of code.

The "simpler" (but slower) alternative is:

			bits := uint16(br.getBits(llState.nbBits()))
			llState = llTable[(llState.newState()+bits)&maxTableMask]

			bits = uint16(br.getBits(mlState.nbBits()))
			mlState = mlTable[(mlState.newState()+bits)&maxTableMask]

			bits = uint16(br.getBits(ofState.nbBits()))
			ofState = ofTable[(ofState.newState()+bits)&maxTableMask]

or branchless:

			nb := llState.nbBits() & 15
			bits := uint16(br.getBitsFast(nb)) & bitMask[nb]
			llState = llTable[(llState.newState()+bits)&maxTableMask]

			nb = mlState.nbBits() & 15
			bits = uint16(br.getBitsFast(mlState.nbBits())) & bitMask[nb]
			mlState = mlTable[(mlState.newState()+bits)&maxTableMask]

			nb = ofState.nbBits() & 15
			bits = uint16(br.getBitsFast(ofState.nbBits())) & bitMask[nb]
			ofState = ofTable[(ofState.newState()+bits)&maxTableMask]

The last is the fastest alternative, but the original gives the best performance in my tests.

lizthegrey · 2022-01-09T12:05:55Z

or branchless:

			nb := llState.nbBits() & 15
			bits := uint16(br.getBitsFast(nb)) & bitMask[nb]
			llState = llTable[(llState.newState()+bits)&maxTableMask]

			nb = mlState.nbBits() & 15
			bits = uint16(br.getBitsFast(mlState.nbBits())) & bitMask[nb]
			mlState = mlTable[(mlState.newState()+bits)&maxTableMask]

			nb = ofState.nbBits() & 15
			bits = uint16(br.getBitsFast(ofState.nbBits())) & bitMask[nb]
			ofState = ofTable[(ofState.newState()+bits)&maxTableMask]

The last is the fastest alternative, but the original gives the best performance in my tests.

I'll give the last a try on N1 arm64 and report back whether there's a performance improvement; if so, would suggest splitting into _arm64.go and _other.go function implementations of that else block and hope the compiler will inline the calls accordingly.

klauspost · 2022-01-09T12:20:08Z

You could also try without the if nBits == 0 {, though I suspect it is worth the check.

klauspost · 2022-01-09T13:16:06Z

The closest I can get:

			nb := llState.nbBits()
			bits := br.getBits16Zero2AMD(nb)
			llState = llTable[(llState.newState()+bits)&maxTableMask]

			nb = mlState.nbBits()
			bits = br.getBits16Zero2AMD(nb)
			mlState = mlTable[(mlState.newState()+bits)&maxTableMask]

			nb = ofState.nbBits()
			bits = br.getBits16Zero2AMD(nb)
			ofState = ofTable[(ofState.newState()+bits)&maxTableMask]

with

// getBits16Zero2AMD returns up to 16 bits.
// There are no checks if the buffer is filled.
func (b *bitReader) getBits16Zero2AMD(n uint8) uint16 {
	const regMask = 64 - 1
	v := uint16((b.value << (b.bitsRead & regMask)) >> ((regMask + 1 - n) & regMask))
	b.bitsRead += n
	return v & bitMask[n&15]
}

// getBits16ZeroARM returns up to 16 bits.
// There are no checks if the buffer is filled.
func (b *bitReader) getBits16ZeroARM(n uint8) uint16 {
	const regMask = 64 - 1
	v := uint16((b.value << (b.bitsRead & regMask)) >> (regMask + 1 - n))
	b.bitsRead += n
	return v
}

I suspect getBits16ZeroARM is faster on ARM.

I prefer not to have platform specific implementations, unless it is ASM, or the gains can justify it. Mainly to keep the amount of duplicated code down.

lizthegrey · 2022-01-09T22:48:04Z

I suspect getBits16ZeroARM is faster on ARM.

Unfortunately, no improvement. Hrm.

$ benchcmp output_control_118 output_exp_118
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-4               7073512       7181837       +1.53%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-4           1854666       1873355       +1.01%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-4            23120944      23471222      +1.51%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-4              17288399      17548898      +1.51%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-4            5907932       5976676       +1.16%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-4             7373540       7496899       +1.67%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-4                4097023       4203566       +2.60%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-4          531579        538771        +1.35%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-4          350158        349539        -0.18%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-4                20243415      20374383      +0.65%
BenchmarkDecoder_DecoderSmall/html.zst-4                    1955304       1972872       +0.90%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-4           141322        142898        +1.12%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-4                  878129        890749        +1.44%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-4              230013        231535        +0.66%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-4               2877714       2926625       +1.70%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-4                 2150034       2174270       +1.13%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-4               733509        742105        +1.17%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-4                916192        930390        +1.55%
BenchmarkDecoder_DecodeAll/html_x_4.zst-4                   506000        519509        +2.67%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-4             64815         65632         +1.26%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-4             41036         41040         +0.01%
BenchmarkDecoder_DecodeAll/urls.10K.zst-4                   2504961       2532167       +1.09%
BenchmarkDecoder_DecodeAll/html.zst-4                       242472        245036        +1.06%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-4              17783         17924         +0.79%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-4          225014        223843        -0.52%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-4      57646         58154         +0.88%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-4       723635        732553        +1.23%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-4         538650        546634        +1.48%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-4       187196        186088        -0.59%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-4        230059        233741        +1.60%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-4           126618        130186        +2.82%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-4     16318         16809         +3.01%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-4     10354         10379         +0.24%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-4           629595        634676        +0.81%
BenchmarkDecoder_DecodeAllParallel/html.zst-4               60873         61513         +1.05%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-4      4517          4674          +3.48%

$ benchcmp output_control_117 output_exp_117
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-4               7186153       7329577       +2.00%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-4           1879633       1894968       +0.82%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-4            23693226      24100953      +1.72%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-4              17583795      17928858      +1.96%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-4            6029481       6132862       +1.71%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-4             7535099       7694845       +2.12%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-4                4504959       4532914       +0.62%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-4          541644        546734        +0.94%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-4          349682        349063        -0.18%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-4                20496737      20772657      +1.35%
BenchmarkDecoder_DecoderSmall/html.zst-4                    1982347       1995645       +0.67%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-4           145109        146767        +1.14%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-4                  888187        906370        +2.05%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-4              232888        234864        +0.85%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-4               2938792       3001457       +2.13%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-4                 2180548       2221924       +1.90%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-4               749985        763023        +1.74%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-4                935423        955110        +2.10%
BenchmarkDecoder_DecodeAll/html_x_4.zst-4                   557482        560147        +0.48%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-4             66008         66632         +0.95%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-4             41158         41178         +0.05%
BenchmarkDecoder_DecodeAll/urls.10K.zst-4                   2540351       2570418       +1.18%
BenchmarkDecoder_DecodeAll/html.zst-4                       246193        247732        +0.63%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-4              18222         18389         +0.92%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-4          222816        227526        +2.11%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-4      58407         58926         +0.89%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-4       738104        754114        +2.17%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-4         552184        558742        +1.19%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-4       187512        191093        +1.91%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-4        234596        244870        +4.38%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-4           139564        140376        +0.58%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-4     16913         16701         -1.25%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-4     10401         10377         -0.23%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-4           639893        646898        +1.09%
BenchmarkDecoder_DecodeAllParallel/html.zst-4               61690         63171         +2.40%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-4      4778          4700          -1.63%

klauspost · 2022-01-10T09:45:11Z

Yes, it is quite fiddly. Tiny changes lead to rather unpredictable differences. Found a small improvement in #467

Tried a lot of variations, including reordering the first part of the loop, but all but the one above lead to regression.

For example, I also tried this, which should (logically) reduce the dependency chain, and it also produces nicer assembly:

			bits := br.get32BitsFast(nBits)
			ofNB := ofState.nbBits() & 15
			mlNB := mlState.nbBits() & 15

			lowBits := uint16(bits >> ofNB)
			lowBits &= bitMask[mlNB]
			mlState = mlTable[(mlState.newState()+lowBits)&maxTableMask]

			lowBits = uint16(bits) & bitMask[ofNB]
			ofState = ofTable[(ofState.newState()+lowBits)&maxTableMask]

			lowBits = uint16(bits >> ((ofNB + mlNB) & 31))
			llState = llTable[(llState.newState()+lowBits)&maxTableMask]

... but still gives worse performance.

			bits := br.get32BitsFast(nBits)
			ofNB := ofState.nbBits() & 15
			mlNB := mlState.nbBits() & 15

			lowBitsOF := uint16(bits) & bitMask[ofNB]
			lowBitsML := uint16(bits>>ofNB) & bitMask[mlNB]
			lowBitsLL := uint16(bits >> ((ofNB + mlNB) & 31))
			
			mlState = mlTable[(mlState.newState()+lowBitsML)&maxTableMask]
			ofState = ofTable[(ofState.newState()+lowBitsOF)&maxTableMask]
			llState = llTable[(llState.newState()+lowBitsLL)&maxTableMask]

... is another rejected alternative.

Welcome to superscalar optimizations :D

klauspost closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm64] seqdec llTable access is extremely slow when profiled #466

[arm64] seqdec llTable access is extremely slow when profiled #466

lizthegrey commented Jan 9, 2022

klauspost commented Jan 9, 2022

lizthegrey commented Jan 9, 2022

klauspost commented Jan 9, 2022

klauspost commented Jan 9, 2022

lizthegrey commented Jan 9, 2022

klauspost commented Jan 10, 2022

[arm64] seqdec llTable access is extremely slow when profiled #466

[arm64] seqdec llTable access is extremely slow when profiled #466

Comments

lizthegrey commented Jan 9, 2022

klauspost commented Jan 9, 2022

lizthegrey commented Jan 9, 2022

klauspost commented Jan 9, 2022

klauspost commented Jan 9, 2022

lizthegrey commented Jan 9, 2022

klauspost commented Jan 10, 2022