Add pass to convert Uint8 to int8 across operators #2826

TedThemistokleous · 2024-02-23T15:12:33Z

~Related to #1904 ~

Originally from when we were parsing in dynamicquantizelinear but we've found other operators weren't handling inputs of uint8 correctly and thus we need to add a pass to properly handle conversion.

This is dependent on the more recent #2888 so that we can then handle MatmulInteger,ConvInteger and DynamicQuantizeLinear and uint8 during compile time rather than in the parser

We parse in this instruction as a combination of ops. The issue here is we require int8 for MLIR as uint8 is unsupported to create kernels. In this case we convert from unsigned to signed here and use uint16 as the accumulator to avoid overflow before converting to int8.

Need to adjust parse and verify tests for the operated if it contains the added int8 conversion on output.

codecov · 2024-02-23T16:17:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.86%. Comparing base (263509b) to head (2cde7cc).
Report is 1 commits behind head on develop.

❗ Current head 2cde7cc differs from pull request most recent head 2654eae

Please upload reports for the commit 2654eae to get more accurate results.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2826      +/-   ##
===========================================
+ Coverage    91.82%   91.86%   +0.04%     
===========================================
  Files          486      480       -6     
  Lines        18993    18240     -753     
===========================================
- Hits         17440    16757     -683     
+ Misses        1553     1483      -70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pfultz2 · 2024-02-23T16:50:10Z

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

TedThemistokleous · 2024-02-23T18:10:30Z

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

When speaking with MLIR they said they don't support uint8 x int8 quant_dots along with just uint8 in general for their kernels. I was seeing failures when using models optimized by Onnxruntime and using int8. This is related to the drop in performance is we would fail to compile the model in onnxruntime and then default to ROCm EP and or CPU

pfultz2 · 2024-02-23T18:33:41Z

But why are we getting mixed types like this in the first place?

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

TedThemistokleous · 2024-02-23T19:09:31Z

But why are we getting mixed types like this in the first place?

Because the implimentation of dynamicquantizelinear breaks it up into separate instructions which hardcode uint8 as part of the output type for the zero point which is defined by the onnx standard.

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

Doing more tests on this seems to break backend onnx tests since there is an expectation that all zero point output will be uint8. So it appears we should be doing a pass then if one of the inputs to a quantize linear is then a uint8_t type to add in the convert

src/onnx/parse_dynamicquantizelinear.cpp

This reverts commit 6c02b5a.

This reverts commit e757f4a.

Use this to determine if an instruction has as specific input type used for one of its arguments. Added tests to pick out instruction based on type

TedThemistokleous · 2024-02-28T06:12:58Z

Moving to draft. Will craft better matcher after tomorrow morning's coffee. @lakhinderwalia I see your point from Monday's meeting. Looks like we can just target and replace q_min/q_max as well as the convert at the end to adjust the data range without requiring the extra shift add and converts.

….hpp Makes this easier to test Added additional test cases and passes to existing dynamicquantizelinear verify/parse tests

TedThemistokleous · 2024-02-28T18:40:58Z

Updated this to be a separate pass. Seeing some odd behavior on the added tests when this gets converted to kernels on the gpu. Need to sort out that last bit but the rest should be good for review.

Need this to verify validity of whether our conversion is indeed correct to int8

TedThemistokleous · 2024-03-18T18:05:01Z

Split out changes for parse_dynamcquantizelinear for this one. Retargeted branch to that PR until that changeset gets in. Will revisit one that's merged

…quant

causten · 2024-03-18T19:07:51Z

@TedThemistokleous to review and resolve conflicts

TedThemistokleous · 2024-03-18T19:51:45Z

DQL fix + Pass changes

Summary:
gpu::code_object::reduce_min_min_kernel: 2.04166ms / 49 = 0.0416666ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03947ms / 49 = 0.0416219ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66411ms / 48 = 0.0346689ms, 10%
gpu::code_object::mlir_quant_dot: 1.35967ms / 47 = 0.0289292ms, 9%
gpu::code_object::convert_kernel: 1.15707ms / 50 = 0.0231415ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.15445ms / 49 = 0.0235602ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14034ms / 49 = 0.0232722ms, 7%
gpu::code_object::mul_kernel: 1.12784ms / 49 = 0.0230171ms, 7%
gpu::code_object::contiguous_kernel: 0.854442ms / 36 = 0.0237345ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833104ms / 36 = 0.0231418ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.580536ms / 24 = 0.024189ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.534086ms / 23 = 0.0232211ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.38627ms / 13 = 0.0297131ms, 3%
load: 0.316312ms / 537 = 0.000589036ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.293268ms / 12 = 0.024439ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_add_mul_exp_add_div_kernel: 0.287277ms / 12 = 0.0239397ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280463ms / 12 = 0.0233719ms, 2%
multibroadcast: 0.245326ms / 295 = 0.000831613ms, 2%
hip::hip_copy_literal: 0.0990334ms / 150 = 0.000660223ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0969119ms / 1 = 0.0969119ms, 1%
reshape_lazy: 0.0966945ms / 131 = 0.000738126ms, 1%
transpose: 0.0604593ms / 48 = 0.00125957ms, 1%
slice: 0.04459ms / 36 = 0.00123861ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248186ms / 1 = 0.0248186ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0238949ms / 1 = 0.0238949ms, 1%
gpu::code_object::gather_kernel: 0.0233956ms / 1 = 0.0233956ms, 1%
@param: 0.00996924ms / 26 = 0.000383432ms, 1%
hip::hip_allocate_memory: 0.00088522ms / 1 = 0.00088522ms, 1%
check_context::migraphx::gpu::context: 0.00072764ms / 1 = 0.00072764ms, 1%

Batch size: 1
Rate: 189.117 inferences/sec
Total time: 5.28774ms
Total instructions time: 16.7771ms
Overhead time: 0.371628ms, -11.4893ms
Overhead: 7%, -217%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8


Summary:
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.05055ms / 49 = 0.0418479ms, 13%
gpu::code_object::reduce_min_min_kernel: 2.04459ms / 49 = 0.0417264ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66972ms / 48 = 0.0347858ms, 10%
gpu::code_object::mlir_quant_dot: 1.36225ms / 47 = 0.028984ms, 9%
gpu::code_object::convert_kernel: 1.16309ms / 50 = 0.0232618ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16019ms / 49 = 0.0236774ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14452ms / 49 = 0.0233575ms, 7%
gpu::code_object::mul_kernel: 1.13133ms / 49 = 0.0230883ms, 7%
gpu::code_object::contiguous_kernel: 0.860616ms / 36 = 0.023906ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833249ms / 36 = 0.0231458ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583202ms / 24 = 0.0243001ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537737ms / 23 = 0.0233799ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.385738ms / 13 = 0.0296722ms, 3%
load: 0.314656ms / 537 = 0.000585952ms, 2%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.311504ms / 12 = 0.0259587ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295388ms / 12 = 0.0246157ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285528ms / 12 = 0.023794ms, 2%
multibroadcast: 0.257628ms / 295 = 0.000873314ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.100579ms / 1 = 0.100579ms, 1%
hip::hip_copy_literal: 0.0988059ms / 150 = 0.000658706ms, 1%
reshape_lazy: 0.0981314ms / 131 = 0.000749094ms, 1%
transpose: 0.0633442ms / 48 = 0.00131967ms, 1%
slice: 0.0449541ms / 36 = 0.00124872ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0250007ms / 1 = 0.0250007ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0242269ms / 1 = 0.0242269ms, 1%
gpu::code_object::gather_kernel: 0.0238603ms / 1 = 0.0238603ms, 1%
@param: 0.0102222ms / 26 = 0.000393161ms, 1%
check_context::migraphx::gpu::context: 0.0009512ms / 1 = 0.0009512ms, 1%
hip::hip_allocate_memory: 0.0009052ms / 1 = 0.0009052ms, 1%

Batch size: 1
Rate: 190.495 inferences/sec
Total time: 5.24948ms
Total instructions time: 16.8825ms
Overhead time: 0.37222ms, -11.633ms
Overhead: 7%, -222%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

…quant

…gen_onnx and models accordingly

TedThemistokleous · 2024-04-09T13:45:12Z

Not required but can still be used as perf improvement now 2903 has been added. Will need to determine speedup after rebase.

Fixes were pulled out of here for some other changes but I'm still curious on perf if we can just auto convert uint8/ handle matmul correctly now

pfultz2 · 2024-05-06T17:48:49Z

src/include/migraphx/matcher.hpp

@@ -869,6 +875,12 @@ auto skip_broadcasts_converts(Ms... ms)
 return skip(name("broadcast", "multibroadcast", "contiguous", "convert"))(ms...);
 }

+template <class... Ms>
+auto skip_broadcast_squeeze(Ms... ms)


Just put this matcher in the .cpp file. as its specific to the pass.

pfultz2 · 2024-05-06T17:49:17Z

src/include/migraphx/op/quantizelinear.hpp

@@ -56,7 +56,7 @@ struct quantizelinear

 shape compute_shape(std::vector<shape> inputs) const
 {
- check_shapes{inputs, *this}.same_dims().has(2, 3);
+ check_shapes{inputs, *this}.has(2, 3);


Why is this removed? They should have the same dimensions.

migraphx-bot · 2024-08-15T12:09:01Z

Test	Batch	Rate new 63952f	Rate old ae2b02	Diff	Compare
torchvision-resnet50	64	3,232.60	3,232.34	0.01%	✅
torchvision-resnet50_fp16	64	6,870.53	6,875.78	-0.08%	✅
torchvision-densenet121	32	2,428.49	2,429.76	-0.05%	✅
torchvision-densenet121_fp16	32	4,044.59	4,068.80	-0.60%	✅
torchvision-inceptionv3	32	1,633.77	1,636.23	-0.15%	✅
torchvision-inceptionv3_fp16	32	2,742.80	2,744.68	-0.07%	✅
cadene-inceptionv4	16	769.95	771.82	-0.24%	✅
cadene-resnext64x4	16	802.01	802.64	-0.08%	✅
slim-mobilenet	64	7,430.92	7,438.28	-0.10%	✅
slim-nasnetalarge	64	207.02	207.40	-0.18%	✅
slim-resnet50v2	64	3,337.00	3,328.69	0.25%	✅
bert-mrpc-onnx	8	1,152.71	1,149.05	0.32%	✅
bert-mrpc-tf	1	310.84	308.98	0.60%	✅
pytorch-examples-wlang-gru	1	413.25	416.23	-0.72%	✅
pytorch-examples-wlang-lstm	1	387.91	374.95	3.46%	🔆
torchvision-resnet50_1	1	798.49	799.37	-0.11%	✅
cadene-dpn92_1	1	431.50	432.81	-0.30%	✅
cadene-resnext101_1	1	379.03	378.19	0.22%	✅
onnx-taau-downsample	1	344.37	345.37	-0.29%	✅
dlrm-criteoterabyte	1	35.00	35.05	-0.13%	✅
dlrm-criteoterabyte_fp16	1	57.26	57.40	-0.24%	✅
agentmodel	1	9,579.00	9,772.50	-1.98%	✅
unet_fp16	2	57.91	57.79	0.21%	✅
resnet50v1_fp16	1	931.31	939.67	-0.89%	✅
resnet50v1_int8	1	955.32	929.28	2.80%	✅
bert_base_cased_fp16	64	1,136.01	1,142.07	-0.53%	✅
bert_large_uncased_fp16	32	350.06	351.79	-0.49%	✅
bert_large_fp16	1	208.15	209.75	-0.76%	✅
distilgpt2_fp16	16	2,138.98	2,151.98	-0.60%	✅
yolov5s	1	508.41	506.15	0.45%	✅
tinyllama	1	43.35	43.34	0.02%	✅
vicuna-fastchat	1	174.09	169.35	2.80%	✅
whisper-tiny-encoder	1	410.11	411.82	-0.42%	✅
whisper-tiny-decoder	1	427.12	429.16	-0.47%	✅

Check results before merge 🔆

migraphx-bot · 2024-08-15T12:09:03Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

causten · 2024-09-04T16:22:04Z

@TedThemistokleous Is this PR still needed ?

Ted Themistokleous added 2 commits February 23, 2024 15:04

Update dynamicquantizelinear tests

6c02b5a

Need to adjust parse and verify tests for the operated if it contains the added int8 conversion on output.

TedThemistokleous added onnxruntime PR changes interaction between MIGraphX and Onnxruntime bugfix Fixes a bug found in the code. labels Feb 23, 2024

TedThemistokleous requested review from umangyadav and shivadbhavsar February 23, 2024 15:12

TedThemistokleous self-assigned this Feb 23, 2024

TedThemistokleous requested a review from causten as a code owner February 23, 2024 15:12

TedThemistokleous linked an issue Feb 23, 2024 that may be closed by this pull request

Enable ORT accuracy tests to verify int8 #1904

Closed

umangyadav requested a review from lakhinderwalia February 26, 2024 15:13

lakhinderwalia reviewed Feb 26, 2024

View reviewed changes

src/onnx/parse_dynamicquantizelinear.cpp Outdated Show resolved Hide resolved

Ted Themistokleous added 4 commits February 27, 2024 21:08

Revert "Update dynamicquantizelinear tests"

6eef2ca

This reverts commit 6c02b5a.

Revert "Avoid uint8 being added for dynamicQuantizeLinear"

182fc60

This reverts commit e757f4a.

Add has_type matcher

92316f8

Use this to determine if an instruction has as specific input type used for one of its arguments. Added tests to pick out instruction based on type

backup of pass for modifying dybamicquantizelinear

669c850

TedThemistokleous marked this pull request as draft February 28, 2024 06:11

Ted Themistokleous added 2 commits February 28, 2024 14:15

Fix format

ce107f3

Split out dynamic quantize linear as as seperate pass to simplify_qdq…

b4f87d1

….hpp Makes this easier to test Added additional test cases and passes to existing dynamicquantizelinear verify/parse tests

TedThemistokleous marked this pull request as ready for review February 28, 2024 18:40

Ted Themistokleous added 4 commits February 28, 2024 23:30

Cleanup move from has_type matcher

32d3c54

Fix license

6bfcf22

Add pass to verify test

c52a116

Need this to verify validity of whether our conversion is indeed correct to int8

Cleanup text in simplify pass

0d6094c

TedThemistokleous changed the base branch from develop to fix_parse_dynamicquantizelinear March 18, 2024 18:04

TedThemistokleous added 2 commits March 18, 2024 14:06

Merge branch 'fix_parse_dynamicquantizelinear' into fix_dynamic_int8_…

1fbe2f2

…quant

Merge branch 'develop' into fix_parse_dynamicquantizelinear

6dc8ef1

TedThemistokleous removed the high priority A PR with high priority for review and merging. label Mar 18, 2024

TedThemistokleous and others added 4 commits March 18, 2024 17:17

Merge branch 'fix_parse_dynamicquantizelinear' into fix_dynamic_int8_…

eebca5c

…quant

Fix format

2887017

Remove extra literal that slipped in when resolving conflict

de17954

Update tests to use quant_dot to capture correct int8 output. Update …

2cde7cc

…gen_onnx and models accordingly

Base automatically changed from fix_parse_dynamicquantizelinear to develop March 19, 2024 14:29

Merge branch 'develop' into fix_dynamic_int8_quant

ab77c68

TedThemistokleous removed the bugfix Fixes a bug found in the code. label Apr 29, 2024

Merge branch 'develop' into fix_dynamic_int8_quant

0b27323

TedThemistokleous mentioned this pull request May 3, 2024

Handle weight and data input zero points for ConvInteger #2888

Merged

pfultz2 reviewed May 6, 2024

View reviewed changes

TedThemistokleous changed the title ~~Set output of Dynamicquantizelinear to be int8 instead of uint8~~ Add pass to convert Uint8 to int8 across operators May 6, 2024

This was linked to issues May 6, 2024

Remove shifts in parser for MatMulinteger, ConvInteger and DynamicQuantizeLinear OPs #3045

Open

Convert Uint8 quantizelinear to int8 quantizelinear #3050

Open

causten assigned pfultz2 May 22, 2024

causten and others added 3 commits May 28, 2024 18:58

Merge branch 'develop' into fix_dynamic_int8_quant

2654eae

Merge branch 'develop' into fix_dynamic_int8_quant

637c22a

Merge branch 'develop' into fix_dynamic_int8_quant

63952f2

umangyadav removed their request for review August 17, 2024 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pass to convert Uint8 to int8 across operators #2826

Add pass to convert Uint8 to int8 across operators #2826

TedThemistokleous commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

pfultz2 commented Feb 23, 2024

TedThemistokleous commented Feb 23, 2024

pfultz2 commented Feb 23, 2024

TedThemistokleous commented Feb 23, 2024 •

edited

Loading

TedThemistokleous commented Feb 28, 2024

TedThemistokleous commented Feb 28, 2024 •

edited

Loading

TedThemistokleous commented Mar 18, 2024

causten commented Mar 18, 2024

TedThemistokleous commented Mar 18, 2024

TedThemistokleous commented Apr 9, 2024 •

edited

Loading

pfultz2 May 6, 2024

pfultz2 May 6, 2024

migraphx-bot commented Aug 15, 2024

migraphx-bot commented Aug 15, 2024

causten commented Sep 4, 2024

Add pass to convert Uint8 to int8 across operators #2826

Are you sure you want to change the base?

Add pass to convert Uint8 to int8 across operators #2826

Conversation

TedThemistokleous commented Feb 23, 2024 • edited Loading

codecov bot commented Feb 23, 2024 • edited Loading

Codecov Report

pfultz2 commented Feb 23, 2024

TedThemistokleous commented Feb 23, 2024

pfultz2 commented Feb 23, 2024

TedThemistokleous commented Feb 23, 2024 • edited Loading

TedThemistokleous commented Feb 28, 2024

TedThemistokleous commented Feb 28, 2024 • edited Loading

TedThemistokleous commented Mar 18, 2024

causten commented Mar 18, 2024

TedThemistokleous commented Mar 18, 2024

TedThemistokleous commented Apr 9, 2024 • edited Loading

pfultz2 May 6, 2024

Choose a reason for hiding this comment

pfultz2 May 6, 2024

Choose a reason for hiding this comment

migraphx-bot commented Aug 15, 2024

migraphx-bot commented Aug 15, 2024

causten commented Sep 4, 2024

TedThemistokleous commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

TedThemistokleous commented Feb 23, 2024 •

edited

Loading

TedThemistokleous commented Feb 28, 2024 •

edited

Loading

TedThemistokleous commented Apr 9, 2024 •

edited

Loading