Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pass to convert Uint8 to int8 across operators #2826

Open
wants to merge 52 commits into
base: develop
Choose a base branch
from

Conversation

TedThemistokleous
Copy link
Collaborator

@TedThemistokleous TedThemistokleous commented Feb 23, 2024

~Related to #1904 ~

Originally from when we were parsing in dynamicquantizelinear but we've found other operators weren't handling inputs of uint8 correctly and thus we need to add a pass to properly handle conversion.

This is dependent on the more recent #2888 so that we can then handle MatmulInteger,ConvInteger and DynamicQuantizeLinear and uint8 during compile time rather than in the parser

Ted Themistokleous added 2 commits February 23, 2024 15:04
We parse in this instruction as a combination of ops. The issue here is we require int8 for MLIR as uint8 is unsupported to create kernels.

In this case we convert from unsigned to signed here and use uint16 as the accumulator  to avoid overflow before converting to int8.
Need to adjust parse and verify tests for the operated if it contains the added int8 conversion on output.
@TedThemistokleous TedThemistokleous added onnxruntime PR changes interaction between MIGraphX and Onnxruntime bugfix Fixes a bug found in the code. labels Feb 23, 2024
@TedThemistokleous TedThemistokleous self-assigned this Feb 23, 2024
@TedThemistokleous TedThemistokleous linked an issue Feb 23, 2024 that may be closed by this pull request
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.86%. Comparing base (263509b) to head (2cde7cc).
Report is 1 commits behind head on develop.

Current head 2cde7cc differs from pull request most recent head 2654eae

Please upload reports for the commit 2654eae to get more accurate results.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2826      +/-   ##
===========================================
+ Coverage    91.82%   91.86%   +0.04%     
===========================================
  Files          486      480       -6     
  Lines        18993    18240     -753     
===========================================
- Hits         17440    16757     -683     
+ Misses        1553     1483      -70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pfultz2
Copy link
Collaborator

pfultz2 commented Feb 23, 2024

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

@TedThemistokleous
Copy link
Collaborator Author

Without this we attempt to get mlir to try to create int8 x uint8 which fail at compile.

Can you explain this in more detail?

When speaking with MLIR they said they don't support uint8 x int8 quant_dots along with just uint8 in general for their kernels. I was seeing failures when using models optimized by Onnxruntime and using int8. This is related to the drop in performance is we would fail to compile the model in onnxruntime and then default to ROCm EP and or CPU

@pfultz2
Copy link
Collaborator

pfultz2 commented Feb 23, 2024

But why are we getting mixed types like this in the first place?

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Feb 23, 2024

But why are we getting mixed types like this in the first place?

Because the implimentation of dynamicquantizelinear breaks it up into separate instructions which hardcode uint8 as part of the output type for the zero point which is defined by the onnx standard.

I am wondering if we should have a pass to fix this up instead of doing this in the frontend parser.

Doing more tests on this seems to break backend onnx tests since there is an expectation that all zero point output will be uint8. So it appears we should be doing a pass then if one of the inputs to a quantize linear is then a uint8_t type to add in the convert

Ted Themistokleous added 4 commits February 27, 2024 21:08
Use this to determine if an instruction has as specific input type used for one of its arguments.

Added tests to pick out instruction based on type
@TedThemistokleous TedThemistokleous marked this pull request as draft February 28, 2024 06:11
@TedThemistokleous
Copy link
Collaborator Author

Moving to draft. Will craft better matcher after tomorrow morning's coffee. @lakhinderwalia I see your point from Monday's meeting. Looks like we can just target and replace q_min/q_max as well as the convert at the end to adjust the data range without requiring the extra shift add and converts.

Ted Themistokleous added 2 commits February 28, 2024 14:15
….hpp

Makes this easier to test

Added additional test cases and passes to existing dynamicquantizelinear verify/parse tests
@TedThemistokleous TedThemistokleous marked this pull request as ready for review February 28, 2024 18:40
@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Feb 28, 2024

Updated this to be a separate pass. Seeing some odd behavior on the added tests when this gets converted to kernels on the gpu. Need to sort out that last bit but the rest should be good for review.

Ted Themistokleous added 4 commits February 28, 2024 23:30
Need this to verify validity of whether our conversion is indeed correct to int8
@TedThemistokleous TedThemistokleous changed the base branch from develop to fix_parse_dynamicquantizelinear March 18, 2024 18:04
@TedThemistokleous
Copy link
Collaborator Author

Split out changes for parse_dynamcquantizelinear for this one. Retargeted branch to that PR until that changeset gets in. Will revisit one that's merged

@causten
Copy link
Collaborator

causten commented Mar 18, 2024

@TedThemistokleous to review and resolve conflicts

@TedThemistokleous TedThemistokleous removed the high priority A PR with high priority for review and merging. label Mar 18, 2024
@TedThemistokleous
Copy link
Collaborator Author

DQL fix + Pass changes

Summary:
gpu::code_object::reduce_min_min_kernel: 2.04166ms / 49 = 0.0416666ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03947ms / 49 = 0.0416219ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66411ms / 48 = 0.0346689ms, 10%
gpu::code_object::mlir_quant_dot: 1.35967ms / 47 = 0.0289292ms, 9%
gpu::code_object::convert_kernel: 1.15707ms / 50 = 0.0231415ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.15445ms / 49 = 0.0235602ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14034ms / 49 = 0.0232722ms, 7%
gpu::code_object::mul_kernel: 1.12784ms / 49 = 0.0230171ms, 7%
gpu::code_object::contiguous_kernel: 0.854442ms / 36 = 0.0237345ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833104ms / 36 = 0.0231418ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.580536ms / 24 = 0.024189ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.534086ms / 23 = 0.0232211ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.38627ms / 13 = 0.0297131ms, 3%
load: 0.316312ms / 537 = 0.000589036ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.293268ms / 12 = 0.024439ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_add_mul_exp_add_div_kernel: 0.287277ms / 12 = 0.0239397ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280463ms / 12 = 0.0233719ms, 2%
multibroadcast: 0.245326ms / 295 = 0.000831613ms, 2%
hip::hip_copy_literal: 0.0990334ms / 150 = 0.000660223ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0969119ms / 1 = 0.0969119ms, 1%
reshape_lazy: 0.0966945ms / 131 = 0.000738126ms, 1%
transpose: 0.0604593ms / 48 = 0.00125957ms, 1%
slice: 0.04459ms / 36 = 0.00123861ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248186ms / 1 = 0.0248186ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0238949ms / 1 = 0.0238949ms, 1%
gpu::code_object::gather_kernel: 0.0233956ms / 1 = 0.0233956ms, 1%
@param: 0.00996924ms / 26 = 0.000383432ms, 1%
hip::hip_allocate_memory: 0.00088522ms / 1 = 0.00088522ms, 1%
check_context::migraphx::gpu::context: 0.00072764ms / 1 = 0.00072764ms, 1%

Batch size: 1
Rate: 189.117 inferences/sec
Total time: 5.28774ms
Total instructions time: 16.7771ms
Overhead time: 0.371628ms, -11.4893ms
Overhead: 7%, -217%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8


Summary:
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.05055ms / 49 = 0.0418479ms, 13%
gpu::code_object::reduce_min_min_kernel: 2.04459ms / 49 = 0.0417264ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66972ms / 48 = 0.0347858ms, 10%
gpu::code_object::mlir_quant_dot: 1.36225ms / 47 = 0.028984ms, 9%
gpu::code_object::convert_kernel: 1.16309ms / 50 = 0.0232618ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16019ms / 49 = 0.0236774ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.14452ms / 49 = 0.0233575ms, 7%
gpu::code_object::mul_kernel: 1.13133ms / 49 = 0.0230883ms, 7%
gpu::code_object::contiguous_kernel: 0.860616ms / 36 = 0.023906ms, 6%
gpu::code_object::quantizelinear_kernel: 0.833249ms / 36 = 0.0231458ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583202ms / 24 = 0.0243001ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537737ms / 23 = 0.0233799ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.385738ms / 13 = 0.0296722ms, 3%
load: 0.314656ms / 537 = 0.000585952ms, 2%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.311504ms / 12 = 0.0259587ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295388ms / 12 = 0.0246157ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285528ms / 12 = 0.023794ms, 2%
multibroadcast: 0.257628ms / 295 = 0.000873314ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.100579ms / 1 = 0.100579ms, 1%
hip::hip_copy_literal: 0.0988059ms / 150 = 0.000658706ms, 1%
reshape_lazy: 0.0981314ms / 131 = 0.000749094ms, 1%
transpose: 0.0633442ms / 48 = 0.00131967ms, 1%
slice: 0.0449541ms / 36 = 0.00124872ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0250007ms / 1 = 0.0250007ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0242269ms / 1 = 0.0242269ms, 1%
gpu::code_object::gather_kernel: 0.0238603ms / 1 = 0.0238603ms, 1%
@param: 0.0102222ms / 26 = 0.000393161ms, 1%
check_context::migraphx::gpu::context: 0.0009512ms / 1 = 0.0009512ms, 1%
hip::hip_allocate_memory: 0.0009052ms / 1 = 0.0009052ms, 1%

Batch size: 1
Rate: 190.495 inferences/sec
Total time: 5.24948ms
Total instructions time: 16.8825ms
Overhead time: 0.37222ms, -11.633ms
Overhead: 7%, -222%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Base automatically changed from fix_parse_dynamicquantizelinear to develop March 19, 2024 14:29
@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Apr 9, 2024

Not required but can still be used as perf improvement now 2903 has been added. Will need to determine speedup after rebase.

Fixes were pulled out of here for some other changes but I'm still curious on perf if we can just auto convert uint8/ handle matmul correctly now

@TedThemistokleous TedThemistokleous removed the bugfix Fixes a bug found in the code. label Apr 29, 2024
@@ -869,6 +875,12 @@ auto skip_broadcasts_converts(Ms... ms)
return skip(name("broadcast", "multibroadcast", "contiguous", "convert"))(ms...);
}

template <class... Ms>
auto skip_broadcast_squeeze(Ms... ms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put this matcher in the .cpp file. as its specific to the pass.

@@ -56,7 +56,7 @@ struct quantizelinear

shape compute_shape(std::vector<shape> inputs) const
{
check_shapes{inputs, *this}.same_dims().has(2, 3);
check_shapes{inputs, *this}.has(2, 3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? They should have the same dimensions.

@TedThemistokleous TedThemistokleous changed the title Set output of Dynamicquantizelinear to be int8 instead of uint8 Add pass to convert Uint8 to int8 across operators May 6, 2024
@migraphx-bot
Copy link
Collaborator

Test Batch Rate new
63952f
Rate old
ae2b02
Diff Compare
torchvision-resnet50 64 3,232.60 3,232.34 0.01%
torchvision-resnet50_fp16 64 6,870.53 6,875.78 -0.08%
torchvision-densenet121 32 2,428.49 2,429.76 -0.05%
torchvision-densenet121_fp16 32 4,044.59 4,068.80 -0.60%
torchvision-inceptionv3 32 1,633.77 1,636.23 -0.15%
torchvision-inceptionv3_fp16 32 2,742.80 2,744.68 -0.07%
cadene-inceptionv4 16 769.95 771.82 -0.24%
cadene-resnext64x4 16 802.01 802.64 -0.08%
slim-mobilenet 64 7,430.92 7,438.28 -0.10%
slim-nasnetalarge 64 207.02 207.40 -0.18%
slim-resnet50v2 64 3,337.00 3,328.69 0.25%
bert-mrpc-onnx 8 1,152.71 1,149.05 0.32%
bert-mrpc-tf 1 310.84 308.98 0.60%
pytorch-examples-wlang-gru 1 413.25 416.23 -0.72%
pytorch-examples-wlang-lstm 1 387.91 374.95 3.46% 🔆
torchvision-resnet50_1 1 798.49 799.37 -0.11%
cadene-dpn92_1 1 431.50 432.81 -0.30%
cadene-resnext101_1 1 379.03 378.19 0.22%
onnx-taau-downsample 1 344.37 345.37 -0.29%
dlrm-criteoterabyte 1 35.00 35.05 -0.13%
dlrm-criteoterabyte_fp16 1 57.26 57.40 -0.24%
agentmodel 1 9,579.00 9,772.50 -1.98%
unet_fp16 2 57.91 57.79 0.21%
resnet50v1_fp16 1 931.31 939.67 -0.89%
resnet50v1_int8 1 955.32 929.28 2.80%
bert_base_cased_fp16 64 1,136.01 1,142.07 -0.53%
bert_large_uncased_fp16 32 350.06 351.79 -0.49%
bert_large_fp16 1 208.15 209.75 -0.76%
distilgpt2_fp16 16 2,138.98 2,151.98 -0.60%
yolov5s 1 508.41 506.15 0.45%
tinyllama 1 43.35 43.34 0.02%
vicuna-fastchat 1 174.09 169.35 2.80%
whisper-tiny-encoder 1 410.11 411.82 -0.42%
whisper-tiny-decoder 1 427.12 429.16 -0.47%

Check results before merge 🔆

@migraphx-bot
Copy link
Collaborator


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

     ✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

     ✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ bert_large: PASSED: MIGraphX meets tolerance

     ✅ yolov5s: PASSED: MIGraphX meets tolerance

     ✅ tinyllama: PASSED: MIGraphX meets tolerance

     ✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

     ✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

@umangyadav umangyadav removed their request for review August 17, 2024 13:19
@causten
Copy link
Collaborator

causten commented Sep 4, 2024

@TedThemistokleous Is this PR still needed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
onnxruntime PR changes interaction between MIGraphX and Onnxruntime
Projects
None yet
7 participants