[Torch] Support returning quantized weights and bias for BYOC use cases #9135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This addresses the issue discussed in https://discuss.tvm.apache.org/t/qnn-pytorch-byoc-full-integer-qnn-support/11127
PyTorch stores quantized weights in a custom format, so we cannot directly access 8 bit weights as Numpy arrays. We use a PyTorch function to unpack quantized weights into float32 arrays and quantization parameters.
By default, we use
qnn.op.quantize(...)
to recover int8 weights in a QNN graph, return float32 weights to users, and rely on the QNN lowering and the Relay constant folding pass to quantize weights at compile time. In BYOC use cases, however, we cannot apply the constant folding pass on a QNN graph.I added a new option to quantize weights in the frontend using a function that is equivalent to
qnn.op.quantize(...)
operating on Numpy arrays. In hindsight, we should've chosen this way from the beginning. The old behavior is kept as the default for backward compatibility.cc @comaniac