Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test merge unity #2

Closed
wants to merge 876 commits into from
Closed

Test merge unity #2

wants to merge 876 commits into from

Conversation

Hzfengsy
Copy link
Owner

No description provided.

junrushao and others added 30 commits October 3, 2023 10:36
This PR contains a minor fix for RCCL integration.
This commit adds 2 new operations (R.quantize and R.dequantize) and
supports them in LegalizeOps pass.
…he#15861)

Prior to this commit, the `RewriteCUDAGraph` pass would
unconditionally rewrite an `IRModule`, and was conditionally
included as a lowering pass for used in `relax.build`, based on the
current `PassContext`.  This commit moves the check on the
`PassContext` from the `relax.build` method to the `RewriteCUDAGraph`
pass itself.  This allows the pass to be part of a lowering
flow that is constructed once, and is later used when the
`PassContext.current()` may have changed.
This PR uses static NCCL instead of the dynamic linked one to ensure
out-of-box use of TVM Unity wheel.
This PR fixed a bug introduced in apache#15827 since which the cudagraph's
stream is discarded.
This commit adds `-lrt` to TVM runtime when linked against static NCCL.
The static NCCL depends on symbol `shm_unlink` which comes from librt.
Removed instances of accidentally repeated words from comments. There
are cases where duplicated words appear legitimately, those cases remain
unmodified.
* [Unity] Use PrimValue as offset in R.tril and R.triu

This mirrors the support in `topi`, which supports a `PrimExpr` as the
offset of the diagonal.

* Update implementation to avoid

I believe the `-Wsequence-point` raised by gcc is spurious, as the
`index++` occurs within a braced-initialization list, which has a
defined left-to-right execution order.  However, better to avoid the
warning altogether.

* Updated attr usage to args

* Correct relax op names in msc

* Parametrize failing MSC unit tests, mark with xfail

* Lint fix

* Marked relay to relax tests as known failures
* reconstruct codegen

* minor fix

* minor fix

* minor fix

* update tests

* minor fix
Disco worker originally automatically import `tvm.testing.disco` for
convenient unittesting. However, `tvm.testing` is a special subpackage
that introduces many unnecessary dependencies, for example, pytest. This
PR removes such dependencies by directly moving the testing function
registration logic to the entry file.
This PR adds support for ReLU in NN module and op,
also adds support for GELU in the NN modules.
…pache#15883)

* [Unity][Transform] Allow static Relax arguments to dynamic PrimFunc

Prior to this commit, the `relax.transform.FuseTIR` transform required
that the shapes arguments passed into a `PrimFunc` be structurally
equivalent to the shapes of the parameters, and that any replacement
of symbolic `tir.Var` be with a symbolic `tir.Var` in the fused
function.

This commit updates the `SymbolicMatcher` to instead extract a
`Map<tir::Var, PrimExpr>`.  As a result, a Relax tensor with
statically-known shape can be passed into a TIR PrimFunc with dynamic
shape.  The resulting fused TIR function is in terms of the
statically-known shape, and no longer contains the symbolic variable.
…_tir_inplace` (apache#15878)

* Add call_inplace_packed operator

* Whitespace
…the normalized_shape (apache#15894)

* Fix the KeyError and correctly use the normalized_shape

* Update test_frontend_from_fx.py
* Fix MaxPool TypeError

* Add regression test case.
* delete unused import and add class docstring

* add test for fast math transform

* Update test_fast_math_transform.py
This commit adds debugging information to checks in the `FuseOps`
pass.  While the existing checks indicate where an error occurred in
the `FuseOps` code, this adds information on the relax expressions
that caused the error.
Prior to this commit, `relax::ExternFunc` nodes would be de-duplicated
as part of the `EliminateCommonSubexpr` pass.  This commit instead
ignores the `relax::ExternFunc` nodes, retaining the in-line
definitions.
…e#15884)

* add support for torch.tensor as index

* still don't fit in array indexing

* support at most one tensor index, to avoid error

* correct tests for tensor as index

* code style

* code style

* code style

* code style
…ache#15893)

* fix the error pad_einsum documentation

* Update schedule.py
apache#15804)

* [Unity] Fix TVMError when loading ONNX model with CumSum operator

* Add regression test for loading ONNX model with CumSum operator

* Fix formatting

* Fix spacing errors
* [Unity] Fix TVMScript Issues in Testcases

Due to frequent sync with upstream, some of the testcases are broken,
because of the changes in the TVMScript. This PR is to fix the broken
…ache#15699)

* [Unity][Analysis] Implemented DefinableTIRVarsInStructInfo

The existing utility `TIRVarsInStructInfo` returns all TIR variables,
regardless of whether they are suitable for a variable definition, or
are usage sites.  This utility walks over the struct info once,
returning both the definable symbolic variables and the used symbolic
variables.

* [Unity][Analysis] Accept relax::Expr arg in Defined/FreeSymbolicVars

Prior to this commit, this utility could only be used with a
`relax::Function` argument.  This allows individual expressions to be
inspected, even if they are not part of a complete function.

* [Unity] Propagate symbolic variables in LiftTransformParams

* Updated LiftTransformParams to use support::OrderedSet

* Fixed import after rebase
…pache#15923)

Prior to this commit, the `tvm::script::printer::AttrPrinter` class
took the attribute path as a `const ObjectPath&`.  In both places
where an `AttrPrinter` is called, the temporary object
`n_p->Attr("attrs")` is passed for this argument.  While binding a
temporary object to a const reference can extend the lifetime of the
temporary, this requires the const reference to be in the same scope
as the temporary, and does not apply in this case (see [this
stackoverflow post](https://stackoverflow.com/a/2784304)).  Therefore,
this reference is only valid through the construction of `AttrPrinter
printer`, and is invalid during its usage on the following line.

This dangling reference has caused segfaults in CI for unrelated
changes ([example](https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-unity/detail/PR-15904/3/pipeline)),
and can be reproduced with the following test case.

```python
import pytest

from tvm.script import relax as R

@pytest.mark.parametrize("iter", range(10000))
def test_argmax_without_specified_axis(iter):
    @R.function
    def func(x: R.Tensor((1, 2, 3, 4), "float32")):
        return R.argmax(x)

    func.script(show_meta=True)
```

This test case is not included in this commit, as the reproduction is
not consistent, with failure requiring on the order of 10k iterations
to trigger.  In addition, reproduction was sensitive to the following
conditions.

* The function being printed must contain at least one `relax::Call`
  node, with an operation that has attributes.

* TVM must be built with optimization enabled.  In gcc, the
  `-ftree-dse` optimization, which is part of `-O1`, is required to
  trigger the bug.

* Python's default allocation must be used.  If `PYTHONMALLOC=malloc`
  is set to instead use the system's `malloc`, the segfault was no
  longer triggered.

This commit updates `AttrPrinter` to accept the `ObjectPath` by value.
With the change applied, the above test ran 100k times without error.
…he#15822)

* [Unity][VM] Improved error message in CodeGenVM::EmitKillObject

This was implemented while debugging CI failures in
apache#15810, but is not otherwise related
to the changes in that PR.

* ci bump
…pache#15904)

* [Unity][Transform] Canonicalize and use CSE between pattern matches

The `PatternRewriter` is intended to iterate until no matching
patterns remain.  Prior to this commit, this only involved repeating
the pattern match rewrite rules.  However, intermediate results
produced by pattern replacement could cause the iterative pattern
matching to terminate early.

* If two rewrite rules each introduce the same intermediate, there
  will exist two copies of that intermediate, which can prevent
  `only_used_by` patterns from matching.  Applying
  `EliminateCommonSubexpr` allows the pattern matching to continue.

* Applying a rewrite rule may result in dangling intermediates that
  are no longer used.  These dangling intermediates may prevent the
  next application of a rewrite rule that uses the `only_used_by`
  constraint.  Applying `RemoveAllUnused` allows the pattern matching
  to continue.

* A rewrite rule that returns a `relax::Var` or `relax::TupleGetItem`
  as the replacement introduces trivial var-to-var rebinding, which
  are not tracked by `PatternRewriter`.  Applying
  `CanonicalizeBindings` allows the pattern matching to continue.

While this could be fixed externally by repeatedly applying
`rewrite_call`, this would require re-inspecting the entire function,
and not just the dataflow block in which the replacement occurred.

* Fix tests for removing redundant reshapes

* Fixed failing unit tests, along with edge case in CSE
…#15917)

* support torch.arange()+ (int) in dynamo

* code style

* code style
This PR introduces the PagedKVCache object to Relax runtime
for the KV cache value management in batching settings.

One test file is included. Note that this file does not contain
the test of attention function/kernel. That part will be uploaded
and tested separately.
vincentccc and others added 28 commits January 8, 2024 17:53
* update dp4a tensor intrin

* update dp4a tensor intrin

* lint

---------

Co-authored-by: Lufang CHEN 陈橹方 <[email protected]>
If a matrix multiplication cannot be performed due to incompatible
shapes, the error message now specifies the arguments, the shape of
each argument, and which dimension of the shape has a mismatch.
Previously, this error message only provided the dimension of the
mismatch.
…16307)

Prior to this commit, an error message would occur in
`ExprMutator::ReEmitBinding` if the struct info is missing from the
generated value.  However, because this error was generated from
inside `GetStructInfo`, it didn't include sufficient context for
debugging.  This commit checks the struct info explicitly, and
includes the context of the updated variable in the error message.
)

Prior to this commit, the `BundleModelParams` would replace model parameters
with `param_tuple[index]` within expressions.  These nested
expressions would then be normalized, resulting in `gv =
param_tuple[index]` or `lv = param_tuple[index]` variable
definitions.  These auto-generated `gv` and `lv` names make it quite
difficult to determine which model parameter is being used.

This commit updates the `BundleModelParams` transform to explicitly
produce the bound variable, `orig_param_name = param_tuple[index]`,
preserving human-readable names from the parameters.
…pache#16367)

Resolve a bug that caused undefined relax variables in the output of
`CanonicalizeBindings` for cases where `VisitVarDef(const Var&)`
replaces a variable, and `VisitExpr_(const VarNode*)` returns a value
with different struct info, both occurring within the same
`VarBinding`.

The ExprMutator is only allowed to update a variable's struct info
if the value bound to it has new struct info.  When
CanonicalizeBindings replaces a trivial binding, this may provide
better struct info as a result.

Prior to this commit, `ExprMutator::ReEmitBinding` defined a
remap for `binding->var->vid`, even if the derived class defined a
replacement by overriding `VisitVarDef`.  If the derived class
defines a new variable binding by overriding `VisitVarDef`, and
also causes a variable replacement by overriding `VisitExpr` and
returning a type with different struct info, then `ExprMutator`
must check for both `binding->var->vid` *AND* `new_var->vid`.  The
former may be present in the unmodified graph, and the latter may
be produced by the derived class before delegating to the base
class.

This commit updates `ExprMutator::ReEmitBinding` to define entries for
both replacements that may be required.
apache#16362)

This PR adds a sanity check to ensure that all `tir_var_upper_bound`
attrs used by static memory planning has integer as value type.
This check helps avoid mistakes of using wrong value types.

The check is needed since `func->GetAttr<Map<String, IntImm>>`
does not apply type check.
…pache#16310)

Prior to this commit, several diagnostics in the `WellFormedChecker`
would explicitly extract the name from `relax::Var`, `tir::Var`, and
`GlobalVar` instances.  This is unnecessary, as these classes can
be printed directly, and skips any changes to the default printing
behavior (e.g. printing of variable addresses) that may be useful
while debugging.
…ache#16306)

* [Unity][Transform] Update LambdaLift to use name of lifted lambda

Prior to this commit, the `LambdaLift` pass named each function
as `"lifted_func_" + i`, in incremental order of occurrence.  This
provided unique names for each function, but could be difficult to
read, or to refer to the lifted functions.  This commit updates the
naming scheme to use the location at which the lifted lambda occurs to
generate a unique name for the new `GlobalVar`.

* Update variables names and comments for unique function naming

* Add unit test for conflicting name
CI images should also be updated to install cmake 3.24
…e` kernel and add test (apache#16376)

fix typo bug and add test for vllm reconstruct_from_cache kernel
apache#16349)

* [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate

The conversion from tensorflow to MSC is done by first converting from
tensorflow to relay, then converting from relay to executable python
code, executing that python code to generate relax, and finally
converting from relax to MSC.  During the relax phase of this
conversion, some relax `IRModule` are applied, including
`FuseOpsByPattern`.

The test cases in `test_msc/test_translate_tensorflow.py` rely on
`FuseOpsByPattern` preserving trivial bindings (e.g. `var_1 = var_2`)
in the relax IRModule.  If these trivial bindings are removed by
`CanonicalizeBindings`, then the test cases in this file fail.  The
presence or absence of trivial bindings `FuseOpsByPattern` should be
considered an implementation detail, and relax passes should not be
required to preserve trivial bindings.

This PR updates the relay to executable python step of the tensorflow
to MSC conversion, to remove trivial bindings and output a variable
name that matches the expected value in the test case.  While not an
ideal resolution, as other variable name changes could still
reintroduce the same test failures, it ensures that `FuseOpsByPattern`
may canonicalize bindings as an internal pre- or post-processing step
without breaking these unit tests.

* Update implementation to remove dataflow block in MSC codegen

The potential for duplicate variable names was introduced by having
the `block_builder.emit_output` call, which is only required to export
values from a dataflow block.  The dataflow block is not used in any
later MSC conversion, and its removal avoids this re-export of
variables.

If the dataflow block is required in the future, it can be generated
using `tvm.relax.transform.ConvertToDataflowBlock`.

* Make failing test cases be close to the same structural form

* Updated tests to validate output after compilation

* Lint fixes
…e#16314)

* [Unity][Analysis] Add utility for collecting compile-time bindings

Whether an optimizations should be performed may depend on when the
variables in an expression are known.

For example, consider a LoRA-adjusted model, with base weights `W` of
shape `[m,n]`, LoRA components `A` and `B` with shapes `[r,n]` and
`[m,r]` respectively, and activations `x` with shape `[n,1]`.  The
LoRA-adjusted matmul could be computed either as `(W + B*A)*x` or as
`(W*x + B*(A*x))`.

If `A` and `B` are provided at run-time, then computing `(W +
B*(A*x))` requires significantly fewer computations.

* `(W + B*A)*x`: `m*n*(2*r + 3)` operations
  1. `B*A`: `2*m*n*r` operations using a naive matmul
  2. Adding `W` to (1): `m*n` operations
  3. Multiplying `x` by (2): `2*m*n` operations

* `(W*x + B*(A*x))`: (2*m*n + r*(2*n + 2*m + 1))
  1. `W*x`: `2*m*n` operations
  2. `A*x`: `2*r*n` operations
  3. Multiplying `B` by (2): `2*m*r` operations
  4. Adding (1) and (3)`: `m` operations

However, if `A` and `B` are known at compile-time, then computing `(W
+ B*A)*x` groups all compile-time values together, allowing them to be
computed earlier (i.e. using `LiftTransformParams`)

* `(W + B*A)*x`: `2*m*n` operations
  1. `B*A`: 0 operations, computed at compile-time
  2. Adding `W` to (1): 0 operations, computed at compile-time
  3. Multiplying `x` by (2): `2*m*n` operations

Since the choice of optimized expression depends on which parameters
can be computed at compile-time, it is useful to have a utility that
identifies values that can be computed at compile-time.

* [Unity] QoL improvements for Dataflow matching

- Update the zero-parameter `WildcardPattern` constructor to produce a
  valid instance.  Previously, the zero-parameter constructor produced
  a null instance of `WildcardPattern`, which resulted in an error
  when used.  The `WildcardPattern` was expected to be constructed
  through the `Wildcard` function instead.  Since all other
  `DFPattern` child classes could be constructed explicitly, this
  could lead to unexpected outcomes.

- Check for `pattern.defined()` when performing a pattern-match.  If
  a null instance of a pattern is provided, this gives an error
  message with more context than the one raised by `DFPatternFunctor`.

- Expose `RewriteCall` for use in C++.  Previously, it had only been
  exposed through the FFI registry, and had no declaration in a header
  file.

* [Unity][Transform] Implement relax.transform.AdjustMatmulOrder

Reorder `x*(A*B)` to `(x*A)*B`.  Intended for optimization of LoRA
models, for which `(x*A)*B` has a much smaller memory footprint.

* Fix copy-paste error

* Check for re-orderings from the LHS, skip if cannot prove a benefit
This PR supports PagedKVCache with leveraging TIR kernels.

Right now we do not have sufficient TIR kernels for multi-level
sequences in PagedKVCache, therefore `Fork` in PagedKVCache
is disabled when such a function does not exist.

This PR adds a "reduced" creator of PagedKVCache, where
some auxiliary functions such as the begin/end forward function
of prefill/decode default to None.

CUDA tests are added to ensure correctness.

Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
* [Unity][nnModule] Dynamic shape support in nn Module
fix onnx frontend

Co-authored-by: cheng wen <chengven027-intellif>
…6396)

This PR enhances PagedKVCache with the inline RoPE compute,
which unblocks the movement towards sliding window and attention
sink.

Both FlashInfer and TIR kernels are updated in this PR with
the RoPE calculation. Note that FlashInfer is bumped in order
to include the RoPE update.

The previous standalone kernel used for RoPE application
are thereby removed.

---

Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
…che#16111)

This PR enhances the static block memory planning pass.
Prior to this PR, the memory planning only works on memory
allocation that is not externally referenced. In dynamic
shape settings, such memory allocation is not fully static
and may lead to memory fragmentation.

This PR enhances the behavior, so that for such memory
allocation, we first allocate a storage with regard to its
estimated upper bound (when known), and then allocate the
tensor with the actual dynamic shape out from the storage.
This will ensure the static memory allocation and avoid
memory fragmentation.
* [Unity] Split DecomposeOpsForTraining into two steps

Prior to this commit, the `DecomposeOpsForTraining` transform directly
replaced `relax.nn.batch_norm` into more primitive relax operations.
This required the decomposed form of `relax.nn.batch_norm` to be
duplicated with `DecomposeOpsForInference`.  This commit refactors the
pass to occur in two steps, first to apply training-specific
mutations, and then to decompose.

Having a clear `DecomposeOps` pass also has a clear single location
for operator decomposition, which may be migrated into the operator
definition in the future, similar to `FLegalize`.

* Updated ApplyPassToFunction utility to use a regex
@Hzfengsy Hzfengsy closed this Jan 22, 2024
@Hzfengsy Hzfengsy deleted the test-merge-unity branch January 24, 2024 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.