[METAL] Fix issue with GPU fails #7819

echuraev · 2021-04-09T20:14:51Z

Added first run to auto scheduler. This run is necessary for checking
that the generated kernel is correct. When we just run time evaluator
with incorrect kernel then it is possible that our application on iOS
device will be added to ignore list because of big number of committed
incorrect kernels. One run before running auto scheduling helps us to
avoid this problem.

Added complete handlers to all command buffers in Metal runtime. It
helps to handle GPU errors and report about this error to the host
application.

In case when error happened, we have to create a new stream. Added
mechanism for error handling and streams creating from python interface.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface.

src/runtime/minrpc/rpc_reference.h

python/tvm/_ffi/runtime_ctypes.py

tqchen · 2021-04-14T13:18:42Z

also cc @masahi @csullivan @ZihengJiang please help to review this PR

src/runtime/metal/metal_device_api.mm

src/runtime/metal/metal_common.h

masahi · 2021-04-16T08:27:58Z

@echuraev please kick CI again

masahi · 2021-04-16T20:06:50Z

@tqchen blocked by your change request

masahi · 2021-04-16T22:48:23Z

thanks @echuraev @tqchen

* [METAL] Fix issue with GPU fails Added first run to auto scheduler. This run is necessary for checking that the generated kernel is correct. When we just run time evaluator with incorrect kernel then it is possible that our application on iOS device will be added to ignore list because of big number of committed incorrect kernels. One run before running auto scheduling helps us to avoid this problem. Added complete handlers to all command buffers in Metal runtime. It helps to handle GPU errors and report about this error to the host application. In case when error happened, we have to create a new stream. Added mechanism for error handling and streams creating from python interface. * Try to fix QEMU build * Apply comment * Apply comments and fix build * Apply comments and fix lint * Fix CI

masahi · 2021-05-02T12:04:08Z

hmm it seems this commit broke auto scheduling on vulkan. Removing the change in auto_scheduler/measure.py fixes it. I'll take a look but @echuraev do you know what could be wrong?

tqchen · 2021-05-02T12:06:25Z

@masahi this could due to the stream management introduced in this PR(explicit call of set stream and new stream/free stream). I believe in vk we should always allocate and return an indicator of default stream

masahi · 2021-05-02T12:21:05Z

ok I see

tvm/src/runtime/vulkan/vulkan.cc

Lines 397 to 399 in 46e0634

 void SetStream(Device dev, TVMStreamHandle stream) final { 

 LOG(FATAL) << "Not implemented"; 

 return;

tqchen · 2021-05-02T13:16:29Z

We can let new stream return nullptr, and implement setstream/freestream for nullptr(nop)

…rations rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods.

…rations (#7969) rpc_runner_run interacts with stream handlers following PR #7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <[email protected]>

…rations (apache#7969) rpc_runner_run interacts with stream handlers following PR apache#7819. Vulkan currently executes adds everything into a single command buffer per CPU thread, so there isn't a corresponding concept of streams. Therefore, added no-op implementations for these DeviceAPI methods. Co-authored-by: Eric Lunderberg <[email protected]>