Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building error on ARM Device #4994

Closed
AashinShazar opened this issue Nov 21, 2020 · 19 comments
Closed

Building error on ARM Device #4994

AashinShazar opened this issue Nov 21, 2020 · 19 comments
Assignees
Labels

Comments

@AashinShazar
Copy link

So I'm trying to build JAX from source for the NVIDIA Jetson TX2 which is an ARM device.

I've got bazel up and running and it almost builds up until the following error:

[9,973 / 10,056] Compiling external/llvm-project/llvm/lib/CodeGen/InlineSpiller.cpp; 19s local ... (4 actions running)
Target //build:install_xla_in_source_tree up-to-date:
  bazel-bin/build/install_xla_in_source_tree
INFO: Elapsed time: 16931.191s, Critical Path: 374.05s
INFO: 6229 processes: 6229 local.
INFO: Build completed successfully, 10379 total actions
INFO: Running command line: bazel-bin/build/install_xla_in_source_tree /home/nvidia/jax/build
INFO: Build completed successfully, 10379 total actions
Traceback (most recent call last):
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/a5643b5cc286b9b13a96818003a4a7dd/execroot/__main__/bazel-out/arm-opt/bin/build/install_xla_in_source_tree.runfiles/__main__/build/install_xla_in_source_tree.py", line 94, in <module>
    copy(r.Rlocation("__main__/jaxlib/cusolver_kernels.so"))
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/a5643b5cc286b9b13a96818003a4a7dd/execroot/__main__/bazel-out/arm-opt/bin/build/install_xla_in_source_tree.runfiles/__main__/build/install_xla_in_source_tree.py", line 54, in copy
    _copy_so(src_file, dst_dir, dst_filename=dst_filename)
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/a5643b5cc286b9b13a96818003a4a7dd/execroot/__main__/bazel-out/arm-opt/bin/build/install_xla_in_source_tree.runfiles/__main__/build/install_xla_in_source_tree.py", line 43, in _copy_so
    shutil.copy(src_file, dst_file)
  File "/usr/lib/python3.6/shutil.py", line 245, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.6/shutil.py", line 121, in copyfile
    with open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: '/home/nvidia/jax/build/jaxlib/cusolver_kernels.so'
Traceback (most recent call last):
  File "build/build.py", line 457, in <module>
    main()
  File "build/build.py", line 452, in main
    shell(command)
  File "build/build.py", line 51, in shell
    output = subprocess.check_output(cmd)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bazelisk', 'run', '--verbose_failures=true', '--config=short_logs', '--config=mkl_open_source_only', '--config=cuda', '--define=xla_python_enable_gpu=true', ':install_xla_in_source_tree', '/home/nvidia/jax/build', '--cpu=arm']' returned non-zero exit status 1.

Looking more closely, I notice I get a permission error thrown at me for this path:
PermissionError: [Errno 13] Permission denied: '/home/nvidia/jax/build/jaxlib/cusolver_kernels.so'

I've tried modifying the permission with chmod and chown but with no luck.

I'd really appreciate if anyone had any pointers or guidance to resolving this, thank you!

@hawkinsp
Copy link
Collaborator

We're currently moving our jaxlib build around a bit, so there might be a bit of breakage.
Can you try patching in #4982 ? Does it help?

@AashinShazar
Copy link
Author

AashinShazar commented Nov 21, 2020

I've patched in the mentioned PR. Is there somewhere in the files I need to specify that I'm building on an ARM device / aarch64 platform?

Previously, I used the guidance specified in #773. But I noticed the patch modifies the files where I would specify that ["--cpu=arm"]) option.

@hawkinsp
Copy link
Collaborator

hawkinsp commented Nov 21, 2020

One thing that needs fixing is this line in build_wheel.py in that PR; otherwise the wheel will have the wrong platform name.

 cpu_name = "amd64" if platform.system() == "Windows" else "x86_64"

Specifying --cpu=arm in build.py as that issue suggests might help also. We've never tried this, though!

@AashinShazar
Copy link
Author

So I'm getting the following errors:

[8,692 / 11,264] Compiling external/llvm-project/llvm/utils/TableGen/CodeGenDAGPatterns.cpp; 28s local ... (4 actions running)
ERROR: /home/nvidia/.cache/bazel/_bazel_nvidia/a5643b5cc286b9b13a96818003a4a7dd/external/org_tensorflow/tensorflow/compiler/xla/python/BUILD:354:11: undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/compiler/xla/python:outfeed_receiver_py':
this rule is missing dependency declarations for the following files included by 'external/org_tensorflow/tensorflow/compiler/xla/python/outfeed_receiver_py.cc':
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/arrayobject.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/ndarrayobject.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/ndarraytypes.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_common.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/numpyconfig.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/_numpyconfig.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_endian.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_cpu.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/utils.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/_neighborhood_iterator_imp.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_1_7_deprecated_api.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/old_defines.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/__multiarray_api.h'
  'bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_interrupt.h'
In file included from external/org_tensorflow/tensorflow/compiler/xla/service/buffer_assignment.h:37:0,
                 from external/org_tensorflow/tensorflow/compiler/xla/service/compiler.h:30,
                 from external/org_tensorflow/tensorflow/compiler/xla/client/local_client.h:27,
                 from external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_client.h:32,
                 from external/org_tensorflow/tensorflow/compiler/xla/python/outfeed_receiver_py.cc:26:
external/org_tensorflow/tensorflow/compiler/xla/service/memory_space_assignment.h:502:3: warning: multi-line comment [-Wcomment]
   //       /   \
   ^
external/org_tensorflow/tensorflow/compiler/xla/service/memory_space_assignment.h:1024:3: warning: multi-line comment [-Wcomment]
   //       /        \          \       \
   ^
In file included from bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/ndarraytypes.h:1809:0,
                 from bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/ndarrayobject.h:18,
                 from bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/arrayobject.h:4,
                 from external/org_tensorflow/tensorflow/compiler/xla/python/types.h:22,
                 from external/org_tensorflow/tensorflow/compiler/xla/python/outfeed_receiver_py.cc:29:
bazel-out/aarch64-opt/bin/external/local_config_python/python_include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^~~~~~~
In file included from external/org_tensorflow/tensorflow/compiler/xla/python/outfeed_receiver_py.cc:29:0:
external/org_tensorflow/tensorflow/compiler/xla/python/types.h:77:8: warning: ‘xla::PythonBufferTree’ declared with greater visibility than the type of its field ‘xla::PythonBufferTree::arrays’ [-Wattributes]
 struct PythonBufferTree {
        ^~~~~~~~~~~~~~~~
external/org_tensorflow/tensorflow/compiler/xla/python/types.h:98:8: warning: ‘xla::CastToArrayResult’ declared with greater visibility than the type of its field ‘xla::CastToArrayResult::array’ [-Wattributes]
 struct CastToArrayResult {
        ^~~~~~~~~~~~~~~~~
cc1plus: warning: unrecognized command line option ‘-Wno-stringop-truncation’
Target //build:build_wheel failed to build
INFO: Elapsed time: 4972.990s, Critical Path: 185.17s
INFO: 2804 processes: 2804 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
Traceback (most recent call last):
  File "build/build.py", line 465, in <module>
    main()
  File "build/build.py", line 460, in main
    shell(command)
  File "build/build.py", line 51, in shell
    output = subprocess.check_output(cmd)
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bazelisk', 'run', '--verbose_failures=true', '--config=short_logs', '--config=mkl_open_source_only', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/home/nvidia/jax/dist', '--cpu=arm']' returned non-zero exit status 1.

Any ideas on how I can resolve this?

@tomweingarten
Copy link

I'm having a very similar "permission denied" problem on Ubuntu while building for an Intel CPU. I'm using master as of today, which includes #4982. I tried both the downloaded bazel 3.1.0 and baselisk with 3.7.0.

Build options:
USE_BAZEL_VERSION=3.7.0 python build/build.py --enable_cuda --cuda_version 11.1 --cudnn_version 8.0.5 --cuda_compute_capabilities 8.6 --enable_mkl_dnn true --enable_march_native true --bazel_path=/usr/local/bin/bazelisk

Error:
Traceback (most recent call last):
File "/home/tom/.cache/bazel/_bazel_tom/58b081ab250964c45d8160fdfcced5ca/execroot/main/bazel-out/k8-opt/bin/build/build_wheel.runfiles/main/build/build_wheel.py", line 167, in
prepare_wheel(sources_path)
File "/home/tom/.cache/bazel/_bazel_tom/58b081ab250964c45d8160fdfcced5ca/execroot/main/bazel-out/k8-opt/bin/build/build_wheel.runfiles/main/build/build_wheel.py", line 113, in prepare_wheel
copy_to_jaxlib(r.Rlocation("main/jaxlib/cusolver_kernels.so"))
File "/home/tom/.cache/bazel/_bazel_tom/58b081ab250964c45d8160fdfcced5ca/execroot/main/bazel-out/k8-opt/bin/build/build_wheel.runfiles/main/build/build_wheel.py", line 71, in copy_file
_copy_so(src_file, dst_dir, dst_filename=dst_filename)
File "/home/tom/.cache/bazel/_bazel_tom/58b081ab250964c45d8160fdfcced5ca/execroot/main/bazel-out/k8-opt/bin/build/build_wheel.runfiles/main/build/build_wheel.py", line 60, in _copy_so
shutil.copy(src_file, dst_file)
File "/usr/lib/python3.8/shutil.py", line 415, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: '/tmp/jaxlib5hx2wxwi/jaxlib/cusolver_kernels.so'

@AashinShazar
Copy link
Author

Yeah, I tried rebuilding it after pulling the master including #4982 and I get the same error as @tomweingarten above.

Here's my output:

Target //build:build_wheel up-to-date:
  /home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel
INFO: Elapsed time: 16627.317s, Critical Path: 363.96s
INFO: 6246 processes: 6246 local.
INFO: Build completed successfully, 10402 total actions
INFO: Running command line: /home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel '--output_path=/media/nvidia/NIKON/jax/dist'
INFO: Build completed successfully, 10402 total actions
Traceback (most recent call last):
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel.runfiles/__main__/build/build_wheel.py", line 173, in <module>
    prepare_wheel(sources_path)
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel.runfiles/__main__/build/build_wheel.py", line 119, in prepare_wheel
    copy_to_jaxlib(r.Rlocation("__main__/jaxlib/cusolver_kernels.so"))
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel.runfiles/__main__/build/build_wheel.py", line 77, in copy_file
    _copy_so(src_file, dst_dir, dst_filename=dst_filename)
  File "/home/nvidia/.cache/bazel/_bazel_nvidia/79ba50d43708ccfe1f35f6759102563a/execroot/__main__/bazel-out/arm-opt/bin/build/build_wheel.runfiles/__main__/build/build_wheel.py", line 63, in _copy_so
    shutil.copy(src_file, dst_file)
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/shutil.py", line 415, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: '/tmp/jaxlibkz3m3qjx/jaxlib/cusolver_kernels.so'
Traceback (most recent call last):
  File "build/build.py", line 466, in <module>
    main()
  File "build/build.py", line 461, in main
    shell(command)
  File "build/build.py", line 51, in shell
    output = subprocess.check_output(cmd)
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/nvidia/c4aarch64_installer/envs/SFTM_GPU/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bazel', 'run', '--verbose_failures=true', '--config=short_logs', '--config=mkl_open_source_only', '--config=cuda', '--cpu=arm', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/media/nvidia/NIKON/jax/dist']' returned non-zero exit status 1.
(SFTM_GPU) nvidia@nvidia-desktop:/media/nvidia/NIKON/jax$

@tomweingarten
Copy link

I have absolutely no idea what I'm doing, but I managed to get this to build with a few hacks:

  • Currently build_wheel.py:162 defaults to using a temporary dir: tmpdir = tempfile.TemporaryDirectory(prefix="jaxlib"). I changed this back to the previous behavior: sources_path = os.path.join(os.getcwd(), "jaxlib")
  • build_wheel.py:104 tries to copy setup.py onto itself, so I commented that out
  • I still had some weird permissions problems, so I just ran sudo chmod -R u+rw ~/.cache/bazel/ and it worked.

YMMV :)

@bayerj
Copy link

bayerj commented Nov 27, 2020

Nice to know. Do the tests pass?

@AashinShazar
Copy link
Author

I have absolutely no idea what I'm doing, but I managed to get this to build with a few hacks:

  • Currently build_wheel.py:162 defaults to using a temporary dir: tmpdir = tempfile.TemporaryDirectory(prefix="jaxlib"). I changed this back to the previous behavior: sources_path = os.path.join(os.getcwd(), "jaxlib")
  • build_wheel.py:104 tries to copy setup.py onto itself, so I commented that out
  • I still had some weird permissions problems, so I just ran sudo chmod -R u+rw ~/.cache/bazel/ and it worked.

YMMV :)

I can confirm that this works! I managed to build this on an ARM device with these changes and some changes from earlier. Now to run some tests and see if it's actually working.

@AashinShazar
Copy link
Author

Nice to know. Do the tests pass?

I get a ton of replacing crashed worker gw1 error messages and fails for some of the tests.

However, for my use case of JAX in my project, it seems to be working just fine? Which is a little strange but I wonder if it the tests crashing have something to do with the ARM device (Jetson TX2) I am using.

@tomweingarten
Copy link

tomweingarten commented Nov 27, 2020

Similarly the tests do not pass but my training loop runs.
841 failed, 8423 passed, 968 skipped, 16 errors in 1215.52s (0:20:15)
test_summary.txt

Incidentally should we add pip install pytest-tornasync to the documentation for running tests?

I also noticed a lot of tests fail with CUDA OOM, I assume because the tests are running in parallel? I was only use a couple hundred MB of GPU RAM with other apps. Is there a way to tell pytest to only run one CUDA test at a time?

@hawkinsp
Copy link
Collaborator

hawkinsp commented Nov 30, 2020

If you are using a GPU, you either need to not run the tests in parallel (pytest -n 1), or to use the platform allocator (see https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html#gpu-memory-allocation ; try XLA_PYTHON_CLIENT_ALLOCATOR=platform)

We recommend pytest-xdist in our documentation; I'm not familiar with pytest-tornasync but it doesn't look relevant to us.

It would be helpful if we could figure out why the permissions problem happens for your tmpdir. Is /tmp perhaps mounted with some unusual mount options on your system?

I guess we can definitely work around the problem by always using a subdirectory of the source tree; you should not choose jaxlib otherwise you will create the "moving file over itself" problem you saw. Any unused directory will do.

@tomweingarten
Copy link

Thanks! I followed the commands in the documentation here: https://jax.readthedocs.io/en/latest/developer.html#running-the-tests. Doing that gave an error that I needed to install pytest-tornasync -- but uninstalling all the pytest packages and reinstalling without tornasync seems to work fine, so it must have been a weird dependency glitch.

/tmp/ permissions are standard for me:
drwxrwxrwt 23 root root 40960 Nov 30 10:05 tmp

Is it possible that the change to build_wheel.py is causing it to look for files in the tmp directory even though they're being built in the working directory? So the permissions error is actually because it's unable to find the files? Seems weird but when I looked at the /tmp directory during build I didn't see anything there.

I'm re-running the tests now to see if they pass.

@hawkinsp
Copy link
Collaborator

Thanks! I followed the commands in the documentation here: https://jax.readthedocs.io/en/latest/developer.html#running-the-tests. Doing that gave an error that I needed to install pytest-tornasync -- but uninstalling all the pytest packages and reinstalling without tornasync seems to work fine, so it must have been a weird dependency glitch.

Yes this sounds like a pytest mystery of some kind.

/tmp/ permissions are standard for me:
drwxrwxrwt 23 root root 40960 Nov 30 10:05 tmp

I'm actually wondering more about mount options. Try:

mount  | grep /tmp

Is /tmp a filesystem mounted, say, noexec?

Is it possible that the change to build_wheel.py is causing it to look for files in the tmp directory even though they're being built in the working directory? So the permissions error is actually because it's unable to find the files? Seems weird but when I looked at the /tmp directory during build I didn't see anything there.

Well, that build_wheel.py script is supposed to create the temporary directory inside /tmp. It seems using /tmp is problematic, so we'll need to find a better option.

@tomweingarten
Copy link

Good thought, but unfortunately /tmp/ is not mounted separately, it's a directory in the root mount with the standard Ubuntu permissions: / type ext4 (rw,relatime,errors=remount-ro)

@hawkinsp
Copy link
Collaborator

I suspect #5051 may fix the permissions error. Try it out?

@tomweingarten
Copy link

That did the trick!

@hawkinsp
Copy link
Collaborator

I submitted #5053 instead of #5051 for tedious reasons related to CI.

Should we close this issue and open new ones for any test failures?

@hawkinsp
Copy link
Collaborator

hawkinsp commented Dec 1, 2020

Closing. Keep us posted on how well things work on the Jetson TX2!

@hawkinsp hawkinsp closed this as completed Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants