Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

cgroup issue with nvidia container runtime on Debian testing #1447

Closed
7 of 9 tasks
super-cooper opened this issue Jan 7, 2021 · 62 comments
Closed
7 of 9 tasks

cgroup issue with nvidia container runtime on Debian testing #1447

super-cooper opened this issue Jan 7, 2021 · 62 comments

Comments

@super-cooper
Copy link

super-cooper commented Jan 7, 2021

1. Issue or feature description

Whenever I try to build or run an NVidia container, Docker fails with the error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

$ docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I0107 20:43:11.917241 36435 nvc.c:282] initializing library context (version=1.3.1, build=ac02636a318fe7dcc71eaeb3cc55d0c8541c1072)
I0107 20:43:11.917283 36435 nvc.c:256] using root /
I0107 20:43:11.917290 36435 nvc.c:257] using ldcache /etc/ld.so.cache
I0107 20:43:11.917300 36435 nvc.c:258] using unprivileged user 1000:1000
I0107 20:43:11.917316 36435 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0107 20:43:11.917404 36435 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
W0107 20:43:11.918351 36436 nvc.c:187] failed to set inheritable capabilities
W0107 20:43:11.918381 36436 nvc.c:188] skipping kernel modules load due to failure
I0107 20:43:11.918527 36437 driver.c:101] starting driver service
I0107 20:43:11.921734 36435 nvc_info.c:680] requesting driver information with ''
I0107 20:43:11.932012 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.450.80.02
I0107 20:43:11.932402 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.450.80.02
I0107 20:43:11.932976 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.450.80.02
I0107 20:43:11.933027 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.450.80.02
I0107 20:43:11.933435 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.450.80.02
I0107 20:43:11.933470 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.450.80.02
I0107 20:43:11.933501 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.80.02
I0107 20:43:11.933991 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.450.80.02
I0107 20:43:11.934024 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.80.02
I0107 20:43:11.934094 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.450.80.02
I0107 20:43:11.934545 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.450.80.02
I0107 20:43:11.934976 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.450.80.02
I0107 20:43:11.935258 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.450.80.02
I0107 20:43:11.935783 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.450.80.02
I0107 20:43:11.936188 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.450.80.02
I0107 20:43:11.936243 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.450.80.02
I0107 20:43:11.936622 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.450.80.02
I0107 20:43:11.937013 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.450.80.02
I0107 20:43:11.937296 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.450.80.02
I0107 20:43:11.937573 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.450.80.02
I0107 20:43:11.937881 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.450.80.02
I0107 20:43:11.938438 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLX_nvidia.so.450.80.02
I0107 20:43:11.938920 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLESv2_nvidia.so.450.80.02
I0107 20:43:11.939282 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.450.80.02
I0107 20:43:11.939730 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libEGL_nvidia.so.450.80.02
W0107 20:43:11.939751 36435 nvc_info.c:350] missing library libnvidia-opencl.so
W0107 20:43:11.939756 36435 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0107 20:43:11.939761 36435 nvc_info.c:350] missing library libnvidia-allocator.so
W0107 20:43:11.939767 36435 nvc_info.c:350] missing library libnvidia-compiler.so
W0107 20:43:11.939772 36435 nvc_info.c:350] missing library libnvidia-ngx.so
W0107 20:43:11.939776 36435 nvc_info.c:350] missing library libvdpau_nvidia.so
W0107 20:43:11.939780 36435 nvc_info.c:350] missing library libnvidia-opticalflow.so
W0107 20:43:11.939785 36435 nvc_info.c:350] missing library libnvidia-fbc.so
W0107 20:43:11.939790 36435 nvc_info.c:350] missing library libnvidia-ifr.so
W0107 20:43:11.939795 36435 nvc_info.c:350] missing library libnvoptix.so
W0107 20:43:11.939801 36435 nvc_info.c:350] missing library libnvidia-cbl.so
W0107 20:43:11.939805 36435 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0107 20:43:11.939810 36435 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0107 20:43:11.939814 36435 nvc_info.c:354] missing compat32 library libcuda.so
W0107 20:43:11.939818 36435 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0107 20:43:11.939823 36435 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0107 20:43:11.939828 36435 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0107 20:43:11.939832 36435 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0107 20:43:11.939837 36435 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0107 20:43:11.939841 36435 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0107 20:43:11.939846 36435 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0107 20:43:11.939851 36435 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0107 20:43:11.939856 36435 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0107 20:43:11.939860 36435 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0107 20:43:11.939865 36435 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0107 20:43:11.939870 36435 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0107 20:43:11.939874 36435 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0107 20:43:11.939879 36435 nvc_info.c:354] missing compat32 library libnvoptix.so
W0107 20:43:11.939884 36435 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0107 20:43:11.940108 36435 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-smi
I0107 20:43:11.940153 36435 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0107 20:43:11.940169 36435 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
W0107 20:43:11.941108 36435 nvc_info.c:376] missing binary nvidia-cuda-mps-control
W0107 20:43:11.941117 36435 nvc_info.c:376] missing binary nvidia-cuda-mps-server
I0107 20:43:11.941136 36435 nvc_info.c:438] listing device /dev/nvidiactl
I0107 20:43:11.941142 36435 nvc_info.c:438] listing device /dev/nvidia-uvm
I0107 20:43:11.941146 36435 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0107 20:43:11.941151 36435 nvc_info.c:438] listing device /dev/nvidia-modeset
I0107 20:43:11.941175 36435 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0107 20:43:11.941193 36435 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0107 20:43:11.941198 36435 nvc_info.c:745] requesting device information with ''
I0107 20:43:11.947879 36435 nvc_info.c:628] listing device /dev/nvidia0 (GPU-6518be5e-14ff-e277-21aa-73b482890bee at 00000000:07:00.0)
NVRM version:   450.80.02
CUDA version:   11.0

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 980 Ti
Brand:          GeForce
GPU UUID:       GPU-6518be5e-14ff-e277-21aa-73b482890bee
Bus Location:   00000000:07:00.0
Architecture:   5.2
I0107 20:43:11.947903 36435 nvc.c:337] shutting down library context
I0107 20:43:11.948696 36437 driver.c:156] terminating driver service
I0107 20:43:11.949026 36435 driver.c:196] driver service terminated successfully
  • Kernel version from uname -a
 Linux lambda 5.8.0-3-amd64 #1 SMP Debian 5.8.14-1 (2020-10-10) x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
Thu Jan  7 15:45:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  On   | 00000000:07:00.0  On |                  N/A |
|  0%   45C    P5    29W / 250W |    403MiB /  6083MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3023      G   /usr/lib/xorg/Xorg                177MiB |
|    0   N/A  N/A      4833      G   /usr/bin/gnome-shell              166MiB |
|    0   N/A  N/A      7609      G   ...AAAAAAAAA= --shared-files       54MiB |
+-----------------------------------------------------------------------------+
  • Docker version from docker version
Server: Docker Engine - Community
Engine:
 Version:          20.10.2
 API version:      1.41 (minimum version 1.12)
 Go version:       go1.13.15
 Git commit:       8891c58
 Built:            Mon Dec 28 16:15:28 2020
 OS/Arch:          linux/amd64
 Experimental:     false
containerd:
 Version:          1.4.3
 GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
 Version:          1.0.0-rc92
 GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
 Version:          0.19.0
 GitCommit:        de40ad0
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version        Architecture Description
+++-======================================-==============-============-=================================================================
un  bumblebee-nvidia                       <none>         <none>       (no description available)
ii  glx-alternative-nvidia                 1.2.0          amd64        allows the selection of NVIDIA as GLX provider
un  libegl-nvidia-legacy-390xx0            <none>         <none>       (no description available)
un  libegl-nvidia-tesla-418-0              <none>         <none>       (no description available)
un  libegl-nvidia-tesla-440-0              <none>         <none>       (no description available)
un  libegl-nvidia-tesla-450-0              <none>         <none>       (no description available)
ii  libegl-nvidia0:amd64                   450.80.02-2    amd64        NVIDIA binary EGL library
ii  libegl-nvidia0:i386                    450.80.02-2    i386         NVIDIA binary EGL library
un  libegl1-glvnd-nvidia                   <none>         <none>       (no description available)
un  libegl1-nvidia                         <none>         <none>       (no description available)
un  libgl1-glvnd-nvidia-glx                <none>         <none>       (no description available)
ii  libgl1-nvidia-glvnd-glx:amd64          450.80.02-2    amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
ii  libgl1-nvidia-glvnd-glx:i386           450.80.02-2    i386         NVIDIA binary OpenGL/GLX library (GLVND variant)
un  libgl1-nvidia-glx                      <none>         <none>       (no description available)
un  libgl1-nvidia-glx-any                  <none>         <none>       (no description available)
un  libgl1-nvidia-glx-i386                 <none>         <none>       (no description available)
un  libgl1-nvidia-legacy-390xx-glx         <none>         <none>       (no description available)
un  libgl1-nvidia-tesla-418-glx            <none>         <none>       (no description available)
un  libgldispatch0-nvidia                  <none>         <none>       (no description available)
ii  libgles-nvidia1:amd64                  450.80.02-2    amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia1:i386                   450.80.02-2    i386         NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                  450.80.02-2    amd64        NVIDIA binary OpenGL|ES 2.x library
ii  libgles-nvidia2:i386                   450.80.02-2    i386         NVIDIA binary OpenGL|ES 2.x library
un  libgles1-glvnd-nvidia                  <none>         <none>       (no description available)
un  libgles2-glvnd-nvidia                  <none>         <none>       (no description available)
un  libglvnd0-nvidia                       <none>         <none>       (no description available)
ii  libglx-nvidia0:amd64                   450.80.02-2    amd64        NVIDIA binary GLX library
ii  libglx-nvidia0:i386                    450.80.02-2    i386         NVIDIA binary GLX library
un  libglx0-glvnd-nvidia                   <none>         <none>       (no description available)
un  libnvidia-cbl                          <none>         <none>       (no description available)
un  libnvidia-cfg.so.1                     <none>         <none>       (no description available)
ii  libnvidia-cfg1:amd64                   450.80.02-2    amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                     <none>         <none>       (no description available)
ii  libnvidia-container-tools              1.3.1-1        amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.3.1-1        amd64        NVIDIA container runtime library
ii  libnvidia-eglcore:amd64                450.80.02-2    amd64        NVIDIA binary EGL core libraries
ii  libnvidia-eglcore:i386                 450.80.02-2    i386         NVIDIA binary EGL core libraries
un  libnvidia-eglcore-450.80.02            <none>         <none>       (no description available)
ii  libnvidia-encode1:amd64                450.80.02-2    amd64        NVENC Video Encoding runtime library
ii  libnvidia-glcore:amd64                 450.80.02-2    amd64        NVIDIA binary OpenGL/GLX core libraries
ii  libnvidia-glcore:i386                  450.80.02-2    i386         NVIDIA binary OpenGL/GLX core libraries
un  libnvidia-glcore-450.80.02             <none>         <none>       (no description available)
ii  libnvidia-glvkspirv:amd64              450.80.02-2    amd64        NVIDIA binary Vulkan Spir-V compiler library
ii  libnvidia-glvkspirv:i386               450.80.02-2    i386         NVIDIA binary Vulkan Spir-V compiler library
un  libnvidia-glvkspirv-450.80.02          <none>         <none>       (no description available)
un  libnvidia-legacy-340xx-cfg1            <none>         <none>       (no description available)
un  libnvidia-legacy-390xx-cfg1            <none>         <none>       (no description available)
ii  libnvidia-ml-dev:amd64                 11.1.1-3       amd64        NVIDIA Management Library (NVML) development files
un  libnvidia-ml.so.1                      <none>         <none>       (no description available)
ii  libnvidia-ml1:amd64                    450.80.02-2    amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64        450.80.02-2    amd64        NVIDIA PTX JIT Compiler
ii  libnvidia-rtcore:amd64                 450.80.02-2    amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
un  libnvidia-rtcore-450.80.02             <none>         <none>       (no description available)
un  libnvidia-tesla-418-cfg1               <none>         <none>       (no description available)
un  libnvidia-tesla-440-cfg1               <none>         <none>       (no description available)
un  libnvidia-tesla-450-cfg1               <none>         <none>       (no description available)
un  libnvidia-tesla-450-cuda1              <none>         <none>       (no description available)
un  libnvidia-tesla-450-ml1                <none>         <none>       (no description available)
un  libopengl0-glvnd-nvidia                <none>         <none>       (no description available)
ii  nvidia-alternative                     450.80.02-2    amd64        allows the selection of NVIDIA as GLX provider
un  nvidia-alternative--kmod-alias         <none>         <none>       (no description available)
un  nvidia-alternative-legacy-173xx        <none>         <none>       (no description available)
un  nvidia-alternative-legacy-71xx         <none>         <none>       (no description available)
un  nvidia-alternative-legacy-96xx         <none>         <none>       (no description available)
ii  nvidia-container-runtime               3.4.0-1        amd64        NVIDIA container runtime
un  nvidia-container-runtime-hook          <none>         <none>       (no description available)
ii  nvidia-container-toolkit               1.4.0-1        amd64        NVIDIA container runtime hook
ii  nvidia-cuda-dev:amd64                  11.1.1-3       amd64        NVIDIA CUDA development files
un  nvidia-cuda-doc                        <none>         <none>       (no description available)
ii  nvidia-cuda-gdb                        11.1.1-3       amd64        NVIDIA CUDA Debugger (GDB)
un  nvidia-cuda-mps                        <none>         <none>       (no description available)
ii  nvidia-cuda-toolkit                    11.1.1-3       amd64        NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc                11.1.1-3       all          NVIDIA CUDA and OpenCL documentation
un  nvidia-current                         <none>         <none>       (no description available)
un  nvidia-current-updates                 <none>         <none>       (no description available)
un  nvidia-docker                          <none>         <none>       (no description available)
ii  nvidia-docker2                         2.5.0-1        all          nvidia-docker CLI wrapper
ii  nvidia-driver                          450.80.02-2    amd64        NVIDIA metapackage
un  nvidia-driver-any                      <none>         <none>       (no description available)
ii  nvidia-driver-bin                      450.80.02-2    amd64        NVIDIA driver support binaries
un  nvidia-driver-bin-450.80.02            <none>         <none>       (no description available)
un  nvidia-driver-binary                   <none>         <none>       (no description available)
ii  nvidia-driver-libs:amd64               450.80.02-2    amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii  nvidia-driver-libs:i386                450.80.02-2    i386         NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
un  nvidia-driver-libs-any                 <none>         <none>       (no description available)
un  nvidia-driver-libs-nonglvnd            <none>         <none>       (no description available)
ii  nvidia-egl-common                      450.80.02-2    amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                   450.80.02-2    amd64        NVIDIA EGL installable client driver (ICD)
ii  nvidia-egl-icd:i386                    450.80.02-2    i386         NVIDIA EGL installable client driver (ICD)
un  nvidia-glx-any                         <none>         <none>       (no description available)
ii  nvidia-installer-cleanup               20151021+12    amd64        cleanup after driver installation with the nvidia-installer
un  nvidia-kernel-450.80.02                <none>         <none>       (no description available)
ii  nvidia-kernel-common                   20151021+12    amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                     450.80.02-2    amd64        NVIDIA binary kernel module DKMS source
un  nvidia-kernel-source                   <none>         <none>       (no description available)
ii  nvidia-kernel-support                  450.80.02-2    amd64        NVIDIA binary kernel module support files
un  nvidia-kernel-support--v1              <none>         <none>       (no description available)
un  nvidia-kernel-support-any              <none>         <none>       (no description available)
un  nvidia-legacy-304xx-alternative        <none>         <none>       (no description available)
un  nvidia-legacy-304xx-driver             <none>         <none>       (no description available)
un  nvidia-legacy-340xx-alternative        <none>         <none>       (no description available)
un  nvidia-legacy-340xx-vdpau-driver       <none>         <none>       (no description available)
un  nvidia-legacy-390xx-vdpau-driver       <none>         <none>       (no description available)
un  nvidia-legacy-390xx-vulkan-icd         <none>         <none>       (no description available)
ii  nvidia-legacy-check                    450.80.02-2    amd64        check for NVIDIA GPUs requiring a legacy driver
un  nvidia-libopencl1                      <none>         <none>       (no description available)
un  nvidia-libopencl1-dev                  <none>         <none>       (no description available)
ii  nvidia-modprobe                        460.27.04-1    amd64        utility to load NVIDIA kernel modules and create device nodes
un  nvidia-nonglvnd-vulkan-common          <none>         <none>       (no description available)
un  nvidia-nonglvnd-vulkan-icd             <none>         <none>       (no description available)
un  nvidia-opencl-dev                      <none>         <none>       (no description available)
un  nvidia-opencl-icd                      <none>         <none>       (no description available)
un  nvidia-openjdk-8-jre                   <none>         <none>       (no description available)
ii  nvidia-persistenced                    450.57-1       amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-profiler                        11.1.1-3       amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                        450.80.02-1+b1 amd64        tool for configuring the NVIDIA graphics driver
un  nvidia-settings-gtk-450.80.02          <none>         <none>       (no description available)
ii  nvidia-smi                             450.80.02-2    amd64        NVIDIA System Management Interface
ii  nvidia-support                         20151021+12    amd64        NVIDIA binary graphics driver support files
un  nvidia-tesla-418-vdpau-driver          <none>         <none>       (no description available)
un  nvidia-tesla-418-vulkan-icd            <none>         <none>       (no description available)
un  nvidia-tesla-440-vdpau-driver          <none>         <none>       (no description available)
un  nvidia-tesla-440-vulkan-icd            <none>         <none>       (no description available)
un  nvidia-tesla-450-driver                <none>         <none>       (no description available)
un  nvidia-tesla-450-vulkan-icd            <none>         <none>       (no description available)
un  nvidia-tesla-alternative               <none>         <none>       (no description available)
ii  nvidia-vdpau-driver:amd64              450.80.02-2    amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-visual-profiler                 11.1.1-3       amd64        NVIDIA Visual Profiler for CUDA and OpenCL
ii  nvidia-vulkan-common                   450.80.02-2    amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                450.80.02-2    amd64        NVIDIA Vulkan installable client driver (ICD)
ii  nvidia-vulkan-icd:i386                 450.80.02-2    i386         NVIDIA Vulkan installable client driver (ICD)
un  nvidia-vulkan-icd-any                  <none>         <none>       (no description available)
ii  xserver-xorg-video-nvidia              450.80.02-2    amd64        NVIDIA binary Xorg driver
un  xserver-xorg-video-nvidia-any          <none>         <none>       (no description available)
un  xserver-xorg-video-nvidia-legacy-304xx <none>         <none>       (no description available)
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+00:00
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi
@DanielCeregatti
Copy link

Hi,

I'm experiencing the same issue. For now I've worked around it:

In /etc/nvidia-container-runtime/config.toml I've set no-cgroups = true and now the container starts, but the nvidia devices are not added to the container. Once the devices are added the container works again.

Here are the relevant lines from my docker-compose.yml:

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

This is equivalent to docker run --device /dev/whatever ..., but I'm not sure of the exact syntax.

Hope this helps.

@lissyx
Copy link

lissyx commented Jan 14, 2021

This seems to be related to systemd upgrade to 247.2-2 which was uploaded to sid three weeks ago and made its way to testing now. This commit highlights the change of cgroup hierarchy: https://salsa.debian.org/systemd-team/systemd/-/commit/170fb124a32884bd9975ee4ea9e1ffbbc2ee26b4

Indeed, default setup does not expose anymore /sys/fs/cgroup/devices which libnvidia-container uses according to https:/NVIDIA/libnvidia-container/blob/ac02636a318fe7dcc71eaeb3cc55d0c8541c1072/src/nvc_container.c#L379-L382

Using the documented systemd.unified_cgroup_hierarchy=false kernel command line parameter switch back the /sys/fs/cgroup/devices entry and libnvidia-container is happier.

@klueska
Copy link
Contributor

klueska commented Jan 14, 2021

@lissyx Thank you for printing out the crux of the issue.
We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

@seemethere
Copy link

Wanted to also chime in to say that I'm also experiencing this on Fedora 33

@mathstuf
Copy link

Could the title be updated to indicate that it is systemd cgroup layout related?

@klueska
Copy link
Contributor

klueska commented Jan 25, 2021

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2:
https:/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

@super-cooper super-cooper changed the title Docker unable to run nvidia containers on Debian testing cgroup issue with nvidia container runtime on Debian testing Jan 25, 2021
@super-cooper
Copy link
Author

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2:
https:/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

@regzon
Copy link

regzon commented Jan 28, 2021

I was under the impression this issue was related to adding cgroup v2 support.
The systemd cgroup layout issue was resoolved in:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49
And released today as part of libnvidia-container v1.3.2:
https:/NVIDIA/libnvidia-container/releases/tag/v1.3.2
If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

@super-cooper
Copy link
Author

I was under the impression this issue was related to adding cgroup v2 support.
The systemd cgroup layout issue was resoolved in:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49
And released today as part of libnvidia-container v1.3.2:
https:/NVIDIA/libnvidia-container/releases/tag/v1.3.2
If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

For me it was solved by upgrading the package.

@regzon
Copy link

regzon commented Jan 30, 2021

Thank you, @super-cooper, for the reply.

I am having exactly the same issue on Debian Testing even after an upgrade.

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I0130 05:23:50.494974 4486 nvc.c:282] initializing library context (version=1.3.2, build=fa9c778f687e9ac7be52b0299fa3b6ac2d9fbf93)
I0130 05:23:50.495160 4486 nvc.c:256] using root /
I0130 05:23:50.495178 4486 nvc.c:257] using ldcache /etc/ld.so.cache
I0130 05:23:50.495194 4486 nvc.c:258] using unprivileged user 1000:1000
I0130 05:23:50.495256 4486 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0130 05:23:50.495644 4486 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
W0130 05:23:50.499341 4487 nvc.c:187] failed to set inheritable capabilities
W0130 05:23:50.499369 4487 nvc.c:188] skipping kernel modules load due to failure
I0130 05:23:50.499601 4488 driver.c:101] starting driver service
I0130 05:23:50.504376 4486 nvc_info.c:680] requesting driver information with ''
I0130 05:23:50.506132 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.32.03
I0130 05:23:50.506191 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.32.03
I0130 05:23:50.506283 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.32.03
I0130 05:23:50.506375 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.32.03
I0130 05:23:50.506418 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.32.03
I0130 05:23:50.506467 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.32.03
I0130 05:23:50.506512 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.32.03
I0130 05:23:50.506557 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.32.03
I0130 05:23:50.506669 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.460.32.03
I0130 05:23:50.506714 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.32.03
I0130 05:23:50.507077 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.32.03
I0130 05:23:50.507376 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.460.32.03
I0130 05:23:50.507476 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.460.32.03
I0130 05:23:50.507569 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.460.32.03
I0130 05:23:50.507669 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.460.32.03
W0130 05:23:50.507732 4486 nvc_info.c:350] missing library libnvidia-opencl.so
W0130 05:23:50.507741 4486 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0130 05:23:50.507748 4486 nvc_info.c:350] missing library libnvidia-allocator.so
W0130 05:23:50.507754 4486 nvc_info.c:350] missing library libnvidia-compiler.so
W0130 05:23:50.507760 4486 nvc_info.c:350] missing library libnvidia-ngx.so
W0130 05:23:50.507766 4486 nvc_info.c:350] missing library libvdpau_nvidia.so
W0130 05:23:50.507772 4486 nvc_info.c:350] missing library libnvidia-encode.so
W0130 05:23:50.507781 4486 nvc_info.c:350] missing library libnvidia-opticalflow.so
W0130 05:23:50.507788 4486 nvc_info.c:350] missing library libnvcuvid.so
W0130 05:23:50.507796 4486 nvc_info.c:350] missing library libnvidia-fbc.so
W0130 05:23:50.507806 4486 nvc_info.c:350] missing library libnvidia-ifr.so
W0130 05:23:50.507815 4486 nvc_info.c:350] missing library libnvoptix.so
W0130 05:23:50.507823 4486 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0130 05:23:50.507832 4486 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0130 05:23:50.507848 4486 nvc_info.c:354] missing compat32 library libcuda.so
W0130 05:23:50.507859 4486 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0130 05:23:50.507869 4486 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0130 05:23:50.507880 4486 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0130 05:23:50.507889 4486 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0130 05:23:50.507897 4486 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0130 05:23:50.507906 4486 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0130 05:23:50.507915 4486 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0130 05:23:50.507925 4486 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0130 05:23:50.507933 4486 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0130 05:23:50.507942 4486 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0130 05:23:50.507950 4486 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0130 05:23:50.507960 4486 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0130 05:23:50.507970 4486 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0130 05:23:50.507979 4486 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0130 05:23:50.507988 4486 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0130 05:23:50.507998 4486 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0130 05:23:50.508007 4486 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0130 05:23:50.508015 4486 nvc_info.c:354] missing compat32 library libnvoptix.so
W0130 05:23:50.508025 4486 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0130 05:23:50.508031 4486 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0130 05:23:50.508040 4486 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0130 05:23:50.508050 4486 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0130 05:23:50.508060 4486 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0130 05:23:50.508068 4486 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0130 05:23:50.508515 4486 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-smi
I0130 05:23:50.508580 4486 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0130 05:23:50.508612 4486 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
W0130 05:23:50.509049 4486 nvc_info.c:376] missing binary nvidia-cuda-mps-control
W0130 05:23:50.509060 4486 nvc_info.c:376] missing binary nvidia-cuda-mps-server
I0130 05:23:50.509100 4486 nvc_info.c:438] listing device /dev/nvidiactl
I0130 05:23:50.509109 4486 nvc_info.c:438] listing device /dev/nvidia-uvm
I0130 05:23:50.509118 4486 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0130 05:23:50.509127 4486 nvc_info.c:438] listing device /dev/nvidia-modeset
I0130 05:23:50.509168 4486 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0130 05:23:50.509192 4486 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0130 05:23:50.509200 4486 nvc_info.c:745] requesting device information with ''
I0130 05:23:50.516712 4486 nvc_info.c:628] listing device /dev/nvidia0 (GPU-6064a007-a943-7f11-1ad7-12ac87046652 at 00000000:01:00.0)
NVRM version:   460.32.03
CUDA version:   11.2

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 960M
Brand:          GeForce
GPU UUID:       GPU-6064a007-a943-7f11-1ad7-12ac87046652
Bus Location:   00000000:01:00.0
Architecture:   5.0
I0130 05:23:50.516775 4486 nvc.c:337] shutting down library context
I0130 05:23:50.517704 4488 driver.c:156] terminating driver service
I0130 05:23:50.518087 4486 driver.c:196] driver service terminated successfully
  • Kernel version from uname -a
Linux stas 5.10.0-2-amd64 #1 SMP Debian 5.10.9-1 (2021-01-20) x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
[  487.597570] docker0: port 1(vethb7a49e6) entered blocking state
[  487.597573] docker0: port 1(vethb7a49e6) entered disabled state
[  487.597786] device vethb7a49e6 entered promiscuous mode
[  487.773120] docker0: port 1(vethb7a49e6) entered disabled state
[  487.776548] device vethb7a49e6 left promiscuous mode
[  487.776556] docker0: port 1(vethb7a49e6) entered disabled state
  • Driver information from nvidia-smi -a
Timestamp                                 : Sat Jan 30 08:26:51 2021
Driver Version                            : 460.32.03
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : GeForce GTX 960M
    Product Brand                         : GeForce
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-6064a007-a943-7f11-1ad7-12ac87046652
    Minor Number                          : 0
    VBIOS Version                         : 82.07.82.00.10
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x139B10DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x380217AA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : N/A
            HW Power Brake Slowdown       : N/A
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 4046 MiB
        Used                              : 4 MiB
        Free                              : 4042 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 33 C
        GPU Shutdown Temp                 : 101 C
        GPU Slowdown Temp                 : 96 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : N/A
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 405 MHz
        Video                             : 405 MHz
    Applications Clocks
        Graphics                          : 1097 MHz
        Memory                            : 2505 MHz
    Default Applications Clocks
        Graphics                          : 1097 MHz
        Memory                            : 2505 MHz
    Max Clocks
        Graphics                          : 1202 MHz
        SM                                : 1202 MHz
        Memory                            : 2505 MHz
        Video                             : 1081 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1351
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 2 MiB
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        2291f61
 Built:             Mon Dec 28 16:17:34 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8891c58
  Built:            Mon Dec 28 16:15:28 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version                        Architecture Description
+++-======================================-==============================-============-=================================================================
un  bumblebee-nvidia                       <none>                         <none>       (no description available)
ii  glx-alternative-nvidia                 1.2.0                          amd64        allows the selection of NVIDIA as GLX provider
un  libegl-nvidia-legacy-390xx0            <none>                         <none>       (no description available)
un  libegl-nvidia-tesla-418-0              <none>                         <none>       (no description available)
un  libegl-nvidia-tesla-440-0              <none>                         <none>       (no description available)
un  libegl-nvidia-tesla-450-0              <none>                         <none>       (no description available)
ii  libegl-nvidia0:amd64                   460.32.03-1                    amd64        NVIDIA binary EGL library
un  libegl1-glvnd-nvidia                   <none>                         <none>       (no description available)
un  libegl1-nvidia                         <none>                         <none>       (no description available)
un  libgl1-glvnd-nvidia-glx                <none>                         <none>       (no description available)
ii  libgl1-nvidia-glvnd-glx:amd64          460.32.03-1                    amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
un  libgl1-nvidia-glx                      <none>                         <none>       (no description available)
un  libgl1-nvidia-glx-any                  <none>                         <none>       (no description available)
un  libgl1-nvidia-glx-i386                 <none>                         <none>       (no description available)
un  libgl1-nvidia-legacy-390xx-glx         <none>                         <none>       (no description available)
un  libgl1-nvidia-tesla-418-glx            <none>                         <none>       (no description available)
un  libgldispatch0-nvidia                  <none>                         <none>       (no description available)
ii  libgles-nvidia1:amd64                  460.32.03-1                    amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                  460.32.03-1                    amd64        NVIDIA binary OpenGL|ES 2.x library
un  libgles1-glvnd-nvidia                  <none>                         <none>       (no description available)
un  libgles2-glvnd-nvidia                  <none>                         <none>       (no description available)
un  libglvnd0-nvidia                       <none>                         <none>       (no description available)
ii  libglx-nvidia0:amd64                   460.32.03-1                    amd64        NVIDIA binary GLX library
un  libglx0-glvnd-nvidia                   <none>                         <none>       (no description available)
ii  libnvidia-cbl:amd64                    460.32.03-1                    amd64        NVIDIA binary Vulkan ray tracing (cbl) library
un  libnvidia-cbl-460.32.03                <none>                         <none>       (no description available)
un  libnvidia-cfg.so.1                     <none>                         <none>       (no description available)
ii  libnvidia-cfg1:amd64                   460.32.03-1                    amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                     <none>                         <none>       (no description available)
ii  libnvidia-container-tools              1.3.2-1                        amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.3.2-1                        amd64        NVIDIA container runtime library
ii  libnvidia-eglcore:amd64                460.32.03-1                    amd64        NVIDIA binary EGL core libraries
un  libnvidia-eglcore-460.32.03            <none>                         <none>       (no description available)
ii  libnvidia-glcore:amd64                 460.32.03-1                    amd64        NVIDIA binary OpenGL/GLX core libraries
un  libnvidia-glcore-460.32.03             <none>                         <none>       (no description available)
ii  libnvidia-glvkspirv:amd64              460.32.03-1                    amd64        NVIDIA binary Vulkan Spir-V compiler library
un  libnvidia-glvkspirv-460.32.03          <none>                         <none>       (no description available)
un  libnvidia-legacy-340xx-cfg1            <none>                         <none>       (no description available)
un  libnvidia-legacy-390xx-cfg1            <none>                         <none>       (no description available)
un  libnvidia-ml.so.1                      <none>                         <none>       (no description available)
ii  libnvidia-ml1:amd64                    460.32.03-1                    amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64        460.32.03-1                    amd64        NVIDIA PTX JIT Compiler
ii  libnvidia-rtcore:amd64                 460.32.03-1                    amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
un  libnvidia-rtcore-460.32.03             <none>                         <none>       (no description available)
un  libnvidia-tesla-418-cfg1               <none>                         <none>       (no description available)
un  libnvidia-tesla-440-cfg1               <none>                         <none>       (no description available)
un  libnvidia-tesla-450-cfg1               <none>                         <none>       (no description available)
un  libopengl0-glvnd-nvidia                <none>                         <none>       (no description available)
ii  nvidia-alternative                     460.32.03-1                    amd64        allows the selection of NVIDIA as GLX provider
un  nvidia-alternative--kmod-alias         <none>                         <none>       (no description available)
un  nvidia-alternative-legacy-173xx        <none>                         <none>       (no description available)
un  nvidia-alternative-legacy-71xx         <none>                         <none>       (no description available)
un  nvidia-alternative-legacy-96xx         <none>                         <none>       (no description available)
ii  nvidia-container-runtime               3.4.1-1                        amd64        NVIDIA container runtime
un  nvidia-container-runtime-hook          <none>                         <none>       (no description available)
ii  nvidia-container-toolkit               1.4.1-1                        amd64        NVIDIA container runtime hook
un  nvidia-cuda-mps                        <none>                         <none>       (no description available)
un  nvidia-current                         <none>                         <none>       (no description available)
un  nvidia-current-updates                 <none>                         <none>       (no description available)
ii  nvidia-detect                          460.32.03-1                    amd64        NVIDIA GPU detection utility
un  nvidia-docker                          <none>                         <none>       (no description available)
ii  nvidia-docker2                         2.5.0-1                        all          nvidia-docker CLI wrapper
ii  nvidia-driver                          460.32.03-1                    amd64        NVIDIA metapackage
un  nvidia-driver-any                      <none>                         <none>       (no description available)
ii  nvidia-driver-bin                      460.32.03-1                    amd64        NVIDIA driver support binaries
un  nvidia-driver-bin-460.32.03            <none>                         <none>       (no description available)
un  nvidia-driver-binary                   <none>                         <none>       (no description available)
ii  nvidia-driver-libs:amd64               460.32.03-1                    amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
un  nvidia-driver-libs-any                 <none>                         <none>       (no description available)
un  nvidia-driver-libs-nonglvnd            <none>                         <none>       (no description available)
ii  nvidia-egl-common                      460.32.03-1                    amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                   460.32.03-1                    amd64        NVIDIA EGL installable client driver (ICD)
un  nvidia-glx-any                         <none>                         <none>       (no description available)
ii  nvidia-installer-cleanup               20151021+13                    amd64        cleanup after driver installation with the nvidia-installer
un  nvidia-kernel-460.32.03                <none>                         <none>       (no description available)
ii  nvidia-kernel-common                   20151021+13                    amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                     460.32.03-1                    amd64        NVIDIA binary kernel module DKMS source
un  nvidia-kernel-source                   <none>                         <none>       (no description available)
ii  nvidia-kernel-support                  460.32.03-1                    amd64        NVIDIA binary kernel module support files
un  nvidia-kernel-support--v1              <none>                         <none>       (no description available)
un  nvidia-kernel-support-any              <none>                         <none>       (no description available)
un  nvidia-legacy-304xx-alternative        <none>                         <none>       (no description available)
un  nvidia-legacy-304xx-driver             <none>                         <none>       (no description available)
un  nvidia-legacy-340xx-alternative        <none>                         <none>       (no description available)
un  nvidia-legacy-340xx-vdpau-driver       <none>                         <none>       (no description available)
un  nvidia-legacy-390xx-vdpau-driver       <none>                         <none>       (no description available)
un  nvidia-legacy-390xx-vulkan-icd         <none>                         <none>       (no description available)
ii  nvidia-legacy-check                    460.32.03-1                    amd64        check for NVIDIA GPUs requiring a legacy driver
un  nvidia-libopencl1-dev                  <none>                         <none>       (no description available)
ii  nvidia-modprobe                        460.32.03-1                    amd64        utility to load NVIDIA kernel modules and create device nodes
un  nvidia-nonglvnd-vulkan-common          <none>                         <none>       (no description available)
un  nvidia-nonglvnd-vulkan-icd             <none>                         <none>       (no description available)
un  nvidia-opencl-icd                      <none>                         <none>       (no description available)
ii  nvidia-openjdk-8-jre                   9.+8u272-b10-0+deb9u1~11.1.1-4 amd64        Obsolete OpenJDK Java runtime, for NVIDIA applications
ii  nvidia-persistenced                    460.32.03-1                    amd64        daemon to maintain persistent software state in the NVIDIA driver
un  nvidia-settings                        <none>                         <none>       (no description available)
ii  nvidia-smi                             460.32.03-1                    amd64        NVIDIA System Management Interface
ii  nvidia-support                         20151021+13                    amd64        NVIDIA binary graphics driver support files
un  nvidia-tesla-418-vdpau-driver          <none>                         <none>       (no description available)
un  nvidia-tesla-418-vulkan-icd            <none>                         <none>       (no description available)
un  nvidia-tesla-440-vdpau-driver          <none>                         <none>       (no description available)
un  nvidia-tesla-440-vulkan-icd            <none>                         <none>       (no description available)
un  nvidia-tesla-450-vulkan-icd            <none>                         <none>       (no description available)
un  nvidia-tesla-alternative               <none>                         <none>       (no description available)
ii  nvidia-vdpau-driver:amd64              460.32.03-1                    amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                   460.32.03-1                    amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                460.32.03-1                    amd64        NVIDIA Vulkan installable client driver (ICD)
un  nvidia-vulkan-icd-any                  <none>                         <none>       (no description available)
ii  xserver-xorg-video-nvidia              460.32.03-1                    amd64        NVIDIA binary Xorg driver
un  xserver-xorg-video-nvidia-any          <none>                         <none>       (no description available)
un  xserver-xorg-video-nvidia-legacy-304xx <none>                         <none>       (no description available)
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.3.2
build date: 2021-01-25T11:07+00:00
build revision: fa9c778f687e9ac7be52b0299fa3b6ac2d9fbf93
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • NVIDIA container library logs (see troubleshooting)
    /var/log/nvidia-container-toolkit.log is not generated.
  • Docker command, image and tag used
docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

@klueska Could you please check the issue?

@elezar
Copy link
Member

elezar commented Feb 1, 2021

@regzon thanks for indicating that this is still and issue. Could you please check what your systemd cgroup configuration is? (see for example this other issue which shows similar behaviour: docker/cli#2104 (comment))

@klueska
Copy link
Contributor

klueska commented Feb 1, 2021

@regzon your issue is likely related to the fact that libnvidia-container does not support cgroups v2.

You will need to follow the suggestion in the comments above for #1447 (comment) to force systemd to use v1 cgroups.

In any case -- we do not officially support Debian Testing nor cgroups v2 (yet).

@regzon
Copy link

regzon commented Feb 1, 2021

@elezar @klueska thank you for your help. When forcing the systemd to not use the unified hierarchy, everything works fine. I thought that the latest libnvidia-container upgrade would resolve the issue (as it did for @super-cooper). But if the upgrade is not intended to fix the issue with cgroups, then everything is fine.

@flixr
Copy link

flixr commented Feb 1, 2021

@klueska I'm having the same "issue", i.e. missing support for cgroups v2 (which I would very much like for other reasons).
Is there already an issue for this to track?

@klueska
Copy link
Contributor

klueska commented Feb 4, 2021

We are not planning on building support for cgroups v2 into the existing nvidia-docker stack.

Please see my comment above for more info:
#1447 (comment)

@flixr
Copy link

flixr commented Feb 4, 2021

Let me rephrase it then: I want to use nvidia-docker on a system where cgroup v2 is enabled (systemd.unified_cgroup_hierarchy=true).
Right now this is not working and this bug is closed. So is there an issue that I can track to know when I can use nvidia-docker on hosts with cgroup v2 enabled?

@klueska
Copy link
Contributor

klueska commented Feb 4, 2021

We have it tracked in our internal JIRA with a link to this this issue as the location to report once the work is complete:
NVIDIA/libnvidia-container#111

@jelmd
Copy link

jelmd commented Feb 26, 2021

facebook oomd requires cgroup v2, i.e. systemd.unified_cgroup_hierarchy=1. So either users freeze the boxes pretty often and render them unusable, or they cannot use nvidia-containers. Both is crap. We will probably drop the nvidia-docker non-sense.

@4n0m4l0u5
Copy link

For Debian users, you can disable cgroup hierarchy by editing
/etc/default/grub
and adding
systemd.unified_cgroup_hierarchy=0
to the end of the GRUB_CMDLINE_LINUX_DEFAULT options. Example:
...
GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0"
...

Then run
update-grub
and reboot for changes to take effect.

It's worth noting that I also had to modify /etc/nvidia-container-runtime/config.toml to remove the '@' symbol and update to the correct location of ldconfig for my system (Debian Unstable). eg:
ldconfig = "/usr/sbin/ldconfig"

This worked for me, I hope this saves someone else some time.

@Zethson
Copy link

Zethson commented Apr 10, 2021

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

@gabrielebaris
Copy link

gabrielebaris commented Apr 11, 2021

@Zethson I also use Arch and yesterday I followed your suggestion. It seemed to work (I was able to start the containers), but running nvidia-smi I had no accesso to my GPU from inside docker.
Reading the other answers in this issue, I solved by adding systemd.unified_cgroup_hierarchy=0 to boot kernel parameters and commenting again the entry no-cgroups in /etc/nvidia-container-runtime/config.toml

@wernight
Copy link

wernight commented May 3, 2021

Arch has now cgroup v2 enabled by default, so it'd be useful to plan for supporting it.

@adam505hq
Copy link

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

Awesome this works well.

@biggs
Copy link

biggs commented May 10, 2021

Fix on NixOS (where cgroup v2 is also now default): add
systemd.enableUnifiedCgroupHierarchy = false;
and restart.

@prismplex
Copy link

prismplex commented May 30, 2021

This worked for me on Manjaro Linux (Arch Linux as base) without deactivating cgroup v2:
Create the folder docker.service.d under /etc/systemd/system, create file override.conf in this folder:

[Service]
ExecStartPre=-/usr/bin/nvidia-modprobe -c 0 -u

After that you have to add the following content to your docker-compose.yml, thank you @DanielCeregatti :

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Background: The nvidia-uvm and nvidia-uvm-tools folder did not exist unter /dev for me. After running nvidia-modprobe -c 0 -u they appeared but disappeared after reboot. This workaround adds these folders before docker starts. Unfortunately I don`t know why these folders do not exist by default. Maybe somebody can complement. Currently using Linux 5.12. Maybe it has to do with this kernel version.

Edit: This workaround works only if the container using NVIDIA is restarted afterwards. I do not know why, but if not, the container starts, but cannot access the created directories.

Update 25.06.2021:
Found out why I had to restart jellyfin. Docker started before my disks were online. If somebody has this problem too, here is the fix:
openmediavault/openmediavault#458 (comment)

@Frikster
Copy link

Frikster commented Jan 24, 2022

If you need cgroups active so cannot do no-cgroups = true and you're on PopOS 21.10, As per this explanation this one command fixed this issue for me while keeping cgroups on:

sudo kernelstub -a systemd.unified_cgroup_hierarchy=0

I then had to reboot and the issue is gone.

@klueska
Copy link
Contributor

klueska commented Jan 28, 2022

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.

Please see NVIDIA/libnvidia-container#111 (comment) for instructions on how to get access to this RC (or wait for the full release at the end of next week).

Note: This does not directly add debian testing support, but you can point to the debian10 repo and install from there for now.

@jbcpollak
Copy link

This may be useful for Ubuntu users running into this issue:

So nowadays all you actually need is libnvidia-contianer.list to get access to all of new packages, but if you nvidia-docker.list that is still OK because it also contains entries for all of the repos listed in libnvidia-contianer.list(it just contains entries for more -- now unnecessary -- repos as well).

@klueska , I just wanted to mention when I go to the following URLs:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list
https://nvidia.github.io/nvidia-docker/ubuntu20.04/nvidia-docker.list

I get a valid apt list in response.

But if I visit:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list
https://nvidia.github.io/nvidia-docker/ubuntu20.04/libnvidia-container.list

I get # Unsupported distribution! # Check https://nvidia.github.io/nvidia-docker.

It appears the list has been moved back to the original filename?

@jbcpollak
Copy link

ah, 🤦 , much appreciated, thanks for making it explicit.

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here:
https:/NVIDIA/libnvidia-container/releases/tag/v1.8.0

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

@pkubik
Copy link

pkubik commented Mar 19, 2022

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Are any additional steps required? I have libnvidia-container1 in version 1.8.0-1 on PopOS and the error persists:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

You must not have the package installed correctly because that error message doesn't exist in that from in v1.8.0

YodaEmbedding added a commit to YodaEmbedding/nixos that referenced this issue Mar 24, 2022
Took an entire day to find the linked comment [1] by @biggs, which says:

> Fix on NixOS (where cgroup v2 is also now default): add
> `systemd.enableUnifiedCgroupHierarchy = false;`
> and restart.

Indeed, after applying this commit and then running
`sudo systemctl restart docker`, any of the following commands works:

```bash
sudo docker run --gpus=all nvidia/cuda:10.0-runtime nvidia-smi
sudo docker run --runtime=nvidia nvidia/cuda:10.0-runtime nvidia-smi
sudo nvidia-docker run nvidia/cuda:10.0-runtime nvidia-smi
```

ARGH!!!1

Links:
[1] NVIDIA/nvidia-docker#1447 (comment)
[2] NixOS/nixpkgs#127146
[3] NixOS/nixpkgs#73800
[4] https://blog.zentria.company/posts/nixos-cgroupsv2/

P.S.
I use Colemak, but typing arstarstarst doesn't have the same ring to it.
skogsbrus added a commit to skogsbrus/os that referenced this issue May 1, 2022
@sippmilkwolf
Copy link

Hi all: I encountered a problem, can you help me?

$ docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

Unable to find image 'nvidia/cuda:11.0-base-ubuntu20.04' locally
11.0-base-ubuntu20.04: Pulling from nvidia/cuda
d72e567cc804: Pull complete
0f3630e5ff08: Pull complete
b6a83d81d1f4: Pull complete
651c4abefb41: Pull complete
dfde59c9d941: Pull complete
9b2bcdc98b8a: Pull complete
Digest: sha256:46477a46a8bfbe2982e2efb7cc80f3200970109f221971c204bd6b8df13fd3fd
Status: Downloaded newer image for nvidia/cuda:11.0-base-ubuntu20.04
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.

$ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0722 06:41:38.215548 281198 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0722 06:41:38.215569 281198 nvc.c:350] using root /
I0722 06:41:38.215572 281198 nvc.c:351] using ldcache /etc/ld.so.cache
I0722 06:41:38.215574 281198 nvc.c:352] using unprivileged user 1001:1001
I0722 06:41:38.215586 281198 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0722 06:41:38.215659 281198 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0722 06:41:38.246482 281199 nvc.c:273] failed to set inheritable capabilities
W0722 06:41:38.246515 281199 nvc.c:274] skipping kernel modules load due to failure
I0722 06:41:38.246846 281200 rpc.c:71] starting driver rpc service
W0722 06:41:48.273267 281198 rpc.c:121] terminating driver rpc service (forced)
I0722 06:41:51.670661 281198 rpc.c:135] driver rpc service terminated with signal 15
nvidia-container-cli: initialization error: driver rpc error: timed out
I0722 06:41:51.670713 281198 nvc.c:434] shutting down library context
[milkwolf@node80 alphafold]$ uname -a
Linux node80 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Fri Jul 22 14:42:41 2022
Driver Version : 510.47.03
CUDA Version : 11.6

Attached GPUs : 8
GPU 00000000:1B:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721025409
GPU UUID : GPU-1d0c23ba-38ae-9977-97c5-ae814c4a83a3
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x1b00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1B
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:1B:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 40 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 30.51 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1095 MHz
SM : 1095 MHz
Memory : 7250 MHz
Video : 945 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None

GPU 00000000:1C:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721031628
GPU UUID : GPU-548d0907-a96c-784e-3b81-c48e7f19e93b
Minor Number : 1
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x1c00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1C
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:1C:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 42 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 30.54 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1080 MHz
SM : 1080 MHz
Memory : 7250 MHz
Video : 945 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None

GPU 00000000:1F:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721031352
GPU UUID : GPU-80dc21f2-da05-01cc-cc84-b91a0146db63
Minor Number : 2
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x1f00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1F
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:1F:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 42 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 32.46 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1080 MHz
SM : 1080 MHz
Memory : 7250 MHz
Video : 945 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None

GPU 00000000:23:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721031113
GPU UUID : GPU-e3d3cf38-081a-a94f-6568-cbcb9bff6508
Minor Number : 3
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x2300
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x23
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:23:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 84.06 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7250 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 943.750 mV
Processes : None

GPU 00000000:35:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721031777
GPU UUID : GPU-017aa6ee-82f9-7bd2-86bb-64fc98434678
Minor Number : 4
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x3500
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x35
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:35:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 80.98 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7250 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 981.250 mV
Processes : None

GPU 00000000:36:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721032951
GPU UUID : GPU-9202c88d-2d98-8136-ef96-0188084ace39
Minor Number : 5
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x3600
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x36
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:36:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 80.36 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7250 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 987.500 mV
Processes : None

GPU 00000000:39:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721032357
GPU UUID : GPU-f93cde14-a961-f8f8-9764-fe5c07c9077e
Minor Number : 6
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x3900
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x39
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:39:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 44 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 86.21 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7250 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 981.250 mV
Processes : None

GPU 00000000:3D:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322721030965
GPU UUID : GPU-5904f6ca-6dcc-295b-9608-47d4904e3a85
Minor Number : 7
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x3d00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 510.47.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x3D
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:3D:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 46068 MiB
Reserved : 633 MiB
Used : 0 MiB
Free : 45434 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 80.82 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7250 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 968.750 mV
Processes : None

$ docker version
Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:56:24 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:54:48 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0

$ rpm -qa 'nvidia'

nvidia-docker2-2.11.0-1.noarch
libnvidia-container-tools-1.10.0-1.x86_64
libnvidia-container1-1.10.0-1.x86_64
nvidia-container-toolkit-1.11.0-0.1.rc.1.x86_64

$ nvidia-container-cli -V

cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:40+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-13)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -I/usr/include/tirpc -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

@elezar
Copy link
Member

elezar commented Jul 22, 2022

@sippmilkwolf the issue you show is not related to the original pos and can occur when persistence mode is not enabled. Please see https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon.

@sippmilkwolf
Copy link

@sippmilkwolf the issue you show is not related to the original pos and can occur when persistence mode is not enabled. Please see https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon.

I followed the method in the link you provided and solve the problem, Thank you for your help :-)

@universebreaker
Copy link

I'm trying to setup nvidia-docker2 on PopOS 22.04, I confirmed the libnvidia-container1 installed is 1.8.0 but I still got the cgroup not found message:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

@j2l
Copy link

j2l commented Sep 13, 2022

I'm trying to setup nvidia-docker2 on PopOS 22.04, I confirmed the libnvidia-container1 installed is 1.8.0 but I still got the cgroup not found message: docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

I also have PopOS 22.04 and the same issue.
I tried to kernelstub -a systemd.unified_cgroup_hierarchy=0 and restart.
I tried adding no-cggroup=true
I noticed that $ID is POP in distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
so I changed it to distribution=$(. /etc/os-release;echo ubuntu$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list, updated and upgraded to latest today.
I set back to default kernelstub -d systemd.unified_cgroup_hierarchy=0 and restarted.

But sitll no luck. Same error as you @universebreaker
Here are my installed nvidia packages:

sudo apt list --installed *nvidia*
Listing... Done
libnvidia-cfg1-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-common-515/jammy,jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b all [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b i386 [installed,automatic]
libnvidia-container-tools/jammy,now 1.8.0-1~1644255740~22.04~76ed4b4 amd64 [installed,automatic]
libnvidia-container1/jammy,now 1.8.0-1~1644255740~22.04~76ed4b4 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b i386 [installed,automatic]
libnvidia-egl-wayland1/jammy,jammy,now 1:1.1.9-1.1 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b i386 [installed,automatic]
libnvidia-extra-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b i386 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b i386 [installed,automatic]
nvidia-compute-utils-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
nvidia-container-toolkit/jammy,now 1.8.0-1pop1~1644260705~22.04~60691e5 amd64 [installed,automatic]
nvidia-dkms-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed]
nvidia-docker2/jammy,jammy,now 2.9.0-1~1644261147~22.04~c7639fe all [installed]
nvidia-driver-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed]
nvidia-kernel-common-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
nvidia-kernel-source-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
nvidia-settings/jammy,jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]
xserver-xorg-video-nvidia-515/jammy,now 515.65.01-1pop0~1663007364~22.04~9bd887b amd64 [installed,automatic]

@seth100
Copy link

seth100 commented Oct 11, 2022

@j2l Exactly same error to me!
Any solution please? Thanks

@elezar
Copy link
Member

elezar commented Oct 11, 2022

@universebreaker @seth100 @j2l there was a bug in the code to handle cgroups (for nested containers specifically) in v1.8.0. This was addressed in v1.8.1. Could any of you upgrade the NVIDIA Container Toolkit to at least v1.8.1 and see whether the problem persists.

Note that if you upgrade to v1.11.0 (the latest) there may be an issue with the upgrade as discussed in #1682 (although this is only expected to affect RPM-based distributions).

@seth100
Copy link

seth100 commented Oct 11, 2022

@elezar I have already latest v1.11.0 on my Pop!_OS 22.04 LTS!
List of my packages:

libnvidia-cfg1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-common-515/jammy,jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 all [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-container-tools/jammy,now 1.11.0-0pop1~1663542983~22.04~fbd1818 amd64 [installed,automatic]
libnvidia-container1/jammy,now 1.11.0-0pop1~1663542983~22.04~fbd1818 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-egl-wayland1/jammy,now 1:1.1.9-1.1 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-extra-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
nvidia-compute-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-container-toolkit-base/jammy,now 1.11.0-0pop1~1663593585~22.04~5b13c4c amd64 [installed,automatic]
nvidia-container-toolkit/jammy,now 1.11.0-0pop1~1663593585~22.04~5b13c4c amd64 [installed,automatic]
nvidia-dkms-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-docker2/jammy,jammy,now 2.11.0-1~1663542535~22.04~0f7519f all [installed]
nvidia-driver-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-kernel-common-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-kernel-source-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-settings/jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
xserver-xorg-video-nvidia-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]

@seth100
Copy link

seth100 commented Oct 12, 2022

ok I found the fix:

sudo vi /etc/apt/preferences.d/pop-default-settings

# Add this at the end of the file:
# ---
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002
# ---

sudo apt-get update
sudo apt-get upgrade

here are the package versions now:

libnvidia-cfg1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-common-515/jammy,jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 all [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-compute-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-container-tools/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-egl-wayland1/jammy,now 1:1.1.9-1.1 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-extra-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
libnvidia-gl-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 i386 [installed,automatic]
nvidia-compute-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.12.0~rc.1-1 amd64 [installed,automatic]
nvidia-dkms-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-docker2/bionic,now 2.11.0-1 all [installed]
nvidia-driver-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed]
nvidia-kernel-common-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-kernel-source-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
nvidia-settings/jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]
xserver-xorg-video-nvidia-515/jammy,now 515.65.01-1pop0~1663626642~22.04~1f94f41 amd64 [installed,automatic]

and now docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi works!!!

@RafalSkolasinski
Copy link

method from @seth100 did not work for me on my installation of Pop22.04 but using install instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit did the trick

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests