困扰了两天的问题,记录一下

问题出在启动一个本身已经安装 cuda 的镜像上,具体来说,我是启动地平线天工开物工具链镜像的时候出现的问题,具体报错如下:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: erroår during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/6c984e34fc5db268b0ace9cfe81f3786af8af43477ad96269a15b4fc7abed9a6/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0001] error waiting for container:

创建容器的脚本:

#!/bin/bash

dataset_path=$1
run_type=$2
version=v2.6.2b

if [ -z "$dataset_path" ];then
  echo "Please specify the dataset path"
  exit
fi
dataset_path=$(readlink -f "$dataset_path")

echo "Docker version is ${version}"
echo "Dataset path is $(readlink -f "$dataset_path")"

open_explorer_path=$(readlink -f "$(dirname "$0")")
echo "OpenExplorer package path is $open_explorer_path"

echo "Run in GPU mode"
docker run -it -p 9991:22 --net=bridge --ipc=host --pid=host --name oe_infer \
  --gpus all --privileged \
  -v "$open_explorer_path":/open_explorer \
  -v "$dataset_path":/data/horizon_x3/data \
  -v /workspace:/workspace \
  openexplorer/ai_toolchain_ubuntu_20_xj3_gpu:"$version"
# docker run -it -p 9991:22 --net=bridge --ipc=host --pid=host --name oe_infer \
#   -v "$open_explorer_path":/open_explorer \
#   -v "$dataset_path":/data/horizon_x3/data \
#   -v /workspace:/workspace \
#   openexplorer/ai_toolchain_ubuntu_20_xj3_gpu:"$version"

【解决办法】
1> 直接使用如上脚本创建 gpu docker,会出现我的报错,应该是文件冲突了。首先不打开 gpu,而使用 cpu 来创建容器,也即打开上述我注释掉的部分,然后把创建 gpu docker 部分注释掉;

2> run 这个 cpu 容器,这里应该能够成功。在容器内删除报错文件,比如我这里删除 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 (网上看到一种做法是把 /usr/lib/x86_64-linux-gnu/libcuda.so.1 也一并删除);

3> 然后新建一个终端,将这个 cpu docker commit 为新镜像,为简单起见,可以直接覆盖原镜像,比如我这里的 docker commit docker_id openexplorer/ai_toolchain_ubuntu_20_xj3_gpu ,然后可以用 docker images 观察这个镜像应该是几秒前生成的,这样就没毛病了;

4> 重新执行如上的容器生成脚本,创建 gpu docker,问题应该已经解决。

5> 在容器中执行 nvidia-smi 以及 nvcc -V,正常输出的话应该就没问题了。

记录 | gpu docker启动报错libnvidia-ml.so.1: file exists: unknown_libnvidia-ml.so