项目场景 [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 :

需要升级之前老的程序,之前的cuda 是10.2


问题描述:

环境

cuda 11.2 (之前是10.2)

onnxruntime-gpu 1.10

python 3.9.7

项目场景 with ERRTYPE = cudaError CUDA failure 999 unknown error_python

启动程序的时候

Traceback (most recent call last):
  File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
    model = DetectMultiBackend(weights=config.paddle.model_file)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
    self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
 cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);

原因分析:

1.刚开始以为是onnxruntime-gpu 版本问题 升级到了 1.12 还是报错

2.网上又说是不兼容的问题

3.试试重装下驱动,卸载了11.2 的时候 通过nvidia-smi 发现之前10.2的驱动还存在

4.是因为之前的驱动没有卸载干净


解决方案:

1.卸载10.2

sudo /usr/local/cuda-10.2/bin/cuda-uninstaller

2.安装新驱动

#离线安装 515.57
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check