如何创建不同的python,cuda版本环境
Linux
我用得是Ubuntu环境,确切得来说是windows for Linux 2.0。本文采用得是conda创建不同的python环境。强烈不建议在系统直接装cuda,如果你需要复现多篇环境不同的论文。
用conda设置python版本
有些来自2020的远古论文且带有源码(神经网络的久远是指一年左右),如果直接用你自己的python环境,包非常容易冲突。
检查conda版本
conda -V
无法查到minconda版本的解决方法
设置新的python环境
conda create --name softnet_spotme -c anaconda python=3.7.16
激活这个环境
conda activate softnet_spotme
不使用这个环境
conda deactivate
删除这个环境
conda remove -n softnet_spotme --all
成功激活这个环境后的效果(这个很重要,如果换了环境就安装到别的地方了)
(softnet_spotme) zhutianci@DESKTOP-M29UJV1:~$
源码巨坑
平台导致的问题
原因我用得是Linux,他用得是Windows
ERROR: Could not find a version that satisfies the requirement pywin32==227 (from versions: none)
ERROR: No matching distribution found for pywin32==227
所以把requirement.txt里的pywin32删了。
连库都没有的包
ERROR: Could not find a version that satisfies the requirement tensorflow-gpu==2.4.1 (from versions: 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.6.4, 2.6.5, 2.7.0rc0, 2.7.0rc1, 2.7.0, 2.7.1, 2.7.2, 2.7.3, 2.7.4, 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.12.0)
如果你有一些包找不到,有可能是python版本高了,这篇文章应该用3.7~3.8版本。
dlib问题
如果你运行以下命令的时候,有可能导致各种错误。
pip install dlib==19.21.1
我这个方法应该不通用,我是把系统里的cuda版本删除之后就好了。还有一种解决方案是安装cmake。
cuda巨坑(cuda版本不同有可能导致各种错误)
删除系统内的cuda
Ubuntu清理源(不建议直接在系统里面装cuda)
在你安装cuda的时候,有一些教程会在你的Ubuntu里加入各种源,后面会造成各种冲突。
https://askubuntu.com/questions/307/how-can-ppas-be-removed
我用得是这个方法,但是不要把nvidia的驱动给删了。但是删了也没事,应该好装。
我建议还是用conda以安装多个不同的版本。
在当前环境下安装cuda
记住,一旦你换了cuda环境,下面的安装的就失效了。
安装cudatoolkit这个根据你的tensorflow版本去查
conda install -c conda-forge cudatoolkit=11.1
安装cudnn
pip install nvidia-cudnn-cu11==8.6.0.163
安装tensorflow
pip install tensorflow==2.4.1
创建配置文件,不放心可以先$CONDA_PREFIX
一下。
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
往里面写CUDNN_PATH
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
再写LD_LIBRARY_PATH
echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source一下
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
nvcc运行不了?
这个要安装cudatoolkit-dev
conda install -c conda-forge cudatoolkit-dev=11.1
一些原因导致GPU用不了
一些文件没法load,找全局文件
sudo find / -name 'ibcusolver.so.10'
找一些特定文件夹
find . -type f -path '/home/zhutianci/miniconda3/envs/softnet_spotme/lib' -name 'libcusolver.so.10'
如果找到里就加到$LD_LIBRARY_PATH
。但是一般找不到,这里可以创建硬链接
cd $LD_LIBRARY_PATH
sudo ln libcusolver.so.11 libcusolver.so.10 # hard link
跑模型之前验证GPU是否可用
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
确保没什么报错,我这里的WSL2没得那个NUMA支持,这个支持NUMA感觉很麻烦,要自己编译内核。
2023-07-30 13:28:14.671071: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-07-30 13:28:15.338384: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-07-30 13:28:15.466891: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-30 13:28:15.466941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: NVIDIA GeForce RTX 3070 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2023-07-30 13:28:15.466963: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-07-30 13:28:15.468342: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-07-30 13:28:15.468382: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-07-30 13:28:15.468841: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-07-30 13:28:15.468974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-07-30 13:28:15.470469: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-07-30 13:28:15.470803: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-07-30 13:28:15.470886: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-07-30 13:28:15.470959: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-30 13:28:15.471000: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-30 13:28:15.471022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
windows
找不到某个文件
错误:
Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
如果是conda就去这个目录,每个电脑都不一定相同
cd D:\Softwares\miniconda3\envs\SFAMNet\Library\bin
用管理员身份打开终端,输入下面命令:
New-Item -ItemType SymbolicLink -Path .\cusolver64_10.dll -Target .\cusolver64_11.dll
设置pycharm terminal
cmd.exe "/K" "D:\Softwares\miniconda3\Scripts\activate.bat"
"D:\Softwares\miniconda3"
安装nvcc
conda install -c "nvidia/label/cuda-11.3.0" cuda-nvcc
验证pytorch
python -c "mport torch; print(torch.cuda.is_available())"