写在前面
- 工作中遇到,简单整理
- 理解不足小伙伴帮忙指正
对每个人而言,真正的职责只有一个:找到自我。然后在心中坚守其一生,全心全意,永不停息。所有其它的路都是不完整的,是人的逃避方式,是对大众理想的懦弱回归,是随波逐流,是对内心的恐惧 ——赫尔曼·黑塞《德米安》
当前系统环境
系统环境
┌──[root@test]-[~]
└─$hostnamectl
Static hostname: test
Icon name: computer-desktop
Chassis: desktop
Machine ID: addc7ca21ef24518a9465c499eb3c8b7
Boot ID: 14aa59cc6960431c95d328684b521844
Operating System: Ubuntu 22.04.2 LTS
Kernel: Linux 5.19.0-43-generic
Architecture: x86-64
Hardware Vendor: Micro-Star International Co., Ltd.
Hardware Model: MS-7C83
显卡版本
┌──[root@test]-[~]
└─$lspci -vnn | grep VGA
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] [10de:2504] (rev a1) (prog-if 00 [VGA controller])
┌──[root@test]-[~]
└─$
安装 NVIDIA 驱动程序,在安装之前,需要禁用 Nouveau
驱动程序。
Nouveau
是一个开源的NVIDIA显卡驱动程序,它由社区开发和维护。它可以在Linux系统上替代NVIDIA官方驱动程序,但它的性能和功能可能不如官方驱动程序。
如果使用 Nouveau 驱动程序,您可能无法使用NVIDIA的高级功能,如CUDA和深度学习库。如果您需要使用这些功能,建议安装NVIDIA官方驱动程序。
禁用 Nouveau 驱动程序
┌──[root@test]-[~]
└─$sudo vim /etc/modprobe.d/blacklist-nouveau.conf
┌──[root@test]-[~]
└─$cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
┌──[root@test]-[~]
└─$sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-5.19.0-43-generic
没有输出说明操作成功
┌──[root@test]-[~]
└─$reboot
┌──[root@test]-[~]
└─$lsmod | grep nouveau
┌──[root@test]-[~]
└─$
安装Nvidia驱动
这里的版本 nvidia-driver-510
要和后面安装 cuda
的版本一样
如果之前安装过卸载驱动
# 查看显卡型号
lspci -vnn | grep VGA
# 卸载旧驱动
sudo apt-get remove --purge nvidia*
离线安装
如果离线环境需要手动安装,下载驱动: https://www.nvidia.com/Download/index.aspx?lang=en-us
# 给run文件可执行权限
sudo chmod a+x NVIDIA-Linux-x86_64-515.86.01.run
# 安装
sudo ./NVIDIA-Linux-x86_64-440.64.run -no-x-check -no-nouveau-check -no-opengl-files
# -no-x-check:安装驱动时关闭X服务
# -no-nouveau-check:安装驱动时禁用nouveau
# -no-opengl-files:只安装驱动文件,不安装OpenGL文件
非离线安装
非离线环境使用包管理工具安装,下面的选择这一种,选择安装驱动版本
┌──[root@test]-[~]
└─$ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00002504sv00001462sd0000397Dbc03sc00i00
vendor : NVIDIA Corporation
model : GA106 [GeForce RTX 3060 Lite Hash Rate]
driver : nvidia-driver-530-open - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-525-open - third-party non-free
driver : nvidia-driver-535 - third-party non-free
driver : nvidia-driver-520 - third-party non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-525 - third-party non-free
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-535-open - third-party non-free recommended
driver : nvidia-driver-530 - third-party non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-515-open - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-515 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin
┌──[root@test]-[~]
└─$
安装
┌──[root@test]-[~]
└─$sudo apt install nvidia-driver-510 -y
重启机器
┌──[root@test]-[~]
└─$reboot
查看安装是否成功,对应版本信息
┌──[root@test]-[~]
└─$nvidia-smi
Thu Jun 15 11:49:43 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 38C P8 16W / 170W | 172MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 60MiB |
| 0 N/A N/A 1650 G /usr/bin/gnome-shell 109MiB |
+-----------------------------------------------------------------------------+
┌──[root@test]-[~]
└─$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.108.03 Thu Oct 20 05:10:45 UTC 2022
GCC version: gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1)
┌──[root@test]-[~]
└─$
安装Cuda
CUDA
是NVIDIA提供的一种并行计算平台和编程模型,旨在利用GPU的并行计算能力加速计算密集型应用程序。
CUDA包括CUDA驱动程序和CUDA Toolkit
。支持多种编程语言,包括C、C++、Fortran和Python等。
- CUDA驱动程序是GPU和操作系统之间的接口.
- CUDA Toolkit则包括编译器、库和工具,用于开发CUDA应用程序。
如果以前安装过,卸载
sudo /usr/local/cuda-11.6/bin/cuda-uninstaller
sudo rm -rf /usr/local/cuda-11.6
sudo: /usr/local/cuda-11.8/bin/uninstall_cuda_8.0.pl: command not found
┌──[root@test]-[~]
└─$sudo /usr/local/cuda-11.6/bin/
bin2c cuda-gdbserver ncu nsys-ui nv-nsight-cu-cli
computeprof cuda-memcheck ncu-ui nvcc nvprof
compute-sanitizer cuda-uninstaller nsight_ee_plugins_manage.sh __nvcc_device_query nvprune
crt/ cu++filt nsight-sys nvdisasm nvvp
cudafe++ cuobjdump nsys nvlink ptxas
cuda-gdb fatbinary nsys-exporter nv-nsight-cu
┌──[root@test]-[~]
└─$sudo /usr/local/cuda-11.6/bin/cuda-uninstaller
在输出的终端 UI页面,空格选择全部,选择完成,卸载完成之后重新安装
┌──[root@test]-[~]
└─$sudo /usr/local/cuda-11.6/bin/cuda-uninstaller
Successfully uninstalled
┌──[root@test]-[~]
└─$sudo rm -rf /usr/local/cuda-11.6
官网安装包下载
┌──[root@test]-[~]
└─$chmod +x cuda_*
这里cuda 选择 cuda_11.6.0_510.39.01_linux.run
, 510 对应的版本
┌──[root@test]-[~]
└─$ll cuda*
-rwxr-xr-x 1 root root 3488951771 1月 11 2022 cuda_11.6.0_510.39.01_linux.run*
-rwxr-xr-x 1 root root 3490450898 5月 5 2022 cuda_11.7.0_515.43.04_linux.run*
-rwxr-xr-x 1 root root 4317456991 4月 17 23:04 cuda_12.1.1_530.30.02_linux.run*
-rwxr-xr-x 1 root root 853 5月 17 19:52 cuda_log.log*
-rw-r--r-- 1 root root 2472241638 7月 29 2021 cuda-repo-ubuntu2004-11-4-local_11.4.1-470.57.02-1_amd64.deb
-rw-r--r-- 1 root root 2699477842 5月 5 2022 cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
┌──[root@test]-[~]
└─$
┌──[root@test]-[~]
└─$sudo ./cuda_12.1.1_530.30.02_linux.run
上面我们已经安装了驱动,所以不需要选择,直接安装 cuda 相关的就可以,安装成功输出
┌──[root@test]-[~]
└─$sudo ./cuda_11.6.0_510.39.01_linux.run
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.6/
Please make sure that
- PATH includes /usr/local/cuda-11.6/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.6/lib64, or, add /usr/local/cuda-11.6/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.6/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 510.00 is required for CUDA 11.6 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
添加环境变量
┌──[root@test]-[/b1205]
└─$echo $LD_LIBRARY_PATH
/usr/local/cuda-11.6/lib64:/usr/local/cuda-11.6/lib64
┌──[root@test]-[/b1205]
└─$echo $PATH
/usr/local/cuda-11.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
┌──[root@test]-[/b1205]
└─$
安装 cuDNN
cuDNN
是NVIDIA提供的一个用于深度神经网络的加速库,它可以优化卷积、池化、归一化等操作,使得在GPU上运行深度神经网络的速度得到了大幅度提升。cuDNN需要与CUDA配合使用,因此在安装cuDNN之前,需要先安装相应版本的CUDA。
https://developer.nvidia.com/rdp/cudnn-download
这里需要注册账户登录一下,然后在这里下载
https://developer.nvidia.com/rdp/cudnn-archive
选择cuda对应的版本
┌──[root@test]-[~]
└─$ls cudnn*
cudnn-local-repo-ubuntu2204-8.8.1.3_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.8.1.3_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-*/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudoapt-get install libcudnn8=8.8.1.3-1+cuda1
sudo apt-get install libcudnn8-dev=8.8.1.3-1+cuda1
sudo apt-get install libcudnn8-samples=8.8.1.3-1+cuda1
确实安装是否成功
┌──[root@test]-[~]
└─$nvcc -V && nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
Thu Jun 15 14:42:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 51C P8 21W / 170W | 105MiB / 12288MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 81MiB |
| 0 N/A N/A 1650 G /usr/bin/gnome-shell 22MiB |
+-----------------------------------------------------------------------------+
┌──[root@test]-[~]
└─$
编写测试脚本测试
(py39) test@test:~/code/Face$ cat cuda_vim.py
import numpy as np
import time
from numba import cuda
@cuda.jit
def increment_kernel(array):
idx = cuda.grid(1)
if idx < array.size:
array[idx] += 1
def main():
n = 1000000000
a = np.zeros(n, dtype=np.int32)
threads_per_block = 1024
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
start = time.time()
increment_kernel[blocks_per_grid, threads_per_block](a)
end = time.time()
print("Time taken: ", end - start)
if __name__ == "__main__":
while True:
main()
(py39) test@test:~/code/Face$
Every 2.0s: nvidia-smi test: Thu Jun 15 14:44:47 2023
Thu Jun 15 14:44:47 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 55C P2 51W / 170W | 4025MiB / 12288MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 81MiB |
| 0 N/A N/A 1650 G /usr/bin/gnome-shell 22MiB |
| 0 N/A N/A 32031 C python 3917MiB |
+-----------------------------------------------------------------------------+
遇到的问题
安装530
高版本报下面的错:
┌──[root@test]-[~]
└─$sudo ./cuda_12.1.1_530.30.02_linux.run
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia-fs/2.15.3/source/dkms.conf does not exist.
cat: /var/log/nvidia/.uninstallManifests/kernelobjects-components/uninstallManifest-nvidia_fs: No such file or directory
make: *** No rule to make target 'uninstall'. Stop.
Error! DKMS tree already contains: nvidia-fs-2.15.3
You cannot add the same module/version combo more than once.
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-12.1/
Please make sure that
- PATH includes /usr/local/cuda-12.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-12.1/lib64, or, add /usr/local/cuda-12.1/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.1/bin
To uninstall the kernel objects, run ko-uninstaller in /usr/local/kernelobjects/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 530.00 is required for CUDA 12.1 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
┌──[root@test]-[~]
└─$
解决办法,换了低版本的510
运行 nvvp 报错
┌──[root@test]-[~]
└─$nvvp
Nvvp: Cannot open display:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.eclipse.osgi.storage.FrameworkExtensionInstaller (file:/usr/local/cuda-11.6/libnvvp/plugins/org.eclipse.osgi_3.10.1.v20140909-1633.jar) to method java.net.URLClassLoader.addURL(java.net.URL)
WARNING: Please consider reporting this to the maintainers of org.eclipse.osgi.storage.FrameworkExtensionInstaller
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Nvvp: Cannot open display:
Nvvp:
An error has occurred. See the log file
/usr/local/cuda-11.6/libnvvp/configuration/1686795694122.log.
┌──[root@test]-[~]
└─$
ssh 环境不行,需要做桌面环境
在桌面环境执行,报错
Gtk-Message: 09:10:26.571: Failed to load module "canberra-gtk-module"
安装下面的安装包
┌──[root@test]-[~]
└─$sudo apt-get install libcanberra-gtk-module
nvidia-driver-XXX-open 版本安装报错
nvidia-driver-530-open
是一个在发行版的非自由存储库中提供的NVIDIA驱动程序,它是由发行版的维护者维护的。这意味着它是与发行版的其余部分紧密集成的,并且由发行版的维护者提供支持和更新。
nvidia-driver-530
是一个第三方非自由驱动程序,它不是由发行版的维护者维护的。相反,它是由NVIDIA公司提供的,并且可能需要手动安装和配置。由于它不是由发行版的维护者提供的,因此您可能无法获得与发行版集成和支持相同的级别。
nvidia-driver-530-open是更受支持和更集成的选择,而nvidia-driver-530则需要更多的手动配置和支持。
nvidia-driver-530-open : Depends: libnvidia-gl-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: nvidia-dkms-530-open (<= 530.41.03-1)
Depends: nvidia-dkms-530-open (>= 530.41.03)
Depends: nvidia-kernel-common-530 (<= 530.41.03-1) but it is not going to be installed
Depends: nvidia-kernel-common-530 (>= 530.41.03) but it is not going to be installed
Depends: nvidia-kernel-source-530-open (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-compute-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-extra-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: nvidia-compute-utils-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-decode-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-encode-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: nvidia-utils-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: xserver-xorg-video-nvidia-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-cfg1-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Depends: libnvidia-fbc1-530 (= 530.41.03-0ubuntu0.22.04.2) but it is not going to be installed
Recommends: libnvidia-compute-530:i386 (= 530.41.03-0ubuntu0.22.04.2)
Recommends: libnvidia-decode-530:i386 (= 530.41.03-0ubuntu0.22.04.2)
Recommends: libnvidia-encode-530:i386 (= 530.41.03-0ubuntu0.22.04.2)
Recommends: libnvidia-fbc1-530:i386 (= 530.41.03-0ubuntu0.22.04.2)
Recommends: libnvidia-gl-530:i386 (= 530.41.03-0ubuntu0.22.04.2)
E: Unable to correct problems, you have held broken packages.
解决办法,下面的方式进行了尝试,未解决。换了不带 open
的版本
# 更新你的软件包列表和已安装的软件包:
sudo apt update
sudo apt upgrade
# 尝试使用以下命令来修复可能存在的损坏软件包:
sudo apt --fix-broken install
# 使用以下命令来清理系统中已经安装的软件包的缓存:
sudo apt clean
# 尝试使用以下命令来删除已经损坏的软件包并重新安装
sudo apt remove nvidia-driver-530-open
sudo apt autoremove
sudo apt install nvidia-driver-530-open