查看Linux服务器上是否有GPU显卡可以使用lspci命令
 
PCI(Peripheral Component Interconnect,外设部件互连标准),即定义连接外部设备的一个标准;
主板上有很多 PCI 接口,用来连接显卡、网卡、声卡等外部设备,而 lspci 命令就是用来列出所有连接 PCI 接口的外部设备
 
# 安装lspci命令
 
yum install -y pciutils
 
1、Linux查看显卡信息:
 
lspci | grep -i vga
 
2、使用nvidia GPU也可以:
 
lspci | grep -i nvidia
 
NVIDIA--GPU驱动安装_GPU
 
根据GPU型号下载对应的驱动程序
 
 
NVIDIA--GPU驱动安装_GPU_02
安装驱动
 
chmod +x NVIDIA-Linux-x86_64-470.57.02.run
 
sh NVIDIA-Linux-x86_64-470.57.02.run
 
也可以直接安装cuda平台,安装的时候里面包含了GPU驱动,如下所示:
 
NVIDIA--GPU驱动安装_nvidia-smi_03
 
 在CentOS7上安装驱动时报错: 
 
ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before
         proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the
         Nouveau kernel driver.

WARNING: One or more modprobe configuration files to disable Nouveau are already present at:
           /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have
           rebooted your system since these files were written.  If you have rebooted, then Nouveau may be enabled for other reasons, such as being
           included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA driver README and your Linux
           distribution's documentation for details on how to correctly disable the Nouveau kernel driver

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation
         problems in the README available on the Linux driver download page at www.nvidia.com.
提示需要禁用系统自带的nouveau,可执行下面的命令进行禁用。
 
临时禁用,重启失效
modprobe -r nouveau
 
重启生效可以如下操作:
 
1. 新增黑名单
cp /etc/modprobe.d/nvidia-installer-disable-nouveau.conf  /etc/modprobe.d/blacklist.conf
 
2. 重新生成内核文件
 
mv /boot/initramfs-$(uname -r).img  /boot/initramfs-$(uname -r).img.bak
 
创建一个带有kernel版本号,为kernel使用的 initramfs 镜像。如果 <kernel version> 被省略,那么使用实际运行的内核版本号。如果 <image> 被省略或为空,那么缺省的位置 /boot/initramfs-<kernel version>.img 被使用
 
dracut -v /boot/initramfs-$(uname -r).img  $(uname -r)
 
3. 重启生效
reboot
 
然后重新执行安装程序,报错如下:
ERROR: Unable to find the development tool `cc` in your path; please make sure that you have the package 'gcc' installed.  If gcc is installed on
         your system, then please check that `cc` is in your PATH.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation
         problems in the README available on the Linux driver download page at www.nvidia.com.
报错缺少gcc
 
安装 gcc 
 
yum -y install gcc gcc-c++
 
再次重新执行还是报错:
ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your
         kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel'
         RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the
         '--kernel-source-path' command line option.
安装 kernel-devel
 
yum -y install kernel-devel
 
然后重新执行安装程序,还是刚才的报错:

ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your
         kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel'
         RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the
         '--kernel-source-path' command line option.

指定内核路径,重新执行
 
./NVIDIA-Linux-x86_64-470.57.02.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.41.1.el7.x86_64/
过程如下:
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not
           queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org
           SDK/development package for your distribution and reinstall the driver.

Install NVIDIA's 32-bit compatibility libraries?

                                                  Yes                                               No

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd
           development libraries installed, or specify a path with --glvnd-egl-config-path.

以上为字符模式安装警告信息,可忽略。

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 470.57.02) is now complete.
安装结束
 
重启CentOS7系统
 
重启后可执行 nvidia-smi 命令测试驱动是否安装成功
 
报错如下:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
恢复方法:
 
step1: 安装 dkms
 
yum -y install epel-release
yum -y install dkms 
DKMS全称是DynamicKernel ModuleSupport,它可以帮我们维护内核外的驱动程序,在内核版本变动之后可以自动重新生成新的模块。在使用dkms之前首先需要确保系统中已经安装了DKMS.
 
step2: 查询驱动版本号
 
ls /usr/src|grep nvidia
nvidia-470.57.02
 
NVIDIA--GPU驱动安装_nvidia-smi_04
 
step3: 重新生成对应nvidia的驱动模块
 
dkms install -m nvidia -v 470.57.02
 
NVIDIA--GPU驱动安装_nvidia-smi_05
 
报错如下:
Error! echo
Your kernel headers for kernel 3.10.0-1160.el7.x86_64 cannot be found at
/lib/modules/3.10.0-1160.el7.x86_64/build or /lib/modules/3.10.0-1160.el7.x86_64/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.

 dkms install -m nvidia -v 470.57.02 --kernelsourcedir=/usr/src/kernels/3.10.0-1160.41.1.el7.x86_64 

报错如下:
 
Error! Bad return status for module build on kernel: 3.10.0-1160.el7.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/470.57.02/build/make.log for more information.
查看日志 tail -100 /var/lib/dkms/nvidia/470.57.02/build/make.log
 
DKMS make.log for nvidia-470.57.02 for kernel 3.10.0-1160.el7.x86_64 (x86_64)
2021年 09月 08日 星期三 16:40:36 CST
make: *** /lib/modules/3.10.0-1160.el7.x86_64/build: 没有那个文件或目录。 停止。
make: *** [modules] 错误 2
cd /lib/modules/3.10.0-1160.el7.x86_64
rm -rf  build
ln -s /usr/src/kernels/3.10.0-1160.41.1.el7.x86_64 build
 
NVIDIA--GPU驱动安装_GPU_06
 
验证是否成功
 
nvidia-smi
 
NVIDIA--GPU驱动安装_nvidia-smi_07