torch gpu版本装没装上检测

转载

mob64ca14163a4f 2024-12-11 11:43:44

文章标签 torch gpu版本装没装上检测深度学习人工智能服务器 CUDA 文章分类 游戏开发

1、联网版：先创建pytorch环境：conda create -n ljj_torch112 python=3.8

看本机的：

torch gpu版本装没装上检测_服务器

先看自己的cuda版本：（最权威的看：nvcc --version）

torch gpu版本装没装上检测_服务器_02

10.0的cuda于是不太符合，所以换一个10.2的cuda比较常用！

创建pytorch环境：

torch gpu版本装没装上检测_torch gpu版本装没装上检测_03

激活环境：conda activate ljj_torch112

torch gpu版本装没装上检测_torch gpu版本装没装上检测_04

1.2、下载pytorch（带上了cuda10.2）

我要下载的是1.12的，不是2.0的新版，所以到这里去找以前的pytorch：

Start Locally | PyTorch

conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=10.2 -c pytorch

torch gpu版本装没装上检测_深度学习_05

torch gpu版本装没装上检测_CUDA_06

torch gpu版本装没装上检测_CUDA_07

1.2.1 查看torch、cuda等的版本

torch gpu版本装没装上检测_深度学习_08

1.3、下载DGL

Deep Graph Library (dgl.ai)

Linux 64 :: Anaconda.org

torch gpu版本装没装上检测_torch gpu版本装没装上检测_09

conda install -c dglteam/label/cu102 dgl

torch gpu版本装没装上检测_人工智能_10

下载其他必要的安装包：

#conda install cudatoolkit
pip install scipy
pip install normflows==1.4
pip install tensorboardx==2.5.1
pip install tqdm
pip install torchtext==0.4.0
pip install scikit-learn
pip install pandas
pip install wandb
#日志输出的代码
script -f a.log
#有代码需要用到apex，所以下载apex有以下几步（跟着我标注黄色背景的代码，复制粘贴就可以）：
sudo apt install git

git clone https://github.com/NVIDIA/apex.git(我使用了git，学校服务器不行，所以选择下面的wget)
总结，这个方法1试过，可以！——————在学校服务器8T4可以用这种方法！！

截图：（参考：linux安装nvidia/apex - 知乎 (zhihu.com)）

总结，这个方法2试过，可以！————在学校服务器4T4可以！

【【【【【【【（下载apex：）如果git clone不行的话，再试试wget：

wget https://codeload.github.com/NVIDIA/apex/zip/refs/heads/master -O master.zip
#解压缩 unzip master.zip -d /mnt/hdd1/ljj/apex
（参考来源：chatgpt）
cd /mnt/hdd1/allusers/ljj/4other/apex/apex-master
python3 setup.py install
来源：linux安装nvidia/apex - 知乎 (zhihu.com)

查看apex有没有安装成功：python import apex

（

torch gpu版本装没装上检测_人工智能_18

】】】】】】】】】】】】】】】】】】】】】

参考： apex安装方法_51CTO博客_steam怎么下apex

torch gpu版本装没装上检测_深度学习_19

）

如果代码中下载：pip install torchscale
pip install torchscale
pip install torchtext==0.4.0

1.3.1 可以用conda list查看DGL是否下载成功

torch gpu版本装没装上检测_CUDA_20

torch gpu版本装没装上检测_torch gpu版本装没装上检测_21

2、上传文件版本_学校服务器安装cuda版本的pytorch+DGL：(不建议，因为用这个办法没成功过！)

2.1 上传torch、dgl文件（红框是dgl放置的地方，dgl就不用pip install 了）

（这是8*T4的位置）

torch gpu版本装没装上检测_CUDA_22

（这是4*T4的位置）

torch gpu版本装没装上检测_torch gpu版本装没装上检测_23

torch gpu版本装没装上检测_深度学习_24

2.2 下载torch安装包（dgl就不用pip install 了）

torch gpu版本装没装上检测_深度学习_25

torch gpu版本装没装上检测_torch gpu版本装没装上检测_26

然后就可以运行代码啦！！！

但是如果不加下面的代码，仍然会报错，因为不是pip安装，所以一系列包就没有，所以需要自己手动添加：

conda install cudatoolkit
pip install scipy
pip install normflows==1.4
pip install tensorboardx==2.5.1
pip install tqdm

#日志输出的代码
script -f a.log


#WIKI的代码
CUDA_VISIBLE_DEVICES="1"  python main.py --dataset Wiki-One --data_path ./Wiki --few 5 --data_form Pre-Train --prefix np_rgcn_attn_planar_wiki_5shot_intrain_g_batch_1024_eval_8 --device 1 --batch_size 32 --flow Planar -dim 50 --g_batch 1024 --eval_batch 8 --eval_epoch 2000

详细介绍（上面包含了代码，所以可不看）———————————————————————————————————

结果运行报错！OSError: libcublas.so.11: cannot open shared object file: No such file or directory

（(265条消息) OSError: libcublas.so.11: cannot open shared object file: No such file or directory【import onnx报错】_墨理学AI的博客-CSDN博客）

torch gpu版本装没装上检测_CUDA_27

解决方法：conda install cudatoolkit

torch gpu版本装没装上检测_人工智能_28

如果不行，可以换另一种方法：

torch gpu版本装没装上检测_服务器_29

torch gpu版本装没装上检测_深度学习_30

——————————————————————————————————————————

untimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 14.76 GiB total capacity; 10.37 GiB already allocated; 1.14 GiB free; 12.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

出错结果是：

2.1查看原因：

torch gpu版本装没装上检测_服务器_32

torch gpu版本装没装上检测_深度学习_33

torch gpu版本装没装上检测_人工智能_34

# Your code before the problematic line

# Code leading to CUDA out-of-memory error
x = torch.cat([edges.src['h'], edges.data['feat'], edges.dst['feat']], dim=1)

# Release GPU memory
torch.cuda.empty_cache()

# Continue with the rest of your code
# ...

# Your code after the problematic line

3、在服务器中使用wandb：

(258条消息) wandb使用前提 - 注册，登陆_wandb注册_无脑敲代码，bug漫天飞的博客-CSDN博客

这个是在本地使用wandb：（资料如下链接）

torch gpu版本装没装上检测_深度学习_35

torch gpu版本装没装上检测_服务器_36

torch gpu版本装没装上检测_深度学习_37

———————————————————我的电脑试验一下——————

torch gpu版本装没装上检测_人工智能_38

总的加在自己代码上的部分：

1、import并且初始化init

torch gpu版本装没装上检测_深度学习_39

import wandb
wandb.init(project="my-project")
wandb.watch_called = False  # Re-run the model without restarting the runtime, unnecessary after our next release

2、在函数里;

torch gpu版本装没装上检测_服务器_40

wandb.log({ "Examples": example_images, "Test Accuracy": 100. * correct / len(test_loader.dataset), "Test Loss": test_loss })

3、config

torch gpu版本装没装上检测_人工智能_41

config = wandb.config  # Initialize configconfig.batch_size = 4  # input batch size for training (default:64)
config.test_batch_size = 10  # input batch size for testing(default:1000)
config.epochs = 50  # number of epochs to train(default:10)
config.lr = 0.1  # learning rate(default:0.01)
config.momentum = 0.1  # SGD momentum(default:0.5)
config.no_cuda = False  # disables CUDA training
config.seed = 42  # random seed(default:42)
config.log_interval = 10  # how many batches to wait before logging training status

4、训练结果

torch gpu版本装没装上检测_服务器_42

torch gpu版本装没装上检测_人工智能_43

wandb.watch(model, log="all")
     for epoch in range(1, config.epochs + 1):
         train(config, model, device, train_loader, optimizer, epoch)
         test(config, model, device, test_loader, classes)
     torch.save(model.state_dict(), 'model.h5')
    wandb.save('model.h5')

torch gpu版本装没装上检测_深度学习_44