前言
一、安装edgeai-torchvision环境
首先需要理解的是,虚拟环境安装完torch之后再安装torchvision,且torchvision是基于源码编译安装的,因为the standard torchvision will not support all the features in this repository. 博主系统CUDA版本是11.7,但是当前edgeai-torchvision只支持到cuda11.3,故安装cuda11.3支持的pytorch版本和torchvision,根据setup.sh,安装pytorch1.10.0和torchvision0.11.0,其他依赖项版本能够支持使用即可;
但是出错
RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.
尝试了多种方法,均失败。深入理解setup.py代码之后意识到,就是源码安装torchvision的时候链接不到虚拟环境的CUDA,而是系统的CUDA版本;
edgeai-torchvision/torchvision/extension.py
def _check_cuda_version():
"""
Make sure that CUDA versions match between the pytorch install and torchvision install
"""
if not _HAS_OPS:
return -1
import torch
_version = torch.ops.torchvision._cuda_version()
if _version != -1 and torch.version.cuda is not None:
tv_version = str(_version)
if int(tv_version) < 10000:
tv_major = int(tv_version[0])
tv_minor = int(tv_version[2])
else:
tv_major = int(tv_version[0:2])
tv_minor = int(tv_version[3])
t_version = torch.version.cuda
t_version = t_version.split('.')
t_major = int(t_version[0])
t_minor = int(t_version[1])
if t_major != tv_major or t_minor != tv_minor:
raise RuntimeError("Detected that PyTorch and torchvision were compiled with different CUDA versions. "
"PyTorch has CUDA Version={}.{} and torchvision has CUDA Version={}.{}. "
"Please reinstall the torchvision that matches your PyTorch install."
.format(t_major, t_minor, tv_major, tv_minor))
return _version
/home/xxx/miniconda3/envs/edgeaitv/lib/python3.8/site-packages/torch/utils/cpp_extension.py
def _check_cuda_version(self):
if CUDA_HOME:
nvcc = os.path.join(CUDA_HOME, 'bin', 'nvcc')
cuda_version_str = subprocess.check_output([nvcc, '--version']).strip().decode(*SUBPROCESS_DECODE_ARGS)
cuda_version = re.search(r'release (\d+[.]\d+)', cuda_version_str)
if cuda_version is not None:
cuda_str_version = cuda_version.group(1)
cuda_ver = packaging.version.parse(cuda_str_version)
torch_cuda_version = packaging.version.parse(torch.version.cuda)
if cuda_ver != torch_cuda_version:
# major/minor attributes are only available in setuptools>=49.6.0
if getattr(cuda_ver, "major", float("nan")) != getattr(torch_cuda_version, "major", float("nan")):
raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
else:
raise RuntimeError(CUDA_NOT_FOUND_MESSAGE)
从这些出错部分的源码看出,出错的主要原因是源码编译安装torchvision的时候,是从CUDA_HOME/NVCC中获取的CUDA版本,故虚拟环境的CUDA版本需要和系统的CUDA版本一致。目前系统版本是CUDA11.7,现在为了编译edgeai-torchvision,需要用到cuda11.3,且必须是从系统获取的,所以需要重新安装cuda11.3版本,以后也要便于切换回cuda11.7,具体的安装过程请参考【软硬件环境及工具安装】nvidia驱动/CUDA版本关系及CUDA安装;
错误1:
raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
这个问题和numpy的版本有关,直接安装指定版本的numpy即可;
1)numpy.int was deprecated in NumPy 1.20 and was removed in NumPy 1.24.
You can change it to numpy.int_, or just int.
2)pip3 install numpy==1.19
错误2:
packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
setuptools版本问题,版本过高导致的问题;setuptools版本
AttributeError: module ‘distutils‘ has no attribute ‘version‘ 解决方案
AttributeError: module ‘distutils‘ has no attribute ‘version‘
# 使用pip,不能使用 conda uninstall setuptools,原因是conda在卸载的时候,会自动分析与其相关的库,然后全部删除,如果y的话,整个环境都需要重新配置。
pip3 uninstall setuptools
pip3 install setuptools==59.5.0
二、测试环境;
1. 图像分类
直接运行脚本文件
sh run_edgeailite_classification.sh
也可以直接运行命令行
python ./references/edgeailite/scripts/train_classification_main.py --dataset_name cifar100_classification --model_name mobilenetv2_tv_x1 --data_path ./data/datasets/cifar100_classification --img_resize 32 --img_crop 32 --rand_scale 0.5 1.0
error
edgeai-torchvision/references/edgeailite/engine/train_classification.py", line 695, in validate
progress_bar.set_postfix(Epoch='{}'.format(status_str))
TypeError: set_postfix() missing 1 required positional argument: 'postfix'
原因是源码中函数使用有误,修改即可;
progress_bar.set_postfix('Epoch={}'.format(status_str))
先训练,训练之后基于训练的模型进行量化训练,最后验证,估计量化结果的准确性;基本上理解分类过程的实现逻辑和流程框架;
每个阶段生成3个文件,训练pytorch模型文件,转换的onnx模型文件,以及torchscript模型文件;
2. 语义分割
直接根据软硬件环境修改配置参数,运行脚本文件
sh run_edgeailite_segmentation.sh
错误1:
edgeai-torchvision/torchvision/edgeailite/xvision/datasets/cityscapes_plus.py", line 519, in cityscapes_segmentation
train_split = CityscapesDataLoader(dataset_config, root, split_name, gt, transforms=transforms[0],
TypeError: __init__() got an unexpected keyword argument 'annotation_prefix'
python *args和**kwargs详解_惊瑟的博客
将错误行替换为不使用annotation_prefix参数(查看以前版本的代码),解决问题;
Modelmaker integration v1 · TexasInstruments/edgeai-torchvision@f108240
使用
train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms)
替换原来的
train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms, annotation_prefix=args.annotation_prefix)
错误2:
AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'
原因:AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘_软件测试大叔的博客
原来是在pillow的10.0.0版本中,ANTIALIAS方法被删除了,使用新的方法即可,现在需要使用PIL.Image.LANCZOS
或PIL.Image.Resampling.LANCZOS
。(这与ANTIALIAS
引用的算法完全相同,只是不能再通过名称ANTIALIAS
访问它。);或者降低pillow的版本,使用低版本的pillow;
print(PIL.__version__)
pip uninstall -y Pillow
pip install Pillow==9.5.0
三、设计任务;
参考
1. 安装torch/torchvision/cuda版本关系;
完