前言

 

一、安装edgeai-torchvision环境

 首先需要理解的是,虚拟环境安装完torch之后再安装torchvision,且torchvision是基于源码编译安装的,因为the standard torchvision will not support all the features in this repository. 博主系统CUDA版本是11.7,但是当前edgeai-torchvision只支持到cuda11.3,故安装cuda11.3支持的pytorch版本和torchvision,根据setup.sh,安装pytorch1.10.0和torchvision0.11.0,其他依赖项版本能够支持使用即可;

但是出错

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.

尝试了多种方法,均失败。深入理解setup.py代码之后意识到,就是源码安装torchvision的时候链接不到虚拟环境的CUDA,而是系统的CUDA版本;

edgeai-torchvision/torchvision/extension.py

def _check_cuda_version():
    """
    Make sure that CUDA versions match between the pytorch install and torchvision install
    """
    if not _HAS_OPS:
        return -1
    import torch
    _version = torch.ops.torchvision._cuda_version()
    if _version != -1 and torch.version.cuda is not None:
        tv_version = str(_version)
        if int(tv_version) < 10000:
            tv_major = int(tv_version[0])
            tv_minor = int(tv_version[2])
        else:
            tv_major = int(tv_version[0:2])
            tv_minor = int(tv_version[3])
        t_version = torch.version.cuda
        t_version = t_version.split('.')
        t_major = int(t_version[0])
        t_minor = int(t_version[1])
        if t_major != tv_major or t_minor != tv_minor:
            raise RuntimeError("Detected that PyTorch and torchvision were compiled with different CUDA versions. "
                               "PyTorch has CUDA Version={}.{} and torchvision has CUDA Version={}.{}. "
                               "Please reinstall the torchvision that matches your PyTorch install."
                               .format(t_major, t_minor, tv_major, tv_minor))
    return _version

/home/xxx/miniconda3/envs/edgeaitv/lib/python3.8/site-packages/torch/utils/cpp_extension.py

def _check_cuda_version(self):
        if CUDA_HOME:
            nvcc = os.path.join(CUDA_HOME, 'bin', 'nvcc')
            cuda_version_str = subprocess.check_output([nvcc, '--version']).strip().decode(*SUBPROCESS_DECODE_ARGS)
            cuda_version = re.search(r'release (\d+[.]\d+)', cuda_version_str)
            if cuda_version is not None:
                cuda_str_version = cuda_version.group(1)
                cuda_ver = packaging.version.parse(cuda_str_version)
                torch_cuda_version = packaging.version.parse(torch.version.cuda)
                if cuda_ver != torch_cuda_version:
                    # major/minor attributes are only available in setuptools>=49.6.0
                    if getattr(cuda_ver, "major", float("nan")) != getattr(torch_cuda_version, "major", float("nan")):
                        raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
                    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))

        else:
            raise RuntimeError(CUDA_NOT_FOUND_MESSAGE)

从这些出错部分的源码看出,出错的主要原因是源码编译安装torchvision的时候,是从CUDA_HOME/NVCC中获取的CUDA版本,故虚拟环境的CUDA版本需要和系统的CUDA版本一致。目前系统版本是CUDA11.7,现在为了编译edgeai-torchvision,需要用到cuda11.3,且必须是从系统获取的,所以需要重新安装cuda11.3版本,以后也要便于切换回cuda11.7,具体的安装过程请参考【软硬件环境及工具安装】nvidia驱动/CUDA版本关系及CUDA安装;

错误1:

raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

这个问题和numpy的版本有关,直接安装指定版本的numpy即可;

1)numpy.int was deprecated in NumPy 1.20 and was removed in NumPy 1.24.
   You can change it to numpy.int_, or just int.

  2)pip3 install numpy==1.19

错误2:

packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'

 setuptools版本问题,版本过高导致的问题;setuptools版本

AttributeError: module ‘distutils‘ has no attribute ‘version‘ 解决方案

AttributeError: module ‘distutils‘ has no attribute ‘version‘

# 使用pip,不能使用 conda uninstall setuptools,原因是conda在卸载的时候,会自动分析与其相关的库,然后全部删除,如果y的话,整个环境都需要重新配置。
pip3 uninstall setuptools
pip3 install setuptools==59.5.0

 

二、测试环境

 1. 图像分类

直接运行脚本文件

sh run_edgeailite_classification.sh

也可以直接运行命令行

python ./references/edgeailite/scripts/train_classification_main.py --dataset_name cifar100_classification --model_name mobilenetv2_tv_x1 --data_path ./data/datasets/cifar100_classification --img_resize 32 --img_crop 32 --rand_scale 0.5 1.0

error

edgeai-torchvision/references/edgeailite/engine/train_classification.py", line 695, in validate
    progress_bar.set_postfix(Epoch='{}'.format(status_str))
TypeError: set_postfix() missing 1 required positional argument: 'postfix'

原因是源码中函数使用有误,修改即可;

progress_bar.set_postfix('Epoch={}'.format(status_str))

 先训练,训练之后基于训练的模型进行量化训练,最后验证,估计量化结果的准确性;基本上理解分类过程的实现逻辑和流程框架;

每个阶段生成3个文件,训练pytorch模型文件,转换的onnx模型文件,以及torchscript模型文件;

2. 语义分割

 直接根据软硬件环境修改配置参数,运行脚本文件

sh run_edgeailite_segmentation.sh

错误1:

edgeai-torchvision/torchvision/edgeailite/xvision/datasets/cityscapes_plus.py", line 519, in cityscapes_segmentation
    train_split = CityscapesDataLoader(dataset_config, root, split_name, gt, transforms=transforms[0],
TypeError: __init__() got an unexpected keyword argument 'annotation_prefix'

 python *args和**kwargs详解_惊瑟的博客

将错误行替换为不使用annotation_prefix参数(查看以前版本的代码),解决问题;

Modelmaker integration v1 · TexasInstruments/edgeai-torchvision@f108240

使用

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms)

替换原来的

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms, annotation_prefix=args.annotation_prefix)

错误2:

AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'

原因:AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘_软件测试大叔的博客

原来是在pillow的10.0.0版本中,ANTIALIAS方法被删除了,使用新的方法即可,现在需要使用PIL.Image.LANCZOSPIL.Image.Resampling.LANCZOS。(这与ANTIALIAS引用的算法完全相同,只是不能再通过名称ANTIALIAS访问它。);或者降低pillow的版本,使用低版本的pillow;

print(PIL.__version__)

pip uninstall -y Pillow
pip install Pillow==9.5.0

 

 

三、设计任务;

 

参考

1. 安装torch/torchvision/cuda版本关系

2. github_edgeai-torchvision

3. github_torchvision