运行mmdetection遇到的坑

原创

jenslee 2021-07-09 14:51:25 ©著作权

©著作权归作者所有：来自51CTO博客作者jenslee的原创作品，请联系作者获取转载授权，否则将追究法律责任

1.需要注意的地方torch版本问题

要求torch>=1.1，同时torch跟cuda有版本对应限制，这个一定要注意。我先安装的0.4.0，不行，又装了1.4.0，但是版本要求的cuda太高了，我又重新装回到1.0.0，正好对应我的cuda-9.0，万万没想到程序要求的是大于1.1.0。耗费两三个小时。

cuda与pytorch版本不匹配问题，先看cuda是什么版本的，再去找对应的torch版本进行安装。

当然在这个过程中，很可能会碰上这个bug，因为网速太慢，而导致下载失败

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 397, in _error_catcher
    yield
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 479, in read
    data = self._fp.read(amt)
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read
    data = self.__fp.read(amt)
  File "/usr/lib/python3.5/http/client.py", line 458, in read
    n = self.readinto(b)
  File "/usr/lib/python3.5/http/client.py", line 498, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.5/ssl.py", line 791, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.5/ssl.py", line 575, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/base_command.py", line 188, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/commands/install.py", line 345, in run
    resolver.resolve(requirement_set)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 196, in resolve
    self._resolve_one(requirement_set, req)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 359, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 307, in _get_abstract_dist_for
    self.require_hashes
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/operations/prepare.py", line 199, in prepare_linked_requirement
    progress_bar=self.progress_bar
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 1064, in unpack_url
    progress_bar=progress_bar
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 924, in unpack_http_url
    progress_bar)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 1152, in _download_http_url
    _download_url(resp, link, content_file, hashes, progress_bar)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 861, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/hashes.py", line 75, in check_against_chunks
    for chunk in chunks:
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 829, in written_chunks
    for chunk in chunks:
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/ui.py", line 156, in iter
    for x in it:
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 818, in resp_read
    decode_content=False):
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 531, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 496, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 402, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='pypi.tuna.tsinghua.edu.cn', port=443): Read timed out.

另外也有时间限制，把时间限制从100改成1000就可以了。

解决方法如下所示：

sudo /usr/bin/python3.5 -m pip install  --default-timeout=1000 --no-cache-dir torch==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

清华源不行就换成阿里源，下个破torch折腾了三个小时了：

sudo /usr/bin/python3.5 -m pip install  --default-timeout=1000 --no-cache-dir torch==1.1.0 -i https://mirrors.aliyun.com/pypi/simple

2.error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

看bug来说应该是有些辅助库没有装上

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

解决blog有：

sudo apt-get install python3.5-dev

我的程序采用之后无法解决，中间的版本号根据自己的python版本来改动。

我在采用上述方法之后，然后下载了requirements文件夹里面所有要求的python库之后，运行就能通过了。可能不光ubuntu系统库缺东西，还有python有些库缺东西。

运行mmdetection遇到的坑_Python

3.error: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/easy-install.pth'

这个很简单，直接chmod就可以

sudo chmod 777 /usr/local/lib/python3.5/dist-packages
sudo chmod 777 /usr/local/lib/python3.5/dist-packages/*

4.error: [Errno 13] Permission denied: '/usr/local/bin/convert-onnx-to-caffe2'

解决：

sudo chmod 777 /usr/local/bin

5.解决fatal error: torch/extension.h: No such file or directory

这个问题很典型的就是torch与cuda版本不匹配导致的。

就是torch可能是0.4.0的，升级一下，升到1.1.0

6.from .. import deform_conv_cuda

这个问题十分糟心，运行test.py疯狂报这个错，我一直没解决好，我看了其他人的blog上面写的是：

python setup.py develop

这样就可以了，但是我运行setup.py之后，就会报另一个很蛋疼的错误，也就是下一个问题。解决方法也在下面说。

7.‘:/usr/local/cuda-10.0/nvcc': No such file or directory

按照报错的意思来说，就是cuda文件夹里面找不到这个nvcc，但是我发现确实有这个nvcc文件，但是现实路径不对。

这个解决方法按照别人的方法来也很简单，就是直接改bashrc文件就好。

打开终端：

export CUDA_HOME=/usr/local/cuda-9.0
source ~/.bashrc

按照别人讲的，就这样就可以了，但是我的改了他妈的一天都不对，直接想弃坑了。因为我的服务器上装了cuda10.0跟10.1，有时候地址写的是10.0，有时候写的是10.1，这就很蛋疼。

你再看那个引号里面的路径，是多了一个冒号：。根据别人的做法，删掉冒号，重新输入地址，看似是对的，但是我的就是不行呢！！！！然后我用的一直是实验室的服务器，我在家里用shell一直连不上，疯狂从终端输入./pycharm.h就是没反应，然后又是查pycharm的运行号，kill掉进程。这个过程很烦躁。

最后解决的方法，就是重启了一个工程，然后开始下载python3.5的虚拟环境，把requirements文件夹下面的所有库都下载一遍，再运行setup.py develop就可以了，然后test.py也可以正常运行。。。。NMSL...

我到现在也不明白为什么大家都可以的方法，碰到我的电脑上就是行不通。

运行到此处，setup.py就可以完美运行了。

然后运行test.py文件，效果图就是这样，简单测了一下，用的model是mask scoring_rcnn_x101_64*4d_fpn_1x：

运行mmdetection遇到的坑_Python_02

把这么瘦的我拍的这么胖

在训练自己的数据库遇到的问题：

8.AssertionError: annotation file format <class 'list'> not supported

我在train的时候，转为自己的数据，同时把json文件地址进行修改：

/usr/bin/python3.5 tools/train_chicken.py configs/mask_rcnn_x101_64x4d_fpn_1x.py --gpus 1 --validate --work_dir /work_dir

其中json文件地址在configs/mask_rcnn_x101_64x4d_fpn_1x.py修改，具体位置：

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train_via_region_data.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val_via_region_data.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/test_via_region_data.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))

报错显示我的annotations有问题。我打算先下载coco数据库的json文件，对比一下二者之间的区别

原因是我在使用json文件时，我的图片数据库的裁剪方式为多边形，而coco的是矩形，只存四个点的位置信息，换成矩形就可以了。

分割生成json文件我一上传到csdn的下载界面1

9.KeyError: 'categories'

还是json的问题，换为矩形之后，报错显示类别错误，目前正在解决，需要修改coco的categories，或者把自己的类别改为coco81中种类其中的一种，我原来的类别是chicken，coco里没有，我就改成bird了。

还有要在json里面加上categories类别。

运行mmdetection遇到的坑_Python_03

修改mmdetection/mmdet/core/evaluation下的class_names.py中的voc_classes，将其改为要训练的数据集的类别名称。注意的是，如果类别只有一个，也是要加上逗号的，否则会报错，如下：

运行mmdetection遇到的坑_Python_04

问题的关键在于如何制作标准的coco的json文件。

我上传的网页能够制作coco数据集，可以使用就行了。生成coco数据及之后，就能正常运行。

运行mmdetection遇到的坑_Python_05

10.ValueError: need at least one array to concatenate

制作完毕coco数据集之后，运行以下命令报错：

 /usr/bin/python3.5 tools/train_chicken.py configs/mask_rcnn_x101_64x4d_fpn_1x.py

报错为：

Traceback (most recent call last):
  File "tools/train_chicken.py", line 142, in <module>
    main()
  File "tools/train_chicken.py", line 138, in main
    meta=meta)
  File "/home/zlee/下载/mmdetection-master/mmdet/apis/train.py", line 111, in train_detector
    meta=meta)
  File "/home/zlee/下载/mmdetection-master/mmdet/apis/train.py", line 225, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/zlee/.local/lib/python3.5/site-packages/mmcv/runner/runner.py", line 359, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zlee/.local/lib/python3.5/site-packages/mmcv/runner/runner.py", line 259, in train
    for i, data_batch in enumerate(data_loader):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 493, in __init__
    self._put_indices()
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 591, in _put_indices
    indices = next(self.sample_iter, None)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/sampler.py", line 172, in __iter__
    for idx in self.sampler:
  File "/home/zlee/下载/mmdetection-master/mmdet/datasets/loader/sampler.py", line 63, in __iter__
    indices = np.concatenate(indices)
ValueError: need at least one array to concatenate

显示valueError错误。还没解决

11.KeyError: 'category_id'

一般是引用某个字段，但没有定义

json文件里面类别id没写。写上就行。

"annotations": [{
			"id": 0,
			"image_id": "0",
			"segmentation": [47, 49, 192, 49, 192, 179, 47, 179],
			"area": 18850,
			"bbox": [47, 49, 145, 130],
			"iscrowd": 0
		},
		{
			"id": 1,
			"image_id": "0",
			"segmentation": [45, 304, 187, 304, 187, 563, 45, 563],
			"area": 36778,
			"bbox": [45, 304, 142, 259],
			"iscrowd": 0
		},

加上category_id就可以了，具体的id，需要去coco里面找，我的是bird，id是16

"annotations": [{
			"id": 0,
			"image_id": "0",
			"segmentation": [47, 49, 192, 49, 192, 179, 47, 179],
			"area": 18850,
			"bbox": [47, 49, 145, 130],
			"iscrowd": 0,
                        "category_id": 16
		},
		{
			"id": 1,
			"image_id": "0",
			"segmentation": [45, 304, 187, 304, 187, 563, 45, 563],
			"area": 36778,
			"bbox": [45, 304, 142, 259],
			"iscrowd": 0,
                        "category_id": 16
		},