目录
准备工作
源码下载
配置环境
制作VOC数据集
data目录结构
训练
编译CUDA依赖环境
预训练模型
修改pascal_voc.py文件
进行训练
遇到的问题
主要参考文章
准备工作
源码下载
Faster R-CNN pytorch0.4.0版源码:GitHub - jwyang/faster-rcnn.pytorch: A faster pytorch implementation of faster r-cnn Faster R-CNN pytorch1.0.0版源码:GitHub - jwyang/faster-rcnn.pytorch at pytorch-1.0
配置环境
在requirements.txt所在的目录下用如下命令安装所需库
pip install -r requirements.txt
注意在这之后最好将scipy库降版本,如安装1.2.1版本,不然后面可能会报错
pip uninstall scipy
pip install scipy==1.2.1
制作VOC数据集
data目录结构
data
├─VOCdevkit2007
│ └─VOC2007
│ ├─Annotations
│ ├─ImageSets
│ │ ├─Layout
│ │ ├─Main
│ │ └─Segmentation
│ ├─JPEGImages
│ ├─SegmentationClass
│ └─SegmentationObject
└─pretrained_model
pretrained_model下存放的是预训练模型,
Annotations下存放的是xml标签文件,
JPEGImages下存放的是jpg图片数据文件,
ImageSets下的Main文件夹下存放的是训练集验证集和测试集txt文件,里面是图片的序号
训练
编译CUDA依赖环境
cd lib
python setup.py build develop
预训练模型
预训练模型要存放在pretrained_model文件夹下
修改pascal_voc.py文件
faster-rcnn.pytorch/lib/datasets/pascal_voc.py文件中的检测类别
类别名要是小写!
进行训练
CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset pascal_voc --net res101 --bs 4 --nw 4 --lr 0.005 --lr_decay_step 5 --cuda --epochs 50
测试
使用如下代码测试:
CUDA_VISIBLE_DEVICES=9 python test_net.py --dataset pascal_voc --net res101 --checksession 1 --checkepoch 1 --checkpoint 6755 --cuda
demo.py
使用如下命令运行demo.py
CUDA_VISIBLE_DEVICES=0 python demo.py --net res101 --checksession 1 --checkepoch 4 --checkpoint 13512 --cuda --load_dir models
报错:RuntimeError: Error(s) in loading state_dict for resnet:
RuntimeError: Error(s) in loading state_dict for resnet:
size mismatch for RCNN_cls_score.weight: copying a param with shape torch.Size([16, 2048]) from checkpoint, the shape in current model is torch.Size([21, 2048]).
size mismatch for RCNN_cls_score.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([21]).
size mismatch for RCNN_bbox_pred.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([84, 2048]).
size mismatch for RCNN_bbox_pred.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([84]).
继续报错:
修改demo.py代码中的检测类别:
服务器资源不够,大家都在用,今天没法测试(test+demo)了
遇到的问题
1. 编译时报错invalid command 'develop'
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help
error: invalid command 'develop'
参考:python setup.py develop · Issue #92 · django-extensions/django-extensions · GitHub
setup.py文件中的
from distutils.core import setup
替换为
from setuptools import setup
然后在服务器终端激活虚拟环境,cd到相应的目录,依次执行以下命令:
conda activate your_env_name
cd lib
python setup.py build develop
开始编译了!
2. 其中会遇到 can't import imread ,可通过scipy降版本解决,可降为1.2.1
pip uninstall scipy
pip install scipy==1.2.1
3. libstdc++.so.6: version `GLIBCXX_3.4.30' not found
ImportError: /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/cv2/cv2.abi3.so)
查看系统libstdc++.so.6文件中支持的GLIBCXX版本:
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBC
如下图,最高版本为3.4.30
anaconda环境下libstdc++.so.6文件中支持的GLIBCXX版本:
strings /home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6 | grep GLIBCX
anaconda环境下最高版本为3.4.29,但需要使用版本为3.4.30
参考Ubuntu系统anaconda报错version `GLIBCXX_3.4.30' not found - Death_Knight - 博客园 (cnblogs.com)
查看anaconda环境下libstdc++.so.6的相关文件:
ls libstdc++.so
ls libstdc++.so -al
ls libstdc++.so.6 -al
ls libstdc++.so.6.0.29 -al
使用如下命令查看系统库路径下,libstdc++.so.6的相关文件:
ls -al /usr/lib/x86_64-linux-gnu/libstdc++.so.6
目前anaconda环境中libstdc++.so和libstdc++.so.6的链接地址指向的为libstdc++.so.6.0.29
使用如下命令将anaconda环境中libstdc++.so和libstdc++.so.6的链接地址指向系统路径中的地址
rm libstdc++.so
rm libstdc++.so.6
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30 libstdc++.so
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30 libstdc++.so.6
再次查看发现链接的版本为6.0.30
也可以try这个,我用上面的方法成功了就没试下面这个了:
(已解决)Import报错 Version `GLIBCXX_3.4.22‘ not found_glibcxx_3.4.28_可可与鱼的博客-CSDN博客
4. 再次运行,上个报错解决了,又出现了新的问题
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.
我发现我用的是0.4.0的代码,所以才会出现这么多的问题,现在要换成1.0.1的代码了。。。 真的很难过,折腾了好久发现版本竟然是0.4.0的
Faster R-CNN pytorch0.4.0版源码:GitHub - jwyang/faster-rcnn.pytorch: A faster pytorch implementation of faster r-cnn Faster R-CNN pytorch1.0.0版源码:GitHub - jwyang/faster-rcnn.pytorch at pytorch-1.0
0.4.0的可以看这个,(2条消息) Faster RCNN 环境配置_faster rcnn环境配置_吾人为学的博客-CSDN博客,我看这个还是用的0.4.0
5. 运行报错ImportError: cannot import name '_mask'
参考ImportError: cannot import name '_mask' · Issue #410 · jwyang/faster-rcnn.pytorch · GitHub
激活虚拟环境,cd到data目录安装coco API,执行以下命令
cd data
git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI
make
6. 新问题:TypeError: load() missing 1 required positional argument: 'Loader'
原因:新版本的ppyaml已经不支持旧版本的yaml.load(),
way1:可用以下三种方式替代:
yaml.load(file,Loader=yaml.FullLoader)
yaml.safe_load(file)
yaml.load(file, Loader=yaml.CLoader)
way2:降级pyyaml 版本 6.0降为5.4.1(我是用这样解决的,感觉也最方便)
pip uninstall pyyaml
pip install pyyaml==5.4.1
终于不报错了,感动
7. oh no ,有报错了
ValueError: Caught ValueError in DataLoader worker process 1.
ValueError: operands could not be broadcast together with shapes (683,1024,4) (1,1,3) (683,1024,4)
之前用的多个显卡,现在换成1个显卡,没有报错了,但是为什么rpn_cls,rpn_box等都是nan呢
好吧还是报错了,和上面一样,又发现我的数据集文件的命名都为5位数,应该是6位数
8. 修改之后报错 assert (boxes[:, 2] >= boxes[:, 0]).all() AssertionError
修改lib/datasets/pascal_voc.py,_load_pascal_annotation(,)函数
将Xmin,Ymin,Xmax,Ymax 后的-1全部去掉
lib/datasets/imdb.py,append_flipped_images()函数
数据整理,在一行代码为 boxes[:, 2] = widths[i] - oldx1 - 1下加入代码:
aboxes = boxes
for b in range(len(boxes)):
if boxes[b][2] < boxes[b][0]:
boxes[b][0] = boxes[b][2]
boxes[b][2] = aboxes[b][0]
9. 运行又发生了这个报错
roidb[i]['img_id'] = imdb.image_id_at(i)
IndexError: list index out of range
参考roidb[i]['image'] = imdb.image_path_at(i) ·问题 #79 ·RBGIRSHICK/FAST-RCNN ·GitHub
这可能是缓存文件引起的,可以在 fast-rcnn-master/data/cache/ 文件夹下删除训练数据的特定缓存文件,然后重试解决了!
10. 再次报错ValueError: operands could not be broadcast together with shapes (1024,717,4) (1,1,3) (1024,717,4) 看来这个问题还是没有解决
可能是因为有的图片是4通道的,即rgb+alpha,只选取rgb三个通道即可
在 lib\model\util\blob.py 的P39行前插入:
if im.shape[2] == 4:
im = im[:, :, :3]
可以正常训练了
CUDA_VISIBLE_DEVICES=3,4 python trainval_net.py --dataset pascal_voc --net res101 --bs 4 --nw 4 --lr 0.005 --lr_decay_step 5 --cuda --epochs 50
11.又有报错了
RuntimeError: Caught RuntimeError in DataLoader worker process 3.
RuntimeError: The expanded size of the tensor (1200) must match the existing size (0) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 0, 3]
1439
[session 1][epoch 1][iter 3800/6756] loss: 0.7875, lr: 5.00e-03
fg/bg=(10/1014), time cost: 53.474536
rpn_cls: 0.1268, rpn_box: 0.0292, rcnn_cls: 0.0855, rcnn_box 0.0142
Traceback (most recent call last):
File "trainval_net.py", line 310, in <module>
data = next(data_iter)
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zy/faster-rcnn.pytorch-pytorch-1.0/lib/roi_data_layer/roibatchLoader.py", line 177, in __getitem__
padding_data[:, :data_width, :] = data[0]
RuntimeError: The expanded size of the tensor (1200) must match the existing size (0) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 0, 3]
删除fast-rcnn-master/data/cache/下的pkl文件试一试
测试过程报错:
1. AttributeError: 'NoneType' object has no attribute 'text'
File "/home/zy/faster-rcnn.pytorch-pytorch-1.0/lib/datasets/voc_eval.py", line 22, in parse_rec
obj_struct['pose'] = obj.find('pose').text
AttributeError: 'NoneType' object has no attribute 'text'
解决方法:去原文档里注释掉这一句 obj_struct['pose'] = obj.find('pose').text
发现pose,truncated,difficult 这几项我的xml文件里都没有,于是都给注释了
2. 报错: KeyError: 'difficult'
difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
KeyError: 'difficult'
解决方法:修改faster-rcnn.pytorch-pytorch-1.0/lib/datasets/voc_eval.py文件,注释difficult相关代码
参考利用py-faster-rcnn训练目标检测模型_liuyan20062010的博客-CSDN博客
修改过的地方:
def parse_rec(filename):
""" Parse a PASCAL VOC xml file """
tree = ET.parse(filename)
objects = []
for obj in tree.findall('object'):
obj_struct = {}
obj_struct['name'] = obj.find('name').text
#obj_struct['pose'] = obj.find('pose').text //注释这一行
#obj_struct['truncated'] = int(obj.find('truncated').text) //注释这一行
#obj_struct['difficult'] = int(obj.find('difficult').text) //注释这一行
bbox = obj.find('bndbox')
obj_struct['bbox'] = [int(bbox.find('xmin').text),
int(bbox.find('ymin').text),
int(bbox.find('xmax').text),
int(bbox.find('ymax').text)]
objects.append(obj_struct)
return objects
# extract gt objects for this class
class_recs = {}
npos = 0
for imagename in imagenames:
R = [obj for obj in recs[imagename] if obj['name'] == classname]
bbox = np.array([x['bbox'] for x in R])
#difficult = np.array([x['difficult'] for x in R]).astype(np.bool) //注释这行
difficult = 0; //添加这行
det = [False] * len(R)
#npos = npos + sum(~difficult) //注释这行
class_recs[imagename] = {'bbox': bbox,
'difficult': difficult,
'det': det}
if ovmax > ovthresh:
#if not R['difficult'][jmax]: //注释这行
# if not R['det'][jmax]: //注释这行
# tp[d] = 1. //注释这行
# R['det'][jmax] = 1 //注释这行
# else: //注释这行
fp[d] = 1.
else:
fp[d] = 1.