爬取的网址呢,还是上一篇博文涉足的 https://www.dmzj.com/ 但是这一次,我们使用selenium来获得每个章节的所有图片,因为动态网页比如常见的js动态生成,用静态方法访问网站并且获取html时,js动态生成的这一部分还没有生成,所以静态方法获取不了这一部分信息。
selenium的webdriver,我个人理解webdriver是模拟用户用浏览器访问网络,只不过不是用户亲自用鼠标是双击,而是用的代码。在这里本人使用的是Firefox浏览器。
这部分代码如下:
对于每个页面都去获得那个有所有img的元素。
做个测试
输出结果如下:
D:\software\Anaconda3\install\envs\pytorch\python.exe D:/software/PyCharm/code/spider9.py
[‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330112801.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/1449233011435.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330119295.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330124847.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330130246.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330140432.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330152702.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14494576363317.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330159679.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330165486.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330174622.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330180602.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330190377.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330192663.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330204047.jpg’, ‘https://images.dmzj.com/img/chapterpic/1247/25304/14492330205342.jpg’]
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’]
Process finished with exit code 0
效果是可以的,现在我们修改之前的代码,当然了,只是修改部分的代码了,大框架可以不用动,只是修改了下载单个章节的逻辑罢了。
完整代码如下:
效果如下: