Paddle预训练模型应用工具PaddleHub
- 本文主要介绍如何使用飞桨预训练模型管理工具PaddleHub,快速体验模型以及实现迁移学习。建议使用GPU环境运行相关程序,可以在启动环境时,如下图所示选择“高级版”环境即可。
如果没有算力卡资源可以点击链接申请。
概述
首先提个问题,请问十行Python代码能干什么?有人说可以做个小日历、做个应答机器人等等,用十行代码可以成功训练出深度学习模型,飞桨的PaddleHub可以轻松实现。
PaddleHub是飞桨生态下的预训练模型的管理工具,旨在让飞桨生态下的开发者更便捷地享受到大规模预训练模型的价值。用户可以通过PaddleHub便捷地获取飞桨生态下的预训练模型,结合Fine-tune API快速完成迁移学习到应用部署的全流程工作,让预训练模型能更好服务于用户特定场景的应用。
当前PaddleHub已经可以支持文本、图像、视频、语音和工业应用等五大类主流方向,为用户准备了大量高质量的预训练模型,可以满足用户各种应用场景的任务需求,包括但不限于词法分析、情感分析、图像分类、图像分割、目标检测、关键点检测、视频分类等经典任务。同时结合时事热点,如图1所示,PaddleHub作为飞桨最活跃的生态组成之一,也会及时开源类似口罩人脸检测及分类、肺炎CT影像分析等实用场景模型,帮助开发者快速开发使用。
图1 肺炎CT影像与口罩人脸检测及分类效果图
通常情况下,如果用户希望使用模型完成推理业务,需要完成训练数据采集标注、算法开发、模型训练、预测部署等任务,这其中任何一项都需要花费较多的人力和成本,为了解决这个问题,飞桨提供了PaddleHub预训练模型管理工具。用户可以直接使用PaddleHub中的预训练模型,或以迁移学习的方式训练出自己想要的模型,快速实现推理业务。
那什么是迁移学习呢?通俗的来讲,迁移学习就是运用已有的知识来学习新的知识,例如学会了骑自行车的人也能较快的学会骑电动车。较为常用的一种迁移学习方式是利用预训练模型进行微调,即用户基于当前任务的场景从PaddleHub中选择已训练成功的模型进行新任务训练,且该模型曾经使用的数据集与新场景的数据集情况相近,此时仅需要在当前任务场景的训练过程中使用新场景的数据对模型参数进行微调,即可完成训练任务。
总之,PaddleHub帮助用户简化了数据采集、算法开发、模型训练、预测部署等流程,实现开箱即用,且仅需要增加高质量的领域数据,即可快速提升模型效果。
PaddleHub主要包括如下三类功能:
- 使用命令行实现快速推理:PaddleHub基于“模型即软件”的设计理念,通过Python API或命令行实现快速预测,更方便地使用飞桨模型库。
- 使用预训练模型进行迁移学习:选择高质量预训练模型结合Fine-tune API,在短时间内完成模型训练。
- PaddleHub Serving一键服务化部署:使用简单命令行搭建属于自己的模型的API服务。
前置条件
在使用PaddleHub之前,用户需要完成如下任务:
- 安装Python:对于Linux或MAC操作系统请安装3.5或3.5以上版本;对于Windows系统,请安装3.6或3.6以上版本。
- 安装飞桨2.0版本,具体安装方法请参见快速安装。
- 安装PaddleHub 2.0或以上版本。
!pip install paddlehub==2.0.0rc
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting paddlehub==2.0.0rc
Downloading https://mirror.baidu.com/pypi/packages/df/7f/47008ee77d31f317616112c5a817222caa089fd0760807296775ab811910/paddlehub-2.0.0rc0-py3-none-any.whl (190kB)
|████████████████████████████████| 194kB 11.2MB/s eta 0:00:01
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (4.1.0)
Collecting easydict (from paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/4c/c5/5757886c4f538c1b3f95f6745499a24bffa389a805dee92d093e2d9ba7db/easydict-1.9.tar.gz
Collecting gitpython (from paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/d7/cb/ec98155c501b68dcb11314c7992cd3df6dce193fd763084338a117967d53/GitPython-3.1.12-py3-none-any.whl (159kB)
|████████████████████████████████| 163kB 73.8MB/s eta 0:00:01
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (2.2.3)
Requirement already satisfied: pyzmq in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (18.0.1)
Requirement already satisfied: rarfile in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (3.1)
Requirement already satisfied: visualdl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (2.1.0)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (0.4.4)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (1.16.4)
Collecting packaging (from paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/b1/a7/588bfa063e7763247ab6f7e1d994e331b85e0e7d09f853c59a6eb9696974/packaging-20.8-py2.py3-none-any.whl
Requirement already satisfied: Pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (7.1.2)
Requirement already satisfied: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (5.1.2)
Collecting paddlenlp>=2.0.0b2 (from paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/14/26/492612b0cb40bcc12c2a4fb8f7248b4939abd87dcfe1537b003ebbe02f6e/paddlenlp-2.0.0b3-py3-none-any.whl (163kB)
|████████████████████████████████| 163kB 23.1MB/s eta 0:00:01
Collecting filelock (from paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/93/83/71a2ee6158bb9f39a90c0dea1637f81d5eef866e188e1971a1b1ab01a35a/filelock-3.0.12-py3-none-any.whl
Requirement already satisfied: flask>=1.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (1.1.1)
Requirement already satisfied: opencv-python in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (4.1.1.26)
Requirement already satisfied: tqdm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (4.36.1)
Requirement already satisfied: gunicorn>=19.10.0; sys_platform != "win32" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlehub==2.0.0rc) (20.0.4)
Collecting gitdb<5,>=4.0.1 (from gitpython->paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/48/11/d1800bca0a3bae820b84b7d813ad1eff15a48a64caea9c823fc8c1b119e8/gitdb-4.0.5-py3-none-any.whl (63kB)
|████████████████████████████████| 71kB 7.4MB/s eta 0:00:011
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (2.4.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (0.10.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (2019.3)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (2.8.0)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->paddlehub==2.0.0rc) (1.15.0)
Requirement already satisfied: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (1.0.0)
Requirement already satisfied: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (3.12.2)
Requirement already satisfied: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (2.22.0)
Requirement already satisfied: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (1.21.0)
Requirement already satisfied: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (0.8.53)
Requirement already satisfied: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl>=2.0.0->paddlehub==2.0.0rc) (3.8.2)
Requirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0b2->paddlehub==2.0.0rc) (0.42.1)
Requirement already satisfied: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp>=2.0.0b2->paddlehub==2.0.0rc) (2.9.0)
Collecting seqeval(from paddlenlp>=2.0.0b2->paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43kB)
|████████████████████████████████| 51kB 16.3MB/s eta 0:00:01
Requirement already satisfied: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.0->paddlehub==2.0.0rc) (7.0)
Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.0->paddlehub==2.0.0rc) (2.10.1)
Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.0->paddlehub==2.0.0rc) (0.16.0)
Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.0->paddlehub==2.0.0rc) (1.1.0)
Requirement already satisfied: setuptools>=3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gunicorn>=19.10.0; sys_platform != "win32"->paddlehub==2.0.0rc) (41.4.0)
Collecting smmap<4,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython->paddlehub==2.0.0rc)
Downloading https://mirror.baidu.com/pypi/packages/b0/9a/4d409a6234eb940e6a78dfdfc66156e7522262f5f2fecca07dc55915952d/smmap-3.0.4-py2.py3-none-any.whl
Requirement already satisfied: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl>=2.0.0->paddlehub==2.0.0rc) (2.8.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl>=2.0.0->paddlehub==2.0.0rc) (1.25.6)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl>=2.0.0->paddlehub==2.0.0rc) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl>=2.0.0->paddlehub==2.0.0rc) (2019.9.11)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl>=2.0.0->paddlehub==2.0.0rc) (2.8)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (0.23)
Requirement already satisfied: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (0.10.0)
Requirement already satisfied: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (1.3.4)
Requirement already satisfied: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (16.7.9)
Requirement already satisfied: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (1.3.0)
Requirement already satisfied: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (1.4.10)
Requirement already satisfied: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (2.0.1)
Requirement already satisfied: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl>=2.0.0->paddlehub==2.0.0rc) (3.9.9)
Requirement already satisfied: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl>=2.0.0->paddlehub==2.0.0rc) (0.18.0)
Requirement already satisfied: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl>=2.0.0->paddlehub==2.0.0rc) (0.6.1)
Requirement already satisfied: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl>=2.0.0->paddlehub==2.0.0rc) (2.2.0)
Requirement already satisfied: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl>=2.0.0->paddlehub==2.0.0rc) (2.6.0)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp>=2.0.0b2->paddlehub==2.0.0rc) (0.22.1)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.10.1->flask>=1.1.0->paddlehub==2.0.0rc) (1.1.1)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (0.6.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp>=2.0.0b2->paddlehub==2.0.0rc) (0.14.1)
Requirement already satisfied: scipy>=0.17.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp>=2.0.0b2->paddlehub==2.0.0rc) (1.3.0)
Requirement already satisfied: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->pre-commit->visualdl>=2.0.0->paddlehub==2.0.0rc) (7.2.0)
Building wheels for collected packages: easydict, seqeval
Building wheel for easydict (setup.py) ... done
Created wheel for easydict: filename=easydict-1.9-cp37-none-any.whl size=6350 sha256=2e5071bd4b99471b6dc7a8e30d4c36cec3435569ff5ca2f29e127ac16a5ccb7e
Stored in directory: /home/aistudio/.cache/pip/wheels/35/8b/38/7327c27cd3d4590ffa75b98030bd3828e68b8bb3d599573163
Building wheel for seqeval(setup.py) ... done
Created wheel for seqeval: filename=seqeval-1.2.2-cp37-none-any.whl size=16171 sha256=f31bd140696d29d09ed926bdbd565947421e2f17b035c905f06a15d2a463d992
Stored in directory: /home/aistudio/.cache/pip/wheels/9c/f7/1c/8bdbcbb74a93c95d32f55c63f51e6dbf20b77b7c1db4164f14
Successfully built easydict seqeval
Installing collected packages: easydict, smmap, gitdb, gitpython, packaging, seqeval, paddlenlp, filelock, paddlehub
Found existing installation: paddlehub 1.6.0
Uninstalling paddlehub-1.6.0:
Successfully uninstalled paddlehub-1.6.0
Successfully installed easydict-1.9 filelock-3.0.12 gitdb-4.0.5 gitpython-3.1.12 packaging-20.8 paddlehub-2.0.0rc0 paddlenlp-2.0.0b3 seqeval-1.2.2 smmap-3.0.4
说明:
使用PaddleHub下载数据集、预训练模型等,要求机器可以访问外网。可以使用server_check()检查本地与远端PaddleHub-Server的连接状态,使用方法如下。 如果可以连接远端PaddleHub-Server,则显示“Request Hub-Server successfully”。否则显示“Request Hub-Server unsuccessfully”。
import paddlehub
paddlehub.server_check()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
[2021-01-14 18:04:26,869] [ INFO] - Request Hub-Server successfully.
True
预训练模型
PaddleHub支持的预训练模型涵盖了图像分类、关键点检测、目标检测、文字识别、图像生成、人脸检测、图像编辑、图像分割、视频分类、视频修复、词法分析、语义模型、情感分析、文本审核、文本生成、语音合成、工业质检等200多个主流模型。
进入官网,用户可以点击首页上“学习模型”部分的“所有模型 ”链接查看PaddleHub支持的所有预训练模型。如图2所示,页面的左侧导航栏中可以看到模型类型,且在每个类型内用户可以看到按照不同网络结构、不同预训练数据集等信息划分的近二百个预训练模型。在导航栏右侧,可以看到对应类型支持的预训练模型简要信息,这些信息以页签的方式呈现,包括模型名称、使用场景类别(图像、文本、视频、语音、工业应用)、网络类型、预训练使用的数据集和简介等内容。如果用户希望查看某个预训练模型的具体信息,则可以点击对应页签进行查看。
图2 所有模型页面
用户在选定预训练模型后,请按照官网上预训练模型的详细信息中“选择模型版本进行安装”的内容安装预训练模型。以lac模型为例其对应的安装命令为:
! hub install lac
You are using Paddle compiled with TensorRT, but TensorRT dynamic library is not found. Ignore this if TensorRT is not needed.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/lac_2.2.0.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmpzb35zm2v/lac_2.2.0.tar.gz
[##################################################] 100.00%
[2021-01-14 18:04:42,668] [ INFO] - Successfully installed lac-2.2.0
使用命令行实现快速推理
为了能让用户快速体验飞桨的模型推理效果,PaddleHub支持了使用命令行实现快速推理的功能。例如用户可以执行如下命令使用词法分析模型LAC(Lexical Analysis of Chinese)实现分词功能。
说明: LAC是一个联合的词法分析模型,能整体性地完成中文分词、词性标注、专名识别任务。
!hub run lac --input_text "现在,慕尼黑再保险公司不仅是此类行动的倡议者,更是将其大量气候数据整合进保险产品中,并与公众共享大量天气信息,参与到新能源领域的保障中。"
You are using Paddle compiled with TensorRT, but TensorRT dynamic library is not found. Ignore this if TensorRT is not needed.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
[2021-01-14 18:04:56,552] [ WARNING] - The _initialize method in HubModule will soon be deprecated, you can use the __init__() to handle the initialization of the object
W0114 18:04:56.589459 519 analysis_predictor.cc:1058] Deprecated. Please use CreatePredictor instead.
[{'word': ['现在', ',', '慕尼黑再保险公司', '不仅', '是', '此类', '行动', '的', '倡议者', ',', '更是', '将', '其', '大量', '气候', '数据', '整合', '进', '保险', '产品', '中', ',', '并', '与', '公众', '共享', '大量', '天气', '信息', ',', '参与', '到', '新能源', '领域', '的', '保障', '中', '。'], 'tag': ['TIME', 'w', 'ORG', 'c', 'v', 'r', 'n', 'u', 'n', 'w', 'd', 'p', 'r', 'a', 'n', 'n', 'v', 'v', 'n', 'n', 'f', 'w', 'c', 'p', 'n', 'v', 'a', 'n', 'n', 'w', 'v', 'v', 'n', 'n', 'u', 'vn', 'f', 'w']}]
实现快速推理的命令行的格式如下所示,其中参数解释如下:
- module-name:模型名称。
- input-parameter:输入参数,即上面例子中的“–input_text”
- input-value:推理的输入值,即上面例子中的“今天是个好日子”。
不同的模型,命令行格式和参数取值也不同,具体信息请在每个模型中查看“命令行预测示例”部分。
hub run ${module-name} ${input-parameter} ${input-value}
当前PaddleHub中仅有部分预训练模型支持使用命令行实现快速推理功能,具体一个模型是否支持该功能,用户可以通过官网介绍中是否含有命令行预测及服务部署介绍获得。
图3 预测模型示例
使用预训练模型进行迁移学习
通过高质量预训练模型与PaddleHub Fine-tune API,使用户只需要少量代码即可实现自然语言处理和计算机视觉场景的深度学习模型。以文本分类为例,共分4个步骤:
1. 选择并加载预训练模型
本例使用ERNIE Tiny模型来演示如何利用PaddleHub实现finetune。ERNIE Tiny主要通过模型结构压缩和模型蒸馏的方法,将 ERNIE 2.0 Base 模型进行压缩。相较于 ERNIE 2.0,ERNIE Tiny模型能带来4.3倍的预测提速,具有更高的工业落地能力。
!hub install ernie_tiny==2.0.1
You are using Paddle compiled with TensorRT, but TensorRT dynamic library is not found. Ignore this if TensorRT is not needed.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/ernie_tiny_2.0.1.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmp9cxbz2jk/ernie_tiny_2.0.1.tar.gz
[##################################################] 100.00%
[2021-01-14 18:05:14,353] [ INFO] - Successfully installed ernie_tiny-2.0.1
import paddlehub as hub
model = hub.Module(name='ernie_tiny', version='2.0.1', task='seq-cls', num_classes=2)
[2021-01-14 18:05:21,013] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-tiny
[2021-01-14 18:05:21,015] [ INFO] - Downloading ernie_tiny.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams
100%|██████████| 354158/354158 [00:08<00:00, 43591.73it/s]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1245: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1245: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
其中,参数:
- name:模型名称,可以选择ernie,ernie_tiny,bert-base-cased, bert-base-chinese, roberta-wwm-ext,roberta-wwm-ext-large等。
- version:module版本号
- task:fine-tune任务。此处为seq-cls,表示文本分类任务。
- num_classes:表示当前文本分类任务的类别数,根据具体使用的数据集确定,默认为2。
PaddleHub还提供BERT等模型可供选择, 当前支持文本分类任务的模型对应的加载示例如下:
模型名 |
PaddleHub Module |
ERNIE, Chinese |
hub.Module(name='ernie') |
ERNIE tiny, Chinese |
hub.Module(name='ernie_tiny') |
ERNIE 2.0 Base, English |
hub.Module(name='ernie_v2_eng_base') |
ERNIE 2.0 Large, English |
hub.Module(name='ernie_v2_eng_large') |
BERT-Base, English Cased |
hub.Module(name='bert-base-cased') |
BERT-Base, English Uncased |
hub.Module(name='bert-base-uncased') |
BERT-Large, English Cased |
hub.Module(name='bert-large-cased') |
BERT-Large, English Uncased |
hub.Module(name='bert-large-uncased') |
BERT-Base, Multilingual Cased |
hub.Module(nane='bert-base-multilingual-cased') |
BERT-Base, Multilingual Uncased |
hub.Module(nane='bert-base-multilingual-uncased') |
BERT-Base, Chinese |
hub.Module(name='bert-base-chinese') |
BERT-wwm, Chinese |
hub.Module(name='chinese-bert-wwm') |
BERT-wwm-ext, Chinese |
hub.Module(name='chinese-bert-wwm-ext') |
RoBERTa-wwm-ext, Chinese |
hub.Module(name='roberta-wwm-ext') |
RoBERTa-wwm-ext-large, Chinese |
hub.Module(name='roberta-wwm-ext-large') |
RBT3, Chinese |
hub.Module(name='rbt3') |
RBTL3, Chinese |
hub.Module(name='rbtl3') |
ELECTRA-Small, English |
hub.Module(name='electra-small') |
ELECTRA-Base, English |
hub.Module(name='electra-base') |
ELECTRA-Large, English |
hub.Module(name='electra-large') |
ELECTRA-Base, Chinese |
hub.Module(name='chinese-electra-base') |
ELECTRA-Small, Chinese |
hub.Module(name='chinese-electra-small') |
通过以上的一行代码,model初始化为一个适用于文本分类任务的模型,为ERNIE Tiny的预训练模型后拼接上一个全连接网络(Full Connected)。
以上图片来自于:https://arxiv.org/pdf/1810.04805.pdf
2. 准备数据集并读取数据
用户可以选择使用自定义的数据集或PaddleHub提供的数据集进行迁移训练。
(1) PaddleHub提供的数据集ChnSentiCorp
# 自动从网络下载数据集并解压到用户目录下$HUB_HOME/.paddlehub/dataset目录
train_dataset = hub.datasets.ChnSentiCorp(
tokenizer=model.get_tokenizer(), max_seq_len=128, mode='train')
dev_dataset = hub.datasets.ChnSentiCorp(
tokenizer=model.get_tokenizer(), max_seq_len=128, mode='dev')
[2021-01-14 18:06:07,529] [ INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/vocab.txt
100%|██████████| 459/459 [00:00<00:00, 6793.01it/s]
[2021-01-14 18:06:07,889] [ INFO] - Downloading spm_cased_simp_sampled.model from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/spm_cased_simp_sampled.model
100%|██████████| 1083/1083 [00:00<00:00, 8108.15it/s]
[2021-01-14 18:06:08,252] [ INFO] - Downloading dict.wordseg.pickle from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/dict.wordseg.pickle
100%|██████████| 161822/161822 [00:04<00:00, 39625.95it/s]
Download https://bj.bcebos.com/paddlehub-dataset/chnsenticorp.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmp09k65v5a/chnsenticorp.tar.gz
[##################################################] 100.00%
[2021-01-14 18:06:23,215] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-01-14 18:06:23,222] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-01-14 18:06:23,225] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
- tokenizer:表示该module所需用到的tokenizer,其将对输入文本完成切词,并转化成module运行所需模型输入格式。
- mode:选择数据模式,可选项有 train, test, val, 默认为train。
- max_seq_len:ERNIE/BERT模型使用的最大序列长度,若出现显存不足,请适当调低这一参数。
预训练模型ERNIE对中文数据的处理是以字为单位,tokenizer作用为将原始输入文本转化成模型model可以接受的输入数据形式。 PaddleHub 2.0中的各种预训练模型已经内置了相应的tokenizer,可以通过model.get_tokenizer方法获取。
(2) 自定义数据集
如果用户希望使用自定义的数据集,则需要对自定义数据进行相应的预处理,将数据集文件处理成预训练模型可以读取的格式。例如用PaddleHub文本分类任务使用自定义数据时,需要切分数据集,将数据集切分为训练集、验证集和测试集。
a. 设置数据集目录。
用户需要将数据集目录设定为如下格式。
├──data: 数据目录
├── train.txt: 训练集数据
├── dev.txt: 验证集数据
└── test.txt: 测试集数据
b. 设置文件格式和内容。
训练集、验证集和测试集文件的编码格式建议为utf8格式。内容的第一列是文本内容,第二列为文本类别标签。列与列之间以Tab键分隔。建议在数据集文件第一行填写列说明"label"和"text_a",中间以Tab键分隔,示例如下:
label text_a
房产 昌平京基鹭府10月29日推别墅1200万套起享97折
教育 贵州2011高考录取分数线发布理科一本448分
社会 众多白领因集体户口面临结婚难题
...
c. 加载自定义数据集。
加载文本分类的自定义数据集,用户仅需要继承基类TextClassificationDataset,修改数据集存放地址以及类别即可,具体可以参考如下代码:
from paddlehub.datasets.base_nlp_dataset import TextClassificationDataset
class SeqClsDataset(TextClassificationDataset):
# 数据集存放目录
base_path = '/path/to/dataset'
# 数据集的标签列表
label_list=['体育', '科技', '社会', '娱乐', '股票', '房产', '教育', '时政', '财经', '星座', '游戏', '家居', '彩票', '时尚']
def __init__(self, tokenizer, max_seq_len: int = 128, mode: str = 'train'):
if mode == 'train':
data_file = 'train.txt'
elif mode == 'test':
data_file = 'test.txt'
else:
data_file = 'dev.txt'
super().__init__(
base_path=self.base_path,
tokenizer=tokenizer,
max_seq_len=max_seq_len,
mode=mode,
data_file=data_file,
label_list=self.label_list,
is_file_with_header=True)
# 选择所需要的模型,获取对应的tokenizer
import paddlehub as hub
model = model = hub.Module(name='ernie_tiny', task='seq-cls', num_classes=len(SeqClsDataset.label_list))
tokenizer = model.get_tokenizer()
# 实例化训练集
train_dataset = SeqClsDataset(tokenizer)
至此用户可以通过SeqClsDataset实例化获取对应的数据集,可以通过hub.Trainer对预训练模型model完成文本分类任务,详情可参考PaddleHub文本分类demo。
说明:
CV类预训练模型的自定义数据集的设置方法请参考PaddleHub适配自定义数据完成finetune。
3. 选择优化策略和运行配置
运行如下代码,即可实现对文本分类模型的finetune:
import paddle
optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())
trainer = hub.Trainer(model, optimizer, checkpoint_dir='test_ernie_text_cls', use_gpu=True)
trainer.train(train_dataset, epochs=3, batch_size=32, eval_dataset=dev_dataset, save_interval=1)
[2021-01-14 18:06:45,223] [ WARNING] - PaddleHub model checkpoint not found, start from scratch...
[2021-01-14 18:06:46,358] [ TRAIN] - Epoch=1/3, Step=10/300 loss=0.6446 acc=0.6375 lr=0.000050 step/sec=8.96 | ETA 00:01:40
[2021-01-14 18:06:47,307] [ TRAIN] - Epoch=1/3, Step=20/300 loss=0.4035 acc=0.8688 lr=0.000050 step/sec=10.54 | ETA 00:01:32
[2021-01-14 18:06:48,258] [ TRAIN] - Epoch=1/3, Step=30/300 loss=0.2783 acc=0.8812 lr=0.000050 step/sec=10.51 | ETA 00:01:30
[2021-01-14 18:06:49,210] [ TRAIN] - Epoch=1/3, Step=40/300 loss=0.2588 acc=0.9000 lr=0.000050 step/sec=10.50 | ETA 00:01:29
[2021-01-14 18:06:50,158] [ TRAIN] - Epoch=1/3, Step=50/300 loss=0.2476 acc=0.9062 lr=0.000050 step/sec=10.55 | ETA 00:01:28
[2021-01-14 18:06:51,105] [ TRAIN] - Epoch=1/3, Step=60/300 loss=0.2832 acc=0.9062 lr=0.000050 step/sec=10.56 | ETA 00:01:27
[2021-01-14 18:06:52,051] [ TRAIN] - Epoch=1/3, Step=70/300 loss=0.2453 acc=0.9031 lr=0.000050 step/sec=10.58 | ETA 00:01:27
[2021-01-14 18:06:53,000] [ TRAIN] - Epoch=1/3, Step=80/300 loss=0.3446 acc=0.8781 lr=0.000050 step/sec=10.53 | ETA 00:01:27
[2021-01-14 18:06:53,946] [ TRAIN] - Epoch=1/3, Step=90/300 loss=0.2419 acc=0.9094 lr=0.000050 step/sec=10.56 | ETA 00:01:27
[2021-01-14 18:06:54,897] [ TRAIN] - Epoch=1/3, Step=100/300 loss=0.2760 acc=0.8938 lr=0.000050 step/sec=10.52 | ETA 00:01:26
[2021-01-14 18:06:55,846] [ TRAIN] - Epoch=1/3, Step=110/300 loss=0.2552 acc=0.9031 lr=0.000050 step/sec=10.54 | ETA 00:01:26
[2021-01-14 18:06:56,795] [ TRAIN] - Epoch=1/3, Step=120/300 loss=0.2802 acc=0.8844 lr=0.000050 step/sec=10.54 | ETA 00:01:26
[2021-01-14 18:06:57,746] [ TRAIN] - Epoch=1/3, Step=130/300 loss=0.2462 acc=0.9031 lr=0.000050 step/sec=10.51 | ETA 00:01:26
[2021-01-14 18:06:58,698] [ TRAIN] - Epoch=1/3, Step=140/300 loss=0.2153 acc=0.9094 lr=0.000050 step/sec=10.50 | ETA 00:01:26
[2021-01-14 18:06:59,651] [ TRAIN] - Epoch=1/3, Step=150/300 loss=0.2140 acc=0.9187 lr=0.000050 step/sec=10.49 | ETA 00:01:26
[2021-01-14 18:07:00,611] [ TRAIN] - Epoch=1/3, Step=160/300 loss=0.2318 acc=0.9250 lr=0.000050 step/sec=10.42 | ETA 00:01:26
[2021-01-14 18:07:01,563] [ TRAIN] - Epoch=1/3, Step=170/300 loss=0.2424 acc=0.8969 lr=0.000050 step/sec=10.51 | ETA 00:01:26
[2021-01-14 18:07:02,515] [ TRAIN] - Epoch=1/3, Step=180/300 loss=0.1933 acc=0.9250 lr=0.000050 step/sec=10.50 | ETA 00:01:26
[2021-01-14 18:07:03,468] [ TRAIN] - Epoch=1/3, Step=190/300 loss=0.2376 acc=0.9156 lr=0.000050 step/sec=10.50 | ETA 00:01:26
[2021-01-14 18:07:04,415] [ TRAIN] - Epoch=1/3, Step=200/300 loss=0.2600 acc=0.8938 lr=0.000050 step/sec=10.56 | ETA 00:01:26
[2021-01-14 18:07:05,372] [ TRAIN] - Epoch=1/3, Step=210/300 loss=0.1915 acc=0.9219 lr=0.000050 step/sec=10.45 | ETA 00:01:26
[2021-01-14 18:07:06,328] [ TRAIN] - Epoch=1/3, Step=220/300 loss=0.2076 acc=0.9313 lr=0.000050 step/sec=10.46 | ETA 00:01:26
[2021-01-14 18:07:07,276] [ TRAIN] - Epoch=1/3, Step=230/300 loss=0.1849 acc=0.9281 lr=0.000050 step/sec=10.55 | ETA 00:01:26
[2021-01-14 18:07:08,230] [ TRAIN] - Epoch=1/3, Step=240/300 loss=0.2051 acc=0.9219 lr=0.000050 step/sec=10.48 | ETA 00:01:26
[2021-01-14 18:07:09,178] [ TRAIN] - Epoch=1/3, Step=250/300 loss=0.2602 acc=0.9125 lr=0.000050 step/sec=10.55 | ETA 00:01:26
[2021-01-14 18:07:10,127] [ TRAIN] - Epoch=1/3, Step=260/300 loss=0.1979 acc=0.9281 lr=0.000050 step/sec=10.54 | ETA 00:01:26
[2021-01-14 18:07:11,087] [ TRAIN] - Epoch=1/3, Step=270/300 loss=0.1809 acc=0.9406 lr=0.000050 step/sec=10.41 | ETA 00:01:26
[2021-01-14 18:07:12,041] [ TRAIN] - Epoch=1/3, Step=280/300 loss=0.2120 acc=0.9125 lr=0.000050 step/sec=10.49 | ETA 00:01:26
[2021-01-14 18:07:12,997] [ TRAIN] - Epoch=1/3, Step=290/300 loss=0.1672 acc=0.9313 lr=0.000050 step/sec=10.45 | ETA 00:01:26
[2021-01-14 18:07:13,941] [ TRAIN] - Epoch=1/3, Step=300/300 loss=0.2095 acc=0.9187 lr=0.000050 step/sec=10.60 | ETA 00:01:26
[2021-01-14 18:07:15,169] [ EVAL] - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_acc=0.9292
[2021-01-14 18:07:27,287] [ EVAL] - Saving best model to test_ernie_text_cls/best_model [best acc=0.9292]
[2021-01-14 18:07:27,289] [ INFO] - Saving model checkpoint to test_ernie_text_cls/epoch_1
[2021-01-14 18:07:40,309] [ TRAIN] - Epoch=2/3, Step=10/300 loss=0.1009 acc=0.9719 lr=0.000050 step/sec=0.38 | ETA 00:02:39
[2021-01-14 18:07:41,258] [ TRAIN] - Epoch=2/3, Step=20/300 loss=0.1035 acc=0.9656 lr=0.000050 step/sec=10.54 | ETA 00:02:37
[2021-01-14 18:07:42,203] [ TRAIN] - Epoch=2/3, Step=30/300 loss=0.0717 acc=0.9781 lr=0.000050 step/sec=10.58 | ETA 00:02:35
[2021-01-14 18:07:43,164] [ TRAIN] - Epoch=2/3, Step=40/300 loss=0.1062 acc=0.9625 lr=0.000050 step/sec=10.41 | ETA 00:02:33
[2021-01-14 18:07:44,123] [ TRAIN] - Epoch=2/3, Step=50/300 loss=0.0798 acc=0.9688 lr=0.000050 step/sec=10.43 | ETA 00:02:31
[2021-01-14 18:07:45,080] [ TRAIN] - Epoch=2/3, Step=60/300 loss=0.0684 acc=0.9750 lr=0.000050 step/sec=10.46 | ETA 00:02:29
[2021-01-14 18:07:46,030] [ TRAIN] - Epoch=2/3, Step=70/300 loss=0.1395 acc=0.9563 lr=0.000050 step/sec=10.52 | ETA 00:02:27
[2021-01-14 18:07:46,978] [ TRAIN] - Epoch=2/3, Step=80/300 loss=0.0953 acc=0.9750 lr=0.000050 step/sec=10.55 | ETA 00:02:26
[2021-01-14 18:07:47,928] [ TRAIN] - Epoch=2/3, Step=90/300 loss=0.1744 acc=0.9469 lr=0.000050 step/sec=10.53 | ETA 00:02:24
[2021-01-14 18:07:48,878] [ TRAIN] - Epoch=2/3, Step=100/300 loss=0.1134 acc=0.9563 lr=0.000050 step/sec=10.53 | ETA 00:02:23
[2021-01-14 18:07:49,824] [ TRAIN] - Epoch=2/3, Step=110/300 loss=0.1100 acc=0.9719 lr=0.000050 step/sec=10.57 | ETA 00:02:21
[2021-01-14 18:07:50,774] [ TRAIN] - Epoch=2/3, Step=120/300 loss=0.1317 acc=0.9594 lr=0.000050 step/sec=10.53 | ETA 00:02:20
[2021-01-14 18:07:51,728] [ TRAIN] - Epoch=2/3, Step=130/300 loss=0.1149 acc=0.9594 lr=0.000050 step/sec=10.48 | ETA 00:02:19
[2021-01-14 18:07:52,678] [ TRAIN] - Epoch=2/3, Step=140/300 loss=0.1106 acc=0.9594 lr=0.000050 step/sec=10.53 | ETA 00:02:17
[2021-01-14 18:07:53,629] [ TRAIN] - Epoch=2/3, Step=150/300 loss=0.1503 acc=0.9437 lr=0.000050 step/sec=10.51 | ETA 00:02:16
[2021-01-14 18:07:54,590] [ TRAIN] - Epoch=2/3, Step=160/300 loss=0.1165 acc=0.9688 lr=0.000050 step/sec=10.40 | ETA 00:02:15
[2021-01-14 18:07:55,547] [ TRAIN] - Epoch=2/3, Step=170/300 loss=0.1219 acc=0.9531 lr=0.000050 step/sec=10.46 | ETA 00:02:14
[2021-01-14 18:07:56,506] [ TRAIN] - Epoch=2/3, Step=180/300 loss=0.0948 acc=0.9688 lr=0.000050 step/sec=10.43 | ETA 00:02:13
[2021-01-14 18:07:57,468] [ TRAIN] - Epoch=2/3, Step=190/300 loss=0.1614 acc=0.9313 lr=0.000050 step/sec=10.40 | ETA 00:02:12
[2021-01-14 18:07:58,429] [ TRAIN] - Epoch=2/3, Step=200/300 loss=0.1075 acc=0.9594 lr=0.000050 step/sec=10.40 | ETA 00:02:11
[2021-01-14 18:07:59,395] [ TRAIN] - Epoch=2/3, Step=210/300 loss=0.0625 acc=0.9781 lr=0.000050 step/sec=10.35 | ETA 00:02:10
[2021-01-14 18:08:00,359] [ TRAIN] - Epoch=2/3, Step=220/300 loss=0.1832 acc=0.9375 lr=0.000050 step/sec=10.37 | ETA 00:02:10
[2021-01-14 18:08:01,325] [ TRAIN] - Epoch=2/3, Step=230/300 loss=0.0925 acc=0.9531 lr=0.000050 step/sec=10.35 | ETA 00:02:09
[2021-01-14 18:08:02,285] [ TRAIN] - Epoch=2/3, Step=240/300 loss=0.1071 acc=0.9594 lr=0.000050 step/sec=10.42 | ETA 00:02:08
[2021-01-14 18:08:03,244] [ TRAIN] - Epoch=2/3, Step=250/300 loss=0.1390 acc=0.9500 lr=0.000050 step/sec=10.42 | ETA 00:02:07
[2021-01-14 18:08:04,203] [ TRAIN] - Epoch=2/3, Step=260/300 loss=0.1107 acc=0.9688 lr=0.000050 step/sec=10.43 | ETA 00:02:06
[2021-01-14 18:08:05,169] [ TRAIN] - Epoch=2/3, Step=270/300 loss=0.1033 acc=0.9563 lr=0.000050 step/sec=10.36 | ETA 00:02:06
[2021-01-14 18:08:06,134] [ TRAIN] - Epoch=2/3, Step=280/300 loss=0.2035 acc=0.9406 lr=0.000050 step/sec=10.36 | ETA 00:02:05
[2021-01-14 18:08:07,093] [ TRAIN] - Epoch=2/3, Step=290/300 loss=0.1285 acc=0.9469 lr=0.000050 step/sec=10.43 | ETA 00:02:04
[2021-01-14 18:08:08,048] [ TRAIN] - Epoch=2/3, Step=300/300 loss=0.1037 acc=0.9688 lr=0.000050 step/sec=10.47 | ETA 00:02:04
[2021-01-14 18:08:09,299] [ EVAL] - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_acc=0.9400
[2021-01-14 18:08:31,268] [ EVAL] - Saving best model to test_ernie_text_cls/best_model [best acc=0.9400]
[2021-01-14 18:08:31,271] [ INFO] - Saving model checkpoint to test_ernie_text_cls/epoch_2
[2021-01-14 18:08:44,266] [ TRAIN] - Epoch=3/3, Step=10/300 loss=0.0417 acc=0.9844 lr=0.000050 step/sec=0.28 | ETA 00:02:55
[2021-01-14 18:08:45,224] [ TRAIN] - Epoch=3/3, Step=20/300 loss=0.0459 acc=0.9844 lr=0.000050 step/sec=10.44 | ETA 00:02:54
[2021-01-14 18:08:46,190] [ TRAIN] - Epoch=3/3, Step=30/300 loss=0.0663 acc=0.9750 lr=0.000050 step/sec=10.35 | ETA 00:02:52
[2021-01-14 18:08:47,144] [ TRAIN] - Epoch=3/3, Step=40/300 loss=0.0633 acc=0.9750 lr=0.000050 step/sec=10.48 | ETA 00:02:51
[2021-01-14 18:08:48,095] [ TRAIN] - Epoch=3/3, Step=50/300 loss=0.0283 acc=0.9969 lr=0.000050 step/sec=10.52 | ETA 00:02:50
[2021-01-14 18:08:49,055] [ TRAIN] - Epoch=3/3, Step=60/300 loss=0.0390 acc=0.9781 lr=0.000050 step/sec=10.42 | ETA 00:02:48
[2021-01-14 18:08:50,009] [ TRAIN] - Epoch=3/3, Step=70/300 loss=0.0752 acc=0.9750 lr=0.000050 step/sec=10.48 | ETA 00:02:47
[2021-01-14 18:08:50,959] [ TRAIN] - Epoch=3/3, Step=80/300 loss=0.0303 acc=0.9844 lr=0.000050 step/sec=10.53 | ETA 00:02:46
[2021-01-14 18:08:51,912] [ TRAIN] - Epoch=3/3, Step=90/300 loss=0.0703 acc=0.9688 lr=0.000050 step/sec=10.49 | ETA 00:02:45
[2021-01-14 18:08:52,866] [ TRAIN] - Epoch=3/3, Step=100/300 loss=0.0521 acc=0.9906 lr=0.000050 step/sec=10.48 | ETA 00:02:44
[2021-01-14 18:08:53,818] [ TRAIN] - Epoch=3/3, Step=110/300 loss=0.0278 acc=0.9875 lr=0.000050 step/sec=10.50 | ETA 00:02:42
[2021-01-14 18:08:54,771] [ TRAIN] - Epoch=3/3, Step=120/300 loss=0.0539 acc=0.9875 lr=0.000050 step/sec=10.50 | ETA 00:02:41
[2021-01-14 18:08:55,735] [ TRAIN] - Epoch=3/3, Step=130/300 loss=0.0273 acc=0.9844 lr=0.000050 step/sec=10.37 | ETA 00:02:40
[2021-01-14 18:08:56,710] [ TRAIN] - Epoch=3/3, Step=140/300 loss=0.0463 acc=0.9812 lr=0.000050 step/sec=10.26 | ETA 00:02:39
[2021-01-14 18:08:57,673] [ TRAIN] - Epoch=3/3, Step=150/300 loss=0.0636 acc=0.9812 lr=0.000050 step/sec=10.38 | ETA 00:02:38
[2021-01-14 18:08:58,651] [ TRAIN] - Epoch=3/3, Step=160/300 loss=0.0455 acc=0.9812 lr=0.000050 step/sec=10.23 | ETA 00:02:37
[2021-01-14 18:08:59,619] [ TRAIN] - Epoch=3/3, Step=170/300 loss=0.0745 acc=0.9812 lr=0.000050 step/sec=10.33 | ETA 00:02:37
[2021-01-14 18:09:00,581] [ TRAIN] - Epoch=3/3, Step=180/300 loss=0.0619 acc=0.9906 lr=0.000050 step/sec=10.39 | ETA 00:02:36
[2021-01-14 18:09:01,541] [ TRAIN] - Epoch=3/3, Step=190/300 loss=0.0867 acc=0.9750 lr=0.000050 step/sec=10.42 | ETA 00:02:35
[2021-01-14 18:09:02,496] [ TRAIN] - Epoch=3/3, Step=200/300 loss=0.0570 acc=0.9781 lr=0.000050 step/sec=10.47 | ETA 00:02:34
[2021-01-14 18:09:03,454] [ TRAIN] - Epoch=3/3, Step=210/300 loss=0.0582 acc=0.9781 lr=0.000050 step/sec=10.44 | ETA 00:02:33
[2021-01-14 18:09:04,405] [ TRAIN] - Epoch=3/3, Step=220/300 loss=0.0804 acc=0.9719 lr=0.000050 step/sec=10.51 | ETA 00:02:32
[2021-01-14 18:09:05,361] [ TRAIN] - Epoch=3/3, Step=230/300 loss=0.0390 acc=0.9906 lr=0.000050 step/sec=10.46 | ETA 00:02:31
[2021-01-14 18:09:06,316] [ TRAIN] - Epoch=3/3, Step=240/300 loss=0.0314 acc=0.9875 lr=0.000050 step/sec=10.47 | ETA 00:02:31
[2021-01-14 18:09:07,272] [ TRAIN] - Epoch=3/3, Step=250/300 loss=0.0564 acc=0.9812 lr=0.000050 step/sec=10.46 | ETA 00:02:30
[2021-01-14 18:09:08,228] [ TRAIN] - Epoch=3/3, Step=260/300 loss=0.0294 acc=0.9938 lr=0.000050 step/sec=10.47 | ETA 00:02:29
[2021-01-14 18:09:09,187] [ TRAIN] - Epoch=3/3, Step=270/300 loss=0.0260 acc=0.9938 lr=0.000050 step/sec=10.42 | ETA 00:02:28
[2021-01-14 18:09:10,148] [ TRAIN] - Epoch=3/3, Step=280/300 loss=0.0523 acc=0.9812 lr=0.000050 step/sec=10.41 | ETA 00:02:28
[2021-01-14 18:09:11,112] [ TRAIN] - Epoch=3/3, Step=290/300 loss=0.1009 acc=0.9688 lr=0.000050 step/sec=10.37 | ETA 00:02:27
[2021-01-14 18:09:12,072] [ TRAIN] - Epoch=3/3, Step=300/300 loss=0.0494 acc=0.9844 lr=0.000050 step/sec=10.42 | ETA 00:02:26
[2021-01-14 18:09:13,319] [ EVAL] - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_acc=0.9458
[2021-01-14 18:09:35,225] [ EVAL] - Saving best model to test_ernie_text_cls/best_model [best acc=0.9458]
[2021-01-14 18:09:35,229] [ INFO] - Saving model checkpoint to test_ernie_text_cls/epoch_3
优化策略
Paddle2.0-rc提供了多种优化器选择,如SGD, Adam, Adamax等,详细参见策略。
其中Adam:
- learning_rate: 全局学习率。默认为1e-3;
- parameters: 待优化模型参数。
运行配置
Trainer 主要控制Fine-tune的训练,包含以下可控制的参数:
- model: 被优化模型;
- optimizer: 优化器选择;
- use_gpu: 是否使用gpu;
- use_vdl: 是否使用vdl可视化训练过程;
- checkpoint_dir: 保存模型参数的地址;
- compare_metrics: 保存最优模型的衡量指标;
trainer.train 主要控制具体的训练过程,包含以下可控制的参数:
- train_dataset: 训练时所用的数据集;
- epochs: 训练轮数;
- batch_size: 训练的批大小,如果使用GPU,请根据实际情况调整batch_size;
- num_workers: works的数量,默认为0;
- eval_dataset: 验证集;
- log_interval: 打印日志的间隔, 单位为执行批训练的次数。
- save_interval: 保存模型的间隔频次,单位为执行训练的轮数。
4. 模型预测
当完成Fine-tune后,Fine-tune过程在验证集上表现最优的模型会被保存在${CHECKPOINT_DIR}/best_model目录下,其中${CHECKPOINT_DIR}目录为Fine-tune时所选择的保存checkpoint的目录。
以以下数据为待预测数据,使用该模型来进行预测:
这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般
怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片
作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。
import paddlehub as hub
data = [
['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'],
['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'],
['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'],
]
label_map = {0: 'negative', 1: 'positive'}
model = hub.Module(
name='ernie_tiny',
version='2.0.1',
task='seq-cls',
load_checkpoint='./test_ernie_text_cls/best_model/model.pdparams',
label_map=label_map)
results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
for idx, text in enumerate(data):
print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
[2021-01-14 18:10:49,270] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-tiny/ernie_tiny.pdparams
[2021-01-14 18:10:54,747] [ INFO] - Loaded parameters from /home/aistudio/test_ernie_text_cls/best_model/model.pdparams
[2021-01-14 18:10:54,801] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-01-14 18:10:54,804] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-01-14 18:10:54,807] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
Data: 这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般 Lable: negative
Data: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片 Lable: negative
Data: 作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。 Lable: positive
PaddleHub中不同模型的迁移训练方法请参考:
此外PaddleHub在AI Studio上针对常用的热门模型提供了在线体验环境,欢迎用户使用:
预训练模型 |
任务类型 |
数据集 |
AIStudio链接 |
resnet50_vd_imagenet_ssld |
图像分类 |
花朵数据集Flowers |
|
msgnet |
风格迁移 |
MiniCOCO数据集 |
|
user_guided_colorization |
图像着色 |
油画数据集Canvas |
|
ernie_tiny |
文本分类 |
情感分析数据集ChnSentiCorp |
- |
ernie_tiny |
序列标注 |
序列标注数据集MSRA_NER |
- |
PaddleHub Serving一键服务化部署
使用PaddleHub能够快速进行模型预测,但开发者常面临本地预测过程迁移线上的需求。无论是对外开放服务端口,还是在局域网中搭建预测服务,都需要PaddleHub具有快速部署模型预测服务的能力。在这个背景下,模型一键服务部署工具——PaddleHub Serving应运而生。开发者通过一行命令即可快速启动一个模型预测在线服务,而无需关注网络框架选择和实现。
PaddleHub Serving是基于PaddleHub的一键模型服务部署工具,能够通过简单的Hub命令行工具轻松启动一个模型预测在线服务,前端通过Flask和Gunicorn完成网络请求的处理,后端直接调用PaddleHub预测接口,同时支持使用多进程方式利用多核提高并发能力,保证预测服务的性能。
1. 支持模型
目前PaddleHub Serving支持对PaddleHub所有可直接预测的模型进行服务部署,包括lac、senta_bilstm等NLP类模型,以及yolov3_darknet53_coco2017、vgg16_imagenet等CV类模型,更多模型请参见PaddleHub支持模型列表。未来还将支持开发者使用PaddleHub Fine-tune API得到的模型用于快捷服务部署。
2. 部署方法
使用PaddleHub Serving部署预训练模型的方法如下:
(1) 启动服务端部署
PaddleHub Serving有两种启动方式,分别是使用命令行启动,以及使用配置文件启动。
a. 命令行命令启动
启动命令:
hub serving start --modules Module1==Version1 Module2==Version2 ... \
--port XXXX \
--use_gpu \
--use_multiprocess \
--workers \
--gpu \
参数:
参数 |
用途 |
–modules/-m |
PaddleHub Serving预安装模型,以多个Module==Version键值对的形式列出 |
–port/-p |
服务端口,默认为8866 |
–use_gpu |
使用GPU进行预测,必须安装paddlepaddle-gpu |
–use_multiprocess |
是否启用并发方式,默认为单进程方式,推荐多核CPU机器使用此方式 |
–workers |
在并发方式下指定的并发任务数,默认为2*cpu_count-1,其中cpu_count为CPU核数 |
–gpu |
指定使用gpu的卡号,如1,2代表使用1号显卡和2号显卡,默认仅使用0号显卡 |
NOTE: --use_gpu不可与–use_multiprocess共用。
b. 配置文件启动
启动命令:
hub serving start --config config.json
其中config.json格式如下:
{
"modules_info": {
"yolov3_darknet53_coco2017": {
"init_args": {
"version": "1.0.0"
},
"predict_args": {
"batch_size": 1,
"use_gpu": false
}
},
"lac": {
"init_args": {
"version": "1.1.0"
},
"predict_args": {
"batch_size": 1,
"use_gpu": false
}
}
},
"port": 8866,
"use_multiprocess": false,
"workers": 2,
"gpu": "0,1,2"
}
参数:
参数 |
用途 |
modules_info |
PaddleHub Serving预安装模型,以字典列表形式列出,key为模型名称。其中: |
port |
服务端口,默认为8866 |
use_gpu |
使用GPU进行预测,必须安装paddlepaddle-gpu |
use_multiprocess |
是否启用并发方式,默认为单进程方式,推荐多核CPU机器使用此方式 |
workers |
启动的并发任务数,在并发模式下才生效,默认为2*cpu_count-1,其中cpu_count代表CPU的核数 |
gpu |
指定使用gpu的卡号,如1,2代表使用1号显卡和2号显卡,默认仅使用0号显卡 |
NOTE: --use_gpu不可与–use_multiprocess共用。
(2) 访问服务端
在使用PaddleHub Serving部署服务端的模型预测服务后,就可以在客户端访问预测接口以获取结果了,接口url格式为:
http://127.0.0.1:8866/predict/<MODULE>
其中,<MODULE>为模型名。
通过发送一个POST请求,即可获取预测结果,下面将展示一个具体的demo,以说明使用PaddleHub Serving部署和使用流程。
(3) 利用PaddleHub Serving进行个性化开发
使用PaddleHub Serving进行模型服务部署后,可以利用得到的接口进行开发,如对外提供web服务,或接入到应用程序中,以降低客户端预测压力,提高性能,下面展示了一个web页面demo:
(4) 关闭serving
使用关闭命令即可关闭启动的serving,
$ hub serving stop --port XXXX
参数:
参数 |
用途 |
–port/-p |
指定要关闭的服务端口,默认为8866 |
Demo
将以lac分词服务和ernie预训练词向量两个模型为例,展示如何利用PaddleHub Serving部署在线服务。
(1) 在线lac分词服务
主要分为3个步骤:
Step1. 部署lac在线服务
现在,要部署一个lac在线服务,以通过接口获取文本的分词结果。
首先,任意选择一种启动方式,两种方式分别为:
$ hub serving start -m lac
或
$ hub serving start -c serving_config.json
其中serving_config.json的内容如下:
{
"modules_info": {
"lac": {
"init_args": {
"version": "1.1.0"
},
"predict_args": {
"batch_size": 1,
"use_gpu": false
}
}
},
"port": 8866,
"use_multiprocess": false,
"workers": 2
}
启动成功界面如图:
这样就在8866端口成功部署了lac的在线分词服务。 此处warning为Flask提示,不影响使用
Step2. 访问lac预测接口
在服务部署好之后,可以进行测试,用来测试的文本为今天是个好日子和天气预报说今天要下雨。
客户端代码如下:
# coding: utf8
import requests
import json
if __name__ == "__main__":
# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
text = ["今天是个好日子", "天气预报说今天要下雨"]
# 以key的方式指定text传入预测方法的时的参数,此例中为"data"
# 对应本地部署,则为lac.analysis_lexical(data=text, batch_size=1)
data = {"texts": text, "batch_size": 1}
# 指定预测方法为lac并发送post请求,content-type类型应指定json方式
url = "http://127.0.0.1:8866/predict/lac"
# 指定post请求的headers为application/json方式
headers = {"Content-Type": "application/json"}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
# 打印预测结果
print(json.dumps(r.json(), indent=4, ensure_ascii=False))
运行后得到结果:
{
"msg": "",
"results": [
{
"tag": [
"TIME", "v", "q", "n"
],
"word": [
"今天", "是", "个", "好日子"
]
},
{
"tag": [
"n", "v", "TIME", "v", "v"
],
"word": [
"天气预报", "说", "今天", "要", "下雨"
]
}
],
"status": "0"
}
Step3. 停止serving服务
由于启动时使用了默认的服务端口8866,则对应的关闭命令为:
$ hub serving stop --port 8866
或不指定关闭端口,则默认为8866。
$ hub serving stop
等待serving清理服务后,提示:
$ PaddleHub Serving will stop.
则serving服务已经停止。
(2) ernie预训练词向量服务化API的部署
Step1. 启动PaddleHub Serving
运行启动命令:
$ hub serving start -m ernie
这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。
NOTE: 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。
Step2. 发送预测请求
配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
import requests
import json
# 指定用于预测的文本并生成字典{"text": [text_1, text_2, ... ]}
text = [["今天是个好日子", "天气预报说今天要下雨"], ["这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般"]]
# 以key的方式指定text传入预测方法的时的参数,此例中为"texts"
# 对应本地部署,则为module.get_embedding(texts=text)
data = {"texts": text}
# 发送post请求,content-type类型应指定json方式
url = "http://10.12.121.132:8866/predict/ernie"
# 指定post请求的headers为application/json方式
headers = {"Content-Type": "application/json"}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())