标注是一项消歧任务,单词是模棱两可,有不止一种可能的词性,目标是找到适合这种情况的正确标签。例如,book可以是动词(book that flight)或名词(hand me that book)。这可以是一个限定词(Does that flight serve dinner),也可以是一个补语(I thought that your flight was earlier)。pos标记的目标是解决这些歧义,为上下文选择适当的标记。标签歧义有多普遍?

1、HMM算法
在本节中,我们将介绍使用隐马尔可夫模型进行词性标注。HMM是一个序列模型。序列模型或序列分类器是这样一种模型,它的工作是为序列中的每个单元分配一个标签或类,从而将一系列观察结果映射到一系列标签。HMM是一种概率序列模型:给定一个单元序列(单词、字母、语素、句子等),它计算可能的标签序列的概率分布,并选择最佳的标签序列。

HMM是基于马尔可夫链的增广。马尔可夫链是一个模型,它告诉我们随机变量,状态序列的概率,每一个都可以取某个集合的值。这些集合可以是单词、标签或表示任何东西的符号,例如天气。马尔可夫链做了一个非常强的假设,如果我们想要预测序列的未来,所有重要的是当前状态。所有在当前状态之前的状态,除了通过当前状态之外,对未来没有任何影响。这就好像为了预测明天的天气,你可以查看今天的天气,但你不能查看昨天的天气。

更正式地说,考虑一系列状态变量q1,q2,…,qi。马尔可夫模型体现了这个序列概率的马尔可夫假设:当预测未来时,过去无关紧要,只有现在。

当我们需要计算一系列可观察事件的概率时,马尔可夫链是有用的。然而,在许多情况下,我们感兴趣的事件是隐藏的:我们没有直接观察到它们。例如,我们通常不会在文本中观察到词性标签。相反,我们看到单词,并且必须从单词序列中推断出标签。我们称这些标签为隐藏的,因为它们不被观察到。


隐马尔可夫模型 (Hidden Markov Model, HMM) 是一种常用的序列建模工具,常用于自然语言处理、语音识别、生物信息学等领域。HMM 模型是一种统计模型,用来描述一个含有隐含未知参数的马尔可夫过程。

一个 HMM 模型通常由三部分构成:状态序列、观测序列和模型参数。

  • 状态序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具
  • 观测序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_02
  • 模型参数 Speech and Language Processing之Part-of-Speech Tagging_词性标注_03
  • 初始状态概率向量 Speech and Language Processing之Part-of-Speech Tagging_人工智能_04,其中 Speech and Language Processing之Part-of-Speech Tagging_人工智能_05 表示初始时处于状态 Speech and Language Processing之Part-of-Speech Tagging_词性标注_06
  • 状态转移矩阵 Speech and Language Processing之Part-of-Speech Tagging_词性_07,其中 Speech and Language Processing之Part-of-Speech Tagging_词性标注_08 表示从状态 Speech and Language Processing之Part-of-Speech Tagging_词性标注_06 转移到状态 Speech and Language Processing之Part-of-Speech Tagging_建模工具_10
  • 发射矩阵 Speech and Language Processing之Part-of-Speech Tagging_建模工具_11,其中 Speech and Language Processing之Part-of-Speech Tagging_建模工具_12 表示在状态 Speech and Language Processing之Part-of-Speech Tagging_建模工具_10 下生成观测值 Speech and Language Processing之Part-of-Speech Tagging_词性标注_14

在 HMM 模型中,有两个基本问题需要解决:

  1. 给定模型和观测序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_15,如何计算观测序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_15 对应的概率 Speech and Language Processing之Part-of-Speech Tagging_词性_17
  2. 给定模型和观测序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_15,如何求得最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_词性标注_19

HMM 模型的前向-后向算法

前向-后向算法可以用来解决第一个问题:给定模型和观测序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具_20,如何计算观测序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具_20 对应的概率 Speech and Language Processing之Part-of-Speech Tagging_建模工具_22

Speech and Language Processing之Part-of-Speech Tagging_词性_23 表示时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 状态为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25,观测序列为 Speech and Language Processing之Part-of-Speech Tagging_词性标注_26

Speech and Language Processing之Part-of-Speech Tagging_人工智能_27

使用动态规划方法,可以递推地计算 Speech and Language Processing之Part-of-Speech Tagging_词性_23

Speech and Language Processing之Part-of-Speech Tagging_词性标注_29

其中,Speech and Language Processing之Part-of-Speech Tagging_建模工具_30

观测序列的概率 Speech and Language Processing之Part-of-Speech Tagging_建模工具_22 即为 Speech and Language Processing之Part-of-Speech Tagging_建模工具_32 所有状态 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25

Speech and Language Processing之Part-of-Speech Tagging_词性_34

后向算法可以用来解决第二个问题:给定模型和观测序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具_20,如何求得最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_36

Speech and Language Processing之Part-of-Speech Tagging_词性标注_37 表示时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 状态为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25,观测序列为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_40

Speech and Language Processing之Part-of-Speech Tagging_建模工具_41

同样使用动态规划方法,可以递推地计算 Speech and Language Processing之Part-of-Speech Tagging_词性标注_37

Speech and Language Processing之Part-of-Speech Tagging_建模工具_43

其中,Speech and Language Processing之Part-of-Speech Tagging_词性_44

给定观测序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具_20,可以使用前向-后向算法计算出每个时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 状态为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25 的概率 Speech and Language Processing之Part-of-Speech Tagging_词性标注_48,即在观测序列 Speech and Language Processing之Part-of-Speech Tagging_建模工具_20 下,时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 处于状态 Speech and Language Processing之Part-of-Speech Tagging_词性标注_51

Speech and Language Processing之Part-of-Speech Tagging_词性标注_52

接下来,可以使用 Viterbi 算法求解最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_36

HMM 模型的 Viterbi 算法

Viterbi 算法是一种动态规划算法,用于求解最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_36。具体地,设 Speech and Language Processing之Part-of-Speech Tagging_人工智能_55 表示时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 状态为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25,观测序列为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_58

Speech and Language Processing之Part-of-Speech Tagging_人工智能_59

使用动态规划方法,可以递推地计算 Speech and Language Processing之Part-of-Speech Tagging_人工智能_55

Speech and Language Processing之Part-of-Speech Tagging_建模工具_61

其中,Speech and Language Processing之Part-of-Speech Tagging_人工智能_62

为了求得最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_36,需要在递推过程中维护一个 Speech and Language Processing之Part-of-Speech Tagging_建模工具_64 数组,表示时刻 Speech and Language Processing之Part-of-Speech Tagging_词性_24 状态为 Speech and Language Processing之Part-of-Speech Tagging_人工智能_25

Speech and Language Processing之Part-of-Speech Tagging_词性标注_67

根据递推计算出来的 Speech and Language Processing之Part-of-Speech Tagging_人工智能_55Speech and Language Processing之Part-of-Speech Tagging_建模工具_64 数组,可以反向构建出最可能的状态序列 Speech and Language Processing之Part-of-Speech Tagging_人工智能_36

  1. 在时刻 Speech and Language Processing之Part-of-Speech Tagging_人工智能_71,选择使 Speech and Language Processing之Part-of-Speech Tagging_词性_72 最大的状态 Speech and Language Processing之Part-of-Speech Tagging_词性标注_73
  2. 对于 Speech and Language Processing之Part-of-Speech Tagging_人工智能_74,依次选择使 Speech and Language Processing之Part-of-Speech Tagging_建模工具_75 最大的状态 Speech and Language Processing之Part-of-Speech Tagging_建模工具_76 作为时刻 Speech and Language Processing之Part-of-Speech Tagging_人工智能_77 的状态 Speech and Language Processing之Part-of-Speech Tagging_人工智能_78
  3. 最终得到的状态序列 Speech and Language Processing之Part-of-Speech Tagging_词性_79

总之,HMM 模型的前向-后向算法和 Viterbi 算法分别解决了 HMM 模型的两个基本问题。前者用于计算观测序列的概率,后者则用于求解最可能的状态序列。