半监督学习的老挝语词性标注方法研究

计算机科学 ›› 2016, Vol. 43 ›› Issue (9): 103-106.doi: 10.11896/j.issn.1002-137X.2016.09.019

• 2015 年第三届CCF 大数据学术会议 • 上一篇    下一篇

半监督学习的老挝语词性标注方法研究

杨蓓,周兰江,余正涛,刘丽佳   

  1. 昆明理工大学信息工程与自动化学院 昆明650500 昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500 昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500 昆明理工大学智能信息处理重点实验室 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500 昆明理工大学智能信息处理重点实验室 昆明650500
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受面向汉语-泰语跨语言新闻事件检索方法研究(61462054)资助

Research on Semi-supervised Learning Based Approach for Lao Part of Speech Tagging

YANG Bei, ZHOU Lan-jiang, YU Zheng-tao and LIU Li-jia   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对老挝语语料资源极少而无法直接利用有监督学习的方法实现老挝语词法分析的问题,提出了基于半监督学习的老挝语词性标注方法。首先利用仅有的少量标注词典和未标注语料资源,采用简单概率模型建模,获取较为完整的标注词典;其次利用整数规划获取大量自动标注的语料;最后在训练语 料充足的情况下,利用二阶隐马尔科夫模型建模,实现高质量的老挝语词性标注。提出的方法在老挝语词性标注方面取得了较好的效果,其准确率达到89.8%。

关键词: 半监督学习,二阶隐马尔科夫模型,老挝语词性标注,概率模型,整数规划

Abstract: Aiming at the problem of very few corpora resources,a semi-supervised learning based approach for Lao part of speech Tagging was presented.Firstly,a simple probability model is used to obtain a complete dictionary with a small amount of tagged dictionary and untagged corpus,then much more automatically tagged corpus with integer programming are obtained.Finally,a second-order Markov model with sufficient corpus resources is trained to realize a high quality Lao part of speech tagging.This method achieves a good result in Lao part of speech tagging,and its accuracy is up to 89.8%.

Key words: Semi-supervised learning,Second-order hidden markov model,Lao part of speech tagging,Probability model,Integer programming

[1] Hong Ming-cai,Zhang Kuo,Tang Jie,et al.A Chinese Part-of-Speech Tagging method based on conditional random fields(CRFs)[J].Computer Science,2006,3(10):148-155(in Chinese) 洪铭材,张阔,唐杰,等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,3(10):148-155
[2] Dan G,Baldridge J.Type-Supervised Hidden Markov Models forPart-of-Speech Tagging with Incomplete Tag Dictionaries[C]∥Proceedings of the Association for Computational Linguistics(ACL).2012:821-831
[3] Wang Li-jie,Che Wan-xiang,Liu Ting.Chinese Part-of-Speech Tagging Based on SVMTool[J].Journal of Chinese Information Processing,2009,23(4):16-21(in Chinese) 王丽杰,车万翔,刘挺.基于SVMTool的中文词性标注[J].中文信息学报,2009,23(4):16-21
[4] Merialdo B.Tagging english text with a probabilistic model[J].Computational Linguistics,2002,20(2):155-171
[5] Garrette,Baldridge J.Learning a Part-of-Speech Tagger fromTwo Hours of Annotation[C]∥Proceedings of the Association for Computational Linguistics(ACL).2013:138-147
[6] Toutanova K, Johnson M.A Bayesian LDA-based model forsemi-supervised part-of-speech tagging [C]∥Proceedings of The Annual Conference on Neural Information Processing Systems(NIPS).2008:1521-1528
[7] Ravi S,Knight K.Minimized Models for Unsupervised Part-of-Speech Tagging[C]∥Proceedings of the Association for Computational Linguistics(ACL).2009
[8] Liang Yi-min,Huang De-gen.Full second-order Hidden Markov model based Part-of-Speech Tagging[J].Computer Enginee-ring,2005,1(10):177-180(in Chinese) 梁以敏,黄德根.基于完全二阶隐马尔可夫模型的词性标注[J].计算机工程,2005,1(10):177-180
[9] Liu Jie-bin, Song Mao-qiang, Zhao Fang,et al.Context basedsecond-order Hidden Markov model[J].Computer Engineering,2010,6(10):231-235(in Chinese) 刘洁彬,宋茂强,赵方等.基于上下文的二阶隐马尔可夫模型[J].计算机工程,2010,6(10):231-235
[10] Feng Yue-jiao,He Xing-shi.Theory and Implementation of se-cond-order Hidden Markov model[J].Value Engineering,2009(12):103-105(in Chinese) 丰月姣,贺兴时.二阶隐马尔科夫模型的原理与实现[J].价值工程,2009(12):103-105
[11] Thede S M,Harper M P.A second-order Hidden Markov Model forpart-of-speech tagging[C]∥Proceedings of the Association for Computational Linguistics(ACL).1999:20-26
[12] Yang Hong,Wng Sigerileng.HMM based Automatic Mongolia Part-of-Speech Tagging[J].Journal of Inner Mongolia Normal University(Natural Science Edition),2010,39(2):206-209(in Chinese) 艳红,王斯日古楞.基于HMM 的蒙古文自动词性标注研究[J].内蒙古师范大学学报(自然科学汉文版),2010,39(2):206-209

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!