深度学习的中文地址切分算法Chinese address segmentation algorithm based on depth learning
李一;刘纪平;罗安;
摘要(Abstract):
针对传统分词对词典依赖过高的问题,该文提出了一种基于深度学习的中文地址要素的切分与重组算法。首先利用二元语法(Bigram)二分法将地址切分,然后用网络兴趣点(POI)数据地址集作为样本,采用基于深度学习的方法对地址要素进行特征匹配与要素重组,最终实现以地址要素为单元的中文地址自动切分。本文采用上万条网络采集的POI地址数据作为实验样本,实验结果表明,该算法不仅降低了对词典的依赖,同时也对地名地址的切分正确率有较大提升。
关键词(KeyWords): 中文分词;Bigram二分法;深度学习;地址要素
基金项目(Foundation): 中国测绘科学研究院基本科研业务费项目(7771605)
作者(Authors): 李一;刘纪平;罗安;
DOI: 10.16251/j.cnki.1009-2307.2018.10.017
参考文献(References):
- [1]赵阳阳,王亮,仇阿根.地址要素识别机制的地名地址分词算法[J].测绘科学,2013,38(5):74-76.(ZHAO Yangyang,WANG Liang,QIU Agen.An improved algorithm for address segmentation[J].Science of Surveying and Mapping,2013,38(5):74-76.)
- [2]邹智敏,郭荷清,高英.一种对英文字符串进行分词的方法[J].计算机应用研究,2007,24(7):52-54.(ZOU Zhimin,GUO Heqing,GAO Ying.English string segmentation method[J].Application Research of Computers,2007,24(7):52-54.)
- [3]徐哲,刘循.贝叶斯决策树在英文现在分词词性识别中的应用[J].计算机应用,2009,29(9):2571-2574.(XU Zhe,LIU Xun.Application of Bayesian decision tree to recognition of English present participle[J].Journal of Computer Applications,2009,29(9):2571-2574.)
- [4]李宏波.词典与统计相结合的中文分词算法研究[J].武汉理工大学学报(信息与管理工程版),2010,32(6):907-909,913.(LI Hongbo.Dictionary and statistical analysis combined algorithm for Chinese word segmentation[J].Journal of Wuhan University of Technology(Information&Management Engineering),2010,32(6):907-909,913.)
- [5]习明,王增辉,庄怡.基于双层哈希表的中文分词算法优化[J].软件导刊,2010,9(10):54-55.(XI Ming,WANG Zenghui,ZHUANG Yi.Optimization of Chinese word segmentation algorithm based on Double-Hash[J].Software Guide,2010,9(10):54-55.)
- [6]孙茂松,黄昌宁,邹嘉彦,等.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J].计算机研究与发展,1997,34(5):332-339.(SUN Maosong,HUANG Changning,TSOU B K,et al.Using character bigram for ambiguity resolution in Chinese word segmentation[J].Computer Research&Development,1997,34(5):332-339.)
- [7]李丹宁,李丹,王保华,等.几种基于词典的中文分词算法评价[J].贵州科学,2008,26(3):1-8.(LI Danning,LI Dan,WANG Baohua,et al.The evaluation of several algorithms for dictionary-based Chinese word segmentation[J].Guizhou Science,2008,26(3):1-8.)
- [8]兰冲.基于统计规则的中文分词研究[D].西安:西安电子科技大学,2011.(LAN Chong.Research on Chinese word segmentation based on statistic rules[D].Xi’an:Xidian University,2011.)
- [9]程昌秀,于滨.一种基于规则的模糊中文地址分词匹配方法[J].地理与地理信息科学,2011,27(3):26-29.(CHENG Changxiu,YU Bin.A rule-based segmenting and matching method for fuzzy Chinese addresses[J].Geography and Geo-Information Science,2011,27(3):26-29.)
- [10]张雪英,闾国年,李伯秋,等.基于规则的中文地址要素解析方法[J].地球信息科学学报,2010,12(1):9-16.(ZHANG Xueying,LGuonian,LI Boqiu,et al.Rulebased approach to semantic resolution of Chinese addresses[J].Journal of Geo-Information Science,2010,12(1):9-16.)
- [11]王笑旻.基于Bigram的特征词抽取及自动分类方法研究[J].计算机工程与应用,2005(22):177-179,210.(WANG Xiaomin.Dictionary-free Chinese words acquisition method based on Bigram[J].Computer Engineering and Applications,2005(22):177-179,210.)
- [12]孙德才,王晓霞.一种基于Bigram二级哈希的中文索引结构[J].电子设计工程,2014,22(12):1-4.(SUN Decai,WANG Xiaoxia.A Chinese index structure based on Bigram and two level Hashes[J].Electronic Design Engineering,2014,22(12):1-4.)
- [13]吴应良,韦岗,李海洲.一种基于N-gram模型和机器学习的汉语分词算法[J].电子与信息学报,2001,23(11):1148-1153.(WU Yingliang,WEI Gang,LI Haizhou.A word segmentation algorithm for Chinese language base on N-gram models and machine learning[J].Journal of Electronics and Information Technology,2001,23(11):1148-1153.)
- [14]奚雪峰,周国栋.面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.(XI Xuefeng,ZHOU Guodong.A survey on deep learning for natural language processing[J].Acta Automatica Sinica,2016,42(10):1445-1465.)
- [15]来斯惟,徐立恒,陈玉博,等.基于表示学习的中文分词算法探索[J].中文信息学报,2013,27(5):8-14.(LAI Siwei,XU Liheng,CHEN Yubo,et al.Chinese word segment based on character representation learning[J].Journal of Chinese Information Processing,2013,27(5):8-14.)
- [16]陈芊希,范磊.基于深度学习的网页分类算法研究[J].微型电脑应用,2016,32(2):25-28.(CHEN Qianxi,FAN Lei.Webpage classification based on deep learning algorithm[J].Microcomputer Applications,2016,32(2):25-28.)