Targeting at extending the dictionary for word segmentation so as to improve its accuracy, this paper presents a high-frequency Chinese word extraction algorithm based on information entropy.
为扩展分词词典,提高分词的准确率,本文提出了一种基于信息熵的中文高频词抽取算法,其结果可以用来识别未登录词并扩充现有词典。
We make the following research:in lexical analysis phrase, we insert computer word list based on general segmentation dictionary, exclude word ambiguity;
本文在此阶段做了如下工作:在通用分词词典的基础上,加入计算机专业词汇,排除了词类歧义;
Initially, it is based on the application of the main open source project Luence, the combination of sub-word dictionary and grammar of Chinese word segmentation algorithm components.
最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件。
应用推荐