要建成高质量的标注语料库,必须制订出完备的加工规范。
It is necessary to work out a complete guideline of corpus processing to obtain high quality tagged corpus.
在大规模标注语料库的基础上进行研究,可以如实反映现代汉语语言现象全貌。
On the basis of large scale label corpus, the research faithfully reflects the panorama of the modern Chinese language phenomena.
为高效地建立句法标注语料库,设计研发了一个实用的中文句法编辑与分析辅助系统。
In order to set up part of speech tagging corpus efficiently, one practical Chinese syntax compilation and analyse assistant system is designed and implemented.
通过真实语料上进行的比较实验,证明了该方法能有效利用大量未标注语料提高算法的泛化能力。
The experiments on real corpus show that the proposed method can more effectively and stably utilize the unlabeled examples to improve classification generalization.
其优点在于两个方面:1不受词义标注语料库规模的影响;2对特定词语意义的消歧准确率可达到100%。
The striking advantages of the feature-based approach are 1 it is not influenced by the data size, and 2 it can disambiguate some specific words with precision of 100%.
中文语义级手工标注语料的稀缺,以及中文句子结构的复杂性,都成为中文语与角色标注任务面临的重要问题。
The lack of semantically hand-annotated corpus in Chinese and the complicated constructions of Chinese sentences arise as an important problem of SRL for Chinese.
对大规模真实语料的标注实验表明基于转换的方法与三元统计模型方法相得益彰;
Our experiment for the large scale real corpus tagging proves that transformation-based algorithm and tri-gram statistic method bring out the best in each other.
同时,我们提出了一种新颖、实用的训练语料标注方案,这使得隐马尔科夫模型在音乐实体识别上变得实际可行。
Meanwhile, a novel and convenient training corpus tagging method was proposed, which made Hidden Markov Model practically usable in Musical Entity Recognition.
主要交代本文的选题缘由、研究范围和研究方法、语料来源和标注格式,对以往的研究成果进行概括和总结。
Reasons to choose the topic, research scope and methods, the data sources and label format are introduced. A brief summarization is given to the previous research achievements.
学生建立毕业论文写作个性化语料库应当在教师指导下做好语料收集、分类、整理、标注各个环节的工作。
The students should do collection of corpus, classification, sorting out and annotation of corpus which under the guidance of teacher.
句法标注是语料标注的重点、难点所在,必须以一定的句法理论为基础。
Syntax tagging is the main and most difficult point of corpus tagging, and should be based on syntax theory.
本文采用了一个基于CTOBI的停顿指数标注的语料库,利用有指导的学习方法对自动停顿指数标注方面做了一些有益的探索。
This paper uses a corpus with break indices based on C-TOBI. Applying supervised learning method, some useful attempts are made in the field of automatic break indices intonation.
本文详细论述了语料库的标注,片段单元的定义,组合分析和概率计算。
This paper discusses marked corpus, fragment unit, combination parsing and probability model in detail.
在对大规模语料库进行深加工时,保证词性标注的一致性已成为建设高质量语料库的首要问题。
In the deep processing of large-scale corpus, it has been a chief problem to assure the consistence of part of speech tagging to build the high-quantity corpus.
本文从语料库加工流程的角度,探讨了这一问题,并借助XML(可扩展置标语言)提出了错误标注的具体实现方法。
This paper is to discuss the problem and put forward a practical method of tagging language errors in corpus tagging procedure by XML(Extensible Markup Language).
首先是建立法律语料库,主要是法律书面语的语料库,并对其进行机器自动分词和词性标注。
It is mainly a corpus base of the legal written language to set up legal corpus base at first, and mark automatic participle of the machine and morphological feature to it.
平台可对大规模语料库中的词义进行标注,同时自动统计出标注效果,方便人工进行验证。
The platform can tag word senses in large-scale corpus, and automatic statistic tagging effect, It is convenient for artificial verifying.
对大型语料库进行韵律标注现已经成为一种语言研究和言语工程中广为使用的研究手段。
Prosodic labeling of large corpora has now become a popular research tool in linguistic research and speech technology.
同时组块库的获取和收集也是一项迫切的任务,由于不易直接获取具有组块标注的语料,当前大多组块语料库是通过转化现有树库获得。
It is laborious to collect the corpus with chunk tags, and thus its acquisition is mostly carried out through the transformation of the existing treebank.
本文首先探索了基于单语料库的无监督中文词性标注。
This paper first explores unsupervised part-of-speech tagging for Chinese via monolingual corpus.
兼类词的词类排歧是汉语语料词性标注中的难点问题,它严重影响语料的词性标注质量。
The disambiguation of multi-category words is one of the difficulties in part-of-speech tagging of Chinese text, which affects the processing quality of corpora greatly.
最后,采用一个未经过标注的语料库进行测试,取得了非常好的效果,证明了模型的优越性。
At last, an untagged corpus was used to test the model and the result is very good, which proves the superiority of the model.
实验表明,利用该方法可使语料库标注的准确率提高2.5%。
Experiments show that the method described in this article can improve the accuracy of corpus annotation by 2. 5%.
本文以宾州中文树库为实验语料,考查了不同规模的标注数据对模型性能的影响,实验结果表明,本文提出的无监督词性标注方法提高了中文词性标注的性能。
Experiments on Chinese TreeBank from different training set size are made. It shows that our approach improves the accuracy of POS tagging over the four training sets with different sizes.
首先采用基于字符长度的改进的统计方法对平行语料进行句子级的对齐,并对英文语料和中文语料分别进行词性标注和切分与词性标注。
Parallel corpora are firstly aligned by improved statistical method, which is based on character length, and tagged with their part of speech categories respectively.
第二章介绍了本文所选取语料的收集、整理和标注,并简要说明了在产出性词汇提取过程中出现的一些情况。
The second chapter introduces the process of the corpus 'collection, arrangement and dimension, in addition, it also shows some phenomenons during the process of achieving the productive vocabularies.
蒙古语短语标注是蒙古语语料库语言学研究的进一步深化。
The tagging of Mongolian phrases is the further study with Mongolian corpus linguistics.
本文讨论了汉语语料库的加工技术,即对语料库进行词法、句法和语义等方面的标注。
This paper discusses the process technology of Chinese corpus. The Chinese corpus must be annotated with part of …
统计已对齐和标注的双语语料中的名词和名词短语生成候选术语集。
The term candidate set is produced by statistical the nouns and noun phrases of both corpora.
统计已对齐和标注的双语语料中的名词和名词短语生成候选术语集。
The term candidate set is produced by statistical the nouns and noun phrases of both corpora.
应用推荐