信息抽取是从自由文本语料库构建数据库,实现情报自动收集的有效途径之一。
Information extraction is a main approach for constructing database from free text corpus and for automatic collecting intelligence information.
一些文本语料库进行了分类,例如通过类型或者主题;有时候语料库的类别相互重叠。
Some text corpora are categorized, e. g. , by genre or topic; sometimes the categories of a corpus overlap each other.
一些文本语料库进行了分类,例如经由过程类型或者主题;有时辰语料库的类别彼此重叠。
Some text corpora are categorized, e. g. , by genre or topic; sometimes the categories of a corpus overlap each other.
为了训练你的方法,你可以使用从第一个家庭作业得到的讲稿转录文本语料(GZ)和6.001课本中源文件(GZ)。
To train your method, you will use the lecture transcript corpus (GZ) from the first homework, and a 6.001textbook source file (GZ).
任何文本分析中,第一步都是从文本内容生成一个语料库(corpus),后续的分析将应用于此语料库。
In any text analysis, the first step is to generate a corpus from the textual content, with the subsequent analysis being applied to the corpus.
跨一组相关文档执行文本分析可以导致更高质量的分类,因为您可以交叉引用更大的语料库,并分析出文档之间更深层的关系。
Performing textual analysis across a set of related documents can result in higher-quality categorization, as you can cross-reference from a larger corpus and glean deeper relations between documents.
生成语料库的原因之一是规范化文本并删除任何不相关的内容。
One of the reasons for generating a corpus is to normalize text and remove anything that isn't relevant.
通过大型语料库(海量文本)来检查是个好方法。
本文基于大量真实的WTO语料,考察WTO文本的语言现象,分析其特有的句法特征,并探讨其汉译的一些策略。
This paper, based on a large of corpus of authentic WTO texts, examines their linguistic, particularly their syntactic features and the strategies for translating such texts into Chinese.
自动机的设计充分考虑了各种类别的实体的文本结构特点,在大规模人民日报语料上测试时取得了很好的识别效果。
The design of automaton fully considers the characteristics of each kind of entity, and acquired good recognition results while testing on large-scale people daily corpus.
同时以大量真实文本为语料,详细探讨“这”、“那”指代词在情景语境中的手势和非手势指示、上下文语境中回指、预指等现象的规律或倾向性规律。
Meanwhile, by using real data in our corpus, we attempt to investigate the gestural and symbolic usage of this and that in situational context and anaphora and cataphora in linguistic context.
语料库语言学作为一门新兴的学科,可以应用于文学批评领域来分析文学文本。
As a new and rising discipline, Corpus Linguistics can be applied in the field of literary criticism to analyze literary text.
统计机器翻译是利用基于语料库训练得到的统计参数模型,将源语言的文本翻译成目标语言,它是机器翻译的主流方向。
Statistical machine translation (SMT) is the text translation by the statistical parameter models obtained from the training corpus, which has become the mainstream of machine translation research.
翻译英语语料库(TEC)是世界上首个当代翻译英语语料库,包含有许多源语书面文本的英语译文。
TEC (translational English corpus) is the first and largest translational English corpus in the world, which consists of written English translations from a range of source languages.
文本文在大规模语料的基础上,利用语言模型中稀疏事件的概率估计方法对汉语的熵进行计算,并讨论了语料规模等因素对熵的影响。
Different estimation methods of the probabilities of sparse events for the computation of the entropy in large scale modern Chinese text are applied in this paper.
本文利用三种特征选择方法、两种权重计算方法、五种停用词表以及支持向量机分类器对汽车语料的文本情感类别进行了研究。
The experiment results indicate that the greater text sentiment classification impact depends on other corpus, excluded adjective, verb, adverb as stop words and none stop words.
基于概率的算法只考虑了训练集语料的概率模型,对于不同领域的文本的处理不尽如人意。
And the probabilistic methods those consider the probabilistic model of the training set only also do a bad job on the texts of a specific domain.
首先作者建立了一个语料库,收录了从1992年到2008年17年的研究生入学英语考试试题文本。
Firstly, the author has built a corpus based on PGEE exam texts of 17 years from 1992 to 2008.
文章描述了一种从熟语料中自动获取文本切分知识的机器学习的方法。
This paper presents a learning method to auto ma tically acquire segmentation knowledge from Chinese corpus.
基于语料库的语义接受度(SAS)研究是在线衡量文本理解程度的可行性方法。
The corpus-based study on Semantic Accessibility Scale(SAS) is a useful method to evaluate the acceptance of electronic texts.
如何通过现有的互译文本来建立大规模的双语语料库,对双语互译文本的加工成为至关重要的问题。
How to use the existing bilingual text to build the large scale of bilingual corpus made it important to process the bilingual text.
“互联网用作语料库”是一种把互联网上的文本用作语料资源的新兴方法。
While the web is not an archetypal corpus, "web as a corpus" method is irrefutably functional, and has found its widespread applications in linguistic data retrieval and linguistic hypothesis testing.
抽取电子邮件和手机短信的多种文本特征,分别在TREC07P电子邮件语料和真实中文手机短信语料上进行了垃圾信息过滤实验。
Through multiple text features extraction from email and short message service (SMS) document, some spam filtering experiments are run on TREC07P email corpus and real Chinese SMS corpus separately.
但是大规模双语平行语料库的获取并不容易,现有的平行语料库在规模、时效性和领域的平衡性等方面还不能满足处理真实文本的实际需要。
However, access to a large-scale bilingual parallel corpus is not easy, the existing parallel corpora can not meet the actual needs in terms of the scale, timeliness and balance of the fields.
该方法通过对机器分词语料和人工校对语料的学习,自动获取中文文本的分词校对规则,并应用规则对机器分词结果进行自动校对。
It discusses and analyzes the actuality of Chinese word segmentation, and describes an approach to correcting the Chinese word segmentation automatically based on rules.
该方法通过对机器分词语料和人工校对语料的学习,自动获取中文文本的分词校对规则,并应用规则对机器分词结果进行自动校对。
It discusses and analyzes the actuality of Chinese word segmentation, and describes an approach to correcting the Chinese word segmentation automatically based on rules.
应用推荐