This section includes using regular expressions to identify web content in web page resources, web preprocessing, abstracting and quantifying web concepts, building inverted files and dealing near-replicas of the documents on the web.
这章包括了使用正则匹配识别网页源代码中的内容、网页预处理、网页特征项提取和量化,倒排文件的建立,和对具有相似内容的网页进行消重。
参考来源 - 基于Web的网络搜索技术研究·2,447,543篇论文数据,部分数据来源于NoteExpress
本文主要论述了使用倒排文件的方法建立一个文件快速搜索引擎。
This article discusses the use of the main methods of creating a document would platoon rapid document search engines.
探讨基于压缩倒排文件的中文全文检索技术,包括数据压缩方法、存储、检索与排名机制。
This paper analyzes Chinese full-text retrieval technologies based on compressed inverted file, including data compression, file storage, searching and ranking mechanisms.
算法只需对事务数据库做一次扫描,并且所有对事务的处理操作都在事务数据库映射成的倒排文件中进行。
The algorithm only need scan the transaction database once, and all the transaction operations are carried out on the inverted file mapped from transaction database.
应用推荐