在页面采集中分析了多线程、重复网页、采集器陷阱和网页的存储。
Multi-threading, detection of duplicate content and spider traps, text repository are discussed in page retrieval.
近年来学术剽窃现象屡见报端,互联网上日益增多的重复网页降低了检索效率,给用户带来不便。
Recent years, there are many news about plagiarism on reseach, and the number of duplicated pages on the web is increasing, which lower the efficiency of search and put users to inconvenience.
针对目前搜索引擎搜索结果中普遍存在大量重复网页的现象,提出了一种基于聚类算法DBSCAN的搜索结果优化算法。
The search results got by current search engines generally include a large number of duplicate Web pages. A search results optimization algorithm based on DBSCAN clustering algorithm was proposed.
应用推荐