[Keywords]:computer application; Chinese information processing; String of Feature Code; Fuzzy Matching; Duplicate Removal Algorithm; Redundant Web Pages;
基于10个网页-相关网页
该算法选取源搜索结果中排名靠前的部分网页,对这部分网页根据网页相似度进行DBSCAN聚类,最大限度剔除冗余网页,实现搜索结果的优化。
The algorithm selected top-ranking Web pages of source search results, and clustered them to remove as much redundant pages as possible according to page similarity to achieve optimal search results.
本文依据冗余网页的特点引入模糊匹配的思想,利用网页文本的内容、结构信息,提出了基于特征串的中文网页的快速去重算法,同时对算法进行了优化处理。
The idea of fuzzy matching and information of content and structure of the text of web page are introduced into the algorithm, and the efficiency of the algorithm is optimized.
但另一方面:使用框架会增加学习成本并且会产生多余的样式和标记代码,最终导致网页代码冗余。
However, take note: Using frameworks involves a learning curve and can bulk up your web page sizes with unnecessary style rules and markup.
应用推荐