网页自动分类一般包括网页净化、特征选择、向量表示、训练算法、分类算法等五个部分。
The WAS generally includes the Web page purification, feature selection, vector representation, training algorithm, classification al.
网页查重主要包括两部分,一是对原始网页的处理,主要是对网页噪音净化以及对网页主题信息的提取;
The web replica detect mainly consists of two parts: First, deal with the original website, mainly the web noise purification and extraction of the theme of the web on the website.
其具体过程是对净化后的网页文件,使用CDC进行内容块的分割,使每个网页成为许多内容块的集合。
After noise purification of web was extracted of the theme of the web, using CDC division each document and document become so many elements of the set which is pieces of content.
应用推荐