该文提出了一种基于统计与正文特征的网页正文抽取方法。
This paper presents a new method for content extraction from Web pages based on statistic and content-features.
该方法继承了统计方法的优点,同时利用正文特征克服了原有基于统计的方法无法抽取多正文体网页的缺陷。
This method not only inherits the merits of the traditional statistic method, but also can extract the multi-body documents which can not be obtained by the pure statistic method.
应用推荐