For information extraction, information filtering and suchlike Web application, we need segment this kind of original Web page into several appropriate information blocks as the preprocessing.
对于信息抽取、信息过滤等应用,需要首先将原始页面中分割为若干合适的信息块以便于后续的处理。
The key idea is to segment each web page into different topic areas or segments according to its HTML tags and contents since web pages are semi-structure.
该算法根据网页半结构化的特点,按照HTML标记和网页的内容将网页进行区域分割。
应用推荐