Data-rich Section Extraction from HTML pages

Jiying Wang

Department of Computer Science, HKUST

We propose a novel algorithm, DSE (Data-rich Subtree Extraction) to recognize and extract the data-rich section of an HTML page. We apply the DSE algorithm as a pre-processing (clean-up) step for two typical web information retrieval problems: topic distillation and web information extraction. Our experiments show that, for the test data sets we used, the DSE algorithm can correctly identify the data-rich sections of HTML pages with 100% accuracy. Therefore, it can effectively reduce the root set size for the topic distillation problem thereby improving the precision and accuracy of the HITS algorithm. Furthermore, when applied to the web information extraction problem using the IEPAD algorithm, it can decrease the number of patterns discovered by this algorithm, thus shortening its time cost to generalize a wrapper for HTML pages.