DOM-based keyword extraction from web pages
Self archived versionfinal draft
MetadataShow full item record
CitationShah, Himat. Rezaei, Mohammad. Fränti, Pasi. (2019). DOM-based keyword extraction from web pages. AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, 62. 10.1145/3371425.3371495.
We present D-rank, an unsupervised, language and domain independent method for automatically extracting keywords from a single web page. The method does not use any corpus, and relies only on the information and features on the web page including page URL, word frequency, title, hyperlinks, and headers, which are extracted from DOM tree of the page. Different scores are assigned to the words according to their importance that is specified by their positions in the web page. Experimental results on web pages in three different languages show the effectiveness of the proposed method.