收藏本站
收藏 | 论文排版

HisTrace:A System for Mining on News-Related Articles instead of Web Pages

【摘要】:正The Web is now playing an important part in people's real-life activities.Scientists of not only computer science but also sociology and economics might be interested in mining on information directly related to real-life events, or news-related information on the Web.In this paper we propose a system to enable mining on news-related articles instead of raw web pages.There are functionally two tasks in our system:1) mining for news-related articles and 2) duplicate elimination.For the first task,a novel approach for determining titles,contents and publication-times of news-related articles is presented.Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles.After that,crawl-times and are used to initially compute publication-times for all articles.At last,times extracted from HTML bodies,URLs and anchor texts are used to determine precise publication-times for possible articles.For the second task,a duplicate detection algorithm for news-related articles is described which is base on LCS(longest common subsequence) and achieves both high precision and high recall.The framework of this algorithm has been presented as a general-purpose algorithm for web pages in a previously published paper. In this paper we explain why this algorithm is particularly suitable for news-related articles and present corresponding implementation details.Evaluations have been conducted which show the effectiveness of our approaches.

知网文化
【相似文献】
中国重要会议论文全文数据库 前1条
1 ;HisTrace:A System for Mining on News-Related Articles instead of Web Pages[A];Proceedings 2010 IEEE 2nd Symposium on Web Society[C];2010年
 快捷付款方式  订购知网充值卡  订购热线  帮助中心
  • 400-819-9993
  • 010-62982499
  • 010-62783978