微博话题检测SP&HC聚类算法分析
The SP&HC clustering algorithm analysis of micro-blog topic detection
-
摘要:针对微博网站中具有较大的文本信息量、采用凝聚层次聚类算法不适合、Single-Pass聚类算法检测结果不准确等问题,提出一种将这2种算法相结合的SP&HC聚类算法:利用Single-Pass聚类算法进行大量微博文本的简单聚类,收集一些小幅度、高凝聚力的主题话题,使得形成的主题话题在内容上和数量上得到极大的精简,直到使得主题话题能达到层次聚类算法的要求;运用层次聚类算法对主题话题进行相似话题聚类,直到符合预设值的条件.实验结果证实了SP&HC聚类算法在召回率和准确率上的综合性能优于前2种算法.
-
关键词:
- 微博/
- 热点话题/
- 层次聚类算法/
- Single-Pass聚类算法/
- SP&HC聚类算法
Abstract:Aiming at the problem that micro-blog had large amount of information,the coalescing hierarchical clustering algorithm was not suitable and Single-Pass clustering algorithm results was not accurate, a new algorithm SP&HC integrating hierarchical clustering algorithm and Single-Pass clustering algorithm were put forward.It used Single-Pass clustering algorithm to make the large number of micro-blog text become into the simple clustering, in order to collect some small amplitude and high cohesive theme topic.This greatly streamlined the content and quantity of the topic, until making the theme topic hierarchical clustering algorithm to achieve the requirements; then it used hierarchical clustering algorithm to carry out a similar topic clustering, until that conditions met defaults.The simulation experiment results showed that the performance on recall and accuracy of the algorithm was better than the first two algorithms. -
- [1]
张旭洁,刘宗田.事件本体构建中几个关键问题的研究[D].上海:上海大学,2013.
- [2]
王娜,李明.Web文本挖掘的研究[D].兰州:兰州理工大学,2005.
- [3]
关冕,马军.Web论坛结构化数据抽取技术研究[D].济南:山东大学,2010.
- [4]
洪宇.基于语义结构和时序特征的话题检测与跟踪技术研究[D].哈尔滨:哈尔滨工业大学,2009.
- [5]
孙胜平,张真继.中文微博客热点话题检测与跟踪技术研究[D].北京:北京交通大学,2011.
- [6]
任姚鹏,陈立潮,张英俊,等.结合语义的特征权重计算方法研究[J].计算机工程与设计,2010,31(10):2381.
- [1]
-

计量
- PDF下载量:184
- 文章访问数:8459
- 引证文献数:0