Topic Information Collection Based on the Hidden Markov Model
The topic information collection algorithm is widely used for its accuracy. The Hidden Markov Model (HMM) is used to learn and judge the relevance between the Uniform Resource Locator and the topic information. The Rocchio method is used to construct the
- PDF / 479,421 Bytes
- 10 Pages / 439.37 x 666.142 pts Page_size
- 77 Downloads / 207 Views
Topic Information Collection Based on the Hidden Markov Model Haiyan Jiang, Xingce Wang, Zhongke Wu, Mingquan Zhou, Xuesong Wang and Jigang Wang
Abstract The topic information collection algorithm is widely used for its accuracy. The Hidden Markov Model (HMM) is used to learn and judge the relevance between the Uniform Resource Locator and the topic information. The Rocchio method is used to construct the prototype vectors relevant to the topic information, and the HMM is used to learn the preferred browsing paths. The concept maps including the semantics of the webpage are constructed and the web’s link structures can be decided. The validity of the algorithm is proved by the experiment at last. Comparing with the Best-First algorithm, this algorithm can get more information pages and has higher precision ratio.
Keywords Topic information collection Hidden markov model Crawler Uniform resource locator (URL) Prototype vector Precision ratio Recall ratio
16.1 Introduction At present, the search engine has become the most effective means for people to obtain information in Internet, such as Google, Baidu, etc. To further improve the recall ratio and precision ratio in vast amounts of information in Internet becomes
H. Jiang College of Information Science and Technology, Beijing Normal University, Qilu Normal University, Jinan, Shandong, China X. Wang (&) Z. Wu M. Zhou XuesongWang J. Wang College of Information Science and Technology, Beijing Normal University, No.19, Xinjiekouwai Street, Haidian District, Beijing, China e-mail: [email protected]
Y. Yang and M. Ma (eds.), Proceedings of the 2nd International Conference on Green Communications and Networks 2012 (GCN 2012): Volume 1, Lecture Notes in Electrical Engineering 223, DOI: 10.1007/978-3-642-35419-9_16 Springer-Verlag Berlin Heidelberg 2013
127
128
H. Jiang et al.
the most important thing in the coming search engine design [1]. Vertical search engine emerged for the collection of information aiming at specific industries and fields, which can obtain more accurate and effective results in the related fields. The key factor in the design of vertical search engine is to improve the accuracy of topic information collection. The Best-First algorithm, a traditional web crawler, will cause the lack of information because of a nondirect mutual link among the relevant topic pages, that is, the tunnel [2]. To obtain the optimal sequence through a Hidden Markov Model (HMM) fits the pages’ access sequence obtained in the topic information collection. We will make models about the identification of the collection paths to further improve the precision ratio of information collection. We simulate users’ access to create the table of the topic keyword through the design of topic information of expression in this chapter, calculate the content similarity of the pages, establish the topic relevance of the uniform resource locators (URLs), and simulate users’ access sequence to construct a HMM model. The model learns through multiuser acc
Data Loading...