Research on the Web Data Preprocessing

In order to Web data mining and data preprocessing of text classification problems, the Gini index on the Web data mining preprocessing, through in-depth analysis of principles and the text features of the Gini index, the Gini index constructed a new meas

  • PDF / 418,995 Bytes
  • 7 Pages / 439.37 x 666.142 pts Page_size
  • 56 Downloads / 257 Views

DOWNLOAD

REPORT


Research on the Web Data Preprocessing Chaodong Lu and Xin Xiong

Abstract  In order to Web data mining and data preprocessing of text classification problems, the Gini index on the Web data mining preprocessing, through in-depth analysis of principles and the text features of the Gini index, the Gini index constructed a new measure function, and in the original feature space for feature selection, using the Gini index of the purity of principle. The results show that the classification accuracy of the method has high computational complexity, smaller, and can improve the classification performance for Web data preprocessing. Keywords  Web data mining  •  Data preprocessing  •  Gini index  •  Text classification

105.1 Introduction Data mining technology continues to improve and applications for Web Mining production and extensive application of foundation. Web information as users use the basic content of the data based on the Web, a new data derived Web log. It contains a variety of objects, including users; click the query words and the Web page, etc. These objects include not only its own nature, and also with other existing between different kinds of objects inter-related relationship, take advantage of these information can effectively improve the user access to information on the Web satisfaction, improve the utilization of information [1]. Web Mining can help users to query relevant information quickly and accurately, from the Web data found in the unknown information potentially useful to

C. Lu (*)  Wuhan University of Technology, Wuhan 430070, Hubei, China e-mail: [email protected] C. Lu · X. Xiong  Henan Institute of Engineering, Zhengzhou 451191, Henan, China

Z. Zhong (ed.), Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012, Lecture Notes in Electrical Engineering 220, DOI: 10.1007/978-1-4471-4844-9_105, © Springer-Verlag London 2013

795

796

C. Lu and X. Xiong

understand the client’s interests, customized for specific users, and other personalized information. But the Web log data collected may have redundant data, may also be missing some data, it is necessary to in the excavation before the processing of these data to obtain the appropriate data format for pattern discovery [2]. The process normally includes data cleaning (Data Cleaning), user identification (User Identification), session identification (Session Identification) and the path to add (Path Completion) and so on. The task of data cleaning data mining process is to delete the registry entries for the log does not need; user identification is to refer to the same page, different users, even those who use the same IP address of the user, the process associated with it; session identification to a given All page references the user to a log together and to classify them into a user session; path added to fill the browser cache and proxy server causing the missing page reference.

105.2 Web Log Data Source and Pretreatment Processes Web users access the information needed before minin