Research on the Web Data Preprocessing

In order to Web data mining and data preprocessing of text classification problems, the Gini index on the Web data mining preprocessing, through in-depth analysis of principles and the text features of the Gini index, the Gini index constructed a new meas

PDF / 418,995 Bytes
7 Pages / 439.37 x 666.142 pts Page_size
56 Downloads / 380 Views

DOWNLOAD

REPORT

Research on the Web Data Preprocessing Chaodong Lu and Xin Xiong

Abstract In order to Web data mining and data preprocessing of text classification problems, the Gini index on the Web data mining preprocessing, through in-depth analysis of principles and the text features of the Gini index, the Gini index constructed a new measure function, and in the original feature space for feature selection, using the Gini index of the purity of principle. The results show that the classification accuracy of the method has high computational complexity, smaller, and can improve the classification performance for Web data preprocessing. Keywords Web data mining • Data preprocessing • Gini index • Text classification

105.1 Introduction Data mining technology continues to improve and applications for Web Mining production and extensive application of foundation. Web information as users use the basic content of the data based on the Web, a new data derived Web log. It contains a variety of objects, including users; click the query words and the Web page, etc. These objects include not only its own nature, and also with other existing between different kinds of objects inter-related relationship, take advantage of these information can effectively improve the user access to information on the Web satisfaction, improve the utilization of information [1]. Web Mining can help users to query relevant information quickly and accurately, from the Web data found in the unknown information potentially useful to

C. Lu (*) Wuhan University of Technology, Wuhan 430070, Hubei, China e-mail: [email protected] C. Lu · X. Xiong Henan Institute of Engineering, Zhengzhou 451191, Henan, China

Z. Zhong (ed.), Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012, Lecture Notes in Electrical Engineering 220, DOI: 10.1007/978-1-4471-4844-9_105, © Springer-Verlag London 2013

795

796

C. Lu and X. Xiong

understand the client’s interests, customized for specific users, and other personalized information. But the Web log data collected may have redundant data, may also be missing some data, it is necessary to in the excavation before the processing of these data to obtain the appropriate data format for pattern discovery [2]. The process normally includes data cleaning (Data Cleaning), user identification (User Identification), session identification (Session Identification) and the path to add (Path Completion) and so on. The task of data cleaning data mining process is to delete the registry entries for the log does not need; user identification is to refer to the same page, different users, even those who use the same IP address of the user, the process associated with it; session identification to a given All page references the user to a log together and to classify them into a user session; path added to fill the browser cache and proxy server causing the missing page reference.

105.2 Web Log Data Source and Pretreatment Processes Web users access the information needed before minin

Data Loading...

Research on the Web Data Preprocessing

Recommend Documents

Data Preprocessing

Robust Techniques for Data Preprocessing

Data Preprocessing and Data Mining as Generalization

Big Data Analytics and Preprocessing

Data Quality Visualization for Preprocessing

The CEMS Research Based on Web Service

Big Data Preprocessing: An Application on Online Social Networks

Advanced Data Preprocessing and Feature Engineering

The Web of Data

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

Preprocessing