Data Mining
The Web is the world’s largest source of information. It records the real world from many aspects at every moment. This success is somewhat thanks to XML-based technology, which provides a means of information interchange between applications, as well as
- PDF / 385,911 Bytes
- 32 Pages / 439.37 x 666.142 pts Page_size
- 46 Downloads / 219 Views
Data Mining
25.1 Introduction The Web is the world’s largest source of information. It records the real world from many aspects at every moment. This success is somewhat thanks to XML-based technology, which provides a means of information interchange between applications, as well as a semistructured data model for integrating information and knowledge. Information retrieval has enabled the development of useful web search engines. Relevance criteria based on both textual contents and link structure are very useful for effectively retrieving text-rich documents. The wealth of information in huge databases or the Web has aroused tremendous interest in the area of data mining, also known as knowledge discovery in databases (KDD). Data mining refers to a variety of techniques in the fields of databases, machine learning, and pattern recognition. The objective is to uncover useful patterns and associations from large databases. Data mining is to automatically search large stores of data for consistent patterns and/or relationships between variables so as to predict future behavior. The process of data mining consists of three phases, namely, data preprocessing and exploration, model selection and validation, as well as final deployment. Structured databases have well-defined features and data mining can easily succeed with good results. Web mining is more difficult since the World Wide Web is a less structured database. There are three types of web mining in general: web structure mining, web usage mining (context mining), and web content mining. Content mining unveils useful information about the relationships of web pages based on their content. In a similar way context mining unveils useful information about the relationship of web pages based on past visitor activity. Context mining is usually applied on the access-logs of the web site. Some of the most common data items found in access-logs are the IP address of the visitor, the date and time of the access, the time zone of the visitor, the size of the data transferred, the URL accessed, the protocol used, and the access method. The data stored in access-logs is configurable at the web server with the items mentioned above appearing in most access-logs. K.-L. Du and M. N. S. Swamy, Neural Networks and Statistical Learning, DOI: 10.1007/978-1-4471-5571-3_25, © Springer-Verlag London 2014
747
748
25 Data Mining
Machine learning provides the technical basis of data mining. Data mining needs first to discover the structural features in a database, and exploratory techniques through self-organization such as clustering are particularly promising. Neurofuzzy systems are ideal tools for knowledge representation. Bayesian networks provide a consistent framework to model the probabilistic dependencies among variables. Classification is also a fundamental method in data mining. Raw data contained in databases typically contains obsolete or redundant fields, outliers, and values not allowed. Data cleaning and data transformation may be required for data mining. A graph
Data Loading...