Mining Web Data

The Web is an unique phenomenon in many ways, in terms of its scale, the distributed and uncoordinated nature of its creation, the openness of the underlying platform, and the resulting diversity of applications it has enabled. Examples of such applicatio

  • PDF / 1,434,139 Bytes
  • 29 Pages / 504.567 x 720 pts Page_size
  • 92 Downloads / 265 Views

DOWNLOAD

REPORT


Mining Web Data

“Data is a precious thing, and will last longer than the systems themselves.”—Tim Berners-Lee

18.1

Introduction

The Web is an unique phenomenon in many ways, in terms of its scale, the distributed and uncoordinated nature of its creation, the openness of the underlying platform, and the resulting diversity of applications it has enabled. Examples of such applications include ecommerce, user collaboration, and social network analysis. Because of the distributed and uncoordinated nature in which the Web is both created and used, it is a rich treasure trove of diverse types of data. This data can be either a source of knowledge about various subjects, or personal information about users. Aside from the content available in the documents on the Web, the usage of the Web results in a significant amount of data in the form of user logs or Web transactions. There are two primary types of data available on the Web that are used by mining algorithms. 1. Web content information: This information corresponds to the Web documents and links created by users. The documents are linked to one another with hypertext links. Thus, the content information contains two components that can be mined either together, or in isolation. • Document data: The document data are extracted from the pages on the World Wide Web. Some of these extraction methods are discussed in Chap. 13. • Linkage data: The Web can be viewed as a massive graph, in which the pages correspond to nodes, and the linkages correspond to edges between nodes. This linkage information can be used in many ways, such as searching the Web or determining the similarity between nodes. 2. Web usage data: This data corresponds to the patterns of user activity that are enabled by Web applications. These patterns could be of various types. C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 18 c Springer International Publishing Switzerland 2015 

589

590

CHAPTER 18. MINING WEB DATA • Web transactions, ratings, and user feedback: Web users frequently buy various types of items on the Web, or express their affinity for specific products in the form of ratings. In such cases, the buying behavior and/or ratings can be leveraged to make inferences about the preferences of different users. In some cases, the user feedback is provided in the form of textual user reviews that are referred to as opinions. • Web logs: User browsing behavior is captured in the form of Web logs that are typically maintained at most Web sites. This browsing information can be leveraged to make inferences about user activity.

These diverse data types automatically define the types of applications that are common on the Web. In coordination with the different data types, the applications are also either content- or usage-centric. 1. Content-centric applications: The documents and links on the Web are used in various applications such as search, clustering, and classification. Some examples of such applications are as follows: • Data mining applications: Web documents are used