Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem
- PDF / 2,623,529 Bytes
- 14 Pages / 595.276 x 790.866 pts Page_size
- 19 Downloads / 199 Views
ORIGINAL ARTICLE
Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem Christos Bouras • Vassilis Tsogkas
Received: 12 December 2013 / Accepted: 29 April 2014 Springer-Verlag Berlin Heidelberg 2014
Abstract Collaborative filtering systems typically need to acquire some data about the new user in order to start making personalized suggestions, a situation commonly referred to as the ‘‘new user problem’’. In this work we attempt to address the new user problem via a unique personalized strategy for prompting the user with articles to rate. Our approach makes use of hypernyms extracted from the WordNet database and proves to be converging fast to the actual user interests based on minimal user ratings, which are provided during the registration process. In addition, we explore the possible enhancement of the document clustering results, and in particular clustering of news articles from the web, when using word-based n-grams during the keyword extraction phase. We present and evaluate a weighting approach that combines clustering of news articles derived from the web, using n-grams that are extracted from the articles at an offline stage. This technique is then compared with the single minded ‘‘bagof-words’’ representation that our clustering algorithm, W-kmeans, previously used. Our experimentation reveals that via fine tuning the weighting parameters between keyword and n-grams, as well as the n value itself, a significant improvement regarding the clustering results metrics can be achieved.
Keywords New user problem Collaborative filtering Clustering W-kmeans K-means Personalized strategy n-grams Text preprocessing
1 Introduction Every day, more and more news articles, books, journals, research papers, web pages, and movies are being made available online. While available information is growing in volumes, we quickly become overwhelmed and seek assistance in finding the most interesting, valuable, or entertaining items on which we should spend our scarce time. Historically, humans have adapted well to pieces of information and have developed an excellent filtering ability to make quick judgments. The technologies that are commonly used to address the previously mentioned information overload challenges are basically three. Each one of them focuses primarily on a particular set of tasks or questions: •
•
C. Bouras V. Tsogkas Computer Engineering and Informatics Department, University of Patras, Patras, Greece e-mail: [email protected] C. Bouras (&) Computer Technology Institute and Press ‘‘Diophantus’’, Rion, 26500 Patras, Greece e-mail: [email protected]
•
Information Retrieval (IR), which focuses on tasks involving fulfilling ephemeral interest queries, such as finding the articles related to president Obama Information Filtering (IF), which focuses on tasks involving classifying streams of new content into categories, such as finding any newly released articles regarding the political situation in Middle East, or any newly released movies w
Data Loading...