Clustering by Intent: A Semi-Supervised Method to Discover Relevant Clusters Incrementally

Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that comple

  • PDF / 801,877 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 49 Downloads / 162 Views

DOWNLOAD

REPORT


Hewlett-Packard Labs, Palo Alto, USA [email protected] 2 Hewlett-Packard Labs, Haifa, Israel

Abstract. Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that complex text data can be clustered in many different ways, and, really, it is optimistic to expect relevant clusters from an unsupervised process, even with parameter tinkering. We studied this problem in an interactive context and developed an effective solution that re-casts the problem formulation, radically different from traditional or semi-supervised clustering. Given training labels of some known classes, our method incrementally proposes complementary clusters. In tests on various business datasets, we consistently get relevant results and at interactive time scales. This paper describes the method and demonstrates its superior ability using publicly available datasets. For automated evaluation, we devised a unique cluster evaluation framework to match the business user’s utility. Keywords: Semi-supervised clustering detection

1

·

Class discovery

·

Topic

Introduction

Hewlett-Packard uses text mining techniques to help analyze customer surveys, customer support logs, engineer repair notes, system logs, etc. [11] Though clustering technologies are employed to discover important topics in the data, usually only a small fraction of the proposed clusters are relevant. This is expected by data mining practitioners, but can prove somewhat disappointing to business users. The fundamental issue is that such complex text data can be clustered in many different ways, and it is unlikely that an unsupervised algorithm stumbles upon the one that suits the user’s current intent. We have often found they still fail to produce useful clusters even with repeated attempts at adjusting the various parameters by data mining experts. Furthermore, once some initial large clusters are recognized and dealt with, the remaining data tends to produce decreasingly useful clusters. In fact, sometimes the removal of the known issues causes a shift to less relevant breakdowns c Springer International Publishing Switzerland 2015  A. Bifet et al. (Eds.): ECML PKDD 2015, Part III, LNAI 9286, pp. 20–36, 2015. DOI: 10.1007/978-3-319-23461-8 2

Clustering by Intent

21

of the data, e.g., by setting aside some clusters of known laptop issues (old batteries or cracked displays), the remaining data may be more likely to cluster by product type or geography—frustrating the intent of the user. One may think that semi-supervised clustering algorithms would provide the answer [2], but they do not. We explored using constrained clustering, a form of semi-supervised learning with must-link and cannot-link constraints [3,26], but we found its results mostly useless for our purposes (see Tables 1 and 2). Additionally, we considered constrained non-negative matrix factorization (CNMF) methods [8,18].