Clustering by Intent: A Semi-Supervised Method to Discover Relevant Clusters Incrementally

Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that comple

PDF / 801,877 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
49 Downloads / 176 Views

DOWNLOAD

REPORT

Hewlett-Packard Labs, Palo Alto, USA [email protected] 2 Hewlett-Packard Labs, Haifa, Israel

Abstract. Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that complex text data can be clustered in many diﬀerent ways, and, really, it is optimistic to expect relevant clusters from an unsupervised process, even with parameter tinkering. We studied this problem in an interactive context and developed an eﬀective solution that re-casts the problem formulation, radically diﬀerent from traditional or semi-supervised clustering. Given training labels of some known classes, our method incrementally proposes complementary clusters. In tests on various business datasets, we consistently get relevant results and at interactive time scales. This paper describes the method and demonstrates its superior ability using publicly available datasets. For automated evaluation, we devised a unique cluster evaluation framework to match the business user’s utility. Keywords: Semi-supervised clustering detection

1

·

Class discovery

·

Topic

Introduction

Hewlett-Packard uses text mining techniques to help analyze customer surveys, customer support logs, engineer repair notes, system logs, etc. [11] Though clustering technologies are employed to discover important topics in the data, usually only a small fraction of the proposed clusters are relevant. This is expected by data mining practitioners, but can prove somewhat disappointing to business users. The fundamental issue is that such complex text data can be clustered in many diﬀerent ways, and it is unlikely that an unsupervised algorithm stumbles upon the one that suits the user’s current intent. We have often found they still fail to produce useful clusters even with repeated attempts at adjusting the various parameters by data mining experts. Furthermore, once some initial large clusters are recognized and dealt with, the remaining data tends to produce decreasingly useful clusters. In fact, sometimes the removal of the known issues causes a shift to less relevant breakdowns c Springer International Publishing Switzerland 2015 A. Bifet et al. (Eds.): ECML PKDD 2015, Part III, LNAI 9286, pp. 20–36, 2015. DOI: 10.1007/978-3-319-23461-8 2

Clustering by Intent

21

of the data, e.g., by setting aside some clusters of known laptop issues (old batteries or cracked displays), the remaining data may be more likely to cluster by product type or geography—frustrating the intent of the user. One may think that semi-supervised clustering algorithms would provide the answer [2], but they do not. We explored using constrained clustering, a form of semi-supervised learning with must-link and cannot-link constraints [3,26], but we found its results mostly useless for our purposes (see Tables 1 and 2). Additionally, we considered constrained non-negative matrix factorization (CNMF) methods [8,18].

Data Loading...

Clustering by Intent: A Semi-Supervised Method to Discover Relevant Clusters Incrementally

Recommend Documents

A Network Algorithm to Discover Sequential Patterns

An Intent-Based Network Slice Orchestration Method

Designing for Intent-to-Treat

Estimating the number of clusters via a corrected clustering instability

Neighborhood Topology to Discover Influential Nodes in a Complex Network

Alternative Method for Incrementally Constructing the FP-Tree

An entropy-based initialization method of K -means clustering on the optimal number of clusters

Pathological Changes Discover Network: Discover the Pathological Changes of Perivascular Dermatitis by Semi-supervised L

Incrementally Aggregatable Vector Commitments and Applications to Verifiable Decentralized Storage

Slowness to Discover the Ordinary Italian Landscape

Mining data to discover customer segments

Legislative Intent