Bonsai: diverse and shallow trees for extreme multi-label classification

PDF / 1,544,764 Bytes
21 Pages / 439.37 x 666.142 pts Page_size
81 Downloads / 236 Views

Bonsai: diverse and shallow trees for extreme multi‑label classification Sujay Khandagale1 · Han Xiao1 · Rohit Babbar1 Received: 13 July 2019 / Revised: 1 November 2019 / Accepted: 4 June 2020 © The Author(s) 2020

Abstract Extreme multi-label classification (XMC) refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. In this paper, we develop a suite of algorithms, called Bonsai, which generalizes the notion of label representation in XMC, and partitions the labels in the representation space to learn shallow trees. We show three concrete realizations of this label representation space including: (i) the input space which is spanned by the input features, (ii) the output space spanned by label vectors based on their co-occurrence with other labels, and (iii) the joint space by combining the input and output representations. Furthermore, the constraint-free multi-way partitions learnt iteratively in these spaces lead to shallow trees. By combining the effect of shallow trees and generalized label representation, Bonsai achieves the best of both worlds—fast training which is comparable to state-of-the-art tree-based methods in XMC, and much better prediction accuracy, particularly on tail-labels. On a benchmark Amazon-3M dataset with 3 million labels, Bonsai outperforms a state-of-the-art one-vs-rest method in terms of prediction accuracy, while being approximately 200 times faster to train. The code for Bonsai is available at https://github.com/xmc-aalto/bonsai. Keywords Large-scale multi-label classification · Extreme multi-label classification · Large label space

1 Introduction Extreme Multi-label Classification (XMC) refers to supervised learning of a classifier which can automatically label an instance with a small subset of relevant labels from an extremely large set of all possible target labels. Machine learning problems consisting of hundreds of thousand labels are common in various domains such as product categorization for e-commerce (McAuley and Leskovec 2013; Shen et al. 2011; Bengio et al. 2010; Agrawal et al. 2013), hash-tag suggestion in social media (Denton et al. Editors: Larisa Soldatova, Joaquin Vanschoren. * Rohit Babbar [email protected] 1

Aalto University, Helsinki, Finland

13

Vol.:(0123456789)

Machine Learning

Fig. 1 Label frequency in dataset WikiLSHTC-325K shows power-law distribution. X-axis shows the label IDs sorted by their frequency in training instances and Y-axis gives the actual frequency (on log-scale). Note that more than half of the labels have fewer than 5 training instances

2015), annotating web-scale encyclopedia (Partalas et al. 2015), and image-classification (Krizhevsky et al. 2012; Deng et al. 2010). It has been demonstrated that, the framework of XMC can also be leveraged to effectively address ranking problems arising in bid-phrase suggestion in web-advertising and suggestion of relevant items for recommendation systems (Prabhu and Varma 2014). From the machine learning perspective, building eff

Data Loading...

Bonsai: diverse and shallow trees for extreme multi-label classification

Recommend Documents

Multilabel Classification

Multilabel graph-based classification for missing labels

Classification Trees

Multivariate Predictive Clustering Trees for Classification

Multilabel Toxic Comment Classification Using Supervised Machine Learning Algorithms

Extreme Gradient Boosted Multi-label Trees for Dynamic Classifier Chains

Extracting diverse-shapelets for early classification on time series

Weakly supervised multilabel classification for semantic interpretation of endoscopy video frames

Jaya Optimized Extreme Learning Machine for Breast Cancer Data Classification

A Shallow Convolutional Neural Network for Accurate Handwritten Digits Classification

Extreme-SAX: Extreme Points Based Symbolic Representation for Time Series Classification

Neural Architecture Search for Extreme Multi-label Text Classification