A New Evolving Tree for Text Document Clustering and Visualization

The Self-Organizing Map (SOM) is a popular neural network model for clustering and visualization problems. However, it suffers from two major limitations, viz., (1) it does not support online learning; and (2) the map size has to be pre-determined and thi

  • PDF / 500,072 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 61 Downloads / 295 Views

DOWNLOAD

REPORT


Abstract The Self-Organizing Map (SOM) is a popular neural network model for clustering and visualization problems. However, it suffers from two major limitations, viz., (1) it does not support online learning; and (2) the map size has to be predetermined and this can potentially lead to many ‘‘trial-and-error’’ runs before arriving at an optimal map size. Thus, an evolving model, i.e., the Evolving Tree (ETree), is used as an alternative to the SOM for undertaking a text document clustering problem in this study. ETree forms a hierarchical (tree) structure in which nodes are allowed to grow, and each leaf node represents a cluster of documents. An experimental study using articles from a flagship conference of Universiti Malaysia Sarawak (UNIMAS), i.e., the Engineering Conference (ENCON), is conducted. The experimental results are analyzed and discussed, and the outcome shows a new application of ETree in text document clustering and visualization.

1 Introduction Clustering is a task of assigning data objects into a number of groups (or clusters) so that the objects in the same cluster share the same similarities than to those in other clusters [1]. It converts sets of non-linear data into a human and/or machine understandable format, which can be very useful for unsupervised learning systems. Examples of some famous clustering tools are the Self-Organizing Map (SOM) [2, 3], k-mean clustering [4], and fuzzy c-mean clustering [5, 6]. With respect to SOM, it is an artificial neural network that maps high-dimensional data W. L. Chang  K. M. Tay (&) Faculty of Engineering, Universiti Malaysia Sarawak, Sarawak, Malaysia e-mail: [email protected] C. P. Lim Centre for Intelligent Systems Research, Deakin University, Geelong, Australia

V. Snášel et al. (eds.), Soft Computing in Industrial Applications, Advances in Intelligent Systems and Computing 223, DOI: 10.1007/978-3-319-00930-8_13,  Springer International Publishing Switzerland 2014

141

142

W. L. Chang et al.

onto a low-dimensional grid of nodes [2, 7], and retains the relationship of the data as faithfully as possible. From the literature, various applications of SOM, e.g., speech recognition [8, 9], feature extraction [10], robotic arm [11], noise reduction in telecommunication [12], and textual documents clustering [13], have been reported. Indeed, various extensions for SOM, e.g., hierarchical search [14, 15], growing SOM [16, 17], growing hierarchical SOM (GHSOM) [18], and evolving tree (ETree) [19], have been proposed over the years. In general, these approaches increase the flexibility of SOM and improve the learning time for processing large data samples. With respect to text document clustering (also known as text categorization), it is a process to group similar text documents into group(s), based on their similarity [20]. The use of clustering tools in text document clustering is not new. Examples include the naive Bayes-based document clustering model [21], WEBSOM [22], and support vector machines-based for imbalanced text document classificat