Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

Automatic text categorization (ATC) is a technique of the text document classification. Based on the textual content of documents, predefined classes are assigned. Large numbers of features are extracted from text documents, and documents are represented

PDF / 245,079 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
45 Downloads / 240 Views

DOWNLOAD

REPORT

1 Introduction Nowadays, digital documentation is increasing at a very fast pace, and it is very important to maintain the classification of digital documents. The main aim of digital document classification is to categorize the documents into predefined classes. It is an active research area for the information retrieval [1] and machine learning from the digital text documentation. There are many supervised algorithms which are employed on the digital text documents for the classification such as support vector machine [2], Naïve Bayes [3], decision tree [4], and nearest neighbors [5]. There are two phases of text categorization [6] of digital documents: One is the training phase, and the second is classification testing phase. Earlier, subject indexing and feature extraction method [7] were used for text categorization. However, these methods are not very much successful for the classification. Text categorization methods are based on the term frequency and inverted term frequency and count the frequencies of the term but not consider the position of the term. Therefore, these methods were not efficient in articulating the class for the text data. In each data, the position of the term is very relevant for the identification of the documents. The remaining paper is organized as follows: Sect. 2 discusses the related work. In Sect. 3, material and methodology used for this work are discussed. Section 4 describes the experimental results and discussions. Lastly, Sect. 5 concludes this study.

D. S. Sisodia (B) National Institute of Technology, Raipur, India e-mail: [email protected] A. Shukla Jaypee University of Engineering & Technology, Guna, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2019 R. K. Shukla et al. (eds.), Data, Engineering and Applications, https://doi.org/10.1007/978-981-13-6347-4_7

71

72

D. S. Sisodia and A. Shukla

2 Related Work Earlier, the text classification was done manually, but those classifications were not at all efficient. After that, many classification schemes came to existence such as subject indexing [8], term frequency [9], Gini index [10], mutual information, and information gain [11]. Till now, a significant amount of research has been done in automatic text categorization (ATC). Term frequency and subject indexing also used for classification, but these techniques were using the phenomenon of term redundancy [12] and subject index but missing the relevancy of the term. Gini index is also a global feature selection method for text classification. It is an improved version attribute selection algorithm. Currently, the weighted feature selection [13] algorithms are used for automatic text categorization since it is based on the mutual information [14, 15] of the term of the dataset. Mutual information and maximum entropy classification [16] are the basic techniques which are used by the researcher for machine learning and information retrieval from the text document.

3 Material and Methodology 3.1 Data Source Four datasets have been taken from

Data Loading...

Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

Recommend Documents

Text Categorization

Binary Text Representation for Feature Selection

Feature Reinforcement Approach to Poly-lingual Text Categorization

Automatic Feature Selection by Genetic Algorithms

Rough Set-Based Feature Selection Techniques

Intelligent Text Categorization and Clustering

Application of Automatic Text-Classification Algorithm Based on Feature Extraction for Intelligent System of Transportat

Univariate Feature Selection Techniques for Classification of Epileptic EEG Signals

Text categorization: past and present

Feature Selection and Extraction for Dogri Text Summarization

Aspects of Automatic Text Analysis

Text Classification Using K-Nearest Neighbor Algorithm and Firefly Algorithm for Text Feature Selection