Pre-trained Data Augmentation for Text Classification
Data augmentation is a widely adopted method for improving model performance in image classification tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount
- PDF / 1,164,670 Bytes
- 15 Pages / 439.37 x 666.142 pts Page_size
- 4 Downloads / 415 Views
and Sylvio Barbon Junior
State University of Londrina (UEL), Londrina, Brazil {hugo.abonizio,barbon}@uel.br
Abstract. Data augmentation is a widely adopted method for improving model performance in image classification tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount of training data using simple text transformations or text generation through language models. However, recent text classification tasks need to deal with domains characterized by a small amount of text and informal writing, e.g., Online Social Networks content, reducing the capabilities of current methods. Facing these challenges by taking advantage of the pretrained language models, low computational resource consumption, and model compression, we proposed the PRE-trained Data AugmenTOR (PREDATOR) method. Our data augmentation method is composed of two modules: the Generator, which synthesizes new samples grounded on a lightweight model, and the Filter, that selects only the high-quality ones. The experiments comparing Bidirectional Encoder Representations from Transformer (BERT), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) and Multinomial Naive Bayes (NB) in three datasets exposed the effective improvement of accuracy. It was obtained 28.5% of accuracy improvement with LSTM on the best scenario and an average improvement of 8% across all scenarios. PREDATOR was able to augment real-world social media datasets and other domains, overcoming the recent text augmentation techniques.
Keywords: Data augmentation networks
1
· Text classification · Online social
Introduction
Data augmentation techniques have been successfully applied in machine learning models to improve their generalization capacity. It is a common strategy to avoid overfitting the training data, mainly on scenarios of data scarcity and situations where labeled examples are expensive. Since the performance of machine The authors would like to thank the financial support of the National Council for Scientific and Technological Development (CNPq) of Brazil - Grant of Project 420562/2018-4 - and Funda¸ca ˜o Arauc´ aria. c Springer Nature Switzerland AG 2020 R. Cerri and R. C. Prati (Eds.): BRACIS 2020, LNAI 12319, pp. 551–565, 2020. https://doi.org/10.1007/978-3-030-61377-8_38
552
H. Queiroz Abonizio and S. Barbon
learning models is highly correlated with the amount and the quality of the data used during its training, low-data scenarios become a challenge for practitioners [13]. Several techniques have been proposed and evaluated for image data [30], but the field of textual data augmentation is still incipient. Simple transformations, such as flipping, cropping, and other image manipulations, are often label-preserving on image classification tasks [3,18], but this assumption does not hold for text data. Changing words order or removing some parts of a sentence might change its whole semantic, resulting in low-quality samples and negatively impacting the perform
Data Loading...