Pre-trained Data Augmentation for Text Classification

Data augmentation is a widely adopted method for improving model performance in image classification tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount

PDF / 1,164,670 Bytes
15 Pages / 439.37 x 666.142 pts Page_size
4 Downloads / 434 Views

DOWNLOAD

REPORT

and Sylvio Barbon Junior

State University of Londrina (UEL), Londrina, Brazil {hugo.abonizio,barbon}@uel.br

Abstract. Data augmentation is a widely adopted method for improving model performance in image classiﬁcation tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount of training data using simple text transformations or text generation through language models. However, recent text classiﬁcation tasks need to deal with domains characterized by a small amount of text and informal writing, e.g., Online Social Networks content, reducing the capabilities of current methods. Facing these challenges by taking advantage of the pretrained language models, low computational resource consumption, and model compression, we proposed the PRE-trained Data AugmenTOR (PREDATOR) method. Our data augmentation method is composed of two modules: the Generator, which synthesizes new samples grounded on a lightweight model, and the Filter, that selects only the high-quality ones. The experiments comparing Bidirectional Encoder Representations from Transformer (BERT), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) and Multinomial Naive Bayes (NB) in three datasets exposed the eﬀective improvement of accuracy. It was obtained 28.5% of accuracy improvement with LSTM on the best scenario and an average improvement of 8% across all scenarios. PREDATOR was able to augment real-world social media datasets and other domains, overcoming the recent text augmentation techniques.

Keywords: Data augmentation networks

1

· Text classiﬁcation · Online social

Introduction

Data augmentation techniques have been successfully applied in machine learning models to improve their generalization capacity. It is a common strategy to avoid overﬁtting the training data, mainly on scenarios of data scarcity and situations where labeled examples are expensive. Since the performance of machine The authors would like to thank the ﬁnancial support of the National Council for Scientiﬁc and Technological Development (CNPq) of Brazil - Grant of Project 420562/2018-4 - and Funda¸ca ˜o Arauc´ aria. c Springer Nature Switzerland AG 2020 R. Cerri and R. C. Prati (Eds.): BRACIS 2020, LNAI 12319, pp. 551–565, 2020. https://doi.org/10.1007/978-3-030-61377-8_38

552

H. Queiroz Abonizio and S. Barbon

learning models is highly correlated with the amount and the quality of the data used during its training, low-data scenarios become a challenge for practitioners [13]. Several techniques have been proposed and evaluated for image data [30], but the ﬁeld of textual data augmentation is still incipient. Simple transformations, such as ﬂipping, cropping, and other image manipulations, are often label-preserving on image classiﬁcation tasks [3,18], but this assumption does not hold for text data. Changing words order or removing some parts of a sentence might change its whole semantic, resulting in low-quality samples and negatively impacting the perform

Data Loading...

Pre-trained Data Augmentation for Text Classification

Recommend Documents

Data Augmentation with Transformers for Text Classification

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

Improving Short Text Classification Through Global Augmentation Methods

StyPath: Style-Transfer Data Augmentation for Robust Histology Image Classification

Ontology-Guided Data Augmentation for Medical Document Classification

Text classification algorithms for mining unstructured data: a SWOT analysis

Text Classification

Improving Sentence Classification by Multilingual Data Augmentation and Consensus Learning

Text Classification

On the Effectiveness of Neural Text Generation Based Data Augmentation for Recognition of Morphologically Rich Speech

Multi-domain Transfer Learning for Text Classification

Feature Space Augmentation for Long-Tailed Data