ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization

PDF / 2,011,450 Bytes
28 Pages / 439.37 x 666.142 pts Page_size
92 Downloads / 276 Views

ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization Nhi-Thao Tran1 • Minh-Quoc Nghiem1 • Nhung T. H. Nguyen1 • Ngan Luu-Thuy Nguyen2 Nam Van Chi1 • Dien Dinh1

•

Ó Springer Nature B.V. 2020

Abstract Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a big and high-quality Vietnamese dataset for abstractive multi-document summarization. To that end, we recruited 29 annotators and enhanced MDSWriter—an open-source annotation tool, to support the annotators in creating gold standard summaries. As a result, ViMs has 600 summaries corresponding to 300 clusters of 1,945 documents. We have verified the reliability of our dataset by using a variety of metrics including conventional Cohen’s j, relaxed Cohen’s j—a new metric that we propose to make it more suitable for abstractive summarization, and

& Nhi-Thao Tran [email protected] Minh-Quoc Nghiem [email protected] Nhung T. H. Nguyen [email protected] Ngan Luu-Thuy Nguyen [email protected] Nam Van Chi [email protected] Dien Dinh [email protected] 1

Faculty of Information Technology, HCMC University of Science, Ho Chi Minh City, Vietnam

2

Faculty of Computer Science, HCMC University of Information Technology, Ho Chi Minh City, Vietnam

123

Lang Resources & Evaluation

ROUGE scores. A relaxed j score of 0.55 indicate that ViMs could attain moderate agreement between annotators. Meanwhile, ROUGE scores are 0.729 of ROUGE-1, 0.507 of ROUGE-2 and 0.524 of ROUGE-SU4. We have further evaluated ViMs by using three different summarization systems: TextRank, CFVi and MUSEEC. Their performances are 0.628, 0.711 and 0.732 of ROUGE-1, respectively. These results show that the ViMs dataset is suitable for both training and evaluating multidocument summarization systems. We have made the dataset and evaluation results of this work publicly available for research community. It is noted that unlike previous work that only published the final summarization dataset, we also publish intermediate annotation results, which can be used in other NLP problems such as sentence classification. Keywords Abstractive summarization Multi-document summarization Vietnamese dataset Automatic summarization

1 Introduction In recent years, there has been an increasing interest in automatic text summarization due to the exponential growth of documents available on the Internet. Automatic text summarization is the process of using softwares to create a concise and fluent summary that contains major points of a document or a set of documents. Such automatic systems can save us time in digesting tons of information coming from the Internet. Based on the input, text summarization can be categori

Data Loading...

ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization

Recommend Documents

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

From Extractive to Abstractive Summarization: A Journey

Bengali Abstractive News Summarization (BANS): A Neural Attention Approach

Dataset for Automatic Summarization of Russian News

A Discussion on Various Methods in Automatic Abstractive Text Summarization

Learning Interactions at Multiple Levels for Abstractive Multi-document Summarization

Leverage Unlabeled Data for Abstractive Speech Summarization with Self-supervised Learning and Back-Summarization

Abstractive Summarization via Discourse Relation and Graph Convolutional Networks

An Abstractive Summarization Method Based on Global Gated Dual Encoder

CLTS: A New Chinese Long Text Summarization Dataset

UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning

A Template-Based Approach for Generating Vietnamese References from Flat MR Dataset in Restaurant Domain