A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

  • PDF / 1,551,947 Bytes
  • 22 Pages / 439.37 x 666.142 pts Page_size
  • 70 Downloads / 180 Views

DOWNLOAD

REPORT


A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset Cinthia M. Souza1   · Magali R. G. Meireles1   · Paulo E. M. Almeida2  Received: 23 December 2019 © Akadémiai Kiadó, Budapest, Hungary 2020

Abstract Patents are an important source of information for measuring the technological advancement of a specific knowledge domain. To facilitate the search for information in patent datasets, classification systems separate documents into groups according to the area of knowledge, and designate names to define their content. The increase in the number of patented inventions leads to the need to subdivide these groups. Since these groups belong to a restricted knowledge domain, naming the generated subcategories can be extremely laborious. This work aims to compare the performance of abstractive and extractive summarization techniques in the task of generating sentences directly associated with the content of patents. The abstractive summarization model was composed by a Seq2Seq architecture and a LSTM network. The training was conducted with a dataset of patent titles and abstracts. The validation process was performed using the ROUGE set of metrics. The results obtained by the generated model were compared with the sentence resulting from an extractive summarization algorithm applied to the task of naming patent groups. The main idea was to help the specialist to name new patent groups created by the clustering systems. The naming experiments were performed on the dataset of abstracts of patent documents. Comparative experiments were conducted using four subgroups of the United States Patent and Trademark Office, which uses the Cooperative Patent Classification system. Keywords  Computational intelligence · Knowledge representation · Information systems · Automatic text summarization · Patent datasets

* Magali R. G. Meireles [email protected] 1

Pontifical Catholic University of Minas Gerais, Belo Horizonte, MG, Brazil

2

Federal Center for Technological Education of Minas Gerais, Belo Horizonte, MG, Brazil



13

Vol.:(0123456789)

Scientometrics

Introduction Patents are an important knowledge source and, therefore, their analysis has been considered a useful tool for research and for management development. Patents are one of the most effective ways to protect an invention today  (Wang et  al. 2019). One of the objectives of granting patents is to facilitate the dissemination of scientific knowledge  (Ouellette 2017). However, finding information in these documents is becoming an increasingly complex task due to the large number of patents in datasets (Sjögren et al. 2018). These documents have a complex language with excessive descriptive technical details and idiosyncrasies that report to the structure of the patent document and the length of the sentences. Thereafter, the retrieval process and analysis of these documents are time consuming and laborious  (Codina-Filbà et al. 2017; Gomez 2019). The efficient analysis of these documents allows for monitori