Mining Hidden Topics from Newspaper Quotations: The COVID-19 Pandemic

In this paper, we extract quotations from Al Jazeera’s news articles containing keywords related to the COVID-19 pandemic. We apply Latent Dirichlet allocation (LDA), coherence measures, and clustering algorithms to unsupervisedly explore latent topics fr

  • PDF / 1,443,164 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 98 Downloads / 208 Views

DOWNLOAD

REPORT


, Abu Bakar Siddiqur Rahman1(B) , Grigori Sidorov1(B) , and Alexander Gelbukh1(B)

1 Instituto Politécnico Nacional, Mexico City, México [email protected], [email protected], {sidorov,gelbukh}@cic.ipn.mx 2 Dalat University, Da Lat, Viet Nam

Abstract. In this paper, we extract quotations from Al Jazeera’s news articles containing keywords related to the COVID-19 pandemic. We apply Latent Dirichlet allocation (LDA), coherence measures, and clustering algorithms to unsupervisedly explore latent topics from the dataset of about 3400 quotations to see how coronavirus impacts human beings. By combining noun phrases as inputs before the training and Cv measure for coherence values, we obtain an average coherence value of 0.66 with a least average number of topics of 24.8. The result covers some of the top issues that our world has been facing against the COVID-19 pandemic. Keywords: Topic model · Latent Dirichlet Allocation · Quotation mining · COVID-19

1 Introduction Original from Wuhan, coronavirus (COVID-19) has quickly spread to 229 countries and territories, turn to the pandemic on a global scale just only in several months. No one could expect the massive impact of this virus on human society, initially thought as flu is able to lead to hundreds of thousands of deaths and a predictable economic recession in 2020 and even years later on. The policies to deal with this virus are not the same for all countries, from a lockdown (China, Italy, Spain, Germany) to a “herd immunity” model (Sweden, initially recommend in the UK). No matter what policies or methods are used and their effectiveness, the voice of famous people in a certain country probably has a weight that is big enough to orientate the public on how to counter the coronavirus. A quotation (or quote) reflects someone’s statement or thought regularly recorded when a famous individual declares something in interviews or public speeches. Direct quotations are obviously considered more subjective than interpretations from them (or indirect quotations) because they cover exacts words from the speakers or authors [1]. Hence, we use direct quotations to understand exactly what an individual thinks and his/her opinions about a certain problem, e.g., COVID-19. Note that we use quotations to refer to direct quotations for the remaining part of this paper. © Springer Nature Switzerland AG 2020 L. Martínez-Villaseñor et al. (Eds.): MICAI 2020, LNAI 12469, pp. 51–64, 2020. https://doi.org/10.1007/978-3-030-60887-3_5

52

T. H. Ta et al.

From about 3400 quotations from Al Jazeera, we put them into a Latent Dirichlet allocation (LDA) model to mine hidden topics to know what these people, especially politicians care about COVID-19 pandemic. Several options are used to remove stopwords or high-probability words in quotations before/after training the LDA model. We select the best number of topics from combining coherence measures (C V and C Umass ) with the sum of inverse topic frequency (ITF). Our purpose is to reduce the number of topics (topic number). Except f