Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark

  • PDF / 1,998,921 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 97 Downloads / 198 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : WORLDCIST'20

Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark Phuc Do1

· Trung Phan1 · Hung Le1 · Brij B. Gupta2,3,4

Received: 18 June 2020 / Accepted: 27 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract The simplest and effective way to store human knowledge through centuries was using text. Along with the advancement of technology nowadays, the volume of text has grown to be larger and larger. To extract useful information from this amount of text becomes an exceptionally complex task. As an effort to solve that problem, in this paper, we present a pipeline to extract core knowledge from large quantity text using distributed computing. The components of our pipeline are systems that were known to yield good results. The outputs of our proposed system are stored in a knowledge graph. A knowledge graph is a graph for storing knowledge in the form of triples (head, relation, tail). Some of the existing knowledge graphs in the world are Google knowledge graph, YAGO, DBLP, or DBpedia. These knowledge graphs have one thing in common—they are in English. The English language is studied by many researchers in the world and it had become a rich-resource language (with many natural language processing tools and data set). Vietnamese, on the other hand, is a low-resource language. Therefore, we use cross-lingual transfer method to build a Vietnamese knowledge graph. Firstly, we collect data in form of text about Vietnam tourism, which was written mostly in Vietnamese, using Google search and Wikipedia. In the next step, we translate them into English with Google Translate and use English Natural Language Processing tools like Stanford Parser, Co-referencing, ClausIE, MinIE to extract useful triples from this text. Lastly, the triples are translated back to Vietnamese to build a Vietnam tourism knowledge graph. Since we are working with massive text, we develop a distributed algorithm to extract triples from sentences of massive text. This is a distributed version of MinIE, which was originally developed for a single machine model. In Apache Spark framework, we divide massive text into many smaller parts and move them to the worker nodes with distributed MinIE function. Spark distributed MinIE will extract the triples of sentences in the local text of this worker node in parallel. Finally, the result of worker nodes will be sent back to the master node for building the knowledge graph. We conduct experiments with the distributed MinIE on spark cluster to prove the outperformance of our proposed algorithm. Keywords Knowledge graph · Cross-lingual transfer method · Distributed MinIE · Natural language processing · Triples extraction

& Brij B. Gupta [email protected] Phuc Do [email protected] Trung Phan [email protected] Hung Le [email protected] 1

University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam

2

National Institute o