Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark

PDF / 1,998,921 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
97 Downloads / 286 Views

(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : WORLDCIST'20

Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark Phuc Do1

· Trung Phan1 · Hung Le1 · Brij B. Gupta2,3,4

Received: 18 June 2020 / Accepted: 27 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract The simplest and effective way to store human knowledge through centuries was using text. Along with the advancement of technology nowadays, the volume of text has grown to be larger and larger. To extract useful information from this amount of text becomes an exceptionally complex task. As an effort to solve that problem, in this paper, we present a pipeline to extract core knowledge from large quantity text using distributed computing. The components of our pipeline are systems that were known to yield good results. The outputs of our proposed system are stored in a knowledge graph. A knowledge graph is a graph for storing knowledge in the form of triples (head, relation, tail). Some of the existing knowledge graphs in the world are Google knowledge graph, YAGO, DBLP, or DBpedia. These knowledge graphs have one thing in common—they are in English. The English language is studied by many researchers in the world and it had become a rich-resource language (with many natural language processing tools and data set). Vietnamese, on the other hand, is a low-resource language. Therefore, we use cross-lingual transfer method to build a Vietnamese knowledge graph. Firstly, we collect data in form of text about Vietnam tourism, which was written mostly in Vietnamese, using Google search and Wikipedia. In the next step, we translate them into English with Google Translate and use English Natural Language Processing tools like Stanford Parser, Co-referencing, ClausIE, MinIE to extract useful triples from this text. Lastly, the triples are translated back to Vietnamese to build a Vietnam tourism knowledge graph. Since we are working with massive text, we develop a distributed algorithm to extract triples from sentences of massive text. This is a distributed version of MinIE, which was originally developed for a single machine model. In Apache Spark framework, we divide massive text into many smaller parts and move them to the worker nodes with distributed MinIE function. Spark distributed MinIE will extract the triples of sentences in the local text of this worker node in parallel. Finally, the result of worker nodes will be sent back to the master node for building the knowledge graph. We conduct experiments with the distributed MinIE on spark cluster to prove the outperformance of our proposed algorithm. Keywords Knowledge graph · Cross-lingual transfer method · Distributed MinIE · Natural language processing · Triples extraction

& Brij B. Gupta [email protected] Phuc Do [email protected] Trung Phan [email protected] Hung Le [email protected] 1

University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam

2

National Institute o

Data Loading...

Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark

Recommend Documents

Building Knowledge Graph in Spark Without SPARQL

Distributed graph cube generation using Spark framework

Beginning Apache Spark 2 With Resilient Distributed Datasets, Spark

Apache Spark Implementation of Whale Optimization Algorithm

Building and Using a Knowledge Graph to Combat Human Trafficking

Apache Spark, Big Data, and Azure Databricks

ParaCA: A Speculative Parallel Crawling Approach on Apache Spark

Beginning Apache Spark Using Azure Databricks Unleashing Large Clust

A Blockchain Based Distributed Storage System for Knowledge Graph Security

Memory Management Approaches in Apache Spark: A Review

KEFT: Knowledge Extraction and Graph Building from Statistical Data Tables

CSKB: A Cyber Security Knowledge Base Based on Knowledge Graph