ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application

  • PDF / 4,192,896 Bytes
  • 12 Pages / 595.276 x 790.866 pts Page_size
  • 79 Downloads / 133 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application Hetong Ma1†, Feihong Yang1†, Jiansong Ren2†, Ni Li2, Min Dai2, Xuwen Wang1, An Fang1, Jiao Li1, Qing Qian1* and Jie He3* From 5th China Health Information Processing Conference Guangzhou, China. 22-24 November 2019

Abstract Background: The increasing global cancer incidence corresponds to serious health impact in countries worldwide. Knowledge-powered health system in different languages would enhance clinicians’ healthcare practice, patients’ health management and public health literacy. High-quality corpus containing cancer information is the necessary foundation of cancer education. Massive non-structural information resources exist in clinical narratives, electronic health records (EHR) etc. They can only be used for training AI models after being transformed into structured corpus. However, the scarcity of multilingual cancer corpus limits the intelligent processing, such as machine translation in medical scenarios. Thus, we created the cancer specific cross-lingual corpus and open it to the public for academic use. Methods: Aiming to build an English-Chinese cancer parallel corpus, we developed a workflow of seven steps including data retrieval, data parsing, data processing, corpus implementation, assessment verification, corpus release, and application. We applied the workflow to a cross-lingual, comprehensive and authoritative cancer information resource, PDQ (Physician Data Query). We constructed, validated and released the parallel corpus named as ECCParaCorp, made it openly accessible online. Results: The proposed English-Chinese Cancer Parallel Corpus (ECCParaCorp) consists of 6685 aligned text pairs in Xml, Excel, Csv format, containing 5190 sentence pairs, 1083 phrase pairs and 412 word pairs, which involved information of 6 cancers including breast cancer, liver cancer, lung cancer, esophageal cancer, colorectal cancer, and stomach cancer, and 3 cancer themes containing cancer prevention, screening, and treatment. All data in the parallel corpus are online, available for users to browse and download (http://www.phoc.org.cn/ECCParaCorp/). (Continued on next page)

* Correspondence: [email protected]; [email protected] † Hetong Ma, Feihong Yang and Jiansong Ren contributed equally to this work. 1 Institute of Medical Information/Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China 3 Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) a