A benchmark dataset and case study for Chinese medical question intent classification

PDF / 3,107,879 Bytes
7 Pages / 595 x 791 pts Page_size
58 Downloads / 322 Views

RESEARCH

Open Access

A benchmark dataset and case study for Chinese medical question intent classification Nan Chen, Xiangdong Su* , Tongyang Liu, Qizhi Hao and Ming Wei From 5th China Health Information Processing Conference Guangzhou, China. 22–24 November 2019

Abstract Background: To provide satisfying answers, medical QA system has to understand the intentions of the users’ questions precisely. For medical intent classification, it requires high-quality datasets to train a deep-learning approach in a supervised way. Currently, there is no public dataset for Chinese medical intent classification, and the datasets of other fields are not applicable to the medical QA system. To solve this problem, we construct a Chinese medical intent dataset (CMID) using the questions from medical QA websites. On this basis, we compare four intent classification models on CMID using a case study. Methods: The questions in CMID are obtained from several medical QA websites. The intent annotation standard is developed by the medical experts, which includes four types and 36 subtypes of users’ intents. Besides the intent label, CMID also provides two types of additional information, including word segmentation and named entity. We use the crowdsourcing way to annotate the intent information for each Chinese medical question. Word segmentation and named entities are obtained using the Jieba and a well-trained Lattice-LSTM model. We loaded a Chinese medical dictionary consisting of 530,000 for word segmentation to obtain a more accurate result. We also select four popular deep learning-based models and compare their performances of intent classification on CMID. Results: The final CMID contains 12,000 Chinese medical questions and is organized in JSON format. Each question is labeled the intention, word segmentation, and named entity information. The information about question length, number of entities, and are also detailed analyzed. Among Fast Text, TextCNN, TextRNN, and TextGCN, Fast Text and TextCNN models have achieved the best results in four types and 36 subtypes intent classification, respectively. Conclusions: In this work, we provide a dataset for Chinese medical intent classification, which can be used in medical QA and related fields. We performed an intent classification task on the CMID. In addition, we also did some analysis on the content of the dataset. Keywords: Intent classification, Dataset, Word segmentation, Name entity recognition

*Correspondence: [email protected] Inner Mongolia Key Laboratory of Mongolian Information Processing Technology, College of Computer Science, Inner Mongolia Univeristy, University West Road, Hohhot, China © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. T

Data Loading...

A benchmark dataset and case study for Chinese medical question intent classification

Recommend Documents

A new dataset of dog breed images and a benchmark for finegrained classification

DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Chinese Question Classification Based on ERNIE and Feature Fusion

RuBQ: A Russian Dataset for Question Answering over Wikidata

Question Classification in a Question Answering System on Cooking

D-GHNAS for Joint Intent Classification and Slot Filling

Creating New Medical Ontologies for Image Annotation A Case Study

LiveQA: A Question Answering Dataset Over Sports Live

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

BCData: A Large-Scale Dataset and Benchmark for Cell Detection and Counting

A Study on Learning Style Preferences of Chinese Medical Students

Robust and On-the-Fly Dataset Denoising for Image Classification