Automatic document screening of medical literature using word and text embeddings in an active learning setting

PDF / 1,629,379 Bytes
38 Pages / 439.37 x 666.142 pts Page_size
80 Downloads / 268 Views

Automatic document screening of medical literature using word and text embeddings in an active learning setting Andres Carvallo1 · Denis Parra1 · Hans Lobel1 · Alvaro Soto1 Received: 1 October 2019 © Akadémiai Kiadó, Budapest, Hungary 2020

Abstract Document screening is a fundamental task within Evidence-based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches have tried to reduce physicians’ workload of screening and labeling vast amounts of documents to answer clinical questions. Previous works tried to semi-automate document screening, reporting promising results, but their evaluation was conducted on small datasets, which hinders generalization. Moreover, recent works in natural language processing have introduced neural language models, but none have compared their performance in EBM. In this paper, we evaluate the impact of several document representations such as TF-IDF along with neural language models (BioBERT, BERT, Word2Vec, and GloVe) on an active learning-based setting for document screening in EBM. Our goal is to reduce the number of documents that physicians need to label to answer clinical questions. We evaluate these methods using both a small challenging dataset (CLEF eHealth 2017) as well as a larger one but easier to rank (Epistemonikos). Our results indicate that word as well as textual neural embeddings always outperform the traditional TF-IDF representation. When comparing among neural and textual embeddings, in the CLEF eHealth dataset the models BERT and BioBERT yielded the best results. On the larger dataset, Epistemonikos, Word2Vec and BERT were the most competitive, showing that BERT was the most consistent model across different corpuses. In terms of active learning, an uncertainty sampling strategy combined with a logistic regression achieved the best performance overall, above other methods under evaluation, and in fewer iterations. Finally, we compared the results of evaluating our best models, trained using active learning, with other authors methods from CLEF eHealth, showing better results in terms of work saved for physicians in the document-screening task. Keywords Active learning · Document screening · Natural language processing

* Andres Carvallo [email protected] 1

Pontificia Universidad Catolica de Chile, Santiago, Chile

13

Vol.:(0123456789)

Scientometrics

Introduction Evidence-based Medicine (EBM) is a practice that provides scientific evidence to support medical decisions. This evidence nowadays is obtained from biomedical journals, usually accessible through the portal PubMed1, a search engine which provides free access to abstracts of biomedical research articles, as well as to the MEDLINE database. An existing problem is to find relevant documents given a clinical question or a query within a massive volume of information. As a consequence, the time required for search and screening of articles can take long, and sometimes it consumes a large part of a physician’s workday (Miwa et al. 2014; Elliott

Data Loading...

Automatic document screening of medical literature using word and text embeddings in an active learning setting

Recommend Documents

Learning class-specific word embeddings

Correction to: Learning class-specific word embeddings

Interpretable Segmentation of Medical Free-Text Records Based on Word Embeddings

Improvement of Short Text Clustering Based on Weighted Word Embeddings

Text Classification Using Multilingual Sentence Embeddings

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

Fast Pathfinding in Knowledge Graphs Using Word Embeddings

Automatic Action Extraction for Short Text Conversation Using Unsupervised Learning

Active Document

Joint Multiclass Debiasing of Word Embeddings

A Comparison of Pre-trained Word Embeddings for Sentiment Analysis Using Deep Learning

Line and word segmentation of handwritten text document by mid-point detection and gap trailing