Navigation-based candidate expansion and pretrained language models for citation recommendation
- PDF / 854,631 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 100 Downloads / 193 Views
Navigation‑based candidate expansion and pretrained language models for citation recommendation Rodrigo Nogueira1,2 · Zhiying Jiang2 · Kyunghyun Cho3,4,5,6 · Jimmy Lin2 Received: 19 May 2020 © Akadémiai Kiadó, Budapest, Hungary 2020
Abstract Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration. We treat this task as a ranking problem, which we tackle with a two-stage approach: candidate generation followed by reranking. Within this framework, we adapt to the scientific domain a proven combination based on “bag of words” retrieval followed by rescoring with a BERT model. We experimentally show the effects of domain adaptation, both in terms of pretraining on in-domain data and exploiting in-domain vocabulary. In addition, we introduce a novel navigation-based document expansion strategy to enrich the candidate documents fed into our neural models. On three benchmark datasets, our methods achieve or rival the state of the art in the citation recommendation task. Keywords Transformers · Domain adaptation · Citation graph
Introduction The volume of scientific publications is growing at an incredible rate. For example, nearly a million articles are added per year to MEDLINE, a bibliographic database of the life sciences and biomedical literature.1 A recent study estimates that three million papers are published annually in the English language, with a growth rate of 3–5% per year (Johnson et al. 2018). This flood of information has made it nearly impossible for researchers to keep abreast of discoveries and innovations, both in their specific sub-field as well as more broadly. Furthermore, there is an overwhelming amount of material that a scientist entering 1
https://www.nlm.nih.gov/bsd/stats/cit_added.html.
* Rodrigo Nogueira [email protected] 1
Tandon School of Engineering, New York University, New York, USA
2
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
3
Courant Institute of Mathematical Sciences, New York University, New York, USA
4
Center for Data Science, New York University, New York, USA
5
Facebook AI Research, New York, USA
6
CIFAR Azrieli Global Scholar, Toronto, Canada
13
Vol.:(0123456789)
Scientometrics
a new field of study needs to read before becoming familiarized with common concepts, methods, and other foundations. A number of tools have come along to help researchers cope with this deluge. For example, keyword-based literature search engines (Google Scholar, Microsoft Academic, PubMed, and Semantic Scholar) and citation recommendation tools (Bollacker et al. 1999; Basu et al. 2001; McNee et al. 2002; Kodakateri Pudhiyaveetil et al. 2009; He et al. 2010) help scientists find relevant articles, often exploiting citation networks to identify what’s important in a particular field. Methods to automatically populate scientific knowledge bases (Gao et al. 2006; Spangler
Data Loading...