A Web Resource for Exploring the CORD-19 Dataset Using Root- and Rule-Based Phrases
- PDF / 1,742,605 Bytes
- 7 Pages / 595.276 x 790.866 pts Page_size
- 69 Downloads / 150 Views
REVIEW ARTICLE
© Indian Institute of Science 2020.
A Web Resource for Exploring the CORD‑19 Dataset Using Root‑ and Rule‑Based Phrases Jacob Collard1, Talapady Bhat1, Eswaran Subrahmanian1,2*, Ira Monarch1, Jonah Tash1, Ram Sriram1 and John Elliot1 Abstract | This short paper describes a web resource—the NIST CORD-19 Web Resource—for community explorations of the COVID19 Open Research Dataset (CORD-19). The tools for exploration in the web resource make use of the NIST-developed Root- and Rule-based method, which exploits underlying linguistic structures to create terms that represent phrases in a corpus. The method allows for auto-suggest‑ ing-related terms to discover terms to refine the search of a COVID-19 heterogenous document base. The method also produces taxonomic structures in the target domain as well as providing semantic informa‑ tion about the relationships between terms. This term structure can serve as a basis for creating topic modeling and trend analysis tools. In this paper, we describe use of a novel search engine to demonstrate some of the capabilities above. Keywords: Root- and rule-based method, CORD-19 dataset, Auto-suggest search
1 Introduction The NIST Root- and Rule-based method (R&R)1,2 is a framework built around linguistic structures to identify and index key phrases. R&R defines how individual natural language words (“roots”) can be combined into structured terms. These structured terms represent natural language phrases in accordance with their linguistic structure, allowing for relationships between the individual words or between complex phrases to be identified. These terms both disambiguate and normalize the natural language text according to their linguistic structure (“rules”). The R&R method draws insight from noun compounds in Sanskrit, German, Latin, and other languages. The R&R method can be used for simple single term and advanced searches that disambiguate a user’s search query to increase the relevance of retrieved documents and to normalize the query to retrieve a wider set of relevant documents. This method can be applied to a variety of different domains without modifying the overall framework; practical differences in vocabulary, language use, and abbreviations can be accounted
J. Indian Inst. Sci. | VOL xxx:x | xxx–xxx 2020 | journal.iisc.ernet.in
for by modifying a few simple parameters of the framework. 2 The NIST CORD‑19 Web Resource The COVID-19 Open Research Dataset (CORD19) provides a collection of articles related to SARS-CoV-2, COVID-19, and related viruses and diseases.3 The corpus is obtained from PubMed, the WHO, bioRxiv, and medRxiv, and is updated daily with new relevant articles. This large, openaccess corpus makes it possible for researchers to track and analyze new developments related to COVID-19. However, due to its size and breadth, answering specific research questions about the data can be difficult when only a subset of the corpus is needed—extracting the relevant subset is a challenging problem. Identifying trends in research also requir
Data Loading...