Pathway information extracted from 25 years of pathway figures

  • PDF / 1,775,867 Bytes
  • 18 Pages / 595 x 794 pts Page_size
  • 58 Downloads / 219 Views

DOWNLOAD

REPORT


DATABASE

Open Access

Pathway information extracted from 25 years of pathway figures Kristina Hanspers1† , Anders Riutta1† , Martina Summer-Kutmon2,3 and Alexander R. Pico1* *Correspondence: [email protected] † Kristina Hanspers and Anders Riutta contributed equally to this work. 1 Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA Full list of author information is available at the end of the article

Abstract Thousands of pathway diagrams are published each year as static figures inaccessible to computational queries and analyses. Using a combination of machine learning, optical character recognition, and manual curation, we identified 64,643 pathway figures published between 1995 and 2019 and extracted 1,112,551 instances of human genes, comprising 13,464 unique NCBI genes, participating in a wide variety of biological processes. This collection represents an order of magnitude more genes than found in the text of the same papers, and thousands of genes missing from other pathway databases, thus presenting new opportunities for discovery and research. Keywords: Pathways, Figures, Literature, OCR, Gene sets

Background The molecular mechanisms underlying biology are often outlined as pathway diagrams. In textbooks and on whiteboards, these depictions are fundamental to a biologist’s training. As mental models for practitioners, they serve as scaffolds for hypotheses and integrating new knowledge. And in the scientific literature, pathway figures are the pinnacle of communication for published work, synthesizing diverse sources and types of data spanning decades into a coherent model. Though often published only as static images, pathways express dynamic interactions. Common examples include metabolic cycles, gene regulation, and signaling cascades. Depicted interactions play out over a spectrum of electrochemical, enzymatic, and developmental timescales. When properly modeled as an interaction network and annotated with standard identifiers, pathway knowledge can be conveyed with greater precision in formats amenable to computational analysis. Distinct from static images, pathway models can be used in enrichment analyses [1], enhanced data visualization [2, 3], knowledge graphs [4, 5], biomedical inference [6], and database queries [7, 8]. Over the past couple decades a number of pathway databases, including GenMAPP [9], MetaCyc [10, 11], KEGG [12] and Reactome [13, 14] took on the challenge of curating canonical pathway biology, each with their own unique focus and approach. A broader, community-curated approach was

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included i