Automated Extraction of Structured Data from Text Notes in the Electronic Medical Record
- PDF / 808,630 Bytes
- 3 Pages / 595.276 x 790.866 pts Page_size
- 58 Downloads / 184 Views
J Gen Intern Med DOI: 10.1007/s11606-020-06110-8 © Society of General Internal Medicine 2020
INTRODUCTION
Collecting data at the point-of-care is a critical task for many clinical studies, a process made more feasible by the advent of electronic medical records (EMRs).1 However, creating data entry structures generally requires EMR programming by information technology (IT) specialists,2 resulting in delays and costs that are prohibitive for smaller studies and investigators with limited funding. We developed an alternative strategy to enter and extract structured data from free-text EMR notes, taking advantage of templates that make data parsing tractable. Most EMRs, including the two largest US vendors (Epic and Cerner), allow users to create and share templates within their notes. Within such templates, specific fields are available for the user to choose from a list of options (an enumeration data type) that populates a specific portion of the text when selected. We describe here our method for programmatically extracting structured data from notes created with dedicated templates.
METHODS
Our technique involves three steps (which we illustrate in the Epic EMR (Epic Systems, Verona, WI)): (1) construct a text template (“SmartPhrase”) containing a unique string identifier tag and embedded list enumerations (“SmartLists”) to allow data entry directly into notes, (2) query a back-end relational database (“Clarity”) to capture notes containing the unique text string tag, and (3) parse the captured notes to extract data into structured form using a Python script employing regular expressions to identify the necessary fields (Fig. 1). Our SQL and Python code are available under an open-source MIT License at https://github.com/alexanderflint/structured-datafrom-notes. We tested this approach in a study of stroke treatment in the 21-hospital Kaiser Permanente Northern California (KPNC) health system. To test performance, 7 text data extraction builds (templates) were created with varying number of data Received May 22, 2020 Revised June 2, 2020 Accepted August 4, 2020
elements (1 to 11), varying number of users (3 to 20), and varying number of hospital centers (1 to 21). Clarity was queried with Teradata SQL Assistant v13.11 to capture notes based on a unique text string present in each template. Selected users were granted access to the SmartPhrases and given brief feedback in their intended use. After initial roll-out, no additional user feedback was provided so that we could determine user-generated error rates in the absence of reinforcement. This project was judged to not meet the regulatory definition of research by the KPNC Research Determination Official.
RESULTS
Our method used minimal computing resources. Querying 1217 notes from 17,331,944 stored notes took 72 seconds and further data parsing took < 2 seconds. The usable-field rate was high (20,989/21,709 fields = 96.7%), with lower usable-field rates associated with larger numbers of centers, users, and data fields (Table 1).
DISCUSSION
We describe
Data Loading...