Insights Into Conducting Audiological Research With Clinical Databases.

Purpose The clinical data stored in electronic health records (EHRs) provide unique opportunities for audiological clinical research. In this article, we share insights from our experience of working with a large clinical database of over 730,000 cases. Method Under a framework outlining the process from patient care to researcher data use, we describe issues that can arise in each step of this process and how we overcame specific issues in our data set. Results Correct interpretation of findings depends on an understanding of the data source and structure, and efforts to establish confidence in the data through the processes are discussed under the framework. Conclusion We conclude that EHRs have considerable utility in audiological research, though researchers must exhibit caution and consideration when working with EHRs.

I mprovements in electronic health record (EHR) technology and its potential to enhance clinical and organizational outcomes have made EHRs integral to health care provision (Menachemi & Collum, 2011). In 2017, EHRs were used by an estimated 86% of health care providers in the United States (Office of the National Coordinator for Health Information Technology, 2019). This widespread adoption was, in part, influenced by the Health Information Technology for Economic and Clinical Health Act of 2009, which invested in the adoption and meaningful use of EHRs in the United States (Blumenthal & Tavenner, 2010). A positive impact of this is the use of EHRs in large-scale medical research and a continued interest in their use, as demonstrated by initiatives focusing on large data sets sponsored by both the National Institutes of Health and the Patient-Centered Outcomes Research Institute (Fleurence et al., 2014;Margolis et al., 2014;Washington & Lipstein, 2011).
Research utilizing EHR data to improve clinical processes and to describe relationships among diagnosed health conditions is extensive in many medical fields, but to date, just a few published audiological studies have utilized EHRs or other large clinical databases. However, their contributions demonstrate the utility of EHR research in audiology. For example, Helfer et al. (2005) used diagnostic codes from the U.S. Department of Defense to highlight the need for future health service requirements for warfighters by describing trends in noise-induced hearing loss and other comorbidities in soldiers. Wilson and McArdle utilized audiometric data from the Department of Veterans Affairs (VA) to describe audiometric notches in a clinical population (Wilson & McArdle, 2013) and to support evidence for the clinical recommendation of testing interoctave frequencies (Wilson & McArdle, 2014). Billings et al. (2018) described the prevalence and audiometric abnormalities of VA patients with normal hearing, highlighting a need to address the common clinical presentation of patients with normal hearing reporting hearing difficulties. Zapala et al. (2010) used EHR data to determine safety of self-referral to audiology clinics in Medicare beneficiaries. Researchers utilizing the pediatric AudGenDB database (containing audiological, EHR, radiological imagery, and genetic results) have focused on relationships between hearing loss and genetic and/or rare diseases, which are often difficult to capture in other data sources (Kreicher et al., 2018;Weir, Kreicher, et al., 2016). The potential knowledge to be gained from EHRs is immense because they enable one to examine interrelationships between such variables, some of which (e.g., health care processes) are not measured in prospective epidemiological studies. Currently, we are working with a large clinical database of over 730,000 cases obtained from the VA EHR system. To be included in the analytic sample, patients must have been fit with a hearing aid at a VA facility between April 2012 and October 2014. A large amount of data is available for these patients, including audiology-specific information (e.g., audiometric thresholds, hearing aid information) available between April 2012 and October 2014, and extensive health data (e.g., diagnoses, clinical procedures, lab tests) available between January 2007 and December 2017. Procedural data are derived from Current Procedural Terminology (CPT), and diagnostic information is derived from International Classification of Disease (ICD) Versions 9 and 10. Ultimately, the goal of the project is to expose relationships among diagnostic and care process information (before and after hearing aid use) and audiological outcome variables in hearing aid users in order to gain a broader understanding of factors that impact hearing aid use and outcomes. Future publications will describe the data set in detail and will report findings on such relationships because these are outside the scope of this article.
The intent of this article is to discuss the process of working with EHR data gleaned from our experience of working with a large clinical database and to discuss how leveraging EHR data can benefit audiological research. Although data from the VA EHR are the basis for this work, the themes described are applicable to EHR data sources, in general. We describe examples using audiometric data and recorded CPT and ICD codes. The article is organized as follows. First, we provide a background of EHRs and describe the stages required to move from patient care to researcher data use. We then present issues that can arise when working with EHRs and, next, describe our experience working with the VA EHR, including how we overcame specific issues. Finally, we discuss data interpretation and applications of EHR research in audiology.

EHR Background and Stages From Patient Care to Data Use
EHRs were initially introduced to improve health care quality and capture billing data (Institute of Medicine, 2003). In general, EHRs contain longitudinal data collected during delivery of health care that are relevant to patient care, such as demographics, vital statistics, claims, administrative, and clinical data. EHRs may be specific to one clinic or may contain integrated data from a hospital-wide or interhospital linked system. Therefore, both the structure and content of EHRs vary by source (Häyrinen et al., 2008). Data eventually provided to researchers are a compilation of patients' clinical data recorded during clinical encounters, defined as instances of contact between patients and clinical providers. An encounter does not necessarily result in a clinical procedure or diagnosis. For this, we introduce the concept of events, which are subsets of clinical encounters. Verheij et al. (2018) describe a framework outlining the process of moving from patient care to researcher data use and how each can impact data quality in the research data set. This article utilizes a similar framework, classifying the process into four stages: (1) An event occurs, (2) the event is recorded, (3) data from the EHR are extracted, and (4) data are prepared for research. Below, we further describe each stage and discuss how the way each step is completed can impact the research data set.

Stage 1 (Event Occurs)
For purposes of this article, we define an event as a patient receiving a diagnosis or procedure. This refers to any given event that occurs during a clinical encounterthe occurrence of which is influenced by many factors, some of which are described below.
In order for an event to occur, the patient must first seek medical care. The extent to which care is sought for symptoms varies not only by symptom knowledge, interpretation, and beliefs (Petrova et al., 2019), but also by ethnicity (Williams et al., 2019), socioeconomic status (McCutchan et al., 2015), and gender (Magaard et al., 2017). Second, the presence or absence of an event is subject to the clinical judgment of a provider. Ultimately, it is the provider's decision to order a diagnostic test or treatment. If a provider lacks resources (time, knowledge), a test, for example, may not be ordered, and therefore, the event does not appear to occur. Similarly, resource constraints on a given clinical site influence the occurrence of an event. For example, if a certain technology for a procedure is necessary but unavailable, the event will not occur. Awareness of these factors is necessary given that it will influence subsequent stages. Although we acknowledge that this definition does not capture situations in which a patient chooses not to seek help or is turned away by hospital staff prior to being formally seen, or when a physician chooses not to enter a code for a particular event, we believe it is the most accurate and practical definition to use when analyzing data from this EHR.

Stage 2 (Event Is Recorded)
This stage refers to the documentation of the event in Stage 1. For purposes of this article, the event is observed through the presence of CPT and/or ICD codes. Generally, CPT codes indicate that a specific procedure was performed during the encounter, and ICD codes describe the diagnosis that was either assigned for the first time or that was treated during the encounter. A subset of ICD-9 and ICD-10 codes also identify procedures. The time and date of all events are recorded. The recording of an event in the EHR is influenced by provider care practices, the EHR system itself, purpose and meaning of codes, entry error, and policy changes. Provider care practices influence the mode of entry and completeness of data. When describing medical encounters, some providers rely heavily on free-text entries, some on CPT and ICD codes, and some use both. These practices, which can be influenced by practice guidelines, provider training, and time constraints, impact the ultimate picture of a patient's clinical care. Because CPT and ICD codes are used to justify levels of service provision and billing, some providers choose to enter codes only when it is necessary (e.g., when ordering a lab test). Additionally, the use and meaning of codes are dynamic in that they may change over time to address changes in demand for care, evolving standards of care, and population demographics (Agniel et al., 2018). One result of this may be increased difficulty in interpreting code usage over time. Finally, manual entry of codes is subject to error, in terms of miscoding or not entering complete data (O'Malley et al., 2005). Errors can be random, such as a provider choosing an incorrect code, or systematic -due to, for example, legislative changes. An example of this was the national change from ICD-9 to ICD-10 in October 2015, which resulted in an increase in the number of available codes from approximately 13,000 to approximately 68,000. This results in complexities when trying to interpret data spanning this transition. Lastly, if the information is to be linked to other health data for a given patient, the EHR systems must be interoperable across the facilities at which that patient had sought care. Understanding factors related to data recording is integral to identifying solutions for sources of error or bias.

Stage 3 (Data Are Extracted)
For use in research, data must be extracted from the EHR. This is typically completed by data experts via extraction queries using specific data management software. Data validity can potentially be compromised if the query contains errors or if there are limitations in the extraction software. The complexity of data extraction requires data experts and study team members to liaise effectively about the specific data needed for research.

Stage 4 (Data Preparation)
Data preparation is a time-intensive process in which preprocessing and cleaning of the data are used to transform the extracted data into a database suitable for research and to remove or correct corrupt or inaccurate data entries. Errors that occurred at earlier stages (1-3) can sometimes be identified and fixed, and care must be taken not to introduce new ones. While some errors may be easy to identify, such as an incorrectly keyed free-text entry, others will be difficult to detect. For example, CPT and ICD codes are often used together to identify a procedure (CPT) tied to a diagnosis (ICD). If, for a given encounter, there are no CPT codes, it is not possible to verify that the ICD code is correctly specified. Missing data are problematic to interpret because the reason they arose may not be identifiable. Missing data could arise because a patient sought care elsewhere, a mistake was made in data recording, or there were errors in data extraction. Although missing data are common in research, they can have a substantial effect on conclusions drawn from the data in that they can, for example, lead to bias in the estimation of parameters and/or change the representativeness of samples (Graham, 2009). Identifying causes of missing data are integral to correct interpretation of findings.

Our Experience Working With EHR Data
Next, we describe our experience working with the VA EHR and how we addressed issues with data integrity at each stage.

Stage 1 (Event Occurs)
Our target study sample are veterans for whom a hearing aid was ordered at the VA between 2012 and 2014. Inclusion in the sample therefore requires patients to have sought audiological care at a VA facility. Whether or not a patient is included is also influenced by the audiologist's decision to recommend hearing aids, whether or not the patient decided to follow that recommendation, and whether or not the patient attended the hearing aid fitting appointment. Some of these factors may covary with other patient variables (e.g., age) and pose potential confounds for research analyses. This raises the issue of representativeness of the data set relative to the intended population. We assume VA audiologists recommend hearing aids appropriately, and previous research suggests that uptake of hearing aids is high among veterans who have a hearing test (Saunders et al., 2016). Given that our data set includes patients with a hearing aid order, we are confident that audiological care occurred at the VA. However, as it is known that some patients seek care both within and outside of the VA system, we cannot determine how many patients seek nonaudiological health care for comorbid conditions outside of the VA. Implications of this are discussed in Stage 2 below.

Stage 2 (Event Is Recorded)
As described previously, recording of clinical information is influenced by provider care practices, and it is difficult to assess completeness and accuracy of CPT and ICD codes. First, although the audiological data we have is comprehensive, some audiometric threshold data are missing. We believe this is most likely because, during the study inclusion time period (2012-2014), audiologists were not required to enter audiometric data into the EHR. Although this may raise an issue of sample representativeness, there is no reason to believe there was a systematic bias in missing audiometric data.
Next, we consider CPT and ICD codes. As mentioned above, some patients seek care both within and outside of the VA system. Although systems exist to share medical records between community and VA providers (Byrne et al., 2014), this is not universal and the extent to which ICD codes assigned from community providers are recorded in the VA EHR is unknown. As a result, it is likely that the prevalence of some comorbid conditions is underestimated in our sample. A study by Miller et al. (2004) compared the prevalence of diabetes in veterans derived using a gold standard definition of both VA and Medicare information against a definition of only VA information. Across 3 years, the prevalence estimate using only VA data was only slightly lower (approximately 2 percentage points). A study by Lei et al. (2018) used a similar method to compare dementia prevalence and, again, found the prevalence in the VA data to be slightly lower (approximately 2.6 percentage points). These studies indicate that, in terms of completeness of CPT and ICD codes, data are generally comprehensive, though may be slightly underestimated. To obtain the most accurate picture of each individual's health, we used multiple codes to classify those with a particular diagnosis by creating comprehensive lists of disease-specific codes. For example, when considering cognitive impairment, many specific diagnoses were considered (Alzheimer's disease, vascular dementia, unspecified dementia, etc.). In doing this, we increased the chances of capturing everyone with cognitive impairment. Furthermore, by broadening the coding category, we also accounted for changing coding practices over time. It is also worth noting that comorbid conditions of interest for hearing research are generally chronic (e.g., multiple sclerosis, diabetes, Parkinson's disease). Because ICD codes are assigned at appointments that both diagnose and treat a condition, it is likely that the presence of a chronic condition will be captured if it were treated at a VA facility, even if it was not originally diagnosed there.
It is inevitable that there are coding errors in our data set, although possibly at a relatively low rate because of the long-term use and acceptance of the VA EHR (Edsall & Adler, 2011). Around October 2015, the data show increased prevalence rates for many conditions. We attribute this to the policy change that resulted in the change from ICD-9 to ICD-10 coding. This has been reported in other studies also and requires special approaches (e.g., selection of time intervals uncontaminated by the transition) when analyzing and interpreting the data (Yoon & Chow, 2017).

Stage 3 (Data Are Extracted)
Data extraction was performed by VA data specialists and facilitated by members of the study team. Data were extracted from two separate database systems: Patient Care Services (PCS) and the Corporate Data Warehouse (CDW). PCS contains audiology-related data, including hearing thresholds, hearing aid information, and outcome measures. The CDW contains the remainder of the EHR, including CPT and ICD codes. For patients who attended a hearing aid fitting between April 2012 and October 2014, CPT and ICD codes were extracted for the 11-year period, starting January 2007 and ending in December 2017. Extracting data for this 11-year period allows us to consider comorbid conditions present both before and after the hearing aid fitting and gives us confidence that we have comprehensive health data for patients in the study sample. Audiometric data were extracted for dates from April 2012 to October 2014.
We encountered an incorrectly specified extraction query for audiometric data that resulted in all audiometric thresholds of 0 dB HL being coded as missing. This error was identified and corrected but illustrates the need for thorough data checking and preparation (see below).

Stage 4 (Data Preparation)
The process of data preparation was crucial in establishing confidence in our data set. This process was performed by an expert data analyst and focused on condensing and reorganizing large amounts of data into tables with simplified structures, as well as combining the two data sets (PCS and CDW). Here, we describe a few of the many steps we took to preprocess and clean the data. Identifying hearing-related encounters using CPT and ICD codes was required to document and understand patients' clinical pathways. However, we initially did not have a good understanding of which codes were used in VA clinical practice for such clinical encounters. To empirically determine these codes, we randomly selected 100,000 patients from the sample and, for each patient, noted up to five dates on which codes for either audiometry or hearing aid orders were recorded in the PCS database. We then extracted from the CDW database all CPT and ICD codes recorded on those dates for the patient in question. Aggregate lists of all CPT and ICD codes and their total counts across patients and dates were compiled. Given that the CPT and ICD codes were administered on a date with a known hearing-related encounter, we inferred that these codes were related to hearing health care. We used this information to classify types of audiological appointments (hearing evaluation, hearing aid fitting, etc.), which was necessary for subsequent analyses. Often, a single encounter was associated with multiple events specifying diagnoses and procedures. It should be noted that, although there exist other potential data sources in the EHR that may relate to clinical encounters, we found that the best way to identify a valid occurrence of an encounter was to use CPT and ICD codes, one reason being that they provide the most complete data.
Next, extensive cleaning was done on audiometric data to prepare it for use and to remove erroneous values. To do so, we first used exploratory analyses (e.g., summary statistics) to check the integrity of each variable independently. The problem of audiometric thresholds of 0 dB HL being coded as missing (discussed above) was identified through the use of summary statistics of audiometric thresholds. Second, nonnumerical values were resolved. A small percentage of threshold values were nonnumerical, often taking the form of "X+" (presumably indicating that the patient did not respond at upper limit of testing), "CNT" (could not test) or "DNT" (did not test), "X*," and "+X+" (presumably erroneous entries). The entries "X+" were replaced with an indicator variable of 120-a value above the highest used in the VA but that allows for computation of pure-tone averages and conveys the notion of extreme hearing loss while still allowing researchers to distinguish it from entries indicating a measured threshold. Values of "CNT," "DNT," "X*," and "+X+" were replaced with a missing value indicator. Third, given that standard practice uses 5-dB steps in clinical testing, we assumed that any value not divisible by 5 was an erroneous entry, and thus, these were also replaced with a missing value indicator. The examples provided here include only a few of the steps taken to preprocess audiometric data, and we intend to describe more detailed processes in subsequent publications. Interested readers may look to Mellor et al. (2018) for additional considerations regarding audiologic data preparation in large data sets. In short, the data preparation stage was lengthy and recursive and required expertise from a data analyst, a clinician with knowledge of audiology, and researchers.

Data Interpretation and Applications
EHRs contain a wealth of clinical data that have the potential to provide insights about associations between medical conditions, demographics, health care processes, treatments, and patient outcomes in large samples. However, because the data are not collected or recorded in the controlled manner seen in prospective research studies, considerable caution must be taken when using them for research. Substantial effort is required to avoid or ameliorate errors and biases inherent to EHR data sets. This includes interpretation of results, as it is easy to produce "statistically significant" results in large data sets. Correctly interpreting findings from EHR studies depends both on a deep understanding of the data source and structure, as well as efforts to establish confidence in the data through processes such as those described above.
There are a multitude of advantages in working with EHR data, the most obvious being the large sample size, the vast range of variables, and the availability of longitudinal diagnostic and treatment information. Relative to prospective epidemiological studies, EHR research is time-and cost-efficient, allowing for the generation of findings without new data collection. Additionally, findings from EHR studies may lead to the generation of new hypotheses for experimental research. For audiology, research with EHRs permits examination of associations between health, demographics, and audiological variables and outcomes and can yield a better understanding of the longitudinal progression of audiological care processes. Large data sets can also facilitate the use of new methodologies, such as machine learning and predictive modeling in audiological research (Saunders et al., 2020).

Conclusions
This article outlines our experience working with EHR data under the framework of a process from patient care to researcher data use. We discuss issues that may arise when working with EHRs and describe how we addressed those issues in this data set. We conclude that EHRs have considerable utility in audiological research, though researchers must exhibit caution, consideration, and reflection when working with EHRs.