Skip to main content

Accuracy of heart failure ascertainment using routinely collected healthcare data: a systematic review and meta-analysis



Ascertainment of heart failure (HF) hospitalizations in cardiovascular trials is costly and complex, involving processes that could be streamlined by using routinely collected healthcare data (RCD). The utility of coded RCD for HF outcome ascertainment in randomized trials requires assessment. We systematically reviewed studies assessing RCD-based HF outcome ascertainment against “gold standard” (GS) methods to study the feasibility of using such methods in clinical trials.


Studies assessing International Classification of Disease (ICD) coded RCD-based HF outcome ascertainment against GS methods and reporting at least one agreement statistic were identified by searching MEDLINE and Embase from inception to May 2021. Data on study characteristics, details of RCD and GS data sources and definitions, and test statistics were reviewed. Summary sensitivities and specificities for studies ascertaining acute and prevalent HF were estimated using a bivariate random effects meta-analysis. Heterogeneity was evaluated using I2 statistics and hierarchical summary receiver operating characteristic (HSROC) curves.


A total of 58 studies of 48,643 GS-adjudicated HF events were included in this review. Strategies used to improve case identification included the use of broader coding definitions, combining multiple data sources, and using machine learning algorithms to search free text data, but these methods were not always successful and at times reduced specificity in individual studies. Meta-analysis of 17 acute HF studies showed that RCD algorithms have high specificity (96.2%, 95% confidence interval [CI] 91.5–98.3), but lacked sensitivity (63.5%, 95% CI 51.3–74.1) with similar results for 21 prevalent HF studies. There was considerable heterogeneity between studies.


RCD can correctly identify HF outcomes but may miss approximately one-third of events. Methods used to improve case identification should also focus on minimizing false positives.

Peer Review reports


Heart failure (HF) is an important cause of morbidity and mortality in the general population affecting 1–3% of adults, with over 64 million people estimated to be affected worldwide [1,2,3]. It is a significant burden on healthcare systems, accounting for about 2% of all healthcare expenditure in countries across Europe and the USA [1, 2]. Therefore, HF is an important target for treatment, requiring large randomized, controlled trials to assess potential interventions. Such large trials can be complex and costly [4, 5]. Ascertainment of HF admissions in a clinical trial often requires clinic visits (with or without manual medical records review) to identify potential events, gathering clinical documents for reported events, and independent clinical adjudication to confirm or refute events. This process could be streamlined to reduce the complexity and overall cost of trials [6,7,8]. Routinely collected healthcare data (RCD) may help to achieve this goal by supporting the ascertainment of HF outcomes during within-trial periods, and post-trial assessments of the impact on longer-term HF risk [9].

RCD is defined as “healthcare data collected for purposes other than research or without specific a priori research questions developed before collection” [10]. When patients are diagnosed with HF during a healthcare encounter, this diagnosis, along with other data relating to the encounter, are recorded in RCD, usually in the form of coded diagnoses. The most common RCD source is hospital administrative claims data (ACD), an umbrella term for data generated as part of the financial administration of hospitals [11, 12]. Other RCD sources include patient or disease registries and epidemiological surveys (detailed definitions of RCD sources used are provided in Additional file 1: Supplemental Methods). RCD can be used to ascertain events by searching the data for specific codes or coding algorithms.

Ascertaining hospitalizations for HF from such sources can be problematic as HF is a chronic disease with episodes of decompensation requiring admission, and commonly used coding systems do not distinguish between acute events and prevalent chronic disease.

A meta-analysis published in 2014 of 11 studies reporting sensitivity and specificity of coded administrative data for ascertaining HF, showed that pooled sensitivity was 75% (95% confidence interval [CI] 74.7–75.9) and pooled specificity was 97% (95% CI 96.8–96.9) [13]. These findings mirrored two previous reviews [14, 15]. However, there was a limited number of studies in this review, and some studies had very small numbers of HF events. It is also possible that coding practices have improved over the last decade. A systematic review from 2020, focused entirely on Europe and including 20 studies using electronic health records and primary care data, reported sensitivities ≤ 66% and specificities ≥ 95% in most of the studies [16]. However, it excluded other data sources such as claims databases and registries and was geographically restricted. We have systematically reviewed all studies that assessed the utility of RCD for HF outcome ascertainment to summarise the currently available evidence supporting their use in cardiovascular outcomes trials.


This review follows the PRISMA (Preferred Reporting for Systematic Reviews and Meta-Analyses) guidelines for conducting and reporting a systematic review [17].

Search strategy

A search was conducted of all available peer-reviewed literature on MEDLINE and Embase, from their inception (1946 and 1974 respectively), until May 2021 using the Ovid search engine. The initial search strategy was broad and aimed to include any studies where RCD was used to ascertain HF. No limits were set for the initial search. Multiple search terms, including different phrasings or synonyms of the same term were used (see Systematic Review Protocol in the Supplementary Appendix for search strategy and inclusion criteria). After removing duplicates, the titles and abstracts of potentially eligible articles were reviewed and those meeting the inclusion criteria underwent full-text review. The references of the full-text papers were hand-searched for additional relevant articles.

Inclusion and exclusion criteria

To be included in the review, a study was required to assess the utility of coded RCD for ascertainment of HF against gold standard (GS) ascertainment criteria. We selected full-length, peer-reviewed articles published in English that used RCD to ascertain HF events and reported at least one agreement statistic, or sufficient data to allow its calculation, for International Classification of Disease (ICD) code-based definitions of HF. All studies included must have defined a GS against which to assess the RCD-based ascertainment method and include at least 50 HF events identified using the GS method relevant to that study. The GS method is defined as the reference standard against which each study assessed their RCD-based outcome ascertainment method. Examples include medical records review using pre-specified criteria. Articles were excluded if they used free-text electronic medical records (i.e., narrative clinical notes) as the sole RCD source as these would be considered medical records and are often used as the GS for event adjudication (see Systematic Review Protocol in Supplementary Appendix for detailed exclusion criteria).

Data extraction

The full-text articles were reviewed by the first author (MAG) who abstracted the data into a data collection form. The author extracted study characteristics, details of the data sources (RCD and GS), type of hospital encounter (e.g., inpatient, outpatient, or emergency department attendances), and data definitions used, along with agreement statistics for the ICD code or coding algorithm used to ascertain HF. The agreement statistics extracted included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and kappa scores. Where agreement statistics were unavailable, raw data was extracted for calculation where possible. Most routine databases list the main reason for hospitalization (most responsible diagnosis) in a primary position and secondary complications or pre-existing comorbidities in secondary positions. As the distinction between these categories is likely to be important in ascertaining incident episodes of heart failure (e.g., hospitalization due to HF decompensation) as potential trial outcomes, the coding positions and agreement statistics according to coding position were also abstracted where available. If a study used more than one RCD definition or algorithm, the algorithm with the best agreement statistics was used for the main analysis.

Studies were categorized according to which types of RCD-based and GS HF events were included. Studies that only included hospitalizations for decompensated HF, irrespective of a prior HF diagnosis, were categorized as acute HF studies. These studies were the main focus of the analysis as such methods could be used to collect follow-up information in a clinical trial. Studies that included all individuals with HF recorded over the study period (new and pre-existing HF) were categorized as prevalent HF studies. Such methods could be used to identify potential participants for inclusion in clinical trials. Studies that defined HF as a comorbid disease in individuals admitted with another main diagnosis such as myocardial infarction were also included in the prevalent HF category.

If a study assessed both acute and prevalent HF, or different ICD versions, or more than one coding position separately, the agreement statistics were extracted for all relevant event types or RCD algorithms for subgroup analysis. The first author conducted a second review of the abstracted data comparing them against the original abstract to correct any discrepancies in the data collection form. Any uncertainties were resolved through discussion with two senior clinicians (MMM and RJH).

Study quality assessment

A quality assessment of the included studies was undertaken using the revised tool for Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [18]. Three authors (WK, ME, and AEM) independently reviewed the studies and extracted data using the QUADAS-2 template, and the first author reviewed and collated the final quality assessment. Studies were classified as having a low, high, or unclear risk of bias for 4 domains (patient selection, index test, reference standard, and flow and timing) and the first 3 domains were also assessed for applicability to the review question (see Supplemental Methods in Additional file 1 for details). Studies were considered to have a “low risk” of bias or “low concern” regarding applicability if all domains were low risk. If one or more domains had unclear or high risk the study was considered to be “at risk” of bias or have “some concerns” regarding applicability. A sensitivity analysis excluding studies at risk of bias was undertaken.

Statistical analysis

Studies were grouped according to whether they assessed acute or prevalent HF. Other potential sources of heterogeneity included coding system, position and definitions used, RCD and GS data source, study size, publication date, and country or region (e.g., Europe). All agreement statistics (sensitivity, specificity, PPV or NPV) and 95% CI (exact binomial CI) were calculated using available data (see Additional file 1: Figure S1 for an example 2 × 2 table) [19]. Summary sensitivity, specificity, and a summary receiver operating characteristic (SROC) plot with a summary curve (using the hierarchical SROC model) were obtained using the Stata command metandi [20]. As these are random effects models that may give undue weight to smaller studies, an additional sensitivity analysis was undertaken limited to studies with > 200 GS events.

The I2 statistic was used to assess heterogeneity between the sensitivity and specificity estimates in addition to visual inspection of the HSROC curves [21]. All analyses were performed using Stata version 17.

Formal testing for publication bias was undertaken by a regression of the log diagnostic odds ratio against 1/√effective sample size (ESS), weighted by ESS, with a P < 0.05 for the slope coefficient indicating significant asymmetry [22] (see Additional file 1, Supplemental Methods, Statistical Methods and Interpretation for details).


Qualitative synthesis

Study selection

The initial Embase and MEDLINE searches yielded 2790 articles in total and an additional 56 records were identified through a manual search of references during full-text review. After the removal of duplicates and non-English language articles and abstract review, 129 articles were selected for full-text review. Of these, 71 were excluded and 58 articles were included in the final synthesis (Fig. 1).

Fig. 1
figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart summarising the study selection process. Legend: EMR indicates electronic medical records; GS, gold standard; HF, heart failure; n, number of records and RCD, routinely collected healthcare data

Study characteristics

The 58 studies included 48,643 GS HF events in total. 34 studies (including 30,458 GS HF events) assessed acute HF outcomes [23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57], 21 studies (including 5210 HF events) assessed prevalent HF [12, 49, 58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76] while three studies (with 12,975 HF events) assessed both [77,78,79]. The majority of the studies (59%) were conducted in the USA and Canada. Additional file 1: Table S1 and Table S2 summarize the characteristics of the 58 studies.

Study quality assessment

The overall risk of bias was low for 28 (48%) studies (Additional file 1: Table S3). Of the remaining 30 studies, 7 had at least one high-risk domain and 23 had one or more domains with unclear risk of bias. Of 7 studies with high-risk domains, 6 had a reference standard at risk of not correctly classifying the target condition [28, 57, 68, 70, 71, 79] while, in one study, patients were inappropriately excluded from the analysis as they did not receive the reference standard [57]. Concerns regarding applicability were low for 42 studies (72%). Fourteen of the 16 studies with “some concern” regarding applicability were also considered “at risk” for overall risk of bias, with concerns about the reference standard being the most common issue in both areas.

Gold standard data sources and definition

Forty-nine (85%) studies used hospital medical records as the GS data source (Additional file 1: Table S4 summarizes the sources of routine and GS data). The remaining studies used primary care records (2 studies) [49, 76], and specialty databases or registries containing coded clinical data (5 studies) [12, 24, 35, 42, 57]. One study assessed outcomes against participant self-report [71], and another study conducted prospective medical assessments and echocardiography [37].

Most studies (85%) undertook a further adjudication step of the GS source data conducting clinical adjudication of the medical records according to study defined or guideline criteria. Three studies used the recoding of medical records by professional coders as the GS source [28, 68, 79] while the remaining six studies did not undertake any adjudication (Additional file 1: Table S5 summarizes the GS ascertainment methods used, and Table S6 the main guideline criteria used for GS adjudication).

Routine data sources and definition

Forty-two (72%) studies relied solely on admitted care or inpatient data sources, whilst 15 (26%) studies also used outpatient or emergency department data [23, 30, 40,41,42, 45, 49,50,51, 53, 54, 59, 71, 76, 77]. One (2%) study only used outpatient data [67]. Additional file 1: Table S4 summarises the main routine data sources. 42 studies (72%) used HDD as the main RCD source. Only three (5%) studies included prescribing data [39, 57, 71], while two (3.5%) studies included laboratory data [23, 50] 50 (86%) studies used only one RCD source, whilst eight (14%) studies used a combination of two or more sources [23, 39, 50, 57, 71,72,73, 76]. Two (3%) studies combined coded HDD with machine learning algorithms and keyword searches to ascertain HF events from free text HDD, electronic medical records, and discharge summaries [72, 73].

All the studies identified used data coded in one of three revisions of the ICD coding system (ICD-8, -9, or -10) with some studies using more than one. 32 (55%) studies used ICD-9 codes only, 16 (28%) studies used ICD-10 codes only and one (2%) used ICD-8 codes only [28]. Nine (16%) studies used a combination of revisions [25, 34, 39, 48, 50, 53, 65, 70, 76].

The coding algorithms used varied considerably between studies. Four (7%) studies did not define the specific coding algorithm used [25, 29, 58, 62]. The commonest ICD-9 code used was 428.x (heart failure) alone (17 studies) or in combination with other codes (20 studies). The commonest ICD-10 code used was I50.x (heart failure) alone (9 studies) or in combination with others (15 studies). Additional file 1: Tables S7 and S8 summarize the ICD-9 and -10 coding algorithms used respectively, while Additional file 1: Table S9 includes a list of all the HF codes used in the studies along with their definitions.

Most studies specified the ICD HF code position (primary, secondary, any) within the database. Among 37 studies ascertaining acute HF, 4 (11%) studies reported algorithms with HF codes in the primary position and any position separately [28, 30, 44, 56], 11 (30%) only reported algorithms with HF codes in the primary position, and 21 (57%) only reported algorithms with codes in any position. One study algorithm (2%) used codes in positions 1–6 [36].

Ascertainment of acute heart failure

Results of individual studies

Table 1 summarizes the agreement statistics of the main study algorithm(s) for each study considering acute HF grouped by country (as RCD sources are likely to be similar) and ordered by sensitivity or PPV (highest to lowest). There was a wide range of performance across studies with sensitivities ranging from as low as 13% to > 90%. Only 8/23 (35%) studies reported a sensitivity > 80%. Although specificity also ranged widely between 20 and > 90%, 17/21 (81%) studies reported a specificity > 80%.

Table 1 Agreement statistics for the best ICD code-based algorithm(s) for acute heart failure studies


Sufficient data for meta-analysis was available for 17,986 GS HF events from 17/37 studies assessing RCD for acute HF. The funnel plot for publication bias with the superimposed regression line is shown in Additional file 1: Figure S2. The p value for the slope coefficient was not statistically significant (P value = 0.73) indicating a symmetrical funnel plot and a low likelihood of publication bias.

Table 2 provides the summary statistics for acute and prevalent RCD algorithms overall and according to the diagnostic position of HF codes. The summary sensitivity and specificity for acute HF studies were 63.5% (95% CI 51.3–74.1) and 96.2% (95% CI 91.5–98.3) respectively (Table 2). The agreement was similar in studies which included codes in the primary diagnostic position and any diagnostic position. When the analysis was restricted to 14 studies (17,540 GS HF events in total) with > 200 GS HF events the summary sensitivity was lower while specificity remained unchanged (Table 2 and Additional file 1: Figure S3a). When the analysis was restricted to 9 studies at low risk of bias, summary sensitivity was lower while specificity was similar (Table 2).

Table 2 Agreement statistics for coding algorithms ascertaining acute and prevalent heart failure according to coding position

Figure 2 shows the forest plot of paired sensitivities and specificities for acute HF studies. There was marked heterogeneity between studies ascertaining acute HF (I2 99.3% and 99.7% for sensitivity and specificity respectively). The SROC plot for acute HF (Fig. 3a) has a wide 95% prediction region with individual study algorithms scattered away from the HSROC curve also suggesting considerable heterogeneity between studies, with no clear relationship between sensitivity and specificity. Heterogeneity remained regardless of the coding position used (Additional file 1: Figure S4).

Fig. 2
figure 2

Forest plot of paired sensitivities and specificities of study algorithms ascertaining acute heart failure. Legend: Algorithms sorted by diagnostic code position. Summary points estimated using a bivariate random effects model. CI indicates confidence intervals; FN, false negatives; FP, false positives; I2, I2statistic describing the percentage of variation across studies that is due to heterogeneity rather than chance; TN, true negatives and TP, true positives

Fig. 3
figure 3

SROC plots for the diagnostic accuracy of coding algorithms ascertaining acute and prevalent heart failure. Legend: a Acute heart failure (HF) algorithms and b Prevalent HF algorithms. HSROC indicates hierarchical summary receiver operating characteristic curve, grey circle, the sensitivity and (1-specificity) of an individual study with the size of the circle proportionate to study size; summary point, summary sensitivity, and specificity; 95% confidence region, 95% confidence region for the summary point, and the 95% prediction region, the area in which we can say with 95% certainty the true sensitivity and specificity of a future study will be contained

Subgroup analysis

Given the significant heterogeneity between studies, Additional file 1: Table S10 summarises agreement statistics for studies ascertaining acute HF according to other subgroups of interest that are potential sources of heterogeneity. While there were differences in summary statistics between subgroups, they had wide confidence intervals. However, algorithms from studies using medical records as the GS data source reported a higher summary sensitivity (72.6%, 95% CI 61.2–81.7) than those using registry data (41.2%, 95% CI 30.3–53.0) with similar summary specificities. Four studies with < 1500 participants had higher summary sensitivity (75.3%, 95% CI 41.4–93.0) and lower specificity (76.1%, 95% CI 63.2–85.4) compared to 13 studies with ≥ 1500 participants (59.8%, 95% CI 48.2–70.5 and 97.9%, 95% CI 95.4–99.1 respectively).

There was considerable heterogeneity with I2 ≥ 98% within all subgroups (Additional file 1: Table S10). Some of these subgroups only included a small number of studies and the summary results should be interpreted with caution.

Ascertainment of prevalent heart failure

Results of individual studies

Table 3 summarizes the agreement statistics of the main study algorithm(s) for each study ascertaining prevalent HF grouped by country and ordered by sensitivity or PPV (highest to lowest).

Table 3 Agreement statistics for the best algorithm (s) assessing prevalent heart failure

There was a wide range of performance across studies similar to acute HF studies, but a specificity ≥ 90% was reported by all 22 studies reporting specificities while only 27% reported a sensitivity ≥ 80%.


Twenty-one of 24 studies (including 19,840 GS HF events) ascertaining prevalent HF provided sufficient data for meta-analysis. Statistical testing for publication bias showed no significant asymmetry (P value = 0.57) indicating a low likelihood of publication bias (Additional file 1: Figure S2). The overall summary sensitivity and specificity were 63.7% (95% CI 55.3–71.3) and 98.1% (95% CI 97.0–98.8) respectively (Table 2). The result of restricting the analysis to 10 studies with > 200 GS events was similar to the impact on acute HF (Table 2 and Additional file 1: Figure S3b). Restricting the analysis to 8 studies at low risk of bias produced similar summary sensitivity and specificity to the overall result (Table 2).

Figure 4 shows the forest plot of paired sensitivities and specificities for prevalent HF studies. There was significant heterogeneity between studies similar to acute HF studies (Table 2, Fig. 3b, Additional file 1: Figure S5).

Fig. 4
figure 4

Forest plot of paired sensitivities and specificities of study algorithms ascertaining prevalent heart failure. Legend: Algorithms sorted by diagnostic code position. Summary points are estimated using a bivariate random effects model. CI indicates confidence intervals; FN, false negatives; FP, false positives; I2, I2 statistic describing the percentage of variation across studies that is due to heterogeneity rather than chance; TN, true negatives and TP, true positive


RCD sources are becoming increasingly accessible to researchers and are an invaluable resource for cost-effective, streamlined clinical research. The present study demonstrated that acute HF outcomes ascertained using RCD have good specificity (96%) but lack sensitivity (63%) with similar results for prevalent HF outcomes. This indicates that whilst RCD-based ascertainment is effective at correctly identifying people who have HF, it missed one-third of cases, suggesting that further improvements are required in HF outcome ascertainment methods. The wide confidence intervals around the summary estimate of sensitivity are compatible with RCD-based ascertainment methods missing between 45 and 19% of acute heart failure cases. Furthermore, there was significant heterogeneity between studies and within subgroups which is not explained by differences in RCD coding algorithms, the GS or the country of origin, study size, or year of publication, suggesting there may be other factors such as differences in the populations studied. Therefore, both the summary statistics and subgroup analysis must be interpreted with caution.

A previous review suggested that the use of broader parameters along with laboratory and prescription data may help identify more cases [13]. However, this study has not been able to confirm this, as there were only a few studies using these sources. Eight studies used algorithms combining different sources, coding combinations, periods of data identification etc. [23, 31, 33, 39, 57, 59, 76, 77]. However, the sensitivity in these studies was no different from other studies with simpler algorithms and RCD sources, indicating that the use of complex algorithms did not necessarily improve sensitivity [23, 33, 76]. Using multiple codes from the same source compared to I50x/428x alone (broad vs narrow algorithms) has also not led to a significant increase in sensitivity for acute HF studies (67.1% vs 70.7%) in this meta-analysis (Additional file 1: Table S10). However, this comparison is again between the results of different studies. One study of 99 GS events compared several narrow versus broad coding definitions and found no difference in diagnostic accuracy [76]. Although using machine learning algorithms or keyword searches of free-text entries improved sensitivity this came at the cost of lower specificity in individual studies [72, 73].

Characteristics of better performing algorithms

There were 5 studies with acute HF algorithms that performed above the estimated average with sensitivities > 75% while maintaining specificities > 90% [27, 28, 32, 47, 79]. However, two of these used re-coded medical records as the GS to assess coding practices [28, 79] and all of these studies were considered ‘at risk’ of bias. The use of recoded data may not be a true reflection of the actual presence or absence of disease and may explain the high concordance. In contrast, three studies using registry data as the GS source had worse sensitivities than average (Table 1). This suggests that differences in the GS may explain some of the variation between studies. The only commonalities of the remaining 3 high-performing studies were the use of ICD-9 coded inpatient HDD as the RCD source and adjudicated medical records as the GS.

Prevalent HF studies performed better with 12 studies demonstrating sensitivities > 75% while maintaining specificities > 96%. Five of these studies used RCD from Canadian hospital discharge abstract databases which are coded according to national standards [63, 65, 72, 76, 79]. One of these combined HDD with physician billing data obtaining a sensitivity and specificity of 84.8% and 97.0% respectively (Table 3) [76]. One Canadian study increased its sensitivity from 57.4% (95% CI 51.8–63.0) using an ICD-10 code search of HDD alone to 83.3% (95% CI 73.9–72.8%) by combining the code search with a machine learning algorithm of unstructured free-text entries while maintaining specificity [72]. Similar results were obtained by a German study where combining an ICD-10 code search of HDD with a machine learning algorithm of unstructured free-text improved sensitivity from 49.5% (95% CI 42.8–56.3) to 83.8% (95% CI 78.3–88.4) [73]. The study with the highest sensitivity, specificity, and kappa scores was an Australian study which again used re-coded medical records as the GS, which may explain the high concordance [68].

Limitations of review

There are some limitations to this review. The availability of agreement statistics and information such as the coding algorithms used was variable and made direct comparison between all studies difficult. The quality of the available studies was variable with about half of studies assessed as ‘at risk’ of bias. However, restricting to studies with ‘low risk’ of bias resulted in similar summary estimates of sensitivity and specificity.

This meta-analysis utilizes the currently recommended bivariate and HSROC models which are random effects models that may give undue weight to smaller studies. However, the aim of the meta-analysis is not to present an exact summary but an overall estimate of the likely average sensitivity and specificity of using RCD for ascertainment of HF outcomes. The potential impact of using random-effects meta-analysis was assessed by doing an additional analysis limited to studies with > 200 GS events.

The comparisons between the different algorithms were limited as they were assessed in diverse study populations rather than within the same population, requiring cautious interpretation of the summary statistics and subgroup analysis. For example, a possible impact of the coding position was demonstrated in the meta-analysis results, with studies ascertaining acute HF in the primary position having better summary sensitivity and specificity than those using codes in any position (Table 2). However, four acute HF studies assessing the impact of coding position on diagnostic performance within each study all showed that using codes in the primary position reduces sensitivity and improves specificity compared to codes in any position (Table 1) [28, 30, 44, 56].

This review was also restricted to English language articles and 24 abstract-only studies were excluded. This may have led to publication bias along with any studies that may have been withheld from publication due to poor validation statistics. However, there was no statistically significant publication bias detected.

The WHO ICD-8, -9, and -10 codes do not support separate coding of HF sub-types (e.g., HF with preserved ejection fraction). Although some studies did include additional codes from the ICD-CM codes (USA) and the ICD-CA codes (Canada), this review could only assess the ascertainment of acute HF and prevalent HF irrespective of subtype. The implementation of the new WHO ICD-11 codes, which include heart failure codes capturing preserved, mid-range, and reduced ejection fraction, may allow HF subtypes to be captured in the future [80].

Practical implications and future directions

When using acute HF outcomes to assess treatment effects in trials, a high false negative rate (low sensitivity) will have no impact on the point estimate of the overall treatment effect (provided the missing events are evenly distributed between the control arm and active arm), but it will reduce the statistical power of the trial and lead to widening of confidence intervals. In contrast, low specificity (high false positive rate) can lead to underestimation of treatment effects. Therefore, it is important to ensure that any steps taken to improve the sensitivity of HF algorithms have minimal impact on specificity. A logical way to achieve this may be to broaden the diagnostic codes used to capture HF (and/or combine more than one data source) as attempted by some studies and add a second method to maintain specificity such as a manual review of RCD records by clinicians to confirm or refute suspected events. This second method is less resource-intensive than GS adjudication of medical records and may improve diagnostic accuracy in a similar way to using machine learning algorithms on free text entries but has not been used in any of the studies reviewed [72, 73].

Finally, the considerable variation in agreement statistics between studies may be related to differences in coding practices. Therefore, any new RCD source or ascertainment method is likely to require validation prior to use for HF outcome ascertainment.


While there is significant heterogeneity in studies assessing RCD-based HF outcome ascertainment, this study confirms that the presence of HF codes in RCD correctly identifies true HF but significantly underestimates events. Strategies used to improve case identification include the use of broader coding definitions, multiple data sources, and machine learning algorithms of free text data. However, these methods were not always successful and at times reduced specificity in individual studies. Therefore, methods used to improve case identification should also focus on minimizing false positives.

Availability of data and materials

This study brought together existing data openly available at locations cited in the reference documentation and all data generated or analyzed are included in the published article and supplementary files.



Administrative claims data


Confidence intervals


Gold standard


Hospital discharge data


Heart failure


International Classification of Disease


Negative predictive value


Positive predictive value


Routinely collected healthcare data


(Hierarchical) summary receiver operating characteristic


  1. McMurray JJ, Pfeffer MA. Heart failure. Lancet. 2005;365(9474):1877–89.

    Article  PubMed  Google Scholar 

  2. James SL, Abate D, Abate KH, Abay SM, Abbafati C, Abbasi N, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018;392(10159):1789–858.

    Article  Google Scholar 

  3. Bragazzi NL, Zhong W, Shu J, Abu Much A, Lotan D, Grupper A, et al. Burden of heart failure and underlying causes in 195 countries and territories from 1990 to 2017. Eur J Prev Cardiol. 2021;28(15):1682–90.

    Article  PubMed  Google Scholar 

  4. Sertkaya A, Wong HH, Jessup A, Beleche T. Key cost drivers of pharmaceutical clinical trials in the United States. Clin Trials. 2016;13(2):117–26.

    Article  PubMed  Google Scholar 

  5. Speich B, von Niederhäusern B, Schur N, Hemkens LG, Fürst T, Bhatnagar N, et al. Systematic review on costs and resource use of randomized clinical trials shows a lack of transparent and comprehensive data. J Clin Epidemiol. 2018;96:1–11.

    Article  PubMed  Google Scholar 

  6. Zannad F, Pfeffer MA, Bhatt DL, Bonds DE, Borer JS, Calvo-Rojas G, et al. Streamlining cardiovascular clinical trials to improve efficiency and generalisability. Heart. 2017;103(15):1156.

    Article  PubMed  Google Scholar 

  7. Calvo G, McMurray JJV, Granger CB, Alonso-García Á, Armstrong P, Flather M, et al. Large streamlined trials in cardiovascular disease. Eur Heart J. 2014;35(9):544–8.

    Article  PubMed  Google Scholar 

  8. Collins R. Back to the future: the urgent need to re-introduce streamlined trials. Eur Heart J Suppl. 2018;20(suppl C):C14–7.

    Article  Google Scholar 

  9. Van Staa T-P, Goldacre B, Gulliford M, Cassell J, Pirmohamed M, Taweel A, et al. Pragmatic randomized trials using routine electronic health records: putting them to the test. BMJ. 2012;344:e55.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med. 2015;12(10):e1001885.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Cadarette SM, Wong L. An introduction to health care administrative data. Can J Hosp Pharm. 2015;68(3):232–7.

    PubMed  PubMed Central  Google Scholar 

  12. Etzioni DA, Lessow C, Bordeianou LG, Kunitake H, Deery SE, Carchman E, et al. Concordance between registry and administrative data in the determination of comorbidity: a multi-institutional study. Ann Surg. 2020;272(6):1006–11.

    Article  PubMed  Google Scholar 

  13. McCormick N, Lacaille D, Bhole V, Avina-Zubieta JA. Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis. Plos One. 2014;9(8):e104519.

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  14. Quach S, Blais C, Quan H. Administrative data have high variation in validity for recording heart failure. Can J Cardiol. 2010;26(8):e306–12.

    Article  PubMed Central  Google Scholar 

  15. Saczynski JS, Andrade SE, Harrold LR, Tjia J, Cutrona SL, Dodd KS, et al. A systematic review of validated methods for identifying heart failure using administrative data. Pharmacoepidemiol Drug Saf. 2012;21(SUPPL. 1):129–40.

    Article  PubMed  Google Scholar 

  16. Davidson J, Banerjee A, Muzambi R, Smeeth L, Warren-Gash C. Validity of acute cardiovascular outcome diagnoses recorded in European electronic health records: a systematic review. Clin Epidemiol. 2020;12:1095–111.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36.

    Article  PubMed  Google Scholar 

  19. Seed P. DIAGT: Stata module to report summary statistics for diagnostic tests compared to true disease status. Statistical Software Components. 2010.

  20. Harbord RM, Whiting P. Metandi: Meta-analysis of diagnostic accuracy using hierarchical logistic regression. Stata J. 2009;9(2):211–29.

    Article  Google Scholar 

  21. Dwamena B. MIDAS: Stata module for meta-analytical integration of diagnostic test accuracy studies. Statistical Software Components. 2007.

  22. Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin Epidemiol. 2005;58(9):882–93.

    Article  PubMed  Google Scholar 

  23. Alqaisi F, Williams LK, Peterson EL, Lanfear DE. Comparing methods for identifying patients with heart failure using electronic data sources. BMC Health Serv Res. 2009;9:237.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Austin PC, Daly PA, Tu JV. A multicenter study of the coding accuracy of hospital discharge administrative data for patients admitted to cardiac care units in Ontario. Am Heart J. 2002;144(2):290–6.

    Article  PubMed  Google Scholar 

  25. Blackburn DF, Shnell G, Lamb DA, Tsuyuki RT, Stang MR, Wilson TW. Coding of heart failure diagnoses in Saskatchewan: a validation study of hospital discharge abstracts. J Popul Ther Clin Pharmacol. 2011;18(3):e407–15.

    PubMed  Google Scholar 

  26. Bosco-Levy P, Duret S, Picard F, Dos Santos P, Puymirat E, Gilleron V, et al. Diagnostic accuracy of the international classification of diseases, tenth revision, codes of heart failure in an administrative database. Pharmacoepidemiol Drug Saf. 2019;28(2):194–200.

    Article  PubMed  Google Scholar 

  27. Cozzolino F, Montedori A, Abraha I, Eusebi P, Grisci C, Heymann AJ, et al. A diagnostic accuracy study validating cardiovascular ICD-9-CM codes in healthcare administrative databases. The Umbria data-value project. PLoS ONE. 2019;14(7):e0218919.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Fisher ES, Whaley FS, Krushat WM, Malenka DJ, Fleming C, Baron JA, et al. The accuracy of Medicare’s hospital claims data: progress has been made, but problems remain. Am J Public Health. 1992;82(2):243–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Fonseca C, Sarmento PM, Marques F, Ceia F. Validity of a discharge diagnosis of heart failure: implications of misdiagnosing. Congest Heart Fail. 2008;14(4):187–91.

    Article  PubMed  Google Scholar 

  30. Frolova N, Bakal JA, McAlister FA, Rowe BH, Quan H, Kaul P, et al. Assessing the use of international classification of revision codes from the emergency department for the identification of acute heart failure. JACC: Heart Fail. 2015;3(5):386–91.

    PubMed  Google Scholar 

  31. Goff DC Jr, Pandey DK, Chan FA, Ortiz C, Nichaman MZ. Congestive heart failure in the United States: Is there more than meets the I(CD Code)? The Corpus Christi Heart Project. Arch Intern Med. 2000;160(2):197–202.

    Article  PubMed  Google Scholar 

  32. Heckbert SR, Kooperberg C, Safford MM, Psaty BM, Hsia J, McTiernan A, et al. Comparison of self-report, hospital discharge codes, and adjudication of cardiovascular events in the Women’s Health Initiative. Am J Epidemiol. 2004;160(12):1152–8.

    Article  PubMed  Google Scholar 

  33. Huang H, Turner M, Raju S, Reich J, Leatherman S, Armstrong K, et al. Identification of acute decompensated heart failure hospitalisations using administrative data. Am J Cardiol. 2017;119(11):1791–6.

    Article  PubMed  Google Scholar 

  34. Ingelsson E, Ärnlöv J, Sundström J, Lind L. The validity of a diagnosis of heart failure in a hospital discharge register. Eur J Heart Fail. 2005;7(5):787–91.

    Article  PubMed  Google Scholar 

  35. Jollis JG, Ancukiewicz M, DeLong ER, Pryor DB, Muhlbaier LH, Mark DB. Discordance of databases designed for claims payment versus clinical information systems: Implications for outcomes research. Ann Intern Med. 1993;119(8):844–50.

    Article  CAS  PubMed  Google Scholar 

  36. Khand AU, Shaw M, Gemmel I, Cleland JGF. Do discharge codes underestimate hospitalisation due to heart failure? Validation study of hospital discharge coding for heart failure. Eur J Heart Fail. 2005;7(5):792–7.

    Article  PubMed  Google Scholar 

  37. Kümler T, Gislason GH, Kirk V, Bay M, Nielsen OW, Køber L, et al. Accuracy of a heart failure diagnosis in administrative registers. Eur J Heart Fail. 2008;10(7):658–60.

    Article  PubMed  Google Scholar 

  38. Lee DS, Donovan L, Austin PC, Gong Y, Liu PP, Rouleau JL, et al. Comparison of coding of heart failure and comorbidities in administrative and clinical data for use in outcomes research. Med Care. 2005;43(2):182–8.

    Article  PubMed  Google Scholar 

  39. Mahonen M, Jula A, Harald K, Antikainen R, Tuomilehto J, Zeller T, et al. The validity of heart failure diagnoses obtained from administrative registers. Eur J Prev Cardiol. 2013;20(2):254–9.

    Article  PubMed  Google Scholar 

  40. Mard S, Nielsen FE. Positive predictive value and impact of misdiagnosis of a heart failure diagnosis in administrative registers among patients admitted to a University Hospital cardiac care unit. Clin Epidemiol. 2010;2:235–9.

    PubMed  PubMed Central  Google Scholar 

  41. McCullough PA, Philbin EF, Spertus JA, Kaatz S, Sandberg KR, Weaver WD, et al. Confirmation of a heart failure epidemic: findings from the Resource Utilization Among Congestive Heart Failure (REACH) study. J Am Coll Cardiol. 2002;39(1):60–9.

    Article  PubMed  Google Scholar 

  42. Merry AH, Boer JM, Schouten LJ, Feskens EJ, Verschuren WM, Gorgels AP, et al. Validity of coronary heart diseases and heart failure based on hospital discharge and mortality data in the Netherlands using the cardiovascular registry Maastricht cohort study. Eur J Epidemiol. 2009;24(5):237–47.

    Article  PubMed  Google Scholar 

  43. Ono Y, Taneda Y, Takeshima T, Iwasaki K, Yasui A. Validity of claims diagnosis codes for cardiovascular diseases in diabetes patients in Japanese administrative database. Clin Epidemiol. 2020;12:367–75.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Psaty BM, Delaney JA, Arnold AM, Curtis LH, Fitzpatrick AL, Heckbert SR, et al. Study of cardiovascular health outcomes in the era of claims data. Circulation. 2016;133(2):156–64.

    Article  PubMed  Google Scholar 

  45. Roger VL, Weston SA, Redfield MM, Hellermann-Homan JP, Killian J, Yawn BP, et al. Trends in heart failure incidence and survival in a community-based population. JAMA. 2004;292(3):344–50.

    Article  CAS  PubMed  Google Scholar 

  46. Schaufelberger M, Ekestubbe S, Hultgren S, Persson H, Reimstad A, Schaufelberger M, et al. Validity of heart failure diagnoses made in 2000–2012 in western Sweden. ESC Heart Fail. 2020;7(1):37–46.

    Article  Google Scholar 

  47. Schellenbaum GD, Heckbert SR, Smith NL, Rea TD, Lumley T, Kitzman DW, et al. Congestive heart failure incidence and prognosis: case identification using central adjudication versus hospital discharge diagnoses. Ann Epidemiol. 2006;16(2):115–22.

    Article  PubMed  Google Scholar 

  48. Teng THK, Finn J, Hung J, Geelhoed E, Hobbs M. A validation study: how effective is the hospital morbidity data as a surveillance tool for heart failure in Western Australia? Aust Public Health. 2008;32(5):405–7.

    Article  Google Scholar 

  49. Wilchesky M, Tamblyn RM, Huang A. Validation of diagnostic codes within medical services claims. J Clin Epidemiol. 2004;57(2):131–41.

    Article  PubMed  Google Scholar 

  50. Cohen SS, Roger VL, Weston SA, Jiang R, Movva N, Yusuf AA, et al. Evaluation of claims-based computable phenotypes to identify heart failure patients with preserved ejection fraction. Pharmacol Res Perspect. 2020;8(6):e00676.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Delekta J, Hansen SM, AlZuhairi KS, Bork CS, Joensen AM. The validity of the diagnosis of heart failure (I50.0-I50.9) in the Danish National Patient Register. Dan Med J. 2018;65(4):5470.

    Google Scholar 

  52. Pfister R, Michels G, Wilfred J, Luben R, Wareham NJ, Khaw K-T. Does ICD-10 hospital discharge code I50 identify people with heart failure? A validation study within the EPIC-Norfolk study. Int J Cardiol. 2013;168(4):4413–4.

    Article  PubMed  Google Scholar 

  53. Sundbøll J, Adelborg K, Munch T, Frøslev T, Sørensen HT, Bøtker HE, et al. Positive predictive value of cardiovascular diagnoses in the Danish National Patient Registry: a validation study. BMJ Open. 2016;6(11):e012832.

    Article  PubMed  PubMed Central  Google Scholar 

  54. Thygesen SK, Christiansen CF, Christensen S, Lash TL, Sørensen HT. The predictive value of ICD-10 diagnostic coding used to assess Charlson comorbidity index conditions in the population-based Danish National Registry of Patients. BMC Med Res Methodol. 2011;11:83.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Presley CA, Min JY, Chipman J, Greevy RA, Grijalva CG, Griffin MR, et al. Validation of an algorithm to identify heart failure hospitalisations in patients with diabetes within the veterans health administration. BMJ Open. 2018;8(3):e020455.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Rosamond WD, Chang PP, Baggett C, Johnson A, Bertoni AG, Shahar E, et al. Classification of heart failure in the atherosclerosis risk in communities (ARIC) study. Circ Heart Fail. 2012;5(2):152–9.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Li Q, Glynn RJ, Dreyer NA, Liu J, Mogun H, Setoguchi S. Validity of claims-based definitions of left ventricular systolic dysfunction in medicare patients. Pharmacoepidemiol Drug Saf. 2011;20(7):700–8.

    Article  PubMed  Google Scholar 

  58. Chong WF, Ding YY, Heng BH. A comparison of comorbidities obtained from hospital administrative data and medical charts in older patients with pneumonia. BMC Health Serv Res. 2011;11(1):105.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Fleming ST, Sabatino SA, Kimmick G, Cress R, Wu XC, Trentham-Dietz A, et al. Developing a claim-based version of the ACE-27 comorbidity index: a comparison with medical record review. Med Care. 2011;49(8):752–60.

    Article  PubMed  Google Scholar 

  60. Humphries KH, Rankin JM, Carere RG, Buller CE, Kiely FM, Spinelli JJ. Co-morbidity data in outcomes research: are clinical data derived from administrative databases a reliable alternative to chart review? J Clin Epidemiol. 2000;53(4):343–9.

    Article  CAS  PubMed  Google Scholar 

  61. Powell H, Lim LLY, Heller RF. Accuracy of administrative data to assess comorbidity in patients with heart disease: an Australian perspective. J Clin Epidemiol. 2001;54(7):687–93.

    Article  CAS  PubMed  Google Scholar 

  62. Preen DB, Holman CDAJ, Lawrence DM, Baynham NJ, Semmens JB. Hospital chart review provided more accurate comorbidity information than data from a general practitioner survey or an administrative database. J Clin Epidemiol. 2004;57(12):1295–304.

    Article  PubMed  Google Scholar 

  63. Quan H, Parsons GA, Ghali WA. Validity of information on comorbidity derived from ICD-9-CCM administrative data. Med Care. 2002;40(8):675–85.

    Article  PubMed  Google Scholar 

  64. Sarfati D, Hill S, Purdie G, Dennett E, Blakely T. How well does routine hospitalisation data capture information on comorbidity in New Zealand? N Z Med J. 2010;123(1310):50–61.

    PubMed  Google Scholar 

  65. So L, Evans D, Quan H. ICD-10 coding algorithms for defining comorbidities of acute myocardial infarction. BMC Health Serv Res. 2006;6:161.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Soo M, Robertson LM, Ali T, Clark LE, Fluck N, Johnston M, et al. Approaches to ascertaining comorbidity information: validation of routine hospital episode data with clinician-based case note review. BMC Res Notes. 2014;7:253-.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Borzecki AM, Wong AT, Hickey EC, Ash AS, Berlowitz DR. Identifying hypertension-related comorbidities from administrative data: what’s the optimal approach? Am J Med Qual. 2004;19(5):201–6.

    Article  PubMed  Google Scholar 

  68. Henderson T, Shepheard J, Sundararajan V. Quality of diagnosis and procedure coding in ICD-10 administrative data. Med Care. 2006;44(11):1011–9.

    Article  PubMed  Google Scholar 

  69. Kieszak SM, Flanders WD, Kosinski AS, Shipp CC, Karp H. A comparison of the Charlson comorbidity Index derived from medical record data and administrative billing data. J Clin Epidemiol. 1999;52(2):137–42.

    Article  CAS  PubMed  Google Scholar 

  70. Quan H, Li B, Duncan Saunders L, Parsons GA, Nilsson CI, Alibhai A, et al. Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Serv Res. 2008;43(4):1424–41.

    Article  PubMed  PubMed Central  Google Scholar 

  71. Rector TS, Wickstrom SL, Shah M, Thomas Greeenlee N, Rheault P, Rogowski J, et al. Specificity and sensitivity of claims-based algorithms for identifying members of Medicare+Choice health plans that have chronic medical conditions. Health Serv Res. 2004;39(6 Pt 1):1839–57.

    Article  PubMed  PubMed Central  Google Scholar 

  72. Xu Y, Martin E, D’Souza AG, Doktorchik CTA, Jiang J, Lee S, et al. Enhancing ICD-Code-based case definition for heart failure using electronic medical record data. J Card Fail. 2020;15:610–7.

    Article  Google Scholar 

  73. Kaspar M, Fette G, Güder G, Seidlmayer L, Ertl M, Dietrich G, et al. Underestimated prevalence of heart failure in hospital inpatients: a comparison of ICD codes and discharge letter information. Clin Res Cardiol. 2018;107(9):778–87.

    Article  PubMed  PubMed Central  Google Scholar 

  74. Luthi J-C, Troillet N, Eisenring M-C, Sax H, Burnand B, Quan H, et al. Administrative data outperformed single-day chart review for comorbidity measure. Internat J Qual Health Care. 2007;19(4):225–31.

    Article  Google Scholar 

  75. van Doorn C, Bogardus ST, Williams CS, Concato J, Towle VR, Inouye SK. Risk adjustment for older hospitalized persons: a comparison of two methods of data collection for the Charlson index. J Clin Epidemiol. 2001;54(7):694–701.

    Article  PubMed  Google Scholar 

  76. Schultz SE, Rothwell DM, Chen Z, Tu K. Identifying cases of congestive heart failure from administrative data: a validation study using primary care patient records. Chron Dis Inj Canada. 2013;33(3):160–6.

    Article  CAS  Google Scholar 

  77. Allen LA, Yood MU, Wagner EH, Aiello Bowles EJ, Pardee R, Wellman R, et al. Performance of claims-based algorithms for identifying heart failure and cardiomyopathy among patients diagnosed with breast cancer. Med Care. 2014;52(5):e30–8.

    Article  PubMed  PubMed Central  Google Scholar 

  78. Birman-Deych E, Waterman AD, Yan Y, Nilasena DS, Radford MJ, Gage BF. Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors. Med Care. 2005;43(5):480–5.

    Article  PubMed  Google Scholar 

  79. Juurlink D PC, Croxford R, Chong A, Austin P, Tu J, Laupacis A. . Canadian Institute for Health Information Discharge Abstract Database: a validation study. Toronto: : Institute for Clinical Evaluative Sciences; 2006.

  80. International Classification of Diseases. Eleventh Revision (ICD-11). Geneva: World Health Organisation; 2022.

    Google Scholar 

Download references

Rights retention

For the purpose of open access, the author(s) has applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.


This study was conducted using Departmental funding from the Clinical Trial Service Unit (CTSU), Nuffield Department of Population Health, University of Oxford. CTSU receives support from the UK Medical Research Council (which funds the MRC Population Health Research Unit in a strategic partnership with the University of Oxford, MC-UU_00017/3, MC-UU_00017/5), the British Heart Foundation, Cancer Research UK and Health Data Research (HDR) UK.”

Author information

Authors and Affiliations



All authors contributed to the study’s conception and design. A literature search, qualitative synthesis, and statistical analysis were performed by Michelle A. Goonasekera. Marion M. Mafham and Richard J. Haynes acted as second reviewers to resolve any uncertainties. Waseem Karsan, Muram El-Nayir, Amy E. Mallorie, and Michelle A. Goonasekera undertook the quality assessment. Statistical analysis and data interpretation were supervised by Sarah Parish and Alison Offer. The first draft of the manuscript was written by Michelle A. Goonasekera, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Marion M. Mafham.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

S.P., M.M., A.O., R.H., M.G., WK, ME, and AEM work in the Clinical Trial Service Unit and Epidemiological Studies Unit of the Nuffield Department of Population Health at the University of Oxford. The Clinical Trial Service Unit and Epidemiological Studies Unit have a staff policy of not taking any personal payments directly or indirectly from industry (with reimbursement sought only for the costs of travel and accommodation to attend scientific meetings). It has received research grants from Abbott, AstraZeneca, Bayer, Boehringer Ingelheim, Eli Lilly, GlaxoSmithKline, The Medicines Company, Merck, Mylan, Novartis, Novo Nordisk, Pfizer, Roche, Schering, and Solvay, which are governed by University of Oxford contracts that protect their independence.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Supplemental methods. Table S1. Characteristics of studies ascertaining acute heart failure (ordered by country and number of gold standard events). Table S2. Characteristics of studies ascertaining prevalent heart failure (ordered by country and number of gold standard events). Table S3. QUADAS-2 study quality assessment. Table S4. Sources of routine and gold standard data by country or region. Table S5. Gold standard heart failure ascertainment methods used in the reviewed studies. Table S6. Guidelines used for gold standard adjudication. Table S7. ICD-9 coding algorithms used to define heart failure in the studies reviewed. Table S8. ICD-10 coding algorithms used to define heart failure in the studies reviewed. Table S9. List of ICD codes used across the studies and their definitions. Table S10. Summary diagnostic accuracy statistics for coding algorithms ascertaining acute heart failure according to subgroup. Supplemental Figure S1. Calculation of performance statistics. Supplemental Figure S2. Funnel plot for the meta-analysis of studies ascertaining acute and prevalent HF using effective sample size weighted regression tests of funnel plot asymmetry. Supplemental Figure S3. SROC plot for the diagnostic accuracy of coding algorithms in studies with > 200 gold standard (GS) heart failure (HF) events. Supplemental Figure S4. SROC plots for the diagnostics accuracy of RCD algorithms ascertaining acute heart failure according to coding position. Supplemental Figure S5. SROC plots for the diagnostics accuracy of RCD algorithms ascertaining prevalent heart failure according to coding position.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goonasekera, M.A., Offer, A., Karsan, W. et al. Accuracy of heart failure ascertainment using routinely collected healthcare data: a systematic review and meta-analysis. Syst Rev 13, 79 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: