NOVA and DENOVA scores were developed to guide endocarditis risk assessment in Enterococcus faecalis Bacteremia (EfB), but some of their criteria may be open to interpretation. We aimed to evaluate their inter-rater reliability and feasibility.
MethodsThirty-two physicians from four specialties involved in the management of endocarditis independently evaluated eight EfB patient records using the NOVA and DENOVA scores. Each score was applied eight times per case. Inter-rater reliability was measured with Krippendorff’s alpha, and agreement with Fleiss’ Kappa. Completion time was also recorded.
ResultsNo record received identical scores from all raters. NOVA showed low inter-rater reliability (α = 0.37), while DENOVA reached moderate levels (α = 0.49). High agreement was found for extreme score values, but agreement dropped markedly for intermediate values. Among score items, Auscultation of murmur (A) and Valve disease (V) had the highest reliability (α > 0.8), while Duration of symptoms (D) and Origin of infection (O) had the lowest (α < 0.2). Completion times were similar between NOVA and DENOVA but varied by specialty.
ConclusionThe reproducibility of these scores is limited, especially near critical thresholds, highlighting the need to complement scoring tools with clinical judgment in EfB.
Gram-positive cocci bacteremia is frequently associated with endocarditis: 5 %‒20 % of cases for Staphylococcus aureus; 8 %‒26 % for Enterococcus faecalis; and a significant variability ranging from 1 % to 48 % depending on the Streptococcus spp. species.1,2 Hence, there is a need for reliable and simple tools to rapidly identify patients with bacteremia at higher risk of endocarditis.
In their narrative review, Rasmussen et al. present an algorithm based on an endocarditis risk stratification system for investigating patients with cocci gram-positive bacteremia.1 This risk assessment uses scores developed to help clinicians target patients requiring echocardiography: the VIRSTA, PREDICT and POSITIVE scores for S. aureus; NOVA and DENOVA scores for E. faecalis; or the HANDOC score for non-β-hemolytic Streptococcus spp.3,4 These scores have been designed and optimized to maximize their diagnostic performance. In the case of E. faecalis bacteremia, the DENOVA score shows a better performance (Sensitivity 95 %‒100 % and Specificity 84 %‒85 % for a threshold ≥ 3) than the NOVA score (Sensitivity 97 %‒99 % and Specificity 23 %‒56 % for a threshold ≥ 4).4,5
However, these scores were established from data collected retrospectively and some of their criteria may be open to interpretation. Reproducibility and feasibility studies, which have not yet been carried out, are therefore needed to address these limitations.
The objective of our study is therefore to investigate the reproducibility and feasibility of the NOVA and DENOVA scores.
MethodsDefinitionsThe NOVA score includes 4 criteria composing its acronym: N (5-points), O (4-points), V (2-points), and A (1-point). The DENOVA score adds two new criteria (D and E), each criterion is worth only 1-point.
The criteria used were defined as follows: for D, Duration of symptoms consistent with endocarditis ≥ 7-days before the first positive blood culture; for E, clinical examination or imaging result compatible with septic Embolization; for N: Number of positive blood cultures for E. faecalis ≥ 2; for O: Absence of focal infection susceptible to be the Origin of bacteremia; V: Heart Valve disease predisposing to a moderate or high risk of infective endocarditis including native valve disease, previous endocarditis, or the presence of a valve prosthesis; A: Auscultation of a heart murmur.
Score feasibility and reproducibilityIn one of the centers included in the DENOVA validation study, 27 patients had E. faecalis bacteremia in 2019.5 Eight of them (P1–P8) were intentionally selected to cover a gradient of endocarditis risk (DENOVA scores 0–5). According to the ESC 2023 modified Duke criteria, two patients (P1 and P6) had definite and two (P3 and P4) possible endocarditis.6
Thirty-two clinicians (8 infectious diseases physicians, 8 cardiologists, 8 internal medicine physicians, and 8 medical residents) participated as raters. They had full access to the retrospective medical records, including clinical notes, microbiological results, and imaging reports. Each rater assessed four cases: two using NOVA and two using DENOVA. Because NOVA is fully embedded within DENOVA, the same items were scored twice, resulting in 64 evaluations for DENOVA and 128 for NOVA, with DENOVA and NOVA scores determined 8- and 16-times per medical record, respectively.
The data, collected by questionnaire, were the response to each criterion and the time spent reading the record and determining the score (completion time).
Study endpoints and specific statistical analysisThe primary endpoint was the inter-rater reliability. Inter-rater reliability was assessed using the Krippendorff’s alpha coefficient test: α = 1 indicates perfect reliability and α = 0 the absence of reliability; the score was considered similarly interpretable by different raters if α ≥ 0.8, and still acceptable if α ≥ 0.67.7 The mean concordance for each criterion was calculated by averaging the percentages of agreement for each criterion of each patient; 95 % Confidence Intervals of concordance were computed using 1000 iteration bootstrapping with random resampling with replacement of patients and operators. The analysis of agreement for each value was assessed using the Fleiss’ Kappa test.8 The times were compared using the Kruskal-Wallis test.
To our knowledge, formal power calculations are not available for Krippendorff’s α or Fleiss’ κ. However, a post-hoc power calculation for the t-test comparing completion times indicated a power of 0.88.
ResultsScore reproducibilityThe distribution of scores obtained for each record with NOVA and DENOVA is shown in Fig. 1. No record was given the same score by the 8 evaluators. The score was positive or negative for all evaluators only for half of the patients (8/16 for NOVA and 4/8 for DENOVA).
Using the Krippendorff test, inter-rater reliability was significant for both scores but with a low α-value for NOVA (α = 0.37 [0.24‒0.49]; p < 0.001) and medium for DENOVA (α = 0.49 [0.24–0.73]; p < 0.01). When considering each criterion independently, criteria V and A showed strong inter-rater reliability (α = 0.86 and 0.83; concordance 95 %). E and N had moderate reliability (α = 0.34 and 0.41; concordance 77 % and 75 %). In contrast, D and O had the lowest reliability (α = 0.16 and 0.13; concordance 55 % and 62 %). Confidence intervals and p-values are detailed in Table 1.
Inter-rater reliability according to the criteria using the Krippendorff test.
The mean concordance was calculated by averaging the percentages of agreement for each criterion of each patient (the value is between 50 % and 100 %); 95 % Confidence Intervals of concordance were computed using 1000-iteration bootstrapping with random resampling with replacement of patients and operators. Inter-rater reliability was assessed for each criterion using the Krippendorff’s alpha coefficient test (α = 1 indicates perfect reliability and α = 0 the absence of reliability; the score was considered similarly interpretable by different raters if α ≥ 0.8, and still acceptable if α ≥ 0.67).
When assessing the agreement according to the values of the score by the Fleiss Kappa test, a moderate agreement was observed for very high or very low score values, but it was not significant for intermediate values: 6‒9 for NOVA and 1‒3 for DENOVA.
Score feasibilityNo significant difference was observed between the median completion times for NOVA and DENOVA scores (4′38 vs. 5′32, p = 0.21). Median completion time differed between evaluator groups: 4′11 (extremes: 1′46 to 9′06) for cardiologists, 4′32 (2′36 to 10′05) for infectious diseases physicians, 5′09 (1′15 to 12′55) for internal medicine physicians, and 6′37 (3′00 to 14′20) for medical residents (p < 0.01). This time did not differ whether the score was positive or negative for both NOVA (respectively 4′33 vs. 5′15, p = 0.4) and DENOVA (6′03 vs. 4′55, p = 0.21).
DiscussionBoth scores showed overall poor inter-rater reproducibility, particularly for values close to the positivity threshold. It was also observed that some raters occasionally disagreed with the conclusion of the score they obtained (data not reported). Beyond diagnostic performance and reproducibility, we also considered feasibility, using completion time as a pragmatic, albeit imperfect, indicator. The similar durations observed for NOVA and DENOVA suggest that the two additional items in DENOVA do not increase the practical burden for clinicians. As in other domains, we confirm that it is therefore imperative to assess both score reproducibility and ease of use when developing a new scoring system for endocarditis risk assessment, especially when the criteria involved are susceptible to personal interpretation.9
In addition, DENOVA and NOVA scores have other limitations. Endocarditis diagnosis has no gold standard and relies on major and minor criteria ‒ some of which are already part of the DENOVA score (number of positive blood cultures, predisposing cardiac lesion, embolization) ‒ and the major criterion of cardiac imaging is not sought in all patients in the studies.6 Moreover, using retrospective data for the elaboration and validation of the score is also an issue: firstly, it is easier to evaluate the criteria afterwards with the entire medical record, and secondly, determining the exact time of assessment is often ambiguous, as the question of TEE remains unresolved for many patients. These limitations are sources of important biases which may overestimate the score performance.10
In our study, raters assessed retrospective data. This design made it possible to explore both interpretation variability and uncertainty, while allowing inclusion of a larger number of independent evaluators. Nevertheless, it may not reproduce real-time clinical reasoning. In addition, the small number of clinical scenarios may not fully reflect the diversity of E. faecalis bacteremia presentations, whereas the limited number of raters per case could have affected the precision of inter-rater comparisons.
Further studies, particularly with prospective data collection, are therefore needed to better define the interest of the endocarditis scores in real life, and to compare their performance with clinical judgement. The objective remains to avoid unnecessary TEE in patients at low risk of endocarditis.
ConclusionWhen assessing the risk of endocarditis in patients with E. faecalis bacteremia, low inter-rater reproducibility suggests that scores should not replace clinical judgment, especially in situations of intermediate risk.
Ethical approvalThe study was approved by an institutional review board, the Ethical Committee of Research in Tropical and Infectious Diseases (CER-MIT 2022–0106).
FundingThis research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
We thank all the physicians who responded to the questionnaires, Prof. Xavier Duval and Prof Mathieu Nacher for critical reading of the manuscript.





