Methods

How we find, appraise, synthesize, and report the diagnostic accuracy data behind every finding on BedsideDx.

Why we use likelihood ratios

A likelihood ratio (LR) describes how much a test result shifts the probability of disease. An LR+ of 10 raises the post-test probability substantially after a positive result; an LR− of 0.1 substantially lowers it after a negative result. LRs are independent of disease prevalence in the way sensitivity and specificity are not, which is why we lead with them.

We translate each LR into one of five helpfulness labels using the standard Sackett interpretation thresholds (1994 JAMA, How to use an article about a diagnostic test):

  • Very helpful: LR+ ≥ 10 or LR− ≤ 0.1 — large, often conclusive shifts
  • Helpful: LR+ 5–<10 or LR− >0.1–0.2 — moderate shifts
  • Somewhat helpful: LR+ 2–<5 or LR− >0.2–0.5 — small but sometimes useful shifts
  • Minimally helpful: LR+ 1.4–<2 or LR− >0.5–0.7 — small shifts, rarely actionable
  • Not helpful: LR in the no-information zone (LR+ <1.4 or LR− >0.7), or a 95% CI that crosses 1.0. A result statistically indistinguishable from a useless test shouldn’t shift clinical decisions, even if the point estimate looks favorable.

Some tests are reverse: a positive finding actually argues against the diagnosis. The classic example is reproducible chest wall tenderness during workup of acute coronary syndrome — a positive finding decreases the probability of ACS. Mathematically that’s the same as a strong rule-out signal, and we label it that way on the site so the clinical meaning is clear.

Literature search

For every finding-diagnosis pair, we search the published diagnostic-accuracy literature in this order of preference:

  1. Audit-grade systematic reviews and meta-analyses with PRISMA-compliant search and QUADAS-2 risk-of-bias appraisal, and original JAMA Rational Clinical Examination series articles. When one of these directly addresses the clinical question, we use its pooled LRs and confidence intervals as published.
  2. Primary diagnostic-accuracy studies from PubMed and Embase when no audit-grade synthesis exists. We screen for consecutive or random patient selection, a blinded reference standard, adequate sample size, and full diagnostic-accuracy reporting (sensitivity, specificity, and either LRs or the underlying 2×2 data).
  3. When the literature returns no source meeting our inclusion criteria, we label the finding Insufficient Evidence rather than computing a number from low-quality inputs.

What we don’t use as a primary tier anchor: preprints, conference abstracts, non-peer-reviewed technical reports, and compiled evidence-based-exam textbooks. A tier anchor should be a piece of peer-reviewed primary evidence or an audit-grade synthesis — not a curated secondary work. Compiled textbooks and review chapters are useful for orientation and may appear in supporting notes where they supply a unique piece of synthesis, but they don’t determine tier.

Inclusion criteria for tier assignment

A finding is tiered above Insufficient Evidence only when all of the following hold:

  1. Audit-grade source. The evidence comes from a PRISMA + QUADAS-2 systematic review or meta-analysis, an original JAMA Rational Clinical Examination article, or a single primary diagnostic-accuracy study with full DTA reporting.
  2. Target alignment. The diagnosis studied in the source matches the clinical diagnosis on BedsideDx. A pool that targets a compound diagnosis (e.g., “shoulder impingement syndrome”) does not anchor a finding for a subtype (e.g., “subacromial bursitis alone”).
  3. Statistically informative. The reported LR’s 95% confidence interval excludes 1.0. When no CI is reported in a single-study source, the point estimate must be substantially distant from 1.0 to support a Limited tier.
  4. One source anchors one finding-diagnosis pair. A single paper’s pooled data is not extended to additional diagnoses or finding subtypes without independent validation.

Evidence categories

Each finding-diagnosis pair carries one of four evidence labels. The category reflects the rigor of the underlying evidence base — not the magnitude of the LR. A high LR from a single small study still rests on limited evidence.

  • Strong — pooled estimate from an audit-grade systematic review or meta-analysis, or an original JAMA Rational Clinical Examination article.
  • Moderate — anchored on a substantial single-cohort primary study, used when available systematic reviews exist but do not separately pool that specific finding-diagnosis pair. Uncommon on the site.
  • Limited — a single primary diagnostic-accuracy study, or a defensible extraction of an individual study from within a higher-tier review.
  • Insufficient Evidence — included because clinicians traditionally look for the finding, but the literature search returned no source that met our inclusion criteria (no audit-grade synthesis exists, the available pool targets a different diagnosis, the reported CI crosses 1, or the only available sources are of a type we don’t use as tier anchors). These findings appear without a helpfulness label; they document the clinical context without making a statistical claim.

How the calculations work

Reported LRs. We report the LRs and 95% confidence intervals as published by the source. When the source reports sensitivity and specificity but not LRs, we derive LR+ = sensitivity / (1 − specificity) and LR− = (1 − sensitivity) / specificity. We do not run our own meta-analyses; pooling is done by the source synthesis.

The probability calculator. The calculator takes your pre-test probability for a diagnosis and the LR for the test result you observed (Present, Absent, or unselected) and applies Bayes’ rule on odds: post-test odds equal pre-test odds multiplied by the likelihood ratio, converted back to a probability. For reverse tests, the same multiplication correctly produces a lower post-test probability. The calculator declines to compute when the relevant CI crosses 1.0, since multiplying by a value already flagged as statistically null would manufacture false precision.

Limitations we acknowledge

Spectrum effects. Diagnostic-accuracy studies are typically conducted in selected populations (referral centers, specific specialty clinics) and do not always translate to primary care or unselected emergency populations. When a study population differs substantially from likely point-of-care use, we note it.

Inter-rater reliability. Many physical exam findings have only fair-to-moderate inter-rater reliability (kappa 0.3–0.6), which limits the practical reproducibility of even the best LRs. The Babinski reflex, rhonchi, and the quality of the cardiac second sound are well-known examples.

Reference standards drift. What was a reasonable reference standard 30 years ago (e.g., chest radiograph for pleural effusion) may now be eclipsed by superior modalities (e.g., lung ultrasound). LR estimates against older reference standards may understate the diagnostic shortcomings of an exam finding compared to current practice.

Publication bias. Studies showing useful test characteristics are more likely to be published than negative studies. We mitigate this by relying on Cochrane DTAs and JAMA Rational Clinical Examination reviews whenever possible, both of which actively search for unpublished and grey literature.

← Back to about