Failure Case Analysis

Because every eligibility decision in SatIR traces to an explicit symbolic constraint — a parsed predicate, a salience judgment, an ontology edge — failures are diagnosable down to their exact cause. And because those causes are shared across the corpus, fixing one error fixes it permanently for every patient and trial it touches.


Errors are local. Fixes are global. In an end-to-end LLM system, a failure in one case gives no guarantee about any other case — the same latent error may surface differently across inputs, or not at all. In SatIR, an error has a specific symbolic cause: a predicate was parsed incorrectly, a salience threshold was miscalibrated, an ontology edge was missing, a numerical bound was off by one. Correcting that cause — editing a parsing rule, adjusting a policy, adding an ontology edge — immediately and permanently fixes every patient–trial pair that depended on it. No retraining. No re-embedding. No re-evaluation of the entire corpus.


A Hierarchical Error Taxonomy
Failure cases were manually reviewed and assigned to a structured taxonomy. Each category maps directly to a specific, fixable component of the pipeline.
Category 1.1.1 · 1.1.2

Ontology coverage & specificity expansion

A trial concept was not expanded to its ontology parent — e.g., Acute Myocardial Infarction not lifted to Myocardial Infarction — or a sibling concept was not considered as an alternative. Fix: add the missing ontology edge or adjust the salience policy for that concept pair.

Category 1.1.4

Clinical inference gaps

A patient fact strongly implied by documented findings was not inferred — e.g., a trauma patient's likely procedure was not coded. Fix: extend the patient-side inference module for that class of implication.

Category 2.1.1 · 2.1.2

Parsing & polarity errors

A numeric bound was coded with the wrong comparator (>15 vs ≥15), or an exclusion criterion was given the wrong polarity. Fix: correct the parsing rule — the fix applies to every trial that uses the same criterion pattern.

Category 2.2.1

Salience miscalibration

A condition was treated as whole-fact salient when it should not have been — causing a patient to be incorrectly excluded because the condition was absent from their note. Fix: adjust the salience policy for that concept; every patient record missing that condition is immediately corrected.

Category 1.1.2.3 · 1.1.5

Canonicalization & coding errors

A disease-list item was coded to a screening concept rather than the disease itself, or a clinical term was mapped to an unusual ontology concept. Fix: correct the canonicalization mapping; all trials using that term are corrected simultaneously.

Category 1.2

Relevance boundary

A trial was retrieved because it addresses a condition causally related to the patient's chief complaint, but not directly treating it. These reflect ambiguity in the retrieval objective definition rather than a system error — and are straightforwardly adjustable by tightening the objective.

Failure case taxonomy
Fig. — Failure case taxonomy. Observed failures are organized into a hierarchical taxonomy. Each leaf category corresponds to a specific, fixable component of the SatIR pipeline.

Where Failures Come From
Across manually reviewed cases, no single failure mode dominates — errors are distributed across patient-side parsing, trial-side parsing, salience policy, and ontology coverage.

The majority of discrepancies are attributed to SatIR itself rather than annotation disagreement or LLM judge error. Among SatIR failures, most arise from not retrieving a relevant candidate (missed retrieval) rather than retrieving but incorrectly assessing eligibility. This points to ontology coverage and salience calibration as the primary levers for improvement — both of which are targeted, permanent fixes.

Failure case distribution
Fig. — Failure case distribution. Left: disagreement source (SatIR vs. judge error vs. ambiguity). Middle: failure type (missed retrieval vs. retrieved-but-ineligible). Right: error source within SatIR (patient parsing, trial parsing, salience, system limits).