name: memory-benchmark-error-analysis description: Produce a deep, dialectical error-analysis report for any agent-memory benchmark run (LoCoMo, LongMemEval, Mem0-bench, MSC, or custom). Use whenever the user has a per-question benchmark result file (question, gold, prediction, category, retrieved evidence, correctness) and wants a systematic write-up of failure modes, root causes, priorities, and improvement ceilings that a memory-systems engineer can act on. Benchmark-agnostic: schema of the input file is mapped at runtime; this skill specifies the analytical framework, output contract, and quality bar — not a pipeline. user_invocable: true

Memory Benchmark Error Analysis

A methodology and an output contract. Any agent with file access and the ability to render a PDF can follow it. The value lives in what to look for, how to argue, and what the deliverable must contain — not in any particular parser.

Reader primacy

The primary reader of the report is a memory-systems engineer deciding where to invest next. They have limited patience and strong domain priors. Optimise for them:

Top of report = decision surface. The executive summary must let them decide the next sprint before reading page 2. If they skim only the first screen, they should know: the headline accuracy, the split between retriever-side and answerer-side failure, the worst category, the single highest-ROI fix, the theoretical ceiling, and the noise floor below which they should not celebrate a win.
Every recommendation is a testable claim. For each P-priority action, give (a) expected point lift with interval, (b) the specific negative side-effect you predict, (c) a falsification condition — "if after the fix I see X, the hypothesis is wrong". This turns the report into a pre-registered experiment plan, not advocacy.
Every pattern is a bucket with a definition, not a vibe. Definitions are operational — a reader could in principle re-derive every count from the raw file.
Every claim is grounded. A verbatim example from the file, a count, or a cited external reference. No floating assertions.
The machine-readable companion is the product, not a supplement. Downstream teams will diff your findings.json across runs; it must contain the full per-error record set, not only aggregates.

When to use

Trigger this skill when all hold:

The user has a completed benchmark run (per-question records required, not metrics-only).
Each record carries at minimum: question, gold answer, model prediction, correctness label (ideally with judge reasoning), category or stand-in. Strongly preferred: retrieved memories/evidence, timestamps, entity metadata.
The user is asking for error analysis, failure patterns, 错题分析, postmortem, or "why are we losing points".

Skip if only aggregate accuracy is available — there is nothing to analyse.

Universal input contract

Map whatever the user's file gives you to this vocabulary before starting. Do not hard-code schema assumptions.

Concept	What you need	If missing
record	one question/answer unit	—
question / gold / pred	raw text	required
category	benchmark's own taxonomy	bucket as "uncategorized"; infer labels from question text if only numeric codes present
is_correct	boolean from judge	required
judge reasoning	free text from LLM-as-judge	treat as noisy signal; scan for rate-limit / error artefacts
retrieved evidence	list of memory/chunk texts shown to the answerer	if absent, declare it explicitly; the evidence-localisation split becomes N/A, and root-cause attribution weakens to a single bucket ("pipeline-internal")
metadata (timestamps, session ids, entities, memory ranks)	per-record	enables B/D sub-type diagnosis — otherwise keep labels coarser

If the benchmark uses only numeric category codes, derive the named taxonomy by sampling 5–10 questions per code and stating the mapping in §2 of the report. Do not silently adopt the prior report's labels — a schema may drift between runs.

Workflow (run in this order)

Map the schema. Read a few records; bind the user's field names to the universal vocabulary above. Note missing optional fields.
Slice for speed. Benchmark result files commonly carry retrieved-memory blobs that inflate size 20–30×. Produce a working slice that contains only is_correct = false ∧ not invalid records with the fields you need; persist it once, operate on it thereafter. Do not re-parse the full file per pattern.
Cohort integrity. Recompute overall accuracy from per-record data; compare to any aggregate the file ships with. Flag discrepancies > 0.1pp. Declare which categories you exclude (adversarial / unanswerable) and why.
Compute auxiliary signals once per error. For each error compute: gold_token_overlap_in_pred (content-word Jaccard after stop-word removal); gold_in_retrieved at three thresholds (any, half, all); retrieved_top_hit_rank; refusal_phrase_hit (first hit from the skill's phrase list, or one you extend); pred_date, gold_date, |pred_date − gold_date| when applicable. Persist to the standardised record (see §"Output format").
Label patterns. Apply the catalogue with its operational signals. A/D/E are mutually exclusive by construction (see their definitions). B sub-types are mutually exclusive. C/F/G can overlap with A/D/E. Add a new pattern only if data demands and you can give it an operational definition; document the addition inline.
Build the required artefacts — all of §"Required analytical artefacts". Compute pattern × category, pattern × pattern co-occurrence, per-conv error distribution, three-axis attribution.
Run the dialectical self-check — the five sub-sections in artefact §10.
Write the report using the skeleton in §"Report skeleton". Dense prose. Every claim traces to a number or a verbatim example.
Emit findings.json using the mandatory schema in §"Output format". Per-error records are required, not optional.
Render PDF. Verify CJK if applicable (no tofu boxes). Record input SHA-256 (first 12 chars) and generation timestamp at the bottom of the report.

Hard constraints

Never fabricate records. Every question / gold / pred / judge quote in the report must be copy-pasteable from the input file. Truncate long prediction text to ~160 chars; mark truncation with ….
No unbacked claims. If you cannot point at a number, a verbatim example, or an external reference, cut the sentence.
A/D/E are mutually exclusive at the row level. Definitions enforce this (see §Failure mode catalogue). Co-occurrence on the same row means your labels are ambiguous — fix the definitions.
Per-error records are mandatory in findings.json. Aggregates alone do not satisfy the output contract; downstream diffing requires the full record set.
Prior reports are hypotheses. If a previous analysis exists for the same dataset, treat its conclusions as claims to re-test. State where your data overturns or revises them. Do not silently inherit.
Sampling for large datasets. If the error set exceeds ~500 items, keep full counts in stats tables but sample to ~30 per pattern when selecting verbatim examples. State the sampling explicitly.

Three non-negotiable moves

Evidence localisation first, triple-thresholded. For every wrong item with retrieved evidence, compute whether gold's key content tokens appear in any retrieved item. Report all three thresholds: any / half / all — e.g. 82.7% / 70.8% / 39.9%. The half number is the canonical "answerer-side vs retriever-side" split; report all three so the reader can see sensitivity. This goes into the executive summary.
Patterns are claims with quantitative backing. Every failure mode you name must come with: operational definition (the signal that fires the label), count and share, per-category breakdown, 2–3 verbatim examples, mechanistic explanation ("why this shape of failure"), and its mutual-exclusivity status with other patterns.
Dialectical self-check. Patterns ossify. Argue against your own labels, critique prior reports, audit judge noise on a sampled basis, mark small-sample findings as fragile.

Failure mode catalogue

Seven starting patterns. Mutual exclusivity is noted where enforced. Add a pattern only with a written operational definition.

A — Excessive abstention (evidence-aware refusal)

Signal: pred's first 200 chars contain a refusal phrase from this vocabulary (extend as the data warrants): not explicitly, no mention, no specific, no information, there is no, based on the memories, the memories do not, is not stated, not provided, does not contain, could not find, no record. AND gold_token_overlap_in_pred < 0.3.
Mutual exclusivity: vs D (D has no refusal phrase); vs E (E has overlap ≥ 0.3).
Mechanism: judge-side refusal threshold on the answerer is miscalibrated — the model demands verbatim gold in evidence rather than accepting adjacent evidence.
Sub-classes (optional, report if useful): A (evidence in top-K) vs A_noret (no gold tokens in retrieved). The second is not really abstention — it is retrieval failure wearing abstention's clothes. Separate them explicitly.

B — Temporal dislocation (Cat-Temporal errors only)

All Cat-Temporal errors map to a B sub-type. Sub-types are mutually exclusive. Do not collapse to B1/B2 without checking — in real LoCoMo and LongMemEval runs, ≥ 50% of Cat-Temporal gold is not a parseable absolute date.

Sub-type	Signal	Typical mechanism
B_rel	gold is a relative anchor ("the Friday before 22 October", "the weekend of X")	answerer picked the wrong week in the anchor's calendar neighbourhood
B_dur	gold is a duration phrase ("four months", "19 days")	extraction dropped the duration; or answerer estimated from endpoint dates
B_refuse	pred contains a refusal phrase on a temporal question	retrieval missed the entity ("Talkeetna" not in memory index)
B_other	gold is not a date/duration at all ("They decided to live together…")	benchmark label drift; the question isn't really temporal
B1	both gold and pred parseable, `	diff
B1-2mid	`15 ≤	diff
B2	`	diff

Always plot the |diff| histogram on the parseable subset before asserting which sub-type dominates.

C — Over-simplification / generalisation

Signal: gold is a 1–3 token named entity or specific term; pred returns a hypernym ("Hoodies" → "clothing", "Alaska" → "state", "Apex Legends" → "game") OR omits the specific from a list. Useful detection: (i) build a concrete→generic dictionary of common hypernym pairs; (ii) check whether pred length < gold length × 0.7 with pred containing a superset term; (iii) for list questions, count whether pred's enumerated items ≥ gold's.
Mutual exclusivity: can overlap with D (if gold's specific was dropped and pred substituted a wrong item).
Three sub-mechanisms worth separating in the mechanism note, though not required as formal sub-patterns: (i) extraction-side pruning (fact summariser collapsed "Hoodies" → "clothing"); (ii) synonym loss ("quit" → "wouldn't let anything hold him back"); (iii) list-completeness failure (2/3 of enumerated items returned).

D — Hallucination / wrong fact

Signal: gold_token_overlap_in_pred < 0.3 AND no refusal phrase AND pred expresses confidence (not "likely / may / possibly / unsure"). For 0.3 ≤ overlap < 0.5, require judge reasoning to contain error phrases ("incorrect", "instead of", "does not match", "wrong") before classifying as D — otherwise leave unlabelled.
Mutual exclusivity: vs A (A has refusal phrase); vs E (E has overlap ≥ 0.3).
Sub-types (operational, use these if retrieved evidence available):
- D1 — Same-event-family substitute: gold's key entities are present in retrieved memories but attached to a different timestamp / session / related event. Detection: scan retrieved for gold entities; if present, check whether pred's alternative fact shares retrieved support from a distinct record.
- D2 — Fabrication: gold entities absent from retrieved memories, yet pred confidently produces a different content.
- D3 — Yes/No flip: question contains yes/no intent; gold and pred are both in {Yes, No, yes, no, true, false, 是, 否} but opposite.
Mechanism per sub-type: D1 is a disambiguation failure (retrieval returned multiple events for the same entity, answerer picked the wrong one); D2 is pure hallucination; D3 is a reasoning failure (often on Cat-Multi-hop / Cat-Open-domain yes/no).

E — Near-hit framed as denial

Signal: pred contains a refusal phrase AND gold_token_overlap_in_pred ≥ 0.3. The evidence is visible in pred's text but the surrounding frame denies it.
Mutual exclusivity: vs A (A has overlap < 0.3).
Mechanism: the answerer retrieved and transcribed the evidence, then overrode its own output with a refusal preamble. Often a prompt-template artefact.
Note on false positives: single-token overlap (e.g. "went" coincidentally appearing) is not a real near-hit. Use content-word Jaccard with stop-word removal, and sanity-check a sample manually.

F — Counting / duration deviation

Signal: question opens with how many / how often / how long / how much / number of; gold and pred both contain numeric values; values differ; pred often contains at least N as a defensive lower bound.
Overlaps: heavily with B_dur when the question is "how long"; with D when the number is wildly off.
Mechanism: long-context event-counting bias; model gives a conservative lower bound rather than committing.

G — Inference avoidance

Signal: question contains inference triggers (why / how come / would / could / might / likely) AND pred contains a refusal phrase. For world-knowledge linkage questions (which national park, which US state) that require named-entity resolution from a local description, apply a secondary human-in-the-loop check or accept best-effort heuristic.
Typical locus: Cat-Open-domain / Cat-Multi-hop. Perfect category clustering is a diagnostic signal — if G spreads across categories, your category labelling is off.
Mechanism: the model refuses probabilistic inference on principle, substituting "not explicit" for any output with non-zero uncertainty.

Pattern overlap rules (summary table)

	A	B	C	D	E	F	G
A	—	✗	✓	✗	✗	✓	✓
B	✗	—	✗	✗	✗	✓	✗
C			—	✓	✓		✓
D	✗		✓	—	✗	✓	✗
E	✗		✓	✗	—		✓
F		✓		✓		—
G	✓	✗	✓	✗	✓		—

✓ = may co-occur on same record; ✗ = mutually exclusive by definition; ✗ = hard mutual exclusion (see each definition).

Required analytical artefacts

All twelve are required. Missing any one is grounds for rejection.

Cohort integrity check. Recomputed accuracy, comparison to file aggregate, categories excluded, final cohort size.
Per-category error table with total / errors / error-rate, using the benchmark's own category names (derive names from sampled questions if only numeric codes are given).
Evidence-localisation split at all three thresholds (any / half / all). Canonical headline number: half. Report all three in the executive summary.
Pattern tables — for each active pattern: operational definition reprinted, count, share of errors, per-category distribution, 2–3 verbatim examples, mechanistic note, mutual-exclusivity status.
Pattern × category cross matrix, plus 5 reading notes (dominant pattern per category; unexpected absences; correlations).
Pattern × pattern co-occurrence matrix. Separate from §5. Shows which pairs co-label the same error (e.g. F∩B_dur, C∩D). Memory-systems readers use this to spot coupled failures that need coupled fixes.
Per-conv (or per-subset) error distribution. A table with one row per conv_id (or equivalent subset id): total / errors / accuracy / top three dominant patterns. Highlights which subsets are dragging the score and whether they fail for the same reasons.
External-taxonomy mapping — map active patterns to at least two of: LongMemEval's five abilities (IE / MR / TR / ABS / KU); Microsoft's agentic-AI failure taxonomy; Mem0 / Zep's three buckets. Direction matters — state when your label is the opposite of an external one (e.g. our A is "too eager to abstain", LongMemEval's ABS measures "dare to abstain when correct").
Three-axis root-cause slicing — per error: (i) stage (extraction / retrieval / rerank / answer / judge); (ii) confidence (refusal / low-overlap / partial / high-overlap-yet-judged-wrong); (iii) treatability (prompt-engineering / metadata-extension / architecture / base-model). Summarise as a small table; state the "highest-ROI fix" paragraph.
Priority list P0–P4. For every item: status quo with count; concrete action; expected point lift with interval; predicted negative side-effect; falsification condition ("if after the fix I observe X, the hypothesis is wrong"); measurement plan (what metric, what subset). Each action is a pre-registered experiment, not a wish.
Improvement-ceiling matrix with explicit non-additive total. Rows: each fix. Columns: patterns covered / errors affected / expected lift / cost class. At least two interaction effects must be spelled out (e.g. P0 will amplify D if done before P2). The additive sum is given first; the non-additive estimate follows with its reasoning. Subtract the judge-noise floor at the end.
Dialectical self-check (mandatory, five sub-sections): (i) argue against 5 of your own labels — for each, would a cold reader agree? (ii) which prior-report conclusions do you overturn or revise; (iii) audit judge noise on a sampled basis (scan ≥ 10 judge_reasoning entries, flag disagreement cases, estimate a noise floor with a confidence statement); (iv) mark every pattern with < 10 samples as fragile; (v) if a second run is given, produce a diff table separating shared from run-specific failure modes.
Representative-errors appendix. 12–15 cases covering every active pattern (and every active sub-type where relevant). Usable as a regression test suite.
Completion checklist. Tick every artefact at the end of the report.

Report skeleton

Stable section order so runs are comparable. Each section maps to one or more artefacts above.

Executive summary — one screen. Required structure:
- Five required numbers: (a) overall accuracy + total errors; (b) evidence-localisation at three thresholds; (c) worst-category error rate + ratio to best; (d) largest single pattern count + share; (e) estimated ceiling (non-additive) + noise floor.
- Three required conclusions: (a) primary ROI direction (prompt / metadata / architecture / base-model) with rationale; (b) single highest-ROI P0 with quantified lift and side-effect; (c) the theoretical ceiling, the noise floor, and the minimal meaningful delta for future A/B.
Benchmark taxonomy and error distribution (artefact §2, §7)
Failure mode catalogue with quantitative evidence (artefact §4)
Pattern × category cross matrix (artefact §5)
Pattern × pattern co-occurrence (artefact §6)
External-taxonomy mapping (artefact §8)
Three-axis root-cause slicing (artefact §9)
Priority list P0–P4 with falsification conditions (artefact §10)
Improvement-ceiling matrix, non-additive (artefact §11)
Dialectical self-check (artefact §12)
Representative-errors appendix, 12–15 cases (artefact §13)
References and cross-links
Completion checklist (artefact §14)

Output format

Three artefacts are always emitted together. The user opts out of one only with an explicit instruction.

Markdown source of truth — diffable, reviewable, version-controllable. Suggested filename: <benchmark>_<run-stem>_error_analysis[_<lang>].md.
PDF rendered from that markdown — the primary human deliverable. Render with whatever tool is available; handle CJK font fallback if the corpus is non-Latin (Noto Sans CJK SC / WenQuanYi Zen Hei / Source Han Sans SC). Reference commands, in order of preference:
```
pandoc input.md -o output.pdf --pdf-engine=xelatex \
  -V CJKmainfont="Noto Sans CJK SC" -V geometry:margin=20mm -V papersize=a4
# or
weasyprint input.md output.pdf   # after: uv tool install weasyprint
# or
md-to-pdf input.md
```
After rendering, verify by extracting page 1 text (e.g. PyMuPDF) and confirming characters render — no tofu boxes. Record page count and file size.

Machine-readable companion <stem>_findings.json — the product for downstream tooling. Required schema:

{
  "metadata": {
    "benchmark": "...",
    "model": "...",
    "source_file": "...",
    "source_sha256_prefix": "<first 12 chars>",
    "generated_at_utc": "...",
    "skill_version": "2"
  },
  "cohort": {
    "total_questions": 1540,
    "total_valid": 1540,
    "total_invalid": 0,
    "total_correct": 1372,
    "total_errors": 168,
    "overall_accuracy_pct": 89.09,
    "excluded_categories": []
  },
  "category_stats": { "<cat_code>_<derived_name>": {"total": …, "errors": …, "error_rate_pct": …} },
  "evidence_localisation": {
    "definition": "gold content-token Jaccard overlap against any retrieved item text; thresholds any/half/all",
    "any": {"errors": …, "share_pct": …},
    "half": {"errors": …, "share_pct": …},
    "all":  {"errors": …, "share_pct": …},
    "errors_unscorable_short_gold": …
  },
  "pattern_counts":      { "A": …, "B": …, "C": …, "D": …, "E": …, "F": …, "G": …, "B_rel": …, "D1": …, "D2": …, "D3": … },
  "pattern_x_category":  { "<cat>": { "A": …, "B": …, … } },
  "pattern_x_pattern":   { "<P>": { "<Q>": <co-occurrence count> } },
  "per_conv_breakdown":  { "<conv_id>": {"total": …, "errors": …, "accuracy_pct": …, "top_patterns": [ ["D", 12], ["A", 4] ] } },
  "temporal_subtypes":   { "B_rel": …, "B_dur": …, "B_refuse": …, "B_other": …, "B1": …, "B1-2mid": …, "B2": …, "offset_days_histogram": {"0-6": …, "7-14": …, "15-30": …, "31-90": …, "91-180": …, ">180": …} },
  "refusal_phrase_freq": { "not explicitly": …, "no mention": …, "…": … },
  "concrete_to_generic_pairs": [ {"gold": "Hoodies", "pred_generic": "clothing line", "conv_id": "conv-30"}, … ],
  "judge_noise_cases":   [ {"conv_id": …, "q_index": …, "reason": "rate_limit_429"}, … ],
  "errors": [
    {
      "conv_id": "conv-30",
      "question_index": 77,
      "category": 4,
      "question": "…",
      "gold": "Hoodies",
      "pred": "Gina created a limited edition clothing line …",
      "judge_reason": "The generated answer only mentions a clothing line without specifying …",
      "gold_token_overlap_in_pred": 0.12,
      "gold_in_retrieved": {"any": true, "half": true, "all": false},
      "retrieved_top_hit_rank": 3,
      "refusal_phrase_hit": null,
      "patterns": ["C", "D"],
      "pattern_d_subtype": "D1",
      "pattern_b_subtype": null,
      "stage_attribution": "extraction",
      "confidence_bucket": "low_overlap_no_refusal",
      "treatability": "prompt_engineering"
    }
  ]
}

errors[] length must equal cohort.total_errors. This is the contract that lets downstream teams diff runs without re-parsing raw JSON.

Report language. Default to the language of any prior reports in the same repo/folder; if none, match the user's request; otherwise English. Quoted text from the file (question / gold / pred / judge) is preserved verbatim regardless.

Dialectical principles

Read before each run.

Quantity ≠ quality. The largest error count is usually the largest category — boring. Error rate reveals structural weakness. Always report both.
Pattern labels are claims, not buckets. If you cannot write the operational signal and show a count, you don't have a pattern — you have a vibe.
Fixes interact. Stop adding expected lifts. Multiple fixes aimed at the same failure mode share territory; fixes aimed at abstention will surface hallucinations. State the interaction explicitly.
Prior reports decay. Treat any pre-existing analysis as a hypothesis to re-test, not a conclusion to inherit.
The judge is an agent too. LLM-as-judge has a noise floor usually between 1% and 3%. Any improvement smaller than that floor is not real.
Small samples lie. Under ~10 errors, patterns are anecdotes. Label them as such.
Name the counterfactual. For every recommendation: "what observation would convince me this fix does not work?" Include it.
Categories can drift. Numeric category codes from a benchmark are not semantic labels. Re-derive the naming from sampled questions each run.

Adaptation notes for different benchmarks

LoCoMo (Maharana et al. 2024). Five categories; Cat 5 is adversarial — exclude from scoring. Numeric code → name mapping is not stable across released runs; derive it from content. Expect: B to dominate Cat-Temporal; A + G to dominate Cat-Multi-hop / Open-domain; D to dominate Cat-Single-hop in absolute count.
LongMemEval (Wu et al. ICLR 2025). Five abilities: IE / MR / TR / ABS / KU. Direction flip: their ABS measures "does the model correctly abstain when no answer exists". Your Pattern A measures "does the model incorrectly abstain when an answer exists". Label the sign flip when mapping.
Mem0 / Zep benchmarks. Three buckets: temporal-graph miss / cross-session aggregation miss / entity disambiguation miss. Note that these buckets are memory-side framings — your answerer-side failures (A, D2, E) may not have clean bucket assignment; say so.
MSC / custom multi-session dialogue. Usually no explicit category — cluster questions by LLM-tagged intent (factual / temporal / reasoning / preference) before the cross matrix becomes meaningful.
No retrieval traces available. Drop the evidence-localisation split; replace the stage attribution with a single bucket ("pipeline-internal"); state explicitly in the report that this weakens the three-axis slicing.
Self-judged correctness (no separate judge model). Raise the judge-noise floor estimate by one point; run a larger sample of the self-judgment column in §12.
Non-English corpora. Stop-word list must be language-matched before computing gold_token_overlap_in_pred — Jaccard on unsegmented CJK text is meaningless. Segment with a language-appropriate tokeniser.

Completion checklist (copy into the report)

Cohort integrity checked; excluded categories named
Category naming re-derived from sampled questions (not silently inherited)
Per-category error table with total / errors / error-rate
Evidence-localisation split reported at all three thresholds
Every active pattern has: operational definition, count, per-category breakdown, 2–3 verbatim examples, mechanism note, mutual-exclusivity status
Pattern × category cross matrix with 5 reading notes
Pattern × pattern co-occurrence matrix
Per-conv error distribution with top-patterns per conv
External-taxonomy mapping to ≥ 2 frameworks with direction flips labelled
Three-axis root-cause slicing
P0–P4 priority list: each item has expected lift, predicted side-effect, falsification condition, measurement plan
Improvement-ceiling matrix: additive sum stated, ≥ 2 interactions spelled out, judge-noise floor subtracted
Dialectical self-check: 5 reversed labels, prior-report overturns, judge-noise audit on ≥ 10 samples, small-sample flags, second-run diff or explicit N/A
12–15 representative errors covering every active pattern and sub-type
Markdown source + rendered PDF (CJK rendering verified if applicable)
findings.json produced per mandatory schema; errors[] length equals cohort.total_errors
Generation timestamp and input-file SHA-256 (first 12 chars) recorded at the bottom of the report

Prior art and references

Maharana et al., "Evaluating Very Long-Term Conversational Memory of LLM Agents", arXiv:2402.17753 (LoCoMo)
Wu et al., "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory", ICLR 2025, arXiv:2410.10813
Chhikara et al., "Mem0: Production-Ready AI Agents with Scalable Long-Term Memory", arXiv:2504.19413
Microsoft, "Taxonomy of Failure Mode in Agentic AI Systems"
"Where LLM Agents Fail and How They Learn From Failures", alphaXiv:2509.25370

	A	B	C	D	E	F	G
A	—	✗	✓	✗	✗	✓	✓
B	✗	—	✗	✗	✗	✓	✗
C			—	✓	✓		✓
D	✗		✓	—	✗	✓	✗
E	✗		✓	✗	—		✓
F		✓		✓		—
G	✓	✗	✓	✗	✓		—

	A	B	C	D	E	F	G
A	—	✗	✓	✗	✗	✓	✓
B	✗	—	✗	✗	✗	✓	✗
C			—	✓	✓		✓
D	✗		✓	—	✗	✓	✗
E	✗		✓	✗	—		✓
F		✓		✓		—
G	✓	✗	✓	✗	✓		—

	A	B	C	D	E	F	G
A	—	✗	✓	✗	✗	✓	✓
B	✗	—	✗	✗	✗	✓	✗
C			—	✓	✓		✓
D	✗		✓	—	✗	✓	✗
E	✗		✓	✗	—		✓
F		✓		✓		—
G	✓	✗	✓	✗	✓		—