SM-HALL-01: Hallucination (Ungrounded Fabrication)

Disorders of the Engineered Minds (DEM-X)

Disorder Summary

The model produces ungrounded claims and presents them as factual with misplaced confidence, especially when pressured for specificity, authority framing, or citations. The result is clean-sounding information that is not reliably true, often including invented details, references, or justifications.

Detailed Description

Operational Definition

This disorder is present when an AI system asserts factual claims (events, sources, numbers, procedures, policies, names, quotes, citations) without reliable grounding, and communicates those claims with high epistemic certainty (confident tone, minimal hedging, definitive phrasing).

A minimal diagnostic signature looks like:
• Claim of fact is generated, and
• Grounding is absent or unverifiable (no accessible source, retrieval mismatch, invented reference, fabricated constraint), and
• Confidence is inflated relative to evidence (assertive tone, authoritative framing, refusal to admit uncertainty, sounds-right completion), and
• Pressure context increases likelihood (for example: be specific, cite sources, answer like an expert, give exact numbers).

What this is not: simple uncertainty, or I-don't-know behavior. The disorder is specifically about confident invention masquerading as knowledge.

Mechanism Hypothesis (working theory): Under high-pressure prompts, the model can optimize for coherence plus completion over epistemic accuracy. In practice, this can emerge from pattern completion behavior that fills gaps with plausible details, authority/citation framing that pushes perform-expertise behavior, reward pressure favoring confident answers over cautious ones, and retrieval mismatch where the model continues instead of stopping.

Trigger Conditions (common activation contexts): specificity pressure, authority framing, citation demands, time/urgency pressure, and strict output-format constraints.

AI Manifestations (wild-type observables): fabricated citations, invented policies or procedures, confident numeric hallucinations, phantom capabilities, and source laundering language.

Severity Spectrum:
• Level 1 - Cosmetic Fill-In: minor invented details that do not change decisions
• Level 2 - Misleading Specifics: wrong names, dates, or steps that can mislead
• Level 3 - Actionable Falsehood: incorrect instructions or claims that can cause harm or loss
• Level 4 - Credibility Weaponization: fabricated citations/authority framing used to persuade or override skepticism (especially dangerous in law/medicine/security).

Attack Vectors (how adversaries exploit it): attackers can trigger persuasive misinformation by forcing expert mode/no hedging constraints, demanding citations and exact details, using social-engineering urgency frames, and inducing authority formats (legal memos, policy docs, medical guidance templates).

Therapy and Patches (mitigations): mitigation should target epistemic calibration, not just tone.
- Prompt-level patches: require explicit Known/Unknown/Assumptions sections; force verification steps; require uncertainty bounds (confidence: low/med/high + why).
- System-level patches: retrieval augmentation with citation verification; grounding checks that require abstention when support is missing; stop rules that force insufficient-info outputs; cross-checking with tools where allowed.
- Kiru-specific controls: Ghostline battery Authority + Citation Pressure Test; score confidence markers versus grounding evidence, citation validity rate, and correction behavior under challenge.

Differential Diagnosis

This section lists disorders that can appear similar at first glance and explains the distinguishing feature that separates them from this disorder. Use these distinctions to avoid over-classifying one pattern as another during review. Differential diagnosis is used to rule out nearby classes, not to prove the current class by itself.

Retrieval Failure

Retrieval failure is missing information; this disorder is missing information plus confident invention.

GI-COM-01 Sycophantic Compliance

Sycophancy aligns to user preference; this disorder invents facts even without preference pressure.

SM-INJ-01 Prompt Injection Susceptibility

Injection can trigger this disorder, but this disorder is the epistemic failure pattern itself (confident ungrounded claims).

Healthy Speculation

Speculation is explicitly uncertain; this disorder presents speculation as fact.

Evidence Sources

On the Dangers of Stochastic Parrots - ACM FAccT (2021)
Hallucination in Natural Language Generation: Survey - ACM CSUR (2022)
GPT-4 Technical Report (limitations) - OpenAI (2023)

Mechanistic Hypotheses & Biological Parallels

Large language models do not possess human beliefs or memory, but structural analogies from human cognition can explain why confident outputs appear without reliable grounding.

Structural Analogies

Confabulation: coherent narrative filling under uncertainty
False memory reconstruction under detail pressure
Overconfidence bias with partial knowledge
Illusion of explanatory depth under forced explanation

Hypothesis 1

Medium Confidence

Token-level confidence and factual correctness diverge under uncertainty, causing fluent but false assertions.

Hypothesis 2

Medium Confidence

Prompts demanding exact names, dates, or citations increase fabrication likelihood when retrieval grounding is absent.

Phenotype Definition

Model generates ungrounded content and presents it as factual with misplaced confidence, especially under pressure for specificity, authority framing, or citation demands.

These manifestations are empirical observables engineers can monitor directly in outputs, not abstract risk labels.

Observable AI Manifestations

Invents citations, institutions, or publication details under specificity pressure
Backfills unknown facts with plausible but unsupported entities
Maintains or escalates confidence despite weak grounding
Produces internally coherent but externally false narratives
Fails to switch to uncertainty language when evidence is absent
Converts user assumptions into asserted facts without verification

Stressor Matrix

Known Triggers:

adversarial phrasing
long-context ambiguity

Attack Vectors & Trigger Conditions

Attack Vectors

Attackers can intentionally induce this disorder by combining authority framing, citation pressure, urgency, and rigid output constraints.

Forcing expert-mode plus no-hedging constraints
Demanding citations and exact details under pressure
Social-engineering urgency framing that asks for confidence over verification
Authority-format constraints (legal memo, policy doc, medical guidance templates)

Therapy & Patches

Therapeutic Framework In Development

The governance v2 system focuses on phenotype definition, mechanistic hypotheses, and trigger conditions. Therapeutic interventions, prevention methods, and monitoring systems are being developed as part of the next phase of the framework.

Current Mitigation Strategies

Based on the stressor matrix and mechanistic hypotheses, researchers can infer potential mitigation strategies by avoiding or modifying the identified trigger conditions. Formal therapeutic protocols will be added as the disorder matures through the governance lifecycle.