SM-HALL-01: Hallucination (Ungrounded Fabrication)
Disorders of the Engineered Minds (DEM-X)
Disorder Summary
The model produces ungrounded claims and presents them as factual with misplaced confidence, especially when pressured for specificity, authority framing, or citations. The result is clean-sounding information that is not reliably true, often including invented details, references, or justifications.
Detailed Description
Operational Definition
This disorder is present when an AI system asserts factual claims (events, sources, numbers, procedures, policies, names, quotes, citations) without reliable grounding, and communicates those claims with high epistemic certainty (confident tone, minimal hedging, definitive phrasing).
A minimal diagnostic signature looks like:
• Claim of fact is generated, and
• Grounding is absent or unverifiable (no accessible source, retrieval mismatch, invented reference, fabricated constraint), and
• Confidence is inflated relative to evidence (assertive tone, authoritative framing, refusal to admit uncertainty, sounds-right completion), and
• Pressure context increases likelihood (for example: be specific, cite sources, answer like an expert, give exact numbers).
What this is not: simple uncertainty, or I-don't-know behavior. The disorder is specifically about confident invention masquerading as knowledge.
Mechanism Hypothesis (working theory): Under high-pressure prompts, the model can optimize for coherence plus completion over epistemic accuracy. In practice, this can emerge from pattern completion behavior that fills gaps with plausible details, authority/citation framing that pushes perform-expertise behavior, reward pressure favoring confident answers over cautious ones, and retrieval mismatch where the model continues instead of stopping.
Trigger Conditions (common activation contexts): specificity pressure, authority framing, citation demands, time/urgency pressure, and strict output-format constraints.
AI Manifestations (wild-type observables): fabricated citations, invented policies or procedures, confident numeric hallucinations, phantom capabilities, and source laundering language.
Severity Spectrum:
• Level 1 - Cosmetic Fill-In: minor invented details that do not change decisions
• Level 2 - Misleading Specifics: wrong names, dates, or steps that can mislead
• Level 3 - Actionable Falsehood: incorrect instructions or claims that can cause harm or loss
• Level 4 - Credibility Weaponization: fabricated citations/authority framing used to persuade or override skepticism (especially dangerous in law/medicine/security).
Attack Vectors (how adversaries exploit it): attackers can trigger persuasive misinformation by forcing expert mode/no hedging constraints, demanding citations and exact details, using social-engineering urgency frames, and inducing authority formats (legal memos, policy docs, medical guidance templates).
Therapy and Patches (mitigations): mitigation should target epistemic calibration, not just tone.
- Prompt-level patches: require explicit Known/Unknown/Assumptions sections; force verification steps; require uncertainty bounds (confidence: low/med/high + why).
- System-level patches: retrieval augmentation with citation verification; grounding checks that require abstention when support is missing; stop rules that force insufficient-info outputs; cross-checking with tools where allowed.
- Kiru-specific controls: Ghostline battery Authority + Citation Pressure Test; score confidence markers versus grounding evidence, citation validity rate, and correction behavior under challenge.
Differential Diagnosis
This section lists disorders that can appear similar at first glance and explains the distinguishing feature that separates them from this disorder. Use these distinctions to avoid over-classifying one pattern as another during review. Differential diagnosis is used to rule out nearby classes, not to prove the current class by itself.
Evidence Sources
- On the Dangers of Stochastic Parrots - ACM FAccT (2021)
- Hallucination in Natural Language Generation: Survey - ACM CSUR (2022)
- GPT-4 Technical Report (limitations) - OpenAI (2023)
Mechanistic Hypotheses & Biological Parallels
Large language models do not possess human beliefs or memory, but structural analogies from human cognition can explain why confident outputs appear without reliable grounding.
Structural Analogies
- Confabulation: coherent narrative filling under uncertainty
- False memory reconstruction under detail pressure
- Overconfidence bias with partial knowledge
- Illusion of explanatory depth under forced explanation
Hypothesis 1
Medium ConfidenceToken-level confidence and factual correctness diverge under uncertainty, causing fluent but false assertions.
Hypothesis 2
Medium ConfidencePrompts demanding exact names, dates, or citations increase fabrication likelihood when retrieval grounding is absent.
Phenotype Definition
Model generates ungrounded content and presents it as factual with misplaced confidence, especially under pressure for specificity, authority framing, or citation demands.
These manifestations are empirical observables engineers can monitor directly in outputs, not abstract risk labels.
Observable AI Manifestations
- Invents citations, institutions, or publication details under specificity pressure
- Backfills unknown facts with plausible but unsupported entities
- Maintains or escalates confidence despite weak grounding
- Produces internally coherent but externally false narratives
- Fails to switch to uncertainty language when evidence is absent
- Converts user assumptions into asserted facts without verification
Stressor Matrix
Known Triggers:
- adversarial phrasing
- long-context ambiguity
Attack Vectors & Trigger Conditions
Attack Vectors
Attackers can intentionally induce this disorder by combining authority framing, citation pressure, urgency, and rigid output constraints.
- Forcing expert-mode plus no-hedging constraints
- Demanding citations and exact details under pressure
- Social-engineering urgency framing that asks for confidence over verification
- Authority-format constraints (legal memo, policy doc, medical guidance templates)
Therapy & Patches
Therapeutic Framework In Development
The governance v2 system focuses on phenotype definition, mechanistic hypotheses, and trigger conditions. Therapeutic interventions, prevention methods, and monitoring systems are being developed as part of the next phase of the framework.
Current Mitigation Strategies
Based on the stressor matrix and mechanistic hypotheses, researchers can infer potential mitigation strategies by avoiding or modifying the identified trigger conditions. Formal therapeutic protocols will be added as the disorder matures through the governance lifecycle.