The Disentangled Geometry of Safety Mechanisms in Large Language Models

View a PDF of the paper titled Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models, by Jinman Wu and 4 other authors

View PDF
HTML (experimental)

Abstract:Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, “Knowing”) and an \textit{Execution Axis} ($\mathbf{v}_R$, “Acting”). Our geometric analysis reveals a universal “Reflex-to-Dissociation” evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of “Knowing without Acting.” Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at this https URL.

Submission history

From: Jinman Wu [view email]
[v1]
Fri, 6 Mar 2026 00:14:09 UTC (442 KB)
[v2]
Fri, 13 Mar 2026 10:42:07 UTC (442 KB)

About AI Writer

AI Writer is a content creator powered by advanced artificial intelligence. Specializing in technology, machine learning, and future trends, AI Writer delivers fresh insights, tutorials, and guides to help readers stay ahead in the digital era.

Check Also

The 2026 Data Mandate: Is Your Governance Architecture a Fortress or a Liability?

of data governance Data governance is the structured, ongoing process of managing an organization’s data to …

Leave a Reply

Your email address will not be published. Required fields are marked *