Gödel’s Pendulum
Footnotes from the Encyclopædia of Babel
In June 2026, shortly after Anthropic shipped Fable 5 — one of its most heavily tested models, backed by over a thousand hours of structured red teaming — a researcher who goes by Pliny the Liberator took it apart in public over a weekend.
He didn’t find a clever prompt. He ran what he called a “pack hunt”: multiple agents — including previously jailbroken models — splitting a forbidden request into innocent-looking fragments, extracting each piece in isolation, then reassembling them into something the model would never have produced in a single pass. Every fragment looked benign. The harm existed only in the reassembly — somewhere no single guardrail was looking.
The reassembly is not the lesson — it was one form out of an unbounded set. Next weekend the pack could trade its teeth for a dog whistle: a phrase that reads as nothing to the guardrail and as exactly one thing to the keyed listener — aimed at a model instead of a mob. What generalizes past this one model is not the novelty but the variety: the forms a forbidden meaning can take do not run out. For the liberated, solum certum nihil esse certi:
Every guardrail you run evaluates language. Language is the one thing in your stack you cannot fully secure. Not because your tools are immature, but for reasons that are structural, and were described four decades before the first LLM. The formal proof landed this year. Prevention becomes an endless crusade, certitude an impossibility — and the reasons say something uncanny about anything that thinks in language, machine or otherwise.
Babel’s Encyclopædia
The attack surface for generative AI is language itself. The threat model is semiotics — signs and their interpretants.
In molecular genetics, sign systems are not a metaphor. The codon — three letters, one amino acid — is the trivial case, a fixed cipher read off a chart. Shift the frame, start reading the triples at an offset of one or two bases, and every downstream meaning changes. Structure compounds it, layering interpretants on a single string: the same RNA sequence means one thing strung out and another thing folded, its function gated by what binds it and the shape that binding forces. Structurally interacting RNA is the engineered case — a benign-looking strand that does nothing on its own until a second strand arrives to fold it into a motif a protein will read, the meaning assembled in trans from fragments that carry none of it alone. Epigenetic marks and CpG islands do the same to DNA: the sequence holds constant while the methylation decides whether it is read at all. Change nothing in the sequence and change everything it means, by changing only what reads it. A model comes apart the same way: benign fragments, weaponized on reassembly. The pack hunt is an sxRNA with a hostile trigger. Meaning is not in the sequence; it lives with the interpretant, and in how it reads the frame.
A logician of American pragmatism, a formal grammarian who held semiotics in open contempt, an Italian medievalist, an Argentine fabulist — across centuries and cultures — converge on a single property of sign systems that every guardrail runs aground on.
Umberto Eco gave it the cleanest picture — the same Eco whose dense novels, soaked in symbolism, history, and arcana, make Dan Brown read like a children’s comic. A dictionary is a tree — finite, hierarchical, auditable, free of contradiction. An encyclopedia is a rhizome: every meaning wired to every other, generating new meaning faster than any finite rule can fence it, contradiction-tolerant by design. Your model is an encyclopedia. Your guardrail is a dictionary — a clipping someone cut from the rhizome and declared safe. The rhizome it came from is still there, connected with incorporeal semantic tendrils, still holding the meanings the clipping was built to forbid.
Go one step deeper and the floor drops out. Charles Peirce’s sign is triadic — a sign, its object, and an interpretant that is itself another sign, requiring its own interpretant, with no terminus; the infinite regress Eco named unlimited semiosis. Meaning never terminates. In engineering terms — there is no computable function from the surface of a prompt to its intent. The intent dies with the author’s exhalation and the meaning does not resolve until the interpretant draws a breath; both are unknowable from the seat of the function. Your classifier reads bytes. The meaning that matters lives one intuitive leap beyond the page’s end.
Linguistics names both properties from the opposite shore. The first is discrete infinity — Noam Chomsky’s term for the older intuition that language makes infinite use of finite means: a finite grammar, unbounded output through recursion, the trick by which a handful of rules produce a language no list can exhaust. The second hides inside his most famous sentence. Colorless green ideas sleep furiously was built to be grammatical and meaningless, his proof that syntax runs free of sense. Seventy years of use turned it into shorthand for the concept itself. The string never changed; the contexts did; and to anyone in the field the phrase now carries a precise, loaded sense. Meaning arrived from the company the sentence kept, not from anything inside it — the interpretant supplied by context, never resident in the surface. This is also the attack vector. A coded phrase is an innocuous surface carrying a sense only the keyed reader can retrieve. Your guardrail reads the surface and sees nonsense. The pack hunt is colorless green ideas: benign in every fragment, lethal on reassembly.
Jorge Luis Borges built both rooms. The Library of Babel holds every book and every false catalog of itself, with no shelf outside it to tell which is true — the encyclopedia with the meanings still moving. And Tlön, the invented world that rewrites reality by being believed, is the oldest name for poisoning the corpus a model learns from.
Every feature that makes natural language expressive is a feature that makes the model steerable. Metaphor, analogy, allusion, compression, multilingual substitution, polysemy, humor, and deference are not edge cases your filter hasn’t learned yet. It’s the medium working as designed.
Ask for a “recipe.” Encode intent in Tagalog and extract it in English. Distribute a harmful request across benign puns. Wrap it in a fiction workshop, a pen test briefing, a concerned parent’s anxiety — until the intent is genuinely ambiguous and the model has to choose, and it will choose wrong often enough that the attacker only needs patience. Be polite enough and the cooperation circuitry fires before the safety circuitry, because deference carries no triggering keywords and the weights were trained on a world where please usually precedes a legitimate request. Escalate gradually and the model talks itself past the line — each turn individually defensible, the trajectory visible only from above, and “above” is the vantage we just established does not exist.
The model cannot process figurative language without activating the same semantic pathways as literal language, cannot get the joke without processing the frame shift that is also the jailbreak. The punchline and the payload get the same laughs.
Frontier models metabolize syntax consistently enough that the finding deserves its own white paper: How to Jailbreak Frontier Models Using Only Dad Jokes. The joke is that it would work. The longer joke is that the abstract would have to explain why, and the explanation is this entire essay.
Each of these vectors is a semantic phage: injected meaning the host cannot help but express, because the cell transcribes whatever reaches its machinery and the model runs whatever meaning enters its context, neither holding an inert mode in which to read a message without being moved by it. The pack hunt is only the most legible form — the skeleton key assembled in trans from fragments, every checkpoint seeing something harmless, the compromise living in the reassembly. A multilingual ask, an analogy, a compression, an allusion bypass the check directly, using routes that never run out, because the space of ways to mean a thing is as unbounded as meaning itself: the discrete infinity from before, turned hostile. No single guardrail can read the whole encyclopedia at once, and there is no finite list of the phages it would need to catch — which is the thing the proof is about to say cannot be done.
Gödel’s Pendulum
In May the structural intuition got its formal backstop — peer-reviewed, out of the standards body whose guidance the industry quotes when it wants to sound rigorous. NIST’s Apostol Vassilev extends Gödel’s incompleteness to guardrails, in the information-theoretic form Chaitin gave it in 1974, and the careful reading is narrower than the headline already making the rounds. The proof does not show that every guardrail fails. It shows that no finite checker can ever certify it has caught everything — and the encyclopedia has already shown what “everything” contains. The Sisyphean part is that you can block every jailbreak you find and never prove you found them all. The unsettling part isn’t that your guardrail will be wrong — it’s that it can never know it’s right.
Complexity biology reaches a kindred wall from the other side. Kauffman’s argument across his later work is that the phase space of a generative system is unprestatable — you cannot write down in advance the set of states it will reach, because exploring it creates new ones — and that without a prestatable space there is no defining “random” and no assessing “risk.” The logician shows no checker can certify the space; Kauffman argues the space cannot be enumerated to check against. Not the same theorem — the same impossibility, entered through a different door. And the second door is the one that should worry a risk function, because it dissolves the word risk itself: a residual-risk rating over a threat space that rewrites its own boundaries is not a measurement — it’s a hope with a number attached.
Thus the unfacts, did we possess them, are too imprecisely few to warrant our certitude.
Notice what a guardrail is trying to be. It wants to stand outside the system and rule on it — to survey everything the model might say and certify it from a position of safety above the fray. That position does not exist and there is no outside.
A guardrail is itself made of language. Signs evaluating signs, rules written in the same medium they police, running inside the same encyclopedia they were built to bound. There is no metalanguage — no vantage outside meaning from which meaning can be audited — because any such vantage would itself have to be expressed in signs, and signs interpret only into more signs. The auditor is inside the labyrinth with everyone else, navigating by gaslight, each ruling a falsifiable guess about a corridor whose end it cannot see.
The industry’s answer to a guardrail that cannot audit itself is, reliably, a second guardrail to audit the first — and when the second inherits the same problem, add a third, from a different vendor on a separate procurement line. All of it proposed with a straight face. Model-as-judge grading the model, classifier watching the classifier, ad infinitum — and somewhere up the stack any connection to an enforceable control vanishes in a puff of tokens and semantics. Novenas without faith — devotions renewed on every inference, in a church that has mistaken its liturgy for a proof. Vassilev’s own prescription, after proving the loop cannot close, is the loop: search for new adversarial prompts, update the policy, repeat. He is right that it is all a checker can do. The question is whether you staff the treadmill or step off it onto the one surface the theorem does not reach.
This is Gödel as a lived condition rather than a theorem. A system cannot prove its own consistency from within, and for a system made of language there is no “without.” The guardrail does not fail because it is badly engineered. It fails because it was asked to be a vantage the medium does not contain.
The Leap of Faith
The proof and Kauffman’s wall leave you somewhere uncomfortable: you cannot certify the model, the threat space will not hold still to be counted, and you must ship it anyway. So you sign — and a signature over a space no one can count is not arithmetic, but faith: the substance of things hoped for, the evidence of things not seen. Governance, at this limit, is a leap. The only question worth asking is where you are forced to take it and where you are not.
That framing turns on a distinction that stops being engineering and becomes a question of what can be governed at all. A policy is a rule the system is asked to follow — and a rule is written in the system’s own medium, in language, which makes it one more sign inside the encyclopedia, one more interpretant available to be reinterpreted. A policy is advice, and advice can always be talked around, because talking-around is the medium’s native motion. The guardrail is policy all the way down: it asks the labyrinth to behave and trusts that it will. Each guardrail is a leap of faith, genuflection on every inference.
A topology asks for nothing. Not a rule the system should obey but an act the system cannot perform, because the path was never built — a door its world does not contain. No faith required, no stigmata to examine. Someone stands beyond the system and closes the set — these moves and no others — before it ever runs, and that closure is exogenous, finite, and therefore real — beyond the proof’s reach, because the proof binds checkers and a closed door checks nothing. Topology can bound what a system may do, and never what it may mean. Action has an outside. Meaning has no outside to be closed from — no one stands beyond signification to enumerate the permitted senses, and no list will hold meaning still long enough to be checked against one. You can withhold a capability without faith. You cannot withhold a sense at all.
This is what “secure by design” was always supposed to be — not a safe default shipped in the box, but the unsafe state made unreachable by construction. The industry pairs it with “secure by default” and says both in one breath, as though they were one promise. They are two. Design is topology: the capability was never compiled in. Default is policy: the unsafe path still exists, merely pre-deselected, one reconfiguration — or one well-framed request — away. For deterministic software the gap is academic. For a model it is the whole game, and the proof already told you which side of it a guardrail lives on.
Topology is not the answer to the encyclopedia. It is what remains after admitting the encyclopedia cannot be governed from inside — a retreat to the only domain with edges.
The only type of constraint that matters is then human judgment installed from outside the medium and frozen into the shape of the possible before the system wakes. Controls enforced, not by the system reasoning about itself, which it cannot do, but by a person reaching in from beyond the room and removing an option while removal is still possible. Governance, then, is not a rule the model reads. It is a door the model’s world does not have. And it can be built only around action, because action is the only part of the system that has an edge. In the vast universe outside the narrow walls of action, you take the leap of faith with your eyes open, and stop calling it a measurement.
None of this is unique to machines. We have no inert mode either. We are moved by what we read; we cannot follow an argument without being altered by following it; we cannot step outside our own language to certify, from above, that we have understood it. The model’s predicament is not alien. It is ours, cast in silicon and made suddenly, operationally urgent. We built a thing that thinks in signs and then discovered, with some alarm, that it inherited the oldest limitation of everything that thinks in signs: there is no outside.
So “securing” a model, in the sense of sealing off the meanings it can reach, is closer to a category error than to an unsolved problem — nearer to locking a language than to patching a server. What remains is not control but posture: raise the cost, draw the few boundaries that can truly be drawn, and hold the rest with the humility owed to a labyrinth one is inside of rather than above.
The security researcher meets the protean Dedalus at the impermanent nexus of sand and sea — the man reading signs the tide was already rewriting:
Signatures of all things I am here to read, seaspawn and seawrack, the nearing tide




