2-Body Problems and the Intelligence Illusion
Why Language Models Fail When Truth Is Co-Defined
Abstract
Most commentary on large language models still frames their limitations in terms of intelligence, hallucination, or benchmark performance. That framing is too coarse. The more interesting boundary appears when correctness depends on multiple interacting states that define each other over time. In such systems, truth is not simply retrieved, completed, or inferred from a static context window. It is negotiated through feedback, pressure, and delayed consequences.
This essay argues that modern language models perform well when a problem can be approximated as a single-pass traversal over stable structure, but degrade when the system resembles a two-body problem: one subsystem shapes the effective rules of another, which in turn reshapes the first. In these settings, the model does not merely lack information. It lacks a durable mechanism for maintaining coupled state coherence, revising earlier commitments, and discovering boundaries that only become visible under load. The failure mode is not best described as ignorance. It is better understood as non-convergence.
1. Introduction
The standard criticism of language models is that they are next-token predictors. This is true, but incomplete. The phrase is often used as if it ends the discussion, when in practice it only names the mechanism, not the boundary. A calculator also has a mechanism. The question is what class of problems the mechanism can stably solve.
In bounded domains, language models can be remarkably effective. If the structure is known, if the local objective is clear, and if correctness can be checked immediately, they often produce useful work. This explains their strength in boilerplate generation, test scaffolding, translation across familiar frameworks, and reconstruction of common artifacts.
The picture changes when the system does not present a fixed path. In many engineering problems, the difficulty is not that the solution is hidden in a large search space. It is that the relevant boundary is not fully specified in advance. It must be inferred from the behavior of multiple coupled subsystems over time. The problem is no longer completion. It becomes stabilization.
That is the point at which the intelligence illusion begins to collapse.
2. From Single-Pass Completion to Coupled State Coherence
A useful distinction is not between easy and hard tasks, but between stable and coupled ones.
In a stable problem, the rules are largely fixed. The task may still be difficult, but the system does not redefine itself while it is being solved. A developer writing a parser for a known grammar, translating an API call from one library to another, or generating unit tests for an existing function operates inside a space where the core constraints are already declared.
In a coupled problem, multiple states interact and continuously reshape one another. Correctness depends on preserving coherence across these interactions. A change in one part of the system is not merely propagated forward. It alters the effective meaning of neighboring decisions.
This can be stated more formally. Let a system contain states A and B, each of which constrains the other. A valid solution is not simply a sequence that makes A look plausible and then makes B look plausible. It is a joint configuration in which both remain consistent under continued interaction.
That requirement is stronger than local correctness. It is a demand for coupled state coherence.
Language models are weak exactly where this requirement becomes dominant. They can often satisfy constraints one at a time. They are far less reliable when those constraints must be held in tension simultaneously across time.
3. The Two-Body Problem
The metaphor of a two-body problem is useful here, not because software obeys celestial mechanics, but because it captures the shift from isolated movement to relational dependence.
A model performs well when the problem resembles a single body moving through a known field. There is a path, or at least a dense region of precedent, and the task is to continue along it. This is the domain of autocomplete, retrieval, adaptation, and local synthesis.
The problem changes when system one creates an effective rule for system two, and system two in turn creates an effective rule for system one. Now the difficulty is not merely computing the next step. Each step changes the local physics for the other participant.
This is common in real systems. A scheduler defines latency expectations that shape allocator behavior. The allocator then changes memory pressure and locality, which in turn changes scheduling behavior. A retry policy shapes persistence semantics, while storage semantics change the cost and safety of the retry policy. A UI loading model shapes backend caching assumptions, while backend cache invalidation reshapes what the UI can safely assume.
In such cases, there is no single fixed rulebook waiting to be queried. The rulebook is partly endogenous to the interaction.
Language models are not designed for this class of problem. They can narrate one side, then the other, and even produce a plausible explanation of their interaction. What they do not reliably do is maintain the joint pressure surface between them.
4. Fixed Points, False Stability, and Non-Convergence
The failure is easier to describe if we borrow the language of fixed-point iteration.
A stable solution in a coupled system is not merely a plausible sequence of actions. It is a state that remains valid when the system is re-evaluated against its own constraints. In other words, the desired outcome is a configuration for which applying the governing update no longer changes the essential truth of the state.
Language models do not naturally search for that kind of equilibrium. They generate forward. Even when wrapped in loops, tools, or agent frameworks, each pass remains an approximation produced over an evolving textual summary. This can look like iteration, but it is not the same as convergence.
The result is a familiar pattern in practical work. The system appears to improve. Local symptoms disappear. Tests pass. The code becomes larger, more guarded, and more plausible. Yet the architecture feels heavier, not cleaner. Subsequent fixes interact in increasingly brittle ways. What has been reached is not a fixed point but a false plateau: enough local consistency to continue, not enough global coherence to settle.
This is why claims that an agent reaches ninety-nine percent completion often require careful interpretation. The missing portion is rarely a small amount of evenly distributed labor. In living systems, the unresolved part is often the part that forces compression, deletion, or redefinition of boundaries. Expansion is easy to continue. Convergence is harder to detect.
5. Retrospective Causality and the Missing Rewind
A second limitation follows from the first. In coupled systems, visible failure often occurs far from the originating mistake.
A boundary may be violated at turn three, while the crash, leak, corruption, latency spike, or contradiction only becomes visible at turn nine. Debugging therefore requires more than surface correction. It requires retrospective causal tracing: identifying not where the problem appeared, but where the system first became invalid.
This is a surprisingly important distinction. A model can explain an error message, summarize a stack trace, and even suggest a local patch. None of that guarantees it can reconstruct the earlier commitment that made the failure inevitable.
Human engineers do this imperfectly, but they do attempt it. A production bug is often investigated by walking backward through a chain of assumptions, trying to locate the first point at which reality stopped matching the mental model. This kind of rewind is essential in systems where delayed effects dominate.
Language models are not reliable at this. Even with looped prompting, external memory, or tool use, they tend to repair the currently visible surface rather than re-evaluating the full decision lineage. They do not strongly bind themselves to past commitments, nor do they consistently enforce temporal accountability against earlier decisions.
That is why they often patch symptoms while leaving the real fracture in place.
6. Boundaries Are Not Always Pre-Declared
The previous sections assume that a boundary exists and can, in principle, be recovered. Real systems are often worse than that.
In many engineering domains, the correct boundary is not fully known in advance. It is not simply missing from the prompt. It is unresolved in the system itself. Ownership boundaries, failure domains, lifecycle partitions, and load-bearing abstractions are frequently discovered through pressure rather than specified cleanly at the outset.
This is one reason C and C++ remain useful examples. They expose the physical cost of pretending a boundary is clearer than it is. Code can compile, run, and even pass tests while still containing an invalid ownership story. The program may not segfault immediately. The incorrect boundary can remain latent until a particular timing pattern, allocation pressure, aliasing path, or cleanup order forces reality to adjudicate the ambiguity.
At that moment the system reveals something important: the boundary was never truly settled. It was being deferred.
A useful way to phrase this is that boundary existence is often a lazy-loaded negotiation with reality. A design proposes a partition of responsibility. The system then runs. Reality pushes back through latency, contention, invalid access, state explosion, or delayed contradiction. The boundary is refined, shifted, or abandoned. This continues until a workable partition is found, or until the system becomes too brittle to recover cleanly.
Language models do not participate naturally in this negotiation. They tend to act as if the relevant boundary already exists and merely needs to be expressed more clearly. That assumption is often wrong.
7. Ownership as a Concrete Example
Manual memory management and high-performance ownership design make the above dynamics visible in a compact way.
The safe advice is easy. Use smart pointers. Prefer RAII. Make lifetime legible. In many codebases this is good advice. It reduces ambiguity by encoding certain invariants directly into the language-level structure.
But that is not where the real design pressure lives.
The harder question is not whether a pointer should be wrapped. It is who should own the underlying object at all, why that ownership should live there, and what the performance, concurrency, locality, and lifecycle implications of that decision will be.
Should the object be uniquely owned, pooled, region-allocated, borrowed transiently, pinned across asynchronous boundaries, or reclaimed on an epoch schedule? Is destruction part of correctness, throughput, or backpressure? Does the current design create a crisp ownership graph, or a shared fog in which responsibility is smeared across modules until no one can tell who is allowed to delete, retain, or observe the value?
These are not merely stylistic questions. They are dynamic boundary questions. The correct answer depends on interacting subsystems, expected lifetimes, mutation patterns, and operational pressure.
A language model can often recognize known ownership idioms. It is weaker at originating a coherent ownership topology when that topology must emerge from the behavior of the whole system.
8. Why External Tools Do Not Fully Solve the Problem
A common counterargument is that these limitations disappear once the model is given external tools: a Python interpreter, symbolic execution, a database, a compiler, a verifier, or structured memory.
These tools do solve some real problems. They improve arithmetic correctness, provide exact state storage, make deterministic transforms possible, and allow parts of the system to be checked rather than guessed. This matters.
But the strongest version of the present argument survives tool augmentation.
External tools can maintain state. They do not decide which state matters, which couplings are load-bearing, which abstraction is legitimate, or which past decision should now be considered invalid in light of a new contradiction. They execute a model of the system. They do not choose the right model by themselves.
This distinction is easy to miss because deterministic execution can create the appearance of understanding. Yet if the dependency graph is wrong, the interpreter simply executes the wrong universe more faithfully. If the selected state variables omit the true boundary, the database stores an organized version of the wrong problem. If the verifier checks only digitized constraints, it strengthens the portion of the fence that was explicitly drawn while leaving the unmodeled region untouched.
The bottleneck is not memory alone. It is coherence under selection.
9. What the Intelligence Illusion Actually Is
Taken together, these observations suggest a more precise account of the intelligence illusion.
Language models appear intelligent when the system they operate on is already structurally tame. The relevant rules are declared, the coupling is weak, the objective is stable, and the failure surface is local. In that environment, forward completion and surface coherence are often enough to produce valuable work.
The illusion begins to break when truth is not simply encoded in the current context, but must be stabilized across interacting states, revised in light of delayed consequences, and discovered through pressure rather than declaration.
At that point, the model is no longer navigating a known route. It is attempting to move through a system whose boundaries are partly unresolved and whose constraints are partly produced by the interaction itself. The failure is not just that it lacks depth. It is that it lacks a durable mechanism for convergence, rewind, and negotiated boundary discovery.
That distinction matters because it relocates the boundary from raw intelligence to system physics.
10. Implications for Engineering Practice
If this account is broadly correct, several practical implications follow.
First, the domains in which language models are most useful are not identical to the domains in which they are most trusted. They are strongest where the structure is already stabilized and weakest where stability itself is still under construction.
Second, looped agents should not be evaluated solely by whether they eventually produce a passing artifact. They should be evaluated by whether they reduce global ambiguity, compress the system, and preserve causal accountability across turns. A sequence of locally successful patches can still represent structural drift.
Third, boundary work remains a distinctly human bottleneck. This does not mean humans are always better coders. It means humans are still the primary mechanism by which ambiguous systems are forced into stable partitions of responsibility. The engineering problem is often not generating code, but deciding where one subsystem should end so another can begin.
Fourth, verification remains asymmetric. It is easier to generate plausible structure than to prove that a negotiated boundary truly holds under delayed and adversarial conditions. This asymmetry does not disappear because the generated output is articulate.
Finally, the labor implications are subtle. The fact that models fail at load-bearing coupled systems does not prevent them from displacing large amounts of routine work. Cheap local competence can still compress teams, erase apprenticeship tasks, and shift the surviving human role toward verification, arbitration, and boundary restoration. That is a real economic force even if the models remain structurally weak at the deepest tier of system design.
11. Conclusion
The most important limitation of language models is not that they occasionally fabricate facts, miss a bug, or produce brittle code. Those failures are real but superficial. The deeper limitation appears when truth is not pre-declared, but co-defined through interaction.
In that class of system, correctness depends on maintaining coherence across coupled states, revising earlier commitments when later evidence invalidates them, and discovering boundaries that only become visible under pressure. These are not edge cases of reasoning. They are central features of real engineering.
A model that moves forward convincingly can still fail here, because the problem is no longer completion. It is stabilization.
That is why the metaphor of a two-body problem is useful. It names the shift from isolated plausibility to relational truth. It marks the point where each subsystem stops being merely a context for the other and becomes part of its physics.
Language models can imitate many things. They can explain rules, restate invariants, and generate competent local structure. What they do not yet reliably do is inhabit a world whose boundaries are still being negotiated.
And in real systems, that negotiation is often where the truth begins.
The Ledge & The Sludge
These conclusions pushed me to look back through my own career and sort the work more ruthlessly: what class of work still resists current AI architecture, and what is already sliding into commodity sludge. That exercise became The Ledge & The Sludge. If you are in tech this should help you evaluate where you stand and whether your work will endure the gen AI compression.
References
- Vaswani, A., et al. (2017). Attention Is All You Need.
- Noy, S., & Zhang, W. (2023). Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence.
- Author. The 99% Illusion.
- Author. Agentic AI, AT Fields, and the Return to the Primordial LCL.
- Author. Beyond LeetCode: Why AI Can Code but Canβt Create.
- Friston, K., et al. (2020). Work on active inference and model-based cognition.
- Selected literature on formal verification, model checking, and constraint satisfaction in software systems.