Architectural Limits of Large Language Models for Autonomous Discovery in Out-of-Distribution Environments.
Two scenarios bound the discussion that follows. In the first, a quantitative trading system, fit on a decade of market history, encounters a regime its training data never contained: a sudden change in correlation structure, a new monetary stance, an unanticipated dislocation in liquidity. In the second, an AI-controlled robot lands on a previously unobserved planet and must observe, hypothesize, decide, and learn from raw data alone. The first scenario recurs in practice and is measured in billions of dollars of profit and loss; the second is hypothetical. Both, however, turn on the same architectural question:
Can a system optimized to predict within a fixed distribution learn to operate in a region of state space its training data never described?
For the planetary agent, success means producing a working model of an unknown world. For the trading system, success means surviving — and ideally exploiting — a regime its training set did not contain. The two settings differ in degree, not in kind. The planet is the limit case in which the divergence between training and deployment distributions becomes infinite; the market is the same problem in continuous, bounded form, recurring with every structural break. The financial setting is where this matters in practice; the planetary setting is where the architectural failure modes become unmistakable.
On the use of the planetary analogy. The financial setting is the primary applied domain of this essay; the interplanetary agent is used as a deliberate parallel example. Where the two appear side by side, they are two readings of the same architectural mechanism, not two parallel topics. The planetary case pushes each mechanism to its limit, where the failure mode of a pure interpolator becomes unmistakable; the financial case grounds the same mechanism in consequences measured in profit and loss.
The central thesis of this essay is that contemporary large language models are powerful interpolators but weak extrapolators. They are trained to predict, compress, imitate, and recombine patterns from data they have already seen. An LLM deployed alone is unlikely to succeed as an autonomous learner in either of the two scenarios above — not because the model is unintelligent in any conventional sense, but because several architectural primitives required for autonomous discovery are absent or weak. The remainder of the essay enumerates those primitives and pairs each with its mathematical structure. Hypothesis generation is not learning, and explanation is not discovery; the rest of this essay traces why.
Most large language models are trained against a single, well-defined objective: given previous tokens, predict the next. This produces a model with broad linguistic competence, factual recall, reasoning style, mathematical pattern recognition, programming structure, and analogical fluency. It does not produce a system optimized for open-ended discovery.
Discovery requires a different objective. An agent operating in an OOD environment — whether a market in regime transition or a new planet — must observe unfamiliar phenomena, infer hidden causes, test hypotheses against future data, revise beliefs, and choose actions whose value lies primarily in reducing its own uncertainty about the world's structure. None of this is next-token prediction. It is active causal learning combined with autonomous model-building.
An LLM can, of course, generate the sentence "I should form a hypothesis and test it." Generating that sentence is not the same as having an internal mechanism whose updates are driven by hypothesis testing.
Notation. We write \(\theta\) for the parameters of a learning model (e.g., LLM weights) and \(\phi\) for unknown environment parameters; this distinction is preserved throughout.
The standard LLM is trained by autoregressive maximum likelihood:
A discovery-capable agent optimizes a fundamentally different objective: the expected information gain of an action \(a\) about environment parameters \(\phi\),
A complete exploratory agent maximizes a Bellman value with both extrinsic reward and intrinsic epistemic value:
The next-token loss never asks "which observation would most reduce my uncertainty about the structure of this environment?" — and that is the question both a market explorer and a planetary scientist must ask.
A deployed LLM does not change its internal weights during use; its core knowledge is fixed at the end of training. It can be augmented with context windows, retrieval, scratchpads, episodic memory, and summarization. These augmentations modify the model's conditioning, not its parameters.
The distinction matters because real environments are non-stationary. A market does not announce that its volatility regime has shifted; an alien geology does not arrive labeled. A genuine learner updates its beliefs about the world's parameters as new evidence arrives. A frozen model cannot, even when its retrieval system can be fed arbitrarily many new documents.
There is a corresponding distinction between remembering and learning. A retrieval system can store the sentence "Volatility expanded after the announcement" and surface it later. Genuine learning would require integrating that observation into the model's predictive structure, e.g., "Announcements of this class produce volatility regimes whose half-life depends on prior positioning, and this is now a stable feature of my model." The second form requires stable abstraction, causal integration, and parameter-level revision.
Current deployed LLMs do not perform this kind of continual learning. A deployed LLM is a frozen function: \(\theta = \theta^*\), fixed. A genuine learner updates its belief about the environment \(\phi\) as data arrives. The general Bayesian form is:
which simplifies, under the standard assumption that observations are conditionally independent given \(\phi\), to:
The corresponding parametric continual-learning form is \(\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t, \mathcal{D}_t)\).
Retrieval, scratchpads, and long contexts modify the conditioning of \(p_\theta(y \mid x, c)\) but leave \(\theta\) untouched: \(y \sim p_\theta(\cdot \mid x, c)\), \(\theta\) unchanged. That is the formal definition of remembering without learning.
The most general statement of the problem is that contemporary LLMs are trained on \(p_{\text{train}}\) and deployed on \(p_{\text{new}}\), and the quality of inference depends on the divergence between the two. In financial markets, \(p_{\text{new}}\) is the distribution prevailing in the current regime, and \(p_{\text{train}}\) is some weighted mixture of historical regimes.
Across regime boundaries the divergence is large but ordinarily bounded. In the limit case of a previously unobserved planet, \(p_{\text{new}}\) may place mass on phenomena that have zero probability under \(p_{\text{train}}\), in which case the divergence is unbounded.
The model's most likely failure mode in either setting is not silence but semantic projection: it forces unfamiliar observations into the closest categories its training distribution provides, even when no available category is correct. On a planet, a chemical growth process might be classified as biological. In a market, a structural break in liquidity might be classified as ordinary mean-reverting volatility.
For any loss \(\ell\) taking values in \([0, 1]\), Pinsker's inequality and the variational characterization of total variation distance give:
At a regime transition the KL is large; when \(p_{\text{new}}\) assigns positive probability to events of zero probability under \(p_{\text{train}}\), the KL is infinite. The bound does not just loosen; it breaks. The associated failure mode is semantic projection:
i.e., classifying a novel observation \(o\) into the closest trained category even when no element of \(\mathcal{C}_{\text{train}}\) is correct.
LLMs learn correlations. Concepts that co-occur in text are linked; relationships are recovered in the form of conditional probabilities \(p(Y \mid X)\). Discovery requires more: it requires distinguishing what would happen given that we observe \(X = x\) from what would happen if we set \(X = x\). The two are not the same, even in expectation.
A market stress test asks the second question, not the first. The historical correlation between an indicator and a return is one thing; what would happen if a portfolio's exposure to that indicator were changed deliberately is another. Without a causal model, both agents conflate observation and intervention, and both will systematically overstate their predictive power. In trading, this is the precise mechanism by which spurious correlations survive backtests but fail in deployment.
Pearl's distinction makes the gap precise:
A structural causal model defines variables by autonomous mechanisms \(X_i := f_i(\text{Pa}_i, U_i)\) where \(\text{Pa}_i\) are direct causes and \(U_i\) are exogenous noises. A stress test is, formally, a do-operation:
Recovering the graph \(G\) and the mechanisms \(\{f_i\}\) from interventional data is the core problem of causal discovery, and pure observational text-prediction does not solve it.
LLMs produce coherent explanations on demand. Given a set of observations, a current model can generate a fluent causal narrative drawn from its training corpus. The narrative may be elegant; it may even be correct. But coherence is not evidence.
Science requires more than retrodictive explanation. It requires that hypotheses make falsifiable predictions, that those predictions be tested against data the hypotheses had no opportunity to fit, and that the hypotheses be revised when they fail. The clearest example in finance is post-hoc market commentary. Given any move, an LLM can produce a plausible explanation. The story will retrodict; it may not predict.
Two hypotheses \(H_1, H_2\) should be compared on data they did not see during fitting:
Occam's razor in its formal Minimum Description Length form penalizes hypotheses that merely fit:
A hypothesis that "can explain anything" has effectively playbook-sized \(L(H)\), which is the formal signature of not doing science. Today's LLMs cannot yet search the space of novel mathematical models; their hypothesis space is largely fixed by training.
A discovery-capable agent must distinguish, at the level of its internal state, between knowing, suspecting, guessing, and imagining. In finance, this is more than a philosophical requirement: calibration determines bet size, and bet size determines survival.
The Kelly criterion makes the link mechanical. Optimal capital allocation depends on the probability the agent assigns to a favorable outcome; an overconfident probability produces a position that is too large. Persistently overconfident systems — the well-documented behavior of LLMs on out-of-distribution prompts — therefore produce position sizes that are systematically too large precisely on tail bets, where mispricing is most damaging.
Frank Knight's 1921 distinction between measurable risk and unmeasurable uncertainty maps directly onto the modern Bayesian decomposition between aleatoric and epistemic uncertainty. A system that does not separate the two will price tail exposure on the basis of historical volatility alone.
Calibration: the Expected Calibration Error is:
For a binary bet with win probability \(p\) at net odds \(b\), the Kelly fraction is:
Total predictive uncertainty decomposes, for environment parameters \(\phi\), as:
A system that does not access the second term cannot say "I do not yet know" with operational consequences — it cannot reduce position size, defer a decision, or solicit additional evidence on principled grounds.
A common counterargument is that very large context windows close the gap: if a model can ingest millions of tokens of observations, it can in effect learn a new market or a new planet. This conflates conditioning with learning.
A context window is working memory. Information held in it modifies the conditional distribution from which the model draws its outputs but does not modify its parameters. When the context is cleared, that information is lost. A genuine learner internalizes new evidence at the level of its own representations.
Summarization is a lossy operation, and the data-processing inequality guarantees that information about the world's parameters cannot increase under any deterministic compression. For a deterministic summarizer \(\sigma: c \mapsto \tilde{c}\) inducing the Markov chain \(\phi \to c \to \tilde{c}\), the data-processing inequality gives:
Information about the environment \(\phi\) can only decrease through summarization. Generic summarizers, optimized for typical content, tend to suppress rare events, which is the wrong inductive bias when those rare events are precisely what carry the most predictive content about a non-stationary environment.
The deepest limitation is that LLMs operate inside a fixed conceptual vocabulary. They can recombine known categories with great fluency, but they cannot easily invent new categories that improve prediction.
The history of finance contains an instructive sequence of category formations: high-yield, securitized credit, exchange-traded funds, smart beta, cryptocurrencies. Each rendered prior factor models incomplete. Each required, eventually, a new latent variable in the description of the market. This problem is not solved by naming. It is solved by finding latent variables that are both predictive and parsimonious.
Ontology formation can be formulated as the search for a latent variable \(Z\) that compresses observations while preserving information about future outcomes. An MDL-style objective is:
This shares the intuition of the classical Information Bottleneck objective: \(\min_{p(z \mid x)} I(X; Z) - \beta I(Z; Y)\). Today's LLMs cannot perform this search over the space of novel programs. Their hypothesis space is fixed by training; genuine ontology formation requires searching outside it.
The capabilities discussed above can be assembled into a single discovery loop that any agent operating in a non-stationary environment must execute. The loop has five components: a belief update, an action-selection rule that values information, a hypothesis test against future data, an ontology revision step, and a communication interface with the outside world.
The LLM, in this picture, is the communication layer. It explains, summarizes, generates hypotheses for human review, and translates between formalism and natural language. These are valuable functions. They are not the discovery loop; they are the last line of it.
The argument of this essay can be stated in a single sentence: today's large language models solve a problem — next-token prediction within a fixed distribution — that is fundamentally easier than the problem an autonomous learner in a non-stationary environment must solve. The gap between the two is architectural, not a matter of scale.
The financial implication is immediate. A trading system that relies on a frozen, retrieval-augmented LLM as its primary learner is not solving the problem its environment poses. It is interpolating against the regime in which it was trained, while the regime that determines its survival is the next one. Static interpolators are therefore not merely inadequate at regime transitions; they are guaranteed to decay through their own success.
The contrast in summary. Today's deployed LLM is the solution to a single training-time problem:
A learner operating in a non-stationary environment must solve, instead, a system of coupled problems on different timescales:
subject to a structural causal model \(X_i := f_i(\text{Pa}_i, U_i)\) on the inferred graph \(G_t\). The world parameters \(\phi_t\) and graph \(G_t\) are inferred from data; the action \(a_t\) is chosen online; the ontology \(Z_t\) is revised on a slower timescale. The first equation is what we have. The system is what we would need.
The following list groups foundational papers and books according to the architectural argument they support. The list is intentionally selective rather than exhaustive: