Market State at Time T
Price, volume, volatility regime, trend, momentum, macro context, retrieved historical cases and baseline probabilities.
Training AI agents to improve their market reasoning through historical outcomes, structured feedback, and reward-based policy refinement.
In language models, RLHF uses human preferences to improve model behavior. In market reasoning, the feedback source is different: the market itself provides delayed, noisy, and probabilistic feedback through realized outcomes.
Rather than treating financial forecasting as a static supervised problem, we model it as an iterative decision-and-feedback process. The objective is not deterministic prediction — it is calibrated scenario reasoning: multiple possible outcomes, probability estimates, confidence levels, and explicit uncertainty.
Every historical time step is a learning episode. The agent observes, reasons, the market resolves, and the system audits — converting outcomes into reward signals that improve the policy.
Price, volume, volatility regime, trend, momentum, macro context, retrieved historical cases and baseline probabilities.
Produces main scenario, alternatives, probability estimates, confidence, risk factors, confirmation and invalidation conditions.
Horizons T+1, T+3, T+5, T+10 become known. Realized price path compared against agent scenarios.
Evaluates reasoning quality — not just correctness. Was confidence justified? Were probabilities well-calibrated? Did invalidation logic hold?
Graded reward: positive for calibrated correct reasoning, neutral for reasonable misses, negative for overconfident wrong analysis.
Scenario weighting, probability calibration, confidence expression, regime adaptation — all updated from the reward signal.
The state representation is deliberately broad — structured data, statistical features, retrieved memory, and optionally visual chart structure. Nothing is assumed irrelevant before evidence.
The professor does not label the agent right or wrong. It evaluates the reasoning process against what actually happened — producing graded feedback that resembles RLHF preference datasets, but derived from market data.
Realized outcome was main scenario. Confidence was moderate. Invalidation was not triggered. Probability was well-assigned.
Realized outcome was alternative scenario but assigned meaningful probability. Confidence was appropriately low. Reasoning was grounded.
Realized outcome was assigned near-zero probability. Confidence was high. Invalidation conditions were missing or ignored.
Agent correctly weighted volatility regime and macro context, reducing conviction before a breakout event.
Agent anchored too heavily on one data source, ignoring retrieved historical cases showing similar setups resolving against the main view.
Although main scenario missed, agent produced actionable scenario map with clear confirmation conditions. Branching quality scored positively.
A correct prediction still earns a weak reward if reasoning was poorly grounded or overconfident. A wrong main scenario may still earn a positive signal if the realized outcome was assigned meaningful probability and confidence was calibrated.
This reward structure is then used to build preference datasets and reward models analogous to those in RLHF workflows — but grounded entirely in market evidence.
The Gym transforms historical market data into learning episodes. Once stable, any student agent, professor, reward model, or multimodal architecture can be evaluated consistently against the same environment — separating agent from environment.
Market context, features, volatility regime, macro event flags.
Structured prompt with retrieved memory, baseline priors, chart image.
Scenario analysis, probabilities, confidence, invalidation conditions.
T+1…T+10 revealed after agent commits to analysis.
Graded evaluation of reasoning quality against realized outcome.
Composite score used to update policy or build preference pairs.
Episode stored. Retrieval index updated for future similarity search.
Walk-forward to T+1. Train / validation / test splits enforced by time.
The first policy is a reasoning policy: how to weight scenarios, calibrate probability, express confidence, and adapt to regime changes. Only once that is disciplined does execution quality follow.
Adjust scenario probabilities based on accumulated reward signal across historical episodes.
Learn which scenario weighting strategy works best per volatility regime or market structure.
Generate candidate analyses, rank by reward, build preference pairs, fine-tune the agent.
Train a reward model from audit labels and use it to select the best analysis at inference time.
Learn from fixed historical dataset of episodes without live market interaction.
Condition the agent on target reward levels, enabling goal-conditioned generation of reasoning.
Charts contain information that is difficult to fully encode numerically: compression and expansion rhythm, failed breakout anatomy, volatility clustering, support/resistance interaction. Our multimodal extension tests whether vision-language models can use this additional modality within the same feedback loop.
Statistical representation: OHLCV, vol regime, momentum, retrieved priors. Standard for all agents.
Vision-language model receives fixed chart images alongside structured features. Evaluated under the same RLMF feedback loop for calibration uplift.
Research question: can multimodal agents use visual market structure to improve calibrated scenario reasoning when grounded by market feedback?
An RLMF agent analyzing a live market state across multiple reasoning layers — pattern recognition, historical retrieval, scenario generation, probability calibration. Each layer feeds the next.