Research Stream · 001

Reinforcement
Learning from
Market Feedback

Training AI agents to improve their market reasoning through historical outcomes, structured feedback, and reward-based policy refinement.

↓ Scroll to explore the framework

01 · The Idea

Markets are the feedback signal.

In language models, RLHF uses human preferences to improve model behavior. In market reasoning, the feedback source is different: the market itself provides delayed, noisy, and probabilistic feedback through realized outcomes.

Rather than treating financial forecasting as a static supervised problem, we model it as an iterative decision-and-feedback process. The objective is not deterministic prediction — it is calibrated scenario reasoning: multiple possible outcomes, probability estimates, confidence levels, and explicit uncertainty.

Framework

Feedback source

Policy target

RLHF

Human preference

Language quality

RLMF

Market outcome + audit

Scenario calibration

Supervised

Labeled data

Point prediction

02 · The Loop

Six steps. One continuous cycle.

Every historical time step is a learning episode. The agent observes, reasons, the market resolves, and the system audits — converting outcomes into reward signals that improve the policy.

Market State at Time T

Price, volume, volatility regime, trend, momentum, macro context, retrieved historical cases and baseline probabilities.

→

Agent Scenario Analysis

Produces main scenario, alternatives, probability estimates, confidence, risk factors, confirmation and invalidation conditions.

→

Future Market Outcome

Horizons T+1, T+3, T+5, T+10 become known. Realized price path compared against agent scenarios.

→

Professor / Audit Layer

Evaluates reasoning quality — not just correctness. Was confidence justified? Were probabilities well-calibrated? Did invalidation logic hold?

→

Reward Signal

Graded reward: positive for calibrated correct reasoning, neutral for reasonable misses, negative for overconfident wrong analysis.

→

↻

Policy Improvement

Scenario weighting, probability calibration, confidence expression, regime adaptation — all updated from the reward signal.

03 · Market State

What the agent observes at time T.

The state representation is deliberately broad — structured data, statistical features, retrieved memory, and optionally visual chart structure. Nothing is assumed irrelevant before evidence.

Price & volume history

Technical & statistical features

Volatility regime

Trend & momentum descriptors

Support / resistance structure

Event & macro context

Retrieved similar historical cases

Baseline scenario probabilities

Chart images (multimodal)

What the agent produces.

Main Scenario

Continuation

Alt — Reversal

Bearish

Alt — Stall

Wait / Confirm

Confidence

Moderate — 0.61

Invalidation

Break below 3-day VWAP

04 · The Professor

An audit layer that grades reasoning quality, not just outcomes.

The professor does not label the agent right or wrong. It evaluates the reasoning process against what actually happened — producing graded feedback that resembles RLHF preference datasets, but derived from market data.

+ Positive reward

Calibrated correct reasoning

Realized outcome was main scenario. Confidence was moderate. Invalidation was not triggered. Probability was well-assigned.

→ Neutral reward

Reasonable probabilistic miss

Realized outcome was alternative scenario but assigned meaningful probability. Confidence was appropriately low. Reasoning was grounded.

— Negative reward

Overconfident wrong analysis

Realized outcome was assigned near-zero probability. Confidence was high. Invalidation conditions were missing or ignored.

+ Positive reward

Regime-aware adjustment

Agent correctly weighted volatility regime and macro context, reducing conviction before a breakout event.

— Negative reward

Evidence overweight

Agent anchored too heavily on one data source, ignoring retrieved historical cases showing similar setups resolving against the main view.

→ Neutral reward

Useful scenario branching

Although main scenario missed, agent produced actionable scenario map with clear confirmation conditions. Branching quality scored positively.

05 · Reward Model

Reward is a composite, not a score.

A correct prediction still earns a weak reward if reasoning was poorly grounded or overconfident. A wrong main scenario may still earn a positive signal if the realized outcome was assigned meaningful probability and confidence was calibrated.

R_total = w₁ · ScenarioCorrectness
        + w₂ · ProbabilityCalibration
        + w₃ · ConfidenceCalibration
        + w₄ · RiskAwareness
        − w₅ · DrawdownExposure
        + w₆ · InvalidationLogicQuality

This reward structure is then used to build preference datasets and reward models analogous to those in RLHF workflows — but grounded entirely in market evidence.

06 · Market Gym

A structured environment for walk-forward learning.

The Gym transforms historical market data into learning episodes. Once stable, any student agent, professor, reward model, or multimodal architecture can be evaluated consistently against the same environment — separating agent from environment.

◎

State at T

Market context, features, volatility regime, macro event flags.

⟳

Agent Input

Structured prompt with retrieved memory, baseline priors, chart image.

▸

Student Response

Scenario analysis, probabilities, confidence, invalidation conditions.

◈

Hidden Outcome

T+1…T+10 revealed after agent commits to analysis.

✦

Professor Audit

Graded evaluation of reasoning quality against realized outcome.

⇡

Reward Signal

Composite score used to update policy or build preference pairs.

⊕

Memory Update

Episode stored. Retrieval index updated for future similarity search.

→

Next Episode

Walk-forward to T+1. Train / validation / test splits enforced by time.

07 · Policy Improvement

The policy improves reasoning, not just execution.

The first policy is a reasoning policy: how to weight scenarios, calibrate probability, express confidence, and adapt to regime changes. Only once that is disciplined does execution quality follow.

Reward-weighted regression

Adjust scenario probabilities based on accumulated reward signal across historical episodes.

Contextual bandits

Learn which scenario weighting strategy works best per volatility regime or market structure.

Preference optimization (DPO / ORPO)

Generate candidate analyses, rank by reward, build preference pairs, fine-tune the agent.

Reward model reranking

Train a reward model from audit labels and use it to select the best analysis at inference time.

Offline reinforcement learning

Learn from fixed historical dataset of episodes without live market interaction.

Decision transformers

Condition the agent on target reward levels, enabling goal-conditioned generation of reasoning.

08 · Multimodal Extension

Does visual structure add signal beyond engineered features?

Charts contain information that is difficult to fully encode numerically: compression and expansion rhythm, failed breakout anatomy, volatility clustering, support/resistance interaction. Our multimodal extension tests whether vision-language models can use this additional modality within the same feedback loop.

Research question: can multimodal agents use visual market structure to improve calibrated scenario reasoning when grounded by market feedback?

Live Demo · RLMF Agent

Watch the agent reason.

An RLMF agent analyzing a live market state across multiple reasoning layers — pattern recognition, historical retrieval, scenario generation, probability calibration. Each layer feeds the next.

MARKET STATE · SPX · T+0 · 26 APR 2026 · 14:22 UTC

Price: 5,842.11 Vol regime: LOW Trend: UPTREND D1 ATR(14): 38.2 Retrieved cases: 12

○ Layer 1 — Market Structure Recognition ›

○ Layer 2 — Historical Case Retrieval ›

○ Layer 3 — Volatility & Regime Context ›

○ Layer 4 — Scenario Generation ›

○ Layer 5 — Probability Calibration ›

STRUCTURED OUTPUT · RLMF AGENT v4.2

Main Scenario

Bullish Continuation

P = 0.68

Alt — Failed Breakout

Reversal

P = 0.22

Alt — Range Bound

Consolidation

P = 0.10

ConfidenceModerate · 0.62

HorizonT+1 · T+3 · T+5

ConfirmationClose above 5,870 on vol expansion

InvalidationBreak below 5,800 on daily close

Reward signalPending market resolution at T+1…

ReinforcementLearning fromMarket Feedback

Markets are the feedback signal.

Six steps. One continuous cycle.

Market State at Time T

Agent Scenario Analysis

Future Market Outcome

Professor / Audit Layer

Reward Signal

Policy Improvement

What the agent observes at time T.

What the agent produces.

An audit layer that grades reasoning quality, not just outcomes.

Calibrated correct reasoning

Reasonable probabilistic miss

Overconfident wrong analysis

Regime-aware adjustment

Evidence overweight

Useful scenario branching

Reward is a composite, not a score.

A structured environment for walk-forward learning.

State at T

Agent Input

Student Response

Hidden Outcome

Professor Audit

Reward Signal

Memory Update

Next Episode

The policy improves reasoning, not just execution.

Reward-weighted regression

Contextual bandits

Preference optimization (DPO / ORPO)

Reward model reranking

Offline reinforcement learning

Decision transformers

Does visual structure add signal beyond engineered features?

Structured features only

Structured + chart image (multimodal)

Watch the agent reason.

Reinforcement
Learning from
Market Feedback