THE FINANCE LAB
LIVE · 26 APR 2026 · 14:22 UTC
Home/ Research/ RLMF
Research Stream · 001

Reinforcement
Learning from
Market Feedback

Training AI agents to improve their market reasoning through historical outcomes, structured feedback, and reward-based policy refinement.

Scroll to explore the framework
01 · The Idea

Markets are the feedback signal.

In language models, RLHF uses human preferences to improve model behavior. In market reasoning, the feedback source is different: the market itself provides delayed, noisy, and probabilistic feedback through realized outcomes.

Rather than treating financial forecasting as a static supervised problem, we model it as an iterative decision-and-feedback process. The objective is not deterministic prediction — it is calibrated scenario reasoning: multiple possible outcomes, probability estimates, confidence levels, and explicit uncertainty.

Framework
Feedback source
Policy target
RLHF
Human preference
Language quality
RLMF
Market outcome + audit
Scenario calibration
Supervised
Labeled data
Point prediction
02 · The Loop

Six steps. One continuous cycle.

Every historical time step is a learning episode. The agent observes, reasons, the market resolves, and the system audits — converting outcomes into reward signals that improve the policy.

T

Market State at Time T

Price, volume, volatility regime, trend, momentum, macro context, retrieved historical cases and baseline probabilities.

A

Agent Scenario Analysis

Produces main scenario, alternatives, probability estimates, confidence, risk factors, confirmation and invalidation conditions.

O

Future Market Outcome

Horizons T+1, T+3, T+5, T+10 become known. Realized price path compared against agent scenarios.

P

Professor / Audit Layer

Evaluates reasoning quality — not just correctness. Was confidence justified? Were probabilities well-calibrated? Did invalidation logic hold?

R

Reward Signal

Graded reward: positive for calibrated correct reasoning, neutral for reasonable misses, negative for overconfident wrong analysis.

Policy Improvement

Scenario weighting, probability calibration, confidence expression, regime adaptation — all updated from the reward signal.

03 · Market State

What the agent observes at time T.

The state representation is deliberately broad — structured data, statistical features, retrieved memory, and optionally visual chart structure. Nothing is assumed irrelevant before evidence.

Price & volume history
Technical & statistical features
Volatility regime
Trend & momentum descriptors
Support / resistance structure
Event & macro context
Retrieved similar historical cases
Baseline scenario probabilities
Chart images (multimodal)

What the agent produces.

Main Scenario
Continuation
Alt — Reversal
Bearish
Alt — Stall
Wait / Confirm
Confidence
Moderate — 0.61
Invalidation
Break below 3-day VWAP
04 · The Professor

An audit layer that grades reasoning quality, not just outcomes.

The professor does not label the agent right or wrong. It evaluates the reasoning process against what actually happened — producing graded feedback that resembles RLHF preference datasets, but derived from market data.

+ Positive reward

Calibrated correct reasoning

Realized outcome was main scenario. Confidence was moderate. Invalidation was not triggered. Probability was well-assigned.

→ Neutral reward

Reasonable probabilistic miss

Realized outcome was alternative scenario but assigned meaningful probability. Confidence was appropriately low. Reasoning was grounded.

— Negative reward

Overconfident wrong analysis

Realized outcome was assigned near-zero probability. Confidence was high. Invalidation conditions were missing or ignored.

+ Positive reward

Regime-aware adjustment

Agent correctly weighted volatility regime and macro context, reducing conviction before a breakout event.

— Negative reward

Evidence overweight

Agent anchored too heavily on one data source, ignoring retrieved historical cases showing similar setups resolving against the main view.

→ Neutral reward

Useful scenario branching

Although main scenario missed, agent produced actionable scenario map with clear confirmation conditions. Branching quality scored positively.

05 · Reward Model

Reward is a composite, not a score.

A correct prediction still earns a weak reward if reasoning was poorly grounded or overconfident. A wrong main scenario may still earn a positive signal if the realized outcome was assigned meaningful probability and confidence was calibrated.

Rtotal = w₁ · ScenarioCorrectness
        + w₂ · ProbabilityCalibration
        + w₃ · ConfidenceCalibration
        + w₄ · RiskAwareness
        − w₅ · DrawdownExposure
        + w₆ · InvalidationLogicQuality

This reward structure is then used to build preference datasets and reward models analogous to those in RLHF workflows — but grounded entirely in market evidence.

06 · Market Gym

A structured environment for walk-forward learning.

The Gym transforms historical market data into learning episodes. Once stable, any student agent, professor, reward model, or multimodal architecture can be evaluated consistently against the same environment — separating agent from environment.

01

State at T

Market context, features, volatility regime, macro event flags.

02

Agent Input

Structured prompt with retrieved memory, baseline priors, chart image.

03

Student Response

Scenario analysis, probabilities, confidence, invalidation conditions.

04

Hidden Outcome

T+1…T+10 revealed after agent commits to analysis.

05

Professor Audit

Graded evaluation of reasoning quality against realized outcome.

06

Reward Signal

Composite score used to update policy or build preference pairs.

07

Memory Update

Episode stored. Retrieval index updated for future similarity search.

08

Next Episode

Walk-forward to T+1. Train / validation / test splits enforced by time.

07 · Policy Improvement

The policy improves reasoning, not just execution.

The first policy is a reasoning policy: how to weight scenarios, calibrate probability, express confidence, and adapt to regime changes. Only once that is disciplined does execution quality follow.

01
Reward-weighted regression

Adjust scenario probabilities based on accumulated reward signal across historical episodes.

02
Contextual bandits

Learn which scenario weighting strategy works best per volatility regime or market structure.

03
Preference optimization (DPO / ORPO)

Generate candidate analyses, rank by reward, build preference pairs, fine-tune the agent.

04
Reward model reranking

Train a reward model from audit labels and use it to select the best analysis at inference time.

05
Offline reinforcement learning

Learn from fixed historical dataset of episodes without live market interaction.

06
Decision transformers

Condition the agent on target reward levels, enabling goal-conditioned generation of reasoning.

08 · Multimodal Extension

Does visual structure add signal beyond engineered features?

Charts contain information that is difficult to fully encode numerically: compression and expansion rhythm, failed breakout anatomy, volatility clustering, support/resistance interaction. Our multimodal extension tests whether vision-language models can use this additional modality within the same feedback loop.

Research question: can multimodal agents use visual market structure to improve calibrated scenario reasoning when grounded by market feedback?

Live Demo · RLMF Agent

Watch the agent reason.

An RLMF agent analyzing a live market state across multiple reasoning layers — pattern recognition, historical retrieval, scenario generation, probability calibration. Each layer feeds the next.

MARKET STATE · SPX · T+0 · 26 APR 2026 · 14:22 UTC
Price: 5,842.11 Vol regime: LOW Trend: UPTREND D1 ATR(14): 38.2 Retrieved cases: 12
Layer 1 — Market Structure Recognition
Layer 2 — Historical Case Retrieval
Layer 3 — Volatility & Regime Context
Layer 4 — Scenario Generation
Layer 5 — Probability Calibration
STRUCTURED OUTPUT · RLMF AGENT v4.2
Main Scenario
Bullish Continuation
P = 0.68
Alt — Failed Breakout
Reversal
P = 0.22
Alt — Range Bound
Consolidation
P = 0.10
ConfidenceModerate · 0.62
HorizonT+1 · T+3 · T+5
ConfirmationClose above 5,870 on vol expansion
InvalidationBreak below 5,800 on daily close
Reward signalPending market resolution at T+1…