Fine-Tuning LLMs with RL for ML Code Generation: Post-Mortem

jefferythewind · January 29, 2026, 4:44am

Following up from my talk at the Numerai Symposium today - thanks to everyone who attended and especially those who asked great questions afterward. The feedback inspired me to dig into what the LLM was actually writing and make some adjustments.

Quick Recap

I built a system that uses reinforcement learning to teach Mistral-7B to write ML code for crypto trading. The model generates Python code, the code runs against real data, and the Sharpe ratio becomes the reward signal for PPO updates.

After running overnight (775 experiments, 48 PPO cycles), I got curious about what code the model was actually producing.

What I Found

Someone asked: “What models is it trying?”

So I checked:

Model Type	Usage
LightGBM	96%
XGBoost	0%
Neural Nets	0%
Everything else	0%

It used LightGBM 96% of the time and never tried anything else.

Same story with features - 99.9% of experiments used the template features, only 1 out of 775 attempted custom feature engineering.

The Problem: My Prompt

Re-reading my prompt, I realized I’d been too specific. I included a complete working example using LightGBM:

“model_config”: “import lightgbm as lgb…”

The model learned to copy and modify this example rather than explore alternatives. Classic exploitation over exploration.

The Fix

New prompt - just the rules, no examples:

AVAILABLE PACKAGES:

numpy, pandas, scikit-learn, scipy, statsmodels, lightgbm, xgboost, torch

Model must have .fit() and .predict_proba() methods.

Be creative. Try different models, features, parameters.

Now PyTorch is available, meaning the model could try neural networks. XGBoost, random forests, SVMs - all fair game. Let’s see what it discovers on its own.

Why 43% Failed to Compile

Error Type	%	Cause
KeyError	34%	Tried accessing DataFrame columns that don’t exist
Syntax Error	12%	Malformed Python code
TypeError	12%	Wrong argument types
JSON Parse	9%	LLM output wasn’t valid JSON

The biggest issue - KeyError - happened because the prompt didn’t specify what columns were available in the data. The model would try things like full_data_base[‘target_data’] when that column doesn’t exist.

Fixed in the new prompt by explicitly listing available columns**

Run 2 Started

Reset the model weights back to base Mistral and cleared the replay buffer. The new run is live with the open-ended prompt.

This is the flexibility that RL provides - same training loop, different prompt, potentially very different outcomes. Will report back with what the model tries when given freedom to explore.

Topic		Replies	Views
AI for ML by jefferythewind Data Science	0	203	February 14, 2026
Important question with models Data Science	4	1018	July 18, 2021
Example predictions - am i missing something? Tournament	6	1329	January 1, 2022
Using LLMs to Create Trading Signals Signals	8	17025	November 3, 2024
Liz Experiment Review Q1 2021 : Generating Features and Applying Feature Neutralization Tournament	24	5334	May 11, 2021