Following up from my talk at the Numerai Symposium today - thanks to everyone who attended and especially those who asked great questions afterward. The feedback inspired me to dig into what the LLM was actually writing and make some adjustments.
Quick Recap
I built a system that uses reinforcement learning to teach Mistral-7B to write ML code for crypto trading. The model generates Python code, the code runs against real data, and the Sharpe ratio becomes the reward signal for PPO updates.
After running overnight (775 experiments, 48 PPO cycles), I got curious about what code the model was actually producing.
What I Found
Someone asked: âWhat models is it trying?â
So I checked:
| Model Type | Usage |
|---|---|
| LightGBM | 96% |
| XGBoost | 0% |
| Neural Nets | 0% |
| Everything else | 0% |
It used LightGBM 96% of the time and never tried anything else.
Same story with features - 99.9% of experiments used the template features, only 1 out of 775 attempted custom feature engineering.
The Problem: My Prompt
Re-reading my prompt, I realized Iâd been too specific. I included a complete working example using LightGBM:
âmodel_configâ: âimport lightgbm as lgbâŚâ
The model learned to copy and modify this example rather than explore alternatives. Classic exploitation over exploration.
The Fix
New prompt - just the rules, no examples:
AVAILABLE PACKAGES:
numpy, pandas, scikit-learn, scipy, statsmodels, lightgbm, xgboost, torch
Model must have .fit() and .predict_proba() methods.
Be creative. Try different models, features, parameters.
Now PyTorch is available, meaning the model could try neural networks. XGBoost, random forests, SVMs - all fair game. Letâs see what it discovers on its own.
Why 43% Failed to Compile
| Error Type | % | Cause |
|---|---|---|
| KeyError | 34% | Tried accessing DataFrame columns that donât exist |
| Syntax Error | 12% | Malformed Python code |
| TypeError | 12% | Wrong argument types |
| JSON Parse | 9% | LLM output wasnât valid JSON |
The biggest issue - KeyError - happened because the prompt didnât specify what columns were available in the data. The model would try things like full_data_base[âtarget_dataâ] when that column doesnât exist.
Fixed in the new prompt by explicitly listing available columns**
Run 2 Started
Reset the model weights back to base Mistral and cleared the replay buffer. The new run is live with the open-ended prompt.
This is the flexibility that RL provides - same training loop, different prompt, potentially very different outcomes. Will report back with what the model tries when given freedom to explore.