Synthetic data generation using GANs

rtachinardi · September 5, 2021, 10:57pm

On traditional quant projects, synthetic data generation to support backtesting is becoming a common practice, there is a good summary of it in the appendix A of “Machine Learning for Asset Managers”, by Marcos Lopez de Prado.

We can use various different methods to generate synthetic data, one of the most promising ones is GANs (Generative Adversarial Networks).

I have searched for discussions on the forum about this topic but couldn’t find any, so I’m posting this to bring it up. What do you think? Could it be useful in numerai?

yxbot · September 6, 2021, 11:44am

here is a repo that provides some good sources for using GAN to generate synthetic data:

Haven’t tried this out though, I nearly used the Pate-Gan model in one of my work projects previously, but in the end didn’t bother.

rtachinardi · September 6, 2021, 2:49pm

Thank you for the tip, I will check it out.

Instead of using GANs are you using any other techniques to generate synthetic data? Or would you say these are not needed in the tournament?

yxbot · September 6, 2021, 2:52pm

no, I haven’t tried such approach so far. and probably won’t be doing it due to the fact that they are releasing a larger dataset soon. Also, I have been pretty happy with my models so far

I would wait till the new dataset arrivate anyway, because there will be additional validation data to play with

jacob_stahl · September 6, 2021, 9:04pm

I’m curious about how effective this would be. Is there a sanity test you can use on the synthetic samples to verify that they reflect patterns in the dataset? You can’t eyeball them like image GANs.

rtachinardi · September 6, 2021, 9:20pm

Oh, I didn’t know that. Do you have any links to posts about this new dataset?

rtachinardi · September 6, 2021, 9:29pm

Yes, there is, but they’re fairly more complex than for other types of non-time series data (like images).

Here’s an example:

And you can find the source code here:

Marcos Lopez de Prado books (“Advances in Financial Machine Learning” and “Machine Learning for Asset Managers”) also have discussions about this problem, but unfortunately they aren’t available online for free, so I can’t link them.

yxbot · September 6, 2021, 9:44pm

there have been plenty of chatters on rocket chat, the released date is supposed to be 8th Sep, so in 2 days time. They also say there will be a post in the forum.

Have a look at this link
and this tweet

rtachinardi · September 6, 2021, 9:59pm

Thank you very much, I will take a look!

Topic		Replies	Views
Generate Forged Signature From Real one for data augmentation Data Science	1	1656	March 26, 2022
Numerai Self-Supervised Learning & Data Augmentation Projects Data Science	114	9993	March 22, 2023
Challenges shared by Richard Data Science	2	996	March 3, 2022
New DataScientist on board - Where do I start? Data Science	3	1935	April 15, 2025
ML-Quant: Professional & Open Research Data Science	2	1385	January 17, 2022

Synthetic data generation using GANs

Related topics