Using Transformers on Numerai's stock market data

richai · January 5, 2023, 1:17pm

In the first Quant Club with our chief scientist Michael Oliver, @jrb talked about how he used transformers on the Numerai data (Numerai Quant Club with Michael Oliver - YouTube). I was curious how @jrb and other Numerai data scientists have set up the problem to work with transformer architectures.

jrb · January 5, 2023, 3:17pm

I started experimenting with Transformers with the v3 data. jrb20 was my first transformer model. It’s just a vanilla 4 layer transformer that takes embeddings of the 1050 features as a sequence and the model just has a single linear neuron at the end on the concatenated sequence output from the transformer. My newer models are bigger mixture of experts of small transformer like models and the routing model is a hypernetwork. These things are much trickier to train, but the results seem quite promising. I believe the hard routing (and the softmax attention in the transformers) give these models GBDT like properties, with the flexibility of NNs (pre-training, better loss functions, better optimizers, architectural tricks, etc).

danzell · January 6, 2023, 7:56am

To me it was clear from the beginning that I have to use NNs to be “different” - anyone can use Trees.
My models are multilevel and I use transformers to generate different features/ data representations for 2nd level models.

surajp · March 14, 2023, 3:45am

How about using the era as a sequence to the Transformer with some hacks for long sequences (~6k)?
It should help with synthetic era generation and with self supervised learning as well. Can also be trained on multiple targets with custom losses.

The current rough implementation seem to struggle a little with longer sequences. Having more number of layers help.

Colab: NumeraiTransformerEra.ipynb

olivepossum · March 14, 2023, 10:14am

I’ve tried a hacked implementation of TabTransformers and treated features as categorical (so not using continuous features).

Currently:

Just using the small dataset to avoid running out of memory on GPU.
Using pearson corr as loss function applying penalization to feature exposure
Using eras as batch sizes
Predict all targets at once and rank and average them

No relevant results worth sharing so far. A draft of the code looks like this:

# helpers
def exists(val):
    return val is not None

def default(val, d):
    return val if exists(val) else d

# classes
class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(x, **kwargs) + x

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

# attention
class GEGLU(nn.Module):
    def forward(self, x):
        x, gates = x.chunk(2, dim = -1)
        return x * F.gelu(gates)

class FeedForward(nn.Module):
    def __init__(self, dim, mult = 4, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * mult * 2),
            GEGLU(),
            nn.Dropout(dropout),
            nn.Linear(dim * mult, dim)
        )

    def forward(self, x, **kwargs):
        return self.net(x)

class Attention(nn.Module):
    def __init__(
        self,
        dim,
        heads = 8,
        dim_head = 16,
        dropout = 0.
    ):
        super().__init__()
        inner_dim = dim_head * heads
        self.heads = heads
        self.scale = dim_head ** -0.5

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
        self.to_out = nn.Linear(inner_dim, dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        h = self.heads
        q, k, v = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), (q, k, v))
        sim = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale

        attn = sim.softmax(dim = -1)
        attn = self.dropout(attn)

        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)', h = h)
        return self.to_out(out)

# transformer
class Transformer(nn.Module):
    def __init__(self, num_tokens, dim, depth, heads, dim_head, attn_dropout, ff_dropout):
        super().__init__()
        self.embeds = nn.Embedding(num_tokens, dim)
        self.layers = nn.ModuleList([])

        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Residual(PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = attn_dropout))),
                Residual(PreNorm(dim, FeedForward(dim, dropout = ff_dropout))),
            ]))

    def forward(self, x):
        x = self.embeds(x)

        for attn, ff in self.layers:
            x = attn(x)
            x = ff(x)

        return x
# mlp
class MLP(nn.Module):
    def __init__(self, dims, act = None):
        super().__init__()
        dims_pairs = list(zip(dims[:-1], dims[1:]))
        layers = []
        for ind, (dim_in, dim_out) in enumerate(dims_pairs):
            is_last = ind >= (len(dims_pairs) - 1)
            linear = nn.Linear(dim_in, dim_out)
            layers.append(linear)

            if is_last:
                continue

            act = default(act, nn.ReLU())
            layers.append(act)
            layers.append(nn.Dropout(p=.3))

        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        return self.mlp(x)

# main class
class TabTransformer(nn.Module):
    def __init__(
        self,
        *,
        categories,
        dim,
        depth,
        heads,
        dim_head = 16,
        dim_out = 1,
        mlp_hidden_mults = (4, 2),
        mlp_act = None,
        num_special_tokens = 2,
        attn_dropout = 0.,
        ff_dropout = 0.
    ):
        super().__init__()
        assert all(map(lambda n: n > 0, categories)), 'number of each category must be positive'

        # categories related calculations
        self.num_categories = len(categories)
        self.num_unique_categories = sum(categories)

        # create category embeddings table
        self.num_special_tokens = num_special_tokens
        total_tokens = self.num_unique_categories + num_special_tokens

        # for automatically offsetting unique category ids to the correct position in the categories embedding table
        categories_offset = F.pad(torch.tensor(list(categories)), (1, 0), value = num_special_tokens)
        categories_offset = categories_offset.cumsum(dim = -1)[:-1]
        self.register_buffer('categories_offset', categories_offset)

        # transformer
        self.transformer = Transformer(
            num_tokens = total_tokens,
            dim = dim,
            depth = depth,
            heads = heads,
            dim_head = dim_head,
            attn_dropout = attn_dropout,
            ff_dropout = ff_dropout
        )

        # mlp to logits
        input_size = (dim * self.num_categories)
        l = input_size // 8

        hidden_dimensions = list(map(lambda t: l * t, mlp_hidden_mults))
        all_dimensions = [input_size, *hidden_dimensions, dim_out]

        self.mlp = MLP(all_dimensions, act = mlp_act)

    def forward(self, x_categ):
        assert x_categ.shape[-1] == self.num_categories, f'you must pass in {self.num_categories} values for your categories input'
        x_categ += self.categories_offset
        
        x = self.transformer(x_categ)

        flat_categ = x.flatten(1)

        x = flat_categ
        return self.mlp(x)

#main
categories_unique_values = []
i = 0
while i < FEATURE_LEN: #len of the small feature dataset
    categories_unique_values.append(5)
    i+=1
categories_unique_values = tuple(categories_unique_values)

model = TabTransformer(
    categories = categories_unique_values,      
    dim = 32,                           
    dim_out = TARGET_LEN, #len of all targets                       
    depth = 6,                          
    heads = 8,                          
    attn_dropout = 0.1,                 
    ff_dropout = 0.1,                   
    mlp_hidden_mults = (4, 2),          
    mlp_act = nn.ReLU()                
)

olivepossum · April 12, 2023, 4:06pm

An update on Tab Transformers. Inspired by what @surajp mentioned, tried Linear Attention with a Tab Transformer and managed to run it with the medium dataset, a bunch of targets and splitting the data a bit (with vanilla Attention and a RTX 3090 just could manage the small dataset).

Metrics are a bit more decent than with the small one. Would have expected lower feature exposure (as I’m penalizing it in the loss function, clearly not enough) and lower corr with the example predictions. Anyways, haven’t tuned anything yet as training takes quite a while.

Sharing here code with the Linear Attention.

Any feedback is welcome!

# helpers
def exists(val):
    return val is not None

def default(val, d):
    return val if exists(val) else d

# classes
class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(x, **kwargs) + x

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

# attention
class GEGLU(nn.Module):
    def forward(self, x):
        x, gates = x.chunk(2, dim = -1)
        return x * F.gelu(gates)

class FeedForward(nn.Module):
    def __init__(self, dim, mult = 4, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * mult * 2),
            GEGLU(),
            nn.Dropout(dropout),
            nn.Linear(dim * mult, dim)
        )

    def forward(self, x, **kwargs):
        return self.net(x)
    
def linear_attn(q, k, v):
    dim = q.shape[-1]

    q = q.softmax(dim=-1)
    k = k.softmax(dim=-2)

    q = q * dim ** -0.5

    context = einsum('bhnd,bhne->bhde', k, v)
    attn = einsum('bhnd,bhde->bhne', q, context)
    return attn.reshape(*q.shape)

class LinearAttention(nn.Module):
    def __init__(self, dim, heads, dim_head = None, dropout = 0.):
        super().__init__()
        assert dim_head or (dim % heads) == 0, 'embedding dimension must be divisible by number of heads'
        d_heads = default(dim_head, dim // heads)
        self.heads = heads
        self.d_heads = d_heads
        self.global_attn_fn = linear_attn

        self.to_q = nn.Linear(dim, d_heads * heads, bias = False)
        self.to_k = nn.Linear(dim, d_heads * heads, bias = False)
        self.to_v = nn.Linear(dim, d_heads * heads, bias = False)
        self.to_out = nn.Linear(d_heads * heads, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, **kwargs):
        q, k, v = (self.to_q(x), self.to_k(x), self.to_v(x))
        b, t, e, h, dh = *q.shape, self.heads, self.d_heads
        merge_heads = lambda x: x.reshape(*x.shape[:2], -1, dh).transpose(1, 2)
        q, k, v = map(merge_heads, (q, k, v))

        out = []
        global_out = self.global_attn_fn(q, k, v)
        out.append(global_out)
        attn = torch.cat(out, dim=1)
        attn = attn.transpose(1, 2).reshape(b, t, -1)
        return self.dropout(self.to_out(attn))

# transformer
class Transformer(nn.Module):
    def __init__(self, num_tokens, dim, depth, heads, dim_head, attn_dropout, ff_dropout):
        super().__init__()
        self.embeds = nn.Embedding(num_tokens, dim)
        self.layers = nn.ModuleList([])

        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Residual(PreNorm(dim, LinearAttention(dim, heads, dim_head = dim_head, dropout = attn_dropout))),
                Residual(PreNorm(dim, FeedForward(dim, dropout = ff_dropout))),
            ]))

    def forward(self, x):
        x = self.embeds(x)

        for attn, ff in self.layers:
            x = attn(x)
            x = ff(x)

        return x
# mlp
class MLP(nn.Module):
    def __init__(self, dims, act = None):
        super().__init__()
        dims_pairs = list(zip(dims[:-1], dims[1:]))
        layers = []
        for ind, (dim_in, dim_out) in enumerate(dims_pairs):
            is_last = ind >= (len(dims_pairs) - 1)
            linear = nn.Linear(dim_in, dim_out)
            layers.append(linear)

            if is_last:
                continue

            act = default(act, nn.ReLU())
            layers.append(act)
            layers.append(nn.Dropout(p=.3))

        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        return self.mlp(x)

# main class
class TabTransformer(nn.Module):
    def __init__(
        self,
        *,
        categories,
        dim,
        depth,
        heads,
        dim_head = 16,
        dim_out = 1,
        mlp_hidden_mults = (4, 2),
        mlp_act = None,
        num_special_tokens = 2,
        attn_dropout = 0.,
        ff_dropout = 0.
    ):
        super().__init__()
        assert all(map(lambda n: n > 0, categories)), 'number of each category must be positive'

        # categories related calculations
        self.num_categories = len(categories)
        self.num_unique_categories = sum(categories)

        # create category embeddings table
        self.num_special_tokens = num_special_tokens
        total_tokens = self.num_unique_categories + num_special_tokens

        # for automatically offsetting unique category ids to the correct position in the categories embedding table
        categories_offset = F.pad(torch.tensor(list(categories)), (1, 0), value = num_special_tokens)
        categories_offset = categories_offset.cumsum(dim = -1)[:-1]
        self.register_buffer('categories_offset', categories_offset)

        # transformer
        self.transformer = Transformer(
            num_tokens = total_tokens,
            dim = dim,
            depth = depth,
            heads = heads,
            dim_head = dim_head,
            attn_dropout = attn_dropout,
            ff_dropout = ff_dropout
        )

        # mlp to logits
        input_size = (dim * self.num_categories)
        l = input_size // 8

        hidden_dimensions = list(map(lambda t: l * t, mlp_hidden_mults))
        all_dimensions = [input_size, *hidden_dimensions, dim_out]

        self.mlp = MLP(all_dimensions, act = mlp_act)

    def forward(self, x_categ):
        assert x_categ.shape[-1] == self.num_categories, f'you must pass in {self.num_categories} values for your categories input'
        x_categ += self.categories_offset
        
        x = self.transformer(x_categ)

        flat_categ = x.flatten(1)

        x = flat_categ
        return self.mlp(x)

#main
categories_unique_values_list = []
i = 0
while i < FEATURE_LEN:
    categories_unique_values_list.append(5)
    i+=1

categories_unique_values = tuple(categories_unique_values_list)

model_transformer = TabTransformer(
    categories = categories_unique_values,      
    dim = 16,                           
    dim_out = TARGET_LEN,                        
    depth = 3,                          
    heads = 4,                          
    attn_dropout = 0.1,                 
    ff_dropout = 0.1,                   
    mlp_hidden_mults = (4, 2),          
    mlp_act = nn.ReLU()                
)

autratec · April 15, 2023, 6:52am

Considering openai providing a foundation of neural network with transformer , is it feasible to use their model to handle our trading needs directly. For example, convert our signal to serially inputs with expected out. This is similar to the llm token prediction. And figure out the relationship between those key input - transformer concept. We might use our data, load to openai to fine tune and get it to do classification work - buy/sell. Pls share your thoughts or experiment being done.

surajp · April 17, 2023, 6:24pm

Yep, Working on something like that for signals.

Idea is to feed all the stocks at once and then use decoder to forecast all stocks together. This would allow the model to understand the interactions between stocks. Also allows for country and sector to be embedded into the data. Instead of OpenAI, I am trying to build a Transformer from scratch for this flexibility.

surajp · April 30, 2023, 12:20am

you mean converting daily changes into timeseries? Like 0.01% change as “u” and 0.02% change as token “uu”, -0.06% as “dddddd” and so on? and then training a transformer for forecasting next day’s returns from a dictionary ranging form [“u” or “d”]100100 (assuming 0.01 as base) and thus treating the daily pct changes as natural (language) sequences? Seems doable

autratec · May 1, 2023, 12:09am

Yes. Close. Basically, treat singal as a language

surajp · May 19, 2023, 5:17am

Just adding a reference to my post that explains this approach better “Eras” of Transformers

pumplerod · October 19, 2023, 4:44pm

These examples appear to be using the Transformer model as engineered for LLMs. I’m curious if anyone has experimented with more of a Vision Transformer model. I’ve been exploring this, with no significant results as of yet, but I could also be doing something wildly wrong with my architecture. I’d love to connect with anyone who’s given this a shot.

surajp · October 20, 2023, 4:08pm

I guess the underlying idea or architecture of transformer encoders stays same, both for langugae and vision encoding, It’s just the way we encode images and text to feed to the modle that is different

jrb · October 24, 2023, 10:30am

Yes tokenization in ViTs is a lot simpler. We just cut up the image into patches, linearly project them and then add position embeddings before feeding them into a cascade of transformer encoder layers. It’s regrettable that we don’t have end to end learning of tokenization in text transformers (text tokenizers are pre-trained or trained before the transformer), although IIRC there have been some attempts at it.

pumplerod · October 26, 2023, 12:41am

I’m still experimenting with the transformer process. I’m finding that there isn’t a direct relationship to what the model learns on training data and how it performs on validation. At least it doesn’t seem as directly correlated as when training a random forest or GBT. I’ve read that transformers require a significant amount of training data, and I’m wondering if this is part of the reason. @jrb , do you have a sense for how the main transformer parameters relate to our data? For example, number of layers vs number of heads per layer vs embedding deminsion? I’d like to formulate some general understanding for what impact each of these values has on the learning/generalization of the model.

jrb · November 1, 2023, 12:54pm

You can create a near infinite amount of training data by masking features. I’m not training any new transformer based models because the incentives of the tournament are now tilted towards training tree based models with feature engineering. Also, I don’t have any intuitive sense of what hyperparameters works best with transformers for this data. I’d recommend starting small and experimenting. Almost all of the recieved wisdom from using transformers in the fields of NLP and CV is applicable here.

jeremy_berros · November 6, 2023, 1:59pm

Curious about what makes you say that @jrb ?

jrb · November 6, 2023, 2:24pm

TC incentivises novel predictions and CORR incentivises maximising CORR. With 2xCORR and 1xTC, the incentives are tilted towards maximising CORR at the cost of everything else. And if you’re after the highest CORR, it’s hard to beat gradient boosted tree ensembles.

f58c · November 7, 2023, 8:35pm

any thoughts on why positive corr20v2 may also have negative tc? e.g. jrb46D: corr20v2 0.0061, tc -0.0009. while another model jrb19 has less positive corr20v2 (0.0036) and positive tc (0.0154)?

i’ve noticed a similar pattern in some of my own models (9111953: pos tc, low corr20v2, 10122004: neg tc, higher corr20v2).

adjusting the 20corrv2 and tc multipliers can compensate for this behavior, however 20corrv2 mult 2 is needed to stake a model.

surajp · November 9, 2023, 1:54am

The GBDTs are indeed hard to beat, I am training training Transformers specifically to extract values over the example model, barely getting anything (ofc, my implementation isn’t perfect). While the GBDTs seem highly overfitted on training set, when the NNs reach that level, they often hit saturation and validation CORR starts decreasing. The discrete boundaries of 20k trees is nice generalization which gets robust and smoother with 20k trees, NN on the other hand, generate smoother and continuous boundaries (planes), making them susceptible to smaller changes in inputs. My endeavour now is to look at what can help the NN in generalizing well. Ensemble of NN is nice idea but, then it’s better to go with trees.

Topic		Replies	Views
NN architecture for >0.03 CORR on validation set Data Science	52	8202	August 26, 2021
Numerai Self-Supervised Learning & Data Augmentation Projects Data Science	114	10028	March 22, 2023
AutoEncoder and multitask MLP on new dataset (from Kaggle Jane Street) Data Science	30	9380	March 7, 2022
Super Massive Data: Sunshine Announcements	24	7795	March 23, 2023
Learning Two Uncorrelated Models Data Science	16	6511	September 9, 2020

Using Transformers on Numerai's stock market data

Related topics