From a NN perspective this is a bit surprising. One benefit of using NN is to learn cross feature relationships (non-linear ones). Here the author is forcing the network to learn some relationships before injecting that knowledge into a bigger network, which begs the question why not work with a big network in the first place? (such as a wide&deep architecture)

I am not familiar with the dataset used in the blog post to evaluate the result by myself (error rate from 37.6% down to 37.2%). I wish the author would have given additional examples.

Feature embedding can be quite effective outside NLP. I have been using it at work when working with sparse features. It goes like this: train an auto encoder to reconstruct the sparse features then use the encoder part (and set the weights to “non trainable”), now the spare features can have a dense representation. So far I have not been successful in applying this idea to the tournament.

Embedding categorical features is often a good strategy, you can have some intuition about how it works by looking at it as a sequence of: (1) associate an integer to each variable (2) one-hot encode the variable (now this can be super sparse) then (3) apply a smaller dense layer on the one-hot encoding. Which is what the author have implemented with `tf.feature_column`

functions and the `DenseFeatures`

input layer.