Today, Dustin told me about a Theano bug he was working on with Fred (soon to be fixed).
Basically, 0**x can give NaNs, since y**x is calculated using log(y). Looking at the MDN cost, the coefficients, means, and std_devs all get exponentiated at some point. Although it is very unlikely that the means will ever be exactly 0, the stds and coefficients are outputs of softplus/softmax respectively, and so they can very easily be driven to 0. Setting a floor value for all of the parameters has eliminated my NaNs.
However, as I suspected, the exploding gradient is still an issue. I tried using a learning rate that consistently gave NaNs on the first epoch, and it no longer does. Instead, its first step increases the likelihood dramatically, and is subsequently approximately constant (at least for a few epochs, I am running for longer to see what happens next). Although the likelihood the network obtains in this one epoch is better than the likelihood I’ve achieved with lower learning rates, I’m not sure how much to trust this result. I will need to make more detailed qualitative comparisons, i.e. inspect graphs more closely.
It seems like the algorithm is totally lost after taking this big step, and has no reliable gradient signal. It may be able to recover the scent (so to speak), though, which is why I am training longer.
In any case, now that the NaNs are gone, it should be easy to use gradient clipping to address the problem. I’m currently testing my implementation of gradient clipping.
I’d still like to understand how the network is able to achieve so much better likelihood via one big gradient step at the beginning than it is via incremental improvements (the way training is supposed to work on hard problems like this). I strongly suspect it is completely spurious.