So, I’ve been working recently with Junyoung Chung on the RNN implementation based on Vincent Dumoulin implementation but instead trying the model only on sequence corresponding to one phoneme (here is a simple script to build such data), to see if it is even possible to fit well such small data. One interesting results that I’ve talked about in class was this one. In the cyan plot, we can see that the prediction error from ground truth varies a lot along the sequence.

This is an argument for modelling the conditional variance in time to account for the variable uncertainty along the sequence. Increasing the number of learned parameters of the conditional distribution (here the variance) we are modelling can be actually seen as another way to increase capacity other than just increasing the size of the network.