So, I’ve been working recently with Junyoung Chung on the RNN implementation based on Vincent Dumoulin implementation but instead trying the model only on sequence corresponding to one phoneme (here is a simple script to build such data), to see if it is even possible to fit well such small data. One interesting results that I’ve talked about in class was this one. In the cyan plot, we can see that the prediction error from ground truth varies a lot along the sequence.
This is an argument for modelling the conditional variance in time to account for the variable uncertainty along the sequence. Increasing the number of learned parameters of the conditional distribution (here the variance) we are modelling can be actually seen as another way to increase capacity other than just increasing the size of the network.
When doing unsupervised learning (well… it’s not entirely unsupervised but still) with sequences a possible way to map our problem to a supervised learning problem is making the Markov assumption of order k.
To clarify, in a sequence of size we are trying for example to maximize the likelihood . The Markov assumption of order k is stating that . AR(p) is a special case (the linear one) of such model.
So we just have to define such conditional model (including the limit case for the k first step) in the same way we would define a supervised regression model, thus obtaining a probabilistic model with which we maximize likelihood.
Anyway, this is just to try to justify the name here.
P.S: There is still no preprocessing yet. I’m not using phones AND phonemes. If people want to add that, feel free to make a separate branch.
P.P.S: I’m working with Vincent for a more pylearn2-friendly implementation of this.
EDIT: Here are some occurrences of such modelization:
– First test with PyLearn2
– Speech synthesis project description and first attempt at a regression MLP
– Initial experiment: ’aaaaaa’ and ‘oooooo’
– FIRST EXPERIMENT — VANILLA MLP WITH THEANO
I might have forgotten some. If so please tell me.
So I wrapped TIMIT into a class. You can use it as you see fit.
I haven’t added any preprocessing (centering, normalization, wavelets, Fourier transform, LPC…). (EDIT: I use however the segment_axis function used by João here to cut the sequence into frames, copy this file in your Python path.)
This class is using a reduced set of phonemes, as the same phoneme can be written (and is written) in multiple ways (mentioned here).
People seem skeptical of the processing of the data that I have made.
I’m fine with that.
Because actually, by just looking at the integer vector, I can’t really tell if it’s supposed to be a sound or someone has been playing a prank on me by replacing the meaningful waveform vectors with random vectors. If the data is raw or in another representation like wavelets or MFCC. That’s actually somehow interesting that we expect our machine learning algorithm to figure this out.
So, I’ve made a python script to check if the vectors made sense. I pick a random sentence in the training data, see its waveform, the corresponding phonemes and words and output a .WAV file. I also output the feature of the speaker so I can also check if the voice fits.
It’s supposed to say “Diane may splurge and buy a turquoise necklace”.
Also, reading the script might help you understand how to use the .npy and .pkl files.
P.S: On an unrelated note, if I would expect words not to bring that much information over the phonemes, I would however consider the final punctuation to be obviously important in the prosody learning (assertion, question, exclamation…). So important that I haven’t included this feature yet…
P.P.S: Now if I wanted to transform the data into mainstream representations like Fourier transform or wavelets, I might want to try the signal processing and discrete Fourier transform scipy packages.
So it seems that the preprocessed data that we were given are MFCC of frames, which is problematic because of the hard invertibility of that transform.
After some investigation with Jörg, we were able to obtain a rawer version of TIMIT. The .WAV files of the raw TIMIT are in NIST WAV format, which is not readable by mainstream audio player (not even VLC !). One possible trick is to use the sox command in Linux to convert your file to a readable .WAV format.
sox [input] -t sndfile [output]
Not only does it allows you to listen to your files through mainstream audio player but it is also an exploitable format for scipy to build vectors (see scipy.io.wavfile). So this also answer my previous question about listening to vectors because the package includes tools to create .WAV files from vector.
So now we have the tools to convert TIMIT into a format like numpy.array, more exploitable for standard machine learning algorithms.
P.S: It also seems that we have the alignement of the sequence of sound with words. I’m not sure how exploitable it is over phonemes, because the dimensionality (6102 words, 4739 after stemming with the nltk implementation) of this kind of feature makes it a venue for overfitting.
P.P.S: The formatted data path with the instructions can be found in the Google Doc. I used some code from Yann Dauphin in my script to link every information (waveform, speaker identity, speaker information, phoneme, word). If you have any questions or noticed any bug, please write in the comment section.
So I’m following the IFT6266 course at Université de Montréal. And it seems the project for this course is about speech synthesis. This blog is where I will log my experiments and thoughts on this project.
As explained here (and shown here), speech synthesis is traditionally split in two stages: prosody building and waveform generation. After building the prosody, the waveform generation is typically done gluing components of a dictionary matching the prosody and the phonemes and smoothing the resulting patchwork. This compartmentalization of task holds advantages and drawbacks similar to those of relying on prior knowledge, the system is easier to monitor (and debug) but we are likely to bias the system (here slightly toward a robot voice). I find the result quite good though.
However we could hope to obtain something better by generating more directly the waveform, therefore possibly obtain smoother waveform. We will try to build such system through a machine learning approach (some bibliographical references regarding speech synthesis with this kind of approach are here). Therefore, if we shortcut the prosody building, we can hope that the system will somehow learn the prosody building by itself. For this learning task, we will be given a dataset of people (of different sex, race, age, height and education) saying some “phonetically rich sentences” in several dialects.
This dataset takes the form of several varying-length sequences of audio frames, represented as fixed-length real vectors. These frames might have been centered and normalized for our convenience (with the mean and standard deviation across all frames and components). These data are augmented with context variables (both discrete and continuous) such as speaker informations (age, dialect, education, height, race and sex) and sequence of phonemes, which is aligned with the sequence of frames in the training set.
One proposed evaluation of the generative model resulting from our training is conditional log-likelihood on the test set, i.e. the sum of the likelihood of each test sentence given the speaker information and the unaligned sequence of phonemes. So it might bias the type of model that I would use toward models with tractable conditional log-likelihood.
My code will be on this github.
P.S: For our own debugging, I wonder how easy would it be to generate .wav file from our generated sequences. It might imply recomputing the original training set mean and standard deviation. But so far I can’t even listen to the original .wav files.