So it seems that the preprocessed data that we were given are MFCC of frames, which is problematic because of the hard invertibility of that transform.
After some investigation with Jörg, we were able to obtain a rawer version of TIMIT. The .WAV files of the raw TIMIT are in NIST WAV format, which is not readable by mainstream audio player (not even VLC !). One possible trick is to use the sox command in Linux to convert your file to a readable .WAV format.
sox [input] -t sndfile [output]
Not only does it allows you to listen to your files through mainstream audio player but it is also an exploitable format for scipy to build vectors (see scipy.io.wavfile). So this also answer my previous question about listening to vectors because the package includes tools to create .WAV files from vector.
So now we have the tools to convert TIMIT into a format like numpy.array, more exploitable for standard machine learning algorithms.
P.S: It also seems that we have the alignement of the sequence of sound with words. I’m not sure how exploitable it is over phonemes, because the dimensionality (6102 words, 4739 after stemming with the nltk implementation) of this kind of feature makes it a venue for overfitting.
P.P.S: The formatted data path with the instructions can be found in the Google Doc. I used some code from Yann Dauphin in my script to link every information (waveform, speaker identity, speaker information, phoneme, word). If you have any questions or noticed any bug, please write in the comment section.
So I’m following the IFT6266 course at Université de Montréal. And it seems the project for this course is about speech synthesis. This blog is where I will log my experiments and thoughts on this project.
As explained here (and shown here), speech synthesis is traditionally split in two stages: prosody building and waveform generation. After building the prosody, the waveform generation is typically done gluing components of a dictionary matching the prosody and the phonemes and smoothing the resulting patchwork. This compartmentalization of task holds advantages and drawbacks similar to those of relying on prior knowledge, the system is easier to monitor (and debug) but we are likely to bias the system (here slightly toward a robot voice). I find the result quite good though.
However we could hope to obtain something better by generating more directly the waveform, therefore possibly obtain smoother waveform. We will try to build such system through a machine learning approach (some bibliographical references regarding speech synthesis with this kind of approach are here). Therefore, if we shortcut the prosody building, we can hope that the system will somehow learn the prosody building by itself. For this learning task, we will be given a dataset of people (of different sex, race, age, height and education) saying some “phonetically rich sentences” in several dialects.
This dataset takes the form of several varying-length sequences of audio frames, represented as fixed-length real vectors. These frames might have been centered and normalized for our convenience (with the mean and standard deviation across all frames and components). These data are augmented with context variables (both discrete and continuous) such as speaker informations (age, dialect, education, height, race and sex) and sequence of phonemes, which is aligned with the sequence of frames in the training set.
One proposed evaluation of the generative model resulting from our training is conditional log-likelihood on the test set, i.e. the sum of the likelihood of each test sentence given the speaker information and the unaligned sequence of phonemes. So it might bias the type of model that I would use toward models with tractable conditional log-likelihood.
My code will be on this github.
P.S: For our own debugging, I wonder how easy would it be to generate .wav file from our generated sequences. It might imply recomputing the original training set mean and standard deviation. But so far I can’t even listen to the original .wav files.