When doing unsupervised learning (well… it’s not entirely unsupervised but still) with sequences a possible way to map our problem to a supervised learning problem is making the Markov assumption of order k.
To clarify, in a sequence of size we are trying for example to maximize the likelihood . The Markov assumption of order k is stating that . AR(p) is a special case (the linear one) of such model.
So we just have to define such conditional model (including the limit case for the k first step) in the same way we would define a supervised regression model, thus obtaining a probabilistic model with which we maximize likelihood.
Anyway, this is just to try to justify the name here.
P.S: There is still no preprocessing yet. I’m not using phones AND phonemes. If people want to add that, feel free to make a separate branch.
P.P.S: I’m working with Vincent for a more pylearn2-friendly implementation of this.
EDIT: Here are some occurrences of such modelization:
– First test with PyLearn2
– Speech synthesis project description and first attempt at a regression MLP
– Initial experiment: ’aaaaaa’ and ‘oooooo’
– FIRST EXPERIMENT — VANILLA MLP WITH THEANO
I might have forgotten some. If so please tell me.
So I wrapped TIMIT into a class. You can use it as you see fit.
I haven’t added any preprocessing (centering, normalization, wavelets, Fourier transform, LPC…). (EDIT: I use however the segment_axis function used by João here to cut the sequence into frames, copy this file in your Python path.)
This class is using a reduced set of phonemes, as the same phoneme can be written (and is written) in multiple ways (mentioned here).
People seem skeptical of the processing of the data that I have made.
I’m fine with that.
Because actually, by just looking at the integer vector, I can’t really tell if it’s supposed to be a sound or someone has been playing a prank on me by replacing the meaningful waveform vectors with random vectors. If the data is raw or in another representation like wavelets or MFCC. That’s actually somehow interesting that we expect our machine learning algorithm to figure this out.
So, I’ve made a python script to check if the vectors made sense. I pick a random sentence in the training data, see its waveform, the corresponding phonemes and words and output a .WAV file. I also output the feature of the speaker so I can also check if the voice fits.
It’s supposed to say “Diane may splurge and buy a turquoise necklace”.
Also, reading the script might help you understand how to use the .npy and .pkl files.
P.S: On an unrelated note, if I would expect words not to bring that much information over the phonemes, I would however consider the final punctuation to be obviously important in the prosody learning (assertion, question, exclamation…). So important that I haven’t included this feature yet…
P.P.S: Now if I wanted to transform the data into mainstream representations like Fourier transform or wavelets, I might want to try the signal processing and discrete Fourier transform scipy packages.