So it seems that the preprocessed data that we were given are MFCC of frames, which is problematic because of the hard invertibility of that transform.
After some investigation with Jörg, we were able to obtain a rawer version of TIMIT. The .WAV files of the raw TIMIT are in NIST WAV format, which is not readable by mainstream audio player (not even VLC !). One possible trick is to use the sox command in Linux to convert your file to a readable .WAV format.
sox [input] -t sndfile [output]
Not only does it allows you to listen to your files through mainstream audio player but it is also an exploitable format for scipy to build vectors (see scipy.io.wavfile). So this also answer my previous question about listening to vectors because the package includes tools to create .WAV files from vector.
So now we have the tools to convert TIMIT into a format like numpy.array, more exploitable for standard machine learning algorithms.
P.S: It also seems that we have the alignement of the sequence of sound with words. I’m not sure how exploitable it is over phonemes, because the dimensionality (6102 words, 4739 after stemming with the nltk implementation) of this kind of feature makes it a venue for overfitting.
P.P.S: The formatted data path with the instructions can be found in the Google Doc. I used some code from Yann Dauphin in my script to link every information (waveform, speaker identity, speaker information, phoneme, word). If you have any questions or noticed any bug, please write in the comment section.