Listening to a vector

So it seems that the preprocessed data that we were given are MFCC of frames, which is problematic because of the hard invertibility of that transform.

After some investigation with Jörg, we were able to obtain a rawer version of TIMIT. The .WAV files of the raw TIMIT are in NIST WAV format, which is not readable by mainstream audio player (not even VLC !).  One possible trick is to use the sox command in Linux to convert your file to a readable .WAV format.

sox [input] -t sndfile [output]

Not only does it allows you to listen to your files through mainstream audio player but it is also an exploitable  format for scipy to build vectors (see So this also answer my previous question about listening to vectors because the package includes tools to create .WAV files from vector.

So now we have the tools to convert TIMIT into a format like numpy.array, more exploitable for standard machine learning algorithms.

P.S: It also seems that we have the alignement of the sequence of sound with words. I’m not sure how exploitable it is over phonemes, because the dimensionality (6102 words, 4739 after stemming with the nltk implementation) of this kind of feature makes it a venue for overfitting.

P.P.S: The formatted data path with the instructions can be found in the Google Doc. I used some code from Yann Dauphin in my script to link every information (waveform, speaker identity, speaker information, phoneme, word). If you have any questions or noticed any bug, please write in the comment section. 

7 thoughts on “Listening to a vector

  1. If you want to try using the words (which might actually be a good thing, and something that will have to be done at some stage, maybe not for the class project), I suggest you don’t stem them. Indeed that would REMOVE important information as far as generating the associated sound is concerned. You are right that TIMIT is probably too small to learn a good word-to-phoneme mapping and we would need to exploit other resources. There are phonetic dictionaries and there are much larger speech datasets which related words to acoustics.

  2. Thanks for this!
    So we really only have 4120 sentences to work with? I thought it would have been more.
    By the way, you have a typo in your readme/google doc: I’m guessing the word alignments file is train_wrd.npy not Also you talk about the train_seq_to_phn.npy file twice.

    1. You’re right for the “wrd” and “phn”. I corrected it.

      As for the number of sentences, this is for a train-valid split of 4120-500. This split can be changed however, but it can only go so far as 4619-1, right ?
      Also, the statistical strength also comes from the temporal structure of the individual (well, they can be divided…) sentence, which, combined with a fairly high amount of sentences, could allow some degree of learning.

      1. I think there might be an issue with the train – valid – test split, namely that the split for sentences and the split for phonemes doesn’t seem to line up. train_seq_to_phn seems to suggest that the training set contains 158084 phonemes, but train_phn has more rows (176580). valid_seq_to_phn and valid_phn also aren’t consistent. But test_seq_to_phn and test_phn are.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s