People seem skeptical of the processing of the data that I have made.
I’m fine with that.
Because actually, by just looking at the integer vector, I can’t really tell if it’s supposed to be a sound or someone has been playing a prank on me by replacing the meaningful waveform vectors with random vectors. If the data is raw or in another representation like wavelets or MFCC. That’s actually somehow interesting that we expect our machine learning algorithm to figure this out.
So, I’ve made a python script to check if the vectors made sense. I pick a random sentence in the training data, see its waveform, the corresponding phonemes and words and output a .WAV file. I also output the feature of the speaker so I can also check if the voice fits.
It’s supposed to say “Diane may splurge and buy a turquoise necklace”.
Also, reading the script might help you understand how to use the .npy and .pkl files.
P.S: On an unrelated note, if I would expect words not to bring that much information over the phonemes, I would however consider the final punctuation to be obviously important in the prosody learning (assertion, question, exclamation…). So important that I haven’t included this feature yet…
P.P.S: Now if I wanted to transform the data into mainstream representations like Fourier transform or wavelets, I might want to try the signal processing and discrete Fourier transform scipy packages.