So I’m following the IFT6266 course at Université de Montréal. And it seems the project for this course is about speech synthesis. This blog is where I will log my experiments and thoughts on this project.
As explained here (and shown here), speech synthesis is traditionally split in two stages: prosody building and waveform generation. After building the prosody, the waveform generation is typically done gluing components of a dictionary matching the prosody and the phonemes and smoothing the resulting patchwork. This compartmentalization of task holds advantages and drawbacks similar to those of relying on prior knowledge, the system is easier to monitor (and debug) but we are likely to bias the system (here slightly toward a robot voice). I find the result quite good though.
However we could hope to obtain something better by generating more directly the waveform, therefore possibly obtain smoother waveform. We will try to build such system through a machine learning approach (some bibliographical references regarding speech synthesis with this kind of approach are here). Therefore, if we shortcut the prosody building, we can hope that the system will somehow learn the prosody building by itself. For this learning task, we will be given a dataset of people (of different sex, race, age, height and education) saying some “phonetically rich sentences” in several dialects.
This dataset takes the form of several varying-length sequences of audio frames, represented as fixed-length real vectors. These frames might have been centered and normalized for our convenience (with the mean and standard deviation across all frames and components). These data are augmented with context variables (both discrete and continuous) such as speaker informations (age, dialect, education, height, race and sex) and sequence of phonemes, which is aligned with the sequence of frames in the training set.
One proposed evaluation of the generative model resulting from our training is conditional log-likelihood on the test set, i.e. the sum of the likelihood of each test sentence given the speaker information and the unaligned sequence of phonemes. So it might bias the type of model that I would use toward models with tractable conditional log-likelihood.
My code will be on this github.
P.S: For our own debugging, I wonder how easy would it be to generate .wav file from our generated sequences. It might imply recomputing the original training set mean and standard deviation. But so far I can’t even listen to the original .wav files.