Waveform Synthesis of Held-out LibriSpeech Samples

Original audio
Griffin-Lim (3 iterations)
Griffin-Lim (50 iterations)
Griffin-Lim (150 iterations)
SPSI
SPSI + Griffin-Lim (3 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (2 heads)
MCNN (filter width of 9)
MCNN (losses (1) and (2))
MCNN (loss (1) only)
Audio demos

Original audio
Griffin-Lim (3 iterations)
Griffin-Lim (50 iterations)
Griffin-Lim (150 iterations)
SPSI
SPSI + Griffin-Lim (3 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (2 heads)
MCNN (filter width of 9)
MCNN (losses (1) and (2))
MCNN (loss (1) only)
Audio demos

Original audio
Griffin-Lim (3 iterations)
Griffin-Lim (50 iterations)
Griffin-Lim (150 iterations)
SPSI
SPSI + Griffin-Lim (3 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (2 heads)
MCNN (filter width of 9)
MCNN (losses (1) and (2))
MCNN (loss (1) only)
Audio demos


Waveform Synthesis of Samples from an Unseen Speaker

Original audio
Griffin-Lim (150 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (trained on the unseen speaker dataset)
Audio demos

Original audio
Griffin-Lim (150 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (trained on the unseen speaker dataset)
Audio demos

Original audio
Griffin-Lim (150 iterations)
SPSI + Griffin-Lim (50 iterations)
MCNN (baseline)
MCNN (trained on the unseen speaker dataset)
Audio demos


Contributions of Multiple Heads

Original audio
MCNN
Head 1 output
Head 2 output
Head 3 output
Head 4 output
Head 5 output
Head 6 output
Head 7 output
Head 8 output