Too Real: Microsoft has developed a new iteration of its neural codec language model, Vall-E, that surpasses previous efforts in terms of naturalness, speech robustness, and speaker similarity. It is the first of its kind to reach human parity in a pair of popular benchmarks, and is apparently so lifelike that Microsoft has no plans to grant access to the public.
Leveraging Vall-E's groundwork, the new AI voice tool integrates two major enhancements that greatly improve performance. Grouped code modeling allows Microsoft to better organize codec codes, resulting in shorter sequence lengths that boost inference speed and help overcome challenges associated with long sequence modeling.
Repetition aware sampling, meanwhile, rethinks the original nucleus sampling process to look for token repetition when decoding. Microsoft said this process helps stabilize decoding and prevents the infinite loop issue that was present in the original Vall-E.
Microsoft put Vall-E 2 to the test using the LibriSpeech and VCTK datasets, and it passed them both with flying colors. When Redmond claims the AI tool achieves human parity, they mean Vall-E 2 performed better than ground truth samples in robustness, similarity, and naturalness. In other words, the tool can produce natural speech that is virtually identical to the original speaker.
Microsoft shared dozens of samples from Vall-E 2, which can be found over on the project summary page. Indeed, Vall-E 2 samples are incredibly lifelike and indistinguishable from the human speaker. The AI tool even masters subtleties like putting emphasis on the correct word in a sentence as people subconsciously do when speaking.
Microsoft said Vall-E 2 is purely a research project, adding that it has no plans to incorporate the tech into a consumer product or release the tool to the general public. Redmond further noted that it carries potential risk for misuse, such as impersonating a specific person or spoofing voice identification.
That said, the company believes it could have applications in education, translation, accessibility, journalism, self-authored content, and chatbots, among others.
Image credit: Rootnot Creations