Microsoft's AI speech generator achieves human parity but is too dangerous for the public

Shawn Knight

Posts: 15,389   +193
Staff member
Too Real: Microsoft has developed a new iteration of its neural codec language model, Vall-E, that surpasses previous efforts in terms of naturalness, speech robustness, and speaker similarity. It is the first of its kind to reach human parity in a pair of popular benchmarks, and is apparently so lifelike that Microsoft has no plans to grant access to the public.

Leveraging Vall-E's groundwork, the new AI voice tool integrates two major enhancements that greatly improve performance. Grouped code modeling allows Microsoft to better organize codec codes, resulting in shorter sequence lengths that boost inference speed and help overcome challenges associated with long sequence modeling.

Repetition aware sampling, meanwhile, rethinks the original nucleus sampling process to look for token repetition when decoding. Microsoft said this process helps stabilize decoding and prevents the infinite loop issue that was present in the original Vall-E.

Microsoft put Vall-E 2 to the test using the LibriSpeech and VCTK datasets, and it passed them both with flying colors. When Redmond claims the AI tool achieves human parity, they mean Vall-E 2 performed better than ground truth samples in robustness, similarity, and naturalness. In other words, the tool can produce natural speech that is virtually identical to the original speaker.

Microsoft shared dozens of samples from Vall-E 2, which can be found over on the project summary page. Indeed, Vall-E 2 samples are incredibly lifelike and indistinguishable from the human speaker. The AI tool even masters subtleties like putting emphasis on the correct word in a sentence as people subconsciously do when speaking.

Microsoft said Vall-E 2 is purely a research project, adding that it has no plans to incorporate the tech into a consumer product or release the tool to the general public. Redmond further noted that it carries potential risk for misuse, such as impersonating a specific person or spoofing voice identification.

That said, the company believes it could have applications in education, translation, accessibility, journalism, self-authored content, and chatbots, among others.

Image credit: Rootnot Creations

Permalink to story:

 
MS self validation that their own AI speech generator and said it is so good. Tha kind of raised many red flags to me. And by the way, nobody have access to it, nor does anyone know how it works in the background. They only shared a dozen of samples, and it was concluded that it reached human parity. Wow.
 
Meh, this tactic is old and is just a way to get people hyped up about it for when they do release it someday. They knew the capability that they were after when they developed this, it isn't like it is an accident. Like other models, it remains to be seen if the model's capabilities truly generalize, and how difficult it is to get it out of the uncanny valley.
 
"Microsoft said Vall-E 2 is purely a research project, adding that it has no plans to incorporate the tech into a consumer product or release the tool to the general public."

Microsoft doesn't pour research money and effort into anything without plans to make a profit from it at some point.
 
Very impressive! I'm sure certain government agencies already have it, if not already used it in one of their clandestine operations!
How long will it be before this becomes another weapon in the cyber criminal's arsenal?
These guys need to be held to the same standards as everybody else, which would mean jail time. American law is becoming a farce.
 
If it is too dangerous for people then why create it?
Just a matter of time before everyone has it now, there's no stuffing the genie back in the bottle.
 
Some of the existing text to speech generators are fairly good but they fall apart when they start saying things like "30 M P H" rather than "30 miles per hour". The other key indicator is how enthused they are about the most mundane cr@p. When using text to speech on books there should be a lookup available that allows the author to convert difficult to pronounce words to a phonetic equivalent. That can't be difficult to implement.

Saying that, human voice actors can be pretty poor too. I was listening to a fun sci fi story about a murderous robot. It was being narrated by a real human but one with a very effeminate voice. Maybe that's how the author pictured the robot to sound but it definitely sounded like the robot had other things on it's mind rather the murder :)
 
Imitating someone's likeness and voice honestly should be illegal.
So all comedians who do impressions should be imprisoned?

Instead of kneejerk reactionism, how about we rely on the system of ethics that's worked for us for centuries? If (a) there is a conscious intent to deceive, and (b) you're physically or financially harmed by the deception -- then it should be illegal.
 
These guys need to be held to the same standards as everybody else, which would mean jail time. American law is becoming a farce.
It just follows European law where the only armed part of society are the police who answer solely to the Government. At least in the US the public have the right to bear arms and hold the Government (and their police) in check.
 
Back