Google’s AI can adjust voice emotion, pitch, and speed with 30 minutes of data

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

We spend hours on Instagram and YouTube and waste money on coffee and fast food, but won’t spend 30 minutes a day learning skills to boost our careers.
Master in DevOps, SRE, DevSecOps & MLOps!

Learn from Guru Rajesh Kumar and double your salary in just one year.

Source: venturebeat.com

In a paper originally published last October and accepted to the International Conference on Learning Representations (ICLR) 2020, researchers affiliated with Google and the University College London propose an AI model that enables control of speech characteristics like pitch, emotion, and speaking rate with as little as 30 minutes of data.

The work has obvious commercial implications. Brand voices such as Progressive’s Flo (played by comedian Stephanie Courtney) are often pulled in for pick-ups — sessions to address mistakes, changes, or additions in voiceover scripts — long after a recording finishes. AI-assisted voice correction could eliminate the need for these, saving time and money on the part of the actors’ employers.

A previous study investigated the use of so-called style tokens (which represented different categories of emotion) to control speech affect. The method achieved good results with only 5% of labeled data, but it couldn’t handle speech samples with varying prosody (i.e., intonation, tone, stress, and rhythm) and fixed emotion. The work from Google and the University of College London addresses this limitation.

The researchers trained the system for 300,000 steps across 32 of Google’s custom-designed tensor processing units (TPUs), a scale of compute exceeding that used in previous work. They report that using 30 minutes of labeled data allowed for a “significant degree” of control over speech rate, valence, and arousal, and that affect accuracy didn’t degrade noticeably with at least 10% of labeled data. The researchers said that just 3 minutes of data allowed for control of speech rate and extrapolation outside data seen during training — a result the researchers claim beat out state-of-the-art baselines.

The researchers’ system taps a trained generative model that can synthesize acoustic features from text. Similar to Google’s Tacotron 2, a text-to-speech (TTS) system that generates natural-sounding speech from raw transcripts, the new system can produce visual representations of frequencies called spectrograms by training a second model such as DeepMind’s WaveNet to act as a vocoder, a voice codec that analyzes and synthesizes voice data. (This system uses WaveRNN.)

An annotated data set comprising 72,405 roughly 5-second recordings from 40 English speakers, amounting to 45 hours of audio, was used to train the system. The speakers, all of whom were trained voice actors, were prompted to read text snippets with varying levels of valence (emotions like sadness or happiness) and arousal (excitement or energy). From these sessions, the researchers obtained six possible affective states, which they modeled and use as labels along with labels for speaking rate (here defined as the number of syllables per second in each utterance).

Here’s one of the voices the system modified (which sounds not unlike the default Google Assistant voice, interestingly) to have high arousal and an “angry” valence:

The study’s coauthors acknowledge that the work might raise ethical concerns because it could be misused for misinformation or to commit fraud. Indeed, deepfakes — media that takes a person in an existing image, audio recording, or video and replaces them with someone else’s likeness using AI — are multiplying quickly, and have already been used to defraud a major energy producer. In tandem with tools like Resemble, Baidu’s Deep Voice, and Lyrebird, which need only seconds to minutes of audio samples to clone someone’s voice, it’s not difficult to imagine how this new system might add fuel to the fire.

But the coauthors also assert that in this case, since the focus of this work is on improved prosody with potential benefits to human-computer interfaces, the benefits likely outweigh the risks. “We … urge the research community to take seriously the potential for misuse both of this work and broader advances in TTS,” they wrote.

Google Cloud And Anaplan Innovate To Transform Enterprise Planning

Source: aithority.com Google Cloud and Anaplan, Inc. announced a strategic partnership to offer Anaplan’s platform for enterprise planning and business performance on Google Cloud. As Anaplan’s first public cloud Read More

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

Related Posts

Google fires second AI ethics leader

Total and Google to launch AI tool Solar Mapper in Europe

Unlock a new career in Google Cloud with this mastery bundle

Cloud computing is betting on outer space

Google Cloud And Anaplan Innovate To Transform Enterprise Planning

HOW DEEPMIND ALGORITHMS HELPED IMPROVE THE ACCURACY OF GOOGLE MAPS?