Easily available software can imitate a person's voice with such accuracy that it can fool both humans and smart devices, according to a new report.
Researchers at the University of Chicago's Security, Algorithms, Networking and Data (SAND) Lab tested deepfake voice synthesis programs available on the open-source developer community site Github to see if they could unlock voice-recognition security on Amazon's Alexa, WeChat and Microsoft Azure.
One of the programs, known as SV2TTS, only needs five seconds' worth to make a passable imitation, "according to its developers.
Described as a 'real-time voice cloning toolbox,' SV2TTS was able to trick Microsoft Azure about 30 percent of the time but got the best of both WeChat and Amazon Alexa almost two-thirds, or 63 percent, of the time.
It was also able to fool human ears: 200 volunteers asked to identify the real voices from the deepfakes were tricked about half the time.
The deepfake audio was more successful at faking women's voices and those of non-native English speakers, though, 'why that happened, we need to investigate further,' SAND Lab researcher Emily Wenger told New Scientist.
'We find that both humans and machines can be reliably fooled by synthetic speech and that existing defenses against synthesized speech fall short,' the researchers wrote in a report posted on the open-access server arxiv.
'Such tools in the wrong hands will enable a range of powerful attacks against both humans and software systems [aka machines].'
Scroll down for video
Using the voice-synthesis software SV2TTS to create deepfake audio, researchers were able to fool Amazon Alexa and WeChat into unlocking their voice-recognition security nearly two-thirds of the time
WeChat allows users to log in with their voice and, among other features, Alexa allows users to use voice commands to make payments to third-party apps like Uber, New Scientist reported, while Microsoft Azure's voice recognition system is certified by several industry bodies.
Wenger and her colleagues also tested another voice synthesis program, AutoVC, which requires five minutes of speech to re-create a target's voice.
AutoVC was only able to fool Microsoft Azure about 15 percent of the time, so the researchers declined to test it against WeChat and Alexa.
The lab members were actually drawn to the subject of audio deepfakes after reading about con artists equipped with voice-imitation software duping a British energy-company executive into sending them more than $240,000 by pretending to be his German boss.
The deepfake voices were able to fool 200 volunteers about half the time
'We wanted to look at how practical can these attacks be, given that we've seen some evidence of them in the real world,' Emily Wenger, a PhD candidate in the SAND Lab, told New Scientist.
The unnamed victim wired the money to a secret account in Hungary in 2019 'to help the company avoid late-payment fines', according to the firm's insurer, Euler Hermes.
The director thought it was a 'strange' demand but believed the convincing German accent when he heard it over the phone, the Washington Post reported.
'The software was able to imitate the voice, and not only the voice—the tonality, the punctuation, the German accent,' the insurer said.
The thieves were only stopped when they tried the ruse a second time and the suspicious executive called his boss directly.
Researchers at the University of Chicago's SAND Lab were drawn to investigating deepfake audio by news of con artists equipped with voice-imitation software duping an executive into sending them more than $240,000 by pretending to be his boss
The perpetrators of the scam, billed as the world's first deepfake heist, were never identified and the money was never recovered.
Researchers at cybersecurity firm Symantec say they have found three similar cases of executives being told to send money to private accounts by thieves using AI programs.
One of these losses totaled millions of dollars, Symantec told the BBC.
Voice-synthesis technology works by taking a person's voice and breaking it down into syllables or short sounds before rearranging them