The Terrifying A.I. Scam That Uses Your Loved One’s Voice

On a recent night, a woman named Robin was asleep next to her husband, Steve, in their Brooklyn home, when her phone buzzed on the bedside table. Robin is in her mid-thirties with long, dirty-blond hair. She works as an interior designer, specializing in luxury homes. The couple had gone out to a natural-wine bar in Cobble Hill that evening, and had come home a few hours earlier and gone to bed. Their two young children were asleep in bedrooms down the hall. “I’m always, like, kind of one ear awake,” Robin told me, recently. When her phone rang, she opened her eyes and looked at the caller I.D. It was her mother-in-law, Mona, who never called after midnight. “I’m, like, maybe it’s a butt-dial,” Robin said. “So I ignore it, and I try to roll over and go back to bed. But then I see it pop up again.”

She picked up the phone, and, on the other end, she heard Mona’s voice wailing and repeating the words “I can’t do it, I can’t do it.” “I thought she was trying to tell me that some horrible tragic thing had happened,” Robin told me. Mona and her husband, Bob, are in their seventies. She’s a retired party planner, and he’s a dentist. They spend the warm months in Bethesda, Maryland, and winters in Boca Raton, where they play pickleball and canasta. Robin’s first thought was that there had been an accident. Robin’s parents also winter in Florida, and she pictured the four of them in a car wreck. “Your brain does weird things in the middle of the night,” she said. Robin then heard what sounded like Bob’s voice on the phone. (The family members requested that their names be changed to protect their privacy.) “Mona, pass me the phone,” Bob’s voice said, then, “Get Steve. Get Steve.” Robin took this—that they didn’t want to tell her while she was alone—as another sign of their seriousness. She shook Steve awake. “I think it’s your mom,” she told him. “I think she’s telling me something terrible happened.”

Steve, who has close-cropped hair and an athletic build, works in law enforcement. When he opened his eyes, he found Robin in a state of panic. “She was screaming,” he recalled. “I thought her whole family was dead.” When he took the phone, he heard a relaxed male voice—possibly Southern—on the other end of the line. “You’re not gonna call the police,” the man said. “You’re not gonna tell anybody. I’ve got a gun to your mom’s head, and I’m gonna blow her brains out if you don’t do exactly what I say.”

Steve used his own phone to call a colleague with experience in hostage negotiations. The colleague was muted, so that he could hear the call but wouldn’t be heard. “You hear this???” Steve texted him. “What should I do?” The colleague wrote back, “Taking notes. Keep talking.” The idea, Steve said, was to continue the conversation, delaying violence and trying to learn any useful information.

“I want to hear her voice,” Steve said to the man on the phone.

The man refused. “If you ask me that again, I’m gonna kill her,” he said. “Are you fucking crazy?”

“O.K.,” Steve said. “What do you want?”

The man demanded money for travel; he wanted five hundred dollars, sent through Venmo. “It was such an insanely small amount of money for a human being,” Steve recalled. “But also: I’m obviously gonna pay this.” Robin, listening in, reasoned that someone had broken into Steve’s parents’ home to hold them up for a little cash. On the phone, the man gave Steve a Venmo account to send the money to. It didn’t work, so he tried a few more, and eventually found one that did. The app asked what the transaction was for.

“Put in a pizza emoji,” the man said.

After Steve sent the five hundred dollars, the man patched in a female voice—a girlfriend, it seemed—who said that the money had come through, but that it wasn’t enough. Steve asked if his mother would be released, and the man got upset that he was bringing this up with the woman listening. “Whoa, whoa, whoa,” he said. “Baby, I’ll call you later.” The implication, to Steve, was that the woman didn’t know about the hostage situation. “That made it even more real,” Steve told me. The man then asked for an additional two hundred and fifty dollars to get a ticket for his girlfriend. “I’ve gotta get my baby mama down here to me,” he said. Steve sent the additional sum, and, when it processed, the man hung up.

By this time, about twenty-five minutes had elapsed. Robin cried and Steve spoke to his colleague. “You guys did great,” the colleague said. He told them to call Bob, since Mona’s phone was clearly compromised, to make sure that he and Mona were now safe. After a few tries, Bob picked up the phone and handed it to Mona. “Are you at home?” Steve and Robin asked her. “Are you O.K.?”

Mona sounded fine, but she was unsure of what they were talking about. “Yeah, I’m in bed,” she replied. “Why?”

Artificial intelligence is revolutionizing seemingly every aspect of our lives: medical diagnosis, weather forecasting, space exploration, and even mundane tasks like writing e-mails and searching the Internet. But with increased efficiencies and computational accuracy has come a Pandora’s box of trouble. Deepfake video content is proliferating across the Internet. The month after Russia invaded Ukraine, a video surfaced on social media in which Ukraine’s President, Volodymyr Zelensky, appeared to tell his troops to surrender. (He had not done so.) In early February of this year, Hong Kong police announced that a finance worker had been tricked into paying out twenty-five million dollars after taking part in a video conference with who he thought were members of his firm’s senior staff. (They were not.) Thanks to large language models like ChatGPT, phishing e-mails have grown increasingly sophisticated, too. Steve and Robin, meanwhile, fell victim to another new scam, which uses A.I. to replicate a loved one’s voice. “We’ve now passed through the uncanny valley,” Hany Farid, who studies generative A.I. and manipulated media at the University of California, Berkeley, told me. “I can now clone the voice of just about anybody and get them to say just about anything. And what you think would happen is exactly what’s happening.”

Robots aping human voices are not new, of course. In 1984, an Apple computer became one of the first that could read a text file in a tinny robotic voice of its own. “Hello, I’m Macintosh,” a squat machine announced to a live audience, at an unveiling with Steve Jobs. “It sure is great to get out of that bag.” The computer took potshots at Apple’s main competitor at the time, saying, “I’d like to share with you a maxim I thought of the first time I met an I.B.M. mainframe: never trust a computer you can’t lift.” In 2011, Apple released Siri; inspired by “Star Trek” ’s talking computers, the program could interpret precise commands—“Play Steely Dan,” say, or, “Call Mom”—and respond with a limited vocabulary. Three years later, Amazon released Alexa. Synthesized voices were cohabiting with us.

Still, until a few years ago, advances in synthetic voices had plateaued. They weren’t entirely convincing. “If I’m trying to create a better version of Siri or G.P.S., what I care about is naturalness,” Farid explained. “Does this sound like a human being and not like this creepy half-human, half-robot thing?” Replicating a specific voice is even harder. “Not only do I have to sound human,” Farid went on. “I have to sound like you.” In recent years, though, the problem began to benefit from more money, more data—importantly, troves of voice recordings online—and breakthroughs in the underlying software used for generating speech. In 2019, this bore fruit: a Toronto-based A.I. company called Dessa cloned the podcaster Joe Rogan’s voice. (Rogan responded with “awe” and acceptance on Instagram, at the time, adding, “The future is gonna be really fucking weird, kids.”) But Dessa needed a lot of money and hundreds of hours of Rogan’s very available voice to make their product. Their success was a one-off.

In 2022, though, a New York-based company called ElevenLabs unveiled a service that produced impressive clones of virtually any voice quickly; breathing sounds had been incorporated, and more than two dozen languages could be cloned. ElevenLabs’s technology is now widely available. “You can just navigate to an app, pay five dollars a month, feed it forty-five seconds of someone’s voice, and then clone that voice,” Farid told me. The company is now valued at more than a billion dollars, and the rest of Big Tech is chasing closely behind. The designers of Microsoft’s Vall-E cloning program, which débuted last year, used sixty thousand hours of English-language audiobook narration from more than seven thousand speakers. Vall-E, which is not available to the public, can reportedly replicate the voice and “acoustic environment” of a speaker with just a three-second sample.

Voice-cloning technology has undoubtedly improved some lives. The Voice Keeper is among a handful of companies that are now “banking” the voices of those suffering from voice-depriving diseases like A.L.S., Parkinson’s, and throat cancer, so that, later, they can continue speaking with their own voice through text-to-speech software. A South Korean company recently launched what it describes as the first “AI memorial service,” which allows people to “live in the cloud” after their deaths and “speak” to future generations. The company suggests that this can “alleviate the pain of the death of your loved ones.” The technology has other legal, if less altruistic, applications. Celebrities can use voice-cloning programs to “loan” their voices to record advertisements and other content: the College Football Hall of Famer Keith Byars, for example, recently let a chicken chain in Ohio use a clone of his voice to take orders. The film industry has also benefitted. Actors in films can now “speak” other languages—English, say, when a foreign movie is released in the U.S. “That means no more subtitles, and no more dubbing,” Farid said. “Everybody can speak whatever language you want.” Multiple publications, including The New Yorker, use ElevenLabs to offer audio narrations of stories. Last year, New York’s mayor, Eric Adams, sent out A.I.-enabled robocalls in Mandarin and Yiddish—languages he does not speak. (Privacy advocates called this a “creepy vanity project.”)

But, more often, the technology seems to be used for nefarious purposes, like fraud. This has become easier now that TikTok, YouTube, and Instagram store endless videos of regular people talking. “It’s simple,” Farid explained. “You take thirty or sixty seconds of a kid’s voice and log in to ElevenLabs, and pretty soon Grandma’s getting a call in Grandson’s voice saying, ‘Grandma, I’m in trouble, I’ve been in an accident.’ ” A financial request is almost always the end game. Farid went on, “And here’s the thing: the bad guy can fail ninety-nine per cent of the time, and they will still become very, very rich. It’s a numbers game.” The prevalence of these illegal efforts is difficult to measure, but, anecdotally, they’ve been on the rise for a few years. In 2020, a corporate attorney in Philadelphia took a call from what he thought was his son, who said he had been injured in a car wreck involving a pregnant woman and needed nine thousand dollars to post bail. (He found out it was a scam when his daughter-in-law called his son’s office, where he was safely at work.) In January, voters in New Hampshire received a robocall call from Joe Biden’s voice telling them not to vote in the primary. (The man who admitted to generating the call said that he had used ElevenLabs software.) “I didn’t think about it at the time that it wasn’t his real voice,” an elderly Democrat in New Hampshire told the Associated Press. “That’s how convincing it was.”

Leave a Reply

Your email address will not be published. Required fields are marked *