AI transcript
0:00:05 At the end of last year, there were 120 tools
0:00:07 with which you can clone someone’s voice.
0:00:12 And by March of this year, it’s become 350.
0:00:15 Being able to identify what is real
0:00:18 is going to become really important,
0:00:19 especially because now,
0:00:22 you can do all of these things at scale.
0:00:26 – One of the reasons that spam works and deep fakes work
0:00:29 is the marginal cost of the next call is so low
0:00:31 that you can do these things in mass.
0:00:34 – It’s way cheaper to detect deep fakes.
0:00:36 We’ve had 10,000 years of evolution.
0:00:39 The way we produce speech has vocal cords,
0:00:43 has the diaphragm, has your lips and your mouth
0:00:44 and your nasal cavity.
0:00:47 It’s really hard for these systems to replicate all of that.
0:00:53 – Deep fake, a portmanteau of deep learning and fake.
0:00:55 It started making its way into the public consciousness
0:00:59 in 2018, but is now fully in the zeitgeist.
0:01:02 – We are seeing an alarming rise of deep fakes.
0:01:05 – Deep fakes are becoming increasingly easy to make.
0:01:07 – Deep fake videos are everywhere now.
0:01:09 – Deep fake robo-caller
0:01:12 with someone using President Biden’s voice.
0:01:14 – Deep fake of President Zelensky.
0:01:15 – Deep fake.
0:01:15 – Deep fake.
0:01:16 – Deep fakes.
0:01:17 – Deep fakes.
0:01:19 – We’ve seen deep fakes across social media,
0:01:22 commerce, sports and of course, politics.
0:01:24 And at the rate that they’re appearing,
0:01:27 deep fakes might sound like an impossible problem to tackle.
0:01:30 But it turns out that despite the decreasing barrier
0:01:34 to creation, our defender tool chest is even more robust.
0:01:37 So in today’s episode, we’ll discuss that
0:01:39 with someone who’s been thinking about voice security
0:01:42 for much longer than the average Twitter user
0:01:44 or even high-ranking politician,
0:01:46 wondering where this all goes.
0:01:50 Today, VJ Balasubramanian, co-founder and CEO of Pindrop,
0:01:54 joins A16C general partner, Martin Casado,
0:01:57 to break down the technology, the policy
0:01:59 and the economy of deep fakes.
0:02:02 Together, they’ll discuss questions like,
0:02:04 just how easy is it to create a deep fake today?
0:02:06 Like, how many seconds of audio do you need
0:02:08 and how many tools are available?
0:02:11 But also, can we detect these things?
0:02:13 And if so, is the cost realistic?
0:02:16 Plus, what does good regulation look like here
0:02:18 in a space moving so quickly?
0:02:21 And have we lost a grip on the truth?
0:02:23 We’ll listen in to find out, but first,
0:02:25 let’s kick things off with how VJ got here.
0:02:31 As a reminder, the content here
0:02:33 is for informational purposes only.
0:02:35 Should not be taken as legal, business, tax
0:02:36 or investment advice,
0:02:38 or be used to evaluate any investment or security
0:02:40 and is not directed at any investors
0:02:43 or potential investors in any A16C fund.
0:02:45 Please note that A16C and its affiliates
0:02:46 may also maintain investments
0:02:49 in the companies discussed in this podcast.
0:02:51 For more details, including a link to our investments,
0:02:54 please see a16c.com/disclosures.
0:03:02 I’ve been playing in the voice space
0:03:04 for a really long time.
0:03:07 I’m gonna date myself, but I started working at Siemens.
0:03:10 And at Siemens, we were working in landline switches
0:03:13 and EWSD switches and things like that.
0:03:16 And so that’s where I started.
0:03:19 I also worked at Google and there I was working
0:03:23 on the scalability algorithms for video chat.
0:03:25 And so that’s where I got introduced
0:03:28 to a lot of the voice over IP side of things.
0:03:31 And then I came to do my PhD from Georgia Tech.
0:03:34 And so there, I naturally got super interested
0:03:36 in voice security.
0:03:39 And ultimately, Pindrop, which is the company
0:03:42 that I started, was my PhD thesis,
0:03:45 very similar to the way you started off your life
0:03:50 as well, but it turned out to be something pretty meaningful.
0:03:52 And ever since then, it’s been incredible
0:03:54 what’s happened in this space.
0:03:57 – This is why I’m so excited to have you on this podcast.
0:04:00 To many deep fakes are this new emergent thing,
0:04:03 but you’ve actually been in the voice fraud detection space
0:04:05 for a very long time.
0:04:06 So it’s gonna be great to see your perspective
0:04:08 on how things are different now
0:04:10 and how things are more of the same.
0:04:13 And so maybe to provide a bit of context
0:04:15 to get started from deep fakes,
0:04:17 they’ve entered the zeitgeist,
0:04:21 maybe talk through what they are when we say deep fakes
0:04:24 and why we’re talking so much about them.
0:04:26 – We’ve been doing deep fake detection
0:04:28 for like now seven years.
0:04:32 And even before that, you have people manipulating audio
0:04:34 and manipulating video.
0:04:38 And you saw that with Nancy Pelosi slurring in a speech,
0:04:41 all they did was slow down the audio.
0:04:44 It wasn’t a deep fake, it was actually a cheap fake, right?
0:04:48 And so that is actually what’s existed for a really long time.
0:04:51 What changed is the ability to use
0:04:54 what are known as generative adversarial networks
0:04:58 to constantly improve things like voice cloning
0:05:03 or video cloning or essentially try to get the likeness
0:05:05 of a person really close.
0:05:09 So it’s essentially two systems competing against each other.
0:05:13 And the objective function is I’m gonna get really close
0:05:16 to Martin’s voice and Martin’s face,
0:05:18 and then the other system is trying to figure out,
0:05:20 okay, what are the anomalies?
0:05:22 Can I still detect that it’s a machine
0:05:24 as opposed to a human?
0:05:26 So it’s almost like a reverse Turing test.
0:05:28 And so what ended up happening is
0:05:30 once you start creating these GANs,
0:05:32 which are used in a lot of these spaces
0:05:35 when you run them across multiple iterations,
0:05:37 the system becomes really, really good
0:05:41 ’cause you train a deep learning neural network
0:05:43 and that’s where the deep fake comes from.
0:05:46 And they became so good that lots of people
0:05:49 have extreme difficulty differentiating
0:05:52 between what is human and what is machine.
0:05:54 – So let’s break this down a little bit
0:05:57 because I think that deep fakes are more talked about
0:05:59 now than they were in the past, right?
0:06:02 And so clearly this seems to have coincided
0:06:04 with the generative AI wave.
0:06:07 And so do you think it’s fair to say
0:06:10 that there’s a new type of deep fake
0:06:12 that is drafted on the generative AI wave
0:06:15 and therefore we need to have a different posture
0:06:18 or is it just the same but brought to people’s attention
0:06:20 because of generative AI?
0:06:23 – Generative AI has allowed for combinations
0:06:24 of wonderful things.
0:06:27 But when we started, there was just one tool
0:06:29 that could clone your voice, right?
0:06:32 It was called Liobard, incredible tool.
0:06:35 It was used for lots of great applications.
0:06:38 At the end of last year, there were 120 tools
0:06:41 with which you can clone someone’s voice.
0:06:45 And by March of this year, it’s become 350.
0:06:47 And there’s a lot of open source tools
0:06:52 that you can use to essentially mimic someone’s voice
0:06:54 or to mimic someone’s likeness.
0:06:58 And that’s the ease with which this has happened.
0:07:03 Essentially the cost of doing this has become close to zero
0:07:06 because all it requires for me to clone your voice,
0:07:10 Martin now requires about three to five seconds
0:07:11 of your audio.
0:07:13 And if I want a really high quality deep fake,
0:07:16 it requires about 15 seconds of audio.
0:07:20 Compare this to before the generative AI boom
0:07:25 where John Legend wanted to become the voice of Google Home
0:07:30 and he spent like close to 20 hours recording him saying
0:07:32 a whole bunch of things so that Google Home
0:07:34 could say in San Francisco,
0:07:36 the weather is 37 degrees or whatever.
0:07:39 So the fact is that he had to go into a studio,
0:07:43 spend 20 hours recording his voice
0:07:47 in order for you to do that compared to 15 seconds
0:07:50 and 300 different tools available to do it.
0:07:53 – It almost feels to me that we need like new terms
0:07:56 because this idea of cloning voices
0:07:58 has been around for a while.
0:07:59 I don’t know if you remember this, Vijay,
0:08:03 but this wasn’t too long ago when I was in Japan
0:08:06 and I got this call from my parents, which I never do.
0:08:09 And my mom’s like, where are you right now?
0:08:10 And I’m like, I’m in Japan.
0:08:11 And my mom’s like, no, you’re not.
0:08:13 And I’m like, yes, I am.
0:08:15 She says, hold on, let me get your father.
0:08:19 So my dad jumps on the line and he’s like,
0:08:19 where are you?
0:08:20 I’m in Japan.
0:08:22 He’s like, I just talked to you, you were in prison
0:08:27 and I’m leaving to go bring $10,000 of bail money to you.
0:08:29 I’m like, what are you talking about?
0:08:31 And he’s like, listen, someone called and said
0:08:34 that you had a car accident and you were a bit muffled
0:08:39 because you were hurt
0:08:42 and that I needed to bring cash to a certain area.
0:08:44 And like your mom just thought to call you
0:08:46 while I was heading out the door, right?
0:08:48 So of course we called the police after this
0:08:51 and they said, this is a well-known scam
0:08:53 that’s been going on for a very long time.
0:08:57 And it’s probably just someone that tried to sound like you
0:08:59 and muffling their voice, right?
0:09:03 And so it seems that calling somebody
0:09:06 and obfuscating the voice to trick people
0:09:08 has been around for a very long time.
0:09:10 So maybe just from your perspective,
0:09:14 do we need a new term for these generative AI fakes
0:09:16 because they’re somehow fundamentally different
0:09:18 or is this just more of the same?
0:09:20 And we shouldn’t really worry too much about it
0:09:23 because we’ve been dealing with it for a long time.
0:09:26 – Yeah, so it’s interesting it happened to you in Japan man
0:09:29 because the origin of that scam early on,
0:09:33 I went with the Andres and Horowitz contingency to Japan.
0:09:37 This was way back, this was like close to eight, nine years back
0:09:39 when I was talking about voice fraud,
0:09:43 the Japanese audience talked to me about Oriori Sagi
0:09:45 which is help me grandma.
0:09:48 So it’s exactly that, but at that point in time,
0:09:52 it had started costing Japan close to half a billion dollars
0:09:57 in people losing their life savings to the scams, right?
0:10:01 So in Japan, half a billion dollars close to eight, nine years back.
0:10:05 So the mode of operation is not different, right?
0:10:08 Get vulnerable populations, right?
0:10:11 To get into an urgent situation,
0:10:14 believe they have to do it, otherwise it’s disastrous
0:10:16 and they will comply.
0:10:19 What’s changed is the scale
0:10:22 and the ability to actually mimic your voice.
0:10:25 The fact is that now you have so many tools
0:10:28 that anyone can do it super easily.
0:10:33 Two, before if you had some sort of an accent and things like that,
0:10:36 they couldn’t quite mimic your real voice,
0:10:38 but now because it’s 15 seconds,
0:10:42 your grandson could have a 15 second TikTok video
0:10:45 and that’s all it’s required, not even 15 seconds,
0:10:47 with five seconds and if depending upon the demographic,
0:10:49 you can get a pretty good clone.
0:10:53 So what’s changed is the ability to scale this
0:10:55 and then these fraudsters are combining
0:10:58 these text-to-speech systems with LLM models.
0:11:02 So now you have a system that you’re saying,
0:11:04 okay, when the person says something,
0:11:08 respond back in a particular way crafted by the LLM.
0:11:10 And here is the crazy thing, right?
0:11:12 In LLM, hallucination is a problem.
0:11:16 So the fact that you’re making shit up is a bad idea.
0:11:19 But if you have to make shit up to convince someone,
0:11:21 well, you must be able to do that.
0:11:22 And it’s crazy.
0:11:28 We see fraud where the LLM is coming up with crazy ways
0:11:31 to convince you that something bad is happening.
0:11:32 Wow, wow, wow.
0:11:34 I want to get into next,
0:11:36 are we all doomed as it possible to detect these things like that?
0:11:38 But before we do that, it’d be great if,
0:11:41 since you probably are the world’s expert on voice fraud,
0:11:44 you’ve probably seen more types of voice fraud
0:11:46 than any single person on the planet.
0:11:48 We know of the Odi Odi Sagi,
0:11:50 which is basically what I got hit with.
0:11:53 Can you maybe talk to some other uses of deepfakes
0:11:55 that are prevalent today?
0:11:57 Yeah, so deepfakes existed,
0:11:59 but if you think about deepfakes affecting,
0:12:01 and deepfakes right now you can see, right,
0:12:03 in the political spectrum, they’re there, right?
0:12:07 So election misinformation with President Biden’s campaign
0:12:09 happened, we were the ones who caught it
0:12:11 and identified it and things like that.
0:12:12 What was the specifics?
0:12:13 Are you allowed to talk about it?
0:12:15 Yeah, no, no, for sure.
0:12:17 What happened is early on this year,
0:12:19 and if you think about deepfakes,
0:12:20 they affect three big areas,
0:12:24 commerce, media and communication, right?
0:12:26 And so this is news media, social media.
0:12:30 So what happened is at the beginning of an election year,
0:12:32 you had the first case of election interference
0:12:35 with everyone during the Republican primary
0:12:39 in New Hampshire got a phone call that said,
0:12:40 hey, you know what?
0:12:42 Your vote doesn’t count this Tuesday.
0:12:45 Don’t vote right now, come vote in November.
0:12:48 And this was made in the voice
0:12:50 of the president of the free world, right?
0:12:51 President Biden, right?
0:12:52 That’s the craziness.
0:12:55 They went for the highest profile target,
0:12:56 and you should listen to the audio.
0:12:57 It’s incredible.
0:12:59 It is like President Biden,
0:13:01 and they’ve interspersed it with things
0:13:02 that President Biden says,
0:13:05 like what a bunch of malarkey and things like that.
0:13:08 So that came out and people were like,
0:13:10 okay, is this really President Biden?
0:13:12 So not only did we come in and say,
0:13:13 this was a deep fake,
0:13:15 we have something called source tracing,
0:13:17 which tells us which AI application
0:13:19 was used to create this deep fake.
0:13:21 So we identified the deep fake,
0:13:23 and then we worked with that AI application.
0:13:25 They’re an incredible company.
0:13:28 We worked with them and they immediately found
0:13:31 the person who used that script and shut them down.
0:13:33 So they couldn’t create any other problem.
0:13:37 So this is a great example of different good companies
0:13:40 coming together to shut down a problem.
0:13:41 And so we worked with them.
0:13:42 They shut it down.
0:13:45 And then later on regulation kicked in
0:13:47 and they find the telco providers
0:13:49 who distributed these calls.
0:13:52 They find the political analyst
0:13:55 who intentionally created these deep fakes.
0:13:59 But that was the first case of political misinformation.
0:14:00 You see this a lot.
0:14:01 – Was that this year?
0:14:02 – Yeah, it was this year.
0:14:04 It was in January of this year.
0:14:05 – That’s amazing.
0:14:06 Okay, we’ve got politics.
0:14:08 We’ve got bilking old people.
0:14:10 Maybe one more good anecdote
0:14:12 before we get into whether we can detect these things.
0:14:14 – The one thing that’s really close home
0:14:15 is in commerce, right?
0:14:18 Like financial institutions.
0:14:22 Even though Generative AI came out in 2022, in 2023,
0:14:27 we were seeing essentially one deep fake a month
0:14:29 in some customer, right?
0:14:30 So it was just one deep fake a month
0:14:32 and some customer would face it.
0:14:34 It wasn’t a widespread problem.
0:14:39 But this year, we’ve now seen one deep fake per customer
0:14:41 per day.
0:14:45 So it has rapidly exploded.
0:14:48 And we have certain customers like really big banks
0:14:52 who are getting a deep fake every three hours.
0:14:54 Like it’s insane the speed.
0:14:58 So there has been a 1400% increase
0:15:01 in the amount of deep fakes we’ve seen this year
0:15:04 in the first six months compared to all of last year.
0:15:06 And the year is not even over.
0:15:07 – Wow.
0:15:10 All right, so we have these deep fakes.
0:15:12 They are super prevalent.
0:15:16 They are impacting politics and e-commerce.
0:15:18 Can you talk to like whether these things
0:15:19 are detectable at all?
0:15:22 Is this the beginning of the end or where are we?
0:15:25 – Martin, you’ve lived through many such cycles
0:15:28 where initially it feels like the sky is falling.
0:15:31 Online fraud, emails, spam, there’s a whole bunch of them.
0:15:33 But the situation is the same.
0:15:35 They’re completely detectable.
0:15:39 Right now we’re detecting them with 99% detection rate
0:15:41 with a 1% false positive rate.
0:15:44 So extremely high accuracy on being able to detect them.
0:15:45 – Just to put this in context,
0:15:48 what are numbers for identifying voice?
0:15:50 Not fraud just like whether it’s my voice.
0:15:52 – So it’s roughly about one in every 100,000
0:15:54 to one in every million, right?
0:15:55 That’s the ratio.
0:15:57 So it’s much higher precision for short
0:15:59 and much higher specificity.
0:16:01 But yeah, deep fakes you’re detecting
0:16:03 with a 99% accuracy.
0:16:05 And so these things you’re able to detect
0:16:06 very, very comfortably.
0:16:08 And the reason you’re able to detect it
0:16:12 is because when you think about even something like voice,
0:16:17 you have 8,000 samples of your voice every single second,
0:16:20 even in the lowest fidelity channel,
0:16:22 which is the contact center.
0:16:27 And so you can actually see how the voice changes over time,
0:16:29 8,000 times a second.
0:16:33 And what we find is these deep fakes systems,
0:16:35 either on the frequency domain,
0:16:40 suspectrally or on the time domain, make mistakes.
0:16:41 And they make a lot of mistakes.
0:16:43 And the reason they make mistakes,
0:16:46 and still it’s very clear is because think about it,
0:16:50 your human ear can’t look at anomalies 8,000 times a second.
0:16:52 If it did, you’d go mad, right?
0:16:54 Like you’d have some serious problems.
0:16:58 So that’s the reason like it’s beautiful to your ear.
0:17:01 You think it’s Martin speaking on the other end,
0:17:04 but that’s where you can use good AI,
0:17:07 which can actually look at things 8,000 times a second.
0:17:11 Or like when we’re doing most online conferencing,
0:17:13 like this podcast, it’s usually 16,000.
0:17:16 So then you have 16,000 samples of your voice.
0:17:17 And if you’re doing music,
0:17:20 you have 44,000 samples of the musician’s voice
0:17:21 every single second.
0:17:25 So there’s so much data and so many anomalies
0:17:28 that you can actually detect these pretty comfortably.
0:17:32 I see a lot of proposals, particularly from policy circles,
0:17:36 of using things like watermarking or cryptography,
0:17:39 which has always seemed a strange idea to me,
0:17:43 because you’re asking criminals to comply by something.
0:17:44 So I don’t know,
0:17:50 how do you view more active measures to self-identify
0:17:53 either legit or illegitimate traffic?
0:17:56 Yeah, see, this is why you’re in security, Martin,
0:17:58 and almost immediately you realize
0:18:00 that most attackers will not comply
0:18:03 to you putting in a watermark.
0:18:05 But even without putting in a watermark, right?
0:18:08 Like even if you didn’t have an active adversary,
0:18:12 like the President Biden robocall that I referenced before,
0:18:15 when it finally showed up,
0:18:18 the system that actually generated it had a watermark in it.
0:18:21 But when they tested it against that watermark,
0:18:23 they only were able to extract 2%.
0:18:24 Oh, interesting.
0:18:27 So you mean the original Biden call had a watermark?
0:18:30 A watermark, because it was generated by an AI app
0:18:31 that included a watermark.
0:18:32 And then they copied–
0:18:33 (laughs)
0:18:36 And 90% of that watermark went away,
0:18:38 largely because when you take that audio,
0:18:43 play it across air, play it across telephony channels,
0:18:45 the bits and bytes, they get stripped away.
0:18:47 And so once they get stripped away,
0:18:49 and audio is a very sparse channel.
0:18:52 So even if you add it over and over again,
0:18:53 it’s not possible to do it.
0:18:56 So these watermarking techniques,
0:18:57 I mean, they’re a great technique.
0:18:59 You always think about defense in depth,
0:19:01 where they’re present.
0:19:05 You will be able to identify a whole lot more genuine stuff
0:19:07 as a result of these watermarks,
0:19:09 but attackers are not going to comply it.
0:19:10 When you get videos,
0:19:14 like we are now working with news media organizations,
0:19:17 and 90% of the videos and audios they get from,
0:19:22 for example, the Israel Hamas War are fake.
0:19:23 How many?
0:19:24 90% of them are fake.
0:19:25 – What?
0:19:26 – Yeah.
0:19:27 – I guess I shouldn’t be so surprised, but.
0:19:28 – Yeah.
0:19:29 They’re all made up.
0:19:31 They’re a different war.
0:19:32 Some of them are cheap fake.
0:19:33 Some of them are actually deep fake.
0:19:36 Some of them are clutched together.
0:19:40 And so being able to identify what is real
0:19:42 is going to become really important,
0:19:44 especially because now you can do
0:19:46 all of these things at scale.
0:19:48 – Can you draw out how the maturation
0:19:51 in AI technology impacts this?
0:19:54 Because clearly something happened in the last year
0:19:57 to make this economic for attackers,
0:19:59 which we’re seeing arise.
0:20:02 And clearly it’s going to keep getting better.
0:20:04 And so do you have a mental model
0:20:09 for why this doesn’t become a serious problem in the future
0:20:12 or does it become a serious problem in the future?
0:20:14 – So one of the things that we talk about
0:20:17 is any deep fake system should have
0:20:19 strong resilience built in it.
0:20:20 So it should not just be good
0:20:22 about detecting deep fakes right now.
0:20:26 It should be able to detect what we call zero day deep fakes.
0:20:28 A new system gets created.
0:20:30 How do you detect that deep fake?
0:20:33 And essentially the mental model is the following.
0:20:36 One, deep fake architectures
0:20:38 are not simple monolithic systems.
0:20:41 They have like several components within them.
0:20:43 And what ends up happening is each of these components
0:20:46 tend to leave behind artifacts.
0:20:47 We call this a fake print.
0:20:51 So they all leave behind things that they do poorly, right?
0:20:54 And so when you actually create a new system,
0:20:57 you often find they’ve pulled together pieces of other systems
0:21:00 and those leave behind their older fake prints.
0:21:03 And so you can actually detect newer systems
0:21:07 because they usually only improvise on one component.
0:21:10 The second is we actually run GANs.
0:21:12 So you get these GANs to compete.
0:21:14 Like we create our own deep fake detection system.
0:21:16 Now we say, how do you beat that?
0:21:18 And we have multiple iterations of them running
0:21:20 and we’re constantly running them.
0:21:21 – Sorry, I just wanna make sure that I understand here.
0:21:25 So you’re creating your own deep fake system
0:21:26 using the approach you talked about before,
0:21:28 which is the general adversarial network.
0:21:30 So then you can create a good deep fake
0:21:32 and then you can create a detection for that.
0:21:32 Is that right?
0:21:33 – Exactly.
0:21:35 And then you beat that detection system
0:21:39 and you run that iteration, iteration, iteration.
0:21:40 And then what you find
0:21:42 is actually something really interesting,
0:21:47 which is if a deep fake system has to serve two masters,
0:21:51 that is, one, I need to make the speech legible
0:21:54 and sound as much like Martin.
0:21:59 And two, I need to deceive a deep fake detection system.
0:22:02 Those two objective functions start to diverge them.
0:22:05 So for example, I could start adding noise
0:22:08 and noise is a great way to avoid you
0:22:10 from understanding my limitations.
0:22:12 But if I start adding too much noise,
0:22:13 I can’t hear it.
0:22:17 So for example, we were called into one of these deep fakes
0:22:21 where LeBron James apparently was saying bad things
0:22:24 about the coach during the Paris Olympics.
0:22:26 It wasn’t LeBron James, it was a deep fake.
0:22:28 We actually provided his management team
0:22:32 the necessary detail so that in X,
0:22:34 it could be labeled as AI-generated content.
0:22:37 And so we did that.
0:22:39 But if you look at the audio,
0:22:41 there was a lot of noise introduced into it, right?
0:22:44 To try and avoid detection.
0:22:46 But lots of people couldn’t even hear the audio.
0:22:47 They were like, this is really,
0:22:51 and so that’s where you start seeing these systems diverge.
0:22:53 And this is where I have confidence
0:22:54 in our ability to detect it, right?
0:22:57 Which is you run these GANs,
0:22:58 you know the architectures
0:23:01 that these deep fake generation systems are created.
0:23:03 And ultimately you start seeing divergences
0:23:05 in one of the objective functions.
0:23:07 So either you as a human will be able
0:23:08 to detect some things off,
0:23:11 or we as a system will be able to detect some things off.
0:23:12 – Awesome.
0:23:14 One of the reasons that spam works
0:23:19 and deepfakes work is the marginal cost of the next call
0:23:22 is so low that you can do these things in mass, right?
0:23:25 Like the marginal cost of the next spam email or whatever.
0:23:28 Do you have even just the most vague sense of,
0:23:32 if it takes me a dollar to generate and deepfakes,
0:23:35 how much does it cost to detect and deepfakes?
0:23:36 Is it one to one?
0:23:36 Is it 10 to one?
0:23:38 Is it 100 to one?
0:23:41 – It’s way cheaper to detect deepfakes, right?
0:23:42 Because if you think about it,
0:23:45 like what we’ve seen is the closed example
0:23:49 is Apple released its model that could run on device.
0:23:52 And even that model, which is a small model
0:23:56 in order to do lots of things like voice to text
0:23:58 and things like that.
0:24:01 Our model is about 100 times smaller than that.
0:24:04 So it’s so much faster in detecting deepfakes.
0:24:08 So the ratio is about 100th right now.
0:24:11 And we’re constantly figuring out ways
0:24:15 to make it even cheaper, but it’s 100th that of generation.
0:24:16 – Wow, I see.
0:24:20 So to detect it is two orders of magnitude cheaper
0:24:21 than creation.
0:24:25 Which means in order for anybody to economically get,
0:24:27 listen, if there is no defense, there’s no defense.
0:24:29 But if there’s a defense that requires the bad guys
0:24:33 to have two orders of magnitude more resources,
0:24:36 which is actually pretty dramatic.
0:24:38 Given normally you go for parody on these things
0:24:40 because it tends to be a lot more good people
0:24:41 than bad people.
0:24:42 – And that’s the thing.
0:24:43 You have two orders of magnitude.
0:24:45 And then the fact is that once you know
0:24:46 what a deepfake looks like,
0:24:49 unless they re-architect the entire system.
0:24:53 And the only companies that re-architect full pipelines.
0:24:57 And the last time this was done is back in 2015
0:24:59 when Google released Tacotron,
0:25:02 where they re-architected several pieces of the pipeline.
0:25:04 It’s a very expensive proposition.
0:25:06 – Is the intuitive reason that the cost is so much cheaper
0:25:08 to detect is that you just have to do less stuff.
0:25:11 Like the person generated the deepfake has to like,
0:25:14 sound like a human, be passable to a human
0:25:15 and evade this.
0:25:17 And so that’s just more things than detecting it,
0:25:19 which just can be a much more narrow focus.
0:25:21 So it’ll always be cheaper to detect.
0:25:23 And then you don’t see a period in time
0:25:27 where the AI is so good, no deepfake mechanism can detect it.
0:25:28 You don’t see that.
0:25:31 – We don’t see that because either you become so good
0:25:36 at avoiding detection that you actually start becoming worse
0:25:39 at producing human-generated speech
0:25:42 or you’re producing human-generated speech.
0:25:46 And unless you actually create a physical representation
0:25:50 of a human, because we’ve had 10,000 years of evolution
0:25:53 and the way we produce speech has vocal cords,
0:25:56 has the diaphragm, has your lips and your mouth
0:25:59 and your nasal cavity, all of that physical attributes.
0:26:03 So think about the fact that your voice is resonating
0:26:05 through folds of your vocal cord.
0:26:09 And these are subtle things that have changed over time.
0:26:12 It’s all of what has taken you to become you.
0:26:14 And somebody might have punched you in the throat
0:26:17 at some point in time that’s created some kind of thing.
0:26:19 There’s so much thing that happens.
0:26:22 It’s really hard for these systems to replicate all of that.
0:26:26 They have generic models and those generic models are good.
0:26:28 You can also think about the more we learn
0:26:31 about your voice, Martin, the better we can get
0:26:35 at knowing where your voice is deviating.
0:26:36 – And I have an incentive as a good guy
0:26:37 to work with you on that.
0:26:39 So you’ll have access to data where the bad people
0:26:42 may not have access to data and it totally makes sense.
0:26:45 It seems to me like the spam lessons learn apply here,
0:26:48 which is spam can be very effective for attackers,
0:26:50 very effective.
0:26:53 Defenses can also be incredibly effective,
0:26:54 however you have to put them in place.
0:26:56 And so it’s the same situation here,
0:26:59 which is be sure you have a strategy for deep fake detection.
0:27:01 But if you do, you’ll be okay.
0:27:02 – That’s exactly right.
0:27:04 And I think it has to be in each of the areas.
0:27:06 Like when you think about deepfakes,
0:27:08 you have incredible AI applications
0:27:11 that are doing wonderful things in each of these paces.
0:27:13 You know, the voice cloning apps,
0:27:15 they’ve actually given voices to people
0:27:17 who have throat cancer and things like that.
0:27:20 Not just throat cancer, people who have been put behind bars
0:27:23 because of a bad political regime
0:27:24 are now getting to spread their message.
0:27:27 So they’re doing some incredible stuff
0:27:29 that you couldn’t do otherwise.
0:27:31 But in each of those situations,
0:27:34 it was with the consent of the user
0:27:37 who wanted their voice recreated, right?
0:27:41 And so that notion that the source AI applications
0:27:44 need to make sure that the people using their platform
0:27:46 actually are the people who want to use their platform.
0:27:48 That’s part A.
0:27:50 – And this is where the partnerships that you talked about
0:27:53 with the actual generation companies comes in
0:27:56 so that you can help them for the legitimate use cases
0:27:58 as well as sniffing out the illegitimate one.
0:27:58 Is that right?
0:27:59 – Absolutely.
0:28:01 – And with the labs, incredible.
0:28:05 The amount of work they’re doing to create voices ethically
0:28:09 and safely and carefully is incredible.
0:28:12 They’re trying to get lots of great tools out there.
0:28:13 We’re partnering with them.
0:28:16 They’re making their data sets accessible to us.
0:28:18 There are companies like that, right?
0:28:20 Another company called Respeacher.
0:28:22 They did a lot of the Hollywood movies.
0:28:26 So all of these companies are starting to partner
0:28:29 in order to be able to do this in the right way.
0:28:32 And it’s similar to a lot of what happened
0:28:35 in the fraud situation back in the 2000s
0:28:38 or the email spam situation back in the 2000s.
0:28:41 – I want to shift over to policy.
0:28:43 I’ve had a lot of policy discussions lately
0:28:45 in California as well as at the federal level.
0:28:48 And here’s my summary of how our existing policymakers
0:28:50 think about AI.
0:28:52 A, they’re scared and they want to regulate it.
0:28:54 B, they don’t know why they’re scared.
0:28:56 And C, with one exception,
0:28:58 which is none of them want deep fakes of themselves.
0:29:03 So I’ve found a primary motivation around regulating AI
0:29:06 is just this fear of political deep fakes, honestly.
0:29:09 And these are in pretty legit face-to-face conversations.
0:29:11 And so have you given thought
0:29:14 to what guidance you would give to policymakers,
0:29:15 many of who listen to this podcast
0:29:19 and how they should think about any regulations
0:29:21 or rules around this and maybe how it intersects
0:29:23 with things like innovation and free speech, et cetera.
0:29:25 I mean, it’s a complicated topic.
0:29:28 I think the simple one-liner answer is
0:29:32 they should make it really difficult for threat actors
0:29:34 and really flexible for creators, right?
0:29:37 That’s the ultimate difference.
0:29:40 And history is rife with a lot of great ways, right?
0:29:42 Like you live through the email days
0:29:45 where the CANSPAM Act was a great way,
0:29:50 but it came in combination with better ML technologies.
0:29:51 – And I’m of that generation too,
0:29:53 but maybe just walk through how CANSPAM works.
0:29:55 I think it’s a good analog.
0:29:58 – You probably know more about the CANSPAM Act,
0:30:01 but the CANSPAM Act is one where anyone
0:30:03 who’s providing unsolicited marketing
0:30:06 has to be clear on its headers,
0:30:09 has to allow you to opt out, all of those things.
0:30:13 And if you don’t follow this very strict set of policies,
0:30:14 you can be fine.
0:30:18 And you also have great detection technologies
0:30:20 that allow you to detect these spams, right?
0:30:22 And now that you follow a particular standard,
0:30:25 especially when you’re doing unsolicited marketing
0:30:28 or you’re trying to do bad things like pornography,
0:30:30 you have detection, AI/ML technologies
0:30:32 that can detect you well.
0:30:35 The same thing happened when banks went online.
0:30:37 You had a lot of online fraud.
0:30:39 And if you remember, the Know Your Customer Act
0:30:43 and the Anti-Money Laundering Acts came in there.
0:30:48 So the onus was you as a organization
0:30:49 have to know your customer.
0:30:51 That’s the guarantee.
0:30:52 And so you need technology.
0:30:54 After that, you can do what you want.
0:30:57 What was really good about both of those cases
0:31:00 is they got really specific on one,
0:31:02 what can the technology detect?
0:31:04 Because if the technology can’t detect it,
0:31:06 you can’t litigate, you can’t find the people
0:31:08 who are misusing it and so on.
0:31:10 So what can the technology detect?
0:31:13 And two, how do I make it really specific
0:31:16 on what you can and cannot do
0:31:18 in order to be able to do this?
0:31:21 And so I think those two were great examples
0:31:23 of how we should think about litigation.
0:31:26 And in deep fake, there is this very clear thing, right?
0:31:27 Like you have free speech,
0:31:29 but for the longest time,
0:31:32 anytime you used free speech for fraud,
0:31:34 or you were trying to incite violence,
0:31:37 or you were trying to do obscene things,
0:31:38 these are clear places
0:31:41 where the free speech guarantees go away.
0:31:44 So I think if you’re doing that, you should be fined.
0:31:47 And you should have laws that protect you against that.
0:31:49 And that’s the model I think of.
0:31:50 – Awesome.
0:31:53 So I’m gonna add just one thing from CANSPAN
0:31:55 that I think that you’ve touched on,
0:31:57 but I was actually working email security there.
0:31:59 So I think that this highlighted,
0:32:01 I wanna see if you agree with this kind of characterization.
0:32:04 So the first one is for illegal use,
0:32:06 policy doesn’t really help
0:32:08 because people aren’t gonna comply
0:32:09 and they’re gonna do whatever they want
0:32:11 and they’re doing something criminal anyways.
0:32:15 And so for that, we just rely on the most technical solution.
0:32:17 You can make recommendations,
0:32:18 but for strictly illegal users,
0:32:19 you have to rely on technology.
0:32:21 No policy is gonna keep you safe.
0:32:24 But then there’s this kind of gray area of unwanted stuff.
0:32:27 And the unwanted stuff, you didn’t ask for it.
0:32:30 It may not be illegal, but it’s super annoying
0:32:32 and it’s unwanted and it can fill your inbox.
0:32:35 And for those, you can put in rules
0:32:36 because if somebody crosses those rules,
0:32:39 you can litigate them or you can opt out of it.
0:32:40 And so it regulates to unwanted.
0:32:42 I could see that definitely happening here.
0:32:44 And then of course, there’s the wanted stuff
0:32:46 which doesn’t require any regulation.
0:32:47 Is that a fair characterization?
0:32:49 – That’s a really good characterization.
0:32:52 I think you’ve said it really, really well.
0:32:54 And the only other thing that I’ll say is right now
0:32:57 because we consume things through a lot of platforms,
0:33:00 platforms should be held accountable at some level
0:33:05 to clearly demarcating what is real and what is not.
0:33:08 Because otherwise it’s going to be really hard
0:33:11 for the average consumer to know
0:33:13 that this is AI generated versus this is not.
0:33:17 So I think there’s a certain amount of accountability there.
0:33:19 – Because the technology is where it is,
0:33:22 putting the onus on the platforms to do best practices
0:33:24 just like we did for spam, right?
0:33:27 Like I rely on Microsoft and Google
0:33:29 for the spam detection doing the same type of thing
0:33:30 for the platform.
0:33:32 It sounds like a very sensible recommendation.
0:33:33 – Yeah.
0:33:34 – All right, great.
0:33:35 So let’s just go ahead and wrap this up.
0:33:38 So key point number one is deepfakes
0:33:39 have been around for a long time.
0:33:43 We probably need a new name for this new generation
0:33:46 and this isn’t just like some hypothetical thing
0:33:48 but you’re seeing a massive increase.
0:33:50 You said as much as one per day
0:33:53 and the cost to generate has gone way down.
0:33:57 Good news is that these things are evidently detectable
0:33:59 and in your opinion will always be detectable
0:34:02 if you have a solution in place.
0:34:06 And then as a result, I think any policy should
0:34:09 provide the guidance and maybe accountability
0:34:10 for the platforms to detect it
0:34:12 because we can actually detect it.
0:34:15 And so listen, it’s something for people to know about
0:34:17 but it’s not the end of the world
0:34:19 and policy makers don’t have to regulate all of AI
0:34:21 for this one specific use case.
0:34:22 Is this a fair synopsis?
0:34:24 – This is a beautiful synopsis, Martin.
0:34:26 You’ve captured it really well.
0:34:30 – All right, that is all for today.
0:34:33 If you did make it this far, first of all, thank you.
0:34:35 We put a lot of thought into each of these episodes
0:34:37 whether it’s guests, the calendar touchers,
0:34:39 the cycles with our amazing editor Tommy
0:34:41 until the music is just right.
0:34:43 So if you’d like what we put together,
0:34:47 consider dropping us a line at ratethespodcast.com/a16z
0:34:50 and let us know what your favorite episode is.
0:34:53 It’ll make my day and I’m sure Tommy’s too.
0:34:54 We’ll catch you on the flip side.
0:34:57 (upbeat music)
0:35:05 [BLANK_AUDIO]
Deepfakes—AI-generated fake videos and voices—have become a widespread concern across politics, social media, and more. As they become easier to create, the threat grows. But so do the tools to detect them.
In this episode, Vijay Balasubramaniyan, cofounder and CEO of Pindrop, joins a16z’s Martin Casado to discuss how deepfakes work, how easily they can be made, and what defenses we have. They’ll also explore the role of policy and regulation in this rapidly changing space.
Have we lost control of the truth? Listen to find out.
Resources:
Find Vijay on Twitter: https://x.com/vijay_voice
Find Martin on Twitter: https://x.com/martin_casado
Stay Updated:
Let us know what you think: https://ratethispodcast.com/a16z
Find a16z on Twitter: https://twitter.com/a16z
Find a16z on LinkedIn: https://www.linkedin.com/company/a16z
Subscribe on your favorite podcast app: https://a16z.simplecast.com/
Follow our host: https://twitter.com/stephsmithio
Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.