a16z Podcast: Deep Learning for the Life Sciences

AI transcript
0:00:02 Hi, and welcome to the A16Z podcast.
0:00:03 I’m Hannah.
0:00:05 Deep learning has come to the life sciences.
0:00:09 Lately it seems every week a published study comes out with code on top.
0:00:14 In this episode, A16Z general partner on the bio fund Vijay Pandey and Bart Remsundar talk
0:00:17 about how AI and ML is unlocking the field in a new way.
0:00:22 In a conversation around their recently published book, Deep Learning for the Life Sciences,
0:00:25 written along with co-authors Peter Eastman and Patrick Walters.
0:00:30 The book aims to give developers and scientists a toolkit on how to use deep learning for
0:00:35 genomics, chemistry, biophysics, microscopy, medical analysis, and other areas.
0:00:36 So why now?
0:00:40 What is it about ML’s development that is allowing it to finally make an impact in this
0:00:41 field?
0:00:43 And what is the practical toolkit?
0:00:44 The right problems to attack?
0:00:46 The right questions to ask?
0:00:51 Above and beyond that, as this deep learning toolkit becomes more and more accessible, biology
0:00:54 is becoming democratized through ML.
0:00:57 So how is the hacker ethos coming to the world of biology?
0:01:01 And what might open source biology truly look like?
0:01:05 So Bart, we spent a lot of time thinking about deep learning and life sciences.
0:01:10 It’s a great time, I think, for people to become practitioners in this space, especially
0:01:15 for people maybe that’s never done machine learning before from the life sciences side,
0:01:18 or maybe people from the machine learning side to get into life sciences.
0:01:22 But maybe the place to kick it off is what’s special about now?
0:01:23 Why should people be thinking about this?
0:01:28 The challenge of programming biology has been that we don’t know biology, and we make up
0:01:33 theoretical models, and the computers are wrong, and biologists and chemists understandably
0:01:36 get grumpy and say, “Why are you wasting my time?”
0:01:40 But with machine learning, the advantage is that we can actually learn from the raw data.
0:01:43 And all of a sudden, we have this powerful new tool there.
0:01:45 It can find things that we didn’t know before.
0:01:50 And this is why it now is the time to get into it, really to enable that next wave of breakthroughs
0:01:51 in the core science.
0:01:57 The part that still blows me away is just how fast this field is moving, and it feels
0:02:03 like it’s a combination of having the open source code on places like GitHub and Archive,
0:02:07 and there’s a paper or a week that’s impactful when it used to be maybe a paper or a quarter
0:02:09 or a paper a year.
0:02:13 And the fact that code is coming with the paper, it’s just layering on top.
0:02:17 That seems to me to be the critical thing that’s different now.
0:02:21 I think when you can clone a repo off GitHub, you also don’t have new insights just because
0:02:23 I’m using a new language.
0:02:27 And now that thousands of people are getting into it, I think all of a sudden you’ll find
0:02:32 lots of semi-self-taught biologists who are really starting to find new, interesting things.
0:02:33 And that is why it’s exciting.
0:02:38 It’s like the hacker ethos, but kind of coming into the bio world, which has typically been
0:02:40 much more buttoned down now.
0:02:44 I think anyone who can clone a repo can start really making a difference.
0:02:47 I think that’s going to be where the real long-term impact arises from these types
0:02:48 of efforts.
0:02:53 You don’t need a journal subscription to get archive or to get the code, which is actually
0:02:54 that alone is kind of amazing.
0:02:59 It wasn’t that long ago where a lot of academics offer was sold, and it was maybe sold for
0:03:01 $500, which is very material.
0:03:02 That’s one piece.
0:03:08 You connect that to the concept of now AI or ML can unlock things in biology.
0:03:12 Then biology is becoming democratized as kind of your point.
0:03:17 And so let’s talk about that because we’re still learning biology collectively.
0:03:20 What is it about deep learning in biology now?
0:03:22 Because biology’s old, machine learning is old.
0:03:23 What’s new now?
0:03:26 Deep learning has this question all over the place.
0:03:27 Why does it work now?
0:03:30 The first neural nets kind of popped out in the 1950s.
0:03:32 And I think it’s really a combination of things.
0:03:38 I think that part of it is the hardware, really, the hardware, the software, the growth of kind
0:03:42 of rapid linear algebra stacks that have made it accessible.
0:03:47 I think also an underappreciated part of it is the growth of the cloud and the internet
0:03:48 really.
0:03:51 Neural nets are about as janky now as it used to be in the ’80s.
0:03:55 The difference is that I can now pull up a blog post where someone says, “Oh, these things
0:03:56 are janky.
0:03:57 Here’s the 17 things I did.
0:03:59 I can copy, paste that into my code.”
0:04:01 And all of a sudden, I’m a neural net expert.
0:04:02 It’s all quite that easy.
0:04:07 It turns it to a tradecraft almost that you can learn by just working through it.
0:04:09 That’s why the deep learning tool again has been accessible.
0:04:14 Then you get to biology, and the question is why biology, why now?
0:04:17 And I think you’re actually the question’s a little deeper.
0:04:21 I think that it’s really about, I think, representation learning.
0:04:27 So we have now reached this point where I think we can learn representations of molecules
0:04:28 that are useful.
0:04:33 This has been something that in the science of chemistry, we’ve been doing a long time.
0:04:38 There’s been all sorts of hand-encoded representations of parts of molecular behavior that we think
0:04:39 are important.
0:04:44 But I think now using the new technology from image processing, from word processing, we
0:04:47 can begin to learn molecular representations.
0:04:50 To be fair, I actually don’t think we’ve really broken through there.
0:04:55 If you look at what’s happening in images or text, there are five years ahead of us.
0:05:00 Well, let me break in here because just for the listeners to give a sense for why representation
0:05:05 is important, and one of my pet examples is that if I gave anybody, say, two five-digit
0:05:07 numbers to add, it’d be trivial.
0:05:12 If I gave you those same five-digit numbers in Roman numerals and you wanted to add them,
0:05:14 the representation there would make this insane.
0:05:15 And what would you do?
0:05:21 Well, you would convert into appropriate representation where the operations are trivial or obvious.
0:05:26 And then the operation is done, and maybe it re-encodes, auto-encodes back to the other
0:05:27 representation.
0:05:28 So this is the problem.
0:05:32 It’s like when you have a picture, representations are obvious because it’s pixels, and computers
0:05:34 love pixels.
0:05:39 And maybe even for DNA, DNA is like a one-dimensional image, and so you have bases that are kind
0:05:40 of like pixels.
0:05:44 We used to joke early days that we would just take a photograph with a small molecule and
0:05:46 then use all the other stuff, but that’s kind of insane too.
0:05:51 And so with the right representation, things become transparent and obvious with the wrong
0:05:53 representation becomes hard.
0:05:54 This is really at the heart of machine learning.
0:05:59 It’s that there’s something about the world that I want to compute on, but computers only
0:06:06 accept very limited forms of input, zero ones, tack strings, like simple structures.
0:06:11 Whereas if you take a molecule, a molecule is like a frighteningly complex entity.
0:06:16 So one thing that we often don’t realize is that until 100 years ago, we barely had any
0:06:17 idea what a molecule was.
0:06:23 It’s this alarmingly strange concept that although we see little diagrams in 10th grade
0:06:26 chemistry or whatever, that isn’t what a molecule is.
0:06:31 It’s a much weirder, weirder quantum object, dynamic, kind of shifting, flowing.
0:06:33 We barely understand it even now.
0:06:37 So then you just really start asking the question of what is water, for example?
0:06:40 Is it the three characters, H2O?
0:06:43 Is it two hydrogens and oxygen?
0:06:45 Is it some quantum construct?
0:06:47 Is it this dynamic vibrating thing?
0:06:49 Is it this bulk mass?
0:06:52 There’s so many layers to kind of the science of it.
0:06:55 So what you really want to do is you’ve got to pick one, and this is where it gets really
0:06:56 hard, right?
0:07:01 Like, if I’m thirsty, what I care about in water is a glass of water.
0:07:06 If I’m trying to answer deep questions about the structure of Neptune, I might want a slightly
0:07:08 different representation of water.
0:07:14 The power of the new deep learning techniques is we don’t necessarily have to pick a representation.
0:07:17 We don’t have to say water is X or water is Y.
0:07:22 Instead, you say, let’s do some math, and let’s take that math and let the machine really
0:07:27 learn the form of water that it needs to answer the question at hand.
0:07:32 So one form of mathematical construct is thinking of a molecule as a graph.
0:07:37 And if you do this, you can begin to do these graph-deep learning algorithms that can really
0:07:41 extract meaningful structure from the molecule itself.
0:07:46 We’ve learned, finally, that here’s a general enough mathematical form we can use to extract
0:07:52 meaningful insights about molecules or these critical biological chemical entities that
0:07:56 we can then use to answer real questions in the real world.
0:08:00 What I think is interesting here in particular is that so much has been developed on images,
0:08:03 and there’s a lot of biology that’s images, and so we could just spend the whole time
0:08:08 talking about images, and it could be microscopy or radiology and tons of good stuff there.
0:08:12 But there’s a lot of biology that’s more than images, and molecules is a good example.
0:08:16 For a long time, it seemed like deep learning was being so successful in images that that’s
0:08:17 all it really did.
0:08:23 And if you could take your square peg and put in whatever holes you got, it would work.
0:08:26 What you’re talking about for graphs is kind of an interesting evolution of this, because
0:08:30 a graph and an image are different types of representations.
0:08:35 But at a technical level, convolutional networks for images or graph convolutions for graphs
0:08:39 are kind of a sort of borrowing a concept at a higher level.
0:08:44 The biology version of machine learning is starting to sort of grow up and starting to
0:08:49 not just be a direct copy of what was done with images and in other areas, but now starting
0:08:50 to be its own thing.
0:08:55 A five-year-old can really point out the critical points in an image, but you almost
0:08:58 need a PhD to understand the critical points of a protein.
0:09:04 So you have this like dual kind of weights, a burden of understanding, so it’s taken
0:09:09 a while for the biological machine learning approach to really mature because we’ve had
0:09:13 to spend so much time even figuring out the basics.
0:09:18 But now we’re finally at this point where it feels like we are diverging a little bit
0:09:23 from the core trunk of what people have done for images or text.
0:09:26 In another five years, I’m going to be blown away by what this thing does.
0:09:28 It’s going to understand more deeply.
0:09:35 So we kind of have this sort of connection between democratization of ML, ML into biology,
0:09:38 democratization into biology, but I don’t think we’re there yet.
0:09:42 I think for ML, I think there really is a sense of democratization.
0:09:49 You could code on your phone and do some useful things or certainly on a laptop, a cheap laptop.
0:09:51 But for biology, what is missing?
0:09:53 One is data, and there’s a fair bit of data.
0:09:58 In the book, we talk about the PDB, we talk about other data sets, and there are publicly
0:10:02 available data sets, but somehow that doesn’t get you into the big leagues.
0:10:06 So like if in this vision of democratizing biology, what’s left to be done?
0:10:12 In some ways, the democratization of ML is a teensy bit of an illusion even.
0:10:17 It’s because that the core constructs were mathematically invented, that there is this
0:10:24 convolutional neural net or its cousins, the LSTM or the other forms of core mathematical
0:10:28 breakthroughs that have been designed, that you can take these building blocks and just
0:10:30 apply them straight out.
0:10:35 In biology, as you pointed out earlier, I think we don’t have those core building blocks
0:10:36 just yet.
0:10:41 We don’t know what the LEGO pieces are that would enable a newcomer to really start to
0:10:44 do breakthrough work.
0:10:45 We’re closer than we were.
0:10:49 I think we’ve had the beginnings of a toolbox, but we’re not there yet.
0:10:53 Let’s think about what happened on the ML side as inspiration for the Bio side.
0:10:54 How much is it driven through academia?
0:10:56 How much driven through companies?
0:10:59 Because what I’m getting at is that there’s a lot of IO in academia.
0:11:02 I don’t know if we’re seeing that being made open sourced in companies.
0:11:07 We’re getting to this really weird set of influences where in order for companies to
0:11:09 gain influence, they need to open source.
0:11:14 This is why 10 years ago, I can’t imagine that Google would have open sourced TensorFlow.
0:11:20 It would have been core proprietary technology, but now they know that if they don’t do that,
0:11:24 developers will shift to some other platform by some other company.
0:11:25 Exactly.
0:11:30 It’s weird that the competitive market forces are driving democratization.
0:11:35 Most of high torch basically are Facebook-based and TensorFlow is from Google.
0:11:37 Let’s say Google kept TensorFlow proprietary.
0:11:39 What would be so bad for them if they did that?
0:11:41 What if everybody outside used high torch?
0:11:45 I think there’s a really neat analogy to the financial sector.
0:11:50 A lot of financial banks have masses of functional programs that they keep under the hood, under
0:11:51 the covers.
0:11:55 If you look at Jane Street, or I believe Standard and Chartered, or a few other of these other
0:12:00 big institutions, lots and lots of functional code hiding behind those walls.
0:12:04 But that really hasn’t really infiltrated further out.
0:12:09 This actually, I think, in the long run weakens them because it’s harder to train, it’s harder
0:12:12 to find new talent, it’s more specialized.
0:12:17 A lot of the code base at Google is proprietary, like the original MapReduce was never put
0:12:18 out there.
0:12:22 This I think has actually caused them a little bit of a problem in that new developers coming
0:12:27 in have to spend months and months and months getting up to speed with the Google stack,
0:12:32 whereas if you look at TensorFlow, it doesn’t take any time at all, someone could walk in
0:12:34 and basically be able to write TensorFlow.
0:12:36 They’ve been using it for months to years.
0:12:37 Exactly.
0:12:42 And I think at the scale that Big Tech is at, this is just like, it’s a powerful market
0:12:43 advantage.
0:12:45 They’re almost outsourcing their education process.
0:12:48 And I guess if they don’t put it out, someone else will, and then they’ll learn on their
0:12:49 platform.
0:12:52 Yes, but then maybe what is the missing part in biology?
0:12:57 We’ve got pharma, a huge force there, but they have very specific goals.
0:13:02 A lot of agricultural companies, but it’s much more disparate.
0:13:08 Yeah, it’s dramatically hard to actually take an existing organization and turn it into
0:13:10 an AI machine learning organization.
0:13:17 So one thing I’ve honestly been surprised by is that when I’ve seen companies or organizations
0:13:22 I know try to incorporate AI into their drug discovery process, it ends up taking them
0:13:27 years and years and years, because they’re fighting all these upstream battles, weeks
0:13:32 to get their old computing system to upgrade to the right version of their numerical library
0:13:35 so they could even install TensorFlow.
0:13:41 And then they had all these things about who can actually say, upgrade the core software,
0:13:44 who is it this department?
0:13:47 How much do they need to talk to the biologists, to the chemists?
0:13:52 And the fact is that pharma and existing big codes are not built this way.
0:13:57 That’s not their core expertise, whereas if you look at Facebook or Google, they’ve been
0:14:02 machine learning for almost two decades now, from the first AdWords model.
0:14:07 So in some sense, they had to change very little about their culture, like, yeah, there’s
0:14:11 a slight difference instead of this function, use that function, but whatever.
0:14:15 But the core culture was there, and I think the culture, the people, changing that is
0:14:20 going to be dramatically hard, which is why I think it will really take, I think, ten
0:14:24 years and a generation of students who have been trained in the new way to come in and
0:14:25 shift it.
0:14:26 Yeah.
0:14:27 Well, Google was a startup too, right?
0:14:31 I think, you know, the thesis was that, and is that, that startups will be able to build
0:14:32 a new culture.
0:14:37 And I think the key thing that I think we’re seeing sort of boots on the ground is that
0:14:41 culture has to be not, here’s your data scientists are machine learning people in one room and
0:14:45 you’re biologists in another room, that they’d have to be the same.
0:14:50 What’s intriguing to me is just the size of the bio market.
0:14:55 Biology is healthcare, it’s agriculture, it’s food, it could be the future of manufacturing.
0:14:59 There’s so many different places that biology plays a role to date and will play a role,
0:15:02 but it just means that I think, I think to the point we’re talking about these companies
0:15:06 just are being built right now.
0:15:12 There’s I think this whole host of challenges here because biology is hard and building kind
0:15:17 of like that selective understanding of like, you know, of the 10 best practices that existed.
0:15:19 Five are actually still best practices.
0:15:23 The other five we need to toss out a window and stick in a deep learning model.
0:15:27 That kind of very painstaking process of experimentation and understanding.
0:15:31 That I think is like where the really hard innovation is happening.
0:15:32 And that’s going to take time.
0:15:36 You’re never going to be able to replace like a world-class biologist with any machine learning
0:15:37 program.
0:15:43 A world-class biologist is typically fricking brilliant and they often bring a set of understanding
0:15:46 that no programmer or no computer scientist can.
0:15:51 Now, the flip side holds true and I think that merger, as you said, that’s where like
0:15:53 there’s power for magic dynamism.
0:15:57 One really interesting factoid I heard from an entrepreneur in the space is that, you
0:16:03 know, the best biologists that they could hire had a market rate that was lower than
0:16:10 a introductory, intermediate, you know, front-end dev and, you know, of course, front-end is
0:16:11 very hard engineering.
0:16:15 I don’t want to put that down, but there’s so many fewer of these biologists, so there’s
0:16:21 almost this market imbalance of how is it possible that, you know, you can take really
0:16:27 a world-class biologist of whom there’s maybe a couple of hundred in the world and not have
0:16:29 them be valued properly by the market.
0:16:32 So do you even out those pay scales in one company?
0:16:37 Do you like have two awkward pay ladders that coexist and create tension in your company?
0:16:41 These are the types of like really hard operational questions that almost have nothing to do with
0:16:43 the science, but at the heart of it they do.
0:16:45 Maybe it’s interesting to talk about like how we can help people get there.
0:16:49 Yeah, so what’s like the training they should be doing, maybe we could even go like super
0:16:50 nuts and bolts.
0:16:52 So I got my laptop, what do I do?
0:16:58 So I mean, like, I guess there’s a couple key packages we install, like TensorFlow, maybe
0:17:00 DeepGam, something like that.
0:17:04 Python is often already installed, let’s say on a Mac, is that it?
0:17:06 And then we start going through papers and books and code.
0:17:11 I think the first place really is to, you need to form an understanding of like what
0:17:14 are the problems even that you can think about.
0:17:19 I think if you’re not trained as a biologist, and even if you are, you might not see that
0:17:25 intersection of these are the problems where biological machine learning can or cannot work.
0:17:29 And that I think is really what the book tries to teach you, as in like, what’s the frame
0:17:30 of thinking?
0:17:34 What’s the lens at which you look at this world and say that, oh, that is data coming
0:17:36 out of a microscope.
0:17:40 I should spend 30 minutes, spin up a connet, and iterate on that.
0:17:46 This is a really gnarly thing about how I prepare my like, you know, C.Elegant samples.
0:17:49 I don’t think the deep learning is going to help me here.
0:17:52 And I think it’s that blend of knowledge that the book tries to give you.
0:17:53 It’s like a guidebook.
0:17:57 When you see a new problem, you ask, is this a machine learning problem?
0:17:59 If so, let me use these muscles.
0:18:03 If it’s not a machine learning problem, well, I know that I need to talk to someone who
0:18:05 does know these things.
0:18:06 And that’s what we try to give.
0:18:08 Andring has a great rule of thumb.
0:18:12 If, you know, a human can do it in a second, deep learning can probably figure it out.
0:18:19 So start with something like say microscopy, you have an image coming in and an expert
0:18:22 can probably eyeball and say, interesting, not interesting.
0:18:24 So there’s this binary choice.
0:18:30 And there’s some arcane black box that was trained within the expert’s head and experience.
0:18:34 That’s actually the sort of thing machine learning is like made to solve.
0:18:38 So really ask yourself, like, when you see something like that, is there some type of
0:18:44 perceptual input coming in, image, sound, text, and increasingly molecules, a weird
0:18:49 new form of perception, almost magnetic or quantum, but you have perceptual input coming
0:18:50 in.
0:18:56 And is there a simple right, wrong, left, right, intensity type answer that you want
0:18:57 from it?
0:19:00 If you do, that’s really a machine learning problem at its heart.
0:19:01 Well, so that’s one type of machine learning.
0:19:06 And I think the benefit there of that, what human can do in a second, deep learning can
0:19:12 do, especially since, in principle, on the cloud, you could spin up 100,000, 10,000 servers.
0:19:15 Suddenly you’ve got 10,000 people working to solve the problem.
0:19:17 And then they go back to something else.
0:19:19 That’s just something you can’t do with people.
0:19:24 Or you’ve got 10,000 people working 24/7, as necessary, can’t do that with people.
0:19:28 But there’s another type of machine learning, which is to do things people can’t.
0:19:32 Or maybe more specifically, do things individual people can’t, but maybe crowds could.
0:19:37 So like we see this in radiology, right, where the machine learning can have accuracies
0:19:41 greater than any individual, akin to what, let’s say, the consensus would be, which would
0:19:43 be the gold standard.
0:19:47 That’s maybe the real exciting part, sort of the so-called superhuman intelligence.
0:19:49 Where are the boundaries of possibilities there?
0:19:54 One of the biggest problems really with deep learning is that you have some like strange
0:19:56 and crazy prediction.
0:20:02 Now I think that there’s a fallacy that people fall into of trusting the machine too easily.
0:20:07 Because 90% of the time that’s going to be garbage.
0:20:12 And I think that really kind of the challenge of picking out these bits of superhuman insight
0:20:16 is to know how to shave off the trash predictions.
0:20:17 Yeah.
0:20:19 Is 90% an exaggeration or is it really 90%?
0:20:23 I like nice round numbers, so that might have just been something I picked out.
0:20:26 But there’s like this great example, I think, in medicine.
0:20:32 So there’s scans coming in and the deep learning algorithm is doing like amazing at predicting
0:20:33 it.
0:20:38 And then like they dug into it and it turned out that the scans came from three centers.
0:20:43 One of them had like some type of center label that was like the trauma center or something.
0:20:44 There’s the other non-trauma center.
0:20:49 The deep learning algorithm had like a kindergartner told to do this, learn to identify the trauma
0:20:52 flag and flag those and uptake those.
0:20:57 So if you did this like naive statistics of blending them all out, you’d look amazing.
0:20:58 But really it’s looking for a sticker.
0:20:59 Yeah.
0:21:00 I mean, there’s tons of examples like that.
0:21:04 One with the pathologist with the ruler in there and it’s becoming a ruler detector and
0:21:05 so on.
0:21:09 Like, you know, this AUC like a sense of accuracy of close to 1.0.
0:21:14 We all got to be very suspicious of that because just running a second experiment wouldn’t
0:21:16 predict the first experiment with that type of accuracy.
0:21:18 Anything that’s too good to be true probably is.
0:21:19 Yeah.
0:21:23 I think then you get into the really subtle challenges, which is that, you know, the algorithm
0:21:28 tells me this molecule should be non-toxic to a human and should have effect on this,
0:21:29 you know, indication.
0:21:31 Do I trust it?
0:21:34 Is it possible that there’s a false pattern learned there?
0:21:37 Humans make these types of mistakes all the time, right?
0:21:42 Like if you have any type of like actual biotech, you know that there’s gonna be molecules
0:21:44 made or DCs that are disproven.
0:21:49 So you’re getting into the hard core of learning, which is that, is this real?
0:21:51 The reality is we don’t have answers to these.
0:21:55 They were really kind of trending into the edge of machine learning today, which is that,
0:21:58 is this a causal mechanism?
0:21:59 Does A cause B?
0:22:01 Is it a spurious correlation?
0:22:04 And now we’re getting to that place where humans aren’t necessarily better.
0:22:09 We talk about some techniques for interpreting, for looking at kind of what informed the decision
0:22:11 of the deep learning algorithm.
0:22:16 And we do provide a few kind of tips and tricks to start thinking about it, but the reality
0:22:19 is that’s kind of the hard part of machine learning.
0:22:20 It’s the edge.
0:22:24 The interpreting chapter is one of my favorite ones because it’s often sort of become so-called
0:22:29 common wisdom that machine learning is a black box, but in fact, it doesn’t have to be and
0:22:32 there’s lots of things to do and we are quite prescriptive there.
0:22:36 So the interpretability I think also is frankly what’s going to make human beings more at
0:22:38 peace with this.
0:22:40 And this isn’t anything unique to machine learning.
0:22:46 If you had some guru who’s just spouting off stuff and said, “Buy this stock X and short
0:22:51 the stock Y and put all your life savings into it,” you probably would be thinking, “Okay,
0:22:54 well, maybe, but why?”
0:22:57 So I think this is just human nature and there’s no reason why our interaction with machines
0:22:59 would be any different.
0:23:04 What I think is interesting is human beings are notoriously bad at causality.
0:23:07 We kind of attribute things to be causal when they’re not causal at all.
0:23:12 We do that in our lives from why did that person give me that cup of coffee to why did
0:23:13 that drug fail?
0:23:15 All these different reasons.
0:23:17 There’s two big misconceptions about machine learning.
0:23:18 One is lack of interpretability.
0:23:22 The second one is correlation doesn’t mean causation, which is true, but somehow people
0:23:26 take that to mean it’s impossible to compute causality.
0:23:30 And that’s the part that I think people have to really be educated on because there are
0:23:33 now numerous theories of causality.
0:23:36 And you could use probabilistic generative models, PGMs.
0:23:38 There’s lots of ways to go after causality.
0:23:40 The whole trick though is you need time series data.
0:23:43 What’s beautiful about biology or at least in healthcare is that we’ve got time series
0:23:45 data in many cases.
0:23:51 So now perhaps finally there’s the ability to really understand causality in a way that
0:23:55 human beings couldn’t because we’re so bad at it and machines are good at it and we’ve
0:23:56 got the data.
0:24:00 Can you think of a place where in your experience the algorithms have succeeded in teasing out
0:24:03 a causal structure that people missed?
0:24:10 Yeah, so I think in healthcare we always think about what is leading to various changes like
0:24:15 this drug having adverse effects, this diet having positive or negative effects.
0:24:20 All of these things are being understood in the category of real world evidence, which
0:24:23 is a big deal in pharma these days.
0:24:28 And if you think about it like a clinical trial is really a poor man surrogate for not
0:24:32 understanding causality because if we don’t understand causality you’ve got to do this
0:24:36 thing where it’s double blind, we start from scratch and I’m following it in time and we
0:24:37 see it.
0:24:42 If you understood causality you might be able to just get a lot of results from just mining
0:24:43 the data itself.
0:24:47 As a great example you can’t do clinical trials for all pairs of drugs.
0:24:51 I mean just doing for a single drug is ridiculously expensive and important, but all pairs of
0:24:54 drugs would never happen, but people take pairs of drugs all the time.
0:24:59 And so their adverse effects from real world data is probably the only way to do it and
0:25:03 we can actually get causality and there’s tons of interesting journal medicine papers
0:25:07 sort of saying, “Aha, we found this from doing data analyses.”
0:25:09 I think that’s just starting out.
0:25:14 Honestly, I think that bio-AI drug discovery needs to take a page from the self-driving
0:25:19 car companies, in the neighboring self-driving car world, simulators are all the rage.
0:25:25 And really because it’s that same notion of causality almost, like there’s a structure
0:25:30 to the world like pedestrians walk out, chickens, alligators, whatever, crazy thing.
0:25:34 I saw this for the picture, it happens.
0:25:39 So I think there they’ve built this amazing infrastructure of being able to run these
0:25:44 repeated experiments, almost a randomized clinical trials, but informed by real data.
0:25:49 We don’t yet have that infrastructure in bio-world and I know there’s a couple of exciting
0:25:53 startups are starting to kind of move towards that direction, but I think it’s when we
0:25:58 can really probe the causality at scale and then in addition to just probing it, when
0:26:04 the simulator is wrong, use the new data point that came in and have the simulator learn
0:26:05 to fix itself.
0:26:09 That’s when you get to this really amazing feedback loop that could really revolutionize
0:26:10 biology.
0:26:15 Yeah, so we talked about some basic nuts and bolts about how to get started and the framing
0:26:17 of questions, which is a key part.
0:26:22 So let’s say people, they’re set up, they’ve got their question, where do they go from
0:26:23 there?
0:26:27 I mean, in a sense, we’re talking about something closer to open source biology and to the extent
0:26:33 that biology is programmable and synthetic biology is, I think, very much, it’s been around
0:26:35 for a while, but I think it’s really starting to crest.
0:26:39 How do these pieces come together such that we could finally get to this sort of open source
0:26:41 biology democratization of biology?
0:26:45 A big part of this is really the growth of community.
0:26:49 There are people behind all these GitHub pages that you see.
0:26:54 There’s real decentralized, powerful organizations that, if you look at the Linux Foundation,
0:26:59 if you look at, say, the Bitcoin Core Foundation, there are networks of open source contributors
0:27:02 really that form this brain trust.
0:27:03 It’s very diffuse.
0:27:08 It’s not centralized in the Stanford, Harvard, Med Department, or whatever.
0:27:11 And I think what we’re going to see is the advent of similar decentralized brain trusts
0:27:13 in the bio world.
0:27:17 It’s in a network of experts who are kind of spread across the world and who kind of
0:27:20 contribute through these code patches.
0:27:22 And that, I think, is not at all new to the software world.
0:27:24 We’ve seen that for decades.
0:27:25 It’s totally new to biology.
0:27:26 It’s alien.
0:27:32 Like, you would be surprised how much skepticism there can be at the idea that a non-Harvard
0:27:35 trained, say, biologist can come up with a deep insight.
0:27:37 We know that to be a fact, right?
0:27:43 There is multiple PhDs worth of work in just like the Linux kernel that that community
0:27:46 really doesn’t care to get that stamp of approval.
0:27:50 So I think we’re going to see the similar parallel kind of knowledge base that grows
0:27:51 organically.
0:27:55 But it takes time because you’re talking about the building of another kind of almost educational
0:27:57 structure, which is this new and exciting direction.
0:28:00 Here’s the challenge I worry about the most, which is that, like, so you’re building a
0:28:05 Linux kernel, you can test whether it works or doesn’t work relatively easily.
0:28:09 Even as it is, there’s this huge reproducibility crisis in biology.
0:28:14 So how does one sort of become immune from that, or at least not tainted by that?
0:28:15 How do you know what to trust?
0:28:18 And this is a really, really interesting question.
0:28:23 And this is kind of shading a little bit almost into the crypto world, right?
0:28:27 Like you could potentially think about this experiment where you have like a molecule.
0:28:31 You don’t know what’s going to happen to it, but maybe you create a prediction market
0:28:34 that talks about the future and the small kill.
0:28:38 And you could then begin to create these historical records of predictions.
0:28:42 And we all know there are kind of like expert drug pickers at Big Pharma who can like eyeball
0:28:45 and say that is going through, that is failing.
0:28:48 And five years later, you’re like, well, shit, okay, yes, I was right.
0:28:52 There is the beginnings of infrastructure for these feedback mechanisms, but it’s a really
0:28:53 hard problem.
0:28:54 Yeah.
0:28:55 I’m trying to think though what that would be like.
0:28:59 The huge thing is like, you could imagine if it was a simple question, like, is this
0:29:01 drug soluble?
0:29:03 Someone might run a cheap software calculation.
0:29:07 Someone might do the experiment and there’s different levels of cost of having different
0:29:09 levels of certainty.
0:29:14 You’re essentially describing a decentralized research facility.
0:29:16 Maybe the question is who would use it?
0:29:21 This is, I think, the really hard part because I think that biopharma tends to be a little
0:29:24 more risk averse for good reasons than many other industries.
0:29:29 But I actually think that in the long run, this could be really interesting because if
0:29:34 you have multiple assets in a company, you could kind of like, disbundle the assets and
0:29:39 then you could start to get this much more granular understanding of like, what assets
0:29:41 actually do well, what assets don’t.
0:29:47 And if you make it okay for people to like, place a bet on these assets, all of a sudden
0:29:53 it’s de-risk because if you’re a big farmer and you’re like, I don’t really believe that
0:30:00 Alzheimer’s molecule does what is claimed, but I’m going to say like 15% odds it goes
0:30:01 through.
0:30:04 I’ll just invest 15% of what I would have in another world.
0:30:07 The trick is, and especially what we’re talking about now is the world of financial instruments
0:30:11 as well, is the trick is you have to be able to know how to risk an asset.
0:30:15 And so it could be in the end, one of the first interesting applications of deep learning,
0:30:20 machine learning is to use all the available data to give the maximum likelihood estimator
0:30:22 of what we think this asset is going to be.
0:30:25 It prices the asset and then people can go from there.
0:30:29 It’s kind of a fun world where we’re sort of thinking about how the financial world,
0:30:33 machine learning world and biology come together to kind of decentralize it and democratize
0:30:34 it.
0:30:40 I think there’s opportunities to kind of like, allow for more risks, the long tail to be played
0:30:41 out.
0:30:45 You don’t have as many interesting hypotheses that grow dead in the water because it wasn’t
0:30:48 de-risk enough for a big bet.
0:30:53 So, you know, what I think the big takeaway for me here is that there is that possible
0:30:56 world, but I forget if this is the way you learned how to program.
0:31:03 The way many of us did is I learned when I was like 11 on like actually a TI99A and
0:31:07 I was just playing around with it and I learned so much because it was, I could just get my
0:31:09 hands right in it.
0:31:13 And I think kind of, my hope for the book is that it’s kind of the equivalent in biology
0:31:16 that people can get their hands in it and I don’t know where they’re going to go with
0:31:17 it.
0:31:19 I don’t know if they go where we’re describing.
0:31:22 That’s one of the possible, any futures, but I think that’s what we’re hopefully being
0:31:23 able to give people.
0:31:25 We are opening out the sandbox.
0:31:30 Here’s what we’ve learned in kind of these very exclusive academic institutions.
0:31:36 Let’s throw the gate open, say here’s as much as we know as we can try to distill it down
0:31:38 and do what you will with it.
0:31:42 Like open source means no permission, so go to town and hopefully do something good for
0:31:44 the world is kind of the dream.
0:31:45 That sounds fantastic.
0:31:46 Well, thank you so much for joining us.
0:31:47 Thank you for having me.

with Vijay Pande (@vijaypande) and Bharath Ramsundar

Deep learning has arrived in the life sciences: every week, it seems, a new published study comes out… with code on top. In this episode, a16z General Partner Vijay Pande and Bharath Ramsundar talk about how AI/ML is unlocking the field in a new way, in a conversation around their book, Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More (also co-authored with Peter Eastman and Patrick Walters.

So — why now? ML is old, bio is certainly old. What is it about deep learning’s evolution that is allowing it to finally making a major impact in the life sciences? What is the practical toolkit you need, the right kinds of problems to attack, and the right questions to ask? How is the hacker ethos coming to the world of biology? And what might “open source biology” look like in the future?

Leave a Comment