From AlphaFold to MMseqs2-GPU: How AI is Accelerating Protein Science

AI transcript
0:00:15 Hello, and welcome to the NVIDIA AI podcast. I’m your host, Noah Kravitz.
0:00:21 Work being done on protein structures is amongst the most exciting and impactful work
0:00:27 being assisted by AI right now. With us today are two of the leaders in the industry at the
0:00:31 forefront of research and development on protein structures. They’re here to talk about some
0:00:37 exciting developments in the area, including the recent acceptance of a major paper to nature
0:00:43 methods. So let’s get right into it. Today with me are Chris DeLago and Martin Steineger. Chris is
0:00:48 research lead at NVIDIA and visiting professor at Duke University. He’s been at the forefront of GPU
0:00:55 accelerated machine learning, advancing the way we apply AI to complex problems in biology. And he’s
0:01:00 been working on several protein structure related activities that we’re going to talk about in a
0:01:06 second. And with Chris is Martin Steineger. Martin is an associate professor of biology at Seoul
0:01:10 National University, joining us from South Korea this morning, this morning our time, evening your
0:01:17 time, Martin. And he is the co-author on the Nobel prize winning alpha fold paper, as well as developer
0:01:23 of many foundational tools used in homology search and structure prediction, and increasingly used for
0:01:29 protein design. Here to unpack all of that and get into the excitement around this month’s
0:01:34 announcements, and just generally speaking, all of the advancements in protein structure.
0:01:38 Chris and Martin, thank you so much for joining the NVIDIA AI podcast.
0:01:39 Thanks for having us now.
0:01:40 Thank you.
0:01:46 So let’s start high level. And maybe we’ll start with you, Martin. But obviously, Chris, feel free to jump
0:01:52 in. Why is this so important? What’s the significance of proteins and their 3D structures? And where are we
0:01:55 talking about it on the NVIDIA AI podcast?
0:02:02 Yeah, so it’s a very good question. Proteins are really these like small machineries that drive cells or drive
0:02:09 effectively everything in life around us. And so they’re composed of amino acids. So you have like 20 of these
0:02:16 amino acids, and you can imagine like putting these amino acids in a line as a string. And if you put the
0:02:21 string into water, the solvent, and what happens is that this string turns into a three-dimensional
0:02:26 structure. And this three-dimensional structure really implies the function of this machinery.
0:02:31 And that is really what we care about in the end. And it’s important for drug discovery, you know,
0:02:36 understanding how this machinery works, how can we modify it for our purposes, how can we really,
0:02:41 yeah, use it for humanity’s good in some way. And so the structure is really fundamental for that.
0:02:47 And can you tell us then a little bit about the work that you’re doing, your lab does at
0:02:51 Seoul National University? And you got into it a little bit talking about the machinery,
0:02:58 but why are these structures so central? Why have researchers been spending so much, you know,
0:03:02 blood, sweat, and tears, to put it that way, for years, trying to understand the way these
0:03:04 structures work and relate?
0:03:10 Yeah. And so a bit about our work. So talking about proteins, so there are about like 22,000 or so
0:03:16 that are encoded in the human genome. But in nature, there are way more, we have millions or billions of
0:03:22 these machinery out there. And the work that we do in our lab is to kind of understand all of them and
0:03:27 organize them, make them searchable. You can imagine like Google for proteins, so taking a protein
0:03:32 sequence and find similar looking ones. And why is that important? And because the information that
0:03:37 you can find from similar ones, and you can imagine there’s like similar proteins in apes and so on,
0:03:43 you know, like this evolution between them. And by finding and identifying them, you are somehow able to
0:03:46 understand this one protein that you care about from human, for example, better, you know,
0:03:51 you all of a sudden have understanding that certain amino acids cannot be changed, certain amino acids
0:03:56 might be in contact with each other, actually like three-dimensional contacts. And this is really
0:03:59 important for the structure in the end, and then for its function.
0:04:06 And so Chris, how have models like AlphaField, we mentioned Martin, a co-author on Nobel Prize-winning
0:04:13 paper, how have these models revolutionized the scientific community and academics and industry by proxy?
0:04:22 Yeah, I think that, I mean, they had an incredible impact. I don’t think I have seen any pharma company or
0:04:29 tech bio, biotech company at this point that doesn’t use AlphaField in some way in their research. They
0:04:35 either use it very directly or they use it as a proxy to do their research. So clearly it has helped
0:04:41 drug seekers to create new drugs. Beyond that, as Martin mentioned, by being able to fold the proteins
0:04:46 very efficiently, getting their 3D structures computationally, people are able to start making
0:04:51 discoveries about how biology actually works. So it’s not just about we can predict this protein and
0:04:57 you know, it has this 3D shape, but we can start making inferences about what are these four proteins
0:05:03 doing together? How do they form a particular pathway? How does that pathway interact with brain
0:05:10 function? And so I can see that from a basic biology discovery, from some of the experience that I have
0:05:16 at Duke, it’s having a tremendous impact. And with all of the interactions that I luckily have with the
0:05:23 industry through NVIDIA, clearly everybody’s using either AlphaField very directly or as a proxy, for
0:05:27 instance, by going through the AlphaField database, which is an incredible resource.
0:05:44 And so today, NVIDIA made a number of announcements around different research outputs. Some of them, or many of them, I should say, in collaboration with partners, all having to do with protein structure and generation. Chris, can you give us kind of the highlights and rundown why this is such an important moment?
0:06:14 Yeah, absolutely. I joined NVIDIA about three years ago, a little bit over three years ago at this point. And when I joined, we were really thinking about what is our role, or at least I joined when people were starting to think about what is our role in this messy and crazy interesting space of digital biology and computational biology. And I think since then, we’ve really matured to understand that our strength comes from, first of all, interacting with the ecosystem,
0:06:44 with our partners, with Martin, collaborating on interesting work. And secondarily, obviously, our background is in accelerated computing. And so making these tools work faster. And so we’ve made a bunch of announcements over the last couple of months, and this month in particular, around accelerated structure prediction, specific instructions that are common to all of the structure prediction methods, whether it’s AlphaField, OpenField, ColabField, the new models like Boltz,
0:07:07 the cultural angular attention and multiplication operations. So we’ve done targeted accelerations of these atomic things that researchers can use to build the next generation of models. And we’ve also accelerated the models themselves. So we have inference microservices, specifically, it’s called NVIDIA Inference Microservices, or NIM, for the inference of bolts and OpenFold and AlphaFold.
0:07:11 So people can just take these models and run with them, and they are accelerated by us.
0:07:17 And then we’ve done, obviously, the announcement with Martin already last year.
0:07:21 We announced the availability of the software and the preprint.
0:07:28 Today we are announcing the acceptance information methods, as well as some new compatibility with Blackwell GPUs.
0:07:40 And all of this is basically our attempt as NVIDIA to contribute to the community, to make it easier for people to use these tools in an efficient manner and to make the best of the hardware that they get.
0:07:47 Martin, among the announcements made was the acceptance of MMSeqs2 GPU, a homology retrieval tool, to nature methods.
0:07:55 Can you explain what homology retrieval is and why GPU acceleration changes the game for protein sequence searching?
0:07:59 Okay, let me explain the why from the perspective of protein structure prediction.
0:08:12 So when you do an alpha-fold prediction, you actually do not predict just the single protein sequence into a structure alone, but rather you need homological information.
0:08:19 So the way how you get that is by taking a protein and you search through a database of hundreds of millions or billions of protein sequences.
0:08:24 You retrieve the one that are hopefully sharing a homological relationship to this one.
0:08:35 And they are shaping this like multiple sequence alignment where you now have, for example, a protein from human has one that is close to a chimpanzee and one close to a whale and even to bacteria.
0:08:44 And this information in the end is really constraining the folding space for you because it’s kind of telling you what is possible, what was possible in the history of time through that protein.
0:08:47 You know, like there are certain positions that you don’t want to change at all.
0:08:49 It might be like extremely functional, conserved sites.
0:08:55 But then you find also patterns in this that somehow tell you that certain things are in physical contact with each other.
0:09:15 So along those lines, MSEX is part of a bigger set of tools that were announced today and building on previous tools you alluded to.
0:09:34 Martin, you started to get into this a little bit, but Chris, for the uninitiated, how do all of the pieces of the puzzle of supporting libraries, different AI tools, help make something like MSEX, you know, that much more useful to use, easier to use and more useful in actual practice?
0:09:37 Yeah, I mean, MSEX is a spectacular tool.
0:09:40 I was very lucky that Martin agreed to work with us.
0:09:44 If I have to be honest here, I’ve been chasing him for a long time.
0:09:47 We were friends since much before I joined NVIDIA.
0:09:48 Amazing.
0:09:59 So, you know, I consider our ability to contribute to MSEX and making it more helpful is ultimately like a, you know, really like a trophy for me personally.
0:10:04 And I think for NVIDIA, it’s a great testimonial as to what we can do when we collaborate with the right partners.
0:10:18 How it fits into the bigger picture, I think there’s one figure that shows this very well, which is if you take AlphaFold off the shelf, the way it is today, AlphaFold 2, and you take it off the shelf the way it is on the GitHub repository of DeepMind today.
0:10:28 If you execute an inference for a protein structure, you will be stuck with 80% of the compute time being allocated to this homology retrieval step.
0:10:30 So, you’re actually only using the machine learning step.
0:10:34 It’s only taking about 20% of the execution time.
0:10:38 With MSEX to GPU, we are inverting that relationship.
0:10:43 So, now it only takes 20% of the total execution time to do this homology retrieval step.
0:10:48 And 80% of the time is actually the deep learning inference part.
0:10:59 And so, now we have all of these other things that are coming in, co-equivariance, NVIDIA inference microservices, all of these other tools that we are developing that accelerate those machine learning steps.
0:11:02 And so, we are reducing, again, the time of the 80%.
0:11:04 And so, maybe we get something that is 50-50.
0:11:06 And then, maybe we invert it again.
0:11:11 Maybe the structure prediction is so fast that the homology retrieval becomes the bottleneck again.
0:11:14 And then, we will work more with Martin to make that even faster.
0:11:15 Right?
0:11:18 So, we see this as sort of like our superpower is acceleration.
0:11:27 And we want to make sure that we are always balancing, you know, the most computationally intensive step and optimizing it as much as we can.
0:11:32 So, that’s what you get from the MM6 work as well as the co-equivariance work that we’ve done.
0:11:33 Fantastic.
0:11:37 I’m speaking with Christian DeLago and Martin Steiniger.
0:11:41 Chris is a research lead at NVIDIA and a visiting professor at Duke University.
0:11:51 And he’s been at the forefront of GPU-accelerated machine learning, applying advanced AI to complex problems in biologies for years now.
0:11:59 And we’re talking about, well, a few different activities, but central to them, the MM6 paper, MM6 2GPU, being accepted in Nature Methods.
0:12:04 And Martin is an associate professor of biology at Seoul National University.
0:12:15 And as we’ve been speaking of, he’s a co-author on the AlphaFold paper that won the Nobel Prize and the developer of many foundational tools, as Chris was saying, used in homology search, as well as structured prediction.
0:12:24 I want to zoom out a little bit and talk about research going beyond just AlphaFold and into some of the other tools the community is using.
0:12:31 Martin, you’ve led the way in building several popular tools at Seoul National, including MM6 2GPU.
0:12:33 What inspires you?
0:12:38 What kind of gives you the vision, the intuition to build tools that are used by the broader community?
0:12:41 Yeah, that’s a good question.
0:12:44 So for MM6 2GPU, MM6 2GPU goes actually far back.
0:12:48 It’s work that was done when I was doing my PhD with Johannes Serding.
0:12:52 So effectively, that was driven initially by his main idea.
0:12:57 And then I later thought about like, what is important, you know, like, why would you develop that tool?
0:12:59 MM6 is a homology search method.
0:13:07 There were already many, many before, for example, BLAST, a very popular, amazing tool, well-maintained by the NCBI.
0:13:09 So why would you build another one?
0:13:11 I think, you know, something has changed.
0:13:14 And things that changes, like, for example, you have to see where data is going.
0:13:18 So at that time, metagenomic data was starting to explode.
0:13:27 So we had more and more sequencing data from environments where we go out and, like, sequence human gut, we sequence the oceans, we sequence forests and so on.
0:13:30 And we needed a tool to somehow to search through that very, very fast.
0:13:36 So that’s really why MM6 became important, because you somehow made sense of this protein data.
0:13:45 And then we developed FoldSeq, and that’s one of the first projects that happened when I started my professor position at the Sloan University.
0:13:49 And there, we could somehow see that structures are starting off.
0:13:57 It was actually before AlphaFold, it was clear that AlphaFold2 was there, and big structural data will come, and more and more high-quality structure will be there.
0:13:59 And it’s just so, and we will have a lot of structured data.
0:14:02 I thought initially we have a lot of, like, really low-quality structured data.
0:14:09 We somehow need a tool to organize that, you know, like, find in the low-quality structured data, these, like, pieces of knowledge.
0:14:12 And we need a tool that can compare structures very, very fast.
0:14:15 And that’s how FoldSeq somehow started.
0:14:17 And we extended them to FoldSeq multimers.
0:14:25 You know, now with AlphaFold, like, models, you can start predicting not just monomeric structures, but you can see how things are interacting, like, how two proteins interact with each other.
0:14:28 And, yeah, that generates a lot of data, right?
0:14:35 If you just imagine how much possibilities you have in humans, we have 22,000 genes, you do 22,000 times 20,000, it’s just a pair, right?
0:14:36 Just a pairwise interaction.
0:14:37 That’s a huge number.
0:14:41 And so we need somehow tools that go through that and organize this.
0:14:46 So we’re somehow looking where data is going, as well as, like, what is needed by biologists in the end, right?
0:14:54 So what will be really important for them to, like, given the tools to look at the data that comes out and be ready for the next type of data modality?
0:15:02 So you mentioned FoldSeq, and correct me if I’ve got this wrong, but FoldSeq lets researchers compare protein structures.
0:15:05 It wasn’t even needed before AlphaFold.
0:15:07 Now it’s become so important.
0:15:15 What are some of the other steps and some of the other tools that might be used in structure analysis kind of coming after AlphaFold?
0:15:18 I mean, AlphaFold has really changed the game, right?
0:15:23 And we have this, we had 200,000 structures, now we have hundreds of millions.
0:15:32 So now I feel like every tool that was there before, that worked well before in the pre-AlphaFold world, has to be somehow rethought.
0:15:38 Because before it had to work with, like, 10 structures, now it has to work with hundreds of thousands of millions of structures at the same time, right?
0:15:43 So I think we kind of, in our lab at least, try to get everything into this AlphaFold era.
0:15:48 You know, all of these tools that existed before, can we just make them ready for this data explosion?
0:16:02 So FoldSeq for multimeric comparisons, then FoldDisco for finding 3D motifs, you know, like pockets, enzymatic sites, and really, really like the functional core of proteins in big databases.
0:16:04 And I think there’s more and more coming.
0:16:07 You know, I think now the direction is like, how can we do motions?
0:16:13 How can we do MD in AI and, like before, MD is very, very expensive, right?
0:16:17 When you’re a supercomputer, you need a lot of time to really get that.
0:16:23 But if you can speed it up by a few orders of magnitude, then all of a sudden we have a data explosion of MD data, right?
0:16:24 So how can we organize this?
0:16:32 So somehow there’s always this, yeah, we are close to always a new data avalanche that is in front of us.
0:16:34 So we need somehow to build the tools to make sense of that.
0:16:40 Is computational biology fundamentally limited by analysis at scale?
0:16:41 Yeah, yeah.
0:16:42 I mean, Martin just alluded to it.
0:16:47 I think we’re seeing these avalanches of data and it comes in two flavors, right?
0:16:50 So first of all, you have to generate that data somehow.
0:16:54 And a lot of the data now is not being generated anymore in the lab.
0:16:56 It’s being generated computationally.
0:16:57 Martin just talked about it, right?
0:17:04 So the reason why FoldSick is so important is because now we have databases of protein structures that have been generated by AlphaFold.
0:17:09 And then we actually need faster tools to be able to sift through the data and see what is important.
0:17:13 What are the pockets that, you know, enable enzymatic activity?
0:17:18 Can we find enzymes that do something that we are interested in, you know, and solve problems that way?
0:17:21 So I do think that we are constantly being pulled in two directions.
0:17:25 One is we need, yeah, we have scale problems to generate.
0:17:28 And then we have the problem of scale to sift through.
0:17:34 And I think ultimately the solution to that are computational tools, like the ones that Martin is developing,
0:17:37 and accelerated methods, like the ones that we are co-developing to,
0:17:43 can really bring that to the efficient inference that we need to make sense of all of this data as it is growing.
0:17:52 What’s the collaboration like between you two and kind of Chris at NVIDIA and Industry, Martin at Seoul National and in academia?
0:17:54 What’s that collaboration like?
0:18:00 And then kind of more broadly, tools that you co-develop, how are they made available to the scientific community?
0:18:04 Yeah, maybe let me explain from the academic side.
0:18:09 I mean, I work mostly with academic groups here and there with companies.
0:18:12 I knew Chris before, working with NVIDIA already.
0:18:15 We were in the same lab, actually.
0:18:18 We were not directly at the same lab at the same time,
0:18:23 but I was in the Burkhard Rust for my bachelor while Chris was doing his PhD there.
0:18:25 And so we know each other’s work.
0:18:27 Indirectly crossed paths many times.
0:18:31 And, you know, whenever I would be in the lab, I would hear about Martin.
0:18:33 I hope that Martin sometimes heard about something that I did.
0:18:37 I don’t know, that’s how we crossed paths.
0:18:40 I mean, especially like when the language, protein language model work started,
0:18:43 Chris and Michael Heinzinger really like took that off.
0:18:46 And so I was already working with them.
0:18:51 So when then Chris started working at NVIDIA and said, okay, we might be able to accelerate that.
0:18:57 I thought that would be a great opportunity because funnily, I wanted to have a GPU accelerated HH Blitz.
0:19:04 HH Blitz is a software similar to MMC that was developed by Johannes Surding’s group in the Max Planck in Göttingen.
0:19:07 And when I reached out to Johannes to do an internship in his lab, I actually said,
0:19:10 why don’t we make a GPU accelerated HH Blitz?
0:19:13 In that time, GPUs were not that powerful yet.
0:19:15 You know, it was like over 10, 15 years ago.
0:19:18 It just started this GPU computing.
0:19:20 But I thought, okay, at one point it will happen.
0:19:22 You know, at one point it must work.
0:19:24 And then Chris said, okay, we’re going to do it.
0:19:26 And I said, okay, I had this idea a long, long, long time ago.
0:19:28 Let’s try to do it again.
0:19:31 But now with this NVIDIA’s power behind, you know, really understanding the hardware very well
0:19:36 and all the details and have the compiler in hand, I think that’s like a really unique opportunity.
0:19:41 And then with Bertelschmidt’s group at the University of Mainz, we could really pull it off.
0:19:43 So they developed more or less the algorithm.
0:19:48 Then NVIDIA was there and like understand what the compiler is doing and where is time lost.
0:19:52 Actually changed the compiler for making it really as fast as it is now.
0:19:58 And I think that was a really great combination of like making academia and industry work together.
0:20:00 And you mentioned about open sourcing.
0:20:04 I mean, I in the beginning said I can only do that if we open source everything.
0:20:07 It should be free and we cannot, there should be no patent involved.
0:20:11 It should just be as we always do it, like an open code repository and everybody should
0:20:14 be from the get-go, be able to use it.
0:20:18 And that’s really how we have handled this well in this collaboration, which I really appreciate.
0:20:19 Yeah.
0:20:24 And then if I may just add to that, like as Martin said, I think he maybe had the idea 10 years
0:20:25 before I did.
0:20:32 But when actually Martin’s MM6 paper, which was released in 2017, came out, I remember I was
0:20:35 actually in the U.S. at the time, I printed out the paper, I was looking at it and I was
0:20:40 thinking to myself, this must be, there must be an opportunity to accelerate this with GPUs.
0:20:46 At the time I was just in academia, I came back to Munich, I did my PhD, I started at NVIDIA.
0:20:50 And the first thing that I was on my mind since the day that I started was, I have to reach out
0:20:51 to Martin.
0:20:55 We need to start talking about whether it makes sense to GPU accelerate MM6.
0:20:59 We will find a way to make it work on the open sourcing side of things.
0:21:04 Because ultimately, our interest is, you know, making things easier for the community.
0:21:04 Right.
0:21:09 And having the community be able to consume these tools is obviously in our great interest.
0:21:15 And so I think we are, you will see it also, there are other announcements today about actually
0:21:18 the open sourcing of a generative model for protein design.
0:21:21 We’re going to, we can talk about that in a minute.
0:21:25 But our role in this space is really to work on the accelerations.
0:21:31 And, you know, we’re going to make sure that those work as efficiently as they can on our
0:21:35 GPUs, and that’s the part that we own, but then ultimately the maintenance, the accuracy
0:21:40 of the tooling, the sort of like the, you know, this, this beautiful thing that is MM6 and many
0:21:42 other tools like it that exist out there.
0:21:46 We want to collaborate with our partners to make sure that they are, you know, working as
0:21:47 they expect them to.
0:21:50 And so that’s, that’s why the collaboration with Martin happened.
0:21:54 That’s why the open sourcing of the work, and we will continue doing that.
0:21:56 At least that’s our intention.
0:22:00 Chris, as NVIDIA received feedback from partners, academic or industry.
0:22:03 On MMSeqs or any of the related tools?
0:22:03 Yeah.
0:22:09 So, I mean, I’ll talk, maybe Martin will ask you about what you saw from the academic side,
0:22:12 but on the, on the industry side, the feedback was fantastic.
0:22:20 So we know from a lot of our close collaborators in the industry, many startups that actually even
0:22:25 emailed us that this unblocked them for, for even for funding rounds, which is, it’s incredible.
0:22:31 So we’ve done the thing that I think NVIDIA is really, really good at, which is lifting
0:22:36 everybody up and helping everybody be more successful at what they want to do, whether
0:22:37 that’s in industry or in academia.
0:22:41 And again, we’ve done it in a very open and collaborative way, which is my favorite way
0:22:42 of working.
0:22:42 Yeah.
0:22:45 I mean, I have a list of startups that all have used MM6 GPU.
0:22:50 As I said, some have reached out personally to say this enabled us to get more funding,
0:22:54 which obviously to me makes me very happy because it’s a great signal.
0:22:56 And then we have good, good feedback also from academia.
0:22:58 Maybe Martin, you have some insights there.
0:22:59 Yeah.
0:23:05 MM6 to GPU is actually more than MM6 in the end because MM6 drives not just MM6, but also
0:23:06 FoldSeq.
0:23:09 And so it’s actually the same code base and even Waymo tools, we call it like the
0:23:10 Marv universe.
0:23:11 Marv is our small logo.
0:23:16 It’s like a red character, a red that many people might have seen in the biological community.
0:23:21 One user reached out to me and at one point said like, oh, this GPU implementation is like
0:23:22 linear.
0:23:26 You made a quadratic problem of like searching a big set against another big set linear.
0:23:27 And I was like, nah, it’s not true.
0:23:28 That cannot be true.
0:23:29 And it is also not true.
0:23:34 But what happens is this person had like a 16 core computer and then he had like a
0:23:36 gaming GPU inside.
0:23:38 And at the same time, he runs the thing on this gaming GPU.
0:23:43 And we have in our paper benchmarked against a 128 core machine, you know, like a really
0:23:44 monster server.
0:23:47 And so obviously the 16 core is much, much, much weaker.
0:23:54 And so a FoldSeq is about four times faster than a 128 core machine in our benchmarks, right?
0:23:56 So that you can just imagine the scale.
0:24:00 Like, obviously it looked very linear to that user because it was extremely accelerated,
0:24:02 obviously, but it will be still quadratic.
0:24:05 But it is, it was an amazing feedback, I think.
0:24:05 Yeah.
0:24:11 And so just to add to what Martin said regarding FoldSeq, I mean, both FoldSeq and MM6 actually
0:24:16 can run MM6 GPU and FoldSeq with MM6 GPU can run on multi-GPU systems.
0:24:23 And this is quite exciting because if maybe the barrier of, you know, a really large CPU-based
0:24:29 node on a cluster is like those 128 cores that it can host.
0:24:34 With GPUs, we can easily get more than just a single GPU by just adding more GPUs to the
0:24:34 server.
0:24:41 And so we showed also in the manuscript that we are able to get better performance by sharing
0:24:43 the workload across GPUs on a single node.
0:24:48 And again, there’s a lot of opportunity there to build those servers up with different
0:24:50 configurations of different GPUs as need be.
0:24:54 For instance, maybe one can compute or a couple can compute the MSA, a couple can compute the
0:24:56 structure prediction and do all of that in parallel.
0:25:02 So there’s a lot of exciting opportunity to, you know, to not just make actually the MSA
0:25:07 computation possible at all on a single workstation, but even making it more efficient by adding
0:25:08 more GPUs to it.
0:25:12 So as we start to wrap up here, let’s shift and kind of look ahead a little bit.
0:25:16 Chris, I think you mentioned you were talking about code for a model.
0:25:22 Can you talk a little bit about what that model is and kind of how you think about the frontier
0:25:25 when it comes to biology research using ML?
0:25:29 Yeah, we released code for a model which we call La Proteina.
0:25:35 This has been a fantastic collaboration with many different research teams at NVIDIA driven
0:25:39 by Carsten, Kreiss and Kieran and Tomas.
0:25:45 And the idea here is if you think about the Nobel Prize in chemistry last year, it was won by
0:25:47 protein structure prediction and protein design.
0:25:49 Both sides of the coin, if you want.
0:25:50 Really important.
0:25:53 We’ve talked about protein structure prediction and acceleration today.
0:25:55 MM6 falls into that.
0:25:56 Quick invariance falls into that.
0:25:57 NIMS falls into that.
0:26:02 We’re also doing work on protein design, not because we want to become protein designers
0:26:07 at NVIDIA and sort of like, you know, because we want to become a drug discovery company.
0:26:08 Not at all.
0:26:12 But we need to understand the problem to be able to accelerate it and to scale it.
0:26:18 And so the protein design work that was done by this amazing collaboration of NVIDIA has
0:26:24 really focused on scaling generative models for protein design, you know, across GPUs and
0:26:25 across infrastructures.
0:26:30 And with that scale, we’re actually gaining performance when it comes to accuracy, when
0:26:35 it comes to the metrics that we usually use to see whether these models are actually good,
0:26:36 at least computationally.
0:26:38 We haven’t done the web lab experiments yet.
0:26:43 And so, you know, we obviously, as I said, we are interested in making this available to
0:26:44 as many people as we can.
0:26:50 And so we’ve released this open source, completely open in the sense that both in academic as well
0:26:53 as industry partners can take this, modify it, use it.
0:26:58 It’s more interesting for us that they do so and they teach us what works for them and
0:27:03 what doesn’t and how we can, you know, work on the next iterations of these models.
0:27:07 So, yeah, I think the direction that we’re going is really protein design.
0:27:12 We want to be able to not just predict what proteins do, but also tweak them and make them
0:27:13 do something that we’re interested in.
0:27:17 And that comes with the problem of scale again on both fronts.
0:27:21 So we need actually data to be able to train these things at scale.
0:27:25 So Proteina specifically was trained on the AlphaFault database, which are, again, all of
0:27:28 these protein structures that have been generated with AlphaFault 2.
0:27:33 And then it trains on a transformer architecture that is scaled compared to maybe what was available
0:27:34 prior.
0:27:40 And so we really need both of the scaling things to work together in order for us to, you know,
0:27:41 make progress in the field.
0:27:43 At least that’s my perspective on it.
0:27:49 You mentioned, Chris, kind of a focus on protein design, but to flip things back for a moment
0:27:51 to structure, where do you see things headed?
0:27:56 Where do you see AI-driven protein structure prediction headed over, let’s say, the next
0:27:57 three to five years?
0:28:00 I feel like monomers, we have very much under control.
0:28:03 I mean, we do very well in structure prediction of monomers.
0:28:07 So there’s just very little gains that you can actually see in this computational methods.
0:28:13 We have mentioned multimeres multiple times and multimer prediction is also really,
0:28:18 significantly better than what we had before, but we’re still not on the same level as we
0:28:18 were with monomers.
0:28:24 So there’s still a lot to gain if we could actually close the gap there as well and confidently
0:28:29 predict protein-protein interactions and the respective structural model that comes out
0:28:29 of it.
0:28:31 Because then you can start reasoning up, right?
0:28:36 You can find pairwise interactions and then you can find triplets and so on and can buy
0:28:38 up, build at the end, the whole machine, right?
0:28:42 And that can be composed of tens or even of hundreds of these units in the end.
0:28:46 And I think there in the end, like these interactions, this is really where function is coming from
0:28:51 in my perspective, you know, where really biology is happening and where you really want to go
0:28:55 for like dragging things and see where you can actually attack things for human health, as
0:28:58 well as just understanding it from like an environmental perspective.
0:29:02 And I think that is one direction, you know, then the next one is like, can we now take
0:29:02 these machines?
0:29:04 Can we put them into cellular context?
0:29:06 Can we put lipid layers around it?
0:29:09 And then all of a sudden we have a cell wall and we have the machine inside, right?
0:29:11 And now we know what interacts with it.
0:29:15 And can we then go and can you see the dynamics of these structures, right?
0:29:18 Can we actually see how they function, how they work?
0:29:22 And now we’re getting closer and closer and closer, step by step to this like cellular
0:29:25 system and in the end, hopefully getting to all of it.
0:29:30 But that comes really with scale issues, you know, even for this, having these nice things
0:29:34 like the pairwise protein-protein interactions, we need faster ways how we can actually find
0:29:36 these interactions, how we can build them up, right?
0:29:41 Everything I just said sounds really nice, but technically really, really challenging because
0:29:43 everything is like a combinatorical problem, right?
0:29:45 And it easily explodes.
0:29:51 Yeah, and if I may add, I mean, Martin already said the protein interaction topic is super
0:29:55 interesting and the fact that you can use AlphaFold, it was a discovery, right?
0:30:00 AlphaFold was developed to predict monomers, single proteins, and then people realized, hey,
0:30:03 it actually folds protein interactions really well.
0:30:09 But while that is true, it will fold every protein interaction, even those that may not actually
0:30:10 interact in real life.
0:30:15 So we have the first problem that is really a scientific one about what actually does
0:30:16 interact.
0:30:20 And that’s quite an interesting question because AlphaFold is always going to give you
0:30:20 an answer, right?
0:30:24 So we basically have to figure out, are those like real interactors?
0:30:26 Do they physically interact?
0:30:28 Are they in the same location in the cell?
0:30:30 Super interesting research stuff.
0:30:35 But then if you think about folding and the evolution of it, obviously with newer models like
0:30:40 AlphaFold 3 and bolts and sort of like this new generation of models, what they’ve added
0:30:43 is the capacity to fold more molecules.
0:30:44 It’s not just proteins.
0:30:50 It’s proteins and DNA, proteins and RNA, proteins and small molecules, modified proteins.
0:30:56 So it’s giving all of that diversity of life that we need to understand in the cellular context,
0:30:57 as Martin was saying.
0:31:02 And then you even more so add scale to the problem because now it’s not just the protein
0:31:06 and the other protein that interact, but it may be those two proteins and some drug that
0:31:07 interacts with them, right?
0:31:12 And so you start getting all of these combinatorial effects that are just, you know, blow up the
0:31:14 problem, the search space immensely.
0:31:19 And as Martin mentioned, we need efficient ways of, you know, proving that before we even
0:31:20 go into the prediction.
0:31:24 And then when we have the prediction, we need to be able to do it very quickly and accurately
0:31:28 so that we can, you know, that we get the right answers to our questions.
0:31:32 So much happening and so much on the cusp of happening.
0:31:36 What are the best ways for listeners to stay involved?
0:31:41 Chris on the NVIDIA side and Martin at Soul National and the other work that you’re doing,
0:31:47 where can listeners go, maybe online, social media, published papers, of course, to stay
0:31:48 abreast of anything?
0:31:49 Chris, I’ll start with you.
0:31:50 Sure.
0:32:00 So I think we’re going to start doing more informal releases and updates about our work, but definitely
0:32:01 a good place.
0:32:06 We have different tiers of what we develop in terms of digital biology outputs.
0:32:08 There are product outputs.
0:32:12 And so if you look at digital biology, NVIDIA, you will find Clara Discovery.
0:32:16 There we have tools like BioNemo and the NVIDIA Inference Microservices.
0:32:22 So that’s definitely a good place for the enterprise to stay updated about what’s coming out.
0:32:27 We do have a GitHub organization, which is called NVIDIA Digital Biology.
0:32:29 That’s where we release our research outputs.
0:32:32 For instance, La Proteina is there.
0:32:33 Proteina is there.
0:32:38 The precursor feature version of it will hopefully also be there.
0:32:40 So that’s for the more research side of things.
0:32:44 And then, yeah, we’re probably going to create some new opportunities for people to engage with
0:32:51 our outputs, whether it’s particular guides or blog posts, so that they can see what work
0:32:53 is coming out from digital biology at NVIDIA.
0:32:54 Fantastic.
0:32:56 And Martin, for folks who want to follow your work?
0:32:57 Yeah.
0:33:02 Similar to NVIDIA, we have an organization on GitHub where we are pretty much open source.
0:33:06 We put code out before we have even papers out.
0:33:10 So if you follow the organization, you can kind of always see what happens in the future.
0:33:11 I think that’s really useful.
0:33:14 Obviously, we are active on Blue Sky and LinkedIn.
0:33:19 And so you can just follow us or the collaborators or the people that are on the paper that really
0:33:21 do a lot of work.
0:33:26 And yeah, we try to constantly communicate the new things that we are working on or things
0:33:29 we find exciting and share papers and code and so on.
0:33:30 Yeah.
0:33:30 Fantastic.
0:33:37 Martin, Chris, again, congratulations on the acceptance publication of MMSeqs and everything
0:33:42 else you’ve been talking about and really all the best continued in the work you’re doing
0:33:44 and hope to catch up on your progress down the line.
0:33:45 Thank you.
0:33:46 Thanks for having us.
0:34:37 Thank you.

Listen as two leading researchers at the cutting edge of computational biology explore breakthrough GPU accelerations that are changing how we understand life’s molecular machinery.

Chris Dallago, Research Lead at NVIDIA and Visiting Professor at Duke University, and Martin Steinegger, Associate Professor at Seoul National University and co-author of the Nobel Prize-winning AlphaFold paper, join the podcast to discuss homology retrieval, protein design, and how MMseqs2-GPU inverted the traditional 80/20 compute bottleneck in protein structure prediction, enabling faster drug discovery and biological research. 

Learn more at ai-podcast.nvidia.com.

The AI PodcastThe AI Podcast
SaveSavedRemoved 0
Register New Account