Summary and Insights
0:00:16 Hello, and welcome to the NVIDIA AI podcast. I’m your host, Noah Kravitz. Ian Buck is here
0:00:21 with us today. Ian is vice president of hyperscale and high-performance computing here at NVIDIA,
0:00:26 and he’s here to discuss mixture of experts, the architecture powering the world’s leading
0:00:31 frontier models, and how extreme co-design can both drive down the cost of generating
0:00:37 intelligence today and future-proof your AI platform for whatever advances come tomorrow.
0:00:41 Ian, welcome. Thanks so much for taking the time to join the podcast.
0:00:42 Thanks, Noah. Glad to be here.
0:00:48 So let’s jump right into it. What is mixture of experts, MOE as we call it? Why does it matter?
0:00:53 If you look at the top 10 open models on artificial analysis right now on their leaderboard,
0:01:00 they all share the MOE architecture. So can you explain kind of in lay terms what MOE is and why
0:01:05 it’s suddenly become the standard for frontier AI? Yeah, it’s a great question. I think there’s a lot
0:01:12 of, it’s a term that is used in industry and amongst AI researchers, but it’s not really understood like
0:01:17 what does mixture of experts mean? Yeah. We’ve all heard of neural networks, and that’s what these
0:01:23 neural networks are. They’re neurons, they’re parameters, they’re components of a AI model.
0:01:30 And, you know, when AI got started and really became in the zeitgeist of the world, the neural network was
0:01:37 simply, each parameter represented a neuron of the model. And we heard about a 1 billion parameter
0:01:42 model on a 10 billion, a 100 billion, now trillion parameter models. Those are basically the neurons of
0:01:49 the AI brain that you activate when you ask ChatGPT a question. But something happened along the
0:01:54 way. As these models got smarter and smarter and smarter, they naturally got bigger and bigger and
0:02:00 bigger. In fact, you know, two years ago when Lama first came on the scene, there was a 7B Lama,
0:02:08 and then there was a 70B Lama, and now we have a 405B, BB billion parameter model. And that makes
0:02:12 them smarter. They have more information, they understand more things, and they give you better
0:02:18 answers. But there was a problem. As they got smarter and smarter and smarter, to get the answer,
0:02:24 you actually had to ask and activate every neuron in that brain. So as a result, while the models are
0:02:29 getting more and more intelligent, they’re also getting slower and slower because you had to ask
0:02:35 every neuron and calculate every neuron and perform all the math on every neuron on a GPU. And then it
0:02:41 wasn’t one GPU, it was lots of GPUs, and even more. Along the way, researchers came up with this idea,
0:02:46 and they realized, just like a human brain, you know, we probably don’t need all of these neurons to ask
0:02:50 every question. Simple questions, probably just a few neurons, or different parts of the brain may
0:02:57 encode different information. Let’s just activate those. So to make the AI cheaper, or the tokens,
0:03:01 which was the piece of data that’s flying through that eventually becomes a word on the
0:03:06 screen, the token’s cheaper. Let’s only activate the neurons we need to activate. And that’s what
0:03:12 mixtures of experts is. Instead of having one big model, we actually split the model up into smaller
0:03:19 experts. Same number of total parameters, but now we only ask the, we train the model to only ask the
0:03:23 experts that probably know that information along the way. And that’s part of the training process to
0:03:28 build that model. Once you do that, you can have a model which has maybe 100 billion parameters,
0:03:33 100 billion neurons, but we only ask or activate about 10 billion. That’s a compression mechanism.
0:03:40 That’s a way of making AI cheaper, but still being able to encode all the possible information and answer
0:03:45 all the questions. So today, most models today are achieving higher and higher intelligence scores
0:03:53 by taking advantage of having more than lots of experts and able to ask, have the model as it come
0:03:57 up to the answer, ask only the right experts in order to get that, to get the right answers.
0:04:04 To put some numbers behind it, we have that LAMA 405B, 405 billion parameter. That’s one big model.
0:04:09 On leaderboards like artificial analysis, you mentioned, it gets an intelligence score of about
0:04:15 28. 28 is just a weighted score of the benchmarks they tested. But all 405 billion parameters are
0:04:23 getting active. Now, fast forward to like a modern open model like OpenAI’s GPT OSS model. It has
0:04:28 120 billion parameters, actually a little bit smaller in total parameters. But when you ask
0:04:35 a question, it only activates on the order of about 5 billion parameters. So instead of 405 billion
0:04:39 parameters and all that math and all that cost, it actually only needs to activate about 5 billion
0:04:46 parameters. That’s like a 10 to 1 or beyond compression, making it cheaper. And then it gets
0:04:55 an intelligence score of 61. So it is going from 28 to 61, going from 405 billion parameters to 5 billion
0:05:01 parameters. Way cheaper. It’s not a 10x cheaper. It’s still complicated. And we can talk why these
0:05:07 MOEs are complicated to run. But artificial analysis does measure the cost to run the benchmark. So like
0:05:13 how to run and calculate the intelligence score. For LAMA 405B, I think that currently it costs about
0:05:19 $200 for them to actually ask what cloud service to get all the answers to create that score.
0:05:26 They asked GPT OSS the same thing. Its tokens are cheaper and only cost about 75 bucks. So MOEs are
0:05:33 making models, allowing models to get bigger, smarter. It’s allowing to get cheaper. And as a result,
0:05:38 advancing AI. Now, of course, across the board, all the leaderboards, they’re all these mixture of
0:05:43 expert models. Right. Correct me, bring you back on track if I get off here with the questions.
0:05:49 But from kind of a layperson, use that word standpoint, if I’m trying to wrap my head around
0:05:55 this idea of mixture of experts, are the experts divided up in ways that I might think about knowledge?
0:06:02 You know, this expert handles math, and this one handles science, and this one handles, I don’t know,
0:06:08 visual understanding. Yeah, it’s a great question. You know, that is the art of training these things.
0:06:14 In fact, AI, it’s not like hard-coded in there. Right. They don’t train a separate model for math,
0:06:20 doing math questions, and a separate model for telling you how to make a pizza. The AI, the beauty of
0:06:25 AI is that the algorithms that these researchers and scientists and companies like Anthropic and OpenAI
0:06:30 and everybody else have figured out is that they can just give it the data, and they encourage the model
0:06:36 to sort of camp, to identify and create these little pockets of knowledge. It’s not prescriptive,
0:06:41 but it’s just the data that they’re seeing. It naturally clumps the activity of these different
0:06:46 questions to different experts. So, and then in front of those experts, there’s this thing called
0:06:51 the router. And the router actually is able to just look at the string of questions, like what’s,
0:06:55 what the answer is, what answer is forming, what, how is it thinking, and then be able to predict,
0:07:00 you know what, this one probably goes to that guy or this other guy. In fact, today’s experts,
0:07:06 they may have on the order of dozens of experts on every layer of the model. And there’s a little
0:07:10 router between, and they may actually ask not just one expert, but like at every layer, they may ask
0:07:16 two experts or eight experts. And then there’s another unit, a model which listens to all the
0:07:21 experts. This guy says, I’m pretty sure I got the right answer. Maybe I got the right answer. I don’t
0:07:24 know. I don’t know. I don’t know. Combines the answer and then goes to the next one. So that’s
0:07:28 actually the, the architecture of it. You know, it’s kind of like, um, you could, you could train
0:07:35 one person, one brilliant scientist. You train an Einstein to be able to answer any question.
0:07:40 That’s really hard. It takes a lot of energy. That’s a very expensive person to hire and have
0:07:45 on staff. Instead, I, maybe I can hire a couple of domain experts or teach a couple of different
0:07:49 people some stuff. And, you know, I just give them all that question. They can all answer it
0:07:53 very quickly in parallel and, uh, you come the combined knowledge. And that’s actually how we
0:07:59 work today. We don’t work in one, one person is not a company. Companies exist because we have all
0:08:05 this expertise around and the MOE method is basically applying that to AI. So the models are all trained
0:08:09 that way. It used, there’s all sorts of training methods to create the condition where, where
0:08:14 information activations can start grouping and gathering together. And you can train these little,
0:08:20 routers and combiners. And then you just do that and in multiple, multiple layers. And sure enough,
0:08:28 at the end of it, you’ve got a, a chat model like a GPT OSS or a KimiK2. Yeah. No, MOE isn’t
0:08:35 new to 2025. The idea of the architecture has been around for a few years. So was it being used? Has it
0:08:39 been, you know, being used all along and we just weren’t so aware of it? And then why has it kind of
0:08:44 come to prominence lately? Yeah. The idea of experts is not new in machine learning.
0:08:49 You know, before AI, there was an idea of creating, you know, combining multiple machine learning
0:08:55 models together and how to do that with statistically to improve the accuracy. Um, there’s all sorts
0:09:00 of history and math around that. Yeah. Applying it to AI though is, is relatively new. You know,
0:09:05 the early versions of, uh, we now, we now know where chat GPT, they were a mixture of experts and
0:09:12 models, but they were not public, publicly known. Okay. Okay. It really wasn’t until, um, the deep seek
0:09:17 moment, uh, which is about a year ago, where it really blew the doors open. Like, uh, because deep
0:09:24 seek, those researchers, uh, were the first to really build a world-class MOE based model. People
0:09:30 have written papers about it, but it was one that actually competed and demonstrated the intelligence
0:09:36 scores that could be leaving with the closed source models. And it was a beast. It was awesome. It had
0:09:42 256 experts in every layer. I mean, it, it did every single optimization. And as a result, it was
0:09:50 extremely cheap to run incredibly complicated, but cheap to run because it was so, it went, took MOE all
0:09:54 the way to the extreme. And maybe many people think it’s kind of where open AI was, you know, with the
0:10:00 original GPT. Uh, so now once we had that moment, you know, the first time deep seek was run on,
0:10:05 on even GPU systems, it actually didn’t run that well because we didn’t have the infrastructure or
0:10:09 even the software to run it that well. The deep seek engineers had written all this custom code to
0:10:14 make it run awesome. Um, but at that point, every model, every researcher realized, Hey, this thing’s
0:10:18 real. We now can see how we do it. I mean, the whole thing open, they published the paper. It’s a
0:10:25 brilliant paper and it shows the opportunity for MOE. And since that moment, uh, you can see that every
0:10:31 model now has shifted to building MOEs. Deep seek sort of shine the light on how to do it, how to train
0:10:37 it, how to do inference and deploy it and sort of kicked off that, that revolution of MOEs that’s been,
0:10:43 that we’ve been enjoying. Right. So we know the deep seek moment was huge, as you just said, for,
0:10:49 for many reasons, is that kind of, are we going to look back and say like, Hey, the lights went on
0:10:56 then and, you know, new things will come, but for the moment is, is everything MOE? And if, if not,
0:11:01 why, what’s kind of the, um, I don’t know, the decision-making process. When would you train a
0:11:07 model to be MOE and when would you not? You know, I think all the, the models that, that really are
0:11:13 focused on, on providing a intelligent response, it makes a lot of sense why they’re MOE. Yeah.
0:11:19 You, you want to do your best to encode as much knowledge into the neural network. So it just
0:11:25 knows things. Uh, you don’t need to like on pencil and paper, right? Two plus two to work out that
0:11:30 it’s four. You just know two plus two is four. Um, so the more neurons you can throw into a holistic
0:11:36 model, it gives it innate knowledge. It doesn’t have to work it, work that out in a reasoning chain
0:11:41 or other such things. So there’s a huge advantage to having models be bigger as long as we don’t
0:11:48 increase the cost. And that’s why MOEs, we want to be able to push the limits of only activating
0:11:54 10%, 5%, 3% of the neurons, more and more experts. And then you can see that in the research and the
0:11:59 way the models are evolving. They’re really pushing the limits, um, of seeing some of the modern models,
0:12:05 you know, they’ll have 300, 400 experts. They’re trying to combine now getting all those experts and
0:12:10 all that communication is complicated. We’ll talk about that. Um, but it is innate by, you know,
0:12:17 having that, the foundation model with all of those experts allows them to then apply all the other
0:12:23 techniques of inference, of reasoning. It allows models that are smaller to be, um, distilled and
0:12:28 fine-tuned for specific tasks. Um, it creates a foundation for the rest of, for the rest of the
0:12:32 AI models around the world. Certainly some of the smallest models that, you know, for the more
0:12:37 dedicated individual use cases, you know, I’ve got to put a box around a stop sign or I’ve got to, um,
0:12:43 you know, a ring doorbell that uses AI to detect if it’s a squirrel or not a squirrel. You know,
0:12:47 those small models may not, you know, they need to do one thing, one specific thing. Probably I can
0:12:51 get it, squeeze it down. I don’t need to go to, uh, you know, the complexity of, of an expert system,
0:12:57 but anything that wants to be agentic, any kind of agent and pretty much most of the AIs that we
0:13:02 interact with purposefully, they’re all MOEs because they can be thrown and they need to know
0:13:06 and they need to be able to reason about, um, a wide variety of different stuff and it makes AI
0:13:13 cheaper. Yeah. It lowers the cost per token. It’s so there’s always a driving cost and the continuous,
0:13:17 like let’s, let’s increase intelligence and let’s lower costs. We can do both with MOEs.
0:13:21 I was going to ask you about that because there’s this, it seems like there’s this focus happening,
0:13:26 now, you know, generative has, has progressed far enough and certainly it’s, it’s everywhere,
0:13:32 you know, including, uh, the news, the business section, if you will. And there’s a shift kind of
0:13:38 from, you know, the biggest models, raw speed, you know, the highest scores to, as you said,
0:13:44 how much does this cost and can we get it to be cheaper while being just as smart, if not more
0:13:49 intelligent? So we’re calling it tokenomics, right? So not in the sense of blockchain or crypto
0:13:54 tokens, but, you know, as you mentioned, AI systems generating tokens, reasoning tokens,
0:14:01 output tokens, what have you. So if we’re focused on bringing the cost down, how does a more complex
0:14:06 system, and I’m kind of inferring here a little bit, but I would imagine it’s more expensive to train,
0:14:11 to architect, to train, perhaps not to run, but total cost. How does a more expensive kind of premium
0:14:16 system actually drive the total cost down? Yeah. There’s a wonderful symbiotic relationship
0:14:23 relationship that happens in the market between the AI hardware and the models that are being created
0:14:28 to serve AI. They inherently, and they kind of have to make sense, you know, if the, if the hardware
0:14:33 offers a certain level of connectivity, a certain GPU performance, a certain memory size, obviously
0:14:38 building an AI model that’s even bigger is going to be hard to take to market or even not possible to
0:14:44 efficiently train. So, uh, you know, since the beginning of the original Kepler GPUs that were used for
0:14:51 those cat, those first cat AIs to today’s modern GB200, GB300, NVL72X, you can see a pattern where,
0:14:56 you know, with every new platform, we advance the state of the art or what the capabilities of what
0:15:02 NVIDIA is able to offer. Um, the compute performance, the memory performance, the connectivity, the IO,
0:15:08 we’ll talk about NVLink, those things enable the next wave of building, to train the next model,
0:15:15 but also to do inference. Uh, you know, it’s the, the, they add complexity. You know, when we started, we were
0:15:21 doing PCIe cards, little, basically graphics cards that, you know, plugged into the server equivalent of your
0:15:27 PC and, um, use the floating point calculations and the graphics memory in order to do that computation. And they
0:15:34 were great. When the AI revolution took off, we saw that by adding more can floating point calculations and building
0:15:40 a bigger GPU, adding things like HP and memory, adding things like, um, you know, increasing the power beyond what a
0:15:49 typical PCIe slot will do. We often would, um, increase the performance of what was capable in the AI, not by the, just the
0:15:55 percentage of more flops or memory bandwidth, but by X factors. And that’s really because the model, the AI models
0:15:59 that were able to build were bigger, smarter, and could run more efficiently and could do more things.
0:16:07 You know, TCO, people talk about TCO as the cost. You know, TCO actually is just, it’s not the goal. Like it, in
0:16:13 itself, it’s just the lowest cost. You want the lowest cost? You know, buy one GPU. Sure. The, the goal is
0:16:18 actually to deliver, to improve intelligence and intelligence per dollar, the cost of that
0:16:23 intelligence. Or if we’re at the same level of intelligence, say this, you know, 60 score from
0:16:28 artificial intelligence, are we reducing? Are we reducing the cost of that intelligence over time?
0:16:33 The tokens that people need to buy are the costs in order to run it. That’s really the, the goal in
0:16:39 every generation of NVIDIA architecture. You know, we’re looking to figure out what technologies can we
0:16:46 incorporate, expand, double down on, invest in, or, or pull from the community, or pull from our,
0:16:51 and there are partners in order to deliver X factors of performance improvement, where the model,
0:16:57 even the existing models, like a G, like the current MOEs could get an X factor of performance improvement.
0:17:02 Well, only, you know, we’re not afraid to add more cost and more technology, you know, per GPU basis.
0:17:07 You know, the HP memory is, it’s a lot more expensive than the old school graphics memory, but it,
0:17:14 it only increases the cost in, in percentages where you’re, because you now have HBM, and because you
0:17:19 have the bandwidth that it offers to, and can connect to that much floating point, you can deliver an X factor
0:17:24 in, in total end-to-end performance. Yeah. Yeah. And we saw that actually, you know, when DeepSeeker R1
0:17:29 came out, you know, the, the GPU at the time was the Hopper H200 system. Mm-hmm.
0:17:35 Hopper had, uh, eight GPUs in a server. They were all connected with NVLink through an NVLink switch.
0:17:40 So we could effectively build one giant GPU of eight GPUs working as one. Right.
0:17:45 That was really important. The model is so large, it couldn’t really fit on, um, a single GPU. It had to have used
0:17:51 multi-GPU. And the researchers that built DeepSea took great advantage of that. It also had, um,
0:17:58 an NV, an NVLink capability. So we could actually put every expert on different GPUs. And you could see that.
0:18:02 You could paralyze the work. You could run into things even more efficiently, even faster.
0:18:07 And because as those experts all had to talk to each other, they would do that over NVLink.
0:18:14 So that was very important. Before we had NVLink, you know, you would have to send things over a PCIe bus
0:18:18 and only one could talk at a time. And it was much slower. Because we have NVLink, all those GPUs can
0:18:23 talk to every other GPU at full speed. It’s a totally unblocked, you know, unblocked, you know,
0:18:30 literally at gigabytes and terabytes a second of bandwidth without any concern for, for collision.
0:18:35 It was critical for those DeepSea creatures to get good performance. If you fast for, so obviously,
0:18:41 it also happened at a time, which now we can say is when we’re in the heart of bringing and building the,
0:18:47 what is now the GB200 and VL72, where we scaled up the number of GPUs we can connect
0:18:53 from just eight GPUs in a server to 72 GPUs in an entire rack, a 9x multiple.
0:18:54 Yeah.
0:19:01 Now that’s a lot more GPUs. So did the cost go up? And it, it certainly, uh, obviously 90, that many
0:19:04 GPUs, entire rack with GPUs versus a server is, is a lot more money.
0:19:05 Sure.
0:19:09 In fact, we actually even had to add more technology because we needed to take that, those, all those
0:19:14 NV switches and build a separate NV switch plane is more, it does cost more. But because we could,
0:19:19 we did that, we can actually paralyze and improve the performance of DeepSea Garland even more.
0:19:24 We can take all those experts and instead of having to try to make it all fit and work within
0:19:30 all the eight GPUs, we could actually get all 72 GPUs working as one. And that, that improved
0:19:36 performance of just going generation to regeneration, being able to further paralyze and run all those
0:19:42 experts across it could actually increase the performance so much that we actually got a 15x
0:19:49 improvement on running DeepSea Garland versus only adding, you know, percents, about 50% more
0:19:52 total cost of, of, on a per GPU basis.
0:19:53 Wow. Okay.
0:19:57 That actually generated a 10x reduction in the cost per token.
0:19:58 Right, right, right, right, right.
0:20:02 So we do have to add more technology. We want to keep running more technology.
0:20:09 Nvidia is a technology company, but we turn that technology back into performance, which
0:20:14 in the net of it reduces the cost per token because those 72, it’s that much faster. And
0:20:18 as a result, they can actually run, get more out of that rack, more out of the, on a per
0:20:25 GPU basis. And we’ve taken it down from what was Hopper. It cost about a $1 to get a million
0:20:26 tokens. Okay.
0:20:31 Roughly a million words. It’s now down about $0.10. So people look at the rack and they say
0:20:32 it’s expensive. Right.
0:20:36 But you, you, the way you do that is actually you’ve put all that investment in EnvyLink
0:20:41 in all the connectivity and all the next generation software. And that end, and you also do all
0:20:46 that software work to make it all work really well. And generation over generation, you get
0:20:51 that multiple, that 10x multiple in the reduction in cost. That’s just one model. That same story
0:20:56 is playing out for us and everything else. And those are models that were built and trained
0:20:58 and designed for Hopper. Right.
0:21:02 You know, we’re entering into the, you know, the starting to see some models come out that
0:21:06 are trained on Blackwell and you’re going to see that, you know, now raise the bar and go
0:21:11 even further. So this, this is the virtuous cycle that we’ve been working so furvously to
0:21:17 make, help make happen. We add, you know, we might add percents in terms of cost and complexity
0:21:23 on a per GPU basis. But we, we aim at every generation to deliver X factors of performance.
0:21:29 And as a result, we dramatically lower the cost of per token by that, by 10x.
0:21:35 As I’m listening to you describe, you know, EnvyLink and the advances in getting the experts,
0:21:40 getting the GPUs to communicate and kind of act as one, I can’t help but think like, we need EnvyLink
0:21:45 for like teams meetings so we can get everybody. We’re, we’re able, instead of talking over each
0:21:48 other, just communicate at one as one at the speed of light.
0:21:49 That’s right.
0:21:53 Speaking with Ian Buck, Ian is vice president of hyperscale and high performance computing
0:21:58 at NVIDIA. And we’re discussing mixture of experts and why it’s become the architecture,
0:22:03 well, as it has been for a while, but now getting public prominence, if you will, the architecture
0:22:08 behind so many leading frontier models and what goes into not only architecting and training
0:22:14 the models, but the infrastructure that really makes them hub. And Ian, I wanted to ask you,
0:22:18 you, you talked about this a little bit, as I said, with, you know, EnvyLink and, and all
0:22:23 of the technologies you kind of alluded to as you were describing the MOE architecture.
0:22:30 But what is it specifically about these NVIDIA systems that make them such a good and such
0:22:35 a unique fit for these complex MOE models and are able to achieve, as you just described, you
0:22:38 know, this lowering cost of intelligence measured per token?
0:22:44 Yeah. It’s an interesting and understandable, it goes back to the original idea about having
0:22:45 experts. Okay.
0:22:50 We’re reducing the cost per token by not turning on every neuron, but only turning on the ones we
0:22:57 need. It’s a cost savings. And we talked about LAMA, the 405 billion parameter LAMA model, you
0:23:01 know, that you have, in order to use it, you got to activate all 405 billion of those neurons,
0:23:03 even though they’re not all needed. Right.
0:23:09 Look at GPT-OSS, it’s 120 billion parameters, still a lot, 100, but you only need about 5 billion
0:23:15 parameters. And so it is smart and is a cost saving measure, only does five.
0:23:24 She also notices, though, that’s like a 10x less, actually more than 10x, 1% of the number
0:23:26 of neurons we’re actually doing math on.
0:23:27 Yep.
0:23:31 The cost isn’t, unfortunately, on GPT-OSS, it’s not 1%, actually.
0:23:32 Right.
0:23:38 You know, it is x-factor slower, it’s about 3x less cost, but it’s not, you know, 1% less cost.
0:23:39 Sure, yeah.
0:23:45 There’s a hidden text to MOE. And it’s all about how those experts need and want to, and
0:23:51 need to communicate with each other. In order to get MOEs to run efficiently, those experts
0:23:56 are all doing their math very, very, very fast. And they all need to communicate with each other
0:24:01 very, very, very quickly. And one of the challenges with MOEs is, and as we go and get sparser
0:24:07 and sparser and sparser and sparser, which makes the models more and more valuable, and we’re
0:24:12 saving, saving more and more cost, is can we make sure that all that math is happening and
0:24:17 all those experts can talk to each other without ever running, going idle, without ever waiting
0:24:22 for, again, waiting for a message. You’re buying those GPUs, you’re paying for them so
0:24:24 they can do the math they need to do.
0:24:25 Right.
0:24:29 Not to sit around and wait for someone else to send them something. Or worse, the network
0:24:34 that connects all these GPUs gets gummed up, and now everybody’s sitting idle, and that’s
0:24:36 going to go straight to the bottom line of the cost.
0:24:37 Yeah, yeah.
0:24:43 So, that’s the key part, and the hidden cost of MOE is communication. We’ve looked at, you
0:24:48 know, can we make it work with just point to point? Like, maybe I can just connect this
0:24:53 GPU with this GPU, and this GPU with that GPU. It’ll be a much lower cost to actually just directly
0:24:57 wire them up. But there’s a limit to how much I can do that. If I take one GPU and I connect
0:25:04 it to four, well, this GPU now, its IO is split four ways, and I can only do that so far.
0:25:10 And even with our hopper systems, we had eight, and there was an NV switch chip. We built another
0:25:15 chip specifically for this, but we can’t scale beyond that eight because that’s the chip.
0:25:15 Yeah.
0:25:20 So, if you have point to point or a Taurus-like network, you’re fundamentally limited by how
0:25:27 much MOE, how cheap you can make those tokens, because the hidden cost of MOE is communication.
0:25:32 And if you try to go bigger than the, you know, what a neighboring or point to point
0:25:38 connection or some kind of loop or message passing thing, or use a fabric like Ethernet, they
0:25:45 weren’t designed for this. The best answer is no compromises. I want this expert, this GPU,
0:25:51 to be able to talk to every other expert at full speed, no limitations, no worry about congestion.
0:25:55 I need a network, I want to connect these things so there’s no, there’s nothing blocking.
0:25:56 Yeah.
0:26:01 And that’s what NVLink is. In fact, that chip that we built is specifically designed to
0:26:07 make sure that every GPU, and it’s all of its terabytes a second of bandwidth, can talk to
0:26:12 every other chip at full speed and never compromise on the maximum IO bandwidth we can get out of
0:26:13 every GPU.
0:26:18 We did that with Hopper, with 8-Way, and one of the big innovations, and obviously it took
0:26:25 a lot of engineering to make that 72 racks, every one of the 72, every one of those GPUs at full
0:26:31 speed, no constraints. And you can see that taking off. You can see the benefit, you know,
0:26:37 that allows people to go even further and build even bigger models. The Kimmy K2 model is even
0:26:46 bigger than the GPT one. We now have open source, trillion parameter model, KBK2, yet it only uses 32 billion
0:26:48 parameters when you ask it a question.
0:26:49 Right.
0:26:54 That’s like a 3% activation of the brain.
0:26:55 Yeah.
0:27:00 It’s incredibly complicated. It’s 61 layers, over 340 experts. They all got to talk to each
0:27:01 other.
0:27:07 And as a result, we now have open models that are trillion parameter scale, levels of intelligence,
0:27:14 and the cost is all comparable to what, and even lower than what we could ever possibly have
0:27:18 with a fully dense model. It’s possible because of that emulating connectivity.
0:27:23 NVIDIA is committed to it. Let’s keep going down that path, build. We have some of the world’s
0:27:27 best 30s engineers, signal processing engineers, wire engineers, mechanical engineers, to make
0:27:33 all that work without having costs explode and make it all connected. Every one of those GPUs,
0:27:38 by the way, is connected with a copper wire to one switch to another switch. There’s a reason
0:27:43 why it all sits in the rack is because we’re running at 200 gigabits per second on every one of those
0:27:50 wires. It’s PAMP4 signaling, so it’s like four bits per wire. It’s a 0, 1, 2, 3, and 4, not a 0, 1.
0:27:55 We’ve gone past the binary at this point. And it’s going so fast, it’s actually, its wavelength
0:28:05 is about a millimeter, I think. So we’re pushing the limits of physics, keeping it all nice and
0:28:12 tight, and also doing everything in copper for low cost. We’re super happy with GB200 and what’s
0:28:19 been able to do for inference and just keeping the cost and driving the cost of tokens down,
0:28:21 down, down, while intelligence goes up, up, up.
0:28:24 So is this getting into what we call extreme co-design?
0:28:30 Yeah. One of the joys of working in NVIDIA is that we’re the one company that works with
0:28:31 every company in AI.
0:28:32 Right, yes.
0:28:39 And, you know, we work with them in building their data centers, in getting the latest GPs
0:28:44 to them, in explaining the NVL 72 architecture, in building and help build a lot of the software
0:28:50 that they use. We have teams working on PyTorch, on Jaxx, on SGLang, on VLM, and all the software
0:28:55 that’s out there. And as they, these model makers are building new models of pushing the limits,
0:29:02 both some inside NVIDIA actually now and, but all around the world, we can co-design with them
0:29:08 how to take the maximum utility out of those 72 GPUs to manage that hidden cost of communication,
0:29:14 to make sure every GPU is running at 110% on computing on the fewest possible neurons
0:29:21 and doing that seamlessly and incredibly fast. All the while, thinking about the next model.
0:29:27 You know, what’s that next GPT, that next vision model, the next video model, the next Sora,
0:29:33 and figuring and making smart decisions about how to add more bandwidth, more communication,
0:29:38 more NVLink, and the right kind of floating point. And all doing so without blowing out cost
0:29:45 or blowing out power and keeping and leveraging all the work that they’ve done up to date so that
0:29:50 it can be applied moving forward to the future. This is the extreme co-design that we do at NVIDIA.
0:29:55 And some of our, the folks that I get to work with and probably watching this get to enjoy.
0:30:02 And we, we work really, really hard to continuously work on performance, not just to have the fastest and be
0:30:07 the fastest, but also to reduce the cost because perform, you talked about tokenomics.
0:30:17 If our, just our software alone could increase performance by 2x, you’ve now reduced the cost per token by 2x
0:30:21 direct to the, to the user and the customer or whoever’s going to deploy this AI.
0:30:26 I was on a call this morning. We got a model from a customer. They wanted some help.
0:30:34 We applied the latest NVFP4 techniques, the latest kernel fusions, the latest NVLink communication IO overlaps.
0:30:41 Within two weeks, we did, we hit 2x on their model and gave them the code back and, and you know, and we’re not done.
0:30:44 There’s a, there’s so many places where we could optimize.
0:30:49 I think a lot of people get confused. They see a GPU of a certain number of flops and they say, oh, that’s better or faster.
0:30:58 I’ll tell you, this stuff’s pretty complicated. Manage and run 72 GPUs with 348 experts and all the different kernels and all the different AI and all the different math.
0:31:03 We didn’t even talk about KV caching and reasoning models and all the tricks and techniques.
0:31:14 That’s an end to end problem. It requires extreme co-design between the hardware, what’s the art of the possible, the model builders themselves and the dense and deep software stack that run on it.
0:31:19 NVIDIA actually has more software engineers than hardware engineers specifically for that process.
0:31:20 Right. Yep.
0:31:27 So to kind of zoom out for a second, because we’ve been talking about and kind of get harkening back to what you just said about, you know, thinking about what’s next.
0:31:33 We’ve been talking about MOE in the context of language models predominantly, you know, now.
0:31:38 And the GB 200 NVL 72 is really well suited to that architecture.
0:31:45 But is there a risk of focusing too narrowly on this single model trend of MOE?
0:31:48 What happens when we get, you know, sort of beyond MOE?
0:31:53 What happens? Is the architecture still well suited? Is the cost of tokens still going down?
0:32:03 How do you think about that going forward? And how does the, you know, the design that NVIDIA has today, you know, how is it ready for whatever the next trend might be?
0:32:28 Well, there’s one clear trend in AI is that intelligence creates opportunity. As the models get smarter, as they start to learn new things, or as they specialize in certain areas, they create opportunities to advance that industry, that science, that application, or just make computers more productive for you and I every day.
0:32:29 Yeah.
0:32:41 And in order to do that, we need to make the models smarter themselves. We need to use techniques like reasoning, which is only going to generate more tokens. And the only way to advance the state of the art of AI, well, there’s lots of ways.
0:32:55 One way NVIDIA can help is just reduce the cost of tokens. And doing that MOE, it’s just an optimization technique. If you don’t need all the neurons, don’t waste time computing on them.
0:33:12 That’s an idea. That’s, that’s not unique to LMs and chatbots. That’s, that’s just a good idea. So we see, it may materialize in different ways and how these networks and experts want to communicate or the shape of the models are actually diversifying in lots of ways. There’s lots of different techniques.
0:33:31 mixture of experts is certainly one of them that will stick around for a while. There’s lots of other hybrid approaches and other things that people are talking about or trade-offs that you can make in order to reduce cost. But we see MOEs happening not just in chatbots, but similar sparsity MOE expert applications being done in vision models and video models.
0:33:49 As the models are expanding into science and not just generating tokens and which turn into words that you and I talk about, but work on proteins or working on material properties or understanding or working on things like in, in robotics and, or path planning or logic or business applications.
0:34:01 All of those will benefit from having a large intelligent model that can be sparsely optimized to only use and leverage the part that is needed for that particular question, that particular use case.
0:34:21 You can always go down to the, back down to the squirrel detector in a doorbell, but there’s, there’s usually a benefit to having a model that is actually able to reason about or has some, some multimodal aspects, maybe listen to what’s going on, see the things around it and be able to make intelligent decisions smartly.
0:34:31 That is going to continue to grow. And NVIDIA is not just working on MOEs. We’ve got lots of different irons in the fire. There’s lots of different models. The models are diverse.
0:34:44 I get to work in HPC as well. The whole supercomputing community has now embraced AI, building all sorts of models for simulating physics and simulating weather and things that look nothing like chatbots, but they’re going to use MOEs.
0:35:02 They’re going to use MOEs. They’re going to use every trick in the book because, uh, the opportunity is huge. The ability to revolutionize like biology, to do drug discovery, to, to, for cancer research alone is, uh, an investment that the whole world’s making right now.
0:35:27 And they can take these ideas and take our platform and apply them to, uh, their domain, their problem, uh, to take an open source model or a general model and fine tune it to be a science model or a, um, uh, an application specific model or a business model that is possible because they’re starting from a really intelligent model that can be, that can learn or be used to turn and teach another model to make things possible.
0:35:28 Yeah.
0:35:52 So I’m super excited about MOEs. I’m super excited and we’ll continue to work on reducing the cost per every token. And while that may make our technology bigger, smarter, more complicated at times, and we’ll make it more expensive, uh, it is going to deliver X factors in capability, improvement, intelligent, and as a result, dramatically lower the cost per token.
0:36:11 Ian, for listeners who want to dive in further, we could talk about this all day, but you have things to go build and customers to take care of and all that good stuff. Where can listeners go online? What’s the best place to start to dive into MOEs, to, uh, the infrastructure you’ve been talking about to any and all of it?
0:36:20 Yeah. Check out GTC. You know, one of the things that’s, uh, we started this conference a few years ago, uh, well, maybe now over a decade, I guess I was there for the first one.
0:36:22 It’s called the GPU Technology Conference.
0:36:23 Right.
0:36:29 It’s not a business conference, although obviously many business people show up. It’s not a demo conference. It’s a developer conference.
0:36:30 Yeah.
0:36:39 And if you want to learn more, go check out GTC. We put all the presentations online. Jensen’s keynote’s wonderful. He has a, he’ll explain it even better than I can.
0:36:50 And you can watch, we actually do a few a year now. I encourage you to check out GTC, go see the old ones. And if you’re going to be in San Jose in March, please come and check it out and attend.
0:37:00 There’s tons of sessions at every level from beginner to deep dive. If you want to go down to the hardware, all the NVIDIA experts will be there. All of the different developers are going to be there.
0:37:09 It’s kind of the go-to place to go, go learn and also present your work on what you can do with GPUs and the state of the art of AI. Check it out.
0:37:22 Perfect. Ian Buck, again, thank you. And you know, for what it’s worth, Jensen’s an amazing presenter. You did a great job explaining all this. So we appreciate you taking the time. And as always, all the best to you and your teams on continued progress.
0:37:23 Thank you.
0:37:46 Thank you.
0:37:48 Thank you.
0:38:13 Thank you.
0:00:21 with us today. Ian is vice president of hyperscale and high-performance computing here at NVIDIA,
0:00:26 and he’s here to discuss mixture of experts, the architecture powering the world’s leading
0:00:31 frontier models, and how extreme co-design can both drive down the cost of generating
0:00:37 intelligence today and future-proof your AI platform for whatever advances come tomorrow.
0:00:41 Ian, welcome. Thanks so much for taking the time to join the podcast.
0:00:42 Thanks, Noah. Glad to be here.
0:00:48 So let’s jump right into it. What is mixture of experts, MOE as we call it? Why does it matter?
0:00:53 If you look at the top 10 open models on artificial analysis right now on their leaderboard,
0:01:00 they all share the MOE architecture. So can you explain kind of in lay terms what MOE is and why
0:01:05 it’s suddenly become the standard for frontier AI? Yeah, it’s a great question. I think there’s a lot
0:01:12 of, it’s a term that is used in industry and amongst AI researchers, but it’s not really understood like
0:01:17 what does mixture of experts mean? Yeah. We’ve all heard of neural networks, and that’s what these
0:01:23 neural networks are. They’re neurons, they’re parameters, they’re components of a AI model.
0:01:30 And, you know, when AI got started and really became in the zeitgeist of the world, the neural network was
0:01:37 simply, each parameter represented a neuron of the model. And we heard about a 1 billion parameter
0:01:42 model on a 10 billion, a 100 billion, now trillion parameter models. Those are basically the neurons of
0:01:49 the AI brain that you activate when you ask ChatGPT a question. But something happened along the
0:01:54 way. As these models got smarter and smarter and smarter, they naturally got bigger and bigger and
0:02:00 bigger. In fact, you know, two years ago when Lama first came on the scene, there was a 7B Lama,
0:02:08 and then there was a 70B Lama, and now we have a 405B, BB billion parameter model. And that makes
0:02:12 them smarter. They have more information, they understand more things, and they give you better
0:02:18 answers. But there was a problem. As they got smarter and smarter and smarter, to get the answer,
0:02:24 you actually had to ask and activate every neuron in that brain. So as a result, while the models are
0:02:29 getting more and more intelligent, they’re also getting slower and slower because you had to ask
0:02:35 every neuron and calculate every neuron and perform all the math on every neuron on a GPU. And then it
0:02:41 wasn’t one GPU, it was lots of GPUs, and even more. Along the way, researchers came up with this idea,
0:02:46 and they realized, just like a human brain, you know, we probably don’t need all of these neurons to ask
0:02:50 every question. Simple questions, probably just a few neurons, or different parts of the brain may
0:02:57 encode different information. Let’s just activate those. So to make the AI cheaper, or the tokens,
0:03:01 which was the piece of data that’s flying through that eventually becomes a word on the
0:03:06 screen, the token’s cheaper. Let’s only activate the neurons we need to activate. And that’s what
0:03:12 mixtures of experts is. Instead of having one big model, we actually split the model up into smaller
0:03:19 experts. Same number of total parameters, but now we only ask the, we train the model to only ask the
0:03:23 experts that probably know that information along the way. And that’s part of the training process to
0:03:28 build that model. Once you do that, you can have a model which has maybe 100 billion parameters,
0:03:33 100 billion neurons, but we only ask or activate about 10 billion. That’s a compression mechanism.
0:03:40 That’s a way of making AI cheaper, but still being able to encode all the possible information and answer
0:03:45 all the questions. So today, most models today are achieving higher and higher intelligence scores
0:03:53 by taking advantage of having more than lots of experts and able to ask, have the model as it come
0:03:57 up to the answer, ask only the right experts in order to get that, to get the right answers.
0:04:04 To put some numbers behind it, we have that LAMA 405B, 405 billion parameter. That’s one big model.
0:04:09 On leaderboards like artificial analysis, you mentioned, it gets an intelligence score of about
0:04:15 28. 28 is just a weighted score of the benchmarks they tested. But all 405 billion parameters are
0:04:23 getting active. Now, fast forward to like a modern open model like OpenAI’s GPT OSS model. It has
0:04:28 120 billion parameters, actually a little bit smaller in total parameters. But when you ask
0:04:35 a question, it only activates on the order of about 5 billion parameters. So instead of 405 billion
0:04:39 parameters and all that math and all that cost, it actually only needs to activate about 5 billion
0:04:46 parameters. That’s like a 10 to 1 or beyond compression, making it cheaper. And then it gets
0:04:55 an intelligence score of 61. So it is going from 28 to 61, going from 405 billion parameters to 5 billion
0:05:01 parameters. Way cheaper. It’s not a 10x cheaper. It’s still complicated. And we can talk why these
0:05:07 MOEs are complicated to run. But artificial analysis does measure the cost to run the benchmark. So like
0:05:13 how to run and calculate the intelligence score. For LAMA 405B, I think that currently it costs about
0:05:19 $200 for them to actually ask what cloud service to get all the answers to create that score.
0:05:26 They asked GPT OSS the same thing. Its tokens are cheaper and only cost about 75 bucks. So MOEs are
0:05:33 making models, allowing models to get bigger, smarter. It’s allowing to get cheaper. And as a result,
0:05:38 advancing AI. Now, of course, across the board, all the leaderboards, they’re all these mixture of
0:05:43 expert models. Right. Correct me, bring you back on track if I get off here with the questions.
0:05:49 But from kind of a layperson, use that word standpoint, if I’m trying to wrap my head around
0:05:55 this idea of mixture of experts, are the experts divided up in ways that I might think about knowledge?
0:06:02 You know, this expert handles math, and this one handles science, and this one handles, I don’t know,
0:06:08 visual understanding. Yeah, it’s a great question. You know, that is the art of training these things.
0:06:14 In fact, AI, it’s not like hard-coded in there. Right. They don’t train a separate model for math,
0:06:20 doing math questions, and a separate model for telling you how to make a pizza. The AI, the beauty of
0:06:25 AI is that the algorithms that these researchers and scientists and companies like Anthropic and OpenAI
0:06:30 and everybody else have figured out is that they can just give it the data, and they encourage the model
0:06:36 to sort of camp, to identify and create these little pockets of knowledge. It’s not prescriptive,
0:06:41 but it’s just the data that they’re seeing. It naturally clumps the activity of these different
0:06:46 questions to different experts. So, and then in front of those experts, there’s this thing called
0:06:51 the router. And the router actually is able to just look at the string of questions, like what’s,
0:06:55 what the answer is, what answer is forming, what, how is it thinking, and then be able to predict,
0:07:00 you know what, this one probably goes to that guy or this other guy. In fact, today’s experts,
0:07:06 they may have on the order of dozens of experts on every layer of the model. And there’s a little
0:07:10 router between, and they may actually ask not just one expert, but like at every layer, they may ask
0:07:16 two experts or eight experts. And then there’s another unit, a model which listens to all the
0:07:21 experts. This guy says, I’m pretty sure I got the right answer. Maybe I got the right answer. I don’t
0:07:24 know. I don’t know. I don’t know. Combines the answer and then goes to the next one. So that’s
0:07:28 actually the, the architecture of it. You know, it’s kind of like, um, you could, you could train
0:07:35 one person, one brilliant scientist. You train an Einstein to be able to answer any question.
0:07:40 That’s really hard. It takes a lot of energy. That’s a very expensive person to hire and have
0:07:45 on staff. Instead, I, maybe I can hire a couple of domain experts or teach a couple of different
0:07:49 people some stuff. And, you know, I just give them all that question. They can all answer it
0:07:53 very quickly in parallel and, uh, you come the combined knowledge. And that’s actually how we
0:07:59 work today. We don’t work in one, one person is not a company. Companies exist because we have all
0:08:05 this expertise around and the MOE method is basically applying that to AI. So the models are all trained
0:08:09 that way. It used, there’s all sorts of training methods to create the condition where, where
0:08:14 information activations can start grouping and gathering together. And you can train these little,
0:08:20 routers and combiners. And then you just do that and in multiple, multiple layers. And sure enough,
0:08:28 at the end of it, you’ve got a, a chat model like a GPT OSS or a KimiK2. Yeah. No, MOE isn’t
0:08:35 new to 2025. The idea of the architecture has been around for a few years. So was it being used? Has it
0:08:39 been, you know, being used all along and we just weren’t so aware of it? And then why has it kind of
0:08:44 come to prominence lately? Yeah. The idea of experts is not new in machine learning.
0:08:49 You know, before AI, there was an idea of creating, you know, combining multiple machine learning
0:08:55 models together and how to do that with statistically to improve the accuracy. Um, there’s all sorts
0:09:00 of history and math around that. Yeah. Applying it to AI though is, is relatively new. You know,
0:09:05 the early versions of, uh, we now, we now know where chat GPT, they were a mixture of experts and
0:09:12 models, but they were not public, publicly known. Okay. Okay. It really wasn’t until, um, the deep seek
0:09:17 moment, uh, which is about a year ago, where it really blew the doors open. Like, uh, because deep
0:09:24 seek, those researchers, uh, were the first to really build a world-class MOE based model. People
0:09:30 have written papers about it, but it was one that actually competed and demonstrated the intelligence
0:09:36 scores that could be leaving with the closed source models. And it was a beast. It was awesome. It had
0:09:42 256 experts in every layer. I mean, it, it did every single optimization. And as a result, it was
0:09:50 extremely cheap to run incredibly complicated, but cheap to run because it was so, it went, took MOE all
0:09:54 the way to the extreme. And maybe many people think it’s kind of where open AI was, you know, with the
0:10:00 original GPT. Uh, so now once we had that moment, you know, the first time deep seek was run on,
0:10:05 on even GPU systems, it actually didn’t run that well because we didn’t have the infrastructure or
0:10:09 even the software to run it that well. The deep seek engineers had written all this custom code to
0:10:14 make it run awesome. Um, but at that point, every model, every researcher realized, Hey, this thing’s
0:10:18 real. We now can see how we do it. I mean, the whole thing open, they published the paper. It’s a
0:10:25 brilliant paper and it shows the opportunity for MOE. And since that moment, uh, you can see that every
0:10:31 model now has shifted to building MOEs. Deep seek sort of shine the light on how to do it, how to train
0:10:37 it, how to do inference and deploy it and sort of kicked off that, that revolution of MOEs that’s been,
0:10:43 that we’ve been enjoying. Right. So we know the deep seek moment was huge, as you just said, for,
0:10:49 for many reasons, is that kind of, are we going to look back and say like, Hey, the lights went on
0:10:56 then and, you know, new things will come, but for the moment is, is everything MOE? And if, if not,
0:11:01 why, what’s kind of the, um, I don’t know, the decision-making process. When would you train a
0:11:07 model to be MOE and when would you not? You know, I think all the, the models that, that really are
0:11:13 focused on, on providing a intelligent response, it makes a lot of sense why they’re MOE. Yeah.
0:11:19 You, you want to do your best to encode as much knowledge into the neural network. So it just
0:11:25 knows things. Uh, you don’t need to like on pencil and paper, right? Two plus two to work out that
0:11:30 it’s four. You just know two plus two is four. Um, so the more neurons you can throw into a holistic
0:11:36 model, it gives it innate knowledge. It doesn’t have to work it, work that out in a reasoning chain
0:11:41 or other such things. So there’s a huge advantage to having models be bigger as long as we don’t
0:11:48 increase the cost. And that’s why MOEs, we want to be able to push the limits of only activating
0:11:54 10%, 5%, 3% of the neurons, more and more experts. And then you can see that in the research and the
0:11:59 way the models are evolving. They’re really pushing the limits, um, of seeing some of the modern models,
0:12:05 you know, they’ll have 300, 400 experts. They’re trying to combine now getting all those experts and
0:12:10 all that communication is complicated. We’ll talk about that. Um, but it is innate by, you know,
0:12:17 having that, the foundation model with all of those experts allows them to then apply all the other
0:12:23 techniques of inference, of reasoning. It allows models that are smaller to be, um, distilled and
0:12:28 fine-tuned for specific tasks. Um, it creates a foundation for the rest of, for the rest of the
0:12:32 AI models around the world. Certainly some of the smallest models that, you know, for the more
0:12:37 dedicated individual use cases, you know, I’ve got to put a box around a stop sign or I’ve got to, um,
0:12:43 you know, a ring doorbell that uses AI to detect if it’s a squirrel or not a squirrel. You know,
0:12:47 those small models may not, you know, they need to do one thing, one specific thing. Probably I can
0:12:51 get it, squeeze it down. I don’t need to go to, uh, you know, the complexity of, of an expert system,
0:12:57 but anything that wants to be agentic, any kind of agent and pretty much most of the AIs that we
0:13:02 interact with purposefully, they’re all MOEs because they can be thrown and they need to know
0:13:06 and they need to be able to reason about, um, a wide variety of different stuff and it makes AI
0:13:13 cheaper. Yeah. It lowers the cost per token. It’s so there’s always a driving cost and the continuous,
0:13:17 like let’s, let’s increase intelligence and let’s lower costs. We can do both with MOEs.
0:13:21 I was going to ask you about that because there’s this, it seems like there’s this focus happening,
0:13:26 now, you know, generative has, has progressed far enough and certainly it’s, it’s everywhere,
0:13:32 you know, including, uh, the news, the business section, if you will. And there’s a shift kind of
0:13:38 from, you know, the biggest models, raw speed, you know, the highest scores to, as you said,
0:13:44 how much does this cost and can we get it to be cheaper while being just as smart, if not more
0:13:49 intelligent? So we’re calling it tokenomics, right? So not in the sense of blockchain or crypto
0:13:54 tokens, but, you know, as you mentioned, AI systems generating tokens, reasoning tokens,
0:14:01 output tokens, what have you. So if we’re focused on bringing the cost down, how does a more complex
0:14:06 system, and I’m kind of inferring here a little bit, but I would imagine it’s more expensive to train,
0:14:11 to architect, to train, perhaps not to run, but total cost. How does a more expensive kind of premium
0:14:16 system actually drive the total cost down? Yeah. There’s a wonderful symbiotic relationship
0:14:23 relationship that happens in the market between the AI hardware and the models that are being created
0:14:28 to serve AI. They inherently, and they kind of have to make sense, you know, if the, if the hardware
0:14:33 offers a certain level of connectivity, a certain GPU performance, a certain memory size, obviously
0:14:38 building an AI model that’s even bigger is going to be hard to take to market or even not possible to
0:14:44 efficiently train. So, uh, you know, since the beginning of the original Kepler GPUs that were used for
0:14:51 those cat, those first cat AIs to today’s modern GB200, GB300, NVL72X, you can see a pattern where,
0:14:56 you know, with every new platform, we advance the state of the art or what the capabilities of what
0:15:02 NVIDIA is able to offer. Um, the compute performance, the memory performance, the connectivity, the IO,
0:15:08 we’ll talk about NVLink, those things enable the next wave of building, to train the next model,
0:15:15 but also to do inference. Uh, you know, it’s the, the, they add complexity. You know, when we started, we were
0:15:21 doing PCIe cards, little, basically graphics cards that, you know, plugged into the server equivalent of your
0:15:27 PC and, um, use the floating point calculations and the graphics memory in order to do that computation. And they
0:15:34 were great. When the AI revolution took off, we saw that by adding more can floating point calculations and building
0:15:40 a bigger GPU, adding things like HP and memory, adding things like, um, you know, increasing the power beyond what a
0:15:49 typical PCIe slot will do. We often would, um, increase the performance of what was capable in the AI, not by the, just the
0:15:55 percentage of more flops or memory bandwidth, but by X factors. And that’s really because the model, the AI models
0:15:59 that were able to build were bigger, smarter, and could run more efficiently and could do more things.
0:16:07 You know, TCO, people talk about TCO as the cost. You know, TCO actually is just, it’s not the goal. Like it, in
0:16:13 itself, it’s just the lowest cost. You want the lowest cost? You know, buy one GPU. Sure. The, the goal is
0:16:18 actually to deliver, to improve intelligence and intelligence per dollar, the cost of that
0:16:23 intelligence. Or if we’re at the same level of intelligence, say this, you know, 60 score from
0:16:28 artificial intelligence, are we reducing? Are we reducing the cost of that intelligence over time?
0:16:33 The tokens that people need to buy are the costs in order to run it. That’s really the, the goal in
0:16:39 every generation of NVIDIA architecture. You know, we’re looking to figure out what technologies can we
0:16:46 incorporate, expand, double down on, invest in, or, or pull from the community, or pull from our,
0:16:51 and there are partners in order to deliver X factors of performance improvement, where the model,
0:16:57 even the existing models, like a G, like the current MOEs could get an X factor of performance improvement.
0:17:02 Well, only, you know, we’re not afraid to add more cost and more technology, you know, per GPU basis.
0:17:07 You know, the HP memory is, it’s a lot more expensive than the old school graphics memory, but it,
0:17:14 it only increases the cost in, in percentages where you’re, because you now have HBM, and because you
0:17:19 have the bandwidth that it offers to, and can connect to that much floating point, you can deliver an X factor
0:17:24 in, in total end-to-end performance. Yeah. Yeah. And we saw that actually, you know, when DeepSeeker R1
0:17:29 came out, you know, the, the GPU at the time was the Hopper H200 system. Mm-hmm.
0:17:35 Hopper had, uh, eight GPUs in a server. They were all connected with NVLink through an NVLink switch.
0:17:40 So we could effectively build one giant GPU of eight GPUs working as one. Right.
0:17:45 That was really important. The model is so large, it couldn’t really fit on, um, a single GPU. It had to have used
0:17:51 multi-GPU. And the researchers that built DeepSea took great advantage of that. It also had, um,
0:17:58 an NV, an NVLink capability. So we could actually put every expert on different GPUs. And you could see that.
0:18:02 You could paralyze the work. You could run into things even more efficiently, even faster.
0:18:07 And because as those experts all had to talk to each other, they would do that over NVLink.
0:18:14 So that was very important. Before we had NVLink, you know, you would have to send things over a PCIe bus
0:18:18 and only one could talk at a time. And it was much slower. Because we have NVLink, all those GPUs can
0:18:23 talk to every other GPU at full speed. It’s a totally unblocked, you know, unblocked, you know,
0:18:30 literally at gigabytes and terabytes a second of bandwidth without any concern for, for collision.
0:18:35 It was critical for those DeepSea creatures to get good performance. If you fast for, so obviously,
0:18:41 it also happened at a time, which now we can say is when we’re in the heart of bringing and building the,
0:18:47 what is now the GB200 and VL72, where we scaled up the number of GPUs we can connect
0:18:53 from just eight GPUs in a server to 72 GPUs in an entire rack, a 9x multiple.
0:18:54 Yeah.
0:19:01 Now that’s a lot more GPUs. So did the cost go up? And it, it certainly, uh, obviously 90, that many
0:19:04 GPUs, entire rack with GPUs versus a server is, is a lot more money.
0:19:05 Sure.
0:19:09 In fact, we actually even had to add more technology because we needed to take that, those, all those
0:19:14 NV switches and build a separate NV switch plane is more, it does cost more. But because we could,
0:19:19 we did that, we can actually paralyze and improve the performance of DeepSea Garland even more.
0:19:24 We can take all those experts and instead of having to try to make it all fit and work within
0:19:30 all the eight GPUs, we could actually get all 72 GPUs working as one. And that, that improved
0:19:36 performance of just going generation to regeneration, being able to further paralyze and run all those
0:19:42 experts across it could actually increase the performance so much that we actually got a 15x
0:19:49 improvement on running DeepSea Garland versus only adding, you know, percents, about 50% more
0:19:52 total cost of, of, on a per GPU basis.
0:19:53 Wow. Okay.
0:19:57 That actually generated a 10x reduction in the cost per token.
0:19:58 Right, right, right, right, right.
0:20:02 So we do have to add more technology. We want to keep running more technology.
0:20:09 Nvidia is a technology company, but we turn that technology back into performance, which
0:20:14 in the net of it reduces the cost per token because those 72, it’s that much faster. And
0:20:18 as a result, they can actually run, get more out of that rack, more out of the, on a per
0:20:25 GPU basis. And we’ve taken it down from what was Hopper. It cost about a $1 to get a million
0:20:26 tokens. Okay.
0:20:31 Roughly a million words. It’s now down about $0.10. So people look at the rack and they say
0:20:32 it’s expensive. Right.
0:20:36 But you, you, the way you do that is actually you’ve put all that investment in EnvyLink
0:20:41 in all the connectivity and all the next generation software. And that end, and you also do all
0:20:46 that software work to make it all work really well. And generation over generation, you get
0:20:51 that multiple, that 10x multiple in the reduction in cost. That’s just one model. That same story
0:20:56 is playing out for us and everything else. And those are models that were built and trained
0:20:58 and designed for Hopper. Right.
0:21:02 You know, we’re entering into the, you know, the starting to see some models come out that
0:21:06 are trained on Blackwell and you’re going to see that, you know, now raise the bar and go
0:21:11 even further. So this, this is the virtuous cycle that we’ve been working so furvously to
0:21:17 make, help make happen. We add, you know, we might add percents in terms of cost and complexity
0:21:23 on a per GPU basis. But we, we aim at every generation to deliver X factors of performance.
0:21:29 And as a result, we dramatically lower the cost of per token by that, by 10x.
0:21:35 As I’m listening to you describe, you know, EnvyLink and the advances in getting the experts,
0:21:40 getting the GPUs to communicate and kind of act as one, I can’t help but think like, we need EnvyLink
0:21:45 for like teams meetings so we can get everybody. We’re, we’re able, instead of talking over each
0:21:48 other, just communicate at one as one at the speed of light.
0:21:49 That’s right.
0:21:53 Speaking with Ian Buck, Ian is vice president of hyperscale and high performance computing
0:21:58 at NVIDIA. And we’re discussing mixture of experts and why it’s become the architecture,
0:22:03 well, as it has been for a while, but now getting public prominence, if you will, the architecture
0:22:08 behind so many leading frontier models and what goes into not only architecting and training
0:22:14 the models, but the infrastructure that really makes them hub. And Ian, I wanted to ask you,
0:22:18 you, you talked about this a little bit, as I said, with, you know, EnvyLink and, and all
0:22:23 of the technologies you kind of alluded to as you were describing the MOE architecture.
0:22:30 But what is it specifically about these NVIDIA systems that make them such a good and such
0:22:35 a unique fit for these complex MOE models and are able to achieve, as you just described, you
0:22:38 know, this lowering cost of intelligence measured per token?
0:22:44 Yeah. It’s an interesting and understandable, it goes back to the original idea about having
0:22:45 experts. Okay.
0:22:50 We’re reducing the cost per token by not turning on every neuron, but only turning on the ones we
0:22:57 need. It’s a cost savings. And we talked about LAMA, the 405 billion parameter LAMA model, you
0:23:01 know, that you have, in order to use it, you got to activate all 405 billion of those neurons,
0:23:03 even though they’re not all needed. Right.
0:23:09 Look at GPT-OSS, it’s 120 billion parameters, still a lot, 100, but you only need about 5 billion
0:23:15 parameters. And so it is smart and is a cost saving measure, only does five.
0:23:24 She also notices, though, that’s like a 10x less, actually more than 10x, 1% of the number
0:23:26 of neurons we’re actually doing math on.
0:23:27 Yep.
0:23:31 The cost isn’t, unfortunately, on GPT-OSS, it’s not 1%, actually.
0:23:32 Right.
0:23:38 You know, it is x-factor slower, it’s about 3x less cost, but it’s not, you know, 1% less cost.
0:23:39 Sure, yeah.
0:23:45 There’s a hidden text to MOE. And it’s all about how those experts need and want to, and
0:23:51 need to communicate with each other. In order to get MOEs to run efficiently, those experts
0:23:56 are all doing their math very, very, very fast. And they all need to communicate with each other
0:24:01 very, very, very quickly. And one of the challenges with MOEs is, and as we go and get sparser
0:24:07 and sparser and sparser and sparser, which makes the models more and more valuable, and we’re
0:24:12 saving, saving more and more cost, is can we make sure that all that math is happening and
0:24:17 all those experts can talk to each other without ever running, going idle, without ever waiting
0:24:22 for, again, waiting for a message. You’re buying those GPUs, you’re paying for them so
0:24:24 they can do the math they need to do.
0:24:25 Right.
0:24:29 Not to sit around and wait for someone else to send them something. Or worse, the network
0:24:34 that connects all these GPUs gets gummed up, and now everybody’s sitting idle, and that’s
0:24:36 going to go straight to the bottom line of the cost.
0:24:37 Yeah, yeah.
0:24:43 So, that’s the key part, and the hidden cost of MOE is communication. We’ve looked at, you
0:24:48 know, can we make it work with just point to point? Like, maybe I can just connect this
0:24:53 GPU with this GPU, and this GPU with that GPU. It’ll be a much lower cost to actually just directly
0:24:57 wire them up. But there’s a limit to how much I can do that. If I take one GPU and I connect
0:25:04 it to four, well, this GPU now, its IO is split four ways, and I can only do that so far.
0:25:10 And even with our hopper systems, we had eight, and there was an NV switch chip. We built another
0:25:15 chip specifically for this, but we can’t scale beyond that eight because that’s the chip.
0:25:15 Yeah.
0:25:20 So, if you have point to point or a Taurus-like network, you’re fundamentally limited by how
0:25:27 much MOE, how cheap you can make those tokens, because the hidden cost of MOE is communication.
0:25:32 And if you try to go bigger than the, you know, what a neighboring or point to point
0:25:38 connection or some kind of loop or message passing thing, or use a fabric like Ethernet, they
0:25:45 weren’t designed for this. The best answer is no compromises. I want this expert, this GPU,
0:25:51 to be able to talk to every other expert at full speed, no limitations, no worry about congestion.
0:25:55 I need a network, I want to connect these things so there’s no, there’s nothing blocking.
0:25:56 Yeah.
0:26:01 And that’s what NVLink is. In fact, that chip that we built is specifically designed to
0:26:07 make sure that every GPU, and it’s all of its terabytes a second of bandwidth, can talk to
0:26:12 every other chip at full speed and never compromise on the maximum IO bandwidth we can get out of
0:26:13 every GPU.
0:26:18 We did that with Hopper, with 8-Way, and one of the big innovations, and obviously it took
0:26:25 a lot of engineering to make that 72 racks, every one of the 72, every one of those GPUs at full
0:26:31 speed, no constraints. And you can see that taking off. You can see the benefit, you know,
0:26:37 that allows people to go even further and build even bigger models. The Kimmy K2 model is even
0:26:46 bigger than the GPT one. We now have open source, trillion parameter model, KBK2, yet it only uses 32 billion
0:26:48 parameters when you ask it a question.
0:26:49 Right.
0:26:54 That’s like a 3% activation of the brain.
0:26:55 Yeah.
0:27:00 It’s incredibly complicated. It’s 61 layers, over 340 experts. They all got to talk to each
0:27:01 other.
0:27:07 And as a result, we now have open models that are trillion parameter scale, levels of intelligence,
0:27:14 and the cost is all comparable to what, and even lower than what we could ever possibly have
0:27:18 with a fully dense model. It’s possible because of that emulating connectivity.
0:27:23 NVIDIA is committed to it. Let’s keep going down that path, build. We have some of the world’s
0:27:27 best 30s engineers, signal processing engineers, wire engineers, mechanical engineers, to make
0:27:33 all that work without having costs explode and make it all connected. Every one of those GPUs,
0:27:38 by the way, is connected with a copper wire to one switch to another switch. There’s a reason
0:27:43 why it all sits in the rack is because we’re running at 200 gigabits per second on every one of those
0:27:50 wires. It’s PAMP4 signaling, so it’s like four bits per wire. It’s a 0, 1, 2, 3, and 4, not a 0, 1.
0:27:55 We’ve gone past the binary at this point. And it’s going so fast, it’s actually, its wavelength
0:28:05 is about a millimeter, I think. So we’re pushing the limits of physics, keeping it all nice and
0:28:12 tight, and also doing everything in copper for low cost. We’re super happy with GB200 and what’s
0:28:19 been able to do for inference and just keeping the cost and driving the cost of tokens down,
0:28:21 down, down, while intelligence goes up, up, up.
0:28:24 So is this getting into what we call extreme co-design?
0:28:30 Yeah. One of the joys of working in NVIDIA is that we’re the one company that works with
0:28:31 every company in AI.
0:28:32 Right, yes.
0:28:39 And, you know, we work with them in building their data centers, in getting the latest GPs
0:28:44 to them, in explaining the NVL 72 architecture, in building and help build a lot of the software
0:28:50 that they use. We have teams working on PyTorch, on Jaxx, on SGLang, on VLM, and all the software
0:28:55 that’s out there. And as they, these model makers are building new models of pushing the limits,
0:29:02 both some inside NVIDIA actually now and, but all around the world, we can co-design with them
0:29:08 how to take the maximum utility out of those 72 GPUs to manage that hidden cost of communication,
0:29:14 to make sure every GPU is running at 110% on computing on the fewest possible neurons
0:29:21 and doing that seamlessly and incredibly fast. All the while, thinking about the next model.
0:29:27 You know, what’s that next GPT, that next vision model, the next video model, the next Sora,
0:29:33 and figuring and making smart decisions about how to add more bandwidth, more communication,
0:29:38 more NVLink, and the right kind of floating point. And all doing so without blowing out cost
0:29:45 or blowing out power and keeping and leveraging all the work that they’ve done up to date so that
0:29:50 it can be applied moving forward to the future. This is the extreme co-design that we do at NVIDIA.
0:29:55 And some of our, the folks that I get to work with and probably watching this get to enjoy.
0:30:02 And we, we work really, really hard to continuously work on performance, not just to have the fastest and be
0:30:07 the fastest, but also to reduce the cost because perform, you talked about tokenomics.
0:30:17 If our, just our software alone could increase performance by 2x, you’ve now reduced the cost per token by 2x
0:30:21 direct to the, to the user and the customer or whoever’s going to deploy this AI.
0:30:26 I was on a call this morning. We got a model from a customer. They wanted some help.
0:30:34 We applied the latest NVFP4 techniques, the latest kernel fusions, the latest NVLink communication IO overlaps.
0:30:41 Within two weeks, we did, we hit 2x on their model and gave them the code back and, and you know, and we’re not done.
0:30:44 There’s a, there’s so many places where we could optimize.
0:30:49 I think a lot of people get confused. They see a GPU of a certain number of flops and they say, oh, that’s better or faster.
0:30:58 I’ll tell you, this stuff’s pretty complicated. Manage and run 72 GPUs with 348 experts and all the different kernels and all the different AI and all the different math.
0:31:03 We didn’t even talk about KV caching and reasoning models and all the tricks and techniques.
0:31:14 That’s an end to end problem. It requires extreme co-design between the hardware, what’s the art of the possible, the model builders themselves and the dense and deep software stack that run on it.
0:31:19 NVIDIA actually has more software engineers than hardware engineers specifically for that process.
0:31:20 Right. Yep.
0:31:27 So to kind of zoom out for a second, because we’ve been talking about and kind of get harkening back to what you just said about, you know, thinking about what’s next.
0:31:33 We’ve been talking about MOE in the context of language models predominantly, you know, now.
0:31:38 And the GB 200 NVL 72 is really well suited to that architecture.
0:31:45 But is there a risk of focusing too narrowly on this single model trend of MOE?
0:31:48 What happens when we get, you know, sort of beyond MOE?
0:31:53 What happens? Is the architecture still well suited? Is the cost of tokens still going down?
0:32:03 How do you think about that going forward? And how does the, you know, the design that NVIDIA has today, you know, how is it ready for whatever the next trend might be?
0:32:28 Well, there’s one clear trend in AI is that intelligence creates opportunity. As the models get smarter, as they start to learn new things, or as they specialize in certain areas, they create opportunities to advance that industry, that science, that application, or just make computers more productive for you and I every day.
0:32:29 Yeah.
0:32:41 And in order to do that, we need to make the models smarter themselves. We need to use techniques like reasoning, which is only going to generate more tokens. And the only way to advance the state of the art of AI, well, there’s lots of ways.
0:32:55 One way NVIDIA can help is just reduce the cost of tokens. And doing that MOE, it’s just an optimization technique. If you don’t need all the neurons, don’t waste time computing on them.
0:33:12 That’s an idea. That’s, that’s not unique to LMs and chatbots. That’s, that’s just a good idea. So we see, it may materialize in different ways and how these networks and experts want to communicate or the shape of the models are actually diversifying in lots of ways. There’s lots of different techniques.
0:33:31 mixture of experts is certainly one of them that will stick around for a while. There’s lots of other hybrid approaches and other things that people are talking about or trade-offs that you can make in order to reduce cost. But we see MOEs happening not just in chatbots, but similar sparsity MOE expert applications being done in vision models and video models.
0:33:49 As the models are expanding into science and not just generating tokens and which turn into words that you and I talk about, but work on proteins or working on material properties or understanding or working on things like in, in robotics and, or path planning or logic or business applications.
0:34:01 All of those will benefit from having a large intelligent model that can be sparsely optimized to only use and leverage the part that is needed for that particular question, that particular use case.
0:34:21 You can always go down to the, back down to the squirrel detector in a doorbell, but there’s, there’s usually a benefit to having a model that is actually able to reason about or has some, some multimodal aspects, maybe listen to what’s going on, see the things around it and be able to make intelligent decisions smartly.
0:34:31 That is going to continue to grow. And NVIDIA is not just working on MOEs. We’ve got lots of different irons in the fire. There’s lots of different models. The models are diverse.
0:34:44 I get to work in HPC as well. The whole supercomputing community has now embraced AI, building all sorts of models for simulating physics and simulating weather and things that look nothing like chatbots, but they’re going to use MOEs.
0:35:02 They’re going to use MOEs. They’re going to use every trick in the book because, uh, the opportunity is huge. The ability to revolutionize like biology, to do drug discovery, to, to, for cancer research alone is, uh, an investment that the whole world’s making right now.
0:35:27 And they can take these ideas and take our platform and apply them to, uh, their domain, their problem, uh, to take an open source model or a general model and fine tune it to be a science model or a, um, uh, an application specific model or a business model that is possible because they’re starting from a really intelligent model that can be, that can learn or be used to turn and teach another model to make things possible.
0:35:28 Yeah.
0:35:52 So I’m super excited about MOEs. I’m super excited and we’ll continue to work on reducing the cost per every token. And while that may make our technology bigger, smarter, more complicated at times, and we’ll make it more expensive, uh, it is going to deliver X factors in capability, improvement, intelligent, and as a result, dramatically lower the cost per token.
0:36:11 Ian, for listeners who want to dive in further, we could talk about this all day, but you have things to go build and customers to take care of and all that good stuff. Where can listeners go online? What’s the best place to start to dive into MOEs, to, uh, the infrastructure you’ve been talking about to any and all of it?
0:36:20 Yeah. Check out GTC. You know, one of the things that’s, uh, we started this conference a few years ago, uh, well, maybe now over a decade, I guess I was there for the first one.
0:36:22 It’s called the GPU Technology Conference.
0:36:23 Right.
0:36:29 It’s not a business conference, although obviously many business people show up. It’s not a demo conference. It’s a developer conference.
0:36:30 Yeah.
0:36:39 And if you want to learn more, go check out GTC. We put all the presentations online. Jensen’s keynote’s wonderful. He has a, he’ll explain it even better than I can.
0:36:50 And you can watch, we actually do a few a year now. I encourage you to check out GTC, go see the old ones. And if you’re going to be in San Jose in March, please come and check it out and attend.
0:37:00 There’s tons of sessions at every level from beginner to deep dive. If you want to go down to the hardware, all the NVIDIA experts will be there. All of the different developers are going to be there.
0:37:09 It’s kind of the go-to place to go, go learn and also present your work on what you can do with GPUs and the state of the art of AI. Check it out.
0:37:22 Perfect. Ian Buck, again, thank you. And you know, for what it’s worth, Jensen’s an amazing presenter. You did a great job explaining all this. So we appreciate you taking the time. And as always, all the best to you and your teams on continued progress.
0:37:23 Thank you.
0:37:46 Thank you.
0:37:48 Thank you.
0:38:13 Thank you.
Discover how mixture‑of‑experts (MoE) architecture is enabling smarter AI models without a proportional increase in the required compute and cost. Using vivid analogies and real-world examples, NVIDIA’s Ian Buck breaks down MoE models, their hidden complexities, and why extreme co-design across compute, networking, and software is essential to realizing their full potential. Learn more: https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/

Leave a Reply
You must be logged in to post a comment.