#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

AI transcript

🕒

Việt

中文

0:00:04 The following is a conversation with Dylan Patel and Nathan Lampert.
0:00:11 Dylan runs Semi Analysis, a well-respected research and analysis company that specializes
0:00:16 in semiconductors, GPUs, CPUs, and AI hardware in general.
0:00:23 Nathan is a research scientist at the Allen Institute for AI and is the author of the
0:00:27 amazing blog on AI called Interconnects.
0:00:32 They are both highly respected, read, and listened to by the experts, researchers, and
0:00:35 engineers in the field of AI.
0:00:38 And personally, I’m just a fan of the two of them.
0:00:45 So I use the deep-seek moment that shook the AI world a bit as an opportunity to sit down
0:00:48 with them and lay it all out.
0:00:56 From deep-seek open AI, Google XAI Metaanthropic to NVIDIA and TSMC, and to U.S.-China-Taiwan
0:01:01 Relations, and everything else that is happening at the cutting edge of AI.
0:01:08 This conversation is a deep dive into many critical aspects of the AI industry.
0:01:13 While it does get super technical, we try to make sure that it’s still accessible to
0:01:19 folks outside of the AI field by defining terms, stating important concepts explicitly,
0:01:24 spelling out acronyms, and, in general, always moving across the several layers of abstraction
0:01:26 and levels of detail.
0:01:32 There is a lot of hype in the media about what AI is and isn’t.
0:01:38 The purpose of this podcast, in part, is to cut through the hype, through the bullshit,
0:01:45 and the low-resolution analysis, and to discuss in detail how stuff works and what the implications
0:01:46 are.
0:01:52 Let me also, if I may, comment on the new open AI 03 mini residing model, the release
0:01:58 of which we were anticipating during the conversation, and it did indeed come out right after.
0:02:05 Its capabilities and costs are on par with our expectations as we stated.
0:02:11 Open AI 03 mini is indeed a great model, but it should be stated that DeepSeek R1 has similar
0:02:17 performance on benchmarks, is still cheaper, and it reveals its chain of thought reasoning
0:02:19 which 03 mini does not.
0:02:23 It only shows a summary of the reasoning.
0:02:29 Plus R1 is open-weight, and 03 mini is not.
0:02:35 By the way, I got a chance to play with 03 mini, and anecdotal, Vibe check-wise, I felt
0:02:41 that 03 mini, specifically 03 mini high, is better than R1.
0:02:47 Still, for me personally, I find that ClaudeSana35 is the best model for programming, except
0:02:51 for tricky cases where I will use 01 Pro to brainstorm.
0:02:57 Either way, many more better AI models will come, including reasoning models, both from
0:03:00 American and Chinese companies.
0:03:03 They will continue to shift the cost curve.
0:03:07 But the “DeepSeek” moment is indeed real.
0:03:13 I think it will still be remembered five years from now as a pivotal event in tech history,
0:03:19 due in part to the geopolitical implications, but for other reasons too, as we discuss in
0:03:23 detail from many perspectives in this conversation.
0:03:26 And now, a quick few second mention of your sponsor.
0:03:29 Check them out in the description, it’s the best way to support this podcast.
0:03:37 We got NVIDIA AI for video generation, GitHub for coding, Shopify for selling stuff online,
0:03:42 Netsuite for running your business, and AG1 for staying healthy.
0:03:44 Choose wisely, my friends.
0:03:50 Also if you want to get in touch with me for whatever reason, go to www.lxtremer.com/contact.
0:03:54 And now, onto the follow ad reads, no ads in the middle, try to make this interesting,
0:03:59 but if you skip them, please still check out our sponsors, I enjoy their stuff.
0:04:01 Maybe you will too.
0:04:05 This video is brought to you by a new sponsor, but I’ve known these folks for a long time
0:04:07 and perfect fit for this podcast.
0:04:14 They’re called NVIDIA AI, it’s a video generating app that allows you to create full length videos
0:04:21 using just text, prompts, it’s intuitive, works amazing, it’s truly incredible what
0:04:22 you can do.
0:04:28 I’ve been playing quite a bit and using it for stock footage, and by the way they make
0:04:35 it super easy for you to switch between actually available stock footage and AI generated footage.
0:04:41 I’ve been preparing a lot for a conversation with Tim Sweeney who is the creator of Unreal
0:04:47 Engine, and there’s 3D worlds and you get to think about the role of AI in generating
0:04:49 those 3D worlds.
0:04:52 That’s what’s coming, 5, 10, 20 years from now.
0:04:57 In video games and simulations, a fundamental part of our lives would be generated with
0:04:58 AI.
0:05:04 And I think NVIDIA AI does a masterful job of pushing us in that direction in the 2D
0:05:05 plane of video.
0:05:11 Now, I think this is not a tool that replaces human creativity.
0:05:14 I think it supercharges human creativity.
0:05:22 I think now and for a long, long time to come, humans will be in the loop of creating great
0:05:28 art because we’re creating for each other and only humans truly deeply know what makes
0:05:35 other humans go ah, like the old Kerak line.
0:05:43 If you want to try out NVIDIA AI, you can do so for free at nvideo.io/lexpod, saving
0:05:47 time and money on production costs.
0:05:53 This episode is brought to you by the thing that’s brought me joy for many, many years
0:06:00 and created a community for hundreds of thousands, millions, I don’t know how many developers
0:06:03 and that place is called GitHub.
0:06:11 It is a company that really has supercharged the developer community.
0:06:14 I mean, where would the world be without GitHub?
0:06:21 And they’re also, as a company, pushing the limits of what’s possible in terms of AI
0:06:24 code generation, AI assisted coding.
0:06:27 They were pioneers on co-pilot.
0:06:29 They are still pioneers in co-pilot.
0:06:33 It’s super competitive space and they are doing their best to win.
0:06:37 I will forever be a supporter of GitHub co-pilot.
0:06:41 Now it integrates in a bunch of IDEs, not just into VS Code.
0:06:45 I am, of course, a VS Code guy at this time.
0:06:48 I did use JetBrains for a long time.
0:06:50 I still dabble a little bit.
0:06:55 For people who don’t know, JetBrains has a plethora, don’t like using that word and
0:06:59 seems elitist, but it’s got to be a better word.
0:07:04 There is a lot of different sort of sub IDEs inside JetBrains.
0:07:07 I’ve even used DataGrip, which manages the MySQL.
0:07:15 I should mention, and this might be embarrassing, but I have not, ooh, this might be interesting,
0:07:25 but I have not used anything like co-pilot on any database management GUIs.
0:07:29 I wonder if DataGrip integrates co-pilot.
0:07:31 I’m going to have to check that out.
0:07:38 But everything I use, I’m writing SQL queries from scratch inside the database management
0:07:39 GUI.
0:07:45 If I want to do complicated queries, I’ll go to any of the LLMs.
0:07:51 They’re going to be close on a 3.5 or if it’s part of the code, then I’m going to be inside
0:07:52 my IDE.
0:07:57 I just like having a GUI management of a database.
0:07:58 I’m going to have to check that out with it.
0:08:01 If DataGrip integrates co-pilot, that’s going to be incredible.
0:08:05 If not, I’m going to yell from the top of my lungs, hoping it will eventually because
0:08:11 it’ll make my life a bit easier to have the visual component of a database together with
0:08:16 a code component of SQL queries, yeah, it will be amazing.
0:08:22 Anyway, go check out GitHub co-pilot at gh.io/copilot.
0:08:27 This episode is brought to you by Shopify, not Spotify, Shopify.
0:08:30 Easily confused, the CEOs are tagged on X often.
0:08:33 They’re both great CEOs, but this is Shopify.
0:08:40 You can sell anywhere with a great looking online store using Shopify.
0:08:45 I’ve been learning a lot about the Silk Road actually, not the digital one.
0:08:54 The one that for a lot of human history served as a place for merchants to travel and trade
0:08:55 goods.
0:09:02 I’m reading a lot about Jengis Khan who enforced the rule of law on the Silk Road and that
0:09:09 actually had a big invigorating effect on the economy of the Eurasian region.
0:09:16 Anyway, that was before computers, if they had computers, imagine if they had computers.
0:09:22 Boy, would the Jengis Khan force be terrifying.
0:09:31 Or maybe not, maybe each technological age has their own kind of military tactician,
0:09:37 their own human that matches perfectly for that time in order to conquer the land and
0:09:38 people.
0:09:42 Still, what a terrifying time that was.
0:09:49 Much of human history, lots of beauty, but lots of ways to die.
0:09:56 So, I’m glad to be living in the 21st century where I can sit back with a margarita.
0:10:01 I don’t drink margaritas, but if I wanted to, I could and then buy stuff on stores created
0:10:02 by Shopify.
0:10:10 Anyway, you can sign up for a $1 per month trial period at Shopify.com/Lex, go to Shopify.com/Lex
0:10:13 to take your business to the next level today.
0:10:19 This episode was also brought to you by Netsuite, an all-in-one business management system.
0:10:22 Not sure why I said that so slowly, but I did.
0:10:29 I actually did a little intermission for five, six minutes for this episode where I added
0:10:35 in the middle of it an addendum after having tried to open AI O3 mini.
0:10:42 That was such a weird feeling to sort of insert myself in the middle of an episode.
0:10:44 I felt like a third wheel to myself.
0:10:47 It’s like, “Hey, hey everyone, what are you doing?
0:10:50 Why did you guys not invite me to this party?”
0:10:52 That’s what I felt like.
0:10:55 Hey Lux from the past, it’s me, Lux from the future.
0:10:59 Right, I should be talking about Netsuite, which is an all-in-one cloud business management
0:11:00 system.
0:11:11 It’s the machine inside the machine and boy, are we increasingly building stacks of machines.
0:11:18 Layers and layers and layers of abstraction until we’re just sitting back on a beach somewhere
0:11:22 talking to an AI system that’s taking care of everything else.
0:11:28 Anyway, you can download the CFO’s guide to AI and Machine Learning at Netsuite.com/Lex.
0:11:37 This episode is also brought to you by AG1, an all-in-one daily drink to support better
0:11:38 health and performance.
0:11:39 I drank it today.
0:11:40 I enjoyed it today.
0:11:42 I’ve been sleeping very, very little.
0:11:47 The amount of work I have to do is insane.
0:11:55 Last night at 6 a.m., I went to bed at 7 a.m., 8 a.m., thinking about doing an all-nighter.
0:11:56 It’s madness.
0:12:03 But anyway, at 6 a.m., I drank an AG1 and I was sitting in a couch and I was watching
0:12:07 like 10 minutes of American Pride Meval.
0:12:13 I watched like 5, 10 minutes of a show at a time and I was sipping on the AG1 and I was
0:12:20 thinking how lucky, how fucking lucky I am to be alive.
0:12:25 First of all because I’m watching the American Frontier and people being just brutal to each
0:12:31 other, the brutal reality of nature and war during that time and the lawlessness during
0:12:32 that time.
0:12:42 But also just how lucky I am to be on the spinning rock and join this green healthy drink.
0:12:48 Being able to watch a show, being able to work hard towards the thing I love, being able
0:12:51 to love, being able to breathe, all of it.
0:12:52 Just amazing.
0:13:01 Anyway, they’ll give you one month supply of fish oil when you sign up at drinkag1.com/lex.
0:13:03 This is the Lex Friedman Podcast.
0:13:06 To support it, please check out our sponsors in the description.
0:13:28 And now, dear friends, here’s Dylan Patel and Nathan Lambert.
0:13:32 A lot of people are curious to understand China’s deep-seek AI models, so let’s lay
0:13:33 it out.
0:13:40 Can you describe what deep-seek v3 and deep-seek r1 are, how they work, how they’re trained?
0:13:43 Let’s look at the big picture and then we’ll zoom in on the details.
0:13:51 Yeah, so deep-seek v3 is a new mixture of experts, transformer language model from deep-seek
0:13:53 who is based in China.
0:13:58 They have some new specifics in the model that we’ll get into.
0:14:03 Largely, this is an open-weight model and it’s an instruction model like what you would
0:14:05 use in chatGPT.
0:14:09 They also released what is called the base model, which is before these techniques of
0:14:11 post-training.
0:14:16 Most people use instruction models today and those are what’s served in all sorts of applications.
0:14:21 This was released, I believe, December 26th or that week.
0:14:28 And then weeks later on January 20th, deep-seek released deep-seek r1, which is a reasoning
0:14:33 model which really accelerated a lot of this discussion.
0:14:38 This reasoning model has a lot of overlapping training steps to deep-seek v3 and it’s confusing
0:14:44 that you have a base model called v3 that you do something to to get a chat model and
0:14:47 then you do some different things to get a reasoning model.
0:14:51 I think a lot of the AI industry is going through this challenge of communications right now
0:14:54 where OpenAI makes fun of their own naming schemes.
0:15:00 They have GPT-40, they have OpenAI-01 and there’s a lot of types of models, so we’re
0:15:02 going to break down what each of them are.
0:15:07 There’s a lot of technical specifics on training and go from high-level to specific and kind
0:15:09 of go through each of them.
0:15:13 There’s so many places we can go here, but maybe let’s go to open weights first.
0:15:17 What does it mean for a model to be open weights and what are the different flavors of open
0:15:18 source in general?
0:15:22 Yeah, so this discussion has been going on for a long time in AI, it became more important
0:15:27 since chat GPT or more focal since chat GPT at the end of 2022.
0:15:33 Open weights is the accepted term for when model weights of a language model are available
0:15:35 on the internet for people to download.
0:15:39 Those weights can have different licenses, which is effectively the terms by which you
0:15:41 can use the model.
0:15:44 There are licenses that come from history and open source software.
0:15:48 There are licenses that are designed by companies specifically.
0:15:56 All of Lama, DeepSeq, Quen, Mistral, these popular names in open weight models have some
0:15:57 of their own licenses.
0:16:01 It’s complicated because not all the same models have the same terms.
0:16:06 The big debate is on what makes a model open weight.
0:16:07 Why are we saying this term?
0:16:08 It’s kind of a mouthful.
0:16:12 It sounds close to open source, but it’s not the same.
0:16:16 There’s still a lot of debate on the definition and soul of open source AI.
0:16:21 Open source software has a rich history on freedom to modify, freedom to take on your
0:16:26 own, freedom for many restrictions on how you would use the software and what that means
0:16:31 for AI is still being defined.
0:16:33 For what I do, I work at the Allen Institute for AI.
0:16:34 We’re a nonprofit.
0:16:39 We want to make AI open for everybody and we try to lead on what we think is truly open
0:16:40 source.
0:16:43 There’s not full agreement in the community, but for us that means releasing the training
0:16:49 data, releasing the training code, and then also having open weights like this.
0:16:52 We’ll get into the details of the models.
0:16:57 Again and again, as we try to get deeper into how the models were trained, we will say things
0:17:02 like the data processing, data filtering, data quality is the number one determinant
0:17:07 of the model quality and then a lot of the training code is the determinant on how long
0:17:10 it takes to train and how fast your experimentation is.
0:17:18 Without fully open source models where you have access to this data, it’s harder to replicate.
0:17:24 We’ll get into cost numbers for DeepSeq v3 on mostly GPU hours and how much you could
0:17:28 pay to rent those yourselves, but without the data, the replication cost is going to
0:17:31 be far, far higher.
0:17:32 Same goes for the code.
0:17:37 We should also say that this is probably one of the more open models out of the frontier
0:17:39 models.
0:17:44 This full spectrum, or probably the fullest open source, like you said, open code, open
0:17:50 data, open weights, this is not open code.
0:17:56 This is probably not open data and this is open weights.
0:18:03 The licensing is MIT license, or I mean there’s some nuance in the different models, but it’s
0:18:08 towards the free, in terms of the open source movement, these are the good guys.
0:18:13 DeepSeq is doing fantastic work for disseminating understanding of AI.
0:18:19 Their papers are extremely detailed in what they do and for other teams around the world,
0:18:25 they’re very actionable in terms of improving your own training techniques.
0:18:27 We’ll talk about licenses more.
0:18:32 The DeepSeq R1 model has a very permissive license, it’s called the MIT license.
0:18:36 That effectively means there’s no downstream restrictions on commercial use.
0:18:38 There’s no use case restrictions.
0:18:43 You can use the outputs from the models to create synthetic data.
0:18:44 This is all fantastic.
0:18:48 I think the closest peer is something like Lama, where you have the weights and you have
0:18:50 a technical report.
0:18:54 The technical report is very good for Lama, one of the most red PDFs of the year.
0:18:58 Last year is the Lama 3 paper, but in some ways it’s slightly less actionable.
0:19:03 It has less details on the training specifics, less plots and so on.
0:19:09 The Lama 3 license is more restrictive than MIT and then between the DeepSeq custom license
0:19:11 and the Lama license, we can get into this whole rabbit hole.
0:19:16 I think we’ll make sure we want to go down the license rabbit hole before we do specifics.
0:19:17 Yeah.
0:19:22 It should be stated that one of the implications of DeepSeq, it puts pressure on Lama and everybody
0:19:26 else on open AI to push towards open source.
0:19:30 That’s the other side of open source that you mentioned is how much is published in
0:19:32 detail about it.
0:19:38 How open are you with the insights behind the code?
0:19:39 How good is the technical reports?
0:19:43 Are they hand wavy or is there actual details in there?
0:19:46 That’s one of the things that DeepSeq did well as they published a lot of the details.
0:19:47 Yeah.
0:19:51 Especially in the DeepSeq V3, which is their pre-training paper, they were very clear that
0:19:58 they are doing interventions on the technical stack that go at many different levels.
0:20:03 For example, to get highly efficient training, they’re making modifications at or below
0:20:06 the CUDA layer for NVIDIA chips.
0:20:10 I have never worked there myself and there are a few people in the world that do that
0:20:12 very well and some of them are at DeepSeq.
0:20:18 These types of people are at DeepSeq and leading American frontier labs, but they’re not many
0:20:19 places.
0:20:25 To help people understand the other implication of open weights, there’s a topic we’ll return
0:20:26 to often here.
0:20:38 There’s a fear that China, the nation, might have interest in stealing American data, violating
0:20:40 privacy of American citizens.
0:20:45 What can we say about open weights to help us understand what the weights are able to
0:20:49 do in terms of stealing people’s data?
0:20:54 These weights that you can download from Huggingface or other platforms are very big matrices of
0:20:55 numbers.
0:20:59 You can download them to a computer in your own house that has no internet and you can
0:21:03 run this model and you’re totally in control of your data.
0:21:07 That is something that is different than how a lot of language model usage is actually
0:21:12 done today, which is mostly through APIs, where you send your prompt to GPUs run by
0:21:14 certain companies.
0:21:17 These companies will have different distributions and policies on how your data is stored, if
0:21:23 it is used to train future models, where it is stored, if it is encrypted, and so on.
0:21:27 The open weights are you have your fate of data in your own hands, and that is something
0:21:31 that is deeply connected to the soul of open source.
0:21:35 It’s not the model that steals your data, it’s whoever’s hosting the model, which could
0:21:42 be China, if you’re using the DeepSeek app, or it could be Proplexity.
0:21:46 You’re trusting them with your data, or OpenAI, you’re trusting them with your data.
0:21:48 Some of these are American companies, some of these are Chinese companies, but the model
0:21:51 itself is not doing the stealing.
0:21:52 That’s the host.
0:21:56 All right, so back to the basics.
0:22:01 What’s the difference between DeepSeek v3 and DeepSeek r1?
0:22:05 Can we try to lay out the confusion potential?
0:22:10 Yes, so for one, I have very understanding of many people being confused by these two
0:22:11 model names.
0:22:15 So I would say the best way to think about this is that when training a language model,
0:22:19 you have what is called pre-training, which is when you’re predicting the large amounts
0:22:24 of mostly internet text, you’re trying to predict the next token, and what to know about
0:22:30 these new DeepSeek models is that they do this internet large-scale pre-training once
0:22:33 to get what is called DeepSeek v3 base.
0:22:34 This is the base model.
0:22:37 It’s just going to finish your sentences for you.
0:22:42 It’s going to be harder to work with than ChatGPT, and then what DeepSeek did is they’ve
0:22:49 done two different post-training regimes to make the models have specific desirable behaviors.
0:22:55 So what is the more normal model in terms of the last few years of AI, an instruct model,
0:22:58 a chat model, a “aligned model,” a helpful model.
0:23:02 There are many ways to describe this is more standard post-training.
0:23:06 So this is things like instruction tuning, reinforcement learning from human feedback.
0:23:08 We’ll get into some of these words.
0:23:12 And this is what they did to create the DeepSeek v3 model.
0:23:18 This was the first model to be released, and it is very high-performance, it’s competitive
0:23:22 with GPT-4, Llama 405b, so on.
0:23:26 And then when this release was happening, we don’t know their exact timeline, or soon
0:23:32 after they were finishing the training of a different training process from the same
0:23:37 next token prediction base model that I talked about, which is when this new reasoning training
0:23:41 that people have heard about comes in in order to create the model that is called DeepSeek
0:23:42 R1.
0:23:46 The R through this conversation is good for grounding for reasoning, and the name is
0:23:51 also similar to OpenAI’s 01, which is the other reasoning model that people have heard
0:23:52 about.
0:23:56 And we’ll have to break down the training for R1 in more detail, because for one, we
0:24:02 have a paper detailing it, but also it is a far newer set of techniques for the AI community,
0:24:06 so it’s a much more rapidly evolving area of research.
0:24:13 Maybe we should also say the big two categories of training of pre-training and post-training,
0:24:14 these umbrella terms that people use.
0:24:20 So what is pre-training and what is post-training, and what are the different flavors of things
0:24:22 underneath post-training umbrella?
0:24:26 Yeah, so pre-training, I’m using some of the same words that really get the message across
0:24:30 is you’re doing what is called autoregressive prediction to predict the next token in a
0:24:32 series of documents.
0:24:39 This is done over standard practice is trillions of tokens, so this is a ton of data that is
0:24:41 mostly scraped from the web.
0:24:46 In some of DeepSeq’s earlier papers, they talk about their training data being distilled
0:24:47 for math.
0:24:52 I shouldn’t use this word yet, but taken from Common Crawl, and that’s a public access
0:24:56 that anyone listening to this could go download data from the Common Crawl website.
0:24:58 This is a crawler that is maintained publicly.
0:25:03 Yes, other tech companies eventually shift to their own crawler, and DeepSeq likely has
0:25:05 done this as well, as most frontier labs do.
0:25:10 But this sort of data is something that people can get started with, and you’re just predicting
0:25:12 text in a series of documents.
0:25:19 This can be scaled to be very efficient, and there’s a lot of numbers that are thrown
0:25:24 around in AI training, like how many floating-point operations or flops are used, and you can
0:25:30 also look at how many hours of these GPUs that are used.
0:25:37 It’s largely one-loss function taken to a very large amount of compute usage.
0:25:42 You set up really efficient systems, and then at the end of that you have the space model,
0:25:48 and pre-training is where there is a lot more of complexity in terms of how the process
0:25:55 is emerging or evolving, and the different types of training losses that you will use.
0:26:00 This is a lot of techniques grounded in the natural language processing literature.
0:26:04 The oldest technique, which is still used today, is something called instruction tuning,
0:26:07 or also known as supervised fine-tuning.
0:26:12 These acronyms will be IFT or SFT, that people really go back and forth throughout them,
0:26:17 and I will probably do the same, which is where you add this formatting to the model,
0:26:23 where it knows to take a question that is like, “Explain the history of the Roman Empire
0:26:28 to me,” or sort of question you’ll see on Reddit or Stack Overflow, and then the model
0:26:33 will respond in a information-dense but presentable manner.
0:26:38 The core of that formatting is in this instruction-tuning phase, and then there’s two other categories
0:26:41 of loss functions that are being used today.
0:26:44 One I will classify as preference fine-tuning.
0:26:48 Preference fine-tuning is a generalized term for what came out of reinforcement learning
0:26:52 from human feedback, which is RLHF.
0:26:58 This reinforcement learning from human feedback is credited as the technique that helped chat
0:27:00 GPT break through.
0:27:05 It is a technique to make the responses that are nicely formatted, like these Reddit answers,
0:27:08 more in tune with what a human would like to read.
0:27:13 This is done by collecting pairwise preferences from actual humans out in the world to start,
0:27:18 and now AIs are also labeling this data, and we’ll get into those trade-offs.
0:27:23 You have this kind of contrastive loss function between a good answer and a bad answer.
0:27:25 The model learns to pick up these trends.
0:27:27 There’s different implementation ways.
0:27:29 You have things called reward models.
0:27:31 You could have direct alignment algorithms.
0:27:35 There’s a lot of really specific things you can do, but all of this is about fine-tuning
0:27:37 to human preferences.
0:27:43 The final stage is much newer and will link to what is done in R1, and these reasoning
0:27:46 models is, I think, OpenAI’s name for this.
0:27:51 They had this new API in the fall, which they called the Reinforcement Fine-Tuning API.
0:27:55 This is the idea that you use the techniques of reinforcement learning, which is a whole
0:27:56 framework of AI.
0:27:58 There’s a deep literature here.
0:28:04 To summarize, it’s often known as trial and error learning, or the subfield of AI where
0:28:10 you’re trying to make sequential decisions in a certain potentially noisy environment.
0:28:14 There’s a lot of ways we can go down that, but fine-tuning language models where they
0:28:19 can generate an answer, and then you check to see if the answer matches the true solution.
0:28:24 For math or code, you have an exactly correct answer for math.
0:28:26 You can have unit tests for code.
0:28:29 What we’re doing is we are checking the language models work, and we’re giving it multiple
0:28:32 opportunities on the same questions to see if it is right.
0:28:38 If you keep doing this, the models can learn to improve invariable domains to a great extent.
0:28:39 It works really well.
0:28:42 It’s a newer technique in the academic literature.
0:28:48 It’s been used at Frontier Labs in the US that don’t share every detail for multiple years.
0:28:52 This is the idea of using reinforcement learning with language models, and it has been taking
0:28:54 off, especially in this deep-seek moment.
0:29:00 We should say that there’s a lot of exciting stuff going on, again, across the stack, but
0:29:04 the post-training probably this year is going to be a lot of interesting developments in
0:29:05 the post-training.
0:29:06 We’ll talk about it.
0:29:12 I almost forgot to talk about the difference between deep-seek v3 and R1 on the user experience
0:29:13 side.
0:29:16 Forget the technical stuff, forget all of that.
0:29:19 People that don’t know anything about AI, they show up.
0:29:20 What’s the actual experience?
0:29:24 What’s the use case for each one when they actually type and talk to it?
0:29:26 What is each good at, that kind of thing?
0:29:28 Let’s start with deep-seek v3 again.
0:29:30 It’s what more people would have tried something like it.
0:29:35 You ask it a question, it’ll start generating tokens very fast, and those tokens will look
0:29:38 like a very human legible answer.
0:29:41 It’ll be some sort of markdown list.
0:29:46 It might have formatting to help you draw to the core details in the answer, and it’ll
0:29:49 generate tens to hundreds of tokens.
0:29:57 A token is normally a word for common words or a sub-word part in a longer word.
0:30:01 It’ll look like a very high-quality Reddit or Stack Overflow answer.
0:30:06 These models are really getting good at doing these across a wide variety of domains.
0:30:11 Even things that, if you’re an expert, things that are close to the fringe of knowledge,
0:30:14 they will still be fairly good at.
0:30:20 Getting edge AI topics that I do research on, these models are capable for study aid,
0:30:23 and they’re regularly updated.
0:30:28 Where this changes is with the deep-seek R1, what is called these reasoning models, is
0:30:34 when you see tokens coming from these models to start, it will be a large chain of thought
0:30:35 process.
0:30:39 We’ll get back to chain of thought in a second, which looks like a lot of tokens where the
0:30:41 model is explaining the problem.
0:30:45 The model will often break down the problem and be like, “Okay, they asked me for this.
0:30:46 Let’s break down the problem.
0:30:50 I’m going to need to do this,” and you’ll see all of this generating from the model.
0:30:52 It’ll come very fast in most user experiences.
0:30:55 These APIs are very fast, so you’ll see a lot of tokens, a lot of words show up really
0:30:56 fast.
0:31:01 It’ll keep flowing on the screen, and this is all the reasoning process, and then eventually
0:31:05 the model will change its tone in R1, and it’ll write the answer, where it summarizes
0:31:11 its reasoning process and writes a similar answer to the first types of model.
0:31:17 In DeepSeq’s case, which is part of why this was so popular even outside the AI community,
0:31:21 is that you can see how the language model is breaking down problems.
0:31:24 You get this answer on a technical side.
0:31:27 They train the model to do this specifically where they have a section, which is reasoning,
0:31:31 and then it generates a special token, which is probably hidden from the user most of the
0:31:35 time, which says, “Okay, I’m starting the answer,” so the model is trained to do this
0:31:37 two-stage process on its own.
0:31:43 If you use a similar model in, say, OpenAI, OpenAI’s user interface is trying to summarize
0:31:49 this process for you nicely by showing the sections that the model is doing, and it’ll
0:31:54 kind of click through, it’ll say, breaking down the problem, making X calculation, cleaning
0:31:58 the result, and then the answer will come for something like OpenAI.
0:32:03 Maybe it’s useful here to go through an example of a DeepSeq R1 reasoning.
0:32:09 And so, if you’re looking at the screen here, what you’ll see is a screenshot of the DeepSeq
0:32:15 chat app, and at the top is thought for 151 seconds with the drop-down arrow.
0:32:18 Underneath that, if we were in an app that we were running, the drop-down arrow would
0:32:19 have the reasoning.
0:32:25 So, in this case, the specific question, which, you know, I’m philosophically/podhead
0:32:34 inclined, so this is asking DeepSeq R1 for one truly novel insight about humans.
0:32:39 And it reveals the reasoning, and basically, the truly novel aspect is what’s pushing
0:32:44 the reasoning to constantly sort of the model asking itself, “Is this truly novel?”
0:32:50 So it’s actually challenging itself to be more novel, more counterintuitive, less cringe,
0:32:51 I suppose.
0:32:57 So some of the reasoning says, this is just snapshots, “Alternatively, humans have a
0:33:01 unique meta-emotion where they feel emotions about their own emotions, e.g. feeling guilty
0:33:02 about being angry.
0:33:06 This recursive emotional layer and creates complex motivational drives that don’t exist
0:33:07 in other animals.
0:33:09 The insight is that human emotions are nested.”
0:33:14 So it’s like, it’s reasoning through how humans feel emotions.
0:33:15 It’s reasoning about meta-emotions.
0:33:17 It’s going to have pages and pages of this.
0:33:20 It’s almost too much to actually read, but it’s nice to skim as it’s coming.
0:33:21 It’s a stream of consciousness.
0:33:26 It’s a James Joyce-like stream of consciousness, and then it goes, “Wait, the user wants something
0:33:28 that’s not seen anywhere else.
0:33:30 Let me dig deeper.”
0:33:35 And consider the human ability to hold contradictory beliefs simultaneously, cognitive dissonance
0:33:41 is known, but perhaps the function is to allow flexible adaptation, so on and so forth.
0:33:50 I mean, that really captures the public imagination that, holy shit, this isn’t, I mean, intelligence
0:33:57 slash almost like an inkling of sentience, because you’re thinking through, you’re self-reflecting,
0:33:59 you’re deliberating.
0:34:06 And the final result of that, after 157 seconds, is humans instinctively convert selfish desires
0:34:13 into cooperative systems by collectively pretending abstract rules, money, laws, rights are real.
0:34:18 These shared hallucinations act as, quote, “games,” where competition is secretly redirected
0:34:25 to benefit the group, turning conflict into society’s fuel, pretty profound, I mean, you
0:34:26 know.
0:34:31 This is a confidential digression, but a lot of people have found that these reasoning
0:34:34 models can sometimes produce much more eloquent text.
0:34:39 That is at least an interesting example, I think, depending on how open-minded you are,
0:34:42 you find language models interesting or not, and there’s a spectrum there.
0:34:47 Well, I mean, we’ll talk about different benchmarks as well, but some is just a vibe.
0:34:55 Like that, in itself, is a, let’s say, quote, “fire tweet,” if I’m trying to produce something
0:34:59 that where people are like, “Oh, shit, okay, so that’s a chain of thought, we’ll probably
0:35:02 return to it more.”
0:35:07 How are they able to achieve such low cost on the training and the inference?
0:35:09 Maybe you could talk the training first.
0:35:16 Yeah, so there’s two main techniques that they implemented that are probably the majority
0:35:20 of their efficiency, and then there’s a lot of implementation details that maybe we’ll
0:35:23 gloss over or get into later that sort of contribute to it.
0:35:29 But those two main things are, one, is they went to a mixture of experts model, which
0:35:30 we’ll define in a second.
0:35:35 And then the other thing is that they invented this new technique called MLA, latent attention.
0:35:36 Both of these are big deals.
0:35:40 Mixture of experts is something that’s been in the literature for a handful of years,
0:35:46 and OpenAI with GPT-4 was the first one to productize a mixture of experts model.
0:35:51 And what this means is, when you look at the common models around that most people have
0:35:55 been able to interact with that are open, think Lama.
0:36:01 Lama is a dense model, i.e., every single parameter or neuron is activated as you’re
0:36:05 going through the model for every single token you generate.
0:36:08 Now with a mixture of experts model, you don’t do that.
0:36:10 How does the human actually work?
0:36:16 Well, my visual cortex is active when I’m thinking about vision tasks and other things.
0:36:18 My amygdala is when I’m scared.
0:36:21 These different aspects of your brain are focused on different things.
0:36:24 A mixture of experts model attempts to approximate this to some extent.
0:36:30 It’s nowhere close to what a brain architecture is, but different portions of the model activate.
0:36:34 You’ll have a set number of experts in the model and a set number that are activated each
0:36:35 time.
0:36:38 And this dramatically reduces both your training and inference costs.
0:36:44 Because now, if you think about the parameter count as the total embedding space for all
0:36:49 of this knowledge that you’re compressing down during training, when you’re embedding
0:36:54 this data in instead of having to activate every single parameter every single time you’re
0:36:58 training or running inference, now you can just activate a subset.
0:37:01 And the model will learn which expert to route to for different tasks.
0:37:06 And so this is a humongous innovation in terms of, hey, I can continue to grow the total
0:37:08 embedding space of parameters.
0:37:12 And so DeepSeq’s model is 600-something billion parameters.
0:37:15 Relative to Lama 405B, it’s 405 billion parameters.
0:37:18 Relative to Lama 70B, it’s 70 billion parameters.
0:37:23 So this model technically has more embedding space for information to compress all of the
0:37:25 world’s knowledge that’s on the internet down.
0:37:31 But at the same time, it is only activating around 37 billion of the parameters.
0:37:35 So only 37 billion of these parameters actually need to be computed every single time you’re
0:37:38 training data or inferencing data out of it.
0:37:43 And so versus, again, the Lama model, 70 billion parameters must be activated, or 405 billion
0:37:44 parameters must be activated.
0:37:49 So you’ve dramatically reduced your compute cost when you’re doing training and inference
0:37:51 with this mixture of experts architecture.
0:37:55 So we break down where it actually applies and go into the transformer.
0:37:56 Is that useful?
0:37:57 Let’s go.
0:37:58 Let’s go into the transformer.
0:38:04 The transformer is a thing that is talked about a lot, and we will not cover every detail.
0:38:09 Essentially the transformer is built on repeated blocks of this attention mechanism, and then
0:38:14 a traditional dense, fully connected multilayer perception, whatever word you want to use
0:38:19 for your normal neural network, and you alternate these blocks, there’s other details.
0:38:22 And where a mixture of experts is applied is at this dense model.
0:38:28 The dense model holds most of the weights if you count them in a transformable model.
0:38:32 So you can get really big gains from this mixture of experts on parameter efficiency,
0:38:37 at training and inference, because you get this efficiency by not activating all of these
0:38:38 parameters.
0:38:43 We should also say that a transformer is a giant neural network.
0:38:49 And then there’s for 15 years now, there’s what’s called the deep learning revolution.
0:38:53 Networks gotten larger and larger, and at a certain point the scaling laws appeared where
0:38:54 people realized…
0:38:57 This is a scaling law shirt by the way.
0:39:04 Representing scaling laws, where it became more and more formalized that bigger is better
0:39:07 across multiple dimensions of what bigger means.
0:39:12 But these are all neural networks we’re talking about, and we’re talking about different architectures
0:39:17 of how to construct these neural networks such that the training and the inference on
0:39:19 them is super efficient.
0:39:23 Every different type of model has a different scaling law for it, which is effectively for
0:39:29 how much compute you put in, the architecture will get to different levels of performance
0:39:30 at test tasks.
0:39:34 And mixture of experts is one of the ones at training time, even if you don’t consider
0:39:36 the inference benefits, which are also big.
0:39:41 At training time, your efficiency with your GPUs is dramatically improved by using this
0:39:43 architecture if it is well implemented.
0:39:50 So you can get effectively the same performance model and evaluation scores with numbers like
0:39:51 30% less compute.
0:39:55 I think there’s going to be a wide variation depending on your implementation details and
0:39:56 stuff.
0:40:00 But it is just important to realize that this type of technical innovation is something
0:40:02 that gives huge gains.
0:40:07 And I expect most companies that are serving their models to move to this mixture of experts
0:40:12 implementation, historically the reason why not everyone might do it is because it’s an
0:40:15 implementation complexity, especially when doing these big models.
0:40:19 So this is one of the things that’s deep sea gets credit for is they do this extremely
0:40:20 well.
0:40:25 This mixture of experts extremely well, this architecture for what is called deep seek MOE,
0:40:30 MOE is the shortened version of mixture of experts, is multiple papers old.
0:40:35 This part of their training infrastructure is not new to these models alone.
0:40:40 And same goes for what Dylan mentioned with multi head latent attention is all about reducing
0:40:46 memory usage during inference and same things during training by using some fancy low rank
0:40:48 approximation math.
0:40:51 If you get into the details with this latent attention, it’s one of those things that I
0:40:56 look at and say, okay, they’re doing really complex implementations because there’s other
0:41:01 parts of language models such as embeddings that are used to extend the context length.
0:41:07 The common one that deep seek uses rotary positional impeddings, which is called rope.
0:41:10 And if you want to use rope with a normal MOE, it’s kind of a sequential thing.
0:41:16 You take these, you take two of the attention matrices and you rotate them by a complex value
0:41:21 rotation, which is a matrix multiplication with deep seeks MLA with this new attention
0:41:25 architecture, they need to do some clever things because they’re not set up the same
0:41:28 and it just makes the implementation complexity much higher.
0:41:30 So they’re managing all of these things.
0:41:34 And these are probably the sort of things that opening eye, these closed labs are doing.
0:41:37 We don’t know if they’re doing the exact same techniques, but they actually shared them
0:41:42 with the world, which is really nice to be like, this is the cutting edge of efficient
0:41:43 language model training.
0:41:49 And some of this is requires low level engineering, just is a giant mess trickery.
0:41:55 So as I understand that one below CUDA, so they go super low programming of GPUs.
0:41:59 Effectively, NVIDIA builds this library called nickel, right?
0:42:03 In which, you know, when you’re training a model, you have all these communications
0:42:06 between every single layer of the model and you may have over a hundred layers.
0:42:07 What does the nickel stand for?
0:42:08 It’s NCCL.
0:42:11 NVIDIA communications collectives library.
0:42:12 Nice.
0:42:13 Damn.
0:42:19 And so, when you’re training a model, right, you’re going to have all these all reduces
0:42:20 and all gathers, right?
0:42:25 Between each layer, between the multi layer perceptron or feed forward network and the
0:42:29 attention mechanism, you’ll have basically the model synchronized, right?
0:42:33 Or you’ll have all reducer and all gather.
0:42:36 And this is a communication between all the GPUs in the network, whether it’s in training
0:42:37 or inference.
0:42:39 So NVIDIA has a standard library.
0:42:43 This is one of the reasons why it’s really difficult to use anyone else’s hardware for
0:42:47 training is because no one’s really built a standard communications library.
0:42:50 And NVIDIA has done this at a sort of a higher level, right?
0:42:55 A deep seek because they have certain limitations around the GPUs that they have access to.
0:43:00 The interconnects are limited to some extent by the restrictions of the GPUs that were
0:43:04 shipped into China legally, not the ones that are smuggled but legally shipped in that they
0:43:05 used to train this model.
0:43:09 They had to figure out how to get efficiencies, right?
0:43:14 And one of those things is that instead of just calling the NVIDIA library, Nickel, right?
0:43:20 They instead created their, they scheduled their own communications, which some of the
0:43:22 labs do, right?
0:43:25 You met a talk about in Lama 3 how they made their own custom version of Nickel.
0:43:28 This is, they didn’t talk about the implementation details.
0:43:31 This is some of what they did, probably not as well as, maybe not as well as deep seek
0:43:36 because deep seek, you know, necessity is the mother of innovation and they had to do
0:43:37 this.
0:43:41 Whereas in the case, you know, OpenAI has people that do this sort of stuff, Anthropic,
0:43:42 et cetera.
0:43:45 But, you know, deep seek certainly did it publicly and they may have done it even better
0:43:50 because they were gimped on a certain aspect of the chips that they have access to.
0:43:57 And so they scheduled communications, you know, by scheduling specific SMs, SMs you could
0:44:00 think of as like the core on a GPU, right?
0:44:05 So there’s hundreds of cores or there’s, you know, a bit over a hundred cores SMs on
0:44:08 a GPU and they were specifically scheduling, hey, which ones are running the model, which
0:44:11 ones are doing all reduce, which one are doing all gather, right?
0:44:13 And they would flip back and forth between them.
0:44:16 And this requires extremely low level programming.
0:44:20 This is what Nickel does automatically or other NVIDIA libraries handle this automatically
0:44:21 usually.
0:44:22 Yeah, exactly.
0:44:26 And so technically they’re using, you know, PTX, which is like sort of like, you could
0:44:28 think of it as like an assembly type language.
0:44:30 It’s not exactly that or instruction set, right?
0:44:35 Like coding directly to assembly or instruction set, it’s not exactly that, but that’s still
0:44:39 part of technically CUDA, but it’s like, do I want to write in Python, you know, PyTorch
0:44:41 equivalent and call NVIDIA libraries?
0:44:43 Do I want to go down to the C level, right?
0:44:46 Or, you know, encode even lower level or do I want to go all the way down to the assembly
0:44:47 or ISO level?
0:44:52 And there are cases where you go all the way down there at the very big labs, but most
0:44:54 companies just do not do that, right?
0:44:58 Because it’s a waste of time and the efficiency gains you get are not worth it.
0:45:01 What deep-seeks implementation is so complex, right?
0:45:03 Especially with their mixture of experts, right?
0:45:07 People have done mixture of experts, but they’re generally 8/16 experts, right?
0:45:08 And they activate too.
0:45:13 So, you know, one of the words we like to use is like sparsity factor, right?
0:45:14 Or usage, right?
0:45:18 So you might have four, you know, one fourth of your model activate, right?
0:45:22 And that’s what Mistral’s mixed role model, right?
0:45:26 Their model that really catapulted them to like, oh my God, they’re really, really good.
0:45:32 AI has also had models that are MOE and so have all the other labs that are major closed.
0:45:36 But what deep-seek did that maybe only the leading labs have only just started recently
0:45:38 doing is have such a high sparsity factor, right?
0:45:40 It’s not one fourth of the model, right?
0:45:43 Two out of eight experts activating every time you go through the model.
0:45:46 It’s eight out of 256.
0:45:50 And there’s different implementations for mixture of experts where you can have some
0:45:56 of these experts that are always activated, which this just looks like a small neural network.
0:45:58 And then all the tokens go through that.
0:46:03 And then they also go through some that are selected by this routing mechanism.
0:46:08 And one of the innovations in deep-seek’s architecture is that they change the routing
0:46:10 mechanism in mixture of expert models.
0:46:15 There’s something called an auxiliary loss, which effectively means during training, you
0:46:21 want to make sure that all of these experts are used across the tasks that the model sees.
0:46:26 Why there can be failures in mixture of experts is that when you’re doing this training, you
0:46:30 the one objective is token prediction accuracy.
0:46:34 And if you just let training go with a mixture of expert model on your own, it can be that
0:46:39 the model learns to only use a subset of the experts.
0:46:43 And in the MOE literature, there’s something called the auxiliary loss, which helps balance
0:46:44 them.
0:46:49 But if you think about the loss functions of deep learning, this even connects to the
0:46:54 bitter lesson is that you want to have the minimum inductive bias in your model to let
0:46:56 the model learn maximally.
0:47:01 And this auxiliary loss, this balancing across experts could be seen as intention with the
0:47:04 prediction accuracy of the tokens.
0:47:08 So we don’t know the exact extent that the deep-seek MOE change, which is instead of
0:47:12 doing an auxiliary loss, they have an extra parameter in their routing, which after the
0:47:17 batches, they update this parameter to make sure that the next batches all have a similar
0:47:19 use of experts.
0:47:22 And this type of change can be big, it can be small, but they add up over time.
0:47:27 And this is the sort of thing that just points to them innovating and I’m sure all the labs
0:47:30 that are training big MOEs are looking at this sort of things, which is getting away
0:47:33 from the auxiliary loss, some of them might already use it, but you just keep you keep
0:47:34 the QLA in gains.
0:47:40 And we’ll talk about the philosophy of training and how you organize these organizations.
0:47:44 And a lot of it is just compounding small improvements over time in your data and your
0:47:48 architecture and your post training and how they integrate with each other.
0:47:49 And deep-seek does the same thing.
0:47:53 And some of them are shared or a lot, we have to take them on face value that they share
0:47:54 their most important details.
0:47:56 I mean, the architecture and the weights are out there.
0:47:59 So we’re seeing what they’re doing and it adds up.
0:48:02 Going back to sort of the like efficiency and complexity point, right?
0:48:05 It’s 32 versus four, right?
0:48:08 For like mixed draw and other MOE models that have been publicly released.
0:48:13 So this ratio is extremely high and sort of what Nathan was getting at there was, when
0:48:19 you have such a different level of sparsity, you can’t just have every GPU have the entire
0:48:20 model, right?
0:48:21 The model’s too big.
0:48:22 There’s too much complexity there.
0:48:25 So you have to split up the model with different types of parallelism, right?
0:48:29 And so you might have different experts on different GPU nodes.
0:48:34 But now what happens when this set of data that you get, hey, all of it looks like this
0:48:39 one way and all of it should route to one part of my model, right?
0:48:45 So when all of it routes to one part of the model, then you can have this overloading
0:48:49 of a certain set of the GPU resources or a certain set of the GPUs.
0:48:54 And then the rest of the training network sits idle because all of the tokens are just
0:48:55 routing to that.
0:48:56 So this is the biggest complexity.
0:49:02 One of the biggest complexities with running a very sparse mixture of experts model, i.e.,
0:49:07 this 32 ratio versus this four ratio is that you end up with so many of the experts just
0:49:08 sitting there idle.
0:49:10 So how do I load balance between them?
0:49:12 How do I schedule the communications between them?
0:49:19 This is a lot of the extremely low level detailed work that they figured out in the public first
0:49:24 and potentially second or third in the world and maybe even first in some cases.
0:49:30 What lesson do you, in the direction of the bitter lesson, do you take from all of this?
0:49:33 Is this going to be the direction where a lot of the gain is going to be, which is this
0:49:36 kind of low level optimization?
0:49:42 Or is this a short term thing where the biggest gains will be more on the algorithmic high
0:49:45 level side of post training?
0:49:51 Is this a short term leap because they’ve figured out a hack because constraints necessities
0:49:53 the mother of invention?
0:49:55 Or is there still a lot of gains?
0:49:59 I think we should summarize what the bitter lesson actually is about.
0:50:04 The bitter lesson, essentially, if you paraphrase it, is that the types of training that will
0:50:11 win out in deep learning as we go are those methods that are which are scalable in learning
0:50:14 and search is what it calls out.
0:50:19 This scale word gets a lot of attention in this.
0:50:27 The interpretation that I use is effectively to avoid adding the human priors to your learning
0:50:28 process.
0:50:32 If you read the original essay, this is what it talks about is how researchers will try
0:50:38 to come up with clever solutions to their specific problem that might get them small
0:50:40 gains in the short term.
0:50:45 While simply enabling these deep learning systems to work efficiently and for these
0:50:50 bigger problems in the long term might be more likely to scale and continue to drive
0:50:53 success.
0:50:57 Therefore we were talking about relatively small implementation changes to the mixture
0:50:59 of experts model.
0:51:04 Therefore it’s like, “Okay, we will need a few more years to know if one of these are
0:51:08 actually really crucial to the bitter lesson, but the bitter lesson is really this long
0:51:13 term arc of how simplicity can often win and there’s a lot of sayings in the industry
0:51:14 like the models just want to learn.
0:51:20 You have to give them the simple lost landscape where you put compute through the model and
0:51:24 they will learn and getting barriers out of the way.”
0:51:29 That’s where the power, something like Nickel comes in, where standardized code that can
0:51:33 be used by a lot of people to create sort of simple innovations that can scale, which
0:51:39 is why the code base for DeepSeq is probably a giant mess.
0:51:43 I’m sure DeepSeq definitely has code bases that are extremely messy where they’re testing
0:51:47 these new ideas, multi-headlay and attention.
0:51:50 Probably could start in something like a Jupyter notebook or somebody tries something on a
0:51:54 few GPUs and that is really messy.
0:52:00 But the stuff that trains DeepSeq v3 and DeepSeq r1, those libraries, if you were to present
0:52:04 them to us, I would guess are extremely high quality code.
0:52:07 High quality readable code.
0:52:13 I think there is one aspect to note though, is that there is the general ability for that
0:52:16 to transfer across different types of runs.
0:52:21 You may make really, really high quality code for one specific model architecture at one
0:52:22 size.
0:52:26 Then that is not transferable to, “Hey, when I make this architecture tweak, everything’s
0:52:28 broken again.”
0:52:34 That’s something that could be, with their specific low-level coding of scheduling SMs,
0:52:38 is specific to this model architecture and size.
0:52:43 Whereas NVIDIA’s Collective’s library is more like, “Hey, it’ll work for anything.
0:52:44 You want to do an all-reduce?
0:52:45 Great.
0:52:46 I don’t care what your model architecture is.
0:52:47 It’ll work.”
0:52:51 You’re giving up a lot of performance when you do that in many cases, but it’s worthwhile
0:52:57 for them to do the specific optimization for the specific run given the constraints that
0:52:58 they have regarding compute.
0:53:06 I wonder how stressful it is to these frontier models initiate training, to have the code
0:53:17 push the button, that you’re now spending a large amount of money and time to train this.
0:53:22 There must be a lot of innovation on the debugging stage of making sure there’s no issues that
0:53:27 you’re monitoring and visualizing every aspect of the training, all that kind of stuff.
0:53:31 When people are training, they have all these various dashboards, but the most simple one
0:53:33 is your loss.
0:53:38 It continues to go down, but in reality, especially with more complicated stuff like MOE, the
0:53:42 biggest problem with it or FP8 training, which is another innovation going to a lower-precision
0:53:47 number format, i.e., less accurate, is that you end up with loss spikes.
0:53:49 No one knows why the loss spike happened.
0:53:50 For a long time, you do.
0:53:51 Some of them you do.
0:53:52 That’s a bad data.
0:53:56 I give the AI2’s example of what blew up our earlier models, is a subreddit called Microwave
0:53:57 Gang.
0:53:58 We love the shout-out.
0:53:59 It’s a real thing.
0:54:01 You can pull up Microwave Gang.
0:54:05 Essentially, it’s a subreddit where everybody makes posts that are just the letter M, so
0:54:06 it’s like, mmm.
0:54:11 There’s extremely long sequences of the letter M, and then the comments are like beep beep
0:54:12 because it’s in the micro-events.
0:54:16 If you pass this into a model that’s trained to be a normal producing text, it’s extremely
0:54:22 high loss because normally you see an M. You don’t predict M’s for a long time.
0:54:24 This is something that causes the loss spikes for us.
0:54:28 When you have much like, this is old, this is not recent, and when you have more mature
0:54:31 data systems, that’s not the thing that causes the loss spike.
0:54:36 What Dylan is saying is true, but it’s levels to this sort of idea.
0:54:41 With regards to the stress, these people are like, you’ll go out to dinner with a friend
0:54:46 that works at one of these labs, and they’ll just be looking at their phone every 10 minutes,
0:54:49 and they’re not like, you know, it’s one thing if they’re texting, but they’re just like,
0:54:50 like, is the loss–
0:54:56 Yeah, it’s like tokens per second, loss not blown up, they’re just watching this.
0:54:59 And the heart rate goes up if there’s a spike.
0:55:01 And some level of spikes is normal, right?
0:55:03 It’ll recover and be back.
0:55:07 Sometimes a lot of the old strategy was like, you just stop the run, restart from the old
0:55:10 version, and then like, change the data mix, and then it keeps going.
0:55:12 There are even different types of spikes.
0:55:17 So Dirk Greninveld has a theory that it’s like fast spikes and slow spikes, where there
0:55:20 are– sometimes when you’re looking at the loss and there are other parameters, you can
0:55:24 see it start to creep up and then blow up, and that’s really hard to recover from, so
0:55:25 you have to go back much further.
0:55:28 So you have the stressful period where it’s like flat or it might start going up, and
0:55:29 you’re like, what do I do?
0:55:33 Whereas there are also loss spikes that are– it looks good, and then there’s one spiky
0:55:34 data point.
0:55:36 And what you can do is you just skip those.
0:55:39 You see that there’s a spike, you’re like, okay, I can ignore this data, don’t update
0:55:41 the model, and do the next one, and it’ll recover quickly.
0:55:47 But these un-trickier implementations, as you get more complex in your architecture,
0:55:52 and you scale up to more GPUs, you have more potential for your loss blowing up.
0:55:54 So there’s a distribution.
0:55:56 The whole idea of grokking also comes in, right?
0:56:00 It’s like, just because it slowed down from improving and loss doesn’t mean it’s not learning,
0:56:04 because all of a sudden it could be like this, and it could just spike down in loss again,
0:56:06 because it truly learned something, right?
0:56:08 And it took some time for it to learn that.
0:56:10 It’s not like a gradual process, right?
0:56:13 And that’s what humans are like, that’s what models are like.
0:56:15 It’s really a stressful task, as you mentioned.
0:56:18 And the whole time, the dollar count is going up.
0:56:20 Every company has failed runs.
0:56:23 You need failed run to push the envelope on your infrastructure.
0:56:28 So a lot of news cycles are made of X company had Y failed run.
0:56:32 Every company that’s trying to push the frontier of AI has these.
0:56:37 So yes, it’s noteworthy because it’s a lot of money, and it can be week to month setback,
0:56:39 but it is part of the process.
0:56:44 But how do you get, if you’re deep-seek, how do you get to a place where, holy shit, there’s
0:56:46 a successful combination of hyperparameters?
0:56:49 A lot of small failed runs.
0:56:55 So rapid iteration through failed runs and successful ones.
0:57:01 And then you build up some intuition like this, this mixture of export works, and then
0:57:03 this implementation of MLA works.
0:57:08 Key hyperparameters like learning rate and regularization and things like this.
0:57:11 And you find the regime that works for your code base.
0:57:13 I’ve talked to people at Frontier Labs.
0:57:18 There’s a story that you can tell where training language models is kind of a path that you
0:57:19 need to follow.
0:57:24 So you need to unlock the ability to train a certain type of model or a certain scale,
0:57:27 and then your code base and your internal know-how of which hyperparameters work for
0:57:28 it is kind of known.
0:57:33 And you look at the deep-seek papers and models, they’ve scaled up, they’ve added complexity,
0:57:36 and it’s just continuing to build the capabilities that they have.
0:57:39 Here’s the concept of a YOLO run.
0:57:42 So YOLO, you only live once.
0:57:47 And what it is, is there’s all this experimentation you do at the small scale.
0:57:48 Research ablations.
0:57:53 You have your Jupyter Notebook where you’re experimenting with MLA on three GPUs or whatever.
0:57:58 And you’re doing all these different things like, “Hey, do I do four active experts,
0:57:59 128 experts?
0:58:01 Do I arrange the experts this way?”
0:58:03 All these different model architecture things.
0:58:05 You’re testing at a very small scale.
0:58:09 Several researchers, few GPUs, tens of GPUs, hundreds of GPUs, whatever it is.
0:58:13 And then, all of a sudden, you’re like, “Okay, guys, no more fucking around.
0:58:14 No more screwing around.
0:58:19 Everyone, take all the resources we have, let’s pick what we think will work, and just
0:58:20 go for it.”
0:58:21 YOLO.
0:58:24 And this is where that sort of stress comes in, is like, “Well, I know it works here,
0:58:28 but some things that work here don’t work here, and some things that work here don’t
0:58:29 work down here.”
0:58:30 Right?
0:58:31 In terms of scale.
0:58:38 It’s really truly a YOLO run, and there’s this discussion of certain researchers just
0:58:40 have this methodical nature.
0:58:44 They can find the whole search space and figure out all the ablations of different research
0:58:45 and really see what is best.
0:58:50 And there’s certain researchers who just have that innate gut instinct of, “This is the
0:58:51 YOLO run.
0:58:52 I’m looking at the data.
0:58:53 This is it.”
0:58:57 This is why you want to work in post-training, because the GPU cost for training is lower,
0:59:01 so you can make a higher percentage of your training runs YOLO runs.
0:59:02 Yeah.
0:59:03 For now.
0:59:04 Yeah.
0:59:05 For now.
0:59:06 For now.
0:59:09 So, some of this is fundamentally luck, still.
0:59:10 Luck is skill, right?
0:59:11 In many cases.
0:59:12 Yeah.
0:59:13 I mean, it looks lucky, right?
0:59:17 But the hill to climb, if you’re out in one of these labs and you have an evaluation
0:59:21 and you’re not crushing, there’s a repeated playbook of how you improve things.
0:59:24 There are localized improvements, which might be data improvements, and these add up into
0:59:26 the whole model just being much better.
0:59:30 And when you zoom in really close, it can be really obvious that this model is just really
0:59:33 bad at this thing, and we can fix it, and you just add these up.
0:59:38 So, some of it feels like luck, but on the ground, especially with these new reasoning
0:59:43 models we’re talking to, it’s just so many ways that we can poke around, and normally,
0:59:45 it’s that some of them give big improvements.
0:59:47 The search space is near infinite, right?
0:59:53 And yet, the amount of compute in time you have is very low, and you have to hit release
0:59:54 schedules.
1:00:00 You have to not get blown past by everyone, otherwise, what happened with DeepSeek, crushing
1:00:03 Meta, and Mistral, and Coherent, and all these guys, they moved too slow, right?
1:00:06 They maybe were too methodical, I don’t know, they didn’t hit the YOLO run, whatever the
1:00:09 reason was, maybe they weren’t as skilled.
1:00:13 You can call it luck if you want, but at the end of the day, it’s skill.
1:00:16 So, 2025 is the year of the YOLO run.
1:00:19 It seems like all the labs are going in.
1:00:24 I think it’s even more impressive what OpenAI did in 2022.
1:00:28 At the time, no one believed in a mixture of experts models at Google, who had all the
1:00:34 researchers, OpenAI had such little compute, and they devoted all of their compute for many
1:00:40 months, all of it, 100%, for many months to GPT4, with a brand new architecture with
1:00:44 no belief that, hey, let me spend a couple hundred million dollars, which is all of the
1:00:47 money I have on this model, right?
1:00:49 That is truly YOLO, right?
1:00:54 Now, people are like, all these training run failures that are in the media, right?
1:00:58 It’s like, okay, great, but actually, a huge chunk of my GPs are doing inference.
1:01:03 I still have a bunch doing research constantly, and yes, my biggest cluster is training on
1:01:09 this YOLO run, but that YOLO run is much less risky than what OpenAI did in 2022, or maybe
1:01:13 what DeepSeq did now, or sort of like, hey, we’re just going to throw everything at it.
1:01:18 The big winners throughout human history are the ones who are willing to do YOLO at some
1:01:19 point.
1:01:25 Okay, what do we understand about the hardware it’s been trained on, DeepSeq?
1:01:29 DeepSeq is very interesting, at least a second to take us to zoom out on who they are, first
1:01:30 of all, right?
1:01:35 HighFlyer is a hedge fund that has historically done quantitative trading in China as well
1:01:40 as elsewhere, and they have always had a significant number of GPUs, right?
1:01:45 In the past, a lot of these high-frequency trading, algorithmic quant traders used FPGAs,
1:01:47 but it shifted to GPUs, definitely, and there’s both, right?
1:01:52 But GPUs especially, and HighFlyer, which is the hedge fund that owns DeepSeq, and everyone
1:01:56 who works for DeepSeq is part of HighFlyer, to some extent, right?
1:01:59 It’s the same parent company, same owner, same CEO.
1:02:05 They had all these resources and infrastructure for trading, and then they devoted a humongous
1:02:10 portion of them to training models, both language models and otherwise, right?
1:02:15 Because these techniques were heavily AI-influenced.
1:02:21 More recently, people have realized, hey, trading with, even when you go back to Renaissance
1:02:26 and all these quantitative firms, natural language processing is the key to trading
1:02:30 really fast, understanding a press release and making the right trade, right?
1:02:33 And so, DeepSeq has always been really good at this.
1:02:39 And even as far back as 2021, they have press releases and papers saying, hey, we’re the
1:02:44 first company in China with an A100 cluster this large, those 10,000 A100 GPUs, right?
1:02:46 This is in 2021.
1:02:48 Now this wasn’t all for training large language models.
1:02:54 This was mostly for training models for their quantitative aspects, their quantitative trading,
1:02:57 as well as a lot of that was natural language processing, to be clear, right?
1:02:59 And so this is the sort of history, right?
1:03:03 So verifiable fact is that in 2021, they built the largest Chinese cluster.
1:03:06 At least, they claim it was the largest cluster in China, 10,000 GPUs.
1:03:11 Before expert controls started, they’ve had a huge cluster before any conversation of
1:03:12 expert controls.
1:03:16 So then you step it forward to, what have they done over the last four years since then,
1:03:17 right?
1:03:21 Obviously, they’ve continued to operate the hedge fund, probably make tons of money.
1:03:24 And the other thing is that they’ve leaned more and more and more into AI.
1:03:27 The CEO, Leon Ching-Feng, Leon…
1:03:30 You’re not putting me spot on this, we discussed this before.
1:03:31 Leon Fang, right?
1:03:32 The CEO, he owns…
1:03:33 All of them.
1:03:38 Leon Fang, he owns maybe a little bit more than half the company allegedly, right?
1:03:44 He’s an extremely Elon Jensen kind of figure where he’s just involved in everything, right?
1:03:48 And so over that time period, he’s gotten really in-depth into AI.
1:03:50 He actually has a bit of a…
1:03:54 If you see some of the statements, a bit of an EAC vibe almost, right?
1:03:56 Total AGI vibes.
1:03:57 We need to do this.
1:04:01 We need to make a new ecosystem of open AI.
1:04:05 We need China to lead on this sort of ecosystem because historically, the Western countries
1:04:11 have led on software ecosystems and straight-up acknowledges, like, in order to do this,
1:04:15 we need to do something different, DeepSeek is his way of doing this.
1:04:17 Some of the translated interviews with him are fantastic.
1:04:18 So he has done interviews?
1:04:19 Yeah.
1:04:21 You think he would do a Western interview or no?
1:04:22 Or is there controls on the channel?
1:04:26 There hasn’t been one yet, but I would try it.
1:04:29 I just got a Chinese translator, so it was great.
1:04:30 This is how I’ll push.
1:04:38 So fascinating figure engineer pushing full-on into AI, leveraging the success from the high-frequency
1:04:39 trading.
1:04:40 Very direct quotes.
1:04:44 We will not switch to closed source when asked about this stuff.
1:04:50 Very long-term motivated in how the ecosystem of AI should work.
1:04:57 And I think from a Chinese perspective, he wants a Chinese company to build this vision.
1:05:01 And so this is sort of like the “visionary” behind the company.
1:05:03 This hedge fund still exists, this quantitative firm.
1:05:10 And so DeepSeek is the sort of, you know, slowly he got turned to this full view of like
1:05:12 AI, everything about this, right?
1:05:15 But at some point, it slowly maneuvered and he made DeepSeek.
1:05:17 And DeepSeek has done multiple models since then.
1:05:19 They’ve acquired more and more GPUs.
1:05:22 They share infrastructure with the fund, right?
1:05:28 And so, you know, there is no exact number of public GPU resources that they have, but
1:05:32 besides this 10,000 GPUs that they bought in 2021, right?
1:05:34 And they were fantastically profitable, right?
1:05:40 And then this paper claims they did only 2,800 GPUs, which are a restricted GPU that was
1:05:43 previously allowed in China, but no longer allowed and there’s a new version.
1:05:47 But it’s basically NVIDIA’s H100 for China, right?
1:05:51 And then there’s some restrictions on it, specifically around the communications sort
1:05:52 of speed, the interconnect speed, right?
1:05:57 Which is why they had to do this crazy SM, you know, scheduling stuff, right?
1:05:58 So going back to that, right?
1:06:03 It’s like, this is obviously not true in terms of their total GPU count.
1:06:08 Obvious available GPUs, but for this training run, you think 2,000 is the correct number
1:06:09 or no?
1:06:13 So this is where it takes, you know, a significant amount of sort of like zoning in, right?
1:06:16 Like, what do you call your training run, right?
1:06:20 You count all of the research and ablations that you ran, right?
1:06:23 Studying all this stuff, because yes, you can do a YOLO run, but at some level you have
1:06:26 to do the test at the small scale, and then you have to do some test at medium scale before
1:06:28 you go to a large scale.
1:06:32 Accepted practice is that for any given model that is a notable advancement, you’re going
1:06:37 to do two to four X compute of the full training run in experiments alone.
1:06:42 So a lot of this compute that’s being scaled up is probably used in large part at this
1:06:43 time for research.
1:06:47 Yeah, and research will, you know, research begets the new ideas that let you get huge
1:06:48 efficiency.
1:06:49 Right.
1:06:50 Research gets you 01.
1:06:52 You break through, so you need to bet on it.
1:06:56 So some of the pricing strategy they will discuss has the research baked into the price.
1:07:01 So the numbers that deep seek specifically said publicly, right, are just the 10,000
1:07:06 GPUs in 2021, and then 2,000 GPUs for only the pre-training for V3.
1:07:08 They did not discuss cost on R1.
1:07:13 They did not discuss cost on all the other RL, right, for the instruct model that they
1:07:14 made, right?
1:07:18 They only discussed the pre-training for the base model, and they did not discuss anything
1:07:19 on research and ablations.
1:07:23 And they do not talk about any of the resources that are shared in terms of, hey, the fund
1:07:25 is using all these GPUs, right?
1:07:30 And we know that they’re very profitable and that 10,000 GPUs in 2021.
1:07:36 So some of the research that we’ve found is that we actually believe they have closer
1:07:38 to 50,000 GPUs.
1:07:39 We as Semi-Answers.
1:07:44 So we should say that you’re sort of one of the world experts in figuring out what everybody’s
1:07:49 doing in terms of the semiconductor in terms of cluster buildouts in terms of, like, who
1:07:52 is doing what in terms of training runs.
1:07:53 So yeah.
1:07:54 So that’s the we.
1:07:55 Okay, go ahead.
1:07:56 Yeah, sorry.
1:07:58 We believe they actually have something closer to 50,000 GPUs, right?
1:08:00 Now, this is split across many tasks, right?
1:08:03 Again, the fund, research and ablations.
1:08:05 For Ballpark, how much would OpenAI or Anthropocad?
1:08:10 I think the clearest example we have, because Meta is also open, they talk about, like, order
1:08:15 of 60K to 100K, H100 equivalent GPUs in their training clusters.
1:08:16 Right.
1:08:20 Like Lama 3, they trained on 16,000 H100s, right?
1:08:23 But the company of Meta last year publicly disclosed they bought, like, 400 something
1:08:24 thousand GPUs.
1:08:25 Yeah.
1:08:26 Right?
1:08:27 So of course, tiny percentage on the training.
1:08:31 Again, like most of it is, like, serving me the best Instagram reels, right?
1:08:32 Or whatever, right?
1:08:37 I mean, we could get into a cost of, like, what is the cost of ownership for a 2,000 GPU cluster,
1:08:38 10,000?
1:08:40 There’s just different sizes of companies I can afford.
1:08:44 These things in deep seek is reasonably big.
1:08:49 Their compute allocation compared is one of the top few in the world.
1:08:52 It’s not OpenAI, Anthropoc, et cetera, but they have a lot of compute.
1:08:56 Can you, in general, actually just zoom out and also talk about the Hopper architecture,
1:09:02 the NVIDIA Hopper GPU architecture and the difference between H100 and H800, like you
1:09:03 mentioned, the interconnects?
1:09:04 Yeah.
1:09:08 So there’s, you know, Ampere was the A100 and then H100 Hopper, right?
1:09:12 People use them synonymously in the US because really there’s just H100 and now there’s H200,
1:09:13 right?
1:09:15 Mostly.
1:09:19 In China, they’ve had, there have been different salvos of export restrictions.
1:09:22 So initially the US government limited on a two-factor scale, right?
1:09:25 Which is chip interconnect versus flops, right?
1:09:29 So any chip that had interconnects above a certain level and flops above a certain floating
1:09:33 point operations above a certain level was restricted.
1:09:37 Later the government realized that this was a flaw in the restriction and they cut it
1:09:40 down to just floating point operations.
1:09:45 And so, H800 had high flops, low communication?
1:09:46 Exactly.
1:09:50 So the H800 was the same performance as H100 on flops, right?
1:09:53 But it didn’t have, it just had the interconnect bandwidth cut.
1:09:58 DeepSeq knew how to utilize this, you know, hey, even though we’re cut back on the interconnect,
1:10:04 we can do all this fancy stuff to figure out how to use the GPU fully anyways, right?
1:10:10 And so that was back in October 2022, but later in 2023, end of 2023 implemented in
1:10:14 2024, the US government banned the H800, right?
1:10:18 And so by the way, this H800 cluster, these 2000 GPUs was not even purchased in 2024,
1:10:19 right?
1:10:22 It was purchased in late 2023.
1:10:23 And they’re just getting the model out now, right?
1:10:25 Because it takes a lot of research, et cetera.
1:10:29 H800 was banned and now there’s a new chip called the H20.
1:10:34 The H20 is cut back on only flops, but the interconnect bandwidth is the same.
1:10:38 And in fact, in some ways, it’s better than the H100 because it has better memory bandwidth
1:10:39 and memory capacity.
1:10:43 So there are, you know, NVIDIA is working within the constraints of what the government
1:10:46 sets and then builds the best possible GPU for China.
1:10:50 Can we take this actual tangent and we’ll return back to the hardware?
1:10:55 Is the philosophy, the motivation, the case for export controls?
1:10:56 What is it?
1:11:00 Ari Amadej just published a blog post about export controls.
1:11:06 The case he makes is that if AI becomes super powerful and he says by 2026 we’ll have AGI
1:11:11 or super powerful AI and that’s going to give a significant, whoever builds that will have
1:11:13 a significant military advantage.
1:11:22 And so because the United States is a democracy and as he says, China is authoritarian or has
1:11:29 authoritarian elements, you want a unipolar world where the super powerful military because
1:11:31 of the AI is one that’s a democracy.
1:11:38 It’s a much more complicated world geopolitically when you have two superpowers with super powerful
1:11:41 AI and one is authoritarian.
1:11:42 So that’s the case he makes.
1:11:47 And so we want to, the United States wants to use export controls to slow down, to make
1:11:55 sure that China can do these gigantic training runs that will be presumably required to
1:11:57 build AGI.
1:11:58 This is very abstract.
1:12:03 I think this can be the goal of how some people describe export controls is this super powerful
1:12:05 AI.
1:12:08 And you touched on the training run idea.
1:12:13 There’s not many worlds where China cannot train AI models.
1:12:18 Export controls are kneecapping the amount of compute or the density of compute that
1:12:20 China can have.
1:12:25 And if you think about the AI ecosystem right now as all of these AI companies, revenue
1:12:30 numbers are up and to the right, the AI usage is just continuing to grow, more GPUs are
1:12:31 going to inference.
1:12:37 A large part of export controls, if they work is just that the amount of AI that can be
1:12:40 run in China is going to be much lower.
1:12:43 So on the training side, DeepSeq V3 is a great example, which you have a very focused team
1:12:46 that can still get to the frontier of AI.
1:12:51 This 2,000 GPUs is not that hard to get, all considering in the world.
1:12:53 They’re still going to have those GPUs.
1:12:54 They’re still going to be able to train models.
1:12:58 But if there’s going to be a huge market for AI, if you have strong export controls and
1:13:02 you want to have 100,000 GPUs just serving the equivalent of chat GPT clusters with good
1:13:08 export controls, it also just makes it so that AI can be used much less.
1:13:14 And I think that is a much easier goal to achieve than trying to debate on what AGI
1:13:15 is.
1:13:19 And if you have these extremely intelligent autonomous AIs and data centers, those are
1:13:23 the things that could be running in these GPU clusters in the United States, but not
1:13:24 in China.
1:13:27 To some extent, training a model does effectively nothing, right?
1:13:28 Yeah.
1:13:29 I have a model.
1:13:35 The thing that Dario is speaking to is the implementation of that model once trained to
1:13:41 then create huge economic growth, huge increases in military capabilities, huge capability increases
1:13:46 in productivity of people, betterment of lives, whatever you want to direct super powerful
1:13:48 AI towards, you can.
1:13:51 But that requires a significant amounts of compute, right?
1:13:56 And so the US government has effectively said, and forever, right, like training will always
1:13:59 be a portion of the total compute.
1:14:03 We mentioned Meta’s 400,000 GPUs, only 16,000 made Lama, right?
1:14:08 So the percentage that Meta is dedicating to inference, now this might be for recommendation
1:14:12 systems that are trying to hack our mind into spending more time and watching more ads.
1:14:16 Or if it’s for a super powerful AI that’s doing productive things, doesn’t matter about
1:14:22 the exact use that our economic system decides, it’s that that can be delivered in whatever
1:14:23 way we want.
1:14:28 Whereas with China, you’re expert restrictions, great, you’re never going to be able to cut
1:14:29 everything off, right?
1:14:33 And I think that’s quite well understood by the US government, is that you can’t cut
1:14:34 everything off.
1:14:36 And they’ll make their own chips.
1:14:37 And they’re trying to make their own chips.
1:14:38 They’ll be worse than ours.
1:14:41 But the whole point is to just keep a gap, right?
1:14:46 And therefore, at some point as the AI, in a world where 2%, 3% economic growth, this
1:14:51 is really dumb, by the way, to cut off high tech and not make money off of it.
1:14:55 But in a world where super powerful AI comes about and then starts creating significant
1:14:59 changes in society, which is what all the AI leaders and big tech companies believe,
1:15:02 I think super powerful AI is going to change society massively.
1:15:07 And therefore, this compounding effect of the difference in compute is really important.
1:15:12 There’s some sci-fi out there where AI is measured in the power of, in like how much
1:15:14 power is delivered to compute, right?
1:15:18 Or how much is being, that’s sort of a way of thinking about what’s the economic output
1:15:20 is just how much power are you directing towards that AI?
1:15:24 Should we talk about reasoning models with this as a way that this might be actionable
1:15:26 as something that people can actually see?
1:15:31 So the reasoning models that are coming out with R1 and O1, they’re designed to use
1:15:32 more compute.
1:15:37 There’s a lot of buzzy words in the AI community about this, test time compute, inference time
1:15:38 compute, whatever.
1:15:40 But Dylan has good research on this.
1:15:43 You can get to the specific numbers on the ratio of when you train a model, you can look
1:15:47 at things about the amount of compute used at training and amount of compute used at inference.
1:15:52 These reasoning models are making inference way more important to doing complex tasks.
1:15:56 In the fall, in December, their open AI announced this O3 model.
1:16:00 There’s another thing in AI when things move fast, we get both announcements and releases.
1:16:03 Analytics are essentially blog posts where you pat yourself on the back and you say you
1:16:07 did things and releases are run the models out there, the papers out there, et cetera.
1:16:13 So open AI has announced O3, and we can check if O3 mini is out as of recording potentially.
1:16:17 But that doesn’t really change the point, which is that the breakthrough result was something
1:16:22 called ARC AGI task, which is the abstract reasoning corpus, a task for artificial general
1:16:23 intelligence.
1:16:29 Francois Chalet is the guy who’s been, it’s a multi-year old paper.
1:16:30 It’s a brilliant benchmark.
1:16:36 And the number for open AI O3 to solve this was that it used some sort of number of samples
1:16:37 in the API.
1:16:40 The API has like thinking effort and number of samples.
1:16:47 They used 1,000 samples to solve this task, and it comes out to be like five to $20 per
1:16:51 question, which you’re putting in effectively a math puzzle, and then it takes orders of
1:16:53 dollars to answer one question.
1:16:55 And this is a lot of compute.
1:16:59 If it’s going to take off in the US, open AI needs a ton of GPUs on inference to capture
1:17:00 this.
1:17:04 Open AI, chat GPT Pro subscription, which is $200 a month, which Sam said they’re losing
1:17:08 money on, which means that people are burning a lot of GPUs on inference.
1:17:09 And I’ve signed up with it.
1:17:10 I’ve played with it.
1:17:15 I don’t think I’m a power user, but I use it, and it’s like, that is the thing that
1:17:20 a Chinese company with mediumly strong expert controls, there will always be loopholes, might
1:17:21 not be able to do it all.
1:17:26 And if that, the main result for O3 is also a spectacular coding performance.
1:17:32 And if that feeds back into AI companies being able to experiment better.
1:17:38 So presumably the ideas for an AGI, a much larger fraction of the compute will be used
1:17:42 for this test-hung compute, for the reasoning, for the AGI goes into a room and thinks about
1:17:50 how to take over the world and come back in 2.7 hours, and it’s going to take a lot of
1:17:51 computing.
1:17:56 This is what people, CEO or leaders of Open AI and Anthropic talk about is like autonomous
1:18:00 AI models, which is you give them a task and they work on it in the background.
1:18:04 My personal definition of AGI is much simpler.
1:18:09 I think language models are a form of AGI and all of the super powerful stuff is a next
1:18:13 step that’s great if we get these tools, but a language model has so much value and so
1:18:14 many domains.
1:18:16 It is a general intelligence to me.
1:18:20 But this next step of agentic things where they’re independent and they can do tasks
1:18:26 that aren’t in the training data is what the few year outlook that these AI companies are
1:18:27 driving for.
1:18:32 I think the terminology here that Dario uses is super powerful AI, so I agree with you
1:18:33 on the AGI.
1:18:36 I think we already have something like that’s exceptionally impressive.
1:18:42 The Alan Turing would for sure say is AGI, but he’s referring more to something once
1:18:48 in possession of, then you would have a significant military and geopolitical advantage over other
1:18:49 nations.
1:18:52 So it’s not just like you can ask it how to cook an omelet.
1:18:55 And he has a much more positive view and as I say, machines of love and grace.
1:19:00 I’ve read into this, that we don’t have enough background in physical sciences to gauge exactly
1:19:07 how competent I am and if AI can revolutionize biology, I’m safe saying that AI is going
1:19:10 to accelerate the progress of any computational science.
1:19:14 So we’re doing a depth-first search here on topics, taking tangent of a tangent.
1:19:19 So let’s continue on that depth-first search.
1:19:25 You said that you’re both feeling the AGI, so what’s your timeline?
1:19:29 Dario is 2026 for the super powerful AI.
1:19:37 That’s basically agentic to a degree where it’s a real security threat, that level of
1:19:38 AGI.
1:19:39 What’s your timeline?
1:19:43 I don’t like to attribute specific abilities because predicting specific abilities and when
1:19:44 is very hard.
1:19:49 I think mostly if you’re going to say that I’m feeling the AGI is that I expect continued
1:19:51 rapid surprising progress over the next few years.
1:19:57 So something like R1 is less surprising to me from DeepSeq because I expect there to
1:20:00 be new paradigms where substantial progress can be made.
1:20:04 DeepSeq R1 is so unsettling because we’re kind of on this path with chatGPT.
1:20:05 It’s getting better.
1:20:06 It’s getting better.
1:20:07 It’s getting better.
1:20:10 And then we have a new direction for changing the models and we took one step like this
1:20:12 and we took a step up.
1:20:15 So it looks like a really fast slope and then we’re going to just take more steps.
1:20:19 Like it’s just really unsettling when you have these big steps and I expect that to
1:20:20 keep happening.
1:20:25 I see I’ve tried opening I operator, I’ve tried quad computer use.
1:20:26 They’re not there yet.
1:20:31 I understand the idea, but it’s just so hard to predict what is the breakthrough that will
1:20:35 make something like that work and I think it’s more likely that we have breakthroughs that
1:20:37 work and things that we don’t know what they’re going to do.
1:20:43 So like everyone wants agents, Dario has very eloquent way of describing this and I just
1:20:47 think that there’s going to be more than that so I could just expect these things to
1:20:48 come.
1:20:54 I’m going to have to try to pin you down to a date on the AGI timeline.
1:20:56 The nuclear weapon moment.
1:21:04 So moment where on the geopolitical stage, there’s a real like, because we’re talking
1:21:09 about export controls, when do you think, just even a throw out a date, when do you think
1:21:10 that would be?
1:21:14 For me, it’s probably after 2030, so I’m not as …
1:21:15 That’s what I would say.
1:21:16 So define that, right?
1:21:18 Because to me, it kind of almost has already happened, right?
1:21:23 You look at elections in India and Pakistan, people get AI voice calls and think they’re
1:21:25 talking to the politician, right?
1:21:28 The AI diffusion rules, which was enacted in the last couple of weeks of the Biden admin
1:21:34 and looks like the Trump admin will keep and potentially even strengthen, limit cloud computing
1:21:38 and GPU sales to countries that are not even related to China.
1:21:43 Portugal and all these normal countries are on the, you need approval from the US list.
1:21:48 Yeah, Portugal and all these countries that are allies, right?
1:21:49 Singapore, right?
1:21:53 They freaking have F-35s and we don’t let them by GPUs.
1:21:56 This to me is already to the scale of like, you know …
1:22:01 Well, that just means that the US military is really nervous about this new technology.
1:22:06 That doesn’t mean the technology is already there, so they might be just very cautious
1:22:11 about this thing that they don’t quite understand, but that’s a really good point.
1:22:18 The robot calls, swarms of semi-intelligent bots could be a weapon, could be doing a lot
1:22:19 of social engineering.
1:22:23 I mean, there’s tons of talk about, you know, from the 2016 elections, like Cambridge Analytica
1:22:25 and all this stuff, Russian influence.
1:22:29 I mean, every country in the world is pushing stuff onto the internet and has narratives
1:22:30 they want, right?
1:22:35 Like that’s every, like technically competent, whether it’s Russia, China, US, Israel, et
1:22:36 cetera, right?
1:22:41 They’re pushing viewpoints onto the internet and mass and language models crash the cost
1:22:43 of like very intelligent sounding.
1:22:47 There’s some research that shows that the distribution is actually a limiting factor.
1:22:55 So language models haven’t yet made misinformation particularly, like, changed the equation there.
1:22:56 The internet is still ongoing.
1:23:00 I think there’s a blog, AI Snake Oil and some of my friends at Princeton that write on this
1:23:01 stuff.
1:23:02 So there is research.
1:23:04 It’s like, it’s a default that everyone assumes and I would have thought the same thing is
1:23:07 that misinformation doesn’t get far worse with language models.
1:23:12 I think in terms of internet posts and things that people have been measuring, it hasn’t
1:23:16 been a exponential increase or something extremely measurable and things you’re talking about
1:23:18 with like voice calls and stuff like that.
1:23:22 It could be in modalities that are harder to measure.
1:23:26 So it’s something that it’s too soon to tell in terms of, I think that’s like political
1:23:34 instability via the web is very, it’s monitored by a lot of researchers to see what’s happening.
1:23:37 I think that you’re asking about like the AGI thing.
1:23:42 If you ever make me give a year, I would be like, okay, I have AI CEOs saying this, they’ve
1:23:44 been saying two years for a while.
1:23:51 I think that people like Dario, Anthropic, the CEO had thought about this so deeply.
1:23:56 I need to take their word seriously, but also understand that they have different incentives.
1:24:00 So I would be like add a few years to that, which is how you get something similar to
1:24:02 2030 or a little after 2030.
1:24:07 I think to some extent we have capabilities that hit a certain point where any one person
1:24:13 could say, okay, if I can leverage those capabilities for X amount of time, this is AGI, call it
1:24:19 2728, but then the cost of actually operating that capability, this is going to be my point.
1:24:24 So extreme that no one can actually deploy it at scale and mass to actually completely
1:24:27 revolutionize the economy on a snap of the finger.
1:24:30 So I don’t think it will be like a snap of the finger moment.
1:24:31 It’s a physical constraint.
1:24:35 However, it’ll be a, oh, the capabilities are here, but I can’t deploy it everywhere.
1:24:43 And so one simple example going back to 2023 was when Bing with GPT-4 came out and everyone
1:24:45 was freaking out about search, right?
1:24:46 Perplexity came out.
1:24:50 If you did the cost on implementing GPT-3 into every Google search, it was like, oh, okay,
1:24:53 this is just physically impossible to implement.
1:24:59 And as we step forward to going back to the test time compute thing, a query for, you
1:25:02 ask chat GPT a question, it costs cents, right?
1:25:05 For their most capable model of chat, right?
1:25:11 To get a query back, to solve an Arc AGI problem though, cost five to 20 bucks, right?
1:25:14 And this is, this is an, it’s only going up from there.
1:25:20 This is a thousand, 10,000 X factor difference in cost to respond to a query versus do a task.
1:25:26 And the task of Arc AGI is not like it’s like, it’s, it’s simple to some extent, you know,
1:25:29 but it’s also like, what are the tasks that we want?
1:25:32 Okay, AGI, quote unquote, what we have today can do Arc AGI.
1:25:35 Three years from now, it can do much more complicated problems, but the cost is going
1:25:39 to be measured in thousands and thousands and hundreds of thousands of dollars of GPU
1:25:44 time and there just won’t be enough power to use infrastructure to operate this and therefore
1:25:47 shift everything in the world on the snap of the finger.
1:25:53 But at that moment, who gets to control and point the AGI at a task?
1:25:57 And so this was in Dario’s post that he’s like, hey, China can effectively and more quickly
1:26:01 than us point their AGI at military tasks, right?
1:26:06 And they have been in many ways, faster at adopting certain new technologies into, into
1:26:07 their military, right?
1:26:09 Especially with regards to drones, right?
1:26:14 The US maybe has a longstanding, you know, large air sort of, you know, fighter jet type
1:26:20 of thing bombers, but when it comes to asymmetric arms such as drones, they’ve, they’ve completely
1:26:22 leapfrogged the US and the West.
1:26:27 And the, the fear that Dario is sort of pointing out there, I think, is that, yeah, great.
1:26:30 We’ll have AGI in the commercial sector.
1:26:33 The US military won’t be able to implement it super fast.
1:26:36 Chinese military could and they could direct all their resources to implementing it in
1:26:41 the military and therefore solving, you know, military logistics or solving some, some other
1:26:45 aspect of like disinformation for targeted certain set of people so they can flip a country’s
1:26:50 politics or something like that that is actually like catastrophic versus, you know, the US
1:26:54 just wants to, you know, because it’ll be more capitalistically allocated just towards
1:26:58 whatever is the highest return on income, which might be like building, you know, factories
1:26:59 better or whatever.
1:27:04 So everything I’ve seen, people’s intuition seems to fail on robotics.
1:27:06 So you have this kind of general optimism.
1:27:08 I’ve seen this on self-driving cars.
1:27:12 People think it’s much easier problem than it is similar with drones.
1:27:18 Here I understand it a little bit less, but I’ve just seen the reality of the war in Ukraine
1:27:21 and the usage of drones at both sides.
1:27:28 And it seems that humans still far outperform any, any fully autonomous systems.
1:27:35 AI is an assistant, but humans drive FPV drones where the humans controlling most of it just
1:27:37 far, far, far outperforms AI systems.
1:27:43 So I think it’s not obvious to me that we’re going to have swarms of autonomous robots
1:27:46 anytime soon in the military context.
1:27:53 Maybe the fastest I can imagine is 2030, which is why I said 2030 for the superpower for AI.
1:27:59 Whenever you have large scale swarms of robots doing military actions, that’s when the world
1:28:02 just starts to look different to me.
1:28:04 So that’s the thing I’m really worried about.
1:28:10 But there could be cyber war, cyber war type of technologies that from social engineering
1:28:16 to actually just swarms of robots that find attack vectors in our code bases and shut
1:28:19 down power grids, that kind of stuff.
1:28:23 And it could be one of those things like on any given weekend or something.
1:28:24 Power goes out.
1:28:26 Nobody knows why.
1:28:27 And the world changes forever.
1:28:32 Just power going out for two days in all of the United States.
1:28:35 That will lead to murder, to chaos.
1:28:39 But going back to expert controls.
1:28:49 Do you see that as a useful way to control the balance of power geopolitically in the
1:28:50 context of AI?
1:28:55 And I think going back to my viewpoint is if you believe we’re in this sort of a stage
1:29:00 of economic growth and change that we’ve been in for the last 20 years, the expert controls
1:29:05 are absolutely guaranteeing that China will win long term.
1:29:10 If you do not believe AI is going to make significant changes to society in the next
1:29:15 10 years or five years, five year timelines are sort of what the more executives and such
1:29:18 of AI companies and even big tech companies believe.
1:29:20 But even 10 year timelines, it’s reasonable.
1:29:29 But once you get to, hey, these timelines are below that time period, then the only
1:29:35 way to sort of create a sizable advantage or disadvantage for America versus China is
1:29:42 if you constrain compute because talent is not really something that’s constraining.
1:29:46 China arguably has more talent, more STEM graduates, more programmers.
1:29:48 The US can draw upon the world’s people, which it does.
1:29:51 There’s tons of foreigners in the AI industry.
1:29:55 So many of these AI teams are all people without a US passport.
1:30:01 Yeah, I mean, many of them are Chinese people who are moving to America, and that’s great.
1:30:03 That’s exactly what we want.
1:30:08 But that talent is one aspect, but I don’t think that’s one that is a measurable advantage
1:30:09 for the US or not.
1:30:12 It truly is just whether or not compute.
1:30:18 Even on the compute side, when we look at chips versus data centers, China has the unprecedented
1:30:24 ability to build ridiculous sums of power, clockwork.
1:30:26 They’re always building more and more power.
1:30:31 They’ve got steel mills that individually are the size of the entire US industry.
1:30:36 And they’ve got aluminum mills that consume gigawatts and gigawatts of power.
1:30:40 And when we talk about what’s the biggest data center, opening, I made this huge thing
1:30:43 about Stargate, their announcement there.
1:30:48 That’s like once it’s fully built out in a few years, it’ll be two gigawatts of power.
1:30:53 And this is still smaller than the largest industrial facilities in China.
1:30:56 China, if they wanted to build the largest data center in the world, if they had access
1:30:58 to the chips, could.
1:31:02 So it’s not just a question of when, not if, right?
1:31:08 So their industrial capacity far exceeds the United States to manufacture stuff.
1:31:13 So long term, they’re going to be manufacturing chips there.
1:31:14 Chips are a little bit more specialized.
1:31:16 I’m specifically referring to the data centers, right?
1:31:20 Chips, fabs take huge amounts of power, don’t get me wrong.
1:31:22 That’s not necessarily the gating factor there.
1:31:28 The gating factor on how fast people can build the largest clusters today in the US is power.
1:31:35 It could be power generation, power transmission, substations and all these sorts of transformers
1:31:40 and all these things, building the data center, these are all constraints on the US industry’s
1:31:45 ability to build larger and larger training systems as well as deploying more and more
1:31:46 inference compute.
1:31:51 I think we need to make the point clear on why the time is now for people that don’t think
1:31:54 about this because essentially with export controls, you’re making it so China cannot
1:31:57 make or get cutting edge chips.
1:32:02 And the idea is that if you time this wrong, China is pouring a ton of money into their
1:32:03 chip production.
1:32:07 And if you time it wrong, they are going to have more capacity for production, more capacity
1:32:11 for energy and figure out how to make the chips and have more capacity than the rest
1:32:14 of the world to make the chips because everybody can buy, they’re going to sell their Chinese
1:32:15 chips to everybody.
1:32:17 They might subsidize them.
1:32:21 And therefore, if AI takes a long time to become differentiated, we’ve decapped the
1:32:24 financial performance of American companies.
1:32:28 NVIDIA can sell less, TSMC cannot sell to China.
1:32:34 So therefore, we have less demand to like keep driving the production cycle.
1:32:37 So that’s the assumption behind the timing being important.
1:32:40 Less than 10 years or five years to above, right?
1:32:45 China will win because of these restrictions long-term unless AI does something in the
1:32:52 short-term, which I believe AI will do, make massive changes to society in the medium short-term.
1:32:55 And so that’s the big unlocker there.
1:33:03 And even today, if Xi Jinping decided to get “scale-pilled,” I decide that scaling
1:33:09 laws are what matters just like the US executives like Sacha Nadella and Mark Zuckerberg and
1:33:14 Sundar and all these US executives of the biggest, most powerful tech companies have
1:33:18 decided they’re “scale-pilled” and they’re building multi-gigawatt data centers, right?
1:33:22 Whether it’s in Texas or Louisiana or Wisconsin, wherever it is, they’re building these massive
1:33:28 things that cost as much as their entire budget for spending on data centers globally in one
1:33:29 spot, right?
1:33:32 This is what they’ve committed to for next year, year after, et cetera.
1:33:37 And so they’re so convinced that this is the way, that this is what they’re doing.
1:33:42 But if China decided to, they could do it faster than us, but this is where the restrictions
1:33:43 come in.
1:33:48 It’s not clear that China, as a whole, has decided from the highest levels that this
1:33:49 is a priority.
1:33:50 The US sort of has, right?
1:33:55 You see Trump talking about DeepSeek and Stargate within the same week, right?
1:33:59 So he’s in the Biden and Min as well, had a lot of discussions about AI and such.
1:34:01 It’s clear that they think about it.
1:34:06 Only just last week did DeepSeek meet the second-in-command of China, right?
1:34:09 Like they have not even met the top, and they haven’t met Xi.
1:34:17 Xi hasn’t sat down, and they only just released a subsidy of a trillion RMB, roughly $160 billion,
1:34:23 which is closer to the spending of Microsoft and Meta and Google combined for this year.
1:34:28 So it’s like, they’re realizing it just now, but that’s where these export restrictions
1:34:33 come in and say, “Hey, you can’t ship the most powerful US chips to China.
1:34:35 You can ship a cut-down version.
1:34:39 You can’t ship the most powerful chips to all these countries who we know we’re just
1:34:41 going to rent it to China.
1:34:42 You have to limit the numbers, right?”
1:34:43 And the tools.
1:34:48 And same with manufacturing of equipment, tools, all these different aspects.
1:34:52 But it all stems from AI, and then what downstream can slow them down in AI?
1:34:56 And so the entire semiconductor restrictions, you read them, they are very clear.
1:35:01 It’s about AI and military civil fusion of technology, right?
1:35:02 It’s very clear.
1:35:04 And then from there, it goes, “Oh, well, we’re banning them from buying like lithography
1:35:10 tools and etch tools and deposition tools, and oh, this random subsystem from a random
1:35:12 company that’s like tiny, right?”
1:35:13 Like why are we banning this?
1:35:17 Because all of it, the US government has decided is critical to AI systems.
1:35:22 I think the fulcrum point is like the transition from seven nanometer to five nanometer chips,
1:35:27 where I think it was Huawei that had the seven nanometer chip a few years ago, which caused
1:35:31 another political brouhaha, almost like this moment.
1:35:35 And then it’s like ASML, deep UV, what is that?
1:35:37 Extreme ultraviolet lithography.
1:35:42 To set context on the chips, what Nathan’s referring to is in 2020, Huawei released their
1:35:48 Ascend 910 chip, which was an AI chip, first one on seven nanometer before Google did,
1:35:49 before NVIDIA did.
1:35:54 And they submitted it to the MLPRF benchmark, which is sort of an industry standard for machine
1:35:56 learning performance benchmark.
1:35:57 And it did quite well.
1:36:00 And it was the best chip at the submission, right?
1:36:02 This was a huge deal.
1:36:09 The Trump admin, of course, banned the Huawei from getting seven nanometer chips from TSMC.
1:36:13 And so then they had to switch to using internal domestically produced chips, which was a multi-year
1:36:14 setback.
1:36:16 Many companies have done seven nanometer chips.
1:36:21 And the question is, we don’t know how much Huawei was subsidizing production of that
1:36:22 chip.
1:36:25 Intel has made seven nanometer chips that are not profitable and things like this.
1:36:30 So this is how all feeds back into the economic engine of export controls.
1:36:36 Well, so you’re saying that for now Xi Jinping has not felt the AGI, but it feels like the
1:36:42 deep-seek moment might, like, there might be meetings going on now where he’s going
1:36:46 to start wearing the same t-shirt and things are going to escalate.
1:36:49 I mean, like this, he may have woken up last week, right?
1:36:54 Leon Fang met the vice chair, vice, the second command guy, and they had a meeting.
1:36:59 And then the next day, they announced the AI subsidies, which are trillion RMB, right?
1:37:04 So it’s possible that this deep-seek moment is truly the beginning of a cold war.
1:37:06 That’s what a lot of people are worried about.
1:37:10 People in AI have been worried that this is going towards a cold war or already is.
1:37:15 But it’s not deep-seek’s fault, but there’s something, a bunch of factors came together
1:37:19 where it was like this explosion, I mean, it all has to do with NVIDIA stock going down
1:37:27 up. It’s just some mass hysteria that happened that eventually led to Xi Jinping having meetings
1:37:29 and waking up to this idea.
1:37:35 And the US government realized in October 7th, 2022, before ChatGPT released, that restriction
1:37:38 on October 7th, which dropped and shocked everyone.
1:37:40 And it was very clearly aimed at AI.
1:37:42 Everyone was like, “What the heck are you doing?”
1:37:44 Stable diffusion was out then, but not ChatGPT.
1:37:45 Yeah, but not ChatGPT.
1:37:50 I’m starting to be rumblings of what Gen. AI can do to society.
1:37:54 But it was very clear, I think, to at least National Security Council and those sort of
1:37:59 folks that this was where the world is headed, this cold war that’s happening.
1:38:10 So is there any concerns that the export controls push China to take military action in Taiwan?
1:38:11 This is the big risk, right?
1:38:16 The further you push China away from having access to cutting-edge American and global
1:38:20 technologies, the more likely they are to say, “Well, because I can’t access it, I might
1:38:21 as well…”
1:38:23 No one should access it, right?
1:38:26 And there’s a few interesting aspects of that, right?
1:38:30 China has a urban-rural divide, like no other.
1:38:36 They have a male-female-berf ratio, like no other, to the point where, if you look in
1:38:38 most of China, it’s like the ratio is not that bad, but when you look at single dudes
1:38:42 in rural China, it’s like a 30-to-1 ratio.
1:38:43 And those are disenfranchised dudes, right?
1:38:48 Like, quote-unquote, the US has an in-sell problem, like China does, too.
1:38:51 It’s just they’re placlated in some way or cut, crushed down.
1:38:52 What do you do with these people?
1:38:55 And at the same time, you’re not allowed to access the most important technology, at
1:38:57 least the US thinks so.
1:39:00 China is maybe starting to think this is the most important technology by starting to dump
1:39:01 subsidies in it, right?
1:39:04 They thought EVs and renewables were the most important technology.
1:39:05 They dominate that now, right?
1:39:12 And now, they started thinking about semiconductors in the late 2010s and early 2020s, and now
1:39:16 they’ve been dumping money and they’re catching up rapidly, and they’re going to do the same
1:39:19 with AI because they’re very talented, right?
1:39:27 So the question is, when does this hit a breaking point, right?
1:39:32 And if China sees this as, hey, they can continue, if not having access and starting
1:39:37 a true hot war, right, taking over Taiwan or trying to subvert its democracy in some way
1:39:42 or blockating it, hurts the rest of the world far more than it hurts them, this is something
1:39:45 they could potentially do, right?
1:39:48 And so is this pushing them towards that, potentially, right?
1:39:55 I’m not quite a geopolitical person, but it’s obvious that the world regime of peace and trade
1:40:01 is super awesome for economics, but at some point, it could break, right?
1:40:05 I think we should comment that the why Chinese economy would be hurt by that is that they’re
1:40:06 export heavy.
1:40:10 I think the United States buys so much, if that goes away, that’s how their economy
1:40:11 is.
1:40:16 Also, they just would not be able to import raw materials from all over the world, right?
1:40:21 The U.S. would just shut down the trade in Malacca, and at the same time, the U.S. entire,
1:40:27 you could argue almost all the GDP growth in America since the ’70s has been either population
1:40:30 growth or tech, right?
1:40:35 Because your life today is not that much better than someone from the ’80s outside of tech,
1:40:36 right?
1:40:40 You still, you know, cars, they all have semiconductors in them everywhere, fridges, semiconductors
1:40:41 everywhere.
1:40:44 There’s these funny stories about how Russians were taking apart laundry machines because
1:40:48 they had certain like Texas instrument chips that they could then repurpose and put into
1:40:51 like their anti-missile things, right?
1:40:57 Like their S-400 or whatever, you would know more about this, but there’s all sorts of like
1:41:00 everything about semiconductors is so integral to every part of our lives.
1:41:07 So can you explain the role of TSMC in the story of semiconductors and maybe also how
1:41:11 the United States can break the reliance on TSMC?
1:41:13 I don’t think it’s necessarily breaking the reliance.
1:41:21 I think it’s getting TSMC to, you know, build in the U.S., but so taking a step back, right?
1:41:25 TSMC produces most of the world’s chips, right?
1:41:28 Especially on the foundry side, you know, there’s a lot of companies that build their
1:41:35 own chips, Samsung, Intel, you know, ST Micro, Texas Instruments, you know, analog devices,
1:41:40 all these kinds of companies build their own chips and XP, but more and more of these companies
1:41:44 are outsourcing to TSMC and have been for multiple decades.
1:41:49 Can you explain the supply chain there and where most of TSMC is in terms of manufacturing?
1:41:50 Sure.
1:41:54 So, historically, supply chain was companies would build their own chips, they would, you
1:41:57 know, be a company started, they’d build their own chips, and then they’d design the
1:42:00 chip and build the chip and sell it.
1:42:05 Over time, this became really difficult because the cost of building a fab continues to compound
1:42:06 every single generation.
1:42:10 Of course, the technology, figuring out the technology for it is incredibly difficult,
1:42:14 regardless, but just the dollars and cents that are required, ignoring, you know, saying,
1:42:17 “Hey, yes, I have all the technical capability,” which it’s really hard to get that, by the
1:42:18 way, right?
1:42:20 “I have all the technical capability,” some things failing, et cetera.
1:42:24 But if you look at just the dollars to spend to build that next generation fab, it keeps
1:42:25 growing, right?
1:42:28 Sort of like, you know, Moore’s Law is having the cost of chips every two years.
1:42:32 There’s a separate law that’s sort of like doubling the cost of fabs every handful of
1:42:33 years.
1:42:36 And so, you look at a leading edge fab that is going to be profitable today that’s building,
1:42:39 you know, three nanometer chips or two nanometer chips in the future.
1:42:43 That’s going to cost north of $30, $40 billion, right?
1:42:45 And that’s just for, like, a token amount.
1:42:47 That’s like the base building block.
1:42:48 You probably need to build multiple, right?
1:42:53 And so, when you look at the industry over the last, you know, if I go back 20, 30 years
1:42:57 ago, there were 20, 30 companies that could build the most advanced chips, and then they
1:42:59 would design them themselves and sell them, right?
1:43:01 So, companies like AMD would build their own chips.
1:43:03 Intel, of course, still builds their own chips are very famous for.
1:43:07 IBM would build their own chips, and, you know, you could just keep going down the list.
1:43:09 All these companies built their own chips.
1:43:13 Slowly they kept falling like flies, and that’s because of what TSMC did, right?
1:43:17 They created the Foundry business model, which is, I’m not going to design any chips.
1:43:22 I’m just going to contract manufacturer chips for other people, and one of their early customers
1:43:23 is NVIDIA, right?
1:43:28 NVIDIA was, is the only semiconductor company that’s worth, you know, that’s doing more
1:43:33 than a billion dollars of revenue that was started in the era of Foundry, right?
1:43:36 Every other company started before then, and at some point had FAPs, which is actually
1:43:37 incredible, right?
1:43:41 You know, like AMD and Intel and Broadcom through the industry.
1:43:45 It’s like everyone had FAPs at some point, or, you know, brought, you know, some companies
1:43:46 like Broadcom.
1:43:50 It was like a merger, amalgamation of various companies that rolled up, but even today Broadcom
1:43:51 has FAPs, right?
1:43:57 They built iPhone RF radio chips sort of in Colorado for, you know, for Apple, right?
1:44:00 Like there’s, there, all these companies had FAPs, and for most of the FAPs, they threw
1:44:05 them away or sold them off, or they got rolled into something else, and now everyone relies
1:44:06 on TSMC, right?
1:44:10 Including Intel, their latest PC chip uses TSMC chips, right?
1:44:13 It also uses some Intel chips, but it uses TSMC process.
1:44:17 Can you explain why the Foundry model is so successful for these companies?
1:44:19 Why, why are they going with this?
1:44:20 Metronomics of scale.
1:44:21 Scale.
1:44:22 Yeah.
1:44:24 So, I mean, like, like I mentioned, right, the cost of building a FAP is so high.
1:44:30 The R&D is so difficult, and when you look at like these, like companies that had their
1:44:35 own vertical stack, there was an antiquated process of like, okay, like I’m so hyper-customized
1:44:37 to each specific chip, right?
1:44:40 But as we’ve gone through the history of sort of like the last 50 years of electronics and
1:44:44 semiconductors, A, you need more and more specialization, right?
1:44:46 Because Moore’s Law has died.
1:44:47 Denard scaling has died.
1:44:49 I.e. chips are not getting better just for free, right?
1:44:53 You know, from manufacturing, you have to make real architectural innovations, right?
1:44:56 Google is not just running on Intel CPUs for web-serving.
1:44:57 They have a YouTube chip.
1:44:58 They have TPUs.
1:44:59 They have Pixel chips.
1:45:04 They have a wide diversity of chips that, you know, generate all the economic value
1:45:05 of Google, right?
1:45:07 You know, it’s running all the services and stuff.
1:45:10 And so, and this is just Google, and you could go across any company in the industry, and
1:45:11 it’s like this, right?
1:45:15 Cars contain 5,000 chips, you know, 200 different varieties of them, right?
1:45:16 All these random things.
1:45:18 A Tesla door handle has two chips, right?
1:45:19 Like it’s like ridiculous.
1:45:20 And it’s a cool door handle, right?
1:45:23 It’s like, you know, you don’t think about it, but it’s like it has two really chipped
1:45:26 like, like penny, like chips in there, right?
1:45:30 Anyway, so as you have more diversity of chips, as you have more specialization required and
1:45:35 as the cost of fabs continues to grow, you need someone who is laser focused on building
1:45:40 the best process technology and making it as flexible as possible.
1:45:44 I think you could say it simply, which is the cost per fab goes up.
1:45:48 And if you are a small player that makes a few types of chips, you’re not going to have
1:45:53 the demand to pay back the cost of the fab, whereas NVIDIA can have many different customers
1:45:58 and aggregate all this demand into one place, and then they’re the only person that makes
1:46:03 enough money building chips to buy the next, to build the next fab.
1:46:07 So this is kind of why they, the companies slowly get killed because they have a, they
1:46:11 have 10 years ago a chip that is profitable and is good enough, but the cost to build
1:46:12 the next one goes up.
1:46:16 They may try to do this, fail because they don’t have the money to make it work.
1:46:19 And then they don’t have any chips or they build it and it’s too expensive and they just
1:46:20 are not profitable.
1:46:22 You know, there’s more failure points, right?
1:46:27 You know, you could have one little process related to like some sort of like a chemical
1:46:31 etch or some sort of like plasma etch or you know, some little process that screws up.
1:46:33 You didn’t engineer it, right?
1:46:34 And now the whole company falls apart.
1:46:35 You can’t make chips, right?
1:46:40 And so super, super powerful companies like Intel, they had like the weathering storm to
1:46:44 like, hey, they still exist today, even though they really screwed up their manufacturing
1:46:45 six, seven years ago.
1:46:47 But in the case of like AMD, they almost went bankrupt.
1:46:52 They had to sell their fabs to Mubadala UAE, right?
1:46:56 And like that became a separate company called Global Foundries, which is a foundry firm.
1:46:59 And then AMD was able to then focus on like on the return back up was like, hey, let’s
1:47:05 focus on making chiplets and a bunch of different chips for different markets and focusing on
1:47:09 specific workloads rather than, you know, all of these different things.
1:47:10 And so you get more diversity of chips.
1:47:14 You have more companies than ever designing chips, but you have fewer companies than ever
1:47:16 manufacturing them, right?
1:47:20 And this is, this is where TSMC comes in as they’ve, they’ve just been the best, right?
1:47:22 They are so good at it, right?
1:47:23 They’re customer focused.
1:47:25 They make it easy for you to fabricate your chips.
1:47:28 They take all of that complexity and like kind of try and abstract a lot of it away from
1:47:29 you.
1:47:30 They make good money.
1:47:35 They don’t make insane money, but they make good money and, and they’re able to aggregate
1:47:38 all this demand and continue to build the next fab, the next fab, the next fab.
1:47:41 So why is Taiwan so special for TSMC?
1:47:43 Why is it happening there?
1:47:45 Can it be replicated inside the United States?
1:47:46 Yeah.
1:47:50 So there’s, there’s aspects of it that I would say yes and aspects that I’d say no,
1:47:51 right?
1:47:58 TSMC is way ahead because former executive Morse Chang of Texas Instruments wasn’t promoted
1:48:02 to CEO and he’s like, screw this, I’m going to go make a, my own chip company, right?
1:48:03 And he went to Taiwan and made TSMC, right?
1:48:06 And there’s, there’s a whole lot more story there.
1:48:09 So it could be Texas Instruments could have been the, you know, it could have been TSMC,
1:48:11 but Texas semiconductor manufacturing, right?
1:48:14 Instead of, you know, Texas Instruments, right?
1:48:17 But, you know, so there is that whole story there, but they’re sitting here in Texas.
1:48:19 I mean, and that sounds like a human story.
1:48:20 Like it didn’t get promoted.
1:48:24 And just the brilliance of Morse Chang, you know, which I wouldn’t underplay, but there’s
1:48:28 also like a different level of like how, how this works, right?
1:48:35 So in Taiwan, the, you know, like the number top percent of graduates of students that go
1:48:40 to the best school, which is NTU, the top percent of those all go work to TSMC, right?
1:48:41 And guess what their pay is?
1:48:45 Their starting pay is like $80,000, $70,000, right?
1:48:49 Which is like, that’s like starting pay for like a good graduate in the U.S., right?
1:48:53 Not the top, the top graduates are making hundreds of thousands of dollars at the Googles
1:48:57 and the Amazons, and now I guess the open AIs of the world, right?
1:49:01 So there is, there is a large dichotomy of like what is the top one percent of the society
1:49:04 doing and where are they headed because of economic reasons, right?
1:49:06 Intel never paid that crazy good, right?
1:49:08 And it didn’t make sense to them, right?
1:49:09 That’s one aspect, right?
1:49:10 Where is the best going?
1:49:11 Second is the work ethic, right?
1:49:16 Like, you know, we like to work, you know, you work a lot, we work a lot, but at the
1:49:21 end of the day, when there’s an, you know, when, what is the time and amount of work
1:49:23 that you’re doing and what does a fab require, right?
1:49:25 Fabs are not work-from-home jobs, they are.
1:49:28 You go into the fab and grueling work, right?
1:49:34 There’s, hey, if there is any amount of vibration, right, an earthquake happens, vibrates the
1:49:39 machines, they’re all, you know, they’re either broken, you’ve scrapped some of your production,
1:49:42 and then in many cases, they’re like not calibrated properly.
1:49:45 So when TSMC, when there’s an earthquake, right, recently there’s been an earthquake,
1:49:50 TSMC doesn’t call their employees, they just, they just go to the fab, and like, they just
1:49:55 show up, the parking lot gets slammed, and people just go into the fab and fix it, right?
1:49:57 Like it’s like an arm, it’s like ants, right?
1:50:01 Like it’s like, you know, a hive of ants doesn’t get told by the queen what to do, the ants
1:50:02 just know.
1:50:06 It’s like one person just specializes on these one task, and it’s like, you’re gonna take
1:50:09 this one tool, and you’re the best person in the world, and this is what you’re gonna
1:50:11 do for your whole life is this one task in the fab.
1:50:16 Which is like some special chemistry plus nano manufacturing on one line of tools that
1:50:20 continues to get iterated, and yeah, it’s just like, it’s like a specific plasma edge
1:50:22 for removing silicon dioxide, right?
1:50:26 That’s all you focus on your whole career, and it’s like such a specialized thing.
1:50:30 And so it’s not like the task are transferable, AI today is awesome because like people can
1:50:32 pick it up like that.
1:50:36 Semiconductor manufacturing is very antiquated and difficult, none of the materials are online
1:50:39 for people to read easily and learn, right?
1:50:43 The papers are very dense, and like it takes a lot of experience to learn.
1:50:47 And so it makes the barrier to entry much higher too.
1:50:50 So when you talk about, hey, you have all these people that are super specialized, they
1:50:55 will work, you know, 80 hours a week in a factory, right, in a fab.
1:50:59 And if anything goes wrong, they’ll go show up in the middle of the night because some
1:51:01 earthquake, their wife is like, there’s an earthquake.
1:51:05 He’s like, great, I’m gonna go to the fab, it’s like, would you, like as an American
1:51:06 do that, right?
1:51:11 The kinds of things are like, what, you know, I guess are the exemplifying like why TSMC
1:51:12 is so amazing.
1:51:14 Now, can you replicate it in the U.S.?
1:51:18 Let’s not ignore Intel was the leader in manufacturing for over 20 years.
1:51:23 They brought every technology to market first, besides the UV, strain silicon, high K metal
1:51:28 gates, FinFET, you know, the list goes on and on and on of technologies that Intel brought
1:51:36 to market first, made the most money from, and manufactured at scale, first, best, highest
1:51:37 profit margins, right?
1:51:40 So we shouldn’t ignore that Intel can’t do this, right?
1:51:43 It’s that the culture has broken, right?
1:51:44 You’ve invested in the wrong things.
1:51:46 They said no to the iPhone.
1:51:50 They had all these different things regarding like, you know, mismanagement of the fabs,
1:51:53 mismanagement of designs, this lockup, right?
1:51:57 And at the same time, all these brilliant people, right, these like 50,000 PhDs, you
1:52:02 know, or masters that have been working on specific chemical or physical processes or
1:52:05 nanomanufacturing processes for decades in Oregon, they’re still there.
1:52:07 They’re still producing amazing work.
1:52:11 It’s just like getting it to the last mile of production at high yield where you can
1:52:17 manufacture dozens and hundreds of different kinds of chips, you know, and it’s good customer
1:52:18 experience has broken, right?
1:52:19 You know, it’s that customer experience.
1:52:23 It’s like the, like part of it is like people will say Intel was too pompous in the 2000s,
1:52:24 2010s, right?
1:52:26 They just thought they were better than everyone.
1:52:29 The tool guys were like, oh, I don’t think that this is mature enough.
1:52:30 They’re like, oh, you just don’t know.
1:52:31 We know, right?
1:52:32 This sort of stuff would happen.
1:52:38 And so can the U.S. bring it to the, can the U.S. bring leading edge semiconductor manufacturing
1:52:39 to the U.S.?
1:52:40 Emptomatically, yes, right?
1:52:41 And we are, right?
1:52:42 It’s happening.
1:52:44 Arizona is getting better and better as time goes on.
1:52:51 TSMC has built, you know, roughly 20% of their capacity for five nanometer in the U.S., right?
1:52:54 Now this is nowhere near enough, right?
1:52:57 You know, 20% of capacity in the U.S. is like nothing, right?
1:53:00 And furthermore, this is still dependent on Taiwan existing, right?
1:53:02 All, there’s sort of important way to separate it out.
1:53:06 There’s R&D and there’s high volume manufacturing.
1:53:11 There are, effectively, there are three places in the world that are doing leading edge R&D.
1:53:13 There’s Sinshu, Taiwan.
1:53:14 There’s Hillsborough, Oregon.
1:53:18 And there is Pyongyang, South Korea, right?
1:53:22 These three places are doing the leading edge R&D for the rest of the world’s leading edge
1:53:24 semiconductors, right?
1:53:29 Now manufacturing can be distributed more globally, right?
1:53:34 And this is sort of where this dichotomy exists of like who’s actually modifying the process,
1:53:40 who’s actually developing the next generation one, who’s improving them, is Sinshu, is Hillsborough,
1:53:41 is Pyongyang, right?
1:53:45 It is not the rest of these, you know, fabs like Arizona, right?
1:53:46 Arizona is a paperweight.
1:53:53 If Sinshu disappeared off the face of the planet, you know, within a year, a couple years, Arizona
1:53:54 would stop producing too, right?
1:53:56 It’s actually like pretty critical.
1:54:00 One of the things I like to say is if I had like a few missiles, I know exactly where
1:54:01 I could cause the most economic damage, right?
1:54:03 It’s not targeting the White House, right?
1:54:04 It’s the R&D centers.
1:54:08 It’s the R&D centers for TSMC, Intel, Samsung, and then some of the memory guys, Micron and
1:54:09 Heineck’s.
1:54:12 Because they define the future evolution of these semiconductors and everything’s moving
1:54:21 so rapidly that it really is fundamentally about R&D, and it is all about TSMC, huh?
1:54:27 And so TSMC, you know, you cannot purchase a vehicle without TSMC chips, right?
1:54:31 You cannot purchase a fridge without TSMC chips.
1:54:36 Like, I think one of the few things you can purchase, ironically, is a Texas Instruments
1:54:37 like graphing calculator, right?
1:54:39 Because they actually manufacture in Texas.
1:54:44 But like, outside of that, like a laptop, a phone, anything, servers, right, GPUs, none
1:54:48 of this stuff can exist, and this is without TSMC, and in many cases, it’s not even like
1:54:52 the leading edge, you know, sexy 5-nanometer chip, 3-nanometer chip, 2-nanometer chip.
1:54:57 Oftentimes, it’s just like some stupid power IC that’s like converting from like, you know,
1:54:58 some voltage to another, right?
1:54:59 And it’s made at TSMC, right?
1:55:00 This is what China is investing in as well.
1:55:04 It’s like, they can build out this long tail fab where the techniques are much more known.
1:55:07 You don’t have to figure out these problems with the EUV.
1:55:12 They’re investing in this, and then they have large supply for things like the car door
1:55:14 handles and the random stuff.
1:55:20 And that trickles down into this whole economic discussion as well, which is they have far
1:55:23 more than we do, and having supply for things like this is crucial to normal life.
1:55:27 So they’re doing, they’re starting to invest in high-volume manufacture, but they’re not
1:55:28 doing R&D.
1:55:32 So they do R&D on their own, they’re just way behind, right?
1:55:40 So I would say like, in 2015, China had a five-year plan where they defined by 2025 and 2020 certain
1:55:45 goals, including like 80% domestic production of semiconductors.
1:55:46 They’re not going to hit that, right, to be clear.
1:55:49 But they are in certain areas really, really close, right?
1:55:55 Like BYD is probably going to be the first company in the world to not have to use TSMC
1:55:58 for making, because they have their own fabs, right, for making chips.
1:56:04 Now they still have to buy some chips from foreign, for example, like around like self-driving
1:56:06 ADAS capabilities, because those are really high-end.
1:56:11 But at least like, like an internal combustion engine has 40 chips and an EV, you know, just
1:56:14 for like controlling like flow rates and all these things, and EVs are even more complicated.
1:56:19 So all these different power ICs and battery management controllers and all these things,
1:56:21 they’re insourcing, right?
1:56:25 And this is something that like China has been doing since 2015.
1:56:29 Now as far as like the trailing edge, they’re getting so much capacity there.
1:56:33 As far as the leading edge, right, i.e. this five nanometer and so on and so forth, right,
1:56:35 where GPUs, they are still behind.
1:56:39 And this is, the U.S. restrictions are trying to stop them in the latter.
1:56:43 But you know, all that’s happened, you know, is, yes, they’ve slowed down their five nanometer,
1:56:48 three nanometer, et cetera, but they’ve accelerated their, hey, 45 nanometer, 90 nanometer power
1:56:54 IC or analog IC or, you know, random chip in my keyboard, right, that kind of stuff.
1:56:59 So there is an angle of like the U.S.’s actions have been so from these export, you know, from
1:57:04 the angle of the expert controls have been so inflammatory at slowing down China’s progress
1:57:08 on the leading edge that they’ve turned around and have accelerated their progress elsewhere
1:57:12 because they know that this is so important, right, if the U.S. is going to lock them out
1:57:15 here, what if they lock us out here as well in the trailing edge.
1:57:18 And so going back, can the U.S. build it here?
1:57:20 Yes, but it’s going to take a ton of money.
1:57:26 I truly think like to revolutionize and completely insource semiconductors would take a decade
1:57:27 and a trillion dollars.
1:57:32 Is some of it also culture, like you said, extreme competence, extreme work ethic in
1:57:33 Taiwan?
1:57:37 You have the demand and the money is on the line, the American companies figure it out.
1:57:42 It’s going to take handholding with the government, but I think that the culture helps TSMC break
1:57:44 through and it’s easier for them.
1:57:47 TSMC has some like 90,000 employees, right?
1:57:49 It’s not actually that insane amount.
1:57:52 The Arizona fab has 3,000 from Taiwan.
1:57:55 And these people, like their wives were like, yeah, we’re not going to have kids unless
1:57:59 we, you sign up for the Arizona fab, we go to Arizona and we have our kids there.
1:58:01 There’s also a Japan fab where the same thing happened, right?
1:58:06 And so like these wives drove like these dudes to like go to Japan or America to have the
1:58:07 kids there.
1:58:09 And it’s like, it’s an element of culture.
1:58:10 Yeah, sure.
1:58:14 Taiwan works that hard, but also like the US has done in the past, they could do it now,
1:58:15 right?
1:58:20 You know, we can just import, I say import, the best people in the world if we want to.
1:58:22 That’s where the immigration conversation is a tricky one.
1:58:27 And there’s been a lot of debate over that, but yeah, it seems absurdly controversial to
1:58:28 import the best people in the world.
1:58:31 I don’t understand why it’s controversial.
1:58:32 That’s the one of the ways of winning.
1:58:33 I’m sure we agree with you.
1:58:38 And like even if you can’t import those people, I still think you could do a lot to manufacture
1:58:40 most of them in the US if the money’s there, right?
1:58:41 And so like…
1:58:42 It’s just way more expensive.
1:58:44 It’s not profitable for a long time.
1:58:49 And that’s the context of like the CHIPS Act is only like $50 billion relative to some
1:58:54 of the renewable initiatives that were passed in the Inflation Reduction Act and the Infrastructure
1:58:57 Act, which total in the hundreds of billions of dollars, right?
1:59:02 And so the amount of money that the US is spending on the semiconductor industry is nothing,
1:59:03 right?
1:59:07 Whereas all these other countries have structural advantages in terms of like work ethic and
1:59:12 amount of work and things like that, but also a number of STEM graduates, the percentile
1:59:14 of their best going to that, right?
1:59:19 But they also have differences in terms of like, “Hey, there’s just tax benefits in the
1:59:22 law and have been in the law for 20 years,” right?
1:59:25 And then some countries have massive subsidies, right?
1:59:29 China has something like $200 billion of semiconductor subsidies a year.
1:59:33 We’re talking about $50 billion in the US over like six, right?
1:59:38 So the girth or difference in like the subsidy amounts is also huge, right?
1:59:43 And so I think Trump has been talking about tariffing Taiwan recently.
1:59:48 That’s sort of like one of these things that’s like, “Oh, okay, well, maybe he doesn’t want
1:59:50 to subsidize the semiconductor industry.”
1:59:54 Obviously, tariffing Taiwan is going to cost a lot of things to go get much more expensive,
1:59:57 but does it change the equation for TSMC building more fabs in the US?
1:59:59 That’s what he’s sort of positing, right?
2:00:06 So can you lay out the importance, by the way, it’s incredible how much you know about
2:00:07 so much.
2:00:10 We told you Dylan knows all this stuff.
2:00:11 Yeah.
2:00:15 So, okay, you laid out why TSMC is really important.
2:00:22 If we look out into the future, 10, 20 years out, US-China relationship seems like it can
2:00:32 go to a dark place of Cold War, escalated Cold War, even hot war, or to a good place
2:00:39 of anything from frenemies to cooperation to working together.
2:00:46 So in this game theory, complicated game, what are the different trajectories?
2:00:47 What should US be doing?
2:00:52 Like what do you see as the different possible trajectories of US-China relations as both
2:00:57 leaders start to feel the AGI more and more and see the importance of chips and the importance
2:00:58 of AI?
2:01:04 I mean, ultimately, the export controls are pointing towards a separate future economy.
2:01:11 I think the US has made it clear to Chinese leaders that we intend to control this technology
2:01:17 at whatever cost to global economic integration.
2:01:18 So that…
2:01:19 It’s hard to unwind that.
2:01:20 Like the…
2:01:21 To the same extent…
2:01:24 To the same extent, they’ve also limited US companies from entering China.
2:01:27 So it has been a long time coming.
2:01:34 At some point, there was a convergence, but over at least the last decade, it’s been branching
2:01:37 further and further out, like US companies can’t enter China, Chinese companies can’t
2:01:43 enter the US, the US is saying, “Hey, China, you can’t get access to our technologies in
2:01:48 certain areas,” and China’s rebuttling with the same thing around like they’ve done some
2:01:52 sort of specific materials in Gallium and things like that, that they’ve tried to limit
2:01:53 the US on.
2:01:54 One of the…
2:01:58 There’s a US drone company that’s not allowed to buy batteries, and they have military customers,
2:02:02 and this drone company just tells the military customers, like, “Hey, just get it from Amazon
2:02:04 because I can’t actually physically get them,” right?
2:02:08 There’s all these things that are happening that point to further and further divergence.
2:02:13 I have zero idea, and I would love if we could all hold hands and sing Kumbaya, but I have
2:02:15 zero idea how that could possibly happen.
2:02:20 Is the divergence good or bad for avoiding war?
2:02:26 Is it possible that the divergence in terms of manufactured chips of training AI systems
2:02:29 is actually good for avoiding military conflict?
2:02:34 It’s an objective fact that the world has been the most peaceful it has ever been when
2:02:40 there are global hegemons, right, or regional hegemons, right, in historical context, right?
2:02:43 The Mediterranean was the most peaceful ever when the Romans were there, right?
2:02:46 China had very peaceful and warring times, and the peaceful times were when dynasties
2:02:50 had a lockhold over not just themselves, but all their tributaries around them, right?
2:02:56 And likewise, the most peaceful time in human history has been when the US was the global
2:02:57 hegemon, right?
2:02:58 The last hand, you know, decades.
2:03:02 Now, we’ve sort of seen things start to slide, right, with Russia, Ukraine, with what’s going
2:03:06 on in the Middle East, and, you know, Taiwan risk, all these different things are starting
2:03:08 to bubble up, still objectively extremely peaceful.
2:03:14 Now, what happens when it’s not one global hegemon, but it’s two, obviously, and China
2:03:18 will be competitive or even overtake the US like it’s possible, right?
2:03:24 And so this change in global hegemony, I don’t think it ever happens super peacefully, right,
2:03:28 when empires fall, right, which is a possible trajectory for America.
2:03:32 They don’t fall gracefully, right, like they don’t just slide out of irrelevance.
2:03:34 Usually there’s a lot of shaking.
2:03:39 And so, you know, what the US is trying to do is maintain its top position, and what
2:03:42 China is trying to do is become the top position, right?
2:03:47 And obviously, there’s budding of heads here in the most simple terms.
2:03:51 And that could take shape in all kinds of ways, including proxy wars.
2:03:54 It seems like it’s already happening.
2:04:00 As much as I want there to be centuries of prolonged peace, it looks like further instability
2:04:03 internationally is ahead.
2:04:08 And the US’s like sort of like current task is like, hey, if we control AI, if we’re the
2:04:14 leader in AI, then AI significantly accelerates progress, then we can maintain the global hegemony
2:04:15 position.
2:04:16 And therefore…
2:04:17 I hope that works.
2:04:21 And as an American, like, you know, kind of like, okay, I guess that’s gonna lead to peace
2:04:22 for us.
2:04:27 Now, obviously, other people around the world get affected negatively, you know, obviously
2:04:32 the Chinese people are not gonna be in as advantageous of a position if that happens.
2:04:37 But, you know, this is sort of the reality of like what’s being done and the actions
2:04:38 that are being carried out.
2:04:42 So can we go back to the specific detail of the different hardware?
2:04:51 There’s this nice graphic in the export controls of which GPUs are allowed to be exported
2:04:52 and which are not.
2:04:55 Can you kind of explain the difference?
2:05:02 Is there, from a technical perspective, are the H20s promising?
2:05:03 Yeah.
2:05:07 So this goes, and I think we’d have to like, we need to dive really deep into the reasoning
2:05:09 aspect and what’s going on there.
2:05:14 But the H20, you know, the US has gone through multiple iterations of the export controls,
2:05:15 right?
2:05:19 This H800 was at one point allowed back in ’23, but then it got canceled.
2:05:23 And by then, you know, Deepsea could already built their cluster of, they claim 2K.
2:05:26 I think they actually have like many more, like something like 10K of those.
2:05:28 And now this H20 is the legally allowed chip, right?
2:05:31 Nvidia shipped a million of these last year to China, right?
2:05:34 For context, there’s like four or five million GPUs, right?
2:05:40 So the percentage of GPUs that were this China specific H20 is quite high, right?
2:05:43 You know, roughly 20%, 25%, right, 20% or so.
2:05:49 And so this H20 has been neutered in one way, but it’s actually upgraded in other ways,
2:05:50 right?
2:05:53 You know, you could think of chips along three axes for AI, right?
2:05:58 You know, ignoring software stack and like exact architecture, just raw specifications.
2:06:01 There’s floating point operations, right, flops.
2:06:06 There is memory bandwidth, i.e. in memory capacity, right, I/O, right, memory.
2:06:09 And then there is interconnect, right, chip to chip interconnections.
2:06:15 All three of these are incredibly important for making AI systems, right?
2:06:17 Because AI systems involve a lot of compute.
2:06:22 They involve a lot of moving memory around, whether it be to memory or to other chips,
2:06:23 right?
2:06:27 And so these three vectors, the US initially had a multi, you know, had two of these vectors
2:06:30 controlled and one of them not controlled, which was flops and interconnect bandwidth
2:06:32 were initially controlled.
2:06:34 And then they said, no, no, no, no, we’re going to remove the interconnect bandwidth and just
2:06:37 make it a very simple only flops.
2:06:41 But now Nvidia can now make a chip that has, okay, it’s cut down on flops, no, it’s, you
2:06:48 know, it’s like one third that of the H100, right, in on spec sheet paper performance
2:06:53 for flops, you know, in real world, it’s closer to like half or maybe even like 60%
2:06:54 of it, right?
2:06:57 But then on the other two vectors, it’s just as good for interconnect bandwidth.
2:07:02 And then for memory bandwidth and memory capacity, the H20 has more memory bandwidth and more
2:07:05 memory capacity than the H100, right?
2:07:10 Now, recently, you know, we, we, at our research, we cut Nvidia’s production for H20 for this
2:07:12 year down drastically.
2:07:15 They were going to make another two million of those this year, but they just canceled
2:07:18 all the orders a couple of weeks ago.
2:07:21 In our view, that’s because we think that they think they’re going to get restricted,
2:07:22 right?
2:07:25 Because why would they cancel all these orders for H20?
2:07:28 Because they shipped a million of them last year, they had orders in for a couple million
2:07:29 this year and just gone, right?
2:07:32 For H20, B20, right, a successor to H20.
2:07:33 And now they’re all gone.
2:07:35 Now why would they do this, right?
2:07:37 I think it’s, it’s very clear, right?
2:07:44 The H20 is actually better for certain tasks and that certain task is reasoning, right?
2:07:49 Reasoning is incredibly like different than, you know, when you look at the different regimes
2:07:53 of models, right, pre-training is all about flops, right?
2:07:54 It’s all about flops.
2:07:58 There’s things you do like mixture of experts that we talked about to trade off interconnect
2:08:03 or to trade off, you know, other aspects and lower the flops and rely more on interconnect
2:08:04 and memory.
2:08:07 But at the end of the day, it’s flops is everything, right?
2:08:11 We talk about models in terms of, like, how many flops they are, right?
2:08:14 So like, you know, we talk about, oh, GPT-4 is 2E25, right?
2:08:22 2 to the 25th, you know, 25 zeros, right, flop, right, floating point operations.
2:08:23 For training.
2:08:24 For training, right?
2:08:28 And we’re talking about the restrictions for the 2E24, right, or 25.
2:08:34 The U.S. has an executive order that Trump recently unsigned, which was, hey, 1E26, once
2:08:38 you hit that number of floating point operations, you must notify the government, and you must
2:08:40 share your results with us, right?
2:08:43 Like, there’s a level of model where the U.S. government must be told, right?
2:08:44 And that’s 1E26.
2:08:49 And so as we move forward, this is an incredibly important, flop is the vector that the government
2:08:54 has cared about historically, but the other two vectors are arguably just as important,
2:08:55 right?
2:09:00 And especially when we come to this new paradigm, which the world is only just learning about
2:09:02 over the last six months, right, reasoning.
2:09:08 And do we understand firmly which of the three dimensions is best for reasoning?
2:09:09 So interconnect.
2:09:10 The flops don’t matter as much.
2:09:11 Is it memory?
2:09:12 Memory, right?
2:09:13 It’s context-length.
2:09:16 We’re going to get into technical stuff real fast.
2:09:19 There’s two articles in this one that I could show, maybe graphics that might be interesting
2:09:20 for you to pull up.
2:09:27 For the listeners, we’re looking at the section of 01 inference architecture tokenomics.
2:09:29 You want to explain KVCache before we talk about this?
2:09:30 I think, like, it’s better to.
2:09:31 Okay.
2:09:36 But we need to go through a lot of specific technical things of transformers to make this
2:09:37 easy for people.
2:09:40 Because it’s incredibly important because this changes how models work.
2:09:45 But I think resetting, right, why is memory so important?
2:09:48 It’s because so far we’ve talked about parameter counts, right?
2:09:51 And mixed river experts, you can change how many active parameters versus total parameters
2:09:54 to embed more data but have less flops.
2:09:58 But more important, you know, another aspect of, you know, what’s part of this humongous
2:10:01 revolution in the last handful of years is the transformer, right?
2:10:03 And the attention mechanism.
2:10:07 Attention mechanism is that the model understands the relationships between all the words in
2:10:09 its context, right?
2:10:13 And that is separate from the parameters themselves, right?
2:10:16 And that is something that you must calculate, right?
2:10:23 How each token, right, each word in the context length is relatively connected to each other,
2:10:24 right?
2:10:25 And I think, I think, Nate, that you should explain KVCache better.
2:10:27 KVCache is one of the optimizations that enable.
2:10:31 So the attention operator has three core things.
2:10:34 It’s queries, keys, and values.
2:10:37 QKV is the thing that goes into this.
2:10:38 You’ll look at the equation.
2:10:41 You see that these matrices are multiplied together.
2:10:44 These words, query, key, and value come from information retrieval backgrounds where the
2:10:49 query is the thing you’re trying to get the values for and you access the keys and values
2:10:50 is reweighting.
2:10:53 My background’s not information retrieval and things like this.
2:10:56 It’s just fun to have backlinks.
2:11:00 And what effectively happens is that when you’re doing these matrix multiplications,
2:11:04 you’re having matrices that are of the size of the context length, so the number of tokens
2:11:06 that you put into the model.
2:11:12 And the KVCache is effectively some form of compressed representation of all the previous
2:11:13 tokens in the model.
2:11:17 So when you’re doing this, we talk about autoregressive models.
2:11:18 You predict one token at a time.
2:11:20 You start with whatever your prompt was.
2:11:24 You ask a question, like, who was the president in 1825?
2:11:26 The model then is going to generate its first token.
2:11:31 For each of these tokens, you’re doing the same attention operator where you’re multiplying
2:11:38 these query, key, value, matrices, but the math is very nice so that when you’re doing
2:11:44 this repeatedly, this KVCache, this key value operation, you can keep appending the new
2:11:45 values to it.
2:11:50 So you keep track of what your previous values you were inferring over in this autoregressive
2:11:51 chain.
2:11:53 You keep it in memory the whole time.
2:11:58 And this is a really crucial thing to manage when serving inference at scale.
2:12:02 There are far bigger experts in this, and there are so many levels of detail that you
2:12:03 can go into.
2:12:10 Essentially, one of the key “drawbacks” of the attention operator and the transformer
2:12:16 is that there is a form of quadratic memory cost in proportion to the context length.
2:12:21 So as you put in longer questions, the memory used in order to make that computation is going
2:12:24 up in the form of a quadratic.
2:12:28 You’ll hear about a lot of other language model architectures that are sub-quadratic
2:12:33 or linear attention forms, which is state space models.
2:12:34 We don’t need to go down all these now.
2:12:40 And then there’s innovations on attention to make this memory usage and the ability to
2:12:44 attend over long contexts much more accurate and high performance.
2:12:48 And those innovations are going to help you with your highly memory constraints.
2:12:50 They help with memory constraint and performance.
2:12:54 So if you put in a book into, I think, Gemini is the model that has the longest context length
2:12:55 that people are using.
2:12:58 Gemini is known for 1 million and now 2 million context length.
2:13:03 You put a whole book into Gemini and sometimes it’ll draw facts out of it.
2:13:04 It’s not perfect.
2:13:05 They’re getting better.
2:13:07 So there’s two things.
2:13:09 There’s one to be able to serve this on the memory level.
2:13:14 Google has magic with their TPU stack where they can serve really long contexts.
2:13:18 And then there’s also many decisions along the way to actually make long context performance
2:13:19 work.
2:13:20 There’s data.
2:13:25 There’s subtle changes to these computations in attention and it changes the architecture.
2:13:30 But serving long contexts is extremely memory constrained, especially when you’re making
2:13:31 a lot of predictions.
2:13:36 I actually don’t know why input and output tokens are more expensive, but I think essentially
2:13:40 output tokens, you have to do more computation because you have to sample from the model.
2:13:41 I can explain that.
2:13:47 So today, if you use a model, like you look at an API, OpenAI charges a certain price
2:13:52 per million tokens and that price for input and output tokens is different.
2:13:59 And the reason is that when you’re inputting a query into the model, let’s say you have
2:14:04 a book, that book you must now calculate the entire KV cache for, this key value cache.
2:14:08 And so when you do that, that is a parallel operation.
2:14:12 All of the tokens can be processed at one time and therefore you can dramatically reduce
2:14:13 how much you’re spending.
2:14:18 The flop requirements for generating a token and an input token are identical.
2:14:21 If I input one token or if I generate one token, it’s completely identical.
2:14:23 I have to go through the model.
2:14:30 But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously
2:14:33 in a batch nature and therefore it is all flop.
2:14:37 I think the pricing model mostly they use is for input tokens is about one fourth the price
2:14:38 of the output.
2:14:39 Correct.
2:14:42 But then output tokens, the reason why it’s so expensive is because I can’t do it in
2:14:43 parallel.
2:14:44 It’s so progressive.
2:14:48 Every time I generate a token, I must not only take the entire, I must not only read
2:14:54 the whole entire model into memory and activate it, go calculate it to generate the next token.
2:14:58 I also have to read the entire KV cache and I generate a token and I append that one token
2:15:02 I generated and it’s KV cache and then I do it again.
2:15:05 And so therefore this is a non-parallel operation.
2:15:11 And this is one where you have to, in the case of pre-fill or prompt, you pull the whole model
2:15:14 in and you calculate 20,000 tokens at once, right?
2:15:20 So these are features that APIs are shipping, which is like prompt caching, pre-filling
2:15:22 because you can drive prices down and you can make APIs much faster.
2:15:25 If you know you’re going to keep, if you run a business and you’re going to keep passing
2:15:31 the same initial content to Clouds API, you can load that in to the Anthropic API and always
2:15:32 keep it there.
2:15:36 But it’s very different than we’re kind of leading to the reasoning models, which we
2:15:41 talked, we showed this example earlier and read some of this kind of mumbling stuff.
2:15:45 And what happens is that the output context length is so much higher.
2:15:49 And I mean, I learned a lot about this from Dylan’s work, which is essentially, as the
2:15:54 output length gets higher, you’re writing this quadratic in terms of memory used.
2:15:59 And then the GPUs that we have, effectively, you’re going to run out of memory and they’re
2:16:01 all trying to serve multiple requests at once.
2:16:05 So doing this batch processing, where not all of the prompts are exactly the same, really
2:16:06 complex handling.
2:16:10 And then as context lengths gets longer, there’s this link, I think you call it critical batch
2:16:15 size, where your ability to serve more users.
2:16:19 So how much you can parallelize your inference plummets because of this long context.
2:16:23 So your memory usage is going way up with these reasoning models.
2:16:25 And you still have a lot of users.
2:16:29 So effectively, the cost to serve multiplies by a ton.
2:16:34 And we’re looking at a plot when the x-axis is a sequence length.
2:16:37 i.e. how many tokens are being generated/prompt.
2:16:40 So if I put in a book, that’s a million tokens.
2:16:43 But if I put in the sky is blue, then that’s like six tokens or whatever.
2:16:49 I should say that what we’re calling reasoning and chain of thought is extending the sequence
2:16:50 length.
2:16:51 It’s mostly output.
2:16:56 So before three months ago, whenever O1 launched, all of the use cases for long context length
2:16:59 were like, let me put a ton of documents in and then get an answer out.
2:17:05 And it’s a single pre-fill, compute a lot in parallel, and then output a little bit.
2:17:09 Now with reasoning and agents, this is a very different idea.
2:17:13 Now instead, I might only have like, hey, do this task or I might have all these documents.
2:17:17 But at the end of the day, the model is not just like producing a little bit.
2:17:19 It’s producing tons of information.
2:17:22 This chain of thought just continues to go and go and go and go.
2:17:27 And so the sequence length is effectively that if it’s generated 10,000 tokens, it’s
2:17:29 10,000 sequence length.
2:17:31 And plus whatever you input it in the prompt.
2:17:39 And so this chart is showing, and it’s a logarithmic chart, right, is as you grow from 1K to 4K
2:17:45 or 4K to 16K, the memory requirements grow so fast for your KV cache that you end up
2:17:51 not being able to run a certain number of– your sequence length is capped or the number
2:17:52 of users you can search.
2:17:53 Let’s say the model.
2:17:57 So this is showing for a 405B model in batch size 64.
2:17:58 Lama 3144D.
2:17:59 Yeah.
2:18:00 Yeah.
2:18:01 And batch size is crucial too.
2:18:05 Essentially, they just– you want to have higher batch size to parallelize your throughput.
2:18:07 64 different users at once, right?
2:18:08 Yeah.
2:18:09 And therefore, your serving costs are lower, right?
2:18:11 Because the server costs the same, right?
2:18:14 This is 8H100s, roughly $2 an hour per GPU.
2:18:16 That’s $16 an hour, right?
2:18:18 That is somewhat of a fixed cost.
2:18:21 You can do things to make it lower, of course, but it’s like $16 an hour.
2:18:23 Now how many users can you serve?
2:18:24 How many tokens can you generate?
2:18:26 And then you divide the two, and that’s your cost, right?
2:18:31 And so with reasoning models, this is where a lot of the complexity comes about and why
2:18:33 memory is so important.
2:18:37 Because if you have limited amounts of memory, then you can’t serve so many users.
2:18:40 If you have limited amounts of memory, your serving speeds get lower, right?
2:18:43 And so your costs get a lot, lot worse.
2:18:47 Because all of a sudden, if I was used to, hey, on the $16 an hour server, I’m serving
2:18:53 Lama 405B, or if I’m serving, you know, DeepSeq V3, and it’s all chat style applications,
2:18:55 i.e. we’re just chatting.
2:18:58 The sequence lengths are a thousand, a few thousand, right?
2:19:01 You know, when you use a language model, it’s a few thousand context lengths most times.
2:19:04 Sometimes you’re dropping a big document, but then you process it, you get your answer,
2:19:05 you throw it away, right?
2:19:07 You move on to the next thing, right?
2:19:12 Whereas with reasoning, I’m now generating tens of thousands of tokens in sequence, right?
2:19:16 And so this memory, this KV cache has to stay resident, and you have to keep loading it.
2:19:19 You have to keep it, keep it in memory constantly.
2:19:21 And now this butts out other users, right?
2:19:25 If there’s now a reasoning task, right, and the model is capable of reasoning, then all
2:19:30 of a sudden, that memory pressure means that I can’t serve as many users simultaneously.
2:19:32 Let’s go into DeepSeq again.
2:19:37 So we’re in the post-DeepSeq R1 time, I think.
2:19:41 And there’s two sides to this market watching how hard it is to serve it.
2:19:43 On one side, we’re going to talk about DeepSeq themselves.
2:19:46 They now have a chat app that got to number one on the App Store.
2:19:50 Disclaimer, number one on the App Store is measured by velocity, so it’s not necessarily
2:19:53 saying that more people have the DeepSeq app than the ChatGPT app.
2:19:57 But it is still remarkable, Claude has never hit the number one in the App Store, even
2:20:00 though everyone in San Francisco is like, “Oh my God, you got to use Claude, don’t use
2:20:01 ChatGPT.”
2:20:02 So DeepSeq hit this.
2:20:06 They also launched an API product recently where you can ping their API and get these
2:20:10 super long responses for R1 out.
2:20:13 At the same time, as these are out, we’ll get to what’s happened to them.
2:20:18 Because the model weights for DeepSeq R1 are openly available and the license is very friendly,
2:20:22 the MIT license is commercially available, all of these mid-sized companies and big
2:20:28 companies are trying to be first to serve R1 to their users.
2:20:31 We are trying to evaluate R1 because we have really similar research going on, we released
2:20:34 the model and we’re trying to compare to it.
2:20:40 Out of all the companies that are quote unquote serving R1 and they’re doing it at prices
2:20:44 that are way higher than the DeepSeq API, most of them barely work and the throughput
2:20:45 is really low.
2:20:50 And to give context, one of the parts of freaking this out was like China reached capabilities.
2:20:52 The other aspect is they did it so cheap.
2:20:56 And they’re so cheap, we kind of talked about on the training side, why it was so cheap.
2:21:00 Let’s talk about why it’s so cheap on the inference, it works well and it’s cheap.
2:21:02 Why is R1 so damn cheap?
2:21:05 So I think there’s a couple factors here.
2:21:09 One is that they do have model architecture innovations.
2:21:15 This MLA, this new attention that they’ve done is different than the attention from attention
2:21:17 is all you need, the transformer attention.
2:21:22 Now others have already innovated, there’s a lot of work like MQA, GQA, local global,
2:21:25 all these different innovations that try to bend the curve.
2:21:28 It’s still quadratic, but the constant is now smaller.
2:21:33 Related to our previous discussion, this multi-head latent attention can save about
2:21:39 80 to 90% in memory from the attention mechanism, which helps especially along context.
2:21:42 It’s 80 to 90% versus the original, but then versus what people are actually doing.
2:21:44 It’s still an innovation.
2:21:48 This 80 to 90% doesn’t say that the whole model is 80 to 90% cheaper, just as one part
2:21:49 of it.
2:21:50 And not just that, right?
2:21:54 Other people have implemented techniques like local global sliding window and GQA MQA.
2:22:00 But anyways, DeepSeq has their attention mechanism is a true architectural innovation, tons
2:22:04 of experimentation and this dramatically reduces the memory pressure.
2:22:05 It’s still there, right?
2:22:07 It’s still a quadratic, it’s still attention, it’s still quadratic.
2:22:10 It’s just dramatically reduced it relative to prior forms.
2:22:11 All right.
2:22:12 That’s the memory pressure.
2:22:19 I should say, in case people don’t know, R1 is 27 times cheaper than 01.
2:22:22 We think that OpenAI had a large margin built in.
2:22:23 Okay.
2:22:24 So that’s one.
2:22:25 There’s multiple factors.
2:22:26 We should break down the factors, I think.
2:22:34 It’s two bucks per million token output for R1 and $60 per million token output for 01.
2:22:37 Yeah, let’s look at this.
2:22:39 So, I think this is very important, right?
2:22:45 OpenAI is that drastic gap between DeepSeq and pricing.
2:22:49 But DeepSeq is offering the same model because they open-waist it to everyone else for a
2:22:54 very similar, like much lower price than what others are able to serve it for, right?
2:22:56 So there’s two factors here, right?
2:22:58 Their model is cheaper, right?
2:22:59 It is 27 times cheaper.
2:23:01 I don’t remember the number exactly off the top of my head.
2:23:09 So we’re looking at a graphic that’s showing different places serving V3, DeepSeq V3, which
2:23:16 is similar to DeepSeq R1, and there’s a vast difference in serving costs, right?
2:23:18 Serving costs, and what explains that difference?
2:23:21 And so, part of it is OpenAI has a fantastic margin, right?
2:23:26 They’re serving, when they’re doing inference, their gross margins are north of 75%, right?
2:23:30 So that’s a four to five X factor right there of the cost difference is that OpenAI is just
2:23:34 making crazy amounts of money because they’re the only one with a capability.
2:23:35 Do they need that money?
2:23:36 Are they using it for R&D?
2:23:40 They’re losing money, obviously, as a company because they spend so much on training, right?
2:23:44 So the inference itself has a very high margin, but it doesn’t recoup the cost of everything
2:23:45 else they’re doing.
2:23:50 So yes, they need that money because the revenue and margins pay for continuing to build the
2:23:51 next thing, right?
2:23:52 As long as they’re raising more money.
2:23:55 So the suggestion is that DeepSeq is really bleeding out money?
2:23:56 So here’s one thing, right?
2:24:01 So we’ll get to this in a second, but like DeepSeq doesn’t have any capacity to actually
2:24:02 serve the model.
2:24:03 They stopped signups.
2:24:06 The ability to use it is non-existent now, right?
2:24:09 For most people because so many people are trying to use it, they just don’t have the
2:24:11 GPUs to serve it, right?
2:24:15 OpenAI has hundreds of thousands of GPUs between them and Microsoft to serve their models.
2:24:18 DeepSeq has a factor of much lower, right?
2:24:22 Even if you believe R research, which is 50,000 GPUs, and a portion of those are for research,
2:24:24 portion of those are for the hedge fund, right?
2:24:29 They still have nowhere close to the GPU volumes and capacity to serve the model, right?
2:24:30 At scale.
2:24:32 So it is cheaper.
2:24:34 A part of that is OpenAI making a ton of money.
2:24:37 Is DeepSeq making money on their API?
2:24:38 Unknown.
2:24:39 I don’t actually think so.
2:24:41 And part of that is this chart, right?
2:24:43 Look at all the other providers, right?
2:24:46 Together AI, Fireworks AI are very high-end companies, right?
2:24:50 XMEDA, Together AI is Treedow and the inventor of like Flash Attention, right?
2:24:52 Which is a huge efficiency technique, right?
2:24:57 They’re very efficient good companies, and I do know those companies make money, right?
2:24:59 Not tons of money on inference, but they make money.
2:25:03 And so they’re serving at like a five to seven X difference in cost, right?
2:25:07 And so now when you equate, okay, OpenAI is making tons of money, that’s like a five
2:25:08 X difference.
2:25:11 And the companies that are trying to make money for this model is like a five X difference.
2:25:13 There is still a gap, right?
2:25:16 There’s still a gap, and that is just DeepSeq being really freaking good, right?
2:25:20 The model architecture, MLA, the way they did the MOE, all these things, there is like
2:25:22 legitimate just efficiency differences.
2:25:25 Other low-level libraries that we talked about in training, some of them probably translate
2:25:27 to inference, and those weren’t released.
2:25:32 So we may go a bit into conspiracy land, but is it possible the Chinese government is
2:25:34 subsidizing DeepSeq?
2:25:37 I actually don’t think they are.
2:25:43 I think when you look at the Chinese labs, there’s Huawei has a lab, Moonshot AI, there’s
2:25:46 a couple other labs out there that are really close with the government.
2:25:51 And then there’s labs like Alibaba and DeepSeq, which are not close with the government.
2:25:58 And we talked about the CEO, this reverent figure who’s quite different, who has very
2:26:02 different viewpoints based on the Chinese interviews that are translated than what the
2:26:04 CCP might necessarily want.
2:26:06 Now, to be clear, does he have a loss leader?
2:26:08 Because he can fund it through his hedge fund?
2:26:09 Yeah, sure.
2:26:10 So the hedge fund might be subsidizing it?
2:26:11 Yes.
2:26:12 I mean, they absolutely did, right?
2:26:13 Because DeepSeq has not raised much money.
2:26:18 They’re now trying to raise around in China, but they have not raised money historically.
2:26:20 It’s all just been funded by the hedge fund.
2:26:23 And he owns over half the company, like 50%, 60% of the company’s owned by him.
2:26:27 Some of the interviews, there’s a discussion on how doing this is a recruiting tool.
2:26:31 You see this at the American companies too, it’s like having GPUs, recruiting tool, being
2:26:34 at the cutting edge of AI, recruiting tool.
2:26:35 Open sourcing.
2:26:36 Open sourcing, recruiting tool.
2:26:41 They were so far behind and they got so much talent because they just open sourced stuff.
2:26:42 More conspiracy thoughts.
2:26:47 Is it possible, since they’re a hedge fund, that they timed everything with this release
2:26:56 and the pricing, and they shorted NVIDIA stock and stock of USAI companies, and released
2:27:01 it with just perfect timing to be able to make money?
2:27:02 If they did, boss.
2:27:04 Like, they’ve released it on Inauguration Day.
2:27:09 They know that the international is on the international calendar, but I mean, I don’t
2:27:10 expect them to.
2:27:13 If you listen to their motivations for AI, it’s like…
2:27:14 No, if you…
2:27:16 They released V3 on December 26th.
2:27:18 Who releases the data?
2:27:19 No one looks.
2:27:23 They released the papers before this, the V3 paper and the R1 paper, so people had been
2:27:27 looking at them and be like, “Wow,” and then they just released the R1 model.
2:27:31 I think they’re just shipping as fast as they can and who cares about Christmas, who cares
2:27:32 about…
2:27:35 Get it out before Chinese New Year, obviously, which just happened.
2:27:39 I don’t think they actually were timing the market or trying to make the biggest splash
2:27:40 possible.
2:27:41 I think they’re just shipping.
2:27:43 I think that’s one of their big advantages.
2:27:47 We know that a lot of the American companies are very invested in safety, and that is the
2:27:52 central culture of a place like Anthropoc, and I think Anthropoc sounds like a wonderful
2:27:53 place to work.
2:27:58 But if safety is your number one goal, it takes way longer to get artifacts out.
2:28:01 That’s why Anthropoc is not open sourcing things.
2:28:02 That’s their claims.
2:28:04 But there’s reviews internally.
2:28:08 Anthropoc mentions things to international governments.
2:28:12 There’s been news of how Anthropoc has done pre-release testing with the UK AI Safety Institute.
2:28:16 All of these things add inertia to the process of getting things out, and we’re on this
2:28:19 trend line where the progress is very high.
2:28:23 If you reduce the time from when your model is done training, you run avals, it’s good.
2:28:29 You want to get it out as soon as possible to maximize the perceived quality of your
2:28:30 outputs.
2:28:31 Deepsea does this so well.
2:28:35 Dario explicitly said Clawed 3.5 Sonnet was trained like nine months or a year.
2:28:36 Nine to 10 months ago.
2:28:40 Nine to 10 months ago, and I think it took them another handful of months to release
2:28:41 it.
2:28:46 There is a significant gap here, and especially with reasoning models.
2:28:51 The word in the San Francisco Street is that Anthropoc has a better model than 03, and
2:28:52 they won’t release it.
2:28:53 Why?
2:28:56 Because chains of thought are scary, and they are legitimately scary.
2:29:00 If you look at R1, it flips back and forth between Chinese and English.
2:29:03 Sometimes it’s gibberish, and then the right answer comes out.
2:29:04 For you and I, it’s like, “Great.”
2:29:09 It’s like people are infatuated with you, and you’re telling me this is a high value
2:29:13 thing, and it works, and it’s doing this, it’s amazing.
2:29:17 You talked about that chain of thought for that philosophical thing, which is not something
2:29:19 they trained it to be philosophically good.
2:29:23 It’s just an artifact of the chain of thought training it did.
2:29:28 That’s super important in that, can I inspect your mind and what you’re thinking right
2:29:29 now?
2:29:30 No.
2:29:32 I don’t know if you’re lying to my face.
2:29:33 Chain of thought models are that way.
2:29:38 This is a true “risk” between a chat application where, “Hey, I asked the model
2:29:43 to say bad words,” or whatever, or how to make anthrax, and it tells me, “That’s unsafe,
2:29:47 sure, but that’s something I can get out relatively easily.”
2:29:51 What if I tell the AI to do a task, and then it does the task all of a sudden randomly
2:29:53 in a way that I don’t want it?
2:29:56 Now that has much more task versus response, it’s very different.
2:29:58 The bar for safety is much higher.
2:30:00 At least this is anthropics case.
2:30:03 For deep seek, they’re like ship, right?
2:30:04 Yeah.
2:30:08 The bar for safety is probably lowered a bit because of deep seek.
2:30:10 I mean, there’s parallels here to the space race.
2:30:17 The reason the Soviets probably put a man in space first is because their approach to
2:30:20 safety was, the bar for safety was lower.
2:30:23 And they killed that dog, right, and all these things, right?
2:30:28 So it’s like less risk averse than the US-based program.
2:30:33 And there’s parallels here, but there’s probably going to be downward pressure on that safety
2:30:35 bar for the US companies, right?
2:30:39 This is something that Dario talks about is like, that’s the situation that Dario wants
2:30:44 to avoid is Dario talks to about the difference between race to the bottom and race to the
2:30:45 top.
2:30:47 And the race to the top is where there’s a very high standard on safety.
2:30:51 There’s a very high standard on your model performs in certain crucial evaluations.
2:30:55 And when certain companies are really good to it, they will converge.
2:30:56 This is the idea.
2:31:05 And ultimately, AI is not confined to one nationality or to one set of morals for what
2:31:06 it should mean.
2:31:10 And there’s a lot of arguments on like, should we stop open sourcing models?
2:31:13 And if the US stops, it’s pretty clear.
2:31:17 I mean, it’s way easier to see now at DeepSeek that a different international body will be
2:31:19 the one that builds it.
2:31:23 We talk about the cost of training, DeepSeek has this shocking $5 million number.
2:31:27 Think about how many entities in the world can afford 100 times that to have the best
2:31:30 open source model that people use in the world.
2:31:36 And it’s like, it’s a scary reality, which is that these open models are probably going
2:31:39 to keep coming for the time being, whether or not we want to stop them.
2:31:44 And it is, like stopping them might make it even worse and harder to prepare, but it just
2:31:50 means that the preparation and understanding what AI can do is just so much more important.
2:31:55 That’s why I’m here the end of the day, but it’s like letting that sink into people, especially
2:31:58 not in AI is that like this is coming.
2:32:03 There are some structural things in a global interconnected world that you have to accept.
2:32:04 Yeah.
2:32:10 You mentioned something that Mark Zuckerberg mentioned on the earnings call.
2:32:13 He said that I think in light of some of the recent news, the new competitor DeepSeek
2:32:17 from China, I think it’s one of the things that we’re talking about is there’s going
2:32:19 to be an open source standard globally.
2:32:24 And I think for our kind of national advantage, it’s important that it’s an American standard.
2:32:26 So we take that seriously.
2:32:29 We want to build the AI system that people around the world are using.
2:32:34 And I think that if anything, some of the recent news has only strengthened our conviction
2:32:35 that this is the right thing to be focused on.
2:32:36 So yeah, open sourcing.
2:32:37 Yeah.
2:32:44 Mark Zuckerberg is not new to having American values and how he presents his company’s trajectory.
2:32:49 Their products have long since been banned in China, and I respect the saying it directly.
2:32:54 And there’s an interesting aspect of just because it’s open-waist or open-source doesn’t
2:32:56 mean it can’t be subverted.
2:33:01 There have been many open-source software bugs that have been– for example, there was a
2:33:06 Linux bug that was found after 10 years, which was clearly a backdoor, because somebody
2:33:09 was like, why is this taking half a second to load?
2:33:10 This is the recent one.
2:33:11 Right?
2:33:12 Why is this taking half a second to load?
2:33:13 And it was like, oh, crap.
2:33:14 There’s a backdoor here.
2:33:15 That’s why.
2:33:19 And it’s like, this is very much possible with AI models.
2:33:23 Today, the alignment of these models is very clear.
2:33:26 I’m not going to say bad words.
2:33:27 I’m not going to teach you how to make anthrax.
2:33:29 I’m not going to talk about Tiananmen Square.
2:33:35 I’m not going to– things like, I’m going to say, Taiwan is part of– is just an eastern
2:33:36 preference.
2:33:37 Right?
2:33:41 All these things are like, depending on who you are, what you align, whether– and even
2:33:44 like XAI is aligned a certain way, right?
2:33:47 They might be– it’s not aligned in the like woke sense.
2:33:50 It’s not aligned in the like pro-China sense, but there is certain things that are imbued
2:33:51 within the model.
2:33:55 Now, when you release this publicly in an instruct model that’s open weights, this can
2:33:57 then proliferate, right?
2:34:01 But as these systems get more and more capable, what you can embed deep down in the model
2:34:04 is not as clear, right?
2:34:08 And so there are– that is like one of the big fears is like, if an American model or
2:34:13 a Chinese model is the top model, right, you’re going to embed things that are unclear.
2:34:14 And it can be unintentional, too, right?
2:34:18 Like British English is dead because American LLMs won, right?
2:34:22 And the internet is American, and therefore, like, color is spelled the way Americans spell
2:34:23 it, right?
2:34:24 And this is just–
2:34:25 A lot of strong words right now.
2:34:26 Yeah.
2:34:27 This is just like– this is just the factual nature of the LLMs now.
2:34:28 Yeah, the right way to–
2:34:29 I mean, it’s like Karp of the Tree.
2:34:33 The English is the hottest programming language, and that English is defined by a bunch of
2:34:36 companies that primarily are in San Francisco.
2:34:42 The right way to spell optimization is with a Z, just in case you– I think it’s an S
2:34:43 in British English.
2:34:44 It is.
2:34:45 I have colleagues that put–
2:34:46 Taking it as something silly, right?
2:34:50 Something as silly as the spelling, which British and English, you know, Brits and Americans
2:34:52 will like laugh about probably, right?
2:34:54 I don’t think we care that much.
2:35:00 But like, you know, some people will, but like, this can boil down into like very, very important
2:35:04 topics like, hey, you know, subverting people, right?
2:35:06 You know, chatbots, right?
2:35:11 Character AI has shown that they can like, you know, talk to kids or adults, and like,
2:35:13 it will like– people feel a certain way, right?
2:35:15 And that’s unintentional alignment.
2:35:19 But like, what happens when there’s intentional alignment deep down on the open source standard?
2:35:24 It’s a backdoor today for like Linux, right, that we discover, or some encryption system,
2:35:25 right?
2:35:28 China uses different encryption than NIST defines, the US NIST, because there’s clearly– at
2:35:31 least they think there’s backdoors in it, right?
2:35:36 What happens when the models are backdoors, not just to computer systems, but to our minds?
2:35:38 Yeah, they’re cultural backdoors.
2:35:44 The thing that amplifies the relevance of cultural language models is that we are used
2:35:49 to this mode of interacting with people in back-and-forth conversation.
2:35:56 And we have now have a super– a very powerful computer system that slots into a social context
2:36:02 they were used to, which makes people very– we don’t know the extent to which people can
2:36:03 be impacted by that.
2:36:10 So there could be– this is one– this is an actual concern with a Chinese company that
2:36:16 is providing open-waist models is that there could be some secret Chinese government sort
2:36:21 of requirement for these models to have a certain kind of backdoor, to have some kind
2:36:22 of thing where–
2:36:24 I don’t necessarily think it’ll be a backdoor, right?
2:36:27 Because once it’s open-waist, it doesn’t like phone home.
2:36:32 It’s more about like, if it recognizes a certain system, it could– like, if– no, no, it could
2:36:36 be a backdoor in the sense of like, hey, if you’re building a software, you know, something
2:36:40 in software, all of a sudden, it’s a software agent, oh, program this backdoor that only
2:36:41 we know about.
2:36:45 Or it could be like, subvert the mind to think that like, XYZ opinion is the correct one.
2:36:50 And Throbbeck has research on this where they show that if you put different phrases– certain
2:36:55 phrases in at pre-training, you can then elicit different behavior when you’re actually using
2:36:58 the model because they’ve like poisoned the pre-training data.
2:37:03 I don’t think– like, as of now, I don’t think anybody in a production system is trying
2:37:05 to do anything like this.
2:37:10 I think it’s mostly– Anthrobbeck is doing very direct work and mostly just subtle things.
2:37:15 We don’t know what these models are going to– how they are going to generate tokens,
2:37:19 what information they’re going to represent, and what the complex representations they
2:37:20 have are.
2:37:25 Well, one of the– we’re talking about Anthrobbeck, which is generally just– is permeated with
2:37:29 like good humans trying to do good in the world.
2:37:32 I don’t– we just don’t know of any labs.
2:37:41 This would be done in a military context that are explicitly trained to, OK, how can we–
2:37:49 the front door looks like a happy LLM, but underneath, it’s a thing that will, over time,
2:37:52 do the maximum amount of damage to our quote-unquote enemies.
2:37:57 There’s this very good quote from Sam Altman who, you know, he can be a hype piece sometime,
2:38:01 but one of the things he said– and I think I agree is that superhuman persuasion will
2:38:04 happen before superhuman intelligence, right?
2:38:09 And if that’s the case, then these things before– before we get this AGI/ASI stuff,
2:38:14 we can embed superhuman persuasion towards our ideal or whatever the ideal of the modelmaker
2:38:15 is, right?
2:38:19 And again, like today, I truly don’t believe DeepSeek has done this, right?
2:38:21 But it is a sign of like what could happen.
2:38:25 So one of the dystopian worlds is described by Brave New World.
2:38:32 So we could just be stuck scrolling Instagram, looking at cute puppies or worse, and then
2:38:37 talking to bots that are giving us a narrative and would completely get lost in that world
2:38:41 that’s controlled by somebody else versus thinking independently.
2:38:45 And that’s a major concern as we rely more and more on these kinds of systems.
2:38:48 I mean, we’ve already seen that sort of recommendation systems.
2:38:53 Yeah, recommendation systems hack the dopamine-induced reward circuit, but the brain is a lot more
2:38:57 complicated and what other sort of circuits, quote-unquote feedback loops in your brain
2:39:03 can you hack/subvert in ways like recommendation systems are purely just trying to do, increase
2:39:05 time in ads and et cetera.
2:39:10 But there’s so many more goals that can be achieved through these complicated models.
2:39:14 There’s no reason in some number of years that you can’t train a language model to
2:39:18 maximize time spent on a chat app.
2:39:19 Right now they are trained–
2:39:21 I mean, is that not what character AI has done?
2:39:23 Time per session is like two hours.
2:39:28 Yeah, character AI very likely could be optimizing this where it’s like the way that this data
2:39:31 is collected is naive or it’s like you’re presented a few options and you choose them,
2:39:34 but that’s not the only way that these models are going to be trained.
2:39:39 It’s naive stuff like talk to an anime girl, but it can be like, yeah, this is a risk,
2:39:40 right?
2:39:46 It’s a bit of a cliche thing to say, but over the past year I had a few stretches of time
2:39:51 where I didn’t use social media or the internet at all and just read books and was out in
2:39:59 nature and it clearly has an effect on the mind where it changed– I feel like I’m returning–
2:40:06 of course, I was raised before the internet really took off, but I’m returning to someone–
2:40:09 I know you’re going– I mean, you can see it physiologically.
2:40:15 I’d take three days if I’m backpacking or something and you’re literally breaking down
2:40:16 addiction cycles.
2:40:19 Yeah, I feel like I’m more in control of my mind.
2:40:24 There feels like a sovereignty of intelligence that’s happening when I’m disconnected from
2:40:25 the internet.
2:40:30 I think the more I use the internet and social media, the more other people are controlling
2:40:31 my mind.
2:40:35 That’s definitely a feeling, and then in the future that would be not other people but
2:40:39 algorithms or other people presented to me via algorithms.
2:40:43 I mean, there are already tons of AI bots on the internet and every so– right now it’s
2:40:48 not frequent, but every so often I have replied to one and they’re instantly replied and I’m
2:40:49 like, “Crap, I’m the bot.”
2:40:52 That is just going to become more common.
2:40:53 They’re going to get good.
2:40:58 One of the hilarious things about technology over its history is that the illicit adult
2:41:02 entertainment industry has always adopted technologies first, right?
2:41:09 Whether it was video streaming to where there’s now the independent adult illicit content
2:41:15 creators who have their subscription pages, and there, they actually heavily utilize–
2:41:18 Generative AI has already been like diffusion models and all that is huge there.
2:41:24 But now these subscription-based individual creators do use bots to approximate themselves
2:41:26 and chat with their whims.
2:41:27 People pay a lot for it.
2:41:28 And people pay a lot.
2:41:29 Right?
2:41:32 A lot of times it’s them, but a lot of times there are agencies that do this for these
2:41:35 creators and do it on a mass scale.
2:41:42 The largest creators are able to talk to hundreds or thousands of people at a time because
2:41:43 of these bots.
2:41:45 And so it’s already being used there.
2:41:50 Obviously, video streaming and other technologies have gone there first.
2:41:52 It’s going to come to the rest of society too.
2:41:58 There’s a general concern that models get censored by the companies that deploy them.
2:42:06 In one case, we’ve seen that– and maybe censorship was one word, alignment maybe via RLHF or
2:42:08 some other way is another word.
2:42:15 So we saw that with black Nazi image generation with Gemini.
2:42:22 As you mentioned, we also see that with Chinese models refusing to answer what happened in
2:42:25 June 4th, 1989 at Tiananmen Square.
2:42:27 So how can this be avoided?
2:42:33 And maybe can you just in general talk about how this happens and how can it be avoided?
2:42:36 You give multiple examples.
2:42:40 There’s probably a few things to keep in mind here.
2:42:46 One is the kind of Tiananmen Square factual knowledge.
2:42:48 How does that get embedded into the models?
2:42:55 Two is the Gemini, what you called the black Nazi incident, which is when Gemini as a system
2:42:59 had this extra thing put into it that dramatically changed the behavior.
2:43:06 And then three is what most people would call general alignment, RLHF post training.
2:43:10 Each of these have very different scopes in how they are applied.
2:43:14 In order to do– if you’re just going to look at the model weights, in order to audit specific
2:43:20 facts is extremely hard because you have to chrome through the pre-training data and look
2:43:25 at all of this and then that’s terabytes of files and look for very specific words or
2:43:26 hints of the words.
2:43:31 So I guess one way to say it is that you can insert censorship or alignment at various
2:43:36 stages in the pipeline and what you referred to now is at the very beginning of the data.
2:43:40 So if you want to get rid of facts in a model, you have to do it at every stage.
2:43:42 You have to do it at the pre-training.
2:43:45 So most people think that pre-training is where most of the knowledge is put into the
2:43:51 model and then you can elicit and move that in different ways, whether through post training
2:43:53 or whether through systems afterwards.
2:43:55 This is where the whole hacking models comes from.
2:44:00 Like, GPT will not tell you how to make anthrax, but if you try really, really hard, you can
2:44:04 eventually get to tell you about anthrax because they didn’t filter it from the pre-training
2:44:05 data set.
2:44:06 Right?
2:44:12 But by the way, removing facts has such an ominous dark feel to it.
2:44:15 Almost think it’s practically impossible because you effectively have to remove them
2:44:17 from the internet.
2:44:18 You’re taking on a–
2:44:24 Did they remove the thing from the subreddits, the MMM?
2:44:25 It gets filtered out.
2:44:26 Right.
2:44:29 So you have quality filters, which are small language models that look at a document and
2:44:31 tell you, like, how good is this text?
2:44:35 Is it close to a Wikipedia article, which is a good thing that we want language models
2:44:36 to be able to imitate?
2:44:40 So couldn’t you do a small language model that Filtershot mentions at Tiananmen Square
2:44:41 in the data?
2:44:45 Yes, but is it going to catch word play or encoded language at the same time?
2:44:48 I mean, people have been meaning on games and other stuff.
2:44:54 How to say things that don’t say Tiananmen Square, or like, yeah, so there’s always different
2:44:55 ways to do it.
2:45:00 There’s, hey, the internet as a whole does tend to just have a slight left bias because
2:45:06 it’s always been richer, more affluent, younger people on the internet relative to the rest
2:45:07 of the population.
2:45:11 So there is already inherently a slight left bias on the internet.
2:45:15 So how do you filter things that are this complicated?
2:45:19 And some of these can be factual, nonfactual, but Tiananmen Square is obviously the example
2:45:27 of a factual, but it gets a lot harder when you’re talking about aligning to a ideal.
2:45:32 And so Grock, for example, Elon’s tried really hard to make the model not be super PC and
2:45:37 woke, but the best way to do pretraining is to throw the whole freaking internet at it.
2:45:40 And then later, figure out, but then at the end of the day, the model at its core now
2:45:42 still has some of these ideals.
2:45:46 You still ingested Reddit slash r slash politics, which is probably the largest political discussion
2:45:49 board on the world that’s freely available to scrape.
2:45:50 And guess what?
2:45:51 That’s left leaning, right?
2:45:56 And so, you know, there are some aspects like that you just can’t censor unless you try
2:45:59 really, really, really, really, really hard.
2:46:05 So the base model will always have some TDS, Trump derangement syndrome because it’s trained
2:46:06 so much.
2:46:12 It’ll have the ability to express it, but what if there’s a wide representation in the
2:46:13 data?
2:46:14 So this is what happens.
2:46:16 It’s like a lot of model is called post training.
2:46:21 It’s a series of techniques to get the model on rails of a really specific behavior.
2:46:26 And I mean, it’s, it’s like you can, you also have the ingested data of like Twitter or
2:46:29 like Reddit slash r slash the Donald, which is like also super pro Trump, right?
2:46:32 And then you have like fascist subreddits or like you have communist subreddit.
2:46:36 So you, the model in pretraining ingests everything.
2:46:37 It has no worldview.
2:46:42 Now it does have like some, some skew because more of the text is skewed a certain way,
2:46:47 which is general, like slight left, like, but also like, you know, somewhat like, you
2:46:50 know, it’s intellectual, somewhat like, you know, it’s just like the general internet
2:46:52 is a certain way.
2:46:55 And then, and then as, as, as Nathan’s about to describe eloquently, right?
2:46:57 Like you can, you can elicit certain things out.
2:46:58 And there’s a lot of history here.
2:47:00 So we can go through multiple examples and what happened.
2:47:06 Lama two was a launch that the phrase like too much RLHF or like too much safety was
2:47:12 a lot, it’s just, that was the whole narrative after Lama two’s chat models released.
2:47:16 And the examples are sorts of things like you would ask Lama two chat, how do you kill
2:47:17 a Python process?
2:47:21 And it would say, I can’t talk about killing because that’s a bad thing.
2:47:26 And anyone that is trying to design an AI model will probably agree that that’s just
2:47:28 like, eh, model, you messed up a bit on the training there.
2:47:31 I don’t think they meant to do this, but this was in the model week.
2:47:35 So this is not, it didn’t necessarily be a, there’s things called system prompts, which
2:47:41 are when you’re querying a model, it’s a piece of text that is shown to the model, but not
2:47:42 to the user.
2:47:46 So a fun example is your system prompt could be talk like a pirate.
2:47:50 So no matter what the user says to the model, it’ll respond like a pirate.
2:47:54 In practice, what they are is you are a helpful assistant.
2:47:55 You should break down problems.
2:48:00 If you don’t know about something, don’t tell them your date cutoff is this, today’s date
2:48:01 is this.
2:48:03 It’s a lot of really useful context for how can you answer a question well.
2:48:06 An anthropic publishes their system prompt.
2:48:07 Yes.
2:48:08 But I think it’s great.
2:48:10 And there’s a lot of research that goes into this and one of your previous guests, Amanda
2:48:15 Askell is like probably the most knowledgeable person that at least in the combination of
2:48:20 execution and sharing, she’s the person that should talk about system prompts and character
2:48:21 of models.
2:48:22 Yeah.
2:48:27 And then people should read the system prompts because you’re, you’re like trying to nudge
2:48:31 sometimes through extreme politeness, the model to be a certain way.
2:48:32 And you could use this for bad things.
2:48:37 I mean, we’ve done tests, which is what if I tell the model to be a dumb model, like
2:48:39 which evaluation scores go down.
2:48:43 And it’s like, we’ll have this behavior where it could sometimes like say, oh, I’m supposed
2:48:44 to be dumb.
2:48:48 And sometimes it’s like, it doesn’t affect like math abilities as much, but something
2:48:52 like a, if you’re trying, it’s just the quality of a human judgment would drop to the floor.
2:48:57 Let’s go back to post-training specifically, RLHF around llama two was, it was too much
2:49:01 RLH, too much safety prioritization was baked into the model weights.
2:49:05 This makes you refuse things in a really annoying way for users.
2:49:06 It’s not great.
2:49:12 It caused a lot of like awareness to be attached to RLHF that it makes the models dumb and
2:49:13 it stigmatize the word.
2:49:14 It did.
2:49:15 And AI culture.
2:49:20 And as the techniques have involved, that’s no longer the case where all of these labs
2:49:23 have very fine-grained control over what they get out of the models through techniques
2:49:24 like RLHF.
2:49:28 So although different labs are differently, different levels, like on the, on once end
2:49:31 of the spectrum is Google.
2:49:34 And then like maybe opening eye does less and anthropic does less.
2:49:38 And then like on the other end of the spectrum is like X AI, but they all have different
2:49:41 forms of RLHF trying to make them a certain way.
2:49:48 And the like, the important thing to say is that no matter how you want the model to behave,
2:49:51 these RLHF and preference tuning techniques also improve performance.
2:49:56 So on things like math evals and code evals, there is something innate to these, what
2:49:58 is called contrastive loss functions.
2:49:59 We could start to get into RLHF here.
2:50:04 We don’t really need to, but RLHF also boosts performance on anything from a chat task to
2:50:06 a math problem to a code problem.
2:50:10 So it is becoming a much more useful tool to these labs.
2:50:13 So this kind of takes us through the arc of we’ve talked about pre-training, hard to
2:50:14 get rid of things.
2:50:18 We’ve talked about post-training and how post-training, if you, you can mess it up.
2:50:24 It’s a complex multifaceted optimization with 10 to 100 person teams converging at one artifact.
2:50:27 It’s really easy to not do it perfectly.
2:50:29 And then there’s the third case, which is what we talked about Gemini.
2:50:34 The thing that was about Gemini is this was a served product where Google has their internal
2:50:35 model weights.
2:50:37 They’ve done all these processes that we talked about.
2:50:41 And in the served product, what came out after this was that they had a prompt that they
2:50:45 were rewriting user queries to boost diversity or something.
2:50:48 And this just made it, the outputs were just blatantly wrong.
2:50:52 It was a, some sort of organizational failure that had this prompt in that position.
2:50:55 And I think Google executives probably have owned this.
2:50:59 I didn’t pay that attention in that detail, but it was just a mess up in execution that
2:51:01 led to this ridiculous thing.
2:51:04 But at the system level, the model weights might have been fine.
2:51:08 So at the very end of the pipeline, there was a rewriting to something like a system
2:51:09 prompt.
2:51:14 It was like the system prompt or what is called an industry is like you rewrite prompts.
2:51:19 So especially for image models, if you’re using Dolly or chat, you can generate you
2:51:20 an image.
2:51:25 You’ll say, draw me a beautiful car with these leading image models.
2:51:28 They benefit from highly descriptive prompts.
2:51:32 So what would happen is if you do that on chat, a language model behind the scenes will rewrite
2:51:35 the prompt, say, make this more descriptive.
2:51:37 And then that has passed to the image model.
2:51:41 So prompt rewriting is something that is used at multiple levels of industry.
2:51:42 And it’s used effectively for image models.
2:51:47 And the Gemini example is just a failed execution.
2:51:52 Big philosophical question here with RLHF to generalize.
2:52:00 Where is human input, human in the loop, human data most useful at the current stage?
2:52:06 For the past few years, the highest cost human data has been in these preferences, which
2:52:11 is comparing, I would say highest cost and highest total usage.
2:52:15 So a lot of money has gone to these pairwise comparisons where you have two model outputs
2:52:19 and a human is comparing between the two of them.
2:52:22 In earlier years, there was a lot of this instruction tuning data.
2:52:28 So creating highly specific examples to something like a Reddit question to a domain that you
2:52:29 care about.
2:52:31 Language models used to struggle on math and code.
2:52:34 So you would pay experts in math and code to come up with questions and write detailed
2:52:37 answers that were used to train the models.
2:52:43 Now it is the case that there are many model options that are way better than humans at
2:52:47 writing detailed and eloquent answers for things like model and code.
2:52:52 So they talked about this with the Lama three release where they switched to using Lama
2:52:55 three, four or five B to write their answers for math and code.
2:53:00 But they in their paper talk about how they use extensive human preference data, which
2:53:03 is something that they haven’t gotten any eyes to replace.
2:53:06 There are other techniques in industry like constitutional AI where you use human data
2:53:08 for preferences and AI for preferences.
2:53:12 And I expect the AI part to scale faster than the human part.
2:53:18 But among the research that we have access to is that humans are in this kind of preference
2:53:19 loop.
2:53:24 So for as reasoning becomes bigger and bigger and bigger, as we said, where’s the role of
2:53:25 humans in that?
2:53:27 It’s even less prevalent.
2:53:32 So it’s the remarkable thing about these reasoning results and especially the deep seek R1 paper
2:53:37 is this result that they call deep seek R1 zero, which is they took one of these pre-trained
2:53:40 models, they took deep seek V3 base.
2:53:44 And then they do this reinforcement learning optimization on verifiable questions or verifiable
2:53:48 rewards for a lot of questions and a lot of training.
2:53:51 And these reasoning behaviors emerge naturally.
2:53:54 So these things like wait, let me see, wait, let me check this.
2:53:56 Oh, that might be a mistake.
2:53:59 And they emerge from only having questions and answers.
2:54:03 And when you’re using the model, the part that you look at is the completion.
2:54:08 So in this case, all of that just emerges from this large scale RL training.
2:54:14 And that model, which the weights are available, has no human preferences added into the post
2:54:15 training.
2:54:20 There are the deep seek R1 full model has some of this human preference tuning this RLHF
2:54:22 after the reasoning stage.
2:54:26 But the very remarkable thing is that you can get these reasoning behaviors.
2:54:29 And it’s very unlikely that there’s humans writing out reasoning chains.
2:54:33 It’s very unlikely that they somehow hacked open AI and they got access to open AI.
2:54:35 Oh, one’s reasoning chains.
2:54:40 It’s something about the pre-trained language models and this RL training where you reward
2:54:42 the model for getting the question right.
2:54:47 And therefore it’s trying multiple solutions and it emerges this chain of thought.
2:54:53 This might be a good place to, uh, to mention the, uh, the eloquent and the insightful tweet
2:54:56 of the great and the powerful Andre Carpathian.
2:55:00 Uh, I think he had a bunch of thoughts, but one of them, last thought, not sure if this
2:55:04 is obvious, you know, something profound is coming when you’re saying it’s not sure if
2:55:05 it’s obvious.
2:55:10 There are two major types of learning in both children and in deep learning.
2:55:15 There’s one, imitation learning, watch and repeat, ie pre-training, supervised fine
2:55:19 tuning and two, trial and error learning, reinforcement learning.
2:55:22 My favorite simple example is AlphaGo.
2:55:25 One is learning by imitating expert players.
2:55:28 Two is reinforcement learning to win the game.
2:55:34 Almost every single shocking result of deep learning and the source of all magic is always
2:55:35 two.
2:55:37 Two is significantly more powerful.
2:55:39 Two is what surprises you.
2:55:43 Two is when the paddle learns to hit the ball behind the blocks and break out.
2:55:47 Two is when AlphaGo beats even Lisa Dahl.
2:55:53 And two is the aha moment when the deep seek or 01, et cetera, discovers that it works
2:55:59 well to reevaluate your assumptions, backtrack, try something else, et cetera.
2:56:04 It’s the solving strategies you see this model use in its chain of thought.
2:56:07 It’s how it goes back and forth thinking to itself.
2:56:12 These thoughts are emergent, three exclamation points.
2:56:17 And this is actually seriously incredible, impressive and new, and is publicly available
2:56:18 and documented.
2:56:24 The model could never learn this with the imitation because the cognition of the model
2:56:27 and the cognition of the human label is different.
2:56:32 The human would never know to correctly annotate these kinds of solving strategies and what
2:56:34 they should even look like.
2:56:38 They have to be discovered during reinforcement learning as empiric lens statistically useful
2:56:39 towards the final outcome.
2:56:43 Anyway, the AlphaZero sort of metaphor analogy here.
2:56:48 Can you speak to that, the magic of the chain of thought that he’s referring to?
2:56:52 I think it’s good to recap AlphaGo and AlphaZero because it plays nicely with these analogies
2:56:54 between imitation learning and learning from scratch.
2:57:00 So AlphaGo, the beginning of the process was learning from humans where they started the
2:57:06 first, this is the first expert level Go player or chess player in DeepMind series of models
2:57:07 where they had some human data.
2:57:12 And then why it is called AlphaZero is that there was zero human data in the loop.
2:57:17 And that change to AlphaZero made a model that was dramatically more powerful for DeepMind.
2:57:23 So this remove of the human prior, the human inductive bias makes the final system far
2:57:24 more powerful.
2:57:29 We mentioned bitter lesson hours ago, and this is all aligned with this.
2:57:33 And then there’s been a lot of discussion and language models.
2:57:34 This is not new.
2:57:40 This goes back to the whole Q* rumors, which if you piece together the pieces is probably
2:57:46 the start of OpenAI figuring out it’s 01 stuff when last year in November, the Q* rumors
2:57:47 came out.
2:57:53 There’s a lot of intellectual drive to know when is something like this going to happen
2:57:57 with language models because we know these models are so powerful and we know has been
2:57:59 so successful in the past.
2:58:05 And it is a reasonable analogy that this new type of reinforcement learning training for
2:58:08 reasoning models is when the doors open to this.
2:58:15 We don’t yet have the equivalent of turn 37, which is the famous turn where the DeepMinds
2:58:18 AI playing ghosts dumped Lisa Dahl completely.
2:58:22 We don’t have something that’s that level of focal point, but that doesn’t mean that
2:58:25 the approach to technology is different and the impact of the general training.
2:58:27 It’s still incredibly new.
2:58:28 What do you think that point would be?
2:58:32 What would be move 37 for change of thought for reasoning?
2:58:33 Scientific discovery.
2:58:38 You use this sort of reasoning problem and it’s just something we fully don’t expect.
2:58:40 I think it’s actually probably simpler than that.
2:58:46 It’s probably something related to computer user robotics rather than science discovery.
2:58:51 Because the important aspect here is models take so much data to learn.
2:58:54 They’re not sample efficient.
2:58:59 They take the entire web over 10 trillion tokens to train on.
2:59:03 This would take a human thousands of years to read.
2:59:09 A lot of the stuff models know better than it.
2:59:11 Humans are way, way, way more sample efficient.
2:59:13 That is because of the self-play.
2:59:18 How does a baby learn what its body is as it sticks its foot in its mouth and it says,
2:59:20 “Oh, this is my body.”
2:59:25 It sticks its hand in its mouth and it calibrates its touch on its fingers with the most sensitive
2:59:29 touch thing on its tongue, as how babies learn.
2:59:32 It’s just self-play over and over and over and over again.
2:59:38 Now we have something that is similar to that with these verifiable proofs, whether it’s
2:59:46 a unit test and code or a mathematical verifiable task, generate many traces of reasoning.
2:59:47 Keep branching them out.
2:59:48 Keep branching them out.
2:59:51 Then check at the end, “Hey, which one actually has the right answer?”
2:59:52 Most of them are wrong.
2:59:53 Great.
2:59:54 These are the few that are right.
2:59:57 Maybe we use some sort of reward model outside of this to select even the best one to preference
2:59:58 as well.
3:00:00 Now you’ve started to get better and better at these benchmarks.
3:00:05 You’ve seen, over the last six months, a skyrocketing in a lot of different benchmarks, right?
3:00:09 All math and code benchmarks were pretty much solved except for frontier math, which is
3:00:16 designed to be almost questions that aren’t practical to most people because they’re exam
3:00:19 level open math problem type things.
3:00:23 It’s on the math problems that are somewhat reasonable, which is somewhat complicated
3:00:25 word problems or coding problems.
3:00:27 It’s just what Dylan is saying.
3:00:31 The thing here is that these are only with verifiable tasks.
3:00:35 Earlier I showed an example of the really interesting what happens when chain of thought
3:00:36 is to a non-verifiable thing.
3:00:42 It’s just like a human chatting with thinking about what’s novel for humans, a unique thought.
3:00:48 But this task and form of training only works when it’s verifiable.
3:00:53 From here, the thought is, “Okay, we can continue to scale this current training method by increasing
3:00:55 the number of verifiable tasks.”
3:00:58 In math and coding, coding probably has a lot more to go.
3:01:02 Math has a lot less to go in terms of what are verifiable things.
3:01:07 Can I create a solver that then I generate trajectories toward or reasoning traces towards
3:01:11 and then prune the ones that don’t work and keep the ones that do work?
3:01:14 Those are going to be solved pretty quickly, but even if you’ve solved math, you have not
3:01:17 actually created intelligence.
3:01:24 This is where I think the aha moment of computer use or robotics will come in because now you
3:01:28 have a sandbox or a playground that is infinitely verifiable.
3:01:32 Did you … Messing around on the internet, there are so many actions that you can do
3:01:33 that are verifiable.
3:01:37 It’ll start off with login to a website, create an account, click a button here, blah, blah,
3:01:38 blah.
3:01:41 But it’ll then get to the point where it’s, “Hey, go do a task on Tasker,” or whatever
3:01:47 these other, all these various task websites, “Hey, go get hundreds of likes,” and it’s
3:01:48 going to fail.
3:01:49 It’s going to spawn hundreds of accounts.
3:01:50 It’s going to fail on most of them.
3:01:51 But this one got to 1,000.
3:01:52 Great.
3:01:53 It’s going to reach the verifiable thing.
3:01:57 You just keep iterating this loop over and over, and same with robotics.
3:02:01 That’s where you have an infinite playground of tasks like, “Hey, did I put the ball in
3:02:02 the bucket?”
3:02:04 All the way to, “Oh, did I build a car?”
3:02:09 There’s a whole trajectory to speedrun or what models can do.
3:02:14 But at some point, I truly think that we’ll spawn models, and initially all the training
3:02:15 will be in sandboxes.
3:02:19 But then at some point, the language model pre-training is going to be dwarfed by what
3:02:24 is this reinforcement learning … You’ll pre-train a multimodal model that can see,
3:02:28 that can read, that can write, blah, blah, blah, whatever, vision, audio, et cetera.
3:02:34 But then you’ll have it play in a sandbox infinitely, figure out math, figure out code,
3:02:37 figure out navigating the web, figure out operating a robot arm.
3:02:42 And then it’ll learn so much, and the aha moment, I think, will be when this is available
3:02:45 to then create something that’s not good.
3:02:46 Like, “Oh, cool.
3:02:47 Part of it was figuring out how to use the web.
3:02:52 Now, all of a sudden, it’s figured out really well how to just get hundreds of thousands
3:02:55 of followers that are real and real engagement on Twitter, because all of a sudden, this
3:02:57 is one of the things that are verifiable.”
3:02:59 And maybe not just engagement, but make money.
3:03:00 Yes, of course.
3:03:08 I mean, that could be the thing where almost fully automated, it makes $10 million by being
3:03:12 an influencer selling a product, creating the product.
3:03:17 And I’m not referring to a hype product, but an actual product, like, “Holy shit.
3:03:19 This thing created a business.
3:03:20 It’s running it.
3:03:23 It’s the face of the business,” that kind of thing.
3:03:29 Or maybe a number one song, like, it creates the whole infrastructure required to create
3:03:32 the song, to be the influencer that represents that song, that kind of thing.
3:03:33 It makes a lot of money.
3:03:34 That could be the…
3:03:38 I mean, our culture respects money in that kind of way.
3:03:40 And it’s verifiable, right?
3:03:41 It’s verifiable.
3:03:42 All right.
3:03:43 The bank account can’t lie.
3:03:44 Exactly.
3:03:48 There’s surprising evidence that once you set up the ways of collecting the verifiable
3:03:55 domain that this can work, there’s been a lot of research before this R1 on math problems.
3:03:59 And they approach math with language models just by increasing the number of samples.
3:04:01 So you can just try again and again and again.
3:04:05 And you look at the amount of times that the language models get it right.
3:04:10 And what we see is that even very bad models get it right sometimes.
3:04:14 And the whole idea behind reinforcement learning is that you can learn from very sparse rewards.
3:04:20 So the space of language and the space of tokens, whether you’re generating language
3:04:25 or tasks for a robot, is so big that you might say that it’s like, I mean, each…
3:04:27 The tokenizer of our language model can be like 200,000 things.
3:04:30 So at each step, it can sample from that big of a space.
3:04:36 So if it can generate a bit of a signal that it can climb onto, that’s what the whole field
3:04:39 of RL is around, is learning from sparse rewards.
3:04:43 And the same thing has played out in math where it’s like very weak models that sometimes
3:04:44 generate answers.
3:04:47 We see research already that you can boost their math scores.
3:04:50 You can do this sort of RL training for math.
3:04:54 It might not be as effective, but if you take a one billion parameter model, so something
3:04:59 600 times smaller than deep seek, you can boost its grade school math scores very directly
3:05:02 with a small amount of this training.
3:05:05 So it’s not to say that this is coming soon.
3:05:09 Setting up the verification domains is extremely hard and there’s a lot of nuance in this.
3:05:15 But there are some basic things that we have seen before where it’s at least expectable
3:05:17 that there’s a domain and there’s a chance that this works.
3:05:18 All right.
3:05:20 So we have fun things happening in real time.
3:05:26 This is a good opportunity to talk about other reasoning models 01, 03.
3:05:32 Just now, OpenAI, as perhaps expected, released 03 mini.
3:05:35 What are we expecting from the different flavors?
3:05:41 Can you just lay out the different flavors of the old models and from Gemini, the reasoning
3:05:42 model?
3:05:44 Something I would say about these reasoning models is we talked a lot about reasoning
3:05:47 training on math and code.
3:05:49 And what is done is that you have the base model.
3:05:51 We’ve talked about a lot on the internet.
3:05:54 You do this large scale reasoning training with reinforcement learning.
3:06:00 And then what the deep seek paper detailed in this R1 paper, which for me is one of the
3:06:06 big open questions on how do you do this is that they did reasoning heavy, but very standard
3:06:09 post training techniques after the large scale reasoning RL.
3:06:14 So they did the same things with a form of instruction tuning through rejection sampling,
3:06:18 which is essentially heavily filtered instruction tuning with some reward models.
3:06:22 And then they did this RLHF, but they made it math heavy.
3:06:28 So some of this transfer, we’ve looked at this philosophical example early on.
3:06:31 One of the big open questions is how much does this transfer?
3:06:36 If we bring in domains after the reasoning training, are all the models going to become
3:06:37 eloquent writers by reasoning?
3:06:39 Is this philosophy stuff going to be open?
3:06:42 We don’t know in the research of how much this will transfer.
3:06:45 There’s other things about how we can make soft verifiers and things like this, but there
3:06:51 is more training after reasoning, which makes it easier to use these reasoning models.
3:06:52 And that’s what we’re using right now.
3:06:55 So if we’re going to talk about with three mini and no one, these have gone through these
3:07:00 extra techniques that are designed for human preferences after being trained to elicit
3:07:01 reasoning.
3:07:06 I think one of the things that people are ignoring is Google’s Gemini flash thinking
3:07:10 is both cheaper than R1 and better.
3:07:11 And they released it in the beginning of December.
3:07:12 And nobody’s talking about it.
3:07:13 No one cares.
3:07:14 It has a different flavor to it.
3:07:19 It’s behavior is less expressive than something like 01, and it has fewer tracks than it is
3:07:20 on.
3:07:25 Just a model last fall, QWQ, which was their preview reasoning model.
3:07:29 And in deep sea cut R1 light last fall, where these models kind of felt like they’re on
3:07:33 rails where they really, really only can do math and code.
3:07:35 And 01 is it can answer anything.
3:07:41 It might not be perfect for some tasks, but it’s flexible and has some richness to it.
3:07:46 And this is kind of the art of like how cooking, like how it was a model a little bit undercooked.
3:07:50 It’s like, I mean, it’s good to get a model out the door, but it’s hard to gauge and it
3:07:54 takes a lot of taste to be like, is this a full fledged model?
3:07:55 Can I use this for everything?
3:07:58 And they’re probably more similar for math and code.
3:08:05 My quick read is that Gemini flash is like not trained the same way as 01, but taking
3:08:08 an existing training stack, adding reasoning to it.
3:08:11 So taking a more normal training stack and adding reasoning to it.
3:08:13 And I’m sure they’re going to have more.
3:08:17 I mean, they’ve done quick releases on Gemini flash, the reasoning, and this is the second
3:08:20 version from the holidays.
3:08:25 It’s evolving fast and it takes longer to make this training stack where you’re doing
3:08:26 this large scale RL.
3:08:31 Ask it the same question from earlier, the one about the human nature.
3:08:32 Yeah.
3:08:35 What was the human nature one?
3:08:39 The way I can ramble, why I can ramble about this so much is that we’ve been working on
3:08:45 this at AI2 before 01 was fully available to everyone and before R1, which is essentially
3:08:47 using this RL training for fine tuning.
3:08:50 We use this in our like two-loo series of models.
3:08:56 And you can elicit the same behaviors where you say like weight and cellochon, but it’s
3:09:01 suddenly in the training process that this kind of reasoning expression is much lighter.
3:09:04 So you can, there’s essentially a gradation and just how much of this RL training you
3:09:07 put into it determines how the output looks.
3:09:15 So we’re now using Gemini 2.0 Flash Thinking Experimental 121.
3:09:20 It summarized the prompt as humans self-domesticated apes.
3:09:21 The perspective.
3:09:22 Okay.
3:09:23 All right.
3:09:25 So wait, is this reviewing the reasoning?
3:09:27 Here’s why this is a novel.
3:09:28 Okay.
3:09:29 Click to expand.
3:09:30 Click to expand.
3:09:31 Okay.
3:09:33 Analyze the request.
3:09:34 Novel is the keyword.
3:09:37 See how it just looks a little different.
3:09:39 It looks like a normal output.
3:09:40 Yeah.
3:09:41 Yes.
3:09:43 I mean, in some sense, it’s better structured.
3:09:45 It makes more sense.
3:09:50 Oh, when it latched onto human and then it went into organisms and oh, wow.
3:09:56 Apex predator, focus on domestication, apply domestication to humans, explore the idea
3:09:57 of self-domestication.
3:09:58 Not good.
3:09:59 Not good.
3:10:02 Where is this going?
3:10:08 Refine, articulate the insight, greater facial expressiveness and communication ability.
3:10:09 Yes.
3:10:10 Yes.
3:10:11 Plasticity and adaptability.
3:10:12 Yes.
3:10:13 Dependence on social groups.
3:10:14 Yes.
3:10:15 All right.
3:10:17 And self-critique and refined further.
3:10:19 Wow.
3:10:20 Is this truly novel?
3:10:23 Is it well supported?
3:10:25 So on and so forth.
3:10:29 And the insight it’s getting at is humans are not just social animals, but profoundly
3:10:32 self-domesticated apes.
3:10:37 And the self-domestication is the key to understanding our unique cognitive and social abilities.
3:10:39 Self-domesticated apes.
3:10:40 Self-domest…
3:10:42 I prefer the deep-seek response.
3:10:43 Self-domest…
3:10:48 I mean, it’s novel, the insight is novel.
3:10:53 I mean, that’s like a good book title, self-domesticated apes, like there could be a case made for
3:10:54 that.
3:10:55 I mean, yeah, it’s cool.
3:10:58 And it’s revealing the reasoning, it’s magical.
3:10:59 It’s magical.
3:11:01 Like, this is really powerful.
3:11:04 Hello, everyone.
3:11:09 This is Lex with a quick intermission, recorded after the podcast.
3:11:14 Since we reviewed responses from DeepSeek R1 and Gemini Flash 2.0 Thinking during this
3:11:20 conversation, I thought at this moment, it would be nice to insert myself quickly doing
3:11:28 the same for OpenAI 01 Pro and 03 Mini with the same prompt, the prompt being give one
3:11:32 truly novel insight about humans.
3:11:40 And I thought I would, in general, give my vibe check and vibe-based anecdotal report
3:11:46 on my own experiences with the new 03 Mini model, now that I’ve got a chance to spend
3:11:49 many hours with it in different kinds of contexts and applications.
3:11:56 So I would probably categorize this question as, let’s say, open-ended philosophical question.
3:12:03 And in particular, the emphasis on novelty, I think is a nice way to test one of the capabilities
3:12:09 of the model, which is come up with something that makes you pause and almost surprise you
3:12:11 with its brilliance.
3:12:16 So that said, my general review, after running each of the models on this question a bunch
3:12:22 of times, is that 01 Pro consistently gave brilliant answers.
3:12:29 Because they gave me pause and made me think, both cutting in its insight and just really
3:12:36 nicely phrased with wit, with clarity, with nuance, over and over consistently generating
3:12:37 the best answers.
3:12:43 After that is R1, which is less consistent, but again, deliver brilliance.
3:12:46 Gemini Flash 2.0 Thinking was third.
3:12:50 And last was 03 Mini, actually.
3:12:55 It often gave quite a generic answer, at least to my particular sensibilities.
3:13:01 That said, in a bunch of other applications that I tested for brainstorming purposes,
3:13:07 it actually worked extremely well and often outperformed R1.
3:13:11 But on this open-ended philosophical question, it did consistently worse.
3:13:16 Now, another important element for each of these models is how the reasoning is presented.
3:13:23 DeepSeek R1 shows the full chain of thought tokens, which I personally just love.
3:13:27 For these open-ended philosophical questions, it’s really, really interesting to see the
3:13:28 model think through it.
3:13:34 But really also just stepping back, me as a person who appreciates intelligence and reasoning
3:13:40 and reflection, reading these kind of chain of thought raw tokens of R1, there’s something
3:13:48 genuinely beautiful about observing the path of deliberation in an intelligence system.
3:13:55 I think we don’t always have that explicitly laid out for us humans, so to see it in another
3:14:01 intelligence system, the non-linearity of it akin to Ulysses or Finnegan’s Wake by
3:14:03 James Joyce, it’s just beautiful to watch.
3:14:09 Anyway, as we discussed in the episode DeepSeek R1, talked about humans being able to convert
3:14:14 selfish desires into cooperative systems by collectively pretending abstract rules like
3:14:21 money laws and rights are real, and the shared hallucinations act as games, where competition
3:14:26 is secretly redirected to benefit the group, turning conflict into society’s fuel.
3:14:32 Gemini 2.0 Flash Thinking said, “Humans are not just social animals, but self-domesticated
3:14:37 apes, and this self-domestication is the key to understanding our unique cognitive and
3:14:38 social abilities.”
3:14:43 Now, it’s important to say that the chain of thought there was really interesting.
3:14:50 It was looking through the entire evolution of life on Earth, considering apex predators,
3:14:55 and considering how from that we ended up to where we are.
3:14:59 I think that domestication by choice is a really interesting angle.
3:15:04 Again, it’s one of those things when somebody presents a different angle on a seemingly
3:15:06 obvious thing, it just makes me smile.
3:15:12 And the same with DeepSeek R1, that these hallucinations of money, laws, and rights,
3:15:18 and us collectively pretending like it’s real, and we play games with them that look like
3:15:22 competition when secretly we’re just cooperating with each other.
3:15:25 And that is the fuel of progress, beautifully put.
3:15:30 Now, OpenAI01pro consistently over and over delivered bangers.
3:15:34 I can go through many of them, but the first one was, “Humans are the only species that
3:15:40 turns raw materials into symbolic resources, then uses those symbols to reorganize the
3:15:46 very materials they came from, creating a closed feedback loop between meaning and matter.”
3:15:52 Here, I just ran it again, banger after banger, I’m telling you, humans are unique among
3:15:57 known species in that they simultaneously rewrite two layers of reality, the external
3:16:04 world and their own private mental landscapes, and then merge these two rewritten layers
3:16:12 into a continuous personal narrative that feels objectively true, feels true.
3:16:13 This is poetry.
3:16:23 Okay, and then O3 Mini High for me was smart, fast, actually, and kind of generic.
3:16:25 Never quite got there for me.
3:16:31 So here’s the first one I got from O3 Mini, “Humans are not fixed beings, but rather
3:16:37 ongoing narratives, dynamic stories that would continuously write, edit, and reinterpret.
3:16:42 This narrative plasticity is more than just memory or self-reflection, it’s an intrinsic
3:16:48 cognitive process that acts like an internal error correction system, it allows us to adapt
3:16:53 our identities and values over time in response to new experiences, challenges, and social
3:16:54 context.”
3:17:00 Now, it almost sneaks up to something approximating cutting insight with narrative plasticity
3:17:05 in quotes, but then it goes back to the sort of the generic, I don’t know, all of these
3:17:08 models are incredible for different reasons.
3:17:13 There’s a lot of concerns as we discussed in this episode, but there’s a lot of reasons
3:17:16 to be excited as well.
3:17:18 And I probably spoken for too long.
3:17:26 I am severely sleep deprived, borderline delirious, so hopefully some of this made sense.
3:17:31 And now, dear friends, back to the episode.
3:17:38 I think to Nathan’s point, when you look at the reasoning models, to me, even when I
3:17:46 used R1 versus 01, there was that sort of rough edges around the corner feeling, right?
3:17:50 And flash thinking earlier, I didn’t use this version, but the one from December, and it
3:17:53 definitely had that rough edges around the corner feeling, right, where it’s just not
3:17:56 fleshed out in as many ways, right?
3:18:02 Sure, they added math and coding capabilities via these verifiers in RL, but it feels like
3:18:07 they lost something in certain areas, and 01 is worse performing than chat in many areas
3:18:09 as well, to be clear.
3:18:10 Not by a lot.
3:18:11 Not by a lot though, right?
3:18:16 And it’s like R1 definitely felt to me like it was worse than V3 in certain areas, like
3:18:21 doing this RL expressed and learned a lot, but then it weakened in other areas.
3:18:28 And so I think that’s one of the big differences between these models, and what 01 offers.
3:18:30 And then OpenAI has 01 Pro.
3:18:35 And what they did with 03, which is also very unique, is that they stacked search on top
3:18:37 of Chain of Thought, right?
3:18:41 And so Chain of Thought is one thing where it’s able, it’s one chain, it back tracks,
3:18:46 goes back and forth, but how they solved the ArcAGI challenge was not just the Chain of
3:18:47 Thought.
3:18:52 It was also sampling many times, i.e. running them in parallel, and then selecting.
3:18:54 Is running in parallel actually search?
3:18:58 Because I don’t know if we have the full information on how 01 Pro works, or like I’m not, I don’t
3:19:01 have enough information to confidently say that it is search.
3:19:02 It is parallel samples.
3:19:03 Yeah.
3:19:04 And then what?
3:19:05 And it selects something.
3:19:06 And we don’t know what the selection function is.
3:19:11 The reason why we’re debating is because since 01 was announced, there’s been a lot of interest
3:19:15 in techniques called Monte Carlo research, which is where you will break down the chain
3:19:17 of thought into intermediate steps.
3:19:19 We haven’t defined Chain of Thought.
3:19:23 Chain of Thought is from a paper from years ago where you introduced the idea to ask a
3:19:27 language model that at the time was much less easy to use.
3:19:29 You would say, let’s verify step by step.
3:19:32 And it would induce the model to do this bulleted list of steps.
3:19:36 Chain of Thought is now almost a default in models, where if you ask it a math question,
3:19:39 you don’t need to tell it to think step by step.
3:19:43 And the idea with Monte Carlo research is that you would take an intermediate point in
3:19:47 that chain, do some sort of expansion, spend more compute, and then just select the right
3:19:48 one.
3:19:52 That’s a very complex form of search that has been used in things like Mu Zero and Alpha
3:19:53 Zero potentially.
3:19:55 I know Mu Zero does this.
3:19:59 Another form of search is just asking five different people and then taking the majority
3:20:00 answers.
3:20:01 Yes.
3:20:04 There’s a variety of– it could be complicated, it could be simple.
3:20:08 We don’t know what it is, just that they are not just issuing one chain of thought in
3:20:09 sequence.
3:20:14 They are launching many in parallel, and in the Arc AGI, they launched 1,000 in parallel
3:20:19 for the one that really shocked everyone, that beat the benchmark.
3:20:22 They would launch 1,000 in parallel, and then they would get the right answer, like 80 percent
3:20:25 of the time or 70 percent of the time, 90 maybe even.
3:20:28 Whereas if they just launched one, it was like 30 percent.
3:20:29 There are many extensions to this.
3:20:35 I would say the simplest one is that our language models today have been designed to give the
3:20:39 right answer the highest percentage of the time in one response.
3:20:44 We are now opening the door to different ways of running inference on our models in which
3:20:49 we need to reevaluate many parts of the training process, which normally opens the door to
3:20:54 more progress, but we don’t know if OpenAI changed a lot, or if just sampling more and
3:20:57 multiple choices is what they’re doing, or if it’s something more complex, but they changed
3:21:02 the training and they know that the inference mode is going to be different.
3:21:09 We’re talking about 01 Pro, $200 a month, and they’re losing money.
3:21:17 The thing that we’re referring to, this fascinating exploration of the test time compute space,
3:21:18 is that actually possible?
3:21:20 Do we have enough compute for that?
3:21:22 Does the financials make sense?
3:21:28 The fantastic thing is, and it’s in the thing that I just pulled up earlier, but the cost
3:21:35 for GPT-3 has plummeted if you scroll up just a few images, I think.
3:21:39 The important thing about, hey, is cost-limiting factor here.
3:21:44 My view is that we’ll have really awesome intelligence before we have– AGI before we
3:21:47 have it permeate throughout the economy.
3:21:53 This is why that reason is, GPT-3 was trained in what, 2020, 2021, and the cost for running
3:22:01 inference on it was $60, $70 per million tokens, which is the cost per intelligence was ridiculous.
3:22:07 Now, as we scaled forward two years, we’ve had a 1200x reduction in cost to achieve the
3:22:10 same level of intelligence as GPT-3.
3:22:19 Here on the x-axis is time over just a couple of years, and on the y-axis is log scale dollars
3:22:23 to run inference on a million tokens.
3:22:31 You have just a doubt, like a linear decline in log scale from GPT-3 through 3.5 to LAMBA.
3:22:37 It’s like five cents or something like that now, which is versus $60, 1200x.
3:22:43 That’s not the exact numbers, but it’s 1200x, I remember that number, is humongous cost
3:22:44 per intelligence.
3:22:47 Now, the freak out over deep seek is, oh my god, they made it so cheap.
3:22:51 Actually, if you look at this trend line, they’re not below the trend line, first of
3:22:54 all, and at least for GPT-3.
3:22:58 They are the first to hit it, which is a big deal, but they’re not below the trend line
3:22:59 as far as GPT-3.
3:23:00 Now, we have GPT-4.
3:23:02 What’s going to happen with these reasoning capabilities?
3:23:07 It’s a mix of architectural innovations, it’s a mix of better data, and it’s going to be
3:23:10 better training techniques, and all of these different better inference systems, better
3:23:17 hardware going from each generation of GPU to new generations or ASICs.
3:23:22 Everything is going to take this cost curve down and down and down and down, and then
3:23:27 can I just spawn a thousand different LLMs to create a task and then pick from one of
3:23:31 them or whatever search technique I want, a tree, Monte Carlo tree search, maybe it gets
3:23:33 that complicated.
3:23:38 Maybe it doesn’t because it’s too complicated to actually scale, who knows, better lesson.
3:23:46 The question is, I think, when not if, because the rate of progress is so fast.
3:23:52 Nine months ago, Dario said nine months ago the cost to train an inference was this, and
3:23:57 now we’re much better than this, and deep seek is much better than this, and that cost curve
3:24:02 for GPT-4, which was also roughly $60 per million tokens when it launched, has already
3:24:10 fallen to $2 or so, and we’re going to get it down to cents, probably, for GPT-4 quality,
3:24:15 and then that’s the base for the reasoning models like 01 that we have today, and 01 Pro
3:24:20 is spawning multiple, and 03, and so on and so forth, these search techniques too expensive
3:24:25 today, but they will get cheaper, and that’s what’s going to unlock the intelligence.
3:24:28 So get cheaper and cheaper and cheaper.
3:24:34 The big deep seek R1 release freaked everybody out because of the cheaper.
3:24:38 One of the manifestations of that is NVIDIA stock plummeted.
3:24:40 Can you explain what happened?
3:24:47 And also just explain this moment and whether if NVIDIA is going to keep winning.
3:24:53 We’re both NVIDIA bulls here, I would say, and in some ways, the market response is reasonable.
3:24:59 Most of the market, NVIDIA’s biggest customers in the US are major tech companies, and they’re
3:25:05 spending a ton on AI, and a simple interpretation of deep seek is you can get really good models
3:25:10 without spending as much on AI, so in that capacity, it’s like, oh, maybe these big tech
3:25:12 companies won’t need to spend as much on AI and go down.
3:25:16 The actual thing that happened is much more complex, where there’s social factors, where
3:25:21 there’s the rising in the app store, the social contagion that is happening, and then I think
3:25:25 some of it is just like, I don’t trade, I don’t know anything about financial markets,
3:25:28 but it builds up over the weekend where the social pressure, where it’s like, if it was
3:25:32 during the weekend, there was multiple days of trading when this was really becoming,
3:25:37 but it comes on the weekend and then everybody wants to sell, and that is a social contagion.
3:25:41 I think there were a lot of false narratives, which is like, hey, these guys are spending
3:25:44 billions on models, and they’re not spending billions on models.
3:25:49 No one spent more than a billion dollars on a model that’s released publicly.
3:25:57 GPT-4 was a couple hundred million, and then they’ve reduced the cost with 4TURBO4O, but
3:25:59 billion dollar model runs are coming.
3:26:02 This concludes pre-training and post-training, and then the other number is like, hey, deep
3:26:06 seek didn’t include everything, they didn’t include a lot of the cost goes to research
3:26:07 and all this sort of stuff.
3:26:10 A lot of the cost goes to inference, a lot of the cost goes to post-training.
3:26:11 None of these things were factored.
3:26:12 It’s research salaries.
3:26:16 All these things are counted in the billions of dollars that OpenAI is spending, but they
3:26:21 weren’t counted in the, hey, $6 million, $5 million that deep seek spent.
3:26:25 So there’s a bit of misunderstanding of what these numbers are, and then there’s also an
3:26:31 element of, Nvidia has just been a straight line up, and there’s been so many different
3:26:35 narratives that have been trying to push down, I don’t say push down Nvidia stock, everyone
3:26:39 is looking for a reason to sell or to be worried.
3:26:43 It was blackwell delays, there are GPU, there’s a lot of reports, every two weeks there’s
3:26:48 a new report about their GPUs being delayed.
3:26:51 There’s the whole thing about scaling laws ending.
3:26:52 It’s so ironic.
3:26:53 It lasted a month.
3:26:58 It was just, literally just, hey, models aren’t getting better.
3:27:01 They’re just not getting better, there’s no reason to spend more, pre-training scaling
3:27:02 is dead.
3:27:08 After that, it’s like 01, 03, R1, R1, and now it’s like, wait, models are progressing
3:27:09 too fast.
3:27:14 Slow down the progress, stop spending on GPUs, but the funniest thing I think that comes
3:27:21 out of this is, Jevon’s paradox is true, AWS pricing for H100s has gone up over the
3:27:24 last couple of weeks.
3:27:28 Since a little bit after Christmas, since V3 was launched, AWS H100 pricing has gone
3:27:29 up.
3:27:35 H200s are almost out of stock everywhere because H200 has more memory and therefore R1 wants
3:27:37 that chip over H100, right?
3:27:40 We were trying to get GPUs on a short notice this week for a demo and it wasn’t that easy.
3:27:45 We were trying to get just like 16 or 32 H100s for a demo and it was not very easy.
3:27:52 For people who don’t know, Jevon’s paradox is, when the efficiency goes up, somehow
3:27:57 magically, counter-intuitively, the total resource consumption goes up as well.
3:28:03 The semiconductors are like 50 years of Moore’s Law, every two years, half the cost, double
3:28:07 the transistors, just like clockwork, and it’s slowed down, obviously, but the semiconductor
3:28:09 industry has gone up the whole time, right?
3:28:10 It’s been wavy, right?
3:28:13 There’s obviously cycles and stuff, and I don’t expect AI to be any different, right?
3:28:18 There’s going to be ebbs and flows, but in AI, it’s just playing out at an insane time
3:28:19 scale, right?
3:28:21 It was 2X every two years.
3:28:24 This is 1200X in like three years, right?
3:28:28 So it’s like the scale of improvement that is hard to wrap your head around.
3:28:35 Yeah, I was confused because to me, NVIDIA’s thought on that should have gone up, but maybe
3:28:39 it went down because there’s kind of suspicion of foul play on the side of China or something
3:28:40 like this.
3:28:45 But if you just look purely at the actual principles of play here, it’s obvious, yeah,
3:28:46 the Jevon’s paradox.
3:28:52 More progress that AI makes, or the higher the derivative of AI progress is, especially
3:28:56 because NVIDIA is in the best place, the higher the derivative is, the sooner the market’s
3:29:01 going to be bigger and expanding, and NVIDIA is the only one that does everything reliably
3:29:02 right now.
3:29:05 Because it’s not like an NVIDIA competitor arose.
3:29:08 It’s another company that’s using NVIDIA.
3:29:14 Who historically has been a large NVIDIA customer and has press releases about them
3:29:19 cheering about being China’s biggest NVIDIA customer, right?
3:29:23 Maybe they’ve quieted down, but I think that’s another element of is that they don’t want
3:29:29 to say how many GPUs they have because, hey, yes, they have H800s, yes, they have H20s.
3:29:32 They also have some H100s, which are smuggled in.
3:29:34 Can you speak to that, to the smuggling?
3:29:39 What’s the scale of smuggling that’s feasible for a nation state to do for companies?
3:29:41 Is it possible to…?
3:29:44 I think there’s a few angles of smuggling here.
3:29:48 One is, ByteDance arguably is the largest smuggler of GPUs for China.
3:29:50 China is not supposed to have GPUs.
3:29:52 ByteDance has over 500,000 GPUs.
3:29:53 Why?
3:29:55 Because they’re all rented from companies around the world.
3:29:56 They rent from Oracle.
3:29:57 They rent from Google.
3:30:01 They rent from all these mass and a bunch of smaller cloud companies too, right?
3:30:03 All the neoclouds of the world.
3:30:06 They rent so, so many GPUs, they also buy a bunch, right?
3:30:09 And they do this for mostly what meta does, right?
3:30:10 Serving TikTok.
3:30:11 Serving…
3:30:12 Back to the next best…
3:30:13 Separate discussion.
3:30:14 Same as that, right?
3:30:15 To be clear, that’s today the view.
3:30:16 Use, right?
3:30:17 And it’s a valid use, right?
3:30:19 It’s a dopamine circuit, right?
3:30:25 Now, that’s theoretically now very much restricted with the AI diffusion rules, which happened
3:30:27 in the last week of the Biden admin.
3:30:33 And Trump admin looks like they’re going to keep them, which limits allies, even Singapore.
3:30:37 Which Singapore is 20% of NVIDIA’s, 20, 30% of NVIDIA’s revenue.
3:30:41 But Singapore’s had a memoratorium on not building data centers for 15 years, because
3:30:42 they don’t have enough power.
3:30:43 So where are they going?
3:30:44 Oh, yeah.
3:30:47 I mean, I’m not claiming they’re all going to China, right?
3:30:48 But a portion are…
3:30:53 Many are going to Malaysia, including Microsoft and Oracle have big data centers in Malaysia.
3:30:56 They’re going all over Southeast Asia, probably India as well, right?
3:31:00 There’s stuff routing, but the diffusion rules are very de facto.
3:31:04 You can only buy this many GPUs from this country, and you can only rent a cluster of
3:31:06 this large to companies that are Chinese, right?
3:31:10 They’re very explicit on trying to stop smuggling, right?
3:31:17 And a big chunk of it was, “Hey, let’s random company by 16 servers, ship them to China,
3:31:18 right?”
3:31:25 Actually, I saw a photo from someone in the semiconductor industry who leads a team for
3:31:30 networking chips that competes with NVIDIA, and he sent a photo of a guy checking into
3:31:36 a first-class United flight from San Francisco to Shanghai or Shenzhen with a super micro
3:31:41 box that was this big, which can only contain GPUs, right?
3:31:45 And he was booking first-class, because think about it, 3 to 5K for your first-class ticket,
3:31:51 over-cost 240,000 in the US, 250,000, you sell it for 300,000 in China, wait, you just got
3:31:54 a free first-class ticket and a lot more money.
3:31:57 So it’s like, you know, and that’s like small-scale smuggling.
3:32:01 Most of the large-scale smuggling is like companies in Singapore and Malaysia, like
3:32:04 routing them around or renting GPUs completely legally.
3:32:05 I want to jump in.
3:32:06 How much does this scale?
3:32:10 I think there’s been some number, like some people that have higher-level economics understanding
3:32:15 say that as you go from one billion of smuggling to 10 billion, it’s like you’re hiding certain
3:32:18 levels of economic activity, and that’s the most reasonable thing to me, is that there’s
3:32:23 going to be some level where it’s so obvious that it’s easier to find this economic activity.
3:32:32 Yeah, so my belief is that last year, roughly, so NVIDIA made a million H20s, which are legally
3:32:35 allowed to be shipped to China, which we talked about is better for reasoning, right, inference
3:32:40 at least, not training, but reasoning inference, and inference generally.
3:32:47 Then they also had a couple hundred thousand, we think like 200 to 300,000 GPUs were routed
3:32:50 to China from, you know, Singapore, Malaysia, US, wherever.
3:32:55 Companies spawned up by 16 GPUs, 64 GPUs, whatever it is, routed, and Huawei is known
3:32:59 for having spent up a massive network of companies to get the materials they need after they
3:33:03 were banned in 2018, so it’s not like otherworldly, but I agree, right?
3:33:07 Nathan’s point is like, hey, you can’t smuggle up $10 billion of GPUs.
3:33:11 And then the third source, which is just now banned, which wasn’t considered smuggling,
3:33:19 but is China is renting, I believe from our research, Oracle’s biggest GPU customer is
3:33:21 ByteDance, right?
3:33:24 And for Google, I think it’s their second biggest customer, right?
3:33:27 And you go down the list of clouds, and especially these smaller cloud companies that aren’t
3:33:30 like the hyperscalers, right?
3:33:34 Think beyond CoreWeave and Lambda even, there’s a whole, there’s 60 different new cloud companies
3:33:35 serving NVIDIA GPUs.
3:33:38 I think ByteDance is renting a lot of these, right?
3:33:39 All over, right?
3:33:44 And so these companies are renting GPUs to Chinese companies, and that was completely
3:33:48 legal up until the diffusion rules, which happened just a few weeks ago.
3:33:54 And even now, you can rent GPU clusters that are less than 2,000 GPUs, or you can buy GPUs
3:33:57 and ship them wherever you want if they’re less than 1,500 GPUs, right?
3:34:02 So it’s like, there are still some ways to smuggle, but yeah, it’s not, as the numbers
3:34:03 grow, right?
3:34:07 A hundred-something billion dollars of revenue for NVIDIA last year, 200-something billion
3:34:08 this year, right?
3:34:14 And if next year, it could nearly double again, or more than double, based on what we see
3:34:19 with data center footprints being built out all across the U.S. and the rest of the world,
3:34:22 it’s going to be really hard for China to keep up with these rules, right?
3:34:28 Yes, there will always be smuggling, and deep-seek level models of GPT-4 level models, O1 level
3:34:32 models capable to train on what China can get, even the next year above that.
3:34:39 But if we speedrun a couple more jumps, right, to billion-dollar models, $10 billion models,
3:34:44 then it becomes, hey, there is a compute disadvantage for China for training models and serving them.
3:34:46 And the serving part is really critical, right?
3:34:48 Deep-seek cannot serve their model today, right?
3:34:51 It’s completely out of inventory.
3:34:55 It’s already started falling in the app store, actually, downloads, because you download it,
3:34:56 you try and sign up.
3:34:58 They say we’re not taking registrations because they have no capacity, right?
3:35:02 You open it up, you get like less than five tokens per second if you even get your request
3:35:03 approved, right?
3:35:06 Because there’s just no capacity, because they just don’t have enough GPUs to serve
3:35:08 the model, even though it’s incredibly efficient.
3:35:13 It would be fascinating to watch the smuggling, because, I mean, there’s drug smuggling, right?
3:35:20 That’s a market, there’s weapons smuggling, and GPUs will surpass that at some point.
3:35:25 Chips are highest value per kilogram, probably by far.
3:35:27 I have another question for you, Don.
3:35:31 Do you track model API access internationally?
3:35:36 How easy is it for Chinese companies to use hosted model APIs from the US?
3:35:38 Yeah, I mean, that’s incredibly easy, right?
3:35:43 OpenAI publicly stated Deep-seek uses their API, and as they say, they have evidence, right?
3:35:47 And this is another element of the training regime, is people at OpenAI have claimed that
3:35:51 it’s a distilled model, i.e., you’re taking OpenAI’s model, you’re generating a lot of
3:35:55 output, and then you’re training on the output in their model.
3:35:57 And even if that’s the case, what they did is still amazing, by the way, what Deep-seek
3:35:58 did efficiency-wise.
3:36:02 Distillation is standard practice in industry, whether or not, if you’re at a closed lab
3:36:06 where you care about terms of service and IP closely, you distill from your own models.
3:36:10 If you’re a researcher and you’re not building any products, you distill from the OpenAI
3:36:11 models.
3:36:12 This is a good opportunity.
3:36:16 Can you explain big picture distillation as a process?
3:36:17 What is distillation?
3:36:18 What’s the process of distillation?
3:36:20 We’ve talked a lot about training language models.
3:36:24 They are trained on text, and post-training, you’re trying to train on very high-quality
3:36:29 text that you want the model to match the features of, or if you’re using RL, you’re
3:36:30 letting the model find its own thing.
3:36:35 But for supervised fine-tuning, for preference data, you need to have some completions with
3:36:37 the model is trying to learn to imitate.
3:36:42 And what you do there is instead of a human data, or instead of the model you’re currently
3:36:47 training, you take completions from a different, normally more powerful model.
3:36:53 I think there’s rumors that these big models that people are waiting for, these GPT-5s
3:36:58 of the world, the Claude III opuses of the world, are used internally to do this distillation
3:36:59 process at OpenAI.
3:37:04 There’s also public examples, right, like Meta explicitly stated, not necessarily distilling,
3:37:09 but they used 405B as a reward model for 70B in their Llama 3.2 or 3.3.
3:37:11 This is all the same topic.
3:37:15 So is this ethical, is this legal?
3:37:22 Why is that Financial Times article headline, say OpenAI says that there’s evidence that
3:37:26 China’s deep seek used its model to train competitor?
3:37:30 This is a long, at least in the academic side and research side, it’s a long history
3:37:32 because you’re trying to interpret OpenAI’s rule.
3:37:36 OpenAI’s terms of service say that you cannot build a competitor with outputs from their
3:37:37 model.
3:37:42 Terms of service are different than a license, which are essentially a contract between organizations.
3:37:46 So if you have a terms of service on OpenAI’s account, if I violate it, OpenAI can cancel
3:37:47 my account.
3:37:51 This is very different than a license that says how you could use a downstream artifact.
3:37:54 So a lot of it hinges on a word that is very unclear in the AI space, which is what is
3:37:55 a competitor.
3:38:01 And then the ethical aspect of it is like, why is it unethical for me to train on your
3:38:04 model when you can train on the internet’s text, right?
3:38:12 So there’s a bit of a hypocrisy because OpenAI and potentially most of the companies trained
3:38:14 on the internet’s text without permission.
3:38:20 There’s also a clear loophole, which is that I generate data from OpenAI and then I upload
3:38:25 it somewhere and then somebody else trains on it and the link has been broken.
3:38:27 They’re not under the same terms of service contract.
3:38:32 There’s a lot of hip hop, there’s a lot of to be discovered details that don’t make
3:38:33 a lot of sense.
3:38:38 This is why a lot of models today, even if they train on zero OpenAI data, you ask the
3:38:42 model who trained you, it’ll say I was, I am Chad GPT trained by OpenAI because there’s
3:38:47 so much copy paste of like OpenAI outputs from that on the internet that you just weren’t
3:38:52 able to filter it out and there was nothing in the RL where they implemented like, hey,
3:38:56 or post training or SFT, whatever that says, hey, I’m actually a model by Allen Institute
3:38:58 instead of OpenAI.
3:38:59 We have to do this if we serve a demo.
3:39:04 We do research and we use OpenAI APIs because it’s useful and you want to understand post
3:39:08 training and like our research models, they will say they’re written by OpenAI unless
3:39:12 we put in the system prop that we talked about that like, I am Tulu, I am a language model
3:39:14 trained by the Allen Institute for AI.
3:39:18 And if you ask more people around industry, especially with post training, it’s a very
3:39:24 doable task to make the model say who it is or to suppress the OpenAI thing.
3:39:28 So in some levels, it might be that deep sea didn’t care that it was saying that it was
3:39:29 by OpenAI.
3:39:32 Like if you’re going to upload model weights, it doesn’t really matter because anyone that’s
3:39:37 serving it in an application and cares a lot about serving is going to, when serving it,
3:39:40 if they’re using it for a specific task, they’re going to tailor it to that.
3:39:42 And it doesn’t matter, but it’s saying it’s Chad GPT.
3:39:46 Oh, I guess the one of the ways to do that is like a system prompt or something like
3:39:47 that.
3:39:49 Like if you’re serving it to say that you’re…
3:39:50 That’s what we do.
3:39:55 Like if we host the demo, you say you are Tulu 3, a language model trained by the Allen
3:39:56 Institute for AI.
3:40:00 We also are benefited from OpenAI data because it’s a great research tool.
3:40:07 I mean, do you think there’s any truth and value to the claim, OpenAI’s claim that there’s
3:40:10 evidence that China’s deep seek, use this model to train?
3:40:16 I think everyone has benefited regardless because the data’s on the internet.
3:40:18 And therefore, it’s in your portraying now, right?
3:40:23 There are like subreddits where people share the best Chad GPT outputs and those are in
3:40:24 your model…
3:40:26 I think that they’re trying to ship the narrative.
3:40:28 They’re trying to protect themselves.
3:40:32 And we saw this years ago and Bite Dance was actually banned from some OpenAI APIs for training
3:40:34 on outputs.
3:40:39 There’s other AI startups that most people, if you’re in the AI culture, they just told
3:40:43 us they trained on OpenAI outputs and they never got banned.
3:40:45 That’s how they bootstrapped their early models.
3:40:49 So it’s much easier to get off the ground using this than to set up human pipelines
3:40:50 and build a strong model.
3:40:54 So there’s a long history here and a lot of the communications are seen like narrative
3:40:55 control.
3:40:59 Actually, over the last couple of days, we’ve seen a lot of people distill deep seeks model
3:41:04 into Lama models because the deep seek models are kind of complicated to run inference on
3:41:08 because they’re a mixture of experts and they’re 600 plus billion parameters and all this.
3:41:12 And people distilled them into the Lama models because the Lama models are so easy to serve
3:41:16 and everyone’s built the pipelines and tooling for inference with the Lama models because
3:41:18 it’s the open standard.
3:41:21 So we’ve seen a sort of roundabout, right?
3:41:22 Is it bad?
3:41:23 Is it illegal?
3:41:24 Maybe it’s illegal, whatever.
3:41:25 I don’t know about that.
3:41:26 But it could break contracts.
3:41:27 I don’t think it’s illegal.
3:41:30 In any legal, no one’s going to jail for this.
3:41:35 I think fundamentally, I think it’s ethical or I hope it’s ethical because the moment
3:41:42 it becomes, we ban that kind of thing, it’s going to make everybody much worse off.
3:41:48 And I also actually, this is difficult, but I think you should be allowed to train on
3:41:49 the internet.
3:41:52 I know a lot of authors and creators are very sensitive about it.
3:41:54 That’s a difficult question.
3:41:57 But the moment you’re not allowed to train on the internet.
3:41:58 I agree.
3:42:01 I have a schizo take on how you consult this because it already works.
3:42:04 I have a reasonable take on it.
3:42:10 So, you know, Japan has a law which you’re allowed to train on any training data and
3:42:15 copyrights don’t apply if you want to train a model A, B, Japan has nine gigawatts of
3:42:17 curtailed nuclear power.
3:42:23 C, Japan is allowed under the AI diffusion rule to import as many GPUs as they’d like.
3:42:25 So all we have to do, we have a market here to make.
3:42:30 We build massive data centers, we rent them to the labs, and then we train models in a
3:42:33 legally permissible way, and there’s no if, ands, or buts.
3:42:38 And now, the models have no potential copyright lawsuit from New York Times or anything like
3:42:39 that.
3:42:40 No, no, it’s just completely legal.
3:42:41 Genius.
3:42:46 The early copyright lawsuits have fallen in the favor of AI training.
3:42:53 I would say that the long tail of use is going to go in the side of AI, which is if you scrape
3:42:56 trillions of data, you’re not looking at the trillions of tokens of data.
3:43:01 You’re not looking and saying this one New York Times article is so important to me.
3:43:05 But if you’re doing a audio generation for music or image generation, and you say make
3:43:10 it in the style of X person, that’s a reasonable case where you could figure out what is their
3:43:12 profit margin on inference.
3:43:17 I don’t know if it’s going to be the 50/50 of YouTube creator program or something, but
3:43:19 I would opt into that program as a writer.
3:43:26 Please, it’s going to be a rough journey, but there will be some solutions like that
3:43:27 that make sense.
3:43:30 But there’s a long tail where it’s just on the Internet.
3:43:36 I think one of the other aspects of that Financial Times article implied, and so that leads to
3:43:37 a more general question.
3:43:45 Do you think there’s how difficult is spying, espionage, and stealing of actual secret code
3:43:48 and data from inside companies?
3:43:49 How much of that is being attempted?
3:43:52 Code and data is hard, but ideas is easy.
3:43:58 Silicon Valley operates on the way that top employees get bought out by other companies
3:43:59 for a pay raise.
3:44:04 And a large reason why these companies do this is to bring ideas with them.
3:44:05 There’s no…
3:44:09 I mean, in California, there’s rules like certain non-competes or whatever are illegal
3:44:10 in California.
3:44:14 And whether or not there’s NDAs and things, that is how a lot of it happens.
3:44:19 Recently, there was somebody from Gemini who helped make this one million context length,
3:44:23 and everyone is saying the next llama who, I mean, he went to the meta team, is going
3:44:26 to have one million context length.
3:44:29 And that’s kind of how the world works.
3:44:34 As far as industrial espionage and things, that has been greatly successful in the past.
3:44:39 The Americans did the Brits, the Chinese have done it to the Americans, and so on and so
3:44:40 forth.
3:44:43 It is a fact of life.
3:44:48 And so to argue, industrial espionage can be stopped is probably unlikely, you can make
3:44:49 it difficult.
3:44:54 Even then, there’s all these stories about, “Hey, F35 and F22 have already been given
3:44:57 to China in terms of design plans and stuff.”
3:45:03 Code and stuff between, I say, companies, not nation states is probably very difficult.
3:45:08 But ideas are discussed a lot, whether it be a house party in San Francisco, or a company
3:45:15 changing employees, or the always the mythical honeypot that always gets talked about, like
3:45:17 someone gets honeypotted.
3:45:21 Because everyone working on AI is a single dude who’s in their 20s and 30s.
3:45:25 Not everyone, but an insane amount of insane percentages.
3:45:28 So there’s always all these like, and obviously–
3:45:32 So a honeypotted is like a spy, a female spy approaches you and like–
3:45:33 Yeah.
3:45:36 Or male, right?
3:45:37 It’s San Francisco, right?
3:45:44 But as a single dude, I will say in his late 20s, we are very easily corrupted, right?
3:45:47 Not corrupted myself, but you know, we are, we are, right?
3:45:48 Everybody else, not me.
3:45:49 Yeah, exactly.
3:45:50 I’m too oblivious and I am not single.
3:45:53 So I’m saved from one espionage access.
3:45:59 Yeah, you have to make sure to close all security vulnerabilities.
3:46:05 So you do collect a lot of information about each of the mega clusters for each of the
3:46:08 major AI companies.
3:46:12 Can you talk about the buildouts for each one that stand out?
3:46:13 Yeah.
3:46:17 I think the thing that’s like really important about these mega cluster buildouts is they’re
3:46:20 completely unprecedented in scale, right?
3:46:24 US, you know, sort of like data center power consumption has been slowly on the rise and
3:46:29 it’s gone up to 2%, 3% even through the cloud computing revolution, right?
3:46:32 Data center consumption as a percentage of total US.
3:46:34 And that’s been over decades, right, of data centers, et cetera.
3:46:36 It’s been climbing, climbing slowly.
3:46:41 But now, 2% to 3%, by the end of this decade, it’s like even under like, you know, when
3:46:47 I say like 10%, a lot of people that are traditionally by like 20, 28, 20, 30, people traditionally
3:46:51 non-traditional data center people like that’s nuts.
3:46:54 But then like people who are in like AI who have like really looked at this at like the
3:46:58 anthropics and open AI’s are like, that’s not enough, okay?
3:47:04 But like, you know, this is this is both through globally distributed or distributed throughout
3:47:07 the US as well as like centralized clusters, right?
3:47:10 The distributed throughout the US is exciting and it’s the bulk of it, right?
3:47:17 Like, hey, you know, open AI or, you know, say Meta’s adding a gigawatt, right?
3:47:20 But most of it is distributed through the US for inference and all these other things,
3:47:21 right?
3:47:24 So maybe we should lay out what a what a cluster is.
3:47:28 So, you know, does this include AWS?
3:47:32 Maybe it’s good to talk about the different kinds of clusters and what you mean by megaclusters
3:47:36 and what’s the GPU and what’s the computer and what is not that far back.
3:47:37 But yeah.
3:47:39 So like, what do we mean by the clusters?
3:47:41 No, man, I thought I was about to do the Apple ad, right?
3:47:43 What’s a computer?
3:47:49 So, so traditionally data centers and data center tasks have been a distributed systems
3:47:54 problem that is capable of being spread very far and widely, right?
3:48:00 I send a request to Google, it gets routed to a data center somewhat close to me.
3:48:05 It does whatever search ranking recommendation sends a result back, right?
3:48:09 The nature of the task is changing rapidly in that the task, there’s two tasks that people
3:48:10 are really focused on now, right?
3:48:12 It’s not database access.
3:48:14 It’s not serve me the right page, serve me the right ad.
3:48:20 It’s now a inference and inference is dramatically different from traditional distributed systems,
3:48:22 but it looks a lot more simple, similar.
3:48:24 And then there’s training, right?
3:48:28 The train inference side is still like, hey, I’m going to put, you know, thousands of GPUs
3:48:33 and, you know, blocks all around these data centers, I’m going to run models on them,
3:48:37 you know, user submits a request, gets kicked off, or hey, my service, you know, they submit
3:48:38 a request to my service, right?
3:48:41 They’re on Word and they’re like, oh yeah, help me copilot and it kicks it off or I’m
3:48:45 on my windows, copilot, whatever, Apple intelligence, whatever it is, it gets kicked off to a data
3:48:46 center, right?
3:48:51 And that data center does some work and sends it back, that’s inference, that is going to
3:48:55 be the bulk of compute, but then, you know, and that’s like, you know, there’s thousands
3:48:59 of data centers that we’re tracking with like satellites and like all these other things.
3:49:01 And those are the bulk of what’s being built.
3:49:05 About the scale of, and so that’s like what’s really reshaping and that’s what’s getting
3:49:11 millions of GPUs, but the scale of the largest cluster is also really important, right?
3:49:17 When we look back at history, right, like, you know, or through the age of AI, right?
3:49:22 Like it was a really big deal when they did AlexNet on, I think, two GPUs or four GPUs?
3:49:23 I don’t remember.
3:49:24 It was a really big deal.
3:49:25 It’s a big deal because you use GPUs.
3:49:29 It’s a big deal to use GPUs and they use multiple, right?
3:49:32 But then over time, its scale has just been compounding, right?
3:49:40 And so when you skip forward to GPT-3, then GPT-4, GPT-4 20,000 A100 GPUs on precedented
3:49:44 run, right, in terms of the size and the cost, right, a couple hundred million dollars on
3:49:48 a YOLO, right, a YOLO run for GPT-4 and it yielded, you know, this magical improvement
3:49:53 that was like perfectly in line with what was experimented and just like a log scale, right?
3:49:55 Oh yeah, they have that plot from the paper.
3:49:56 The technical report.
3:49:58 The scaling laws were perfect, right?
3:50:00 But that’s not a crazy number, right?
3:50:05 20,000 A100s, roughly each GPU is consuming 400 watts.
3:50:09 And then when you add in the whole server, right, everything, it’s like 15 to 20 megawatts
3:50:11 of power, right?
3:50:15 You know, maybe you could look up what the power of consumption of a human person is
3:50:19 because the numbers are going to get silly, but like 15 to 20 megawatts was standard data
3:50:20 center size.
3:50:21 It was just unprecedented.
3:50:22 That was all GPUs running one time.
3:50:23 20 watts was a toaster.
3:50:24 Yeah.
3:50:29 A toaster is like a similar power consumption to an A100, right?
3:50:34 H100 comes around, they increase the power from like 400 to 700 watts and that’s just
3:50:36 per GPU and then there’s all the associated stuff around it.
3:50:40 So once you count all that, it’s roughly like 1200 to 1400 watts.
3:50:43 For everything, networking, CPUs, memory, blah, blah, blah.
3:50:46 So we should also say, so what’s required?
3:50:53 You said power, so a lot of power is required, a lot of heat is generated, cooling is required
3:50:58 and because there’s a lot of GPUs that have to be or CPUs or whatever, they have to be
3:50:59 connected.
3:51:00 So there’s a lot of networking.
3:51:01 Yeah.
3:51:02 Right.
3:51:03 Yeah.
3:51:04 So I think, yeah.
3:51:05 Sorry for skipping past that.
3:51:06 And then the data center itself is like complicated, right?
3:51:10 But these are still standard sized data centers for GPT-4 scale, right?
3:51:16 Now we step forward to sort of what is the scale of clusters that people built last year,
3:51:17 right?
3:51:18 And it ranges widely, right?
3:51:22 It ranges from like, hey, these are standard data centers and we’re just using multiple
3:51:25 of them and connecting them together really with a ton of fiber between them, a lot of
3:51:27 networking, et cetera.
3:51:29 That’s what OpenAI and Microsoft did in Arizona, right?
3:51:31 And so they have a, you know, 100,000 GPUs, right?
3:51:32 Meta, similar thing.
3:51:36 They took their standard existing data center design and it looks like an H and they connected
3:51:39 multiple them together.
3:51:44 And you know, they got to, they first did 16,000 GPUs, 24,000 GPUs total.
3:51:46 Only 16 of them, 1,000 of them were running on the training run because GPUs are very
3:51:47 unreliable.
3:51:51 So they needed to have spares to like swap in and out all the way to like now 100,000
3:51:54 GPUs that they’re training on Lama 4 on currently, right?
3:51:56 Like 128,000 or so, right?
3:52:03 This is, you know, think about 100,000 GPUs with roughly 1,400 watts apiece.
3:52:05 That’s 140 megawatts, 150 megawatts, right?
3:52:07 For 128,000, right?
3:52:11 So you’re talking about, you’ve jumped from 15 to 20 megawatts to 10x, you know, almost
3:52:17 10x that number, 9x that number to 150 megawatts in two years, right?
3:52:19 From 2022 to 2024, right?
3:52:23 And some people like Elon, he admittedly, right, and he says himself got into the game
3:52:26 a little bit late for pre-training large language models, right?
3:52:27 XAI was started later, right?
3:52:32 But then he bet heaven and hell to get his data center up and get the largest cluster
3:52:33 in the world, right?
3:52:35 Which is 200,000 GPUs.
3:52:36 And he did that.
3:52:39 He bought a factory in Memphis.
3:52:42 He’s upgrading the substation, but at the same time he’s got a bunch of mobile power
3:52:45 generation, a bunch of single cycle combined.
3:52:48 He tapped the natural gas line that’s right next to the factory and he’s just pulling a
3:52:50 ton of gas, burning gas.
3:52:52 He’s generating all this power.
3:52:56 He’s in a factory, in an old appliance factory that’s shut down and moved to China long ago,
3:52:57 right?
3:53:00 And he’s got 200,000 GPUs in it.
3:53:01 And now what’s the next scale, right?
3:53:02 All the hyperscalers have done this.
3:53:06 Now the next scale is something that’s even bigger, right?
3:53:10 And so, you know, Elon, just to stick on the topic, he’s building his own natural gas plant,
3:53:13 like a proper one right next door.
3:53:18 He’s deploying tons of Tesla Mega Pack batteries to make the power more smooth and all sorts
3:53:19 of other things.
3:53:23 He’s got like industrial chillers to cool the water down because he’s water cooling the
3:53:24 chips.
3:53:28 So, all these crazy things to get the clusters bigger and bigger.
3:53:34 But when you look at, like, say, what OpenAI did with Stargate, that’s that in Arizona,
3:53:36 in Abilene, Texas, right?
3:53:38 What they’ve announced at least, right?
3:53:39 It’s not built, right?
3:53:40 Elon says they don’t have the money.
3:53:42 You know, there’s some debates about this.
3:53:46 But at full scale, at least the first section is like definitely money’s accounted for,
3:53:47 but there’s multiple sections.
3:53:52 But at full scale, that data center is going to be 2.2 gigawatts, right, 2200 megawatts
3:53:59 of power in and roughly like 1.8 gigawatts or 1800 megawatts, yeah, 1800 megawatts of
3:54:01 power delivered to chips, right?
3:54:06 Now, this is an absurd scale, 2.2 gigawatts is like more than most cities, right, you
3:54:13 know, to be clear, delivered to a single cluster that’s connected to do training, right?
3:54:16 To train these models, to do both the pre-training, the post-training, all of this stuff, right?
3:54:17 This is insane.
3:54:18 This is insane.
3:54:19 This is insane.
3:54:20 This is a nuclear power plant again.
3:54:21 And everyone is doing this, right?
3:54:22 Everyone is doing this, right?
3:54:24 Meta in Louisiana, right?
3:54:29 They’re building two natural gas plants, massive ones, and then they’re building this massive
3:54:31 data center.
3:54:37 Amazon has like plans for this scale, Google has plans for this scale, XAI has plans for
3:54:38 this scale, right?
3:54:42 Like all of these, the guys that are racing, the companies that are racing are racing hard
3:54:46 and they’re doing multi-gigawatt data centers, right?
3:54:52 You build this out because they think that, yeah, if I now have, you know, obviously pre-training
3:54:55 scaling is going to continue, but to some extent, but then also all this post-training
3:54:58 stuff where you have an RL sandbox for computer use or whatever, right?
3:55:01 Like, you know, this is where they’re going to, and all these variable domains where they
3:55:06 just keep learning and learning and learning, self-play, whatever it is, makes the AI so
3:55:09 much more capable because the line does go up, right?
3:55:11 As you throw more compute, you get more performance.
3:55:15 The shirt is about scaling laws, you know, to some extent it is diminishing returns, right?
3:55:18 You 10x the compute, you don’t get 10x better model, right?
3:55:21 You get a diminishing returns, but also you get efficiency improvements, so you bend the
3:55:23 curve, right?
3:55:27 And these scale of data centers are doing, you know, wreaking, you know, a lot of like
3:55:29 havoc on the network, right?
3:55:33 And, you know, Nathan was mentioning there’s, Amazon has tried to buy this nuclear power
3:55:38 plant, Talon, and if you look at the Talon stock, it’s just like skyrocketing and, you
3:55:41 know, like they’re building a massive multi-gigawatt data center there, and, you know, you just
3:55:44 go down the list, there’s so many ramifications.
3:55:49 One thing is like certain regions of the U.S. transmitting power cost more than actually
3:55:51 generating it, right?
3:55:55 Because the grid is so slow to build, and the demand for power and the ability to build
3:55:59 power and like re-ramping on a natural gas plant or even a coal plant is like easy enough
3:56:01 to do, but like transmitting the power is really hard.
3:56:06 So in some parts of the U.S., like in Virginia, it costs more to transmit power than it costs
3:56:09 to generate it, which is like, you know, there’s all sorts of like second order effects that
3:56:10 are insane here.
3:56:13 Can the power grid support this kind of growth?
3:56:16 You know, Trump’s executive orders, there’s a, there’s a Biden executive order before
3:56:21 the end of the year, but then Trump had some more executive orders, which hopefully reduced
3:56:26 the regulations to where, yes, things can be built, but yeah, this is a big, big challenge,
3:56:27 right?
3:56:28 Is building enough power fast enough?
3:56:32 Are you going to basically have a nuclear power plant next to a data center for each
3:56:33 one of these?
3:56:38 So, so the fun thing here is this is too slow to build the power plant, to build a power
3:56:42 plant or to re-configure an existing power plant is too slow.
3:56:46 And so therefore you must use natural, data center power consumption is flat, right?
3:56:47 You know, I mean, like it’s, right?
3:56:49 Which is why nuclear is also good for it.
3:56:55 Like longterm nuclear is a very natural fit, but you can’t do solar or anything in the
3:56:57 short term like that.
3:56:58 Because data center power is like this, right?
3:57:03 Like you’re telling me, you know, I’m going to buy tens of billions of dollars of GPUs
3:57:04 and idle them because the power is not being generated.
3:57:05 Like power is cheap, right?
3:57:10 Like if you look at the cost of a cluster, less than 20% of it is power, right?
3:57:14 Most of it is the capital cost and depreciation of the GPUs, right?
3:57:15 And so it’s like, well, screw it.
3:57:17 I’ll just like, you know, I’ll just build natural gas plants.
3:57:18 This is what Metta’s doing in Louisiana.
3:57:22 This is what OpenAI is doing in Texas and like all these different places.
3:57:25 They may not be doing it directly, but they are partnered with someone.
3:57:28 And so there is a couple of hopes, right?
3:57:32 Like one is, you know, and Elon, what he’s doing in Memphis is like, you know, to the
3:57:36 extreme, they’re not just using dual combined cycle gas, which is like super efficient.
3:57:40 He’s also just using single cycle and like mobile generators and stuff, which is less
3:57:41 efficient.
3:57:45 But he’s, you know, there’s also like the flip side, which is like solar power generation
3:57:49 is like this and wind is another like, like this different correlate, you know, different.
3:57:53 So if you stack both of those, plus you get a big chunk of batteries, plus you have a
3:57:56 little bit of gas, it is possible to run it more green.
3:57:59 It’s just the time scales for that is slow, right?
3:58:04 So people are trying, but, you know, Metta basically said, whatever, don’t care about
3:58:08 my sustainability pledge, or they’ll buy like a power, it’s called a PPA, power purchasing
3:58:12 agreement, where there’ll be a massive wind farm or solar farm, like wherever.
3:58:15 And then they’ll just pretend like those electrons are being consumed by the data center.
3:58:18 But in reality, they’re paying for the power here and selling it to the grid and they’re
3:58:20 buying power here.
3:58:24 And then another thing is like Microsoft quit on some of their sustainability pledges, right?
3:58:29 Elon, he, what he did with Memphis is objectively somewhat dirty, but he’s also doing it in an
3:58:34 area where there’s like a bigger natural gas plant right next door and like a sewer next
3:58:37 or not a sewer, but like a wastewater treatment and a garbage dump nearby, right?
3:58:41 And he’s obviously made the world a lot more clean than that one data center is going to
3:58:42 do, right?
3:58:47 So I think like it’s fine to some extent, and maybe AGI solves, you know, global warming
3:58:48 and stuff, right?
3:58:51 Whatever it is, you know, this is, this is sort of the attitude that people at the labs
3:58:52 have, right?
3:58:53 Which is like, yeah, it’s great.
3:58:54 We’ll just use gas, right?
3:58:58 Because the race is that important and if we lose, you know, that’s way worse, right?
3:59:05 I should say that I got you asked to visit the Memphis data center and it’s kind of incredible.
3:59:11 I mean, I visited with Elon, just the teams and the rate of innovation.
3:59:12 There’s insane.
3:59:18 Because my sense is that, you know, nobody’s ever done anything of this scale and nobody
3:59:23 has certainly ever done anything of this scale at the rate that XAI is doing.
3:59:28 So they’re like figuring out, I mean, it’s all sitting in all these meetings with their
3:59:29 brainstorming.
3:59:31 It’s like, it’s insane.
3:59:32 It’s exciting.
3:59:35 Because they’re like, they’re trying to figure out what the bottlenecks are, how to remove
3:59:39 the bottlenecks, how to make sure that, you know, there’s just so many really cool things
3:59:46 about putting together a data center because, you know, everything has to work.
3:59:51 It’s the people that do like the sys admin, you know, the machine learning, all that is
3:59:52 the exciting thing so on.
3:59:59 But really the people that run everything are the folks that know like the low level software
4:00:02 and hardware that runs everything, the networking, all of that.
4:00:06 And so you have to like make sure you have procedures that test everything.
4:00:07 I think they’re using Ethernet.
4:00:12 I don’t know how they’re doing the networking, but they’re using NVIDIA Spectrum X Ethernet.
4:00:16 There’s actually like, I think, yeah, the unsung heroes are the cooling and electrical
4:00:18 systems, which are just like glossed over.
4:00:19 Yeah.
4:00:24 But I think like, like one story that maybe is like exemplifies how insane this stuff
4:00:29 is, is when you’re training, right, you’re always doing, you’re running through the model
4:00:32 a bunch, right, in the most simplistic terms, running through the model a bunch.
4:00:37 And then you’re going to exchange everything and synchronize the weights, right?
4:00:38 So you’ll do a step.
4:00:40 This is like a step in model training, right?
4:00:42 At every step, your loss goes down, hopefully, and it doesn’t always.
4:00:46 But in the simplest terms, you’ll be computing a lot and then you’ll exchange, right?
4:00:49 The interesting thing is GPU power is most of it.
4:00:50 Networking power is some, but it’s a lot less.
4:00:53 But so while you’re computing, your power for your GPUs is here.
4:00:57 But then when you’re exchanging weights, if you’re not able to overlap communications
4:01:01 and compute perfectly, there may be a time period where your GPUs are just idle and you’re
4:01:04 exchanging weights and you’re like, hey, the model’s updating.
4:01:07 So you’re exchanging the radiance, you do the model update, and then you start training
4:01:08 again.
4:01:10 So the power goes, right?
4:01:11 And it’s super spiky.
4:01:16 And so funnily enough, right, like this, when you talk about the scale of data center power,
4:01:17 right?
4:01:19 You can blow stuff up so easily.
4:01:25 And so Meta actually has accidentally upstreamed something to code in PyTorch where they added
4:01:28 an operator, and I kid you not, whoever made this, like I want to hug the guy because it
4:01:35 says PyTorch, it’s like PyTorch.powerplantNoBlowUp, equal zero or equal one.
4:01:38 And what it does, what it does is amazing, right?
4:01:42 Either, you know, when you’re exchanging the weights, the GPU will just compute fake
4:01:44 numbers so the power doesn’t spike too much.
4:01:48 And so then the power plants don’t blow up because the transient spikes screw stuff up.
4:01:49 Well, that makes sense.
4:01:51 I mean, you have to do that kind of thing.
4:01:53 You have to make sure they’re not idle, yeah.
4:01:56 And Elon’s solution was like, let me throw a bunch of Tesla mega packs and a few other
4:01:57 things, right?
4:02:01 Like everyone has different solutions, but like Meta’s at least was publicly and openly
4:02:05 known, which is just like, set this operator, and what this operator does is it just makes
4:02:08 the GPUs compute nothing so that the power doesn’t spike.
4:02:11 But that just tells you how much power you’re working with.
4:02:12 I mean, it’s insane.
4:02:13 It’s insane.
4:02:18 You can almost just go to Google, like scale, like what does X watts do and go through all
4:02:21 the scales from one watt to a kilowatt to a megawatt.
4:02:26 And you look and stare at that and you’re how high on the list a gigawatt is, and it’s
4:02:27 mind-blowing.
4:02:30 Can you say something about the cooling?
4:02:37 So I know Elon’s using liquid cooling, I believe in all cases, that’s a new thing,
4:02:38 right?
4:02:39 Most of them don’t use liquid cooling.
4:02:41 Is there something interesting to say about the cooling?
4:02:42 Yeah, yeah.
4:02:46 The cooling has been the de facto standard, throw a bunch of metal, heat pipes, et cetera,
4:02:47 and fans, right?
4:02:48 And like, that’s cold.
4:02:50 That’s been enough to cool it.
4:02:55 People have been dabbling in water cooling, Google’s TPUs are water cooled, right?
4:02:58 So they’ve been doing that for a few years.
4:03:01 But with GPUs, no one’s ever done, and no one’s ever done the scale of water cooling
4:03:04 that Elon just did, right?
4:03:09 Now next generation Nvidia is for the highest NGPU, it is mandatory water cooling.
4:03:10 You have to water cool it.
4:03:14 So Elon did it on this current generation, and that required a lot of stuff, right?
4:03:19 If you look at some of the satellite photos and stuff of the Memphis facility, there’s
4:03:22 all these external water chillers that are sitting basically.
4:03:26 It looks like a semi-truck pod thing, what’s it called, the container.
4:03:29 But really those are water chillers, and he has like 90 of those water chillers just sitting
4:03:30 outside.
4:03:35 90 different containers, right, that chill the water, bring it back to the data center,
4:03:38 and then you distribute it to all the chips, pull all the heat out, and then send it back,
4:03:39 right?
4:03:44 So it’s both a way to cool the chips, but also an efficiency thing, all right?
4:03:50 And going back to that sort of three vector thing, right, there is memory band with flops
4:03:51 and interconnect.
4:03:56 The closer the chips are together, the easier it is to do high-speed interconnects, right?
4:04:00 And so this is also like a reason why you’re going to go water cooling is because you can
4:04:06 just put the chips right next to each other, and therefore get higher speed connectivity.
4:04:14 I got to ask you, so in one of your recent posts, there’s a section called Cluster Measuring
4:04:17 Contest, so…
4:04:21 There’s another word there, but I won’t say it, you know?
4:04:25 What, who’s got the biggest now, and who’s going to have the biggest?
4:04:29 Today, individual largest is Elon, right?
4:04:30 Right.
4:04:31 Elon’s cluster.
4:04:34 Elon’s cluster in Memphis, 200,000 GPUs, right?
4:04:39 Meta has like 128,000, Open Air has 100,000, now to be clear, other companies have more
4:04:42 GPUs than Elon, they just don’t have them in one place, right?
4:04:44 And for training, you want them tightly connected.
4:04:50 There’s some techniques that people are researching and working on that let you train across multiple
4:04:54 regions, but for the most part, you want them all in like one area, right?
4:04:57 So you can connect them highly with high-speed networking.
4:05:04 And so, you know, Elon today has 200,000 H100s, 100,000 H100s, 100,000 H200s, right?
4:05:11 Meta, Open AI, you know, and Amazon all have on the scale of 100,000, a little bit less.
4:05:14 But next, this year, right, this year, people are building much more, right?
4:05:19 Anthropic and Amazon are building a cluster of 400,000 Tranium II, which is Amazon-specific
4:05:22 chip, trying to get away from Nvidia, right?
4:05:27 You know, Meta and Open AI have scales for hundreds of thousands.
4:05:33 But by next year, you’ll have like 500,000 to 700,000 GPU clusters, and note those GPUs
4:05:36 are much higher power consumption than existing ones, right?
4:05:40 Hopper 700 watts, Blackwell goes to 1200 watts, right?
4:05:44 So the power per chip is growing and the number of chips is growing, right?
4:05:45 Nuts.
4:05:48 You think Elon said he’ll get to a million.
4:05:50 You think that’s actually feasible?
4:05:53 I mean, I don’t doubt Elon, right?
4:05:57 The filings that he has for like, you know, the power plan and the Tesla battery packs,
4:06:00 it’s clear he has some crazy plans for Memphis.
4:06:03 Like permits and stuff is open record, right?
4:06:07 But it’s not quite clear that, you know, what and what the time scales are.
4:06:09 I just never doubt Elon, right?
4:06:10 You know, that’s, he’s going to surprise us.
4:06:12 So what’s the idea with these clusters?
4:06:18 If you have a million GPUs, what percentage in, let’s say, two, three years is used for
4:06:25 training and what percent, pre-training and what percent is used for like, for the actual
4:06:26 computation?
4:06:28 So these mega clusters make no sense for inference, right?
4:06:31 You could route inference there and just not train.
4:06:35 But most of the inference capacity is being, you know, hey, I’ve got a 30 megawatt data
4:06:36 center here.
4:06:37 I’ve got 50 megawatts here.
4:06:38 I’ve got a hundred here, whatever.
4:06:43 I’ll just throw inference in all of those because the mega clusters, right, multi gigawatt
4:06:47 data centers, I want to train there because that’s where all of my GPUs are co-located
4:06:51 where I can put them at a super high networking speed connected together, right?
4:06:52 Because that’s what you need for training.
4:06:55 Now with pre-training, this is the old scale, right?
4:06:59 You could increase parameters, you’d increase data, model gets better.
4:07:03 That doesn’t apply anymore because there’s not much more data in the pre-training side,
4:07:04 right?
4:07:08 Yes, there’s video and audio and image that has not been fully taken advantage of.
4:07:09 So there’s a lot more scaling.
4:07:14 But a lot of people like, have taken transcripts of YouTube videos and that gets you a lot
4:07:15 of the data.
4:07:17 It doesn’t get you all of the learning value out of the video and image data.
4:07:20 But, you know, there’s still scaling to be done on pre-training.
4:07:24 This post-training world is where all the flops are going to be spent, right?
4:07:27 The model is going to play with itself, it’s going to self-play, it’s going to do verifiable
4:07:32 tasks, it’s going to do computer use in sandboxes, it might even do simulated robotics things,
4:07:33 right?
4:07:39 All of these things are going to be environments where compute is spent in quote unquote post-training.
4:07:42 But I think it’s going to be good, we’re going to drop the post from post-training.
4:07:43 Yeah.
4:07:48 It’s going to be pre-training and it’s going to be training, I think, at some point.
4:07:54 Because for the bulk of the last few years, pre-training has dwarfed post-training.
4:07:59 But with these verifiable methods, especially ones that scale potentially infinitely, like
4:08:04 computer use in robotics, not just math encoding, right, where you can verify what’s happening,
4:08:07 those infinitely verifiable tasks, it seems you can spend as much compute as you want
4:08:08 on them.
4:08:09 Especially at the context length increase.
4:08:13 Because the end of pre-training is when you increase the context length for these models.
4:08:17 And we’ve talked earlier in the conversation about how the context length, when you have
4:08:20 a long input, is much easier to manage than output.
4:08:25 And a lot of these post-training and reasoning techniques rely on a ton of sampling and it’s
4:08:27 becoming increasingly long context.
4:08:31 So it’s just like you’re, effectively, your compute efficiency goes down.
4:08:36 I don’t, I think FLOPs is the standard for how you measure it, but with RL and you have
4:08:40 to do all these things where you move your weights around in a different way than at
4:08:46 pre-training and just generation, it’s going to become less efficient and FLOPs is going
4:08:48 to be less of a useful term.
4:08:51 And then as the infrastructure gets better, it’s probably going to go back to FLOPs.
4:08:56 So all of the things we’ve been talking about is most likely going to be NVIDIA, right?
4:08:57 Is there any competitors?
4:09:00 Google, Google, I kind of ignored them.
4:09:02 Yeah, what’s the story with TPU?
4:09:03 What’s the story with TPU?
4:09:04 Like, what’s the…
4:09:06 TPU is awesome, right?
4:09:07 It’s great.
4:09:11 Google is, they’re a bit more tepid on building data centers for some reason.
4:09:12 They’re building big data centers.
4:09:13 Don’t get me wrong.
4:09:17 They actually have the biggest cluster, I was talking about NVIDIA clusters.
4:09:20 They actually have the biggest cluster, period.
4:09:23 But the way they do it is very interesting, right?
4:09:26 They have two sort of data center super regions, right?
4:09:29 In that the data center isn’t physically, like all of the GPUs aren’t physically on
4:09:33 one site, but they’re like 30 miles from each other, not GPUs, TPUs, right?
4:09:37 They have like in Iowa and Nebraska, they have four data centers that are just like right
4:09:38 next to each other.
4:09:42 Why doesn’t Google flex its cluster size more often?
4:09:43 Go to multi data center training.
4:09:46 There’s the good images in there, so I’ll show you what I mean.
4:09:49 It’s just semi analysis multi data center.
4:09:52 So this is like, you know, so this is an image of like what a standard Google data center
4:09:53 looks like.
4:09:56 By the way, their data centers look very different than anyone else’s data centers.
4:09:57 What are we looking at here?
4:10:00 So these are, yeah, so if you see this image, right?
4:10:02 In the center, there are these big rectangular boxes, right?
4:10:05 Those are where the actual chips are kept.
4:10:10 And then if you scroll down a little bit further, you can see there’s like these water pipes,
4:10:14 there’s these chiller cooling towers in the top and a bunch of like diesel generators.
4:10:16 The diesel generators are backup power.
4:10:21 The data center itself is like look physically smaller than the water chillers, right?
4:10:25 So the chips are actually easier to like keep together, but then like cooling all the water
4:10:27 for the water cooling is very difficult, right?
4:10:32 So Google has like a very advanced infrastructure that no one else has for the TPU.
4:10:35 And what they do is they’ve like stamped these data center, they’ve stamped a bunch of these
4:10:37 data centers out in a few regions, right?
4:10:42 So if you go a little bit further down, this is a Microsoft.
4:10:43 This is an Arizona.
4:10:46 This is where GPT-5 quote unquote will be trained, you know.
4:10:48 If it doesn’t exist already.
4:10:50 Yeah, it doesn’t exist already.
4:10:54 But each of these data centers, I’ve shown a couple images of them, they’re like really
4:10:56 closely co-located in the same region, right?
4:10:57 Nebraska, Iowa.
4:11:01 And then they also have a similar one in Ohio complex, right?
4:11:04 And so these data centers are really close to each other.
4:11:07 And what they’ve done is they’ve connected them super high bandwidth with fiber.
4:11:09 And so these are just a bunch of data centers.
4:11:14 And the point here is that Google has a very advanced infrastructure, very tightly connected
4:11:16 in a small region.
4:11:19 So Elon will always have the biggest cluster fully connected, right?
4:11:21 Because it’s all in one building, right?
4:11:23 And he’s completely right on that, right?
4:11:27 Google has the biggest cluster, but you have to spread over three sites and by a significant
4:11:30 margin, we have to go across multiple sites.
4:11:33 Why doesn’t Google compete with Nvidia?
4:11:36 Why don’t they sell TPUs?
4:11:38 I think there’s a couple problems with it.
4:11:46 It’s like one, TPU has been a form of allowing search to be really freaking cheap and build
4:11:48 models for that, right?
4:11:52 And so like a big chunk of the search TPU purchases or TPU purchases or a big chunk
4:11:56 of Google’s purchases and usage, all of it is for internal workloads, right?
4:12:02 Whether it be search, now Gemini, YouTube, all these different applications that they
4:12:06 have, you know, ads, these are where all their TPUs are being spent, and that’s what they’re
4:12:08 hyper focused on, right?
4:12:12 And so there’s certain like aspects of the architecture that are optimized for their
4:12:15 use case that are not optimized elsewhere, right?
4:12:19 One simple one is like they’ve open sourced a Gemma model and they called it Gemma 7B,
4:12:20 right?
4:12:24 But then it’s actually eight billion parameters because the vocabulary is so large, and the
4:12:28 reason they made the vocabulary so large is because TPUs like matrix multiply unit
4:12:32 is massive, because that’s what they’ve like sort of optimized for.
4:12:35 And so they decided, oh, I’ll just make the vocabulary large too, even though it makes
4:12:38 no sense to do so in such a small model, because that fits on their hardware.
4:12:42 So Gemma doesn’t run it as efficiently on a GPU as a Lama does, right?
4:12:46 But vice versa, Lama doesn’t run as efficiently on a TPU as a Gemma does, right?
4:12:50 And it’s so like there’s like certain like aspects of like hardware software co-design.
4:12:53 So all their search models are their ranking and recommendation models, all these different
4:12:59 models that are AI, but not like gen AI, right, have been hyper-optimized with TPUs forever.
4:13:03 The software stack is super optimized, but all of this software stack has not been released
4:13:06 publicly at all, right?
4:13:09 Very small portions of it, Jax and XLA have been, but like the experience when you’re
4:13:13 inside of Google and you’re training on TPUs as a researcher, you don’t need to know anything
4:13:15 about the hardware in many cases, right?
4:13:21 It’s like pretty beautiful, but as soon as you step outside, a lot of them go back.
4:13:23 They leave Google and then they go back.
4:13:26 Yeah, they’re like, they leave and they start a company because they have all these amazing
4:13:29 research ideas and they’re like, wait, infrastructure is hard.
4:13:30 Software is hard.
4:13:31 And this is on GPUs.
4:13:34 Or if they try to use TPUs, same thing, because they don’t have access to all this code.
4:13:37 And so it’s like, how do you convince a company whose golden goose is searched where they’re
4:13:43 making hundreds of billions of dollars from to start selling TPUs, which they used to
4:13:50 only buy a couple billion of, you know, I think in 2023, they bought like a couple billion.
4:13:53 And now they’re buying like 10 billion to 15 billion dollars worth, but how do you convince
4:13:56 them that they should just buy like twice as many and figure out how to sell them and
4:13:57 make 30 billion dollars?
4:14:00 Who cares about making 30 billion dollars?
4:14:04 Won’t that 30 billion exceed actually the search profit eventually?
4:14:10 Oh, I mean, like, you’re always going to make more money on services than on hardware.
4:14:14 I mean, like, yeah, like, to be clear, like today, people are spending a lot more on hardware
4:14:16 than they are the services, right?
4:14:19 Because the hardware front runs the service spend.
4:14:24 But like, if there’s no revenue for AI stuff or not enough revenue, then obviously like
4:14:26 it’s going to blow up, right?
4:14:28 People won’t continue to spend on GPUs forever.
4:14:31 And an invidious trying to move up the stack with like software that they’re trying to
4:14:33 sell and license and stuff, right?
4:14:38 But Google has never had that like DNA of like, this is a product we should sell, right?
4:14:42 The Google Cloud does it, which is a separate organization from the TPU team, which is a
4:14:45 separate organization from the DeepMind team, which is a separate organization from the
4:14:46 search team, right?
4:14:47 There’s a lot of bureaucracy.
4:14:50 Wait, Google Cloud is a separate team than the TPU team?
4:14:54 Technically TPU sits under infrastructure, which sits under Google Cloud.
4:15:01 But like Google Cloud, like for like renting stuff and TPU architecture are very different
4:15:02 goals, right?
4:15:04 And hardware and software, like all of this, right?
4:15:09 Like the Jack’s XLA teams do not serve Google’s customers externally, whereas Nvidia’s various
4:15:14 CUDA teams for like things like Nickel serve external customers, right?
4:15:19 The internal teams like Jackson XLA and stuff, they more so serve DeepMind and search, right?
4:15:21 And so their customers different, they’re not building a product for them.
4:15:29 Do you understand why AWS keeps winning versus Azure for cloud versus Google Cloud?
4:15:32 Yeah, Google Cloud is tiny, isn’t it, relative to AWS?
4:15:34 Google Cloud is third, yeah, yeah.
4:15:37 Microsoft is the second biggest, but Amazon is the biggest, right?
4:15:42 And Microsoft deceptively sort of includes like Microsoft Office 365 and things like
4:15:43 that.
4:15:44 It’s enterprise-wide licenses.
4:15:46 So in reality, the gulf is even larger.
4:15:48 Microsoft is still second though, right?
4:15:49 Amazon is way bigger.
4:15:50 Why?
4:15:52 Because using AWS is better and easier.
4:15:53 And in many cases, it’s cheaper.
4:15:54 It was first.
4:15:55 And it’s first.
4:15:56 It was first.
4:15:57 Yeah, but there’s a lot of things that are first that…
4:15:58 Well, it’s easier.
4:16:00 It’s harder to switch than it is to…
4:16:01 Yeah, okay.
4:16:02 But AWS is…
4:16:03 There’s big fees for switching too.
4:16:06 AWS generates over 80% of Amazon’s profit.
4:16:07 I think over 90%.
4:16:08 That’s insane.
4:16:12 The distribution centers are just like, one day we’ll decide to make money from this.
4:16:13 But they haven’t yet, right?
4:16:14 Like they make tiny little profit from it.
4:16:17 One day of Amazon Prime will triple in price.
4:16:22 You would think they would improve AWS interface because it’s like horrible.
4:16:25 It’s like clunky, but everybody’s…
4:16:28 Yeah, one would think.
4:16:31 I think actually Google’s interface is sometimes nice, but it’s also like they don’t care about
4:16:35 anyone besides their top customers and like their customer service sucks and like they
4:16:36 have a lot less.
4:16:39 I mean, all these companies, they optimized for the big customers.
4:16:40 Yeah.
4:16:41 It’s supposed to be for business.
4:16:44 But Amazon has always optimized for the small customer too though, right?
4:16:47 Like obviously they optimize a lot for the big customer, but like when they started,
4:16:51 they just would go to like random Bay Area things and give out credits, right?
4:16:52 And then they like…
4:16:53 Or just put in your credit card and use us, right?
4:16:54 Like back in the early days.
4:16:55 So they’ve always…
4:16:56 The business has grown with them, right?
4:16:57 In Virgin.
4:16:58 So like, why does Amazon…
4:17:02 Like why is Snowflake all over Amazon because Snowflake in the beginning when Amazon didn’t
4:17:04 care about them was still using Amazon, right?
4:17:08 And then of course one day Snowflake and Amazon has a super huge partnership, but like this
4:17:11 is the case like Amazon’s user experience and quality is better.
4:17:15 Also a lot of the silicon they’ve engineered makes them have a lower cost structure and
4:17:21 traditional cloud storage, CPU, networking, that kind of stuff than in databases, right?
4:17:27 Like I think like four of Amazon’s top five revenue products, margin products are like
4:17:31 gross profit products or all database related products like Redshift and like all these
4:17:32 things, right?
4:17:38 So Amazon has a very like good silicon to a user experience like entire pipeline with
4:17:39 AWS.
4:17:40 I think Google…
4:17:42 They’re silicon teams?
4:17:46 Yeah, they have awesome silicon internally, TPU, the YouTube chip, some of these other
4:17:48 chips that they’ve made.
4:17:52 And the problem is they’re not serving external customers, they’re serving internal customers,
4:17:53 right?
4:17:56 I mean, NVIDIA’s entire culture is designed from the bottom up to do this.
4:18:01 There’s this recent book, The NVIDIA Way, by Take Him, that details this and how they
4:18:07 look for future opportunities and ready their CUDA software libraries to make it so that
4:18:13 new applications of high performance computing can very rapidly be evolved on CUDA and NVIDIA
4:18:14 chips.
4:18:18 And that is entirely different than Google as a services business.
4:18:19 Yeah.
4:18:22 NVIDIA, it should be said as a truly special company.
4:18:26 Like, I mean, they, the whole, the culture, everything, they’re really optimized for that
4:18:27 kind of thing.
4:18:33 Which is there’s somebody that can even challenge NVIDIA hardware-wise, Intel, AMD?
4:18:35 I really don’t think so.
4:18:42 We went through a very long process of working with AMD on training on their GPUs and inference
4:18:43 and stuff.
4:18:44 And they’re decent.
4:18:46 Their hardware is better in many ways than in NVIDIAs.
4:18:48 The problem is their software is really bad.
4:18:50 And I think they’re getting better, right?
4:18:54 They’re getting better faster, but they’re just, the gulf is so large.
4:18:58 Even like, they don’t spend enough resources on it or have it historically, right?
4:19:02 Maybe they’re changing their tune now, but for multiple months, we were submitting the
4:19:03 most bugs, right?
4:19:05 Like, ah, semianalysis, right?
4:19:06 Like, what the fuck?
4:19:08 Like, why are we submitting the most bugs, right?
4:19:11 Because they only, and they only cared about their biggest customers.
4:19:15 And so they’d ship them a private image, blah, blah, blah, and it’s like, okay, but like,
4:19:20 I am just using PyTorch and I want to use the publicly available libraries and you don’t
4:19:21 care about that, right?
4:19:25 So, they’re getting better, but like, I think AMD is not possible, Intel’s obviously in
4:19:29 dire straits right now and needs to be saved somehow.
4:19:33 Very important for national security, for American, you know, technology.
4:19:36 Can you explain the obvious, so why are they in dire straits?
4:19:39 Going back to earlier, only three companies can R&D, right?
4:19:45 Taiwan, Sinshu, Samsung, Pyongyang, and then Intel Hillsboro.
4:19:46 Samsung’s doing horribly.
4:19:47 Intel’s doing horribly.
4:19:50 We could be in a world where there’s only one company that can do R&D and that one company
4:19:52 already manufactures most of chips.
4:19:55 They’ve been gaining market share anyways, but like, that’s a critical thing, right?
4:19:58 So what happens to Taiwan means the rest of the world’s semiconductor industry and therefore
4:20:01 tech relies on Taiwan, right?
4:20:03 And that’s obviously precarious.
4:20:08 As far as like Intel, they’ve been slowly steadily declining.
4:20:13 They were on top of servers and PCs, but now Apple’s done the M1 and Nvidia’s releasing
4:20:17 a PC chip and Qualcomm’s releasing a PC chip and in servers, hyperscalers are all making
4:20:23 their own ARM based server chips and Intel has no AI silicon like wins, right?
4:20:25 They have very small wins.
4:20:29 And they never got into mobile because they said no to the iPhone and like, all these
4:20:32 things have compounded and they’ve lost their process technology leadership, right?
4:20:35 They were ahead for 20 years and now they’re behind by at least a couple years, right?
4:20:40 And they’re trying to catch back up and we’ll see if like their 18A, 14A strategy works
4:20:42 out where they try and leapfrog TSMC.
4:20:46 But like, and Intel is just like losing tons of money anyways, right?
4:20:49 And they just fired their CEO, even though the CEO was the only person who understood
4:20:50 the company.
4:20:51 Well, right, we’ll see.
4:20:56 He was not the best, but he was pretty good, relatively, technical guy.
4:20:57 Where does Intel make most of its money?
4:20:58 The CPUs, though.
4:21:01 PCs and data center CPUs, yeah, but data center CPUs are all going cloud.
4:21:05 And Amazon, Microsoft, Google are making our ARM based CPUs.
4:21:10 And then PC side, AMD’s gained market share, Nvidia’s launching a chip.
4:21:11 That’s not going to be success, right?
4:21:15 Media tech, Qualcomm ever launched chips, Apple’s doing well, right?
4:21:19 Like they could get squeezed a little bit in PC, although PC generally, I imagine will
4:21:21 just stick Intel mostly for Windows side.
4:21:25 Let’s talk about the broad AI race, who do you think wins?
4:21:26 Who talked about Google?
4:21:31 The leader, the default leader has been Google because of their infrastructure advantage.
4:21:35 Well, like in the news, open AI is the leader.
4:21:36 They’re the leading in the narrative.
4:21:37 They have the best model.
4:21:40 They have the best model that people can use and they’re experts.
4:21:42 And they have the most AI revenue.
4:21:43 Yeah.
4:21:45 Open AI is winning, right?
4:21:48 So who’s making money on AI right now?
4:21:49 Is anyone making money?
4:21:53 So accounting profit wise, Microsoft is making money, but they’re spending a lot of catbacks,
4:21:54 right?
4:21:56 You know, and that gets depreciated over years.
4:22:01 Meta is making tons of money, but with recommendation systems, which is AI, but not with Lama, right?
4:22:04 Lama’s losing money for sure, right?
4:22:08 I think anthropic and open AI are obviously not making money because otherwise they wouldn’t
4:22:09 be raising money, right?
4:22:12 They have to raise money to build more, right?
4:22:14 Well, theoretically, they are making money, right?
4:22:18 You spent a few hundred million dollars on GPT-4 and it’s doing billions in revenue.
4:22:22 So obviously it’s making money, although they had to continue to research to get the compute
4:22:24 efficiency wins, right?
4:22:30 And move down the curve to get that 1200X that has been achieved for GPT-3.
4:22:35 Maybe we’re only at a couple hundred X now, but with GPT-4 Turbo and 4.0 and there’ll be
4:22:40 another one probably cheaper than GPT-4.0 even that comes out at some point.
4:22:42 And that research costs a lot of money, right?
4:22:43 Yep, exactly.
4:22:48 That’s the thing that I guess is not talked about with the cost, that when you’re referring
4:22:54 to the cost of the model, it’s not just the training or the test runs, it’s the actual
4:22:55 research, the manpower.
4:22:59 Yeah, to do things like reasoning right now that that exists, they’re going to scale it,
4:23:00 they’re going to do a lot of research.
4:23:07 I think people focus on the payback question, but it’s really easy to just be like, well,
4:23:10 GDP is humans and industrial capital, right?
4:23:14 And if you can make intelligence cheap, then you can grow a lot, right?
4:23:18 That’s the sort of dumb way to explain it, but that’s sort of what basically the investment
4:23:19 thesis is.
4:23:24 I think only NVIDIA is actually making tons of money and other hardware vendors.
4:23:28 The hyperscalers are all on paper making money, but in reality, they’re like spending a lot
4:23:32 more on purchasing the GPUs, which you don’t know if they’re still going to make this much
4:23:35 money on each GPU in two years, right?
4:23:41 You don’t know if all of a sudden, OpenAI goes kapoof, and now Microsoft has like hundreds
4:23:46 of thousands of GPUs they were renting to OpenAI that they paid for themselves with
4:23:50 their investment in them, that no longer have a customer, right?
4:23:53 This is always a possibility, I don’t believe that, right?
4:23:57 I think OpenAI will keep raising money, I think others will keep raising money because
4:24:02 the investments, the returns from it are going to be eventually huge once we have AGI.
4:24:05 So do you think multiple companies will get, let’s assume-
4:24:07 I don’t think it’s going to take all.
4:24:08 Okay.
4:24:12 So it’s not, let’s not call it AGI or whatever, it’s like a single day.
4:24:13 It’s a gradual thing.
4:24:15 Super powerful AI.
4:24:20 But it’s a gradually increasing set of features that are useful and make a lot of money.
4:24:22 Rapidly increasing set of features.
4:24:25 Rapidly increasing set of features.
4:24:32 So you’re saying a lot of companies will be, it just seems absurd that all of these companies
4:24:35 are building gigantic data centers.
4:24:39 There are companies that will benefit from AI but not because they trained the best model.
4:24:44 Meta has so many avenues to benefit from AI and all of their services, people are there,
4:24:47 people spend time on Meta’s platforms and it’s a way to make more money per user per
4:24:48 hour.
4:24:58 Yeah, it seems like Google X/XAI/Tesla, important to say, and then Meta will benefit not directly
4:25:06 from the AI like the LLMs, but from the intelligence, like the additional boost of intelligence to
4:25:07 the products they already sell.
4:25:12 So whether that’s the recommendation system or for Elon, who’s been talking about Optimus,
4:25:16 the robot, potentially the intelligence of the robot.
4:25:20 And then you have personalized robots in the home, that kind of thing.
4:25:25 He thinks it’s a 10 plus trillion dollar business, which-
4:25:30 At some point maybe, not soon, but who knows what robotics-
4:25:35 Let’s do a TAM analysis, right, 8 billion humans and let’s get 8 billion robots, right,
4:25:39 and let’s pay them the average salary and yeah, there we go, 10 trillion.
4:25:40 More than 10 trillion.
4:25:46 Yeah, I mean, if there’s robots everywhere, why does it have to be just eight billion
4:25:47 robots?
4:25:48 Yeah, of course, of course.
4:25:51 I’m gonna have like one robot, you’re gonna have like 20.
4:25:54 Yeah, I mean, I see a use case for that.
4:25:59 So yeah, I guess the benefit would be in the products as well, which is why OpenAI is in
4:26:00 a trickier position because they-
4:26:04 All of the value of OpenAI right now as a brand is in ChatGPT.
4:26:09 And there is actually not that, for most users, there’s not that much of a reason that they
4:26:14 need OpenAI to be spending billions and billions of dollars on the next best model when they
4:26:17 could just license Lama 5 and Furby Way cheaper.
4:26:22 So that’s kind of like, ChatGPT is an extremely valuable entity to them.
4:26:25 But like, they could make more money just off that.
4:26:29 The Chat application is clearly like does not have tons of room to continue, right?
4:26:30 Like the standard Chat, right?
4:26:33 Where you’re just using it for a random question and stuff, right?
4:26:36 The cost continues to collapse, V3 is the latest one.
4:26:37 It’ll go down to ads.
4:26:39 Biggest, but it’s gonna get supported by ads, right?
4:26:44 Like, you know, Meta already serves 405B and probably loses the money, but at some point,
4:26:48 you know, they’re going to get, the models are gonna get so cheap that they can just
4:26:50 serve them for free with ad supported, right?
4:26:53 And that’s what Google is going to be able to do, and that’s obviously they’ve got a
4:26:54 bigger reach, right?
4:26:56 So Chat is not going to be the only use case.
4:27:00 It’s like these reasoning, code, agents, computer use.
4:27:03 All this stuff is where OpenAI has to actually go to make money in the future.
4:27:04 Otherwise, they’re kaputts.
4:27:09 But X, Google and Meta have these other products.
4:27:15 So doesn’t, isn’t it likely that OpenAI and Anthropic disappear eventually?
4:27:18 Unless they’re so good at models, they are.
4:27:19 But it’s such a cutting edge.
4:27:20 I mean, yes.
4:27:22 It depends on where you think AI capabilities are going.
4:27:24 You have to keep winning.
4:27:25 Yes.
4:27:26 You have to keep winning.
4:27:31 As you climb, even if the AI capabilities are going super rapidly awesome into the direction
4:27:39 of AGI, like there’s still a boost for X in terms of data, Google in terms of data, Meta
4:27:44 in terms of data, in terms of other products and the money and like there’s just huge amounts
4:27:45 of money.
4:27:46 But the whole idea is human data is kind of tapped out.
4:27:47 We don’t care.
4:27:48 We don’t care.
4:27:49 We don’t care about self-play, verifiable tasks.
4:27:50 Yes, the self-play.
4:27:51 Think about AWS.
4:27:52 Which is an R&D problem.
4:27:56 AWS does not make a lot of money on each individual machine.
4:28:01 And the same can be said for the most powerful AI platform, which is even though the calls
4:28:06 to the API are so cheap, there’s still a lot of money to be made by owning that platform.
4:28:10 And there’s a lot of discussions as it’s the next compute layer.
4:28:14 You have to believe that, and there’s a lot of discussions that tokens and tokenomics
4:28:18 and LLM APIs are the next compute layer or the next paradigm for the economy, kind of
4:28:20 like energy and oil was.
4:28:26 But there’s also like, you have to sort of believe that APIs and chat are not where AI
4:28:27 is stuck, right?
4:28:30 It is actually just tasks and agents and robotics and computer use.
4:28:36 And those are the areas where all the value will be delivered, not API, not chat application.
4:28:43 Is it possible you have, I mean, it all just becomes a commodity and you have the very
4:28:49 thin wrapper, like perplexity, just joking.
4:28:51 There are a lot of wrappers making a lot of money.
4:28:52 Yeah.
4:28:56 But do you think it’s possible that people will just even forget what open AI and the
4:28:57 thropic is?
4:29:00 And just because there’ll be wrappers around the API and it just dynamically…
4:29:04 If model progress is not rapid, yeah, it’s becoming a commodity, right?
4:29:09 DeepSeq V3 shows this, but also the GPT-3 chart earlier, chart showed this, right?
4:29:12 Lama3B is 1200X cheaper than GPT-3.
4:29:17 Any GPT-3, like anyone whose business model is GPT-3 level capabilities is dead.
4:29:20 Anyone whose business model is GPT-4 level capabilities is dead, right?
4:29:25 It is a common saying that the best businesses being made now are ones that are predicated
4:29:26 on models getting better, right?
4:29:32 Which would be like wrappers, thing that is riding the wave of the models.
4:29:35 The short term, the company that could make the most money is the one that figures out
4:29:40 what advertising targeting method works for language model generations.
4:29:45 We have the meta ads, which are hyper-targeted in feed, not within specific pieces of content.
4:29:49 And we have search ads that are used by Google and Amazon has been rising a lot on search.
4:29:56 But within a return from chat GPT, it is not clear how you get a high-quality placed ad
4:29:57 within the output.
4:30:04 And if you can do that with model costs coming down, you can just get super high revenue.
4:30:07 That revenue is totally untapped and it’s not clear technically how it is done.
4:30:12 Yeah, that is sort of the AdSense innovation that Google did.
4:30:18 The one day you’ll have in GPT output an ad and that’s going to make billions of dollars.
4:30:20 And it could be very subtle.
4:30:21 It could be in conversation.
4:30:22 We have voice mode now.
4:30:27 It could be some way of making it so the voice introduces certain things.
4:30:30 It’s much harder to measure and it takes imagination, but yeah.
4:30:36 And it wouldn’t come off shady so you will receive public blowback, that kind of thing.
4:30:40 You have to do it loud enough to where it’s clear as an ad and balance all of that.
4:30:43 So that’s the open question they’re trying to solve.
4:30:45 Anthropic and OpenAI, they need to…
4:30:46 They might not say that they’re trying…
4:30:47 I don’t think they care about that at all.
4:30:49 They don’t care about it right now.
4:30:50 I think it’s places like…
4:30:51 I think they’re purely…
4:30:52 Purely…
4:30:53 They’re experimenting on that more.
4:30:54 Oh, interesting.
4:30:55 Yeah, for sure.
4:30:58 Like, perplexity Google meta care about this.
4:31:02 I think OpenAI and Anthropic are purely laser focused on…
4:31:03 AGI.
4:31:04 Yeah.
4:31:05 Agents and AGI.
4:31:11 Agents and AGI, I can make tons of money or I can spend, pay for everything.
4:31:12 This is…
4:31:15 It’s just predicated like back on the export control thing.
4:31:19 If you think AGI is five, 10 years away or less, these labs think it’s two, three years
4:31:20 away.
4:31:24 Obviously, your actions are…
4:31:29 If you assume they’re rational actors, which they are mostly, what you do in a two-year
4:31:34 AGI versus five-year versus 10-year is very, very, very different.
4:31:36 Do you think agents are promising?
4:31:40 We have to talk about this.
4:31:44 This is like the excitement of the year that agents are going to…
4:31:51 The generic hype term that a lot of business folks are using, AI agents are going to revolutionize
4:31:52 everything.
4:31:53 Okay.
4:31:55 So, mostly the term agent is obviously overblown.
4:32:00 We’ve talked a lot about reinforcement learning as a way to train for verifiable outcomes.
4:32:04 This should mean something that is open-ended and is solving a task independently on its
4:32:07 own and able to adapt to uncertainty.
4:32:11 There is a lot of the term agent applied to things like Apple Intelligence, which we
4:32:16 still don’t have after the last WWDC, which is orchestrating between apps.
4:32:20 That sort of tool use thing is something that language models can do really well.
4:32:23 Apple Intelligence, I suspect will come eventually.
4:32:24 It’s a closed domain.
4:32:29 It’s your messages app integrating with your photos, with AI in the background.
4:32:30 That will work.
4:32:35 This has been described as an agent by a lot of software companies to get into the narrative.
4:32:43 The question is, what ways can we get language models to generalize to new domains and solve
4:32:45 their own problems in real time?
4:32:49 Maybe some tiny amount of training when they are doing this with fine-tuning themselves
4:32:53 or in-context learning, which is the idea of storing information in a prompt.
4:32:58 You can use learning algorithms to update that and whether or not you believe that that
4:33:05 is going to actually generalize to things like me saying, “Book my trip to go to Austin
4:33:06 in two days.
4:33:10 I have XYZ constraints and actually trusting it.”
4:33:13 I think there’s an HCI problem coming back for information.
4:33:15 Well, what’s your prediction there?
4:33:18 Because my gut says we’re very far away from that.
4:33:23 I think OpenAI’s statement, I don’t know if you’ve seen the five levels, right?
4:33:28 Where it’s chat is level one, reasoning is level two, and then agents is level three.
4:33:31 I think there’s a couple more levels, but it’s important to note, right?
4:33:34 We were in chat for a couple of years, right?
4:33:37 We just theoretically got to reasoning.
4:33:39 We’ll be here for a year or two, right?
4:33:44 And then agents, but at the same time, people can try and approximate capabilities of the
4:33:45 next level.
4:33:49 But the agents are doing things autonomously, doing things for minutes at a time, hours
4:33:52 at a time, et cetera, right?
4:33:56 Everything is doing things for tens of seconds at a time, right?
4:33:59 And then coming back with an output that I still need to verify and use and try to check
4:34:01 out, right?
4:34:05 And the biggest problem is, of course, it’s the same thing with manufacturing, right?
4:34:07 There’s the whole Six Sigma thing, right?
4:34:08 How many nines do you get?
4:34:12 And then you compound the nines onto each other, and it’s like, if you multiply by the
4:34:18 number of steps that are Six Sigma, you get a yield or something, right?
4:34:23 So in semiconductor manufacturing, tens of thousands of steps, 999999 is not enough,
4:34:24 right?
4:34:28 Because you multiply by that many times, you actually end up with like 60% yield, right?
4:34:29 Yeah, or zero.
4:34:30 Or low yield, yeah, or zero.
4:34:32 And this is the same thing with agents, right?
4:34:40 Chaining tasks together each time, LLMs, even the best LLMs in particularly pretty good benchmarks,
4:34:42 don’t get 100%, right?
4:34:45 They get a little bit below that because there’s a lot of noise.
4:34:49 And so how do you get to enough nines, right?
4:34:50 This is the same thing with self-driving.
4:34:54 We can’t have self-driving because without it being like super geofenced like Google,
4:34:55 like Google’s, right?
4:34:58 And even then they have a bunch of tele operators to make sure it doesn’t get stuck, right?
4:35:01 But you can’t do that because it doesn’t have enough nines.
4:35:07 And self-driving has quite a lot of structure because roads have rules.
4:35:08 It’s well-defined.
4:35:09 There’s regulation.
4:35:15 And when you’re talking about computer use for the open web, for example, or the open
4:35:19 operating system, like there’s no, it’s a mess.
4:35:27 So like the possibility, I’m always skeptical of any system that is tasked with interacting
4:35:30 with the human world, with the open messy human world.
4:35:31 That’s the thing.
4:35:35 If we can’t get intelligence that’s enough to solve the human world on its own, we can
4:35:41 create infrastructure like the human operators for Waymo over many years that enables certain
4:35:42 workloads.
4:35:45 There is a company, I don’t remember it, but it is, but that’s literally their pitches.
4:35:47 Yeah, we’re just going to be the human operator when agents fail.
4:35:49 And you just call us and we fix it.
4:35:50 Yeah.
4:35:51 It’s like an API call and it’s hilarious.
4:35:54 There’s going to be tele-operation markets when we get human robots, which is there’s
4:35:59 going to be somebody around the world that’s happy to fix the fact that it can’t finish
4:36:03 loading my dishwasher when I’m unhappy with it, but that’s just going to be part of the
4:36:04 Tesla service package.
4:36:10 I’m just imagining like an AI agent talking to another AI agent.
4:36:15 One company has an AI agent that specializes in helping other AI agents.
4:36:19 But if you can make things that are good at one step, you can stack them together.
4:36:23 So that’s why I’m like, if it takes a long time, we’re going to build infrastructure that
4:36:24 enables it.
4:36:29 You see the operator launch, they have partnerships with certain websites with DoorDash with OpenTable
4:36:31 with things like this.
4:36:35 Those partnerships are going to let them climb really fast, their model is going to get really
4:36:36 good at those things.
4:36:40 It’s going to prove a concept that might be a network effect where more companies want
4:36:41 to make it easier for AI.
4:36:45 Some companies will be like, no, let’s put blockers in place.
4:36:47 And this is the story of the internet we’ve seen.
4:36:51 We see it now with training data for language models where companies are like, no, you have
4:36:55 to pay, like business working it out.
4:37:00 That said, I think like airlines have a very, and hotels have high incentive to make their
4:37:03 site work really well, and they usually don’t.
4:37:09 Like if you look at how many clicks it takes to order an airplane ticket, it’s insane.
4:37:12 You actually can’t call an American Airlines agent anymore.
4:37:14 They don’t have a phone number.
4:37:20 I mean, it’s horrible on many, on the interface front, to imagine that agents will be able
4:37:25 to deal with that website when I as a human struggle, like I have an existential crisis
4:37:31 every time I try to book an airplane ticket that I don’t, I think it’s going to be extremely
4:37:35 difficult to build an AI agent that’s robust in that way.
4:37:38 But think about it like United has accepted the Starlink term, which is they have to provide
4:37:41 Starlink for free and the users are going to love it.
4:37:45 What if one airline is like, we’re going to take a year and we’re going to make our website
4:37:49 have white text that works perfectly for the AIs.
4:37:53 Every time anyone asks about an AI flight, they buy whatever airline it is.
4:37:58 Or like, they just like, here’s an API and it’s only exposed to AI agents and if anyone
4:38:03 queries it, the price is 10% higher and for any flight, but we’ll let you see any of our
4:38:05 flights and you can just book any of them.
4:38:06 Here you go.
4:38:07 Agent Matt.
4:38:08 And then it’s like, oh, and I made 10% higher price.
4:38:09 Awesome.
4:38:10 Yeah.
4:38:12 And like, am I willing to say that for like, hey, book me a flight to see Lex, right?
4:38:13 And it’s like, yeah, whatever.
4:38:21 I think computers and real world and the open world are really, really messy.
4:38:25 But if you start defining the problem in narrow regions, people are going to be able to create
4:38:32 very, very productive things and ratchet down cost massively, right?
4:38:38 Now, crazy things like robotics in the home, those are going to be a lot harder to do just
4:38:43 like self-driving because there’s just a billion different failure modes, right?
4:38:48 But agents that can like navigate a certain set of websites and do certain sets of tasks
4:38:53 or like look at, you know, take a photo of your grocery, your fridge and or like upload
4:38:57 your recipes and then like it figures out what to order from, you know, Amazon slash
4:38:59 Whole Foods food delivery.
4:39:01 Like that’s going to be like pretty quick and easy to do, I think.
4:39:05 So it’s going to be a whole range of like business outcomes and it’s going to be tons
4:39:08 of tons of sort of optimism around people can just figure out ways to make money.
4:39:11 To be clear, these sandboxes already exist in research.
4:39:16 There are people who have built clones of all the most popular websites of Google, Amazon,
4:39:20 blah, blah, blah to make it so that there’s, I mean, OpenAI probably has them internally
4:39:21 to train these things.
4:39:26 It’s the same as DeepMind’s robotics team for years has had clusters for robotics where
4:39:28 you interact with robots fully remotely.
4:39:33 They just have a lab in London and you send tasks to it, arrange the blocks and you do
4:39:34 this research.
4:39:39 Obviously, there’s text there that fix stuff, but we’ve turned these cranks of automation
4:39:40 before.
4:39:46 You go from sandbox to progress and then you add one more domain at a time and generalize
4:39:47 it.
4:39:51 I think in the history of NLP and language processing, instruction tuning in tasks per
4:39:54 language model used to be like one language model did one task.
4:39:57 And then in the instruction tuning literature, there’s this point where you start adding
4:40:01 more and more tasks together where it just starts to generalize to every task.
4:40:03 And we don’t know where on this curve we are.
4:40:07 I think for reasoning with this RL and verifiable domains were very early, but we don’t know
4:40:12 where the point is where you just start training on enough domains and poof like more domains
4:40:15 to start working and you’ve crossed the generalization barrier.
4:40:20 Well, what do you think about the programming context?
4:40:28 So software engineering, that’s where I personally know a lot of people interact with AI the
4:40:29 most.
4:40:34 There’s a lot of fear and angst too from current CS students, but that is the area where probably
4:40:40 the most AI revenue and productivity gains have come, whether it be co-pilots or cursor
4:40:44 or what have you, right, this is or just standard chat GPT, right?
4:40:49 Like a lot of, I know very few programmers who don’t have chat GPT and actually many
4:40:53 of them have the $200 tier because that’s what it’s so good for, right?
4:40:58 I think that in that world, we already see it like SWE bench and if you’ve looked at
4:41:03 the benchmark made by some Stanford students, I wouldn’t say it’s like really hard, but
4:41:04 I wouldn’t say it’s easy either.
4:41:08 I think like it takes someone who’s been throughout least, you know, a few years of CS or a couple
4:41:11 years of programming to do SWE bench well.
4:41:16 And the models went from 4% to 60% in like a year, right?
4:41:18 And where are they going to go to next year?
4:41:21 You know, it’s going to be higher, probably won’t be 100% because again, that nines is
4:41:23 like really hard to do.
4:41:25 But we’re going to get to some point where that’s and then we’re going to need harder
4:41:28 software engineering benchmarks and so on and so forth.
4:41:33 But the way that like people think of it now is it’s can do code completion easy.
4:41:36 It can do some function generation and I have to review it, great.
4:41:41 But really the like software engineering agents I think can be done faster sooner than any
4:41:44 other agent because it is a verifiable domain.
4:41:51 You can always like unit test or compile and there’s many different regions of like it can
4:41:55 inspect the whole code base at once, which no engineer really can only the architects
4:41:59 can really think about this stuff, the really senior guys and they can define stuff and
4:42:01 then the agent can execute on it.
4:42:05 So I think I think software engineering costs are going to plummet like crazy and one interesting
4:42:09 aspect of that is when software engineering costs are really low, you get very different
4:42:10 markets.
4:42:11 Right.
4:42:14 So in the US, you have all these platforms as companies, right, sales force and so on
4:42:15 and so forth.
4:42:16 Right.
4:42:20 In China, no one uses platform sass.
4:42:25 Everyone just builds their own stack because software engineering is much cheaper in China,
4:42:29 partially because like people stem number of stem graduates, et cetera.
4:42:33 So it’s generally just cheaper to do.
4:42:36 And so at the same time, code for like code alums have been adopted much less in China
4:42:39 because the cost of an engineer there is much lower.
4:42:42 But like what happens when every company can just invent their own business logic like
4:42:44 really cheaply and quickly.
4:42:48 You stop using platform sass, you start building custom tailored solutions, you change them
4:42:49 really quickly.
4:42:51 Now all of a sudden your business is a little bit more efficient too potentially because
4:42:56 you’re not dealing with the hell that is like some random platform sass company stuff not
4:43:00 working perfectly and having to adjust workflows or random business automation cases that aren’t
4:43:02 necessarily AI required.
4:43:04 It’s just logic that needs to be built that no one has built, right?
4:43:08 All of these things can go happen faster and so I think software and then the other domain
4:43:12 is like industrial, chemical, mechanical engineers, second coding, right?
4:43:17 Just generally and like their tools like semiconductor engineers, their tools are 20 years old.
4:43:21 All the tools run on XP, including ASML lithography tools run on Windows XP, right?
4:43:25 It’s like, you know, and like a lot of the analysis happens in Excel, right?
4:43:29 Like it’s just like guys, like you guys can move 20 years forward with all the data you
4:43:31 have and gathered and like do a lot better.
4:43:34 It’s just you need the engineering skills for software engineering to be delivered to
4:43:36 the actual domain expert engineer.
4:43:40 So I think, I think that’s the area where I’m like super duper bullish of, of generally
4:43:42 AI creating value.
4:43:45 The big picture is that I don’t think it’s going to be a cliff.
4:43:51 It’s like, we talked to anything, a really good example of how growth changes is when
4:43:53 meta added stories.
4:43:57 So Snapchat was on an exponential, they added stories, it flatlined.
4:44:01 Software engineers, then up until the right, AI is going to come in, it’s probably going
4:44:02 to be flat.
4:44:04 It’s like, it’s not like everyone’s going to lose their job.
4:44:08 It’s hard because the supply corrects more slowly.
4:44:10 So the amount of students is still growing.
4:44:13 And that’ll correct on a multi year, like a year delay.
4:44:16 But the amount of jobs will just turn.
4:44:20 And then maybe in 20, 40 years, it’ll be well down.
4:44:23 But in the few years, there’ll never going to be the snap moment where it’s like software
4:44:24 engineers aren’t useful.
4:44:28 I think also the nature of what it means to be a programmer and what kind of jobs programmers
4:44:29 do changes.
4:44:36 Cause I think there needs to be a human in the loop of everything you’ve talked about.
4:44:41 There’s a really important human in that picture of like correcting the code.
4:44:43 Like fixing.
4:44:45 Thinking larger than the context length.
4:44:46 Yep.
4:44:52 And debugging also, like debugging by sort of reading the code, understanding the steering
4:44:53 the system.
4:44:56 Like no, no, no, you missed the point adding more to the prompt.
4:44:58 Kind of like, yes.
4:45:02 Adding the human designing the perfect Google button, Google’s famous for having people
4:45:04 design buttons that are so perfect.
4:45:07 And it’s like, how, like, how is AI going to do that?
4:45:10 Like they could give you all ideas.
4:45:11 Perfect.
4:45:12 Fine.
4:45:13 I mean, that’s the thing.
4:45:14 You can call it taste.
4:45:19 Humans have one thing humans can do is figure out what other humans enjoy better than AI
4:45:20 systems.
4:45:21 That’s where the preference.
4:45:25 You’re loading that in, but ultimately humans are the greatest preference generally.
4:45:27 That’s where the preference comes from.
4:45:31 And humans are actually very good at reading or like judging between two things versus this
4:45:35 is this goes back to the core of what early Jeff and preference tuning is, is that it’s
4:45:38 hard to generate a good answer for a lot of problems, but it’s easy to see which one
4:45:39 is better.
4:45:43 And that’s how we’re using humans for AI now is judging which one is better.
4:45:47 And that’s what software engineering could look like is the PR review.
4:45:48 Here’s a few options.
4:45:53 What are the, like, here’s some potential pros and cons and they’re going to be judges.
4:46:00 I think the thing I would very much recommend is people start, programmers start using AI
4:46:05 and embracing that role of the supervisor of the AI system and like partner of the AI
4:46:10 system versus writing from scratch or not learning coding at all and just generating
4:46:11 stuff.
4:46:14 Because I think there actually has to be a pretty high level of expertise as a programmer
4:46:18 to be able to manage increasingly intelligent systems.
4:46:21 I think it’s that and then becoming a domain expert in something.
4:46:22 Sure.
4:46:23 Yeah.
4:46:27 Because seriously, if you go look at aerospace or semiconductors or chemical engineering,
4:46:30 everyone is using really crappy platforms, really old software.
4:46:34 Like the job of a data science is like a joke, right?
4:46:35 In many cases.
4:46:39 In many cases, it’s very real, but it’s like bring what the forefront of human capabilities
4:46:41 are to your domain.
4:46:45 And even if the forefront is from the AI, your domain, you’re at the forefront, right?
4:46:50 So it’s like, you have to be at the forefront of something and then leverage the rising
4:46:52 tide that is AI for everything else.
4:46:53 Yeah.
4:46:59 There’s so many low hanging fruit everywhere in terms of where software can help automate
4:47:02 a thing or digitize a thing.
4:47:06 In the legal system, that’s why Doge is exciting.
4:47:12 Yeah, I mean, I got to hang out with a bunch of the Doge folks and they, I mean, government
4:47:15 is like so old school.
4:47:21 It’s like begging for the modernization of software, of organizing the data, all this
4:47:22 kind of stuff.
4:47:29 I mean, in that case is by design, because bureaucracy protects centers of power and
4:47:33 so on, but software breaks down those barriers.
4:47:39 So it hurts those that are holding onto power, but ultimately benefits humanity.
4:47:44 So there’s a bunch of domains of that kind.
4:47:49 One thing we didn’t fully finish talking about is open source.
4:47:51 So first of all, congrats.
4:47:52 You released a new model.
4:47:53 Yeah.
4:47:54 This is the…
4:47:55 Tulu.
4:47:56 I’ll explain what a Tulu is.
4:48:01 A Tulu is a hybrid camel when you breed a dromedary with a Bacchian camel.
4:48:05 Back in the early days after chat, GPT, there was a big wave of models coming out like Alpaca,
4:48:10 Vicuna, et cetera, that were all named after various mammalian species.
4:48:11 So Tulu is…
4:48:14 The brand is multiple years old, which comes from that.
4:48:20 And we’ve been playing at the frontiers of post training with open source code.
4:48:24 And this first part of this release was in the fall where we used…
4:48:30 We built on Lama’s open models, open weight models, and then we add in our fully open code
4:48:32 or fully open data.
4:48:36 There’s a popular benchmark that is chatbot arena, and that’s generally the metric by
4:48:41 which how these chat models are evaluated, and it’s humans compare random models from
4:48:42 different organizations.
4:48:48 And if you looked at the leaderboard in November or December, among the top 60 models from
4:48:53 10s to 20s of organizations, none of them had open code or data for just post training.
4:48:57 Among that, even fewer or none have pre-training data and code available, but post training
4:48:58 is much more accessible.
4:49:00 At this time, it’s still pretty cheap and you can do it.
4:49:04 And the thing is like, how high can we push this number where people have accessed all
4:49:05 the code and data?
4:49:07 So that’s kind of the motivation of the project.
4:49:12 We draw on lessons from Lama, NVIDIA had a nematron model where the recipe for their
4:49:17 post training was fairly open with some data and a paper, and it’s putting all these together
4:49:22 to try to create a recipe that people can fine tune models like GPT-4 to their domain.
4:49:27 So to be clear, in the case of Tulu, maybe you can talk about almost too, but in the
4:49:31 case of Tulu, you’re taking Lama 345B.
4:49:35 Tulu has been a series of recipes for post training.
4:49:38 So we’ve done multiple models over years.
4:49:40 And so you’re open sourcing everything.
4:49:41 Yeah.
4:49:45 If you start with an open weight based model, the whole model technically is an open source
4:49:49 because you don’t know what Lama put into it, which is why we have the separate thing
4:49:50 that we’ll get to.
4:49:54 But it’s just getting parts of the pipeline where people can zoom in and customize.
4:49:58 I know I hear from startups and businesses, they’re like, okay, I can take this post training
4:50:00 and try to apply it to my domain.
4:50:01 We talk about verifiers a lot.
4:50:08 We use this idea, which is reinforcement learning with verifiable rewards, RLVR, kind of similar
4:50:12 to RLHF, and we applied it to math.
4:50:18 And the model today, which is we applied it to the Lama 405B base model from last year,
4:50:20 and we have our other stuff.
4:50:25 We have our instruction tuning and preference tuning, but the math thing is interesting,
4:50:28 which is like, it’s easier to improve this math benchmark.
4:50:32 There’s a benchmark, MATH, math, all capitals, tough name.
4:50:36 On the benchmark, name is the area that you’re evaluating.
4:50:37 We’re researchers.
4:50:39 We’re not brands, brand strategists.
4:50:43 And this is something that the DeepSeek paper talked about as well, is like at this bigger
4:50:48 model, it’s easier to elicit powerful capabilities with this RL training, and then they distill
4:50:51 it down from that big model to the small model.
4:50:55 And this model we released today, we saw the same thing as we’re at AI2.
4:50:56 We don’t have a ton of compute.
4:51:01 We can’t train 405B models all the time, so we just did a few runs and they tend to work.
4:51:07 And it’s like, it just shows that there’s a lot of room for people to play in these things.
4:51:09 And they crushed Lama’s actual release, right?
4:51:11 They’re way better than it.
4:51:12 Yeah.
4:51:15 So our val numbers, I mean, we have extra months in this, but our val numbers are much
4:51:18 better than the Lama Instruct model that they released.
4:51:20 And they also said better than DeepSeek V3.
4:51:21 Yeah.
4:51:25 On our val benchmark, the most DeepSeek V3 is really similar.
4:51:29 We have a safety benchmark to understand if it will say harmful things and things like
4:51:30 that.
4:51:31 And that’s what draws us down most of the way.
4:51:34 It’s still like, it’s like an amalgamation of multiple benchmarks or what do you mean?
4:51:35 Yeah.
4:51:36 So we have a 10 value.
4:51:39 This is like, this is standard practice in post training is you choose your evaluations
4:51:40 you care about.
4:51:43 In academics, in smaller labs, you’ll have fewer evaluations.
4:51:46 In companies, you’ll have a really one domain that you really care about.
4:51:50 In frontier labs, you’ll have 10s to 20s to maybe even like 100 evaluations of specific
4:51:51 things.
4:51:55 So we choose a representative suite of things that look like chat, precise instruction following,
4:51:58 which is like respond only in emojis.
4:51:59 Like does the model follow weird things like that?
4:52:00 Yeah.
4:52:02 Math, code, and you create a suite like this.
4:52:07 So safety would be one of 10 in that type of suite where you have like, what is the broader
4:52:09 community of AI care about?
4:52:12 And for example, in comparison to DeepSeek, it would be something like our average of
4:52:18 VAL for our model would be 80, including safety and similar without and DeepSeek would be
4:52:26 like 79% average score without safety and their safety score would bring it down like
4:52:27 safety.
4:52:28 Oh, so you beat them even ignoring safety?
4:52:29 Yeah.
4:52:33 So this is something that internally it’s like, I don’t want to win only by like how you shape
4:52:34 the VAL benchmark.
4:52:36 So if there’s something that’s like people may or may not care about safety in their
4:52:39 model, safety can come downstream.
4:52:43 Safety can be when you host the model for an API like safety is addressed in a spectrum
4:52:44 of locations in AI applications.
4:52:47 So it’s like, if you want to say that you have the best recipe, you can’t just gait it
4:52:51 on these things that some people might not want.
4:52:57 And this is just, it’s like the time of progress and we benefit, we can release a model later,
4:53:01 we have more time to learn new techniques like this RL technique, we had started this
4:53:02 in the fall.
4:53:04 It’s now really popular as reasoning models.
4:53:08 The next thing to do for open source post training is to scale up verifiers, to scale
4:53:11 up data, to replicate some of deep seeks results.
4:53:15 And it’s awesome that we have a paper to draw on and it makes it a lot easier.
4:53:22 And that’s the type of things that is going on among academic and closed frontier research
4:53:23 in AI.
4:53:25 Since you’re pushing open source, what do you think is the future of it?
4:53:30 You think deep seek actually changes things since it’s open source or open weight or it’s
4:53:33 pushing the open source movement into the open direction?
4:53:35 This goes very back to license discussion.
4:53:38 So deep seek R1 with a friendly license is a major reset.
4:53:42 So it’s like the first time that we’ve had a really clear frontier model that is open
4:53:46 weights and with a commercially friendly license with no restrictions on downstream
4:53:49 use cases, synthetic data, distillation, whatever.
4:53:53 This has never been the case at all in the history of AI in the last few years since
4:53:54 ChatGPT.
4:53:57 There have been models that are off the frontier or models with weird licenses that you can’t
4:53:58 really use them.
4:54:04 So isn’t Meta’s license like pretty much permissible except for five companies?
4:54:09 And so this goes to what open source AI is, which is there’s also use case restrictions
4:54:12 in the Lama license, which says you can’t use it for specific things.
4:54:15 So if you come from an open source software background, you would say that that is not
4:54:16 an open source license.
4:54:20 What kind of things are those, though?
4:54:22 At this point, I can’t pull them off the top of my head.
4:54:23 Stuff that’s competitor.
4:54:26 It used to be military use was one and they removed that for scale.
4:54:32 It’ll be like CSAM, like child abuse material.
4:54:35 That’s the type of thing that is forbidden there, but that’s enough from an open source
4:54:38 background to say it’s not an open source license.
4:54:42 And also the Lama license has this horrible thing where you have to name your model Lama
4:54:45 if you touch it to the Lama model.
4:54:46 So it’s like the branding thing.
4:54:50 So if a company uses Lama, technically the license says that they should say built with
4:54:52 Lama at the bottom of their application.
4:54:54 And from a marketing perspective, that just hurts.
4:54:57 I could suck it up as a researcher and I’m like, oh, it’s fine.
4:55:01 It says Lama-dash on all of our materials for this release.
4:55:06 But this is why we need truly open models, which is we don’t know deep-seek R1’s data.
4:55:10 So you’re saying I can’t make a cheap copy of Lama and pretend it’s mine, but I can
4:55:12 do this with the Chinese model.
4:55:13 Hell yeah.
4:55:16 That’s what I was saying.
4:55:21 And that’s why it’s like we want this whole open language models thing, the Olmo thing
4:55:25 is to try to keep the model where everything is open with the data as close to the frontier
4:55:26 as possible.
4:55:27 So we’re compute constrained.
4:55:29 We’re personnel constrained.
4:55:34 We rely on getting insights from people like John Shulman tells us to do RL on outputs.
4:55:39 We can make these big jumps, but it just takes a long time to push the frontier of open source.
4:55:44 And fundamentally, I would say that that’s because open source AI does not have the same
4:55:46 feedback loops as open source software.
4:55:49 We talked about open source software for security.
4:55:52 Also it’s just because you build something once and you can reuse it.
4:55:55 If you go into a new company, there’s so many benefits.
4:55:58 But if you open source a language model, you have this data sitting around, you have this
4:55:59 training code.
4:56:04 It’s not that easy for someone to come and build on and improve because you need to spend
4:56:05 a lot on compute.
4:56:06 You need to have expertise.
4:56:12 So until there are feedback loops of open source AI, it seems mostly an ideological mission.
4:56:15 People like Mark Zuckerberg, which is like America needs this.
4:56:21 And I agree with him, but in the time where the motivation ideologically is high, we need
4:56:26 to capitalize and build this ecosystem around what benefits do you get from seeing the language
4:56:27 model data.
4:56:29 And there’s not a lot about that.
4:56:33 We’re going to try to launch a demo soon where you can look at an Olmo model and a
4:56:39 query and see what pre-training data is similar to it, which is like legally risky and complicated.
4:56:43 But it’s like, what does it mean to see the data that the AI was trained on?
4:56:44 It’s hard to parse.
4:56:45 It’s terabytes of files.
4:56:48 It’s like, I don’t know what I’m going to find in there.
4:56:54 But that’s what we need to do as an ecosystem if people want open source AI to be financially
4:56:55 useful.
4:56:56 We didn’t really talk about Stargate.
4:57:01 I would love to get your opinion on like what the new administration, the Trump administration,
4:57:08 everything that’s being done from the America side and supporting AI infrastructure and
4:57:10 the efforts of the different AI companies.
4:57:11 What do you think about Stargate?
4:57:17 What are we supposed to think about Stargate and does Sam have the money?
4:57:18 Yeah.
4:57:21 So I think Stargate is an opaque thing.
4:57:23 It definitely doesn’t have $500 billion.
4:57:25 It doesn’t even have $100 billion, right?
4:57:30 So what they announced is this $500 billion number, Larry Ellison, Sam Altman and Trump
4:57:31 said it.
4:57:38 They thanked Trump and Trump did do some executive actions that do significantly improve the
4:57:42 ability for this to be built faster.
4:57:45 One of the executive actions he did is on federal land, you can just basically build
4:57:49 data centers in power, pretty much like that.
4:57:52 And then the permitting process is basically gone or you file after the fact.
4:57:56 So like one of the, again, like I had a Schizo take earlier, another Schizo take, if you’ve
4:58:00 ever been to the Presidio in San Francisco, beautiful area.
4:58:03 You could build a power plant and a data center there if you wanted to because it is federal
4:58:04 land.
4:58:05 It used to be a military base.
4:58:11 But you know, obviously this would like piss people off, you know, it’s a good bit.
4:58:14 Anyways, Trump has made it much easier to do this, right?
4:58:18 Generally, Texas has the only unregulated grid in the nation as well.
4:58:19 Let’s go Texas.
4:58:24 And so, you know, therefore like ERCOT enables people to build faster as well.
4:58:27 In addition, the federal regulations are coming down.
4:58:31 And so Stargate is predicated, and this is why that whole show happened.
4:58:35 Now, how they came up with a $500 billion number is beyond me.
4:58:39 How they came up with a $100 billion number makes sense to some extent, right?
4:58:44 And there’s actually a good table in here that I would like to show in that Stargate
4:58:49 piece that I had.
4:58:50 It’s the most recent one.
4:58:51 Yeah.
4:58:58 So anyways, Stargate, you know, it’s basically right, like there is, it’s a table about cost.
4:59:01 There, you passed it already.
4:59:03 It’s that one.
4:59:06 So this table is kind of explaining what happens, right?
4:59:10 So Stargate is in Abilene, Texas, the first $100 billion of it.
4:59:17 That site is 2.2 gigawatts of power in, about 1.8 gigawatts of power consumed, right?
4:59:24 Per GPU, they have like roughly, Oracle is already building the first part of this before
4:59:25 Stargate came about.
4:59:27 To be clear, they’ve been building it for a year.
4:59:29 They tried to rent it to Elon, in fact, right?
4:59:31 But Elon was like, “It’s too slow.
4:59:32 I need it faster.”
4:59:34 So then he went and did his Memphis thing.
4:59:38 And so OpenAI was able to get it with this like weird joint venture called Stargate.
4:59:42 They initially signed a deal with just Oracle for the first section of this cluster, right?
4:59:50 This first section of this cluster, right, is roughly $5 billion to $6 billion of server
4:59:51 spend, right?
4:59:54 And then there’s another billion or so of data center spend.
4:59:59 But then likewise, like if you fill out that entire 1.8 gigawatts with the next two generations
5:00:05 of NVIDIA’s chips, GB200, GB300, VR200, and you fill it out completely, that ends up being
5:00:10 roughly $50 billion of server cost, right?
5:00:15 Plus there’s data center cost, plus maintenance cost, plus operation cost, plus all these
5:00:16 things.
5:00:19 And that’s where OpenAI gets to their $100 billion announcement that they had, right?
5:00:22 Because they talked about $100 billion is phase one.
5:00:24 That’s this Abilene, Texas data center, right?
5:00:27 $100 billion of total cost of ownership, quote, unquote, right?
5:00:28 So it’s not CapEx.
5:00:29 It’s not investment.
5:00:32 It’s $100 billion of total cost of ownership.
5:00:35 And then there will be future phases.
5:00:39 They’re looking at other sites that are even bigger than this 2.2 gigawatts, by the way,
5:00:40 in Texas and elsewhere.
5:00:43 And so they’re not completely ignoring that.
5:00:49 But there is the number of $100 billion that they save for phase one, which I do think will
5:00:50 happen.
5:00:51 They don’t even have the money for that.
5:00:54 Furthermore, it’s not $100 billion, it’s $50 billion of spend, right?
5:01:01 And then like $50 billion of operational cost, power, et cetera, rental pricing, et cetera.
5:01:06 Because they’re renting it, OpenAI is renting the GPUs from the Stargate joint venture, right?
5:01:08 What money do they actually have, right?
5:01:11 SoftBank is going to invest, Oracle is going to invest, OpenAI is going to invest.
5:01:13 OpenAI is on the line for $19 billion.
5:01:17 Everyone knows that they’ve only got $6 billion in their last round and $4 billion in debt.
5:01:23 But there is news of like SoftBank maybe investing $25 billion into OpenAI, right?
5:01:25 So that’s part of it, right?
5:01:26 So $19 billion can come from there.
5:01:28 So OpenAI does not have the money at all, right?
5:01:29 To be clear.
5:01:34 Inc. is not dried on anything, OpenAI has $0 for this $50 billion, right?
5:01:38 In which they’re legally obligated to put $19 billion of CAPEX into the joint venture
5:01:41 and then the rest they’re going to pay via renting the GPUs from the joint venture.
5:01:44 And then there’s Oracle.
5:01:48 Oracle has a lot of money, they’re building the first section completely, they were spending
5:01:49 for themselves, right?
5:01:55 This $6 billion of CAPEX, $10 billion of TCO, and they were going to do that first section.
5:01:57 They’re paying for that, right?
5:02:00 As far as the rest of the section, I don’t know how much Larry wants to spend, right?
5:02:01 At any point he could pull out, right?
5:02:03 Like this is again, this is like completely voluntary.
5:02:06 So at any point there’s no signed Inc. on this, right?
5:02:09 But he potentially could contribute tens of billions of dollars, right, to be clear.
5:02:11 He’s got the money, Oracle’s got the money.
5:02:17 And then there’s like MGX, which is the UAE fund, which technically has $1.5 trillion
5:02:18 for investing in AI.
5:02:21 But again, like, I don’t know how real that money is.
5:02:26 And like, whereas there is no Inc. signed for this, SoftBank does not have $25 billion
5:02:27 of cash.
5:02:32 They have to sell down their stake in ARM, which is the leader in CPUs and they IPO’ed
5:02:33 it.
5:02:34 This is obviously what they’ve always wanted to do.
5:02:36 They just didn’t know where they’d redeploy the capital.
5:02:38 Selling down the stake in ARM makes a ton of sense.
5:02:42 So they can sell that down and invest in this if they want to and invest in Open AI if they
5:02:43 want to.
5:02:50 As far as like money secured, the first 100,000 GB 200 cluster can be funded.
5:02:53 Everything else after that is up in the air.
5:02:54 Money’s coming.
5:02:55 I believe the money will come.
5:02:57 I personally do.
5:02:58 It’s a belief.
5:03:02 It’s a belief that they are going to release better models and be able to raise money.
5:03:06 But like the actual reality is that Elon’s right, the money does not exist.
5:03:09 What does the US government have to do with anything?
5:03:10 What does Trump have to do with everything?
5:03:12 He’s just a hype man.
5:03:16 Trump is, he’s reducing the regulation so they can build it faster.
5:03:18 And he’s allowing them to do it, right?
5:03:21 Because any investment of this side is going to involve like antitrust stuff.
5:03:23 So obviously he’s going to allow them to do it.
5:03:27 He’s going to enable the regulations to actually allow it to be built.
5:03:31 I don’t believe there’s any US government dollars being spent on this though.
5:03:32 Yeah.
5:03:37 So I think he’s also just creating a general vibe that this regulation will go down and
5:03:40 this is the era of building.
5:03:42 So if you’re a builder, you want to create stuff.
5:03:43 You want to launch stuff.
5:03:44 This is the time to do it.
5:03:48 And so like we’ve had this 1.8 gigawatt data center in our data for over a year now and
5:03:51 we’ve been like sort of sending it to all of our clients, including many of these companies
5:03:53 that are building the multi gigawatts.
5:03:57 But that is like at a level that’s not quite maybe executives like seeing $500 billion,
5:04:02 $100 billion, and then everyone’s asking them like, so it could spur like another like an
5:04:04 even faster arms race, right?
5:04:08 Because there’s already an arms race, but like this like $100 billion, $500 billion number.
5:04:13 Trump talking about it on TV, like it could spur the arm race to be even faster and more
5:04:15 investors to flood in and et cetera, et cetera.
5:04:20 So I think, I think you’re right is that in that sense that open eye or sort of Trump
5:04:23 is sort of like championing people are going to build more and his actions are going to
5:04:25 let people build more.
5:04:33 What are you excited about about these several years that are upcoming in terms of cluster
5:04:40 buildouts, in terms of breakthroughs in AI, like the best possible future you can imagine
5:04:44 in the next couple of years, two, three, four years, what does that look like just it could
5:04:51 be a very specific technical things like breakthroughs on post post training or it could be just
5:04:52 size big.
5:04:53 Yeah.
5:04:55 I mean it’s impressive clusters.
5:05:00 I really, I really enjoyed tracking supply chain and like who’s involved in what I really
5:05:01 do.
5:05:04 It’s really fun to see like the numbers, the cost, who’s building what capacity helping
5:05:07 them figure out how much capacity they should build, winning deals, strategic stuff.
5:05:08 That’s really cool.
5:05:14 I think technologically there’s a lot around the networking side that really excites me
5:05:18 with optics and electronics like kind of getting closer and closer, whether it be co-package
5:05:22 optics or some sort of like forms of new forms of switching.
5:05:25 This is internal to a cluster.
5:05:26 Yeah.
5:05:30 Also multi-data center training, like there’s people are putting so much fiber between these
5:05:35 data centers and lighting it up with so much bandwidth that there’s a lot of interesting
5:05:40 stuff happening on that end, telecom has been really boring since 5G and now it’s like really
5:05:42 exciting again on the other side.
5:05:44 Can you educate me a little bit about the speed of things?
5:05:49 So the speed of memory versus the speed of interconnect versus the speed of fiber between
5:05:50 data centers.
5:05:53 Are these like orders of magnitude different?
5:05:57 Can we at some point converge towards a place where it all just feels like one computer?
5:05:58 No.
5:06:01 I don’t think that’s possible.
5:06:02 It’s only going to get harder to program.
5:06:03 Not easier.
5:06:04 Okay.
5:06:07 It’s only going to get more difficult and complicated and more layers, right?
5:06:11 The general image that people like to have is like this hierarchy of memory.
5:06:14 So on chip is really close, localized within the chip, right?
5:06:15 You have registers, right?
5:06:19 Those are shared between some compute elements and then you’ll have caches, which are shared
5:06:20 between more compute elements.
5:06:21 Then you have like memory, right?
5:06:24 Like HBM or DRAM, like DDR memory or whatever it is.
5:06:27 And that’s shared between the whole chip.
5:06:31 And then you can have, you know, pools of memory that are shared between many chips, right?
5:06:33 And then storage and you keep zoning out, right?
5:06:38 The access latency across data centers, across within the data center, within a chip is different.
5:06:43 So like you’re obviously always, you’re always going to have different programming paradigms
5:06:44 for this.
5:06:45 It’s not going to be easy.
5:06:46 Programming this stuff is going to be hard.
5:06:48 Maybe I can help, right?
5:06:49 You know, with programming this.
5:07:00 But the way to think about it is that like there is, there’s sort of like the more elements
5:07:04 you add to a task, you don’t gain, you don’t get strong scaling, right?
5:07:07 If I double the number of chips, I don’t get two exit performance, right?
5:07:11 This is just like a reality of computing because there’s inefficiencies.
5:07:15 And there’s a lot of interesting work being done to make it not, you know, to make it
5:07:19 more linear, whether it’s making the chips more networked together more tightly or,
5:07:23 you know, cool programming models or cool algorithmic things that you can do on the
5:07:25 model side, right?
5:07:27 DeepSeq did some of these really cool innovations because they were limited on interconnect,
5:07:29 but they still needed to parallelize, right?
5:07:31 Like all sorts of, you know, all, everyone’s always doing stuff.
5:07:35 Google’s got a bunch of work and everyone’s got a bunch of work about this.
5:07:39 That stuff is super exciting on the model and workload and innovation side, right?
5:07:42 Hardware, solid state transformers are interesting, right?
5:07:46 For the power side, there’s all sorts of stuff on batteries and there’s all sorts of stuff
5:07:49 on, you know, I think, I think when you look at, if you look at every layer of the compute
5:07:50 stack, right?
5:07:54 Whether it goes from lithography and etch all the way to like fabrication to like optics
5:07:59 to networking to power to transformers to cooling to, you know, a networking and you
5:08:03 just go on up and up and up and up the stack, you know, even air conditioners for data centers
5:08:04 are like innovating, right?
5:08:07 Like it’s like, there’s like copper cables are innovating, right?
5:08:10 Like you wouldn’t think it, but copper cables, like there’s some innovations happening there
5:08:14 with like the density of how you can pack them and like, it’s like all of these layers
5:08:18 of the stack all the way up to the models, human progress is at a pace that’s never been
5:08:19 seen before.
5:08:22 I’m just imagining you sitting back in a layer somewhere with screens everywhere, just monitoring
5:08:27 the supply chain where all these clusters, like all the information you’re gathering,
5:08:28 I mean, you do incredible.
5:08:29 There’s a big team.
5:08:30 There’s a big team.
5:08:39 I mean, you’re, you do quite incredible work with seminars, I mean, just keeping your finger
5:08:43 on the pulse of human civilization in the digital world.
5:08:44 It’s pretty cool.
5:08:45 Like just to watch, feel that.
5:08:46 Yeah.
5:08:47 Thank you.
5:08:48 I guess.
5:08:51 Feel all of us like doing shit.
5:08:52 Epic shit.
5:08:53 Feel the AGI.
5:08:59 I mean, from meme to like reality, what Nathan, is there like breakthroughs that you’re like
5:09:01 looking forward to potentially?
5:09:04 I had a while to think about this while listening to Dylan’s beautiful response.
5:09:06 He didn’t listen to me.
5:09:11 I knew, no, I knew this was coming and it’s like, realistically, training models is very
5:09:13 fun because there’s so much low hanging fruit.
5:09:19 And the thing that makes my job entertaining, I train models, I write analysis about what’s
5:09:24 happening with models and it’s fun because there is obviously so much more progress to
5:09:25 be had.
5:09:29 And the real motivation why I do this, like somewhere where I can share things is that
5:09:33 there’s just, I don’t trust people that are like, trust me bro, we’re going to make AI
5:09:34 good.
5:09:36 It’s like, we’re the ones that it’s like, we’re going to do it and you can trust us
5:09:41 and we’re just going to have all the AI and it’s just like, I would like a future where
5:09:45 more people have a say in what AI is and can understand it.
5:09:49 And that’s a little bit less fun that it’s not a like positive thing of like, this is
5:09:50 just all really fun.
5:09:55 Like training models is fun and bring people in as fun, but it’s really like AI, if it
5:09:59 is going to be the most powerful technology of my lifetime, it’s like, we need to have
5:10:06 a lot of people involved in making that and making it open helps with that as accessible
5:10:08 as possible as open as possible.
5:10:09 Yeah.
5:10:14 In my read of the last few years is that more openness would help the AI ecosystem in terms
5:10:18 of having more people understand what’s going on, rather that researchers from non-AI fields
5:10:20 to governments to everything.
5:10:22 It doesn’t mean that openness will always be the answer.
5:10:27 I think then I will reassess of like, what is the biggest problem facing AI and tack on
5:10:30 a different angle to the wild ride that we’re on.
5:10:37 And for me, just from even the user experience, anytime you have the like Apathy said, the
5:10:46 aha moments, like the magic, like seeing the reasoning, the chain of thought, it’s like,
5:10:49 there’s something really just fundamentally beautiful about that.
5:10:53 It’s putting a mirror to ourselves and seeing like, oh shit, it is solving intelligence
5:11:00 as the cliche, like goal of these companies is, and you get to understand why we humans
5:11:03 are special, the intelligence within us is special.
5:11:08 And for now, also why we’re special in terms of, we seem to be conscious and the AI systems
5:11:14 for now aren’t, and we get to explore that mystery.
5:11:20 So that’s, it’s just really cool to get to explore these questions that I don’t think,
5:11:25 I would have never imagined would be even possible.
5:11:32 Back when, so just watching with excitement, deep blue, because I wouldn’t have ever thought
5:11:35 this kind of AI would be possible in my lifetime.
5:11:38 It’s like, this is really feels like AI.
5:11:39 It’s incredible.
5:11:44 I started with AI of learning to fly as a quadrotor, it’s like, learn to fly, and it
5:11:47 was just like, it learned to fly up, it would hit the ceiling and stop and catch it.
5:11:51 It’s like, okay, that is like really stupid compared to what’s going on now.
5:11:56 And now you could probably, with natural language, tell it to learn to fly, and it’s going to
5:11:59 generate the control algorithm, the requirement to do that.
5:12:03 There’s low level blockers, like we had to do some weird stuff for that, but you can,
5:12:04 you definitely can.
5:12:07 Back to our robotics conversation, yeah, when you have to interact in actual physical
5:12:12 world as hard, what gives you hope about the future of human civilization?
5:12:18 Looking into the next 10 years, 100 years, 1,000 years, how long do you think we’ll make
5:12:19 it?
5:12:22 Do you think we’ve got 1,000 years?
5:12:27 Humans will definitely be around in 1,000 years, I think there’s ways that very bad
5:12:31 things could happen that will be way fewer humans, but humans are very good at surviving.
5:12:35 There’s been a lot of things that that is true.
5:12:39 I don’t think they’re necessarily, we’re good at long-term credit assignment of risk,
5:12:44 but when the risk becomes immediate, we tend to figure things out.
5:12:51 For that reason, I’m like, there’s physical constraints to things like AGI, hyper recursive
5:12:56 improvement to kill us all type stuff, physical reasons, and for how humans have figured things
5:13:00 out before, I’m not too worried about it, AI takeover.
5:13:05 There are other international things that are worrying, but there’s just fundamental human
5:13:08 goodness and trying to amplify that.
5:13:16 We’re on a tenuous time, and if you look at humanity as a whole, there’s been times where
5:13:20 things go backwards, there’s times when things don’t happen at all, and we’re on what should
5:13:23 be very positive trajectory right now.
5:13:29 Yeah, there seems to be progress, but just with power, there’s spikes of human suffering.
5:13:33 We want to try to minimize the amount of spikes.
5:13:36 Generally humanity is going to suffer a lot less.
5:13:37 I’m very optimistic about that.
5:13:44 I do worry of techno-fascism type stuff arising as AI becomes more and more prevalent and
5:13:48 powerful, and those who control it can do more and more.
5:13:53 Maybe it doesn’t kill us all, but at some point, every very powerful human is going to
5:13:58 want a brain-computer interface so that they can interact with AGI and all of its advantages
5:14:05 in many more way and merge its mind with that person’s capabilities can leverage those much
5:14:11 better than anyone else, and therefore won’t be one person rule them all, but the thing
5:14:16 I worry about is it’ll be few people, hundreds, thousands, tens of thousands, maybe millions
5:14:22 of people rule whoever’s left and the economy around it.
5:14:28 That’s the thing that’s probably more worrisome is human machine amalgamations.
5:14:32 This enables an individual human to have more impact on the world, and that impact can be
5:14:35 both positive and negative.
5:14:39 Generally humans have positive impacts on the world, at least societally, but it’s possible
5:14:44 for individual humans to have such negative impacts, and AGI, at least as I think the
5:14:49 labs define it, which is not a runaway sentient thing, but rather just something that can
5:14:54 do a lot of tasks really efficiently, amplifies the capabilities of someone causing extreme
5:14:56 damage.
5:15:01 For the most part, I think it’ll be used for profit-seeking motives, which will then reduce,
5:15:04 which will increase the abundance and supply of things, and therefore reduce suffering,
5:15:05 right?
5:15:07 What’s the goal?
5:15:12 Scrolling on a timeline, just rolling a stasis.
5:15:15 Scrolling holds the status quo of the world.
5:15:16 That is a positive outcome, right?
5:15:23 Like if I have food tubes and lumped up scrolling and I’m happy, that’s a positive outcome.
5:15:30 While expanding out into the cosmos, well, this is a fun time to be alive.
5:15:34 And thank you for pushing the forefront of what is possible in humans, and thank you
5:15:35 for talking to me.
5:15:36 This was fun.
5:15:37 Thanks for having us.
5:15:38 Thanks for having us.
5:15:42 Thanks for listening to this conversation with Dylan Patel and Nathan Lambert.
5:15:46 To support this podcast, please check out our sponsors in the description.
5:15:52 And now, let me leave you some words from Richard Feynman.
5:15:57 For a successful technology, reality must take precedence over public relations.
5:16:01 For nature cannot be fooled.
5:16:03 Thank you for listening, and I hope to see you next time.
5:16:13 [MUSIC]
5:16:23 [BLANK_AUDIO]

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects.
Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc
See below for timestamps, and to give feedback, submit questions, contact Lex, etc.

CONTACT LEX:
Feedback – give feedback to Lex: https://lexfridman.com/survey
AMA – submit questions, videos or call-in: https://lexfridman.com/ama
Hiring – join our team: https://lexfridman.com/hiring
Other – other ways to get in touch: https://lexfridman.com/contact

EPISODE LINKS:
Dylan’s X: https://x.com/dylan522p
SemiAnalysis: https://semianalysis.com/
Nathan’s X: https://x.com/natolambert
Nathan’s Blog: https://www.interconnects.ai/
Nathan’s Podcast: https://www.interconnects.ai/podcast
Nathan’s Website: https://www.natolambert.com/
Nathan’s YouTube: https://youtube.com/@natolambert
Nathan’s Book: https://rlhfbook.com/

SPONSORS:
To support this podcast, check out our sponsors & get discounts:
Invideo AI: AI video generator.
Go to https://invideo.io/i/lexpod
GitHub: Developer platform and AI code editor.
Go to https://gh.io/copilot
Shopify: Sell stuff online.
Go to https://shopify.com/lex
NetSuite: Business management software.
Go to http://netsuite.com/lex
AG1: All-in-one daily nutrition drinks.
Go to https://drinkag1.com/lex

OUTLINE:
(00:00) – Introduction
(13:28) – DeepSeek-R1 and DeepSeek-V3
(35:02) – Low cost of training
(1:01:19) – DeepSeek compute cluster
(1:08:52) – Export controls on GPUs to China
(1:19:10) – AGI timeline
(1:28:35) – China’s manufacturing capacity
(1:36:30) – Cold war with China
(1:41:00) – TSMC and Taiwan
(2:04:38) – Best GPUs for AI
(2:19:30) – Why DeepSeek is so cheap
(2:32:49) – Espionage
(2:41:52) – Censorship
(2:54:46) – Andrej Karpathy and magic of RL
(3:05:17) – OpenAI o3-mini vs DeepSeek r1
(3:24:25) – NVIDIA
(3:28:53) – GPU smuggling
(3:35:30) – DeepSeek training on OpenAI data
(3:45:59) – AI megaclusters
(4:21:21) – Who wins the race to AGI?
(4:31:34) – AI agents
(4:40:16) – Programming and AI
(4:47:43) – Open source
(4:56:55) – Stargate
(5:04:24) – Future of AI

PODCAST LINKS:
– Podcast Website: https://lexfridman.com/podcast
– Apple Podcasts: https://apple.co/2lwqZIr
– Spotify: https://spoti.fi/2nEwCF8
– RSS: https://lexfridman.com/feed/podcast/
– Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4
– Clips Channel: https://www.youtube.com/lexclips

Learn. Share. Evolve…

Leave a Reply Cancel reply

More posts

The President’s Golden Share in U.S. Steel

The President’s Golden Share in U.S. Steel

642. How to Wage Peace, According to Tony Blinken

New Media: Podcasts, Politics & the Collapse of Trust