#447 – Cursor Team: Future of Programming with AI

AI transcript
0:00:05 The following is a conversation with the founding members of the Cursor team, Michael Trule,
0:00:10 Swaly Asif, Arvid Lunmark, and Aman Sanger.
0:00:16 Cursor is a code editor based on VS Code that adds a lot of powerful features for AI-assisted
0:00:18 coding.
0:00:24 It has captivated the attention and excitement of the programming and AI communities.
0:00:30 So I thought this is an excellent opportunity to dive deep into the role of AI in programming.
0:00:36 This is a super technical conversation that is bigger than just about one code editor.
0:00:42 It’s about the future of programming and in general the future of human-AI collaboration
0:00:48 in designing and engineering complicated and powerful systems.
0:00:50 And now a quick few second mention of each sponsor.
0:00:54 Check them out in the description, it’s the best way to support this podcast.
0:00:59 We got Encore for unifying your machine learning stack, Masterclass for learning, Shopify for
0:01:05 selling stuff online, NetSuite for your business, and AG1 for your health.
0:01:07 Choose what is it my friends.
0:01:13 Also, if you want to get in touch with me for whatever reason, or take a survey or submit
0:01:17 questions for an AMA, all of that will be great.
0:01:19 Go to lexframing.com/contact.
0:01:24 And now onto the full ad reads, I try to make them interesting, but if you skip them please
0:01:28 still check out our sponsors, I enjoy their stuff, maybe you will too.
0:01:34 This episode is brought to you by Encore, a platform that provides data focused AI tooling
0:01:39 for data annotation, curation management, and for model evaluation.
0:01:45 One of the things I love about these guys is they have a great blog that describes cleanly,
0:01:49 I mean it’s technical, but it’s not too technical, but it’s sufficiently technical to where it’s
0:01:52 actually describing the ideas not BS.
0:01:59 Blog posts on sort of the state of the art, like the OpenAI 01 model that was just released.
0:02:05 So sometimes they integrate it into why this is a part of Encore, why this makes sense,
0:02:06 and sometimes not.
0:02:09 And so I love that, I recommend their blog just in general.
0:02:13 That said, when they are looking at state of the art models, they are always looking for
0:02:15 ways to integrate it into their platform.
0:02:19 Basically, it’s a place to organize your data, and data is everything.
0:02:26 This was true before the popularity and the explosion of attention methods of transformers,
0:02:29 and it is still very much true now.
0:02:34 Sort of the non-synthetic, the human generated data is extremely important.
0:02:37 How you generate that data, how you organize that data, how you leverage it, how you train
0:02:43 on it, how you fine tune on it, the pre-training, the post-training, all of it, the whole thing.
0:02:45 It is extremely, extremely important.
0:02:48 And so Encore takes data very seriously.
0:02:54 Anyway, go try out Encore to create, annotate and manage your AI data at Encore.com/Lex.
0:02:58 That’s Encore.com/Lex.
0:03:04 This episode is also brought to you by MasterClass, where you can watch over 200 classes from the
0:03:07 best people in the world and their respective disciplines.
0:03:09 Carlos Santana on guitar, for example.
0:03:10 I loved that one.
0:03:14 There’s a few guitar ones, Tomorello, too, great, great, great stuff, but…
0:03:18 Carlos Santana, his instrumental Europa.
0:03:24 I haven’t quite tried to play that, but it’s on my to-do list, it’s sort of one of those
0:03:30 things you know for sure this is the thing I will play, because it’s too beautiful, it’s
0:03:32 too soulful.
0:03:36 It feels like once you play, you understand something about the guitar that you didn’t
0:03:37 before.
0:03:38 It’s not blues.
0:03:39 It’s not.
0:03:41 It’s not what it is.
0:03:51 It’s some kind of dream-like teleportation into a psychedelic world where the tone is
0:03:56 warmer than anything else I’ve ever heard, and still the guitar can cry.
0:03:57 I don’t know.
0:03:58 I love it.
0:03:59 He’s a genius.
0:04:08 So it’s such a gift that you can get a genius like that to teach us about his secrets.
0:04:13 Get unlimited access to every masterclass and get an additional 15% off an annual membership
0:04:16 at masterclass.com/lexpod.
0:04:21 That’s masterclass.com/lexpod.
0:04:27 This episode is also brought to you by Shopify, a platform designed for anyone to sell anywhere
0:04:33 with a great looking online store, or a simple looking online store, like the one I put together
0:04:34 at Lex Freeman.com/store.
0:04:39 I have a few shirts on there in case you’re interested.
0:04:45 And speaking of shirts, I’m reminded of thrift stores, which I very much loved for a long
0:04:46 time.
0:04:54 I still love thrift stores, or a nice place to get stuff like kitchen stuff and clothing.
0:04:58 And the kind of clothing you get at thrift stores is actually pretty interesting because
0:05:00 there’s shirts there.
0:05:03 They’re just unlike anything else you would get anywhere else.
0:05:10 So if you’re sort of selective and creative-minded, there’s a lot of interesting fashion that’s
0:05:11 there.
0:05:16 And in terms of t-shirts, there’s just like hilarious t-shirts, t-shirts that are very
0:05:20 far away from the kind of trajectories you have taken in life, or are not, but you just
0:05:24 haven’t thought about it, like a band that you love, but you’ve never would have thought
0:05:26 to wear their t-shirt.
0:05:32 Anyway, a little bit, I think it’s Shopify as the internet’s thrift store.
0:05:36 Of course, you can do super classy, you can do super fancy, or you can do super thrift.
0:05:39 All of it is possible.
0:05:42 Sign up for a $1 per month trial period at Shopify.com/Lex.
0:05:44 That’s all lowercase.
0:05:49 Go to Shopify.com/Lex to take your business to the next level today.
0:05:56 This episode is also brought to you by Netsuite, an all-in-one cloud business management system.
0:06:02 Sometimes I think that Netsuite is supporting this podcast because they’re trolling me.
0:06:06 They’re saying, “Hey, Lex, aren’t you doing a little too much talking?
0:06:08 Maybe you should be building more.”
0:06:10 I agree with you, Netsuite.
0:06:12 I agree with you.
0:06:18 And so every time I do an ad read for Netsuite, it is a chance for me to confront my Jungin
0:06:19 shadow.
0:06:26 Some of the demons emerge from the subconscious and ask questions that I don’t have answers
0:06:31 to, questions about one’s mortality and that life is short, and that one of the most fulfilling
0:06:35 things in life is to have a family and kids and all of these things I would very much
0:06:37 like to have.
0:06:43 And also the reality that I love programming and I love building, I love creating cool
0:06:49 things that people can use and share and that would make their life better, all of that.
0:06:53 Of course, I also love listening to podcasts, and I kind of think of this podcast as me
0:06:59 listening to a podcast where I can also maybe participate by asking questions.
0:07:04 So all of these things that you love, but you ask the hard question of like, “Okay,
0:07:08 while life is slipping away, it’s short, it really, really is short.
0:07:13 What do you want to do with the rest of the minutes and the hours that make up your life?”
0:07:14 Yeah.
0:07:16 So thank you for the existential crisis, Netsuite.
0:07:18 I appreciate it.
0:07:22 If you’re running a business, if you have taken the leap into the unknown and started
0:07:27 a company, then you should be using the right tools to manage that company.
0:07:30 In fact, over 37,000 companies have upgraded to Netsuite.
0:07:38 Take advantage of Netsuite’s flexible financing plan at Netsuite.com/Lex, that’s Netsuite.com/Lex.
0:07:43 This episode is also brought to you by The Delicious, The Delicious AG1.
0:07:47 It’s an all-in-one daily drink to support better health and peak performance.
0:07:53 It’s basically a super awesome multivitamin that makes me feel like I have my life together.
0:07:57 Even when everything else feels like it’s falling apart, at least I have AG1.
0:08:00 At least I have that nutritional foundation to my life.
0:08:06 So the fast thing I’m doing, all the carnivore diets, all the physical endurance events and
0:08:12 the mental madness of staying up all night or just the stress of certain things I’m
0:08:19 going through, all of that, AG1 is there, at least they have the vitamins.
0:08:25 Also, sometimes wonder, they used to be called Athletic Greens and now they’re called AG1.
0:08:27 I always wonder, is AG2 coming?
0:08:28 Like, why is it just one?
0:08:32 It’s an interesting branding decision, like AG1.
0:08:38 Me as an OCD kind of programmer type, it’s like, okay, is this a versioning thing?
0:08:42 Okay, is this like AG0.1 alpha?
0:08:45 When’s the final release?
0:08:52 Anyway, the thing I like to say and to consume is AG1, they’ll give you one month supply
0:08:58 of fish oil when you sign up at drinkag1.com/lex.
0:09:02 This is the Lex Freeman podcast, to support it, please check out our sponsors in the
0:09:03 description.
0:09:08 And now, dear friends, here’s Michael, Swale, Arvid, and Aman.
0:09:31 All right, this is awesome, we have Michael, Aman, Swale, Arvid here from the cursor team.
0:09:36 First up, big ridiculous question, what’s the point of a code editor?
0:09:42 So the code editor is largely the place where you build software, and today, or for a long
0:09:47 time, that’s meant the place where you text edit a formal programming language.
0:09:50 And for people who aren’t programmers, the way to think of a code editor is like a really
0:09:56 souped up word processor for programmers, where the reason it’s souped up is code has
0:09:57 a lot of structure.
0:10:03 And so the quote unquote word processor, the code editor, can actually do a lot for you
0:10:08 that word processors in the writing space haven’t been able to do for people editing text there.
0:10:13 And so that’s everything from giving you visual differentiation of the actual tokens in the
0:10:17 code, so you can scan it quickly, to letting you navigate around the code base, like you’re
0:10:21 navigating around the internet with hyperlinks, you’re going to definitions of things you’re
0:10:28 using to error checking, to catch rudimentary bugs.
0:10:33 And so traditionally, that’s what a code editor has meant.
0:10:38 And I think that what a code editor is, is going to change a lot over the next 10 years.
0:10:42 As what it means to build software, maybe starts to look a bit different.
0:10:45 I think also a code editor should just be fun.
0:10:46 Yes.
0:10:47 That is very important.
0:10:48 That is very important.
0:10:54 And it’s actually sort of an underrated aspect of how we decide what to build, like a lot
0:11:00 of the things that we build, and then we try them out, we do an experiment, and then we
0:11:03 actually throw them out because they’re not fun.
0:11:08 And so a big part of being fun is being fast, a lot of the time.
0:11:09 Fast is fun.
0:11:10 Yeah.
0:11:14 That should be a T-shirt.
0:11:18 But fundamentally, I think one of the things that draws a lot of people to building stuff
0:11:24 on computers is this insane integration speed, where in other disciplines, you might be sort
0:11:29 of gate capped by resources or the ability, even the ability to get a large group together
0:11:34 and coding is this amazing thing where it’s you and the computer and that alone, you can
0:11:36 build really cool stuff really quickly.
0:11:42 So for people to know, Cursor is this super cool new editor that’s a fork of VS Code.
0:11:49 It would be interesting to get your kind of explanation of your own journey of editors.
0:11:54 How did you, I think all of you were big fans of VS Code with Co-Pilot.
0:12:00 How did you arrive to VS Code and how did that lead to your journey with Cursor?
0:12:01 Yeah.
0:12:05 So I think a lot of us, all of us were originally fan users.
0:12:06 Pure fan.
0:12:07 Pure fan.
0:12:08 Yeah.
0:12:11 No neo fan, just pure fan and a terminal.
0:12:20 And at least for myself, it was around the time that Co-Pilot came out, so 2021, that
0:12:21 I really wanted to try it.
0:12:26 So I went into VS Code, the only platform, the only co-editor in which it was available.
0:12:33 And even though I really enjoyed using them, just the experience of Co-Pilot with VS Code
0:12:37 was more than good enough to convince me to switch.
0:12:41 And so that kind of was the default until we started working on Cursor.
0:12:43 And maybe we should explain what Co-Pilot does.
0:12:46 It’s like a really nice autocomplete.
0:12:50 It suggests as you start writing a thing, it suggests one or two or three lines how to
0:12:52 complete the thing.
0:12:57 And there’s a fun experience in that, you know, like when you have a close friendship
0:13:02 and your friend completes your sentences, like when it’s done well, there’s an intimate
0:13:03 feeling.
0:13:08 There’s probably a better word than intimate, but there’s a cool feeling of like, holy shit,
0:13:10 it gets me.
0:13:14 And then there’s an unpleasant feeling when it doesn’t get you.
0:13:19 And so there’s that kind of friction, but I would say for a lot of people, the feeling
0:13:21 that it gets me overpowers that it doesn’t.
0:13:25 And I think actually one of the underrated aspects of get up Co-Pilot is that even when
0:13:29 it’s wrong, it’s like a little bit annoying, but it’s not that bad because you just type
0:13:33 another character and then maybe then it gets you, or you type another character and then
0:13:34 it gets you.
0:13:35 So even when it’s wrong, it’s not that bad.
0:13:38 Yeah, you can sort of iterate and fix it.
0:13:43 I mean, the other underrated part of Co-Pilot for me sort of was just the first real, real
0:13:44 AI product.
0:13:47 It’s like the first language model consumer product.
0:13:52 So Co-Pilot was kind of like the first killer app for LMS.
0:13:53 Yeah.
0:13:55 And like the beta was out in 2021.
0:13:56 Right.
0:13:57 Okay.
0:14:00 So what’s the origin story of cursor?
0:14:05 So around 2020, the scaling loss papers came out from open AI.
0:14:10 And that was a moment where this looked like clear predictable progress for the field where
0:14:13 even if we didn’t have any more ideas, looks like you can make these models a lot better
0:14:16 if you had more compute and more data.
0:14:21 By the way, we’ll probably talk for three to four hours on the topic of scaling loss.
0:14:22 Yes.
0:14:27 But just to summarize, it’s a paper and a set of papers and set of ideas that say bigger
0:14:32 might be better for model size and data size in the realm of machine learning.
0:14:34 It’s bigger and better, but predictively better.
0:14:35 Okay.
0:14:36 That’s another topic of conversation.
0:14:37 Yeah.
0:14:40 So around that time, for some of us, there were like a lot of conceptual conversations
0:14:43 about what’s this going to look like?
0:14:46 What’s the story going to be for all these different knowledge worker fields about how
0:14:51 they’re going to be made better by this technology getting better?
0:14:56 And then I think there were a couple of moments where the theoretical gains predicted in that
0:15:00 paper started to feel really concrete and it started to feel like a moment where you
0:15:06 could actually go and not do a PhD if you wanted to work on, do useful work in AI.
0:15:10 I actually felt like now there was this whole set of systems one could build that were really
0:15:11 useful.
0:15:14 And I think that the first moment we already talked about a little bit, which was playing
0:15:15 with the early bit of copilot.
0:15:18 Like that was awesome and magical.
0:15:22 I think that the next big moment where everything kind of clicked together was actually getting
0:15:23 early access to GPT-4.
0:15:29 So it was sort of end of 2022 was when we were tinkering with that model.
0:15:32 And the step of incapability is felt enormous.
0:15:37 And previous to that, we had been working on a couple of different projects we had been
0:15:41 as a copilot because of scaling odds, because of our prior interests in the technology.
0:15:46 We had been tinkering around with tools for programmers, but things that are very specific.
0:15:52 So we were building tools for financial professionals who have to work within a Jupyter notebook
0:15:55 or playing around with, can you do static analysis with these models?
0:16:01 And then the stuff up in GPT-4 felt like, look, that really made concrete the theoretical
0:16:06 gains that we had predicted before felt like you could build a lot more just immediately
0:16:07 at that point in time.
0:16:13 And also, if we were being consistent, it really felt like this wasn’t just going to
0:16:14 be a point solution thing.
0:16:16 This was going to be all the programming that was going to flow through these models.
0:16:20 And it felt like that demanded a different type of programming environment, a different
0:16:22 type of programming.
0:16:26 And so we set off to build that sort of larger vision around that.
0:16:28 There’s one that I distinctly remember.
0:16:32 So my roommate is an IMO Gold winner.
0:16:35 There’s a competition in the US called the Putnam, which is sort of the IMO for college
0:16:36 people.
0:16:40 And it’s this math competition is exceptionally good.
0:16:50 So Sheng Tong and Amon, I remember, it’s sort of June of 2022, had this bet on whether
0:16:56 the like 2024 June or July, you were going to win a gold medal in the IMO with like
0:16:57 models.
0:16:59 IMO is International Math Olympiad.
0:17:00 Yeah.
0:17:02 IMO is International Math Olympiad.
0:17:05 And so Arvid and I are both also competed in it.
0:17:09 So it was sort of personal.
0:17:13 And I remember thinking, Matt, this is not going to happen.
0:17:21 This was like, even though I sort of believed in progress, I thought, IMO Gold, like Amon
0:17:22 is just delusional.
0:17:23 Yeah.
0:17:27 That was the, and to be honest, I mean, I was, to be clear, very wrong.
0:17:31 But that was maybe the most prescient bet in the group.
0:17:36 So the, the new results from DeepMind, it turned out that you were correct.
0:17:37 That’s what the–
0:17:38 Well, it was technically not.
0:17:41 Technically incorrect, but one point away.
0:17:43 Amon was very enthusiastic about this stuff.
0:17:48 And before Amon had this like scaling loss t-shirt that he would walk around with, he
0:17:51 had the like charts and like the formulas on it.
0:17:54 So you like felt the AGI, or you felt the scaling loss?
0:17:55 Yeah.
0:18:01 I distinctly remember there was this one conversation I had with Michael where before
0:18:05 I hadn’t thought super deeply and critically about scaling laws.
0:18:09 And he kind of posed the question, why isn’t scaling all you need?
0:18:12 Or why isn’t scaling going to result in massive gains in progress?
0:18:15 And I think I went through like the, like the stages of grief.
0:18:22 There is anger, denial, and then finally at the end, just thinking about it, acceptance.
0:18:29 And I think I’ve been quite hopeful and optimistic about progress since.
0:18:33 I think one thing I’ll caveat is, I think it also depends on like which domains you’re
0:18:34 going to see progress.
0:18:40 Like math is a great domain because especially like formal theorem proving because you get
0:18:44 this fantastic signal of actually verifying if the thing was correct.
0:18:47 And so this means something like RL can work really, really well.
0:18:52 And I think like you could have systems that are perhaps very superhuman to math and still
0:18:53 not technically have AGI.
0:18:54 Okay.
0:18:57 So can we take it all the way to cursor?
0:18:58 And what is cursor?
0:19:04 It’s a fork of VS code and VS code is one of the most popular editors for a long time.
0:19:06 Like everybody found love with it.
0:19:07 Everybody left Vim.
0:19:13 I left Emacs for, sorry.
0:19:19 So unified in some fundamental way, the developer community.
0:19:23 And then you look at the space of things, you look at the scaling laws, AI is becoming
0:19:24 amazing.
0:19:30 And you decided, okay, it’s not enough to just write an extension for your VS code because
0:19:32 there’s a lot of limitations to that.
0:19:37 We need, if AI is going to keep getting better, better, better, we need to really like rethink
0:19:40 how the AI is going to be part of the editing process.
0:19:45 And so you decided to fork VS code and start to build a lot of the amazing features we’ll
0:19:48 be able to talk about.
0:19:49 But what was that decision like?
0:19:55 Because there’s a lot of extensions, including co-pilot of VS code that are doing sort of
0:19:56 AI type stuff.
0:19:59 What was the decision like to just fork VS code?
0:20:05 So the decision to do an editor seemed kind of self-evident to us for at least what we
0:20:07 wanted to do and achieve.
0:20:10 Because when we started working on the editor, the idea was, these models are going to get
0:20:12 much better, their capabilities are going to improve, and it’s going to entirely change
0:20:16 what you build software, both in a, you will have big productivity gains, but also radical
0:20:20 and not like the active building software is going to change a lot.
0:20:25 And so you’re very limited in the control you have over a code editor if you’re a plug
0:20:28 into an existing coding environment.
0:20:31 And we didn’t want to get locked in by those limitations.
0:20:34 We wanted to be able to just build the most useful stuff.
0:20:35 Okay.
0:20:41 Well, then the natural question is, VS code is kind of with co-pilot a competitor.
0:20:43 So how do you win?
0:20:46 Is it basically just the speed and the quality of the features?
0:20:52 Yeah, I mean, I think this is a space that is quite interesting, perhaps quite unique
0:20:57 where if you look at previous tech waves, maybe there’s kind of one major thing that
0:21:00 happened and unlock a new wave of companies.
0:21:06 But every single year, every single model capability or jump you get model capabilities,
0:21:13 you now unlock this new wave of features, things that are possible, especially in programming.
0:21:18 And so I think in AI programming, being even just a few months ahead, let alone a year
0:21:22 ahead, makes your product much, much, much more useful.
0:21:28 I think the cursor a year from now will need to make the cursor of today look obsolete.
0:21:33 And I think, you know, Microsoft has done a number of like fantastic things, but I don’t
0:21:37 think they’re in a great place to really keep innovating and pushing on this in the way that
0:21:39 a startup can.
0:21:42 Just rapidly implementing features.
0:21:49 And push, yeah, like and kind of doing the research experimentation necessary to really
0:21:50 push the ceiling.
0:21:53 I don’t know if I think of it in terms of features as I think of it in terms of like
0:21:56 capabilities for programmers.
0:22:01 It’s that like, you know, as, you know, the new one model came out and I’m sure there
0:22:06 are going to be more models of different types, like longer context and maybe faster.
0:22:14 Like, there’s all these crazy ideas that you can try and hopefully 10% of the crazy ideas
0:22:17 will make it into something kind of cool and useful.
0:22:24 And we want people to have that sooner to rephrase it’s like an underrated fact is we’re
0:22:25 making it for ourselves.
0:22:30 When we started cursor, you really felt this frustration that, you know, models, you could
0:22:33 see models getting better.
0:22:35 The cobalt experience had not changed.
0:22:39 It was like, man, these guys like the ceiling is getting higher.
0:22:41 Like, why are they not making new things?
0:22:42 Like they should be making new things.
0:22:45 They should be like, like, where’s all the alpha features?
0:22:47 There were no alpha features.
0:22:51 It was like, I’m sure it was selling well.
0:22:55 I’m sure it was a great business, but it didn’t feel, I’m one of these people that
0:23:00 really want to try and use new things and it was just, there was no new thing for like
0:23:01 a very long while.
0:23:02 Yeah.
0:23:03 It’s interesting.
0:23:08 I don’t know how you put that into words, but when you compare a cursor with cobalt,
0:23:12 cobalt pretty quickly became, started to feel stale for some reason.
0:23:13 Yeah.
0:23:19 I think one thing that I think helps us is that we’re sort of doing it all in one where
0:23:23 we’re developing the UX and the way you interact with the model.
0:23:28 At the same time as we’re developing like how we actually make the model give better
0:23:33 answers, so like how you build up the prompt or like how do you find the context and for
0:23:36 a cursor tab, like how do you train the model.
0:23:41 So I think that helps us to have all of it, like sort of like the same people working
0:23:43 on the entire experience end to end.
0:23:44 Yeah.
0:23:47 It’s like the person making the UI and the person training the model like sit to like
0:23:50 18 feet away.
0:23:52 Often the same person even.
0:23:53 Yeah.
0:23:57 Often even the same person, so you can create things that are sort of not possible if you’re
0:24:00 not talking, you’re not experimenting.
0:24:03 And you’re using like you said cursor to write cursor.
0:24:04 Of course.
0:24:05 Oh, yeah.
0:24:06 Yeah.
0:24:07 Well, let’s talk about some of these features.
0:24:14 Let’s talk about the all knowing, the all powerful praise B to the tab, auto complete
0:24:17 on steroids basically.
0:24:18 So how does tab work?
0:24:19 What is tab?
0:24:24 To highlight and summarize at a high level, I’d say that there are two things that cursor
0:24:25 is pretty good at right now.
0:24:27 There are other things that it does.
0:24:31 But two things that it helps programmers with.
0:24:36 One is this idea of looking over your shoulder and being like a really fast colleague who
0:24:41 can kind of jump ahead of you and type and figure out what you’re going to do next.
0:24:47 And that was the original idea behind, that was kind of the kernel, the idea behind good
0:24:51 auto complete was predicting what you’re going to do next, but you can make that concept
0:24:56 even more ambitious by not just predicting the characters after your cursor, but actually
0:24:58 predicting the next entire change you’re going to make the next diff, the next place
0:25:01 you’re going to jump to.
0:25:07 And the second thing cursor is pretty good at right now too, is helping you sometimes
0:25:13 jump ahead of the AI and tell it what to do and go from instructions to code.
0:25:16 And on both of those, we’ve done a lot of work on making the editing experience for
0:25:21 those things ergonomic and also making those things smart and fast.
0:25:24 One of the things we really wanted was we wanted the model to be able to edit code for
0:25:25 us.
0:25:30 That was kind of a wish and we had multiple attempts at it before we had sort of a good
0:25:34 model that could edit code for you.
0:25:39 Then after we had a good model, I think there had been a lot of effort to make the inference
0:25:45 fast for having a good experience.
0:25:50 And we’ve been starting to incorporate, I mean, Michael sort of mentioned this like
0:25:52 ability to jump to different places.
0:25:57 And that jump to different places, I think, came from a feeling of, you know, once you
0:26:04 accept an edit, it’s like, man, it should be just really obvious where to go next.
0:26:08 It’s like, I’d made this change, the model should just now that like the next place to
0:26:11 go to is like 18 lines down.
0:26:18 If you’re a WIM user, you could press 18jj or whatever, but like, why am I doing this?
0:26:20 Like the model should just know it.
0:26:24 And then so the idea was, you know, you just press tab, it would go 18 lines down and then
0:26:28 make it show you the next edit and you would press tab.
0:26:31 So it was just you, as long as you could keep pressing tab.
0:26:35 And so the internal competition was how many tabs can we make someone press.
0:26:41 Once you have like the idea, more sort of abstractly, the thing to think about is sort
0:26:45 of like, how are the edits sort of zero entropy?
0:26:50 So once you’ve sort of expressed your intent and the edit is, there’s no like new bits
0:26:56 of information to finish your thought, but you still have to type some characters to
0:27:00 like make the computer understand what you’re actually thinking, then maybe the model should
0:27:07 just sort of read your mind and all the zero entropy bits should just be like tabbed away.
0:27:08 Yeah.
0:27:09 That was sort of the abstract.
0:27:12 And this is an interesting thing where if you look at language model loss on different
0:27:19 domains, I believe the bits per byte, which is kind of character normalize loss for code
0:27:23 is lower than language, which means in general, there are a lot of tokens in code that are
0:27:27 super predictable, a lot of characters that are super predictable.
0:27:32 And this is I think even magnified when you’re not just trying to autocomplete code, but
0:27:37 predicting what the user is going to do next in their editing of existing code.
0:27:41 And so, you know, the gold cursor tabs, let’s eliminate all the low entropy actions you
0:27:43 take inside of the editor.
0:27:47 When the intent is effectively determined, let’s just jump you forward in time, skip
0:27:48 you forward.
0:27:52 Well, what’s the intuition and what’s the technical details of how to do next cursor
0:27:54 prediction?
0:27:58 That jump, that’s not, that’s not so intuitive, I think to people.
0:27:59 Yeah.
0:28:03 I think I can speak to a few of the details on how to make these things work.
0:28:09 They’re incredibly low latency, so you need to train small models on this, on this task.
0:28:15 In particular, they’re incredibly pre-filled token hungry, what that means is they have
0:28:19 these really, really long prompts where they see a lot of your code, and they’re not actually
0:28:21 generating that many tokens.
0:28:27 And so the perfect fit for that is using a sparse model, meaning an MOE model.
0:28:31 So that was kind of one, one break, one break that we made that substantially improved performance
0:28:32 at longer context.
0:28:38 The other being a variant of speculative coding that we kind of built out called speculative
0:28:39 edits.
0:28:46 These are two, I think, important pieces of what make it quite high quality and very fast.
0:28:47 Okay.
0:28:51 So MOE, make sure of experts, the input is huge, the output is small.
0:28:52 Yeah.
0:28:53 Okay.
0:28:57 So what else can you say about how to make, is like caching play a role in this particular
0:29:02 model. Caching plays a huge role because you’re dealing with this many input tokens.
0:29:08 If every single keystroke that you’re typing in a given line, you had to rerun the model
0:29:13 on all of those tokens passed in, you’re just going to, one, significantly degrade latency,
0:29:16 two, you’re going to kill your GPUs with load.
0:29:22 So you need to design the actual prompts used for the model such that they’re caching
0:29:28 aware and then you need to reuse the KV cache across requests, just so that you’re spending
0:29:30 less work, less compute.
0:29:37 Again, what are the things that Tab is supposed to be able to do kind of in the near term,
0:29:46 just to sort of linger on that? Generate code, fill empty space, also edit code across multiple
0:29:51 lines and then jump to different locations inside the same file and then launch.
0:29:56 And hopefully jump to different files also. So if you make an edit in one file and maybe
0:30:03 you have to go to another file to finish your thought, it should go to the second file also.
0:30:09 The full generalization is like next action prediction. Sometimes you need to run a command
0:30:14 in the terminal and it should be able to suggest the command based on the code that you wrote
0:30:22 to. Or sometimes you actually need to, like it suggests something, but it’s hard for you
0:30:26 to know if it’s correct because you actually need some more information to learn. You need
0:30:29 to know the type to be able to verify that it’s correct.
0:30:33 So maybe it should actually take you to a place that’s like the definition of something
0:30:38 and then take you back so that you have all the requisite knowledge to be able to accept
0:30:39 the next completion.
0:30:47 Also providing the human the knowledge. Yes. Right. Can you integrate like, I just gone
0:30:53 to know a guy named Prime Jen who I believe has an SS, you can order coffee via SSH?
0:30:56 Oh yeah. We did that.
0:30:57 We did that.
0:31:04 So can that also the model do that? Like feed you and provide you with caffeine. Okay,
0:31:11 so that’s the general framework. Yeah. And the magic moment would be if it is programming
0:31:17 is this weird discipline where sometimes the next five minutes, not always, but sometimes
0:31:19 the next five minutes of what you’re going to do is actually predictable from the stuff
0:31:23 you’ve done recently. And so can you get to a world where that next five minutes either
0:31:27 happens by you disengaging and it taking you through or maybe a little bit more of just
0:31:30 you seeing next step what it’s going to do and you’re like, okay, that’s good. That’s
0:31:35 good. That’s good. And you can just sort of tap, tap, tap through these big changes.
0:31:39 As we’re talking about this, as you mentioned, one of the really cool and noticeable things
0:31:44 about cursor is that there’s this whole diff interface situation going on. So like the
0:31:50 model suggests with the red and the green of like, here’s how we’re going to modify
0:31:54 the code. And in the chat window, you can apply and it shows you the diff and you can
0:31:58 accept the diff. So maybe can you speak to whatever direction of that?
0:32:05 We’ll probably have like four or five different kinds of diffs. So we we have optimized the
0:32:11 diff for for the autocomplete. So that has a different diff interface than then when
0:32:16 you’re reviewing larger blocks of code. And then we’re trying to optimize another diff
0:32:23 thing for when you’re doing multiple different files. And sort of at a high level, the difference
0:32:30 is for when you’re doing autocomplete, it should be really, really fast to read. Actually,
0:32:35 it should be really fast to read in all situations. But in autocomplete, it’s sort of you’re really
0:32:40 like your eyes focused in one area, you can’t be in too many, the humans can’t look in
0:32:41 too many different places.
0:32:43 So you’re talking about on the interface side?
0:32:48 On the interface side. So it currently has this box on the side. So we have the current
0:32:53 box. And if you try to delete code in some place and tries to add other code, it tries
0:32:58 to show you a box on the side, maybe show it if we pull it up and cursor.com. This is
0:32:59 what we’re talking about.
0:33:05 So that that box, it was like three or four different attempts at trying to make this
0:33:12 this thing work, where first the attempt was like these blue crossed out line. So before
0:33:18 it was a box on the side, it used to show you the code to delete by showing you like,
0:33:22 like Google Doc style, you would see like a line through it, then you would see the
0:33:28 new code. And that was super distracting. And then we tried many different, you know,
0:33:33 there was there was sort of deletions, there was trying to read highlight. Then the next
0:33:40 iteration of it, which is sort of funny, would you would hold the on Mac, the option button.
0:33:44 So it would, it would sort of highlight a region of code to show you that there might
0:33:51 be something coming. So maybe in this example, like the input and the value would get would
0:33:57 all get blue. And the blue was to highlight that the AI had a suggestion for you. So instead
0:34:01 of directly showing you the thing, it would show you that the AI, it would just hint that
0:34:05 the AI had a suggestion. And if you really wanted to see it, you would hold the option
0:34:11 button. And then you would see the new suggestion. And if you release the option button, you
0:34:14 would then see your original code.
0:34:17 So that’s, by the way, that’s pretty nice, but you have to know to hold the option button.
0:34:18 Yeah.
0:34:21 So by the way, I’m not a Mac user, but I got it.
0:34:25 It was, it was a button, I guess, you people have.
0:34:29 It’s, you know, it’s again, it’s just, it’s just not intuitive. I think that’s the, that’s
0:34:30 the key thing.
0:34:33 And there’s a chance this, this is also not the final version of it.
0:34:41 I am personally very excited for making a lot of improvements in this area. Like we often
0:34:48 talk about it as the verification problem where these diffs are great for small edits,
0:34:55 for large edits, or like when it’s multiple files or something, it’s actually a little
0:34:59 bit prohibitive to, to review these diffs.
0:35:04 So there are like a couple of different ideas here. Like one idea that we have is, okay,
0:35:09 you know, like parts of the diffs are important. They have a lot of information. And then parts
0:35:15 of the diff are just very low entropy. They’re like example, like the same thing over and
0:35:16 over again.
0:35:20 And so maybe you can highlight the important pieces and then gray out the not so important
0:35:26 pieces. Or maybe you can have a model that looks at the diff and sees, oh, there’s a
0:35:31 likely bug here. I will like mark this with a little red squiggly and say like, you should
0:35:37 probably like review this part of the diff. And ideas in that vein, I think are exciting.
0:35:43 Yeah, that’s a really fascinating space of like UX design engineering. So you’re basically
0:35:49 trying to guide the human programmer through all the things they need to read and nothing
0:35:50 more.
0:35:51 Yeah.
0:35:52 Like optimally.
0:35:59 I want an intelligent model to do it. Like currently diff algorithms are, they’re like,
0:36:05 they’re just like normal algorithms. There is no intelligence. There’s like intelligence
0:36:09 that went into designing the algorithm, but then there is no like you don’t care if it’s
0:36:13 about this thing or this thing, as you want a model to do this.
0:36:20 So I think the general question is like, Matt, these models are going to get much smarter.
0:36:26 As the models get much smarter, the changes they will be able to propose are much bigger.
0:36:29 So as the changes gets bigger and bigger and bigger, the humans have to do more and more
0:36:34 and more verification work. It gets more and more and more hard. Like just you need, you
0:36:41 need to help them out. It’s sort of, I don’t want to spend all my time reviewing code.
0:36:46 Can you say a little more across multiples files, diff?
0:36:51 Yeah. I mean, so GitHub tries to sell this, right? With code review. When you’re doing
0:36:57 code review, you’re reviewing multiple diffs across multiple files. But like Arvid said
0:37:02 earlier, I think you can do much better than code review. You know, code review kind of
0:37:07 sucks. Like you spend a lot of time trying to grok this code that’s often quite unfamiliar
0:37:14 to you. And it often like doesn’t even actually catch that many bugs. And I think you can
0:37:19 significantly improve that review experience using language models, for example, using
0:37:23 the kinds of tricks that Arvid had described of maybe pointing you towards the regions
0:37:31 that actually matter. I think also, if the code is produced by these language models
0:37:38 and it’s not produced by someone else, like the code review experience is designed for
0:37:44 both the reviewer and the person that produced the code. In the case where the person that
0:37:48 produced the code is a language model, you don’t have to care that much about their experience.
0:37:53 And you can design the entire thing around the reviewer, such that the reviewer’s job
0:38:00 is as fun, as easy, as productive as possible. And I think that that feels like the issue
0:38:06 with just kind of naively trying to make these things look like code review. I think you
0:38:09 can be a lot more creative and push the boundary on what’s possible.
0:38:15 Because one idea there is, I think, ordering matters. Generally, when you review a PR,
0:38:20 you have this list of files and you’re reviewing them from top to bottom. But actually, you
0:38:24 actually want to understand this part first, because that came logically first. And then
0:38:28 you want to understand the next part. And you don’t want to have to figure out that
0:38:32 yourself. You want a model to guide you through the thing.
0:38:36 And is the step of creation going to be more and more in natural language? Is the goal
0:38:38 versus with actual…
0:38:43 I think sometimes, I don’t think it’s going to be the case that all of programming will
0:38:48 be natural language. And the reason for that is, you know, if I’m programming with Swalla
0:38:54 and Swalla is at the computer and the keyboard. And sometimes, if I’m like driving, I want
0:39:00 to say to Swalla, hey, like implement this function. And that that works. And then sometimes
0:39:05 it’s just so annoying to explain to Swalla what I want him to do. And so I actually take
0:39:11 over the keyboard and I show him, I write like part of the example. And then it makes sense.
0:39:15 And that’s the easiest way to communicate. And so I think that’s also the case for AI.
0:39:19 Like sometimes the easiest way to communicate with AI will be to show an example and then
0:39:23 it goes and does the thing everywhere else. Or sometimes, if you’re making a website,
0:39:28 for example, the easiest way to show to the AI what you want is not to tell it what to
0:39:34 do, but, you know, drag things around or draw things. And yeah. And like maybe eventually
0:39:38 we will get to like brain machine interfaces or whatever and kind of like understand what
0:39:42 you’re thinking. And so I think natural language will have a place. I think it will not definitely
0:39:46 not be the way most people program most of the time.
0:39:52 I’m really feeling the AGI with this editor. It feels like there’s a lot of machine learning
0:39:57 going on underneath. Tell me about some of the ML stuff that makes it all work.
0:40:03 The cursor really works via this ensemble of custom models that we’ve trained alongside,
0:40:07 you know, the frontier models that are fantastic at the reasoning intense things. And so cursor
0:40:12 tab, for example, is a great example of where you can specialize this model to be even better
0:40:17 than even frontier models if you look at evals on the task we set it at. The other domain
0:40:22 which it’s kind of surprising that it requires custom models, but it’s kind of necessary
0:40:29 and works quite well is in apply. So I think these models are like the frontier models are
0:40:33 quite good at sketching out plans for code and generating like rough sketches of like
0:40:41 the change, but actually creating diffs is quite hard for frontier models for your training
0:40:49 models. Like you try to do this with sonnet with 01 any frontier model and it really messes
0:40:56 up stupid things like counting line numbers, especially in super, super large files. And
0:41:00 so what we’ve done to alleviate this is we let the model kind of sketch out this rough
0:41:06 code block that indicates what the change will be. And we train a model to then apply
0:41:13 that change to the file. And we should say that apply is the model looks at your code.
0:41:19 It gives you a really damn good suggestion of what new things to do. And the seemingly
0:41:26 for humans trivial step of combining the two you’re saying is not so trivial contrary to
0:41:32 popular perception. It is not a deterministic algorithm. Yeah. I think like you see shallow
0:41:38 copies of apply elsewhere. And it just breaks like most of the time because you think you
0:41:43 can kind of try to do some deterministic matching and then it fails, you know, at least 40%
0:41:51 of the time. And that just results in a terrible product experience. I think in general this
0:41:56 regime of you are going to get smarter and smarter models. And like so one other thing
0:42:03 that apply, it lets you do is it lets you use fewer tokens with the most intelligent models.
0:42:10 This is both expensive in terms of latency for generating all these tokens and cost.
0:42:16 So you can give this very, very rough sketch and then have your small models go and implement
0:42:20 it because it’s a much easier task to implement this very, very sketched out code. And I think
0:42:25 that this regime will continue where you can use smarter and smarter models to do the planning
0:42:30 and then maybe the implementation details can be handled by the less intelligent ones.
0:42:35 Perhaps you’ll have, you know, maybe 01, maybe it’ll be even more capable models given an
0:42:43 even higher level plan that is kind of recursively applied by Sonic and then the applied model.
0:42:47 Maybe we should talk about how to make it fast. I feel like fast is always an interesting
0:42:48 detail.
0:42:51 Yeah. How do you make it fast?
0:42:57 Yeah. So one big component of making it fast is speculative edits. So speculative edits
0:43:02 are a variant of speculative decoding and maybe be helpful to briefly describe speculative
0:43:08 decoding. With speculative decoding, what you do is you can kind of take advantage of the
0:43:14 fact that, you know, most of the time, and I’ll add the caveat that it would be when
0:43:22 you’re memory bound in language model generation. If you process multiple tokens at once, it
0:43:26 is faster than generating one token at a time. So this is like the same reason why if you
0:43:32 look at tokens per second with prompt tokens versus generated tokens, it’s much, much faster
0:43:39 for prompt tokens. So what we do is instead of using what speculative decoding normally
0:43:44 does, which is using a really small model to predict these draft tokens that your larger
0:43:50 model will then go in and verify. With code edits, we have a very strong prior of what
0:43:56 the existing code will look like, and that prior is literally the same exact code. So
0:44:01 what you can do is you can just feed chunks of the original code back into the model,
0:44:05 and then the model will just pretty much agree most of the time that, okay, I’m just going
0:44:10 to spit this code back out. And so you can process all of those lines in parallel, and
0:44:12 you just do this with sufficiently many chunks, and then eventually you’ll reach a point of
0:44:18 disagreement where the model will now predict text that is different from the ground truth
0:44:23 original code. It’ll generate those tokens, and then we kind of will decide after enough
0:44:29 tokens match the original code to restart speculating in chunks of code. What this actually ends
0:44:36 up looking like is just a much faster version of normal editing code. So it looks like a
0:44:41 much faster version of the model rewriting all the code. So we can use the same exact
0:44:48 interface that we use for diffs, but it will just stream down a lot faster.
0:44:53 And then the advantage is that while it’s streaming, you can just also start reviewing
0:45:01 the code before it’s done. So there’s no big loading screen. So maybe that is part of the
0:45:02 advantage.
0:45:05 So the human can start reading before the thing is done.
0:45:11 I think the interesting riff here is something like speculation is a fairly common idea nowadays.
0:45:16 It’s not only in language models. There’s obviously speculation in CPUs, and there’s
0:45:20 speculation for databases and speculation all over the place.
0:45:28 Let me ask this sort of the ridiculous question of which LLM is better at coding. GPT, Claude,
0:45:32 who wins in the context of programming? And I’m sure the answer is much more nuanced because
0:45:37 it sounds like every single part of this involves a different model.
0:45:46 Yeah, I think there’s no model that Pareto dominates others, meaning it is better in
0:45:55 all categories that we think matter. The categories being speed, ability to edit code, ability
0:45:59 to process lots of code, long context, you know, a couple of other things and kind of
0:46:01 coding capabilities.
0:46:06 The one that I’d say right now is just kind of net best is Sonic. I think this is a consensus
0:46:07 opinion.
0:46:12 Our one’s really interesting and it’s really good at reasoning. So if you give it really
0:46:18 hard programming interview style problems or lead code problems, it can do quite, quite
0:46:25 well on them. But it doesn’t feel like it kind of understands your rough intent as well as
0:46:28 Sonic does.
0:46:34 If you look at a lot of the other frontier models, one qualm I have is it feels like they’re
0:46:39 not necessarily over, I’m not saying they train on benchmarks, but they perform really
0:46:44 well on benchmarks relative to kind of everything that’s kind of in the middle. So if you try
0:46:47 it on all these benchmarks and things that are in the distribution of the benchmarks
0:46:51 they’re evaluated on, you know, they’ll do really well, but when you push them a little
0:46:56 bit outside of that, so I think the one that kind of does best at kind of maintaining that
0:47:00 same capability, like you kind of have the same capability in the benchmark as when you
0:47:03 try to instruct it to do anything with coding.
0:47:09 But another ridiculous question is the difference between the normal programming experience versus
0:47:14 what benchmarks represent? Like where do benchmarks fall short, do you think, when we’re evaluating
0:47:15 these models?
0:47:20 By the way, that’s like a really, really hard. It’s like critically important detail, like
0:47:28 how different benchmarks are versus like real coding. Where real coding, it’s not interview
0:47:35 style coding, it’s you’re doing these, you know, humans are saying like half broken English
0:47:41 sometimes and sometimes you’re saying like, oh, do what I did before. Sometimes you’re
0:47:48 saying, go add this thing and then do this other thing for me and then make this UI element
0:47:54 and then, you know, it’s just like a lot of things are sort of context dependent. You
0:47:58 really want to like understand the human and then do what the human wants as opposed to
0:48:04 sort of this, maybe the way to put it is sort of abstractly is the interview problems are
0:48:15 very well specified. They lean a lot on specification while the human stuff is less specified.
0:48:21 Yeah. I think that this benchmark question is both complicated by what, so I just mentioned,
0:48:27 and then also, what Amon was getting into is that even if you like, you know, there’s
0:48:31 this problem of like the skew between what can you actually model on a benchmark versus
0:48:34 real programming and that can be sometimes hard to encapsulate because it’s like real
0:48:39 programming is like very messy and sometimes things aren’t super well specified, what’s
0:48:44 correct or what isn’t. But then it’s also doubly hard because of this public benchmark
0:48:48 problem and that’s both because public benchmarks are sometimes kind of hill climbed on. Then
0:48:53 it’s like really, really hard to also get the data from the public benchmarks out of
0:48:59 the models. And so, for instance, like one of the most popular like agent benchmarks
0:49:05 sweet bench is really, really contaminated and the training data of these foundation
0:49:09 models. And so if you ask these foundation models to do a sweet bench problem, you actually
0:49:12 don’t give them the context of a code base. They can like hallucinate the right file pass.
0:49:18 They can hallucinate the right function names. And so it’s also just the public aspect of
0:49:19 these things is tricky.
0:49:25 Yeah, like in that case, it could be trained on the literal issues or pull requests themselves.
0:49:30 And maybe the labs will start to do a better job or they’ve already done a good job at
0:49:34 decontaminating those things, but they’re not going to emit the actual training data
0:49:38 of the repository itself. Like these are all like some of the most popular Python repositories
0:49:44 like Sympi is one example. I don’t think they’re going to handicap their models on Sympi and
0:49:50 all these popular Python repositories in order to get true evaluation scores on these benchmarks.
0:49:56 I think that given the dearths and benchmarks, there have been like a few interesting crutches
0:50:01 that places that build systems with these models or build these models actually use to
0:50:05 get a sense of are they going in the right direction or not. And in a lot of places,
0:50:09 people will actually just have humans play with the things and give qualitative feedback
0:50:14 on these like one or two of the foundation model companies, they have people who that’s
0:50:19 a big part of their role. And internally, we also qualitatively assess these models and
0:50:22 actually lean on that a lot in addition to like private evals that we have.
0:50:23 It’s like the vibe.
0:50:31 Yeah, the vibe benchmark, human benchmark. You pull in the humans to do a vibe check.
0:50:32 Yeah.
0:50:37 I mean, that’s kind of what I do like just like reading online forums and Reddit and
0:50:45 X just like, well, I don’t know how to properly load in people’s opinions because they’ll
0:50:51 say things like, I feel like Claude or GPT’s gotten dumber or something. They’ll say I
0:50:58 feel like and then I sometimes feel like that too. But I wonder if it’s the model’s problem
0:50:59 or mine.
0:51:06 Yeah, with Claude, there’s an interesting take I heard where I think AWS has different
0:51:14 chips. And I suspect they have slightly different numerics than Nvidia GPUs. And someone speculated
0:51:20 that Claude’s degraded performance had to do with maybe using the quantized version that
0:51:27 existed on AWS bedrock versus whatever was running on Anthropics GPUs.
0:51:31 I interviewed a bunch of people that have conspiracy theories. I’m glad we spoke to
0:51:32 this conspiracy theory.
0:51:40 Well, it’s not like conspiracy theory as much. Humans are humans and there’s these details
0:51:48 and you’re doing like these queasy monoflops and chips are messy and man, you can just
0:51:55 have bugs. Like bugs are, it’s hard to overstate how hard bugs are to avoid.
0:52:00 That’s the role of a good prompt in all of this. You will mention that benchmarks have
0:52:11 really structured, well-formulated prompts. What should a human be doing to maximize success?
0:52:15 And what’s the importance of what the humans? You wrote a blog post on, called it Prompt
0:52:16 Design.
0:52:23 Yeah, I think it depends on which model you’re using and all of them are slightly different
0:52:26 and they respond differently to different prompts.
0:52:35 But I think the original GPT-4 and the original sort of readable models last year, they were
0:52:40 quite sensitive to the prompts and they also had a very small context window.
0:52:46 And so we have all of these pieces of information around the codebase that would maybe be relevant
0:52:49 in the prompt. Like you have the docs, you have the files that you add, you have the
0:52:54 conversation history, and then there’s a problem like how do you decide what you actually
0:52:57 put in the prompt and when you have a limited space.
0:53:01 And even for today’s models, even when you have long context, filling out the entire
0:53:06 context window means that it’s slower. It means that sometimes the model actually gets
0:53:09 confused and some models get more confused than others.
0:53:14 And we have this one system internally that we call pre-empt, which helps us with that
0:53:26 a little bit. And I think it was built for the era before where we had 8,000 token context
0:53:33 windows. And it’s a little bit similar to when you’re making a website, you sort of,
0:53:38 you want it to work on mobile, you want it to work on a desktop screen, and you have
0:53:45 this dynamic information, which you don’t have, for example, if you’re designing a print
0:53:48 magazine, you know exactly where you can put stuff.
0:53:52 But when you have a website or when you have a prompt, you have these inputs, and then
0:53:55 you need to format them to always work. Even if the input is really big, then you might
0:53:58 have to cut something down.
0:54:02 And so the idea was, okay, let’s take some inspiration. What’s the best way to design
0:54:08 websites? Well, the thing that we really like is React and the declarative approach where
0:54:18 you use JSX in JavaScript, and then you declare, this is what I want, and I think this has
0:54:24 higher priority, or this has higher Z index than something else. And then you have this
0:54:30 rendering engine in web design, it’s like Chrome, and in our case, it’s a preempt renderer,
0:54:34 which then fits everything onto the page. And as you declare it, it will decide what
0:54:40 you want, and then it figures out what you want. And so we have found that to be quite
0:54:46 helpful. And I think the role of it has sort of shifted over time, where initially it was
0:54:52 to fit to these small context windows. Now it’s really useful because it helps us with
0:54:58 splitting up the data that goes into the prompt and the actual rendering of it. And so it’s
0:55:02 easier to debug because you can change the rendering of the prompt and then try it on
0:55:06 old prompts, because you have the raw data that went into the prompt. And then you can
0:55:11 see, did my change actually improve it for this entire eval set.
0:55:14 So do you literally prompt with JSX?
0:55:18 Yes, yes. So it kind of looks like React, there are components, we have one component
0:55:25 that’s a file component, and it takes in the cursor, usually there’s one line where the
0:55:29 cursor is in your file, and that’s probably the most important line because that’s one
0:55:32 you’re looking at. And so then you can give priorities, so like that line has the highest
0:55:38 priority, and then you subtract one for every line that is farther away. And then eventually
0:55:42 when it’s rendered, it figures out how many lines can actually fit in its centers around
0:55:43 that thing.
0:55:44 That’s amazing.
0:55:48 Yeah. And you can do like other fancy things where if you have lots of code blocks from
0:55:53 the entire code base, you could use retrieval and things like embedding and re-ranking scores
0:55:57 to add priorities for each of these components.
0:56:02 So should humans, when they ask questions, also use, try to use something like that?
0:56:07 Like would it be beneficial to write JSX in the problem? The whole idea is this should
0:56:09 be loose and messy.
0:56:15 I think our goal is kind of that you should just do whatever is the most natural thing
0:56:21 for you. And then we, our job is to figure out how do we actually like retrieve the relative
0:56:23 event thing so that your thing actually makes sense.
0:56:28 Well, this is sort of the discussion I had with Arvin of Proplexity is like his whole
0:56:33 idea is like you should let the person be as lazy as he wants.
0:56:39 But like, yeah, that’s a beautiful thing. But I feel like you’re allowed to ask more
0:56:41 of programmers, right?
0:56:45 So like if you say just do what you want, I mean, humans are lazy. There’s a kind of
0:56:52 tension between just being lazy versus like provide more as be prompted almost like the
0:56:59 system pressuring you or inspiring you to be articulate not in terms of the grammar
0:57:05 of the sentences, but in terms of the depth of thoughts that you convey inside the problems.
0:57:12 I think even as a system gets closer to some level of perfection, often when you ask the
0:57:17 model for something, you just are not not enough intent is conveyed to know what to
0:57:22 do. And there are like a few ways to resolve that intent. One is the simple thing of having
0:57:28 models just ask you, I’m not sure how to do these parts based on your query. Could you
0:57:29 clarify that?
0:57:38 I think the other could be maybe if you there are five or six possible generations, given
0:57:42 the uncertainty present in your query so far, why don’t we just actually show you all those
0:57:44 and let you pick them?
0:57:53 How hard is it to, for the model to choose to speak, talk back? Sort of versus, it’s
0:58:00 hard to sort of like how to deal with the uncertainty. Do I do I choose to ask for more information
0:58:02 to reduce the ambiguity?
0:58:09 So I mean, one of the things we do is it’s like a recent addition is try to suggest files
0:58:18 that you can add. So while you’re typing, one can guess what the uncertainty is and maybe
0:58:27 suggest that like, you know, maybe maybe you’re writing your API. And we can guess using the
0:58:34 commits that you’ve made previously in the same file that the client and the server is
0:58:41 super useful. And there’s like a hard technical problem of how do you resolve it across all
0:58:46 commits, which files are the most important given your current prompt. And we’re still
0:58:53 sort of initial versions ruled out. And I’m sure we can make it much more accurate. It’s
0:58:54 very experimental.
0:58:58 But then the idea is we show you like, do you just want to add this file, this file, this
0:59:04 file also, to tell, you know, the model to edit those files for you. Because if maybe
0:59:08 you’re making the API, like, you should also edit the client and the server that is using
0:59:13 the API and the other one resolving the API. And so that will be kind of cool as both there’s
0:59:18 the phase where you’re writing the prompt and there’s before you even click enter, maybe
0:59:20 we can help resolve some of the uncertainty.
0:59:25 To what degree do you use agentic approaches? How use for our agents?
0:59:33 We think agents are really, really cool. Like, I think agents is like, it’s like you resemble
0:59:37 sort of like a human, it’s sort of like the things like you can kind of feel that it,
0:59:43 like you’re getting closer to AGI because you see a demo where it acts as a human would
0:59:53 and and it’s really, really cool. I think agents are not yet super useful for many things.
1:00:00 I think we’re getting close to where they will actually be useful. And so I think there
1:00:06 are certain types of tasks where having an agent would be really nice. Like, I would
1:00:10 love to have an agent, for example, if like we have a bug where you sometimes can’t command
1:00:17 C and command V inside our chat input box. And that’s a task that’s super well specified.
1:00:21 I just want to say like in two sentences, this does not work. Please fix it. And then
1:00:27 I would love to have an agent that just goes off, does it? And then a day later, I come
1:00:31 back and I reviewed the thing. You mean it goes finds the right file?
1:00:36 Yeah, it finds the right files, it like tries to reproduce the bug, it like fixes the bug
1:00:39 and then it verifies that it’s correct. And this is could be a process that takes a long
1:00:46 time. And so I think I would love to have that. And then I think a lot of programming,
1:00:52 like there is often this belief that agents will take off over all of programming. I don’t
1:00:57 think we think that that’s the case, because a lot of programming, a lot of the value is
1:01:02 in iterating, or you don’t actually want to specify something upfront, because you don’t
1:01:06 really know what you want until you’ve seen an initial version, and then you want to iterate
1:01:11 on that, and then you provide more information. And so for a lot of programming, I think you
1:01:15 actually want a system that’s instant, that gives you an initial version instantly back,
1:01:18 and then you can iterate super, super quickly.
1:01:24 What about something like that Rethink came out replete agent that does also like setting
1:01:28 up the development environment and installing software packages, configuring everything,
1:01:31 configuring the databases, and actually deploying the app?
1:01:32 Yeah.
1:01:36 Is that also in the set of the things you dream about?
1:01:40 I think so. I think that would be really cool. For certain types of programming, it would
1:01:41 be really cool.
1:01:43 Is that within scope of Cursor?
1:01:49 Yeah. We’re aren’t actively working on it right now, but it’s definitely like we want
1:01:56 to make the programmers life easier and more fun. And some things are just really tedious,
1:02:00 and you need to go through a bunch of steps, and you want to delegate that to an agent.
1:02:04 And then some things, you can actually have an agent in the background while you’re working,
1:02:08 like let’s say you have a PR that’s both back end and front end, and you’re working in the
1:02:12 front end, and then you can have a background agent that doesn’t work and figure out kind
1:02:17 of what you’re doing. And then when you get to the back end part of your PR, then you
1:02:23 have some initial piece of code that you can iterate on. And so that would also be really
1:02:25 cool.
1:02:30 One of the things we already talked about is speed. But I wonder if we can just link
1:02:36 around that some more in the various places that the technical details involved in making
1:02:41 this thing really fast. So every single aspect of Cursor, most aspects of Cursor feel really
1:02:46 fast. Like I mentioned, the apply is probably the slowest thing. And for me, I’m sorry,
1:02:47 the pain.
1:02:53 I know. It’s a pain. It’s a pain that we’re feeling and we’re working on fixing it.
1:02:58 Yeah. I mean, it says something that feels, I don’t know what it is, like one second or
1:03:04 two seconds. That feels slow. That means that actually shows that everything else is just
1:03:09 really, really fast. So is there some technical details about how to make some of these models,
1:03:14 how to make the chat fast, how to make the diffs fast? Is there something that just jumps
1:03:15 to mine?
1:03:18 Yeah. I mean, so we can go over a lot of the strategies that we use. One interesting thing
1:03:28 is cache warming. And so what you can do is if as the user is typing, you can have, you’re
1:03:33 probably going to use some piece of context. And you can know that before the user is done
1:03:39 everything. So as we discussed before, reusing the KV cache results in lower latency, lower
1:03:44 costs, cross requests. So as the user starts typing, you can immediately warm the cache
1:03:50 with like, let’s say the current file contents. And then when they’ve pressed enter, there’s
1:03:55 very few tokens. It actually has to prefill and compute before starting a generation.
1:03:57 This will significantly lower TTFD.
1:03:59 Can you explain how KV cache works?
1:04:09 Yeah. So the way transformers work, one of the mechanisms that allow transformers to
1:04:14 not just independently, like the mechanism that allows transformers to not just independently
1:04:19 look at each token, but see previous tokens are the keys and values to tension. And generally
1:04:25 the way attention works is you have at your current token some query. And then you’ve
1:04:30 all the keys and values of all your previous tokens, which are some kind of representation
1:04:37 that the model stores internally of all the previous tokens in the prompt. And like by
1:04:43 default, when you’re doing a chat, the model has to, for every single token, do this forward
1:04:48 pass through the entire model. That’s a lot of matrix multiplies that happen. And that
1:04:49 is really, really slow.
1:04:54 Instead, if you have already done that, and you stored the keys and values, and you keep
1:04:59 that in the GPU, then when I’m, let’s say I have sorted for the last end tokens, if
1:05:06 I now want to compute the output token for the N plus one token, I don’t need to pass
1:05:11 those first end tokens through the entire model because I already have all those keys
1:05:16 and values. And so you just need to do the forward pass through that last token. And
1:05:21 then when you’re doing attention, you’re reusing those keys and values that have been computed,
1:05:26 which is the only kind of sequential part, or sequentially dependent part of the transformer.
1:05:31 Is there like higher level caching of like caching of the prompts or that kind of stuff
1:05:32 that could help?
1:05:40 I see. Yeah, there’s other types of caching you can kind of do. One interesting thing
1:05:48 that you can do for Cursor Tab is you can basically predict ahead as if the user would
1:05:54 have accepted the suggestion and then trigger another request. And so then you’ve cached,
1:05:57 you’ve done the speculative, it’s a mix of speculation and caching, right? Because you’re
1:06:03 speculating what would happen if they accepted it. And then you have this value that is cached,
1:06:07 this suggestion. And then when they pressed tab, the next one would be waiting for them
1:06:13 immediately. It’s a kind of clever heuristic slash trick that uses a higher level caching
1:06:20 and can give the, it feels fast, despite there not actually being any changes in the model.
1:06:25 And if you can make the KV cache smaller, one of the advantages you get is like, maybe
1:06:29 you can speculate even more. Maybe you can guess, here’s the 10 things that could be
1:06:36 useful. Like, you predict the next 10, and then it’s possible the user hits the one
1:06:40 of the 10. It’s like much higher chance than the user hits like the exact one that you
1:06:45 show them. Maybe they type in other character and we sort of hit something else in the cache.
1:06:52 So there’s all these tricks where the general phenomena here is, I think it’s also super
1:07:00 useful for RAL is, you know, maybe a single sample from the model isn’t very good. But
1:07:07 if you predict like 10 different things, turns out that one of the 10, that’s right is the
1:07:12 probability is much higher. There’s these Passet K curves. And, you know, part of RAL,
1:07:19 like what RAL does is, you know, you can exploit this Passet K phenomena to make many different
1:07:26 predictions. And one way to think about this, the model sort of knows internally has like,
1:07:29 has some uncertainty over like, which of the K things is correct, or like, which of the
1:07:36 K things does the human want. So when we RAL are, you know, cursor tab model, one of the
1:07:44 things we’re doing is we’re predicting which like, which of the 100 different suggestions
1:07:48 the model produces is more amenable for humans. Like, which of them do humans more like than
1:07:53 other things? Maybe, maybe like, there’s something where the model can predict very far ahead
1:07:59 versus like a little bit and maybe somewhere in the middle and, and, you know, just, and
1:08:03 then you can give a reward to the things that humans would like more and sort of punish
1:08:06 the things that it won’t like and sort of then train the model to output the suggestions
1:08:09 that humans would like more. You have these like RAL loops that are very useful that exploit
1:08:14 these Passet K curves. Oman maybe can, can you go into even more detail?
1:08:21 Yeah, it’s a little, it is a little different than speed. But I mean, like, technically
1:08:24 you tie it back in because you can get away at the smaller model if you RAL your smaller
1:08:30 model and it gets the same performance as the bigger one. That’s like, and, and while
1:08:35 I was mentioning stuff about KB, about reducing the size of your KB cache, there are other
1:08:41 techniques there as well that are really helpful for speed. So kind of back in the day, like
1:08:47 all the way two years ago, people mainly use multi-hat attention. And I think there’s been
1:08:55 a migration towards more efficient attention schemes like group query or multi-query attention.
1:09:01 And this is really helpful for then with larger batch sizes, being able to generate the tokens
1:09:08 much faster. The interesting thing here is this now has no effect on that time to first
1:09:15 token pre-fill speed. The thing this matters for is now generating tokens. And, and why
1:09:21 is that? Because when you’re generating tokens, instead of being bottlenecked by doing the
1:09:26 super paralyzable matrix multiplies across all your tokens, you’re bottlenecked by how
1:09:32 quickly it’s for long context with large batch sizes by how quickly you can read those cache
1:09:38 keys and values. And so then how that’s memory bandwidth and how can we make this faster?
1:09:42 We can try to compress the size of these keys and values. So multi-query attention is the
1:09:47 most aggressive of these. Where normally with multi-hat attention, you have some number
1:09:57 of attention heads and some number of kind of query, query heads. Multi-query just preserves
1:10:03 the query heads, gets rid of all the key value heads. So there’s only one kind of key value
1:10:11 head and there’s all the remaining query heads. With group query, you instead, you know, preserve
1:10:19 all the query heads. And then your keys and values are kind of, there are fewer heads
1:10:23 for the keys and values, but you’re not reducing it to just one. But anyways, like the whole
1:10:26 point here is you’re just reducing the size of your KV cache.
1:10:28 And then there is the MLA.
1:10:35 Yeah, multi-latent. That’s a little more complicated. And the way that this works is it kind of
1:10:41 turns the entirety of your keys and values across all your heads into this kind of one
1:10:46 latent vector that is then kind of expanded inference time.
1:10:52 But MLA is from this company called DeepSeq. It’s quite an interesting algorithm. Maybe
1:11:00 the key idea is sort of in both MQA and in other places, what you’re doing is sort of
1:11:10 reducing the number of KV heads. And the advantage you get from that is there’s less of them,
1:11:16 but maybe the theory is that you actually want a lot of different, like you want each
1:11:23 of the keys and values to actually be different. So one way to reduce the size is you keep
1:11:28 one big shared vector for all the keys and values. And then you have smaller vectors
1:11:34 for every single token so that when you, you can store the only the smaller thing. There’s
1:11:38 some sort of like lower rank reduction. And the lower rank reduction, and at the end of
1:11:43 the time, when you eventually want to compute the final thing, remember that like your memory
1:11:47 bound, which means that like, you still have some compute left that you can use for these
1:11:55 things. And so if you can expand the latent vector back out, and somehow like, this is
1:12:01 far more efficient because you’re reducing like, for example, maybe like reducing like
1:12:04 32 or something, like the size of the vector that you’re keeping.
1:12:10 Yeah, there’s perhaps some richness and having a separate set of keys and values and query
1:12:16 that kind of pairwise match up versus compressing that all into one. And that interaction at
1:12:17 least.
1:12:25 Okay. And all of that is dealing with being memory bound. Yeah. And what I mean, ultimately,
1:12:29 how does that map to the user experience, trying to get the Yeah, the two things that
1:12:35 map to is you can now make your cash a lot larger, because you’ve less space allocated
1:12:38 for the KV cash, you need to cash a lot more aggressively and a lot more things. So you
1:12:44 get more cash hits, which are helpful for reducing the time to first token for the reasons
1:12:49 that were kind of described earlier. And then the second being when you start doing inference
1:12:53 with more and more requests and larger and larger batch sizes, you don’t see much of
1:12:58 a slowdown in as it’s generated the tokens, the speed of that,
1:13:02 it also allows you to make your prompt bigger for certain. Yeah. Yeah. So like the basic
1:13:08 the size of your KV cash is both the size of all your prompts multiplied by the number
1:13:11 of prompts being processed in parallel. So you could increase either those dimensions,
1:13:17 right, the batch size, or the size of your prompts without degrading the latency of generating
1:13:18 tokens.
1:13:23 Arvid, you wrote a blog post shadow the workspace iterating on code in the background. So what’s
1:13:29 going on? So to be clear, we want there to be a lot of stuff happening in the background
1:13:35 and we’re experimenting with a lot of things. Right now, we don’t have much of that happening,
1:13:40 other than like the cash warming or like, you know, figuring out the right context that
1:13:45 goes into your command key prompts, for example. But the idea is, if you can actually spend
1:13:53 computation in the background, then you can help the user maybe like, at a slightly longer
1:13:57 time horizon than just predicting the next few lines that you’re going to make. But actually,
1:14:02 like in the next 10 minutes, what are you going to make? And by doing it in the background,
1:14:08 you can spend more computation doing that. And so the idea of the shadow workspace that
1:14:15 we implemented, and we use it internally for like experiments, is that to actually get
1:14:20 advantage of doing stuff in the background, you want some kind of feedback signal to give
1:14:24 back to the model. Because otherwise, like you can get higher performance by just letting
1:14:29 the model think for longer. And so like, oh, one is a good example of that. But another
1:14:35 way you can improve performance is by letting the model iterate and get feedback. And so
1:14:41 one very important piece of feedback when you’re a programmer is the language server,
1:14:47 which is this thing it exists for most different languages, and there’s like a separate language
1:14:52 server per language. And it can tell you, you know, you’re using the wrong type here,
1:14:56 and then gives you an error, or it can allow you to go to definition and sort of understands
1:15:01 the structure of your code. So language servers are extensions developed by like there’s a
1:15:05 typescript language to be developed by the typescript people, a Rust language to be developed
1:15:09 by the Rust people, and then they all interface over the language server protocol to VS code.
1:15:13 So that VS code doesn’t need to have all of the different languages built into VS code,
1:15:17 but rather you can use the existing compiler infrastructure.
1:15:19 For linting purposes?
1:15:24 It’s for linting. It’s for going to definition. And if we’re like seeing the right types that
1:15:25 you’re using.
1:15:27 So it’s doing like type checking also?
1:15:32 Yes, type checking and going to references. And that’s like when you’re working in a big
1:15:38 project, you kind of need that. If you don’t have that, it’s like really hard to code in
1:15:39 a big project.
1:15:45 Can you say again how that’s being used inside cursor, the language server protocol communication
1:15:46 thing?
1:15:50 So it’s being used in cursor to show to the programmer just like in VS code. But then
1:15:57 the idea is you want to show that same information to the models, the I/O models. And you want
1:16:01 to do that in a way that doesn’t affect the user because you want to do it in background.
1:16:08 And so the idea behind the Chattel workspace was, okay, like one way we can do this is
1:16:14 we spawn a separate window of cursor that’s hidden. And so you can set this flag in an
1:16:15 electron that’s hidden.
1:16:19 There is a window, but you don’t actually see it. And inside of this window, the AI agents
1:16:24 can modify code however they want, as long as they don’t save it because it’s still the
1:16:29 same folder, and then can get feedback from the linters and go to definition and iterate
1:16:30 on their code.
1:16:36 So like literally run everything in the background? Like as if, right, maybe even run the code?
1:16:41 So that’s the eventual version. And that’s what you want. And a lot of the blog post
1:16:47 is actually about how do you make that happen? Because it’s a little bit tricky. You want
1:16:52 it to be on the user’s machine so that it exactly mirrors the user’s environment. And
1:16:57 then on Linux, you can do this cool thing where you can actually mirror the file system
1:17:04 and have the AI make changes to the files. And it thinks that it’s operating on the file
1:17:12 level. But actually, that’s stored in memory. And you can create this kernel extension to
1:17:21 make it work. Whereas on Mac and Windows, it’s a little bit more difficult. But it’s
1:17:23 a fun, technical problem in this way.
1:17:29 One maybe hacky but interesting idea that I like is holding a lock on saving. And so
1:17:33 basically you can then have the language model kind of hold the lock on saving to disk. And
1:17:38 then instead of you operating in the ground truth version of the files that are saved
1:17:41 to disk, you actually are operating what was the shadow workspace before and these unsaved
1:17:45 things that only exist in memory that you still get linter errors for. And you can code in.
1:17:49 And then when you try to maybe run code, it’s just like there’s a small warning that there’s
1:17:53 a lock. And then you kind of will take back the lock from the language server if you’re
1:17:56 trying to do things concurrently or from the shadow workspace if you’re trying to do things
1:17:57 concurrently.
1:18:02 That’s such an exciting future, by the way. It’s a bit of a tangent. But like to allow
1:18:09 a model to change files. It’s scary for people. But like it’s really cool to be able to just
1:18:16 like let the agent do a set of tasks and you come back the next day and kind of observe.
1:18:18 Like it’s a colleague or something like that.
1:18:23 And I think there may be different versions of like runability where for the simple things
1:18:27 where you’re doing things in the span of a few minutes on behalf of the user as they’re
1:18:32 coming, it makes sense to make something work locally in their machine. I think for the
1:18:36 more aggressive things, we’re making larger changes that take longer periods of time.
1:18:41 You’ll probably want to do this in some sandbox remote environment. And that’s another incredibly
1:18:47 tricky problem of how do you exactly reproduce or mostly reproduce to the point of it being
1:18:53 effectively equivalent for running code, the user’s environment, what does remote sandbox?
1:18:59 I’m curious what kind of agents you want for coding. Do you want them to find bugs? Do
1:19:03 you want them to like implement new features? Like what agents do you want?
1:19:08 So by the way, when I think about agents, I don’t think just about coding. I think so
1:19:13 for the practice, this particular podcast, there’s video editing and a lot of if you
1:19:18 look in Adobe, a lot of there’s code behind. It’s very poorly documented code, but you
1:19:25 can interact with Premiere, for example, using code and basically all the uploading everything
1:19:30 I do on YouTube, everything as you could probably imagine, I do all that through code and including
1:19:37 translation and overdubbing all this. So I envision all those kinds of tasks so automating
1:19:42 many of the tasks that don’t have to do directly with the editing. So that, okay, that’s what
1:19:49 I was thinking about. But in terms of coding, I would be fundamentally thinking about bug
1:19:56 finding, like many levels of kind of bug finding and also bug finding, like logical bugs, not
1:20:03 logical, like spiritual bugs or something. One’s like, sort of big directions of implementation,
1:20:04 that kind of stuff.
1:20:12 Yeah. I mean, it’s really interesting that these models are so bad at bug finding when
1:20:17 just naively prompted to find a bug. They’re incredibly poorly calibrated.
1:20:21 Even the smartest models, even 01.
1:20:25 How do you explain that? Is there a good intuition?
1:20:31 I think these models are really strong reflection of the pre-training distribution. And I do
1:20:36 think they generalize as the loss gets lower and lower. But I don’t think the loss and
1:20:42 the scale is quite, the loss is low enough such that they’re like really fully generalizing
1:20:47 in code. Like the things that we use these things for, the frontier models that they’re
1:20:53 quite good at are really code generation and question answering. And these things exist
1:20:57 in massive quantities in pre-training with all of the code on GitHub and the scale of
1:21:04 many, many trillions of tokens and questions and answers on things like Stack Overflow and
1:21:05 maybe GitHub issues.
1:21:12 And so when you try to push under these things that really don’t exist very much online,
1:21:16 like for example, the cursor tap objective of predicting the next edit given the edits
1:21:22 done so far, the brittleness kind of shows. And then bug detection is another great example
1:21:26 where there aren’t really that many examples of like actually detecting real bugs and then
1:21:33 proposing fixes and the models just kind of like really struggle at it. But I think it’s
1:21:36 a question of transferring the model like in the same way that you get this fantastic
1:21:43 transfer from pre-trained models just on code in general to the cursor tab objective, you’ll
1:21:48 see a very, very similar thing with generalized models that are really good at code to bug
1:21:51 detection. It just takes like a little bit of kind of nudging in that direction.
1:21:55 Like to be clear, I think they sort of understand code really well. Like while they’re being
1:22:02 pre-trained, like the representation that’s being built up like almost certainly like
1:22:07 somewhere in the stream, there’s the model knows that maybe there’s something sketchy
1:22:13 going on, right? It sort of has some sketchiness, but actually eliciting the sketchiness too.
1:22:20 Like actually like part of it is that humans are really calibrated on which bugs are really
1:22:25 important. It’s not just actually saying like there’s something sketchy. It’s like it’s
1:22:29 just a sketchy trivial. It’s the sketchy like you’re going to take the server down.
1:22:30 Yeah.
1:22:35 Like part of it is maybe the cultural knowledge of like why is the staff engineer, staff engineer,
1:22:40 the staff engineer is good because they know that three years ago, like someone wrote a
1:22:46 really sketchy piece of code that took the server down. And as opposed to like, as opposed
1:22:52 to maybe there’s like, you know, you just, this thing is like an experiment. So like
1:22:56 a few bugs are fine. Like you’re just trying to experiment and get the feel of the thing.
1:23:00 And so if the model gets really annoying when you’re writing an experiment, that’s really
1:23:04 bad. But if you’re writing something for super production, you’re like writing a database,
1:23:08 right? You’re writing code in Postgres or Linux or whatever, like your Linus Torvalds.
1:23:13 It’s sort of unacceptable to have even in that case. And just having the calibration
1:23:18 of like, how paranoid is the user? Like,
1:23:22 But even then, like if you’re putting in a maximum paranoia, it’s still just like doesn’t
1:23:23 quite get it.
1:23:24 Yeah. Yeah.
1:23:30 I mean, but this is hard for humans too, to understand what, which line of code is important,
1:23:35 which is not. Like you, I think one of your principles on a website says, if a code can
1:23:41 do a lot of damage, one should add a comment that say this, this, this line of code is,
1:23:42 is dangerous.
1:23:51 And all caps, 10 times, 10 times, no, you say like, for every single line of code inside
1:23:56 the function, you have to, and that’s quite profound. That says something about human
1:24:03 beings because the engineers move on, even the same person might just forget how it can
1:24:06 sink the Titanic, a single function. Like you don’t, you might not intuit that quite
1:24:09 clearly by looking at the single piece of code.
1:24:16 Yeah. And I think that, that one is also partially also for today’s AI models where if you actually
1:24:22 write dangerous, dangerous, dangerous in every single line, like the models will pay more
1:24:26 attention to that and will be more likely to find bucks in that region.
1:24:32 That’s actually just straight up a really good practice of a labeling code of how much
1:24:34 damage this can do.
1:24:39 Yeah. I mean, it’s controversial. Some people think it’s ugly, swallow it.
1:24:42 Well, I actually think it’s like, in fact, I actually think this is one of the things
1:24:48 that I learned from Arvid is, you know, like I sort of aesthetically, I don’t like it,
1:24:53 but I think there’s certainly something where like, it’s useful for the models and humans
1:24:59 just forget a lot. And it’s really easy to make a small mistake and cause like, bring
1:25:05 down, you know, like just bring down the server and like, like, of course we like test a lot
1:25:08 of whatever, but there’s always these things that you have to be very careful.
1:25:12 Yeah. Like with just normal dock strings, I think people will often just skim it when
1:25:18 making a change and think, oh, I know how to do this. And you kind of really need to
1:25:22 point it out to them so that that doesn’t slip through.
1:25:26 Yeah. You have to be reminded that you can do a lot of damage. That’s like, we don’t
1:25:31 really think about that. Like, you think about, okay, how do I figure out how this works so
1:25:35 I can improve it. You don’t think about the other direction that it could do this.
1:25:41 Until we have formal verification for everything, then you can do whatever you want and you
1:25:45 know for certain that you have not introduced a bug if the proof pass.
1:25:48 But concretely, what do you think that future would look like?
1:25:56 I think people will just not write tests anymore and the model will suggest, like you write
1:26:03 a function. The model will suggest a spec and you review the spec. And in the meantime,
1:26:08 smart reasoning model computes a proof that the implementation follows the spec. And I
1:26:10 think that happens for most functions.
1:26:13 Don’t you think this gets at a little bit, some of the stuff you were talking about earlier
1:26:18 with the difficulty of specifying intent for what you want with software? Where sometimes
1:26:22 it might be because the intent is really hard to specify, it’s also then going to be really
1:26:24 hard to prove that it’s actually matching whatever your intent is.
1:26:27 What do you think that spec is hard to generate?
1:26:35 Yeah, or just like for a given spec, maybe you can, I think there is a question of like
1:26:39 can you actually do the formal verification? Like is that possible? I think that there’s
1:26:41 like more to dig into there.
1:26:42 But then also-
1:26:43 Even if you have the spec?
1:26:44 If you have the spec-
1:26:45 How do you map the spec?
1:26:47 Even if you have the spec, is the spec written in natural language?
1:26:48 Yeah, how do you map the spec?
1:26:51 No, the spec would be formal.
1:26:56 I think that you care about things that are not going to be easily well specified in the
1:26:57 spec language.
1:26:58 I see, I see.
1:26:59 Yeah.
1:27:00 Yeah.
1:27:01 Maybe an argument against formal verification is all you need.
1:27:02 Yeah.
1:27:04 But there’s this massive docking-
1:27:07 Replacing something like unit tests, sure.
1:27:08 Yeah.
1:27:09 Yeah.
1:27:14 I think you can probably also evolve the spec languages to capture some of the things
1:27:18 that they don’t really capture right now.
1:27:21 But I don’t know, I think it’s very exciting.
1:27:25 And you’re speaking not just about like single functions, you’re speaking about entire code
1:27:26 bases.
1:27:30 I think entire code bases is harder, but that is what I would love to have.
1:27:32 And I think it should be possible.
1:27:38 Because you can even, there’s like a lot of work recently where you can prove formally
1:27:40 verified down to the hardware.
1:27:44 So like through the, you formally verify the C code and then you formally verify through
1:27:49 the GCC compiler and then through the very log down to the hardware.
1:27:53 And that’s like incredibly big system, but it actually works.
1:27:57 And I think big code bases are sort of similar in that they’re like multi-layered system.
1:28:02 And if you can decompose it and formally verify each part, then I think it should be possible.
1:28:05 I think the specification problem is a real problem, but.
1:28:07 How do you handle side effects?
1:28:12 Or how do you handle, I guess, external dependencies like calling the Stripe API?
1:28:14 Maybe Stripe would write a spec for their API.
1:28:18 But like, you can’t do this for everything, like, can you do this for everything you use?
1:28:22 Like, how do you, how do you do it for, if there’s a language, like maybe, maybe like
1:28:26 people will use language models as primitives in the programs they write and there’s like
1:28:27 a dependence on it.
1:28:29 And like, how, how do you now include that?
1:28:32 I think you might be able to prove, prove that still.
1:28:33 Prove what about language models?
1:28:40 I think if it feels possible that you could actually prove that a language model is aligned,
1:28:46 for example, or like, you can prove that it actually gives the, the right answer.
1:28:47 That’s the dream.
1:28:48 Yeah, that is.
1:28:54 I mean, that’s, if it’s possible, that’s your, I have a dream speech that will certainly
1:29:00 help with, you know, making sure your code doesn’t have bugs and making sure AI doesn’t
1:29:01 destroy all of human civilization.
1:29:08 So the, the full spectrum of AI safety to just bug finding, so you said the models struggle
1:29:09 with bug finding.
1:29:10 What’s the hope?
1:29:15 You know, my hope initially is, and I can let Michael, Michael chime in too, but it’s
1:29:21 like this, it should, you know, first help with the stupid bugs, like it should very
1:29:23 quickly catch the stupid bugs.
1:29:27 Like off by one error is like, sometimes you write something in a comment and do the other
1:29:28 way.
1:29:29 It’s like very common.
1:29:30 Like I do this.
1:29:33 I write like less than in a comment and like, I maybe write the greater than sorry or something
1:29:34 like that.
1:29:39 And the model is like, you look sketchy, like, you sure you want to do that?
1:29:41 But eventually it should be able to catch harder bugs too.
1:29:42 Yeah.
1:29:48 And I think that it’s also important to note that this is having good bug finding models
1:29:53 feels necessary to get to the highest reaches of having AI do more and more programming for
1:29:57 you where you’re going to, you know, if the AI is building more and more of the system
1:30:00 for you, you need to not just generate, but also verify.
1:30:04 And without that, some of the problems that we’ve talked about before with programming
1:30:08 with these models will just become untenable.
1:30:13 So it’s not just for humans, like you write a bug, I write a bug, find the bug for me,
1:30:18 but it’s also being able to verify the AI’s code and check it is really important.
1:30:19 Yeah.
1:30:20 And then how do you actually do this?
1:30:23 Like we have had a lot of contentious dinner discussions of how do you actually train a
1:30:24 bug model?
1:30:30 But one very popular idea is, you know, it’s kind of potentially easy to introduce a bug
1:30:31 than actually finding the bug.
1:30:36 And so you can train a model to introduce bugs in existing code.
1:30:43 And then you can train a reverse bug model, then that can find bugs using this synthetic
1:30:44 data.
1:30:46 So that’s like one example.
1:30:49 But yeah, there are lots of ideas for how to achieve this.
1:30:53 You can also do a bunch of work, not even at the model level of taking the biggest models
1:30:58 and then maybe giving them access to a lot of information that’s not just the code.
1:31:02 But it gets kind of a hard problem to like stare at a file and be like, where’s the bug?
1:31:04 And you know, that’s hard for humans often, right?
1:31:07 And so often you have to run the code and being able to see things like traces and step
1:31:09 through a debugger.
1:31:12 There’s another whole another direction where it like kind of tends toward that.
1:31:15 And it could also be that there are kind of two different product form factors here.
1:31:17 It could be that you have a really specialty model that’s quite fast.
1:31:20 That’s kind of running in the background and trying to spot bugs.
1:31:24 And it might be that sometimes sort of sort of Arvid’s earlier example about, you know,
1:31:27 some nefarious input box bug, it might be that sometimes you want to like, there’s, you
1:31:30 know, there’s a bug, you’re not just like checking hypothesis free, you’re like, this
1:31:31 is a problem.
1:31:35 I really want to solve it and you zap that with tons and tons and tons of compute and
1:31:39 you’re willing to put in like $50 to solve that bug or something even more.
1:31:41 Have you thought about integrating money into this whole thing?
1:31:46 Like I would pay probably a large amount of money for if you found a bug or even generated
1:31:50 code that I really appreciated, like I had a moment a few days ago when I started using
1:32:01 Cursor or generated a perfect, like perfect three functions for interacting with the
1:32:08 YouTube API to update captions and for localization like different in different languages.
1:32:11 The API documentation is not very good.
1:32:15 And the code across like if I Google that for a while, I couldn’t find exactly, there’s
1:32:19 a lot of confusing information and Cursor generated perfectly.
1:32:23 And I was like, I just said back, I read the code, I was like, this is correct, I tested
1:32:24 it is correct.
1:32:30 I was like, I want to tip on a button that goes, yeah, there’s $5.
1:32:34 One that’s really good just to support the company and support what the interface is.
1:32:41 And the others that probably send a strong signal like, good job, right, there’s as much
1:32:43 stronger signal than just accepting the code, right?
1:32:46 You just actually send like a strong good job.
1:32:53 But and for bug finding, obviously, like there’s a lot of people, you know, that would pay
1:32:58 a huge amount of money for a bug, like a bug bounty thing, right?
1:33:00 Is that you guys think about that?
1:33:03 Yeah, it’s a controversial idea inside the company.
1:33:10 I think it sort of depends on how much you believe in humanity almost, you know, like,
1:33:15 I think it would be really cool if like, you spend nothing to try to find a bug.
1:33:18 And if it doesn’t find a bug, you spend $0.
1:33:22 And then if it does find a bug, and you click accept, then it also shows like, in parentheses,
1:33:26 like $1 as you spend $1 to accept the bug.
1:33:30 And then of course there’s a worry like, okay, we spent a lot of computation, like maybe
1:33:32 people will just copy paste.
1:33:34 I think that’s a worry.
1:33:39 And then there is also the worry that like introducing money into the product makes it
1:33:43 like kind of, you know, like it doesn’t feel as fun anymore, like you have to like think
1:33:47 about money and all you want to think about is like the code.
1:33:52 And so maybe it actually makes more sense to separate it out and like you pay some fee
1:33:55 like every month, and then you get all of these things for free.
1:33:59 But there could be a tipping component, which is not like it costs us.
1:34:00 It still has that like dollar symbol.
1:34:06 I think it’s fine, but I also see the point where like, maybe you don’t want to introduce
1:34:07 it.
1:34:09 Yeah, I was gonna say the moment that feels like people do this is when they share it,
1:34:13 when they have this fantastic example, they just kind of share it with their friend.
1:34:16 There is also a potential world where there’s a technical solution to this like honor system
1:34:21 problem too, where if we can get to a place where we understand the output of the system
1:34:24 more, I mean, to the stuff we were talking about with like, you know, error checking
1:34:26 with the LSP and then also running the code.
1:34:30 But if you could get to a place where you could actually somehow verify, oh, I have fixed
1:34:35 the bug, maybe then the bounty system doesn’t need to rely on the honor system too.
1:34:40 How much interaction is there between the terminal and the code, like how much information
1:34:45 is gained from few, if you run the code in the terminal, like can you use, can you do
1:34:51 like a loop where it runs, runs the code and suggests how to change the code if the code
1:34:57 and runtime gives an error is right now, they’re separate worlds completely.
1:35:01 Like I know you can like do control K inside the terminal to help you write the code.
1:35:08 You can use terminal contacts as well inside of Jackman K kind of everything.
1:35:12 We don’t have the looping part yet, though we suspect something like this could make
1:35:13 a lot of sense.
1:35:16 There’s a question of whether it happens in the foreground too, or if it happens in
1:35:19 the background like what we’ve been discussing.
1:35:20 Sure.
1:35:21 The background is pretty cool.
1:35:24 Like we do running the code in different ways, plus there’s a database side to this,
1:35:29 which how do you protect it from not modifying the database, but okay.
1:35:32 I mean, there’s certainly cool solutions there.
1:35:41 There’s this new API that is being developed for it’s not an AWS, but you know, it certainly
1:35:42 is I think it’s in planet scale.
1:35:47 I don’t know planet scale was the first one to added it’s disability sort of add branches
1:35:53 to a database, which is like if you’re working on a feature and you want to test against
1:35:56 the prod database, but you don’t actually want to test against the prod database, you
1:35:59 could sort of add a branch to the database and the way they do that is to add a branch
1:36:04 to the right head log and there’s obviously a lot of technical complexity in doing it
1:36:05 correctly.
1:36:09 I guess database companies need new things to do.
1:36:14 They have good databases now.
1:36:19 And I think like Turbo Buffer, which is one of the databases we use as is going to add
1:36:29 hope maybe branching to the right head log and so maybe the AI agents will use branching
1:36:32 to like test against some branch.
1:36:36 And it’s sort of going to be a requirement for the database to like support branching
1:36:37 or something.
1:36:39 It’d be really interesting if you could branch a file system, right?
1:36:40 Yeah.
1:36:43 If you like everything needs branching, it’s like that.
1:36:44 Yeah.
1:36:45 Yeah.
1:36:48 It’s like that’s the problem with the multiverse, right?
1:36:50 If you branch and everything, that’s like a lot.
1:36:53 I mean, there’s obviously these like super clever algorithms to make sure that you don’t
1:36:58 actually use a lot of space or CPU or whatever.
1:36:59 Okay.
1:37:00 This is a good place to ask about infrastructure.
1:37:03 So you guys mostly use AWS?
1:37:04 What are some interesting details?
1:37:05 What are some interesting challenges?
1:37:08 Why did you choose AWS?
1:37:10 Why is AWS still winning?
1:37:11 Hashtag.
1:37:14 AWS is just really, really good.
1:37:15 It’s really good.
1:37:23 Like whenever you use an AWS product, you just know that it’s going to work.
1:37:28 Like it might be absolute hell to go through the steps to set it up.
1:37:30 Why is the interface so horrible?
1:37:32 Because it’s just so good.
1:37:33 It doesn’t need to.
1:37:34 It’s the nature of winning.
1:37:37 I think it’s exactly, it’s just nature of the winning.
1:37:38 Yeah.
1:37:39 Yeah.
1:37:41 But AWS, you can always trust like it will always work.
1:37:45 And if there is a problem, it’s probably your problem.
1:37:46 Yeah.
1:37:47 Okay.
1:37:52 Is there some interesting like challenges to, you guys have a pretty new startup to get
1:37:55 scaling to like, to so many people on.
1:37:56 Yeah.
1:38:02 I think that there, it has been an interesting journey, adding, you know, each extra zero
1:38:06 to the request per second, but you run into all of these with like, you know, the general
1:38:09 components you’re using for, for caching and databases, run into issues as you make things
1:38:10 bigger and bigger.
1:38:13 And now we’re at the scale where we get like, you know, into overflows on our tables and
1:38:15 things like that.
1:38:19 And then also there have been some custom systems that we’ve built, like for instance
1:38:25 our retrieval system for computing a semantic index of your code base and answering questions
1:38:29 about a code base that have continually, I feel like been one of the trickier things
1:38:30 to scale.
1:38:34 I have a few friends who are super, super senior engineers and one of their sort of
1:38:40 lines is like, it’s very hard to predict where systems will break when you scale them.
1:38:45 You can sort of try to predict in advance, like there’s always something weird that’s
1:38:50 going to happen when you add this extra zero and you thought you thought through everything,
1:38:52 but you didn’t actually think through everything.
1:39:01 But I think for that particular system, we’ve, so what the, for concrete details, the thing
1:39:08 we do is obviously we upload when like, we chunk up all of your code and then we send
1:39:12 up sort of the code for, for embedding and we embed the code.
1:39:18 And then we store the embeddings in a database, but we don’t actually store any of the code.
1:39:22 And then there’s reasons around making sure that we don’t introduce client bugs because
1:39:25 we’re very, very paranoid about client bugs.
1:39:34 We store much of the details on the server, like everything is sort of encrypted.
1:39:39 So one of the technical challenges is always making sure that the local index, the local
1:39:44 code base state is the same as the state that is on the server.
1:39:49 And the way, sort of technically we ended up doing that is, so for every single file,
1:39:52 you can sort of keep this hash.
1:39:56 And then for every folder, you can sort of keep a hash, which is the hash of all of
1:39:57 its children.
1:40:00 And you can sort of recursively do that until the top.
1:40:04 And why, why do something, something complicated?
1:40:07 One thing you could do is you could keep a hash for every file.
1:40:11 Then every minute you could try to download the hashes that are on the server, figure
1:40:13 out what are the files that don’t exist on the server.
1:40:17 Maybe you just created a new file, maybe you just deleted a file, maybe you checked out
1:40:23 a new branch and try to reconcile the state between the client and the server.
1:40:29 But that introduces like absolutely ginormous network overhead, both, both on the client
1:40:30 side.
1:40:35 I mean, nobody really wants us to hammer their Wi-Fi all the time if you’re using cursor.
1:40:39 But also like, I mean, it would introduce like ginormous overhead on the database.
1:40:47 I mean, it would sort of be reading this tens of terabyte database, sort of approaching
1:40:54 like 20 terabytes or something database like every second, that’s just, just kind of crazy.
1:40:56 You definitely don’t want to do that.
1:41:01 And what you do, you sort of, you just try to reconcile the single hash, which is at
1:41:02 the root of the project.
1:41:06 And then if, if something mismatches, then you go, you find where all the things disagree.
1:41:09 Maybe you look at the children and see if the hashes match and if the hashes don’t match,
1:41:13 go look at their children and so on, but you only do that in this scenario where things
1:41:14 don’t match.
1:41:16 And for most people, most of the time the hashes match.
1:41:20 So it’s a kind of like hierarchical reconciliation of hashes.
1:41:21 Yeah, something like that.
1:41:22 Yeah.
1:41:23 It’s called a Merkel tree.
1:41:24 Yeah.
1:41:25 Yeah.
1:41:28 This is cool to see that you kind of have to think through all these problems.
1:41:32 And I mean, the point of, like the reason it’s gotten hard is just because like the number
1:41:38 of people using it and, you know, some of your customers have really, really large code
1:41:43 bases to the point where, you know, we, we originally reordered our code base, which
1:41:48 is, which is big, but I mean, it’s just not the size of some company that’s been there
1:41:52 for 20 years and sort of has a ginormous number of files and you sort of want to scale that
1:41:53 across programmers.
1:41:58 There’s all these details where like building the simple thing is easy, but scaling it to
1:42:02 a lot of people, like a lot of companies is obviously a difficult problem, which is sort
1:42:06 of, you know, independent of actually, so that there’s part of this scaling our current
1:42:11 solution is also, you know, coming up with new ideas that obviously we’re working on,
1:42:14 but then, but then scaling all of that in the last few weeks, months.
1:42:15 Yeah.
1:42:18 And there are a lot of clever things, like additional things that, that go into this
1:42:24 indexing system. For example, the bottleneck in terms of costs is not storing things in
1:42:27 the vector database or the database. It’s actually embedding the code. And you don’t
1:42:32 want to re-embed the code base for every single person in a company that is using the same
1:42:37 exact code, except for maybe there’s a branch with a few different files or they’ve made
1:42:41 a few local changes. And so, because again, embeddings are the bottleneck you can do is
1:42:45 one clever trick and not have to worry about like the complexity of like dealing with branches
1:42:53 and, and the other databases where you just have some cash on the actual vectors computed
1:43:00 from the hash of a given chunk. And so this means that when the end person at a company
1:43:04 goes and embeds their code base, it’s, it’s really, really fast. And you do all this without
1:43:08 actually storing any code on our servers at all. No code data is stored. We just store
1:43:12 the vectors in the vector database and the vector cash.
1:43:18 What’s the biggest gains at this time you get from indexing the code base? I could just
1:43:23 out of curiosity, like what, what benefit do users have? It seems like longer term,
1:43:27 there’ll be more and more benefit. But in the short term, just asking questions of the
1:43:32 code base. What, what’s the use, what’s the usefulness of that?
1:43:39 I think the most obvious one is just you want to find out where something is happening in
1:43:44 your large code base. And you sort of have a fuzzy memory of, okay, I want to find the
1:43:49 place where we do X. But you don’t exactly know what to search for in a normal text search.
1:43:54 As you ask a chat, you hit command enter to ask with, with the code base chat. And then
1:43:58 very often it finds the right place that you were thinking of.
1:44:02 I think like you, like you mentioned, in the future, I think there’s only going to get
1:44:08 more and more powerful where we’re working a lot on improving the quality of our retrieval.
1:44:11 And I think the ceiling for that is really, really much higher than people give credit
1:44:12 for.
1:44:16 One question that’s good to ask here, have you considered and why haven’t you much
1:44:21 done sort of local stuff to where you can do the, it seems like everything we just discussed
1:44:25 is exceptionally difficult to do. To go, to go to the cloud, you have to think about
1:44:32 all these things with the caching and the, you know, large code base with a large number
1:44:35 of programmers are using the same code base. You have to figure out the puzzle of that.
1:44:41 A lot of it, you know, most software just does stuff, this heavy computational stuff
1:44:45 locally. Have you considered doing sort of embeddings locally?
1:44:49 Yeah, we thought about it. And I think it would be cool to do it locally. I think it’s
1:44:55 just really hard. And, and one thing to keep in mind is that, you know, some of our users
1:45:00 use the latest MacBook Pro. And, but most of our users, like more than 80% of our users
1:45:07 are in Windows machines, which, and many of them are not very powerful. And so local models
1:45:14 really only works on the, on the latest computers. And it’s also a big overhead to, to, to build
1:45:19 that in. And so even if we would like to do that, it’s currently not something that we
1:45:23 are able to focus on. And I think there are, there are some people that, that, that do that.
1:45:29 And I think that’s great. But especially as models get bigger and bigger and you want
1:45:34 to do fancier things with like bigger models, it becomes even harder to do it locally.
1:45:39 Yeah. And it’s not a problem with like weaker computers. It’s just that, for example, if
1:45:45 you’re some big company, you have big company code base, it’s just really hard to process
1:45:49 big company code base, even on the beefiest MacBook Pros. So even if it’s not even a
1:45:55 matter of like, if you’re, if you’re just like a student or something, I think if you’re
1:46:00 like the best programmer at a big company, you’re still going to have a horrible experience
1:46:05 if you do everything locally. You could, you could do edge and sort of scrape by, but like,
1:46:07 again, it wouldn’t be fun anymore.
1:46:10 Yeah. Like at approximate nearest neighbors and this massive code base is going to just
1:46:17 eat up your memory and your CPU. And that’s, and that’s just that. Like, let’s talk about
1:46:22 like also the modeling side where as I’ve already said, there are these massive headwinds
1:46:29 against local models where one thing seems to move towards MOEs, which like one benefit
1:46:35 is maybe there are more memory bandwidth bound, which plays in favor of local versus using
1:46:43 GPUs or using NVIDIA GPUs. But the downside is these models are just bigger in total.
1:46:47 And you know, they’re going to need to fit often not even on a single node of multiple
1:46:53 nodes. There’s no way that’s going to fit inside of even really good MacBooks. And I
1:46:59 think especially for coding, it’s not a question as much of like, does it clear some bar of
1:47:04 like the models good enough to do these things? And then like we’re satisfied, which may be
1:47:08 the case for other other problems and maybe where local models shine, but people are always
1:47:13 going to want the best, the most intelligent, the most capable things. And that’s going
1:47:17 to be really, really hard to run for almost all people locally.
1:47:22 Don’t you want the most capable model? Like you want, you want Sonya too?
1:47:23 And also with O1.
1:47:29 I like how you’re pitching me. Would you be satisfied with an inferior model? Listen,
1:47:34 I’m, yes, I’m one of those, but there’s some people that like to do stuff locally, especially
1:47:40 like really, there’s a whole obviously open source movement that kind of resists and it’s
1:47:46 good that they exist actually because you want to resist the power centers that are growing
1:47:47 are.
1:47:52 There’s actually an alternative to local models that I am particularly fond of. I think it’s
1:47:58 still very much in the research stage, but you could imagine to do homomorphic encryption
1:48:03 for language model inference. So you encrypt your input on your local machine, then you
1:48:10 send that up and then the server can use loss of computation. They can run models that you
1:48:14 cannot run locally on this encrypted data, but they cannot see what the data is. And
1:48:18 then they send back the answer and you decrypt the answer and only you can see the answer.
1:48:25 So I think that’s still very much researched and all of it is about trying to make the
1:48:30 overhead lower because right now the overhead is really big. But if you can make that happen,
1:48:36 I think that would be really, really cool and I think it would be really, really impactful
1:48:39 because I think one thing that’s actually kind of worrisome is that as these models
1:48:44 get better and better, they’re going to become more and more economically useful and so more
1:48:52 and more of the world’s information and data will flow through one or two centralized actors.
1:48:58 And then there are worries about there can be traditional hacker attempts, but it also
1:49:04 creates this kind of scary part where if all of the world’s information is flowing through
1:49:12 one node in plain text, you can have surveillance in very bad ways. And sometimes that will happen
1:49:18 for, you know, initially will be like good reasons like people will want to try to protect
1:49:23 against like bad actors using AI models in bad ways. And then you will add in some surveillance
1:49:27 code and then someone else will come in and, you know, you’re in a slippery slope and then
1:49:36 you start doing bad things with a lot of the world’s data. And so I’m very hopeful that
1:49:38 we can solve homomorphic encryption for language modeling.
1:49:42 Yeah, doing privacy preserving machine learning, but I would say like that’s the challenge
1:49:48 we have with all software these days. It’s like there’s so many features that can be
1:49:53 provided from the cloud and all this increasingly relying it and make our life awesome, but
1:49:56 there’s downsides and that’s why you rely on really good security to protect from basic
1:50:03 attacks. But there’s also only a small set of companies that are controlling that data,
1:50:07 you know, and they obviously have leverage and they could be infiltrated in all kinds
1:50:09 of ways. That’s the world we live in.
1:50:14 Yeah, I mean, the thing I’m just actually quite worried about is sort of the world where
1:50:21 means the entropic has this responsible scaling policy and so we’re on like the low ASLs,
1:50:26 which is the entropic security level or whatever, of the models, but as we get to like code
1:50:36 and code ASL 3, ASL 4, whatever models, which are sort of very powerful. But for mostly
1:50:41 reasonable security reasons, you would want to monitor all the prompts. But I think that’s
1:50:46 sort of reasonable and understandable where everyone is coming from, but Matt, it’d be
1:50:52 really horrible if all the world’s information is sort of monitored that heavily. It’s way
1:50:59 too centralized. It’s like this really fine line you’re walking where on the one side,
1:51:05 you don’t want the models to go rogue. On the other side, it’s humans. I don’t know
1:51:11 if I trust all the world’s information to pass through three model providers.
1:51:19 What do you think is different than cloud providers? Because I think a lot of this data would never
1:51:27 have gone to the cloud providers in the first place where this is often like you want to
1:51:31 give more data to the EIA models. You want to give personal data that you would never
1:51:38 have put online in the first place to these companies or to these models. And it also
1:51:47 centralizes control, where right now, for cloud, you can often use your own encryption
1:51:56 keys and it just can’t really do much. But here is just centralized actors that see the
1:51:59 exact plaintext of everything.
1:52:03 On the topic of context, that’s actually been a friction for me. When I’m writing code in
1:52:09 Python, there’s a bunch of stuff imported. You could probably intuit the kind of stuff
1:52:17 I would like to include in the context. How hard is it to auto figure out the context?
1:52:24 It’s tricky. I think we can do a lot better at computing the context automatically in
1:52:28 the future. One thing that’s important to notice, there are trade-offs with including
1:52:34 automatic context. The more context you include for these models, first of all, the slower
1:52:39 they are. And the more expensive those requests are, which means you can then do less model
1:52:44 calls and do less fancy stuff in the background. Also, for a lot of these models, they get
1:52:49 confused if you have a lot of information in the prompt. The bar for accuracy and for
1:52:57 relevance of the context you include should be quite high. But already we do some automatic
1:53:00 context in some places within the product. It’s definitely something we want to get a
1:53:03 lot better at.
1:53:11 I think that there are a lot of cool ideas to try there. Both on the learning better retrieval
1:53:16 systems, like better embedding models, better re-rankers, I think that there are also cool
1:53:21 academic ideas. Stuff we’ve tried out internally, but also the field is grappling with writ
1:53:26 large, about can you get language models to a place where you can actually just have the
1:53:31 model itself understand a new corpus of information. And the most popular, talked about version
1:53:34 of this is, can you make the context windows infinite? Then if you make the context windows
1:53:38 infinite, can you make the model actually pay attention to the infinite context? And
1:53:41 then after you can make it pay attention to the infinite context, to make it somewhat
1:53:45 feasible to actually do it, can you then do caching for that infinite context? You don’t
1:53:47 have to re-compute that all the time.
1:53:51 But there are other cool ideas that are being tried that are a little bit more analogous
1:53:56 to fine-tuning of actually learning this information and the weights of the model. And it might
1:54:01 be that you actually get sort of a qualitatively different type of understanding if you do it
1:54:04 more at the weight level than if you do it at the in-contact learning level. I think
1:54:08 the journey, the journey is still a little bit out on how this is all going to work in
1:54:13 the end. But in the interim, us as a company, we are really excited about better retrieval
1:54:16 systems and picking the parts of the code base that are most relevant to what you’re
1:54:18 doing. We could do that a lot better.
1:54:23 Like one interesting proof of concept for the learning this knowledge directly in the
1:54:31 weights is with VS Code. So we’re in a VS Code fork and VS Code, the code is all public.
1:54:36 So these models in pre-training have seen all the code. They’ve probably also seen questions
1:54:41 and answers about it. And then they’ve been fine-tuned in RLHF to be able to answer questions
1:54:45 about code in general. So when you ask it a question about VS Code, you know, sometimes
1:54:51 it’ll hallucinate, but sometimes it actually does a pretty good job at answering the question.
1:54:56 And I think like this is just by, it happens to be okay. But what if you could actually
1:55:02 like specifically train or post train a model such that it really was built to understand
1:55:08 this code base? It’s an open research question, one that we’re quite interested in. And then
1:55:12 there’s also uncertainty of like, do you want the model to be the thing that end to end is
1:55:16 doing everything, i.e., it’s doing the retrieval and its internals, and then kind of answering
1:55:22 the question, creating the code? Or do you want to separate the retrieval from the frontier
1:55:26 model where maybe, you know, you’ll get some really capable models that are much better
1:55:32 than like the best open source ones in a handful of months. And then you’ll want to separately
1:55:36 train a really good open source model to be the retriever, to be the thing that feeds
1:55:42 in the context to these larger models. Can you speak a little more to the post training
1:55:48 a model to understand the code base? What do you mean by that? Is this a synthetic data
1:55:54 direction? Yeah, I mean, there are many possible ways you could try doing it. There’s certainly
1:55:58 no shortage of ideas. It’s just a question of going in and like trying all of them and
1:56:04 being empirical about which one works best. You know, one very naive thing is to try to
1:56:11 replicate what’s done with the S code and these frontier models. So let’s like continue
1:56:14 pre-training, some kind of continued pre-training that includes general code data, but also
1:56:20 throws in a lot of the data of some particular repository that you care about. And then in
1:56:25 post training, meaning in let’s just start with instruction fine tuning, you have like
1:56:29 a normal instruction fine tuning data set about code, but you throw in a lot of questions
1:56:36 about code in that repository. So you could either get ground truth ones, which might
1:56:39 be difficult, or you could do what you kind of hinted at or suggested using synthetic
1:56:49 data, i.e. kind of having the model ask questions about various pieces of the code. So you kind
1:56:54 of take the pieces of the code, then prompt the model or have a model propose a question
1:56:59 for that piece of code, and then add those as instruction fine tuning data points. And
1:57:04 then in theory, this might unlock the model’s ability to answer questions about that code
1:57:05 base.
1:57:11 Let me ask you about OpenAI 01. What do you think is the role of that kind of test time
1:57:13 compute system in programming?
1:57:18 I think test time compute is really, really interesting. So there’s been the pre-training
1:57:24 regime, which will kind of, as you scale up the amount of data and the size of your model,
1:57:29 get you better and better performance, both on loss and then on downstream benchmarks,
1:57:35 and just general performance when we use it for coding or other tests. We’re starting
1:57:41 to hit a bit of a data wall, meaning it’s going to be hard to continue scaling up this
1:57:47 regime. And so scaling up test time compute is an interesting way of now increasing the
1:57:54 number of inference time flops that we use, but still getting like, as you increase the
1:57:59 number of flops you use inference time, getting corresponding improvements in the performance
1:58:02 of these models. Traditionally, we just had to literally train a bigger model that always
1:58:06 uses, that always used that many more flops. But now we could perhaps use the same size
1:58:12 model and run it for longer to be able to get an answer at the quality of a much larger
1:58:17 model. And so the really interesting thing I like about this is there are some problems
1:58:22 that perhaps require 100 trillion parameter model intelligence trained on 100 trillion
1:58:30 tokens. But that’s like maybe 1%, maybe like 0.1% of all queries. So are you going to spend
1:58:36 all of this effort, all this compute training a model that costs that much and then run
1:58:43 it so infrequently? It feels completely wasteful when instead you get the model that can, you
1:58:48 train the model that’s capable of doing the 99.9% of queries, then you have a way of inference
1:58:54 time running it longer for those few people that really, really want max intelligence.
1:59:00 How do you figure out which problem requires what level of intelligence? Is that possible
1:59:06 to dynamically figure out when to use GPT-4, when to use a small model and when you need
1:59:09 the 0.1?
1:59:14 I mean, yeah, that’s an open research problem, certainly. I don’t think anyone’s actually
1:59:20 cracked this model routing problem quite well. We’d like to, we have initial implementations
1:59:26 of this for things, for something like cursor tab. But at the level of like going between
1:59:33 4.0 sonnet to 0.1, it’s a bit trickier. There’s also questions like what level of intelligence
1:59:41 do you need to determine if the thing is too hard for the four level model. Maybe you need
1:59:46 the 0.1 level model. It’s really unclear.
1:59:51 You mentioned there’s a pre-training process, then there’s post-training, and then there’s
1:59:57 like test time compute that fair does sort of separate. Where’s the biggest gains?
2:00:02 Well, it’s weird because like test time compute, there’s like a whole training strategy needed
2:00:06 to get test time to compute to work. And the really, the other really weird thing about
2:00:13 this is no one, like outside of the big labs and maybe even just open AI, no one really
2:00:18 knows how it works. Like there’ve been some really interesting papers that show hints of
2:00:25 what they might be doing. And so perhaps they’re doing something with tree search using process
2:00:31 reward models. But yeah, I think the issue is we don’t quite know exactly what it looks
2:00:35 like. So it would be hard to kind of comment on like where it fits in. I would put it in
2:00:39 post-training, but maybe like the compute spent for this kind of forgetting test time
2:00:45 compute to work for a model is going to dwarf pre-training eventually.
2:00:50 So we don’t even know if 01 is using just like chain of thought, RL, we don’t know how they’re
2:00:53 using any of these. We don’t know anything.
2:00:57 It’s fun to speculate.
2:01:01 If you were to build a competing model, what would you do?
2:01:06 Yeah, so one thing to do would be, I think you probably need to train a process reward
2:01:11 model, which is so maybe we can get into reward models and outcome reward models versus process
2:01:16 reward models. Outcome reward models are the kind of traditional reward models that people
2:01:21 are trained for these four language models, language modeling. And it’s just looking at
2:01:24 the final thing. So if you’re doing some math problem, let’s look at that final thing you’ve
2:01:30 done, everything, and let’s assign a great how likely we think like what’s the reward
2:01:35 model for this outcome. Process reward models instead try to grade the chain of thought.
2:01:42 And so OpenAI had some preliminary paper on this I think last summer where they use human
2:01:47 labellers to get this pretty large several hundred thousand dataset of grading chains
2:01:48 of thought.
2:01:54 Ultimately, it feels like I haven’t seen anything interesting in the ways that people use process
2:02:02 reward models outside of just using it as a means of affecting how we choose between
2:02:06 a bunch of samples. So like what people do in all these papers is they sample a bunch
2:02:12 of outputs from the language model and then use the process reward models to grade all
2:02:16 those generations alongside maybe some other heuristics and then use that to choose the
2:02:18 best answer.
2:02:23 The really interesting thing that people think might work and people want to work is tree
2:02:28 search with these process reward models because if you really can grade every single step
2:02:34 of the chain of thought, then you can kind of branch out and explore multiple paths of
2:02:38 this chain of thought and then use these process reward models to evaluate how good is this
2:02:40 branch that you’re taking?
2:02:45 Yeah when the quality of the branch is somehow strongly correlated with the quality of the
2:02:49 outcome at the very end. It’s like you have a good model of knowing which branch to take.
2:02:52 So not just in the short term and like in the long term.
2:02:55 And like the interesting work that I think has been done is figuring out how to properly
2:03:01 train the process or the interesting work that has been open sourced in people I think
2:03:07 talk about is how to train the process reward models maybe in a more automated way. I could
2:03:12 be wrong here, could not be mentioning some papers. I haven’t seen anything super that
2:03:17 seems to work really well for using the process reward models creatively to do tree search
2:03:18 and code.
2:03:23 This is kind of an AI safety maybe a bit of a philosophy question. So open AI says that
2:03:27 they’re hiding the chain of thought from the user. And they’ve said that that was a difficult
2:03:33 decision to make. They instead of showing the chain of thought, they’re asking the model
2:03:37 to summarize the chain of thought. They’re also in the background saying they’re going
2:03:42 to monitor the chain of thought to make sure the model is not trying to manipulate the user,
2:03:46 which is a fascinating possibility. But anyway, what do you think about hiding the chain of
2:03:47 thought?
2:03:51 One consideration for open AI, and this is completely speculative, could be that they
2:03:56 want to make it hard for people to distill these capabilities out of their model. It
2:04:00 might actually be easier if you had access to that hidden chain of thought to replicate
2:04:05 the technology. Because that’s pretty important data like seeing the steps that the model took
2:04:06 to get to the final result.
2:04:08 So you could probably train on that also.
2:04:12 And there was sort of a mirror situation with this, with some of the large language model
2:04:20 providers, and also this is speculation. But some of these APIs used to offer easy access
2:04:25 to log probabilities for all the tokens that they’re generating. And also log probabilities
2:04:30 for the prompt tokens. And then some of these APIs took those away. And again, complete speculation.
2:04:35 But one of the thoughts is that the reason those were taken away is if you have access
2:04:39 to log probabilities, similar to this hidden chain of thought, that can give you even more
2:04:44 information to try and distill these capabilities out of the APIs, out of these biggest models,
2:04:46 and to models you control.
2:04:54 As an asterisk on also the previous discussion about us integrating O1, I think that we’re
2:04:55 still learning how to use this model.
2:05:01 So we made O1 available in Cursor because when we got the model, we were really interested
2:05:05 in trying it out. I think a lot of programmers are going to be interested in trying it out.
2:05:13 But O1 is not part of the default Cursor experience in any way up. And we still haven’t found
2:05:21 a way to yet integrate it into the editor in a way that we reach for every hour, maybe
2:05:22 even every day.
2:05:30 And so I think the jury is still out on how to use the model. And we haven’t seen examples
2:05:35 yet of people releasing things where it seems really clear like, oh, that’s like now the
2:05:36 use case.
2:05:40 The obvious one to turn to is maybe this can make it easier for you to have these background
2:05:46 things running, to have these models in loops, to have these models be agentic. But we’re
2:05:48 still discovering.
2:05:54 To be clear, we have ideas. We just need to try and get something incredibly useful before
2:05:56 we put it out there.
2:06:04 But it has these significant limitations. Even barring capabilities, it does not stream.
2:06:08 And that means it’s really, really painful to use for things where you want to supervise
2:06:13 the output. And instead you’re just waiting for the wall text to show up.
2:06:17 Also, it does feel like the early innings of test time computing search where it’s just
2:06:25 like a very, very much of v0. And there’s so many things that don’t feel quite right.
2:06:32 And I suspect in parallel to people increasing the amount of pre-training data and the size
2:06:36 of the models and pre-training and finding tricks there, you’ll now have this other thread
2:06:40 of getting searched to work better and better.
2:06:50 So let me ask you about strawberry tomorrow eyes. So it looks like GitHub co-pilot might
2:06:56 be integrating 01 in some kind of way. And I think some of the comments are saying, does
2:07:01 this mean cursor is done? I think I saw one comment saying that.
2:07:03 I thought, time to shut down cursor.
2:07:05 Time to shut down cursor. Thank you.
2:07:07 So is it time to shut down cursor?
2:07:14 I think this space is a little bit different from past software spaces over the 2010s where
2:07:18 I think that the ceiling here is really, really, really incredibly high. And so I think that
2:07:22 the best product in three to four years will just be soon much more useful than the best
2:07:30 product today. And you can wax poetic about Mote’s this and brand that. And this is our
2:07:34 advantage. But I think in the end, just if you don’t have, like if you stop innovating
2:07:40 on the product, you will lose. And that’s also great for startups. That’s great for
2:07:44 people trying to enter this market because it means you have an opportunity to win against
2:07:50 people who have, you know, lots of users already by just building something better.
2:07:55 And so I think, yeah, over the next few years, it’s just about building the best product,
2:08:01 building the best system, and that both comes down to the modeling engine side of things.
2:08:03 And it also comes down to the editing experience.
2:08:08 Yeah. I think most of the additional value from cursor versus everything else out there
2:08:15 is not just integrating the new model fast, like a one, it comes from all of the kind
2:08:19 of depth that goes into these custom models that you don’t realize are working for you
2:08:25 in kind of every facet of the product, as well as like the really thoughtful UX with
2:08:27 every single feature.
2:08:32 All right. From that profound answer, let’s descend back down to the technical. You mentioned
2:08:34 you have a taxonomy of synthetic data.
2:08:35 Oh, yeah.
2:08:36 Can you please explain?
2:08:43 Yeah. I think there are three main kinds of synthetic data. The first is so what is synthetic
2:08:49 data first? So there’s normal data, like non-synthetic data, which is just data that’s naturally
2:08:55 created, i.e. usually it’ll be from humans having done things. So from some human process,
2:09:01 you get this data. Synthetic data, the first one would be distillation. So having a language
2:09:08 model kind of output tokens or probability distributions over tokens. And then you can
2:09:14 train some less capable model on this. This approach is not going to get you a net like
2:09:19 more capable model than the original one that has produced the tokens. But it’s really useful
2:09:24 for if there’s some capability you want to elicit from some really expensive high latency
2:09:31 model, you can then distill that down into some smaller task specific model. The second
2:09:40 kind is when one direction of the problem is easier than the reverse. And so a great
2:09:46 example of this is bug detection, like we mentioned earlier, where it’s a lot easier
2:09:51 to introduce reasonable looking bugs than it is to actually detect them. And this is
2:09:58 probably the case for humans too. And so what you can do is you can get a model that’s not
2:10:02 training that much data that’s not that smart to introduce a bunch of bugs and code. And
2:10:06 then you can use that to then train, use a synthetic data to train a model that can be
2:10:12 really good at detecting bugs. The last category I think is I guess the main one that feels
2:10:19 like the big labs are doing for synthetic data, which is producing text with language
2:10:26 models that can then be verified easily. So like, you know, an extreme example of this
2:10:31 is if you have a verification system that can detect if language is Shakespeare level
2:10:36 and then you have a bunch of monkeys typing in typewriters, like you can eventually get
2:10:39 enough training data to train a Shakespeare level language model. And I mean, this is
2:10:45 the case, like very much the case for math, where verification is actually really, really
2:10:52 easy for formal, formal languages. And then what you can do is you can have an OK model
2:10:58 generate a ton of rollouts, and then choose the ones that you know have actually proved
2:11:01 the ground truth here at theorems and train that further. There’s similar things you can
2:11:07 do for code with Lee code like problems, or where if you have some set of tests that you
2:11:12 know correspond to if something passes these tests, it is actually solved the problem.
2:11:14 You could do the same thing where we verify that it’s past the test and then train the
2:11:19 model in the output set of past the tests. I think it’s gonna be a little tricky getting
2:11:26 this to work in all domains or just in general, like having the perfect verifier feels really,
2:11:31 really hard to do with just like open-ended miscellaneous tasks. You get the model or
2:11:35 more like long horizon tasks, even in coding.
2:11:40 That’s because you’re not as optimistic as Arvid. But yeah. So yeah, so that third category
2:11:43 requires having a verifier.
2:11:46 Yeah. Verification is, it feels like it’s best when you know for a fact that it’s correct.
2:11:52 And like, then it wouldn’t be like using a language model to verify. It would be using
2:11:54 tests or formal systems.
2:11:59 Or running the thing too, doing like the human form of verification where you just do manual
2:12:01 quality control.
2:12:02 Yeah.
2:12:04 But like the language model version of that where it’s like running the thing and it actually
2:12:05 understands the output.
2:12:06 Yeah.
2:12:07 No, that’s true.
2:12:08 For somewhere between.
2:12:14 Yeah. I think that’s the category that is most likely to result in like massive gains.
2:12:22 What about RL with feedback side, RLHF versus RLAIF? What’s the role of that in getting
2:12:25 better performance on the models?
2:12:36 Yeah. So RLHF is when the reward model you use is trained from some labels you’ve collected
2:12:38 from humans giving feedback.
2:12:45 I think this works if you have the ability to get a ton of human feedback for this kind
2:12:47 of task that you care about.
2:12:54 RLAIF is interesting because you’re kind of depending on, like this is actually kind
2:13:01 of going to, it’s depending on the constraint that verification is actually a decent bit
2:13:05 easier than generation because it feels like, okay, like what are you doing? You’re using
2:13:08 this language model to look at the language model outputs and then improve the language
2:13:09 model.
2:13:15 But no, it actually may work if the language model has a much easier time verifying some
2:13:20 solution than it does generating it. Then you actually could perhaps get this kind of recursive,
2:13:23 but I don’t think it’s going to look exactly like that.
2:13:30 The other thing you could do is that we kind of do is like a little bit of a mix of RLAIF
2:13:34 and RLHF where usually the model is actually quite correct. And this is in the case of
2:13:40 the cursor tab at picking between like two possible generations of what is the better
2:13:45 one. And then it just needs like a hand, a little bit of human nudging with only like
2:13:54 on the order 50, 100 examples to like kind of align that prior the model has with exactly
2:13:55 with what you want.
2:13:59 It looks different than I think normal RLHF. We’re usually training these reward models
2:14:02 and tons of examples.
2:14:09 What’s your intuition when you compare generation and verification or generation and ranking?
2:14:12 Is ranking way easier than generation?
2:14:20 My intuition would just say, yeah, it should be like this is kind of going back to like
2:14:25 if you if you believe P does not equal NP, then there’s this massive class of problems
2:14:30 that are much, much easier to verify given a proof than actually proving it.
2:14:35 I wonder if the same thing will prove P not equal to NP or P equal to NP.
2:14:37 That would be, that would be really cool.
2:14:45 That’d be of whatever feels metal by AI who gets the credit, another open philosophical
2:14:46 question.
2:14:56 I’m actually surprisingly curious what like a good bet for one AI will get the feels
2:14:57 metal will be.
2:14:59 Isn’t this a Mons specialty?
2:15:01 I don’t know what a Mons bed here is.
2:15:04 Oh, sorry, Nobel Prize or feels metal first?
2:15:05 Feels metal.
2:15:06 Feels metal level.
2:15:07 Feels metal comes first.
2:15:08 Feels metal comes first.
2:15:09 Well, you would say that, of course.
2:15:14 But it’s also this like isolated system in verify and sure.
2:15:16 Like I don’t even know if I don’t need to do this.
2:15:17 I feel like I have much more to do there.
2:15:22 I felt like the path to get to IMO was a little bit more clear because it already could get
2:15:26 a few IMO problems and there are a bunch of like, there’s a bunch of low hanging fruit
2:15:30 given the literature at the time of like what what tactics people could take.
2:15:36 I think I’m one much less first in the space with deer improving now and to, yeah, less
2:15:41 intuition about how close we are to solving these really, really hard open problems.
2:15:43 So you think it’ll be feels metal first?
2:15:46 It won’t be like in physics or in.
2:15:47 Oh, 100%.
2:15:51 I think I think that’s probably likely like it’s probably much more likely that it’ll
2:15:52 get them.
2:15:53 Yeah.
2:15:54 Yeah.
2:15:57 Well, I think it’s both to like, I don’t know, like BSD, which is a bird’s wing turn
2:16:02 die or conjecture or like Riemann iPods or any one of these like hard, hard math problems
2:16:07 or just like actually really hard, it’s sort of unclear what the past you to get even a
2:16:09 solution looks like.
2:16:13 Like we don’t even know what a path looks like, let alone, and you don’t buy the idea
2:16:17 that this is like an isolated system and you can actually, you have a good reward system
2:16:22 and it feels like it’s easier to train for that.
2:16:25 I think we might get feels metal before AGI.
2:16:30 I mean, yeah, I’d be very happy, very happy.
2:16:37 But I don’t know if I think 20, 20 H, 20, 30 feels metal feels metal.
2:16:38 All right.
2:16:44 It’s feels like forever from now, given how fast things have been going.
2:16:47 Speaking of how fast things have been going, let’s talk about scaling laws.
2:16:55 So for people who don’t know, maybe it’s good to talk about this whole idea of scaling
2:16:56 laws.
2:16:57 What are they?
2:17:00 Where do you think stand and where do you think things are going?
2:17:04 I think it was interesting, the original scaling laws paper by OpenAI was slightly wrong because
2:17:11 I think of some issues they did with learning rate schedules and then Chinchilla showed
2:17:13 a more correct version.
2:17:16 And then from then, people have again kind of deviated from doing the compute optimal
2:17:22 thing because people start now optimizing more so for making the thing work really well
2:17:26 given an inference budget.
2:17:31 And I think there are a lot more dimensions to these curves than what we originally used
2:17:37 of just compute, number of parameters and data.
2:17:38 Like inference compute is the obvious one.
2:17:41 I think context length is another obvious one.
2:17:47 So let’s say you care about the two things of inference compute and then context window.
2:17:52 Maybe the thing you want to train is some kind of SSM because they’re much, much cheaper
2:17:55 and faster at super, super long context.
2:17:59 And even if maybe it is 10x worse scaling properties during training, maybe if you spend
2:18:04 10x more compute to train the thing to get the same level of capabilities, it’s worth
2:18:09 it because you care most about that inference budget for really long context windows.
2:18:13 So it’ll be interesting to see how people kind of play with all these dimensions.
2:18:17 So yeah, I mean you speak to the multiple dimensions, obviously the original conception
2:18:22 was just looking at the variables of the size of the model as measured by parameters and
2:18:25 the size of the data as measured by the number of tokens and looking at the ratio of the
2:18:26 two.
2:18:32 And it’s kind of a compelling notion that there is a number or at least a minimum and
2:18:36 it seems like one was emerging.
2:18:41 Do you still believe that there is a kind of bigger is better?
2:18:49 I mean, I think bigger is certainly better for just raw performance and raw intelligence.
2:18:55 I think the path that people might take is I’m particularly bullish on distillation.
2:19:01 How many knobs can you turn to if we spend like a ton, ton of money on training, get
2:19:07 the most capable cheap model, really, really caring as much as you can because the naive
2:19:10 version of caring as much as you can about inference time compute is what people have
2:19:16 already done with the Lama models or just overtraining the shit out of 7B models on
2:19:19 way, way, way more tokens than is essential optimal.
2:19:24 But if you really care about it, maybe the thing to do is what Gemma did, which is let’s
2:19:31 not just train on tokens, let’s literally train on minimizing the KL divergence with
2:19:37 the distribution of Gemma 27B, so knowledge distillation there.
2:19:42 And you’re spending the compute of literally training this 27B model, the billion parameter
2:19:46 model on all these tokens just to get out this smaller model.
2:19:50 And the distillation gives you just a faster model, smaller means faster.
2:19:56 Yeah, distillation in theory is I think getting out more signal from the data that you’re
2:19:57 training on.
2:20:01 And it’s like another, it’s perhaps another way of getting over, not like completely over,
2:20:05 but like partially helping with the data wall, where like you only have so much data to train
2:20:09 on, let’s like train this really, really big model on all these tokens and we’ll distill
2:20:10 it into a smaller one.
2:20:16 And maybe we can get more signal per token for this, for this much smaller model than
2:20:18 we would have originally if we trained it.
2:20:22 So if I gave you $10 trillion, how would you spend it?
2:20:25 I mean, you can’t buy an island or whatever.
2:20:34 How would you allocate it in terms of improving the big model versus maybe paying for HF and
2:20:35 the RLHF?
2:20:36 Yeah.
2:20:42 Yeah, I think there’s a lot of these secrets and details about training these large models
2:20:46 that I just don’t know and only privy to the large labs.
2:20:50 And the issue is I would waste a lot of that money if I even attempted this because I wouldn’t
2:20:53 know those things.
2:20:59 Suspending a lot of disbelief and assuming like you had the know-how and operate or if
2:21:03 you’re saying like you have to operate with like the limited information you have now.
2:21:04 No, no, no.
2:21:09 Actually, I would say you swoop in and you get all the information, all the little heuristics,
2:21:17 all the little parameters, all the parameters that define how the thing is trained.
2:21:22 If we look in how to invest money for the next five years in terms of maximizing what
2:21:23 you call raw intelligence.
2:21:25 I mean, isn’t the answer like really simple?
2:21:28 You just try to get as much compute as possible.
2:21:33 Like at the end of the day, all you need to buy is the GPUs and then the researchers
2:21:39 can find all the like they can sort of you can tune whether you want to train a big model
2:21:41 or a small model.
2:21:45 Well this gets into the question of like, are you really limited by compute and money
2:21:50 or are you limited by these other things and I’m more privy to Arvid’s beliefs that
2:21:54 we’re sort of idea limited but there’s always like.
2:21:59 But if you have a lot of compute, you can run a lot of experiments.
2:22:04 So you would run a lot of experiments versus like use that compute to train a gigantic
2:22:05 model.
2:22:10 I would, but I do believe that we are limited in terms of ideas that we have.
2:22:15 I think, yeah, because even with all this compute and like, you know, all the data you
2:22:21 could collect in the world, I think you really are ultimately limited by not even ideas but
2:22:27 just like really good engineering like even with all the capital in the world, would you
2:22:32 really be able to assemble like there aren’t that many people in the world who really like
2:22:34 make the difference here.
2:22:40 And there’s so much work that goes into research that is just like pure, really, really hard
2:22:45 engineering work as like a very kind of hand wavy example.
2:22:48 If you look at the original transformer paper, you know how much work was kind of joining
2:22:54 together a lot of these really interesting concepts embedded in the literature versus
2:22:58 then going in and writing all the codes like maybe the CUDA kernels, maybe whatever else.
2:23:04 I don’t know if it ran in GPUs or TPUs originally, such that it actually saturated the GPU performance,
2:23:05 right?
2:23:07 Getting known she’s here to go in and do all this code, right?
2:23:11 And know it’s like probably one of the best engineers in the world or maybe going a step
2:23:15 further like the next generation of models, having these things, like getting model parallelism
2:23:20 to work and scaling it on like, you know, thousands of, or maybe tens of thousands of like V100s,
2:23:23 which I think GBD3 may have been.
2:23:27 There’s just so much engineering effort that has to go into all of these things to make
2:23:28 it work.
2:23:35 If you really brought that cost down to like, you know, maybe not zero, but just made it
2:23:40 10x easier, made it super easy for someone with really fantastic ideas to immediately
2:23:47 get to the version of like the new architecture they dreamed up that is like getting 50, 40%
2:23:53 utilization on the GPUs, I think that would just speed up research by a ton.
2:23:58 I mean, I think if you see a clear path to improvement, you should always sort of take
2:23:59 the low hanging fruit first, right?
2:24:04 I think probably open the eye and all the other labs did the right thing to pick off
2:24:10 the low hanging fruit, where the low hanging fruit is like sort of, you could scale up
2:24:19 to a GPT4.25 scale and you just keep scaling and like things keep getting better and as
2:24:25 long as like, there’s no point of experimenting with new ideas when like everything is working
2:24:30 and you should sort of bang on it and try to get as much as much juice out of the possible
2:24:33 and then maybe when you really need new ideas for.
2:24:38 I think if you’re spending 10 trillion dollars, probably want to spend some, you know, then
2:24:41 actually like reevaluate your ideas, like probably your idea limited at that point.
2:24:47 I think all of us believe new ideas are probably needed to get, you know, all the way there
2:24:56 to AGI and all of us also probably believe there exist ways of testing out those ideas
2:25:02 at smaller scales and being fairly confident that they’ll play out.
2:25:07 It’s just quite difficult for the labs in their current position to dedicate their
2:25:14 very limited research and engineering talent to exploring all these other ideas when there’s
2:25:21 like this core thing that will probably like improve performance for some like decent amount
2:25:22 of time.
2:25:25 Yeah, but also these big labs like winning.
2:25:31 So they’re just going wild, okay.
2:25:34 So how big question looking on to the future.
2:25:38 You’re now at the center of the programming world.
2:25:43 How do you think programming, the nature programming changes in the next few months,
2:25:47 in the next year, in the next two years, in the next five years, 10 years?
2:25:52 I think we’re really excited about a future where the programmers in the driver’s seat
2:25:54 for a long time.
2:26:00 And you’ve heard us talk about this a little bit, but one that emphasizes speed and agency
2:26:05 for the programmer and control, the ability to modify anything you want to modify, the
2:26:08 ability to iterate really fast on what you’re building.
2:26:16 And this is a little different, I think, than where some people are jumping to in the space
2:26:24 where I think one idea that’s captivated people is can you talk to your computer?
2:26:27 Can you have it built software for you as if you’re talking to like an engineering department
2:26:32 or an engineer over Slack and can it just be this sort of isolated text box?
2:26:38 And part of the reason we’re not excited about that is, you know, some of the stuff
2:26:39 we talked about with latency.
2:26:44 But then a big reason we’re not excited about that is because that comes with giving up
2:26:45 a lot of control.
2:26:49 It’s much harder to be really specific when you’re talking in the text box.
2:26:53 And if you’re necessarily just going to communicate with a thing like you would be communicating
2:26:58 with an engineering department, you’re actually advocating tons of really important decisions
2:26:59 to the spot.
2:27:04 And this kind of gets at fundamentally what engineering is.
2:27:08 I think that some people who are a little bit more removed from engineering might think
2:27:12 of it as, you know, the spec is completely written out and then the engineers just come
2:27:14 and they just implement.
2:27:19 And it’s just about making the thing happen in code and making the thing exist.
2:27:24 But I think a lot of the best engineering, the engineering we enjoy involves tons of
2:27:30 tiny micro decisions about what exactly you’re building and about really hard tradeoffs between,
2:27:35 you know, speed and cost and just all the other things involved in a system.
2:27:41 And we want as long as humans are actually the ones making, you know, designing the software
2:27:44 and the ones specifying what they want to be built.
2:27:47 And it’s not just like company run by all AIs.
2:27:52 We think you’ll really want the humor, the human in a driver seat dictating these decisions.
2:27:56 And so there’s the jury still out on kind of what that looks like.
2:28:01 I think that, you know, one weird idea for what that could look like is it could look
2:28:06 like you kind of, you can control the level of abstraction you view a code base at.
2:28:12 And you can point at specific parts of a code base that may like maybe you digest a code
2:28:15 base by looking at it in the form of pseudocode.
2:28:20 And you can actually edit that pseudocode too and then have changes get made down at
2:28:23 the sort of formal programming level.
2:28:29 And you keep the like, you know, you can gesture at any piece of logic in your software component
2:28:30 of programming.
2:28:33 You keep the inflow text editing component of programming.
2:28:36 You keep the control of, you can even go down into the code.
2:28:40 You can go at higher levels of abstraction while also giving you these big productivity
2:28:41 gains.
2:28:44 It’d be nice if you can go up and down the abstraction stack.
2:28:45 Yeah.
2:28:46 And there are a lot of details to figure out there.
2:28:48 That’s sort of like a fuzzy idea.
2:28:49 Time will tell if it actually works.
2:28:53 But these principles of control and speed in the human in the driver seat we think are
2:28:55 really important.
2:28:58 We think for some things, like Arvid mentioned before, for some styles of programming you
2:29:02 can kind of hand it off chatbot style, you know, if you have a bug that’s really well
2:29:03 specified.
2:29:08 But that’s not most of programming and that’s also not most of the programming.
2:29:10 We think a lot of people value.
2:29:12 What about like the fundamental skill of programming?
2:29:20 There’s a lot of people like young people right now kind of scared, like thinking because
2:29:23 they like love programming, but they’re scared about like, will I be able to have a future
2:29:26 if I pursue this career path?
2:29:30 Do you think the very skill of programming will change fundamentally?
2:29:34 I actually think this is a really, really exciting time to be building software.
2:29:42 We remember what programming was like in, you know, 2013, 2012, whatever it was.
2:29:50 And there was just so much more cruft and boilerplate and, you know, looking up something
2:29:51 really gnarly.
2:29:53 And that stuff still exists.
2:29:55 It’s definitely not at zero.
2:29:59 But programming today is way more fun than back then.
2:30:04 It’s like we’re really getting down to the delight concentration and all the things
2:30:07 that really draw people to programming, like for instance, this element of being able to
2:30:12 build things really fast and speed and also individual control, like all those are just
2:30:15 being turned up a ton.
2:30:18 And so I think it’s just going to be, I think it’s going to be a really, really fun time
2:30:20 for people who build software.
2:30:22 I think that the skills will probably change too.
2:30:26 I think that people’s taste and creative ideas will be magnified and it will be less
2:30:32 about maybe less a little bit about boilerplate text editing, maybe even a little bit less
2:30:36 about carefulness, which I think is really important today.
2:30:39 If you’re a programmer, I think it’ll be a lot more fun.
2:30:41 What do you guys think?
2:30:42 I agree.
2:30:49 I’m very excited to be able to change, like just, one thing that happened recently was
2:30:52 like we wanted to do a relatively big migration to our code base.
2:30:58 We were using async local storage in Node.js, which is known to be not very performant and
2:31:00 we wanted to migrate to a context object.
2:31:04 And this is a big migration that affects the entire code base.
2:31:09 And Swall and I spent, I don’t know, five days working through this, even with today’s
2:31:10 AI tools.
2:31:17 And I am really excited for a future where I can just show a couple of examples and then
2:31:20 the AI applies that to all of the locations.
2:31:24 And then it highlights, oh, this is a new example, like what should I do and then I
2:31:25 show exactly what to do there.
2:31:29 And then that can be done in like 10 minutes.
2:31:33 And then you can iterate much, much faster than you can, then you don’t have to think
2:31:38 as much upfront and stand at the blackboard and like think exactly like how are we going
2:31:42 to do this because the cost is so high, but you can just try something first.
2:31:44 And you realize, oh, this is not actually exactly what I want.
2:31:47 And then you can change it instantly again after.
2:31:53 And so, yeah, I think being a programmer in the future is going to be a lot of fun.
2:31:54 Yeah.
2:31:57 I really like that point about, it feels like a lot of the time with programming.
2:32:00 There are two ways you can go about it.
2:32:06 One is like, you think really hard, carefully upfront about the best possible way to do it.
2:32:10 And then you spend your limited time of engineering to actually implement it.
2:32:14 But I must refer just getting in the code and like, you know, taking a crack at it, seeing
2:32:19 it, how it kind of lays out, and then iterating really quickly on that.
2:32:21 That feels more fun.
2:32:25 Yeah, like just being to generate the boilerplate is great.
2:32:31 So you just focus on the difficult design, nuanced difficult design decisions, migration.
2:32:34 I feel like this is, this is a cool one.
2:32:38 Like, it seems like large language models able to basically translate from one program
2:32:43 language to another or like translate, like migrate in the general sense of what my grade
2:32:44 is.
2:32:46 But that’s in the current moment.
2:32:51 So, I mean, the fear has to do with like, okay, as these models get better and better,
2:32:53 then you’re doing less and less creative decisions.
2:32:58 And is it going to kind of move to a place where it’s, you’re operating in the design
2:33:03 space of natural language, where natural language is the main programming language.
2:33:07 And I guess I get asked that by way of advice, like, if somebody’s interested in programming
2:33:10 now, what do you think they should learn?
2:33:19 Like do they, you guys started in some Java and I forget the, oh, some PHP.
2:33:20 Objective C.
2:33:21 Objective C.
2:33:22 There you go.
2:33:23 Yeah, absolutely.
2:33:27 I mean, in the end, we all know JavaScript is going to win.
2:33:28 And not TypeScript.
2:33:33 It’s just, it’s going to be like vanilla JavaScript, it’s just going to eat the world and maybe
2:33:34 a little bit of PHP.
2:33:40 And I mean, it also brings up the question of like, I think Don Knuth has a, this idea
2:33:45 that some percent of the population is geeks and like, there’s a particular kind of psychology
2:33:52 in mind required for programming and it feels like more and more that expands the kind of
2:33:58 person that should be able to can do great programming might expand.
2:34:04 I think different people do programming for different reasons, but I think the true, maybe
2:34:12 like the best programmers are the ones that really love, just like absolutely love programming,
2:34:21 for example, they’re folks in our team who literally when they’re, they get back from
2:34:26 work, they go and then they boot up cursor and then they start coding on their side projects
2:34:31 for the entire night and they stay up till 3 AM doing that.
2:34:37 And when they’re sad, they said, I just really need to code.
2:34:42 And I think like, you know, there’s that level of programmer where like this obsession
2:34:49 and love of programming, I think makes really the best programmers.
2:34:55 And I think these types of people will really get into the details of how things work.
2:35:01 I guess the question I’m asking that exact program, let’s think about that person.
2:35:06 When the super tab, the super awesome praise be the tab is succeeds.
2:35:11 You keep pressing tab, that person in the team loves to curse the tab more than anybody
2:35:12 else.
2:35:13 Yeah.
2:35:17 And it’s also not just like, like pressing tab is like the just pressing tab.
2:35:21 That’s like the easy way to say it in the catch phrase, you know, but what you’re actually
2:35:27 doing when you’re pressing tab is that you’re injecting intent all the time while you’re
2:35:29 doing it.
2:35:30 Sometimes you’re rejecting it.
2:35:32 Sometimes you’re typing a few more characters.
2:35:38 And that’s the way that you’re sort of shaping the things that’s being created.
2:35:43 And I think programming will change a lot to just what is it that you want to make.
2:35:45 It’s sort of higher bandwidth.
2:35:50 The communication to the computer just becomes higher and higher bandwidth as opposed to like
2:35:53 just typing as much lower bandwidth than communicating intent.
2:36:00 I mean, this goes to your manifesto titled engineering genius.
2:36:04 We are an applied research lab building extraordinary productive human AI systems.
2:36:10 So speaking to this like hybrid element to start, we’re building the engineer of the
2:36:17 future, a human AI programmer that’s an order of magnitude more effective than any one engineer.
2:36:21 This hybrid engineer will have effortless control over their code base and no low entropy
2:36:23 keystrokes.
2:36:28 They will iterate at the speed of their judgment, even in the most complex systems.
2:36:34 Using a combination of AI and human ingenuity, they will outsmart and out engineer the best
2:36:36 pure AI systems.
2:36:38 We are a group of researchers and engineers.
2:36:42 We build software and models to invent at the edge of what’s useful and what’s possible.
2:36:47 Our work has already improved the lives of hundreds of thousands of programmers.
2:36:51 And on the way to that, we’ll at least make programming more fun.
2:36:53 So thank you for talking today.
2:36:54 Thank you.
2:36:55 Thanks for having us.
2:36:56 Thank you.
2:36:57 Thank you.
2:37:01 Thanks for listening to this conversation with Michael, Swalle, Arvid, and Aman.
2:37:04 To support this podcast, please check out our sponsors in the description.
2:37:11 And now, let me leave you with a random, funny, and perhaps profound programming code I saw
2:37:13 on Reddit.
2:37:18 Nothing is as permanent as a temporary solution that works.
2:37:21 Thank you for listening and hope to see you next time.
2:37:22 Bye.
2:37:25 [MUSIC PLAYING]
2:37:29 [MUSIC PLAYING]
2:37:32 [MUSIC PLAYING]
2:37:35 [MUSIC PLAYING]
2:37:37 you

Aman Sanger, Arvid Lunnemark, Michael Truell, and Sualeh Asif are creators of Cursor, a popular code editor that specializes in AI-assisted programming.
Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep447-sc
See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc.

Transcript:
https://lexfridman.com/cursor-team-transcript

CONTACT LEX:
Feedback – give feedback to Lex: https://lexfridman.com/survey
AMA – submit questions, videos or call-in: https://lexfridman.com/ama
Hiring – join our team: https://lexfridman.com/hiring
Other – other ways to get in touch: https://lexfridman.com/contact

EPISODE LINKS:
Cursor Website: https://cursor.com
Cursor on X: https://x.com/cursor_ai
Anysphere Website: https://anysphere.inc/
Aman’s X: https://x.com/amanrsanger
Aman’s Website: https://amansanger.com/
Arvid’s X: https://x.com/ArVID220u
Arvid’s Website: https://arvid.xyz/
Michael’s Website: https://mntruell.com/
Michael’s LinkedIn: https://bit.ly/3zIDkPN
Sualeh’s X: https://x.com/sualehasif996
Sualeh’s Website: https://sualehasif.me/

SPONSORS:
To support this podcast, check out our sponsors & get discounts:
Encord: AI tooling for annotation & data management.
Go to https://encord.com/lex
MasterClass: Online classes from world-class experts.
Go to https://masterclass.com/lexpod
Shopify: Sell stuff online.
Go to https://shopify.com/lex
NetSuite: Business management software.
Go to http://netsuite.com/lex
AG1: All-in-one daily nutrition drinks.
Go to https://drinkag1.com/lex

OUTLINE:
(00:00) – Introduction
(09:25) – Code editor basics
(11:35) – GitHub Copilot
(18:53) – Cursor
(25:20) – Cursor Tab
(31:35) – Code diff
(39:46) – ML details
(45:20) – GPT vs Claude
(51:54) – Prompt engineering
(59:20) – AI agents
(1:13:18) – Running code in background
(1:17:57) – Debugging
(1:23:25) – Dangerous code
(1:34:35) – Branching file systems
(1:37:47) – Scaling challenges
(1:51:58) – Context
(1:57:05) – OpenAI o1
(2:08:27) – Synthetic data
(2:12:14) – RLHF vs RLAIF
(2:14:01) – Fields Medal for AI
(2:16:43) – Scaling laws
(2:25:32) – The future of programming

PODCAST LINKS:
– Podcast Website: https://lexfridman.com/podcast
– Apple Podcasts: https://apple.co/2lwqZIr
– Spotify: https://spoti.fi/2nEwCF8
– RSS: https://lexfridman.com/feed/podcast/
– Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4
– Clips Channel: https://www.youtube.com/lexclips

Leave a Comment