Columbia CS Professor: Why LLMs Can’t Discover New Science

AI transcript

🕒

Việt

中文

0:00:07 Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity.
0:00:13 Einstein had to sort of reject the Newtonian physics and come up with a space-time continuum.
0:00:15 He completely rewrote the rules.
0:00:20 AGI will be when we are able to create new science, new results, new math.
0:00:25 When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on
0:00:27 to come up with new paradigms, new science.
0:00:30 That’s my definition of AGI.
0:00:35 Vishal Misra was trying to fix a broken Cricut stats page and accidentally helped spark one
0:00:37 of AI’s biggest breakthroughs.
0:00:43 On this episode of the A16Z podcast, I talk with Vishal and A16Z’s Martin Fasato about how
0:00:48 that moment led to retrieval augmentation generation and how Vishal’s formal models explain what
0:00:50 large language models can and can’t do.
0:00:55 We discuss why LLMs might be hitting their limits, what real reasoning looks like, and what it
0:00:56 would take to go beyond them.
0:00:57 Let’s get into it.
0:01:02 Martin, I knew you wanted to have Vishal on.
0:01:06 What do you find so remarkable about him and his contributions that inspired this?
0:01:08 Vishal actually has very similar backgrounds.
0:01:09 We both come from networking.
0:01:11 He’s a much more accomplished networking guy than I am.
0:01:12 That’s a high bar given to you.
0:01:16 And so we actually view the world in an information-theoretic way.
0:01:18 It is actually part of networking.
0:01:25 And with all this AI stuff, there’s so much work trying to create models that can help us
0:01:27 understand how these LLMs work.
0:01:33 And in my experience over the last three years, the ones that have most impacted my understanding,
0:01:36 and I think have been the most predictive, are the ones that Vishal has come up with.
0:01:40 He did a previous one that we’re going to talk about called Matrix, is it?
0:01:42 Beyond the black box, but yeah.
0:01:43 Beyond the black box.
0:01:48 So actually, we should put this in the notes for this, but the single best talk I’ve ever
0:01:54 seen on trying to understand how LLMs work is one that Vishal did at MIT, which Hari Balakrishnan
0:01:55 pointed me to, and I watched that.
0:02:00 So he did that work, and then he’s doing more recent work that’s actually trying to scope
0:02:05 out not only how LLMs reason, but it has some reflections on humans’ reason, too.
0:02:09 And so I just think he’s doing some of the more profound work in trying to understand and
0:02:12 come up with models, formal models, for how LLMs reason.
0:02:17 On that note, you said his most recent work helped you change how humans think.
0:02:18 Why don’t you flesh that out a little bit?
0:02:19 How did it sort of?
0:02:23 Well, okay, so can I just try to take a rough sketch at it, and then you just tell me how
0:02:23 wrong I am?
0:02:24 Go right ahead.
0:02:30 You’re trying to describe how LLMs work, and one thing that you found is that they reduce
0:02:40 a very, very complex, multidimensional space into basically a geometric manifold that’s
0:02:46 a reduced state space, so it’s reduced degrees of freedom, but you can actually predict where
0:02:49 in the manifold the reasoning can move to, roughly.
0:02:55 So you’ve reduced the dimensionality of the problem to a geometric manifold, and then you
0:03:01 can actually formally specify kind of how far you can reason within that manifold.
0:03:07 And the articulation is that we, or one of the intuitions is that we as humans do the
0:03:12 same thing, is we take this very complex, heavy-tailed, stochastic universe, and we reduce it to
0:03:17 kind of this geometric manifold, and then when we reason, we just move along that manifold.
0:03:20 Yeah, I think you captured it accurately.
0:03:22 That’s kind of the spirit of the work, yeah.
0:03:24 Wait, wait, can I just hear it in your words?
0:03:25 Because I’m a VC, so.
0:03:30 You’re a VC with an H-index of what, 60?
0:03:31 True.
0:03:40 Yeah, so ultimately what all these LLMs are doing, whether the early LLMs or the LLMs that
0:03:46 we have today with all sorts of post-training, RLHF, whatever you do, at the end of the day
0:03:51 what they do is they create a distribution for the next token, right?
0:03:58 So, given a prompt, these LLMs create a distribution for the next token or the next word, and then
0:04:05 they pick something from that distribution using some kind of algorithm to predict the
0:04:07 next token, pick it, and then keep going.
0:04:14 Now, what happens because of the way we train these LLMs, the architecture of the transformers,
0:04:19 and the loss function, the way you put it is, right, it sort of reduces the world into
0:04:21 these Bayesian manifolds.
0:04:30 And as long as the LLM is going in, sort of traversing through these manifolds, it is
0:04:34 confident, and it can produce something which makes sense.
0:04:40 The moment it sort of wears away from the manifold, then it starts hallucinating and starts spotting
0:04:41 nonsense.
0:04:42 Confident nonsense, but nonsense.
0:04:47 So, it creates these manifolds, and the trick is the distribution that is produced.
0:04:51 You can measure the entropy of the distribution.
0:04:53 Right?
0:04:55 But entropy the way Shannon describes it.
0:04:56 Shannon entropy.
0:04:59 Shannon entropy, yeah, not thermodynamic entropy.
0:05:05 So, suppose you have a vocabulary of, let’s say, 50,000 different tokens, and you have a
0:05:09 distribution, next token distribution over these 50,000 tokens.
0:05:12 So, let’s say the cat sat on the, right?
0:05:20 If that is a problem, then the distribution will have a high probability for map, or hat, or table,
0:05:27 and a very low probability of, let’s say, ship, or whale, or something like that, right?
0:05:32 So, because of the way it’s trained, it has these distributions.
0:05:36 Now, the distributions can be low entropy or high entropy.
0:05:42 A high entropy distribution means that there are many different ways that the LLM can go
0:05:46 with a high enough probability for all those paths.
0:05:52 Low entropy means that there are only a small set of choices for the next token.
0:05:58 And the prompts also, you can categorize into two kinds of prompts.
0:06:03 One prompt is, as you can say, high information entropy.
0:06:04 Yeah.
0:06:16 So, the way these manifolds work, the LLMs start paying attention to prompts that have
0:06:21 high information entropy and low prediction entropy.
0:06:23 So, what do I mean by that?
0:06:26 So, when I say, I’m going out for dinner.
0:06:27 Yeah.
0:06:28 Right?
0:06:33 So, when I say, I’m going out for dinner, that phrase, the LLMs have been trained
0:06:39 they’ve seen it a lot, and there are many different directions I can go with it.
0:06:44 I can say, I’m going for dinner tonight, I’m going for dinner to McDonald’s, or I’m going
0:06:45 to dinner, blah, blah, blah.
0:06:45 There are many different.
0:06:46 Yeah.
0:06:53 But, when I say, I’m going to dinner with Martin Cassaro, you know, the LLM, now this is
0:06:54 information rich.
0:06:57 This is sort of a rare phrase.
0:07:02 And now, the sort of realm of possibilities reduces, because Martin is only going to take
0:07:03 me to Michelin star restaurants.
0:07:05 Yep, yep, yep, yep.
0:07:08 I’m not going to go to a McDonald’s.
0:07:09 You get what I’m saying.
0:07:16 The moment you add more context, you make the prompt information rich, the prediction entropy
0:07:16 reduces.
0:07:17 Yep, yep, yep, yep.
0:07:22 And another example that I often cite is…
0:07:25 But just quickly, but what is your takeaway?
0:07:27 What is your implication on that?
0:07:33 Which is, of course, as you’re interested, so, yeah, so you’re, so, sorry, I forgot how you
0:07:40 described it, but so the more precise you are, the more tokens you are, I presume the less
0:07:42 options you have for the next token.
0:07:43 Is that correct or not correct?
0:07:45 Yeah, yeah, essentially.
0:07:52 So, you’re reducing it, you’re reducing it to a very specific state space when it comes
0:07:55 to confidence in an answer.
0:07:57 And this is kind of a manifold that you can go on.
0:08:04 And then, I mean, do you, do you have kind of a conclusion of what that means for systems
0:08:08 or what that means for reasoning, or is it just a nice way to articulate the bounds of
0:08:09 LLMs?
0:08:16 No, there is something, I don’t know, I don’t know if I should say profound, but there is
0:08:21 something about it which tells what these LLMs can or cannot do, right?
0:08:29 So, one of the examples that I often tell is, suppose I ask you, what is 769 times 1025?
0:08:32 You have no idea.
0:08:36 You can have some vague idea, given the two numbers, right?
0:08:43 And so, in your mind, the next token distribution of the answer is going to be diffuse, right?
0:08:45 You don’t know.
0:08:46 You have maybe a vague guess.
0:08:51 If you are mathematically very good, maybe your guess is more precise, but it is still going
0:08:52 to be diffuse.
0:08:53 And it’s not going to be the correct answer.
0:09:00 But, if I, if you say, can I write it down and do it the way we have learned multiplication
0:09:04 tables, now you know exactly what to do next step, right?
0:09:09 You write 769 and then 1025, and then you know exactly.
0:09:15 So, at each stage of that process, your prediction entropy is very low.
0:09:22 You know exactly what to do because you have been taught this algorithm.
0:09:28 And by invoking this algorithm, saying, okay, I’m not going to just guess the answer, but
0:09:30 I’m going to do it step by step.
0:09:33 Then, your prediction and entropy reduces.
0:09:39 And, you can arrive at an answer which you’re confident of and which is correct.
0:09:41 And, the LLMs are pretty much the same way.
0:09:43 That’s why chain of thought works.
0:09:50 What happens with chain of thought is, you ask the LLM to do something chain of thought.
0:09:52 It starts breaking the problem into small steps.
0:09:55 These steps, it has seen in the past.
0:09:56 It has been trained on.
0:10:00 Maybe with some different numbers, but the concept, it has been trained on.
0:10:03 And, once it breaks it down, then it’s confident.
0:10:08 Okay, now I need to do A, B, C, D, and then I arrive at this answer.
0:10:10 Whatever it is.
0:10:12 Let’s zoom back out.
0:10:13 I want to get into LLMs.
0:10:18 But, first, Vishal, maybe you can give more context on your background and how that informs
0:10:20 your work here.
0:10:21 Okay.
0:10:26 So, yeah, as Martin said, my background is very similar to his.
0:10:28 We, you know, we come from doing networking.
0:10:35 So, my PhD thesis, my sort of early work at Columbia has all been in networking.
0:10:42 But there’s another side of me, another hat that I wear, which is both an entrepreneur and
0:10:43 a cricket fan.
0:10:45 I was going to say, don’t you own a cricket team or something?
0:10:51 I’m a minority owner at your, for your local cricket team, the San Francisco Unicorns.
0:10:52 Yeah, that’s right.
0:10:54 I’m very proud to have you.
0:11:04 So, but, so, say, in the 90s, I was one of the people who started this portal called Crickinfo.
0:11:11 And Crickinfo, at one point, it was the most popular website in the world.
0:11:12 It had more hits than Yahoo.
0:11:14 That was before India came on.
0:11:21 And so, you know, we built, cricket is a very start with sport.
0:11:24 You’ll think baseball multiplied by a thousand.
0:11:31 And we had built this free searchable stats database on Crickinfo called Stats Guru.
0:11:35 And this has been available on Crickinfo since 2000.
0:11:43 But because you can search for anything, everything was made available on Stats Guru.
0:11:48 And, you know, you can’t expect people to write SQL queries to query everything.
0:11:50 So how do you, how did we do it?
0:11:56 Well, it was a web form, you know, where you could formulate your query using that form.
0:12:00 And in the back end, that was translated into SQL query, got the results and got it back.
0:12:05 But as a result, that because you could do everything, everything was made available.
0:12:11 The web form had like 25 different checkboxes, 15 text fields, 18 different dropdowns.
0:12:13 The interface was a mess.
0:12:14 It was very daunting.
0:12:23 So, and ESPN acquired Crickinfo in the mid-2006, I think.
0:12:26 But they still kept the same interface.
0:12:29 And that has always sort of nagged me.
0:12:32 And so I still know the people who run ESPN.
0:12:33 Wait, wait, what nagged you?
0:12:35 Is that Crickinfo did not have informal language?
0:12:37 It had a web form for doing queries?
0:12:40 That web form was terrible.
0:12:45 Because of that, only the real nerds use Stats Guru.
0:12:50 Of all the things in the world that bother you, the fact that an old website was a web form.
0:12:54 I appreciate your commitment to aesthetic.
0:13:01 So I’m still friendly with the people who run ESPN Crickinfo.
0:13:05 The editor-in-chief, whenever he comes to New York, you know, we meet up, we go out for a drink.
0:13:08 And so he was here in 2000.
0:13:13 So now the story shifts to how LLMs and me sort of met.
0:13:17 So January 2000, right before the pandemic, he was here.
0:13:21 And I again said, why did you do something about Stats Guru?
0:13:23 And he looks at me and says, why did you do something about Stats Guru?
0:13:30 He was kind of joking, but he thought maybe, you know, I had some ways to fix the interface.
0:13:34 So anyway, then the pandemic hit, the world stopped.
0:13:39 But in July of 2020, the first version of GPT-3 was released.
0:13:51 And I saw someone use GPT-3 to write a SQL query for their own database using natural language.
0:13:56 And I thought, can I use this to fix Stats Guru?
0:14:04 So I got early access to GPT-3, you know, getting access those days was difficult, but somehow I got it.
0:14:08 But soon I realized that, you know, no, I cannot really do it.
0:14:12 Because Stats Guru, the backend databases were so complex.
0:14:17 And if you remember, GPT-3 had only a 2048 token context window.
0:14:23 There was no way in hell I could fit the complexities of that database in that context window.
0:14:29 And GPT-3 also did not do instruction following at that time.
0:14:37 But then in trying to solve this problem, I accidentally invented what’s now called RAG.
0:14:47 Where based on the natural language query, I created a database of natural language queries and structure, the structured queries.
0:14:54 Like I created a DSL, which then translated into a REST call to Stats Guru.
0:15:00 So based on the new query, I would look through my set of natural language queries.
0:15:01 I had about 1,500 examples.
0:15:05 And I would pick the six or seven most relevant ones.
0:15:10 And then that and the structured query, I would send as a prefix and the new query.
0:15:13 And GPT-3 magically completed it.
0:15:15 And the accuracy was very high.
0:15:19 So that had been running in production since September 2021.
0:15:23 You know, about 15 months before, chatGPD came.
0:15:27 And, you know, the whole revolution in some sense started.
0:15:29 And RAG became very popular.
0:15:30 I didn’t call it RAG.
0:15:35 But this is something sort of I accidentally did in trying to solve that problem for ClickInfo.
0:15:42 Now, once I built it, you know, I was thrilled that this worked.
0:15:44 But I had no idea why it worked.
0:15:50 You know, I stared at that transformer architecture diagram.
0:15:51 I read those papers.
0:15:54 But I couldn’t understand how or why it worked.
0:16:00 So then I started in this journey of developing a mathematical model,
0:16:02 trying to understand how it worked.
0:16:09 So that’s been sort of my journey through this world of AI and LLMs
0:16:12 because I was trying to solve this cricket problem.
0:16:13 Yeah, amazing.
0:16:17 And so maybe reflecting back since the release of GPT-3,
0:16:20 what has most surprised you about how LLMs have developed?
0:16:23 So what has most surprised me?
0:16:25 The pace of development.
0:16:30 So GPT-3 was, you know, it was a nice pilot trick.
0:16:33 And you had to jump through hoops to get it to do something useful.
0:16:39 But starting with the, you know, chat GPT was an advance over GPT-3.
0:16:43 And then you had all these things like chain of thought, instruction following.
0:16:46 GPT-4 really made it polished.
0:16:51 And, you know, the pace of development has really surprised me.
0:16:54 Now, you know, when I started working with GPT-3,
0:16:57 I could sort of see what its limitations were,
0:17:00 what I could make it do, what I couldn’t make it do.
0:17:03 But I never thought of it as, you know,
0:17:06 what these LLMs have become for me now
0:17:09 and what have become for millions of people around the world.
0:17:13 We treat these models as our co-workers.
0:17:15 Almost like an intern.
0:17:18 That, you know, you’re constantly chatting with them,
0:17:21 brainstorming, making them do all sorts of work,
0:17:23 which we couldn’t imagine, you know.
0:17:26 Just when ChatGPT was released,
0:17:28 it was nice, it could write poems,
0:17:29 it could write limericks,
0:17:32 it could answer some hallucinated questions.
0:17:36 But the capabilities that have emerged now,
0:17:40 that pace has been very sort of surprising to me.
0:17:42 Do you see progress plateauing?
0:17:45 Or how do you, either now or in the near future,
0:17:46 how do you see it going?
0:17:52 Yes, in some sense, progress is plateauing.
0:17:54 It’s like the iPhone, you know,
0:17:55 when the iPhone came out,
0:17:58 wow, what is this thing?
0:18:01 And the early iterations, you know,
0:18:04 constantly we were amazed by new capabilities.
0:18:07 But the last, you know, seven, eight, nine years,
0:18:11 it’s maybe the camera got a little bit better
0:18:13 or, you know, one thing changed here
0:18:14 or memory is more.
0:18:17 But there has been no fundamental advance
0:18:19 in what it’s capable of.
0:18:24 You can sort of see a similar thing happening with these LLMs.
0:18:29 And this is not true for just one company and one model, right?
0:18:32 You look at what OpenAI is coming up with
0:18:34 or what Anthropik, Google,
0:18:40 or all these open source Chinese model or Mistral.
0:18:44 The capabilities of LLMs has not fundamentally changed.
0:18:46 They’ve become better, right?
0:18:47 They’ve improved.
0:18:53 But they have not crossed into a different realm.
0:18:57 So this is something that I really appreciate about your work.
0:19:01 And so the thing that really struck me is
0:19:03 as soon as these things showed up,
0:19:07 you actually got busy trying to have a formal model
0:19:09 of what they’re capable of,
0:19:12 which was in stark contrast to what everybody else was doing.
0:19:13 Everybody else was like,
0:19:16 AGI, these things are going to, you know,
0:19:18 recursively self-improve.
0:19:20 Like, or they’ll say,
0:19:22 Oh, these are just stochastic parrots,
0:19:23 which doesn’t mean anything.
0:19:25 So everybody had rhetoric.
0:19:27 And sometimes this rhetoric was fanciful.
0:19:30 And sometimes this rhetoric was almost reductionist.
0:19:31 Like, oh, it’s just a database,
0:19:33 which is clearly not true.
0:19:35 And the thing that really struck me about your work
0:19:36 is you’re like,
0:19:38 No, let’s figure out exactly what’s going on.
0:19:39 Let’s come up with a formal model.
0:19:41 And once we have a formal model,
0:19:43 we can reason about what that means.
0:19:47 And then, you know, in my reading of your work,
0:19:48 I kind of break it up in two pieces.
0:19:50 There’s the first one where you basically,
0:19:53 you came up with this, you know, matrix abstraction.
0:19:54 I think it’s worth you talking through.
0:19:57 And then you took in-context learning as an example
0:19:59 and you mapped it to Bayesian reasoning,
0:20:01 which to me was incredibly powerful
0:20:02 because at the time,
0:20:04 nobody knew why in-context learning worked.
0:20:08 So I think it’d be great for you to discuss that
0:20:10 because again, I think it was the first real
0:20:11 kind of formal effect on like,
0:20:13 how are these things working?
0:20:16 And then the more recent work
0:20:17 that you’re working on now
0:20:21 is a kind of more generalized version
0:20:25 of what is the state space
0:20:26 that these models output
0:20:29 when it comes to confidence,
0:20:31 which is the manifold that we’re talking about before.
0:20:33 So I think it’d be great
0:20:36 if you just described your matrix model
0:20:38 and then how you use that
0:20:41 to provide some bounds
0:20:43 what in-context learning is doing,
0:20:44 what’s happening.
0:20:45 Okay.
0:20:49 So, yeah, let’s start with that matrix abstraction.
0:20:51 So the idea behind the matrix is
0:20:53 you have this gigantic matrix
0:20:57 where every row corresponds to a prompt.
0:21:02 And then the number of columns of this matrix
0:21:05 is the vocabulary of the LLM,
0:21:07 the number of tokens it has
0:21:07 that it can emit.
0:21:10 So for every prompt,
0:21:13 this matrix contains
0:21:16 the distribution over this vocabulary.
0:21:19 So when you say the cat sat on the,
0:21:20 you know,
0:21:22 the column that corresponds to mat
0:21:25 will have a high probability.
0:21:26 Most of them will be zero.
0:21:27 But, you know,
0:21:30 reasonable continuations
0:21:32 will have a non-zero probability.
0:21:33 And so you can imagine
0:21:35 that there’s this gigantic matrix.
0:21:38 Now, the size of this matrix is,
0:21:39 you know,
0:21:42 if we just take just the old
0:21:45 first-generation GPT-3 model,
0:21:48 which had a context window of 2,000 tokens
0:21:54 and a vocabulary of 50,000 next tokens
0:21:56 or 50,000 tokens,
0:21:58 then the size of it,
0:22:00 the number of rows in this matrix
0:22:03 is more than the number of atoms
0:22:06 across all galaxies that we know of.
0:22:07 So clearly,
0:22:11 we cannot represent it exactly.
0:22:13 Now, fortunately,
0:22:16 a lot of these rows are,
0:22:19 do not appear in real life, right?
0:22:21 An arbitrary collection of tokens,
0:22:23 you are not going to use that as a prompt.
0:22:25 Similarly,
0:22:28 you saw a lot of these rows are absent
0:22:29 and a lot of the column values
0:22:30 are also zero.
0:22:31 Right?
0:22:33 When you say the cat sat on the,
0:22:35 it’s unlikely to be followed
0:22:37 by the token corresponding
0:22:38 to, let’s say, numbers.
0:22:40 Or, you know,
0:22:41 an arbitrary collection of tokens.
0:22:44 there will be only a very small subset of tokens
0:22:46 that can follow a particular prompt.
0:22:49 So this matrix is very,
0:22:50 very sparse.
0:22:53 But even after that sparsity
0:22:56 and even after removing the sort of gibberish prompts,
0:23:02 the size of this matrix is too much for these models to represent,
0:23:04 even with the trail in parameters.
0:23:05 So what,
0:23:07 in an abstract sense,
0:23:08 what is happening is
0:23:11 the models get trained
0:23:13 on certain,
0:23:14 you know,
0:23:14 data
0:23:16 from the training set
0:23:17 and certain,
0:23:17 some,
0:23:18 a subset,
0:23:20 a small subset of these rows,
0:23:21 you have
0:23:23 reasonable values
0:23:26 for the next token distribution.
0:23:29 whenever you give the prompt
0:23:30 something new,
0:23:31 right,
0:23:33 then it’ll try to interpolate
0:23:34 with what it has learned
0:23:36 and what’s there
0:23:37 in the new prompt
0:23:39 and come up with
0:23:40 a new distribution.
0:23:42 But it’s basically,
0:23:44 so it’s more than a stochastic parrot.
0:23:47 It is sort of based on this,
0:23:50 on this subset of the matrix
0:23:51 that it has been trained on.
0:23:53 So,
0:23:53 so when I say,
0:23:54 you know,
0:23:56 I’m going out for dinner
0:23:57 with Martin
0:23:58 tonight.
0:24:00 Now,
0:24:02 I’m reasonably sure
0:24:04 that it has never encountered
0:24:04 that phrase
0:24:07 in its training data,
0:24:07 right?
0:24:09 But it has encountered
0:24:10 variants of this phrase
0:24:12 and
0:24:14 given that
0:24:15 I’m going out with Martin,
0:24:16 it,
0:24:17 it can produce
0:24:18 a Bayesian posterior.
0:24:20 It uses that evidence
0:24:20 that Martin
0:24:21 is the one
0:24:22 that I’m going for dinner with
0:24:23 and it’ll produce
0:24:24 a next token distribution
0:24:26 that will focus on
0:24:27 the likely places
0:24:28 that we are going.
0:24:30 So,
0:24:31 so this matrix,
0:24:33 because it’s represented
0:24:34 in a compressed way,
0:24:36 yet the models
0:24:37 respond to everything,
0:24:38 every prompt.
0:24:40 How do they do it?
0:24:41 Well,
0:24:41 they,
0:24:42 they go back to what
0:24:43 they’ve been trained on,
0:24:44 interpolate there,
0:24:46 and use the prompt
0:24:46 as sort of
0:24:48 some evidence
0:24:49 to compute
0:24:50 a new distribution.
0:24:53 which,
0:24:53 so right,
0:24:55 so the context
0:24:56 of the prompt
0:24:57 impacts
0:24:59 the posterior distribution.
0:25:00 Exactly,
0:25:00 yeah.
0:25:01 Right.
0:25:02 And you mapped
0:25:03 to Bayesian
0:25:04 learning
0:25:04 where
0:25:06 the,
0:25:07 the context
0:25:08 is
0:25:09 the new
0:25:09 evidence.
0:25:11 New evidence,
0:25:11 exactly.
0:25:12 To learn from.
0:25:12 So,
0:25:13 so I’ll give you,
0:25:13 so,
0:25:14 so for instance,
0:25:15 the cricket example
0:25:16 that I spoke about earlier.
0:25:17 Yeah.
0:25:17 So I,
0:25:19 I created my own DSL.
0:25:20 Yeah.
0:25:21 Which,
0:25:21 you know,
0:25:22 mapped a natural language
0:25:23 query in cricket
0:25:24 to this DSL,
0:25:26 which then I can
0:25:27 translate into a SQL
0:25:28 query or a REST API
0:25:29 or whatever.
0:25:30 But getting the DSL
0:25:31 is important.
0:25:32 Now,
0:25:33 these LLMs
0:25:34 have never seen
0:25:34 that DSL.
0:25:35 I designed it.
0:25:36 Yeah.
0:25:38 But yet,
0:25:39 after showing
0:25:39 a few examples,
0:25:41 it learned it.
0:25:43 How did it learn it?
0:25:43 And this is,
0:25:45 this is in the prompt.
0:25:45 You didn’t,
0:25:46 no training,
0:25:46 no post training.
0:25:46 It’s in the prompt.
0:25:48 100% in the prompt,
0:25:48 right?
0:25:49 So like,
0:25:50 it’s the way to understand it.
0:25:51 Yeah, yeah.
0:25:51 Yeah, yeah.
0:25:52 This is,
0:25:52 this was happening
0:25:53 in October 2020.
0:25:55 Right?
0:25:56 I had no access
0:25:57 to internals
0:25:57 of OpenAI.
0:25:59 I could just,
0:25:59 you know,
0:26:00 access the API.
0:26:01 OpenAI
0:26:02 had no access
0:26:03 to internal
0:26:04 structure of Stats Guru
0:26:06 or the DSL
0:26:07 that I cooked up
0:26:07 in my head.
0:26:08 Yet,
0:26:09 after showing it
0:26:10 only a few examples,
0:26:11 it learned it right away.
0:26:13 So that’s an example
0:26:15 where it has seen
0:26:15 DSLs
0:26:17 or structures
0:26:18 in the past.
0:26:20 And
0:26:21 now using
0:26:22 this evidence
0:26:22 that I show,
0:26:22 okay,
0:26:23 this is what
0:26:24 my DSL looks like.
0:26:25 Now,
0:26:26 a new
0:26:27 natural language query,
0:26:28 it is able to
0:26:30 create the right
0:26:30 posterior distribution
0:26:31 for the tokens
0:26:33 that map
0:26:34 to the example
0:26:35 that I’ve seen.
0:26:35 Now,
0:26:37 the other
0:26:37 beautiful thing
0:26:38 about this is
0:26:40 this is an example
0:26:41 of few-shot learning
0:26:42 or in-context learning.
0:26:43 Right?
0:26:45 But when I
0:26:46 give that prompt
0:26:47 along with
0:26:48 these examples
0:26:49 to this LLM,
0:26:50 I’m not saying
0:26:51 to the LLM,
0:26:51 okay,
0:26:52 this is an example
0:26:52 of few-shot learning,
0:26:53 so learn from
0:26:54 these examples.
0:26:55 Right?
0:26:56 You just pass
0:26:58 this to the LLM
0:26:59 as a prompt
0:27:00 and it
0:27:01 processes it
0:27:02 exactly
0:27:03 the way
0:27:04 it would process
0:27:04 any other
0:27:05 prompt
0:27:05 which is
0:27:05 not an
0:27:06 example
0:27:06 of
0:27:07 in-context
0:27:07 learning.
0:27:09 So,
0:27:09 that really
0:27:10 means that
0:27:10 the underlying
0:27:11 mechanism
0:27:12 is the same.
0:27:15 Whether you
0:27:16 give a set
0:27:17 of examples
0:27:18 and then
0:27:18 ask it to
0:27:19 complete a talk,
0:27:20 a task like
0:27:21 in-context learning
0:27:22 or just
0:27:22 give it
0:27:23 some prompt
0:27:24 for continuation
0:27:24 that I’m
0:27:25 going out
0:27:25 for dinner
0:27:26 with Martin
0:27:26 tonight.
0:27:27 There’s no
0:27:28 in-context learning
0:27:28 there.
0:27:30 But
0:27:32 the process
0:27:33 with which
0:27:34 it’s generating
0:27:34 or doing
0:27:35 this inferencing
0:27:36 is exactly
0:27:36 the same.
0:27:37 And that’s
0:27:38 what I
0:27:38 have been
0:27:39 trying to
0:27:39 model
0:27:40 and come
0:27:40 up with
0:27:41 a formal
0:27:41 model of.
0:27:43 What I’ve
0:27:43 found very
0:27:44 impressive
0:27:46 is you’ve
0:27:46 used this
0:27:48 basic model
0:27:49 to show
0:27:49 a number
0:27:50 of things,
0:27:50 right,
0:27:50 to describe
0:27:51 context learning
0:27:52 and to map
0:27:52 the Bayesian
0:27:52 learning.
0:27:53 But you
0:27:53 did it
0:27:53 for another
0:27:54 one where
0:27:55 you’ve
0:27:55 sketched out
0:27:56 this almost
0:27:57 glib argument
0:27:58 on Twitter,
0:27:58 on X,
0:27:59 where you
0:28:00 made this
0:28:05 rough argument
0:28:06 for why
0:28:06 recursive
0:28:07 self-improvement
0:28:09 can’t happen
0:28:09 without
0:28:10 additional
0:28:11 information.
0:28:11 And so
0:28:13 maybe just
0:28:13 walk through
0:28:14 very quickly
0:28:14 how this
0:28:15 same model
0:28:15 you can
0:28:16 just very
0:28:17 quickly show
0:28:17 that a
0:28:18 model can
0:28:18 never
0:28:19 recursively
0:28:20 self-improve.
0:28:22 So,
0:28:24 you know,
0:28:25 another
0:28:26 phrase that
0:28:29 we’ve been
0:28:29 using
0:28:31 recently is,
0:28:31 you know,
0:28:32 the output
0:28:32 of the
0:28:33 LLM is
0:28:33 the inductive
0:28:34 closure of
0:28:34 what it
0:28:35 has been
0:28:35 trained on.
0:28:36 Yeah.
0:28:37 So when
0:28:38 you say
0:28:38 that it
0:28:39 can
0:28:39 recursively
0:28:40 self-improve,
0:28:44 it could
0:28:45 mean one
0:28:45 of two
0:28:45 things.
0:28:46 So let’s
0:28:46 get back
0:28:47 to the
0:28:47 Well,
0:28:47 actually,
0:28:48 you know
0:28:48 what’s
0:28:48 kind of
0:28:48 interesting
0:28:49 is like
0:28:49 often
0:28:51 most people
0:28:52 agree that
0:28:52 if you have
0:28:53 one LLM
0:28:53 and you
0:28:53 just feed
0:28:54 the output
0:28:54 and the
0:28:54 input,
0:28:55 like it’s
0:28:55 not going
0:28:55 to do
0:28:56 anything.
0:28:57 But then
0:28:58 often people
0:28:58 will say,
0:28:58 well,
0:28:58 what if
0:28:59 you have
0:28:59 two LLMs,
0:28:59 you have
0:29:00 no external
0:29:01 information,
0:29:01 but you have
0:29:01 two LLMs
0:29:02 talking to
0:29:02 each other,
0:29:03 maybe they
0:29:04 can improve
0:29:04 each other
0:29:05 and then you
0:29:05 can have
0:29:05 like,
0:29:06 you know,
0:29:06 a takeoff
0:29:07 scenario.
0:29:07 But again,
0:29:08 you even
0:29:09 address this,
0:29:09 even in the
0:29:10 case of like
0:29:11 N number
0:29:11 of LLMs
0:29:12 using kind
0:29:13 of the
0:29:14 matrix model
0:29:14 to show
0:29:15 that like
0:29:15 you just
0:29:15 aren’t
0:29:16 getting any
0:29:16 information
0:29:18 entropy.
0:29:20 So you
0:29:21 can represent
0:29:21 the sort
0:29:22 of information
0:29:23 contained in
0:29:23 these models.
0:29:25 And let’s
0:29:25 go back to
0:29:26 that matrix
0:29:27 analogy that
0:29:27 I have,
0:29:29 the matrix
0:29:29 abstraction.
0:29:30 So like I
0:29:30 said,
0:29:31 you know,
0:29:33 these models
0:29:35 represent a
0:29:36 subset of
0:29:36 the rows.
0:29:39 So a subset
0:29:39 of the rows
0:29:41 are represented,
0:29:42 but some
0:29:42 of these
0:29:45 rows are
0:29:46 able to
0:29:47 help fill
0:29:48 out
0:29:48 some of
0:29:49 the missing
0:29:49 rows.
0:29:51 For instance,
0:29:51 you know,
0:29:53 if the
0:29:54 model knows
0:29:54 how to do
0:29:55 multiplication
0:29:55 doing the
0:29:56 step-by-step,
0:29:57 then every
0:29:57 row that is
0:29:58 corresponding to
0:29:58 let’s say
0:29:59 769 times
0:30:00 125 or
0:30:00 whatever,
0:30:01 all those
0:30:01 multiplications,
0:30:03 it can fill
0:30:03 out the
0:30:03 answer.
0:30:04 Because it
0:30:05 has those
0:30:06 algorithms
0:30:06 sort of
0:30:07 embedded in
0:30:07 them,
0:30:08 you just
0:30:08 need to
0:30:09 unroll them.
0:30:10 So it
0:30:11 can sort
0:30:11 of self
0:30:12 improve up
0:30:13 to a point.
0:30:14 But beyond
0:30:14 a point,
0:30:16 these models
0:30:17 can only
0:30:19 sort of
0:30:20 generate what
0:30:20 they have
0:30:21 been trained
0:30:21 on.
0:30:21 So let
0:30:21 me give
0:30:22 you,
0:30:24 I’ll give
0:30:24 you three
0:30:25 examples.
0:30:25 Yeah.
0:30:27 So any
0:30:27 model,
0:30:29 any LLM
0:30:29 that was
0:30:31 trained on
0:30:34 pre-1915
0:30:35 physics
0:30:36 would never
0:30:37 have come
0:30:38 up with a
0:30:38 theory of
0:30:38 relativity.
0:30:41 Einstein
0:30:41 had to
0:30:42 sort of
0:30:42 reject the
0:30:43 Newtonian
0:30:44 physics and
0:30:44 come up
0:30:44 with a
0:30:45 space-time
0:30:46 continuum.
0:30:46 He
0:30:47 completely
0:30:47 rewrote the
0:30:47 rules.
0:30:49 So that
0:30:50 is an
0:30:50 example of
0:30:51 AGI,
0:30:54 where you
0:30:55 are generating
0:30:56 or generating
0:30:56 new knowledge.
0:30:57 It’s not
0:30:58 simply
0:30:59 unrolling
0:31:00 what’s already
0:31:01 about the
0:31:01 universe.
0:31:02 It’s actually
0:31:02 discovering
0:31:02 something
0:31:03 fundamental
0:31:03 about the
0:31:04 universe.
0:31:05 And for
0:31:05 that you
0:31:05 have to
0:31:06 go outside
0:31:06 your
0:31:06 training
0:31:06 set.
0:31:07 Similarly,
0:31:09 any LLM
0:31:09 that was
0:31:09 trained on
0:31:10 it would
0:31:10 not have
0:31:11 come up
0:31:11 with quantum
0:31:11 mechanics.
0:31:14 That’s where
0:31:15 particle duality
0:31:15 or this
0:31:16 whole probabilistic
0:31:17 notion or
0:31:18 that energy
0:31:19 is not
0:31:20 continuous but
0:31:20 it is
0:31:20 quantized.
0:31:21 You had
0:31:22 to reject
0:31:22 Newtonian
0:31:23 physics.
0:31:24 Or
0:31:25 Gettel’s
0:31:25 incompleteness
0:31:26 theorem.
0:31:28 He had to
0:31:28 go outside
0:31:29 the axioms
0:31:30 to say that
0:31:30 okay it is
0:31:31 incomplete.
0:31:32 So those
0:31:32 are examples
0:31:33 where you’re
0:31:34 creating
0:31:36 new science
0:31:37 or
0:31:38 fundamentally
0:31:38 new
0:31:38 results.
0:31:40 That kind
0:31:40 of
0:31:41 self-improvement
0:31:42 is not
0:31:43 possible with
0:31:43 these
0:31:44 architectures.
0:31:45 They can
0:31:46 refine these
0:31:46 they can
0:31:47 fill out
0:31:47 these rows
0:31:48 where the
0:31:48 answer
0:31:49 already
0:31:49 exists.
0:31:50 Another
0:31:50 example
0:31:51 which has
0:31:52 received a
0:31:52 lot of
0:31:52 press these
0:31:53 days is
0:31:54 these
0:31:54 IMO
0:31:55 results.
0:31:56 International
0:31:56 math
0:31:56 will
0:31:56 appear.
0:31:58 Whether
0:31:58 it’s a
0:31:59 human
0:31:59 solving it
0:32:01 or the
0:32:02 LLM
0:32:02 solving it
0:32:03 they are
0:32:04 not inventing
0:32:05 new kinds
0:32:05 of math.
0:32:08 they are
0:32:09 able to
0:32:10 connect
0:32:10 known
0:32:11 results
0:32:11 in a
0:32:12 sequence
0:32:12 of
0:32:12 steps
0:32:13 to
0:32:14 come
0:32:14 up
0:32:14 with
0:32:14 the
0:32:14 answer.
0:32:16 So even
0:32:17 the LLMs
0:32:17 what they are
0:32:17 doing is
0:32:18 they are
0:32:18 exploring
0:32:19 all sorts
0:32:19 of
0:32:20 solutions.
0:32:22 In some
0:32:22 of these
0:32:22 solutions
0:32:24 they start
0:32:24 going on
0:32:25 this path
0:32:26 where their
0:32:27 next token
0:32:27 entropy
0:32:28 is low.
0:32:30 So that’s
0:32:31 where I say
0:32:32 they are in
0:32:32 that Bayesian
0:32:33 manifold
0:32:34 where you
0:32:35 have this
0:32:35 entropy
0:32:36 collapse
0:32:37 and by
0:32:37 doing those
0:32:38 steps
0:32:39 you arrive
0:32:40 at the
0:32:40 answer.
0:32:41 But you’re
0:32:42 not inventing
0:32:42 new math.
0:32:43 You’re not
0:32:43 inventing new
0:32:44 axioms or
0:32:45 new branches
0:32:46 of mathematics.
0:32:47 You’re sort
0:32:48 of using
0:32:48 what you’ve
0:32:49 been trained
0:32:50 on to
0:32:50 arrive at
0:32:51 that answer.
0:32:52 So those
0:32:53 things LLMs
0:32:54 can do
0:32:55 you know
0:32:55 they’ll get
0:32:56 better at it
0:32:57 of connecting
0:32:57 the known
0:32:58 dots.
0:33:00 But creating
0:33:01 new dots
0:33:02 I think we
0:33:03 need an
0:33:03 architectural
0:33:05 advance.
0:33:07 So Martin
0:33:07 was talking
0:33:08 earlier about
0:33:08 how the
0:33:09 discourse
0:33:10 was either
0:33:11 stochastic
0:33:12 parrots or
0:33:13 AGI
0:33:13 recursive
0:33:14 solving.
0:33:16 How do you
0:33:16 conceive of
0:33:17 the AGI
0:33:18 discourse or
0:33:20 even the
0:33:20 concept?
0:33:21 What does
0:33:22 it mean to
0:33:22 the extent
0:33:22 that it’s
0:33:23 useful?
0:33:23 How do
0:33:23 you think
0:33:24 about that?
0:33:27 the way
0:33:28 I think
0:33:28 about it
0:33:28 the way
0:33:29 we’ve tried
0:33:29 to formulate
0:33:29 it in our
0:33:30 papers is
0:33:31 it’s beyond
0:33:31 a stochastic
0:33:32 parrot but
0:33:32 it’s not
0:33:32 AGI.
0:33:34 It’s doing
0:33:34 Bayesian
0:33:36 reasoning over
0:33:36 what it has
0:33:37 been trained
0:33:37 on.
0:33:39 It’s a lot
0:33:40 more sophisticated
0:33:40 than just
0:33:41 a stochastic
0:33:42 parrot.
0:33:44 How do you
0:33:44 define AGI?
0:33:46 Okay so
0:33:47 AGI
0:33:49 so how do
0:33:49 I define
0:33:50 AGI?
0:33:52 So the
0:33:52 way I would
0:33:53 say that
0:33:56 LLMs
0:33:56 currently
0:33:58 navigate
0:33:58 through this
0:33:59 known Bayesian
0:34:00 manifold.
0:34:01 AGI will
0:34:02 create new
0:34:03 manifolds.
0:34:05 So right
0:34:05 now these
0:34:06 models navigate
0:34:07 they do
0:34:07 not create.
0:34:09 AGI will
0:34:09 be when
0:34:09 we are able
0:34:10 to create
0:34:12 new science
0:34:13 new results
0:34:14 new math.
0:34:15 When an
0:34:15 AGI comes
0:34:16 up with a
0:34:16 theory of
0:34:16 relativity
0:34:17 I mean it’s
0:34:18 an extremely
0:34:18 high bar
0:34:18 but you get
0:34:19 what I’m
0:34:19 saying.
0:34:20 It has to
0:34:21 go beyond
0:34:21 what it has
0:34:22 been trained
0:34:23 on to
0:34:23 come up
0:34:24 with
0:34:26 new
0:34:26 paradigms
0:34:27 new
0:34:27 science
0:34:28 and
0:34:29 that’s
0:34:30 that’s
0:34:30 my
0:34:30 definition
0:34:31 of AGI.
0:34:32 Vishal can
0:34:32 you do
0:34:33 you think
0:34:33 that based
0:34:33 on the
0:34:34 work you’ve
0:34:34 done can
0:34:35 you bound
0:34:35 the amount
0:34:36 of data
0:34:37 computer
0:34:39 or data
0:34:40 or compute
0:34:40 that would
0:34:41 be needed
0:34:42 in order
0:34:43 for it
0:34:44 to evolve
0:34:44 it.
0:34:45 So what
0:34:45 are the
0:34:46 problems
0:34:48 if you
0:34:48 just take
0:34:49 LLMs as
0:34:49 they exist
0:34:50 is there
0:34:50 was so
0:34:51 much data
0:34:51 used to
0:34:52 create them
0:34:53 to create
0:34:53 a new
0:34:54 manifold
0:34:55 will need
0:34:55 a lot
0:34:56 more data
0:34:56 just because
0:34:57 of the
0:34:57 basic
0:34:58 mechanisms
0:34:58 right
0:34:58 otherwise
0:34:58 it’ll
0:34:59 just
0:34:59 kind of
0:34:59 like
0:35:00 you know
0:35:00 get kind
0:35:01 of consumed
0:35:01 into the
0:35:02 existing
0:35:02 set of
0:35:02 data
0:35:03 like
0:35:04 have you
0:35:04 found
0:35:05 any bounds
0:35:05 of
0:35:06 of
0:35:07 what
0:35:07 would
0:35:07 be
0:35:08 needed
0:35:08 to
0:35:09 actually
0:35:09 evolve
0:35:09 the
0:35:10 manifold
0:35:10 in
0:35:10 a
0:35:10 useful
0:35:11 way
0:35:11 or
0:35:11 do you
0:35:11 think
0:35:11 we
0:35:12 just
0:35:12 need a new
0:35:12 architecture
0:35:14 I personally
0:35:16 think that we
0:35:16 need a new
0:35:17 architecture
0:35:19 the more data
0:35:19 that we have
0:35:20 the more compute
0:35:21 we have
0:35:21 we’ll get maybe
0:35:23 smoother manifolds
0:35:23 so it’s like a map
0:35:25 there’s this view
0:35:26 that people have
0:35:26 they’re like
0:35:27 well
0:35:28 Vishal
0:35:29 this is all
0:35:31 good and well
0:35:32 but you know
0:35:32 I could just
0:35:33 take an LLM
0:35:34 and I can give it
0:35:34 eyes
0:35:35 and I can give it
0:35:35 ears
0:35:36 and I can put it
0:35:36 in the world
0:35:37 and it’ll gain
0:35:38 information
0:35:38 and based on that
0:35:40 it’ll improve
0:35:41 itself
0:35:42 and therefore
0:35:43 it can learn
0:35:44 new things
0:35:44 but the
0:35:45 counterpoint
0:35:46 that I’ve always
0:35:46 just intuitively
0:35:47 thought to that
0:35:48 is
0:35:49 the amount
0:35:49 of data
0:35:50 used to train
0:35:50 these things
0:35:51 is so large
0:35:52 how much
0:35:53 can you actually
0:35:53 evolve that
0:35:54 manifold
0:35:54 given an
0:35:55 incremental
0:35:55 I mean
0:35:56 almost none
0:35:56 at all
0:35:57 right
0:35:57 there has
0:35:58 to be
0:35:59 some other
0:36:00 way to generate
0:36:01 new manifolds
0:36:01 that aren’t
0:36:02 evolving the
0:36:03 existing one
0:36:05 I completely
0:36:06 agree
0:36:06 there has
0:36:07 to be
0:36:07 a new
0:36:08 sort of
0:36:08 architectural
0:36:09 leap
0:36:09 that is
0:36:09 needed
0:36:11 to go
0:36:12 from the
0:36:12 current
0:36:13 you know
0:36:13 just throwing
0:36:14 more data
0:36:14 and more
0:36:15 compute
0:36:16 you know
0:36:16 it’s going
0:36:16 to plateau
0:36:17 it’s you know
0:36:18 the iPhone
0:36:18 15
0:36:18 16
0:36:19 17
0:36:20 and are
0:36:20 there any
0:36:21 research
0:36:21 directions
0:36:22 that are
0:36:22 promising
0:36:23 in your
0:36:23 mind
0:36:23 that
0:36:23 might
0:36:24 help
0:36:24 us
0:36:24 you know
0:36:25 go beyond
0:36:25 LLM
0:36:26 limitations
0:36:28 so I
0:36:29 mean
0:36:31 again
0:36:32 I love
0:36:32 LLMs
0:36:32 they are
0:36:33 fantastic
0:36:33 and they are
0:36:34 going to
0:36:34 increase
0:36:34 productivity
0:36:35 like
0:36:36 nobody’s
0:36:36 business
0:36:37 but I
0:36:37 don’t think
0:36:37 they are
0:36:38 the answer
0:36:39 so
0:36:39 you know
0:36:39 Yard
0:36:41 famously
0:36:41 says
0:36:42 that
0:36:43 LLMs
0:36:43 are a
0:36:44 distraction
0:36:45 on the
0:36:45 road
0:36:45 to AGI
0:36:48 I don’t
0:36:48 think
0:36:49 I’m not
0:36:49 quite
0:36:50 in that
0:36:50 camp
0:36:51 but I
0:36:52 think we
0:36:53 need
0:36:53 a new
0:36:54 new
0:36:55 architecture
0:36:55 to sit
0:36:56 on top
0:36:56 of LLMs
0:36:58 to reach
0:36:58 AGI
0:36:59 you know
0:37:00 a very
0:37:00 basic
0:37:00 thing
0:37:00 you know
0:37:00 what
0:37:01 Martin
0:37:01 just
0:37:01 said
0:37:02 you give
0:37:02 them
0:37:02 eyes
0:37:03 and you
0:37:03 give
0:37:03 them
0:37:03 ears
0:37:03 you make
0:37:04 them
0:37:04 multi-model
0:37:05 of course
0:37:05 they’ll
0:37:05 become more
0:37:06 powerful
0:37:07 but you need
0:37:07 a little bit
0:37:07 more than
0:37:08 that
0:37:09 you know
0:37:09 the way
0:37:10 human brains
0:37:10 learn
0:37:11 with very
0:37:12 few
0:37:12 examples
0:37:13 that’s not
0:37:13 the way
0:37:14 transformers
0:37:14 learn
0:37:16 and
0:37:18 you know
0:37:19 I’m not
0:37:19 saying that
0:37:20 we need
0:37:20 to create
0:37:21 an Einstein
0:37:21 or a Gator
0:37:22 but
0:37:23 there has
0:37:23 to be
0:37:23 an
0:37:24 architectural
0:37:24 leap
0:37:25 that is
0:37:25 able to
0:37:26 create
0:37:26 these
0:37:27 manifolds
0:37:27 and just
0:37:28 throwing
0:37:28 new
0:37:28 data
0:37:29 will not
0:37:29 do it
0:37:29 it’ll
0:37:29 just
0:37:30 smoothen
0:37:30 out
0:37:30 the
0:37:30 already
0:37:31 existing
0:37:31 manifolds
0:37:33 is that
0:37:33 something
0:37:33 so is
0:37:34 your goal
0:37:34 to actually
0:37:35 help
0:37:37 like think
0:37:37 through new
0:37:38 architectures
0:37:39 or are
0:37:39 you
0:37:39 primarily
0:37:40 focused
0:37:40 on
0:37:41 putting
0:37:41 formal
0:37:41 bounds
0:37:41 on
0:37:42 existing
0:37:42 architectures
0:37:45 a bit
0:37:45 of both
0:37:45 I mean
0:37:46 the former
0:37:47 goal is
0:37:47 the more
0:37:47 ambitious
0:37:48 one
0:37:48 that
0:37:49 everybody
0:37:50 is chasing
0:37:51 and yeah
0:37:52 I think
0:37:52 about that
0:37:52 constantly
0:37:54 are there
0:37:55 any new
0:37:55 even
0:37:56 like
0:37:57 sort of
0:37:57 hints
0:37:57 that a new
0:37:58 architect
0:37:58 or like
0:37:58 have we
0:37:59 started to
0:37:59 make any
0:38:00 progress
0:38:01 on new
0:38:02 architectures
0:38:03 or is
0:38:03 it
0:38:10 you
0:38:10 know
0:38:12 Yarn
0:38:12 has been
0:38:13 pushing
0:38:14 at this
0:38:14 JPA architecture
0:38:16 energy based
0:38:17 architectures
0:38:18 they seem
0:38:18 promising
0:38:20 the way
0:38:21 I have
0:38:21 been
0:38:22 sort of
0:38:22 thinking
0:38:23 about it
0:38:23 is
0:38:27 you know
0:38:27 there’s
0:38:28 this
0:38:29 set of
0:38:30 benchmarks
0:38:31 or the
0:38:31 ARC
0:38:32 prize
0:38:32 yeah
0:38:33 right
0:38:33 that
0:38:34 Mike
0:38:35 and
0:38:36 François
0:38:36 Chalet
0:38:36 have
0:38:38 and
0:38:39 if you
0:38:40 understand
0:38:41 why the
0:38:41 LLMs are
0:38:42 failing on
0:38:42 this test
0:38:43 maybe you
0:38:44 can sort of
0:38:44 reverse engineer
0:38:45 a new
0:38:45 architecture
0:38:47 that will
0:38:48 help you
0:38:50 succeed in
0:38:51 that right
0:38:52 and
0:38:53 I agree
0:38:54 with a
0:38:55 lot of
0:38:55 what
0:38:56 several
0:38:56 people say
0:38:57 that you
0:38:57 know
0:38:58 language is
0:38:59 great but
0:38:59 language is
0:39:00 not the
0:39:00 answer
0:39:01 you know
0:39:02 when I’m
0:39:03 looking at
0:39:04 catching a
0:39:04 ball that
0:39:05 is coming
0:39:05 to me
0:39:06 I’m mentally
0:39:07 doing that
0:39:07 simulation in
0:39:08 my head
0:39:09 I’m not
0:39:10 translating it
0:39:11 to language to
0:39:11 figure out where
0:39:12 it will land
0:39:13 I do that
0:39:13 simulation in
0:39:14 my head
0:39:15 so
0:39:16 a way
0:39:17 you know
0:39:17 one of the
0:39:17 new
0:39:18 architectures
0:39:19 architectural
0:39:19 things is
0:39:20 how do we
0:39:21 do
0:39:22 how do we
0:39:22 get these
0:39:23 models to
0:39:23 do approximate
0:39:24 simulations
0:39:26 to test
0:39:26 out that
0:39:27 idea and
0:39:28 whether to
0:39:28 proceed
0:39:30 or not
0:39:31 so
0:39:33 so yeah
0:39:33 we have
0:39:34 you know
0:39:35 another thing
0:39:35 that I’ve
0:39:36 always wondered
0:39:37 about is
0:39:38 did we
0:39:39 develop as
0:39:39 humans
0:39:40 did we
0:39:40 develop
0:39:41 language
0:39:42 because we
0:39:42 were
0:39:42 intelligent
0:39:44 or because
0:39:45 we developed
0:39:45 language
0:39:47 we accelerated
0:39:47 our intelligence
0:39:49 so I
0:39:50 don’t know
0:39:50 which side
0:39:50 of the
0:39:51 camp
0:39:52 you follow
0:39:52 on that
0:39:53 question
0:39:54 what’s
0:39:54 interesting
0:39:55 is
0:39:55 like
0:39:55 you
0:39:55 have
0:39:55 these
0:39:56 anecdotal
0:39:57 examples
0:39:58 of
0:40:00 humans
0:40:00 developing
0:40:01 languages
0:40:01 de novo
0:40:01 that have
0:40:01 been
0:40:02 recorded
0:40:02 right
0:40:02 like
0:40:03 it’s
0:40:03 either
0:40:03 the
0:40:04 Guatemalan
0:40:05 or Nicaraguan
0:40:06 sign language
0:40:06 right
0:40:07 where there
0:40:07 is these
0:40:07 students
0:40:08 that develop
0:40:09 their own
0:40:10 language
0:40:11 without being
0:40:11 taught
0:40:12 and so
0:40:12 that would
0:40:13 suggest
0:40:13 that language
0:40:14 just follows
0:40:14 intelligence
0:40:16 the problem
0:40:16 is they’re
0:40:17 all anecdotal
0:40:17 right
0:40:18 like who
0:40:18 knows
0:40:19 if somebody
0:40:19 didn’t teach
0:40:20 them sign
0:40:20 language
0:40:20 like nobody
0:40:21 really knows
0:40:21 there is
0:40:22 no controls
0:40:23 so this is
0:40:23 all these
0:40:24 observational
0:40:24 studies
0:40:25 and there’s
0:40:26 so few
0:40:26 of them
0:40:27 you have
0:40:28 to wonder
0:40:28 if it’s
0:40:29 just kind
0:40:30 of sloppy
0:40:31 observation
0:40:31 and so I
0:40:31 think that
0:40:32 the question
0:40:32 is still
0:40:33 outstanding
0:40:34 yeah
0:40:35 so
0:40:38 I mean
0:40:38 language
0:40:39 definitely
0:40:40 accelerated
0:40:41 our intelligence
0:40:41 there’s no
0:40:42 question about
0:40:42 that
0:40:43 but which
0:40:43 followed which
0:40:44 we don’t
0:40:44 know
0:40:46 I view it
0:40:47 as a networking
0:40:47 problem
0:40:48 naturally
0:40:48 which is
0:40:48 once you
0:40:49 have languages
0:40:49 you can
0:40:50 communicate
0:40:50 and when
0:40:51 you can
0:40:51 communicate
0:40:51 you can
0:40:52 store
0:40:52 you can
0:40:53 replicate
0:40:53 yeah
0:40:53 yeah
0:40:54 exactly
0:40:55 cool
0:40:57 again this
0:40:57 is kind
0:40:58 of a wonky
0:40:58 question
0:40:59 but
0:41:02 you know
0:41:02 I think
0:41:03 one thing
0:41:03 that you’ve
0:41:03 brought to
0:41:04 the discourse
0:41:04 and for
0:41:04 those that
0:41:04 are listening
0:41:05 to this
0:41:05 I really
0:41:05 think
0:41:05 that you
0:41:05 should
0:41:06 look up
0:41:06 Vishal’s
0:41:06 work
0:41:07 and read
0:41:07 it
0:41:07 I just
0:41:07 think
0:41:08 it’ll
0:41:08 give you
0:41:08 a really
0:41:09 really
0:41:09 especially
0:41:09 if you
0:41:09 have a
0:41:10 systems
0:41:10 background
0:41:10 like a
0:41:11 networking
0:41:11 systems
0:41:11 background
0:41:12 give you
0:41:12 really
0:41:12 really
0:41:12 good
0:41:13 understanding
0:41:13 of
0:41:14 kind
0:41:14 of
0:41:14 the
0:41:14 bounds
0:41:14 on
0:41:15 these
0:41:16 but like
0:41:17 the toolkit
0:41:17 that you
0:41:18 draw from
0:41:19 is like
0:41:20 information
0:41:20 theory
0:41:21 and like
0:41:22 more formal
0:41:24 have you
0:41:25 found that
0:41:25 the AI
0:41:26 community
0:41:27 is receptive
0:41:27 to this
0:41:28 or is it
0:41:28 like
0:41:29 two different
0:41:30 cultures
0:41:30 two different
0:41:31 planets
0:41:31 trying to
0:41:32 communicate
0:41:33 and not a lot
0:41:33 of common
0:41:33 ground
0:41:34 like how
0:41:34 have you
0:41:35 found
0:41:36 like bringing
0:41:36 like the
0:41:36 networking
0:41:37 view of
0:41:37 the world
0:41:38 to the
0:41:38 AI
0:41:39 realm
0:41:40 some
0:41:40 of
0:41:41 them
0:41:41 are
0:41:41 receptive
0:41:42 to it
0:41:42 definitely
0:41:43 but
0:41:47 you know
0:41:49 these
0:41:49 large
0:41:50 conferences
0:41:50 and their
0:41:50 reviewing
0:41:51 processes
0:41:51 it’s so
0:41:52 random
0:41:53 and the
0:41:53 kind of
0:41:54 questions
0:41:54 they ask
0:41:55 you know
0:41:55 I’m a
0:41:56 modeling
0:41:56 person
0:41:57 I like
0:41:58 to bottle
0:41:58 things
0:41:59 and you
0:41:59 know
0:42:00 I submitted
0:42:01 one version
0:42:01 of this
0:42:01 work to
0:42:03 one very
0:42:03 famous
0:42:06 machine learning
0:42:06 or AI
0:42:07 conference
0:42:08 and the
0:42:08 reviewer
0:42:09 said okay
0:42:09 this is
0:42:09 a
0:42:09 model
0:42:09 so
0:42:10 what
0:42:15 so
0:42:16 there
0:42:17 is
0:42:20 that’s
0:42:20 absolutely
0:42:21 remarkable
0:42:21 so like
0:42:22 you’ve
0:42:22 actually
0:42:22 taken a
0:42:23 system
0:42:23 that
0:42:23 nobody
0:42:24 understands
0:42:24 we have
0:42:25 no
0:42:25 models
0:42:25 for
0:42:25 you
0:42:25 actually
0:42:26 provided
0:42:26 some
0:42:26 model
0:42:27 that
0:42:27 we
0:42:27 can
0:42:27 use
0:42:27 to
0:42:27 analyze
0:42:28 it
0:42:28 and
0:42:29 that
0:42:29 alone
0:42:29 wasn’t
0:42:30 sufficient
0:42:30 they’re
0:42:31 asking
0:42:31 so
0:42:31 where
0:42:31 are
0:42:32 the
0:42:32 large
0:42:32 scale
0:42:33 experiments
0:42:33 to
0:42:34 prove
0:42:35 this
0:42:35 I do
0:42:36 listen
0:42:36 I honestly
0:42:37 I mean
0:42:37 I find
0:42:38 there’s
0:42:38 so much
0:42:39 empiricism
0:42:39 in
0:42:40 the
0:42:41 current
0:42:42 AI
0:42:43 community
0:42:43 exactly
0:42:44 because we
0:42:44 don’t
0:42:44 understand
0:42:44 the
0:42:45 systems
0:42:45 you know
0:42:46 it kind
0:42:46 of reminds
0:42:46 me
0:42:47 I feel
0:42:47 like
0:42:49 systems
0:42:49 went the
0:42:50 other way
0:42:50 right
0:42:50 it’s like
0:42:51 we had
0:42:51 all of
0:42:51 these
0:42:51 models
0:42:52 but then
0:42:52 we didn’t
0:42:52 understand
0:42:53 how the
0:42:53 systems
0:42:53 worked
0:42:54 and then
0:42:54 we just
0:42:55 actually did
0:42:55 measurement
0:42:55 it feels
0:42:56 like
0:42:57 ML
0:42:58 or the
0:42:58 AI stuff
0:42:58 is the
0:42:58 opposite
0:42:59 which is
0:42:59 like
0:42:59 we know
0:43:00 we don’t
0:43:00 understand
0:43:00 them
0:43:01 and so
0:43:01 we just
0:43:01 measure
0:43:01 them
0:43:02 but now
0:43:02 we’re
0:43:02 trying
0:43:02 to
0:43:03 come up
0:43:03 with
0:43:03 the
0:43:04 models
0:43:06 yeah
0:43:07 exactly
0:43:07 so
0:43:08 it was
0:43:08 so easy
0:43:09 in some
0:43:09 sense
0:43:10 to build
0:43:10 these
0:43:11 artifacts
0:43:12 and then
0:43:13 just measure
0:43:13 them
0:43:13 that people
0:43:14 have been
0:43:14 going around
0:43:15 trying to
0:43:16 do that
0:43:17 and
0:43:19 one term
0:43:19 I really
0:43:20 dislike
0:43:21 is prompt
0:43:22 engineering
0:43:23 why
0:43:23 you know
0:43:24 engineering
0:43:25 used to
0:43:25 mean
0:43:26 sending
0:43:26 a man
0:43:27 to the
0:43:27 moon
0:43:27 or
0:43:28 providing
0:43:28 five
0:43:28 nines
0:43:29 reliability
0:43:30 prompt
0:43:31 engineering
0:43:31 is prompt
0:43:32 twiddling
0:43:34 you fiddle
0:43:34 with a
0:43:34 prompt
0:43:35 and the
0:43:35 bogus
0:43:36 changes
0:43:37 and the
0:43:37 inference
0:43:38 the output
0:43:39 changes
0:43:39 and you
0:43:40 know
0:43:40 you have
0:43:40 like
0:43:41 hundreds
0:43:41 of
0:43:41 papers
0:43:41 just
0:43:43 just
0:43:44 you know
0:43:44 doing one
0:43:45 experiment
0:43:45 on the
0:43:45 other
0:43:46 changing
0:43:46 a
0:43:46 prompt
0:43:46 this
0:43:46 way
0:43:47 that
0:43:47 way
0:43:47 and
0:43:48 writing
0:43:48 their
0:43:48 observations
0:43:49 and
0:43:50 as a
0:43:50 result
0:43:51 you know
0:43:51 lots of
0:43:52 these
0:43:52 papers
0:43:53 are being
0:43:53 written
0:43:54 are being
0:43:54 submitted
0:43:55 for review
0:43:55 reviewers get
0:43:56 busy
0:43:56 looking at
0:43:57 all this
0:43:57 kind of
0:43:57 empirical
0:43:58 work
0:43:59 and
0:44:00 my
0:44:00 personal
0:44:01 taste
0:44:01 is
0:44:02 to
0:44:02 first
0:44:03 try to
0:44:04 understand
0:44:05 model it
0:44:07 and then
0:44:07 you can
0:44:07 do
0:44:08 the other
0:44:08 thing
0:44:09 so like
0:44:09 a true
0:44:09 theory
0:44:10 guy
0:44:12 I don’t
0:44:12 know about
0:44:12 this bit
0:44:13 twiddling
0:44:15 let me
0:44:16 ask one
0:44:16 more
0:44:16 LLM
0:44:16 question
0:44:17 which is
0:44:18 are there
0:44:18 any
0:44:19 benchmarks
0:44:19 or real
0:44:20 world
0:44:20 tasks
0:44:21 that if
0:44:21 they
0:44:22 occurred
0:44:22 you’d
0:44:23 sort of
0:44:23 reevaluate
0:44:23 and say
0:44:24 hey
0:44:25 maybe
0:44:25 LLMs
0:44:25 are
0:44:26 closer to
0:44:26 the path
0:44:27 to AGI
0:44:27 than I
0:44:27 thought
0:44:31 if
0:44:32 the
0:44:33 real
0:44:33 world
0:44:33 tasks
0:44:41 good
0:44:41 question
0:44:48 you know
0:44:48 which
0:44:50 for
0:44:52 LLMs
0:44:54 or these
0:44:54 models
0:44:55 the
0:44:56 one
0:44:58 domain
0:44:58 where you
0:44:58 have
0:44:59 the
0:44:59 most
0:44:59 training
0:45:00 data
0:45:00 is
0:45:01 probably
0:45:01 coding
0:45:04 and
0:45:04 coding
0:45:04 is
0:45:05 where
0:45:08 you
0:45:08 can
0:45:08 also
0:45:09 have
0:45:09 the
0:45:09 most
0:45:10 structure
0:45:11 and
0:45:12 yet
0:45:13 anyone
0:45:13 who
0:45:14 has
0:45:14 used
0:45:15 these
0:45:16 tools
0:45:16 whether
0:45:16 it’s
0:45:16 cursor
0:45:17 or
0:45:17 whatever
0:45:18 or
0:45:18 cloud
0:45:18 code
0:45:21 LLMs
0:45:21 continue
0:45:21 to
0:45:22 hallucinate
0:45:22 continue
0:45:23 to
0:45:23 generate
0:45:24 unreasonable
0:45:24 code
0:45:25 you
0:45:25 know
0:45:25 you
0:45:25 have
0:45:26 to
0:45:28 you
0:45:28 have
0:45:28 to
0:45:28 constantly
0:45:31 babysit
0:45:31 these
0:45:32 models
0:45:33 so
0:45:34 the
0:45:34 day
0:45:35 an
0:45:35 LLM
0:45:35 can
0:45:36 create
0:45:36 a
0:45:37 large
0:45:37 software
0:45:37 project
0:45:38 without
0:45:38 any
0:45:42 babysitting
0:45:43 is the
0:45:43 day I’ll
0:45:44 be a little
0:45:44 bit convinced
0:45:45 that it’s
0:45:46 too much
0:45:46 easier
0:45:46 but
0:45:47 again
0:45:49 I don’t
0:45:49 think
0:45:50 it’ll be
0:45:51 able to
0:45:51 create
0:45:51 new
0:45:52 science
0:45:52 if it
0:45:52 does
0:45:54 that’s
0:45:54 when I’ll
0:45:54 be
0:45:54 convinced
0:45:54 I
0:45:55 think
0:45:56 you can
0:45:56 almost
0:45:56 take
0:45:56 a
0:45:57 definitional
0:45:57 approach
0:45:57 to answer
0:45:58 this
0:45:58 question
0:45:58 Vishal
0:45:59 like
0:45:59 the
0:46:00 problem
0:46:00 with
0:46:00 these
0:46:00 types
0:46:00 of
0:46:01 questions
0:46:01 is
0:46:02 if you
0:46:02 have
0:46:02 billions
0:46:02 of
0:46:03 dollars
0:46:03 and you
0:46:03 can
0:46:04 collect
0:46:04 whatever
0:46:04 data
0:46:04 you
0:46:05 want
0:46:05 you
0:46:05 can
0:46:05 make
0:46:05 a
0:46:05 model
0:46:05 do
0:46:06 anything
0:46:06 you
0:46:06 want
0:46:07 right
0:46:07 and so
0:46:08 so
0:46:08 you
0:46:09 know
0:46:09 at
0:46:10 some
0:46:10 level
0:46:11 you
0:46:11 have
0:46:11 got
0:46:11 this
0:46:12 entire
0:46:13 capital
0:46:15 structure
0:46:15 machinery
0:46:16 behind
0:46:16 these
0:46:16 models
0:46:16 so
0:46:17 you’re
0:46:17 like
0:46:17 oh
0:46:17 it
0:46:17 can
0:46:18 be
0:46:18 good
0:46:18 at
0:46:18 science
0:46:18 well
0:46:18 sure
0:46:19 you put
0:46:19 a billion
0:46:19 dollars
0:46:19 in solving
0:46:20 materials
0:46:21 science
0:46:21 and collect
0:46:21 all this
0:46:21 data
0:46:22 you’ll be
0:46:22 good
0:46:22 at
0:46:22 material
0:46:22 science
0:46:23 or
0:46:23 whatever
0:46:24 it
0:46:24 is
0:46:24 but
0:46:25 there
0:46:25 is
0:46:25 a
0:46:26 definitional
0:46:26 answer
0:46:26 which
0:46:27 is
0:46:28 and
0:46:29 I’m
0:46:29 going
0:46:29 to
0:46:29 draw
0:46:30 from
0:46:30 your
0:46:30 work
0:46:30 which
0:46:31 is
0:46:31 there
0:46:31 is
0:46:31 a
0:46:31 manifold
0:46:32 that’s
0:46:32 in
0:46:32 there
0:46:33 based
0:46:33 on
0:46:33 the
0:46:33 data
0:46:34 it’s
0:46:34 been
0:46:34 trading
0:46:34 on
0:46:35 and
0:46:35 then
0:46:35 the
0:46:36 question
0:46:36 is
0:46:36 if
0:46:36 it
0:46:37 ever
0:46:37 produces
0:46:37 something
0:46:38 that’s
0:46:38 off
0:46:38 like
0:46:38 a
0:46:38 new
0:46:39 manifold
0:46:40 so
0:46:41 considering
0:46:41 the
0:46:41 existing
0:46:42 traded
0:46:42 data
0:46:42 if
0:46:42 it
0:46:43 ever
0:46:43 does
0:46:43 that
0:46:43 if
0:46:43 it
0:46:43 does
0:46:44 something
0:46:44 that’s
0:46:44 outside
0:46:44 of
0:46:44 that
0:46:45 distribution
0:46:46 then
0:46:47 clearly
0:46:47 we’re
0:46:47 on
0:46:47 a
0:46:47 path
0:46:48 to
0:46:48 learning
0:46:49 new
0:46:49 things
0:46:49 and
0:46:49 if
0:46:50 not
0:46:50 then
0:46:50 everything
0:46:50 is
0:46:50 just
0:46:50 a
0:46:51 computational
0:46:51 step
0:46:51 from
0:46:52 what’s
0:46:52 already
0:46:52 known
0:46:55 I guess
0:46:57 the counter
0:46:57 to that
0:46:58 would be
0:46:58 maybe
0:46:58 all
0:46:58 humans
0:46:59 do
0:46:59 is
0:47:00 work
0:47:00 on
0:47:00 their
0:47:00 own
0:47:00 manifold
0:47:01 and
0:47:02 Einstein
0:47:04 was
0:47:05 lucky
0:47:05 or
0:47:05 something
0:47:06 I guess
0:47:06 would be
0:47:06 the
0:47:06 counter
0:47:06 to
0:47:07 that
0:47:09 there’s
0:47:09 several
0:47:10 many
0:47:11 Einstein
0:47:11 examples
0:47:11 and
0:47:12 it’s
0:47:13 creating
0:47:13 this
0:47:13 new
0:47:13 manifold
0:47:13 I
0:47:14 didn’t
0:47:14 want
0:47:14 to
0:47:14 use
0:47:14 that
0:47:15 definitional
0:47:15 answer
0:47:15 I
0:47:15 thought
0:47:16 it
0:47:16 might
0:47:16 sound
0:47:17 too
0:47:19 wonky
0:47:19 too
0:47:20 mathematical
0:47:21 but
0:47:21 essentially
0:47:23 if
0:47:24 LLMs
0:47:24 really
0:47:25 created
0:47:25 this
0:47:25 new
0:47:25 manifold
0:47:27 then
0:47:27 I
0:47:27 would
0:47:27 be
0:47:28 convinced
0:47:29 but
0:47:30 so far
0:47:30 they have
0:47:30 just
0:47:31 gotten
0:47:31 better
0:47:31 at
0:47:32 navigating
0:47:32 the
0:47:32 existing
0:47:33 manifold
0:47:33 the
0:47:33 existing
0:47:34 training
0:47:34 set
0:47:34 which
0:47:34 is
0:47:35 hugely
0:47:35 powerful
0:47:35 and
0:47:35 is
0:47:35 going
0:47:35 to
0:47:36 change
0:47:36 the
0:47:36 world
0:47:37 I’m
0:47:37 not
0:47:38 denying
0:47:38 that
0:47:38 I
0:47:38 think
0:47:38 they
0:47:38 are
0:47:39 extremely
0:47:40 good
0:47:41 at
0:47:41 what
0:47:41 they
0:47:41 can
0:47:41 do
0:47:42 but
0:47:42 there’s
0:47:42 a
0:47:42 limit
0:47:42 to
0:47:43 what
0:47:43 they
0:47:43 can
0:47:43 do
0:47:44 so
0:47:44 I
0:47:44 have
0:47:44 one
0:47:44 quick
0:47:44 question
0:47:45 what’s
0:47:45 next
0:47:45 for
0:47:45 you
0:47:46 you’ve
0:47:47 tackled
0:47:47 in
0:47:48 context
0:47:48 learning
0:47:48 you’ve
0:47:48 got
0:47:48 a
0:47:49 model
0:47:49 for
0:47:49 LLMs
0:47:49 and
0:47:49 I’ve
0:47:50 got
0:47:50 a
0:47:50 generalized
0:47:51 model
0:47:51 for
0:47:52 their
0:47:53 solution
0:47:53 space
0:47:54 what
0:47:54 what
0:47:54 what are
0:47:54 you
0:47:54 thinking
0:47:55 about
0:47:55 tackling
0:47:56 next
0:47:58 in
0:47:58 terms
0:47:58 of
0:47:59 modeling
0:48:00 or
0:48:02 academically
0:48:02 an
0:48:03 LLM
0:48:03 academically
0:48:06 I’m
0:48:09 you know
0:48:09 I’m
0:48:09 thinking
0:48:10 of
0:48:10 this
0:48:10 what
0:48:11 is
0:48:11 the
0:48:11 architectural
0:48:12 leap
0:48:12 that is
0:48:13 needed
0:48:15 to create
0:48:15 this
0:48:16 new
0:48:16 manifold
0:48:17 and
0:48:17 how
0:48:17 do
0:48:18 we
0:48:18 use
0:48:19 multimodal
0:48:20 data
0:48:21 awesome
0:48:22 to
0:48:23 expand
0:48:23 I’ll
0:48:24 come back
0:48:24 and talk
0:48:25 to us
0:48:25 that’s
0:48:26 right
0:48:27 so I
0:48:28 mean
0:48:29 you know
0:48:29 even
0:48:29 with
0:48:31 LLMs
0:48:31 you know
0:48:31 in the
0:48:32 paper
0:48:32 we say
0:48:32 that
0:48:33 you can
0:48:34 improve
0:48:35 the
0:48:36 inference
0:48:37 by
0:48:37 following
0:48:38 this
0:48:39 low
0:48:39 or
0:48:39 minimum
0:48:40 entropy
0:48:40 path
0:48:42 so that’s
0:48:42 a very
0:48:43 sort of
0:48:43 small
0:48:43 step
0:48:44 that we
0:48:44 are
0:48:44 taking
0:48:44 you know
0:48:45 we are
0:48:46 building
0:48:46 and
0:48:46 training
0:48:46 models
0:48:47 that will
0:48:47 do
0:48:47 inference
0:48:48 based on
0:48:50 the
0:48:50 entropy
0:48:50 path
0:48:51 yeah
0:48:52 by the way
0:48:52 is
0:48:53 model probe
0:48:54 still up
0:48:56 token probe
0:48:56 yeah yeah
0:48:56 token probe
0:48:57 is still up
0:48:58 and you
0:48:58 can see
0:48:59 actually
0:48:59 the
0:49:00 you know
0:49:01 token probe
0:49:02 is a
0:49:02 software
0:49:03 that we
0:49:03 built
0:49:03 and thanks
0:49:04 to
0:49:04 Martin
0:49:04 and
0:49:05 A16Z’s
0:49:06 generosity
0:49:07 it’s running
0:49:07 on your
0:49:08 servers
0:49:09 and anyone
0:49:09 can go
0:49:10 and test
0:49:11 and what
0:49:11 we have
0:49:11 done there
0:49:12 is we
0:49:12 actually
0:49:13 show
0:49:13 the
0:49:14 entropy
0:49:15 yeah
0:49:16 it is so
0:49:16 enlightening
0:49:17 I recommend
0:49:17 anybody
0:49:17 listening to
0:49:18 this
0:49:18 who’s
0:49:18 interested
0:49:18 actually
0:49:19 check out
0:49:19 token probe
0:49:20 it only
0:49:21 shows you
0:49:21 the confidence
0:49:22 yeah
0:49:23 as you go
0:49:23 along
0:49:24 it’s remarkable
0:49:25 so in context
0:49:25 learning
0:49:26 you know
0:49:26 you create
0:49:27 your new
0:49:27 DSL
0:49:28 and you
0:49:29 give it
0:49:29 to the
0:49:29 prompt
0:49:30 and you
0:49:30 can see
0:49:31 the confidence
0:49:32 rising
0:49:33 with each
0:49:33 new example
0:49:34 the entropy
0:49:35 reducing
0:49:36 and that
0:49:36 sort of
0:49:37 is a validation
0:49:38 of the
0:49:38 model
0:49:39 you can
0:49:39 see it
0:49:40 sort of
0:49:41 unfurling
0:49:41 and right
0:49:42 in front
0:49:42 of your
0:49:42 eyes
0:49:43 so token
0:49:43 probe
0:49:43 is
0:49:44 thanks
0:49:45 thanks again
0:49:46 Michelle
0:49:46 thanks so
0:49:47 much for
0:49:47 coming on
0:49:47 the podcast
0:49:48 it was a great
0:49:48 conversation
0:49:49 it was great
0:49:50 fun
0:49:50 thank you
0:49:51 thank you so
0:49:51 much again
0:49:54 thanks for
0:49:55 listening to
0:49:55 this episode
0:49:56 of the
0:49:56 A16Z
0:49:57 podcast
0:49:58 if you
0:49:58 liked this
0:49:58 episode
0:49:59 be sure
0:49:59 to like
0:50:00 comment
0:50:00 subscribe
0:50:01 leave us
0:50:01 a rating
0:50:02 or review
0:50:03 and share
0:50:03 it with
0:50:03 your friends
0:50:04 and family
0:50:05 for more
0:50:05 episodes
0:50:06 go to
0:50:06 YouTube
0:50:07 Apple
0:50:07 Podcast
0:50:08 and Spotify
0:50:09 follow us
0:50:10 on X
0:50:10 at A16Z
0:50:11 and subscribe
0:50:12 to our
0:50:12 Substack
0:50:12 at
0:50:14 a16z.substack.com
0:50:15 thanks again
0:50:16 for listening
0:50:16 and I’ll see
0:50:17 you in the
0:50:17 next episode
0:50:19 as a reminder
0:50:20 the content
0:50:20 here is for
0:50:21 informational
0:50:22 purposes only
0:50:22 should not
0:50:23 be taken
0:50:23 as legal
0:50:23 business
0:50:24 tax
0:50:24 or investment
0:50:25 advice
0:50:26 or be used
0:50:26 to evaluate
0:50:27 any investment
0:50:27 or security
0:50:28 and is not
0:50:29 directed at
0:50:29 any investors
0:50:30 or potential
0:50:30 investors
0:50:31 in any
0:50:32 A16Z
0:50:32 fund
0:50:33 please note
0:50:34 that A16Z
0:50:34 and its
0:50:34 affiliates
0:50:35 may also
0:50:35 maintain
0:50:35 investments
0:50:36 in the
0:50:36 companies
0:50:36 discussed
0:50:37 in this
0:50:37 podcast
0:50:38 for more
0:50:38 details
0:50:39 including
0:50:39 a link
0:50:40 to our
0:50:40 investments
0:50:41 please
0:50:41 see
0:50:42 a16z.com
0:50:44 forward slash
0:50:44 disclosures
0:51:21 Thank you.

From GPT-1 to GPT-5, LLMs have made tremendous progress in modeling human language. But can they go beyond that to make new discoveries and move the needle on scientific progress?

We sat down with distinguished Columbia CS professor Vishal Misra to discuss this, plus why chain-of-thought reasoning works so well, what real AGI would look like, and what actually causes hallucinations.

Resources:

Follow Dr. Misra on X: https://x.com/vishalmisra

Follow Martin on X: https://x.com/martin_casado

Stay Updated:

If you enjoyed this episode, be sure to like, subscribe, and share with your friends!

Find a16z on X: https://x.com/a16z

Find a16z on LinkedIn: https://www.linkedin.com/company/a16z

Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX

Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711

Follow our host: https://x.com/eriktorenberg

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Stay Updated:

Find a16z on X

Find a16z on LinkedIn

Listen to the a16z Podcast on Spotify

Listen to the a16z Podcast on Apple Podcasts

Follow our host: https://twitter.com/eriktorenberg

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Columbia CS Professor: Why LLMs Can’t Discover New Science

Leave a Reply Cancel reply