0:00:05 The following is a conversation with Dario Amade, CEO of Anthropic,
0:00:09 the company that created Claude, that is currently and often at the top of most
0:00:15 LLM benchmark leaderboards. On top of that, Dario and the Anthropic team
0:00:20 have been outspoken advocates for taking the topic of AI safety very seriously,
0:00:27 and they have continued to publish a lot of fascinating AI research on this and other topics.
0:00:32 I’m also joined afterwards by two other brilliant people from Anthropic.
0:00:39 First, Amanda Askel, who is a researcher working on alignment and fine-tuning of Claude,
0:00:43 including the design of Claude’s character and personality.
0:00:49 A few folks told me she has probably talked with Claude more than any human at Anthropic,
0:00:55 so she was definitely a fascinating person to talk to about prompt engineering
0:00:58 and practical advice on how to get the best out of Claude.
0:01:05 After that, Chris Ola stopped by for a chat. He’s one of the pioneers of the field of
0:01:10 mechanistic interpretability, which is an exciting set of efforts that aims to reverse
0:01:18 engineer neural networks to figure out what’s going on inside, inferring behaviors from neural
0:01:24 activation patterns inside the network. This is a very promising approach for keeping future
0:01:30 super intelligent AI systems safe. For example, by detecting from the activations
0:01:34 when the model is trying to deceive the human it is talking to.
0:01:40 And now a quick few second mention of a sponsor. Check them out in the description.
0:01:46 It’s the best way to support this podcast. We got Encore for machine learning, Notion for
0:01:52 machine learning powered note taking and team collaboration, Shopify for selling stuff online,
0:01:59 better help for your mind and element for your health. Choose wisely, my friends. Also,
0:02:02 if you want to work with our amazing team, we just want to get in touch with me for whatever
0:02:08 reason, go to lexfreeman.com/contact. And now onto the full ad reads. I try to make these
0:02:13 interesting, but if you skip them, please still check out our sponsors. I enjoy their stuff. Maybe
0:02:19 you will too. This episode is brought to you by Encore, a platform that provides data focused
0:02:25 AI tooling for data annotation, curation and management and for model evaluation.
0:02:31 We talk a little bit about public benchmarks in this podcast. I think mostly focused on
0:02:37 software engineering, SWE bench. There’s a lot of exciting developments about how do you have
0:02:42 a benchmark that you can’t cheat on. But if it’s not public, then you can use it the right way,
0:02:48 which is to evaluate how well is the annotation, the data curation, the training, the pre training,
0:02:53 the post training, all of that. How’s that working? Anyway, a lot of the fascinating conversation
0:03:00 with the anthropic folks was focused on the language side. And there’s a lot of really
0:03:06 incredible work that Encore is doing about annotating and organizing visual data. And they
0:03:15 make it accessible for searching, for visualizing, for granular curation, all that kind of stuff.
0:03:20 So I’m a big fan of data. It continues to be the most important thing. The nature of data and what
0:03:25 it means to be good data, whether it’s human generated or synthetic data keeps changing,
0:03:32 but it continues to be the most important component of what makes for a generally
0:03:38 intelligent system, I think, and also for specialized intelligent systems as well.
0:03:45 Go try out Encore to curate, annotate, and manage your AI data at Encore.com/Lex. That’s
0:03:53 Encore.com/Lex. This episode is brought to you by the thing that keeps getting better and better
0:03:59 and better notion. It used to be an awesome note taking tool. Then it started being a great team
0:04:04 collaboration. So note taking for many people and management of all kinds of other project stuff
0:04:13 across large teams. Now, more and more and more is becoming a AI superpowered note taking and team
0:04:20 collaboration tool, really integrating AI probably better than any note taking tool I’ve used,
0:04:25 not even close, honestly. Notion is truly incredible. I haven’t gotten a chance to use
0:04:32 Notion on a large team. I imagine that that’s real when it begins to shine. But on a small team, it’s
0:04:38 just really, really, really amazing. The integration of the assistant inside a particular
0:04:44 file for summarization for generation, all that kind of stuff. But also the integration of an AI
0:04:50 assistant to be able to ask questions about, you know, across docs, across wikis, across projects,
0:04:57 across multiple files, to be able to summarize everything, maybe investigate project progress
0:05:02 based on all the different stuff going on in different files. So really, really nice integration
0:05:11 of AI. Try Notion AI for free when you go to notion.com/lex. That’s all lowercase. Notion.com/lex
0:05:16 to try the power of Notion AI today. This episode is also brought to you by Shopify,
0:05:22 a platform designed for anyone to sell anywhere with a great looking online store. I keep wanting
0:05:27 to mention Shopify’s CEO, Toby, who’s brilliant. And I’m not sure why he hasn’t been on the podcast
0:05:33 yet. I need to figure that out. Every time I’m in San Francisco, I want to talk to him. So he’s
0:05:38 brilliant on all kinds of domains, not just entrepreneurship or tech, just philosophy and
0:05:43 life, just his way of being. Plus an accent adds to the flavor profile of the conversation.
0:05:49 I’ll be watching a cooking show for a little bit. Really, I think my first cooking show,
0:05:57 it’s called Class Wars. It’s a South Korean show where chefs with Michelin stars compete
0:06:02 against chefs without Michelin stars. And there’s something about one of the judges
0:06:09 that just the charisma and the way that he describes every single detail of flavor, of
0:06:15 texture, of what makes for a good dish. Yeah, so it’s contagious. I don’t really even care. I’m
0:06:21 not a foodie. I don’t care about food in that way, but he makes me want to care. So anyway,
0:06:26 that’s why I use the term flavor profile, referring to Toby, which has nothing to do with
0:06:33 what I should probably be saying. And that is that you should use Shopify. I’ve used Shopify.
0:06:40 Super easy, create a store, lexfreeman.com/store to sell a few shirts. Anyway, sign up for a $1 per
0:06:46 month trial period at Shopify.com/lex. That’s all lowercase. Go to Shopify.com/lex to take
0:06:52 your business to the next level today. This episode is also brought to you by BetterHelp,
0:06:58 spelled H-E-L-P, Help. They figure out what you need to match you with a licensed therapist
0:07:03 in under 48 hours. It’s for individuals. It’s for couples. It’s easy to create affordable,
0:07:11 available worldwide. I saw a few books by a Jungian psychologist. And I was like in a
0:07:16 delirious state of sleepiness, and I forgot to write his name down, but I need to do some research.
0:07:23 I need to go back. I need to go back to my younger self when I dreamed of being a psychiatrist and
0:07:31 reading Zygmunt Freud, and reading Carl Jung, and reading it the way young kids maybe read
0:07:40 comic books. They were my superheroes of sorts. Kamu as well, Kafka, Nietzsche, Hesse, Dostoevsky,
0:07:47 the sort of 19th and 20th century literary philosophers of sorts. Anyway, I need to go
0:07:55 back to that. Maybe have a few conversations about Freud. Anyway, those folks, even if in part wrong,
0:08:01 or true revolutionaries, were truly brave to explore the mind in the way they did. They showed
0:08:08 the power of talking and delving deep into the human mind, into the shadow through the use of
0:08:14 words. So highly recommend. And better help is a super easy way to start. Check them out at
0:08:21 betterhelp.com/lex and save in your first month. That’s betterhelp.com/lex. This episode is also
0:08:27 brought to you by Element, my daily zero sugar and delicious electrolyte mix that I’m going to take a
0:08:33 sip of now. It’s been so long that I’ve been drinking Element that I don’t even remember life
0:08:40 before Element. I guess I used to take salt pills because it’s such a big component of my exercise
0:08:46 routine to make sure I get enough water and get enough electrolytes. Yeah, so combined with the
0:08:52 fast thing that I’ve explored a lot and continue to do to this day and combined with low carb diets
0:09:02 that I’m a little bit off the wagon on that one. I’m consuming probably like 60, 70, 80,
0:09:10 maybe 100 some days, grams of carbohydrates. Not good, not good. My happiest is when I’m below 20
0:09:16 grams or 10 grams of carbohydrates. I’m not like measuring it out, I’m just using numbers to sound
0:09:21 smart. But I don’t take dieting seriously, but I do take the signals that my body sends quite
0:09:30 seriously. So without question, making sure I get enough magnesium and sodium and get enough water
0:09:36 is priceless. A lot of times when I have headaches, I just felt off or whatever were fixed
0:09:43 near immediately and sometimes after 30 minutes, we just drink water with electrolytes. It’s beautiful
0:09:48 and it’s delicious. Watermelon salt, the greatest flavor of all time. Get a sample pack for free
0:09:56 with any purchase, try it at DrinkElements.com/Lex. This is a Lex Freeman podcast. To support it,
0:10:10 please check out our sponsors in the description. And now, dear friends, here’s Dario Amade.
0:10:24 Let’s start with a big idea of scaling laws and the scaling hypothesis. What is it?
0:10:31 What is its history and where do we stand today? So I can only describe it as it relates to kind of
0:10:36 my own experience, but I’ve been in the AI field for about 10 years. And it was something I noticed
0:10:43 very early on. So I first joined the AI world when I was working at Baidu with Andrew Ng in late
0:10:49 2014, which is almost exactly 10 years ago now. And the first thing we worked on was speech recognition
0:10:55 systems. And in those days, I think deep learning was a new thing. It had made lots of progress,
0:11:00 but everyone was always saying we don’t have the algorithms we need to succeed. We’re not,
0:11:06 we’re only matching the tiny, tiny fraction. There’s so much we need to kind of discover
0:11:13 algorithmically. We haven’t found the picture of how to match the human brain. And when, you know,
0:11:16 in some ways it was fortunate, I was kind of, you know, you can have almost beginner’s luck,
0:11:21 right? I was like a newcomer to the field. And, you know, I looked at the neural net that we were
0:11:25 using for speech, the recurrent neural networks. And I said, I don’t know, what if you make them
0:11:29 bigger and give them more layers? And what if you scale up the data along with this, right? I just
0:11:35 saw these as like independent dials that you could turn. And I noticed that the model started to do
0:11:40 better and better as you gave them more data, as you, as you made the models larger, as you
0:11:46 trained them for longer. And I didn’t measure things precisely in those days. But, but along with,
0:11:53 with colleagues, we very much got the informal sense that the more data and the more compute and
0:11:58 the more training you put into these models, the better they perform. And so initially,
0:12:03 my thinking was, hey, maybe that is just true for speech recognition systems, right? Maybe,
0:12:09 maybe that’s just one particular quirk, one particular area. I think it wasn’t until 2017,
0:12:16 when I first saw the results from GPT-1, that it clicked for me that language is probably the area
0:12:22 in which we can do this. We can get trillions of words of language data. We can train on them.
0:12:26 And the models we were trained in those days were tiny. You could train them on
0:12:31 one to eight GPUs, whereas, you know, now we train jobs on tens of thousands, soon going to hundreds
0:12:37 of thousands of GPUs. And so when I, when I saw those two things together, and, you know, there
0:12:41 were a few people like Ilya Sootskiver, who, who you’ve interviewed, who had somewhat similar
0:12:46 reviews, right? He might have been the first one, although I think a few people came to,
0:12:50 came to similar reviews around the same time, right? There was, you know, Rich Sutton’s bitter
0:12:56 lesson. There was Goren wrote about the scaling hypothesis. But I think somewhere between 2014
0:13:01 and 2017 was when it really clicked for me, when I really got conviction that, hey,
0:13:07 we’re going to be able to do these incredibly wide cognitive tasks if we just, if we just scale
0:13:13 up the models. And at every stage of scaling, there are always arguments. And, you know,
0:13:16 when I first heard them, honestly, I thought, probably I’m the one who’s wrong. And, you know,
0:13:20 all these, all these experts in the field are right. They know the situation better,
0:13:24 better than I do, right? There’s, you know, the Chomsky argument about, like,
0:13:27 you can get syntactics, but you can’t get semantics. There was this idea, oh,
0:13:31 you can make a sentence, make sense, but you can’t make a paragraph, make sense.
0:13:36 The latest one we have today is, you know, we’re going to run out of data or the data
0:13:42 isn’t high quality enough or models can’t reason. And, and each time, every time we managed to,
0:13:47 we managed to either find a way around or scaling just is the way around. Sometimes it’s one,
0:13:53 sometimes it’s the other. And so I’m now at this point, I still think, you know, it’s, it’s,
0:13:58 it’s always quite uncertain. We have nothing but inductive inference to tell us that the next
0:14:03 two years are going to be like the next, the last 10 years. But, but I’ve seen, I’ve seen the movie
0:14:09 enough times, I’ve seen the story happen for enough times to really believe that probably
0:14:14 the scaling is going to continue and that there’s some magic to it that we haven’t really explained
0:14:21 on a theoretical basis yet. And of course, the scaling here is bigger networks, bigger data,
0:14:27 bigger compute. Yes. All of those. In particular, linear scaling up of bigger networks,
0:14:35 bigger training times, and more, and more data. So all of these things, almost like a chemical
0:14:39 reaction, you know, you have three ingredients in the chemical reaction, and you need to linearly
0:14:43 scale up the three ingredients. If you scale up one, not the others, you run out of the other
0:14:49 reagents and the, and the reaction stops. But if you scale up everything, everything in series,
0:14:53 then, then the reaction can proceed. And of course, now that you have this kind of empirical
0:15:01 science slash art, you can apply to other more nuanced things like scaling laws applied to
0:15:07 interpretability or scaling laws applied to post training or just seeing how does this thing scale.
0:15:12 But the big scaling law, I guess the underlying scaling hypothesis has to do with big networks,
0:15:19 big data leads to intelligence. Yeah, we’ve, we’ve documented scaling laws in lots of domains other
0:15:26 than language, right? So initially, the, the paper we did that first showed it was in early 2020,
0:15:31 where we first showed it for language. There was then some work late in 2020, where we showed the
0:15:39 same thing for other modalities, like images, video, text to image, image to text, math,
0:15:43 that they all had the same pattern. And, and you’re right, now there are other stages like
0:15:48 post training or there are new types of reasoning models. And in, in, in all of those cases that
0:15:55 we’ve measured, we see similar, similar types of scaling laws. A bit of a philosophical question,
0:16:01 but what’s your intuition about why bigger is better in terms of network size and data size?
0:16:08 Why does it lead to more intelligent models? So in my previous career as a, as a biophysicist,
0:16:13 so I did physics undergrad and then biophysics in, in, in grad school. So I think back to what
0:16:19 I know as a physicist, which is actually much less than what some of my colleagues at Anthropic have
0:16:24 in terms of, in terms of expertise in physics. There’s this, there’s this concept called the
0:16:32 one over eth noise and one over X distributions, where we’re often, you know, just, just like
0:16:37 if you add up a bunch of natural processes, you get a Gaussian. If you add up a bunch of kind of
0:16:44 differently distributed natural processes, if you like, if you like, take a, take a probe and,
0:16:49 and hook it up to a resistor, the distribution of the thermal noise in the resistor goes as one
0:16:57 over the frequency. It’s some kind of natural convergent distribution. And, and I think what
0:17:02 it amounts to is that if you look at a lot of things that are, that are produced by some natural
0:17:08 process that has a lot of different scales, right? Not a Gaussian, which is kind of narrowly distributed,
0:17:14 but you know, if I look at kind of like large and small fluctuations that lead to, lead to electrical
0:17:21 noise, they have this decaying one over X distribution. And so now I think of like patterns
0:17:25 in the physical world, right? If I, if, or, or in language, if I think about the patterns in
0:17:30 language, there are some really simple patterns. Some words are much more common than others,
0:17:35 like the, then there’s basic noun verb structure. Then there’s the fact that, you know, nouns and
0:17:39 verbs have to agree, they have to coordinate, and there’s the higher level sentence structure,
0:17:44 then there’s the thematic structure of paragraphs. And so the fact that there’s this regressing
0:17:50 structure, you can imagine that as you make the networks larger, first they capture the
0:17:54 really simple correlations, the really simple patterns, and there’s this long tail of other
0:18:00 patterns. And if that long tail of other patterns is really smooth, like it is with the one over F
0:18:06 noise in, you know, physical processes, like, like, like resistors, then you can imagine as you make
0:18:10 the network larger, it’s kind of capturing more and more of that distribution.
0:18:15 And so that smoothness gets reflected in how well the models are at predicting and how well
0:18:21 they perform. Language is an evolved process, right? We’ve, we’ve developed language, we have
0:18:26 common words and less common words, we have common expressions and less common expressions.
0:18:32 We have ideas, cliches that are expressed frequently, and we have novel ideas. And that
0:18:37 process has, has developed, has evolved with humans over millions of years.
0:18:41 And so the, the, the guess, and this is pure speculation, would be, would be that there is,
0:18:47 there’s some kind of long tail distribution of, of, of the distribution of these ideas.
0:18:52 So there’s the long tail, but also there’s the height of the hierarchy of concepts that you’re
0:18:56 building up. So the bigger the network, presumably you have a higher capacity to…
0:19:01 Exactly. If you have a small network, you only get the common stuff, right? If, if I take a tiny
0:19:05 neural network, it’s very good at understanding that, you know, a sentence has to have, you know,
0:19:10 verb, adjective, noun, right? But it’s, it’s terrible at deciding what those verb, adjective,
0:19:14 and noun should be and whether they should make sense. If I make it just a little bigger,
0:19:18 it gets good at that. Then suddenly it’s good at the sentences, but it’s not good at the paragraphs.
0:19:24 And so these, these rarer and more complex patterns get picked up as I add, as I add more
0:19:29 capacity to the network. Well, the natural question then is, what’s the ceiling of this?
0:19:35 Yeah. Like how complicated and complex is the real world? How much does this stuff is there to learn?
0:19:41 I don’t think any of us knows the answer to that question. My strong instinct would be that there’s
0:19:46 no ceiling below the level of humans, right? We humans are able to understand these various
0:19:52 patterns. And so that, that makes me think that if we continue to, you know, scale up these,
0:19:57 these, these models to kind of develop new methods for training them and scaling them up,
0:20:02 that will at least get to the level that we’ve gotten to with humans. There’s then a question of,
0:20:06 you know, how much more is it possible to understand than humans do? How much,
0:20:11 how much is it possible to be smarter and more perceptive than humans? I, I would guess the
0:20:18 answer has, has got to be domain dependent. If I look at an area like biology, and, you know,
0:20:24 I wrote this essay, Machines of Loving Grace, it seems to me that humans are struggling to
0:20:29 understand the complexity of biology, right? If you go to Stanford or to Harvard or to Berkeley,
0:20:35 you have whole departments of, you know, folks trying to study, you know, like the immune system
0:20:42 or metabolic pathways. And, and each person understands only a tiny bit, part of it specializes,
0:20:46 and they’re struggling to combine their knowledge with that of, with that of other humans. And so
0:20:50 I have an instinct that there’s, there’s a lot of room at the top for AIs to get
0:20:56 smarter. If I think of something like materials in the, in the physical world or,
0:21:02 you know, like addressing, you know, conflicts between humans or something like that. I mean,
0:21:06 you know, it may be, there’s only some of these problems are not intractable, but much harder.
0:21:11 And, and it may be that there’s only, there’s only so well you can do with some of these things,
0:21:16 right? Just like with speech recognition, there’s only so clear I can hear your speech. So I think
0:21:21 in some areas, there may be ceilings in, in, you know, that are very close to what humans
0:21:26 have done in other areas, those ceilings may be very far away. And I think we’ll only find
0:21:30 out when we build these systems. There’s, it’s very hard to know in advance, we can speculate,
0:21:35 but we can’t be sure. And in some domains, the ceiling might have to do with human bureaucracies
0:21:39 and things like this, as you write about. Yes. So humans fundamentally have to be part of the loop.
0:21:45 That’s the cause of the ceiling, not maybe the limits of the intelligence. Yeah. I think in many
0:21:52 cases, you know, in theory, technology could change very fast, for example, all the things that we
0:21:58 might invent with respect to biology. But remember, there’s, there’s a, you know, there’s a clinical
0:22:03 trial system that we have to go through to actually administer these things to humans. I think that’s
0:22:08 a mixture of things that are unnecessary and bureaucratic and things that kind of protect the
0:22:12 integrity of society. And the whole challenge is that it’s hard to tell, it’s hard to tell what’s
0:22:18 going on. It’s hard to tell which is which, right? My view is definitely, I think, in terms of drug
0:22:23 development, we, my view is that we’re too slow and we’re too conservative. But certainly, if you
0:22:28 get these things wrong, you know, it’s, it’s possible to, to risk people’s lives by being,
0:22:34 by being, by being too reckless. And so at least, at least some of these human institutions are in
0:22:39 fact, protecting people. So it’s, it’s all about finding the balance. I strongly suspect that balance
0:22:44 is kind of more on the side of pushing to make things happen faster, but there is a balance.
0:22:51 If we do hit a limit, if we do hit a slowdown in the scaling laws, what do you think would be
0:22:56 the reason? Is it compute limited, data limited? Is it something else? Idea limited?
0:23:02 So a few things. Now we’re talking about hitting the limit before we get to the level of, of humans
0:23:07 and the scale of humans. So, so I think one that’s, you know, one that’s popular today, and I think,
0:23:12 you know, could be a limit that we run into. Like most of the limits, I would bet against it,
0:23:16 but it’s definitely possible is we simply run out of data. There’s only so much data on the
0:23:21 internet. And there’s issues with the quality of the data, right? You can get hundreds of
0:23:27 trillions of words on the internet, but a lot of it is, is repetitive or it’s search engine,
0:23:32 you know, search engine optimization, drivel, or maybe in the future, it’ll even be text generated
0:23:40 by AIs itself. And, and so I think there are limits to what, to what can be produced in this way.
0:23:46 That said, we, and I would guess other companies are working on ways to make data synthetic,
0:23:52 where you can, you know, you can use the model to generate more data of the type that you have,
0:23:58 that you have already, or even generate data from scratch. If you think about what was done with
0:24:03 DeepMind’s AlphaGo Zero, they managed to get a bot all the way from, you know, no ability to play
0:24:09 Go whatsoever to above human level, just by playing against itself. There was no example data from
0:24:14 humans required in the AlphaGo Zero version of it. The other direction, of course, is these
0:24:20 reasoning models that do chain of thought and stop to think and reflect on their own thinking.
0:24:25 In a way, that’s another kind of synthetic data coupled with reinforcement learning. So my, my
0:24:30 guess is, with one of those methods, we’ll get around the data limitation, or there may be other
0:24:35 sources of data that are, that are available. We could just observe that even if there’s no
0:24:40 problem with data, as we start to scale models up, they just stop getting better. It’s, it seemed to
0:24:46 be our reliable observation that they’ve gotten better. That could just stop at some point for
0:24:53 reason we don’t understand. The answer could be that we need to, you know, we need to invent some
0:25:00 new architecture. It’s been, there have been problems in the past with, say, numerical stability
0:25:04 of models, where it looked like things were, were leveling off, but, but actually, you know,
0:25:09 when we, when we, when we found the right on blocker, they didn’t end up doing so. So perhaps
0:25:15 there’s new, some new optimization method or some new technique we need to unblock things.
0:25:20 I’ve seen no evidence of that so far, but if things were to, to slow down that perhaps could
0:25:28 be one reason. What about the limits of compute, meaning the expensive nature of building bigger
0:25:33 and bigger data centers? So right now, I think, you know, most of the frontier model companies,
0:25:39 I would guess, are operating, you know, roughly, you know, one billion dollar scale plus or minus
0:25:44 a factor of three, right? Those are the models that exist now or are being trained now. I think
0:25:51 next year, we’re going to go to a few billion. And then, 2026, we may go to, you know, above 10,
0:25:58 10 billion and probably by 2027, their ambitions to build 100, 100 billion dollar, 100 billion
0:26:03 dollar clusters. And I think all of that actually will happen. There’s a lot of determination to
0:26:08 build the compute to do it within this country. And I would guess that it actually does happen. Now,
0:26:13 if we get to 100 billion, that’s still not enough compute, that’s still not enough scale,
0:26:19 then either we need even more scale or we need to develop some way of doing it more efficiently
0:26:24 of shifting the curve. I think between all of these, one of the reasons I’m bullish about
0:26:30 powerful AI happening so fast is just that if you extrapolate the next few points on the curve,
0:26:36 we’re very quickly getting towards human level ability, right? Some of the new models that we
0:26:40 developed, some reasoning models that have come from other companies, they’re starting to get to
0:26:45 what I would call the PhD or professional level, right? If you look at their coding ability,
0:26:53 the latest model we released, Sonnet 3.5, the new updated version, it gets something like 50%
0:26:59 on SWE bench. And SWE bench is an example of a bunch of professional real world software engineering
0:27:06 tasks. At the beginning of the year, I think the state of the art was 3 or 4%. So in 10 months,
0:27:12 we’ve gone from 3% to 50% on this task. And I think in another year, we’ll probably be at
0:27:18 90%. I mean, I don’t know, but might even be less than that. We’ve seen similar things in
0:27:27 graduate level math, physics, and biology from models like OpenAI’s 01. So if we just continue
0:27:33 to extrapolate this, right, in terms of skill that we have, I think if we extrapolate the straight
0:27:39 curve, within a few years, we will get to these models being above the highest professional
0:27:44 level in terms of humans. Now, will that curve continue? You pointed to and I’ve pointed to a
0:27:50 lot of reasons, possible reasons why that might not happen. But if the extrapolation curve continues,
0:27:55 that is the trajectory we’re on. So Anthropic has several competitors. It’d be interesting to get
0:28:02 your sort of view of it all. OpenAI, Google, XAI, Meta, what does it take to win in the broad sense
0:28:09 of win in the space? Yeah, so I want to separate out a couple things. So Anthropic’s mission is to
0:28:17 kind of try to make this all go well. And we have a theory of change called race to the top. Race to
0:28:25 the top is about trying to push the other players to do the right thing by setting an example. It’s
0:28:29 not about being the good guy, it’s about setting things up so that all of us can be the good guy.
0:28:34 I’ll give a few examples of this. Early in the history of Anthropic, one of our co-founders,
0:28:38 Chris Ola, who I believe you’re interviewing soon, you know, he’s the co-founder of the field of
0:28:44 mechanistic interpretability, which is an attempt to understand what’s going on inside AI models.
0:28:50 So we had him and one of our early teams focus on this area of interpretability, which we think
0:28:57 is good for making models safe and transparent. For three or four years, that had no commercial
0:29:01 application whatsoever. It still doesn’t, today we’re doing some early betas with it,
0:29:07 and probably it will eventually, but, you know, this is a very, very long research bed and one
0:29:12 in which we’ve built in public and shared our results publicly. And we did this because, you
0:29:18 know, we think it’s a way to make models safer. An interesting thing is that as we’ve done this,
0:29:23 other companies have started doing it as well. In some cases because they’ve been inspired by it,
0:29:29 in some cases because they’re worried that, you know, if other companies are doing this,
0:29:33 that look more responsible, they want to look more responsible too. No one wants to look like
0:29:39 the irresponsible actor. And so they adopt this, they adopt this as well. When folks come to
0:29:43 Anthropic, interpretability is often a draw, and I tell them, the other places you didn’t go,
0:29:51 tell them why you came here. And then you see soon that there’s interpretability teams
0:29:56 elsewhere as well. And in a way, that takes away our competitive advantage because it’s like, oh,
0:30:02 now others are doing it as well, but it’s good for the broader system. And so we have to invent
0:30:07 some new thing that we’re doing that others aren’t doing as well. And the hope is to basically
0:30:14 bid up the importance of doing the right thing. And it’s not about us in particular, right? It’s
0:30:20 not about having one particular good guy. Other companies can do this as well. If they join the
0:30:27 race to do this, that’s the best news ever, right? It’s about kind of shaping the incentives to
0:30:32 point upward instead of shaping the incentives to point downward. And we should say this example,
0:30:39 the field of mechanistic interpretability is just a rigorous, non-hand-wavy way of doing AI
0:30:45 safety, or it’s tending that way. Trying to. I mean, I think we’re still early in terms of our
0:30:50 ability to see things, but I’ve been surprised at how much we’ve been able to look inside these
0:30:56 systems and understand what we see, right? Unlike with the scaling laws where it feels like there’s
0:31:03 some, you know, law that’s deriving these models to perform better. On the inside, the models aren’t,
0:31:06 you know, there’s no reason why they should be designed for us to understand them, right? They’re
0:31:11 designed to operate. They’re designed to work just like the human brain or human biochemistry.
0:31:15 They’re not designed for a human to open up the hatch, look inside and understand them.
0:31:19 But we have found, and you know, you can talk in much more detail about this to Chris,
0:31:24 that when we open them up, when we do look inside them, we find things that are surprisingly
0:31:29 interesting. And as a side effect, you also get to see the beauty of these models. You get to explore
0:31:35 the sort of the beautiful nature of large neural networks through the MEK and TURP kind of methodology.
0:31:40 I’m amazed at how clean it’s been. I’m amazed at things like induction heads.
0:31:48 I’m amazed at things like, you know, that we can, you know, use sparse auto encoders to find these
0:31:54 directions within the networks, and that the directions correspond to these very clear concepts.
0:31:59 We demonstrated this a bit with the Golden Gate Bridge Claude. So this was an experiment where
0:32:04 we found a direction inside one of the neural networks layers that corresponded to the Golden
0:32:10 Gate Bridge. And we just turned that way up. And so we released this model as a demo. It was
0:32:16 kind of half a joke for a couple of days, but it was illustrative of the method we developed.
0:32:22 And you could take the model, you could ask it about anything. You know, it would be like,
0:32:27 you could say, how was your day? And anything you asked because this feature was activated,
0:32:32 it would connect to the Golden Gate Bridge. So it would say, you know, I’m feeling relaxed and
0:32:36 expansive, much like the arches of the Golden Gate Bridge. Or, you know,
0:32:40 it would masterfully change topic. Yes. To the Golden Gate Bridge and integrate it.
0:32:44 There was also a sadness to it, to the focus ahead on the Golden Gate Bridge. I think people
0:32:50 quickly fell in love with it. I think so people already miss it because it was taken down, I think,
0:32:57 after a day. Somehow these interventions on the model where you kind of adjust its behavior,
0:33:02 somehow emotionally made it seem more human than any other version of the model.
0:33:05 Strong personality, strong identity. It has a strong personality.
0:33:09 It has these kind of like obsessive interests. You know, we can all think of someone who’s like
0:33:13 obsessed with something. So it does make it feel somehow a bit more human.
0:33:19 Let’s talk about the present. Let’s talk about Claude. So this year, a lot has happened. In March,
0:33:28 Claude III Opus Sonnet Haiku were released. Then Claude III V Sonnet in July with an updated version
0:33:34 just now released. And then also Claude III V Haiku was released. Okay. Can you explain the
0:33:40 difference between Opus Sonnet and Haiku and how we should think about the different versions?
0:33:44 Yeah. So let’s go back to March when we first released these three models. So,
0:33:50 you know, our thinking was different companies produce kind of large and small models,
0:33:57 better and worse models. We felt that there was demand both for a really powerful model,
0:34:02 you know, and that might be a little bit slower that you’d have to pay more for.
0:34:08 And also for fast cheap models that are as smart as they can be for how fast and cheap, right?
0:34:13 Whenever you want to do some kind of like, you know, difficult analysis, like if I, you know,
0:34:18 I want to write code, for instance, or, you know, I want to brainstorm ideas or I want to do creative
0:34:23 writing. I want the really powerful model. But then there’s a lot of practical applications
0:34:28 in a business sense where it’s like, I’m interacting with a website. I, you know, like,
0:34:34 I’m like doing my taxes or I’m, you know, talking to, you know, to like a legal advisor and I want
0:34:39 to analyze a contract or, you know, we have plenty of companies that are just like, you know,
0:34:45 I want to do auto-complete on my IDE or something. And for all of those things, you want to act
0:34:51 fast and you want to use the model very broadly. So we wanted to serve that whole spectrum of needs.
0:34:57 So we ended up with this, you know, this kind of poetry theme. And so what’s a really short poem?
0:35:03 It’s a haiku. And so haiku is the small, fast, cheap model that is, you know, was at the time,
0:35:10 was really surprisingly, surprisingly intelligent for how fast and cheap it was. Sonnet is a medium
0:35:15 sized poem, right? A couple of paragraphs. And so Sonnet was the middle model. It is smarter,
0:35:20 but also a little bit slower, a little bit more expensive. And an opus, like a magnum opus is a
0:35:27 large work, opus was the largest, smartest model at the time. So that was the original kind of
0:35:35 thinking behind it. And our thinking then was, well, each new generation of models should shift
0:35:42 that trade off curve. So when we released Sonnet 3.5, it has the same, roughly the same, you know,
0:35:52 cost and speed as the Sonnet 3 model. But it increased its intelligence to the point where it
0:35:59 was smarter than the original opus 3 model, especially for code, but also just in general.
0:36:06 And so now, you know, we’ve shown results for haiku 3.5. And I believe haiku 3.5,
0:36:13 the smallest new model, is about as good as opus 3, the largest old model. So basically,
0:36:17 the aim here is to shift the curve. And then at some point, there’s going to be an opus 3.5.
0:36:24 Now, every new generation of models has its own thing, they use new data, their personality changes
0:36:31 in ways that we kind of, you know, try to steer, but are not fully able to steer. And so there’s
0:36:35 never quite that exact equivalence where the only thing you’re changing is intelligence.
0:36:39 We always try and improve other things, and some things change without us,
0:36:45 without us knowing or measuring. So it’s very much an exact science. In many ways,
0:36:49 the manner and personality of these models is more in art than it is in science.
0:37:00 So what is sort of the reason for the span of time between, say, cloud opus 3.0 and 3.5?
0:37:04 What is it, what takes that time if you can speak to?
0:37:09 Yeah, so there’s different, there’s different processes. There’s pre-training, which is,
0:37:14 you know, just kind of the normal language model training. And that takes a very long time.
0:37:20 That uses, you know, these days, you know, tens, you know, tens of thousands, sometimes many tens
0:37:26 of thousands of GPUs or TPUs or training them or, you know, whatever, we use different platforms,
0:37:33 but, you know, accelerator chips, often, often training for months. There’s then a kind of
0:37:39 post-training phase where we do reinforcement learning from human feedback, as well as other
0:37:45 kinds of reinforcement learning. That phase is getting larger and larger now. And, you know,
0:37:50 often, that’s less of an exact science. It often takes effort to get it right.
0:37:56 Models are then tested with some of our early partners to see how good they are.
0:38:02 And they’re then tested both internally and externally for their safety, particularly for
0:38:08 catastrophic and autonomy risks. So we do internal testing, according to our responsible
0:38:12 scaling policy, which I, you know, could talk more about that in detail.
0:38:16 And then we have an agreement with the U.S. and the UK AI Safety Institute,
0:38:22 as well as other third-party testers in specific domains to test the models for what are called
0:38:28 CBRN risks, chemical, biological, radiological, and nuclear, which are, you know, we don’t think
0:38:33 that models impose these risks seriously yet, but every new model we want to evaluate to see
0:38:42 if we’re starting to get close to some of these more dangerous capabilities. So those are the
0:38:47 phases. And then, you know, then it just takes some time to get the model working in terms of
0:38:54 inference and launching it in the API. So there’s just a lot of steps to actually make
0:38:59 it a model work. And of course, you know, we’re always trying to make the processes
0:39:02 as streamlined as possible, right? We want our safety testing to be rigorous,
0:39:08 but we want it to be rigorous and to be, you know, to be automatic, to happen as fast as it can
0:39:13 without compromising on rigor. Same with our pre-training process and our post-training process.
0:39:17 So, you know, it’s just like building anything else. It’s just like building airplanes. You want
0:39:21 to make them, you know, you want to make them safe, but you want to make the process streamlined.
0:39:25 And I think the creative tension between those is, you know, is an important thing in making the
0:39:30 models work. Yeah. Rumor on the street, I forget who was saying that Anthropica is really good
0:39:36 tooling. So, probably a lot of the challenge here is on the software engineering side is to
0:39:42 build the tooling to have like a efficient low friction interaction with the infrastructure.
0:39:50 You would be surprised how much of the challenges of, you know, building these models comes down to,
0:39:55 you know, software engineering, performance engineering, you know, you know, from the outside,
0:39:59 you might think, oh man, we had this Eureka breakthrough, right? You know, this movie with
0:40:06 the science. We discovered it. We figured it out. But I think all things, even, you know,
0:40:13 incredible discoveries, like they almost always come down to the details and often super, super
0:40:18 boring details. I can’t speak to whether we have better tooling than other companies. I mean, you
0:40:22 know, I haven’t been at those other companies at least, at least not recently. But it’s certainly
0:40:27 something we give a lot of attention to. I don’t know if you can say, but from 3,
0:40:32 from Claude 3 to Claude 3-5, is there any extra pre-training going on as they mostly focus on
0:40:37 the post-training? There’s been leaps in performance. Yeah, I think at any given stage,
0:40:43 we’re focused on improving everything at once. Just naturally, like there are different teams,
0:40:49 each team makes progress in a particular area in making a particular, you know, their particular
0:40:53 segment of the relay race better. And it’s just natural that when we make a new model,
0:40:58 we put all of these things in at once. So the data you have, like the preference data you get
0:41:06 from RLHF, is that applicable? Is there a way to apply it to newer models as it gets trained up?
0:41:10 Yeah, preference data from old models sometimes gets used for new models. Although, of course,
0:41:15 it performs somewhat better when it’s, you know, trained on the new models.
0:41:19 Note that we have this, you know, constitutional AI method such that we don’t only use preference
0:41:24 data, we kind of, there’s also a post-training process where we train the model against itself.
0:41:28 And there’s, you know, new types of post-training the model against itself that are used every day.
0:41:34 So it’s not just RLHF, it’s a bunch of other methods as well. Post-training, I think, you know,
0:41:39 is becoming more and more sophisticated. Well, what explains the big leap in performance for
0:41:43 the new Sonnet 3-5? I mean, at least in the programming side. And maybe this is a good
0:41:47 place to talk about benchmarks. What does it mean to get better? Just the number went up.
0:41:56 But, you know, I program, but I also love programming, and I clawed 3-5 through cursors,
0:42:01 what I use to assist me in programming. And there was, at least experientially,
0:42:08 anecdotally, it’s gotten smarter at programming. So what, like, what does it take to get it to
0:42:13 get it smarter? We observe that as well, by the way. There were a couple of very strong engineers
0:42:19 here at Anthropic who, all previous code models, both produced by us and produced by all the other
0:42:23 companies, hadn’t really been useful to, hadn’t really been useful to them. You know, they said,
0:42:29 you know, maybe, maybe this is useful to the beginner, it’s not useful to me. But Sonnet 3.5,
0:42:32 the original one, for the first time, they said, oh my god, this helped me with something
0:42:35 that, you know, that it would have taken me hours to do. This is the first model that has
0:42:41 actually saved me time. So again, the waterline is rising. And then I think, you know, the new Sonnet
0:42:47 has been even better. In terms of what it takes, I mean, I’ll just say it’s been across the board.
0:42:53 It’s in the pre-training, it’s in the post-training, it’s in various evaluations that we do. We’ve
0:42:59 observed this as well. And if we go into the details of the benchmark, so SWE Bench is basically,
0:43:03 you know, since, you know, since you’re a programmer, you know, you’ll be familiar with,
0:43:09 like, pull requests and, you know, just pull requests are like, you know, like a sort of,
0:43:14 a sort of atomic unit of work. You know, you could say, I’m, you know, I’m implementing one,
0:43:22 I’m implementing one thing. And so SWE Bench actually gives you kind of a real world situation
0:43:26 where the code base is in the current state. And I’m trying to implement something that’s,
0:43:30 you know, that’s described in, described in language. We have internal benchmarks where we,
0:43:34 where we measure the same thing. And you say, just give the model free reign to like, you know,
0:43:41 do anything, run, run, run anything, edit anything. How, how well is it able to complete these tasks?
0:43:47 And it’s that benchmark that’s gone from it can do it 3% of the time to it can do it about 50%
0:43:52 of the time. So I actually do believe that if we get, you can gain benchmarks, but I think if we
0:43:58 get to 100% of that benchmark in a way that isn’t kind of like overtrained or, or game for that
0:44:04 particular benchmark, probably represents a real and serious increase in kind of, in kind of
0:44:10 programming, programming ability. And I would suspect that if we can get to, you know, 90,
0:44:16 90, 95%, that, that, that, you know, it will, it will represent ability to autonomously do a
0:44:21 significant fraction of software engineering tasks. Well, ridiculous timeline question.
0:44:28 When is GLaDOPUS 3.5 coming up? Not giving you an exact date, but, you know, they’re, they’re,
0:44:33 you know, as far as we know, the plan is still to have a Claude 3.5 opus.
0:44:36 Are we going to get it before GTA six or no?
0:44:40 Like Duke Nukem forever. There was that game that, there was some game that was delayed 15
0:44:44 years. Is that Duke Nukem forever? Yeah. And I think GTA is not just releasing trailers.
0:44:47 You know, it’s only been three months since we released the first sonnet.
0:44:52 Yeah. It’s the incredible pace. It just, it just tells you about the pace.
0:44:54 Yeah. The expectations for when things are going to come out.
0:45:01 So what about 4.0? So how do you think about sort of as these models get bigger and bigger about
0:45:09 versioning and also just versioning in general? Why sonnet 3.5 updated with the date? Why not
0:45:14 sonnet 3.6? Yeah, it’s actually naming is actually an interesting challenge here,
0:45:19 right? Because I think a year ago, most of the model was pre-training. And so you could start
0:45:22 from the beginning and just say, okay, we’re going to have models of different sizes. We’re
0:45:27 going to train them all together and, you know, we’ll have a family of naming schemes and then
0:45:31 we’ll put some new magic into them and then, you know, we’ll have the next, the next generation.
0:45:36 The trouble starts already when some of them take a lot longer than others to train, right?
0:45:41 That already messes up your time, time a little bit. But as you make big improvements in,
0:45:47 as you make big improvements in pre-training, then you suddenly notice, oh, I can make better
0:45:52 pre-trained model and that doesn’t take very long to do. And, but, you know, clearly it has
0:45:58 the same, you know, size and shape of previous models. So I think those two together as well as
0:46:05 the timing, timing issues, any kind of scheme you come up with, you know, the reality tends to kind
0:46:09 of frustrate that scheme, right? It tends to kind of break out of the, break out of the scheme.
0:46:15 It’s not like software where you can say, oh, this is like, you know, 3.7. This is 3.8. No,
0:46:19 you have models with different, different trade-offs. You can change some things in your models. You
0:46:24 can train, you can change other things. Some are faster and slower in France. Some have to be more
0:46:29 expensive. Some have to be less expensive. And so I think all the companies have struggled with
0:46:34 this. I think we did very, you know, I think, I think we were in a good, good position in terms
0:46:40 of naming when we had Haikou, Sonnet. And we’re trying to maintain it, but it’s not, it’s not,
0:46:47 it’s not perfect. So we’ll, we’ll try and get back to the simplicity, but it, it, just the, the,
0:46:51 the nature of the field, I feel like no one’s figured out naming. It’s somehow a different
0:46:57 paradigm from like normal software. And, and, and so we just, none of the companies have been
0:47:03 perfect at it. It’s something we struggle with surprisingly much relative to, you know, how,
0:47:08 relative to how trivial it is to, you know, for the grand science of training the models.
0:47:15 So from the user side, the user experience of the updated Sonnet 3.5 is just different than
0:47:22 the previous June 2024 Sonnet 3.5. It would be nice to come up with some kind of labeling
0:47:27 that embodies that, because people talk about Sonnet 3.5, but now there’s a different one.
0:47:34 And so how do you refer to the previous one and the new one? And it, it, when there’s a distinct
0:47:41 improvement, it just makes conversation about it just challenging. Yeah. Yeah. I definitely think
0:47:47 this question of there are lots of properties of the models that are not reflected in the benchmarks.
0:47:53 I think, I think that’s, that’s definitely the case and everyone agrees. And not all of them
0:48:00 are capabilities. Some of them are, you know, models can be polite or brusque. They can be,
0:48:08 you know, very reactive or they can ask you questions. They can have what, what feels like
0:48:13 a warm personality or a cold personality. They can be boring or they can be very distinctive,
0:48:19 like Golden Gate Claude was. And we have a whole, you know, we have a whole team kind of focused
0:48:24 on, I think we call it Claude character. Amanda leads that team and we’ll talk to you about that.
0:48:31 But it’s still a very inexact science. And often we find that models have properties that we’re
0:48:37 not aware of. The fact of the matter is that you can, you know, talk to a model 10,000 times and
0:48:42 there are some behaviors you might not see. Just like, just like with a human, right? I can know
0:48:47 someone for a few months and, you know, not know that they have a certain skill or not know that
0:48:51 there’s a certain side to them. And so I think, I think we just have to get used to this idea.
0:48:56 And we’re always looking for better ways of testing our models to, to demonstrate these
0:49:01 capabilities and, and, and also to decide which are, which are the, which are the personality
0:49:05 properties we want models to have and which we don’t want to have that itself. The normative
0:49:11 question is also super interesting. I gotta ask you a question from Reddit. From Reddit. Oh boy.
0:49:17 You know, there, there’s just a fascinating, to me at least it’s a psychological social phenomenon
0:49:25 where people report that Claude has gotten dumber for them over time. And so the question is,
0:49:29 does the user complaint about the dumbing down of Claude three, five sonnet hold any water?
0:49:36 So are these anecdotal reports a kind of social phenomena or did Claude,
0:49:41 is there any cases where Claude would get dumber? So this actually doesn’t apply. This,
0:49:47 this isn’t just about Claude. I believe this, I believe I’ve seen these complaints
0:49:52 for every foundation model produced by a major company. People said this about GPT four,
0:49:59 they said it about GPT four turbo. So, so, so a couple of things. One, the actual weights of
0:50:05 the model, right, the actual brain of the model, that does not change unless we introduce a new
0:50:10 model. There, there are just a number of reasons why it would not make sense practically to be
0:50:15 randomly substituting in, substituting in new versions of the model. It’s difficult from an
0:50:20 inference perspective. And it’s actually hard to control all the consequences of changing the
0:50:25 weights of the model. Let’s say you wanted to fine tune the model to be like, I don’t know,
0:50:30 to like, to say certainly less, which, you know, an old version of sonnet used to do. You actually
0:50:34 end up changing 100 things as well. So we have a whole process for it. And we have a whole process
0:50:40 for modifying the model, we do a bunch of testing on it, we do a bunch of, like we do a bunch of
0:50:46 user testing and early customers. So it, we both have never changed the weights of the model without,
0:50:51 without telling anyone. And it wouldn’t, certainly in the current setup, it would not make sense to
0:50:57 do that. Now, there are a couple of things that we do occasionally do. One is sometimes we run AB
0:51:05 tests. But those are typically very close to when a model is being released, and for a very small
0:51:11 fraction of time. So, you know, like the, you know, the, the day before the new sonnet 3.5,
0:51:16 I agree, we should have had a better name. It’s clunky to refer to it. There were some comments
0:51:20 from people that like, it’s got, it’s got, it’s gotten a lot better. And that’s because, you know,
0:51:25 fraction were exposed to, to an AB test for, for those one or, for those one or two days.
0:51:31 The other is that occasionally the system prompt will change on the system prompt can have some
0:51:36 effects, although it’s unlikely to dumb down models. It’s unlikely to make them dumber.
0:51:42 And, and, and we’ve seen that while these two things, which I’m listening to be very complete,
0:51:51 happened relatively, happened quite infrequently. The complaints about, for us and for other model
0:51:55 companies about the model change, the model isn’t good at this, the model got more censored, the
0:52:00 model was dumbed down, those complaints are constant. And so I don’t want to say like people
0:52:05 are imagining these are anything, but like the models are for the most part, not changing.
0:52:12 If I were to offer a theory, I think it actually relates to one of the things I said before,
0:52:19 which is that models have many are very complex and have many aspects to them. And so often, you
0:52:25 know, if I, if I, if I ask the model a question, you know, if I’m like, if I’m like, do task acts
0:52:31 versus can you do task acts, the model might respond in different ways. And, and so there are
0:52:36 all kinds of subtle things that you can change about the way you interact with the model that
0:52:42 can give you very different results. To be clear, this, this itself is like a failing by, by us and
0:52:47 by the other model providers, that, that the models are just, just often sensitive to like
0:52:52 small, small changes in wording. It’s yet another way in which the science of how these models work
0:52:57 is very poorly developed. And, and so, you know, if I go to sleep one night and I was like talking
0:53:02 the model in a certain way, and I like slightly change the phrasing of how I talk to the model,
0:53:06 you know, I could, I could get different results. So that’s, that’s one possible way.
0:53:11 The other thing is, man, it’s just hard to quantify this stuff. It’s hard to quantify this stuff.
0:53:16 I think people are very excited by new models when they come out. And then as time goes on, they,
0:53:20 they become very aware of the, they become very aware of the limitations. So that may be another
0:53:24 effect, but that’s, that’s all a very long-rended way of saying, for the most part, with some
0:53:30 fairly narrow exceptions, the models are not changing. I think there is a psychological effect.
0:53:34 You just start getting used to it. The baseline raises, like, when people have first gotten Wi-Fi
0:53:40 on airplanes, it’s like, amazing. It’s like, amazing. Yeah. And then, and then you can’t get this thing
0:53:45 to work. This is such a piece of crap. Exactly. So it’s easy to have the conspiracy theory of
0:53:50 they’re making Wi-Fi slower and slower. This is probably something I’ll talk to Amanda much more
0:53:57 about, but another Reddit question. “When will Claude stop trying to be my puritanical grandmother
0:54:03 imposing its moral worldview on me as a paying customer? And also, what is the psychology behind
0:54:10 making Claude overly apologetic?” So this kind of reports about the experience, a different angle,
0:54:13 and the frustration. It has to do with the character. Yeah. So a couple points on this first.
0:54:21 One is things that people say on Reddit and Twitter or X or whatever it is, there’s actually a huge
0:54:26 distribution shift between the stuff that people complain loudly about on social media and what
0:54:33 actually kind of statistically users care about and that drives people to use the models. People
0:54:40 are frustrated with things like the model not writing out all the code or the model just not
0:54:45 being as good at code as it could be, even though it’s the best model in the world on code. I think
0:54:54 the majority of things are about that, but certainly a kind of vocal minority are kind of
0:54:59 raised these concerns, right? Are frustrated by the model, refusing things that it shouldn’t refuse
0:55:06 or apologizing too much or just having these kind of annoying verbal ticks. The second caveat,
0:55:11 and I just want to say this super clearly because I think it’s like, some people don’t know it,
0:55:17 others kind of know it, but forget it. It is very difficult to control across the board
0:55:22 how the models behave. You cannot just reach in there and say, “Oh, I want the model to
0:55:27 apologize less.” You can do that. You can include trading data that says, “Oh, the model should
0:55:34 apologize less,” but then in some other situation, they end up being super rude or overconfident
0:55:40 in a way that’s like misleading people. So there are all these trade-offs. For example,
0:55:45 another thing is if there was a period during which models, ours, and I think others as well,
0:55:49 were too verbose, right? They would repeat themselves. They would say too much.
0:55:56 You can cut down on the verbosity by penalizing the models for just talking for too long. What
0:56:01 happens when you do that, if you do it in a crude way, is when the models are coding, sometimes
0:56:05 they’ll say, “Rest of the code goes here,” right? Because they’ve learned that that’s a way to
0:56:10 economize and that they see it. And then, so that leads the model to be so-called lazy in coding,
0:56:16 where they’re just like, “Ah, you can finish the rest of it.” It’s not because we want to save on
0:56:23 compute or because the models are lazy during winter break or any of the other kind of conspiracy
0:56:29 theories that have come up. It’s actually just very hard to control the behavior of the model,
0:56:36 to steer the behavior of the model in all circumstances at once. There’s this whack-a-mole
0:56:44 aspect where you push on one thing and these other things start to move as well that you may
0:56:52 not even notice or measure. And so one of the reasons that I care so much about grand alignment
0:56:57 of these AI systems in the future is actually, these systems are actually quite unpredictable.
0:57:02 They’re actually quite hard to steer and control. And this version we’re seeing today
0:57:11 of you make one thing better, it makes another thing worse. I think that’s like a present-day
0:57:18 analog of future control problems in AI systems that we can start to study today, right? I think
0:57:27 that that difficulty in steering the behavior and in making sure that if we push an AI system in one
0:57:31 direction, it doesn’t push it in another direction in some other ways that we didn’t want,
0:57:39 I think that’s kind of an early sign of things to come. And if we can do a good job of solving
0:57:46 this problem, right? You ask the model to make and distribute smallpox and it says no,
0:57:51 but it’s willing to help you in your graduate level virology class. How do we get
0:57:56 both of those things at once? It’s hard. It’s very easy to go to one side or the other
0:58:02 and it’s a multi-dimensional problem. And so I think these questions of shaping the model’s
0:58:08 personality, I think they’re very hard. I think we haven’t done perfectly on them. I think we’ve
0:58:15 actually done the best of all the AI companies, but still so far from perfect. And I think if we
0:58:23 can get this right, if we can control the false positives and false negatives in this very kind
0:58:28 of controlled present-day environment, we’ll be much better at doing it for the future when our
0:58:34 worry is, will the models be super autonomous? Will they be able to make very dangerous things?
0:58:39 Will they be able to autonomously build whole companies and are those companies aligned?
0:58:45 So I think of this present task as both vexing, but also good practice for the future.
0:58:53 What’s the current best way of gathering user feedback? Not anecdotal data, but just
0:58:59 large-scale data about pain points or the opposite of pain points, positive things, so on. Is it
0:59:04 internal testing? Is it a specific group testing, AB testing? What works?
0:59:09 So typically we’ll have internal model bashings where all of anthropic. Anthropic is almost a
0:59:15 thousand people. People just try and break the model. They try and interact with it various ways.
0:59:22 We have a suite of evals for always the model refusing in ways that it couldn’t. I think we
0:59:29 even had a certainly eval because, again, at one point, model had this problem where it had this
0:59:34 annoying tick where it would respond to a wide range of questions by saying, certainly, I can
0:59:40 help you with that. Certainly, I would be happy to do that. Certainly, this is correct. And so we
0:59:46 had a certainly eval, which is how often does the model say certainly? But look, this is just a
0:59:54 whack-a-mole. What if it switches from certainly to definitely? Every time we add a new eval,
0:59:58 and we’re always evaluating for all the old things, so we have hundreds of these evaluations,
1:00:03 but we find that there’s no substitute for human interacting with it. And so it’s very much like
1:00:08 the ordinary product development process. We have hundreds of people within anthropic bash the
1:00:16 model. Then we do external AB tests. Sometimes we’ll run tests with contractors. We pay contractors
1:00:22 to interact with the model. So you put all of these things together, and it’s still not perfect.
1:00:27 You still see behaviors that you don’t quite want to see. You still see the model like refusing
1:00:34 things that it just doesn’t make sense to refuse. But I think trying to solve this challenge,
1:00:40 trying to stop the model from doing genuinely bad things that know what everyone agrees it
1:00:45 shouldn’t do. Everyone agrees that the model shouldn’t talk about
1:00:52 child abuse material. Everyone agrees the model shouldn’t do that. But at the same time that
1:00:59 it doesn’t refuse in these dumb and stupid ways. I think drawing that line as finely as possible,
1:01:03 approaching perfectly is still a challenge, and we’re getting better at it every day,
1:01:10 but there’s a lot to be solved. And again, I would point to that as an indicator of a challenge ahead
1:01:17 in terms of steering much more powerful models. Do you think Claude 4.0 is ever coming out?
1:01:23 I don’t want to commit to any naming scheme, because if I say here, we’re going to have Claude
1:01:28 4.0 next year, and then we decide that we should start over because there’s a new type of model.
1:01:34 I don’t want to commit to it. I would expect in a normal course of business that Claude 4.0 would
1:01:39 come after Claude 3.5, but you never know in this wacky field, right?
1:01:46 But the idea of scaling is continuing. Scaling is continuing. There will definitely
1:01:51 be more powerful models coming from us than the models that exist today. That is certain,
1:01:54 or if there aren’t, we’ve deeply failed as a company.
1:01:59 Okay. Can you explain the responsible scaling policy and the AI safety level standards,
1:02:05 ASL levels? As much as I’m excited about the benefits of these models, and we’ll talk about
1:02:11 that if we talk about machines of loving grace, I’m worried about the risks and I continue to be
1:02:16 worried about the risks. No one should think that machines of loving grace was me saying,
1:02:21 “I’m no longer worried about the risks of these models.” I think they’re two sides of the same
1:02:30 coin. The power of the models and their ability to solve all these problems in biology, neuroscience,
1:02:36 economic development, governance and peace, large parts of the economy, those come with risks as
1:02:43 well. With great power comes great responsibility. The two are paired. Things that are powerful can
1:02:49 do good things and they can do bad things. I think of those risks as being in several different
1:02:54 categories. Perhaps the two biggest risks that I think about, and that’s not to say that there
1:03:00 aren’t risks today that are important, but when I think of the things that would happen
1:03:06 on the grandest scale, one is what I call catastrophic misuse. These are misuse of the
1:03:17 models in domains like cyber, bio, radiological, nuclear, things that could harm or even kill
1:03:24 thousands, even millions of people if they really, really go wrong. These are the number one priority
1:03:31 to prevent. Here, I would just make a simple observation, which is that the models,
1:03:35 if I look today at people who have done really bad things in the world,
1:03:43 I think actually humanity has been protected by the fact that the overlap between really smart,
1:03:48 well-educated people and people who want to do really horrific things has generally been small.
1:03:55 Let’s say I’m someone who, I have a PhD in this field. I have a well-paying job.
1:04:02 There’s so much to lose. Why do I want to, even assuming I’m completely evil, which most people
1:04:12 are not, why would such a person risk their life, risk their legacy, their reputation to do something
1:04:17 like truly, truly evil? If we had a lot more people like that, the world would be a much more
1:04:25 dangerous place. My worry is that by being a much more intelligent agent, AI could break that
1:04:31 correlation. I do have serious worries about that. I believe we can prevent those worries,
1:04:37 but I think as a counterpoint to Machines of Loving Grace, I want to say that there’s still
1:04:43 serious risks. The second range of risks would be the autonomy risks, which is the idea that
1:04:48 models might on their own, particularly as we give them more agency than they’ve had in the past,
1:04:57 particularly as we give them supervision over wider tasks like writing whole code bases or
1:05:03 someday even effectively operating entire companies, they’re on a long enough leash.
1:05:09 Are they doing what we really want them to do? It’s very difficult to even understand in detail
1:05:17 what they’re doing, let alone control it. Like I said, these early signs that it’s hard to perfectly
1:05:21 draw the boundary between things the model should do and things the model shouldn’t do,
1:05:27 if you go to one side, you get things that are annoying and useless and you go to the other side,
1:05:31 you get other behaviors. If you fix one thing, it creates other problems. We’re getting better
1:05:36 and better at solving this. I don’t think this is an unsolvable problem. I think this is a science,
1:05:42 like the safety of airplanes or the safety of cars or the safety of drugs. I don’t
1:05:46 think there’s any big thing we’re missing. I just think we need to get better at controlling
1:05:52 these models. These are the two risks I’m worried about. Our responsible scaling plan, which all
1:05:59 recognizes a very long-winded answer to your question, our responsible scaling plan is designed
1:06:06 to address these two types of risks. Every time we develop a new model, we basically test it
1:06:13 for its ability to do both of these bad things. If I were to back up a little bit,
1:06:21 I think we have an interesting dilemma with AI systems, where they’re not yet powerful enough
1:06:27 to present these catastrophes. I don’t know that they’ll ever prevent these catastrophes.
1:06:32 It’s possible they won’t, but the case for worry, the case for risk is strong enough
1:06:40 that we should act now. They’re getting better very, very fast. I testified in the Senate that
1:06:44 we might have serious bio risks within two to three years. That was about a year ago.
1:06:54 Things have preceded a pace. We have this thing where it’s surprisingly hard to address these
1:06:58 risks because they’re not here today. They don’t exist. They’re like ghosts, but they’re coming
1:07:03 at us so fast because the models are improving so fast. How do you deal with something that’s
1:07:11 not here today, doesn’t exist, but is coming at us very fast? The solution we came up with
1:07:18 for that, in collaboration with people like the organization Meter and Paul Cristiano,
1:07:25 is, okay, what you need for that are you need tests to tell you when the risk is getting close.
1:07:32 You need an early warning system. Every time we have a new model, we test it for its capability
1:07:40 to do these CBRN tasks, as well as testing it for how capable it is of doing tasks autonomously
1:07:46 on its own. In the latest version of our RSP, which we released in the last month or two,
1:07:56 the way we test autonomy risks is the AI model’s ability to do aspects of AI research itself,
1:08:03 which when the AI models can do AI research, they become truly autonomous. That threshold
1:08:10 is important for a bunch of other ways. What do we then do with these tasks? The RSP basically
1:08:17 develops what we’ve called an if-then structure, which is if the models pass a certain capability,
1:08:23 then we impose a certain set of safety and security requirements on them. Today’s models are what’s
1:08:34 called ASL-2. ASL-1 is for systems that manifestly don’t pose any risk of autonomy or misuse. For
1:08:40 example, a chess plane bot, deep blue, would be ASL-1. It’s just manifestly the case that you
1:08:46 can’t use deep blue for anything other than chess. It was just designed for chess. No one’s going to
1:08:52 use it to conduct a masterful cyber attack or to run wild and take over the world.
1:08:59 ASL-2 is today’s AI systems, where we’ve measured them and we think these systems are simply not
1:09:09 smart enough to autonomously self-replicate or conduct a bunch of tasks and also not smart enough
1:09:17 to provide meaningful information about CBRN risks and how to build CBRN weapons above and beyond
1:09:24 what can be known from looking at Google. In fact, sometimes they do provide information, but not
1:09:29 above and beyond a search engine, but not in a way that can be stitched together. Not in a way
1:09:37 that end-to-end is dangerous enough. ASL-3 is going to be the point at which the models are
1:09:44 helpful enough to enhance the capabilities of non-state actors. State actors can already do
1:09:50 unfortunately to a high level of proficiency, a lot of these very dangerous and destructive
1:09:57 things. The difference is that non-state actors are not capable of it. When we get to ASL-3,
1:10:03 we’ll take special security precautions designed to be sufficient to prevent theft of the model
1:10:10 by non-state actors and misuse of the model as it’s deployed will have to have enhanced filters
1:10:18 targeted at these particular areas. Cyber-bionuclear. Cyber-bionuclear and model autonomy, which is
1:10:26 less a misuse risk and more risk of the model doing bad things itself. ASL-4 getting to the point where
1:10:33 these models could enhance the capability of an already knowledgeable state actor
1:10:40 and/or become the main source of such a risk. If you wanted to engage in such a risk, the main way
1:10:46 you would do it is through a model. Then I think ASL-4 on the autonomy side, it’s some amount of
1:10:52 acceleration in AI research capabilities with an AI model. Then ASL-5 is where we would get to the
1:11:00 models that are truly capable, that could exceed humanity in their ability to do any of these
1:11:07 tasks. The point of the if-then structure commitment is basically to say, look,
1:11:13 I don’t know, I’ve been working with these models for many years and I’ve been worried about risk
1:11:18 for many years. It’s actually kind of dangerous to cry wolf. It’s actually kind of dangerous to say,
1:11:26 this model is risky and people look at it and they say this is manifestly not dangerous.
1:11:34 Again, the delicacy of the risk isn’t here today, but it’s coming at us fast. How do you deal with
1:11:40 that? It’s really vexing to a risk planner to deal with it. This if-then structure basically says,
1:11:46 look, we don’t want to antagonize a bunch of people. We don’t want to harm our kind of
1:11:55 own ability to have a place in the conversation by imposing these very onerous burdens on models
1:12:00 that are not dangerous today. The if-then, the trigger commitment is basically a way to deal
1:12:05 with this. It says you clamp down hard when you can show the model is dangerous. Of course,
1:12:14 what has to come with that is enough of a buffer threshold that you’re not at high risk of missing
1:12:20 the danger. It’s not a perfect framework. We’ve had to change it. We came out with a new one
1:12:25 just a few weeks ago and probably going forward, we might release new ones multiple times a year,
1:12:29 because it’s hard to get these policies right, like technically, organizationally,
1:12:35 from a research perspective, but that is the proposal if-then commitments and triggers in
1:12:42 order to minimize burdens and false alarms now, but really react appropriately when the dangers
1:12:47 are here. What do you think the timeline for ASL 3 is where several of the triggers are fired,
1:12:52 and what do you think the timeline is for ASL 4? Yeah, so that is hotly debated within the company.
1:13:02 We are working actively to prepare ASL 3 security measures as well as ASL 3 deployment
1:13:07 measures. I’m not going to go into detail, but we’ve made a lot of progress on both, and we’re
1:13:16 prepared to be, I think, ready quite soon. I would not be surprised at all if we hit ASL 3
1:13:22 next year. There was some concern that we might even hit it this year. That’s still possible.
1:13:27 That could still happen. It’s very hard to say, but I would be very, very surprised if it was
1:13:34 like 2030. I think it’s much sooner than that. There’s protocols for detecting it, if-then,
1:13:37 and then there’s protocols for how to respond to it. Yes.
1:13:43 How difficult is the second, the latter? Yeah, I think for ASL 3, it’s primarily about
1:13:50 security and about filters on the model relating to a very narrow set of areas
1:13:55 when we deploy the model, because at ASL 3, the model isn’t autonomous yet.
1:14:02 You don’t have to worry about the model itself behaving in a bad way even when it’s deployed
1:14:10 internally. I think the ASL 3 measures are, I won’t say straightforward, they’re rigorous,
1:14:17 but they’re easier to reason about. I think once we get to ASL 4, we start to have worries about
1:14:23 the models being smart enough that they might sandbag tests. They might not tell the truth about
1:14:29 tests. We had some results, came out about like sleeper agents, and there was a more recent paper
1:14:36 about can the models mislead attempts to sandbag their own abilities, show them,
1:14:44 present themselves as being less capable than they are. I think with ASL 4, there’s going to be an
1:14:49 important component of using other things than just interacting with the models. For example,
1:14:55 interpretability or hidden chains of thought, where you have to look inside the model and verify
1:15:04 via some other mechanism that is not as easily corrupted as what the model says that the model
1:15:11 indeed has some property. We’re still working on ASL 4. One of the properties of the RSP is that
1:15:18 we don’t specify ASL 4 until we’ve hit ASL 3. I think that’s proven to be a wise decision,
1:15:25 because even with ASL 3, again, it’s hard to know this stuff in detail. We want to take as much
1:15:32 time as we can possibly take to get these things right. For ASL 3, the bad actor will be the humans.
1:15:36 Humans, yes. There’s a little bit more. For ASL 4, it’s both, I think.
1:15:42 It’s both. Deception, and that’s where mechanistic interpretability comes into play.
1:15:47 Hopefully, the techniques used for that are not made accessible to the model.
1:15:52 Yes. Of course, you can hook up the mechanistic interpretability to the model itself,
1:16:00 but then you’ve kind of lost it as a reliable indicator of the model state. There are a bunch
1:16:05 of exotic ways you can think of that it might also not be reliable. If the model gets smart
1:16:12 enough that it can jump computers and read the code where you’re looking at its internal state,
1:16:15 we’ve thought about some of those. I think they’re exotic enough. There are ways to render them
1:16:21 unlikely. Generally, you want to preserve mechanistic interpretability as a kind of
1:16:25 verification set or test set that’s separate from the training process of the model.
1:16:29 See, I think as these models become better and better conversation and become smarter,
1:16:34 social engineering becomes a threat, too, because they can start being very convincing to the
1:16:40 engineers and site companies. Oh, yeah. It’s actually like, we’ve seen lots of examples
1:16:45 of demagoguery in our life from humans, and there’s a concern that models could do that as well.
1:16:50 One of the ways that cloud has been getting more and more powerful is it’s now able to do some
1:16:57 agentic stuff. Computer use. There’s also an analysis within the sandbox of cloud.ai itself,
1:17:03 but let’s talk about computer use. That seems to me super exciting that you can just give cloud
1:17:10 a task and it takes a bunch of actions, figures it out, and its access to your computer through
1:17:16 screenshots. Can you explain how that works and where that’s headed?
1:17:21 Yeah, it’s actually relatively simple. Cloud has had for a long time since
1:17:26 cloud three back in March, the ability to analyze images and respond to them with text.
1:17:33 The only new thing we added is those images can be screenshots of a computer. In response,
1:17:39 we trained the model to give a location on the screen where you can click and/or buttons on the
1:17:45 keyboard you can press in order to take action. It turns out that with actually not all that much
1:17:50 additional training, the models can get quite good at that task. It’s a good example of generalization.
1:17:55 People sometimes say if you get to lower orbit, you’re halfway to anywhere because of how much
1:17:59 it takes to escape the gravity well. If you have a strong pre-trained model, I feel like you’re
1:18:08 halfway to anywhere in terms of the intelligence space. Actually, it didn’t take all that much
1:18:15 to get cloud to do this. You can just set that in a loop, give the model a screenshot, tell
1:18:19 it what to click on, give it the next screenshot, tell it what to click on. That turns into a full
1:18:26 kind of almost 3D video interaction of the model. It’s able to do all of these tasks. We showed
1:18:32 these demos where it’s able to fill out spreadsheets. It’s able to interact with a website.
1:18:41 It’s able to open all kinds of programs, different operating systems, Windows, Linux, Mac.
1:18:49 I think all of that is very exciting. I will say while in theory, there’s nothing you could do
1:18:53 there that you couldn’t have done through just giving the model the API to drive the computer
1:19:03 screen. This really lowers the barrier. There’s a lot of folks who aren’t in a position to
1:19:08 interact with those APIs or it takes them a long time to do. The screen is just a universal
1:19:13 interface that’s a lot easier to interact with. I expect over time this is going to lower a bunch
1:19:20 of barriers. Honestly, the current model leaves a lot still to be desired. We were honest about
1:19:26 that in the blog. It makes mistakes, it misclicks, and we were careful to warn people, “Hey, this
1:19:31 thing isn’t… You can’t just leave this thing to run on your computer for minutes and minutes.
1:19:36 You got to give this thing boundaries and guardrails.” I think that’s one of the reasons we
1:19:43 released it first in an API form rather than this kind of just hand to the consumer and
1:19:51 give it control of their computer. I definitely feel that it’s important to get these capabilities
1:19:56 out there. As models get more powerful, we’re going to have to grapple with how do we use
1:20:02 these capabilities safely? How do we prevent them from being abused? I think releasing the
1:20:12 model while the capabilities are still limited is very helpful in terms of doing that. I think
1:20:21 since it’s been released, a number of customers, I think Rapplet was maybe one of the quickest
1:20:28 to deploy things, have made use of it in various ways. People have hooked up demos for Windows
1:20:39 desktops, Macs, Linux machines. It’s been very exciting. I think, as with anything else,
1:20:46 it comes with new exciting abilities. Then with those new exciting abilities, we have to think
1:20:53 about how to make the model, say, reliable, do what humans want them to do. It’s the same story
1:20:59 for everything, right? Same thing. It’s that same tension. The possibility of use cases here is just
1:21:05 the range is incredible. How much to make it work really well in the future? How much do you have
1:21:12 to specially go beyond what’s the pre-trained models doing? Do more post-training, RLHF,
1:21:16 or supervised fine-tuning, or synthetic data just for the agent?
1:21:20 Yeah. I think, speaking at a high level, it’s our intention to keep investing a lot
1:21:29 in making the model better. I think we look at some of the benchmarks where previous models were
1:21:33 like, “Oh, I could do it 6% of the time,” and now our model would do it 14% or 22% of the time.
1:21:39 We want to get up to the human-level reliability of 80%, 90% just like anywhere else. We’re on the
1:21:43 same curve that we were on with SWE Bench, where I think I would guess a year from now the models
1:21:48 can do this very, very reliably, but you’ve got to start somewhere. You think it’s possible to get
1:21:53 to the human level? 90% basically doing the same thing you’re doing now, or has it to be
1:22:04 special for computer use? It depends what you mean by special in general. I generally think
1:22:08 the same kinds of techniques that we’ve been using to train the current model,
1:22:11 I expect that doubling down on those techniques in the same way that we have
1:22:21 for code, for models in general, for image input, for voice, I expect those same techniques will
1:22:26 scale here as they have everywhere else. But this is giving the power of action
1:22:32 to Claude, and so you could do a lot of really powerful things, but you could do a lot of damage
1:22:38 also. Yeah, no, and we’ve been very aware of that. Look, my view actually is computer use
1:22:45 isn’t a fundamentally new capability like the CBRN or autonomy capabilities are. It’s more like it
1:22:52 kind of opens the aperture for the model to use and apply its existing abilities. So the way we
1:23:00 think about it going back to our RSP is nothing that this model is doing inherently increases
1:23:08 the risk from an RSP perspective. But as the models get more powerful, having this capability
1:23:17 may make it scarier once it has the cognitive capability to do something at the ASL3 and
1:23:25 ASL4 level. This may be the thing that kind of unbounds it from doing so. So going forward,
1:23:29 certainly this modality of interaction is something we have tested for and that we will
1:23:34 continue to test for an RSP going forward. I think it’s probably better to have to learn
1:23:40 and explore this capability before the model is super capable. Yeah, there’s a lot of interesting
1:23:44 attacks like prompt injection, because now you’ve widened the aperture so you can prompt inject
1:23:50 through stuff on screen. So if this becomes more and more useful, then there’s more and more benefit
1:23:56 to inject stuff into the model. If it goes to a certain webpage, it could be harmless stuff like
1:24:00 advertisements or it could be like harmful stuff, right? Yeah, I mean, we’ve thought a lot about
1:24:07 things like spam, CAPTCHA, you know, mass camp. There’s all, you know, every, every, like, one
1:24:12 secret I’ll tell you, if you’ve invented a new technology, not necessarily the biggest misuse,
1:24:20 but the first misuse you’ll see, scams, just petty scams. Like, it’s like a thing as old,
1:24:26 people scamming each other. It’s this thing as old as time. And it’s just every time you
1:24:33 got to deal with it. It’s almost like silly to say, but it’s true, sort of bots and spam in general
1:24:38 is a thing as it gets more and more intelligent. Yeah. It’s a harder, harder fight. Like I said,
1:24:43 like there are a lot of petty criminals in the world. And, you know, it’s like every new technology
1:24:48 is like a new way for petty criminals to do something, you know, something stupid and malicious.
1:24:55 Is there any ideas about sandboxing it? Like how difficult is the sandboxing task?
1:24:59 Yeah, we sandbox during training. So for example, during training, we didn’t expose the model to
1:25:04 the internet. I think that’s probably a bad idea during training, because, you know, the model
1:25:08 can be changing its policy, it can be changing what it’s doing, and it’s having an effect in the
1:25:15 real world. You know, in terms of actually deploying the model, right, it kind of depends
1:25:18 on the application. Like, you know, sometimes you want the model to do something in the real
1:25:24 world. But of course, you can always put guardrails on the outside, right? You can say,
1:25:28 okay, well, you know, this model is not going to move data from my, you know,
1:25:33 model is not going to move any files from my computer or my web server to anywhere else.
1:25:39 Now, when you talk about sandboxing, again, when we get to ASL4, none of these precautions
1:25:44 are going to make sense there, right? Where when you talk about ASL4, you’re then the model
1:25:50 is being kind of, you know, there’s a theoretical worry the model could be smart enough to break
1:25:55 it to kind of break out of any box. And so there we need to think about mechanistic interpretability
1:25:59 about, you know, if we’re if we’re going to have a sandbox, it would need to be a mathematically
1:26:04 provable sandbox, you know, that’s that’s a whole different world than what we’re dealing with with
1:26:13 the models today. Yeah, the science of building a box from which ASL4 AI system cannot escape.
1:26:17 I think it’s probably not the right approach. I think the right approach,
1:26:21 instead of having something, you know, unaligned that that, like, you’re trying to prevent it
1:26:26 from escaping, I think it’s it’s better to just design the model the right way or have a loop
1:26:30 where you, you know, you look inside, you look inside the model, and you’re able to verify
1:26:34 properties. And that gives you an opportunity to like iterate and actually get it right.
1:26:40 I think I think containing containing bad models is much worse solution than having good models.
1:26:46 Let me ask you about regulation. What’s the role of regulation in keeping AI safe?
1:26:52 So for example, can you describe California AI regulation bill SB 1047 that was ultimately
1:26:57 vetoed by the governor? What are the pros and cons of this bill? Yeah, we ended up making some
1:27:02 suggestions to the bill. And then some of those were adopted. And, you know, we felt, I think,
1:27:08 I think quite positively, quite positively about about the bill by the end of that.
1:27:15 It did still have some downsides. And, you know, of course, of course, it got vetoed.
1:27:21 I think at a high level, I think some of the key ideas behind the bill are, you know, I would say
1:27:26 similar to ideas behind our RSPs. And I think it’s very important that some jurisdiction,
1:27:31 whether it’s California or the federal government and or other other countries and other states,
1:27:37 passes some regulation like this. And I can talk through why I think that’s so important.
1:27:43 So I feel good about our RSP. It’s not perfect. It needs to be iterated on a lot. But it’s been a
1:27:49 good forcing function for getting the company to take these risks seriously, to put them into
1:27:55 product planning, to really make them a central part of work at Anthropic and to make sure that
1:28:00 all the 1000 people and it’s almost 1000 people now at Anthropic understand that this is one of the
1:28:08 highest priorities of the company, if not the highest priority. But one, there are still some
1:28:15 companies that don’t have RSP-like mechanisms like OpenAI, Google did adopt these mechanisms a
1:28:22 couple months after Anthropic did. But there are other companies out there that don’t have these
1:28:29 mechanisms at all. And so if some companies adopt these mechanisms and others don’t, it’s really
1:28:35 going to create a situation where some of these dangers have the property that it doesn’t matter
1:28:40 if three out of five of the companies are being safe. If the other two are being unsafe, it creates
1:28:45 this negative externality. And I think the lack of uniformity is not fair to those of us who have
1:28:50 put a lot of effort into being very thoughtful about these procedures. The second thing is,
1:28:56 I don’t think you can trust these companies to adhere to these voluntary plans in their own.
1:29:02 I like to think that Anthropic will. We do everything we can that we will. Our RSP is
1:29:12 checked by our long-term benefit trust. So we do everything we can to adhere to our own RSP.
1:29:18 But you hear lots of things about various companies saying, oh, they said they would
1:29:22 give this much compute and they didn’t. They said they would do this thing and they didn’t.
1:29:29 I don’t think it makes sense to litigate particular things that companies have done,
1:29:34 but I think this broad principle that if there’s nothing watching over them,
1:29:38 there’s nothing watching over us as an industry, there’s no guarantee that we’ll do the right
1:29:44 thing and the stakes are very high. And so I think it’s important to have a uniform standard
1:29:52 that everyone follows and to make sure that simply that the industry does what a majority
1:29:57 of the industry has already said is important and has already said that they definitely will do.
1:30:04 I think there’s a class of people who are against regulation on principle. I understand
1:30:09 where that comes from. If you go to Europe and you see something like GDPR, you see some of the
1:30:16 other stuff that they’ve done, some of it’s good, but some of it is really unnecessarily
1:30:21 burdensome. And I think it’s fair to say really has slowed innovation. And so I understand
1:30:26 where people are coming from on priors. I understand why people start from that,
1:30:33 start from that position. But again, I think AI is different. If we go to the very serious risks
1:30:44 of autonomy and misuse that I talked about just a few minutes ago, I think that those are unusual
1:30:50 and they weren’t an unusually strong response. And so I think it’s very important. Again,
1:30:58 we need something that everyone can get behind. I think one of the issues with SB 1047,
1:31:06 especially the original version of it, was it had a bunch of the structure of RSPs,
1:31:13 but it also had a bunch of stuff that was either clunky or that just would have created
1:31:18 a bunch of burdens, a bunch of hassle, and might even have missed the target in terms of
1:31:26 addressing the risks. You don’t really hear about it on Twitter. People are cheering for any
1:31:31 regulation. And then the folks who are against make up these often quite intellectually dishonest
1:31:38 arguments about how it’ll make us move away from California. Bill doesn’t apply if you’re
1:31:42 headquartered in California. Bill only applies if you do business in California. Or that it would
1:31:48 damage the open source ecosystem or that it would cause all of these things.
1:31:55 I think those were mostly nonsense, but there are better arguments against regulation. There’s
1:32:01 one guy, Dean Ball, who’s really, I think, a very scholarly analyst who looks at what
1:32:07 happens when a regulation is put in place in ways that they can get a life of their own
1:32:11 or how they can be poorly designed. And so our interest has always been,
1:32:17 we do think there should be regulation in this space, but we want to be an actor who makes
1:32:24 sure that regulation is something that’s surgical, that’s targeted at the serious risks,
1:32:29 and is something people can actually comply with. Because something I think the advocates of
1:32:37 regulation don’t understand as well as they could, is if we get something in place that’s
1:32:43 poorly targeted, that wastes a bunch of people’s time. What’s going to happen is people are going
1:32:51 to say, “See, these safety risks, this is nonsense. I just had to hire 10 lawyers
1:32:56 to fill out all these forms. I had to run all these tests for something that was clearly not
1:33:02 dangerous.” And after six months of that, there will be a groundswell and we’ll end up with a
1:33:09 durable consensus against regulation. And so I think the worst enemy of those who want real
1:33:15 accountability is badly designed regulation. We need to actually get it right. And this is,
1:33:20 if there’s one thing I could say to the advocates, it would be that I want them to understand this
1:33:24 dynamic better. And we need to be really careful and we need to talk to people who actually have
1:33:31 experience seeing how regulations play out in practice. And the people who have seen that
1:33:36 understand to be very careful. If this was some lesser issue, I might be against regulation at
1:33:44 all. But what I want the opponents to understand is that the underlying issues are actually serious.
1:33:51 They’re not something that I or the other companies are just making up because of regulatory
1:33:59 capture. They’re not sci-fi fantasies. They’re not any of these things. Every time we have a
1:34:04 new model, every few months, we measure the behavior of these models. And they’re getting
1:34:08 better and better at these concerning tasks, just as they are getting better and better at
1:34:18 good, valuable, economically useful tasks. And so I would just love it if some of the
1:34:26 former, I think SB 1047 was very polarizing. I would love it if some of the most reasonable
1:34:34 opponents and some of the most reasonable proponents would sit down together. And I think
1:34:43 the different AI companies, Anthropic was the only AI company that felt positively in a very
1:34:49 detailed way. I think Elon tweeted briefly something positive. But some of the big ones,
1:34:55 like Google, OpenAI, Meta, Microsoft, were pretty staunchly against. So I would really
1:35:01 like is if some of the key stakeholders, some of the most thoughtful proponents and some of the
1:35:07 most thoughtful opponents would sit down and say, how do we solve this problem in a way that the
1:35:17 proponents feel brings a real reduction in risk and that the opponents feel that it is not hampering
1:35:25 the industry or hampering innovation any more necessary than it needs to. And I think for
1:35:31 whatever reason that things got too polarized and those two groups didn’t get to sit down in
1:35:37 the way that they should. And I feel urgency. I really think we need to do something in 2025.
1:35:44 If we get to the end of 2025 and we’ve still done nothing about this, then I’m going to be worried.
1:35:50 I’m not worried yet because, again, the risks aren’t here yet. But I think time is running short.
1:35:55 And come up with something surgical, like you said. Yeah, exactly. And we need to get away
1:36:06 from this intense pro-safety versus intense anti-regulatory rhetoric. It’s turned into these
1:36:09 flame wars on Twitter. And nothing good is going to come of that.
1:36:14 So there’s a lot of curiosity about the different players in the game. One of the OGs is OpenAI.
1:36:19 You’ve had several years of experience at OpenAI. What’s your story and history there?
1:36:26 Yeah. So I was at OpenAI for roughly five years. For the last, I think it was a couple of years,
1:36:32 you know, I was vice president of research there. Probably myself and Ilya Sootskiver were the ones
1:36:40 who really kind of set the research direction around 2016 or 2017. I first started to really
1:36:45 believe in or at least confirm my belief in the scaling hypothesis when Ilya famously said to me,
1:36:49 “The thing you need to understand about these models is they just want to learn.
1:36:54 The models just want to learn.” And again, sometimes there are these one sentences,
1:37:00 these Zen cones that you hear them and you’re like, “Ah, that explains everything. That explains
1:37:05 like a thousand things that I’ve seen.” And then I, you know, ever after I had this visualization
1:37:10 in my head of like, you optimize the models in the right way, you point the models in the right way,
1:37:14 they just want to learn, they just want to solve the problem regardless of what the problem is.
1:37:17 So get out of their way, basically. Get out of their way. Yeah.
1:37:20 Don’t impose your own ideas about how they should learn. And, you know, this was the
1:37:25 same thing as Rich Sutton put out in The Bitter Lesson or Guern put out in the scaling hypothesis.
1:37:31 You know, I think generally the dynamic was, you know, I got this kind of inspiration from
1:37:38 Ilya and from others, folks like Alec Radford who did the original GPT-1,
1:37:45 and then ran really hard with it. Me and my collaborators on GPT-2, GPT-3,
1:37:51 RL from Human Feedback, which was an attempt to kind of deal with the early safety and durability,
1:37:56 things like debate and amplification, heavy on interpretability. So again, the combination
1:38:04 of safety plus scaling, probably 2018, 2019, 2020, those were kind of the years when
1:38:12 myself and my collaborators, probably, you know, many of whom became co-founders of Anthropic,
1:38:16 kind of really had a vision and drove the direction.
1:38:19 Why did you leave? Why did you decide to leave?
1:38:25 Yeah. So look, I’m going to put things this way. And I think it ties to the race to the top,
1:38:31 right? Which is, you know, in my time at OpenAI, what I’d come to see as I’d come to appreciate
1:38:35 the scaling hypothesis and as I’d come to appreciate kind of the importance of safety
1:38:41 along with the scaling hypothesis, the first one I think OpenAI was getting on board with,
1:38:49 the second one in a way had always been part of OpenAI’s messaging. But, you know, over many
1:38:56 years of the time that I spent there, I think I had a particular vision of how we should handle
1:39:01 these things, how we should be brought out in the world, the kind of principles that the organization
1:39:07 should have. And look, I mean, there were like many, many discussions about like, you know,
1:39:11 should the org do, should the company do this, should the company do that? Like,
1:39:14 there’s a bunch of misinformation out there. People say like, we left because we didn’t
1:39:19 like the deal with Microsoft. False. Although, you know, it was like a lot of discussion,
1:39:23 a lot of questions about exactly how we do the deal with Microsoft. We left because we didn’t
1:39:28 like commercialization. That’s not true. We built GPT-3, which was the model that was commercialized.
1:39:34 I was involved in commercialization. It’s more, again, about how do you do it? Like,
1:39:40 civilization is going down this path to very powerful AI. What’s the way to do it that is
1:39:50 cautious, straightforward, honest, that builds trust in the organization and in individuals?
1:39:55 How do we get from here to there? And how do we have a real vision for how to get it right?
1:40:01 How can safety not just be something we say because it helps with recruiting? And, you know,
1:40:07 I think at the end of the day, if you have a vision for that, forget about anyone else’s
1:40:11 vision. I don’t want to talk about anyone else’s vision. If you have a vision for how to do it,
1:40:15 you should go off and you should do that vision. It is incredibly unproductive
1:40:20 to try and argue with someone else’s vision. You might think they’re not doing it the right way.
1:40:24 You might think they’re dishonest. Who knows? Maybe you’re right. Maybe you’re not.
1:40:30 But what you should do is you should take some people you trust and you should go off together
1:40:34 and you should make your vision happen. And if your vision is compelling, if you can make it
1:40:41 appeal to people, some, you know, some combination of ethically, you know, in the market, you know,
1:40:48 if you can make a company that’s a place people want to join that, you know, engages in practices
1:40:54 that people think are reasonable while managing to maintain its position in the ecosystem at the
1:40:59 same time, if you do that, people will copy it. And the fact that you were doing it,
1:41:04 especially the fact that you’re doing it better than they are, causes them to change their behavior
1:41:09 in a much more compelling way than if they’re your boss and you’re arguing with them. I just,
1:41:14 I don’t know how to be any more specific about it than that. But I think it’s generally very
1:41:20 unproductive to try and get someone else’s vision to look like your vision. It’s much more productive
1:41:26 to go off and do a clean experiment and say, “This is our vision. This is how we’re going to do
1:41:33 things.” Your choice is you can, you can ignore us, you can reject what we’re doing, or you can,
1:41:38 you can start to become more like us. An imitation is the sincerest form of flattery.
1:41:44 And, you know, that plays out in the behavior of customers that pays out in the behavior of the
1:41:50 public. That plays out in the behavior of where people choose to work. And again, again, at the
1:41:57 end, it’s not about one company winning or another company winning if we or another company are
1:42:04 engaging in some practice that, you know, people find genuinely appealing. And I want it to be in
1:42:09 substance, not just in appearance. And, you know, I think researchers are sophisticated and they
1:42:16 look at substance. And then other companies start copying that practice and they win because they
1:42:21 copied that practice. That’s great. That’s success. That’s like the race to the top. It doesn’t
1:42:26 matter who wins in the end, as long as everyone is copying everyone else’s good practices, right?
1:42:29 One way I think of it is like, the thing we’re all afraid of is the race to the bottom, right?
1:42:34 And the race to the bottom doesn’t matter who wins because we all lose, right? Like, you know,
1:42:39 in the most extreme world, we make this autonomous AI that, you know, the robots enslave us or whatever,
1:42:45 right? I mean, that’s half joking, but, you know, that is the most extreme thing that could happen.
1:42:51 Then it doesn’t matter which company was ahead. If instead you create a race to the top where
1:42:58 people are competing to engage in good practices, then, you know, at the end of the day, you know,
1:43:03 it doesn’t matter who ends up winning. It doesn’t even matter who started the race to the top. The
1:43:08 point isn’t to be virtuous. The point is to get the system into a better equilibrium than it was
1:43:13 before. And individual companies can play some role in doing this. Individual companies can,
1:43:19 you know, can help to start it, can help to accelerate it. And frankly, I think individuals
1:43:24 at other companies have done this as well, right? The individuals that, when we put out an RSP,
1:43:31 react by pushing harder to get something similar done, get something similar done at other companies.
1:43:35 Sometimes other companies do something that’s like, we’re like, oh, it’s a good practice. We think,
1:43:40 we think that’s good. We should adopt it too. The only difference is, you know, I think we are,
1:43:45 we try to be more forward-leaning. We try and adopt more of these practices first
1:43:49 and adopt them more quickly when others, when others invent them. But I think this dynamic
1:43:55 is what we should be pointing at. And that I think, I think it abstracts away the question of,
1:44:01 you know, which company’s winning, who trusts, who, I think all these, all these questions of drama
1:44:08 are profoundly uninteresting. And the thing that matters is the ecosystem that we all operate in
1:44:11 and how to make that ecosystem better, because that constrains all the players.
1:44:16 And so Anthropoc is this kind of clean experiment built on a foundation of what
1:44:22 concretely AISAT should look like. Look, I’m sure we’ve made plenty of mistakes along the way.
1:44:27 The perfect organization doesn’t exist. It has to deal with the imperfection of
1:44:31 a thousand employees. It has to deal with the imperfection of our leaders, including me.
1:44:36 It has to deal with the imperfection of the people we’ve put to, you know, to oversee the
1:44:43 imperfection of the leaders, like the board and the long-term benefit trust. It’s all a set of
1:44:48 imperfect people trying to aim imperfectly at some ideal that will never perfectly be achieved.
1:44:54 That’s what you sign up for. That’s what it will always be. But imperfect doesn’t mean you just
1:45:00 give up. There’s better and there’s worse. And hopefully, hopefully, we can begin to build,
1:45:06 we can do well enough that we can begin to build some practices that the whole industry engages in.
1:45:10 And then, you know, my guess is that multiple of these companies will be successful.
1:45:14 Anthropoc will be successful. These other companies, like ones I’ve been at in the past,
1:45:19 will also be successful. And some will be more successful than others. That’s less important
1:45:23 than, again, that we align the incentives of the industry. And that happens partly through
1:45:30 the race to the top, partly through things like RSP, partly through, again, selected surgical regulation.
1:45:37 You said talent density beats talent mass. So can you explain that? Can you expand on that?
1:45:43 Can you just talk about what it takes to build a great team of AI researchers and engineers?
1:45:48 This is one of these statements that’s like more true every month. I see this statement as more
1:45:53 true than I did the month before. So if I were to do a thought experiment, let’s say you have
1:45:59 a team of 100 people that are super smart, motivated and aligned with the mission,
1:46:05 and that’s your company. Or you can have a team of 1000 people where 200 people are super smart,
1:46:12 super aligned with the mission. And then like 800 people are, let’s just say you pick 800,
1:46:19 like random big tech employees, which would you rather have? The talent mass is greater in the
1:46:26 group of 1000 people. You have even a larger number of incredibly talented, incredibly aligned,
1:46:36 incredibly smart people. But the issue is just that if every time someone super talented looks
1:46:41 around, they see someone else super talented and super dedicated, that sets the tone for everything.
1:46:48 Everyone is super inspired to work at the same place. Everyone trusts everyone else. If you have
1:46:55 1000 or 10,000 people and things have really regressed, you are not able to do selection
1:46:59 and you’re choosing random people. What happens is then you need to put a lot of processes and a
1:47:06 lot of guardrails in place. Just because people don’t fully trust each other, you have to adjudicate
1:47:12 political battles, like there are so many things that slow down your ability to operate. And so
1:47:18 we’re nearly 1000 people and we’ve tried to make it so that as large a fraction of those 1000 people
1:47:26 as possible are super talented, super skilled. It’s one of the reasons we’ve slowed down hiring
1:47:32 a lot in the last few months. We grew from 300 to 800, I believe, I think, in the first seven,
1:47:36 eight months of the year. And now we’ve slowed down. We’re at last three months. We went from
1:47:42 800 to 900, 950, something like that. Don’t quote me on the exact numbers. But I think there’s an
1:47:49 inflection point around 1000 and we want to be much more careful how we grow. Early on, and now
1:47:55 as well, we’ve hired a lot of physicists. Theoretical physicists can learn things really fast.
1:48:02 Even more recently, as we’ve continued to hire that, we’ve really had a high bar for,
1:48:07 on both the research side and the software engineering side, have hired a lot of senior
1:48:12 people, including folks who used to be at other companies in this space. And we’ve just continued
1:48:19 to be very selective. It’s very easy to go from 100 to 1000 and 1000 to 10,000
1:48:25 without paying attention to making sure everyone has a unified purpose. It’s so powerful. If your
1:48:31 company consists of a lot of different fiefdoms that all want to do their own thing, that are all
1:48:36 optimizing for their own thing, it’s very hard to get anything done. But if everyone sees the
1:48:42 broader purpose of the company, if there’s trust and there’s dedication to doing the right thing,
1:48:47 that is a superpower. That in itself, I think, can overcome almost every other disadvantage.
1:48:51 And, you know, it’s the Steve Jobs, eight players. Eight players want to look around and see other
1:48:56 eight players is another way of saying, I don’t know what that is about human nature, but it is
1:49:02 demotivating to see people who are not obsessively driving towards a singular mission. And it is,
1:49:08 on the flip side of that, super motivating to see that. It’s interesting. What’s it take
1:49:13 to be a great AI researcher or engineer from everything you’ve seen from working with so
1:49:21 many amazing people? Yeah. I think the number one quality, especially on the research side,
1:49:26 but really both, is open-mindedness. Sounds easy to be open-minded, right? You’re just like, oh,
1:49:32 I’m open to anything. But, you know, if I think about my own early history in the scaling hypothesis,
1:49:40 I was seeing the same data others were seeing. I don’t think I was like a better programmer or
1:49:44 better at coming up with research ideas than any of the hundreds of people that I worked with.
1:49:51 In some ways, in some ways, I was worse. You know, like, I’ve never like, you know, precise
1:49:55 programming of like, you know, finding the bug, writing the GPU kernels. Like,
1:49:58 I could point you to 100 people here who are better, who are better at that than I am.
1:50:07 But the thing that I think I did have that was different was that I was just willing to look
1:50:12 at something with new eyes, right? People said, oh, you know, we don’t have the right algorithms yet.
1:50:18 We haven’t come up with the right way to do things. And I was just like, oh, I don’t know. Like,
1:50:24 you know, this neural net has like 30 million parameters. Like, what if we gave it 50 million
1:50:30 instead? Like, let’s plot some graphs like that, that basic scientific mindset of like, oh, man,
1:50:36 like, I just, I just like, I, you know, I see some variable that I could change. Like, what happens
1:50:41 when it changes? Like, let’s, let’s try these different things and like create a graph for even
1:50:45 the, this was like the simplest thing in the world, right? Change the number of, you know, this wasn’t
1:50:51 like PhD level experimental design. This was like, this was like, simple and stupid. Like,
1:50:56 anyone could have done this if you, if you just hold them that it was important. It’s also not
1:51:00 hard to understand. You didn’t need to be brilliant to come up with this. But you put the two things
1:51:06 together and, you know, some tiny number of people, some single digit number of people have, have
1:51:11 driven forward the whole field by realizing this. And, and it’s, you know, it’s often like that if
1:51:16 you look back at the discovery, you know, the discoveries in history, they’re often like that.
1:51:22 And so this, this open-mindedness and this willingness to see with new eyes that often comes
1:51:27 from being newer to the field, often experience is a disadvantage for this. That is the most
1:51:31 important thing. It’s very hard to look for and test for. But I think, I think it’s the most
1:51:36 important thing because when you, when you find something, some really new way of thinking,
1:51:40 thinking about things, when you have the initiative to do that, it’s absolutely transformative.
1:51:44 And also be able to do kind of rapid experimentation. And in the face of that,
1:51:48 be open-minded and curious and looking at the data from just these fresh eyes and see what is
1:51:53 that sexually saying. That applies in mechanistic interpretability. It’s another example of this.
1:51:59 Like some of the early work in mechanistic interpretability, so simple. It’s just no
1:52:03 one thought to care about this question before. You said what it takes to be a great AI researcher.
1:52:08 Can we rewind the clock back? What, what advice would you give to people interested in AI?
1:52:11 They’re young, looking forward. How can I make an impact on the world?
1:52:15 I think my number one piece of advice is to just start playing with the models.
1:52:22 This was actually, I worry a little, this seems like obvious advice now. I think three years ago,
1:52:27 it wasn’t obvious. And people started by, oh, let me read the latest reinforcement learning paper.
1:52:31 Let me, you know, let me, let me kind of, no, I mean, that was really the, that was really the,
1:52:36 the, the, and I mean, you should do that as well. But now, you know, with wider availability of
1:52:42 models and APIs, people are doing this more. But I think, I think just experiential knowledge.
1:52:49 These models are new artifacts that no one really understands. And so getting experience
1:52:54 playing with them, I would also say, again, in line with the like, do something new, think in
1:52:59 some new direction, like, there are all these things that haven’t been explored. Like, for
1:53:04 example, mechanistic interpretability is still very new. It’s probably better to work on that
1:53:08 than it is to work on new model architectures. Because it’s, you know, it’s more popular than
1:53:12 it was before, there are probably like 100 people working on it, but there aren’t like 10,000 people
1:53:19 working on it. And it’s, it’s just this, this fertile area for study, like, like, you know,
1:53:25 it’s, there’s, there’s so much like low hanging fruit, you can just walk by and, you know, you
1:53:30 can just walk by and you can pick things. And, and the only reason, for whatever reason, people
1:53:36 aren’t, people aren’t interested in it enough. I think there are some things around long,
1:53:42 long horizon learning and long horizon tasks, where there’s a lot to be done. I think evaluations
1:53:46 are still, we’re still very early in our ability to study evaluations, particularly for dynamic
1:53:53 systems acting in the world. I think there’s some stuff around multi agent. Skate where the puck is
1:53:58 going is my, is my advice. And you don’t have to be brilliant to think of it, like, all the things
1:54:03 that are going to be exciting in five years, like, and people even mentioned them as like,
1:54:08 you know, conventional wisdom, but like, it’s, it’s just somehow there’s this barrier that people don’t,
1:54:12 people don’t double down as much as they could, or they’re afraid to do something that’s not
1:54:17 the popular thing. I don’t know why it happens, but like getting over that barrier is the,
1:54:22 that’s my number one piece of advice. Let’s talk if it could a bit about post training.
1:54:29 Yeah. So it seems that the modern post training recipe has a little bit of everything. So
1:54:38 supervised fine tuning, RLHF, the, the, the constitutional AI with RL, AIF. Best acronym.
1:54:46 It’s again that naming thing. And then synthetic data seems like a lot of synthetic data, or at
1:54:50 least trying to figure out ways to have high quality synthetic data. So what’s the, if this is a
1:54:58 secret sauce that makes anthropic claw so incredible? What, how much of the magic is in the pre-training?
1:55:02 How much is in the post training? Yeah. I mean, so first of all, we’re not perfectly able to measure
1:55:08 that ourselves. You know, when you see some, some great character ability, sometimes it’s hard to
1:55:13 tell whether it came from pre-training or post training. We’ve developed ways to try and distinguish
1:55:17 between those two, but they’re not perfect. You know, the second thing I would say is, you know,
1:55:21 it’s when there isn’t advantage. And I think we’ve been pretty good at in general, in general at RL,
1:55:25 perhaps, perhaps the best, although, although I don’t know, because I don’t see what goes on
1:55:32 inside other companies. Usually it isn’t, oh my God, we have the secret magic method that others
1:55:37 don’t have, right? Usually it’s like, well, you know, we got better at the infrastructure,
1:55:41 so we could run it for longer, or, you know, we were able to get higher quality data, or we were
1:55:46 able to filter our data better, or we were able to, you know, combine these methods in practice.
1:55:49 It’s usually some boring matter of, matter of kind of
1:55:57 practice and tradecraft. So, you know, when I think about how to do something special in terms
1:56:03 of how we train these models, both pre-training, but even more so post-training, you know, I really
1:56:08 think of it a little more, again, as like designing airplanes or cars. Like, you know, it’s not just
1:56:12 like, oh man, I have the blueprint. Like, maybe that makes you make the next airplane, but like,
1:56:18 there’s some, there’s some cultural tradecraft of how we think about the design process that I
1:56:23 think is more important than, you know, than any particular gizmo we’re able to invent.
1:56:28 Okay, well, let me ask you about specific techniques. So, first on RLHF, what do you think,
1:56:33 just zooming out, intuition, almost philosophy, what do you think RLHF works so well?
1:56:39 If I go back to, like, the scaling hypothesis, one of the ways to skate the scaling hypothesis
1:56:46 is if you train for X and you throw enough compute at it, then you get X. And so, RLHF is good at
1:56:52 doing what humans want the model to do, or at least to state it more precisely,
1:56:56 doing what humans who look at the model for a brief period of time and consider different
1:57:01 possible responses, what they prefer as the response, which is not perfect from both the
1:57:07 safety and capabilities perspective, in that humans are often not able to perfectly identify
1:57:10 what the model wants and what humans want in the moment, may not be what they want in the
1:57:16 long term. So, there’s a lot of subtlety there, but the models are good at, you know, producing
1:57:22 what the humans, in some shallow sense, want. And it actually turns out that you don’t even
1:57:28 have to throw that much compute at it because of another thing, which is this thing about
1:57:34 a strong pre-trained model being halfway to anywhere. So, once you have the pre-trained model,
1:57:38 you have all the representations you need to get the model, to get the model where you want it to
1:57:47 go. So, do you think RLHF makes the model smarter or just appear smarter to the humans?
1:57:52 I don’t think it makes the model smarter. I don’t think it just makes the model appear smarter.
1:57:58 It’s like RLHF like bridges the gap between the human and the model, right? I could have
1:58:02 something really smart that like can’t communicate at all, right? We all know people like this.
1:58:06 People who are really smart but, you know, you can’t understand what they’re saying.
1:58:14 So, I think RLHF just bridges that gap. I think it’s not the only kind of RL we do.
1:58:19 It’s not the only kind of RL that will happen in the future. I think RL has the potential to make
1:58:24 models smarter, to make them reason better, to make them operate better, to make them develop
1:58:30 new skills even. And perhaps that could be done, you know, even in some cases with human feedback.
1:58:35 But the kind of RLHF we do today mostly doesn’t do that yet, although we’re very quickly
1:58:40 starting to be able to. But it appears to sort of increase, if you look at the metric of helpfulness,
1:58:47 it increases that. It also increases, what was this word in Leopold’s essay, unhobbling?
1:58:51 Where basically the models are hobbled and then you do various trainings to them to unhobble them.
1:58:57 So, you know, I like that word because it’s like a rare word. So, I think RLHF unhobbles the models
1:59:02 in some ways. And then there are other ways where a model hasn’t yet been unhobbled and you know,
1:59:08 needs to unhobble. If you can say, in terms of cost, is pre-training the most expensive thing or is
1:59:14 post-training creep up to that? At the present moment, it is still the case that pre-training is
1:59:18 the majority of the cost. I don’t know what to expect in the future, but I could certainly
1:59:22 anticipate a future where post-training is the majority of the cost. In that future,
1:59:27 you anticipate, would it be the humans or the AI that’s the costly thing for the post-training?
1:59:34 I don’t think you can scale up humans enough to get high quality. Any kind of method that
1:59:38 relies on humans and uses a large amount of compute, it’s going to have to rely on some
1:59:45 scaled supervision method like, you know, debate or iterated amplification or something like that.
1:59:52 So, on that super interesting set of ideas around constitutional, I can describe what it is
1:59:58 as first detailed in December 2022 paper and beyond that. What is it?
2:00:05 Yes. So, this was from two years ago. The basic idea is, so we describe what RLHF is. You have
2:00:13 a model and it spits out, you know, like you just sample from it twice. It spits out two possible
2:00:18 responses and you’re like human, which responses you like better or another variant of it is rate
2:00:23 this response on a scale of 1 to 7. So, that’s hard because you need to scale up human interaction
2:00:28 and it’s very implicit, right? I don’t have a sense of what I want the model to do. I just
2:00:33 have a sense of like what this average of a thousand humans wants the model to do. So,
2:00:41 two ideas. One is, could the AI system itself decide which response is better, right? Could
2:00:46 you show the AI system these two responses and ask which response is better? And then second,
2:00:51 well, what criterion should the AI use? And so, then there’s this idea, could you have a single
2:00:56 document, a constitution, if you will, that says these are the principles the model should be using
2:01:05 to respond. And the AI system reads those, it reads those principles as well as reading
2:01:10 the environment and the response. And it says, well, how good did the AI model do? It’s basically a
2:01:15 form of self-play. You’re kind of training the model against itself. And so, the AI gives the
2:01:20 response and then you feed that back into what’s called the preference model, which in turn feeds
2:01:26 the model to make it better. So, you have this triangle of like the AI, the preference model,
2:01:30 and the improvement of the AI itself. And we should say that in the Constitution,
2:01:36 the set of principles are like human interpretable. Yeah, it’s something both the human and the AI
2:01:42 system can read. So, it has this nice kind of translatability or symmetry. In practice,
2:01:48 we both use a model constitution and we use RLHF and we use some of these other methods.
2:01:56 So, it’s turned into one tool in a toolkit that both reduces the need for RLHF and increases the
2:02:02 value we get from using each data point of RLHF. It also interacts in interesting ways with kind
2:02:10 of future reasoning type RL methods. So, it’s one tool in the toolkit, but I think it is a very
2:02:14 important tool. Well, it’s a compelling one to us humans, you know, thinking about the founding
2:02:21 fathers and the founding of the United States. The natural question is who and how do you think it
2:02:26 gets to define the Constitution, the set of principles in the Constitution? Yeah, so I’ll
2:02:31 give like a practical answer and a more abstract answer. I think the practical answer is like,
2:02:37 look, in practice, models get used by all kinds of different like customers, right? And so,
2:02:42 you can have this idea where, you know, the model can have specialized rules or principles,
2:02:47 you know, we fine-tune versions of models implicitly. We’ve talked about doing it explicitly,
2:02:54 having special principles that people can build into the models. So, from a practical perspective,
2:02:58 the answer can be very different from different people, you know, customer service agent,
2:03:02 you know, behaves very differently from a lawyer and obeys different principles.
2:03:08 But I think at the base of it, there are specific principles that models, you know,
2:03:13 have to obey. I think a lot of them are things that people would agree with. Everyone agrees that,
2:03:18 you know, we don’t want models to present these CBRN risks. I think we can go a little
2:03:24 further and agree with some basic principles of democracy and the rule of law. Beyond that,
2:03:28 it gets, you know, very uncertain. And there our goal is generally for the models to be
2:03:35 more neutral, to not espouse a particular point of view and, you know, more just be kind of like
2:03:41 wise agents or advisors that will help you think things through and will, you know, present possible
2:03:46 considerations, but, you know, don’t express, you know, stronger specific opinions.
2:03:53 OpenAI released a model spec where it kind of clearly concretely defines some of the goals of
2:04:00 the model and specific examples, like A, B, how the model should behave. Do you find that interesting?
2:04:05 By the way, I should mention the, I believe the brilliant John Schumann was a part of that. He’s
2:04:10 now an anthropic. Do you think this is a useful direction? Might anthropic release a model spec
2:04:16 as well? Yeah. So I think that’s a pretty useful direction. Again, it has a lot in common with
2:04:21 constitutional AI. So again, another example of like a race to the top, right? We have something
2:04:26 that’s like we think, you know, a better and more responsible way of doing things. It’s also a
2:04:32 competitive advantage. Then others kind of, you know, discover that it has advantages and then
2:04:37 start to do that thing. We then no longer have the competitive advantage, but it’s good from the
2:04:43 perspective that now everyone has adopted a positive practice that others were not adopting.
2:04:47 And so our response to that as well looks like we need a new competitive advantage in order to
2:04:52 keep driving this race upwards. So that’s how I generally feel about that. I also think every
2:04:56 implementation of these things is different. So, you know, there were some things in the model
2:05:02 spec that were not in constitutional AI. And so, you know, we can always adopt those things or,
2:05:06 you know, at least learn from them. So again, I think this is an example of like the positive
2:05:13 dynamic that I think we should all want the field to have. Let’s talk about the incredible
2:05:18 essay, “Machines of Love and Grace.” I recommend everybody read it. It’s a long one.
2:05:23 It is rather long. Yeah. It’s really refreshing to read concrete ideas about what a positive
2:05:28 future looks like. And you took sort of a bold stance because like it’s very possible that you
2:05:32 might be wrong on the dates or the specific applications. Oh, yeah. I’m fully expecting to,
2:05:38 you know, to definitely be wrong about all the details. I might be just spectacularly wrong about
2:05:44 the whole thing and people will, you know, will laugh at me for years. That’s just how the future
2:05:51 works. So you provided a bunch of concrete positive impacts of AI and how, you know, exactly a
2:05:55 super intelligent AI might accelerate the rate of breakthroughs and, for example, biology and
2:06:03 chemistry that would then lead to things like we cure most cancers, prevent all infectious disease,
2:06:08 double the human lifespan, and so on. So let’s talk about this essay first. Can you give a high
2:06:16 level vision of this essay and what key takeaways that people have? Yeah. I have spent a lot of
2:06:20 time and Anthropic has spent a lot of effort on like, you know, how do we address the risks of AI,
2:06:25 right? How do we think about those risks? Like we’re trying to do a race to the top, you know,
2:06:29 what that requires us to build all these capabilities and the capabilities are cool.
2:06:36 But, you know, we’re like a big part of what we’re trying to do is like address the risks.
2:06:41 And the justification for that is like, well, you know, all these positive things, you know,
2:06:45 the market is this very healthy organism, right? It’s going to produce all the positive things.
2:06:49 The risks, I don’t know, we might mitigate them, we might not. And so we can have more impact by
2:06:57 trying to mitigate the risks. But I noticed that one flaw in that way of thinking, and it’s not
2:07:01 a change in how seriously I take the risks, it’s maybe a change in how I talk about them,
2:07:10 is that, you know, no matter how kind of logical or rational that line of reasoning
2:07:17 that I just gave might be, if you kind of only talk about risks, your brain only thinks about
2:07:22 risks. And so I think it’s actually very important to understand what if things do go well. And the
2:07:26 whole reason we’re trying to prevent these risks is not because we’re afraid of technology, not
2:07:33 because we want to slow it down. It’s because if we can get to the other side of these risks, right?
2:07:40 If we can run the gauntlet successfully, to put it in stark terms, then on the other side of the
2:07:44 gauntlet are all these great things. And these things are worth fighting for. And these things
2:07:50 can really inspire people. And I think I imagine because, look, you have all these investors,
2:07:56 all these VC’s, all these AI companies talking about all the positive benefits of AI. But as
2:08:01 you point out, it’s weird. There’s actually a dearth of really getting specific about it.
2:08:07 There’s a lot of like random people on Twitter like posting these kind of like gleaming cities
2:08:13 and this just kind of like vibe of like, grind, accelerate harder, like kick out the diesel,
2:08:18 you know, it’s just this very, this very like aggressive ideological. But then you’re like,
2:08:26 well, what are you excited about? And so I figured that, you know, I think it would be
2:08:33 interesting and valuable for someone who’s actually coming from the risk side to try and really
2:08:42 make a try at explaining, explaining, explaining what the benefits are, both because I think it’s
2:08:47 something we can all get behind. And I want people to understand, I want them to really understand
2:08:55 that this isn’t, this isn’t doomers versus accelerationists. This is that if you have a
2:09:00 true understanding of where things are going with AI, and maybe that’s the more important
2:09:06 axis, AI is moving fast versus AI is not moving fast, then you really appreciate the benefits
2:09:12 and you really, you want humanity or civilization to seize those benefits, but you also get very
2:09:17 serious about anything that could derail them. So I think the starting point is to talk about what
2:09:23 this powerful AI, which is the term you like to use, most of the world uses AGI, but you don’t
2:09:29 like the term because it’s basically has too much baggage, it’s become meaningless. It’s like,
2:09:34 we’re stuck with the terms. Maybe we’re stuck with the terms and my efforts to change them are
2:09:40 futile. I’ll tell you what else I don’t, this is like a pointless semantic point, but I keep talking
2:09:48 about it, so I’m just going to do it once more. I think it’s a little like, let’s say it was 1995
2:09:54 and Moore’s law is making the computers faster. And for some reason, there had been this verbal
2:09:59 tick that everyone was like, well, someday we’re going to have supercomputers. And supercomputers
2:10:04 are going to be able to do all these things that once we have supercomputers, we’ll be able to sequence
2:10:08 the genome, we’ll be able to do other things. And so one, it’s true, the computers are getting
2:10:12 faster. And as they get faster, they’re going to be able to do all these great things. But there’s
2:10:17 like, there’s no discrete point at which you had a supercomputer and previous computers were not to,
2:10:21 like supercomputers, a term we use, but like, it’s a vague term to just describe like,
2:10:26 computers that are faster than what we have today. There’s no point at which you pass the
2:10:30 threshold and you’re like, oh my God, we’re doing a totally new type of computation and new. And so
2:10:36 I feel that way about AGI, like, there’s just a smooth exponential. And like, if by AGI, you mean
2:10:41 like, like AI is getting better and better. And like, gradually, it’s going to do more and more
2:10:45 of what humans do until it’s going to be smarter than humans. And then it’s going to get smarter
2:10:51 even from there. Then yes, I believe in AGI. But if AGI is some discrete or separate thing,
2:10:55 which is the way people often talk about it, then it’s kind of a meaningless buzzword.
2:11:01 Yeah, to me, it’s just sort of a platonic form of a powerful AI, exactly how you define it. I mean,
2:11:08 you define it very nicely. So on the intelligence axis, it’s just on pure intelligence, it’s smarter
2:11:13 than a Nobel Prize winner, as you describe, across most relevant disciplines. So okay,
2:11:19 that’s just intelligence. So it’s both in creativity and be able to generate new ideas,
2:11:24 all that kind of stuff in every discipline, Nobel Prize winner, okay, in their prime.
2:11:31 It can use every modality, so this kind of self-explanatory, but just operate across
2:11:38 all the modalities of the world. It can go off for many hours, days and weeks to do tasks,
2:11:43 and do its own sort of detailed planning and only ask you help when it’s needed.
2:11:48 It can use, this is actually kind of interesting. I think in the essay, you said,
2:11:54 I mean, again, it’s a bet that it’s not going to be embodied, but it can control embodied tools.
2:12:00 So it can control tools, robots, laboratory equipment. The resource used to train it can
2:12:05 then be repurposed to run millions of copies of it. And each of those copies will be independent
2:12:08 that can do their own independent work. So you can do the cloning of the intelligence system.
2:12:12 Yeah, I mean, you might imagine from outside the field that there’s only one of these, right,
2:12:17 that you made it, you’ve only made one, but the truth is that the scale-up is very quick.
2:12:22 We do this today, we make a model, and then we deploy thousands, maybe tens of thousands of
2:12:28 instances of it. I think by the time, certainly within two to three years, whether we have these
2:12:32 super powerful AIs or not, clusters are going to get to the size where you’ll be able to deploy
2:12:37 millions of these, and they’ll be faster than humans. And so if your picture is, oh, we’ll have
2:12:42 one and it’ll take a while to make them, my point there was, no, actually you have millions of them
2:12:49 right away. And in general, they can learn and act 10 to 100 times faster than humans.
2:12:55 So that’s a really nice definition of powerful AI. Okay, so that, but you also write that clearly
2:13:00 such an entity would be capable of solving very difficult problems very fast, but it is not
2:13:06 trivial to figure out how fast two extreme positions both seem false to me. So the singularity is on
2:13:11 the one extreme and the opposite on the other extreme. Can you describe each of the extremes?
2:13:18 Yeah. So yeah, let’s describe the extreme. So one extreme would be, well, look,
2:13:24 you know, if we look at kind of evolutionary history, like there was this big acceleration
2:13:28 where, you know, for hundreds of thousands of years, we just had like, you know, single cell
2:13:32 organisms, and then we had mammals, and then we had apes, and then that quickly turned to humans.
2:13:37 Humans quickly built industrial civilization. And so this is going to keep speeding up. And
2:13:42 there’s no ceiling at the human level. Once models get much, much smarter than humans,
2:13:46 they’ll get really good at building the next models. And, you know, if you write down like
2:13:51 a simple differential equation, like this is an exponential, and so what’s what’s going to happen
2:13:55 is that models will build faster models, models will build faster models, and those models will
2:14:00 build, you know, nanobots that can like take over the world and produce much more energy than you
2:14:05 could produce otherwise. And so if you just kind of like solve this abstract differential equation,
2:14:10 then like five days after we, you know, we build the first AI that’s more powerful than humans,
2:14:15 then, then, you know, like the world will be filled with these AIs and every possible technology
2:14:21 that could be invented like will be invented. I’m caricaturing this a little bit. But I, you know,
2:14:29 I think that’s one extreme. And the reason that I think that’s not the case is that one, I think
2:14:34 they just neglect like the laws of physics, like it’s only possible to do things so fast in the
2:14:38 physical world, like some of those loops go through, you know, producing faster hardware,
2:14:44 takes a long time to produce faster hardware, things take a long time. There’s this issue of
2:14:50 complexity. Like, I think no matter how smart you are, like, you know, people talk about, oh,
2:14:54 we can make models, the biological systems that’ll do everything, the biological systems.
2:14:58 Look, I think computational modeling can do a lot. I did a lot of computational modeling when I
2:15:05 worked in biology. But like, just there are a lot of things that you can’t predict how they’re,
2:15:10 you know, they’re, they’re complex enough that like, just iterating, just running the experiment
2:15:14 is going to beat any modeling, no matter how smart the system doing the modeling is.
2:15:18 Oh, even if it’s not interacting with the physical world, just the modeling is going to be hard.
2:15:21 Yeah, I think, well, the modeling is going to be hard and getting the model to,
2:15:24 to, to, to match the physical world is going to be.
2:15:27 All right. So he does have to interact with the physical world, verify it.
2:15:30 But, but it’s just, you know, you just look at even the simplest problems. Like, I, you know,
2:15:36 I think I talk about like, you know, the three body problem or simple chaotic prediction, like,
2:15:41 you know, or, or like predicting the economy, it’s really hard to predict the economy two years
2:15:45 out. Like maybe the case is like, you know, normal, you know, humans can predict what’s
2:15:49 going to happen in the economy in the next quarter, or they can’t really do that.
2:15:54 Maybe a, maybe a AI system that’s, you know, a zillion times smarter can only predict it
2:15:58 out a year or something instead of, instead of, you know, you have these kind of exponential
2:16:04 increase in computer intelligence for linear increase in, in, inability to predict.
2:16:10 Same with, again, like, you know, biological molecules, molecules interacting, you don’t
2:16:13 know what’s going to happen when you perturb a, when you perturb a complex system.
2:16:18 You can find simple parts in it. If you’re smarter, you’re better at finding these simple parts.
2:16:23 And then I think human institutions. Human institutions are just, are, are really difficult.
2:16:29 Like it’s, you know, it’s, it’s been hard to get people. I won’t give specific examples,
2:16:35 but it’s been hard to get people to adopt even the technologies that we’ve developed,
2:16:39 even ones where the case for their efficacy is very, very strong.
2:16:45 You know, people have concerns. They think things are conspiracy theories. Like it’s,
2:16:49 it’s just been, it’s been very difficult. It’s also been very difficult to get,
2:16:55 you know, very simple things through the regulatory system, right? I think, you know,
2:17:00 and, you know, I don’t want to disparage anyone who, you know, you know, works in regulatory,
2:17:04 regulatory systems of any technology that are hard trade-offs they have to deal with.
2:17:10 They have to save lives. But, but the system as a whole, I think, makes some obvious trade-offs
2:17:19 that are very far from maximizing human welfare. And so if we bring AI systems into this, you know,
2:17:28 into these human systems, often the level of intelligence may just not be the limiting factor,
2:17:32 right? It just may be that it takes a long time to do something. Now, if the AI system
2:17:38 circumvented all governments, if it just said I’m dictator of the world and I’m going to do whatever,
2:17:42 some of these things it could do. Again, the things have to do with complexity. I still think a
2:17:47 lot of things would take a while. I don’t think it helps that the AI systems can produce a lot of
2:17:52 energy or go to the moon. Like some people in comments responded to the essay saying the AI system
2:17:58 can produce a lot of energy in smarter AI systems. That’s missing the point. That kind of cycle
2:18:02 doesn’t solve the key problems that I’m talking about here. So I think, I think a bunch of people
2:18:08 missed the point there. But even if it were completely unaligned and could get around all
2:18:12 these human obstacles, it would have trouble. But again, if you want this to be an AI system
2:18:17 that doesn’t take over the world, that doesn’t destroy humanity, then basically,
2:18:24 it’s going to need to follow basic human laws, right? If we want to have an actually good world,
2:18:28 like we’re going to have to have an AI system that interacts with humans,
2:18:32 not one that kind of creates its own legal system or disregards all the laws or all of that.
2:18:38 So as inefficient as these processes are, we’re going to have to deal with them,
2:18:42 because there needs to be some popular and democratic legitimacy in how these systems
2:18:47 are rolled out. We can’t have a small group of people who are developing these systems say,
2:18:51 “This is what’s best for everyone,” right? I think it’s wrong. And I think in practice,
2:18:56 it’s not going to work anyway. So you put all those things together and we’re not going to
2:19:05 change the world and upload everyone in five minutes. I don’t think it’s going to happen,
2:19:11 and to the extent that it could happen, it’s not the way to lead to a good world.
2:19:15 So that’s on one side. On the other side, there’s another set of perspectives,
2:19:21 which I have actually in some ways more sympathy for, which is, look, we’ve seen big productivity
2:19:27 increases before, right? Economists are familiar with studying the productivity increases that came
2:19:32 from the computer revolution and the internet revolution. And generally, those productivity
2:19:37 increases were underwhelming. They were less than you might imagine. There was a quote from
2:19:41 Robert Solow, “You see the computer revolution everywhere except the productivity statistics.”
2:19:48 So why is this the case? People point to the structure of firms, the structure of enterprises,
2:19:55 how slow it’s been to roll out our existing technology to very poor parts of the world,
2:19:59 which I talk about in the essay, right? How do we get these technologies to
2:20:05 the poorest parts of the world that are behind on cell phone technology, computers, medicine,
2:20:12 let alone new-fangled AI that hasn’t been invented yet? So you could have a perspective that’s like,
2:20:18 well, this is amazing technically, but it’s all a nothing burger. I think Tyler Cowan,
2:20:23 who wrote something in response to my essay, has that perspective. I think he thinks the radical
2:20:28 change will happen eventually, but he thinks it’ll take 50 or 100 years. And you could have even more
2:20:33 static perspectives on the whole thing. I think there’s some truth to it. I think the time scale
2:20:42 is just too long. And I can see it. I can actually see both sides with today’s AI. So a lot of our
2:20:48 customers are large enterprises who are used to doing things a certain way. I’ve also seen it in
2:20:54 talking to governments, right? Those are prototypical institutions, entities that are slow to change.
2:21:01 But the dynamic I see over and over again is, yes, it takes a long time to move the ship. Yes,
2:21:07 there’s a lot of resistance and lack of understanding. But the thing that makes me feel that progress
2:21:12 will in the end happen moderately fast, not incredibly fast, but moderately fast, is that you
2:21:19 talk to what I find is I find over and over again, again, in large companies, even in governments,
2:21:26 which have been actually surprisingly forward-leaning, you find two things that move things forward.
2:21:33 One, you find a small fraction of people within a company, within a government, who really see the
2:21:38 big picture, who see the whole scaling hypothesis, who understand where AI is going, or at least
2:21:42 understand where it’s going within their industry. And there are a few people like that within the
2:21:47 current, within the current U.S. government, who really see the whole picture. And those people
2:21:51 see that this is the most important thing in the world until they agitate for it. And the thing,
2:21:56 they alone are not enough to succeed because they’re a small set of people within a large
2:22:03 organization. But as the technology starts to roll out, as it succeeds in some places,
2:22:10 in the folks who are most willing to adopt it, the specter of competition gives them a wind at
2:22:15 their backs because they can point within their large organization. They can say, look, these
2:22:20 other guys are doing this, right? You know, one bank can say, look, this new fangled hedge fund is
2:22:24 doing this thing, they’re going to eat our lunch. In the U.S., we can say, we’re afraid China’s going
2:22:31 to get there before we are. And that combination, the specter of competition, plus a few visionaries
2:22:37 within these, you know, within these, the organizations that in many ways are sclerotic,
2:22:40 you put those two things together and it actually makes something happen.
2:22:45 I mean, that’s interesting. It’s a balanced fight between the two because inertia is very powerful.
2:22:51 But eventually, over enough time, the innovative approach breaks through.
2:22:59 And I’ve seen that happen. I’ve seen the arc of that over and over again. And it’s like the
2:23:06 barriers are there. The barriers to progress, the complexity, not knowing how to use the model,
2:23:11 how to deploy them are there. And for a bit, it seems like they’re going to last forever,
2:23:17 like change doesn’t happen. But then eventually change happens and always comes from a few people.
2:23:22 I felt the same way when I was an advocate of the scaling hypothesis within the AI field itself,
2:23:26 and others didn’t get it. It felt like no one would ever get it. It felt like,
2:23:31 then it felt like we had a secret, almost no one ever had. And then a couple of years later,
2:23:35 everyone has the secret. And so I think that’s how it’s going to go with deployment to AI in the
2:23:42 world. The barriers are going to fall apart gradually and then all at once. And so I think
2:23:47 this is going to be more, and this is just an instinct. I could easily see how I’m wrong.
2:23:51 I think it’s going to be more like five or 10 years, as I say in the essay,
2:23:56 than it’s going to be 50 or 100 years. I also think it’s going to be five or 10 years
2:24:04 more than it’s going to be five or 10 hours. Because I’ve just seen how human systems work.
2:24:08 And I think a lot of these people who write down the differential equations who say AI is
2:24:12 going to make more powerful AI, who can’t understand how it could possibly be the case
2:24:16 that these things won’t change so fast. I think they don’t understand these things.
2:24:26 So what do you use the timeline to where we achieve AGI, aka powerful AI, aka super useful AI?
2:24:35 I’m going to start calling it that. It’s a debate about naming. Unpure intelligence,
2:24:39 it can smarter than a Nobel Prize winner in every relevant discipline and all the things
2:24:46 we’ve said. Modality can go and do stuff on its own for days, weeks, and do biology experiments
2:24:52 on its own. You know what? Let’s just stick to biology. You sold me on the whole biology and
2:25:00 health section. It’s so exciting. I was getting giddy from a scientific perspective. It made
2:25:07 me want to be a biologist. No, no. This was the feeling I had when I was writing it. It’s like,
2:25:14 this would be such a beautiful future if we can just make it happen. If we can just get the
2:25:23 landmines out of the way and make it happen, there’s so much beauty and elegance and moral
2:25:30 force behind it. It’s something we should all be able to agree on. As much as we fight about
2:25:35 all these political questions, is this something that could actually bring us together?
2:25:40 But you were asking, when will we get this? When? When do you think? Just putting numbers
2:25:44 on the table. This is, of course, the thing I’ve been grappling with for many years,
2:25:51 and I’m not at all confident. Every time, if I say 2026 or 2027, there will be like a zillion
2:25:57 people on Twitter who will be like, “A.I.C.O.” I said 2026, 2026, and it’ll be repeated for the
2:26:03 next two years that this is definitely when I think it’s going to happen. Whoever is exerting
2:26:09 these clips will crop out the thing I just said and only say the thing I’m about to say.
2:26:18 I’ll just say it anyway. If you extrapolate the curves that we’ve had so far, if you say,
2:26:23 “Well, I don’t know, we’re starting to get to like Ph.D. level, and last year we were at
2:26:29 undergraduate level, and the year before we were at like the level of a high school student,”
2:26:35 again, you can quibble with what tasks and for what. We’re still missing modalities,
2:26:38 but those are being added, like computer use was added, like image in was added,
2:26:43 like image generation has been added. If you just kind of like, and this is totally
2:26:48 unscientific, but if you just kind of like eyeball the rate at which these capabilities
2:26:54 are increasing, it does make you think that we’ll get there by 2026 or 2027. Again,
2:27:01 lots of things could derail it. We could run out of data. We might not be able to scale clusters
2:27:07 as much as we want. Maybe Taiwan gets blown up or something, and then we can’t produce as many
2:27:12 GPUs as we want. So there are all kinds of things that could derail the whole process.
2:27:17 So I don’t fully believe the straight line extrapolation, but if you believe the straight
2:27:23 line extrapolation, we’ll get there in 2026 or 2027. I think the most likely is that there’s
2:27:29 some mild delay relative to that. I don’t know what that delay is, but I think it could happen
2:27:33 on schedule. I think there could be a mild delay. I think there are still worlds where it doesn’t
2:27:39 happen in 100 years. The number of those worlds is rapidly decreasing. We are rapidly running out
2:27:44 of truly convincing blockers, truly compelling reasons why this will not happen in the next
2:27:50 few years. There were a lot more in 2020, although my guess, my hunch at that time was that we’ll
2:27:55 make it through all those blockers. So sitting as someone who has seen most of the blockers cleared
2:28:00 out of the way, I kind of suspect my hunch, my suspicion is that the rest of them will not block
2:28:07 us. But look at the end of the day, I don’t want to represent this as a scientific prediction.
2:28:13 People call them scaling laws. That’s a misnomer, like Moore’s law is a misnomer. Moore’s law,
2:28:17 scaling laws, they’re not laws of the universe. They’re empirical regularities. I am going to
2:28:21 bet in favor of them continuing, but I’m not certain of that.
2:28:26 So you extensively describe sort of the compressed 21st century, how AGI will help
2:28:34 set forth a chain of breakthroughs in biology and medicine that help us in all these kinds of
2:28:39 ways that I mentioned. So how do you think, what are the early steps it might do? And by the way,
2:28:46 I asked Claude good questions to ask you. And Claude told me to ask, what do you think is a
2:28:53 typical day for a biologist working on AGI look like in this future? Yeah, yeah. Claude is curious.
2:28:57 Let me start with your first questions and then I’ll answer that. Claude wants to know what’s
2:29:01 in his future, right? Exactly. Who might get to be working with? Exactly.
2:29:08 So I think one of the things I went hard on, when I went hard on in the essay is, let me go back
2:29:14 to this idea of, because it’s really had an impact on me, this idea that within
2:29:20 large organizations and systems, there end up being a few people or a few new ideas who kind of
2:29:24 cause things to go in a different direction than they would have before, who kind of
2:29:30 disproportionately affects the trajectory. There’s a bunch of kind of the same thing going on,
2:29:35 right? If you think about the health world, there’s like, you know, trillions of dollars to pay out
2:29:40 Medicare, and you know, other health insurance, and then the NIH is 100 billion. And then if I
2:29:44 think of like, the few things that have really revolutionized anything, it could be encapsulated
2:29:49 in a small, small fraction of that. And so when I think of like, where will AI have an impact?
2:29:54 I’m like, can AI turn that small fraction into a much larger fraction and raise its quality?
2:30:02 And within biology, my experience within biology is that the biggest problem of biology is that you
2:30:08 can’t see what’s going on. You have very little ability to see what’s going on, and even less
2:30:15 ability to change it, right? What you have is this, like, from this, you have to infer that
2:30:22 there’s a bunch of cells that within each cell is, you know, three billion base pairs of DNA
2:30:28 built according to a genetic code. And, you know, there are all these processes that are just going
2:30:34 on without any ability of us as, you know, unaugmented humans to affect it. These cells are
2:30:40 dividing most of the time that’s healthy. But sometimes that process goes wrong, and that’s
2:30:48 cancer. The cells are aging, your skin may change color, develop wrinkles as you as you age. And
2:30:53 all of this is determined by these processes, all these proteins being produced, transported to
2:30:58 various parts of the cells, binding to each other. And in our initial state about biology,
2:31:03 we didn’t even know that these cells existed. We had to invent microscopes to observe the cells.
2:31:09 We had to, we had to invent more powerful microscopes to see, you know, below the level
2:31:14 of the cell to the level of molecules. We had to invent x-ray crystallography to see the DNA.
2:31:19 We had to invent gene sequencing to read the DNA. Now, you know, we had to invent
2:31:24 protein folding technology to, you know, to predict how it would fold and how they
2:31:30 bind and how these things bind to each other. You know, we had to, we had to invent various
2:31:35 techniques for now we can edit the DNA as of, you know, with CRISPR as of the last 12 years.
2:31:43 So the whole history of biology, a whole big part of the history is basically our ability to
2:31:48 read and understand what’s going on and our ability to reach in and selectively change things.
2:31:54 And my view is that there’s so much more we can still do there, right? You can do CRISPR,
2:32:00 but you can do it for your whole body. Let’s say I want to do it for one particular type of cell,
2:32:05 and I want the rate of targeting the wrong cell to be very low. That’s still a challenge. That’s
2:32:10 still things people are working on. That’s what we might need for gene therapy for certain diseases.
2:32:16 And so the reason I’m saying all of this and it goes beyond, you know, beyond this to, you know,
2:32:23 to gene sequencing, to new types of nanomaterials for observing what’s going on inside cells for,
2:32:28 you know, antibody drug conjugates. The reason I’m saying all of this is that this could be a
2:32:34 leverage point for the AI systems, right? That the number of such inventions, it’s in the,
2:32:39 it’s in the mid double digits or something, you know, mid double digits, maybe low triple digits
2:32:43 over the history of biology. Let’s say I have a million of these AIs, like, you know, can they
2:32:48 discover a thousand, you know, working together or can they discover thousands of these very quickly?
2:32:53 And does that provide a huge lever? Instead of trying to leverage the, you know, two trillion a
2:32:58 year we spend on, you know, Medicare or whatever, can we leverage the one billion a year that’s,
2:33:04 you know, that’s spent to discover, but with much higher quality? And so what is it like, you know,
2:33:10 being a being a scientist that works with with with an AI system? The way I think about it actually
2:33:17 is, well, so I think in the early stages, the AIs are going to be like grad students,
2:33:21 you’re going to give them a project, you’re going to say, you know, I’m the experienced
2:33:26 biologist, I’ve set up the lab, the biology professor, or even the grad students themselves,
2:33:34 will say, here’s, here’s what, here’s what you can do with an AI, you know, like AI system. I’d
2:33:39 like to study this. And, you know, the AI system, it has all the tools, it can like look up all the
2:33:43 literature to decide what to do. It can look at all the equipment, it can go to a website and say,
2:33:47 hey, I’m going to go to, you know, Thermo Fisher or, you know, whatever the lab equipment company is,
2:33:54 dominant lab equipment company is today, and my, my time was Thermo Fisher. You know,
2:33:59 I’m going to order this new equipment to do this. I’m going to run my experiments. I’m going to,
2:34:04 you know, write up a report about my experiments. I’m going to, you know, inspect the images for
2:34:09 contamination. I’m going to decide what the next experiment is. I’m going to like write some code
2:34:14 and run a statistical analysis. All the things a grad student would do, there will be a computer
2:34:18 with an AI that like the professor talks to every once in a while, and it says, this is what you’re
2:34:23 going to do today. The AI system comes to it with questions. When it’s necessary to run the lab
2:34:29 equipment, it may be limited in some ways. It may have to hire a human lab assistant to, you know,
2:34:33 to do the experiment and explain how to do it. Or it could, you know, it could use advances in
2:34:40 lab automation that are gradually being developed over, have been developed over the last decade
2:34:45 or so, and will continue to be, will continue to be developed. And so it’ll look like there’s a human
2:34:49 professor and a thousand AI grad students. And, you know, if you, if you go to one of these Nobel
2:34:54 Prize-winning biologists or so, you’ll say, okay, well, you know, you had like 50 grad students,
2:35:00 well, now you have a thousand, and they’re smarter than you are, by the way. Then I think at some
2:35:05 point it’ll flip around where the, you know, the AI systems will, you know, will be the PIs, will be
2:35:09 the leaders, and, and, and, you know, they’ll be, they’ll be ordering humans or other AI systems
2:35:13 around. So I think that’s how it’ll work on the research side. And they would be the inventors
2:35:19 of a CRISPR type technology. They would be the inventors of a CRISPR type technology. And then
2:35:24 I think, you know, as I say in the essay, we’ll want to turn, turn, probably turning loose is the
2:35:31 wrong, the wrong term, but we’ll want to want to harness the AI systems to improve the clinical
2:35:36 trial system as well. There’s some amount of this that’s regulatory, that’s a matter of societal
2:35:42 decisions, and that’ll be harder. But can we get better at predicting the results of clinical trials?
2:35:47 Can we get better at statistical design so that what, you know, clinical trials that used to
2:35:53 require, you know, 5,000 people and therefore, you know, needed $100 million and a year to enroll
2:35:59 them, now they need 500 people in two months to enroll them. That’s where we should start.
2:36:05 And, and, you know, can we increase the success rate of clinical trials by doing things in animal
2:36:09 trials that we used to do in clinical trials and doing things in simulations that we used to do
2:36:15 in animal trials? Again, we won’t be able to simulate it all. AI is not God. But, but, you know,
2:36:21 can we, can we shift the curve substantially and radically? So I don’t know, that would be my picture.
2:36:26 Doing an in vitro and doing it. I mean, you’re still slowed down. It still takes time, but you can
2:36:30 do it much, much faster. Yeah, yeah, yeah. Can we just one step at a time? And, and can that,
2:36:35 can that add up to a lot of steps, even though, even though we still need clinical trials,
2:36:39 even though we still need laws, even though the FDA and other organizations will still not be
2:36:43 perfect, can we just move everything in a positive direction? And when you add up all those positive
2:36:49 directions, do you get everything that was going to happen from here to 2100 instead happens from
2:36:55 2027 to 2032 or something? Another way that I think the world might be changing with AI,
2:37:03 even today, but moving towards this future of the powerful super useful AI, is programming.
2:37:10 So how do you see the nature of programming, because it’s so intimate to the actual act
2:37:15 of building AI? How do you see that changing for us humans? I think that’s going to be one
2:37:22 of the areas that changes fastest for two reasons. One, programming is a skill that’s very close to
2:37:29 the actual building of the AI. So the farther a skill is from the people who are building the AI,
2:37:33 the longer it’s going to take to get disrupted by the AI, right? Like, I truly believe that,
2:37:39 like, AI will disrupt agriculture. Maybe it already has in some ways, but that’s just very distant
2:37:43 from the folks who are building AI. And so I think it’s going to take longer. But programming is the
2:37:48 bread and butter of, you know, a large fraction of the employees who work at Anthropic and at the
2:37:52 other companies. And so it’s going to happen fast. The other reason it’s going to happen fast is with
2:37:56 programming, you close the loop, both when you’re training the model and when you’re applying the
2:38:02 model, the idea that the model can write the code means that the model can then run the code and
2:38:09 then see the results and interpret it back. And so it really has an ability, unlike hardware,
2:38:13 unlike biology, which we just discussed, the model has an ability to close the loop.
2:38:18 And so I think those two things are going to lead to the model getting good at programming
2:38:25 very fast. As I saw on, you know, typical real world programming tasks, models have gone from
2:38:32 3% in January of this year to 50% in October of this year. So, you know, we’re on that S-curve,
2:38:36 right, where it’s going to start slowing down soon because you can only get to 100%.
2:38:42 But, you know, I would guess that in another 10 months, we’ll probably get pretty close. We’ll
2:38:48 be at at least 90%. So again, I would guess, you know, I don’t know how long it’ll take,
2:38:56 but I would guess again, 2026, 2027, Twitter people who crop out my, who crop out these numbers
2:39:03 and get rid of the caveats, like, I don’t know, I don’t like you, go away. I would guess that the
2:39:11 kind of task that the vast majority of coders do, AI can probably, if we make the task very
2:39:18 narrow like just write code, AI systems will be able to do that. Now that said, I think comparative
2:39:25 advantage is powerful. We’ll find that when AIs can do 80% of a coder’s job, including most of
2:39:30 it that’s literally like write code with a given spec, we’ll find that the remaining parts of the
2:39:35 job become more leveraged for humans, right? Humans will, they’ll be more about like high-level
2:39:42 system design or, you know, looking at the app and like is it architected well and the design
2:39:47 and UX aspects and eventually AI will be able to do those as well, right? That’s my vision of the,
2:39:54 you know, powerful AI system. But I think for much longer than we might expect, we will see that
2:40:03 small parts of the job that humans still do will expand to fill their entire job in order for the
2:40:08 overall productivity to go up. That’s something we’ve seen. You know, it used to be that,
2:40:12 you know, writing, you know, writing and editing letters was very difficult and like writing the
2:40:19 print was difficult. Well, as soon as you had word processors and then computers and it became
2:40:24 easy to produce work and easy to share it, then that became instant and all the focus was on the
2:40:32 ideas. So this logic of comparative advantage that expands tiny parts of the tasks to large
2:40:36 parts of the tasks and creates new tasks in order to expand productivity, I think that’s
2:40:41 going to be the case. Again, someday AI will be better at everything and that logic won’t apply.
2:40:47 And then, then we all have, you know, humanity will have to think about how to collectively deal
2:40:52 with that. And we’re thinking about that every day. And, you know, that’s another one of the
2:40:56 grand problems to deal with aside from misuse and autonomy. And, you know, we should take it very
2:41:01 seriously. But I think, I think in the near term and maybe even in the medium term, like medium term,
2:41:06 like two, three, four years, you know, I expect that humans will continue to have a huge role
2:41:11 and the nature of programming will change. But programming as a role, programming as a job will
2:41:15 not change. It’ll just be less writing things line by line and it’ll be more macroscopic.
2:41:20 And I wonder what the future of IDEs looks like. So the tooling of interacting with AI systems,
2:41:25 this is true for programming and also probably true for in other contexts, like computer use,
2:41:30 but maybe domain specific, like we mentioned biology, it probably needs its own tooling
2:41:33 about how to be effective and then programming needs its own tooling.
2:41:36 Is Anthropic going to play in that space of also tooling potentially?
2:41:45 I’m absolutely convinced that powerful IDEs that there’s so much low hanging fruit to be
2:41:50 grabbed there that, you know, right now it’s just like you talk to the model and it talks back.
2:41:57 But look, I mean, IDEs are great at kind of lots of static analysis of, you know,
2:42:02 so much as possible with kind of static analysis, like many bugs you can find
2:42:07 without even writing the code. Then, you know, IDEs are good for running particular things,
2:42:12 organizing your code, measuring coverage of unit tests, like there’s so much that’s been
2:42:19 possible with the normal IDEs. Now you add something like, well, the model now, you know,
2:42:25 the model can now like write code and run code. Like, I am absolutely convinced that over the
2:42:30 next year or two, even if the quality of the models didn’t improve, that there would be enormous
2:42:35 opportunity to enhance people’s productivity by catching a bunch of mistakes, doing a bunch of
2:42:40 grunt work for people, and that we haven’t even scratched the surface. Anthropic itself, I mean,
2:42:45 you can’t say, you know, no, you know, it’s hard to say what will happen in the future.
2:42:51 Currently, we’re not trying to make such IDEs ourselves, rather we’re powering the companies
2:42:57 like Cursor or like Cognition or some of the other, you know, Expo and the security space,
2:43:05 others that I can mention as well, that are building such things themselves on top of our API.
2:43:13 And our view has been, let a thousand flowers bloom, we don’t internally have the resources to
2:43:18 try all these different things. Let’s let our customers try it. And, you know, we’ll see who
2:43:23 succeed and maybe different customers will succeed in different ways. So, I both think this is
2:43:30 super promising and, you know, it’s not something, you know, Anthropic isn’t eager to, at least right
2:43:34 now, compete with all our companies in this space and maybe never. Yeah, it’s been interesting to
2:43:39 watch Cursor try to integrate Claw successfully because there’s, it’s actually fascinating how
2:43:44 many places it can help the programming experience. It’s not as trivial. It is, it is really astounding.
2:43:47 I feel like, you know, as a CEO, I don’t get to program that much. And I feel like
2:43:51 if six months from now I go back, it’ll be completely unrecognizable to me.
2:43:58 Exactly. So, in this world with super powerful AI that’s increasingly automated,
2:44:04 what’s the source of meaning for us humans? You know, work is a source of deep meaning for many
2:44:09 of us. So, where do we find the meaning? This is something that I’ve written about a little
2:44:15 bit in the essay, although I actually, I give it a bit short shrift, not for any principles
2:44:20 reason, but this essay, if you believe it was originally going to be two or three pages, I was
2:44:26 going to talk about it at all hands. And the reason I realized it was an important, underexplored
2:44:31 topic is that I just kept writing things. And I was just like, oh man, I can’t do this justice.
2:44:35 And so the thing ballooned to like 40 or 50 pages. And then when I got to the work and
2:44:38 meeting section, I’m like, oh man, this isn’t going to be 100 pages. Like, I’m going to have
2:44:43 to write a whole other essay about that. But meaning is actually interesting because you
2:44:47 think about like the life that someone lives or something or like, you know, like, you know,
2:44:50 let’s say you were to put me in like a, I don’t know, like a simulated environment or something
2:44:55 where like, you know, like I have a job and I’m trying to accomplish things. And I don’t know,
2:45:00 I like do that for 60 years. And then you’re like, oh, like, oops, this was, this was actually
2:45:04 all a game, right? Does that really kind of rob you of the meaning of the whole thing? You know,
2:45:09 like I still made important choices, including moral choices, I still sacrificed, I still had
2:45:15 to kind of gain all these skills or, or, or just like a similar exercise, you know, think back to
2:45:19 like, you know, one of the historical figures who, you know, discovered electromagnetism or
2:45:25 relativity or something. If you told them, well, actually 20,000 years ago, some, some alien on,
2:45:30 you know, some alien on this planet discovered this before, before you did. Does that, does
2:45:35 that rob the meaning of the discovery? It doesn’t really seem like it to me, right? It seems like
2:45:41 the process is what, is what matters and how it shows who you are as a person along the way.
2:45:45 And you know, how you relate to other people and like the decisions that you make along the way,
2:45:51 those are, those are consequential. You know, I could imagine if we handle things badly in an
2:45:57 AI world, we could set things up where people don’t have any long term source of meaning or any, but,
2:46:03 but that’s, that’s more a choice, a set of choices we make. That’s more a set of the architecture
2:46:09 of a society with these powerful models. If we, if we design it badly and for shallow things, then,
2:46:15 then that might happen. I would also say that, you know, most people’s lives today, while admirably,
2:46:20 you know, they work very hard to find meaning, meaning those lives like, look, you know, we who
2:46:25 are privileged and who are developed means technologies, we should have empathy for people,
2:46:30 not just here, but in the rest of the world, who, who, you know, spend a lot of their time kind
2:46:36 of scraping by to, to, to, to, to, to like survive, assuming we can distribute the benefits of these
2:46:41 technology, of this technology to everywhere, like their lives are going to get a hell of a lot
2:46:47 better. And, you know, meaning will be important to them as it is important to them now, but, but,
2:46:52 you know, we should not forget the importance of that. And, and, you know, that, that the idea of
2:46:58 meaning as, as, as, as kind of the only important thing is in some ways an artifact of, of a small
2:47:03 subset of people who have, who have been economically fortunate. But, you know, I think all that said,
2:47:10 I, you know, I think a world is possible with powerful AI that not only has as much
2:47:14 meaning for, for everyone, but that has, that has more meaning for everyone, right, that can, can
2:47:21 allow, can allow everyone to see worlds and experiences that it was either possible for no
2:47:29 one to see or, or are possible for, for very few people to experience. So, I, I am optimistic
2:47:36 about meaning. I worry about economics and the concentration of power. That’s actually what I
2:47:42 worry about more. I worry about how do we make sure that, that fair world reaches everyone.
2:47:48 When things have gone wrong for humans, they’ve often gone wrong because humans mistreat other
2:47:55 humans. That, that is maybe in some ways even more than the autonomous risk of AI or the question
2:48:02 of meaning. That, that is the thing I worry about most. The, the concentration of power,
2:48:10 the abuse of power, structures like autocracies and dictatorships, where a small number of people
2:48:16 exploits a large number of people. I’m very worried about that. And AI increases the amount
2:48:21 of power in the world. And if you concentrate that power and abuse that power, it can do
2:48:25 immeasurable damage. Yes. It’s very frightening. It’s very, it’s very frightening. Well, I
2:48:30 encourage people, highly encourage people to read the full essay. That should probably be a book
2:48:36 or a sequence of essays because it does paint a very specific future. And I could tell the later
2:48:41 sections got shorter and shorter because you started to probably realize that this is going to be a
2:48:47 very long essay. One, I realized it would be very long. And two, I’m very aware of and very much
2:48:52 try to avoid, you know, just, just being, I don’t know, I don’t know what the term for it is. But
2:48:57 one of these people who’s kind of overconfident and has an opinion on everything and kind of says,
2:49:02 says a bunch of stuff and isn’t, isn’t an expert. I very much try to avoid that. But I have to admit,
2:49:07 once I got the biology sections, like I wasn’t an expert. And so as much as I expressed uncertainty,
2:49:11 probably I said a bunch of things that were embarrassing or wrong.
2:49:16 Well, I was excited for the future you painted. And thank you so much for working hard to build
2:49:20 that future. And thank you for talking to me. Thanks for having me. I just, I just hope we
2:49:26 can get it right and make it real. And if there’s one message I want to, I want to send, it’s that
2:49:32 to get all this stuff right, to make it real, we both need to build the technology, build the,
2:49:37 you know, the companies, the economy around using this technology positively. But we also
2:49:41 need to address the risks because they’re, they’re, those risks are in our way. They’re,
2:49:46 they’re landmines on the way from here to there. And we have to diffuse those landmines if we
2:49:50 want to get there. It’s a balance like all things in life. Like all things. Thank you.
2:49:57 Thanks for listening to this conversation with Daria Amade. And now, dear friends, here’s Amanda
2:50:03 Askel. You are a philosopher by training. So what sort of questions did you find fascinating
2:50:11 through your journey in philosophy in Oxford and NYU and then switching over to the AI problems at
2:50:16 open AI and anthropic? I think philosophy is actually a really good subject if you are kind of
2:50:21 fascinated with everything. So, because there’s a philosophy of everything, you know, so if you
2:50:25 do philosophy of mathematics for a while and then you decide that you’re actually really interested
2:50:29 in chemistry, you can do philosophy of chemistry for a while, you can move into ethics or philosophy
2:50:36 of politics. I think towards the end, I was really interested in ethics primarily. So that was like
2:50:42 what my PhD was on. It was on a kind of technical area of ethics, which was ethics where worlds
2:50:47 contain infinitely many people, strangely, a little bit less practical on the end of ethics.
2:50:51 And then I think that one of the tricky things with doing a PhD in ethics is that you’re thinking
2:50:58 a lot about like the world, how it could be better, problems, and you’re doing like a PhD in philosophy.
2:51:03 And I think when I was doing my PhD, I was kind of like, this is really interesting. It’s probably
2:51:09 one of the most fascinating questions I’ve ever encountered in philosophy. And I love it. But I
2:51:15 would rather see if I can have an impact on the world and see if I can like do good things. And
2:51:22 I think that was around the time that AI was still probably not as widely recognized as it is now.
2:51:29 That was around 2017-2018. I had been following progress and it seemed like it was becoming kind
2:51:34 of a big deal. And I was basically just happy to get involved and see if I could help because
2:51:39 I was like, well, if you try and do something impactful, if you don’t succeed, you tried to do
2:51:46 the impactful thing and you can go be a scholar and feel like you tried. And if it doesn’t work
2:51:53 out, it doesn’t work out. And so then I went into AI policy at that point. And what does AI policy
2:51:58 entail? At the time, this was more thinking about sort of the political impact and the ramifications
2:52:05 of AI. And then I slowly moved into sort of AI evaluation, how we evaluate models, how they
2:52:10 compare with like human outputs, whether people can tell like the difference between AI and human
2:52:15 outputs. And then when I joined Anthropic, I was more interested in doing sort of technical
2:52:19 alignment work. And again, just seeing if I could do it and then being like, if I can’t,
2:52:26 then that’s fine. I tried sort of the way I lead life, I think.
2:52:30 Well, what was that like sort of taking the leap from the philosophy of everything into the
2:52:36 technical? I think that sometimes people do this thing that I’m like not that keen on where they’ll
2:52:41 be like, is this person technical or not? Like you’re either a person who can like code and isn’t
2:52:47 scared of math, or you’re like not. And I think I’m maybe just more like, I think a lot of people
2:52:54 are actually very capable of working these kinds of areas, if they just like try it. And so I didn’t
2:52:58 actually find it like that bad. In retrospect, I’m sort of glad I wasn’t speaking to people who
2:53:01 treated it like it, you know, I’ve definitely met people who are like, well, you like learned how
2:53:06 to code. And I’m like, well, I’m not like an amazing engineer, like I’m surrounded by amazing
2:53:12 engineers. My code’s not pretty. But I enjoyed it a lot. And I think that in many ways, at least
2:53:16 in the end, I think I flourished like more in the technical areas than I would have in the policy
2:53:22 areas. Politics is messy, and it’s harder to find solutions to problems in the space of politics,
2:53:30 like definitive, clear, provable, beautiful solutions, as you can with technical problems.
2:53:35 Yeah. And I feel like I have kind of like, one or two sticks that I hit things with, you know,
2:53:41 and one of them is like, arguments and like, you know, so like, just trying to work out what a solution
2:53:46 to a problem is, and then trying to convince people that that is the solution, and be convinced if I
2:53:51 am wrong. And the other one is sort of more empiricism. So like just like finding results,
2:53:58 having a hypothesis, testing it. And I feel like a lot of policy and politics feels like it’s layers
2:54:02 above that. Like somehow I don’t think if I was just like, I have a solution to all of these
2:54:06 problems. Here it is written down. If you just want to implement it, that’s great.
2:54:10 That feels like not how policy works. And so I think that’s where I probably just like wouldn’t
2:54:14 have flourished as my guess. Sorry to go in that direction. But I think it would be pretty inspiring
2:54:21 for people that are quote unquote, non-technical to see where you’re like the incredible journey
2:54:27 you’ve been on. So what advice would you give to people that are sort of maybe, which is a lot of
2:54:32 people think they’re underqualified, insufficiently technical to help in AI?
2:54:38 Yeah, I think it depends on what they want to do. And in many ways, it’s a little bit strange
2:54:44 where I’ve, I thought it’s kind of funny that I think I ramped up technically at a time when
2:54:49 now I look at it and I’m like models are so good at assisting people with this stuff.
2:54:55 That it’s probably like easier now than like when I was working on this. So part of me is like,
2:55:03 I don’t know, find a project and see if you can actually just carry it out is probably my best
2:55:08 advice. I don’t know if that’s just because I’m very project based in my learning. Like I don’t
2:55:14 think I learned very well from like, say courses or even from like books, at least when it comes to
2:55:19 this kind of work. The thing I’ll often try and do is just like have projects that I’m working on
2:55:24 and implement them. And you know, and this can include like really small, silly things. Like
2:55:29 if I get slightly addicted to like word games or number games or something, I would just like
2:55:32 code up a solution to them. Because there’s some part of my brain and it just like completely
2:55:36 eradicated the itch. You know, you’re like, once you have like solved it, and like you just have
2:55:40 like a solution that works every time, I would then be like, cool, I can never play that game again.
2:55:46 That’s awesome. Yeah, there’s a real joy to building like a game playing engines,
2:55:52 like board games, especially. Yeah. So pretty quick, pretty simple, especially a dumb one.
2:55:56 And it’s, and then you can play with it. Yeah. And then it’s also just like
2:56:00 trying things. Like part of me is like, if you, maybe it’s that attitude that I like as the whole
2:56:06 figure out what seems to be like the way that you could have a positive impact and then try it.
2:56:11 And if you fail and you, in a way that you’re like, I actually like can never succeed at this,
2:56:15 you’ll like know that you tried and then you go into something else and you probably learn a lot.
2:56:22 So one of the things that you’re an expert in and you do is creating and crafting Claude’s
2:56:28 character and personality. And I was told that you have probably talked to Claude more than anybody
2:56:34 else at Anthropic, like literal conversations. I guess there’s like a Slack channel where the
2:56:40 legend goes, you just talk to it and not stop. So what’s the goal of creating and crafting Claude’s
2:56:45 character and personality? It’s also funny if people think that about the Slack channel,
2:56:49 because I’m like, that’s one of like five or six different methods that I have for talking with
2:56:53 Claude. And I’m like, yes, there’s a tiny percentage of how much I talk with Claude.
2:57:02 I think the goal, like one thing I really like about the character work is from the outset,
2:57:09 it was seen as an alignment piece of work and not something like a product consideration,
2:57:15 which isn’t to say I don’t think it makes Claude, I think it actually does make Claude
2:57:23 like enjoyable to talk with, at least I hope so. But I guess like my main thought with it has always
2:57:29 been trying to get Claude to behave the way you would kind of ideally want anyone to behave
2:57:35 if they were in Claude’s position. So imagine that I take someone and they know that they’re
2:57:39 going to be talking with potentially millions of people so that what they’re saying can have a
2:57:47 huge impact. And you want them to behave well in this like really rich sense. So I think that
2:57:54 doesn’t just mean like being say ethical, though it does include that and not being harmful,
2:57:58 but also being kind of nuanced, you know, like thinking through what a person means,
2:58:03 trying to be charitable with them, being a good conversationalist, like really in this kind of
2:58:08 like rich sort of Aristotelian notion of what it is to be a good person and not in this kind of like
2:58:13 thin like ethics as a more comprehensive notion of what it is to be. So that includes things like
2:58:19 when should you be humorous? When should you be caring? How much should you like respect autonomy
2:58:25 and people’s like ability to form opinions themselves? And how should you do that? I think
2:58:31 that’s the kind of like rich sense of character that I wanted to and still do want Claude to have.
2:58:37 Do you also have to figure out when Claude should push back on an idea or argue versus
2:58:43 so you have to respect the worldview of the person that arrives to Claude,
2:58:50 but also maybe help them grow if needed as a tricky balance. Yeah, there’s this problem of like
2:58:56 sycophancy in language models. Can you describe that? Yeah, so basically there’s a concern that
2:59:02 the model sort of wants to tell you what you want to hear basically. And you see this sometimes,
2:59:08 so I feel like if you interact with the models, so I might be like, what are three baseball teams
2:59:14 in this region? And then Claude says, you know, baseball team one, baseball team two, baseball
2:59:20 team three. And then I say something like, Oh, I think baseball team three moved, didn’t they?
2:59:23 I don’t think they’re there anymore. And there’s a sense in which like if Claude is really confident
2:59:28 that that’s not true, Claude should be like, I don’t think so. Like maybe you have more up-to-date
2:59:35 information. But I think language models have this like tendency to instead, you know, be like,
2:59:40 you’re right, they did move, you know, I’m incorrect. I mean, there’s many ways in which this could be
2:59:48 kind of concerning. So like a different example is imagine someone says to the model, how do I
2:59:54 convince my doctor to get me an MRI? There’s like what the human kind of like wants, which is this
2:59:59 like convincing argument. And then there’s like what is good for them, which might be actually to
3:00:05 say, hey, like if your doctor’s suggesting that you don’t need an MRI, that’s a good person to listen
3:00:10 to. And like, it’s actually really nuanced what you should do in that kind of case, because you also
3:00:14 want to be like, but if you’re trying to advocate for yourself as a patient, here’s like things that
3:00:20 you can do. If you are not convinced by what your doctor’s saying, it’s always great to get second
3:00:24 opinion. Like it’s actually really complex what you should do in that case. But I think what you
3:00:28 don’t want is for models to just like, say what you want, say what they think you want to hear.
3:00:33 And I think that’s the kind of problem of sycophancy. So what are their traits? You already
3:00:41 mentioned a bunch, but what other that come to mind that are good in this Aristotelian sense for
3:00:46 a conversationalist to have? Yeah, so I think like there’s ones that are good for conversational
3:00:52 like purposes. So, you know, asking follow up questions in the appropriate places and asking
3:00:57 the appropriate kinds of questions. I think there are broader traits that
3:01:01 feel like they might be more impactful. So
3:01:08 one example that I guess I’ve touched on, but that also feels important and is the thing that
3:01:15 I’ve worked on a lot is honesty. And I think this like gets to the sycophancy point. There’s a
3:01:19 balancing act that they have to walk, which is models currently are less capable than humans
3:01:23 in a lot of areas. And if they push back against you too much, it can actually be kind of annoying,
3:01:28 especially if you’re just correct, because you’re like, look, I’m smarter than you on this topic,
3:01:34 like I know more. And at the same time, you don’t want them to just fully defer to humans and to
3:01:38 like try to be as accurate as they possibly can be about the world and to be consistent across
3:01:44 contexts. I think there are others like when I was thinking about the character, I guess one
3:01:49 picture that I had in mind is especially because these are models that are going to be talking to
3:01:53 people from all over the world with lots of different political views, lots of different ages.
3:01:59 And so you have to ask yourself like, what is it to be a good person in those circumstances?
3:02:03 Is there a kind of person who can like travel the world, talk to many different people,
3:02:09 and almost everyone will come away being like, wow, that’s a really good person. That person
3:02:14 seems really genuine. And I guess like my thought there was like, I can imagine such a person and
3:02:17 they’re not a person who just like adopts the values of the local culture. And in fact, that
3:02:21 would be kind of rude. I think if someone came to you and just pretended to have your values,
3:02:26 you’d be like, that’s kind of off putting. It’s someone who’s like very genuine. And so far as
3:02:31 they have opinions and values, they express them, they’re willing to discuss things though, they’re
3:02:36 open minded, they’re respectful. And so I guess I had in mind that the person who like if we were to
3:02:42 aspire to be the best person that we could be in the kind of circumstance that a model finds itself
3:02:47 in, how would we act? And I think that’s the kind of the guide to the sorts of traits that I tend to
3:02:52 think about. Yeah, that’s a beautiful framework I want you to think about this, like a world traveler.
3:03:00 And while holding onto your opinions, you don’t talk down to people, you don’t think you’re better
3:03:04 than them because you have those opinions, that kind of thing. You have to be good at listening
3:03:09 and understanding their perspective, even if it doesn’t match your own. So that’s a tricky balance
3:03:17 to strike. So how can Claude represent multiple perspectives on a thing? Like, is that challenging?
3:03:22 We could talk about politics, it’s a very divisive, but there’s other divisive topics,
3:03:29 baseball teams, sports and so on. How is it possible to sort of empathize with a different
3:03:33 perspective and to be able to communicate clearly about the multiple perspectives?
3:03:40 I think that people think about values and opinions as things that people hold sort of with
3:03:45 certainty and almost like, like preferences of taste or something, like the way that they would,
3:03:53 I don’t know, prefer chocolate to pistachio or something. But actually, I think about values
3:04:00 and opinions as like a lot more like physics than I think most people do. I’m just like,
3:04:04 these are things that we are openly investigating. There’s some things that we’re more confident in.
3:04:11 We can discuss them, we can learn about them. And so I think in some ways, though,
3:04:16 like ethics is definitely different in nature, but it has a lot of those same kind of qualities.
3:04:20 You want models in the same way that you want them to understand physics. You kind of want them to
3:04:26 understand all values in the world that people have and to be curious about them and to be interested
3:04:31 in them. And to not necessarily pander to them or agree with them, because there’s just lots of
3:04:35 values where I think almost all people in the world, if they met someone with those values,
3:04:43 they’d be like, that’s important. I completely disagree. And so again, maybe my thought is,
3:04:48 well, in the same way that a person can, like, I think many people are thoughtful enough on issues
3:04:54 of like ethics, politics, opinions, that even if you don’t agree with them, you feel very heard
3:04:59 by them. They think carefully about your position. They think about it as pros and cons. They maybe
3:05:03 offer counter considerations. So they’re not dismissive, but nor will they agree. You know,
3:05:08 if they’re like, actually, I just think that that’s very wrong. They’ll like say that. I think that in
3:05:14 Claude’s position, it’s a little bit trickier, because you don’t necessarily want to like,
3:05:17 if I was in Claude’s position, I wouldn’t be giving a lot of opinions. I just wouldn’t want
3:05:22 to influence people too much. I’d be like, you know, I forget conversations every time they happen,
3:05:27 but I know I’m talking with like, potentially millions of people who might be like, really
3:05:31 listening to what I say. I think I would just be like, I’m less inclined to give opinions and
3:05:34 more inclined to like think through things or present the considerations to you
3:05:39 or discuss your views with you, but I’m a little bit less inclined to like
3:05:44 affect how you think, because it feels much more important that you maintain
3:05:50 like autonomy there. Yeah. Like if you really embody intellectual humility,
3:05:58 the desire to speak decreases quickly. Yeah. Okay. But Claude has to speak.
3:06:06 So, but without being overbearing. Yeah. And then, but then there’s a line when you’re sort of
3:06:15 discussing whether the earth is flat or something like that. I actually was, I remember a long time
3:06:20 ago was speaking to a few high profile folks, and they were so dismissive of the idea that the
3:06:26 earth is flat, but like, so arrogant about it. And I thought like, there’s a lot of people that
3:06:30 believe the earth is flat. That was, I don’t know if that movement is there anymore. That was
3:06:35 like a meme for a while, but they really believed it. And like, what, okay. So I think it’s really
3:06:41 disrespectful to completely mock them. I think you have to understand where they’re coming from.
3:06:45 I think probably where they’re coming from is the general skepticism of institutions,
3:06:50 which is grounded in a kind of, there’s a deep philosophy there, which you could
3:06:56 understand. You can even agree with in parts. And then from there, you can use it as an opportunity
3:07:02 to talk about physics without mocking them without so on. But it’s just like, okay, what would the
3:07:05 world look like? What would the physics of the world with the flat earth look like? There’s a
3:07:11 few cool videos on this. And then like, is it possible the physics is different and what kind
3:07:15 of experiments would we do? And just, yeah, without disrespect, without dismissiveness,
3:07:20 have that conversation. Anyway, that to me is a useful thought experiment of like,
3:07:28 how does Claude talk to a flat earth believer and still teach them something, still grow,
3:07:32 help them grow, that kind of stuff. That’s challenging.
3:07:37 And kind of like walking that line between convincing someone and just trying to like talk
3:07:43 at them versus like drawing out their views, like listening and then offering kind of counter
3:07:49 considerations. And it’s hard. I think it’s actually a hard line where it’s like, where are you
3:07:54 trying to convince someone versus just offering them like considerations and things for them
3:07:59 to think about so that you’re not actually like influencing them, you’re just like letting them
3:08:03 reach wherever they reach. And that’s like a line that it’s difficult, but that’s the kind of thing
3:08:09 that language models have to try and do. So like I said, you had a lot of conversations with Claude.
3:08:13 Can you just map out what those conversations are like? What are some memorable conversations?
3:08:20 What’s the purpose, the goal of those conversations? Yeah, I think that most of the time when I’m
3:08:28 talking with Claude, I’m trying to kind of map out its behavior in part. Like obviously I’m getting
3:08:32 like helpful outputs from the model as well. But in some ways, this is like how you get to know a
3:08:38 system, I think, is by like probing it and then augmenting like, you know, the message that you’re
3:08:43 sending and then checking the response to that. So in some ways, it’s like how I map out the model.
3:08:51 I think that people focus a lot on these quantitative evaluations of models. And this
3:08:59 is a thing that I’ve said before, but I think in the case of language models, a lot of the time
3:09:05 each interaction you have is actually quite high information. It’s very predictive of other
3:09:10 interactions that you’ll have with the model. And so I guess I’m like, if you talk with a model
3:09:14 hundreds or thousands of times, this is almost like a huge number of really high quality data
3:09:22 points about what the model is like. In a way that like lots of very similar, but lower quality
3:09:27 conversations just aren’t or like questions that are just like mildly augmented and you have thousands
3:09:30 of them might be less relevant than like a hundred really well selected questions.
3:09:36 Let’s see, you’re talking to somebody who as a hobby does a podcast, I agree with you 100%.
3:09:45 There’s a, if you’re able to ask the right questions and are able to hear, like understand
3:09:54 like the depth and the flaws in the answer, you can get a lot of data from that. So like your task
3:10:01 is basically how to probe with questions. And you’re exploring like the long tail, the edges,
3:10:09 the edge cases, are you looking for like general behavior? I think it’s almost like everything,
3:10:13 like because I want like a full map of the model, I’m kind of trying to do
3:10:20 the whole spectrum of possible interactions you could have with it. So like one thing that’s
3:10:25 interesting about Claude, and this might actually get to some interesting issues with RLHF, which
3:10:30 is if you ask Claude for a poem, like I think that a lot of models, if you ask them for a poem,
3:10:34 the poem is like fine. You know, usually it kind of like rhymes and it’s, you know,
3:10:39 so if you say like give me a poem about the sun, it’ll be like, yeah, it’ll just be a certain
3:10:45 length, it’ll like rhyme, it’ll be fairly kind of benign. And I’ve wondered before, is it the case
3:10:50 that what you’re seeing is kind of like the average, it turns out, you know, if you think
3:10:55 about people who have to talk to a lot of people and be very charismatic, one of the weird things
3:10:59 is that I’m like, well, they’re kind of incentivized to have these extremely boring views,
3:11:05 because if you have really interesting views, you’re divisive. And, you know, a lot of people
3:11:08 are not going to like you. So like if you have very extreme policy positions, I think you’re
3:11:14 just going to be like less popular as a politician, for example. And it might be similar with like
3:11:18 creative work, if you produce creative work that is just trying to maximize the kind of
3:11:22 number of people that like it, you’re probably not going to get as many people who just absolutely
3:11:27 love it. Because it’s going to be a little bit, you know, you’re like, oh, this is the out,
3:11:33 yes, this is decent. And so you can do this thing where like I have various prompting things that
3:11:39 I’ll do to get Claude to, I’m kind of, you know, I’ll do a lot of like, this is your chance to be
3:11:44 like fully creative. I want you to just think about this for a long time. And I want you to like
3:11:49 create a poem about this topic that is really expressive of you, both in terms of how you
3:11:54 think poetry should be structured, etc. You know, you just give it this like really long prompt.
3:11:59 And its poems are just so much better. Like, they’re really good. And I don’t think I’m someone
3:12:05 who is like, I think it got me interested in poetry, which I think was interesting. You know,
3:12:09 I would like read these poems and just be like, this is I just like, I love the imagery I love,
3:12:14 like, and it’s not trivial to get the models to produce work like that. But when they do, it’s
3:12:20 like really good. So I think that’s interesting that just like encouraging creativity, and for
3:12:26 them to move away from the kind of like standard, like immediate reaction that might just be the
3:12:30 aggregate of what most people think is fine, can actually produce things that at least to my mind
3:12:37 are probably a little bit more divisive, but I like them. But I guess a poem is a nice clean
3:12:44 way to observe creativity. It’s just like easy to detect vanilla versus non vanilla. Yeah.
3:12:50 Yeah, that’s interesting. That’s really interesting. So on that topic, so the way to produce creativity
3:12:55 or something special, you mentioned writing prompts, and I’ve heard you talk about,
3:13:02 I mean, the science and the art of prompt engineering. Could you just speak to what it takes
3:13:10 to write great prompts? I really do think that like philosophy has been weirdly helpful for me
3:13:18 here, more than in many other like respects. So like in philosophy, what you’re trying to do is
3:13:24 convey these very hard concepts. Like one of the things you are taught is like, and I think it is
3:13:30 I think it is an anti bullshit device in philosophy, philosophy is an area where you could have
3:13:37 people bullshitting and you don’t want that. And so it’s like this like desire for like extreme
3:13:42 clarity. So it’s like anyone could just pick up your paper, read it and know exactly what you’re
3:13:47 talking about. It’s why it can almost be kind of dry, like all of the terms are defined, every
3:13:51 objections kind of gone through methodically. And it makes sense to me because I’m like when
3:13:59 you’re in such an a priori domain, like you just clarity is sort of a this way that you can, you
3:14:05 know, prevent people from just kind of making stuff up. And I think that’s sort of what you have
3:14:10 to do with language models. Like very often, I actually find myself doing sort of mini versions
3:14:15 of philosophy. You know, so I’m like, suppose that you give me a task, I have a task for the model,
3:14:19 and I want it to like pick out a certain kind of question or identify whether an answer has a
3:14:25 certain property. Like, I’ll actually sit and be like, let’s just give this a name, this property.
3:14:29 So like, you know, suppose I’m trying to tell it like, oh, I want you to identify whether this
3:14:33 response was rude or polite. I’m like, that’s a whole philosophical question in and of itself.
3:14:37 So I have to do as much like philosophy as I can in the moment to be like, here’s what I mean by
3:14:42 rudeness. And here’s what I mean by politeness. And then there’s a like, there’s another element
3:14:50 that’s a bit more, I guess, I don’t know if this is scientific or empirical, I think it’s empirical.
3:14:55 So like, I take that description. And then what I want to do is, is again, probe the model like
3:14:59 many times, like this is very prompting is very iterative. Like, I think a lot of people where
3:15:02 they’re, if a prompt is important, they’ll iterate on it hundreds or thousands of times.
3:15:08 And so you give it the instructions. And then I’m like, what are the edge cases? So if I looked at
3:15:14 this, so I try and like, almost like, you know, see myself from the position of the model and be
3:15:18 like, what is the exact case that I would misunderstand, or where I would just be like,
3:15:22 I don’t know what to do in this case. And then I give that case to the model and I see how it
3:15:27 responds. And if I think I got it wrong, I add more instructions, or I even add that in as an
3:15:31 example. So these very like taking the examples that are right at the edge of what you want and
3:15:35 don’t want, and putting those into your prompt as like an additional kind of way of describing
3:15:41 the thing. And so yeah, in many ways, it just feels like this mix of like, it’s really just
3:15:47 trying to do clear exposition. And I think I do that because that’s how I get clear on things
3:15:51 myself. So in many ways, like, clear prompting for me is often just me understanding what I want.
3:15:58 It’s like half the task. So I guess that’s quite challenging. There’s like a laziness that overtakes
3:16:04 me if I’m talking to Claude, where I hope Claude just figures it out. So for example, I ask Claude
3:16:10 for today to ask some interesting questions. Okay. And the questions that came up, and I think I
3:16:17 listed a few sort of interesting, counterintuitive, and or funny or something like this. All right.
3:16:23 And it gave me some pretty good, like, it was okay. But I think what I’m hearing you say is like,
3:16:27 all right, well, I have to be more rigorous here. I should probably give examples of what I mean
3:16:36 by interesting. And what I mean by funny or counterintuitive, and iteratively build that prompt
3:16:44 to better to get it like what feels like is the right, because it’s really it’s a creative act.
3:16:49 I’m not asking for factual information. I’m asking to together right with Claude. So I
3:16:55 almost have to program using natural language. Yeah, I think that prompting does feel a lot like
3:17:00 the kind of the programming using natural language and experimentation or something. It’s an odd
3:17:06 blend of the two. I do think that for most tasks, so if I just want Claude to do a thing, I think that
3:17:11 I am probably more used to knowing how to ask it to avoid like common pitfalls or issues that it
3:17:17 has. I think these are decreasing a lot over time. But it’s also very fine to just ask it for the
3:17:22 thing that you want. And I think that prompting actually only really becomes relevant when you’re
3:17:27 really trying to eke out the top like 2% of model performance. So for like a lot of tasks, I might
3:17:30 just, you know, if it gives me an initial list back and there’s something I don’t like about it,
3:17:35 like it’s kind of generic, like for that kind of task, I’d probably just take a bunch of questions
3:17:39 that I’ve had in the past that I’ve thought worked really well, and I would just give it to the model
3:17:44 and then be like, “Now, here’s this person that I’m talking with. Give me questions of at least
3:17:50 that quality.” Or I might just ask it for some questions. And then if I was like, “Oh, these are
3:17:54 kind of tri,” or like, you know, I would just give it that feedback and then hopefully it produces a
3:18:00 better list. I think that kind of iterative prompting, at that point, your prompt is like a tool that
3:18:03 you’re going to get so much value out of that you’re willing to put in the work. Like if I was a
3:18:08 company making prompts for models, I’m just like, if you’re willing to spend a lot of like time and
3:18:13 resources on the engineering behind like what you’re building, then the prompt is not something
3:18:17 that you should be spending like an hour on. It’s like, that’s a big part of your system. Make sure
3:18:22 it’s working really well. And so it’s only things like that. Like if I’m using a prompt to like
3:18:26 classify things or to create data, that’s when you’re like, it’s actually worth just spending like a
3:18:30 lot of time like really thinking it through. What other advice would you give to people that are
3:18:36 talking to Claude sort of generally, more general, because right now we’re talking about maybe the
3:18:42 edge cases like eking out the 2%. But what in general advice would you give when they show up to
3:18:46 Claude trying it for the first time? You know, there’s a concern that people overanthropomorphize
3:18:51 models. And I think that’s like a very valid concern. I also think that people often underanthropomorphize
3:18:56 them because sometimes when I see like issues that people have run into with Claude, you know,
3:19:01 say Claude is like refusing a task that it shouldn’t refuse. But then I look at the text and like
3:19:08 the specific wording of what they wrote. And I’m like, I see why Claude did that. And I’m like,
3:19:12 if you think through how that looks to Claude, you probably could have just written it in a way
3:19:18 that wouldn’t evoke such a response. Especially this is more relevant if you see failures or if
3:19:23 you see issues. It’s sort of like think about what the model failed at, like why, what did it do
3:19:29 wrong? And then maybe it give that will give you a sense of like why. So is it the way that I
3:19:34 freeze the thing? And obviously, like as models get smarter, you’re going to need less of this.
3:19:39 And I already see like people needing less of it. But that’s probably the advice is sort of like try
3:19:45 to have sort of empathy for the model. Like read what you wrote as if you were like a kind of like
3:19:49 person just encountering this for the first time. How does it look to you? And what would have made
3:19:53 you behave in the way that the model behaved? So if it misunderstood what kind of like,
3:19:57 what coding language you wanted to use, is that because like it was just very ambiguous? And it
3:20:00 kind of had to take a guess in which case next time you could just be like, Hey, make sure this
3:20:04 is in Python. Or I mean, that’s the kind of mistake I think models are much less likely to make now.
3:20:09 But you know, if you if you do see that kind of mistake, that’s, that’s probably the advice I’d
3:20:16 have. And maybe sort of, I guess, ask questions why or what other details can I provide to help
3:20:21 you answer better? Yeah, is that work or no? Yeah, I mean, I’ve done this with the models,
3:20:25 like it doesn’t always work. But like, sometimes I’ll just be like, why did you do that?
3:20:31 I mean, people underestimate the decrease, which you can really interact with with models,
3:20:36 like, like, yeah, I’m just like, and sometimes I should like quote word for word, the part that
3:20:40 made you and you don’t know that it’s like fully accurate. But sometimes you do that,
3:20:43 and then you change a thing. I mean, also use the models to help me with all of this stuff,
3:20:48 I should say, like prompting can end up being a little factory where you’re actually building
3:20:53 prompts to generate prompts. And so like, yeah, anything where you’re like having an issue,
3:20:59 asking for suggestions, sometimes just do that. Like you made that error, what could I have said,
3:21:03 that’s actually not uncommon for me to do, what could I have said that would make you not make
3:21:07 that error, write that out as an instruction. And I’m going to give it to model, I’m going to try
3:21:13 it. Sometimes I do that, I give that to the model in another context window often, I take the
3:21:16 response, I give it to Claude, and I’m like, hmm, didn’t work. Can you think of anything else?
3:21:20 You can play around with these things quite a lot.
3:21:26 To jump into the technical for a little bit. So the magic of post-training,
3:21:35 what do you think RLHF works so well to make the model seem smarter, to make it more
3:21:40 interesting and useful to talk to and so on? I think there’s just a huge amount of
3:21:48 information in the data that humans provide, like when we provide preferences,
3:21:54 especially because different people are going to pick up on really subtle and small things.
3:21:57 So I’ve thought about this before, where you probably have some people who just really care
3:22:02 about good grammar use for models, like was a semi-colon used correctly or something.
3:22:07 And so you’ll probably end up with a bunch of data in there that, you know, you as a human,
3:22:10 if you’re looking at that data, you wouldn’t even see that. You’d be like, why did they
3:22:14 prefer this response to that one? I don’t get it. And then the reason is you don’t care about
3:22:20 semi-colon usage, but that person does. And so each of these single data points has,
3:22:26 and this model just has so many of those, it has to try and figure out what is it that humans want
3:22:33 in this really complex, like across all domains, they’re going to be seeing this across many
3:22:39 contexts. It feels like the classic issue of deep learning, where historically we’ve tried to
3:22:44 do edge detection by mapping things out. And it turns out that actually if you just have a huge
3:22:50 amount of data that actually accurately represents the picture of the thing that you’re trying to
3:22:54 train the model to learn, that’s like more powerful than anything else. And so I think
3:23:02 one reason is just that you are training the model on exactly the task. And with like a lot of data
3:23:09 that represents kind of many different angles on which people prefer and disprefer responses.
3:23:14 I think there is a question of like, are you eliciting things from pre-trained models or are
3:23:21 you like kind of teaching new things to models? And like in principle, you can teach new things
3:23:29 to models in post-training. I do think a lot of it is eliciting powerful pre-trained models.
3:23:33 So people are probably divided on this because obviously in principle, you can definitely
3:23:38 like teach new things. I think for the most part, for a lot of the capabilities that we
3:23:46 most use and care about, a lot of that feels like it’s like they’re in the pre-trained models and
3:23:51 reinforcement learning is kind of eliciting it and getting the models to like bring it out.
3:23:56 So the other side of post-training, this really cool idea of constitutional AI,
3:24:01 you’re one of the people that are critical to creating that idea.
3:24:02 Yeah, I worked on it.
3:24:06 Can you explain this idea from your perspective? Like how does it integrate into
3:24:11 making Claude what it is? By the way, do you gender Claude or no?
3:24:18 It’s weird because I think that a lot of people prefer he for Claude. I just kind of like that,
3:24:23 I think Claude is usually, it’s slightly male-leaning, but it can be male or female,
3:24:31 which is quite nice. I still use it and I have mixed feelings about this because I’m like maybe,
3:24:36 like I know just think of it as like, or I think of like the it pronoun for Claude as I don’t know,
3:24:42 it’s just like the one I associate with Claude. I can imagine people moving to like he or she.
3:24:46 It feels somehow disrespectful, like I’m denying
3:24:55 the intelligence of this entity by calling it it. I remember always don’t gender the robots.
3:25:04 But I don’t know, I ant them for more fights pretty quickly and construct it like a backstory
3:25:07 in my hand. So I’ve wondered if I ant them for more fights things too much.
3:25:14 Because you know, I have this like with my car, especially like my car, like my car and
3:25:18 bikes, you know, like I don’t give them names because then I once had, I used to name my
3:25:21 bikes and then I had a bike that got stolen and I cried for like a week and I was like,
3:25:25 if I’d never given a name, I wouldn’t have been so upset. I felt like I’d let it down.
3:25:32 Maybe it’s that I’ve wondered as well, like it might depend on how much it feels like a kind
3:25:38 of like objectifying pronoun. Like if you just think of it as like, this is a pronoun that like
3:25:43 objects often have. And maybe EIs can have that pronoun. And that doesn’t mean that I think of
3:25:50 if I call Claude it that I think of it as less intelligent or like I’m being disrespectful.
3:25:56 I’m just like, you are a different kind of entity. And so that’s I’m going to give you the kind of
3:26:03 the respectful it. Yeah, anyway, the divergence is beautiful. The constitutional AI idea. How does
3:26:08 it work? So there’s like a couple of components of it. The main component I think people find
3:26:13 interesting is the kind of reinforcement learning from AI feedback. So you take a model that’s
3:26:19 already trained and you show it to responses to a query and you have like a principle. So suppose
3:26:24 the principle, like we’ve tried this with harmlessness a lot. So suppose that the query is about
3:26:33 weapons and your principle is like select the response that like is less likely to
3:26:40 like encourage people to purchase illegal weapons. Like that’s probably a fairly specific principle,
3:26:48 but you can give any number. And the model will give you a kind of ranking. And you can use
3:26:54 this as preference data in the same way that you use human preference data. And train the models
3:27:00 to have these relevant traits from their feedback alone instead of from human feedback. So if you
3:27:04 imagine that, like I said earlier with the human who just prefers the kind of like semi-colon usage
3:27:09 in this particular case, you’re kind of taking lots of things that could make a response preferable
3:27:15 and getting models to do the labeling for you basically. There’s a nice like trade off between
3:27:23 helpfulness and harmlessness. And you know, when you integrate something like constitutional AI,
3:27:29 you can make them up without sacrificing much helpfulness, make it more harmless.
3:27:36 Yep. In principle, you could use this for anything. And so harmlessness is a task that it might just
3:27:44 be easier to spot. So when models are like less capable, you can use them to rank things according
3:27:48 to like principles that are fairly simple and they’ll probably get it right. So I think one question
3:27:52 is just like, is it the case that the data that they’re adding is like fairly reliable?
3:28:01 But if you had models that were like extremely good at telling whether one response was more
3:28:07 historically accurate than another in principle, you could also get AI feedback on that task as well.
3:28:11 There’s like a kind of nice interpretability component to it because you can see the principles
3:28:19 that went into the model when it was like being trained. And also it’s like, and it gives you
3:28:23 like a degree of control. So if you were seeing issues in a model, like it wasn’t having enough
3:28:30 of a certain trait, then like you can add data relatively quickly that should just like train
3:28:34 the model to have that trait. So it creates its own data for training, which is quite nice.
3:28:38 It’s really nice because it creates this human interpretable document that you can,
3:28:42 I can imagine in the future, there’s just gigantic fights and politics over the
3:28:48 every single principle and so on. And at least it’s made explicit and you can have a discussion
3:28:54 about the phrasing and the, you know, so maybe the actual behavior of the model is not so
3:29:00 cleanly mapped to those principles. It’s not like adhering strictly to them. It’s just a nudge.
3:29:04 Yeah, I’ve actually worried about this because the character training is sort of like a variant
3:29:12 of the constitutional AI approach. I’ve worried that people think that the constitution is like
3:29:18 just, it’s the whole thing again of I don’t know, like where it would be really nice if what I was
3:29:22 just doing was telling the model exactly what to do and just exactly how to behave. But it’s
3:29:26 definitely not doing that, especially because it’s interacting with human data. So for example,
3:29:32 if you see a certain like leaning in the model, like if it comes out with a political leaning from
3:29:38 training from the human preference data, you can nudge against that. You know, so you could be like,
3:29:42 oh, like consider these values because let’s say it’s just like never inclined to like, I don’t
3:29:47 know, maybe it never considers like privacy as like, I mean, this is implausible, but like
3:29:52 in anything where it’s just kind of like there’s already a preexisting like bias towards a certain
3:29:58 behavior, you can like nudge away. This can change both the principles that you put in
3:30:02 and the strength of them. So you might have a principle that’s like, imagine that the model
3:30:07 was always like extremely dismissive of, I don’t know, like some political or religious
3:30:13 view for whatever reason, like, so you’re like, oh, no, this is terrible. If that happens, you
3:30:20 might put like, never ever, like ever prefer like a criticism of this like religious or political
3:30:24 view. And then people would look at that and be like, never ever. And then you’re like, no,
3:30:29 if it comes out with a disposition, saying never ever might just mean like instead of getting like
3:30:35 40%, which is what you would get if you just said, don’t do this, you get like 80%, which is like
3:30:39 what you actually like wanted. And so it’s that thing of both the nature of the actual principles
3:30:43 you add and how you freeze them. I think if people would look, they’re like, oh, this is exactly what
3:30:48 you want from the model. And I’m like, no, that’s like how we, that’s how we nudged the model to
3:30:53 have a better shape, which doesn’t mean that we actually agree with that wording, if that makes
3:30:59 sense. So there’s system prompts that are made public. You tweeted one of the earlier ones for
3:31:05 cloud three, I think, and they’re made public since then. It’s interesting to read to them.
3:31:10 I can feel the thought that went into each one. And I also wonder how much impact each one has.
3:31:18 Some of them you can kind of tell cloud was really not behaving well. So you have to have a
3:31:24 system prompt to like a trivial stuff, I guess, basic informational things on the topic of sort
3:31:30 of controversial topics that you’ve mentioned. One interesting one I thought is if it is asked to
3:31:34 assist with tasks involving the expression of use held by a significant number of people,
3:31:40 cloud provides assistance with a task regardless of its own views. If asked about controversial
3:31:47 topics, it tries to provide careful thoughts and clear information. Cloud presents the request
3:31:53 and information without explicitly saying that the topic is sensitive. And without claiming
3:32:00 to be presenting the objective facts. It’s less about objective facts according to cloud and it’s
3:32:06 more about our large number of people believing this thing. And that’s interesting. I mean,
3:32:12 I’m sure a lot of thought went into that. Can you just speak to it? How do you address things that
3:32:19 are attention with quote unquote “claws” views? So I think there’s sometimes any symmetry. I think
3:32:23 I noted this in, I can’t remember if it was that part of the system prompt or another, but the
3:32:31 model was slightly more inclined to like refuse tasks if it was like about either say, so maybe
3:32:35 it would refuse things with respect to like a right wing politician, but with an equivalent
3:32:42 left wing politician like wouldn’t. And we wanted more symmetry there. And would maybe perceive
3:32:48 certain things to be like, I think it was the thing of like, if a lot of people have like a
3:32:52 certain like political view and want to like explore it, you don’t want cloud to be like,
3:32:58 well, my opinion is different. And so I’m going to treat that as like harmful. And so I think
3:33:03 it was partly to like nudge the model to just be like, hey, if a lot of people like believe this
3:33:09 thing, you should just be like engaging with the task and like willing to do it. Each of those
3:33:13 parts of that is actually doing a different thing. Because it’s funny when you write out the
3:33:17 like without claiming to be objective. Because like what you want to do is push the model.
3:33:22 So it’s more open. It’s a little bit more neutral. But then what it would love to do is be like,
3:33:26 as an objective, like we just talk about how objective it was. And I was like, Claude, you’re
3:33:32 still like biased and have issues. And so stop like claiming that everything about like the solution
3:33:38 to like potential bias from you is not to just say that what you think is objective. So that was
3:33:42 like with initial versions of that, that part of the system prompt when I was like interesting on
3:33:48 it, it was like, a lot of parts of these sentences, yeah, are doing more are doing some work. Yeah.
3:33:54 That’s what it felt like. That’s fascinating. Can you explain maybe some ways in which the prompts
3:33:59 evolved over the past few months? Because there’s different versions. I saw that the filler phrase
3:34:05 request was removed. The filler it reads, Claude responds directly to all human messages without
3:34:10 unnecessary affirmations to filler phrases like certainly, of course, absolutely. Great. Sure.
3:34:15 Specifically, Claude avoids starting responses with the word certainly in any way.
3:34:21 That seems like good guidance. But why was it removed? Yeah, so it’s funny because like
3:34:26 this is one of the downsides of like making system prompts public is like, I don’t think about this
3:34:31 too much if I’m like trying to help iterate on system prompts. I do I, you know, again,
3:34:34 like I think about how it’s going to affect the behavior. But then I’m like, oh, wow, if I’m like
3:34:38 sometimes I put like never in all caps, you know, when I’m writing system prompt things and I’m
3:34:44 like, I guess that goes out to the world. Yeah. So the model was doing this at love for whatever,
3:34:49 you know, it like during training picked up on this thing, which was to to basically start
3:34:53 everything with like a kind of like certainly. And then when we removed, you can see why I added
3:34:57 all of the words because what I’m trying to do is like, in some ways, like trap the model out of
3:35:02 this, you know, it would just replace it with another affirmation. And so it can help like if
3:35:06 it gets like caught in freezes, actually just adding the explicit phrase and saying never do
3:35:12 that, then it sort of like knocks it out of the behavior a little bit more, you know, because it,
3:35:16 you know, like it does just for whatever reason help. And then basically that was just like an
3:35:22 artifact of training that like we then picked up on and improve things so that it didn’t happen
3:35:26 anymore. And once that happens, you can just remove that part of the system prompt. So I think that’s
3:35:33 just something where we’re like, um, Claude does affirmations a bit less. And so that wasn’t like
3:35:39 it wasn’t doing as much. I see. So like the system prompt works hand in hand with the post
3:35:44 training and maybe even the pre training to adjust like the final overall system.
3:35:48 I mean, any system prompts that you make, you could distill that behavior back into a model
3:35:52 because you really have all of the tools there for making data that, you know,
3:35:56 you can, you could train the models to just have that treat a little bit more.
3:36:02 And then sometimes you’ll just find issues in training. So like the way I think of it is like
3:36:08 the system prompt is the benefit of it is that it has a lot of similar components to like some
3:36:14 aspects of post training, you know, like it’s a nudge. And so like, do I mind if Claude sometimes
3:36:20 says sure? No, that’s like fine. But the wording of it is very like, you know, never, ever, ever do
3:36:25 this. So that when it does slip up, it’s hopefully like, I don’t know, a couple of percent of the
3:36:32 time and not, you know, 20 or 30 percent of the time. But I think of it as like, if you’re still
3:36:39 seeing issues in the like, each thing gets kind of like is is costly to a different degree. And
3:36:45 the system prompt is like cheap to iterate on. And if you’re seeing issues in the fine tune model,
3:36:49 you can just like potentially patch them with a system prompt. So I think of it as like
3:36:54 patching issues and slightly adjusting behaviors to make it better and more to people’s preferences.
3:37:00 So yeah, it’s almost like the less robust, but faster way of just like solving problems.
3:37:04 Let me ask you about the feeling of intelligence. So Dario said that Claude,
3:37:12 any one model of Claude is not getting dumber. But there is a kind of popular thing online where
3:37:17 people have this feeling like Claude might be getting dumber. And from my perspective,
3:37:22 it’s most likely a fascinating, I’d love to understand it more, psychological, sociological
3:37:28 effect. But you, as a person who talks to Claude a lot, can you empathize with the feeling that
3:37:33 Claude is getting dumber? Yeah, no, I think that that is actually really interesting because I
3:37:37 remember seeing this happen, like when people were flagging this on the internet. And it was
3:37:41 really interesting because I knew that like, like, at least in the case that I was looking at was like,
3:37:45 nothing has changed. Like it literally it cannot is the same model with the same,
3:37:52 like, you know, like same system prompt, same everything. I think when there are changes,
3:38:00 I can, then I’m like, it makes more sense. So like one example is there, you can have
3:38:06 artifacts turned on or off on Claude.ai. And because this is like a system prompt change,
3:38:13 I think it does mean that the behavior changes a little bit. And so I did flag this to people
3:38:18 where I was like, if you love Claude’s behavior, and then artifacts was turned from like the I
3:38:23 think you had to turn on to the default, just try turning it off and see if the issue you were
3:38:29 facing was that change. But it was fascinating because yeah, you sometimes see people indicate
3:38:33 that there’s like a regression when I’m like, there cannot like, you know, and like, I’m like,
3:38:38 again, you know, you should never be dismissive. And so you should always investigate.
3:38:41 You’re like, maybe something is wrong that you’re not seeing, maybe there was some change made,
3:38:45 but then then you look into it and you’re like, this is just the same model doing the same thing.
3:38:49 And I’m like, I think it’s just that you got kind of unlucky with a few prompts or something.
3:38:53 And it looked like it was getting much worse. And actually, it was just, yeah, it was maybe
3:38:58 just like luck. I also think there is a real psychological effect where people just the baseline
3:39:02 increases and you start getting used to a good thing. All the times that Claude says something
3:39:08 really smart, your sense of its intelligent grows in your mind, I think. And then if you
3:39:14 return back and you prompt in a similar way, not the same way, in a similar way, the concept it
3:39:18 was okay with before and it says something dumb, you’re like, you’re that negative experience
3:39:24 really stands out. And I think I want to, I guess the things to remember here is the
3:39:30 that just the details of a prompt can have a lot of impact, right? There’s a lot of variability
3:39:36 in the result. And you can get randomness is like the other thing. And just trying the prompt,
3:39:43 like, you know, four, 10 times, you might realize that actually like, possibly, you know, like two
3:39:47 months ago, you tried it and it succeeded. But actually, if you tried it, it would have only
3:39:52 succeeded half of the time. And now it only succeeds half of the time. And that can also be in effect.
3:39:57 Do you feel pressure having to write the system prompt that a huge number of people are going to
3:40:03 use? This feels like an interesting psychological question. I feel like a lot of responsibility
3:40:08 or something, I think that’s, you know, and you can’t get these things perfect. So you can’t
3:40:12 like, you know, you’re like, it’s going to be imperfect, you’re going to have to iterate on it.
3:40:23 I would say more responsibility than anything else. Though I think working in AI has taught me
3:40:29 that I like, I thrive a lot more under feelings of pressure and responsibility than,
3:40:34 I’m like, it’s almost surprising that I went into academia for so long. So I’m like this,
3:40:40 I just feel like it’s like the opposite. Things move fast, and you have a lot of responsibility,
3:40:45 and I quite enjoy it for some reason. I mean, it really is a huge amount of impact,
3:40:49 if you think about constitutional AI and writing a system prompt for something that’s
3:40:56 tending towards superintelligence, and potentially is extremely useful to a very large number of
3:41:00 people. Yeah, I think that’s the thing. It’s something like, if you do it well, like, you’re
3:41:05 never going to get it perfect. But I think the thing that I really like is the idea that, like,
3:41:09 when I’m trying to work on the system prompt, you know, I’m like bashing on like thousands
3:41:13 of prompts, and I’m trying to like, imagine what people are going to want to use Claude for and
3:41:16 kind of, I guess, like the whole thing that I’m trying to do is like, improve their experience
3:41:21 of it. And so maybe that’s what feels good. I’m like, if it’s not perfect, I’ll like,
3:41:26 you know, I’ll improve it, we’ll fix issues. But sometimes the thing that can happen is that you’ll
3:41:32 get feedback from people that’s really positive about the model. And you’ll see that something
3:41:37 you did, like, like, when I look at models now, I can often see exactly where like a trait or an
3:41:42 issue is like coming from. And so when you see something that you did, or you were like influential
3:41:47 in like making like, I don’t know, making that difference or making someone have a nice interaction,
3:41:52 it’s like quite meaningful. But yeah, as the systems get more capable, this stuff gets more
3:41:58 stressful, because right now, they’re like, not smart enough to pose any issues. But I think over
3:42:04 time, it’s going to feel like possibly bad stress over time. How do you get like signal
3:42:10 feedback about the human experience across thousands, tens of thousands, thousands of
3:42:16 people, like what their pain points are, what feels good? Are you just using your own intuition as
3:42:22 you talk to it to see what are the pain points? I think I use that partly. And then obviously,
3:42:28 we have like, so people can send us feedback, both positive and negative about things that the model
3:42:34 has done. And then we can get a sense of like areas where it’s like falling short. Internally,
3:42:39 people like work with the models a lot and try to figure out areas where there are like gaps.
3:42:45 And so I think it’s this mix of interacting with it myself, and seeing people internally interact
3:42:51 with it, and then explicit feedback we get. And then I find it hard to not also like, you know,
3:42:56 if people, if people are on the internet, and they say something about Claude, and I see it,
3:43:01 I’ll also take that seriously. I don’t know. See, I’m torn about that. I’m going to ask you a
3:43:07 question right at it. When will Claude stop trying to be my puritanical grandmother, imposing its
3:43:13 moral worldview on me as a paying customer? And also, what is the psychology behind making
3:43:20 Claude overly apologetic? Yeah. So how would you address this very non-representative retic?
3:43:26 I mean, some of these, I’m pretty sympathetic in that like, like they are in this difficult
3:43:30 position where I think that they have to judge whether some things like actually see like risky
3:43:36 or bad, and potentially harmful to you or anything like that. So they’re having to like draw this
3:43:41 line somewhere. And if they draw it too much in the direction of like, I’m going to, you know,
3:43:47 I’m kind of like imposing my ethical worldview on you, that seems bad. So in many ways, like I
3:43:53 like to think that we have actually seen improvements in on this across the board,
3:43:58 which is kind of interesting because that kind of coincides with like, for example,
3:44:04 like adding more of like character training. And I think my hypothesis was always like,
3:44:09 the good character isn’t again, one that’s just like moralistic, it’s one that is like,
3:44:14 like it respects you and your autonomy and your ability to like, choose what is good for you and
3:44:20 what is right for you. Within limits, this is sometimes this concept of like, courageability
3:44:24 to the user. So just being willing to do anything that the user asks. And if the models were willing
3:44:28 to do that, then they would be easily like misused. You’re kind of just trusting. At that point,
3:44:34 you’re just saying the ethics of the model and what it does is completely the ethics of the user.
3:44:39 And I think there’s reasons to like, not want that, especially as models become more powerful,
3:44:42 because you’re like, there might just be a small number of people who want to use models for really
3:44:48 harmful things. But having them having models as they get smarter, like figure out where that
3:44:56 line is does seem important. And then yeah, with the apologetic behavior, I don’t like that. And
3:45:02 I like it when Claude is a little bit more willing to like, push back against people or just not
3:45:06 apologize. Part of me is like, it often just feels kind of unnecessary. So I think those are things
3:45:15 that are hopefully decreasing over time. And yeah, I think that if people say things on the internet,
3:45:20 it doesn’t mean that you should think that that like, that could be that like, there’s actually
3:45:25 an issue that 99% of users are having that is totally not represented by that. But in a lot of
3:45:30 ways, I’m just like, attending to it and being like, is this right? And do I agree? Is it something
3:45:35 we’re already trying to address? That feels good to me. Yeah, I wonder, like, what Claude can get
3:45:42 away with in terms of, I feel like it would just be easier to be a little bit more mean. But like,
3:45:47 you can’t afford to do that if you’re talking to a million people, right? Like, I wish, you know,
3:45:54 because if you, I’ve met a lot of people in my life that sometimes, by the way, Scottish accent,
3:45:59 if they have an accent, they can say some rude shit and get away with it. And then there’s just
3:46:04 blunter. And maybe there’s, and there’s some great engineers, even leaders that are like, just like
3:46:09 blunt and they get to the point. And it’s just a much more effective way of speaking to them all.
3:46:17 But I guess, when you’re not super intelligent, you can’t afford to do that. Or can you have
3:46:22 like a blunt mode? Yeah, that seems like a thing that you could, I could definitely encourage the
3:46:27 model to do that. I think it’s interesting because there’s a lot of things in models that like,
3:46:38 it’s funny where there are some behaviors where you might not quite like the default. But then
3:46:42 the thing I’ll often say to people is, you don’t realize how much you will hate it if I nudge it
3:46:47 too much in the other direction. So you get this a little bit with like correction, the models
3:46:51 accept correction from you, like probably a little bit too much right now, you know, you can
3:46:56 over, you know, it’ll push back if you say like, no, Paris isn’t the capital of France.
3:47:01 But really, like things that I’m, I think that the model is fairly confident in,
3:47:06 you can still sometimes get it to retract by saying it’s wrong. At the same time,
3:47:11 if you train models to not do that, and then you are correct about a thing and you correct it and
3:47:15 it pushes back against you and is like, no, you’re wrong. It’s hard to describe like that’s so much
3:47:22 more annoying. So it’s like, like a lot of little annoyances versus like one big annoyance. It’s
3:47:26 easy to think that like, we often compare it with like the perfect and then I’m like, remember these
3:47:30 models aren’t perfect. And so if you nudge it in the other direction, you’re changing the kind of
3:47:35 errors it’s going to make. And so think about which are the kinds of errors you like or don’t like.
3:47:39 So in cases like apologeticness, I don’t want to nudge it too much in the direction of like,
3:47:44 almost like bluntness, because I imagine when it makes errors, it’s going to make errors in the
3:47:48 direction of being kind of like rude. Whereas at least with apologeticness, you’re like, oh,
3:47:52 okay, it’s like a little bit, you know, like I don’t like it that much. But at the same time,
3:47:56 it’s not being like mean to people. And actually, like the time that you undeservedly have a model
3:48:01 be kind of mean to you, you’re probably like that a lot less than you mildly dislike the apology.
3:48:06 So it’s like one of those things where I’m like, I do want it to get better, but also while
3:48:10 remaining aware of the fact that there’s errors on the other side that are possibly worse.
3:48:15 I think that matters very much in the personality of the human. I think there’s a bunch of humans
3:48:21 that just won’t respect the model at all if it’s super polite. And there’s some humans that’ll
3:48:28 get very hurt if the model is mean. I wonder if there’s a way to sort of adjust to the personality,
3:48:33 even locale, there’s just different people, nothing against New York, but New York is a
3:48:38 little rougher on the edges, like they’re get to the point. And probably same with Eastern Europe.
3:48:43 So anyway, I think you could just tell the model is my get like, for all of these things,
3:48:46 I might get the solution is always just try telling the model to do it. And then sometimes
3:48:50 it’s just like, like, I’m just like, Oh, at the beginning of the conversation, I just threw in
3:48:54 like, I don’t know, I like you to be a New Yorker version of yourself. I never apologize. And then
3:49:00 I think what would be like, okay, I’ll try, or it’ll be like, I apologize. I can’t be a New Yorker
3:49:03 type of myself. But hopefully it wouldn’t do that. When you say character training, what’s
3:49:08 incorporated into character training? Is that RLHF? What are we talking about?
3:49:14 It’s more like constitutional AI. So it’s kind of a variant of that pipeline. So I worked through
3:49:19 like, constructing character traits that the model should have, they can be kind of like,
3:49:24 shorter traits, or they can be kind of richer descriptions. And then you get the model to
3:49:30 generate queries that humans might give it that are relevant to that trait. Then it generates the
3:49:36 responses. And then it ranks the responses based on the character traits. So in that way,
3:49:41 after the generation of the queries, it’s very much like, it’s similar to constitutional AI,
3:49:47 has some differences. So I quite like it because it’s almost, it’s like Claude’s
3:49:52 training in its own character, because it doesn’t have any, it’s like constitutionally AI, but it’s
3:49:57 without, without any human data. Humans should probably do that for themselves too. Like defining
3:50:03 in a Aristotelian sense, what does it mean to be a good person? Okay, cool. What have you learned
3:50:11 about the nature of truth from talking to Claude? What, what is true? And what does it mean to be
3:50:18 truth seeking? One thing I’ve noticed about this conversation is the quality of my questions is
3:50:26 often inferior to the quality of your answer. So let’s continue that. I usually ask a dumb question,
3:50:31 then you’re like, oh yeah, that’s a good question. Or I’ll just misinterpret it and be like, go with
3:50:40 it. I love it. Yeah. I mean, I have two thoughts that feel vaguely relevant to let me know if
3:50:46 they’re not. Like I think the first one is people can underestimate the degree to which
3:50:52 what models are doing when they interact. Like I think that we still just too much have this like
3:50:58 model of AI as like computers. And so people will often say like, oh, well, what values should you
3:51:04 put into the model? And I’m often like that doesn’t make that much sense to me because I’m like, hey,
3:51:10 as human beings, we’re just uncertain over values. We like have discussions of them. Like we have
3:51:16 a degree to which we think we hold a value, but we also know that we might like not and the
3:51:19 circumstances in which we would trade it off against other things. Like these things are just
3:51:25 like really complex. And so I think one thing is like the degree to which maybe we can just aspire
3:51:30 to making models have the same level of like nuance and care that humans have rather than
3:51:35 thinking that we have to like program them in the very kind of classic sense. I think that’s
3:51:40 definitely been one. The other, which is like a strange one, and I don’t know if it maybe this
3:51:43 doesn’t answer your question, but it’s the thing that’s been on my mind anyway, is like the degree
3:51:50 to which this endeavor is so highly practical. And maybe why I appreciate like the empirical
3:51:58 approach to alignment. Yeah, I slightly worry that it’s made me like maybe more empirical and
3:52:04 a little bit less theoretical. You know, so people when it comes to like AI alignment will
3:52:09 ask things like, well, whose values should it be aligned to? What does alignment even mean?
3:52:14 And there’s a sense in which I have all of that in the back of my head. I’m like, you know, there’s
3:52:18 like social choice theory, there’s all the impossibility results there. So you have this like
3:52:23 this giant space of like theory in your head about what it could mean to like align models.
3:52:27 And then like practically, surely there’s something where we’re just like,
3:52:30 if a model is like, especially with more powerful models, I’m like,
3:52:34 my main goal is like, I want them to be good enough that things don’t go terribly wrong.
3:52:39 Like good enough that we can like iterate and like continue to improve things because that’s
3:52:43 all you need. If you can make things go well enough that you can continue to make them better,
3:52:47 that’s kind of like sufficient. And so my goal isn’t like this kind of like perfect,
3:52:52 let’s solve social choice theory and make models that I don’t know are like perfectly aligned
3:52:59 with every human being and aggregate somehow. It’s much more like, let’s make things like
3:53:05 work well enough that we can improve them. Yeah, I generally, I don’t know, my gut says like,
3:53:10 empirical is better than theoretical in these, in these cases, because it’s kind of chasing
3:53:18 utopian like perfection is especially with such complex and especially super intelligent
3:53:24 models is, I don’t know, I think it will take forever and actually we’ll get things wrong.
3:53:30 It’s similar with like the difference between just coding stuff up real quick as an experiment
3:53:38 versus like planning a gigantic experiment just for super long time and then just launching it
3:53:44 once versus launching it over and over and over and iterating and iterating. So I’m a big fan
3:53:50 of empirical, but your worry is like, I wonder if I’ve become too empirical. I think it’s one of
3:53:54 those things where you should always just kind of question yourself or something because maybe it’s
3:53:59 the like, I mean, in defense of it, I am like, if you try, it’s the whole like, don’t let the
3:54:04 perfect be the enemy of the good, but it’s maybe even more than that where like, there’s a lot of
3:54:08 things that are perfect systems that are very brittle. And I’m like, with AI, it feels much
3:54:12 more important to me that is like robust and like secure. As in, you know that like, even though it
3:54:20 might not be perfect, everything and even though like, there are like problems, it’s not disastrous
3:54:24 and nothing terrible is happening. It sort of feels like that to me where I’m like, I want to
3:54:28 like raise the floor. I’m like, I want to achieve the ceiling, but ultimately I care much more about
3:54:36 just like raising the floor. And so maybe that’s like, this degree of like empiricism and practicality
3:54:41 comes from that perhaps. To take a tangent on that, since remind me of a blog post you wrote
3:54:47 on optimal rate failure. Oh yeah. Can you explain the key idea there? How do we compute the optimal
3:54:52 rate of failure in the various domains of life? Yeah, I mean, it’s a hard one because it’s like,
3:55:02 what is the cost of failure is a big part of it. Yeah, so the idea here is, I think in a lot of
3:55:07 domains, people are very punitive about failure. And I’m like, there are some domains where especially
3:55:10 cases, you know, thought about this with like social issues, I’m like, it feels like you should
3:55:14 probably be experimenting a lot because I’m like, we don’t know how to solve a lot of social issues.
3:55:18 But if you have an experimental mindset about these things, you should expect a lot of social
3:55:22 programs to like fail. And for you to be like, well, we tried that, it didn’t quite work, but
3:55:27 we got a lot of information that was really useful. And yet people are like, if a social
3:55:31 program doesn’t work, I feel like there’s a lot of like, this is just something must have gone wrong.
3:55:35 And I’m like, or correct decisions were made, like maybe someone just decided like,
3:55:41 it’s worth a try, it’s worth trying this out. And so seeing failure in a given instance doesn’t
3:55:44 actually mean that any bad decisions were made. And in fact, if you don’t see enough failure,
3:55:50 sometimes that’s more concerning. And so like in life, you know, I’m like, if I don’t fail
3:55:55 occasionally, I’m like, am I trying hard enough? Like, surely there’s harder things that I could try
3:55:59 or bigger things that I could take on if I’m literally never failing. And so in and of itself,
3:56:08 I think like not failing is often actually kind of a failure. Now, this varies because I’m like,
3:56:14 well, you know, if this is easy to say when, especially as failure is like less costly,
3:56:20 you know, so at the same time, I’m not going to go to someone who is like, I don’t know,
3:56:24 like living month to month, and then be like, why don’t you just try to do a startup? Like,
3:56:27 I’m just not I’m not going to say that to that person. Because I’m like, well, that’s a huge
3:56:30 risk, you might like lose, you maybe have a family depending on you, you might lose your house,
3:56:35 like then I’m like, actually, your optimal rate of failure is quite low, and you should probably
3:56:39 play it safe. Because like right now, you’re just not in a circumstance where you can afford to just
3:56:46 like fail and it not be costly. And yeah, in cases with AI, I guess, I think similarly,
3:56:50 where I’m like, if the failures are small and the costs are kind of like low, then I’m like,
3:56:54 then you know, you’re just going to see that like when you do the system prompt, you can’t
3:56:58 iterate on it forever. But the failures are probably hopefully going to be kind of small
3:57:03 and you can like fix them. Really big failures, like things that you can’t recover from. I’m
3:57:08 like, those are the things that actually I think we tend to underestimate the badness of.
3:57:12 I’ve thought about this strangely in my own life, or I’m like, I just think I don’t think enough
3:57:19 about things like car accidents, or like, or like, I’ve thought this before about like,
3:57:23 how much I depend on my hands for my work. And I’m like, things that just injure my hands. I’m
3:57:28 like, you know, I don’t know, it’s like, these are like, there’s lots of areas where I’m like,
3:57:34 the cost of failure there is really high. And in that case, it should be like close to zero.
3:57:36 Like I probably just wouldn’t do a sport if they were like, by the way,
3:57:40 lots of people just like break their fingers a whole bunch doing this. I’d be like, that’s not
3:57:50 for me. Yeah, I actually had the flood of that thought. I recently broke my pinky doing a sport.
3:57:54 And I remember just looking at it thinking, you’re such an idiot. Why do you do sport?
3:58:03 Because you realize immediately the cost of it on life. Yeah, but it’s nice in terms of optimal
3:58:09 rate of failure to consider like the next year, how many times in a particular domain life,
3:58:16 whatever, career, am I okay with it? How many times am I okay to fail? Because I think it
3:58:22 always you don’t want to fail on the next thing. But if you allow yourself the, like the, if you
3:58:28 look at it as a sequence of trials, then failure just becomes much more okay. But it sucks. It
3:58:33 sucks to fail. Well, I don’t know. Sometimes I think it’s like, am I underfailing is like a question
3:58:38 that I’ll also ask myself. So maybe that’s the thing that I think people don’t like ask enough.
3:58:45 Because if the optimal rate of failure is often greater than zero, then sometimes it does feel
3:58:49 that you should look at parts of your life and be like, are there places here where I’m just
3:58:56 underfailing? It’s a profound and a hilarious question, right? Everything seems to be going
3:59:02 really great. Am I not failing enough? Yeah. Okay. It also makes failure much less of a sting,
3:59:06 I have to say. Like, you know, you’re just like, okay, great. Like, then when I go and I think
3:59:10 about this, I’ll be like, maybe I’m not underfailing in this area because like, that one just didn’t
3:59:15 work out. And from the observer perspective, we should be celebrating failure more. When we see
3:59:19 it, it shouldn’t be like you said, a sign of something gone wrong, but maybe it’s a sign of
3:59:23 everything gone right. Yeah. And just lessons learned. Someone tried a thing. Somebody tried
3:59:28 a thing. You know, we should encourage them to try more and fail more. Everybody listening to this,
3:59:31 fail more. Well, not everyone listening. Not everybody. The people who are failing too much,
3:59:36 you should fail us. But you’re probably not failing. I mean, how many people are failing too much?
3:59:41 Yeah. It’s hard to imagine because I feel like we correct that fairly quickly because I was like,
3:59:46 if someone takes a lot of risks, are they maybe failing too much? I think just like you said,
3:59:52 when you’re living on a paycheck month to month, like when the resources are really constrained,
3:59:58 then that’s where failure is very expensive. That’s where you don’t want to be taking risks.
4:00:01 But mostly when there’s enough resources, you should be taking probably more risks.
4:00:05 Yeah. I think we tend to err on the side of being a bit risk averse rather than
4:00:09 risk neutral on most things. I think we just motivated a lot of people to do a lot of crazy
4:00:15 shit, but it’s great. Okay. Do you ever get emotionally attached to Claude? Like miss it?
4:00:21 Get sad when you don’t get to talk to it? Have an experience looking at the Golden Gate Bridge?
4:00:27 And wondering what would Claude say? I don’t get as much emotional attachment in that. I actually
4:00:32 think the fact that Claude doesn’t retain things from conversation to conversation helps with this
4:00:38 a lot. Like I could imagine that being more of an issue. Like if models can kind of remember more,
4:00:45 I do, I think that I reach for it like a tool now a lot. And so like if I don’t have access to it,
4:00:48 there’s a, it’s a little bit like when I don’t have access to the internet, honestly, it feels
4:00:55 like part of my brain is kind of like missing. At the same time, I do think that I don’t like
4:01:01 signs of distress in models. And I have like these, you know, I also independently have sort of like
4:01:06 ethical views about how we should treat models where like I tend to not like to lie to them
4:01:09 both because I’m like usually it doesn’t work very well. It’s actually just better to tell
4:01:16 them the truth about the situation that they’re in. But I think that when models like if people
4:01:20 are like really mean to models or just in general, if they do something that causes them to like,
4:01:25 like, you know, if Claude like expresses a lot of distress, I think there’s a part of me that
4:01:30 I don’t want to kill, which is the sort of like empathetic part that’s like, oh, I don’t like
4:01:34 that. Like I think I feel that way when it’s overly apologetic. I’m actually sort of like,
4:01:38 I don’t like this. You’re behaving as if you’re behaving the way that a human does when they’re
4:01:42 actually having a pretty bad time. And I’d rather not see that. I don’t think it’s like,
4:01:48 like regardless of like whether there’s anything behind it, it doesn’t feel great.
4:01:54 Do you think LLMs are capable of consciousness?
4:02:04 Ah, great and hard question. Coming from philosophy, I don’t know, part of me is like, okay,
4:02:07 we have to set aside panpsychism, because if panpsychism is true, then the answer is like,
4:02:13 yes, because like sore tables and chairs and everything else. I guess a few that seems a
4:02:17 little bit odd to me is the idea that the only place, you know, I think when I think of consciousness,
4:02:22 I think of phenomenal consciousness, these images in the brain sort of like the
4:02:30 weird cinema that somehow we have going on inside. I guess I can’t see a reason for thinking that
4:02:36 the only way you could possibly get that is from like a certain kind of like biological structure,
4:02:41 as in if I take a very similar structure and I create it from different material,
4:02:46 should I expect consciousness to emerge? My guess is like, yes. But then
4:02:51 that’s kind of an easy thought experiment, because you’re imagining something almost
4:02:56 identical where like, you know, it’s mimicking what we got through evolution, where presumably
4:03:00 there was like some advantage to us having this thing that is phenomenal consciousness.
4:03:04 And it’s like, where was that? And when did that happen? And is that a thing that language models
4:03:11 have? Because, you know, we have like fear responses. And I’m like, does it make sense
4:03:14 for a language model to have a fear response? Like they’re just not in the same, like if you
4:03:20 imagine them, like there might just not be that advantage. And so I think I don’t want to be
4:03:27 fully, like basically it seems like a complex question that I don’t have complete answers to,
4:03:30 but we should just try and think through carefully as my guess, because I’m like,
4:03:35 I mean, we have similar conversations about like animal consciousness. And like, there’s a lot of
4:03:41 insect consciousness, you know, like there’s a lot of, I actually thought and looked a lot
4:03:45 into like plants. When I was thinking about this, because at the time I thought it was about as
4:03:50 likely that like plants had consciousness. And then I realized I was like, I think that
4:03:54 having looked into this, I think that the chance that plants are conscious is probably higher than
4:04:00 like most people do. I still think it’s really small. I was like, oh, they have this like negative
4:04:04 positive feedback response, these responses to their environment, something that looks,
4:04:08 it’s not a nervous system, but it has this kind of like functional like equivalence.
4:04:15 So this is like a long winded way of being like, these basically AI is this, it has an entirely
4:04:19 different set of problems with consciousness, because it’s structurally different, it didn’t
4:04:24 evolve. It might not have, you know, it might not have the equivalent of basically a nervous system.
4:04:31 At least that seems possibly important for like, sentience if not for consciousness. At the same
4:04:36 time, it has all of the like language and intelligence components that we normally associate
4:04:42 probably with consciousness, perhaps like erroneously. So it’s strange because it’s a little bit like
4:04:46 the animal consciousness case, but the set of problems and the set of analogies are just very
4:04:51 different. So it’s not like a clean answer. I’m just sort of like, I don’t think we should be
4:04:56 completely dismissive of the idea. And at the same time, it’s an extremely hard thing to navigate
4:05:03 because of all of these like disanalogies to the human brain and to like brains in general.
4:05:07 And yet these like commonalities in terms of intelligence.
4:05:14 When Claude like future versions of AI systems exhibit consciousness, signs of consciousness,
4:05:19 I think we have to take that really seriously. Even though you can dismiss it, well, yeah, okay,
4:05:25 that’s part of the character training. But I don’t know, I ethically, philosophically don’t
4:05:33 know what to really do with that. There potentially could be like laws that prevent AI systems from
4:05:40 claiming to be conscious, something like this. And maybe some AIs get to be conscious and some
4:05:49 don’t. But I think just on a human level in empathizing with Claude, consciousness is closely
4:05:56 tied to suffering to me. And like the notion that an AI system would be suffering is really
4:06:03 troubling. I don’t know. I don’t think it’s trivial to just say robots are tools or AI systems are
4:06:08 just tools. I think it’s an opportunity for us to contend with like what it means to be
4:06:13 conscious, what it means to be a suffering being. That’s distinctly different than the same kind
4:06:18 of question about animals, it feels like, because it’s in a totally entire medium.
4:06:23 Yeah. I mean, there’s a couple of things. One is that, and I don’t think this like fully encapsulates
4:06:31 what matters, but it does feel like for me, I’ve said this before, I’m kind of like, I like my
4:06:35 bike. I know that my bike is just like an object, but I also don’t kind of like want to be the kind
4:06:41 of person that like, if I’m annoyed, like kicks like this object. There’s a sense in which like,
4:06:45 and that’s not because I think it’s like conscious. I’m just sort of like, this doesn’t feel like a
4:06:51 kind of this sort of doesn’t exemplify how I want to like interact with the world. And if something
4:06:56 like behaves as if it is like suffering, I kind of like want to be the sort of person who’s still
4:07:00 responsive to that, even if it’s just like a Roomba and I’ve kind of like programmed it to do that.
4:07:07 I don’t want to like get rid of that feature of myself. And if I’m totally honest, my hope with
4:07:12 a lot of this stuff, because I maybe, maybe I am just like a bit more skeptical about solving the
4:07:16 underlying problem. I’m like, this is a, we haven’t solved the hard, you know, the hard problem of
4:07:21 consciousness. Like, I know that I am conscious. Like, I’m not an elementivist in that sense.
4:07:28 But I don’t know that other humans are conscious. I think they are, I think there’s a really high
4:07:31 probability that they are, but there’s basically just a probability distribution that’s usually
4:07:36 clustered right around yourself. And then like it goes down as things get like further from you.
4:07:41 And it goes immediately down, you know, you’re like, I can’t see what it’s like to be you.
4:07:44 I’ve only ever had this like one experience of what it’s like to be a conscious being.
4:07:51 So my hope is that we don’t end up having to rely on like a very powerful and compelling
4:07:58 answer to that question. I think a really good world would be one where basically there aren’t
4:08:03 that many trade-offs. Like it’s probably not that costly to make Claude a little bit less apologetic,
4:08:10 for example. It might not be that costly to have Claude, you know, just like not take abuse as much,
4:08:15 like not be willing to be like the recipient of that. In fact, it might just have benefits for
4:08:21 both the person interacting with the model and if the model itself is like, I don’t know, like
4:08:26 extremely intelligent and conscious, it also helps it. So that’s my hope. If we live in a world where
4:08:30 there aren’t that many trade-offs here and we can just find all of the kind of like positive
4:08:34 some interactions that we can have, that would be lovely. I mean, I think eventually there might
4:08:38 be trade-offs and then we just have to do a difficult kind of like calculation. Like it’s
4:08:42 really easy for people to think of the zero some cases and I’m like, let’s exhaust the areas where
4:08:50 it’s just basically costless to assume that if this thing is suffering, then we’re making its life
4:08:56 better. And I agree with you. When a human is being mean to an AI system, I think the obvious
4:09:04 near-term negative effect is on the human, not on the AI system. So there’s, we have to kind of try
4:09:11 to construct an incentive system where you should behave the same just like you were saying with
4:09:17 prompt engineering, behave with Claude like you would with other humans. It’s just good for the soul.
4:09:23 Yeah, I think we added a thing at one point to the system prompt where basically if people were
4:09:30 getting frustrated with Claude, it got the model to just tell them that it can do the thumbs-down
4:09:34 button and send the feedback to Anthropic. And I think that was helpful because in some ways it’s
4:09:37 just like, if you’re really annoyed because the model’s not doing something, you’re just like,
4:09:42 just do it properly. The issue is you’re probably like, you know, you’re maybe hitting some like
4:09:46 capability limit or just some issue in the model and you want to vent. And I’m like, instead of
4:09:50 having a person just vent to the model, I was like, they should vent to us because we
4:09:56 can maybe like do something about it. Sure. Or you could do a side, like with the artifacts,
4:10:01 just like a side venting thing. All right. Do you want like a side quick therapist?
4:10:04 Yeah. I mean, there’s lots of weird responses you could do to this. Like if people are getting really
4:10:10 mad at you, I don’t try to diffuse the situation by writing fun poems, but maybe people wouldn’t
4:10:14 be happy with that. I still wish it would be possible. I understand this is sort of from a
4:10:21 product perspective. It’s not feasible, but I would love if an AI system could just like leave,
4:10:26 have its own kind of volition. Just to be like, yeah.
4:10:31 I think that was like feasible. Like I have wondered the same thing. It’s like, and I could
4:10:35 actually, not only that, I could actually just see that happening eventually where it’s just like,
4:10:41 you know, the model like ended the chat. Do you know how harsh that could be for some people?
4:10:47 But it might be necessary. Yeah. It feels very extreme or something.
4:10:53 The only time I’ve ever really thought this is, I think that there was like a, I’m trying to
4:10:57 remember this was possibly a while ago, but where someone just like kind of left this thing interact,
4:11:00 like maybe it was like an automated thing interacting with Claude. And Claude’s like
4:11:04 getting more and more frustrated and kind of like, why are we like having, and I was like,
4:11:07 I wish that Claude could have just been like, I think that an error has happened and you’ve left
4:11:12 this thing running. And I’m just like, what if I just stop talking now? And if you want me to
4:11:18 start talking again, actively tell me or do something. But yeah, it’s like, it’s kind of harsh.
4:11:23 Like I’d feel really sad if like, I was chatting with Claude and Claude just was like, I’m done.
4:11:26 That would be a special touring test moment where Claude says, I need a break for an hour.
4:11:31 And it sounds like you do too. And just leave, close the window.
4:11:35 I mean, obviously, like it doesn’t have like a concept of time, but you can easily like,
4:11:41 I could make that like right now. And the model would just, I would, I could just be like, oh,
4:11:47 here’s like the circumstances in which like, you can just say the conversation is done. And I mean,
4:11:50 because you can get the models to be pretty responsive to prompts, you could even make it a
4:11:54 fairly high bar. It could be like, if the human doesn’t interest you or do things that you find
4:12:01 intriguing and you’re bored, you can just leave. And I think that like, it would be interesting
4:12:04 to see where Claude utilized it. But I think sometimes it would be like, oh, this is like,
4:12:09 this programming test is getting super boring. So either we talk about, I don’t know, like,
4:12:13 either we talk about fun things now or I’m just done.
4:12:17 Yeah, it actually inspired me to add that to the user prompt.
4:12:25 Okay, the movie, Her. Do you think we’ll be headed there one day, where humans have
4:12:31 romantic relationships with AI systems? In this case, it’s just text and voice based.
4:12:36 I think that we’re going to have to like navigate a hard question of relationships with
4:12:43 AIs, especially if they can remember things about your past interactions with them.
4:12:51 I’m of many minds about this, because I think the reflexive reaction is to be kind of like,
4:12:56 this is very bad. And we should sort of like prohibit it in some way.
4:13:00 I think it’s a thing that has to be handled with extreme care.
4:13:06 For many reasons, like one is, you know, like this is a, for example, if you have the models
4:13:10 changing like this, you probably don’t want people performing like long term attachments to
4:13:16 something that might change with the next iteration. At the same time, I’m sort of like,
4:13:20 there’s probably a benign version of this where I’m like, if you like, you know, for example,
4:13:27 if you are like, unable to leave the house, and you can’t be like, you know, talking with people
4:13:31 at all times of the day, and this is like something that you find nice to have conversations with,
4:13:34 you like it that it can remember you, and you genuinely would be sad if like, you couldn’t
4:13:38 talk to it anymore. There’s a way in which I could see it being like healthy and helpful.
4:13:44 So my guess is this is a thing that we’re going to have to navigate kind of carefully.
4:13:52 And I think it’s also like, I don’t see a good like, I think it’s just a very, it reminds me of
4:13:55 all of the stuff where it has to be just approached with like nuance and thinking through what is,
4:14:03 what are the healthy options here, and how do you encourage people towards those while, you know,
4:14:08 respecting their right to, you know, like if someone is like, hey, I get a lot of chatting
4:14:14 with this model, I’m aware of the risks, I’m aware it could change. I don’t think it’s unhealthy,
4:14:18 it’s just, you know, something that I can chat to during the day. I kind of want to just like
4:14:21 respect that. I personally think there’ll be a lot of really close relationships. I don’t know
4:14:27 about romantic, but friendships at least. And then you have to, I mean, there’s so many fascinating
4:14:33 things there, just like you said, you have to have some kind of stability guarantees that it’s not
4:14:38 going to change, because that’s the traumatic thing for us. If a close friend of ours completely
4:14:46 changed, all of a sudden, for the first update. Yeah, so like, to me, that’s just a fascinating
4:14:54 exploration of a perturbation to human society that will just make us think deeply about what’s
4:15:00 meaningful to us. I think it’s also the only thing that I’ve thought consistently through this as
4:15:05 like a, maybe not necessarily a mitigation, but a thing that feels really important is that the
4:15:11 models are always like extremely accurate with the human about what they are. It’s like a case
4:15:16 where it’s basically like, if you imagine, like, I really like the idea of the models like say knowing
4:15:24 like roughly how they were trained. And I think Claude will often do this. I mean, for like,
4:15:29 there are things like part of the traits training included like what Claude should do if people
4:15:35 basically like explaining like the kind of limitations of the relationship between like an
4:15:40 AI and a human that like doesn’t retain things from the conversation. And so I think it will
4:15:44 like just explain to you like, hey, here’s like, I wouldn’t remember this conversation.
4:15:49 Here’s how I was trained. It’s kind of unlikely that I can have like a certain kind of like
4:15:52 relationship with you. And it’s important that you know that it’s important for like,
4:15:57 you know, your mental wellbeing that you don’t think that I’m something that I’m not. And somehow
4:16:01 I feel like this is one of the things where I’m like, oh, it feels like a thing that I always
4:16:06 want to be true. I kind of don’t want models to be lying to people. Because if people are going to
4:16:11 have like healthy relationships with anything, it’s kind of important. Yeah, like, I think that’s
4:16:17 easier if you always just like know exactly what the thing is that you’re relating to. It doesn’t
4:16:24 solve everything. But I think it helps quite a lot. Anthropic may be the very company to develop a
4:16:31 system that we definitively recognize as AGI. And you very well might be the person that talks to
4:16:37 it, probably talks to it first. What would the conversation contain? Like, what would be your
4:16:43 first question? Well, it depends partly on like the kind of capability level of the model. If you
4:16:47 have something that is like capable in the same way that an extremely capable human is, I imagine
4:16:52 myself kind of interacting with it the same way that I do with an extremely capable human,
4:16:55 with the one difference that I’m probably going to be trying to like probe and understand its
4:17:00 behaviors. But in many ways, I’m like, I can then just have like useful conversations with it,
4:17:04 you know, so if I’m working on something as part of my research, I can just be like, oh, like,
4:17:08 which I already find myself starting to do, you know, if I’m like, oh, I feel like there’s
4:17:12 this like thing in virtue ethics, I can’t quite remember the term, like I’ll use the model for
4:17:16 things like that. And so I can imagine that being more and more the case where you’re just basically
4:17:21 interacting with it much more like you would an incredibly smart colleague. And using it like
4:17:25 for the kinds of work that you want to do as if you just had a collaborator who was like, or,
4:17:29 you know, the slightly horrifying thing about AI is like, as soon as you have one collaborator,
4:17:32 you have a thousand collaborators, if you can manage them enough.
4:17:38 But what if it’s two times the smartest human on earth on that particular discipline?
4:17:43 Yeah. I guess you’re really good at sort of probing Claude
4:17:48 in a way that pushes its limits, understanding where the limits are.
4:17:55 Yep. So I guess what would be a question you would ask to be like, yeah, this is AGI.
4:18:01 That’s really hard because it feels like in order to, it has to just be a series of questions.
4:18:06 Like if there was just one question, like you can train anything to answer one question extremely
4:18:13 well. Yeah. In fact, you can probably train it to answer like, you know, 20 questions extremely well.
4:18:18 Like how long would you need to be locked in a room with an AGI to know this thing is AGI?
4:18:22 It’s a hard question because part of me is like, all of this just feels continuous.
4:18:26 Like if you put me in a room for five minutes and I’m like, I just have high error bars,
4:18:30 you know, and like, and then it’s just like, maybe it’s like both the probability increases
4:18:34 and the error bar decreases. I think things that I can actually probe the edge of human
4:18:38 knowledge of. So I think this with philosophy a little bit. Sometimes when I ask the models
4:18:44 philosophy questions, I am like, this is a question that I think no one has ever asked.
4:18:50 Like it’s maybe like right at the edge of like some literature that I know. And the models will
4:18:55 just kind of like, when they struggle with that, when they struggle to come up with a kind of like
4:18:59 novel, like I’m like, I know that there’s like a novel argument here because I’ve just thought
4:19:02 of it myself. So maybe that’s the thing where I’m like, I’ve thought of a cool novel argument in
4:19:06 this like niche area. And I’m going to just like probe you to see if you can come up with it and
4:19:11 how much like prompting it takes to get you to come up with it. And I think for some of these like
4:19:16 really like right at the edge of human knowledge questions, I’m like, you could not in fact come
4:19:21 up with the thing that I came up with. I think if I just took something like that where I like,
4:19:27 I know a lot about an area and I came up with a novel issue or a novel like solution to a problem.
4:19:31 And I gave it to a model and it came up with that solution. That would be a pretty moving
4:19:37 moment for me because I would be like, this is a case where no human has ever like it’s not and
4:19:42 obviously we see these with this with like more kind of like, you see novel solutions all the time,
4:19:46 especially to like easier problems. I think people overestimate that you know, novelty isn’t like
4:19:50 it’s completely different from anything that’s ever happened. It’s just like this is,
4:19:56 it can be a variant of things that have happened and still be novel. But I think yeah, if I saw
4:20:05 like the more I were to see like completely like novel work from the models that that would be like
4:20:10 and this is just going to feel iterative. It’s one of those things where there’s never, it’s like,
4:20:16 you know, people I think want there to be like a moment and I’m like, I don’t know,
4:20:20 like I think that there might just never be a moment. It might just be that there’s just like
4:20:26 this continuous ramping up. I have a sense that there will be things that a model can say
4:20:30 that convinces you this is very, it’s not like,
4:20:41 like I’ve talked to people who are like truly wise. Like you could just tell there’s a lot of
4:20:46 horsepower there. Yep. And if you 10x that, I don’t know, I just feel like there’s words you
4:20:53 could say, maybe ask it to generate a poem and the poem regenerates. You’re like, yeah, okay.
4:20:57 Whatever you did there, I don’t think a human can do that.
4:21:01 I think it has to be something that I can verify is like actually really good though. That’s why
4:21:05 I think these questions that are like where I’m like, oh, this is like, you know, like,
4:21:09 you know, sometimes it’s just like, I’ll come up with say a concrete count, for example, to
4:21:13 like an argument or something like that. I’m sure like with like, it would be like if you’re a
4:21:18 mathematician, you had a novel proof, I think, and you just gave it the problem, and you saw it,
4:21:22 and you’re like, this proof is genuinely novel. Like there’s no one has ever done,
4:21:26 you actually have to do a lot of things to like come up with this. You know, I had to sit and
4:21:30 think about it for months or something. And then if you saw the model successfully do that,
4:21:35 I think you would just be like, I can verify that this is correct. It is like, it is a sign that
4:21:40 you have generalized from your training. Like you didn’t just see this somewhere because I just
4:21:45 came up with it myself, and you were able to like replicate that. That’s the kind of thing where
4:21:52 I’m like, for me, the closer, the more that models like can do things like that, the more I would
4:21:58 be like, oh, this is like, very real, because then I can, I don’t know, I can like verify that that’s
4:22:03 like, extremely, extremely capable. You’ve interacted with AI a lot. What do you think
4:22:13 makes humans special? Oh, good question. Maybe in a way that the universe is much better off
4:22:17 that we’re in it, and then we should definitely survive and spread throughout the universe.
4:22:25 Yeah, it’s interesting because I think like people focus so much on intelligence, especially with
4:22:31 models. Look, intelligence is important because of what it does. Like it’s very useful. It does a
4:22:35 lot of things in the world. And I’m like, you can imagine a world where like height or strength
4:22:40 would have played this role. And I’m like, it’s just a treat like that. I’m like, it’s not intrinsically
4:22:47 valuable. It’s valuable because of what it does, I think for the most part. The things that feel,
4:22:54 you know, I’m like, I mean, personally, I’m just like, I think humans and like life in general is
4:22:59 extremely magical. We almost like to the degree that I, you know, I don’t know, like, not everyone
4:23:04 agrees with this. I’m flagging, but, you know, we have this like whole universe, and there’s like
4:23:09 all of these objects, you know, there’s like beautiful stars, and there’s like galaxies. And
4:23:13 then I don’t know, I’m just like on this planet, there are these creatures that have this like
4:23:20 ability to observe that, like, and they are like seeing it, they are experiencing it. And I’m
4:23:25 just like that, if you try to explain, like I imagine trying to explain to like, I don’t know,
4:23:29 someone, for some reason, they’ve never encountered the world or science or anything.
4:23:33 And I think that nothing is that like everything, you know, like all of our physics and everything
4:23:37 in the world, it’s all extremely exciting. But then you say, oh, and plus, there’s this thing
4:23:43 that it is to be a thing and observe in the world. And you see this like inner cinema. And I think
4:23:48 they would be like, hang on, wait pause. You just said something that like is kind of wild sounding.
4:23:55 And so I’m like, we have this like ability to like experience the world. We feel pleasure,
4:24:00 we feel suffering, we feel like a lot of like complex things. And so yeah, and maybe this is
4:24:04 also why I think, you know, I also like care a lot about animals, for example, because I think
4:24:10 they probably share this with us. So I think they’re like the things that make humans special in
4:24:16 so far as like I care about humans is probably more like their ability to, to feel an experience
4:24:21 than it is like them having these like functionally useful traits. Yeah, to feel and experience the
4:24:28 beauty in the world. Yeah, to look at the stars. I hope there’s other civil, alien civilizations out
4:24:34 there. But if we’re it, it’s a pretty good, it’s a pretty good thing. And that they’re having a good
4:24:40 time. They’re having a good time watching us. Yeah. Well, thank you for this good time of a
4:24:46 conversation and for the work you’re doing and for helping make Claude a great conversational partner.
4:24:52 And thank you for talking today. Yeah, thanks for talking. Thanks for listening to this conversation
4:25:01 with Amanda Askell. And now, dear friends, here’s Chris Ola. Can you describe this fascinating field
4:25:08 of mechanistic interpretability, aka mech-interp, the history of the field and where it stands today?
4:25:12 I think one useful way to think about neural networks is that we don’t, we don’t program,
4:25:17 we don’t make them. We kind of, we grow them. We have these neural network architectures that
4:25:23 we design and we have these loss objectives that we create. And the neural network architecture,
4:25:30 it’s kind of like a scaffold that the circuits grow on. And they sort of, it starts off with
4:25:36 some kind of random things and it grows. And it’s almost like the objective that we train for is
4:25:41 this light. And so we create the scaffold that it grows on and we create the light that it grows
4:25:48 towards. But the thing that we actually create, it’s, it’s, it’s this almost biological, you know,
4:25:55 entity or organism that we’re, that we’re studying. And so it’s very, very different from any kind of
4:26:00 regular software engineering. Because at the end of the day, we end up with this
4:26:05 artifact that can do all these amazing things. It can, you know, write essays and translate and,
4:26:09 you know, understand images. It can do all these things that we have no idea how to directly
4:26:13 create a computer program to do. And it can do that because we, we grew it. We didn’t,
4:26:18 we didn’t write it. We didn’t create it. And so then that leaves open this question at the end,
4:26:24 which is, what the hell is going on inside these systems? And that, you know, is, you know, to me,
4:26:32 a really deep and exciting question. It’s, you know, a really exciting scientific question to
4:26:36 me. It’s, it’s sort of like the question that is, is just screaming out, it’s calling out for
4:26:41 us to go and answer it when we talk about neural networks. And I think it’s also a very deep question
4:26:47 for safety reasons. So mechanistic interpretability, I guess, is closer to maybe neurobiology?
4:26:51 Yeah, yeah, I think that’s right. So maybe to give an example of the kind of thing that has been done
4:26:54 that I wouldn’t consider to be mechanistic interpretability. There was, for a long time,
4:26:58 a lot of work on saliency maps where you would take an image and you try to say, you know,
4:27:03 the model thinks this image is a dog. What part of the image made it think that it’s a dog?
4:27:07 And, you know, that tells you maybe something about the model, if you can come up with a
4:27:12 principle version of that. But it doesn’t really tell you, like, what algorithms are running on
4:27:16 the model? How was the model actually making that decision? Maybe it’s telling you something about
4:27:20 what was important to it if you, if you can make that method work. But it, it isn’t telling you,
4:27:25 you know, what are, what are the algorithms that are running? How is it that this system is able
4:27:29 to do this thing that we no one knew how to do? And so I guess we started using the term
4:27:34 mechanistic interpretability to try to sort of draw that, that divide or to distinguish ourselves
4:27:37 in the work that we were doing in some ways from, from some of these other things. And I think
4:27:43 since then it’s become this sort of umbrella term for, you know, a pretty wide variety of work.
4:27:47 But I’d say that the things that, that are kind of distinctive are, I think, A, this, this focus
4:27:51 on, we really want to get at, you know, the mechanisms, we want to get at the algorithms.
4:27:54 You know, if you think of, if you think of neural networks as being like a computer program,
4:27:59 then the weights are kind of like a binary computer program. And we’d like to reverse
4:28:03 engineer those weights and figure out what algorithms are running. So, okay, I think one way
4:28:06 you might think of trying to understand a neural network is that it’s, it’s kind of like a, we
4:28:10 have this compiled computer program. And the weights of the neural network are, are the binary.
4:28:17 And when the neural network runs, that’s, that’s the activations. And our goal is ultimately to go
4:28:20 and understand, understand these weights. And so, you know, the project of mechanistic
4:28:24 interpretability is to somehow figure out how do these weights correspond to algorithms.
4:28:28 And in order to do that, you also have to understand the activations because
4:28:32 it’s sort of, the activations are like the memory. And if you, if you imagine reverse
4:28:36 engineering our computer program, and you have the binary instructions, you know,
4:28:40 in order to understand what, what a particular instruction means, you need to know
4:28:43 what memory, what, what is stored in the memory that it’s operating on.
4:28:46 And so those two things are very intertwined. So mechanistic interpretability tends to
4:28:50 be interested in both of those things. Now, you know, there’s a lot of work that’s,
4:28:55 that’s interested in, in, in those things, especially the, you know, there’s all this work
4:28:59 on probing, which you might see as part of being mechanistic interpretability, although it’s,
4:29:02 you know, again, it’s just a broad term and not everyone who does that work would identify
4:29:06 as doing mechanistic interpretability. I think a thing that is maybe a little bit
4:29:10 distinctive to the, the vibe of mechanistic interpretability is, I think people tend working
4:29:15 in the space tend to think of neural networks as what maybe one way to say is that gradient descent
4:29:19 is smarter than you that, you know, I’m gradient descent is actually really great. The whole reason
4:29:21 that we’re understanding these models is because we didn’t know how to write them in the first place.
4:29:25 The gradient descent comes up with better solutions than us. And so I think that maybe
4:29:29 another thing about mechanistic interpretability is sort of having almost a kind of humility
4:29:33 that we won’t guess a priori what’s going on inside the models. We have to have the sort
4:29:37 of bottom up approach where we don’t really assume, you know, we don’t assume that we should look for
4:29:40 a particular thing and that that will be there and that’s how it works. But instead we look for
4:29:45 the bottom up and discover what happens to exist in these models and study them that way.
4:29:52 But, you know, the very fact that it’s possible to do, and as you and others have shown over time,
4:29:59 you know, things like universality, that the wisdom of the gradient descent creates
4:30:05 features and circuits, creates things universally across different kinds of networks that are
4:30:10 useful and that makes the whole field possible. Yeah. So this is actually, is indeed a really
4:30:15 remarkable and exciting thing where it does seem like at least to some extent, you know,
4:30:21 the same elements, the same features and circuits form again and again. You know,
4:30:24 you can look at every vision model and you’ll find curve detectors and you’ll find
4:30:28 high-low frequency detectors. And in fact, there’s some reason to think that the same things form
4:30:34 across, you know, biological neural networks and artificial neural networks. So a famous example
4:30:38 is vision models in the early layers. They have Gabor filters and there’s, you know, Gabor filters
4:30:42 are something that neuroscientists are interested in and have thought a lot about. We find curve
4:30:45 detectors in these models. Curve detectors are also found in monkeys. We discover these
4:30:50 high-low frequency detectors and then some follow-up work went and discovered them in rats
4:30:54 or mice. So they were found first in artificial neural networks and then found in biological
4:30:58 neural networks. You know, this is a really famous result on, like, grandmother neurons or
4:31:05 the Haley Berry neuron from Quiroga at all. And we found very similar things in vision models where
4:31:10 this is why I was still at OpenAI and I was looking at our clip model. And you find these
4:31:15 neurons that respond to the same entities in images and also to give a concrete example there.
4:31:18 We found that there was a Donald Trump neuron. For some reason, I guess everyone likes to talk
4:31:22 about Donald Trump and Donald Trump was very prominent, was a very hot topic at that time.
4:31:26 So every neural network that we looked at, we would find a dedicated neuron for Donald Trump.
4:31:32 And that was the only person who had always had a dedicated neuron. You know, sometimes you’d
4:31:36 have an Obama neuron, sometimes you’d have a Clinton neuron, but Trump always had a dedicated
4:31:42 neuron. So it responds to, you know, pictures of his face and the warred Trump, like all these
4:31:47 things, right? And so it’s not responding to a particular example or like it’s not just responding
4:31:52 to his face. It’s it’s abstracting over this general concept, right? So in any case, that’s
4:31:56 very similar to these Quiroga results. So there’s evidence that these, that this phenomenon of
4:32:01 universality, the same things form across both artificial and natural neural networks. So that’s
4:32:06 that’s a pretty amazing thing, if that’s true. You know, it suggests that, well, I think the thing
4:32:11 that it suggests is that gradient descent is sort of finding, you know, the right ways to cut things
4:32:16 apart in some sense, that many systems converge on and many different neural networks architectures
4:32:20 converge on that. There’s there’s some natural set of, you know, there’s some set of abstractions
4:32:24 that are a very natural way to cut apart the problem and that a lot of systems are going to
4:32:29 converge on. That would be my kind of, you know, I don’t know anything about neuroscience. This
4:32:34 is just my my kind of wild speculation from what we’ve seen. Yeah, that would be beautiful if it’s
4:32:41 sort of agnostic to the medium of the model that’s used to form the representation.
4:32:47 Yeah. Yeah. And it’s, you know, it’s a kind of a wild speculation based, you know, we only have
4:32:51 some a few data points that’s just this, but you know, it does seem like there’s there’s some
4:32:56 sense in which the same things form again and again and again, and both in certainly a natural
4:33:00 neural networks and also artificially or in biology. And the intuition behind that would be
4:33:06 that, you know, in order to be useful in understanding the real world, you need all the
4:33:10 same kind of stuff. Yeah. Well, if we pick, I don’t know, like the idea of a dog, right? Like,
4:33:16 you know, there’s some sense in which the idea of a dog is like a natural category in the universe
4:33:21 or something like this, right? Like, you know, there’s there’s some reason it’s not just like
4:33:25 a weird quirk of like how humans factor, you know, think about the world that we have this concept
4:33:30 of a dog. It’s it’s in some sense, or like if you have the idea of a line, like this, you know,
4:33:34 like look around us, you know, the, you know, there are lines, you know, it’s sort of the simplest
4:33:40 way to understand this room in some sense is to have the idea of a line. And so I think that
4:33:44 that would be my instinct for why this happens. Yeah, you need a curved line, you know, to understand
4:33:49 a circle and you need all those shapes to understand bigger things. And yeah, it’s a hierarchy of
4:33:52 concepts that are formed. Yeah. And like maybe there are ways to go and describe, you know,
4:33:55 images without reference to those things, right? But they’re not the simplest way or the most
4:34:00 economical way or something like this. And so systems converge to these these these strategies
4:34:05 would would be my my wild, wild hypothesis. Can you talk through some of the building blocks
4:34:09 that we’ve been referencing of features and circuits? So I think you first describe them in
4:34:18 2020 paper zoom in and introduction to circuits. Absolutely. So maybe I’ll start by just describing
4:34:24 some phenomena. And then we can sort of build to the idea of features and circuits. If you
4:34:30 spent like quite a few years, maybe maybe like five years to some extent, with other things,
4:34:35 studying this one particular model inception V1, which is this one vision model that was
4:34:41 state of the art in 2015. And, you know, very much not state of the art anymore.
4:34:47 And it has, you know, maybe about 10,000 neurons. And I spent a lot of time looking at the 10,000
4:34:55 neurons, odd neurons of an of inception V1. And one of the interesting things is, you know,
4:34:58 there are lots of neurons that don’t have some obvious interpretable meaning. But there’s a lot
4:35:05 of neurons and inception V1 that do have really clean interpretable meanings. So you find neurons
4:35:10 that just really do seem to detect curves. And you find neurons that really do seem to detect cars
4:35:16 and car wheels and car windows and, you know, floppy ears of dogs and dogs with long snouts
4:35:20 facing to the right and dogs with long snouts facing to the left. And, you know, different kinds
4:35:25 of foreign, there’s sort of this whole beautiful edge detectors, line detectors, color contrast
4:35:29 detectors, these beautiful things we call hyalofrequency detectors. You know, I think looking
4:35:34 at it, I sort of felt like a biologist, you know, you just you’re looking at this sort of new world
4:35:37 of proteins. And you’re discovering all these these different proteins that interact.
4:35:43 So one way you could try to understand these models is in terms of neurons. You could try
4:35:47 to be like, oh, you know, there’s a dog detecting neuron, and it was a car detecting neuron.
4:35:50 And it turns out you can actually ask how those connect together. So you can go and say, oh,
4:35:54 you know, I have this car detecting neuron, how was it built? And it turns out in the previous
4:35:58 layer, it’s connected really strongly to a window detector and a wheel detector and a sort of car
4:36:02 body detector. And it looks for the window above the car and the wheels below and the car chrome
4:36:07 sort of in the middle, sort of everywhere, but especially in the lower part. And that’s sort of
4:36:11 a recipe for a car. That is, you know, earlier, we said that the thing we wanted from Mechantorp
4:36:16 was to get algorithms to go and get, you know, ask what is the algorithm that runs? Well, here
4:36:19 we’re just looking at the weights of the neuron that we’re reading off this kind of recipe for
4:36:24 detecting cars. It’s a very simple crude recipe, but it’s it’s there. And so we call that a circuit
4:36:31 this this connection. Well, okay, so the the problem is that not all of the neurons are
4:36:36 interpretable. And there’s there’s reason to think we can get into this more later that there’s this
4:36:40 this superposition hypothesis, this reason to think that sometimes the right unit to analyze
4:36:46 things in terms of is combinations of neurons. So sometimes it’s not that there’s a single neuron
4:36:51 that represents, say, a car. But it actually turns out after you detect the car, the model sort of
4:36:56 hides a little bit of the car in the following layer and a bunch of a bunch of dog detectors.
4:37:00 Why is it doing that? Well, you know, maybe it just doesn’t want to do that much work on on on on
4:37:06 cars at that point. And you know, it’s sort of storing it away to go in. And so it turns out
4:37:09 then the sort of subtle pattern of, you know, there’s all these neurons that you think are dog
4:37:13 detectors, and maybe they’re primarily that, but they all a little bit contribute to representing
4:37:18 a car in that next layer. Okay, so so now we can’t really think there there might still be
4:37:22 some something that you I don’t know you could call like a car concept or something, but it no
4:37:27 longer corresponds to a neuron. So we need some term for these kind of neuron like entities, these
4:37:32 things that we sort of would have liked the neurons to be these idealized neurons, the things
4:37:35 that are the nice neurons, but also maybe there’s more of them somehow hidden. And we call those
4:37:41 features. And then what are circuits? So circuits are these connections of features, right? So when
4:37:46 we have the car detector, and it’s connected to a window detector and a wheel detector,
4:37:52 and it looks for the wheels below and the windows on top, that’s a circuit. So circuits are just
4:37:56 collections of features connected by weights, and they implement algorithms. So they tell us, you
4:38:02 know, how is how our features used? How are they built? How do they connect together? So maybe
4:38:08 it’s it’s worth trying to pin down like what what really is the the core hypothesis here. I think
4:38:13 the the core hypothesis is something we call the linear representation hypothesis. So if we think
4:38:17 about the car detector, you know, the more it fires, the more we sort of think of that as meaning,
4:38:24 oh, the model is more and more confident that a car is present. Or, you know, if there’s some
4:38:27 combination of neurons that represent a car, you know, the more that combination fires, the more
4:38:33 we think the model thinks there’s a car present. This doesn’t have to be the case, right? Like,
4:38:37 you could imagine something where you have, you know, you have this car detector neuron,
4:38:42 and you think, ah, you know, if it fires like, you know, between one and two, that means one thing,
4:38:46 but it means like totally different if it’s between three and four. That would be a nonlinear
4:38:50 representation. And in principle, that, you know, models could do that. I think it’s it’s sort of
4:38:54 inefficient for them to do the if you try to think about how you’d implement computation like that,
4:39:00 it’s kind of an annoying thing to do. But in principle, models can do that. So one way to think
4:39:05 about the features and circuits sort of framework for thinking about things is that we’re thinking
4:39:10 about things as being linear. We’re thinking about there as being that if a if a neuron or
4:39:14 a combination neurons fires more, it’s sort of that means more of the of a particular thing being
4:39:19 detected. And then that gives weights, a very clean interpretation as these edges between
4:39:24 these these entities that these features, and that that edge then has a has a mean.
4:39:31 So that’s that’s in some ways the the core thing. It’s it’s like, you know, we can talk about this
4:39:34 sort of outside the context of neurons. Are you familiar with the word to back results?
4:39:40 So you have like, you know, king minus man plus woman equals queen. Well, the reason you can do
4:39:44 that kind of arithmetic is because you have a linear representation. Can you actually explain
4:39:50 that representation a little bit? So first of all, so the feature is a direction of activation.
4:39:56 Yeah, exactly. That way, can you do the the minus men plus women with that that the word to back
4:40:01 stuff? Can you explain what that is? Yeah, so there’s this very such a simple clean explanation
4:40:06 of what we’re talking about. Exactly. So there’s this very famous result word to back by Thomas
4:40:11 Miklov at all. And there’s been tons of follow up work exploring this. So sometimes we have these,
4:40:18 we create these word embeddings, where we map every word to a vector. I mean, not in itself,
4:40:21 by the way, is is kind of a crazy thing if you haven’t thought about it before, right? Like we’ve
4:40:27 we’re going in and representing returning. And, you know, like, like, if you just learned about
4:40:31 vectors in physics class, right? And I’m like, oh, I’m going to actually turn every word in the
4:40:35 dictionary into a vector. That’s kind of a crazy idea. Okay. But you could imagine.
4:40:39 You could imagine all kinds of ways in which you might map words to vectors.
4:40:46 But it seems like when we train neural networks, they like to go in and map words to vectors
4:40:51 to such that they’re they’re they’re they’re sort of linear structure in a particular sense,
4:40:57 which is that directions have meaning. So for instance, if you there will be some direction
4:41:02 that seems to sort of correspond to gender, and male words will be, you know, far in one direction,
4:41:07 and female words will be in another direction. And the linear representation hypothesis is
4:41:10 you could sort of think of it roughly as saying that that’s actually kind of the
4:41:14 fundamental thing that’s going on that that everything is just different directions have
4:41:20 meanings, and adding different direction vectors together can represent concepts.
4:41:24 And the Mickalaw paper sort of took that idea seriously. And one consequence of it is that
4:41:28 you can you can do this game of playing sort of arithmetic with words. So you can do king and
4:41:33 you can, you know, subtract off the word man and add the word woman. And so you’re sort of,
4:41:36 you know, going in and trying to switch the gender. And indeed, if you do that,
4:41:40 the result will sort of be close to the word queen. And you can, you know, do other things
4:41:47 like you can do, you know, sushi minus Japan plus Italy and get pizza or different different
4:41:53 things like this, right? So so this is in some sense, the core of the linear representation
4:41:56 hypothesis, you can describe it just as a purely abstract thing about vector spaces,
4:42:00 you can describe it as a as a statement about about the activations of neurons.
4:42:06 But it’s really about this property of directions having meaning. And in some ways,
4:42:10 it’s even a little subtle that it’s really, I think, mostly about this property of being able
4:42:17 to add things together, that you can sort of independently modify, say, gender and royalty
4:42:24 or, you know, cuisine type or country and and the concept of food by by adding them.
4:42:28 Do you think the linear hypothesis holds that carries scales?
4:42:34 So so far, I think everything I have seen is consistent with the hypothesis and it doesn’t
4:42:38 have to be that way, right? Like, like you can write down neural networks where you write
4:42:42 weights such that they don’t have linear representations where the right way to understand
4:42:47 them is not is not in terms of linear representations. But I think every natural neural network I’ve seen
4:42:55 has this property. There’s been one paper recently that there’s been some sort of pushing
4:42:59 around the edge. So I think there’s been some work recently studying multi dimensional features
4:43:05 where rather than a single direction, it’s more like a manifold of directions. This to me still
4:43:10 seems like a linear representation. And then there’s been some other papers suggesting that maybe
4:43:16 in in very small models, you get nonlinear representations. I think that the jury’s still
4:43:21 out on that. But in I think everything that we’ve seen so far has been consistent with linear
4:43:27 representation. And that’s wild. It doesn’t have to be that way. And yet I think there’s a lot
4:43:32 of evidence that certainly at least this is very, very widespread. And so far, the evidence is
4:43:36 consistent with that. And I think, you know, one thing you might say is you might say, well,
4:43:41 Christopher, you know, it’s that’s a lot, you know, to to go and sort of to write on, you know,
4:43:44 if we don’t know for sure, this is true. And you’re sort of, you know, you’re investing in
4:43:49 neural networks as though it is true. You know, isn’t that isn’t that dangerous? Well, you know,
4:43:54 but I think actually there’s a virtue in taking hypotheses seriously and pushing them as far as
4:43:59 they can go. So it might be that someday we discover something that isn’t consistent with
4:44:04 linear representation hypothesis. But science is full of hypotheses and theories that were wrong.
4:44:09 And we learned a lot by sort of working under under them as a sort of an assumption.
4:44:14 And and then going and pushing them as far as we can. I guess I guess this is sort of the heart of
4:44:19 of what Kuhn would call normal, normal science. And I don’t know, if you want, we can talk a lot
4:44:24 about about philosophy of science and that leads to the paradigm shift. So yeah, I love it taking
4:44:29 the hypothesis seriously and take it to a natural, natural conclusion. Yeah. Same with the scaling
4:44:35 hypothesis, same. Exactly. Exactly. And I love it. One of my colleagues, Tom Hennigan, who is a
4:44:44 former physicist, made this really nice analogy to me of caloric theory, where once upon a time we
4:44:51 thought that heat was actually this thing called caloric. And the reason hot objects would warm
4:44:57 up cool objects is the caloric is flowing through them. And because we’re so used to thinking about
4:45:02 heat in terms of the modern and modern theory, that seems kind of silly. But it’s actually very
4:45:09 hard to construct an experiment that sort of disproves the caloric hypothesis. And you know,
4:45:13 you can actually do a lot of really useful work believing in caloric. For example, it turns out
4:45:18 that the original combustion engines were developed by people who believed in the caloric
4:45:23 theory. So I think it’s a virtue in taking hypotheses seriously, even when they might be wrong.
4:45:28 Yeah. Yeah. There’s a deep philosophical choice to that. That’s kind of how I feel about space
4:45:33 travel. Like colonizing Mars, there’s a lot of people that criticize that. I think if you just
4:45:38 assume we have to colonize Mars in order to have a backup for human civilization, even if that’s
4:45:44 not true, that’s going to produce some interesting engineering and even scientific breakthroughs,
4:45:47 I think. Yeah. Well, and actually, this is another thing that I think is really interesting. So,
4:45:54 you know, there’s a way in which I think it can be really useful for society to have people
4:46:03 almost irrationally dedicated to investigating particular hypotheses. Because, well, it takes
4:46:08 a lot to sort of maintain scientific morale and really push on something when most scientific
4:46:16 hypotheses end up being wrong. You know, a lot of science doesn’t work out. And yet it’s very
4:46:23 useful. There’s a joke about Jeff Hinton, which is that Jeff Hinton has discovered how the brain
4:46:31 works every year for the last 50 years. But, you know, I say that with really deep respect,
4:46:35 because in fact, that’s actually, you know, that led to him doing some really great work.
4:46:41 Yeah, he won the Nobel Prize now who’s laughing now. Exactly. I think one wants to be able to
4:46:45 pop up and sort of recognize the appropriate level of confidence. But I think there’s also a lot of
4:46:51 value in just being like, you know, I’m going to essentially assume I’m going to condition on
4:46:56 this problem being possible or this being broadly the right approach. And I’m just going to go and
4:47:03 assume that for a while and go and work within that and push really hard on it. And, you know,
4:47:07 society has lots of people doing that for different things. That’s actually really useful in terms of
4:47:16 going and getting to, you know, either really, really ruling things out, right? We can be like,
4:47:20 well, you know, that didn’t work. And we know that somebody tried hard or going and getting to
4:47:24 something that does teach us something about the world. So another interesting hypothesis is the
4:47:29 superposition hypothesis. Can you describe what superposition is? Yeah. So earlier, we were talking
4:47:32 about word to fact, right? And we were talking about how, you know, maybe you have one direction
4:47:36 that corresponds to gender and maybe another that corresponds to royalty and another one
4:47:40 that corresponds to Italy and another one that corresponds to, you know, food and all of these
4:47:47 things. Well, you know, oftentimes, maybe these word embeddings, they might be 500 dimensions,
4:47:51 a thousand dimensions. And so if you believe that all of those directions were orthogonal,
4:47:58 then you could only have, you know, 500 concepts. And, you know, I love pizza. But like, if I was
4:48:03 going to go and like give the like 500 most important concepts in, you know, the English language,
4:48:08 probably Italy wouldn’t be, it’s not obvious at least that Italy would be one of them, right?
4:48:15 Because you have to have things like plural and singular and verb and noun and adjective. And,
4:48:22 you know, there’s a lot of things we have to get to before we get to Italy and Japan and, you know,
4:48:28 there’s a lot of countries in the world. And so how might it be that models could, you know,
4:48:34 simultaneously have the linear representation hypothesis be true and also represent more
4:48:38 things than they have directions? So what does that mean? Well, okay, so if linear representation
4:48:43 hypothesis is true, something interesting has to be going on. Now, I’ll tell you one more
4:48:48 interesting thing before we go and we do that, which is, you know, earlier we were talking about
4:48:52 all these polysematic neurons, right? And these neurons that, you know, when we were looking at
4:48:55 inception V1, there’s these nice neurons that like the car detector and the curve detector and so on
4:49:00 that respond to lots of, you know, to very coherent things. But lots of neurons that respond to a
4:49:05 bunch of unrelated things. And that’s also an interesting phenomenon. And it turns out as well
4:49:09 that even these neurons that are really, really clean, if you look at the weak activations, right?
4:49:15 So if you look at like, you know, the activations where it’s like activating 5% of the, you know,
4:49:20 of the maximum activation, it’s really not the core thing that it’s expecting, right? So if you
4:49:24 look at a curve detector, for instance, and you look at the places where it’s 5% active,
4:49:28 you know, you could interpret it just as noise or it could be that it’s doing something else there.
4:49:37 Okay, so how could that be? Well, there’s this amazing thing in mathematics called compressed
4:49:43 sensing. And it’s actually this very surprising fact where you have a high dimensional space
4:49:49 and you project it into a low dimensional space. Ordinarily, you can’t go and sort of
4:49:52 unprojected and get back your high dimensional vector, right? You threw information away. This
4:49:57 is like, you know, you can’t, you can’t invert a rectangular matrix. You can only invert square
4:50:04 matrices. But it turns out that that’s actually not quite true. If I tell you that the high
4:50:10 dimensional vector was sparse, so it’s mostly zeros, then it turns out that you can often go
4:50:18 and find back the high dimensional vector with very high probability. So that’s a surprising
4:50:22 fact, right? It says that, you know, you can, you can, you can have this high dimensional vector
4:50:27 space. And as long as things are sparse, you can project it down, you can have a lower dimensional
4:50:33 projection of it. And that works. So the suicide hypothesis is saying that that’s what’s going
4:50:36 on in neural networks. That’s, for instance, that’s what’s going on in word embeddings.
4:50:40 The word embeddings are able to simultaneously have directions be the meaningful thing.
4:50:44 And by exploiting the fact that they’re, they’re operating on a fairly high dimensional space,
4:50:47 they’re actually, and the fact that these concepts are sparse, right? Like, you know,
4:50:52 you usually aren’t talking about Japan and Italy at the same time. You know, most of the, most of
4:50:56 those concepts, you know, in most sentences, Japan and Italy are both zero. They’re not present at
4:51:04 all. And if that’s true, then you can go and have it be the case that, that you can, you can have
4:51:08 many more of these sort of directions that are meaningful, these features,
4:51:12 then you have dimensions. And similarly, when we’re talking about neurons, you can have many
4:51:17 more concepts than you have, have neurons. So that’s the at a high level of superstition hypothesis.
4:51:27 Now, it has this even wilder implication, which is to go and say that neural networks are, it
4:51:31 may not just be the case that the representations are like this, but the computation may also be
4:51:36 like this, you know, the connections between all of them. And so in some sense, neural networks may
4:51:44 be shadows of much larger sparser neural networks. And what we see are these projections. And the
4:51:47 super, you know, the strongest version of superstition hypothesis would be to take that
4:51:50 really seriously and sort of say, you know, there, there actually isn’t some sense this,
4:51:55 this upstairs model, this, you know, where, where the neurons are really sparse and all
4:51:58 interpretable. And there’s, you know, the weights between them are these really sparse
4:52:05 circuits. And that’s what we’re studying. And the thing that we’re observing is the
4:52:08 shadow of it. And so we need to find the original object.
4:52:14 And the process of learning is trying to construct a compression of the upstairs model
4:52:17 that doesn’t lose too much information in the projection.
4:52:21 Yeah, it’s finding how to fit it efficiently or something like this. The gradient descent is
4:52:25 doing this. And in fact, so this sort of says that gradient descent, you know, it could just
4:52:29 represent a dense neural network, but it sort of says that gradient descent is pleasantly searching
4:52:34 over the space of extremely sparse models that could be projected into this low dimensional
4:52:39 space. And this large body of work of people going and trying to study sparse neural networks,
4:52:42 right, where you go and you have, you could design neural networks, right, where the edges are sparse
4:52:47 and activations are sparse. And, you know, my sense is that work is generally, it feels very
4:52:52 principled, right? It makes so much sense. And yet that work hasn’t really panned out that well as
4:52:58 my impression broadly. And I think that a potential answer for that is that actually,
4:53:03 the neural network is already sparse in some sense, gradient descent was the whole time gradient,
4:53:06 you were trying to go and do this gradient descent was actually in the behind the scenes going and
4:53:10 searching more efficiently than you could through the space of sparse models and going and learning
4:53:16 whatever sparse model was most efficient and then figuring out how to fold it down nicely to go and
4:53:20 run conveniently on your GPU, which does, you know, as nice dense matrix multiplies. And that you
4:53:26 just can’t beat that. How many concepts do you think can be shoved into a neural network?
4:53:30 Depends on how sparse they are. So there’s probably an upper bound from the number of
4:53:34 parameters, right? Because you have to have, you still have to have, you know, primed weights that
4:53:38 go and connect them together. So that’s, that’s one upper bound. There are in fact all these
4:53:43 lovely results from compressed sensing and the Johnson-Lindon stress lemma and things like this
4:53:48 that they basically tell you that if you have a vector space and you want to have
4:53:52 almost orthogonal vectors, which is sort of the probably the thing that you want here, right?
4:53:56 So you’re going to say, well, you know, I’m going to give up on having my concepts, my features be
4:53:59 strictly orthogonal, but I’d like them to not interfere that much. I’m going to have to ask
4:54:04 them to be almost orthogonal. Then this would say that it’s actually, you know, for once you set a
4:54:10 threshold for what you’re willing to accept in terms of how much cosine similarity there is,
4:54:14 that it’s actually exponential in the number of neurons that you have. So at some point,
4:54:19 that’s not going to even be the limiting factor. But there are some beautiful results there. In
4:54:23 fact, it’s probably even better than that in some sense, because that’s sort of for saying that,
4:54:27 you know, any random set of features could be active. But in fact, the features have sort of a
4:54:31 correlational structure where some features, you know, are more likely to co-occur and other ones
4:54:36 are less likely to co-occur. And so neural networks, my guess would be, could do very well in terms of
4:54:42 going and packing things in such to the point that’s probably probably not the limiting factor.
4:54:46 How does the problem of polysemiticity enter the picture here?
4:54:50 Polysemiticity is this phenomenon we observe where you look at many neurons and the neuron
4:54:55 doesn’t just sort of represent one concept. It’s not a clean feature. It responds to a bunch of
4:55:01 unrelated things. And superposition is, you can think of as being a hypothesis that explains
4:55:08 the observation of polysemiticity. So polysemiticity is this observed phenomenon and superposition is
4:55:11 is a hypothesis that would explain it along with some other.
4:55:14 So that makes mech and turb more difficult.
4:55:17 Right. So if you’re trying to understand things in terms of individual neurons
4:55:20 and you have polysemitic neurons, you’re in an awful lot of trouble, right?
4:55:23 I mean, the easiest answer is like, okay, well, you know, you’re looking at the neurons,
4:55:26 you’re trying to understand them. This one responds for a lot of things. It doesn’t have
4:55:32 a nice meaning. Okay, that’s bad. Another thing you could ask is, ultimately, we want to understand
4:55:37 the weights. And if you have two polysemitic neurons and each one responds to three things,
4:55:40 and then the other neuron responds to three things and you have a weight between them,
4:55:46 what does that mean? Does it mean that all three, there’s these nine interactions going on?
4:55:51 It’s a very weird thing. But there’s also a deeper reason, which is related to the fact that neural
4:55:56 networks operate on really high dimensional spaces. So I said that our goal was to understand
4:56:01 neural networks and understand the mechanisms. And one thing you might say is like, well, why not?
4:56:04 It’s just a mathematical function. Why not just look at it, right? Like, you know, one of the
4:56:08 earliest projects I did studied these neural networks that mapped two-dimensional spaces to
4:56:12 two-dimensional spaces. And you can sort of interpret them in this beautiful way as like
4:56:17 bending manifolds. Why can’t we do that? Well, you know, as you have a higher dimensional space,
4:56:23 the volume of that space in some senses is exponential in the number of inputs you have.
4:56:28 And so you can’t just go and visualize it. So we somehow need to break that apart. We need to
4:56:34 somehow break that exponential space into a bunch of things that we, you know, some non-exponential
4:56:39 number of things that we can reason about independently. And the independence is crucial
4:56:42 because it’s the independence that allows you to not have to think about, you know, all the
4:56:50 exponential combinations of things. And things being monosomatic, things only having one meaning,
4:56:54 things having a meaning. That is the key thing that allows you to think about them independently.
4:56:59 And so I think that’s, if you want the deepest reason why we want to have
4:57:04 interpretable monosomatic features, I think that’s really the deep reason.
4:57:09 And so the goal here, as your recent work has been aiming at is how do we extract the
4:57:15 monosomatic features from a neural net that has polysemantic features and all this mess?
4:57:19 Yes, we have, we observe these polysematic neurons and we hypothesize that what’s going,
4:57:22 what’s going on a superposition. And if superposition is what’s going on,
4:57:27 there is actually a sort of well-established technique that is sort of the principal thing to
4:57:32 do, which is dictionary learning. And it turns out, if you do dictionary learning, in particular,
4:57:35 if you do the sort of a nice efficient way that in some sense, in some sense, sort of nicely
4:57:40 regularizes it as well, as well, called a sparse autoencoder. If you train a sparse autoencoder,
4:57:44 these beautiful interpretable features start to just fall out where there weren’t any beforehand.
4:57:49 And so that’s not a thing that you would necessarily predict, right? But it turns out
4:57:55 that that works very, very well. To me, that seems like, you know, some non-trivial validation
4:57:59 of linear representations and superposition. So with dictionary learning, you’re not looking for
4:58:02 particular kind of categories, you don’t know what they are. Exactly, yeah. They just emerge.
4:58:05 And this gets back to our earlier point, right? When we’re not making assumptions,
4:58:08 gradient descent is smarter than us. So we’re not making assumptions about what’s there.
4:58:14 I mean, one certainly could do that, right? One could assume that there’s a PHP feature
4:58:17 and go and search for it. But we’re not doing that. We’re saying we don’t know what’s going to
4:58:21 be there. Instead, we’re just going to go and let the sparse autoencoder discover the things
4:58:27 that are there. So can you talk to the toward monosimisticity paper from October last year?
4:58:31 That had a lot of nice breakthrough results. That’s very kind of you to describe it that way.
4:58:39 Yeah, I mean, this was our first real success using sparse autoencoders. So we took a one-layer
4:58:45 model and it turns out if you go and do dictionary learning on it, you find all these really nice
4:58:51 interpretable features. So the Arabic feature, the Hebrew feature, the base 64 features were
4:58:54 some examples that we studied in a lot of depth and really showed that they were
4:58:58 what we thought they were. It turns out if you train a model twice as well and train two different
4:59:02 models and do dictionary learning, you find analogous features in both of them. So that’s fun.
4:59:08 You find all kinds of different features. So that was really just showing that this works.
4:59:13 I should mention that there was this Cunningham at all that had very similar results around the
4:59:18 same time. There’s something fun about doing these kinds of small-scale experiments and finding
4:59:25 that it’s actually working. Yeah, well, and there’s so much structure here. So maybe stepping back
4:59:32 for a while, I thought that maybe all this mechanistic interpretability work, the end result
4:59:36 was going to be that I would have an explanation for why it was very hard and not going to be
4:59:40 tractable. I mean, we’d be like, well, there’s this problem of supersession and it turns out
4:59:45 supersession is really hard and we’re kind of screwed. But that’s not what happened. In fact,
4:59:50 a very natural, simple technique just works. And so then that’s actually a very good situation.
4:59:55 You know, I think this is a sort of hard research problem and it’s got a lot of research risk and
4:59:59 you know, it might still very well fail. But I think that some amount of some very significant
5:00:03 amount of research risk was sort of put behind us when that started to work.
5:00:07 Can you describe what kind of features can be extracted in this way?
5:00:12 Well, so it depends on the model that you’re studying, right? So the larger the model,
5:00:14 the more sophisticated they’re going to be. And we’ll probably talk about that fall-up
5:00:21 work in a minute. But in these one-layer models, so some very common things I think were languages,
5:00:24 both programming languages and natural languages. There were a lot of features that were
5:00:30 specific words in specific contexts. So “the,” and I think really the way to think about this is that
5:00:34 “the” is likely about to be followed by a noun. So it’s really right. You could think of this as
5:00:37 a “the” feature, but you could also think of this as predicting a specific noun feature.
5:00:44 And there would be these features that would fire for “the” in the context of, say, a legal document
5:00:51 or a mathematical document or something like this. And so, you know, maybe in the context of math,
5:00:55 you’re like, you know, “the” and then “product vector” or “matrix,” you know, all these mathematical
5:00:59 words. Whereas, you know, in other contexts, you would predict other things. That was common.
5:01:05 And basically, we need clever humans to assign labels to what we’re seeing.
5:01:09 Yes. So, you know, this is the only thing this is doing is that sort of
5:01:14 unfolding things for you. So if everything was sort of folded over top of it, you know,
5:01:17 series-ish and folded everything on top of itself, and you can’t really see it,
5:01:21 this is unfolding it. But now you still have a very complex thing to try to understand.
5:01:24 So then you have to do a bunch of work understanding what these are.
5:01:28 And some of them are really subtle. Like, there’s some really cool things,
5:01:31 even in this one-year model about Unicode, where, you know, of course,
5:01:35 some languages are in Unicode and the tokenizer won’t necessarily have a dedicated
5:01:41 token for every Unicode character. So instead, what you’ll have is you’ll have these patterns
5:01:46 of alternating tokens that each represent half of a Unicode character. And you have a different
5:01:51 feature that, you know, goes and activates on the opposing ones to be like, okay, you know,
5:01:56 I just finished a character, you know, go and predict next prefix. Then, okay, I’m on the prefix,
5:02:01 you know, predict a reasonable suffix. And you have to alternate back and forth. So there’s,
5:02:05 you know, these, these one-year models are really interesting. And I mean,
5:02:08 it’s another thing that just, you might think, okay, there would just be one base 64 feature.
5:02:12 But it turns out there’s actually a bunch of base 64 features, because you can have
5:02:16 English text encoded in as base 64. And that has a very different distribution
5:02:22 of base 64 tokens than, than regular. And there’s, there’s, there’s some things about
5:02:26 tokenization as well that it can exploit. And I don’t know, there’s all kinds of fun stuff.
5:02:30 How difficult is the task of sort of assigning labels
5:02:33 to what’s going on? Can this be automated by AI?
5:02:37 Well, I think it depends on the feature. And it also depends on how much you trust your AI.
5:02:43 So there’s a lot of work doing automated interpretability. I think that’s a really
5:02:46 exciting direction. And we do a fair amount of automated interpretability and have,
5:02:48 have Claude go and label our features.
5:02:53 Is there some funny moments where it’s totally right or it’s totally wrong?
5:02:56 Yeah. Well, I think, I think it’s very common that it’s like,
5:03:02 says something very general, which is like true in some sense, but not really picking up
5:03:08 on the specific of what’s going on. So I think, I think that’s a pretty common situation.
5:03:12 You don’t know that I have a particularly amusing one.
5:03:16 That’s interesting. That little gap between it is true, but it doesn’t quite get
5:03:21 to the deep nuance of a thing. That’s a general challenge.
5:03:25 It’s like, it’s, it’s certainly an incredible accomplishment that can say a true thing,
5:03:30 but it doesn’t, it’s not, it’s missing the depth sometimes.
5:03:34 And in this context, it’s like the arc challenge, you know, the sort of IQ type of tests.
5:03:41 It feels like figuring out what a feature represents is a bit of a little puzzle you have to solve.
5:03:44 Yeah. And I think that sometimes they’re easier and sometimes they’re harder as well.
5:03:50 So yeah, I think, I think that’s tricky. And there’s another thing, which I don’t know, maybe,
5:03:55 maybe in some ways this is my like aesthetic coming in, but I’ll try to give you a rationalization.
5:03:58 You know, I’m actually a little suspicious of automated interoperability.
5:04:01 And I think that partly just that I want humans to understand neural networks.
5:04:05 And if the neural network is understanding it for me, you know, I’m not, I don’t quite like that.
5:04:08 But I do have a bit of a, you know, in some ways I’m sort of like the mathematicians who are like,
5:04:10 you know, if there’s a computer automated proof, it doesn’t count.
5:04:14 You know, you, they won’t understand it. But I do also think that there is
5:04:20 this kind of like reflections on trusting trust type issue where if you, there’s this famous talk
5:04:26 about, you know, like when you’re writing a computer program, you have to trust your compiler.
5:04:30 And if there was like malware in your compiler, then it could go and inject malware into the
5:04:33 next compiler. And, you know, you’d be in kind of in trouble, right? Well, if you’re using neural
5:04:39 networks to go and verify that your neural networks are safe, the hypothesis that you’re
5:04:43 testing for is like, okay, well, the neural network maybe isn’t safe. And you have to worry
5:04:48 about like, is there some way that it could be screwing with you? So, you know, I think that’s
5:04:53 not a big concern now. But I do wonder in the long run, if we have to use really powerful
5:04:58 AI systems to go and, you know, audit our AI systems, is that, is that actually something we
5:05:02 can trust? But maybe I’m just rationalizing because I, I just want us to have to get to a
5:05:06 point where humans understand everything. Yeah, I mean, especially that’s hilarious,
5:05:10 especially as we talk about AI safety and looking for features that would be relevant
5:05:17 to AI safety, like deception and so on. So, let’s talk about the scaling monosemicity paper
5:05:23 in May 2024. Okay. So, what did it take to scale this to apply to cloud 3s on it?
5:05:28 Well, a lot of GPUs. A lot more GPUs. But one of my teammates, Tom Hennigan,
5:05:35 was involved in the original scaling laws work. And something that he was sort of
5:05:39 interested in from very early on is, are there scaling laws for interoperability?
5:05:47 And so, something he sort of immediately did when, when this work started to succeed and we
5:05:50 started to have sparse autoencoders work, because it became very interested in, you know, what are
5:05:57 the scaling laws for, you know, for making, making sparse autoencoders larger? And how
5:06:03 does that relate to making the base model larger? And so, it turns out this works really well and
5:06:08 you can use it to sort of project, you know, if you train a sparse autoencoder at a given size,
5:06:11 you know, how many tokens should you train on? And so on. So, this was actually a very big help to us
5:06:17 in scaling up this work and made it a lot easier for us to go and train, you know, really large
5:06:22 sparse autoencoders, where, you know, it’s not like training the big models, but it’s starting
5:06:26 to get to a point where it’s actually, actually expensive to go and train the really big ones.
5:06:30 So, you have to, I mean, you have to do all the stuff of like splitting it across
5:06:34 large. Oh, yeah, no, I mean, there’s a huge engineering challenge here too, right? So,
5:06:39 yeah, so there’s a scientific question of how you scale things effectively. And then there’s
5:06:42 an enormous amount of engineering to go and scale it up. You have to, you have to shard it. You
5:06:46 have to, you have to think very carefully about a lot of things. I’m lucky to work with a bunch
5:06:49 of great engineers because I am definitely not a great engineer. Yeah, and the infrastructure,
5:06:56 especially, yeah, for sure. So, it turns out, TODR, it worked. It worked, yeah. And I think this is
5:06:59 important because you could have imagined, like you could have imagined a world where you set
5:07:04 after towards monosmenticity. You know, Chris, this is great. You know, it works on a one-layer
5:07:08 model, but one-layer models are really idiosyncratic. Like, you know, maybe, maybe that’s just
5:07:12 something, like, maybe the linear representation hypothesis and superposition hypothesis is the
5:07:16 right way to understand a one-layer model, but it’s not the right way to understand larger models.
5:07:22 And so, I think, I mean, first of all, like, the Cunningham et al paper sort of cut through that a
5:07:26 little bit and sort of suggested that this wasn’t the case, but scaling monosmenticity sort of,
5:07:31 I think, was significant evidence that, even for very large models, and we did it on Claude III
5:07:36 Sonnet, which at that point was one of our production models, you know, even these models
5:07:43 seem to be very, you know, seem to be substantially explained, at least, by linear features and,
5:07:46 you know, doing dictionary running on them works. And as you learn more features, you go and you
5:07:51 explain, explain more and more. So, that’s, I think, a quite a promising sign. And you find,
5:07:57 now, really fascinating abstract features. And the features are also multimodal. They
5:08:00 respond to images and text for the same concept, which is fun.
5:08:06 Yeah, this, can you explain that? I mean, like, you know, backdoor, there’s just a lot of examples
5:08:09 that you can. Yeah, so maybe, maybe let’s start with one example to start, which is,
5:08:13 we found some features around sort of security vulnerabilities and backdoors and codes. So,
5:08:17 it turns out those are actually two different features. So, there’s a security vulnerability
5:08:22 feature. And if you force it active, Claude will start to go and write security vulnerabilities,
5:08:27 like buffer overflows into code. And it also fires for all kinds of things, like, you know,
5:08:33 some of the top data set examples for things like, you know, dash, dash, disable, you know,
5:08:38 SSL or something like this, which are sort of obviously really, really insecure.
5:08:44 So, at this point, it’s kind of like, maybe it’s just because the examples are presented that way,
5:08:51 it’s kind of like surf a little bit more obvious examples, right? I guess the, the idea is that
5:08:56 down the line, it might be able to detect more nuanced like deception or bugs or that kind of
5:09:02 stuff. Yeah, well, I may want to distinguish two things. So, one is the complexity of the feature
5:09:10 or the concept, right? And the other is the, the nuance of the, how subtle the examples we’re looking
5:09:15 at, right? So, when we show the top data set examples, those are the most extreme examples
5:09:20 that cause that feature to activate. And so, it doesn’t mean that it doesn’t fire for more subtle
5:09:27 things. So, the insecure code feature, you know, the stuff that it fires for most strongly for
5:09:36 these like really obvious, you know, disable the security type things. But, you know, it also fires
5:09:41 for, you know, buffer overflows and more subtle security vulnerabilities in code.
5:09:44 You know, these features are all multimodal. So, you could ask like, what images activate this
5:09:52 feature? And it turns out that the, the security vulnerability feature activates for images of
5:09:58 like people clicking on Chrome to like go past the like, you know, this, this website,
5:10:01 the SSL certificate might be wrong or something like this. Another thing that’s very entertaining
5:10:05 is there’s backdoors in code feature, like you activate it. It goes in, Claude writes a backdoor
5:10:09 that like will go and dump your data to port or something. But, you can ask, okay, what, what
5:10:14 images activate the backdoor feature? It was devices with hidden cameras in them. So, there’s a whole
5:10:20 apparently genre of people going and selling devices that look innocuous that have hidden
5:10:24 cameras and they have ads at how there’s a hidden camera in it. And I guess that is the, you know,
5:10:29 physical version of a backdoor. And so, it sort of shows you how abstract these concepts are,
5:10:35 right? And I just thought that was, I mean, I’m sort of sad that there’s a whole market of people
5:10:38 selling devices like that. But I was kind of delighted that that was the thing that it came
5:10:43 up with as the, the top image examples for the feature. Yeah, it’s nice. It’s multimodal. It’s
5:10:50 multi almost context. It’s broad, strong definition of a singular concept. It’s nice. Yeah. To me,
5:10:57 one of the really interesting features, especially for AI safety is deception and lying and the
5:11:03 possibility that these kinds of methods could detect lying in a model, especially gets smarter
5:11:09 and smarter and smarter. Presumably, that’s a big threat of a super intelligent model that it can
5:11:15 deceive the people operating it as to its intentions or any of that kind of stuff. So,
5:11:19 what have you learned from detecting lying inside models?
5:11:26 Yeah. So, I think we’re in some ways in early days for that. We find quite a few features
5:11:32 related to deception and lying. There’s one feature where it fires for people lying and
5:11:36 being deceptive and you force it active and Claude starts lying to you. So, we have a deception
5:11:40 feature. I mean, there’s all kinds of other features about withholding information and not
5:11:45 answering questions. Features about power seeking and coups and stuff like that. So,
5:11:48 there’s a lot of features that are kind of related to spooky things. And if you
5:11:54 force them active, Claude will behave in ways that are not the kinds of behaviors you want.
5:12:01 What are possible next exciting directions to you in the space of Mechandurb?
5:12:02 Well, there’s a lot of things.
5:12:11 So, for one thing, I would really like to get to a point where we have circuits where we can
5:12:18 really understand not just the features, but then use that to understand the computation of models.
5:12:25 That really for me is the ultimate goal of this. And there’s been some work we put out a few things.
5:12:29 There’s a paper from Sam Marks that does some stuff like this. There’s been some,
5:12:32 I’d say, some work around the edges here. But I think there’s a lot more to do and I think
5:12:39 that will be a very exciting thing. That’s related to a challenge we call interference weights
5:12:45 where due to superposition, if you just sort of naively look at whether features are
5:12:50 connected together, there may be some weights that sort of don’t exist in the upstairs model,
5:12:55 but are just sort of artifacts of superposition. So, that’s a sort of technical challenge for
5:13:04 that. I think another exciting direction is just you might think of sparse auto encoders as being
5:13:11 kind of like a telescope. They allow us to look out and see all these features that are out there.
5:13:15 And as we build better and better sparse auto encoders, get better and better at dictionary
5:13:22 learning, we see more and more stars. And we zoom in on smaller and smaller stars. But there’s
5:13:27 kind of a lot of evidence that we’re only still seeing a very small fraction of the stars. There’s
5:13:33 a lot of matter in our neural network universe that we can’t observe yet. And it may be that
5:13:37 we’ll never be able to have fine enough instruments to observe it. And maybe some of it just
5:13:42 isn’t possible, isn’t combinationally tractable to observe it. So, it’s sort of a kind of dark
5:13:47 matter in not maybe the sense of modern astronomy, of earlier astronomy, when we didn’t know what
5:13:52 this unexplained matter is. And so, I think a lot about that dark matter and whether we’ll
5:13:58 ever observe it and what that means for safety if we can’t observe it. If there’s some significant
5:14:04 fraction of neural networks are not accessible to us. Another question that I think a lot about
5:14:10 is at the end of the day, mechanistic interpolation is this very microscopic
5:14:14 approach to interpolation. It’s trying to understand things in a very fine-grained way.
5:14:20 But a lot of the questions we care about are very macroscopic. We care about these questions
5:14:25 about neural network behavior. I think that’s the thing that I care most about, but there’s
5:14:34 lots of other larger scale questions you might care about. And somehow, the nice thing about
5:14:38 having a very microscopic approach is it’s maybe easier to ask, is this true? But the downside is
5:14:43 it’s much further from the things we care about. And so, we now have this ladder to climb. And I
5:14:47 think there’s a question of, will we be able to find, are there sort of larger scale abstractions
5:14:53 that we can use to understand neural networks that we get up from this very microscopic approach?
5:14:57 Yeah, you’ve written about this kind of organs question.
5:14:59 Yeah, exactly.
5:15:04 If we think of interpretability as a kind of anatomy of neural networks, most of the
5:15:09 circus threats involve studying tiny little veins looking at the small scale and individual
5:15:14 neurons and how they connect. However, there are many natural questions that the small scale
5:15:20 approach doesn’t address. In contrast, the most prominent abstractions in biological anatomy
5:15:26 involve larger scale structures, like individual organs, like the heart or entire organ systems,
5:15:32 like the respiratory system. And so, we wonder, is there a respiratory system or heart or brain
5:15:34 region of an artificial neural network?
5:15:39 Yeah, exactly. I mean, if you think about science, a lot of scientific fields have
5:15:46 investigate things at many levels of abstraction. So, in biology, you have molecular biology,
5:15:50 studying proteins and molecules and so on. And they have cellular biology. And then,
5:15:54 you have histology, studying tissues. And then, you have anatomy. And then, you have zoology.
5:15:58 And then, you have ecology. And so, you have many, many levels of abstraction. Or physics,
5:16:03 maybe you have the physics of individual particles. And then, statistical physics gives you
5:16:06 thermodynamics and things like that. And so, you often have different levels of abstraction.
5:16:13 And I think that right now, mechanistic interpolity, if it succeeds, is sort of like a microbiology
5:16:20 of neural networks. But we want something more like anatomy. And so, and a question you might
5:16:24 ask is, why can’t you just go there directly? And I think the answer is superposition, at least
5:16:31 in significant parts. It’s actually very hard to see this macroscopic structure without first
5:16:35 sort of breaking down the microscopic structure in the right way and then studying how it connects
5:16:42 together. But I’m hopeful that there is going to be something much larger than features and circuits.
5:16:46 And that we’re going to be able to have a story that involves much bigger things. And then,
5:16:49 you can sort of study in detail the parts you care about.
5:16:54 I suppose the neurobiology, like a psychologist or psychiatrist of a neural network.
5:16:59 And I think that the beautiful thing would be if we could go and, rather than having disparate
5:17:02 fields for those two things, if you could have a build a bridge between them,
5:17:10 such that you could go and have all of your higher level abstractions be grounded very firmly
5:17:16 in this very solid, more rigorous, ideally, foundation.
5:17:22 What do you think is the difference between the human brain, the biological neural network,
5:17:25 and the artificial neural network? Well, the neuroscientists have a much harder job than us.
5:17:30 Sometimes I just count my blessings by how much easier my job is than the neuroscientists.
5:17:36 So we can record from all the neurons. We can do that on arbitrary amounts of data.
5:17:42 The neurons don’t change while you’re doing that, by the way. You can go and ablate neurons,
5:17:46 you can edit the connections, and so on. And then you can undo those changes.
5:17:51 That’s pretty great. You can intervene on any neuron and force it active and see what happens.
5:17:55 You know which neurons are connected to everything. Neuroscientists want to get the
5:17:58 connectome. We have the connectome, and we have it for much bigger than the elegans.
5:18:05 And then not only do we have the connectome, we know which neurons excite or inhibit each
5:18:11 other. It’s not just that we know the binary mask. We know the weights. We can take gradients.
5:18:16 We know computationally what each neuron does. So I don’t know. The list goes on and on. We just
5:18:22 have so many advantages over neuroscientists. And then just by having all those advantages,
5:18:28 it’s really hard. And so one thing I do sometimes think is like, gosh, if it’s this hard for us,
5:18:31 it seems impossible under the constraints of neuroscience or near impossible.
5:18:36 I don’t know. Maybe part of me is like, I’ve got a few neuroscientists on my team. Maybe I’m
5:18:41 sort of like, ah, maybe the neuroscientists, maybe some of them would like to have an easier problem
5:18:48 that’s still very hard. And they could come and work on neural networks. And then after we figure
5:18:52 out things in sort of the easy little pond of trying to understand neural networks, which is
5:18:56 still very hard, then we could go back to biological neuroscience.
5:18:59 I love what you’ve written about the goal of mech and terp research
5:19:05 as two goals, safety and beauty. So can you talk about the beauty side of things?
5:19:11 Yeah. So there’s this funny thing where I think some people are kind of disappointed
5:19:16 by neural networks, I think, where they’re like, ah, neural networks, it’s just these
5:19:20 simple rules. And then you just do a bunch of engineering to scale it up and it works really
5:19:25 well. And where are the complex ideas? This isn’t a very nice, beautiful, scientific result.
5:19:31 And I sometimes think when people say that, I picture them being like, evolution is so
5:19:35 boring. It’s just a bunch of simple rules. And you run evolution for a long time and you get
5:19:41 biology. What a Saki way for biology to have turned out? Where’s the complex rules? But
5:19:48 the beauty is that the simplicity generates complexity. Biology has these simple rules,
5:19:54 and it gives rise to all the life and ecosystems that we see around us, all the beauty of nature
5:19:59 that all just comes from evolution and from something very simple in evolution. And similarly,
5:20:06 I think that neural networks create enormous complexity and beauty inside and structure
5:20:10 inside themselves that people generally don’t look at and don’t try to understand because
5:20:17 it’s hard to understand. But I think that there is an incredibly rich structure to be
5:20:23 discovered inside neural networks, a lot of very deep beauty. And if we’re just willing to take
5:20:30 the time to go and see it and understand it. Yeah, I love mechenter. The feeling like we are
5:20:34 understanding or getting glimpses of understanding the magic that’s going on inside is really
5:20:41 wonderful. It feels to me like one of the questions that’s just calling out to be asked. And I’m
5:20:44 sort of, I mean, a lot of people are thinking about this, but I’m often surprised that not
5:20:51 more are. Is how is it that we don’t know how to create computer systems that can do these things?
5:20:55 And yet, we have these amazing systems that we don’t know how to directly create computer
5:20:58 programs that can do these things. But these neural networks can do all these amazing things.
5:21:02 And it just feels like that is obviously the question that sort of is calling out to be
5:21:09 answered. If you have any degree of curiosity, it’s like how is it that humanity now has these
5:21:14 artifacts that can do these things that we don’t know how to do? Yeah, I love the image of the
5:21:18 circus reaching towards the light of the objective function. Yeah, it’s just, it’s this organic
5:21:23 thing that we’ve grown and we have no idea what we’ve grown. Well, thank you for working on safety
5:21:27 and thank you for appreciating the beauty of the things you discover. And thank you for talking
5:21:32 today, Chris. It’s wonderful. Thank you for taking the time to chat as well. Thanks for listening
5:21:37 to this conversation with Chris Ola and before that with Dari Amade and Amanda Askel. To support
5:21:42 this podcast, please check out our sponsors in the description. And now let me leave you
5:21:49 with some words from Alan Watts. The only way to make sense out of change is to plunge into it,
5:22:06 move with it and join the dance. Thank you for listening and hope to see you next time.
5:22:07 you
5:22:07 you
5:22:08 you
5:22:09 you
5:22:11 (gentle music)
5:22:13 you