AI transcript
0:00:04 to me a different problem.
0:00:06 – The previous decade had mostly been about
0:00:09 understanding data that already exists.
0:00:10 But the next decade was going to be
0:00:12 about understanding new data.
0:00:16 – Visual spatial intelligence is so fundamental.
0:00:19 It’s as fundamental as language.
0:00:21 – It’s like unwrapping presents on Christmas
0:00:22 that every day you know there’s gonna be
0:00:24 some amazing new discovery,
0:00:26 some amazing new application or algorithm somewhere.
0:00:30 – If we see something or if we imagine something,
0:00:34 both can converge towards generating it.
0:00:38 – I think we’re in the middle of a Cambrian explosion.
0:00:42 – To many, the last two years of AI
0:00:44 have felt like a light switch.
0:00:46 Pre and post GPT-3?
0:00:48 Pre and post being able to generate an image
0:00:50 with natural language.
0:00:53 And even pre and post translating any video
0:00:55 with a click of a button.
0:00:57 But to some, like Dr. Fei-Fei Li,
0:01:00 often referred to as the “Godmother of AI”
0:01:03 and longtime professor of computer science at Stanford,
0:01:05 who by the way taught some very well-known researchers
0:01:07 like Andre Carpathi.
0:01:08 To people like Fei-Fei,
0:01:10 artificial intelligence unlocks have existed
0:01:13 on a multi-decade long continuum.
0:01:15 And that continuum is destined to proceed
0:01:17 into the physical spatial world.
0:01:19 At least that’s what Fei-Fei and her co-founders
0:01:22 of new company, World Labs, believe.
0:01:25 And these four founders pioneered the ecosystem
0:01:28 in so many ways, from Fei-Fei’s ImageNet
0:01:30 to Justin Johnson’s work on scene graphs,
0:01:32 Ben Mildenhall’s work on nerfs,
0:01:34 or even Christoph Lassner’s work
0:01:36 on the precursor to the Gaussian splat.
0:01:37 And in today’s episode,
0:01:40 you’ll get to hear from Fei-Fei and Justin
0:01:41 as they explore this evolution
0:01:45 with A16Z General Partner, our team, Casado.
0:01:48 From the very earliest seeds to the recent explosion
0:01:50 of consumer-grade AI applications,
0:01:53 and the key watershed moments along the way.
0:01:56 We’ll, of course, dive into the why now behind World Labs,
0:01:59 but also their choice to focus on spatial intelligence
0:02:02 and what it might really take to build at that frontier,
0:02:05 from algorithmic unlocks to hardware.
0:02:06 All right, let’s get started.
0:02:11 As a reminder, the content here
0:02:12 is for informational purposes only.
0:02:14 Should not be taken as legal, business, tax,
0:02:16 or investment advice,
0:02:18 or be used to evaluate any investment or security,
0:02:19 and is not directed at any investors
0:02:22 or potential investors in any A16Z fund.
0:02:25 Please note that A16Z and its affiliates
0:02:26 may also maintain investments
0:02:29 in the companies discussed in this podcast.
0:02:31 For more details, including a link to our investments,
0:02:34 please see a16z.com/disclosures.
0:02:40 – Over the last two years,
0:02:42 we’ve seen this kind of massive rush
0:02:44 of consumer AI companies and technology,
0:02:45 and it’s been quite wild,
0:02:48 but you’ve been doing this now for decades.
0:02:51 And so maybe you walked a little bit about how we got here,
0:02:54 kind of like your key contributions and insights along the way.
0:02:57 – So it is a very exciting moment, right?
0:03:01 Just zooming back, AI is in a very exciting moment.
0:03:04 I personally have been doing this for two decades plus,
0:03:07 and we have come out of the last AI winter.
0:03:10 We have seen the birth of modern AI.
0:03:13 Then we have seen deep learning taking off,
0:03:16 showing us possibilities like playing chess,
0:03:20 but then we’re starting to see the deepening of the technology
0:03:25 and the industry adoption of some of the earlier possibilities,
0:03:26 like language models.
0:03:31 And now I think we’re in the middle of a Cambrian explosion,
0:03:32 in almost a literal sense,
0:03:35 because now in addition to texts,
0:03:38 you’re seeing pixels, videos, audios,
0:03:42 all coming with possible AI applications and models.
0:03:44 So it’s a very exciting moment.
0:03:45 – I know you both so well,
0:03:47 and many people know you both so well,
0:03:48 because you’re so prominent in the field,
0:03:49 but not everybody threw up an AI.
0:03:51 So maybe it’s kind of worth just going through
0:03:52 like your quick backgrounds,
0:03:54 just to kind of level set the audience.
0:03:55 – Yeah, sure.
0:03:57 So I first got into AI at the end of my undergrad.
0:03:59 I did math and computer science for undergrad at Kel-Tak.
0:04:00 That was awesome.
0:04:01 But then towards the end of that,
0:04:03 there was this paper that came out
0:04:05 that was at the time, a very famous paper, the cat paper,
0:04:07 from Hong-lek Lee and Andrew Ng and others
0:04:09 that were at Google Brain at the time.
0:04:10 And that was like the first time
0:04:13 that I came across this concept of deep learning.
0:04:15 And to me, it just felt like this amazing technology.
0:04:18 And that was the first time that I came across this recipe
0:04:19 that would come to define the next,
0:04:21 more than decade of my life,
0:04:23 which is that you can get these amazingly powerful
0:04:25 learning algorithms that are very generic,
0:04:27 couple them with very large amounts of compute,
0:04:29 couple them with very large amounts of data,
0:04:30 and magic things started to happen
0:04:32 when you compiled those ingredients.
0:04:35 So I first came across that idea around 2011 and 2012-ish.
0:04:36 And I just thought, oh my God,
0:04:38 this is gonna be what I wanna do.
0:04:40 It was obvious you gotta go to grad school to do this stuff.
0:04:42 And then saw that Fei-Fei was at Stanford,
0:04:44 one of the few people in the world at the time
0:04:45 who was on that train.
0:04:47 And that was just an amazing time
0:04:49 to be in deep learning and computer vision specifically.
0:04:51 Because that was really the era when this went
0:04:54 from these first nascent bits of technology
0:04:55 that were just starting to work
0:04:56 and really got developed
0:04:59 and spread across a ton of different applications.
0:05:00 So then over that time,
0:05:01 we saw the beginnings of language modeling.
0:05:04 We saw the beginnings of discriminative computer vision
0:05:05 where you could take pictures
0:05:07 and understand what’s in them in a lot of different ways.
0:05:09 We also saw some of the early bits
0:05:10 of what we would now call gen AI,
0:05:13 generative modeling, generating images, generating text.
0:05:15 A lot of those core algorithmic pieces
0:05:17 actually got figured out by the academic community
0:05:18 during my PhD years.
0:05:20 There was a time I would just wake up every morning
0:05:22 and check the new papers on archive
0:05:24 and just be ready is like unwrapping presents on Christmas.
0:05:25 Every day, you know, there’s gonna be
0:05:27 some amazing new discovery,
0:05:28 some amazing new application or algorithm
0:05:29 somewhere in the world.
0:05:31 In the last two years, everyone else in the world
0:05:33 kind of came to the same realization
0:05:36 of using AI to get new Christmas presents every day.
0:05:37 But I think for those of us that have been in the field
0:05:38 for a decade or more,
0:05:41 we’ve sort of had that experience for a very long time.
0:05:44 I come to AI through a different angle,
0:05:45 which is from physics
0:05:47 because my undergraduate background was physics,
0:05:50 but physics is the kind of discipline
0:05:53 that teaches you to think audacious questions
0:05:57 and think about what is the still remaining mystery
0:05:58 of the world.
0:06:00 Of course, in physics, it’s atomic world,
0:06:02 you know, universe and all that,
0:06:05 but somehow that kind of training,
0:06:09 thinking got me into the audacious question
0:06:11 that really captured my own imagination,
0:06:12 which is intelligence.
0:06:17 So I did my PhD in AI and computational neuroscience
0:06:18 at Caltech.
0:06:21 So Justin and I actually didn’t overlap,
0:06:24 but we share the same alma mater at Caltech.
0:06:25 – And the same advisor.
0:06:27 – Yes, same advisor,
0:06:28 your undergraduate advisor,
0:06:31 my PhD advisor, Pietro Perona.
0:06:34 And my PhD time, which is similar to your PhD time,
0:06:38 was when AI was still in the winter in the public eye,
0:06:40 but it was not in the winter in my eye
0:06:43 because it’s that free spring hibernation.
0:06:46 There’s so much life, machine learning,
0:06:50 statistical modeling was really getting power.
0:06:53 I think I was one of the native generation
0:06:56 in machine learning and AI,
0:06:59 whereas I look at just this generation
0:07:01 as the native deep learning generation.
0:07:05 So machine learning was the precursor of deep learning
0:07:08 and we were experimenting with all kinds of models,
0:07:11 but one thing came out at the end of my PhD
0:07:14 and the beginning of my assistant professor time,
0:07:19 there was a overlooked elements of AI
0:07:24 that is mathematically important to drive generalization,
0:07:27 but the whole field was not thinking that way.
0:07:30 And it was data because we were thinking
0:07:33 about the intricacy of Bayesian models
0:07:35 or kernel methods and all that,
0:07:38 but what was fundamental that my students
0:07:40 and my lab realized probably earlier
0:07:45 that most people is that if you let data drive models,
0:07:47 you can unleash the kind of power
0:07:49 that we haven’t seen before.
0:07:51 And that was really the reason
0:07:56 we went on a pretty crazy bet on ImageNet,
0:08:01 which is just forget about any scale we’re seeing now,
0:08:03 which is thousands of data points.
0:08:07 At that point, an OP community has their own data sets.
0:08:09 I remember you see Irvine data set
0:08:13 or some data set in NLP was, it was small,
0:08:15 computer vision community has their data sets,
0:08:19 but all in the order of thousands or tens of thousands
0:08:22 were like, we need to drive it to internet scale.
0:08:26 And luckily it was also the coming of age of internet.
0:08:29 So we were riding that wave
0:08:31 and that’s when I came to Stanford.
0:08:34 – So these epochs are what we often talk about.
0:08:36 ImageNet is clearly the epoch that created
0:08:40 or at least maybe made popular and viable computer vision.
0:08:43 In the gen AI wave, we talk about two kind of core unlocks.
0:08:45 One is the Transformers paper, which is attention,
0:08:46 and we talk about stable diffusion.
0:08:48 Is that a fair way to think about this?
0:08:49 Which is there’s these two algorithmic unlocks
0:08:51 that came from academia or Google
0:08:52 and that’s where everything comes from
0:08:54 or has it been more deliberate?
0:08:56 Or have there been other kind of big unlocks
0:08:57 that kind of brought us here
0:08:58 that we don’t talk as much about?
0:09:00 – I think the big unlock is compute.
0:09:03 I know the story of AI is often a story of compute,
0:09:04 but no matter how much people talk about it,
0:09:06 I think people underestimate it, right?
0:09:07 And the amount of growth that we’ve seen
0:09:10 in computational power over the last decade is astounding.
0:09:12 The first paper that’s really credited
0:09:14 with the breakthrough moment in computer vision
0:09:15 for deep learning was AlexNet,
0:09:18 which was a 2012 paper where a deep neural network
0:09:20 did really well on the ImageNet challenge
0:09:22 and just blew away all the other algorithms
0:09:23 that Fei-Fei had been working on
0:09:24 and the types of algorithms
0:09:26 that they had been working on more in grad school.
0:09:29 That AlexNet was a 60 million parameter deep neural network
0:09:33 and it was trained for six days on two GTX580s,
0:09:35 which was the top consumer card at the time,
0:09:37 which came out in 2010.
0:09:39 So I was looking at some numbers last night
0:09:40 just to put these in perspective.
0:09:44 And the newest, latest and greatest from NVIDIA is the GB200.
0:09:47 Do either of you wanna guess how much raw compute factor
0:09:50 we have between the GTX580 and the GB200?
0:09:52 – Shoot, no, what?
0:09:53 – Go for it.
0:09:54 – It’s in the thousands.
0:09:56 So I ran the numbers last night,
0:09:57 that two week training run
0:10:00 that of six days on two GTX580s,
0:10:02 if you scale, it comes out to just under five minutes
0:10:04 on a single GB200.
0:10:06 – Justin is making a really good point.
0:10:11 The 2012 AlexNet paper on ImageNet Challenge
0:10:15 is literally a very classic model.
0:10:18 And that is the convolutional neural network model.
0:10:21 And that was published in 1980s, the first paper.
0:10:25 I remember as a graduate student learning that.
0:10:29 And it more or less also has six, seven layers.
0:10:30 Practically the only difference
0:10:33 between AlexNet and the ConvNet,
0:10:38 the difference is the two GPUs and the deluge of data.
0:10:42 – Yeah, so I think most people now are familiar
0:10:43 with quote the bitter lesson.
0:10:44 And the bitter lesson says is
0:10:46 if you make an algorithm, don’t be cute,
0:10:48 just make sure you can take advantage of available compute
0:10:50 ’cause the available compute will show up.
0:10:52 On the other hand, there’s another narrative,
0:10:53 which seems to me to be just as credible,
0:10:55 which is it’s actually new data sources
0:10:56 that unlock deep learning, right?
0:10:57 Like ImageNet is a great example.
0:10:59 Self-attention is great from transformers,
0:11:01 but they’ll also say this is a way you can exploit
0:11:03 human labeling of data because it’s the humans
0:11:04 that put the structure in the sentences.
0:11:06 And if you look at clip, let’s say,
0:11:07 well, I could be using the internet
0:11:10 to like actually have humans use the alt tag
0:11:11 to label images, right?
0:11:13 And so like that’s a story of data,
0:11:15 that’s not a story of compute.
0:11:16 And so is the answer just both
0:11:17 or is like one more than the other?
0:11:18 – I think it’s both,
0:11:20 but you’re hitting on another really good point.
0:11:22 So I think there’s actually two epochs
0:11:24 that to me feel quite distinct in the algorithmics here.
0:11:26 So like the ImageNet era
0:11:27 is actually the era of supervised learning.
0:11:29 So in the era of supervised learning,
0:11:30 you have a lot of data,
0:11:32 but you don’t know how to use data on its own.
0:11:34 Like the expectation of ImageNet
0:11:36 and other datasets of that time period
0:11:37 was that we’re gonna get a lot of images,
0:11:39 but we need people to label everyone.
0:11:42 And all of the training data that we’re gonna train on,
0:11:44 a human labeler has looked at every one
0:11:46 and said something about that image.
0:11:47 And the big algorithmic unlocks,
0:11:49 we know how to train on things
0:11:51 that don’t require human labeled data.
0:11:52 – As the naive person in the room
0:11:53 that doesn’t have an AI background,
0:11:56 it seems to me if you’re training on human data,
0:11:58 the humans have labeled it, it’s just not explicit.
0:12:00 – Yeah, I knew you were gonna say that, Martin.
0:12:01 I knew that.
0:12:05 Yes, philosophically, that’s a really important question.
0:12:08 But that actually is more true in language than pixels.
0:12:09 – Fair enough, yeah.
0:12:10 – Right. – Yeah, yeah, yeah, yeah.
0:12:12 But I do think it’s an important distinction
0:12:14 because clip really is human labeled.
0:12:16 I think attention as humans have like figured out
0:12:19 the relationships of things and then you learn them.
0:12:21 So it is human labeled just more implicit than explicit.
0:12:22 – Yeah, it’s still human labeled.
0:12:25 The distinction is that for this supervised learning era,
0:12:27 our learning tasks were much more constrained.
0:12:29 So you would have to come up with this ontology
0:12:30 of concepts that we wanna discover.
0:12:32 Right, if you’re doing ImageNet,
0:12:33 FaithA and your students at the time
0:12:35 spent a lot of time thinking about
0:12:37 which thousand categories should be
0:12:38 in the ImageNet challenge.
0:12:39 Other data sets of that time,
0:12:41 like the Coco data set for object detection,
0:12:43 they thought really hard about
0:12:45 which 80 categories we put in there.
0:12:46 – So let’s walk to gen AI.
0:12:49 So when I was doing my PhD before that, you came.
0:12:51 So I took machine learning from Mandarin
0:12:53 and then I took Bayesian, something very complicated
0:12:55 from Daphne Koller and it was very complicated for me.
0:12:56 A lot of that was just predictive modeling.
0:12:58 And then I remember the whole kind of vision stuff
0:13:00 that you unlock, but then the generative stuff
0:13:02 has shown up, like I would say in the last four years,
0:13:04 which is to me very different.
0:13:05 You’re not identifying objects.
0:13:06 You’re not predicting something.
0:13:07 You’re generating something.
0:13:10 And so maybe kind of walk through like the key unlocks
0:13:12 that got us there and then why it’s different.
0:13:13 And if we should think about it differently
0:13:16 and is it part of a continuum, is it not?
0:13:18 – It is so interesting.
0:13:23 Even during my graduate time, generative model was there.
0:13:25 We wanted to do generation.
0:13:29 Nobody remembers, even with letters and numbers,
0:13:31 we were trying to do some.
0:13:33 Jeff Hinton has had the generated papers.
0:13:36 We were thinking about how to generate.
0:13:39 And in fact, if you think from a probability distribution
0:13:41 point of view, you can mathematically generate.
0:13:43 It’s just nothing we generate
0:13:45 would ever impress anybody, right?
0:13:49 So this concept of generation mathematically,
0:13:52 theoretically is there, but nothing worked.
0:13:56 Justin’s PhD, his entire PhD is a story,
0:14:00 almost a mini story of the trajectory of the field.
0:14:02 He started his first project in data.
0:14:04 I forced them to.
0:14:05 He didn’t like it.
0:14:07 (laughing)
0:14:09 – In retrospect, I learned a lot of really useful things.
0:14:12 – I’m glad you say that now.
0:14:15 – So actually my first paper, both of my PhD and like ever,
0:14:17 my first academic publication ever,
0:14:19 was the image retrieval with scene graphs.
0:14:22 – And then we went into taking pixels, generating words.
0:14:26 And Justin and Andre really worked on that.
0:14:31 But that was still a very, very lossy way of generating
0:14:34 and getting information out of the pixel world.
0:14:36 And then in the middle, Justin went off
0:14:38 and did a very famous piece of work.
0:14:43 And it was the first time that someone made it real time,
0:14:44 right?
0:14:44 – Yeah, yeah.
0:14:46 So the story there is there was this paper that came out
0:14:48 in 2015, a neural algorithm of artistic style
0:14:50 led by Leon Gaddis.
0:14:51 And the paper came out and they showed
0:14:53 these real world photographs that they had converted
0:14:54 into a Van Gogh style.
0:14:58 And we are kind of used to seeing things like this in 2024,
0:14:59 but this was in 2015.
0:15:01 So this paper just popped up on archive one day
0:15:02 and it blew my mind.
0:15:06 I just got this GNI brainworm in my brain in 2015
0:15:07 and it did something to me.
0:15:09 And I thought, oh my God, I need to understand
0:15:10 this algorithm, I need to play with it.
0:15:12 I need to make my own images into Van Gogh.
0:15:15 So then I like read the paper and then over a long weekend,
0:15:17 I re-implemented the thing and got it to work.
0:15:19 It was actually very simple algorithm.
0:15:22 So like my implementation was like 300 lines of Lua.
0:15:24 Cause at the time it was Lua.
0:15:24 It was Lua.
0:15:25 This was pre-pi torch.
0:15:27 So we were using Lua torch,
0:15:28 but it was like very simple algorithm,
0:15:29 but it was slow, right?
0:15:31 So it was an optimization based thing.
0:15:32 Every image you want to generate,
0:15:33 you need to run this optimization loop,
0:15:35 run this gradient descent loop
0:15:36 for every image that you generate.
0:15:37 The images were beautiful,
0:15:39 but I just wanted it to be faster.
0:15:40 – And Justin just did it.
0:15:44 And it was actually, I think your first taste of
0:15:47 an academic work having an industry impact.
0:15:48 – A bunch of people had seen this
0:15:50 artistic style transfer stuff at the time.
0:15:51 And me and a couple others at the same time
0:15:53 came up with different ways to speed this up,
0:15:56 but mine was the one that got a lot of traction.
0:15:59 – Before the world understand JANAI,
0:16:01 Justin’s last piece of work in PhD
0:16:05 was actually inputting language
0:16:07 and getting a whole picture out.
0:16:11 It’s one of the first JANAI work it’s using GEM,
0:16:14 which was so hard to use.
0:16:16 The problem is that we are not ready
0:16:18 to use a natural piece of language.
0:16:20 So Justin, you heard he worked on same graph.
0:16:25 So we have to input a same graph language structure.
0:16:28 So the sheep, the grass, the sky in a graph way,
0:16:31 it literally was one of our photos, right?
0:16:34 And then he and another very good master student, Grim,
0:16:36 they got that GEM to work.
0:16:40 So you can see from data to matching,
0:16:45 to style transfer to generative images,
0:16:48 we’re starting to see you ask if this is a rub change
0:16:50 for people like us.
0:16:53 It’s already happening in a continuum,
0:16:56 but for the world, the results are more abrupt.
0:16:59 – Hey, it’s Steph.
0:17:01 You might know that before my time at A16Z,
0:17:04 I used to work at a company called The Hustle.
0:17:06 And then we were acquired by HubSpot
0:17:08 where I helped build their podcast network.
0:17:09 While I’m not there anymore,
0:17:12 I’m still a big fan of HubSpot podcasts,
0:17:15 especially My First Million.
0:17:17 In fact, I’ve listened to pretty much all 600
0:17:18 of their episodes.
0:17:20 My First Million is perfect for those of you
0:17:22 who are always trying to stay ahead of the curve.
0:17:25 Or in some cases, take matters into your own hands
0:17:28 by building the future yourself.
0:17:30 Hosted by my friends, Sam Parr and Sean Curry,
0:17:33 who have each built and sold eight-figure businesses
0:17:34 to Amazon and HubSpot,
0:17:36 the show explores business ideas
0:17:38 that you can start tomorrow.
0:17:41 Plus, Sam and Sean jam alongside guests like Mr. Beast,
0:17:43 Rob Dyrdek, Tim Ferriss,
0:17:46 and ever so often, you’ll even find me there.
0:17:48 From gas station pizza and egg carton businesses
0:17:51 doing millions, all the way up to several guests
0:17:53 making their first billion.
0:17:54 Go check out My First Million
0:17:56 wherever you get your podcasts.
0:18:04 – So I read your book.
0:18:06 And for those that are listening, it’s a phenomenal book.
0:18:08 I like, I really recommend you read it.
0:18:10 And it seems for a long time, like a lot of the,
0:18:11 and I’ll talk to you Fei-Fei,
0:18:13 like a lot of your research has been,
0:18:17 and your direction has been towards kind of spatial stuff
0:18:19 and pixel stuff and intelligence.
0:18:21 And now you’re doing world labs
0:18:22 and it’s around spatial intelligence.
0:18:24 And so maybe you talk through,
0:18:25 is this been part of a long journey for you?
0:18:28 Like why did you decide to do it now?
0:18:29 Is it a technical unlock?
0:18:30 Is it a personal unlock?
0:18:35 Move us from that maillou of AI research to world labs.
0:18:38 – For me, it is both personal and intellectual, right?
0:18:41 My entire intellectual journey
0:18:45 is really this passion to seek North stars,
0:18:47 but also believing that those North stars
0:18:51 are critically important for the advancement of our field.
0:18:55 So at the beginning, I remembered after graduate school,
0:19:00 I thought my North star was telling stories of images
0:19:03 because for me, that’s such an important piece
0:19:05 of visual intelligence.
0:19:08 That’s part of what you call AI or AGI.
0:19:11 But when Justin and Andre did that,
0:19:14 I was like, oh my God, that was my live stream.
0:19:15 What do I do next?
0:19:17 So it came a lot faster.
0:19:20 I thought it would take a hundred years to do that.
0:19:23 But visual intelligence is my passion
0:19:28 because I do believe for every intelligent being,
0:19:32 like people or robots or some other form,
0:19:36 knowing how to see the world, reason about it,
0:19:39 interact in it, whether you’re navigating
0:19:42 or manipulating or making things,
0:19:46 you can even build civilization upon it.
0:19:50 And visual spatial intelligence is so fundamental.
0:19:53 It’s as fundamental as language,
0:19:58 possibly more ancient and more fundamental in certain ways.
0:20:00 So it’s very natural for me
0:20:04 that our North star is to unlock spatial intelligence.
0:20:07 The moment to me is right.
0:20:09 We’ve got these ingredients.
0:20:10 We’ve got compute.
0:20:13 We’ve got much deeper understanding of data,
0:20:15 way deeper than image that days.
0:20:19 Compared to those days, we’re so much more sophisticated.
0:20:23 And we’ve got some advancement of algorithms,
0:20:25 including co-founders in world lab
0:20:28 like Ben Mildenhall and Kristoff Laster.
0:20:31 They were at the cutting edge of nerve
0:20:33 that we are in the right moment
0:20:38 to really make a bet and to focus and just unlock that.
0:20:40 – So I just want to clarify it
0:20:41 for folks that are listening to this.
0:20:42 You’re starting this company world lab.
0:20:45 Spatial intelligence is kind of how you’re generally described
0:20:46 in the problem you’re solving.
0:20:50 Can you maybe try to crisply describe what that means?
0:20:53 – Yeah, so spatial intelligence is about machines ability
0:20:56 to perceive, reason and act in 3D space and time,
0:20:59 to understand how objects and events are positioned
0:21:01 in 3D space and time,
0:21:02 how interactions in the world
0:21:05 can affect those 4D positions over space-time
0:21:08 and both sort of perceive, reason about, generate,
0:21:11 interact with, really take the machine out of the mainframe
0:21:12 or out of the data center
0:21:13 and putting it out into the world
0:21:15 and understanding the 3D, 4D world
0:21:17 with all of its richness.
0:21:17 – So to be very clear,
0:21:19 are we talking about the physical world
0:21:21 or are we just talking about an abstract notion of world?
0:21:22 – I think it can be both.
0:21:23 I think it can be both
0:21:25 and that encompasses our vision long-term.
0:21:27 Even if you’re generating worlds,
0:21:29 even if you’re generating content positioned in 3D,
0:21:31 it has a lot of benefits.
0:21:32 Or if you’re recognizing the real world,
0:21:34 being able to put 3D understanding
0:21:37 into the real world as well is part of it.
0:21:38 – Just for everybody listening,
0:21:39 like the two other co-founders
0:21:41 have been on the hall of Christophe Lastner
0:21:43 or Absolute Legends in the field at the same level.
0:21:46 These four decided to come out and do this company now
0:21:49 and so I’m trying to dig to like why now is the right time.
0:21:51 – Yeah, I mean, this is again,
0:21:52 part of a longer evolution for me,
0:21:54 but post PhD, when I was really wanting to develop
0:21:56 into my own independent researcher,
0:21:57 both for my later career,
0:21:58 I was just thinking,
0:22:00 what are the big problems in AI and computer vision?
0:22:02 And the conclusion that I came to about that time
0:22:04 was that the previous decade
0:22:06 had mostly been about understanding data
0:22:07 that already exists.
0:22:08 But the next decade
0:22:10 was going to be about understanding new data.
0:22:11 And if we think about that,
0:22:14 the data that already exists was all of the images
0:22:17 and videos that maybe existed on the web already.
0:22:18 And the next decade was gonna be about
0:22:20 understanding new data, right?
0:22:22 People have smartphones, smartphones are collecting cameras,
0:22:23 those cameras have new sensors,
0:22:25 those cameras are positioned in the 3D world.
0:22:27 It’s not just you’re gonna get a bag of pixels
0:22:29 from the internet and know nothing about it
0:22:31 and try to say if it’s a cat or a dog.
0:22:32 We wanna treat these images
0:22:35 as universal sensors to the physical world.
0:22:36 And how can we use that to understand
0:22:38 the 3D and 4D structure of the world,
0:22:41 either in physical spaces or generative spaces?
0:22:44 So I made a pretty big pivot post PhD
0:22:45 into 3D computer vision,
0:22:47 predicting 3D shapes of objects
0:22:49 with some of my colleagues at fair at the time.
0:22:51 Then later, I got really enamored by this idea
0:22:54 of learning 3D structure through 2D, right?
0:22:55 Because we talk about data a lot.
0:22:57 3D data is hard to get on its own,
0:22:59 but because there’s a very strong
0:23:00 mathematical connection here,
0:23:03 our 2D images are projections of a 3D world.
0:23:05 And there’s a lot of mathematical structure here
0:23:06 we can take advantage of.
0:23:07 So even if you have a lot of 2D data,
0:23:09 there’s a lot of people who’ve done amazing work
0:23:12 to figure out how can you back out the 3D structure
0:23:15 of the world from large quantities of 2D observations.
0:23:17 And then in 2020, you asked about breakthrough moments.
0:23:18 There was a really big breakthrough moment
0:23:20 from our co-founder, Ben Mildenhall at the time
0:23:22 with his paper, “NERF,” in their radiance fields.
0:23:26 And that was a very simple, very clear way
0:23:29 of backing out 3D structure from 2D observations.
0:23:31 That just lit a fire under this whole space
0:23:33 of 3D computer vision.
0:23:34 I think there’s another aspect here
0:23:37 that maybe people outside the field don’t quite understand.
0:23:39 That was also a time when large language models
0:23:40 were starting to take off.
0:23:42 So a lot of the stuff with language modeling
0:23:44 actually had gotten developed in academia.
0:23:46 Even during my PhD, I did some poorly work
0:23:48 with Andre Carpathia on language modeling in 2014.
0:23:50 – LSTM, I just don’t remember.
0:23:54 – LSTM, RNNs, GRUs, this was pre-transformer.
0:23:57 But then at some point, like around the GPT-2 time,
0:23:59 like you couldn’t really do those kind of models anymore
0:24:02 in academia because they took way more at resourcing.
0:24:03 But there was one really interesting thing.
0:24:06 The “NERF” approach that Ben came up with,
0:24:07 like you could train these in a couple hours
0:24:09 on the single GPU.
0:24:10 So I think at that time, there was a dynamic here
0:24:12 that happened, which is that I think
0:24:13 a lot of academic researchers ended up
0:24:15 focusing a lot of these problems
0:24:17 because there was core algorithmic stuff to figure out
0:24:19 and because you could actually do a lot
0:24:21 without a ton of compute and you could get state-of-the-art
0:24:22 results on a single GPU.
0:24:25 Because of those dynamics, there was a lot of research,
0:24:27 a lot of researchers in academia
0:24:30 were moving to think about what are the core algorithmic ways
0:24:32 that we can advance this area as well.
0:24:34 Then I ended up chatting with Fei-Fei Moore
0:24:35 and I realized that we were actually–
0:24:37 – She’s very convincing.
0:24:37 – She’s very convincing.
0:24:38 Well, there’s that.
0:24:39 But we talked about trying to figure out
0:24:41 your own independent research trajectory
0:24:42 from your advisor.
0:24:42 Well, it turns out we ended up–
0:24:43 – Oh, no.
0:24:45 – Kind of concluding on– – Converging again.
0:24:46 – Converging on similar things.
0:24:48 – Okay, well, for my end,
0:24:50 I want to talk to the smartest person.
0:24:53 I call Justin, there’s no question about it.
0:24:55 I do want to talk about a very interesting
0:24:58 technical story of pixels
0:25:01 that most people working in language don’t realize
0:25:04 is that pre-Gen AI era in the field of computer vision,
0:25:07 those of us who work on pixels,
0:25:10 we actually have a long history
0:25:13 in an area of research called reconstruction,
0:25:14 3D reconstruction.
0:25:17 It dates back from the ’70s.
0:25:20 You can take photos ’cause humans have two eyes, right?
0:25:23 So in general, it starts with stereo photos
0:25:26 and then you try to triangulate the geometry
0:25:29 and make a 3D shape out of it.
0:25:31 It is a really, really hard problem.
0:25:34 To this day, it’s not fundamentally soft
0:25:36 because there’s correspondence and all that.
0:25:38 So this whole field,
0:25:41 which is an older way of thinking about 3D,
0:25:44 has been going around and it has been making
0:25:46 really good progress.
0:25:50 But when this happened in the context of generative methods,
0:25:53 in the context of diffusion models,
0:25:56 suddenly reconstruction and generation
0:25:57 start to really merge.
0:26:01 Now, within really a short period of time
0:26:03 in the field of computer vision,
0:26:05 it’s hard to talk about reconstruction
0:26:07 versus generation anymore.
0:26:12 We suddenly have a moment where if we see something
0:26:15 or if we imagine something,
0:26:18 both can converge towards generating it.
0:26:21 And that’s just, to me, a really important moment
0:26:23 for computer vision, but most people miss that
0:26:26 ’cause we’re not talking about it as much as LLMs.
0:26:28 – Right, so in pixel space, there’s reconstruction
0:26:31 where you reconstruct a scene that’s real
0:26:32 and then if you don’t see the scene,
0:26:34 then you use generative techniques, right?
0:26:35 So these things are kind of very similar.
0:26:37 Throughout this entire conversation,
0:26:38 you’re talking about languages
0:26:40 and you’re talking about pixels.
0:26:42 So maybe it’s a good time to talk about
0:26:44 how spatial intelligence and what you’re working on
0:26:47 contrasts with language approaches,
0:26:48 which, of course, are very popular now.
0:26:49 Is it complementary?
0:26:51 Is it orthogonal?
0:26:52 – I think they’re complementary.
0:26:53 – I don’t mean to be too leading here.
0:26:54 Maybe just contrasting.
0:26:58 Like everybody says I know OpenAI and I know GPT
0:26:59 and I know multimodal models
0:27:00 and a lot of what you’re talking about
0:27:02 is like they’ve got pixels and they’ve got languages
0:27:05 and doesn’t this kind of do what we want to do
0:27:06 with spatial reasoning?
0:27:07 – Yeah, so I think to do that,
0:27:09 you need to open up the black box a little bit
0:27:10 of how these systems work under the hood.
0:27:12 So with language models
0:27:13 and the multimodal language models
0:27:14 that we’re seeing nowadays,
0:27:16 their underlying representation under the hood
0:27:18 is a one-dimensional representation.
0:27:19 We talk about context lengths,
0:27:20 we talk about transformers,
0:27:22 we talk about sequences, attention.
0:27:24 Fundamentally, their representation of the world
0:27:26 is one-dimensional.
0:27:27 So these things fundamentally operate
0:27:29 on a one-dimensional sequence of tokens.
0:27:31 So this is a very natural representation
0:27:32 when you’re talking about language
0:27:34 because written text is a one-dimensional sequence
0:27:36 of discrete letters.
0:27:37 So that kind of underlying representation
0:27:39 is the thing that led to LLMs.
0:27:42 And now the multimodal LLMs that we’re seeing now,
0:27:45 you kind of end up shoehorning the other modalities
0:27:46 into this underlying representation
0:27:48 of a one-D sequence of tokens.
0:27:50 Now, when we move to spatial intelligence,
0:27:52 it’s kind of going the other way,
0:27:54 where we’re saying that the three-dimensional nature
0:27:56 of the world should be front and center
0:27:57 in the representation.
0:27:59 So at an algorithmic perspective,
0:28:01 that opens up the door for us to process data
0:28:02 in different ways,
0:28:04 to get different kinds of outputs out of it,
0:28:06 and to tackle slightly different problems.
0:28:08 So even at a course level,
0:28:09 you kind of look at outside and you say,
0:28:12 “Oh, multimodal LLMs can look at images too.”
0:28:13 Well, they can, but I think they don’t have
0:28:15 that fundamental 3D representation
0:28:17 at the heart of their approaches.
0:28:18 – I totally agree with Justin.
0:28:20 I think talking about the one-D
0:28:23 versus fundamentally 3D representation
0:28:26 is one of the most core differentiation.
0:28:28 The other thing that’s slightly philosophical,
0:28:30 but it’s really important for me at least,
0:28:35 is language is fundamentally a purely generated signal.
0:28:38 There’s no language out there.
0:28:40 You don’t go out in the nature
0:28:43 and there’s words written in the sky for you.
0:28:44 Whatever data you’re feeding,
0:28:49 you pretty much can just somehow regurgitate
0:28:53 with enough generalizability the same data out,
0:28:56 and that’s language to language.
0:28:58 But 3D world is not.
0:29:00 There is a 3D world out there
0:29:02 that follows laws of physics,
0:29:06 that has its own structures due to materials
0:29:07 and many other things.
0:29:11 And to fundamentally back that information out
0:29:13 and be able to represent it
0:29:15 and be able to generate it
0:29:19 is just fundamentally quite a different problem.
0:29:22 We will be borrowing similar ideas
0:29:26 or useful ideas from language and LLMs,
0:29:28 but this is fundamentally philosophically
0:29:30 to me a different problem.
0:29:34 – So language 1D and probably a bad representation
0:29:35 of the physical world
0:29:36 ’cause it’s been generated by humans
0:29:38 and it’s probably lossy.
0:29:41 There’s a whole nother modality of generative AI models
0:29:44 which are pixels and these are 2D image and 2D video.
0:29:46 And like one could say that if you look at a video
0:29:49 you can see 3D stuff because like you can pan a camera
0:29:50 or whatever it is.
0:29:53 And so like how would like spatial intelligence
0:29:55 be different than say 2D video?
0:29:56 – When I think about this,
0:29:57 it’s useful to disentangle two things.
0:29:59 One is the underlying representation
0:30:02 and then two is kind of the user facing affordances
0:30:03 that you have.
0:30:05 And here’s where you can get sometimes confused
0:30:08 because fundamentally we see 2D, right?
0:30:10 Our retinas are 2D structures in our bodies
0:30:11 and we’ve got two of them.
0:30:15 So fundamentally our visual system perceives 2D images.
0:30:17 But the problem is that depending on what representation
0:30:19 you use there could be different affordances
0:30:21 that are more natural or less natural.
0:30:23 So even if you at the end of the day
0:30:26 you might be seeing a 2D image or a 2D video
0:30:30 your brain is perceiving that as a projection of a 3D world.
0:30:31 So there’s things you might want to do,
0:30:34 move objects around, move the camera around.
0:30:36 In principle, you might be able to do these
0:30:39 with a purely 2D representation and model,
0:30:40 but it’s just not a fit to the problems
0:30:42 that you’re asking the model to do, right?
0:30:46 Modeling the 2D projections of a dynamic 3D world
0:30:48 is a function that probably can be modeled.
0:30:50 But by putting a 3D representation into the heart of a model
0:30:52 there’s just going to be a better fit
0:30:53 between the kind of representation
0:30:55 that the model is working on
0:30:58 and the kind of tasks that you want that model to do.
0:31:01 So our bet is that by threading a little bit more
0:31:03 3D representation under the hood
0:31:06 that’ll enable better affordances for users.
0:31:08 And this also goes back to the North Star.
0:31:11 For me, why is it spatial intelligence?
0:31:15 Why is it not flat pixel intelligence?
0:31:18 It’s because I think the arc of intelligence
0:31:22 has to go to what Justin calls affordances.
0:31:26 And the arc of intelligence, if you look at evolution, right?
0:31:31 The arc of intelligence eventually enables animals and humans,
0:31:34 especially humans as an intelligent animal,
0:31:37 to move around the world, interact with it,
0:31:39 create civilization, create life,
0:31:42 create a piece of sandwich,
0:31:45 whatever you do in this 3D world.
0:31:48 And translating that into a piece of technology
0:31:52 that native 3Dness is fundamentally important
0:31:57 for the flood of possible applications,
0:32:03 even if some of them, the serving of them looks 2D,
0:32:05 but it’s innately 3D to me.
0:32:08 I think this is actually a very subtle
0:32:10 and incredibly critical point.
0:32:11 And so I think it’s worth digging into
0:32:13 and a good way to do this is talking about use cases.
0:32:15 And so just to level-set this,
0:32:17 is when we’re talking about generating a technology,
0:32:20 let’s call it a model, that can do spatial intelligence.
0:32:23 So maybe in the abstract,
0:32:26 what might that look like kind of a little bit more concretely?
0:32:28 There’s a couple different kinds of things we imagine
0:32:32 these spatially intelligent models able to do over time.
0:32:36 And one that I’m really excited about is world generation.
0:32:38 We’re all used to something like a text-image generator,
0:32:39 or starting to see text-to-video generators,
0:32:41 where you put an image, put in a video,
0:32:45 and out pops an amazing image or an amazing two-second clip.
0:32:47 But I think you could imagine leveling this up
0:32:48 and getting 3D worlds out.
0:32:52 So one thing that we could imagine spatial intelligence
0:32:53 helping us with in the future
0:32:56 are up-leveling these experiences into 3D,
0:32:59 where you’re getting out a full virtual, simulated,
0:33:01 but vibrant and interactive 3D world, right?
0:33:04 Maybe for gaming, maybe for virtual photography, you name it.
0:33:05 Even if you got this to work,
0:33:07 there’d be a million applications for education.
0:33:10 I mean, in some sense, this enables a new form of media, right?
0:33:12 Because we already have the ability
0:33:15 to create virtual interactive worlds,
0:33:18 but it costs hundreds of millions of dollars
0:33:20 and a ton of development time.
0:33:22 And as a result, what are the places
0:33:24 that people drive this technological ability
0:33:26 is video games, right?
0:33:29 But because it takes so much labor to do so,
0:33:31 then the only economically viable use
0:33:33 of that technology in its form today
0:33:36 is games that can be sold for $70 apiece
0:33:38 to millions and millions of people to recoup the investment.
0:33:42 If we had the ability to create these same virtual,
0:33:45 interactive, vibrant 3D worlds,
0:33:47 you could see a lot of other applications of this, right?
0:33:49 Because if you bring down that cost of producing
0:33:50 that kind of content,
0:33:52 then people are going to use it for other things, right?
0:33:56 What if you could have sort of a personalized 3D experience
0:33:58 that’s as good and as rich as detailed
0:33:59 as one of these AAA video games
0:34:02 that cost hundreds of millions of dollars to produce,
0:34:04 but it could be catered to this very niche thing
0:34:06 that only maybe a couple of people
0:34:07 would want that particular thing.
0:34:10 That’s not a particular product or a particular roadmap,
0:34:13 but I think that’s a vision of a new kind of media
0:34:15 that would be enabled by spatial intelligence
0:34:17 in the generative realm.
0:34:18 – If I think about a world,
0:34:18 I actually think about things
0:34:20 that are not just scene generation.
0:34:21 I think about stuff like movement and physics.
0:34:24 And so like in the limit, is that included?
0:34:26 And then if I’m interacting with it,
0:34:28 like, are there semantics?
0:34:30 And I mean, by that, like, if I open a book,
0:34:31 are there like pages and are there words in it?
0:34:33 And do they mean, like, are we talking like
0:34:34 a full depth experiment
0:34:36 or are we talking about like kind of a static scene?
0:34:37 – I think I’ll see a progression
0:34:38 of this technology over time.
0:34:40 This is really hard stuff to build.
0:34:43 So I think the static problem is a little bit easier,
0:34:45 but in the limit, I think we want this to be fully dynamic,
0:34:48 fully interactable, all the things that you just said.
0:34:49 – I mean, that’s the definition
0:34:51 of spatial intelligence, yeah.
0:34:52 So there is going to be a progression.
0:34:55 We’ll start with more static,
0:34:58 but everything you’ve said is in the roadmap
0:35:00 of spatial intelligence.
0:35:01 – I mean, this is kind of in the name
0:35:02 of the company itself, World Labs.
0:35:05 Like the world is about building and understanding worlds.
0:35:07 And this is actually a little bit of inside baseball.
0:35:09 I realized after we told the name to people,
0:35:11 they don’t always get it because in computer vision
0:35:13 and reconstruction and generation,
0:35:15 we often make a distinction or a delineation
0:35:16 about the kinds of things you can do.
0:35:18 And kind of the first level is objects, right?
0:35:20 A microphone, a cup, a chair.
0:35:22 These are discrete things in the world.
0:35:24 And a lot of the ImageNet style stuff
0:35:26 that Fei-Fei worked on was about recognizing
0:35:27 objects in the world.
0:35:30 Then leveling up the next level of objects,
0:35:30 I think of the scenes.
0:35:32 Scenes are compositions of objects.
0:35:34 Now we’ve got this recording studio with a table
0:35:36 and microphones and people and chairs
0:35:37 at some composition of objects.
0:35:40 But then we envision worlds as a step beyond scenes, right?
0:35:42 Scenes are kind of maybe individual things,
0:35:45 but we want to break the boundaries, go outside the door,
0:35:46 step up from the table, walk out from the door,
0:35:49 walk down the street and see the cars buzzing past
0:35:51 and see the leaves on the trees moving
0:35:53 and be able to interact with those things.
0:35:54 – Another thing that’s really exciting
0:35:57 is just to mention the word new media.
0:36:01 With this technology, the boundary between real world
0:36:04 and virtual imagined world or augmented world
0:36:07 or predicted world is all blurry.
0:36:09 This real world is 3D, right?
0:36:14 So in the digital world, you have to have a 3D representation
0:36:18 to even blend with the real world.
0:36:20 You cannot have a 2D, you cannot have a 1D
0:36:23 to be able to interface with the real 3D world
0:36:25 in an effective way.
0:36:28 With this, it unlocks it so the use cases
0:36:31 can be quite limitless because of this.
0:36:34 – Right, so the first use case that Justin was talking about
0:36:36 would be like the generation of a virtual world
0:36:38 for any number of use cases.
0:36:39 One that you’re just alluding to
0:36:40 would be more of an augmented reality, right?
0:36:44 – Yes, just around the time world lab was being formed,
0:36:47 Vision Pro was released by Apple
0:36:50 and they use the word spatial computing.
0:36:52 We’re almost like they almost stole our,
0:36:54 but we’re spatial intelligence.
0:36:57 So spatial computing needs spatial intelligence.
0:36:58 That’s exactly right.
0:37:02 So we don’t know what hardware form it will take.
0:37:04 It’ll be goggles, glasses.
0:37:05 – Contact lenses.
0:37:06 – Contact lenses.
0:37:10 But that interface between the true real world
0:37:12 and what you can do on top of it,
0:37:16 whether it’s to help you to augment your capability
0:37:19 to work on a piece of machine and fix your car,
0:37:22 even if you are not a trained mechanic
0:37:25 or to just be in a Pokemon,
0:37:27 suddenly this piece of technology
0:37:31 is going to be the operating system basically
0:37:33 for AR, VR, MixR.
0:37:35 – In the limit, what does an AR device need to do?
0:37:38 It’s this thing that’s always on, it’s with you,
0:37:39 it’s looking out into the world.
0:37:41 So it needs to understand the stuff that you’re seeing
0:37:44 and maybe help you out with tasks in your daily life.
0:37:46 But I’m also really excited about this blend
0:37:49 between virtual and physical that becomes really critical.
0:37:51 If you have the ability to understand what’s around you
0:37:53 in real time, in perfect 3D,
0:37:55 then it actually starts to deprecate
0:37:56 large parts of the real world as well.
0:37:58 Like right now, how many differently sized screens
0:38:00 do we all own for different use cases?
0:38:01 – Too many.
0:38:02 – You’ve got your phone, you’ve got your iPad,
0:38:05 you’ve got your computer monitor, you’ve got your TV,
0:38:05 you’ve got your watch.
0:38:07 Like these are all basically different sized screens
0:38:09 because they need to present information to you
0:38:11 in different contexts and in different positions.
0:38:14 But if you’ve got the ability to seamlessly blend
0:38:16 virtual content with the physical world,
0:38:18 it kind of deprecates the need for all of those.
0:38:20 It just ideally seamlessly blends information
0:38:21 that you need to know in the moment
0:38:24 with the right mechanism of giving you that information.
0:38:28 – Another huge case of being able to blend
0:38:32 the digital virtual world with the 3D physical world
0:38:36 is for alien agents to be able to do things
0:38:37 in the physical world.
0:38:41 And if humans use this mixed art devices
0:38:43 to do things like I said, I don’t know how to fix a car,
0:38:47 but if I have to, I put on this goggles or a glass
0:38:49 and suddenly I’m guided to do that.
0:38:53 But there are other types of agents, namely robots,
0:38:56 any kind of robots, not just humanoid.
0:39:01 And their interface by definition is the 3D world,
0:39:05 but their compute, their brain by definition
0:39:06 is the digital world.
0:39:11 So what connects that from the learning to behaving
0:39:14 between a robot brain to the real world brain?
0:39:17 It has to be spatial intelligence.
0:39:19 – So you’ve talked about virtual worlds,
0:39:22 you’ve talked about kind of more of an augmented reality.
0:39:24 And now you’ve just talked about the purely physical world,
0:39:28 basically, which would be used for robotics for any company.
0:39:31 That would be like a very large charter,
0:39:32 especially if you’re gonna get into,
0:39:35 how do you think about the idea of like deep, deep tech
0:39:37 versus any of these specific application areas?
0:39:40 – We see ourselves as a deep tech company,
0:39:44 as the platform company that provides models
0:39:46 that can serve different use cases.
0:39:48 – Of these three, is there anyone that you think
0:39:50 is kind of more natural early on
0:39:53 that people can kind of expect the company to lean into?
0:39:55 – I think it’s suffices to say
0:39:58 the devices are not totally ready.
0:40:00 – Actually, I got my first VR headset in grad school.
0:40:02 That’s one of these transformative technology experiences.
0:40:04 You put it on, you’re like, oh my God,
0:40:05 like this is crazy.
0:40:06 And I think a lot of people have that experience
0:40:07 the first time they use VR.
0:40:09 So I’ve been excited about this space for a long time.
0:40:11 And I love the Vision Pro.
0:40:13 Like I stayed up late to order one of the first ones,
0:40:15 like the first day it came out.
0:40:17 But I think the reality is it’s just not there yet
0:40:19 as a platform for mass market appeal.
0:40:22 – So very likely as a company will move into a market
0:40:24 that’s more ready than, but, you know,
0:40:27 we are a deep tech company.
0:40:28 – Then I think there can sometimes be simplicity
0:40:30 and generality, right?
0:40:32 We have this notion of being a deep tech company.
0:40:34 We believe that there is some underlying
0:40:37 fundamental problems that need to be solved really well.
0:40:38 And if solved really well,
0:40:40 can apply to a lot of different domains.
0:40:42 We really view this long arc of the company
0:40:44 as building and realizing the dreams
0:40:46 of spatial intelligence writ large.
0:40:49 – So this is a lot of technology to build, it seems to me.
0:40:50 – Yeah, I think it’s a really hard problem.
0:40:53 I think sometimes from people who are not directly
0:40:55 in the AI space, they just see it as AI
0:40:57 as one undifferentiated mass of talent.
0:40:59 And for those of us who have been here for longer,
0:41:01 you realize that there’s a lot of different kinds
0:41:02 of talent that need to come together
0:41:04 to build anything in AI, in particular this one.
0:41:06 We’ve talked a little bit about the data problem.
0:41:08 We’ve talked a little bit about some of the algorithms
0:41:10 that I worked on during my PhD,
0:41:12 but there’s a lot of other stuff we need to do this too.
0:41:14 You need really high quality, large-scale engineering.
0:41:17 You need really deep understanding of the 3D world.
0:41:19 There’s actually a lot of connections with computer graphics
0:41:20 because they’ve been kind of attacking
0:41:22 a lot of the same problems from the opposite direction.
0:41:24 So when we think about team construction,
0:41:27 we think about how do we find like absolute
0:41:29 top of the world best experts in the world
0:41:31 at each of these different sub-domains
0:41:34 that are necessary to build this really hard thing.
0:41:36 – When I thought about how we formed
0:41:39 the best founding team for world labs,
0:41:41 it has to start with a group
0:41:44 of phenomenal multidisciplinary founders.
0:41:48 And of course, Justin is natural for me.
0:41:52 Justin, cover your ears as one of my best students
0:41:54 and one of the smartest technologists.
0:41:58 But there are two other people I have known by reputation
0:42:00 and one of them Justin even worked with
0:42:02 that I was drooling for, right?
0:42:04 One is Ben Mildenhall.
0:42:07 We talked about his seminal work in NERV.
0:42:10 But another person is Christoph Lassner,
0:42:15 who has been reputed in the community of computer graphics
0:42:20 and especially he had the foresight of working
0:42:23 on a precursor of the Gaussian splat representation
0:42:27 for 3D modeling five years, right?
0:42:29 Before the Gaussian splat takeoff.
0:42:31 – Ben and Christoph are legends
0:42:33 and maybe just quickly talk about kind of like
0:42:35 how you thought about the build out of the rest of the team
0:42:37 because again, like there’s a lot to build here
0:42:40 and a lot to work on, not just in kind of AI or graphics
0:42:42 but like systems and so forth.
0:42:46 – Yeah, this is what so far I’m personally most proud of
0:42:48 is the formidable team.
0:42:50 I’ve had the privilege of working
0:42:53 with the smartest young people in my entire career, right?
0:42:57 From the top universities being a professor at Stanford
0:43:01 but the kind of talent that we put together here
0:43:03 at World Labs is just phenomenal.
0:43:05 I’ve never seen the concentration.
0:43:09 And I think the biggest differentiating element here
0:43:13 is that we’re believers of spatial intelligence.
0:43:16 All of the multidisciplinary talents,
0:43:18 whether it’s system engineering, machine learning,
0:43:23 infra to generate a modeling, to data, to graphics,
0:43:28 all of us, whether it’s our personal research journey
0:43:31 or technology journey or even personal hobby.
0:43:35 And that’s how we really found our founding team
0:43:40 and that focus of energy and talent is humbling to me.
0:43:41 I just love it.
0:43:43 – So I know you’ve been guided by a North Star.
0:43:45 So something about North Stars is like
0:43:47 you can’t actually reach them
0:43:49 because they’re in the sky but it’s a great way
0:43:50 to have guidance.
0:43:52 So how will you know when you’ve accomplished
0:43:54 what you’ve set out to accomplish
0:43:56 or is this a lifelong thing
0:43:59 that’s gonna continue kind of infinitely?
0:44:01 – First of all, there’s real North Stars
0:44:03 and virtual North Stars.
0:44:05 Sometimes you can reach virtual North Stars.
0:44:06 – Fair enough, in the world model.
0:44:08 – Exactly. – You can hit North Stars.
0:44:11 – Like I said, the way I thought one of my North Stars
0:44:13 that would take a hundred years
0:44:16 was storytelling of images and Justin and Andre.
0:44:18 In my opinion, solved it for me.
0:44:20 So we could get to our North Star.
0:44:24 But I think for me is when so many people
0:44:27 and so many businesses are using our models
0:44:31 to unlock their needs for spatial intelligence.
0:44:33 And that’s the moment I know
0:44:35 we have reached a major milestone.
0:44:37 – Actual deployment, actual impact.
0:44:38 – Yeah, I don’t think we’re ever gonna get there.
0:44:40 I think that this is such a fundamental thing.
0:44:43 The universe is a giant evolving four dimensional structure
0:44:45 and spatial intelligence writ large
0:44:47 is just understanding that in all of its depths
0:44:49 and figuring out all the applications to that.
0:44:53 So I think we have a particular set of ideas in mind today,
0:44:55 but I think this journey is gonna take us places
0:44:57 that we can’t even imagine right now.
0:44:58 – The magic of good technology
0:45:03 is that technology opens up more possibilities and unknowns.
0:45:04 So we will be pushing
0:45:07 and then the possibilities will be expanding.
0:45:08 – Brilliant.
0:45:09 Thank you, Justin.
0:45:10 Thank you, Fit.
0:45:11 This was fantastic.
0:45:11 – Thank you, Martin.
0:45:12 – Thank you, Martin.
0:45:14 (upbeat music)
0:45:16 – All right, that is all for today.
0:45:19 If you did make it this far, first of all, thank you.
0:45:21 We put a lot of thought into each of these episodes,
0:45:23 whether it’s guests, the calendar touchers,
0:45:25 the cycles with our amazing editor, Tommy
0:45:27 until the music is just right.
0:45:29 So if you’d like what we’ve put together,
0:45:33 consider dropping us a line at ratethespodcast.com/a16z
0:45:36 and let us know what your favorite episode is.
0:45:39 It’ll make my day and I’m sure Tommy’s too.
0:45:40 We’ll catch you on the flip side.
0:45:43 (upbeat music)
0:45:45 (upbeat music)
0:45:55 [BLANK_AUDIO]
Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today.
In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what’s next for innovation at World Labs.
If you’re curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen.
Resources:
Learn more about World Labs: https://www.worldlabs.ai
Find Fei-Fei on Twitter: https://x.com/drfeifei
Find Justin on Twitter: https://x.com/jcjohnss
Find Martin on Twitter: https://x.com/martin_casado
Stay Updated:
Let us know what you think: https://ratethispodcast.com/a16z
Find a16z on Twitter: https://twitter.com/a16z
Find a16z on LinkedIn: https://www.linkedin.com/company/a16z
Subscribe on your favorite podcast app: https://a16z.simplecast.com/
Follow our host: https://twitter.com/stephsmithio
Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.