Bringing Robots to Life with AI: The Three Computer Revolution – ep 273

AI transcript

🕒

Việt

中文

0:00:16 Hello, and welcome to the NVIDIA AI podcast. I’m your host, Noah Kravitz. Our guest today
0:00:22 is Yash Raj Narang. Yash is Senior Research Manager at NVIDIA and the head of the Seattle
0:00:28 Robotics Lab, which I’m really excited to learn more about along with you today. Yash’s
0:00:33 work focuses on the intersection of robotics, AI, and simulation, and his team conducts
0:00:38 fundamental and applied research across the full robotics stack, including perception, planning,
0:00:43 control, reinforcement learning, imitation learning, simulation, and vision-language action
0:00:49 models. Full robotics stack, like it says. Prior to joining NVIDIA, Yash completed a PhD in
0:00:54 materials science and mechanical engineering from Harvard University and a master’s in mechanical
0:01:00 engineering from MIT. And he’s here now to talk about robots, the field of robotics, robotics
0:01:05 learning, all kinds of awesome stuff. I’m so excited to have you here, Yash. So thank you
0:01:06 for joining the podcast. Welcome.
0:01:07 Thank you so much, Noah.
0:01:13 So maybe first things first, and this is a very selfish question I mentioned before we started,
0:01:18 but I think the listeners will be into it too. I’ve never been to the Seattle Robotics Lab. I don’t
0:01:22 know much about it. Can we start with having you talk a little bit about your own role,
0:01:28 your background, if you like, and give us a little peek into what the Seattle Lab is all about.
0:01:34 Yeah, absolutely. So the Seattle Robotics Lab, it started in, I believe, October of 2017,
0:01:39 and I actually joined the lab in December of 2018. And the lab was started by Dieter Fox,
0:01:43 who’s a professor at the University of Washington. And, you know, at the time,
0:01:48 I believe he had a conversation with Jensen at a conference, Jensen Huang, of course,
0:01:53 the CEO of NVIDIA. And Jensen thinks way far out into the future. And at that point,
0:01:57 he was getting really excited about robotics. And he said, you know, essentially, that we need a
0:02:02 research effort in robotics at NVIDIA. And that’s really how the lab started. So that was kind of the
0:02:07 birth of the lab. And at the beginning, the lab, you know, and it still does a very academic focus.
0:02:08 Okay.
0:02:14 So we consistently have really high engagement at conferences. We publish a lot, we do a lot of
0:02:20 fundamental and applied research. And recently, NVIDIA has been developing, especially over the
0:02:25 past few years, a really robust product and engineering effort as well. And so we’re working
0:02:29 more closely and closely with them to try to get some of our research out into the hands of the
0:02:34 community. So, you know, fundamental academic mission, but it’s really important for us as well
0:02:38 to transfer our research and get it out there for everyone to use.
0:02:45 Fantastic. And you mentioned Dieter Fox, I believe, at UW, University of Washington. Is the lab,
0:02:46 is there a relationship there?
0:02:52 Yeah. So when Dieter started the lab, we, you know, over a number of years had a very close
0:02:57 relationship with University of Washington, where many students from his lab and others would come
0:03:03 do internships at the Seattle Robotics Lab. We still definitely have that kind of relationship.
0:03:06 I stepped into the leadership role just a few months ago.
0:03:07 Oh, wow. Okay.
0:03:11 And, you know, plan to maintain that relationship because it’s been so productive for us.
0:03:17 Awesome. I have a little bit of bias. Somebody very close to me is a UW alum. Go Huskies.
0:03:22 So, you know, I had to ask. All right. Let’s talk about robots. We’re going to start talking about,
0:03:27 well, really, I’ll leave it to you and I’ll ask at a very high level. How do robots come to life?
0:03:31 What does that mean when we talk about, you know, a robot coming to life? And I think there’s,
0:03:35 I’m going to get into the three computer concept and stuff like that, but I’ll leave it to you
0:03:38 at a high level. What does that mean bringing robots to life?
0:03:45 Yeah, it’s a big question. I think it’s, it’s a real open question too. I think we can even start
0:03:50 with what is a robot. I think this is a subject of debate, but, you know, generally speaking,
0:03:58 a robot is a synthetic system that can perceive the world, can plan out sequences of actions,
0:04:03 can make changes in the world, and it can be programmed. And it typically serves some purpose
0:04:08 of automation. That’s really sort of the essence of a robot. And now there’s the question of if you
0:04:14 have a robot, how can it come to life? So I would say that if most people were, for example,
0:04:20 to step into a factory today, you know, like an automotive manufacturing plant, they would see
0:04:26 lots and lots of robots everywhere. Right. And the motion of these robots and the payloads of these
0:04:32 robots and the speed of these robots, it’s extremely impressive. But those same people that are walking
0:04:38 into these places, they might not feel like these robots are alive because they don’t necessarily react
0:04:42 to you. In fact, you probably want to get out of their way if they’re putting something together,
0:04:48 you know, to be safe. So I think part of robots coming alive is really this additional aspect of
0:04:55 intelligence so that when conditions change, it can adapt, it can be robust to perturbations,
0:05:01 and it can start to learn from experience. Yeah. And I think that’s really kind of the essence of
0:05:07 coming alive. Got it. And what is the three computer concept and how does it relate to robotics?
0:05:12 Yeah, the three computer concept is pretty interesting. I think this was, you know,
0:05:16 I don’t know the exact history of this, but I think this was inspired by, you know, the three body
0:05:23 problem. So the three computer concept, it’s really a formula for today’s robotics, you know,
0:05:29 both on the research side and the industry side. And it has three parts, as the name suggests. So the
0:05:37 first computer is the NVIDIA DGX computer. So this includes things like GB200 systems,
0:05:43 Grace, Blackwell, super chips, and systems that are composed of those chips. And these are really
0:05:50 ideal for training large AI models and running inference on those models. So getting that fundamental
0:05:56 understanding of the world, being able to process, you know, take images as input, language as input,
0:06:03 and produce meaningful actions, robot actions as output, for example, training these sorts of models,
0:06:08 and then running inference on those. The second computer is Omniverse and Cosmos. It’s a combination
0:06:15 of these things. So Omniverse is really a developer platform that NVIDIA has built for a number of years
0:06:21 with incredible capabilities on rendering, incredible capabilities on simulation, and many, many applications
0:06:28 built on top of this platform. So for example, in the Seattle Robotics Lab, we’re heavy users of IsaacSim and
0:06:35 IsaacLab, which are basically robot simulation and robot learning software that is developed on top of
0:06:43 Omniverse. And what you can do with Omniverse is essentially train robots to acquire new behaviors,
0:06:48 for example, using processes like reinforcement learning, which is sort of intelligent trial and error.
0:06:53 You can also use it to evaluate robots. For example, if you have some learned behaviors,
0:06:59 and you want to see how it performs in different scenarios, you can put it into simulation and kind
0:07:06 of see what happens there. Cosmos is essentially a world model for robotics. And world model is this
0:07:11 kind of big term, and many people have different interpretations of it. But just to kind of ground
0:07:18 things a bit here, some of the things that Cosmos has done is actually make video generation models.
0:07:24 So you can have an initial frame of an image, you can have a language command, and then you can predict
0:07:29 sequences of image that come come after that. So this is the Cosmos predict model. There’s also the
0:07:36 Cosmos transfer model. And the idea here is that you can take an image and you can again take, let’s say,
0:07:41 a language prompt. And you can transform that image to look like a completely different scene,
0:07:49 while maintaining, you know, the shape and semantic relationships of different objects in that image.
0:07:55 And then there’s Cosmos reason, which is really a VLM, which is a vision language model. So it can take
0:08:02 images as input, language as input, and it can basically produce language as output. It can answer
0:08:07 questions about images, and it can do a sort of a step-by-step thinking or reasoning process.
0:08:14 Now, just stepping back a little bit, you know, second computer again, Omniverse and Cosmos. And what
0:08:22 they’re really used for is to generate data, to generate experience, and to evaluate robots in simulation.
0:08:29 And so in a sense, this can kind of come either before or after the first computer. You know, you can,
0:08:36 for example, generate a lot of data, and then learn from it using that first computer, these DGX systems.
0:08:40 Or you can train a model on that DGX system and then evaluate it using something like Omniverse
0:08:47 or Cosmos. And the third computer is the AGX. By the way, I looked this up recently. I was curious,
0:08:51 I’ve been here for a while, but still curious, what is what is the D and DGX stand for? What is the A and
0:08:57 AGX stand for? Oh, yeah. Okay. So, A is apparently for deep learning. And A is apparently for autonomous.
0:09:02 So, it’s kind of a nice way to remember. Interesting. The more you know. Yeah,
0:09:09 exactly. The more you know, right? So, the third computer is the Jetson AGX. Specifically,
0:09:17 the Thor has been recently released. And this is all about running inference on models that are located on
0:09:24 your robot. So, instead of having, you know, separate workstations or, you know, data centers,
0:09:30 this is a chip that actually lives on the robot where you can basically have AI models there and
0:09:34 you can run inference on them in real time. Really powerful.
0:09:40 So, before asking to follow up, I feel like I have to plug the podcast real quick because
0:09:45 it was really sort of satisfying in a way to listening to you and thinking, oh, yeah, we did an
0:09:50 episode with that. Oh, yeah. Sonya talked about that. Oh, yeah. So, I will say, if you would like
0:09:55 to know a little more about the feeling of walking through an automotive factory with a lot of robots
0:10:01 doing amazing things without worrying about getting out of the way, great episode with Siemens from a few
0:10:06 months back, check that out. I mentioned Sonya Fidler-Ridley recently, sorry, from NVIDIA.
0:10:12 She spoke around SIGGRAPH, but a lot of stuff related to robots. Of course, from GTC, there are my plucks.
0:10:17 Okay. So, you got into this a little bit, Yash, but, you know, mentioning Thor in particular,
0:10:23 but what’s changed recently in the field and what does that mean for where robotics is headed?
0:10:28 Yeah. I think there have been many changes in the field. I think, for example, the three-computer
0:10:33 solution and three-computer strategy from NVIDIA, that’s been definitely a key enabler. Just the
0:10:40 fact that there is access to more and more compute, more and more powerful compute, and tools like
0:10:46 Omniverse, for example, for rendering and simulation, and Cosmos for world models, and,
0:10:51 of course, you know, better and better onboard compute. I think that’s really, really empowered robotics.
0:10:57 On, let’s say, you know, maybe if we think a little bit about the learning side,
0:11:04 I think since joining the lab in December of 2018, I’ve sort of been lucky to witness different
0:11:10 transformations in robotics over time. So, you know, one thing that I witnessed early on was,
0:11:16 actually, I think this was in 2019 when OpenAI released its Rubik’s Cube manipulation work.
0:11:24 And so, these are, these were basically dexterous hands, human-like hands that learned to manipulate
0:11:29 a Rubik’s Cube and essentially solve it, but it was learned, you know, purely in simulation and
0:11:31 then transferred to the real world. Yeah, I remember that.
0:11:36 So, that was kind of a big moment in the rise of the sim-to-real paradigm,
0:11:41 training and simulation deploying in the real world. I think other things, you know, came after that.
0:11:48 The, you know, transformers were, of course, invented kind of before, but really starting
0:11:53 to see more and more of that model architecture and robotics. I think that was, that was a big moment
0:12:00 or a big series of moments. Another specific moment that was pretty powerful was just, of course,
0:12:08 as everybody in AI knows, ChatGPT. So, I think that was released in late 2022. Most people started to
0:12:15 interact with it early 2023. And then, you know, the world of robotics started thinking about, okay,
0:12:20 how do we actually leverage this for what we do? And, you know, many other fields kind of felt the
0:12:21 same thing. Sure.
0:12:29 So, there was really an explosion of papers starting in 2023 about how to use language models for robotics
0:12:34 and how to use vision language models for robotics. And I think that was, that was quite interesting.
0:12:40 So, there are papers that kind of explored this along every dimension. Like, can you, for example,
0:12:46 give some sort of long-range task to a robot, or, you know, in this case, to a language model,
0:12:50 and have it figure out all the steps you need to accomplish in order to perform that task?
0:12:58 Can you, for example, use a language model to construct rewards? So, when you do, for example,
0:13:05 reinforcement learning, intelligent trial and error, you usually need some sort of signal about how good
0:13:10 your attempt was. You know, you’re trying all of these different things. How good was that sequence
0:13:17 of actions? And that’s typically called a reward. So, you know, these are traditionally hand-coded
0:13:23 things using a lot of human intuition. And there’s some very interesting work, including Eureka from NVIDIA,
0:13:28 about how to use language models to sort of generate those rewards. There was also kind of a
0:13:37 simultaneous explosion in more general generative AI, for example, generating images and generating 3D
0:13:43 assets. A lot of this work came from NVIDIA as well. So, on the image generation side, you know,
0:13:49 there was work, for example, on generating images that describe the goal of your robotic system. So,
0:13:53 where do you want your robot to end up? What do you want the final product to look like? Let’s
0:13:59 generate an image from that and use that to sort of guide the learning process. And then there’s also,
0:14:03 you know, when it comes to simulation, one of the, one of the, and we’ll probably get more into this a
0:14:07 little bit later, but one of the challenges of simulation is you have to build a scene and you
0:14:13 have to build these 3D assets, your meshes. And that can take a lot of time and effort and artistic
0:14:20 ability and so on. So, there’s a lot of work on automatically generating these scenes and generating
0:14:26 these assets. And in a sense, you can kind of view this transformation that we’ve seen over the past
0:14:33 years as kind of taking the human or human ingenuity more and more out of the process or
0:14:39 at higher and higher levels, as opposed to absolutely doing everything and sort of hard
0:14:46 coating things like rewards and final states and, you know, building meshes and assets manually and
0:14:51 describing scenes and so on and so forth. So, we’re, you know, able to automate more and more of that.
0:14:57 Yeah. There’s so much in what you just said. And one of the big things for me, from this perspective,
0:15:04 is thinking about how little I understood about Omniverse, let alone Cosmos, before having the
0:15:09 chance to have some of these conversations, particularly over the past few months and having to do with
0:15:16 robotics, physical AI, and simulation and the idea of creating the world and then the robot is able to
0:15:23 learn and Cosmos. It’s all, it’s just fascinating. It’s so cool to, you know, to, I’m wanting to geek
0:15:28 out on my end. But when you’re talking about the different types of learning and, you know,
0:15:33 I’m sure they go together in the same way that you mix different approaches to anything in solving
0:15:38 complex problems. Can you talk a little bit about, I don’t know if pros and cons is the right way to
0:15:45 describe it, but the difference between imitation and reinforcement learning, not so much in what they
0:15:51 are, but in sort of, you know, effectiveness or how you use them together and that sort of thing.
0:15:56 Yeah, absolutely. I think these, you know, these are two really popular paradigms for robot learning.
0:16:02 And I will, you know, try to kind of ground it in what we do, what we typically do in robotics,
0:16:10 the typical implementations of imitation learning and reinforcement learning. So in a typical imitation
0:16:17 learning pipeline, you’re typically learning from examples. So for example, let’s say I define a task,
0:16:23 I’m trying to pick up my water bottle with a robot. What I might do if I were using an imitation
0:16:29 learning approach is maybe, you know, physically move around the robot and pick up the water bottle,
0:16:34 or I might use my keyboard and mouse to sort of teleoperate the robot and pick up the water bottle,
0:16:41 or I might use other interfaces. But the point is that I am collecting a number of demonstrations of
0:16:46 this behavior. I do it once in one way, I do it, you know, the second time in a different way.
0:16:50 And maybe I move the water bottle around, and I collect a lot of different demonstrations there.
0:16:57 And basically, the purpose of imitation learning is to essentially mimic those demonstrations.
0:17:04 The behaviors would ideally look as I have demonstrated it, right? Now, reinforcement
0:17:09 learning operates a little bit differently. Reinforcement learning tries to discover the behaviors,
0:17:17 you know, or the sequences of actions that achieve the goal. So, you know, in the most extreme case,
0:17:21 what you might do, if you were to take a reinforcement learning approach, again, intelligent
0:17:27 trial and error, is you might just have proposals of different sequences of actions that are being
0:17:33 generated. And if they happen to pick up the water bottle, I give a reward signal of one.
0:17:40 And if they fail, I might give a reward signal of zero. And the key difference here is that I am not
0:17:47 providing very much guidance on this sequence of actions that the robot needs to use in order to
0:17:52 accomplish the task. I’m letting the robot explore, try out many different things, and then come up with its
0:18:00 own strategy. So, you know, pros and cons. So, imitation learning, you know, one pro is that
0:18:07 you can provide it a lot of guidance. And the behaviors that you learn, for example, if a human,
0:18:12 if a person is demonstrating these behaviors, then the behaviors that you learn would generally
0:18:18 be human-like. They’re trying to essentially mimic those demonstrations. Now, reinforcement learning,
0:18:22 on the other hand, you know, again, in the most extreme case, you’re not necessarily leveraging
0:18:29 any demonstrations. The robot or agent, it’s often called, has to figure this out on its own.
0:18:34 And so it can be less efficient. Of course, you’re not giving it that guidance. And so it’s trying all
0:18:39 of these sequences of actions. And there are principled ways to do that. But essentially,
0:18:43 it would be less efficient than if you were to give it some demonstrations and, say, learn from that.
0:18:49 Now, the pro is that you can often do things that you have the capability of doing things that are
0:18:53 really, that can be really hard to demonstrate. So one of the things, you know, one of the topics that
0:18:58 I’ve worked on for some time, for example, is assembly, literally teaching robots to put parts
0:19:04 together. And this can actually be really difficult to do via a teleoperation interface. You probably need
0:19:06 to be an expert gamer in order to do that, right?
0:19:06 Yeah.
0:19:12 I hear you talk about assembling things. And I think of, forget the robot. I think of myself
0:19:17 trying to put together like very small parts on something, you know, twisting a screw in. And I
0:19:20 can’t, that makes me cringe, let alone trying to tell, operate a robot. Yeah.
0:19:25 It can be really hard depending on the task. And the second thing is that reinforcement learning
0:19:31 generally has the potential to achieve superhuman performance. So there are things, and I think games
0:19:36 are a great example. Like, you know, one of the domains of reinforcement learning historically
0:19:41 has been in games like Atari games. And that’s kind of where people maybe in recent history got
0:19:48 super excited about reinforcement learning because all of a sudden you could have these AI agents that
0:19:55 can do better at these games than any human ever. The same capabilities apply to robots. So you can
0:20:01 potentially learn, the robot can learn behaviors that are better than, you know, what any person
0:20:07 could possibly demonstrate. And maybe like a simple example of this is speed. So maybe there’s a tricky
0:20:12 problem you’re trying to give your robot where it has to go through a really narrow path and has to do
0:20:17 this very quickly. And if you were to demonstrate this, you might proceed very slowly, you might collide
0:20:23 along the way. But if a reinforcement learning agent is allowed to solve this problem, it could probably
0:20:28 learn these behaviors automatically, these smooth behaviors, and it can start to do this really,
0:20:32 really fast. And, you know, assembling objects is another example. You can start to assemble
0:20:36 objects faster than you could possibly demonstrate. And I think that’s the power.
0:20:41 That’s very cool. The thinking about or listening to you talk about different approaches to teaching
0:20:46 and learning brought to mind, I was looking at the NVIDIA YouTube channel just the other day for a
0:20:52 totally different reason and came across the video of Jensen giving the robot a gift and writing the card
0:20:56 that says, you know, dear robot, enjoy your new brain or something along those lines, right?
0:20:57 Yeah.
0:21:03 There’s something I only know kind of the name modular versus end-to-end brain. What is that about? Is
0:21:05 that, am I along the right lines or is that something totally different?
0:21:11 No, no, that’s, it’s essentially a way to design robotic intelligence. I would say these are two
0:21:17 competing paradigms. Both of these paradigms can leverage the latest and greatest in hardware. I would
0:21:22 say that. Now, the modular approach is an approach that has been developed for a very long time in
0:21:30 robotics. And sort of a classic framing for this is that a robot, you know, in order to perform some
0:21:37 task or set of tasks needs to have the ability to perceive the world. So to take in sensing information
0:21:43 and then come up with an understanding of the world, like where everything is, for example. And it also
0:21:50 needs the ability to plan. So for example, given some sort of model of the world, like a physics model,
0:21:55 for example, or a more abstract model, and maybe some sort of reward signal, you know,
0:22:02 can it actually select a sequence of actions that is likely to accomplish a desired goal?
0:22:09 And then, you know, a third module in this modular approach would be the action module. And that means
0:22:14 that you get in this sequence of actions, maybe this, these configurations that you’d like the robot
0:22:21 to reach in space. And the action module would figure out also called control would figure out what are the
0:22:26 motor commands that you want to generate, literally, what are the signals you want to send to the robot’s
0:22:32 motors in order to, you know, move along this path in space. So that’s kind of the perceived,
0:22:38 plan, act framework. It’s called different things over time, but that’s kind of the classic framing
0:22:44 for a modular approach. And so following that, you would have maybe a perception module and you’d have
0:22:49 some group of people working on that. You’d have a planning module, you’d have some group of people
0:22:54 working on that. You have an action module. And so this is kind of how many robotic systems have
0:23:00 been built over time. Now, the end-to-end approach is something that is definitely newer. And the idea is
0:23:07 that you don’t draw these, these boundaries, really, you, you, you take in your sensor data,
0:23:11 like camera data, you know, maybe force torque data, if you’re interacting with the world,
0:23:18 and then you directly predict that the commands that you may send your motors, right? So you kind
0:23:24 of skip these intermediate steps, and you go straight from from inputs to outputs. And that’s the end-to-end
0:23:31 approach. And, you know, I would say the modular approaches are extremely powerful. They have their,
0:23:35 they have their, their advantages, which there are really, there’s, there’s a lot of maturity around
0:23:41 developing each of those modules can be easy to debug, you know, for teams of engineers, which I was
0:23:45 mentioning the groups of people earlier. Yeah. It can be easier to certify as well. You know,
0:23:51 if safety is a safety critical application, the end-to-end approach, the advantage there is that
0:23:58 you’re not relying as much on human ingenuity or human engineering to figure out what exactly are
0:24:03 the outputs I should be producing for my perception module, what exactly are the outputs I should be
0:24:08 producing for my planning module, and so on. That requires a lot of engineering. And if you don’t do it
0:24:11 right, you may not get the desired outcome. Yeah. I was just going to say conceptually,
0:24:18 it made me think of the difference between doing whatever task I’m used to doing and asking a chat
0:24:24 bot just to shoot me the output, you know, and, and yeah, yeah. Right. And I think just another
0:24:30 analogy here would be, I think this has been a really, uh, fruitful debate, um, really vigorous debate
0:24:37 in autonomous driving actually. So in the 2010s, I would say just about every effort in autonomous driving
0:24:43 was focused on the modular paradigm. Again, you know, separate perception, planning, control modules,
0:24:49 and different teams associated with, with each of those things. And then kind of late, uh, like,
0:24:54 let’s say, you know, early in the 2020s was a real shift to the end-to-end paradigm, which basically said,
0:25:01 let’s just collect a lot of data and train a model that goes directly from pixels to actions. You know,
0:25:07 actions in this case being steering angle, throttle, brakes, and so on. Yeah. And many things today
0:25:12 kind of look, um, I would say like a hybrid, you know, different companies and strategies, but
0:25:17 most people have converged upon something that has elements of both. I’m speaking with Yash Raj Narang.
0:25:23 Yash is a senior research manager at NVIDIA and the head of the Seattle robotics lab. And we’ve been
0:25:29 talking about all things, robots, AI, um, simulation, which we’ll get back to in a second, but we were just
0:25:36 talking about different styles, different approaches to robotics learning. I wanted to go back to earlier
0:25:41 in the conversation when you mentioned, you know, going into the, uh, factory and seeing all these
0:25:47 different robots doing these kinds of things. And even before that, your definition of what a robot
0:25:52 is or, or, or is not. And thinking about that, and I’m getting to thinking about asking you to define
0:25:57 sort of the difference between traditional and humanoid robots. And I’m thinking traditional,
0:26:04 like robot arms in a factory. You know, I have fuzzy probably images from sci-fi movies
0:26:09 when I was a kid and stuff like that. Right. And humanoid robots. And I mentioned this earlier back
0:26:15 during GTC, I had the chance to sit down with the CEO of one X robotics, who we talked all about
0:26:22 humanoid robots. So maybe you can talk a little bit about this traditional robots, humanoid robots,
0:26:29 what the difference is, and maybe why we’re now starting to see more robots that look like humans,
0:26:31 and whether or not that has anything to do with functionality.
0:26:37 Yeah, absolutely. So one of your earlier questions too, is kind of how is, you know,
0:26:43 how has robotics changed recently? I think this is just, uh, another fantastic example of that.
0:26:49 It’s been unbelievable over the past few years to see the explosion of interest and progress in
0:26:55 humanoid robotics. And, you know, to be fair, actually companies like Boston Dynamics and, uh,
0:27:00 uh, in agility robotics, for example, have been working on this since, you know, probably the,
0:27:06 the mid, maybe even early 2010s. Yeah. And, uh, you know, so they made continuous progress on that.
0:27:12 And everybody was always really, you know, excited and inspired to see their demo videos and so on.
0:27:16 Can I interrupt you to ask a really silly question, but now I need to know, is there a word we say
0:27:22 humanoid robots, right? Is there a word for a robot that looks like a dog? Because the Boston Dynamics
0:27:26 makes me think of those early Atlas, I think those early videos. Yeah. Yeah. Yeah. Um,
0:27:32 and I think Boston Dynamics used to, uh, uh, uh, you know, they, they had a dog-like robot,
0:27:35 which was called big dog. Um, you know, okay. Yeah. Yeah. Sometime then, which is, you know,
0:27:41 maybe why this is compromised. People typically refer to them as quadrupeds or four legs, right?
0:27:46 Got it. Got it. Thank you. Yeah. So, um, yeah, where were we? So, uh,
0:27:49 traditional robots versus humanoids. So there’s been an explosion of interest in humanoids,
0:27:54 particular over the past few years. And I think it was just this, this perfect storm of factors where
0:27:58 there was already a lot of, um, excitement being generated by some of the original players in this
0:28:05 field. Folks like, uh, Tesla got super interested in humanoid robotics, I think 2022, 2023. And it
0:28:11 also coincided with this, um, explosion of advancement in intelligence through LLMs,
0:28:17 VLMs and early signals of that in, in, in robotics. And so I think, you know, there’s a group of people,
0:28:22 you know, forward-thinking people, uh, Jensen very much included. This is near and dear to his heart.
0:28:27 That felt that, um, the time is right for this dream of humanoid robotics to finally be realized,
0:28:32 right? You know, let’s, let’s actually go for it. And, you know, this, this begs the question of why,
0:28:37 why humanoids at all, you know, why have people been so interested in humanoids? Why do people believe in
0:28:43 humanoids? And I think that the most common answer you’ll get to this, which I believe makes a lot of
0:28:49 sense is that the world has been designed for humans. You know, we have built everything for us,
0:28:58 for our form factors, for our hands. And if we want robots to operate alongside us in places that we go to
0:29:05 every day, you know, in our home, in the office and so on, we want these robots to, to have our form.
0:29:10 And in doing so, they can do a lot of things, ideally that we can, we can go up and down stairs
0:29:16 that were really built for the dimensions of our legs. We can open and close doors that are located
0:29:21 at a certain height and have a certain geometry because they’re easy for us to grab. Humanoids
0:29:27 could, you know, manipulate tools like hammers and scissors and screwdrivers and pipettes. If you’re in
0:29:32 a lab, um, these sorts of things, which were built for our hands. And so that’s really the,
0:29:38 the fundamental argument about why humanoids at all. And it’s been amazing to see this iterative
0:29:43 process where there’s advancements in the intelligence and advancements in, in the hardware.
0:29:49 So basically the, the body and the brain and kind of going back and forth and just seeing,
0:29:53 for example, the amount of progress that’s in, uh, that’s, that’s been happening over the past couple
0:29:58 of years and developing really high quality robotic hand hardware. It’s, it’s kind of amazing.
0:30:02 So that’s really kind of, you know, my understanding of the story and kind of the fundamental argument
0:30:07 behind, uh, behind humanoid robots. Right. But, but I definitely see, I would say,
0:30:11 I see a future where these things actually just coexist traditional and humanoid. Yeah.
0:30:16 So earlier we were talking about the importance of simulation, creating world environments where
0:30:22 robots can, can explore, can learn all the different approaches to that. And I think we touched on
0:30:29 this a little bit, but can you speak specifically to the role of simulated or synthetic data versus
0:30:33 real world data? It’s something we touched upon. And again, listeners, the more we’re talking about,
0:30:38 it’s, I feel like all these recent episodes sort of coming together, we’re talking about, you know,
0:30:46 the increasing role of AI broadly generating tokens for other parts of the system to use and all of that.
0:30:52 So when it comes to the world of robotics, simulated data, real world data, how do they work? How do they
0:30:58 coexist? Yeah. So first I’d like to say that in contrast with a number of other areas like language
0:31:06 and vision, robotics is widely acknowledged to have a data problem. So there is no internet scale corpus of
0:31:14 robotics data. And so that’s really why so many people in robotics are very, very interested in simulation
0:31:19 and specifically using it to generate synthetic data. So that’s, that’s basically the idea is that
0:31:25 simulation can be used to have high fidelity renderings of the world. They can be used to
0:31:31 do really high quality physics simulations, and they can be used as a result to generate a lot of data that
0:31:36 would just be totally intractable to collect in the real world. And real world data is, you know,
0:31:42 generally speaking, your source of ground truth. It doesn’t have any gap with respect to the real world,
0:31:47 because it is the real world, but it tends to be much harder to scale. You know, in contrast with
0:31:52 autonomous vehicles, for example, robotics doesn’t really have a car at the moment. There aren’t fleets
0:31:58 of robots that everybody has access to. You can’t put a dash cam on the, uh, those little food delivery
0:31:59 robots and get the data you need. Yeah.
0:32:05 Even if you could, you know, will it be nearly enough data? The answer is probably no, you know,
0:32:11 to, to train general intelligence. You know, that’s kind of why people are really attracted to the idea
0:32:17 of using simulation to, to generate data and real world. Um, whenever you can get it, it’s the ideal
0:32:20 source of data. Um, but it’s just really, really difficult to scale.
0:32:26 So you mentioned, you know, using real world data, there’s no gap. We’ve talked about the
0:32:32 sim to real gap in other contexts. How do you close it in robotics? What’s the importance of it? Where
0:32:37 are we at? And you talked about it a little bit, but get into the gap a little more and what we can do
0:32:43 about it. Sure. So sim to real gap. So there are different areas in which simulation is typically
0:32:48 different from the real world. So one is, you know, on the perception side, literally, you know,
0:32:53 the visual qualities of simulation are very different from the real world. Simulation looks
0:32:59 different, um, often from the way the real world does. Uh, so that’s, that’s one source of gap.
0:33:04 Another source of gap is really on the physics side. So, um, for example, in the real world,
0:33:10 you might be, you know, trying to, um, manipulate something, pick up something that is very,
0:33:15 very, very flexible and your simulator might only be able to model rigid objects, you know,
0:33:20 or rigid objects connected by joints. And, you know, even if you had a perfect model in your
0:33:25 simulator of whatever you’re trying to move around or manipulate, you still have to figure out like,
0:33:30 what are the parameters of that model? You know, what, what is the stiffness of this thing that I’m
0:33:35 trying to, to move around? What is the mass? Um, what are the inertia matrices in these properties?
0:33:40 So physics is, is just another gap. And then there are other factors, things like
0:33:45 latencies. So in the real world, you might have different sensors, um, that are streaming
0:33:52 data at different frequencies. And in simulation, you may not have modeled all of the complexities of
0:33:56 different, again, different sensors coming into different frequencies. Your control loop may be
0:34:01 running at a particular frequency. And these things may have a certain amount of jitter or delay
0:34:04 in the real world, which you may or may not model in simulation.
0:34:05 Right. Okay.
0:34:09 Um, so these are just a few examples of areas where you, you know, it might be quite different
0:34:17 between simulation in the real world. And generally speaking, the ways around this are you either, um,
0:34:23 spend a lot of time modeling the real world, really capturing the visual qualities and the physics
0:34:29 phenomena and the physics parameters and the latencies and putting that in simulation. But that can take a lot of
0:34:36 time and effort. Another approach is, you know, called domain randomization or dynamics randomization.
0:34:43 And the idea is that you can’t possibly identify everything about the real world and put it into
0:34:49 simulation. So whenever I’m doing learning on simulated data, let me just randomize a lot of
0:34:56 these properties. So I want to train a robot that can, um, you know, pick up a mug or, you know,
0:35:01 put two parts together and, um, it should work in any environment. It shouldn’t, shouldn’t really
0:35:06 matter what the background looks like. So let me just take my simulated data and randomize the background
0:35:11 in, in many, many, many different ways. Um, and you can do similar strategies for physics models
0:35:16 as well. You can randomize different parameters of physics models. And then there’s also another
0:35:21 approach, which is really focused on domain adaptation. So I really care about a particular
0:35:27 environment, um, in which I want to deploy my robot. So let me just augment my simulated data
0:35:32 to be reflective of that environment, right? You know, let me make my simulation look like
0:35:36 an industrial work cell, or let me make it look like my home because I know I’m going to have my
0:35:41 robot operate here. And maybe the final approach is kind of, you know, this, this thing called domain
0:35:47 invariance. So there’s randomization adaptation and invariance, um, which is basically the idea that
0:35:53 I’m going to remove a lot of information that is just not necessary for learning. You know, um, if, uh,
0:35:58 maybe if I’m, if I’m picking up certain objects, I only need to know about the edges of these objects.
0:36:04 I don’t need to know what color, for example. So, you know, taking that idea, um, and incorporating
0:36:08 it into the learning process and making sure that my, my networks themselves, or my data
0:36:13 might be transformed in a way that it’s no longer reliant on these, these things that don’t matter.
0:36:19 Yeah. I’m thinking about all of the data coming in and, you know, all the things that can be
0:36:25 captured by the sensors and using video to train. And earlier you were talking about the problem,
0:36:29 and it made me think of reasoning models, the problem of, you know, can you give a robot a task
0:36:33 and can it break it down and reason its way and then actually execute and do it?
0:36:39 What are reasoning VLA models been talked about a lot recently? I keep hearing about them anyway.
0:36:43 Can you talk a little bit about what they are and how they’re used in robotics?
0:36:48 Yeah, absolutely. So reasoning itself, you know, just stepping back for a second,
0:36:51 reasoning is an interesting term because it means many things to many different people.
0:36:51 Yeah.
0:36:56 I think a lot of people think about things like logic and causality and common sense and so on,
0:37:01 you know, different types of, of reasoning. And you can, you can use those to draw conclusions about,
0:37:07 about the world. Reasoning in the context of LLMs and VLMs and now VLAs. So vision language
0:37:13 action models that produce actions as outputs often means, you know, in, in simple terms,
0:37:19 thinking step by step. In fact, if you go to chat GPT and you say, here’s my question, you know,
0:37:25 show me your work or think step by step, it will do this form of reasoning. And so that’s the idea is
0:37:31 that you can often have better quality answers or better quality training data if you allow these
0:37:37 models to actually engage in a multi-step thinking process. And that’s kind of the essence of reasoning
0:37:41 models. And reasoning VLAs are no exception to that. Okay.
0:37:47 So I might give a robot a really hard task, like setting a table. And maybe I want my VLA to now
0:37:52 identify what are all the subtasks involved in order to do that. And within those subtasks,
0:37:59 what are all the smaller scale trajectories that I need to generate and so on. So this is kind of the
0:38:04 essence of, of the reasoning VLA. Got it. Right. So to start to wrap up here,
0:38:08 I was going to ask you, I am going to ask you to sort of, in a way it’s kind of summarizing what we’ve
0:38:14 been talking about, but maybe to put kind of a point on what you think sort of the, the most
0:38:19 important current limitations are to robotic learning that, you know, we’re working, you’re working,
0:38:23 you and your teams and folks in the community are working to overcome. You mentioning setting the
0:38:29 table, though, made me think, you know, a better way to ask that, how far are we from laundry folding
0:38:35 robots? Like, am I, am I going to, I, I’m the worst at folding laundry and I always see demos.
0:38:41 And I heard at some point that, you know, folding laundry sort of represents conceptually
0:38:48 a very difficult task for a robot. Am I going to see it soon before my kids go off to school?
0:38:56 I think you might see it soon. I I’ve seen some really impressive work, uh, coming out recently,
0:39:01 you know, from various companies and demos within NVIDIA on things like laundry folding.
0:39:08 Yeah. And, you know, the general process that people take is to collect a lot of demonstrations
0:39:14 of people actually folding laundry and then use imitation learning paradigms or variants,
0:39:22 try to learn from those demonstrations. And this ends up actually being, if you have the right kind of
0:39:28 data and enough data in the right model architectures, you can actually learn to do these things quite
0:39:34 well. Now, the classic question is how well will it generalize? If I learn to fold, you know,
0:39:38 if I have a robot that can fold my laundry, can it fold your laundry? Right, right.
0:39:43 The typical answer to that is you probably need some amount of data that’s in the setting that you
0:39:48 actually want to do the robot in. And then you can, you can fine tune these models. But I would say
0:39:53 we’re actually pretty, we’re getting closer and closer and closer than certainly I’ve ever seen on
0:39:59 tasks like laundry folding. I’m excited. I’m excited. That’s you, you, you’ve, you’ve got me optimistic.
0:40:05 Uh, thank you for that. So perhaps to get back to the more general, uh, conversation of interest,
0:40:12 the current limitations, what do you see them as? And you know, what’s the prognosis on, on getting
0:40:20 past them? Sure. I think one big one is people feel, I would say the community as a whole is really
0:40:25 optimistic about the role of simulation robotics, or at least most of the community is simulation can
0:40:31 take different forms. It can take kind of the physics simulation approach, or it can take this,
0:40:36 you know, video generation. Like, let me, let me just, um, predict what the world will look like.
0:40:41 And these are really, you know, really thriving paradigms. And I think two questions around that,
0:40:46 one that we just talked about, which is the sim to real gap. So I think the sim to real gap is people
0:40:50 have made a lot of progress on. It’s something we’ve worked very hard on in video, but there’s still
0:40:55 a lot more progress to be made, you know, until we can truly generate data and experience
0:41:00 and simulation and have it transferred to the real world without having to, you know, put a lot of
0:41:05 thought and engineering into truly making it work. And conversely, there’s, there’s the real to sim
0:41:09 question. So building simulators is really, really difficult. You again, have to, you know,
0:41:15 design your scenes and design your 3d assets and so on. Wouldn’t it be great if we could just take some
0:41:21 images or take some videos of the real world and instantly have a simulation that also has physics
0:41:25 properties. It doesn’t just have the visual representation of the world, but it has realistic
0:41:31 masses and friction and these other properties. So sim to real and real to sim, I think are two big
0:41:36 challenges and we’re just getting closer and closer, you know, every few months on, on solving those
0:41:39 problems. And then the boundaries between sim and real, I think we’ll start to be a little bit blurred,
0:41:44 which, which is kind of maybe an interesting possibility. I think that’s one big thing.
0:41:50 And the second big thing I’d say for now is, is the data question. Again, robotics, as we’re talking
0:41:57 about it here, doesn’t have the equivalent of a car. There is no fleet of robots that everybody has access
0:42:04 to that can be used to collect a ton of data. And, uh, until that exists, I think we have to think a
0:42:10 lot more about where we’re going to get that data from. And one thing that the group effort at NVIDIA,
0:42:16 which is around humanoids has proposed is this idea of the data pyramid, um, where you basically have,
0:42:21 you know, at the base of the pyramid, things like videos, YouTube videos that you’re trying to learn
0:42:25 from. And then maybe a little bit higher in the pyramid, you have things like synthetic data that’s
0:42:28 coming from different types of simulators. And then maybe at the top of the pyramid,
0:42:32 you have something like data that’s actually collected in the real world. And then the
0:42:37 question is, what is the right mixture of these different data sources to give robots this, you
0:42:38 know, general intelligence?
0:42:44 So, Yash, as we’re recording this, uh, Coral is coming up. Let’s end on that forward-looking note,
0:42:49 and it’ll be a good segue for the audience to go check out what Coral is all about. But tell us what
0:42:54 it’s about and what, uh, your and NVIDIA’s participation is going to be like this year.
0:42:59 Yeah, absolutely. So, um, Coral is, uh, stands for the Conference on Robot Learning. Um, and it started
0:43:05 out as a small conference, I think in, you know, it’s probably 2017 was maybe the first edition of it.
0:43:10 And it’s grown tremendously. It’s one of the hottest conferences in robotics research now, um,
0:43:16 as learning itself as a paradigm has really taken off. This year, it’s going to be in, uh, in Seoul,
0:43:22 in Korea, which is extremely exciting. Yeah. And it’s going to bring together, uh, the robotics
0:43:26 community, the learning community and the intersection of those two communities. Um, and so, you know,
0:43:31 I think everybody in robotics is looking forward to this. Our participation, you know, the Seattle
0:43:37 robotics lab and other research efforts at NVIDIA, for example, the, the gear lab, which focuses on
0:43:42 humanoids, you know, presenting a wide range of papers. And so we’re going to be giving talks
0:43:47 on those papers, presenting posters on those papers, hopefully some, some demos. And, you know,
0:43:51 we’re just going to be really excited to talk with, uh, with researchers and, uh, you know,
0:43:56 people will be interested in joining us in our missions. Fantastic. Any of those posters and
0:44:00 papers, uh, you’re excited about in particular, maybe you want to share a little teaser with us.
0:44:05 Yeah. I’m, I’m excited about a number of them, but one that I can just call out for now
0:44:10 that I work closely on is this, this project called neural robot dynamics. Um, so that’s the name of the
0:44:16 paper. Um, and we, we have, you know, abbreviated that to nerd. I was going to ask. I’m glad.
0:44:22 Um, so it’s, yeah, it’s just, uh, any RD also kind of inspired by neural radiance fields. Um,
0:44:26 you know, right, right. Of course. Yeah. So we had this, uh, framework and these models,
0:44:33 um, which we call nerd. And the idea is basically, um, that classical simulation. So typical physics
0:44:39 simulators kind of work in this way where they are, you know, performing these explicit computations
0:44:45 about here are my joint torques of the robot. Here’s some external forces. Here’s some contact
0:44:51 forces. Um, and let’s predict the next state of the robot. And the idea behind neural simulation is,
0:44:57 can we capture that all with a neural network? And so that, you know, you might be wondering,
0:45:02 why would you want to do that? And there are some, uh, some advantages to this. So one is that,
0:45:07 you know, neural networks are inherently differentiable. And what that means is that you
0:45:13 can understand if you slightly change the inputs to your simulator, what would be the change in the
0:45:19 outputs. And if you know this, um, then you can perform optimization. You can figure out how do
0:45:25 I optimize my inputs to get the robot to do something interesting. So, um, neural networks
0:45:31 are inherently differentiable. And if you can capture, um, a simulator in this way, um, you can essentially
0:45:36 create a differentiable simulator, um, for free, which is kind of, which is kind of exciting. Another
0:45:42 thing, which is, um, really exciting to us is fine tune ability. So it’s very difficult if you’re
0:45:47 given a simulator and you want, and you have some set of real world data that you collected on that
0:45:52 particular robot that you’re simulating to actually figure out how should I modify the simulator to
0:45:59 better predict that real world data. And neural simulators, um, can kind of do this very,
0:46:04 very naturally. You can fine tune them just like any other neural network. So I can train a neural network
0:46:10 on some, uh, simulated data and then collect some amount of real world data and then fine tune it.
0:46:15 And this process can be continuous. You know, if my robot changes over time or there’s wear and tear,
0:46:21 I can continue fine tuning it and always have this really accurate, you know, simulator of that robot,
0:46:25 which is pretty exciting. Yeah. That’s really cool. Yeah. I think, I think it’s really cool.
0:46:30 And a third advantage, which we are sort of in the early stages of exploring is really on the speed
0:46:37 side. So, um, a lot of compute, uh, today, as, as many people know, it’s been really optimized for
0:46:44 AI workloads and specific types of, uh, mathematical operations, specific types of matrix multiplications,
0:46:51 for example, that are very common in neural networks. And if you can transform a typical simulator
0:46:56 into a neural network, then you can, you can really take advantage of all of these speed benefits that
0:47:01 come with the latest compute and with the latest software built on top of that. Um, so that’s
0:47:08 really exciting to us. And we sort of did this project in a way that allows these neural models
0:47:13 to really generalize. So for given a particular robot, if you put it in a new place, you know,
0:47:17 in the world, or you change some aspects of the world, this model can still make accurate
0:47:23 predictions and it can make accurate predictions over a long time scale. Amazing. For listeners who
0:47:29 would like to follow the progress at Coral in particular, Seattle Robotics Lab in particular,
0:47:35 NVIDIA more broadly, uh, where are some online places, some resources you might direct them to?
0:47:40 Yeah. I’d say the Coral website itself is probably your, um, you know,
0:47:46 your primary source of information. So you’ll find the program for Coral. You’ll find, um,
0:47:52 you know, links to actually watch some of the talks at Coral. You’ll be able to have links to papers
0:47:56 and you’ll see that the range of workshops that are going to be there. And a lot of them, I’m sure,
0:47:59 will post recordings of these workshops. So that’s a great way to get involved.
0:48:01 And that’s just corl.org for the listeners.
0:48:06 Yes. Yes, that’s right. Yeah. Um, you get your website as well. I’m sure we’ll have
0:48:10 on the website and through NVIDIA social media accounts. Noah, you could probably call out to
0:48:15 those. Um, I’m sure there’s going to be plenty of, uh, updates on Coral over the next, you know,
0:48:19 the next period of time. Can I ask you as a parting shot here,
0:48:24 predict the future for us. What does the future of robotics look like? You can look out a couple of
0:48:28 years, five years, 10 years, whatever timeframe makes the most sense. And you know, we, we want to
0:48:32 hold you to this, but, but what do you think about when you think about the future of all this?
0:48:37 Yeah, I think it comes down to those fundamental questions. So, you know, one is kind of what,
0:48:42 what will the bodies of robots look like? So this is kind of what you touched on with, uh,
0:48:47 you know, robot arms and factories versus humanoids. And I think what you’ll see is that there’ll be a
0:48:52 place for both. So, you know, robot arms and more traditional looking robots will still operate in
0:48:58 environments that are really built for them or need an extremely high degree of optimality.
0:49:05 And humanoids will really operate in environments where they need to, uh, actually be, you know,
0:49:11 um, alongside humans and, you know, in your household and in your office and so on around
0:49:16 many, many things that have been built for humans. So I, I kind of see that as the future of, of the
0:49:22 body side of things on the brain side of things. There’s also these questions of, you know, modular
0:49:28 versus end to end paradigms. And what I’ve seen in autonomous vehicles is of course, as we talked
0:49:33 about before, starting with modular, um, swinging to end to end, you know, starting to converge on
0:49:38 something in the middle. And I can imagine that robotics, as we’re talking about here, for example,
0:49:45 robotic manipulation will start to follow a similar trajectory where we will explore end to end models,
0:49:50 and then probably converge upon hybrid architectures until we collect enough data that an end to end
0:49:55 model is, is actually all we need. You know, that’s kind of how I see those aspects. There are some
0:50:00 other questions, for example, are we going to have specialized models or are we just going to have
0:50:03 one big model that solves everything? Right.
0:50:06 You know, that one is, is a little bit hard to predict, but I would say that again, there’s
0:50:12 probably a role for both where we’re going to have specialized models for very specific domain
0:50:18 specific tasks and where, for example, power or energy limits are, are, are very significant.
0:50:24 And you’re going to have sort of these generalist models, um, in other domains where you need to do a
0:50:29 lot of different things and you need a lot of common sense reasoning to solve tasks. Yeah,
0:50:33 I would say those are, those are some, some open debates and that would be my prediction.
0:50:37 And then maybe one other thing that you touched on was simulation versus the real world. And again,
0:50:42 I kind of see this as one of the most exciting things. Um, I’d love to see how this unfolds,
0:50:47 but I really feel that the boundaries between simulation and real world will start to be blurred.
0:50:53 The sim to real problem will be more and more solved. And the real to sim problem will also be
0:50:58 more and more solved. And so we’ll be able to capture the complexity of the real world and make
0:51:03 predictions in a very fluid way, um, perhaps using a combination of physics simulators and
0:51:06 these world models that people have been building like Cosmos.
0:51:11 That’s amazing future. Yash, thank you so much. Uh, this has been an absolute pleasure. And I,
0:51:15 I know you have plenty to get back to, so we appreciate you taking the time out to come on the podcast.
0:51:21 Um, all the best with everything and enjoy coral. Can’t wait to follow your progress and read all about it.
0:51:23 Thank you so much. Noah, it’s been a pleasure.
0:51:57 Thank you.
0:52:01 Thank you.
0:52:01 Thank you.
0:52:01 Thank you.
0:52:01 Thank you.

Yashraj Narang, head of NVIDIA’s Seattle Robotics Lab, reveals how the three computer solution—DGX for training, Omniverse and Cosmos for simulation, and Jetson AGX for real-time inference—is transforming modern robotics. From sim-to-real breakthroughs to humanoid intelligence, discover how NVIDIA’s full-stack approach is making robots more adaptive, capable, and ready for real-world deployment.

Learn more at ai-podcast.nvidia.com.