How Two Stanford Students Are Building Robots for Handling Household Chores – Ep. 224

AI transcript

0:00:00 [MUSIC]
0:00:10 Hello, and welcome to the NVIDIA AI podcast.
0:00:13 I’m your host, Noah Kravitz.
0:00:15 We’re recording from NVIDIA GTC24 back live and
0:00:18 in person at the San Jose Convention Center in San Jose, California.
0:00:23 And now we get to talk about robots.
0:00:25 With me are Eric Lee and Josiah David Wong, who are here at the conference to help
0:00:29 us all answer the question, what should robots do for us?
0:00:32 They’ve been teaching robots to perform a thousand everyday activities.
0:00:36 And I, for one, cannot wait for a laundry folding assistant.
0:00:39 Maybe with a dry sense of humor, to become a thing in my own household.
0:00:42 So let’s get right into it.
0:00:44 Eric and Josiah, thanks so much for taking the time to join the podcast.
0:00:48 How’s your GTC been so far?
0:00:49 I know that you hosted a session bright and early on Monday morning,
0:00:53 had a couple days since.
0:00:54 How’s the week treating you?
0:00:55 >> Yeah, thanks, Noah.
0:00:57 Our GTC has been going really well.
0:00:59 Thanks for inviting us to the podcast.
0:01:01 We had a really great turnout yesterday.
0:01:04 People have been very engaged and
0:01:05 people also ask a bunch of questions towards the end in the Q&A section.
0:01:08 I guess people join because they’re really tired of household toys.
0:01:11 >> Who’s not, right?
0:01:12 >> Yeah, common problem.
0:01:13 >> So before we get a little deeper into your session and
0:01:16 what you’re doing with training robots, maybe we can start a little bit of
0:01:20 background about yourselves, who you are, what you’re working on and where, and
0:01:24 we’ll go from there.
0:01:25 >> Yeah, my name is Chun Shui Li and I also go by Eric.
0:01:27 I’m a fourth year PhD student at Stanford Vision and Learning Lab.
0:01:30 Advised by Professor Fei Fei Li and Silver Sever as it.
0:01:34 In the past couple of years, I’ve been working on building simulation platforms
0:01:38 and developing robotics algorithm for robots to solve household tasks.
0:01:42 >> Yeah, and I’m Josiah.
0:01:44 Similar to Eric, I’m one year behind him, so I’m a third year PhD students.
0:01:48 Also advised by Fei Fei Li.
0:01:49 And similar to him, I’ve also been working with this behavior project that we’re
0:01:52 going to talk about today for the past couple of years.
0:01:55 And I’m really excited to, I don’t know, see robots working in real life.
0:01:58 And we’re hoping that this is a good milestone towards that goal.
0:02:01 >> Excellent.
0:02:02 Before we dive in, for those out there thinking like these guys are studying
0:02:06 right at the heart of it all and they’re in the lab and
0:02:08 they’ve got these amazing advisors, I’m going to put you on the spot.
0:02:11 One thing people might find surprising or interesting or
0:02:15 fun about the day to day life of a PhD student researcher working in the Stanford
0:02:22 Vision Lab.
0:02:23 >> One interesting thing, I think for me, I’ll say one interesting thing is that
0:02:28 I didn’t expect to be this collaborative.
0:02:29 It might be unique to our project, but I sort of imagine that you do a PhD,
0:02:33 you sort of just grind away on your own in a sad corner of the room with no windows.
0:02:37 And you just not seen any sunlight, but a room is really beautiful.
0:02:40 And I think we get to hang out.
0:02:42 So I think I’m really lucky to have people that I can call my friends as well as
0:02:45 lab mates, and we also get to work closely together.
0:02:47 So I think it’s something I wouldn’t have thought that I would have at a place
0:02:51 like Stanford, I guess.
0:02:52 >> Yeah, that’s awesome.
0:02:53 >> How are you?
0:02:54 >> Exactly, I want to echo that.
0:02:55 I think it’s part of because of the nature of our work, which is really a very
0:02:59 complex and immense amount of work that we have to assemble a team of a few
0:03:03 dozen people, which is very uncommon in academic labs, right, setup.
0:03:08 So it’s, it feels to me that it very much works like a very fast-paced
0:03:12 startup where people share the same goal and people have different skillsets
0:03:16 complementing each other.
0:03:17 So yeah, I think we had a great run so far.
0:03:21 >> Very cool.
0:03:21 Community is always a good thing.
0:03:23 So let’s talk about your work.
0:03:24 Should we start with the session, or do you want to start further back with the
0:03:30 work you’re doing in the lab and what led up to the session?
0:03:32 What’s the best way to talk about it?
0:03:34 >> Yeah, I guess we can start it maybe two, three years back when we first
0:03:39 have this preliminary idea of what our project is, which is called behavior.
0:03:43 I think our professor Fei-Fei Li had this amazing benchmark called ImageNet from
0:03:49 before in a computer vision community essentially accelerate the progress in
0:03:53 that field and essentially set a benchmark where everybody can compete fairly and
0:03:57 in a reproduced way and push the whole kind of vision filled forward.
0:04:02 I think we were seeing that in the robotics field on the other hand,
0:04:06 things, because of the involvement of hardware,
0:04:09 each academic paper seems to be a little bit kind of segregated on their own.
0:04:13 They will work on a few tasks that are different from each other and
0:04:16 it’s really, really hard to compare results and kind of move the field forward.
0:04:20 So we started this project thinking that we should hopefully establish a common
0:04:27 ground, a simulation benchmark that is very accessible, very useful.
0:04:31 Everybody can use, it has to be large scale so
0:04:34 that if it works on this benchmark, hopefully it shows some general capability.
0:04:39 And it should be human-centered, like the robots should work on tasks that
0:04:43 actually matters to our day-to-day life.
0:04:45 It shouldn’t be like some very contrived example that us researchers came up with.
0:04:50 And in fact, maybe nobody cares about it.
0:04:52 So that’s very important.
0:04:53 So we set up to do this benchmark that we have been working on for the past couple of weeks.
0:04:57 Why a simulated environment?
0:05:02 Why not just start working with robots, training them out in the real world?
0:05:05 And as the hardware and the software and the systems that drive the robots,
0:05:09 capacities increase, you can do more.
0:05:12 Why work with simulations?
0:05:13 Yeah, that’s a great question.
0:05:15 I think we would get this all the time and there’s a couple of answers.
0:05:18 I think one is that I think, to Eric’s earlier point,
0:05:21 I think the hardware is not quite where the software is currently.
0:05:23 So we have all these really powerful brains with chat,
0:05:26 GPT and stuff that can sort of generate really generalizable, really rich contents.
0:05:30 But you don’t have the hardware to support that yet.
0:05:32 And so I think part of the issue is that it’s expensive to sort of iterate on that.
0:05:35 Sure.
0:05:36 And along those lines, I think there’s the safety component where,
0:05:39 because like Eric was mentioning, a lot of the tasks we want to care about are the ones
0:05:43 that are human-centric.
0:05:44 It’s like your household tasks where you want to full laundry or do the dishes
0:05:47 or stuff where you would probably have humans or multiple humans in the vicinity
0:05:51 of the robot and you don’t want a researcher to be like trying to hack
0:05:54 together an algorithm and then it just lashes out and it hits you on the face.
0:05:57 And that’s just, you know, you’re going to get sued to the ground.
0:05:59 So I think simulation provides a really nice way for us to be able to prototype
0:06:05 and sort of explore the different options similar to how these other foundation
0:06:07 models were developed sort of in the cloud and then deploy them in your life
0:06:11 once you know that they’re stable, once you know that they’re ready to be used.
0:06:13 And like Eric mentioned earlier, like I think there’s this aspect of reproducibility
0:06:17 where if you all are using the sort of same environments,
0:06:19 then you know that the results can transfer and you can be validated
0:06:22 by other labs and other people.
0:06:23 Whereas you build a bespoke robot and you say it does something
0:06:26 and you can’t really validate it unless you buy the robot and, you know,
0:06:29 completely reproduce it.
0:06:30 So yeah, a few different benefits we think that are pretty important.
0:06:32 Now, there are existing simulation engines.
0:06:36 I don’t know if you’d call them that, but game engines, Unreal, Unity,
0:06:40 that are used beyond game development, obviously,
0:06:43 and you can simulate things in those environments.
0:06:45 Why not go with one of those?
0:06:48 Right, yeah. Another great question, I think, is a natural fault that a lot of people ask.
0:06:52 I think there is a couple limitations with the current set of simulators that we have.
0:06:57 On the one hand, I think you have sort of the, like you mentioned,
0:07:00 the very well-known game engines like Unity and stuff.
0:07:04 And I think the problem is that you get really hyper realistic visuals.
0:07:08 I think it’s, you know, you get these amazing games that are really immersive
0:07:11 and it feels like it feels like real life.
0:07:13 But I think when it comes down to the actual interactive capabilities,
0:07:16 like what you can actually do with your, you know, PS5 controller or whatnot in the game,
0:07:20 I think it’s definitely curate experiences by the developers.
0:07:23 And so there’s a clear distinction between what you can and can’t do.
0:07:27 And that’s not how real life works, right?
0:07:28 We’re like, you know, there could be a tape that says, do not caution, do not enter,
0:07:32 but you can just, you know, walk through that tape in real life.
0:07:33 And, you know, you’ll have to take the consequences, but you can still do that.
0:07:36 And actually, and I think that’s what we want robots to be able to do where we don’t again,
0:07:40 to Eric’s point, like we don’t want to pre-define a set of things that we want to teach it.
0:07:44 We wanted to learn a general idea about how the world operates.
0:07:47 And so I think that necessitates the need for a simulator where everything in the world
0:07:51 is interactive, where it’s, you know, a cup on a table or a laptop or, you know, like a door.
0:07:55 And so there’s no sort of distinction between like, OK, we we curated this one room
0:07:59 and so this is really realistic and it works really well.
0:08:02 But if you try to walk outside of it, then it’s going to not work.
0:08:04 We want it all to work. Yeah. Yeah.
0:08:05 And so did you build your own simulator?
0:08:08 It’s more accurate to say we built on top of a simulator.
0:08:11 And I think this is where we have to give Nvidia so much credit, where, you know,
0:08:14 they have this really powerful ecosystem called Omniverse, where it’s sort of
0:08:18 supposed to be this one stop shop where you can get hyper real surrendering.
0:08:22 They have a powerful physics back end.
0:08:24 They can simulate cloth. They can simulate fluid.
0:08:26 They can do all these things where, you know, it’s stuff that we would want to do in real life,
0:08:30 you know, like full laundry, you know, pour water into a cup, that kind of stuff.
0:08:34 And so they provide sort of the core engine, let’s say, that we build upon.
0:08:38 And then we provide the additional functionality that they don’t support.
0:08:41 And I think together it gives us, you know, a very rich set of features
0:08:44 where we can simulate a bunch of stuff that robots would have to do
0:08:47 if you want to put them in our households. Yeah. Anything to add, Eric?
0:08:51 Yeah, no, I think I do. Yeah, we really want to extend our gratitude to the Omniverse team.
0:08:56 I think they have hundreds of engineers really putting together these, you know,
0:09:00 physics engine rendering engine that works remarkably on GPUs
0:09:04 that may also be paralyzed, which is actually in our next step in our roadmap
0:09:08 to make our things run even faster, given its powerful capabilities.
0:09:12 And it’s just impossible to do many of these household tasks
0:09:17 without the support of this platform.
0:09:19 You mentioned Omniverse, obviously.
0:09:20 And so there was a simulation environment called, it was called Gibson, iGibson.
0:09:26 And then you extended that to create Omnigibson.
0:09:30 Am I getting it right? Right, yeah.
0:09:32 We definitely, so iGibson, just for the audience,
0:09:35 is a predecessor of Omnigibson that we developed, you know, three, four years ago.
0:09:39 And at that time, Omniverse hasn’t launched yet.
0:09:42 So we used, we wrote our own renderer and then we were building
0:09:45 on a previous physics engine called PiBullet,
0:09:48 which works very well for rigid body interactions.
0:09:50 And then as Omniverse was launched at that time, that was two years ago.
0:09:55 That’s also when we decided to kind of tackle a much larger scale of household tasks.
0:10:01 We decided to work on, for example, one thousand different activities
0:10:04 that we do in our daily homes, that we quickly realized
0:10:07 that it has gone beyond the capability of what our previous physics engine
0:10:11 can do with PiBullet, like it doesn’t handle fluid, it doesn’t.
0:10:14 It supports some level of cloth, but it’s not very realistic.
0:10:18 The speed is sort of slow.
0:10:20 Now we see this brand new toy, I guess, that came right out of the oven.
0:10:24 And then we thought, let’s try this out.
0:10:26 So we pretty much kind of started clean from a new project on top of Omniverse.
0:10:32 Many, many things can change.
0:10:33 We’re doing some of the design choices that we already made in iGibson
0:10:36 that proven in history to be working quite well in our research world.
0:10:40 We heard a lot of ideas, but we also changed a bunch of stuff
0:10:44 to make things more usable and more powerful as well in the Omnigibson.
0:10:48 So let’s talk about robots doing chores.
0:10:51 How does one go about training a robot, whether in a simulation
0:10:55 or in the physical world, to learn how to do household chores?
0:11:00 Can you can you walk us through a little bit of what that’s like?
0:11:02 Oh, that’s a great question, a very open-ended question.
0:11:05 It’s an obvious, it’s what I do.
0:11:07 I ask the open-ended questions and sit back.
0:11:09 I think to make it easy for the audience,
0:11:10 you can think of it as two maybe broad fields that are generally tackled right now.
0:11:14 Where one is essentially you throw a robot in and you sort of let it do what it wants.
0:11:19 And it’s sort of you can think of it as maybe learning a bit from play where you give rewards.
0:11:24 So you can think of like teaching a child, like, you know, they don’t really know what to do.
0:11:27 And so you have to sort of give them, you know, you punish them when they don’t do something good.
0:11:30 You give them like a timeout and then when you do something good,
0:11:32 you know, you give them like a cookie or, you know, some kind of reward.
0:11:34 And it’s similar for robots where like you throw them in and the naively,
0:11:37 the AI model doesn’t know what to do.
0:11:39 So just try as random things to try as touching table.
0:11:41 Try as like, you know, touching a cup or something.
0:11:43 But let’s say what you really wanted to do is to, you know, pick up the cup
0:11:46 and then pour water to something else.
0:11:48 And so you can reward the things where it’s closer to what you want to do.
0:11:51 Like if it picks touches the cup, you can like give it a good reward, like a positive reward.
0:11:55 And if it like says, knocks over the cup and spills the water, you give it a negative reward.
0:11:58 And so that’s one approach I think that researchers are trying.
0:12:01 And another approach is where we learn directly from humans,
0:12:04 where a human can actually, let’s say, teleoperate.
0:12:07 So like, let’s say you have a video game controller and can control the robot’s arm
0:12:10 to actually just directly pick up the cup, pour some water into something else
0:12:13 and then they call it a day.
0:12:14 And then the robot can look at the data and just collect it and sort of train on that.
0:12:17 See, okay, I saw that the human moved my arm to this and like sort of poured it.
0:12:20 So I’m going to try to reproduce that action.
0:12:22 And so it’s these two different approaches towards sort of scaffolding directly from scratch
0:12:26 versus scaffolding based on like human intelligence.
0:12:29 Right. Yeah.
0:12:30 And if you’re stringing together a series of actions, like let’s say,
0:12:34 I mean, even your example of picking up the cup and then pouring the water into a different vessel,
0:12:39 is it one sort of fluid sequence or is it are you teaching sort of modular tasks
0:12:46 that you then can string together?
0:12:48 Yeah, that’s a, it’s another design decision, right?
0:12:50 Like I think there’s something called task plan where you can imagine
0:12:53 that every individual step is a different training pipeline.
0:12:56 So like, I’m just going to focus on learning to pick the cup and I’m not going to do anything
0:12:59 with it, but I’m just going to repeat that action over and over.
0:13:01 And then let’s say you can plug it in with something else, which says,
0:13:03 okay, and I’m going to do like a pouring action over and over.
0:13:05 And then if we just string them together, then maybe, let’s say,
0:13:08 you can get the combination of those two skills.
0:13:10 But others have looked at sort of the end to end, what we call process where, you know,
0:13:14 you look at the taskable, it’s just like pick up the cup and pour it into another vessel.
0:13:18 And you just try to do it from the very beginning to the very end.
0:13:21 And I think it’s still unclear which way is better.
0:13:24 But again, like it’s a bunch of design decisions and there’s a ton of problems.
0:13:27 I agree, I agree.
0:13:28 There’s no really a consensus that researchers have really been poking here
0:13:32 and there and trying to find their luck.
0:13:33 And there’s pros and cons on both sides.
0:13:36 For example, if you do the end plan approach, if it works, it works really well.
0:13:40 But it’s because the task is longer, it’s more data hungry.
0:13:43 It’s more difficult to convert.
0:13:46 On the other hand, if you do a more modular approach,
0:13:49 then each skill can work really well.
0:13:51 But like the transition point is actually very brittle, right?
0:13:54 You might reach some bottleneck where you try to change a couple of skills together.
0:13:57 And then it breaks in the middle and then it’s very hard to recover from there either.
0:14:00 So I think we’re still figuring this out as a community.
0:14:04 What were some of the hardest household tasks for the robots to pick up?
0:14:09 Or easiest ones or even just sort of the ones that kind of you remember
0:14:14 because it was interesting, the process was sort of unexpected and interesting.
0:14:18 I was going to say the folding laundry example I mentioned is one that
0:14:22 maybe it’s just, you know, the platforms I hang out on, hang out on the algorithms
0:14:28 and know that I don’t like folding laundry and I’m terrible at it.
0:14:31 I can’t fold a shirt the same way twice.
0:14:33 Oh, yeah. But every once in a while, you know,
0:14:35 I feel like I’ll see a video of a system that’s like
0:14:38 gotten a little bit closer, but it seems to be a difficult problem.
0:14:42 Yeah, it’s really challenging.
0:14:43 I think to be clear to the audience, we haven’t solved all the thousand tasks.
0:14:47 That’s our goal also.
0:14:49 I think the first step is just providing these a thousand tasks
0:14:51 in like a really reproducible way and the platforms so they can actually simulate them.
0:14:55 But for me personally, I think what immediately comes to mind is like one of the top five tasks.
0:14:59 So to give a bit of context, like like Eric mentioned,
0:15:01 we don’t want to just predefined tasks.
0:15:03 We want to actually see what people care about.
0:15:05 So we actually like pulled a bunch of thousands of people online
0:15:08 and we asked them, you know, what would you want a robot to do for you?
0:15:11 And so we had a bunch of tasks more than a thousand
0:15:13 and we whittled them down to a thousand based on the results that people gave.
0:15:16 OK. And one of the top five tasks was clean up after a wild party.
0:15:19 And so the way we visualized in our simulator was we had this, you know,
0:15:22 living room and just tons of glass bottles, beer bottles,
0:15:25 like, you know, like just random objects scattered in the floor.
0:15:28 And that’s just a distinct memory in my mind because I think it really
0:15:31 sets the stage for like how much disdain we have for a very certain task.
0:15:35 And it was clear that people like ranked it very highly because it’s, you know,
0:15:38 it’s very undesirable or clean laundry or excuse me, full laundry.
0:15:42 But I’m getting flashbacks out.
0:15:43 I’m wondering if you have taught a robot to patch a hole in the wall.
0:15:47 Oh, God, that’s that’s a story I’m not going to get into.
0:15:49 A hole that the robot maybe made himself when I was trying to do something
0:15:52 in the real world. Exactly, exactly.
0:15:53 Make some mistake. Yeah. Any thoughts, Eric?
0:15:55 What do you think? Yeah.
0:15:56 I guess some of the cooking tasks seems pretty difficult.
0:15:59 Oh, yeah, that’s a shame.
0:16:00 Of our household tasks are cooking related.
0:16:03 And we did spend quite a lot of effort kind of implementing these complex.
0:16:08 I guess we we try to do a bit of simplification but we want to get the
0:16:12 high level kind of chemical reaction that’s happening in a cooking process.
0:16:16 For example, baking a pie or making a stew, for example,
0:16:19 those kind of things in our platform.
0:16:21 Yeah. And these tasks are pretty challenging too, right?
0:16:24 Like you need to have this kind of common sense knowledge about what does it take,
0:16:27 you know, to cook a specific dish, you know, what are the ingredients,
0:16:30 how much you put in, you don’t want to put in too much salt.
0:16:33 Also not too little salt. Right, it’s right.
0:16:34 You don’t understand how much time you put into the oven,
0:16:37 how long to wait and make sure you don’t spill anything else.
0:16:39 Yeah, that’s some of the longest, longest horizon tasks.
0:16:42 Yeah. I forgive me, I’m sure there’s a better way to ask this question,
0:16:45 but what’s difficult about, I can imagine it’s incredibly difficult,
0:16:49 but what’s difficult about cooking for the robot to learn?
0:16:54 Is it that there’s so many steps and objects happening?
0:16:58 Is it something about the motions involved?
0:17:01 No, that’s, you’re asking a very brilliant way.
0:17:03 I think both actually, both are kind of the,
0:17:06 they’re both two challenging aspects.
0:17:08 One of them is that it involves many concepts or like symbolically,
0:17:11 you can think of it involves many types of objects.
0:17:14 And you just kind of chain them together,
0:17:16 make sure you use the right tool at the right time.
0:17:18 And also the motions are difficult.
0:17:20 Like imagine you need to cut an onion into like small dices to make,
0:17:26 you know, some sort of dish can come off a good example.
0:17:28 But then the motion itself is very dexterous, right?
0:17:32 Imagine that, sometimes humans cut their fingers when they’re cooking.
0:17:35 First thing I thought of.
0:17:36 Exactly. That’s pretty tough.
0:17:38 Yeah. And then I think also it just needs to have some understanding
0:17:41 of things that aren’t explicit.
0:17:43 Like if I put these two chemicals together,
0:17:45 it actually creates something third that you didn’t see before.
0:17:47 And I think a lot of times in current research,
0:17:49 you sort of assume that the robot already knows everything.
0:17:52 And so what it’s given, it can only do stuff like combinatorially.
0:17:55 But I think cooking is an interesting example where,
0:17:58 you know, you put in, I don’t know, dough into the oven
0:18:00 and outcomes like it just transforms into bread.
0:18:02 And like, I think it’s, it’s, you know, there’s the joke about, you know,
0:18:04 you put in a piece of bread and outcomes toast.
0:18:06 And then there’s like a comic where, you know, Calvin from Calvin and Hobbes is like,
0:18:09 oh, like I wonder how this machine works.
0:18:10 It just somehow transforms it into this new object.
0:18:13 And like, I can’t see where it’s stored, right?
0:18:15 And so I think the idea that a robot has to learn that is also quite challenging too.
0:18:18 I’m speaking with Eric Lee and Josiah David Wong.
0:18:22 Eric and Josiah are PhD students at Stanford who are here at GTC24.
0:18:28 They presented a session early in the week.
0:18:30 We’re talking about it now.
0:18:31 They’re attempting to teach robots how to do a thousand common household tasks
0:18:36 that humans just don’t want to do if we can help it,
0:18:38 which is just one of the many potential future avenues for robotics in our lives.
0:18:43 But it’s a good one.
0:18:44 I’m looking forward to it.
0:18:45 One thing I want to ask you about, LLMs are everywhere right now.
0:18:49 And, you know, a lot of the recent podcasts and guests I’ve been talking to,
0:18:53 and just people I’ve been talking to at the show,
0:18:55 are talking about LLMs as relates to different fields, right?
0:19:01 Scientific discovery and genome sequencing and drug discovery and all kinds of things.
0:19:06 There’s been some stuff in the media lately about some high-profile stuff
0:19:11 about robots that have an LLM integrated, chat GPT integrated,
0:19:17 so you can ask the robot in natural language to do something.
0:19:21 You can interact with you, that sort of thing.
0:19:24 How do you think about something like that?
0:19:26 From the outside, I sort of at the same time can easily imagine what that is.
0:19:32 But then in my brain, it almost stutters when I try to imagine,
0:19:38 like I’ve used enough chatbots and text-to-image models and that kind of thing
0:19:43 to sort of understand, you know, I type in a prompt and it predicts the output that I want.
0:19:48 When we’re talking about equipping, you know, a robot with these capabilities,
0:19:54 is it a similar process?
0:19:56 Is the robot, when we were talking about cooking, I was imagining, you know,
0:20:00 can an LLM in some way give a robot the ability to sort of see the larger picture of, you know,
0:20:07 now remember, when you take the dough out, it’s going to look totally different
0:20:11 because it’s become a pizza.
0:20:12 Is that a thing or is that me in my human brain just trying to make sense
0:20:18 of just this rapid pace of acceleration in this thing we call AI
0:20:22 that’s actually touching so many different, you know, disciplines all at once?
0:20:27 Oh, yeah, I do think the development of large models,
0:20:31 not just large language model, but also like large language vision models,
0:20:36 will really accelerate the progress of robotics.
0:20:39 And people actually, researchers in our field have adopted these approaches
0:20:43 for the last two years and things are fast, you know, things are moving very fast
0:20:47 and we’re very excited.
0:20:48 I think one of the challenges is that what these LLM have been good at
0:20:53 is still at the symbolic level.
0:20:54 So in kind of, you can think of in the virtual world, it has these concepts,
0:20:59 you know, like what ingredients to put in, like my great pizza, for example.
0:21:04 But there’s still this low level, difficult robot skills motions, you would call it.
0:21:10 How to roll a dough into a flattened thing?
0:21:13 How do you spring stuff on the pizza so that it’s like evenly spread out?
0:21:18 All those little detail are the crux of a successful pizza.
0:21:22 Top-notch pizza, not a rent.
0:21:24 Edible pizza.
0:21:25 I don’t even edit all right.
0:21:27 I hope the listeners can hear it.
0:21:28 I can see in your face as you’re talking the like, you have to get this right.
0:21:34 And I’m with you.
0:21:35 And so the actual physical implementation of doing those motions is something that,
0:21:41 you know, the robotics field I’m sure is working on, has been working on,
0:21:44 but work in progress.
0:21:46 Exactly.
0:21:46 I think you hit the nail exactly on the head where it is the execution
0:21:51 where you can think of it as, you know, theoretical knowledge.
0:21:54 Like you can, you’re human, like the same thing.
0:21:57 Like, okay, you’re playing in the head.
0:21:59 You’re playing the chores that you could do.
0:22:00 So you like list them out and you know exactly what you’re supposed to do.
0:22:02 But then you actually have to go and execute them.
0:22:04 And so I think the LLM, because it’s not plugged in with a physics simulator,
0:22:08 it doesn’t actually know, okay, I think that if I do this, if I, you know,
0:22:12 pick up the cup, then it, you know, it will not spill any water.
0:22:15 But if the cup has like a hole in the bottom that you don’t see and then you do
0:22:18 and then stuff falls out, then you have to readjust your plan.
0:22:20 And I think if you just have an LLM, you don’t know exactly what the outcomes
0:22:25 are going to be along the way.
0:22:26 And so like Eric was saying, I think it needs to sort of, we say like closing the
0:22:29 loop, so to speak, we’re like, you plan and then you try it out and then you plan again.
0:22:33 And I think with that extra execution step, I think is something that’s still
0:22:38 sort of an open research problem that we’re both hoping to tackle.
0:22:41 Right.
0:22:42 And so where are you now in the quest for a thousand chores to put it that way?
0:22:47 Is it all in a simulation environment?
0:22:50 Are you having robots in the physical world go out and, you know,
0:22:54 have you gotten to the point where the experiments feel stable enough to try them
0:22:58 out in the physical world?
0:22:59 Where are you on that timeline?
0:23:01 So when we originally posted our work, which was a couple of years ago and we’ve
0:23:05 done, you know, a bunch of work since then, one of our experiments was actually
0:23:08 putting up what we call a digital twin, which is we have like a real room in the
0:23:13 real world with a real robot.
0:23:15 And we essentially try to replicate it as closely as we can in simulation with
0:23:19 the same robot virtualized.
0:23:20 And I think we were able to show that with training the robot at the level of
0:23:25 telling it, okay, grasp this object, now put it here.
0:23:27 And then having it learn within that loop and simulation, we could actually have
0:23:30 it work on the real world.
0:23:31 So we tested it in the real world and we did see non-zero success.
0:23:34 So I think the task was like putting away trash or something.
0:23:36 I think so.
0:23:37 Yeah.
0:23:37 So we had to throw away like a bottle or like a red solo cup into like a trash
0:23:40 can.
0:23:40 And so that requires like navigation, moving around the room, picking up
0:23:43 stuff and then also like moving it back and also like dropping it in a specific
0:23:46 way.
0:23:46 And so I think that’s a, it’s a good signal to show that, you know, robots can
0:23:50 learn potentially.
0:23:51 But of course this is, I think, one of the easier tasks where if it’s folding
0:23:54 laundry, like if we can do it, you know, not well, then, you know, how much
0:23:57 hard is it going to be for a robot to do?
0:23:58 Yeah.
0:23:58 So I think there’s still a lot of unknown questions to actually hit, you know,
0:24:02 even a hundred of the tasks, much less than a thousand, all of them.
0:24:04 So, but I think we have seen some progress.
0:24:06 So I hope that we can, you know, start to scaffold up from there.
0:24:09 Yeah.
0:24:09 So what’s, what’s the rest of, I don’t know, the semester of the year, like for
0:24:16 you guys, is it all heads down on this project?
0:24:19 What’s the timeline?
0:24:20 Yeah.
0:24:21 I think, I think we have just had our first official release two days ago and
0:24:26 I think things are at a stage where we have all our 1,000 tasks ready for
0:24:30 researchers and, and, and scientists to try it out.
0:24:33 I think our immediate next step are to try some of these ourselves, you
0:24:38 know, like what Google called like dog food, your own products, right?
0:24:41 So we’re, you know, Robo learning researchers ourselves.
0:24:44 We want to see how do the current state of the art Robo learning or robotics
0:24:49 models work in these tasks?
0:24:51 What are some of the pitfalls?
0:24:52 So I think that’s number one that essentially look, tell us where are the
0:24:57 low hanging fruits that can really significantly improve our performances?
0:25:00 And the second is that we’re also thinking about potentially, you know,
0:25:04 hosting a challenge where these, where researchers can kick.
0:25:09 So everything is even more modular so that people can participate from all over
0:25:13 the world to, to, to, to, to make progress on this benchmark.
0:25:17 And I think that’s also in our roadmap to make it happen.
0:25:20 Well, if you need to volunteer to create the wild party mess to clean up,
0:25:25 we know who to ask.
0:25:26 You know, I think along the lines of the challenge, I think a goal is to
0:25:30 sort of democratize this sort of research and allow more people to explore.
0:25:33 And so we’ve actually put together like a demo that anyone can try.
0:25:36 So for the audience listening, like, you know, if you’re technically inclined
0:25:39 if you’re a researcher, but even if you’re just like, oh, I don’t know.
0:25:41 I don’t want to say lay person, but you know, a person that’s normally not
0:25:44 involved with AI, but want to sort of like, just see what we’re all about.
0:25:46 We do actually have something that we can just try it out immediately and
0:25:49 we can see sort of the robot in the simulation and like what it looks like.
0:25:52 So I assume there’ll be some links, hopefully associated with this.
0:25:55 But yeah, we hope you know, it’s if you know the link off hand,
0:25:58 you can, you can speak it out now, but we’ll do as well.
0:26:01 Yeah, it’s on what we say behavior.
0:26:03 I’ll stand for the idea.
0:26:04 Yeah.
0:26:04 Great.
0:26:04 Okay.
0:26:05 And that reminds me to ask you mentioned it briefly, but what is behavior?
0:26:09 Yeah.
0:26:09 That’s a good distinction.
0:26:10 So Omni Gibson to be clear is like the simulation platform where we simulate
0:26:13 this stuff.
0:26:14 Right.
0:26:14 And I think overarching that this whole project is called behavior one K
0:26:17 representing the thousand tasks that we hope robots can solve in the near future.
0:26:20 Right.
0:26:21 Yeah.
0:26:21 That’s the distinction.
0:26:22 I guess is that it’s the all encompassing thing, which is not just the simulator,
0:26:24 but also the tasks and also sort of the whole ethos of the whole project is
0:26:28 called behavioral.
0:26:28 Okay.
0:26:29 Okay.
0:26:29 Yeah.
0:26:29 All right.
0:26:30 And before I let you go, we always like to wrap these conversations up with a
0:26:34 little bit of a forward looking.
0:26:35 What do you think your work?
0:26:38 You know, how do you think your work will affect the industry?
0:26:40 What do you think the future of et cetera, et cetera is robots?
0:26:44 So it’s 2024.
0:26:46 Let’s say by 2030.
0:26:48 Are robots in the physical world going to be, you know, to some extent, we’ve got,
0:26:53 you know, Roombas and, you know, vacuum cleaner robots, that kind of thing.
0:26:57 And certainly at a show like this, you see robots out there.
0:27:00 But, you know, in there, there was an NVIDIA robot.
0:27:02 I saw yesterday that’s out in children’s hospitals interacting with patients.
0:27:07 And yeah, it’s down on the, I’m pointing.
0:27:09 Nobody can see in the radio, but I’m pointing out to the show floor.
0:27:11 It’s out there.
0:27:12 You can see it.
0:27:13 What do you think society is going to be in, you know, five, six years from now as
0:27:17 relates to the quantity and sort of level of interactions with robots in our
0:27:22 lives, or maybe it’s more than five, six years.
0:27:25 Maybe it’s 2050 or further down the line.
0:27:27 I think it’s hard to predict because if we look at the autonomous driving industry,
0:27:32 let’s say as a predecessor, I think even though it’s been hyped up as like the
0:27:37 next ubiquitous thing for what it’s taking now, like for quite a while, right?
0:27:40 But we’re still not quite, I mean, it’s become much more commonplace, but we’re
0:27:43 still not at like, what is it?
0:27:44 Level five autonomy, let’s say.
0:27:46 And so I imagine something similar will happen with, you know, human robots or
0:27:51 something that you see with everyday, you know, interactive household robots
0:27:54 where I can imagine it, we’ll start seeing them in real life.
0:27:57 But I don’t think you’ll be ubiquitous until, you know, decades.
0:28:00 It’s my take, but I don’t know if you’re more optimistic.
0:28:03 Yeah, I think to keep it in prismatic is actually a good thing because I think
0:28:07 in general, the reason exactly the reason why these things are too hard is
0:28:13 because human have very high standards.
0:28:15 Exactly.
0:28:15 It’s like the sub driving cars.
0:28:16 People, people are okay drivers or decent drivers.
0:28:19 So, so you only have X and X number of miles.
0:28:22 You want the robot to be better at it, much better at it actually.
0:28:24 So I think we’re still a bit far away from, you know, a house robots that can
0:28:30 be very versatile, meaning do many things and also doing many things reliable.
0:28:35 Very consistent, very consistently.
0:28:36 You don’t want to break, you know, 20% of time.
0:28:38 That’s, that’s, 20% of conditions.
0:28:40 Yeah, that’s right.
0:28:41 So I think it’s hard because we have high standards, but hopefully these robots
0:28:46 can kind of come in like incrementally for our life, maybe first in more
0:28:50 structured environment like warehouses and so on, like doing like reshuling
0:28:54 or like, like restocking shelves or, you know, putting Amazon package here
0:28:58 and there and then hopefully soon we can have, you know, like time, you know,
0:29:03 like time, have a folding laundry robot soon.
0:29:06 Excellent.
0:29:07 Good enough.
0:29:07 Eric Josiah guys, thank you so much for dropping by taking time out of your busy
0:29:11 week to join the podcast.
0:29:12 This is a lot of fun for me.
0:29:14 I’m sure for the audience for listeners who want more, want to learn more
0:29:18 about the work you’re doing, more about what’s going on at the lab, read
0:29:22 some published research, that kind of thing.
0:29:24 Are there good starting points online where we can direct them?
0:29:27 I think what Eric mentioned earlier, like just go to behavior.stanford.edu
0:29:31 and that’s sort of the entry point where you can see, you know, all this stuff
0:29:34 about this project, but more, you know, widely you can also then from there
0:29:37 get to see what else is going on at Stanford.
0:29:39 That’s exciting.
0:29:39 So yeah, definitely check it out if you’re so inclined.
0:29:41 Perfect.
0:29:42 All right.
0:29:43 Well guys, thanks again.
0:29:44 Enjoy the rest of your show and good luck with the research.
0:29:47 Thanks Noah.
0:29:47 Thanks for having us.
0:29:49 [music]
0:29:51 [music]
0:29:53 [music]
0:29:55 [music]
0:29:56 [music]
0:29:57 [music]
0:29:58 [music]
0:29:59 [music]
0:30:00 [music]
0:30:01 [music]
0:30:06 [music]
0:30:07 [music]
0:30:08 [music]
0:30:33 [music]
0:30:35 [music]
0:30:36 you
0:30:38 [BLANK_AUDIO]

Imagine having a robot that could help you clean up after a party — or fold heaps of laundry. Chengshu Eric Li and Josiah David Wong, two Stanford University Ph.D. students advised by renowned American computer scientist Professor Fei-Fei Li, are making that a ‌dream come true. In this episode of the AI Podcast, host Noah Kravitz spoke with the two about their project, BEHAVIOR-1K, which aims to enable robots to perform 1,000 household chores, including picking up fallen objects or cooking. To train the robots, they’re using the NVIDIA Omniverse platform, as well as reinforcement and imitation learning techniques. Listen to hear more about the breakthroughs and challenges Li and Wong experienced along the way.

Leave a Comment Cancel reply