NVIDIA’s Ming-Yu Liu on How World Foundation Models Will Advance Physical AI – Episode 240

AI transcript
0:00:10 [MUSIC]
0:00:14 Hello, and welcome to the NVIDIA AI podcast.
0:00:16 I’m your host, Noah Kravitz.
0:00:21 NVIDIA CEO, Jensen Hoang, recently keynoted the CES Consumer
0:00:24 Electronic Show conference in Las Vegas, Nevada.
0:00:28 Amongst the many exciting announcements Jensen talked about
0:00:30 was NVIDIA Cosmos.
0:00:34 Cosmos is a development platform for world foundation models,
0:00:38 which I think we’re all going to be talking a lot about in the coming months and years.
0:00:40 What is a world foundation model?
0:00:44 Well, thankfully, we’ve got an expert here to tell us all about it.
0:00:47 Mingyu Liu is vice president of research at NVIDIA.
0:00:49 He’s also an IEEE fellow.
0:00:53 And he’s here to tell us all about world foundation models,
0:00:57 how they work, what they mean, and why we should care about them going forward.
0:01:03 So without further ado, Mingyu, thank you so much for joining the NVIDIA AI podcast, and welcome.
0:01:04 It’s great to be here.
0:01:07 So let’s start with the basics, if you would.
0:01:09 What is a world foundation model?
0:01:09 Sure.
0:01:16 So world foundation models are deep learning-based, space-time visual simulator
0:01:19 that can help us look into the future.
0:01:21 It can simulate visits.
0:01:25 It can simulate people’s intentions and activities.
0:01:30 It’s like a data exchange of AI, imagining many different environments
0:01:32 and can simulate the future.
0:01:35 So we can make good decisions based on the simulation.
0:01:38 We can leverage world foundation models in imagination
0:01:43 and simulation capability to help train physical AI agents.
0:01:46 We can also leverage this capability to help the Asian
0:01:49 make good decisions during the inference time.
0:01:54 You can generate a virtual world based on test prompts, image prompts, video prompts,
0:01:57 action prompts, and the layer combinations.
0:02:03 So we call it a world foundation model because it can generate many different worlds
0:02:08 and also because it can be customized to different physical AI setups
0:02:10 to become a customized world model.
0:02:14 So different physical AI have different number of cameras and different locations.
0:02:18 So we want the world foundation model to be customizable for different physical AI setups
0:02:20 so they can use in their settings.
0:02:27 So I want to ask you kind of how a world model is similar or different to an LLM
0:02:28 and other types of models.
0:02:32 But I think first I want to back up a step and ask you,
0:02:38 how is a world model similar or different to a model that generates video?
0:02:40 Because my understanding, and please correct me when I’m wrong,
0:02:45 my understanding is that you can prompt a world model to generate a video,
0:02:49 but that video is generated based on the things you were talking about,
0:02:54 based on understanding of physics and other things in the physical world,
0:02:55 and it’s a different process.
0:03:00 So I don’t know what the best way is to kind of unpack it for the listeners,
0:03:02 but one place to start might be,
0:03:07 how does a world model differentiate from an LLM or a generative AI video model?
0:03:15 So one model is different to LN in the sense that LN is focused on generating text description.
0:03:17 It generates understanding.
0:03:20 A world model is generating simulation,
0:03:26 and the most common form of simulation is videos, so they are generating pixels.
0:03:29 And so world models and video foundation models, they are related,
0:03:34 and video foundation model is a general model that generates videos.
0:03:39 It can be for creative use cases, it can be for other use cases.
0:03:45 In world models, we are focusing on this aspect of video generation.
0:03:50 Based on your current observation and the intention of the actors in your world,
0:03:52 you rule out the future.
0:03:55 So they are related, but with a different focus.
0:03:56 Gotcha, thank you.
0:03:58 So why do we need the world models?
0:04:01 I mean, I think I know part of the answer to the question,
0:04:05 we’re talking about simulating physical AI and all of these amazing things,
0:04:10 but tell us about the need for world foundation models from your perspective.
0:04:16 So I think world foundation models is important to physical AI developers.
0:04:21 Physical AI systems with AI deploy in the real world.
0:04:25 And different to digital AI, these physical AI systems,
0:04:29 they interact with the environment and create damage.
0:04:32 So this could be real hard.
0:04:37 Right, so a physical AI system might be controlling a robotic arm
0:04:41 or some other piece of equipment changing the physical world.
0:04:45 I think there are three major use cases for physical AI.
0:04:47 It’s all around simulation.
0:04:51 The first one is, when you train a physical AI system,
0:04:54 you train a deep learning model, you have a thousands of points.
0:04:57 Do you know which one you want to deploy?
0:05:00 And if you deploy individually, you’re going to be very time consuming,
0:05:03 and so then it’s bad, you’re going to damage your kitchen.
0:05:09 So with a world model, you can do verification in the simulation.
0:05:15 So you can quickly test out this policy in many, many different kitchens.
0:05:19 And before you deploy in the real kitchen,
0:05:24 and after this verification step, you may be narrowed down to three checkpoints,
0:05:26 and then you do the real deployments.
0:05:31 So you can have an easier life to deploy your physical AI.
0:05:35 It reminds me of when we’ve had podcasts about drug discovery,
0:05:40 and the guests talking about the ability to simulate experiments
0:05:44 and different molecular combinations and all of that work
0:05:47 so that they can narrow it down to the ones that are worth trying
0:05:49 in the actual, the physical lab, right?
0:05:53 So it sounds like, you know, similar like just being able to simulate everything
0:05:56 and narrow it down must be such a huge advantage to developers.
0:05:59 Yeah, and second application is, you know,
0:06:02 a world model, if you can predict the futures,
0:06:05 you have some kind of understanding of basics.
0:06:10 You might know the action required to drive the world toward that future.
0:06:15 And the policy model, you know, the typical one deploying physical AI
0:06:19 is all about putting the action, right action, given the observation.
0:06:23 And so a world model can be used as initialization for the policy model,
0:06:27 and then, you know, you can train the policy model with less amount of data
0:06:29 because the world model is already pretty trained
0:06:34 with many different observations that spawn the data assets.
0:06:39 So without a world model, what’s the procedure of training a policy like?
0:06:42 So one procedure is you collect data,
0:06:46 and then you start to do the supervised by tuning,
0:06:48 and then you may use, yeah.
0:06:52 So it’s hands on, it’s manual, you have to get all the data, it’s a lot, yeah.
0:06:59 Yeah, and third one is when world model is good enough, highly accurate and fast,
0:07:04 you know, before robot taking any actions, you just simulate different missions.
0:07:08 And the check which you want to really achieve your goal, and take that one.
0:07:12 You know, I have a data strange nest to you before you’re making any decision.
0:07:14 Would they be great?
0:07:17 You mentioned accuracy when the models are fast enough and accurate enough,
0:07:20 and I don’t know if it’s a fair question to ask,
0:07:22 so I ask it, interpret it the best way,
0:07:27 but like, how do you determine accuracy or measure accuracy on a world model,
0:07:31 and is there a benchmark that, you know, different benchmarks you need to hit
0:07:34 to deploy in different situations, or how does that work?
0:07:35 Yeah, it’s a great question.
0:07:39 So I think a world model development is still in its infancy.
0:07:40 Right.
0:07:46 So people are still trying to figure out the right way to measure the world model performance.
0:07:49 And I think there are several aspects a world model must have.
0:07:51 One is follow the law of physics.
0:07:54 When you’re dropping a ball, you should predict it, you know,
0:07:58 in the right position based on the physics laws, right?
0:08:03 And also in the 3D environment, we have to have object permits, right?
0:08:06 So when you turn back and come back, you know,
0:08:08 the object should remain there, right,
0:08:11 without any other players, it should remain in the same location.
0:08:14 So there are many different aspects I think we need to capture.
0:08:17 And I think an important part for the research community
0:08:20 is to come out with the right benchmark.
0:08:24 So that the community can move forward in the right location
0:08:26 to democratize this important area.
0:08:26 Right.
0:08:29 So speaking of moving forward, maybe we can talk a little bit,
0:08:34 or you can talk a little bit, about COSMOS and what was announced at CES.
0:08:41 So in CES, Jensen announced the COSMOS World Model Development Platform.
0:08:45 It’s a developer-first world model platform.
0:08:48 So in this platform, there are several components.
0:08:51 One is pre-trained world foundation models.
0:08:52 Right.
0:08:54 We have two kind of world foundation models.
0:08:58 One is based on diffusion, the other is based on autoregressive.
0:09:03 And we also have tokenizers for the world foundation models.
0:09:06 Tokenizers compress videos into token
0:09:09 so that transformers can consume for their task.
0:09:10 Right, right.
0:09:15 In addition to these two, we also provide post-training scripts
0:09:20 to help physical AI builder to fine-tune the pre-trained model
0:09:22 to their physical AI setup.
0:09:24 Some cars have A cameras, right?
0:09:29 And we rely on our world foundation model to predict A views.
0:09:35 And lastly, we also have this video curation toolkit.
0:09:40 So processing videos, a lot of video is a serrated computing task.
0:09:43 There are many people need to be processed.
0:09:48 And we gather libraries as they’re ready to compute computation code together.
0:09:53 Want to help the world model developers leverage the library to create data.
0:09:55 Either they want to build their own world models
0:10:00 or fine-tune one based on our pre-trained world foundation models.
0:10:03 So the models provided as part of COSMOS,
0:10:06 those are open to developers to use.
0:10:09 They open to other businesses, enterprises.
0:10:12 Yes, so this is an open-weight development platform.
0:10:15 So meaning that the model is open-weight,
0:10:18 the model weights are released before commercial use.
0:10:23 Before, this is important to physical AI builders, right?
0:10:27 So physical AI builders, they need to solve tons of problems
0:10:32 to be really useful robots, self-driving cars for our society.
0:10:36 There are so many problems, and world model is one of them.
0:10:44 And those companies, they may not have the resources or expertise to build a world model.
0:10:47 And media care about our developers,
0:10:51 and we know many of them are trying to make a huge impact in physical AI.
0:10:53 So we want to help them.
0:10:58 That’s why we create this world model development platform for them to leverage
0:11:01 so that they can handle other problems,
0:11:05 and we can contribute our art to the transformation of our society.
0:11:06 Absolutely.
0:11:07 I wanted to ask you,
0:11:12 can you explain a little bit about the difference between diffusion models
0:11:15 and autoregressive models, particularly in this context?
0:11:20 Why offer both, what are the use cases and pros and cons?
0:11:23 So autoregressive model or AR model,
0:11:27 it’s a model that we did talk once at a time,
0:11:30 conditioned on what has been observed, right?
0:11:35 So GPT is probably the most popular autoregressive model.
0:11:36 We did token at a time.
0:11:38 Diffusion, on the other hand,
0:11:44 is a model that we did a set of token together.
0:11:50 And through iteratively removed noises from these initial tokens.
0:11:53 And the difference is that for AR model,
0:11:56 with a significant amount of investment in GPT,
0:12:00 there are so many optimizations that they can run very fast.
0:12:04 And diffusion, because tokens are generated together,
0:12:08 so it’s easier to have coherent tokens.
0:12:11 The generation quality tends to be better.
0:12:14 And both of them are useful for physical eye builders.
0:12:18 So some of them need speed, some of them need high accuracy.
0:12:19 So both are good.
0:12:20 Excellent.
0:12:23 So far, the most successful autoregressive model
0:12:27 is based on discrete token prediction, like in GPT.
0:12:31 So you pretty much have a set of integers tokens
0:12:33 and you predict them during training.
0:12:35 And in the case of wall foundation models,
0:12:40 it means you have to organize videos into a set of integers.
0:12:44 And you can imagine it’s a challenging compression task.
0:12:46 And because of these compression,
0:12:51 the autoregressive model tends to struggle more on the accuracy.
0:12:53 But it has other benefits.
0:12:58 For example, its setting is integrated into the physical AI setup.
0:12:59 Got it.
0:13:01 I’m speaking with Mingyu Liu.
0:13:04 Mingyu is vice president of research at NVIDIA,
0:13:07 and he’s been telling us about world foundation models,
0:13:09 including the announcement of NVIDIA Cosmos,
0:13:11 the developer platform for world models
0:13:14 that was announced during Jensen’s CES keynote.
0:13:16 So we’ve been talking a lot about,
0:13:18 you’ve been explaining what a world model is,
0:13:21 how it’s similar and different to other types of AI models,
0:13:24 just now the difference between autoregression and diffusion.
0:13:26 Let’s kind of change gears a little bit
0:13:28 and talk about the applications.
0:13:31 How will Cosmos, how are our world foundation models
0:13:33 going to impact industries?
0:13:36 Yeah, so we believe that, first of all,
0:13:40 the world foundation model can be used as a synthetic data generation
0:13:43 engine to generate different synthetic data.
0:13:45 And like what I said earlier,
0:13:50 the world model can also be used as a policy evaluation tool
0:13:54 to determine which checkpoint or which policy
0:13:58 is a better candidate for you to test out in the physical world.
0:14:01 And also, if you can predict the future,
0:14:04 it probably can reconfigure it to predict the action
0:14:09 toward that future, so as a policy vision initialization.
0:14:14 And also to have a data stretch next to you before any endeavor.
0:14:16 So during the next time,
0:14:20 schedule a rollout and pick the best decision for each moment.
0:14:22 Are there particular industries?
0:14:26 I know working factories and industrial work,
0:14:27 anything involving robotics,
0:14:32 are there specific industries that you see benefiting from world models
0:14:33 maybe sooner than others?
0:14:38 Yes, I think the self-driving car industry and the human robot industry
0:14:42 will benefit a lot from these world model developments.
0:14:47 It can simulate different environments that will be difficult
0:14:53 to have in the real world to make sure the Asian is behaved effectively.
0:14:56 So I think these are two very exciting industries,
0:14:58 where the world models can impact.
0:15:01 And NVIDIA obviously has a long history, as you were saying,
0:15:04 of it’s not just about rolling out the hardware,
0:15:07 there’s the software, the stack, the ecosystem,
0:15:09 all of the work to support developers,
0:15:13 because if the devs aren’t building world-changing things with the products,
0:15:14 then there’s a problem, right?
0:15:18 What are some of the partnerships, the ecosystems,
0:15:20 relative to world foundation models?
0:15:23 And maybe there are some partners who are already doing some interesting stuff
0:15:25 with the tech you can talk about.
0:15:28 Yes, we are working with a couple of human-loving companies
0:15:30 and self-driving car companies,
0:15:36 including 1x, Wabi, Dioto, S10, and many others.
0:15:39 So NVIDIA believe in suffering.
0:15:43 We believe that true greatness comes from suffering.
0:15:45 So working with our partners,
0:15:50 we can look at the challenges they are facing to experience their pain
0:15:53 and to help us to build a world model platform
0:15:56 that is really beneficial to them.
0:15:57 Fantastic, yeah.
0:16:01 So I think this is the important part to make the field move faster.
0:16:02 Absolutely.
0:16:06 All right, so you talked about being able to predict the future
0:16:09 and you talked about just now that things are moving faster.
0:16:11 What do you see on the horizon?
0:16:13 What’s next for world foundation models?
0:16:16 Where do you see this going in the next five years
0:16:19 or adjust that time frame to whatever makes sense?
0:16:22 So I’m trying to be a world model now,
0:16:23 try to predict the future.
0:16:25 Exactly, yeah, for now it’s fine.
0:16:28 Yes, I believe we are still in the infancy
0:16:32 of world foundation model development.
0:16:35 The model can do phases to some extent,
0:16:37 but not well or robust enough.
0:16:42 That’s the critical point to make a huge transformation.
0:16:45 It’s useful, but we need to make it more useful.
0:16:49 So the field of AI advance very fast.
0:16:55 So from GBT-3 to CheGBT, it’s just a year or two.
0:16:57 Right, we forget it’s all going so quickly.
0:17:00 Yeah, it’s going so fast.
0:17:04 And I believe physical AI development will be very fast too,
0:17:07 because the infrastructure for a large-scale model
0:17:09 has been established.
0:17:14 So this large-density model transformation.
0:17:17 And there’s a strong need to have physical AI assistance.
0:17:20 So it’s been passed for humanoid.
0:17:22 And there are also a lot of investments.
0:17:25 So we have the great foundation.
0:17:30 And many young researchers want to make a difference.
0:17:33 And we also have great need and investments.
0:17:35 I think this is going to be a very exciting area
0:17:37 and it’s going to move very fast.
0:17:42 I don’t want to say that it will be solved in five years or three years.
0:17:45 So I think it’s still a long way.
0:17:48 And more importantly, we also need to study
0:17:52 how to best integrate these war models
0:17:56 into the physical AI systems in a way that can really benefit them.
0:18:00 Right, and does that come through just working with partners
0:18:02 out in the field, kind of combining research with application
0:18:05 and iterating and learning?
0:18:06 Yeah, I believe so.
0:18:07 I believe in suffering.
0:18:12 So I believe that to hand in hand with our partners,
0:18:16 understand their problems is the best way to make progress.
0:18:19 For folks who would like to learn more
0:18:22 about any aspects of what we’re talking about,
0:18:24 there are obviously resources on the NVIDIA site.
0:18:28 And of course, the coverage of Jensen’s keynote and the announcements.
0:18:30 Are there specific places, maybe a research blog,
0:18:34 maybe your own blog or social media channels,
0:18:39 where people can go to learn more about NVIDIA’s work with world models
0:18:42 and anything else you think the listeners might find interesting?
0:18:49 Yes, so we have a white paper written for the Cosmos war model iPhone.
0:18:52 And we welcome you to download and take a read
0:18:55 and let me know how, you know, whether it’s useful to you
0:18:59 and let me know the feedback and we will try to do better for the next one.
0:19:03 Excellent. Mingyu, it was an absolute pleasure talking to you.
0:19:05 I definitely learned more about world models
0:19:09 and some of the particulars and the applications going forward.
0:19:10 So I thank you for that.
0:19:12 I’m sure the audience did as well.
0:19:14 But, you know, the work that you’re doing, as you said,
0:19:16 it’s early innings and it’s all changing so fast.
0:19:19 So we will all keep an eye on the research that you’re doing
0:19:22 and the applications and best of luck with it.
0:19:24 And I look forward to catching up again
0:19:27 and seeing how quickly things evolve from here on out.
0:19:28 Thank you. Thanks for having me.
0:19:32 It’s been fun and I hope next time I can share more, you know,
0:19:35 maybe more advanced version of the world model.
0:19:38 Absolutely. Well, thank you again for joining the podcast.
0:19:39 Thank you.
0:19:42 (dramatic music)
0:19:44 (dramatic music)
0:19:47 (dramatic music)
0:19:50 (dramatic music)
0:19:53 (dramatic music)
0:19:55 (dramatic music)
0:19:58 (dramatic music)
0:20:01 (dramatic music)
0:20:04 (dramatic music)
0:20:07 (dramatic music)
0:20:09 (dramatic music)
0:20:12 (dramatic music)
0:20:15 (dramatic music)
0:20:18 (dramatic music)
0:20:20 (dramatic music)
0:20:23 (dramatic music)
0:20:26 (dramatic music)
0:20:29 (dramatic music)
0:20:39 [BLANK_AUDIO]

As AI continues to evolve rapidly, it is becoming more important to create models that can effectively simulate and predict outcomes in real-world environments. World foundation models are powerful neural networks that can simulate physical environments, enabling teams to enhance AI workflows and development. Ming-Yu Liu, vice president of research at NVIDIA and an IEEE Fellow, joined the NVIDIA AI Podcast to talk about world foundation models and how it will impact various industries.
https://blogs.nvidia.com/blog/world-foundation-models-advance-physical-ai/
https://www.nvidia.com/cosmos/

Leave a Comment