AI transcript
0:00:15 Hello, and welcome to the NVIDIA AI podcast. I’m your host, Noah Kravitz.
0:00:19 It’s often said that modern cars are computers rolling along on wheels.
0:00:22 From performance and safety systems to in-vehicle infotainment,
0:00:27 computers control and oversee many, many functions in today’s vehicles.
0:00:30 Autonomous driving systems, of course, are no exception.
0:00:36 The quest to build self-driving cars depends on compute power and, yes, data. Lots of data.
0:00:39 Here to delve into the inner workings of autonomous vehicles
0:00:42 and the increasingly vital role of data in modern carmaking
0:00:47 are Tin Son, technical lead for vision language action models at Porsche,
0:00:51 and Brian Moore, CEO and co-founder of Voxel 51,
0:00:54 whose visual AI and computer vision data platform, 51,
0:01:00 is used by customers across a range of industries, including, you guessed it, Porsche.
0:01:04 Tin, Brian, welcome to the NVIDIA AI podcast,
0:01:06 and thank you so much for taking the time to join.
0:01:07 Thanks for having us.
0:01:09 Thank you for having us.
0:01:12 So maybe we can start with each of you introducing yourselves
0:01:17 and just kind of talking a little bit about what you do at your respective companies
0:01:20 and a little bit about how that relates to autonomous systems.
0:01:21 And, of course, we’ll get into it.
0:01:23 So, Tin, maybe you can start.
0:01:25 Right. So, I’m Tin.
0:01:28 I’m a PhD student and tech lead at Porsche AG, like you said.
0:01:31 I’m dealing with vision language action models for autonomous driving.
0:01:37 And in my research, I want to turn cars into embodied agents that can understand space,
0:01:40 time, and physical properties of the real world,
0:01:44 so that they are able to act within it and interact with the driver through natural language
0:01:47 or through pose, facial expressions, or gesture.
0:01:48 Fantastic.
0:01:52 And, Brian, tell us a little bit about Voxel 51.
0:01:56 Yeah. So, as you mentioned, I’m Brian, the co-founder and CEO here at Voxel 51.
0:01:58 First, my background.
0:02:00 So, I’m a geek, a nerd by background.
0:02:03 I have a PhD in machine learning from the University of Michigan.
0:02:04 Co-blue?
0:02:04 Exactly.
0:02:07 That is where, over 10 years ago, I met my co-founder, Jason,
0:02:09 who is a faculty at Michigan.
0:02:12 We started off doing some consulting work,
0:02:16 of course, being located in Ann Arbor, just down the road from Detroit, the Motor City.
0:02:20 We had the opportunity and great pleasure to collaborate with a number of automakers
0:02:23 10-plus years ago in early versions of autonomy,
0:02:27 where we kind of reached that key insight that, in theory,
0:02:28 it’s all about models and algorithms.
0:02:32 In practice, it’s all about data and data quality and data strategy,
0:02:38 which led us to the opportunity and need to provide our product 51
0:02:40 to help solve some of those data challenges.
0:02:41 Very cool.
0:02:43 And we’ll get into, as I said, as we talk,
0:02:45 we’ll get into a little bit more about what 51 is
0:02:48 and what Voxel 51 does with customers like Portia.
0:02:52 But maybe, Tin, let’s start with you.
0:02:55 On the podcast, we’ve said, I think it’s actually an Andrew Ng quote originally,
0:02:58 but we like to say that AI is like the new electricity.
0:03:03 It’s sort of there in the background, powering more and more, you know,
0:03:06 everything that we do across industries and research disciplines,
0:03:09 and it’s kind of providing power that we run on.
0:03:12 Why is it so important in the automotive industry,
0:03:15 and when we’re talking about autonomous vehicle systems,
0:03:19 why is it so important to organize and understand these huge amounts of data,
0:03:23 whether they’re coming from real-world capture or generating synthetic data?
0:03:28 Why does that play such a big part in developing autonomous driving systems?
0:03:32 So to deal with this increasing scenario space in the open world,
0:03:36 the industry is shifting from pre-specifying scenarios
0:03:40 and pre-specifying everything which can happen in modular pipelines,
0:03:45 which are only partially supported by AI models towards end-to-end pipelines,
0:03:48 where AI plays the major role or plays the sole role.
0:03:51 And instead of specifying all the scenarios,
0:03:55 we are training data and directly mapping them to outputs,
0:03:57 to actions which the models have to perform.
0:03:59 So we are removing the inductive bias
0:04:02 and removing all the rules and knowledge we have
0:04:04 and training from the data.
0:04:09 This is mainly because we are moving from level two assisted driving
0:04:11 to fully autonomous systems
0:04:15 that have to operate in conditions which are partially also unknown to us,
0:04:18 which we are not able to specify fully.
0:04:22 And in order to act safely under these conditions,
0:04:24 we have to collect lots and lots of data,
0:04:28 so billions of kilometers of driving data which need to be collected.
0:04:33 And not only that we need to explore this whole scenario space
0:04:34 to provide safe driving solutions,
0:04:39 but also we have long-tail distributions of different traffic scenarios,
0:04:41 of different interactions of agents.
0:04:45 And not all data has the same value for the model.
0:04:48 We have different kinds of redundancies and imbalances,
0:04:50 and the training data for these models.
0:04:52 And this is where we need to explore the data,
0:04:55 where we need to capture the data, which is mainly unlabeled.
0:04:58 We need to implement automated labeling pipelines
0:05:02 in order to understand which are new scenarios,
0:05:05 important scenarios having impact on the agents,
0:05:08 and which are scenarios which are less important,
0:05:10 which are redundant, which happen more often.
0:05:14 For instance, we have lots of scenarios with walking pedestrians on zebra crossings,
0:05:22 but there’s not so much situations where helicopters have to perform a landing on the Eagle Lane of an autonomous vehicle, for instance.
0:05:24 Right. But it could happen.
0:05:25 Yeah, exactly.
0:05:32 The other thing which is important in that regard is that there’s not that much ground truth data for different modalities,
0:05:33 for different sensor setups.
0:05:36 So in order to be able to train models,
0:05:40 we need different kinds of data from different modalities.
0:05:44 For instance, spatial data or data from multiple sensor setups, from multiple embodiments.
0:05:51 Or in terms of our research, also agent interactions with the physical world or visual question answering data
0:05:54 are sparsely found in available real world data.
0:06:02 So we need a data curation platform like Voxel 51 provides in order to harness different methods,
0:06:06 in order to create meaningful pipelines, meaningful data labeling pipelines,
0:06:10 which are relevant for our training, for the training of the models,
0:06:13 but also for the validation during the operations of the models.
0:06:18 We need to validate and continuously observe because the scenario space is so large
0:06:23 that we have the obligation to observe it during the operation too.
0:06:28 So to go back to a couple of things you said, first, just kind of the level set.
0:06:32 You mentioned the levels of autonomous vehicle systems
0:06:38 and the difference between level two and then going to a fully autonomous system.
0:06:44 Can you just briefly give an overview of what the levels are and kind of where today’s technology is at?
0:06:44 Yeah.
0:06:51 So in current vehicles, we mostly have level two systems, which are driver assistance systems.
0:06:55 And these systems don’t operate on their own.
0:06:57 They cannot act in the environment.
0:06:59 They just support the driver in certain tasks,
0:07:01 for instance, lateral or longitudinal movement.
0:07:06 And in the future or in the near future, we will have fully autonomous systems.
0:07:08 So these are like the lane keeping systems?
0:07:09 Exactly.
0:07:15 Lane keeping systems and distance and automated distance systems, which keep the citizens.
0:07:17 Like following behind another car and…
0:07:18 Following behind.
0:07:18 Got it.
0:07:18 Okay.
0:07:20 And so that’s level two.
0:07:22 And then getting up to…
0:07:25 Is it level five that’s considered fully autonomous?
0:07:25 Yeah.
0:07:27 So level five is fully autonomous.
0:07:30 This is basically a human agent which can act in the real world.
0:07:40 So if we have an AI which is able to interact fully autonomous without any separation from what a human agent would act in the environment, we have level five.
0:07:40 Right.
0:07:43 We are not striving for level five yet.
0:07:50 We are striving for level four, which is that in most domains, autonomous agents can act and interact.
0:07:56 But there are some borders, some boundaries, which we call operational design domain boundaries, boundaries of the system.
0:08:06 We need to consider and the system needs to be able to identify these boundaries and to have fallback loops where the driver can engage with the system.
0:08:11 So, Brian, where does Voxel 51 fit into this?
0:08:13 How do you work with Portia?
0:08:20 And then if you like also, you can talk a little bit more about the company, the platform, and how it serves other customers in general.
0:08:21 Yeah, absolutely.
0:08:30 So, you know, like Tim mentioned, our platform, our product exists to put data at the center of all of his team and indeed all visual AI projects work.
0:08:38 For context, what we see today is that models and weights, while very important, are becoming increasingly commoditized or publicly available.
0:08:54 And therefore, the key differentiator between the success and failure, you know, leader versus follower status of a product or company lies in their ability to turn the data they have access to into, you know, what we call actionable insights or intelligence for their systems.
0:09:03 So, in other domains, let’s say more legacy domains, use cases like structured data, data that fits in spreadsheets and tables and so forth.
0:09:10 There’s quite a lot of mature tooling that exists out in the market to help you make sense of that data, run computations, process that data.
0:09:28 However, in our experience, when it comes to visual data, image data, video data, 3D data, all the associated metadata that needs to be understood and processed by AI systems, there was really a dearth of tools in the market that played that role of being that key platform that puts data at the center of all the development work.
0:09:47 So, we experienced that ourselves through our research and ultimately identified that a product needs to exist that’s open and extensible in the market that teams like Tins at Porsche can adopt, feed their data into, and really facilitate the end-to-end development of their systems.
0:10:07 So, that encompasses data annotation or labeling, data curation, and importantly, evaluating models, and then turning that flywheel, whereby you identify failure modes or gaps in a model’s performance, and turn that into the right decisions about what new data should I go out and gather to address a specific failure mode.
0:10:21 Whether it be real data that’s available through an autonomous fleet of vehicles in the world, or maybe synthetic data that’s increasingly becoming important to fill in gaps that are unusually hard or difficult to get your hands on.
0:10:24 The helicopter landing next to me on my daily commute.
0:10:25 Exactly.
0:10:29 You may not have many examples of that at your fingertips.
0:10:42 However, in order to truly get to L5 systems, we need to have confidence that these, you know, agentic systems are going to be able to respond and react to those kind of extreme, but nonetheless important events that could occur in practice.
0:10:43 Yep, yep.
0:10:45 I mean, less unlikely, right?
0:10:50 But, you know, I always think of kids darting out into the road, you know, on a bike, on foot, whatever.
0:10:55 And, yeah, I want my fully self-driving car to understand how to avoid the kid, the helicopter, 100%.
0:10:56 Exactly, yeah.
0:11:00 Yeah, and so just to sum up what our product offers, it’s exactly that.
0:11:26 If you want to, for example, deep dive into the performance of an autonomous vehicle in situations where there’s a crowded intersection at night with low light, and you want to develop confidence or trust that that system’s going to respond correctly, you need the ability to perform that query, analyze that model’s performance, and if it’s not up to par yet, take the right actions to build better data sets so that you can get to where you need to be in terms of performance.
0:11:27 That’s fantastic.
0:11:27 Fantastic.
0:11:31 So, Tin, in the real world, Brian was talking about building that trust.
0:11:40 How do you, how does Porsche go about making sure that its autonomous driving systems are safe, are road-ready before releasing them?
0:11:42 How do you test?
0:11:45 How do you, how does simulation come into play?
0:11:47 We’re talking about the role of data and everything.
0:11:49 Can you talk a little bit about your process?
0:11:54 So, when it comes to simulation, it is an increasingly important and valuable tool for us.
0:12:03 And we have talked about the increasing complexity of the systems, moving towards end-to-end systems, moving towards maybe embodied agents which interact with the driver.
0:12:10 And we have talked about the increasing complexity of the scenario space, where we have more and more scenarios which have different parameters.
0:12:15 And simulation really gives us the ability to ask the question, what could happen?
0:12:18 So, we don’t only have recorded data.
0:12:22 We also have interaction between agents in the environment.
0:12:35 We have different models where we can evaluate the feasibility in all of the state and action space, not only in the recorded state and action space we have from our real test drives.
0:12:45 And this is really, really, really valuable because this gives us the opportunity to have models which are generalized over the whole environment and everything which can happen.
0:12:59 And simulation enables us to do that by becoming increasingly high fidelity, by providing increasingly realistic environments, increasingly realistic sensor models, and also behavioral models for different agents.
0:13:11 And this is important for us because we really want to have safe systems on the road and only through simulation we can capture this because in the real world there’s so much which can happen.
0:13:14 And there’s also some scenarios which are not so easy to replicate.
0:13:23 For instance, if a helicopter lands on the road and we cannot create a test case in the real world without harming someone or without having the danger of harming someone.
0:13:33 So we need simulation and also synthetic data generation to capture these scenarios and we have the ability to do that with an increasing level of realism.
0:13:51 In the latest developments, there also have been additional aspects of simulation like improving the fidelity with generative models by feeding in scenarios into a generative model and getting more realistic outputs which resemble almost video scenes.
0:13:53 And you see that also in NVIDIA Cosmos.
0:14:07 So we are really excited to see the developments which are happening currently in that area, which improve our realism and improve the fidelity of the simulation and therefore also the validity to test in simulation environments.
0:14:16 You mentioned that accounting for the unknown or the unpredictable is a big issue with developing safe autonomous driving systems.
0:14:29 Are there particular known scenarios, you know, things that happen in the real world that are just really difficult problems to solve when it comes to building an autonomous system to deal with it?
0:14:34 You know, you mentioned low light, crowded intersection, nighttime, those kinds of things.
0:14:41 Are there particular scenes or even particular variables that just really pose a challenge for working on these systems?
0:14:46 Definitely. So we have certain physical limitations of sensors.
0:14:50 For instance, radar and LIDAR sensors have certain limitations.
0:14:57 LIDAR sensors are limited by, for instance, rain or water reflections or reflective surfaces.
0:15:04 While radar sensors can be noisy in the presence of certain magnetic fields or metals.
0:15:09 And camera, of course, we know when camera becomes noisy in many weather situations.
0:15:21 So we have different physical limitations and we need to also to model them and to understand in which situations agents can act because it’s not a physical limitations and where these agents have physical limitations too.
0:15:31 And simulation also provides, in the latest development through realistic sensor models, also the ability to evaluate virtually in that regard.
0:15:42 When it comes to agent interactions, we have a lot of limitations because of the problem of domain generalization.
0:15:47 We have different types of scenarios where agents need to interact.
0:15:58 And if the data is not present in the training data, we have the issue of learning concepts and applying concepts like human driver do.
0:16:06 For instance, if I’m a human driver, I see certain entities in the environment, which I never saw, but I can anticipate them.
0:16:16 But an AI agent has difficulties anticipating them because when it didn’t see it in the training data, it cannot generalize over the unseen event.
0:16:19 And this is also where these methods come into play.
0:16:21 We need to capture also these situations.
0:16:31 One thing I would add on the synthetic data front, one thing we’re seeing is that it’s becoming increasingly important to be able to generate variations of a scene.
0:16:36 For example, you know, we’re excited to integrate with NVIDIA’s Cosmos World Foundation models in our product.
0:16:43 And one thing that enables is you can import a realistic scene, parametrize it, and then tune different things.
0:16:46 What would it look like if that vehicle was a different shade of beige?
0:16:50 Or what would it look like if there was another vehicle or a pedestrian in the same scene?
0:16:52 Or maybe let’s change the weather conditions.
0:17:00 And that ability to kind of play around with a realistic scene is important to kind of develop a trust that the system is going to, you know,
0:17:04 be able to deal with all the different variations that it might see of that scene in practice.
0:17:06 Right, right.
0:17:11 Is that a manual process or are you automatically generating the variations to put them back into the system?
0:17:19 Yeah, that’s one of the exciting things about the software integration we have with the, you know, the Cosmos Foundation models as an example.
0:17:34 Users of our product can, you know, identify a scene and then, you know, click a button to automatically generate all of those variations and pull them back into their training data set and indeed automate the analysis or assessment of all those different variations.
0:17:37 So we think that having humans in the loop is definitely important.
0:17:47 It would be a mistake to sort of, you know, ship a system without having a human’s eyes on the scene to identify or spot check or build trust in performance in key scenarios.
0:17:53 But of course, the volume of data that you need to get to the performance that you need is immense.
0:17:57 And so you have to leverage, you know, automation whenever possible.
0:18:04 Brian, as Voxel’s worked with Porsche and other leading automakers, what’s been surprising to you along the way?
0:18:11 Maybe unanticipated challenges, maybe getting past a hurdle in a surprising way.
0:18:15 What are some of the things that stand out to you as you think about working with automakers?
0:18:33 Yeah, I think that the main thing I would say, you know, as Tim was describing all of the very interesting sort of software or, you know, related systems challenges and bringing autonomy to market, what we’re seeing is that the leaders in the space really are reinventing themselves, not as automakers, but really as software companies.
0:18:34 Right.
0:18:41 And that’s the type of sort of skill and expertise that’s going to be needed to really solve these problems and bring a differentiated product to market.
0:18:54 There’s a trend today where, you know, if you think of the software component of autonomy as something that you can procure off the shelf, maybe through a vendor, then it can get you to a certain level of performance.
0:19:03 But the key, you know, as I argued before, the key to, you know, really industry leading performance is to harness the data that your company has access to.
0:19:11 You really need to bring the development and iteration of that software in-house rather than just outsourcing it to reach kind of leading status.
0:19:23 Yeah, from a consumer’s perspective, when you start talking about software and automakers, I think of infotainment systems just, you know, as a driver, as a passenger, kind of the first light-up thing that I see, right?
0:19:36 And I’ve been following a little bit from afar, but, you know, kind of this almost like dance between the automakers and some non-automotive software makers and, you know, mobile phone makers in particular, as you get into plugging your phone into the car.
0:19:42 And then maybe like with Apple CarPlay or Android Auto, it takes over the in-car infotainment and that kind of thing.
0:19:53 But then listening to you talk about it and thinking, well, okay, let’s move from infotainment to something I can’t even imagine how complex it really is as an autonomous driving system.
0:20:02 And it makes perfect sense to me, Brian, what you’re saying, that to really develop a top-notch system, you can’t just grab something off the shelf and plug it in.
0:20:07 Like you’ve got to be, you know, shaping it to fit your vehicles, all the data you have, all of that kind of stuff.
0:20:12 Yeah. And it’s also important, I think, to not take it all in-house and say that you can do everything.
0:20:22 I think an effective pattern that we’ve seen in the market, and of course, Tim would be an authority on this more than me, is, well, let’s first focus on being able to validate the performance of systems.
0:20:30 So we can truly understand if we’re going to work with a vendor for a certain piece of technology, can we truly understand its performance and develop trust in it?
0:20:37 And then we can evaluate whether it makes sense to bring certain aspects of the system in-house so we can fine-tune it with our own data and so forth.
0:20:43 So that focus on validation and evaluation is definitely important in the short term.
0:20:48 Brian, you may have said this at the beginning, so forgive me, but how long ago was Voxel 51 founded?
0:20:59 Yeah. So Voxel started over 10 years ago now as a kind of a consulting partnership between myself and my co-founder, Jason, as a sort of, you know, venture-backed software company.
0:21:02 That journey started in 2018, so about seven years in the market.
0:21:07 Okay. So along that path, almost a decade now, or, you know, seven years to market and a decade,
0:21:20 since you started, have there been particular technological breakthroughs that have really allowed Voxel to just take your practice to the next level and, you know, do things, offer things to customers you couldn’t?
0:21:27 And particularly when it comes to, I mean, this is what you do, but when it comes to wrangling and managing and understanding these vast qualities of data,
0:21:34 are there particular just technological advances along the way that have really, you know, made the work that you do possible?
0:21:40 Certainly there’s been just immense technological innovation on the model and algorithmic side.
0:21:44 Yeah. It’s one of those questions where I’m sort of like, have there been innovations in the past 10 years?
0:21:45 Yeah, maybe a couple, you know.
0:21:52 Yeah. Yeah. So one of the, you know, so there’s been clear advances in, you know, technologies like the transformer architecture,
0:21:56 which kind of leveled up another kind of order of magnitude of performance potential.
0:22:01 And then, of course, we had all of the, you know, advancements in chat GPT and large language models.
0:22:12 And now we have models in vision that are more multimodal in nature, vision language models that can pull in information from text and audio and fuse that with vision.
0:22:21 And the long-term, I think, vision in the space is that we need models that can go directly from pixels or sensor inputs directly to actions or decisions.
0:22:25 And that kind of end-to-end system is kind of the holy grail.
0:22:27 And it’s very exciting to see all of that develop.
0:22:37 The lessons learned from us, it always requires more data, more data, more data to organize, to understand, to sift through, to find kind of the needles in the haystack.
0:22:47 And so the number one feature request we get from our customers is definitely, hey, you know, as I’m thinking about my plans and my goals for next year, it’s involving an order of magnitude plus more data.
0:22:52 The need and ability to connect to more GPUs to compute on that data.
0:23:03 And so we’ve definitely benefited from the rapid pace of innovation and, you know, distributed computing and related technologies that our platform can plug into to help deliver that scale to customers.
0:23:13 To dig in on something you said real quick, and correct me where I’m wrong here, but I think you said that the holy grail, as you put it, is moving to a system where a model that can go from pixel to action directly.
0:23:15 Where are we at now?
0:23:17 How do you get from pixel to action currently?
0:23:17 Yeah.
0:23:23 First of all, just to unpack the historical context there, we refer to the space as visual AI.
0:23:26 What you may have referred to it in the past as is computer vision.
0:23:26 Right.
0:23:36 And that historically has represented kind of very low level tasks, like taking an image and classifying it as a certain animal or drawing a box around a certain object.
0:23:38 So that’s a very kind of low level task.
0:23:46 It’s important information and the system needs to understand, you know, the content of an image or a video stream in order to reason about it.
0:24:02 However, I think the lesson that we keep learning, even in on sort of the language models, is that to the extent that we can push the system to be more end to end and, you know, have the authority to do a lot of reasoning itself and go directly from raw inputs to the decision.
0:24:06 There’s a capacity for more sort of intelligence.
0:24:07 Are we at that step yet?
0:24:08 Certainly not.
0:24:14 But I think that’s where the leading edge is in terms of research and the work that Tim and his team at Porsche are doing.
0:24:15 It’s very exciting times.
0:24:16 Gotcha.
0:24:24 A long time, well, not that long ago, but a while ago, I read a novel that was kind of a, you know, not quite cyberpunk, but along those lines, tech heavy.
0:24:38 And one of the little threads in the book was about self-driving cars and near future freeway system in the United States where the cars talk, you know, automatically to each other and to the toll taking systems on the roads and all that kind of stuff.
0:24:48 But one of the themes that came out was, you know, if every car on the road was autonomous, that would be potentially the ideal situation for human safety.
0:25:01 Because if all the vehicles were self-driving and they’re all performing at a good level, they’re going to be able to make much better decisions than human drivers just because of the raw ability to compute and deal with all the data and just everything humans can’t quite do.
0:25:10 Given where we’re at now in reality and level two systems and all the stuff you’ve been talking about, how do we use technology?
0:25:16 How can technology help improve, help build human trust in autonomous vehicles?
0:25:29 So the question you asked is how to establish trust towards the human and where human has most trust in our system, which are explainable and are acting similar to humans.
0:25:39 So one big limitation also current end-to-end systems have is that they are not able to describe why they are taking certain actions.
0:25:52 They do not derive their actions from basic concepts like humans can understand from several hours of driving how to drive in the world, while autonomous systems can’t do that.
0:26:19 And they have to have big amounts of data in order to derive actions and where the shift needs to happen is towards systems which are able to derive and generalize based on certain contexts derived from human knowledge, like derived from different basic concepts like we learn in driving school how to act in the environment and also how to explain and describe these actions.
0:26:42 And only when we are able to do that, and this is also done in the research on foundation models in the area of autonomous driving, only when we are able to do that, we have trustworthy systems which are also able to interact with the driver, with the co-driver in that case, and to explain their actions and also to explain when certain boundaries are met.
0:26:46 Like human drivers also have certain boundaries, they also cannot perform in all situations.
0:26:50 So this is the shift we need to, to, to, to make.
0:26:50 Right.
0:26:50 Brad?
0:26:53 Just to add a few other angles to that.
0:26:56 So I try to hold two things at once together.
0:27:17 One is the excitement for the long-term future of fully connected, fully autonomous vehicles, which are orders of magnitude safer than humans and can benefit from other information that human drivers don’t like the ability to directly communicate with other vehicles or other sensor types, like LIDAR, that offer information that humans don’t have.
0:27:25 At the same time, I think we’d be remiss not to be very excited about all of the very concrete benefits that are already in the market today, specifically around safety.
0:27:36 You know, the fact that my vehicle today has L2 systems that can automatically detect, you know, maybe I’ve lost focus on the highway and a car in front of me is stopped in traffic.
0:27:44 My vehicle can already react to that and to keep me safe, and I think there’s so much more potential to roll out in more of an incremental way.
0:27:48 AI enabled advancements that focus specifically on trust and safety.
0:28:05 You know, Tin painted a great picture of what’s needed from a explainability and transparency standpoint on the technology, but also very excited about the very concrete benefits that everyday, you know, drivers can see over the next five years while we continue working towards the grander future.
0:28:13 And the sort of potential of a fully autonomous network is that kind of excitement technologically that keeps pushing the face of innovation forward.
0:28:14 Yeah.
0:28:18 And we’re happy to, you know, help do our part to enable that to happen at Voxel.
0:28:18 Yeah.
0:28:21 But definitely appreciate both aspects and tension.
0:28:21 Right.
0:28:27 Tin, along those lines, what excites you the most about the future of autonomous vehicles at Porsche?
0:28:32 So I’m probably biased because I’m conducting research in that area, but what really excites me…
0:28:33 I hope you are.
0:28:40 What really excites me is turning cars into Knight Rider, if you know the series Knight Rider.
0:28:41 Oh, yeah.
0:28:49 So turning cars into embodied agents, which are able to anticipate the environment similarly how humans do.
0:29:03 So we can leverage foundation models, leverage the knowledge of the world of the web context we have in order to not have direct mappings of actions from inputs like end-to-end models have, but to have some kind of reasoning process in between.
0:29:16 And to generalize over different scenarios, regardless of embodiment and like end-to-end approaches are maybe error-prone and sensitive to different embodiments, to different sensor setups.
0:29:23 We need to retrain them based on if we change the sensor setup, we need to retrain the whole system in many cases.
0:29:40 And foundation models promise to have some kind of intermediate representation, which incorporates web context and incorporates world knowledge and gives us the ability to generalize on different kinds of scenes and act in these scenes.
0:29:46 Even if not previously trained on that specific scene, because we have this reasoning process.
0:29:52 And another part of the whole puzzle is that we have completely new types of interactions.
0:29:54 We have vision language navigation.
0:29:57 Me as a driver, I can tell the car what to do.
0:30:05 And the car knows about its abilities and can choose certain actions, which it knows, and create new experiences for the driver.
0:30:25 So for instance, looking for a parking lot in the shadow or any other situational request the driver could have, we can create new actions, which are not previously trained by any model, but are derived from the context and derived from the world knowledge, which is pre-trained on those models.
0:30:31 So there’s a lot of potential in there, but there also needs a lot to be done in regards to that.
0:30:40 In our research, we have investigated what foundation models need to be able of in order to fulfill this task of vision language navigation.
0:30:46 And together with also in close collaboration with Voxo51, we have identified four areas.
0:30:51 And this is semantic understanding, which is classes, affordances and attributes.
0:30:57 Spatial understanding, which is the locations, the orientations of objects to watch each other.
0:31:01 Temporal understanding, which is the development over time in the past and the future.
0:31:09 And physical understanding, most importantly, which is the world model, the physical rules like forces or gravity applied to the environment.
0:31:24 And we have identified that current foundation models are very good in semantic understanding, like deriving these basic concepts of scenarios.
0:31:31 But we need to improve them in spatial, temporal and physical understanding in order to really grasp the task of vision language navigation.
0:31:35 And we’re really, really excited what’s happening currently there in this environment.
0:31:40 Also in research done by NVIDIA and other players in that area.
0:31:42 And we are looking forward for the research.
0:31:54 And on our own, we also train models and create models, which are hopefully able to grasp spatial temporal understanding and also physical rules of the environment in order to interact with it.
0:32:07 It’s good that you guys are the ones working on safety, because, Tim, as soon as you started talking about the driver requesting the car to do something, I immediately thought, man, if I had a Taycan, I could just ask it to get me places fast.
0:32:11 And, you know, but I don’t think that’s quite where we’re getting at here.
0:32:23 But, Brian, along the lines of safety, where do you see the biggest opportunities for improving autonomous vehicle safety through simulation, through, you know, access and tools to help you use better data?
0:32:29 Yeah. So for us, it always comes back to data and the important role that data plays in the success of AI.
0:32:29 Sure.
0:32:35 And, you know, fortunately, there’s some pretty exciting technological advancements happening in the data space.
0:32:43 So historically, one of the most onerous and costly and time-consuming aspects of an AI project was data annotation.
0:32:49 The need to gather a data set and have it labeled to sort of teach the model all the information that it needs to know.
0:32:58 Interestingly, you know, that was historically done in sort of an outsourced way where the data was shipped off to human teams to, you know, do the rote work of labeling that data.
0:33:04 You can imagine that’s very expensive and creates an artificial bottleneck on the amount of data that you can feed to your systems.
0:33:19 Actually, these days, with the emergence of these generalized foundation models, on our team, we’ve done a study in comparing the performance of AI systems that are developed on human annotated data versus automatically labeled or auto labeled data from foundation models.
0:33:26 And, of course, the benefit of leveraging automatic labeling is that it’s far more efficient, lower cost, and so forth.
0:33:29 And the question was, well, how did the performance compare?
0:33:47 And interestingly, we found that you can achieve comparable performance replacing human annotation in many situations with auto labeling, which kind of opens up the sort of the valve of the potential quantity of data that you can feed to specific systems in use cases like autonomy.
0:33:53 And so we’re excited to bring auto labeling to our users in the 51 platform.
0:33:55 We call it verified auto labeling.
0:34:03 The verified part is important because it’s not just about feeding tons of data to a system, but being able to verify the correctness of that information.
0:34:17 And so we’ve developed some technology internally that can kind of help you get the most value out of the knowledge that foundation models have while also prioritizing verification and trust and your users understanding the performance.
0:34:35 And then, of course, there’s the simulation piece because, you know, if you can’t get your hands on exactly that scene that you know that your model has a weakness or a failure mode that needs to be addressed, tapping into simulation techniques, as we mentioned previously, to fill in those gaps and build higher quality data sets.
0:34:41 That’s very exciting in terms of the potential to push us to the next level of performance in the LLM space.
0:34:47 In the LLM space, we hear today things like, oh, well, we’ve already trained on all of the information available on the public Internet.
0:34:48 Right.
0:34:52 And so now we need to resort to synthetic data because there’s nothing left, right?
0:34:52 Right.
0:35:02 Now, I’m not sure that that’s exactly true, but it’s definitely the case that synthetic data generation techniques have different complementary capabilities to real data.
0:35:12 And increasingly in the future, we see it being an integral tool to the, you know, toolkits of teams that are putting data quality at the center of their development efforts.
0:35:17 As we wrap up, let’s hold on to that future forward mindset for a moment here.
0:35:20 And I’m going to ask you if you can think ahead five years from now.
0:35:22 So 2030.
0:35:22 Wow.
0:35:32 So if we’re in 2030, what do you hope will have changed, will have progressed and developed in the world of autonomous vehicle safety and simulation five years out?
0:35:44 Well, I’m definitely excited in general about the opportunity that autonomous driving plays and shedding light on the powerful capabilities of visual AI and multimodal models.
0:36:02 I think from a, you know, geek standpoint, it’s a perfect testing ground to test the value of multimodal systems that can understand not just the visual inputs or LIDAR inputs, but also things like you mentioned before, a fully connected system where we can take information about the intentions or behaviors of other vehicles.
0:36:12 to bring us to bring us to that next level of safety or, you know, feed in, you know, sort of audio signals to build a more holistic understanding of a scenario.
0:36:14 I think it’s going to unlock the next level of safety.
0:36:21 When the model can understand the urgency of my swearing behind the wheel, you’ll know how much danger we’re really in.
0:36:21 Exactly.
0:36:22 Yeah.
0:36:25 So maybe I can add on that.
0:36:26 Yeah, please.
0:36:39 So I believe that safety will shift from like a holistic view where we have to really test all of the necessary scenarios for a specific domain towards more situative safety,
0:36:53 because we can have models which are able to reason on different situations, which are able to derive from specific basic concepts and are able to interact with the driver so that there can be safe conditions,
0:37:12 even in unsafe situations so that the driver, for instance, gets requested to a takeover by a model which talks in natural language to the driver or also the model can anticipate different situations based on the concepts behind it as an unsafe situation where it can act accordingly.
0:37:15 And like Brian said, the multimodality plays a major role.
0:37:25 So we will have models which are not only able to reason on visual data, not only able to reason on camera data, for instance,
0:37:35 but also on spatial data, on point clouds by radar and LIDAR sensors, and also on map information and other information which is available in the environment.
0:37:43 And therefore, we believe that the future models should be able to have more of a situative understanding of safety, like humans also have.
0:37:52 They do not need to capture all of the situations, but they can be able to anticipate the situations and act safely even in an unsafe situation.
0:37:55 Anticipate also unsafe behavior by other agents.
0:37:57 And this is where we have to go.
0:38:02 Yeah, that whole concept of anticipation is such a big part of living life as a human, right?
0:38:06 And so it makes sense, the importance of it within these systems.
0:38:11 But I will leave the complexities of getting it to work to folks like you.
0:38:17 It’s fantastic for us as users or enthusiasts of automobiles that there’s companies like Porsche out there,
0:38:24 because I’m sure that they’ll balance the safety and automation with the fun factor of being a driver behind the wheel.
0:38:29 Whether or not you have to be driving, I’m sure it’ll be a first-class experience with Porsche involved.
0:38:41 No, that’s a great point because there are, I think anyone who’s driven a car has probably taken at least one drive just for fun or to relax or to get away from it, clear your head.
0:38:46 You know, and so just that driving experience is something that I hope we can hang on to going forward.
0:38:46 Absolutely.
0:38:49 Tin, Brian, this has been a great conversation.
0:38:54 I’ve learned a lot about the current and future of autonomous vehicles, so thank you both for joining.
0:39:02 For listeners who would like to learn more, would like to dig a little deeper into what Porsche is doing with autonomous vehicles and everything else,
0:39:07 into what Voxel 51 is all about, where would you point them to go on the web to get started?
0:39:11 Brian, I assume the Voxel 51 website, but where can they go?
0:39:12 Definitely check out our website.
0:39:19 For those technologists in the audience, check out our open-source project, 51, completely free, openly available.
0:39:26 Download it, kick the tires, test out this data-centric view of developing your next visual AI system.
0:39:27 Fantastic.
0:39:30 And, Tin, is there a Porsche research blog?
0:39:32 Is there a part of the website devoted to autonomous vehicles?
0:39:34 Where’s the best place for a listener to start?
0:39:41 Yeah, so we also have a lot of our research open-sourced on the Porsche free and open-source website.
0:39:49 For instance, a benchmark for evaluating foundation models and the task of vision language navigation based on the four capabilities we just mentioned.
0:39:56 And also, a lot of research papers you can grasp and just check out our Google Scholar and our GitHub page,
0:40:00 where you can dive into what we’re doing and foundation models for autonomous driving.
0:40:01 Fantastic.
0:40:04 Again, thank you guys so much.
0:40:08 And, you know, anytime you want to talk cars and autonomous vehicles and all that stuff,
0:40:11 give a call and maybe we can catch up and do it again in the future.
0:40:12 Looking forward to it.
0:40:13 Thanks, Noah.
0:40:14 Looking forward, Noah.
0:40:14 Thank you.
0:40:14 Thank you.
0:40:49 Thank you.
0:40:50 Thank you.
0:00:19 It’s often said that modern cars are computers rolling along on wheels.
0:00:22 From performance and safety systems to in-vehicle infotainment,
0:00:27 computers control and oversee many, many functions in today’s vehicles.
0:00:30 Autonomous driving systems, of course, are no exception.
0:00:36 The quest to build self-driving cars depends on compute power and, yes, data. Lots of data.
0:00:39 Here to delve into the inner workings of autonomous vehicles
0:00:42 and the increasingly vital role of data in modern carmaking
0:00:47 are Tin Son, technical lead for vision language action models at Porsche,
0:00:51 and Brian Moore, CEO and co-founder of Voxel 51,
0:00:54 whose visual AI and computer vision data platform, 51,
0:01:00 is used by customers across a range of industries, including, you guessed it, Porsche.
0:01:04 Tin, Brian, welcome to the NVIDIA AI podcast,
0:01:06 and thank you so much for taking the time to join.
0:01:07 Thanks for having us.
0:01:09 Thank you for having us.
0:01:12 So maybe we can start with each of you introducing yourselves
0:01:17 and just kind of talking a little bit about what you do at your respective companies
0:01:20 and a little bit about how that relates to autonomous systems.
0:01:21 And, of course, we’ll get into it.
0:01:23 So, Tin, maybe you can start.
0:01:25 Right. So, I’m Tin.
0:01:28 I’m a PhD student and tech lead at Porsche AG, like you said.
0:01:31 I’m dealing with vision language action models for autonomous driving.
0:01:37 And in my research, I want to turn cars into embodied agents that can understand space,
0:01:40 time, and physical properties of the real world,
0:01:44 so that they are able to act within it and interact with the driver through natural language
0:01:47 or through pose, facial expressions, or gesture.
0:01:48 Fantastic.
0:01:52 And, Brian, tell us a little bit about Voxel 51.
0:01:56 Yeah. So, as you mentioned, I’m Brian, the co-founder and CEO here at Voxel 51.
0:01:58 First, my background.
0:02:00 So, I’m a geek, a nerd by background.
0:02:03 I have a PhD in machine learning from the University of Michigan.
0:02:04 Co-blue?
0:02:04 Exactly.
0:02:07 That is where, over 10 years ago, I met my co-founder, Jason,
0:02:09 who is a faculty at Michigan.
0:02:12 We started off doing some consulting work,
0:02:16 of course, being located in Ann Arbor, just down the road from Detroit, the Motor City.
0:02:20 We had the opportunity and great pleasure to collaborate with a number of automakers
0:02:23 10-plus years ago in early versions of autonomy,
0:02:27 where we kind of reached that key insight that, in theory,
0:02:28 it’s all about models and algorithms.
0:02:32 In practice, it’s all about data and data quality and data strategy,
0:02:38 which led us to the opportunity and need to provide our product 51
0:02:40 to help solve some of those data challenges.
0:02:41 Very cool.
0:02:43 And we’ll get into, as I said, as we talk,
0:02:45 we’ll get into a little bit more about what 51 is
0:02:48 and what Voxel 51 does with customers like Portia.
0:02:52 But maybe, Tin, let’s start with you.
0:02:55 On the podcast, we’ve said, I think it’s actually an Andrew Ng quote originally,
0:02:58 but we like to say that AI is like the new electricity.
0:03:03 It’s sort of there in the background, powering more and more, you know,
0:03:06 everything that we do across industries and research disciplines,
0:03:09 and it’s kind of providing power that we run on.
0:03:12 Why is it so important in the automotive industry,
0:03:15 and when we’re talking about autonomous vehicle systems,
0:03:19 why is it so important to organize and understand these huge amounts of data,
0:03:23 whether they’re coming from real-world capture or generating synthetic data?
0:03:28 Why does that play such a big part in developing autonomous driving systems?
0:03:32 So to deal with this increasing scenario space in the open world,
0:03:36 the industry is shifting from pre-specifying scenarios
0:03:40 and pre-specifying everything which can happen in modular pipelines,
0:03:45 which are only partially supported by AI models towards end-to-end pipelines,
0:03:48 where AI plays the major role or plays the sole role.
0:03:51 And instead of specifying all the scenarios,
0:03:55 we are training data and directly mapping them to outputs,
0:03:57 to actions which the models have to perform.
0:03:59 So we are removing the inductive bias
0:04:02 and removing all the rules and knowledge we have
0:04:04 and training from the data.
0:04:09 This is mainly because we are moving from level two assisted driving
0:04:11 to fully autonomous systems
0:04:15 that have to operate in conditions which are partially also unknown to us,
0:04:18 which we are not able to specify fully.
0:04:22 And in order to act safely under these conditions,
0:04:24 we have to collect lots and lots of data,
0:04:28 so billions of kilometers of driving data which need to be collected.
0:04:33 And not only that we need to explore this whole scenario space
0:04:34 to provide safe driving solutions,
0:04:39 but also we have long-tail distributions of different traffic scenarios,
0:04:41 of different interactions of agents.
0:04:45 And not all data has the same value for the model.
0:04:48 We have different kinds of redundancies and imbalances,
0:04:50 and the training data for these models.
0:04:52 And this is where we need to explore the data,
0:04:55 where we need to capture the data, which is mainly unlabeled.
0:04:58 We need to implement automated labeling pipelines
0:05:02 in order to understand which are new scenarios,
0:05:05 important scenarios having impact on the agents,
0:05:08 and which are scenarios which are less important,
0:05:10 which are redundant, which happen more often.
0:05:14 For instance, we have lots of scenarios with walking pedestrians on zebra crossings,
0:05:22 but there’s not so much situations where helicopters have to perform a landing on the Eagle Lane of an autonomous vehicle, for instance.
0:05:24 Right. But it could happen.
0:05:25 Yeah, exactly.
0:05:32 The other thing which is important in that regard is that there’s not that much ground truth data for different modalities,
0:05:33 for different sensor setups.
0:05:36 So in order to be able to train models,
0:05:40 we need different kinds of data from different modalities.
0:05:44 For instance, spatial data or data from multiple sensor setups, from multiple embodiments.
0:05:51 Or in terms of our research, also agent interactions with the physical world or visual question answering data
0:05:54 are sparsely found in available real world data.
0:06:02 So we need a data curation platform like Voxel 51 provides in order to harness different methods,
0:06:06 in order to create meaningful pipelines, meaningful data labeling pipelines,
0:06:10 which are relevant for our training, for the training of the models,
0:06:13 but also for the validation during the operations of the models.
0:06:18 We need to validate and continuously observe because the scenario space is so large
0:06:23 that we have the obligation to observe it during the operation too.
0:06:28 So to go back to a couple of things you said, first, just kind of the level set.
0:06:32 You mentioned the levels of autonomous vehicle systems
0:06:38 and the difference between level two and then going to a fully autonomous system.
0:06:44 Can you just briefly give an overview of what the levels are and kind of where today’s technology is at?
0:06:44 Yeah.
0:06:51 So in current vehicles, we mostly have level two systems, which are driver assistance systems.
0:06:55 And these systems don’t operate on their own.
0:06:57 They cannot act in the environment.
0:06:59 They just support the driver in certain tasks,
0:07:01 for instance, lateral or longitudinal movement.
0:07:06 And in the future or in the near future, we will have fully autonomous systems.
0:07:08 So these are like the lane keeping systems?
0:07:09 Exactly.
0:07:15 Lane keeping systems and distance and automated distance systems, which keep the citizens.
0:07:17 Like following behind another car and…
0:07:18 Following behind.
0:07:18 Got it.
0:07:18 Okay.
0:07:20 And so that’s level two.
0:07:22 And then getting up to…
0:07:25 Is it level five that’s considered fully autonomous?
0:07:25 Yeah.
0:07:27 So level five is fully autonomous.
0:07:30 This is basically a human agent which can act in the real world.
0:07:40 So if we have an AI which is able to interact fully autonomous without any separation from what a human agent would act in the environment, we have level five.
0:07:40 Right.
0:07:43 We are not striving for level five yet.
0:07:50 We are striving for level four, which is that in most domains, autonomous agents can act and interact.
0:07:56 But there are some borders, some boundaries, which we call operational design domain boundaries, boundaries of the system.
0:08:06 We need to consider and the system needs to be able to identify these boundaries and to have fallback loops where the driver can engage with the system.
0:08:11 So, Brian, where does Voxel 51 fit into this?
0:08:13 How do you work with Portia?
0:08:20 And then if you like also, you can talk a little bit more about the company, the platform, and how it serves other customers in general.
0:08:21 Yeah, absolutely.
0:08:30 So, you know, like Tim mentioned, our platform, our product exists to put data at the center of all of his team and indeed all visual AI projects work.
0:08:38 For context, what we see today is that models and weights, while very important, are becoming increasingly commoditized or publicly available.
0:08:54 And therefore, the key differentiator between the success and failure, you know, leader versus follower status of a product or company lies in their ability to turn the data they have access to into, you know, what we call actionable insights or intelligence for their systems.
0:09:03 So, in other domains, let’s say more legacy domains, use cases like structured data, data that fits in spreadsheets and tables and so forth.
0:09:10 There’s quite a lot of mature tooling that exists out in the market to help you make sense of that data, run computations, process that data.
0:09:28 However, in our experience, when it comes to visual data, image data, video data, 3D data, all the associated metadata that needs to be understood and processed by AI systems, there was really a dearth of tools in the market that played that role of being that key platform that puts data at the center of all the development work.
0:09:47 So, we experienced that ourselves through our research and ultimately identified that a product needs to exist that’s open and extensible in the market that teams like Tins at Porsche can adopt, feed their data into, and really facilitate the end-to-end development of their systems.
0:10:07 So, that encompasses data annotation or labeling, data curation, and importantly, evaluating models, and then turning that flywheel, whereby you identify failure modes or gaps in a model’s performance, and turn that into the right decisions about what new data should I go out and gather to address a specific failure mode.
0:10:21 Whether it be real data that’s available through an autonomous fleet of vehicles in the world, or maybe synthetic data that’s increasingly becoming important to fill in gaps that are unusually hard or difficult to get your hands on.
0:10:24 The helicopter landing next to me on my daily commute.
0:10:25 Exactly.
0:10:29 You may not have many examples of that at your fingertips.
0:10:42 However, in order to truly get to L5 systems, we need to have confidence that these, you know, agentic systems are going to be able to respond and react to those kind of extreme, but nonetheless important events that could occur in practice.
0:10:43 Yep, yep.
0:10:45 I mean, less unlikely, right?
0:10:50 But, you know, I always think of kids darting out into the road, you know, on a bike, on foot, whatever.
0:10:55 And, yeah, I want my fully self-driving car to understand how to avoid the kid, the helicopter, 100%.
0:10:56 Exactly, yeah.
0:11:00 Yeah, and so just to sum up what our product offers, it’s exactly that.
0:11:26 If you want to, for example, deep dive into the performance of an autonomous vehicle in situations where there’s a crowded intersection at night with low light, and you want to develop confidence or trust that that system’s going to respond correctly, you need the ability to perform that query, analyze that model’s performance, and if it’s not up to par yet, take the right actions to build better data sets so that you can get to where you need to be in terms of performance.
0:11:27 That’s fantastic.
0:11:27 Fantastic.
0:11:31 So, Tin, in the real world, Brian was talking about building that trust.
0:11:40 How do you, how does Porsche go about making sure that its autonomous driving systems are safe, are road-ready before releasing them?
0:11:42 How do you test?
0:11:45 How do you, how does simulation come into play?
0:11:47 We’re talking about the role of data and everything.
0:11:49 Can you talk a little bit about your process?
0:11:54 So, when it comes to simulation, it is an increasingly important and valuable tool for us.
0:12:03 And we have talked about the increasing complexity of the systems, moving towards end-to-end systems, moving towards maybe embodied agents which interact with the driver.
0:12:10 And we have talked about the increasing complexity of the scenario space, where we have more and more scenarios which have different parameters.
0:12:15 And simulation really gives us the ability to ask the question, what could happen?
0:12:18 So, we don’t only have recorded data.
0:12:22 We also have interaction between agents in the environment.
0:12:35 We have different models where we can evaluate the feasibility in all of the state and action space, not only in the recorded state and action space we have from our real test drives.
0:12:45 And this is really, really, really valuable because this gives us the opportunity to have models which are generalized over the whole environment and everything which can happen.
0:12:59 And simulation enables us to do that by becoming increasingly high fidelity, by providing increasingly realistic environments, increasingly realistic sensor models, and also behavioral models for different agents.
0:13:11 And this is important for us because we really want to have safe systems on the road and only through simulation we can capture this because in the real world there’s so much which can happen.
0:13:14 And there’s also some scenarios which are not so easy to replicate.
0:13:23 For instance, if a helicopter lands on the road and we cannot create a test case in the real world without harming someone or without having the danger of harming someone.
0:13:33 So we need simulation and also synthetic data generation to capture these scenarios and we have the ability to do that with an increasing level of realism.
0:13:51 In the latest developments, there also have been additional aspects of simulation like improving the fidelity with generative models by feeding in scenarios into a generative model and getting more realistic outputs which resemble almost video scenes.
0:13:53 And you see that also in NVIDIA Cosmos.
0:14:07 So we are really excited to see the developments which are happening currently in that area, which improve our realism and improve the fidelity of the simulation and therefore also the validity to test in simulation environments.
0:14:16 You mentioned that accounting for the unknown or the unpredictable is a big issue with developing safe autonomous driving systems.
0:14:29 Are there particular known scenarios, you know, things that happen in the real world that are just really difficult problems to solve when it comes to building an autonomous system to deal with it?
0:14:34 You know, you mentioned low light, crowded intersection, nighttime, those kinds of things.
0:14:41 Are there particular scenes or even particular variables that just really pose a challenge for working on these systems?
0:14:46 Definitely. So we have certain physical limitations of sensors.
0:14:50 For instance, radar and LIDAR sensors have certain limitations.
0:14:57 LIDAR sensors are limited by, for instance, rain or water reflections or reflective surfaces.
0:15:04 While radar sensors can be noisy in the presence of certain magnetic fields or metals.
0:15:09 And camera, of course, we know when camera becomes noisy in many weather situations.
0:15:21 So we have different physical limitations and we need to also to model them and to understand in which situations agents can act because it’s not a physical limitations and where these agents have physical limitations too.
0:15:31 And simulation also provides, in the latest development through realistic sensor models, also the ability to evaluate virtually in that regard.
0:15:42 When it comes to agent interactions, we have a lot of limitations because of the problem of domain generalization.
0:15:47 We have different types of scenarios where agents need to interact.
0:15:58 And if the data is not present in the training data, we have the issue of learning concepts and applying concepts like human driver do.
0:16:06 For instance, if I’m a human driver, I see certain entities in the environment, which I never saw, but I can anticipate them.
0:16:16 But an AI agent has difficulties anticipating them because when it didn’t see it in the training data, it cannot generalize over the unseen event.
0:16:19 And this is also where these methods come into play.
0:16:21 We need to capture also these situations.
0:16:31 One thing I would add on the synthetic data front, one thing we’re seeing is that it’s becoming increasingly important to be able to generate variations of a scene.
0:16:36 For example, you know, we’re excited to integrate with NVIDIA’s Cosmos World Foundation models in our product.
0:16:43 And one thing that enables is you can import a realistic scene, parametrize it, and then tune different things.
0:16:46 What would it look like if that vehicle was a different shade of beige?
0:16:50 Or what would it look like if there was another vehicle or a pedestrian in the same scene?
0:16:52 Or maybe let’s change the weather conditions.
0:17:00 And that ability to kind of play around with a realistic scene is important to kind of develop a trust that the system is going to, you know,
0:17:04 be able to deal with all the different variations that it might see of that scene in practice.
0:17:06 Right, right.
0:17:11 Is that a manual process or are you automatically generating the variations to put them back into the system?
0:17:19 Yeah, that’s one of the exciting things about the software integration we have with the, you know, the Cosmos Foundation models as an example.
0:17:34 Users of our product can, you know, identify a scene and then, you know, click a button to automatically generate all of those variations and pull them back into their training data set and indeed automate the analysis or assessment of all those different variations.
0:17:37 So we think that having humans in the loop is definitely important.
0:17:47 It would be a mistake to sort of, you know, ship a system without having a human’s eyes on the scene to identify or spot check or build trust in performance in key scenarios.
0:17:53 But of course, the volume of data that you need to get to the performance that you need is immense.
0:17:57 And so you have to leverage, you know, automation whenever possible.
0:18:04 Brian, as Voxel’s worked with Porsche and other leading automakers, what’s been surprising to you along the way?
0:18:11 Maybe unanticipated challenges, maybe getting past a hurdle in a surprising way.
0:18:15 What are some of the things that stand out to you as you think about working with automakers?
0:18:33 Yeah, I think that the main thing I would say, you know, as Tim was describing all of the very interesting sort of software or, you know, related systems challenges and bringing autonomy to market, what we’re seeing is that the leaders in the space really are reinventing themselves, not as automakers, but really as software companies.
0:18:34 Right.
0:18:41 And that’s the type of sort of skill and expertise that’s going to be needed to really solve these problems and bring a differentiated product to market.
0:18:54 There’s a trend today where, you know, if you think of the software component of autonomy as something that you can procure off the shelf, maybe through a vendor, then it can get you to a certain level of performance.
0:19:03 But the key, you know, as I argued before, the key to, you know, really industry leading performance is to harness the data that your company has access to.
0:19:11 You really need to bring the development and iteration of that software in-house rather than just outsourcing it to reach kind of leading status.
0:19:23 Yeah, from a consumer’s perspective, when you start talking about software and automakers, I think of infotainment systems just, you know, as a driver, as a passenger, kind of the first light-up thing that I see, right?
0:19:36 And I’ve been following a little bit from afar, but, you know, kind of this almost like dance between the automakers and some non-automotive software makers and, you know, mobile phone makers in particular, as you get into plugging your phone into the car.
0:19:42 And then maybe like with Apple CarPlay or Android Auto, it takes over the in-car infotainment and that kind of thing.
0:19:53 But then listening to you talk about it and thinking, well, okay, let’s move from infotainment to something I can’t even imagine how complex it really is as an autonomous driving system.
0:20:02 And it makes perfect sense to me, Brian, what you’re saying, that to really develop a top-notch system, you can’t just grab something off the shelf and plug it in.
0:20:07 Like you’ve got to be, you know, shaping it to fit your vehicles, all the data you have, all of that kind of stuff.
0:20:12 Yeah. And it’s also important, I think, to not take it all in-house and say that you can do everything.
0:20:22 I think an effective pattern that we’ve seen in the market, and of course, Tim would be an authority on this more than me, is, well, let’s first focus on being able to validate the performance of systems.
0:20:30 So we can truly understand if we’re going to work with a vendor for a certain piece of technology, can we truly understand its performance and develop trust in it?
0:20:37 And then we can evaluate whether it makes sense to bring certain aspects of the system in-house so we can fine-tune it with our own data and so forth.
0:20:43 So that focus on validation and evaluation is definitely important in the short term.
0:20:48 Brian, you may have said this at the beginning, so forgive me, but how long ago was Voxel 51 founded?
0:20:59 Yeah. So Voxel started over 10 years ago now as a kind of a consulting partnership between myself and my co-founder, Jason, as a sort of, you know, venture-backed software company.
0:21:02 That journey started in 2018, so about seven years in the market.
0:21:07 Okay. So along that path, almost a decade now, or, you know, seven years to market and a decade,
0:21:20 since you started, have there been particular technological breakthroughs that have really allowed Voxel to just take your practice to the next level and, you know, do things, offer things to customers you couldn’t?
0:21:27 And particularly when it comes to, I mean, this is what you do, but when it comes to wrangling and managing and understanding these vast qualities of data,
0:21:34 are there particular just technological advances along the way that have really, you know, made the work that you do possible?
0:21:40 Certainly there’s been just immense technological innovation on the model and algorithmic side.
0:21:44 Yeah. It’s one of those questions where I’m sort of like, have there been innovations in the past 10 years?
0:21:45 Yeah, maybe a couple, you know.
0:21:52 Yeah. Yeah. So one of the, you know, so there’s been clear advances in, you know, technologies like the transformer architecture,
0:21:56 which kind of leveled up another kind of order of magnitude of performance potential.
0:22:01 And then, of course, we had all of the, you know, advancements in chat GPT and large language models.
0:22:12 And now we have models in vision that are more multimodal in nature, vision language models that can pull in information from text and audio and fuse that with vision.
0:22:21 And the long-term, I think, vision in the space is that we need models that can go directly from pixels or sensor inputs directly to actions or decisions.
0:22:25 And that kind of end-to-end system is kind of the holy grail.
0:22:27 And it’s very exciting to see all of that develop.
0:22:37 The lessons learned from us, it always requires more data, more data, more data to organize, to understand, to sift through, to find kind of the needles in the haystack.
0:22:47 And so the number one feature request we get from our customers is definitely, hey, you know, as I’m thinking about my plans and my goals for next year, it’s involving an order of magnitude plus more data.
0:22:52 The need and ability to connect to more GPUs to compute on that data.
0:23:03 And so we’ve definitely benefited from the rapid pace of innovation and, you know, distributed computing and related technologies that our platform can plug into to help deliver that scale to customers.
0:23:13 To dig in on something you said real quick, and correct me where I’m wrong here, but I think you said that the holy grail, as you put it, is moving to a system where a model that can go from pixel to action directly.
0:23:15 Where are we at now?
0:23:17 How do you get from pixel to action currently?
0:23:17 Yeah.
0:23:23 First of all, just to unpack the historical context there, we refer to the space as visual AI.
0:23:26 What you may have referred to it in the past as is computer vision.
0:23:26 Right.
0:23:36 And that historically has represented kind of very low level tasks, like taking an image and classifying it as a certain animal or drawing a box around a certain object.
0:23:38 So that’s a very kind of low level task.
0:23:46 It’s important information and the system needs to understand, you know, the content of an image or a video stream in order to reason about it.
0:24:02 However, I think the lesson that we keep learning, even in on sort of the language models, is that to the extent that we can push the system to be more end to end and, you know, have the authority to do a lot of reasoning itself and go directly from raw inputs to the decision.
0:24:06 There’s a capacity for more sort of intelligence.
0:24:07 Are we at that step yet?
0:24:08 Certainly not.
0:24:14 But I think that’s where the leading edge is in terms of research and the work that Tim and his team at Porsche are doing.
0:24:15 It’s very exciting times.
0:24:16 Gotcha.
0:24:24 A long time, well, not that long ago, but a while ago, I read a novel that was kind of a, you know, not quite cyberpunk, but along those lines, tech heavy.
0:24:38 And one of the little threads in the book was about self-driving cars and near future freeway system in the United States where the cars talk, you know, automatically to each other and to the toll taking systems on the roads and all that kind of stuff.
0:24:48 But one of the themes that came out was, you know, if every car on the road was autonomous, that would be potentially the ideal situation for human safety.
0:25:01 Because if all the vehicles were self-driving and they’re all performing at a good level, they’re going to be able to make much better decisions than human drivers just because of the raw ability to compute and deal with all the data and just everything humans can’t quite do.
0:25:10 Given where we’re at now in reality and level two systems and all the stuff you’ve been talking about, how do we use technology?
0:25:16 How can technology help improve, help build human trust in autonomous vehicles?
0:25:29 So the question you asked is how to establish trust towards the human and where human has most trust in our system, which are explainable and are acting similar to humans.
0:25:39 So one big limitation also current end-to-end systems have is that they are not able to describe why they are taking certain actions.
0:25:52 They do not derive their actions from basic concepts like humans can understand from several hours of driving how to drive in the world, while autonomous systems can’t do that.
0:26:19 And they have to have big amounts of data in order to derive actions and where the shift needs to happen is towards systems which are able to derive and generalize based on certain contexts derived from human knowledge, like derived from different basic concepts like we learn in driving school how to act in the environment and also how to explain and describe these actions.
0:26:42 And only when we are able to do that, and this is also done in the research on foundation models in the area of autonomous driving, only when we are able to do that, we have trustworthy systems which are also able to interact with the driver, with the co-driver in that case, and to explain their actions and also to explain when certain boundaries are met.
0:26:46 Like human drivers also have certain boundaries, they also cannot perform in all situations.
0:26:50 So this is the shift we need to, to, to, to make.
0:26:50 Right.
0:26:50 Brad?
0:26:53 Just to add a few other angles to that.
0:26:56 So I try to hold two things at once together.
0:27:17 One is the excitement for the long-term future of fully connected, fully autonomous vehicles, which are orders of magnitude safer than humans and can benefit from other information that human drivers don’t like the ability to directly communicate with other vehicles or other sensor types, like LIDAR, that offer information that humans don’t have.
0:27:25 At the same time, I think we’d be remiss not to be very excited about all of the very concrete benefits that are already in the market today, specifically around safety.
0:27:36 You know, the fact that my vehicle today has L2 systems that can automatically detect, you know, maybe I’ve lost focus on the highway and a car in front of me is stopped in traffic.
0:27:44 My vehicle can already react to that and to keep me safe, and I think there’s so much more potential to roll out in more of an incremental way.
0:27:48 AI enabled advancements that focus specifically on trust and safety.
0:28:05 You know, Tin painted a great picture of what’s needed from a explainability and transparency standpoint on the technology, but also very excited about the very concrete benefits that everyday, you know, drivers can see over the next five years while we continue working towards the grander future.
0:28:13 And the sort of potential of a fully autonomous network is that kind of excitement technologically that keeps pushing the face of innovation forward.
0:28:14 Yeah.
0:28:18 And we’re happy to, you know, help do our part to enable that to happen at Voxel.
0:28:18 Yeah.
0:28:21 But definitely appreciate both aspects and tension.
0:28:21 Right.
0:28:27 Tin, along those lines, what excites you the most about the future of autonomous vehicles at Porsche?
0:28:32 So I’m probably biased because I’m conducting research in that area, but what really excites me…
0:28:33 I hope you are.
0:28:40 What really excites me is turning cars into Knight Rider, if you know the series Knight Rider.
0:28:41 Oh, yeah.
0:28:49 So turning cars into embodied agents, which are able to anticipate the environment similarly how humans do.
0:29:03 So we can leverage foundation models, leverage the knowledge of the world of the web context we have in order to not have direct mappings of actions from inputs like end-to-end models have, but to have some kind of reasoning process in between.
0:29:16 And to generalize over different scenarios, regardless of embodiment and like end-to-end approaches are maybe error-prone and sensitive to different embodiments, to different sensor setups.
0:29:23 We need to retrain them based on if we change the sensor setup, we need to retrain the whole system in many cases.
0:29:40 And foundation models promise to have some kind of intermediate representation, which incorporates web context and incorporates world knowledge and gives us the ability to generalize on different kinds of scenes and act in these scenes.
0:29:46 Even if not previously trained on that specific scene, because we have this reasoning process.
0:29:52 And another part of the whole puzzle is that we have completely new types of interactions.
0:29:54 We have vision language navigation.
0:29:57 Me as a driver, I can tell the car what to do.
0:30:05 And the car knows about its abilities and can choose certain actions, which it knows, and create new experiences for the driver.
0:30:25 So for instance, looking for a parking lot in the shadow or any other situational request the driver could have, we can create new actions, which are not previously trained by any model, but are derived from the context and derived from the world knowledge, which is pre-trained on those models.
0:30:31 So there’s a lot of potential in there, but there also needs a lot to be done in regards to that.
0:30:40 In our research, we have investigated what foundation models need to be able of in order to fulfill this task of vision language navigation.
0:30:46 And together with also in close collaboration with Voxo51, we have identified four areas.
0:30:51 And this is semantic understanding, which is classes, affordances and attributes.
0:30:57 Spatial understanding, which is the locations, the orientations of objects to watch each other.
0:31:01 Temporal understanding, which is the development over time in the past and the future.
0:31:09 And physical understanding, most importantly, which is the world model, the physical rules like forces or gravity applied to the environment.
0:31:24 And we have identified that current foundation models are very good in semantic understanding, like deriving these basic concepts of scenarios.
0:31:31 But we need to improve them in spatial, temporal and physical understanding in order to really grasp the task of vision language navigation.
0:31:35 And we’re really, really excited what’s happening currently there in this environment.
0:31:40 Also in research done by NVIDIA and other players in that area.
0:31:42 And we are looking forward for the research.
0:31:54 And on our own, we also train models and create models, which are hopefully able to grasp spatial temporal understanding and also physical rules of the environment in order to interact with it.
0:32:07 It’s good that you guys are the ones working on safety, because, Tim, as soon as you started talking about the driver requesting the car to do something, I immediately thought, man, if I had a Taycan, I could just ask it to get me places fast.
0:32:11 And, you know, but I don’t think that’s quite where we’re getting at here.
0:32:23 But, Brian, along the lines of safety, where do you see the biggest opportunities for improving autonomous vehicle safety through simulation, through, you know, access and tools to help you use better data?
0:32:29 Yeah. So for us, it always comes back to data and the important role that data plays in the success of AI.
0:32:29 Sure.
0:32:35 And, you know, fortunately, there’s some pretty exciting technological advancements happening in the data space.
0:32:43 So historically, one of the most onerous and costly and time-consuming aspects of an AI project was data annotation.
0:32:49 The need to gather a data set and have it labeled to sort of teach the model all the information that it needs to know.
0:32:58 Interestingly, you know, that was historically done in sort of an outsourced way where the data was shipped off to human teams to, you know, do the rote work of labeling that data.
0:33:04 You can imagine that’s very expensive and creates an artificial bottleneck on the amount of data that you can feed to your systems.
0:33:19 Actually, these days, with the emergence of these generalized foundation models, on our team, we’ve done a study in comparing the performance of AI systems that are developed on human annotated data versus automatically labeled or auto labeled data from foundation models.
0:33:26 And, of course, the benefit of leveraging automatic labeling is that it’s far more efficient, lower cost, and so forth.
0:33:29 And the question was, well, how did the performance compare?
0:33:47 And interestingly, we found that you can achieve comparable performance replacing human annotation in many situations with auto labeling, which kind of opens up the sort of the valve of the potential quantity of data that you can feed to specific systems in use cases like autonomy.
0:33:53 And so we’re excited to bring auto labeling to our users in the 51 platform.
0:33:55 We call it verified auto labeling.
0:34:03 The verified part is important because it’s not just about feeding tons of data to a system, but being able to verify the correctness of that information.
0:34:17 And so we’ve developed some technology internally that can kind of help you get the most value out of the knowledge that foundation models have while also prioritizing verification and trust and your users understanding the performance.
0:34:35 And then, of course, there’s the simulation piece because, you know, if you can’t get your hands on exactly that scene that you know that your model has a weakness or a failure mode that needs to be addressed, tapping into simulation techniques, as we mentioned previously, to fill in those gaps and build higher quality data sets.
0:34:41 That’s very exciting in terms of the potential to push us to the next level of performance in the LLM space.
0:34:47 In the LLM space, we hear today things like, oh, well, we’ve already trained on all of the information available on the public Internet.
0:34:48 Right.
0:34:52 And so now we need to resort to synthetic data because there’s nothing left, right?
0:34:52 Right.
0:35:02 Now, I’m not sure that that’s exactly true, but it’s definitely the case that synthetic data generation techniques have different complementary capabilities to real data.
0:35:12 And increasingly in the future, we see it being an integral tool to the, you know, toolkits of teams that are putting data quality at the center of their development efforts.
0:35:17 As we wrap up, let’s hold on to that future forward mindset for a moment here.
0:35:20 And I’m going to ask you if you can think ahead five years from now.
0:35:22 So 2030.
0:35:22 Wow.
0:35:32 So if we’re in 2030, what do you hope will have changed, will have progressed and developed in the world of autonomous vehicle safety and simulation five years out?
0:35:44 Well, I’m definitely excited in general about the opportunity that autonomous driving plays and shedding light on the powerful capabilities of visual AI and multimodal models.
0:36:02 I think from a, you know, geek standpoint, it’s a perfect testing ground to test the value of multimodal systems that can understand not just the visual inputs or LIDAR inputs, but also things like you mentioned before, a fully connected system where we can take information about the intentions or behaviors of other vehicles.
0:36:12 to bring us to bring us to that next level of safety or, you know, feed in, you know, sort of audio signals to build a more holistic understanding of a scenario.
0:36:14 I think it’s going to unlock the next level of safety.
0:36:21 When the model can understand the urgency of my swearing behind the wheel, you’ll know how much danger we’re really in.
0:36:21 Exactly.
0:36:22 Yeah.
0:36:25 So maybe I can add on that.
0:36:26 Yeah, please.
0:36:39 So I believe that safety will shift from like a holistic view where we have to really test all of the necessary scenarios for a specific domain towards more situative safety,
0:36:53 because we can have models which are able to reason on different situations, which are able to derive from specific basic concepts and are able to interact with the driver so that there can be safe conditions,
0:37:12 even in unsafe situations so that the driver, for instance, gets requested to a takeover by a model which talks in natural language to the driver or also the model can anticipate different situations based on the concepts behind it as an unsafe situation where it can act accordingly.
0:37:15 And like Brian said, the multimodality plays a major role.
0:37:25 So we will have models which are not only able to reason on visual data, not only able to reason on camera data, for instance,
0:37:35 but also on spatial data, on point clouds by radar and LIDAR sensors, and also on map information and other information which is available in the environment.
0:37:43 And therefore, we believe that the future models should be able to have more of a situative understanding of safety, like humans also have.
0:37:52 They do not need to capture all of the situations, but they can be able to anticipate the situations and act safely even in an unsafe situation.
0:37:55 Anticipate also unsafe behavior by other agents.
0:37:57 And this is where we have to go.
0:38:02 Yeah, that whole concept of anticipation is such a big part of living life as a human, right?
0:38:06 And so it makes sense, the importance of it within these systems.
0:38:11 But I will leave the complexities of getting it to work to folks like you.
0:38:17 It’s fantastic for us as users or enthusiasts of automobiles that there’s companies like Porsche out there,
0:38:24 because I’m sure that they’ll balance the safety and automation with the fun factor of being a driver behind the wheel.
0:38:29 Whether or not you have to be driving, I’m sure it’ll be a first-class experience with Porsche involved.
0:38:41 No, that’s a great point because there are, I think anyone who’s driven a car has probably taken at least one drive just for fun or to relax or to get away from it, clear your head.
0:38:46 You know, and so just that driving experience is something that I hope we can hang on to going forward.
0:38:46 Absolutely.
0:38:49 Tin, Brian, this has been a great conversation.
0:38:54 I’ve learned a lot about the current and future of autonomous vehicles, so thank you both for joining.
0:39:02 For listeners who would like to learn more, would like to dig a little deeper into what Porsche is doing with autonomous vehicles and everything else,
0:39:07 into what Voxel 51 is all about, where would you point them to go on the web to get started?
0:39:11 Brian, I assume the Voxel 51 website, but where can they go?
0:39:12 Definitely check out our website.
0:39:19 For those technologists in the audience, check out our open-source project, 51, completely free, openly available.
0:39:26 Download it, kick the tires, test out this data-centric view of developing your next visual AI system.
0:39:27 Fantastic.
0:39:30 And, Tin, is there a Porsche research blog?
0:39:32 Is there a part of the website devoted to autonomous vehicles?
0:39:34 Where’s the best place for a listener to start?
0:39:41 Yeah, so we also have a lot of our research open-sourced on the Porsche free and open-source website.
0:39:49 For instance, a benchmark for evaluating foundation models and the task of vision language navigation based on the four capabilities we just mentioned.
0:39:56 And also, a lot of research papers you can grasp and just check out our Google Scholar and our GitHub page,
0:40:00 where you can dive into what we’re doing and foundation models for autonomous driving.
0:40:01 Fantastic.
0:40:04 Again, thank you guys so much.
0:40:08 And, you know, anytime you want to talk cars and autonomous vehicles and all that stuff,
0:40:11 give a call and maybe we can catch up and do it again in the future.
0:40:12 Looking forward to it.
0:40:13 Thanks, Noah.
0:40:14 Looking forward, Noah.
0:40:14 Thank you.
0:40:14 Thank you.
0:40:49 Thank you.
0:40:50 Thank you.
Tin Sohn, technical lead for vision-language-action models at Porsche, and Brian Moore, CEO and co-founder of Voxel51, explore how AI, data, and simulation are shaping the future of autonomous vehicles. They share insights on the industry’s transition from rule-based systems to data-driven, end-to-end approaches, the growing use of synthetic and simulated data for safety-critical testing, and how foundation models can enable cars to reason, act, and even interact like human drivers. Learn more at ai-podcast.nvidia.com.



Leave a Reply
You must be logged in to post a comment.