Google DeepMind Developers: How Nano Banana Was Made

AI Transcript

0:00:06 These models are allowing creators to do less tedious parts of the job, right?
0:00:10 They can be more creative and they can spend, you know, 90% of their time being creative
0:00:15 versus 90% of their time like editing things and doing these tedious kind of manual operations.
0:00:19 I’m convinced that this ultimately really empowers artists, right?
0:00:20 It gives you new tools, right?
0:00:24 It’s like, hey, we now have, I don’t know, watercolors for Michelangelo.
0:00:25 Let’s see what he does with it, right?
0:00:26 And amazing things come out.
0:00:31 One of the hardest challenges in AI isn’t language or reasoning, it’s vision.
0:00:36 Getting models to understand, compose, and edit images with the same precision that they process text.
0:00:41 Today, you’ll hear a conversation with Oliver Wang and Nicole Bricktova from Google DeepMind
0:00:45 about Gemini 2.5 Image, also known as Nano Banana.
0:00:47 They discuss the architecture behind the model,
0:00:51 how image generation and editing are integrated into Gemini’s multimodal framework,
0:00:54 and what it takes to achieve character consistency,
0:00:57 compositional control, and conversational editing at scale.
0:01:00 They also touch on open questions and model evaluation,
0:01:02 safety, and latency optimization,
0:01:07 and how visual reasoning connects to broader advances in multimodal systems.
0:01:08 Let’s get into it.
0:01:16 Maybe start by telling us about the backstory behind the Nano Banana model.
0:01:17 How did it come to be?
0:01:18 How did you all start working on it?
0:01:23 Sure, so our team has worked on image models for some time.
0:01:25 We developed the Imagine family of models,
0:01:26 which goes back a couple years.
0:01:30 And actually, there was also an image generation model in Gemini before,
0:01:31 the Gemini 2.0 image generation model.
0:01:38 So what happened was the teams kind of started to focus more on the Gemini use cases,
0:01:41 so like interactive, conversational, and editing.
0:01:43 And essentially, what happened was we teamed up,
0:01:46 and we built this model, which became what’s known as Nano Banana.
0:01:48 So yeah, that’s sort of the origin story.
0:01:52 Yeah, and I think maybe just some more background on that.
0:01:56 So our Imagine models were always kind of top of the charts for visual quality,
0:01:59 and we really focused on kind of these specialized generation editing use cases.
0:02:02 And then when 2.0 Flash came out,
0:02:05 that’s when we really started to see some of the magic
0:02:08 of being able to generate images and text at the same time,
0:02:09 so you can maybe tell a story.
0:02:11 Just the magic of being able to talk to images
0:02:13 and edit them conversationally.
0:02:16 But the visual quality was maybe not where we wanted it to be.
0:02:20 And so Nano Banana or Gemini 2.5 Flash image.
0:02:22 Nano Banana is way cooler.
0:02:23 It’s easier to say.
0:02:24 It’s a lot easier to say.
0:02:24 It’s the name that stuck.
0:02:26 Yes, it’s the name that stuck.
0:02:29 But it really became kind of the best of both worlds in that sense,
0:02:33 like the Gemini smartness and the multimodal kind of conversational nature of it,
0:02:35 plus the visual quality of Imagine.
0:02:37 And I feel like that’s maybe what resonates a lot with people.
0:02:39 Wow, amazing.
0:02:42 So I guess when you were testing out a model as you were developing it,
0:02:45 what were some wow moments that you found?
0:02:47 I know this is going to go viral.
0:02:49 I know people will love this.
0:02:53 So I actually didn’t feel like it was going to go viral
0:02:55 until we had released on Elemarina.
0:03:00 And what we saw was that we budgeted a comparable amount of queries per second
0:03:02 as we had for our previous models that were on Elemarina.
0:03:08 And we had to keep upping that number as people were going to Elemarina to use the model.
0:03:11 And I feel like that was the first time when I was really like,
0:03:14 oh wow, this is something that’s very, very useful to a lot of people.
0:03:16 Like it surprised even me.
0:03:17 I don’t know about the whole team,
0:03:20 but we were trying to make the best conversational editing model possible.
0:03:25 But then it really started taking off when people were like going out of their way
0:03:28 and using a website that would actually only give you the model some percentage of the time.
0:03:31 But even that was worth like going to that website to use the model.
0:03:33 So I think that was really the moment, at least for me,
0:03:35 that I was like, oh wow, this is going to be bigger.
0:03:37 That’s actually the best way to condition people.
0:03:40 Like only give them a reward partially.
0:03:41 Not all the time.
0:03:42 Not by design.
0:03:46 I had a moment earlier, so I’ve been trying some similar queries
0:03:49 on kind of multiple generations of models over time.
0:03:52 And a lot of them have to do with things I wanted to be as a kid.
0:03:56 So like an astronaut, explorer, or put me on the red carpet.
0:04:00 And I tried it on a demo that we had internally before we released the model.
0:04:03 It was the first time when the output actually looked like me.
0:04:06 And you guys play with these models all the time.
0:04:10 The only time that I’ve seen that before is if you fine-tune a model using LoRa
0:04:13 or some other method to do that and you need multiple images
0:04:16 and it takes a really long time and then you have to like actually serve it somewhere.
0:04:18 So this was the first time when it was like zero shot.
0:04:21 Oh wow, just one image of me and it looks like me.
0:04:22 And I was like, wow.
0:04:26 And then there became these like, we have decks that are just like covered in my face
0:04:28 as I was trying to convince other people that it was really cool.
0:04:31 And really I think the moment more people realized
0:04:35 that it was like a really fun feature to use is when they tried it on themselves.
0:04:37 Because it’s kind of fun when you see it on another person,
0:04:40 but it doesn’t really resonate with people emotionally.
0:04:44 It makes it so personal when it’s like you, your kids, you know, your spouse.
0:04:47 And I think that’s your dog.
0:04:49 And that’s really what started kind of resonating internally.
0:04:52 And then people just started making all these like 80s makeover versions of themselves.
0:04:55 And that’s when we really started to see like a lot of internal activity.
0:04:56 And we were like, okay, we’re on to something.
0:05:00 It’s a lot of fun to test these models when we’re making them
0:05:02 because you see all these amazing creative things that people make.
0:05:04 Oh wow, I never thought that was possible.
0:05:05 So it’s really fun.
0:05:09 No, it’s, I mean, we’ve dealt with the whole family and it’s a crazy amount of fun.
0:05:12 So think a bit about long-term.
0:05:13 Where does this lead, right?
0:05:18 I mean, we built these new tools that I think will change visual arts forever, right?
0:05:20 We suddenly can transfer style.
0:05:24 We suddenly can generate consistent images of a subject, right?
0:05:27 I have what used to be a very complex manual Photoshop process.
0:05:30 Suddenly I type one command and magically it happens.
0:05:31 What’s the end state of this?
0:05:33 I mean, do we have an idea yet?
0:05:37 How will creative arts be taught in a university in five years from now?
0:05:41 So I think it’s going to be a spectrum of things, right?
0:05:44 I think on the professional side, a lot of what we’re hearing is that
0:05:50 these models are allowing creators to do less tedious parts of the job, right?
0:05:55 They can be more creative and they can spend 90% of their time being creative
0:06:00 versus 90% of their time like editing things and doing these tedious kind of manual operations.
0:06:01 So I’m really excited about that.
0:06:05 I think we’ll see kind of an explosion of creativity like on that side of the spectrum.
0:06:10 And then I think for consumers, there’s sort of like two sides of the spectrum for this probably.
0:06:15 One is you might just be doing some of these fun things like Halloween costumes for my kid, right?
0:06:19 And the goal there is probably just to share it with somebody, right?
0:06:20 Your family or your friends.
0:06:25 On the other side of the spectrum, you might have these tasks like putting together a slide deck, right?
0:06:26 I started out as a consultant.
0:06:27 We talked about it at the beginning.
0:06:32 And you spend a lot of time on like very tedious things like trying to make things look good,
0:06:34 trying to make the story make sense.
0:06:40 I think for those types of tasks, you probably just have an agent who you give the specs of what you’re trying to do.
0:06:43 And then it goes out and like actually lays it out nicely for you.
0:06:47 It creates the right visual for the information that you’re trying to convey.
0:06:51 And it really is going to be this, I think, spectrum depending on what you’re trying to do.
0:06:56 Do you want to be in the creative process and actually tinker with things and collaborate with the model?
0:07:00 Or do you just want the model to like go do the task and be as minimally involved as possible?
0:07:04 So in this new world, then what is art?
0:07:08 I mean, somebody recently said art is if you can create an out-of-distribution sample.
0:07:11 Is that a good definition or is it aiming too high?
0:07:15 Or do you think if art is out-of-distribution or in-distribution for the model?
0:07:16 There we go.
0:07:20 I think that out-of-distribution sample, that is a little bit too restrictive.
0:07:24 I think a lot of great art is actually in distribution for art that occurred before it.
0:07:27 So, I mean, what is art?
0:07:29 I think it’s like a very philosophical debate.
0:07:31 And there’s a lot of people that do discuss this.
0:07:34 To me, I think that the most important thing for art is intent.
0:07:39 And so what is generated from these models is a tool to allow people to create art.
0:07:43 And I’m actually not worried about the high-end and the creatives and the professionals.
0:07:46 Because I’ve seen like, if you put me in front of one of these models,
0:07:48 I can’t create anything that anyone wants to see.
0:07:53 But like, I’ve seen what people can do who are creative people and have like intent and these ideas.
0:07:58 And I think that’s the most interesting thing to me is the things they create are really amazing and inspiring for me.
0:08:04 So I feel like the high-end and the professionals and the creatives, like, they’ll always use state-of-the-art tools.
0:08:07 And this is like another tool in the tool belt for people to make cool things.
0:08:14 I think one of the really interesting things that I kept hearing about this model in particular from like creatives and artists was
0:08:22 A lot of them felt like they couldn’t use a lot of AI tools before because it didn’t allow them the level of control that they expected for their art.
0:08:27 On one side, that was like the characters or object consistency.
0:08:32 Like, they really used that to have a compelling narrative for a story.
0:08:37 And so before, when you couldn’t get the same character over and over, it was very difficult.
0:08:44 And then I think the, like, second thing I hear all the time from artists is they love being able to upload multiple images and say,
0:08:54 use the style of this on this character or add this thing to this image, which is something that I think was very hard to do even with previous image edit models.
0:08:59 I guess I’m curious, was that something you guys were really optimizing for when you trained this one?
0:09:01 Or how did you think about that?
0:09:09 I mean, yeah, definitely sort of customizability and character consistency are things that we closely monitored during the development.
0:09:12 And we tried to do the best job we could on them.
0:09:17 I think another thing is also the iterative nature of kind of like an interactive conversation.
0:09:22 And art tends to be iterative as well, where you make lots of changes, you see where it’s going and you make more.
0:09:25 And this is another thing that I think makes the model more useful.
0:09:28 And actually, that’s an area that I also feel like we can improve the model greatly.
0:09:34 Like, I know that once you get into really long conversations, like, it starts to follow your instructions a little bit worse.
0:09:40 But it’s something that we’re planning to improve on and make the model more kind of like a natural conversation partner,
0:09:42 like a creative partner in making something.
0:09:50 One thing that’s so interesting is after you guys launched Nano Banana, we start to hear about editing models all the time, everywhere.
0:09:55 Like, it’s like, after you launched the world, woke up and you’re like, editing model, it’s great.
0:09:56 Everyone wants it.
0:10:02 And then obviously, like, it kind of goes into the customizability, the personalization of it.
0:10:04 And then, Oliver, I know you used to be at Adobe.
0:10:08 And then there’s also software where we used to manually edit things.
0:10:13 How do you see the knobs evolve now on the model layer versus what we used to do?
0:10:20 Yeah, I mean, I think that one thing that Adobe has always done and the professional tools generally require
0:10:24 is lots of control, lots of knobs, lots of, so there’s always a balance of,
0:10:29 we want someone to be able to use this on their phone, maybe with just like a voice interface.
0:10:36 And we also want someone who can really, like a really professional or creative to be able to do fine scale adjustments.
0:10:39 I think we haven’t exactly figured out how to enable both of those yet.
0:10:45 But there’s a lot of people that are building really compelling UIs and I think there’s different ways it can be done.
0:10:46 I don’t know.
0:10:47 You have thoughts on this?
0:10:52 Well, I also hope that we get to a point where you don’t have to learn what all these controls mean.
0:10:58 And the model can maybe smartly suggest what you could do next based on the context of what you’ve already done.
0:11:03 And that feels like it’s kind of prime for someone to tackle that on.
0:11:06 So like what do the UIs of the future look like?
0:11:10 In a way where you probably don’t need to learn a hundred things that you had to before,
0:11:15 but like the tools should be smart enough to suggest to you what it can do based on what you’re already doing.
0:11:16 That’s such an insightful take.
0:11:19 I definitely had moments when I used Nano Banana.
0:11:24 I was like, I didn’t know I wanted this, but I didn’t even ask for the style.
0:11:28 I don’t even have the words for what that style even, you know, is called.
0:11:34 So this is like very insightful on how image embedding and language embedding is not one-to-one.
0:11:37 Like we cannot map to like all the editing tasks with language.
0:11:39 So, oh, go ahead.
0:11:42 Let me start taking a little bit of the counterpoint just to see where this goes.
0:11:50 At the end, the question of how complex the interface can be limited by sort of what we can express in software,
0:11:51 how easy we can make something in software,
0:11:54 to some degree it’s also limited by how much complexity is a user willing to tolerate.
0:11:58 And, you know, if you have a professional, they only care about the result.
0:12:00 They’re willing to tolerate a vast amount of complexity.
0:12:04 They have the training, they have the education, they have the experience to use that, right?
0:12:07 Then we may end up with lots of knobs and dials.
0:12:08 It’s just very different.
0:12:11 But I mean, today, if you use a cursor or so for coding,
0:12:16 it’s not that it has a super easy, you know, single text prompt interface.
0:12:21 It has a good amount of, you know, add context here, different modes and so on.
0:12:28 So will we have like the ultra-sophisticated interface for the power user?
0:12:29 And how would that look like?
0:12:33 So I’m a big fan of Comfy UI and node-based interfaces in general.
0:12:34 And that is complex.
0:12:38 And it’s complex, but it’s also, it’s very robust and you can do a lot of things.
0:12:39 It’s incredible, yeah.
0:12:41 And so, you know, after we released Nano Banana,
0:12:45 we saw people building all these really complicated Comfy UI workflows
0:12:48 where they were combining a bunch of different models together and different tools.
0:12:51 And that’s generated some of the, like, for example, using Nano Banana
0:12:54 as a way to get storyboards or keyframes for video models.
0:12:58 Like, you can plug these things together and get really amazing outputs.
0:13:01 So I think that, like, at the pro or the developer level,
0:13:03 like, these kinds of interfaces are great.
0:13:06 In terms of, like, the prosumer level,
0:13:09 I think it’s very much unknown what it’s going to look like in a couple years.
0:13:12 Yeah, I think it just really depends on your audience, right?
0:13:15 Because for the regular consumer, like, I use my parents always as an example,
0:13:17 the chatbot is actually kind of great.
0:13:18 Oh, yeah, totally.
0:13:20 Because you don’t have to learn a new UI.
0:13:23 You just upload your images and then you talk to them, right?
0:13:25 Like, it’s kind of amazing that way.
0:13:28 Then for the pros, I agree that, like, you need so much more control
0:13:30 than, you know, and then there’s somewhere in between, probably,
0:13:33 which are people who may want to be doing this,
0:13:36 but they were too intimidated by the professional tools in the past.
0:13:38 And for them, I do think that there’s a space of, like,
0:13:41 that you need more control than the chatbot gives you,
0:13:44 but you don’t need as much control as what the professional tools give you.
0:13:46 And, like, what’s that kind of in-between state?
0:13:48 There’s a ton of opportunity there.
0:13:49 There’s a ton of opportunity there.
0:13:51 It is interesting you mentioned Comfy UI
0:13:55 because it’s on the other far spectrum of workflow.
0:13:58 Like, a workflow can have hundreds of steps and nodes
0:13:59 and you need to make sure all of them work.
0:14:03 Whereas on the other side of the spectrum, there’s NanoBanana.
0:14:05 You kind of describe something with words
0:14:06 and then you get something out.
0:14:08 Like, I don’t know what’s a model architecture, stuff like that,
0:14:14 but I guess it’s your view that the world is moving to ensemble
0:14:16 a model hosted by one provider doing it all?
0:14:21 Or do you think the world is moving to more of everyone building a workflow?
0:14:24 NanoBanana is one of the nodes in ComfyWork UI?
0:14:32 I definitely don’t think that the broad amount of use cases
0:14:35 will be fully satisfied by one model at any point.
0:14:38 So I think that there will always be a diversity of models.
0:14:40 Some, like, I’ll give you an example,
0:14:44 but some, you know, we could optimize for instruction following in our models,
0:14:46 make sure it does exactly what you want.
0:14:50 But it might be a worse model for someone who’s looking for ideation
0:14:53 or kind of inspiration where they want the model to kind of take over
0:14:55 and do other things, go crazy.
0:14:57 So, like, I just think there’s so many different use cases
0:14:59 and so many types of people that, like, there’s a lot of space.
0:15:01 There’s a lot of room in this space for multiple models.
0:15:04 So that’s where I see us going.
0:15:08 I don’t think this is going to be like a single model to rule them all.
0:15:09 That makes sense.
0:15:12 Let’s go to the very other end of the spectrum from the professional.
0:15:16 Do you think kindergartners in the future will learn drawing
0:15:18 by sketching something, you know, on a little tablet
0:15:22 and then you have the AI turn that into a beautiful image
0:15:25 and so that’s how they alone get in touch with art?
0:15:29 I don’t know if you always want it to turn into a beautiful image,
0:15:33 but I think there’s something there about the AI being, again,
0:15:36 a partner and a teacher to you in a way that you, like, didn’t have.
0:15:38 So I didn’t know how to draw.
0:15:41 I still don’t have any talent for it, really.
0:15:45 But I think it would be great if we could use these tools
0:15:47 in a way that actually teaches you kind of the step-by-steps
0:15:49 and helps you critique and maybe, again,
0:15:52 shows you kind of like an autocomplete almost for images.
0:15:55 Like, what’s the next step that I could take, right?
0:15:58 Or maybe show me a couple of options and, like, how do I actually do this?
0:16:00 So I hope it’s more that direction.
0:16:02 I don’t think we all want, you know,
0:16:04 every five-year-old’s image to suddenly look perfect.
0:16:09 We would probably list something in the process.
0:16:14 As someone who struggled the most in high school out of all my classes of the art and the sketching class,
0:16:16 I actually would have preferred it.
0:16:20 But I know a lot of people want their kids to learn to draw, which I understand.
0:16:26 It’s funny because we’ve been trying to get the model to create, like, childlike crayon drawings,
0:16:27 which is actually quite challenging.
0:16:32 Ironically, you know, sometimes the things that are hard to make are,
0:16:34 because the level of abstraction is very large.
0:16:35 Right.
0:16:37 So it’s actually quite difficult to make those types of images.
0:16:39 Do you dedicate a pre-K, fine too?
0:16:39 Yeah.
0:16:45 We do have some in our evals right now to try to see if we’re getting better.
0:16:50 In general, I’m very optimistic about AI for education.
0:16:54 And part of the reason is, I think, that most of us are visual learners, right?
0:17:00 So the AI right now, as a tutor, at least all it can do is talk to you or give you text to read.
0:17:02 And that’s definitely not how students learn.
0:17:10 So I think that these models have a lot of potential as a way to help education by giving people sort of visual cues.
0:17:13 Imagine if you could get an explanation for something where you get the text explanation,
0:17:17 but you also get images and figures that kind of, like, help explain how they work.
0:17:21 I think it just, everything would be much more useful, much more accessible for students.
0:17:22 So I’m really excited about that.
0:17:28 On that point, one thing that’s very interesting to us is that when Nano of Nano came out,
0:17:31 it almost felt like there’s, part of a use case is the reasoning model.
0:17:32 Like, you have a diagram.
0:17:32 Absolutely, yeah.
0:17:33 Right?
0:17:35 Like, you can explain some knowledge visually.
0:17:41 So the model, not just doing an approximation of the visual aspect, there’s the reasoning aspect to it, too.
0:17:43 Do you think that’s where we’re going to?
0:17:51 Do you think all the large models will realize that, oh, like, to be a good LL or VL, like, VLM,
0:17:54 we have to have both image and language and audio and so on and so forth?
0:17:56 100%.
0:17:57 I definitely think so.
0:18:05 The future for these AI models that I’m most excited by is where they are tools for people to accomplish more things.
0:18:10 Like, I think if you imagine a future where you have these agentic models that just talk to each other and do all the work,
0:18:13 then it becomes a little bit less necessary that there’s, like, this visual mode of communication.
0:18:19 But as long as there’s people in the loop and as long as the, kind of, the motivation for the task they’re solving comes from people,
0:18:24 I think it makes total sense that visual modality is going to be really critical for any of these AI agents going forward.
0:18:32 Will we get to a point where there’s actually, so, you know, I’m asking you to create an image.
0:18:38 It sits there for two hours, reasons with itself, has drafts, explores different directions, and then comes back with a final answer?
0:18:39 Yeah, absolutely.
0:18:41 If it’s necessary, yeah.
0:18:46 And maybe not just for a single image, but to the point of, you know, maybe you’re redesigning your house.
0:18:50 And maybe you actually really don’t want to be involved in the process, right?
0:18:51 But you’re like, okay, this is what it looks like.
0:18:53 Like, there’s some inspiration that I like.
0:18:57 And then you send it to a model the same way that you would send it to, like, a designer.
0:18:59 It’s the visual deep research.
0:19:01 It’s like visual deep research, basically.
0:19:02 I really like that term.
0:19:06 And then it goes off and does its thing and searches for maybe the furniture that would go with your environment.
0:19:09 And then it comes back to you and maybe presents you with options.
0:19:12 Because maybe you don’t want to sit for two hours and get one thing.
0:19:14 This is a hundred-page art book on a new house.
0:19:17 This is a ten-slide deck.
0:19:23 Also, I think if you think about, like, instruction manuals or, like, IKEA directions or something,
0:19:29 then, like, breaking down a hard problem into many intermediate steps could be really useful as a way to communicate.
0:19:32 So when can we generate Lego sets?
0:19:33 Yeah.
0:19:34 Soon, maybe.
0:19:38 Do we at some point need 3D as part of it?
0:19:39 Right.
0:19:43 I mean, there’s a whole debate around world models and image models and how they fit together.
0:19:46 Enlighten us here.
0:19:48 What is the short summary of where we’ll end up there?
0:19:50 I mean, I don’t know the answer.
0:19:54 I think that, obviously, the real world is in 3D.
0:19:59 So if you have a 3D world model or a world model that has explicit 3D representations,
0:20:00 there’s a lot of advantages.
0:20:03 For example, everything stays consistent all the time.
0:20:07 Now, the main challenge is that we don’t walk around with 3D capture devices in our pocket.
0:20:11 So in terms of, like, the available data for training these models,
0:20:13 it’s largely the projection onto 2D.
0:20:17 So I think that both viewpoints are totally valid for where we’re going.
0:20:19 I come a bit from the projection side.
0:20:22 Like, I think we can solve almost all the problems, if not all the problems,
0:20:24 working on the projection of the 3D world directly
0:20:28 and letting the models learn the latent world representations.
0:20:31 I mean, we see this already, that the video models have very good 3D understanding.
0:20:34 You can run reconstruction algorithms over the videos you generate,
0:20:36 and they’re very accurate.
0:20:39 And in general, if you look at, like, the history of human art,
0:20:42 like, it starts as, like, the projection, right?
0:20:43 People drawing on cave walls.
0:20:45 All of our interfaces are in 2D.
0:20:52 So I think that, like, humans are very well suited for working on this projection of the 3D world into a 2D plane.
0:20:55 And it’s a really natural environment for interfaces and for viewing.
0:20:57 That is very true.
0:21:00 Like, so I’m a cartoonist in my spare time.
0:21:03 And then drawing in 2D is just light and shadow.
0:21:04 And then you present yourself with 3D.
0:21:09 We trick ourselves to believing it’s 3D or it’s, you know, on a piece of paper.
0:21:15 But then what human can do that, you know, like, a drawing or, like, a model can do is we can navigate the world.
0:21:17 Like, we see a table.
0:21:18 We can’t walk past it.
0:21:22 I guess the question becomes, if everything is 2D, how do you solve that problem?
0:21:25 Well, I don’t think, yeah.
0:21:34 So if we’re trying to solve the robotics problems, I think maybe the 2D representation is useful for planning and visualizing kind of at a high level.
0:21:40 Like, I think people navigate by remembering kind of 2D projections of the world.
0:21:42 Like, you don’t build a 3D map in your head.
0:21:43 You’re more like, oh, I know.
0:21:43 I see this building.
0:21:44 I turn left.
0:21:44 Yeah.
0:21:46 So I think that, like, for that kind of planning, it’s reasonable.
0:21:51 But for the actual locomotion around the space, like, definitely 3D is important there.
0:21:53 So robotics, yeah, they probably need 3D.
0:21:56 That’s a saving grace.
0:22:07 So character consistency, which you previously mentioned, I really love the example of, like, when a model feels so personal, like, people are so tempted to try it.
0:22:09 How did you unlock that moment?
0:22:12 The reason why I ask is that character consistency is so hard.
0:22:15 There’s a huge uncanny valley to it.
0:22:22 You know, like, if it’s someone I don’t know, if I see their AI generation, I’m like, okay, it’s maybe the same person.
0:22:30 But if it’s someone I know, if there’s just a little bit of a difference, I actually felt very turned off by it because I’m like, this is not a real person.
0:22:34 So in that case, how do you know where generating is good?
0:22:38 And then is it mostly by user feedback or, like, I love this?
0:22:39 Or is it something else?
0:22:41 You look at faces you know.
0:22:44 But that’s a very small sample size.
0:22:46 Face detection camera, user.
0:22:51 So not even before you ever released this, right?
0:22:56 So when we were developing this model, we actually started out doing character consistency evals on faces we didn’t know.
0:22:58 And it doesn’t tell you anything.
0:23:04 And then we started testing it on ourselves and quickly realized, like, okay, this is what you need to do because this is a face that I’m familiar with.
0:23:08 And so there is a lot of sort of eyeballing evaluations that happens.
0:23:18 And just the team testing it on themselves and just generally people they know, like, Oliver probably knows my face at this point enough to be able to tell whether or not it’s actually me when it’s generated.
0:23:20 And so we do do a lot of that.
0:23:24 And then, you know, you ideally tested on different sets of people, different ages, right?
0:23:28 Different kind of groups of folks to make sure that it kind of works across the board.
0:23:30 Yeah, I think they’re right.
0:23:35 I mean, that touches a little bit on this bigger issue, which is that, like, evals are really difficult in this space.
0:23:40 Because human perception is very uneven in terms of the things that it cares about.
0:23:45 So really, it’s very hard to know, like, how good is the character consistency of a model?
0:23:48 And is it good enough?
0:23:49 Is it not good enough?
0:23:52 Like, you know, I think there’s still a lot of improvement we can make on character consistency.
0:23:59 But I think that for some use cases, like, we got to a point, and that’s, you know, we weren’t the first edit model by any means.
0:24:06 But I think that, like, once the quality gets above a certain level for character consistency, it can kind of just take off because it becomes useful for so much more.
0:24:10 And I think as it gets better, it’ll be useful for even more things, too.
0:24:22 I think one of the really interesting things we’re seeing across a bunch of modalities, of which image, edit, and generation, obviously, is one, is, like, I think the arenas and benchmarks and everything are awesome.
0:24:35 But especially when you have, like, multidimensional things like image and video, it’s very hard, as all of the models get better and better, to condense every quality of a model into, like, one judgment.
0:24:42 So it’s like, you know, you’re judging, okay, you swap a character into an image and you change the style of the image.
0:24:47 Maybe one did the character swap and consistency much better, and the other did the style much better.
0:24:49 Like, how do you say which output is better?
0:24:55 And it probably comes down to, like, what the person cares most about and what they want to use it for.
0:25:09 Are there, like, certain, you know, characteristics of the model that you guys value more than other things in, like, making those tradeoffs when deciding which version of the model to deploy or, like, what to really focus on during training?
0:25:13 Yes, there are.
0:25:16 One of the things I like about this space is that there is no right answer.
0:25:21 So actually, there’s quite a lot of, I don’t know if it’s taste, but it’s, like, preference that goes into the models.
0:25:26 And I think you can kind of see the difference in preferences of the different research labs in the models that they release.
0:25:36 So, like, when we’re balancing two things, a lot of it comes down to, like, oh, well, I don’t know, I just like this look better or, you know, this feature is more important to us.
0:25:41 I’d imagine it’s hard for you guys, too, because you have so many users, right?
0:25:53 Like, Google, like, being in the Gemini app, like, everyone in the world can use that versus, like, many other AI companies just think about, like, we’re only going for the professional creatives or we’re only going for the consumer meat makers.
0:26:00 And, like, you guys have the unique and exciting but challenging task of, like, literally anyone in the world can do this.
0:26:02 How do we decide what everyone would want?
0:26:06 Yeah, and it is, sometimes we do make these tradeoffs.
0:26:11 We do have a set of things that are sort of, like, super high priority that we don’t want to regress on, right?
0:26:18 So now, because character consistency was so awesome and so many people are using it, we don’t want our next models to get worse on that dimension, right?
0:26:19 So we pay a lot of attention to it.
0:26:24 We care a lot about images looking photorealistic when you want photos.
0:26:25 And this is important.
0:26:28 One, I think we all prefer that style.
0:26:37 Two, you know, for advertising use cases, for example, like, a lot of it is kind of photorealistic images of products and people.
0:26:39 And so we want to make sure that we can kind of do that.
0:26:43 And then sometimes there are just things that, like, will kind of fall down the wayside.
0:26:49 So for this first release, the model is not as good as text rendering as we would like it to be.
0:26:51 And that’s something that we want to fix in the future.
0:26:58 But it was kind of one of those things where we looked at, okay, the model is good at X, Y, Z, not as good at this, but we still think it’s okay to release.
0:27:01 And it will still be an exciting thing for people to play with.
0:27:14 If you look at the past, right, we had, for previous model generations, a lot of things we did with, like, sidecar models, like ControlNet or something like that, where we basically figured out a way to provide structured data to the model to achieve a particular result.
0:27:21 It seems like these newer models that have taken a step back just because they’re so incredibly good at just prompting or, you know, giving a reference image and picking things up from there.
0:27:23 Where will this go long-term?
0:27:25 Do you think this will come back to some degree?
0:27:35 You know, like, I mean, from the creator’s perspective, right, having, I don’t know, open post information so I can get a post exactly right, right, for multiple characters, this seems very, very tempting, right?
0:27:41 Is it like, or to rephrase it a little bit, it’s like, does the bitter lesson hold here that at the end of the day everything’s just one big model and you throw things in?
0:27:45 Or is there a little bit of structure we can offer to make this better?
0:27:52 I mean, I think that there will be, there will always be users that want control that the model doesn’t give you out of the box.
0:28:00 But I think we tried to make it so that, you know, because really what an artist wants when they want to do something is they want the intent to be understood.
0:28:06 And I think that these AI models are getting better at understanding the intent of users.
0:28:09 So often when you ask text queries, now the model gets what you’re going for.
0:28:16 So, you know, in that sense, I think we can get pretty far with understanding the intent of our users.
0:28:22 And maybe some of that is personalization, like we need to know information about what you’re trying to do or what you’ve done in the past.
0:28:33 But I think once you can understand the intent, then you can generally do the type of edit, like, is this like a very structure-preserving edit or is this like a freeform kind of, like, we can learn these kinds of effects, I think.
0:28:40 But still, of course, there’s one person who’s going to really care about every pixel and, like, this thing needs to be slightly to the left and a little bit more blue.
0:28:42 And, like, those people will use existing tools to do that.
0:28:48 I mean, I think it’s like, you know, I want an image with 26 people spelling out every letter of the alphabet or something.
0:28:54 That’s sort of the thing where I think we’re still quite a bit away from getting that right, you know, in the first try.
0:28:56 On the other hand, with pose information, potentially.
0:28:59 But then the question, I guess, is like,
0:29:02 do you really want to be the one who’s like extracting the pose
0:29:03 and providing that as information?
0:29:04 That’s a very good question.
0:29:07 Or do you just want to provide some reference image
0:29:09 and say like, this is actually what I want.
0:29:11 Like, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle, muzzle.
0:29:12 Go figure this out, right?
0:29:13 There are 26 people.
0:29:13 Yes, yes, yes.
0:29:14 The alphabet.
0:29:15 No, they’re in a different style.
0:29:15 Fair enough.
0:29:16 Yeah.
0:29:22 I think in that case, I wouldn’t spend a ton of time building a custom interface
0:29:24 for making this picture of 26 people.
0:29:27 So it seems like the kind of thing that we can solve.
0:29:28 Just transfer.
0:29:33 Do you think the representation of what the AI images are will change?
0:29:37 So the reason why I ask the questions, as artists, there’s different formats we play with.
0:29:38 There’s the SVGs.
0:29:41 We have anchor points and bezier curves.
0:29:45 And on the other side, there’s, you know, Procreate or like Fresco, what have you.
0:29:48 There’s layers that we can also play with.
0:29:51 There’s the other parameter, which is what’s the brush you use?
0:29:53 Like the brush, the texture of it.
0:29:58 So every one parameter, you can write script and actually do something very personal about it.
0:30:05 Do you think like pixel is the right representation, the end game for image generation model?
0:30:08 Or do you think there’s a net new representation that we haven’t invented yet?
0:30:09 That’s an easy question.
0:30:12 Wow.
0:30:18 I’ll say that everything is a subset of pixels.
0:30:19 That’s true.
0:30:19 Yeah.
0:30:20 So text is a subset of pixels.
0:30:21 Right.
0:30:23 Because I could just render all the text as an image.
0:30:27 So how far can we get with just pixels is an interesting question.
0:30:33 I think, you know, if the model is really responsive and handles multi-turn interactions well,
0:30:34 then I think you can probably get pretty far.
0:30:39 Because the primary reason I think you would want to leave the pixel domain is for editability.
0:30:45 And so, you know, in cases where you need to have your font or you want to change the text
0:30:51 or you want to move things around just like control points, it could be useful to have kind
0:30:55 of mixed generation, which consists of pixels and SVGs and other forms.
0:31:00 But if we can do it all, if we can, if the multi-turn interaction is enough, then I think
0:31:01 you can get pretty far with pixels.
0:31:07 I will say that one of the things that’s exciting about these models that have native capabilities
0:31:11 is that you now have a model that can generate code and it can generate images.
0:31:15 So there’s a lot of interesting things that come in that intersection, right?
0:31:19 Like maybe I wanted to write some code and then make some things be rasterized, some things
0:31:20 be parametric.
0:31:23 Like stick it all together, train it together.
0:31:24 Like this would be very cool.
0:31:29 That’s such a good point because I did see a tweet of someone asking CloudSona to replicate
0:31:36 an image on an Excel sheet where every cell is a pixel, which is like a very fun exercise.
0:31:40 It was like a coding model and it doesn’t really know anything about, you know, images.
0:31:40 Yeah, it worked.
0:31:44 Yeah, there’s the classic pelican riding a bicycle test.
0:31:47 Yeah, totally.
0:31:51 I have one on model, like on interfaces, if that’s okay.
0:31:53 I don’t, sorry if I’m bringing up too much product stuff, guys.
0:31:56 I’m just very curious on the product front.
0:32:03 Like, I guess I’m curious how you think about like owning the interface where people are editing
0:32:10 or generating images with Nano Banana versus really just wanting a ton of people to use the
0:32:12 model for different things in the API.
0:32:20 Like we’ve talked about so many different use cases like ads, you know, education, design, like architecture.
0:32:29 Each of those things could be, there could be a standalone product built on top of Nano Banana that prompts the model in the right way or allow certain types of inputs or whatever.
0:32:42 Is your guys’ vision like that the kind of product in the Gemini app is like a playground for people to explore and then developers will build the individual products that are used for certain use cases?
0:32:45 Or is that something you’re also kind of interested in owning?
0:32:47 I think it’s a little bit of everything.
0:32:53 So I definitely think that the Gemini app is an entry point for people to explore.
0:33:07 And the nice thing about Nano Banana is I think it shows that fun is kind of a gateway to utility where, you know, people come to make a figurine image of themselves, but then they stay because it helps them with their math homework or it helps them write something.
0:33:10 Right. And so I think that’s a really powerful kind of transition point.
0:33:15 There’s definitely interfaces that we’re interested in building and exploring as a company.
0:33:25 And so, you know, you may have seen Flo from Josh’s team in labs that’s really trying to rethink, like, what’s the tool for AI filmmakers, right?
0:33:29 And for AI filmmakers, image is actually a big part of the iteration journey, right?
0:33:30 Because video creation is expensive.
0:33:35 A lot of people kind of think in frames when they initially start creating.
0:33:40 And a lot of them even start in the LLM space for, like, brainstorming and thinking about what they want to create in the first place.
0:33:46 And so there’s definitely a kind of place that we have in that space of just us trying to think about, like, what does this look like?
0:33:53 We have the advantage of it kind of sitting close to the models and the interfaces so we can kind of build that in a tight coupling.
0:33:59 And then there’s definitely the, you know, we’re probably not going to go build a software for an architecture firm.
0:34:02 My dad is an architect and he would probably love that.
0:34:07 But I don’t think that’s something that we will do, but somebody should go and do that.
0:34:12 And that’s why it’s exciting because we do have the developer business and we have the enterprise business.
0:34:20 And so people can go use these models and then figure out, like, what’s the next generation workflow for, like, this specific audience so that I can help them solve a problem.
0:34:24 So I think the answer is kind of like, yes, all three.
0:34:26 Yeah, I brought that up.
0:34:32 I don’t know if you guys have been following the reception of Nano Banana in Japan, but I’m sure you’ve had it.
0:34:33 It’s been insane.
0:34:34 And it’s so funny.
0:34:51 Like, I now half of my XFeed is these really heavy Nano Banana users in Japan who have created, like, Chrome extensions called, there’s one called, like, Easy Banana that’s specifically for using Nano Banana for, like, manga generation and specific types of anime and things like that.
0:35:06 And, like, they go super deep into basically prompting the model for you and storing the outputs in various places, using, obviously, your underlying model to generate these, like, amazing anime that you would never guess were AI generated.
0:35:15 Because, like, the level of precision and consistency and that sort of thing is just beyond what I’ve seen any single model be able to do today.
0:35:22 I guess, what are some, like, to Justin’s point, what are some force multipliers that you guys have seen in the model?
0:35:31 So what I mean by this is, for example, if you unlock character consistency, you can generate different frames and then you can make a video and then you can make a movie, right?
0:35:39 So these are the things that, if you get it right and get it really well, there’s so much more downstream tasks that can derive from it.
0:35:44 Just curious, like, how do you think about what are the force multipliers that you want to unlock?
0:35:46 So the next…
0:35:46 What’s the next big one?
0:35:52 What’s the next, yeah, big wave of people who can just use Nano Banana as a base model for all the downstream tasks?
0:35:57 So I think one current one, actually, is also the latency point, right?
0:36:05 Because I think it’s also just, like, it makes it really fun to iterate with these models when it just takes 10 seconds to generate the next frame, right?
0:36:08 If you had to sit there and wait for two minutes, like, you would probably just give up and leave.
0:36:09 A very different experience.
0:36:11 So I think that’s one.
0:36:16 Just, like, there has to be some quality bar because if it’s just fast and the quality isn’t there, then it also doesn’t matter, right?
0:36:20 Like, you have to hit a quality bar and then speed becomes, of course, a multiplier.
0:36:28 I think this general idea of just, like, visualizing information to your education point from earlier is sort of another one, right?
0:36:30 And that needs good text.
0:36:32 It needs factuality, right?
0:36:39 Because if you’re going to start making kind of visual explainers about something, it looks nice, but it also needs to be accurate.
0:36:40 Right.
0:36:47 And so I think that’s probably kind of the next level where at some point then you could also just have a personalized textbook to you, right?
0:36:50 Where it’s not just the text that’s different, but it’s also all the visuals.
0:36:50 Yeah.
0:36:51 Diamond Age.
0:36:52 That was basically.
0:36:53 Yeah.
0:36:54 Yeah, basically.
0:36:58 And then it should also internationalize really well, right?
0:37:08 Because a lot of the times today, you might actually be able to find a diagram that explains the thing that you’re trying to learn about on the internet, but it’s maybe not in the language that you actually speak, right?
0:37:15 And so I think that becomes just, like, another way to improve and open up accessibility of information to just a lot more people.
0:37:18 And, again, visually, because a lot of people are visual learners.
0:37:19 Interesting.
0:37:23 How do you think about, like, images generated?
0:37:35 So the reason why I ask is that there’s another very cool example I’ve seen someone making a work with NanoOpenElla, which is he wrote a script and then he kept prompt the model to say, generate the frame one second after this.
0:37:38 And then it became a video.
0:37:44 So, and then when I saw it, I’m like, well, is every image just one frame in a continuum?
0:37:47 Like, you always know about the continuum in a parallel universe.
0:37:49 You could have, you know, generated any one of them.
0:37:51 It’s one big directed graph.
0:37:52 Right, exactly.
0:37:54 And then maybe it’s video at the end of the day.
0:37:55 So how do you see that?
0:37:57 Where does it, you know, intersect or not intersect?
0:38:03 I think it’s very, yeah, video and images are very closely related.
0:38:15 And also I think what we’re seeing in these kind of what’s coming next or sequence predicting use cases is the generalization and world knowledge of the model as well.
0:38:21 And this is, and so where do I think it’s going?
0:38:29 I think that we will have, yeah, I think video is an obvious next kind of domain.
0:38:34 I think that like when you have editing, a lot of times what you’re asking is like, you know, what happens if I do this?
0:38:35 And that’s what video has.
0:38:37 It has the time sequence of actions.
0:38:50 So it’s like we have a slow frames per second video that you can interact with, but obviously making something that’s like fully interactive and real time and is the direction this field is headed.
0:38:59 So you’re probably in the zero dot, I don’t know how many zeros, zero dot, zero, zero, one percent of most experienced people in the world using image models.
0:39:03 What are your personal favorite use cases?
0:39:06 How do you use it day to day if you’re not just testing the existing model?
0:39:22 Well, I, so I’m not sure I am in the very, but I’ll tell you what, I mean, it’s like we were saying earlier, the personalization aspect is the thing that totally drives it home for me.
0:39:27 I have two young kids and like the best things that I do in the model are the things I do with my kids.
0:39:31 And like we can make, you know, make their stuffed animals come to life and these types of applications.
0:39:33 And it’s just so personal and gratifying to see.
0:39:40 We all saw a lot of people taking old pictures of their family, for example, and like showing them, restoring them.
0:39:47 And like, so I think that that’s, that’s the, the real beauty of the edit models is that you can, you can make it about the one thing that matters most to you.
0:39:49 So that’s what I use it for is, is my kids, basically.
0:39:54 You’re basically making content that you probably would have never made before.
0:39:56 And it’s like for the consumption of one person, right?
0:39:58 Or, or, or one family.
0:40:00 And so you’re kind of telling these stories that you would have never told before.
0:40:05 So kind of similar, like I do a lot of family holiday cards and birthday cards and whatnot.
0:40:13 Now, anytime I make a slide deck, I like force myself to generate some images that are like contextually relevant and then try to get the text right.
0:40:14 And all of those things.
0:40:18 And then we try to push the boundaries around like, can you make a chart in the pixel space?
0:40:18 Do you want to?
0:40:20 That’s another question, right?
0:40:24 Because you also want the, you want the bars in the bar chart to be accurately positioned relative to one another.
0:40:27 So I think we do a lot of these things.
0:40:31 I’m actually really impressed with the people we work with on the team who are just like very creative.
0:40:36 We have a team who just works really closely with us on models that we’re developing.
0:40:38 And then they just like push the boundary.
0:40:40 They’ll do like crazy things with the models.
0:40:42 What’s the most surprising thing you’ve seen?
0:40:44 Hear her like, I didn’t know our model can do this.
0:40:44 Yeah.
0:40:51 This is even just kind of like simple things where people have been doing like texture transfer.
0:40:55 Like they will take a portrait of a person and then you’re like, what would it look like?
0:40:58 But if it had the texture of this piece of wood.
0:41:03 And I’m like, I would have never, I would have never thought of this being a use case because my brain just doesn’t work that way.
0:41:08 But people like kind of just push the boundaries of what you’re, what you can do with these things.
0:41:16 That is an interesting example of the world knowledge because texture technically is 3D because there’s like a whole 3D aspect of it.
0:41:19 There’s a light and shadow of it, but this is a 2D transfer.
0:41:20 Yeah.
0:41:20 So that’s very cool.
0:41:28 I think for me, the thing I’m most excited by and maybe most impressed by is are the use cases, the test, the reasoning abilities of the models.
0:41:43 So some people on our team figured out you could like give geometry problems to the model and like ask it to kind of, you know, solve for X here or fill in this missing thing or like present this from a slightly different view.
0:41:53 And like these types of, of things that really require world knowledge and the reasoning ability of like a state of the art language model are the things that are making me really go, wow, that’s amazing.
0:41:55 I didn’t think we would be able to do that.
0:41:58 Can it generate compiled code on a blackboard yet?
0:42:06 And like if I take a picture of my, I don’t know, like code on the laptop, would it know if it compiles on the image model?
0:42:14 I’ve seen examples where people give it like an image of HTML code and have the model render the webpage and you can do that.
0:42:15 That’s very cool.
0:42:19 The coolest example I saw, so I came from academia, so I spent a lot of time writing papers and making figures.
0:42:28 And one of our colleagues took a picture of one of the result figures from one of their papers with a method that could do a bunch of different things.
0:42:35 This one, you know, a bunch of different type of applications in the paper and asked the model to, and like sort of erase the, the results.
0:42:41 So you have like the inputs and ask the model to like solve all of these in picture form in a figure of a paper.
0:42:42 And it was able to do that.
0:42:52 So it could actually like figure out what is the problem that this one figure is asking for, find the answer and put it in the image and then do that for a bunch of different applications at the same time, which was really amazing.
0:42:54 That’s very cool.
0:42:58 Has anyone built an application on top of that capability yet?
0:43:01 Like what’s the application that will come out of that?
0:43:11 I think that there are a lot of very interesting, I would say, zero shot transfer capability, like problem solving type things that we don’t even know the boundary of yet.
0:43:13 And some of these are probably quite useful.
0:43:27 Like, you know, if you want to have a method that does solve some problem X, I don’t know, like finds the, the, the, the normals of the scene or something, like the service orientations or something, you probably can prompt the model to give you kind of a reasonable estimate.
0:43:36 So I think there’s lots of problems, like sort of understanding problems and other types of things that we could maybe solve with zero or a few shot prompting that we don’t know yet.
0:43:41 There’s one thing you mentioned I found super interesting, which is the world knowledge transfer.
0:43:47 But in a lot of world models, like, or video models, there always is something that keeps the state.
0:43:54 Like, just because you look at a way doesn’t mean that the chair should disappear or change color because that’s not what the state of the world is.
0:43:55 How do you see that?
0:43:58 Do you think there’s relevance there in image model?
0:44:00 Is that something you even consider optimizing for?
0:44:16 Yeah, I mean, if you think about an image model that has a long context where you can put other things in that context, like text, images, audio, video, then I think it’s definitely like you’re reasoning over the context of things you have to produce a final output image or video.
0:44:24 So, yeah, I think there’s definitely some model capability to do this type of stuff already.
0:44:27 Got it. I haven’t tested it out yet.
0:44:30 For this big use case, I’ll let you know.
0:44:38 That’s one of my favorite things about these models is just finding, and I’m sure it’s really fun for you guys, and you guys probably have much more of a hint than we do about what they can do.
0:44:49 But sometimes you’ll just see some crazy X or Reddit or whatever post about some incredible thing that someone has figured out how to do that you would never expect that the model might be able to do necessarily.
0:45:02 And then other people kind of build on that and say, oh, and then I tried the next iteration of this thing, and suddenly you have this, like, most entirely new space that’s been discovered in terms of what the models are capable of.
0:45:09 It must be fun as people much more deeply involved in kind of building these models and building the interfaces to kind of watch that happen.
0:45:17 So if you talk to visual artists today, you know, I personally love this stuff I post about it on the internet.
0:45:19 You get some very skeptical answers.
0:45:20 People are like, oh, this is terrible.
0:45:25 Like, do you have any idea what triggers this reaction?
0:45:30 I’m convinced that this ultimately really empowers the artists, right?
0:45:31 It gives you new tools, right?
0:45:34 It’s like, hey, we now have, I don’t know, watercolors for Michelangelo.
0:45:35 Let’s see what he does with it, right?
0:45:36 And amazing things come out.
0:45:37 It’s a similar thing.
0:45:41 But what triggers this strong reaction against it?
0:45:46 So I think it’s something to do with the amount of control over the output.
0:45:52 So, you know, in the beginning when we had these kinds of text-to-image models, it would be very much like a one-shot.
0:45:55 You put in some text, you get an output, and people would be like, oh, this is art.
0:45:56 This is this thing I made.
0:46:01 And I think that maybe rubs people a little bit the wrong way who come from the creative community
0:46:09 because, you know, most of the decisions that were made were made by the model, by the data that was used.
0:46:11 You can’t express yourself anymore physically, right?
0:46:12 Yeah, exactly.
0:46:14 So as a creative person, you want to be able to express yourself.
0:46:22 So I think as we make the models more controllable, then a lot of these concerns of like, oh, that’s just the computers doing everything kind of may go away.
0:46:28 And the other thing is I think that there was a period of time where we were all so amazed by the images that these models could create
0:46:34 that like we were pretty like happy to see just like, oh, this stuff comes out of these models.
0:46:37 But I think humans get really bored fast of this type of thing.
0:46:38 So like there was a big rush.
0:46:44 And now if you see them, if you see an image that you know is just like, oh, that’s just like a single prompt person didn’t think about it much.
0:46:46 You can kind of tell like that’s an AI-generated image, not that interesting.
0:46:53 So I think like there’s still this boundary of like now you need to be able to make interesting things with the AI tools, which is hard.
0:46:59 But this will, yeah, this will always be, you know, a requirement.
0:47:00 We need someone to be able to do this.
0:47:01 We still need artists.
0:47:02 We still need artists.
0:47:08 And I think artists will be able to also recognize when people have actually like put a lot of control and intent into it.
0:47:08 I would still not be an artist.
0:47:17 But it is, there’s a lot of craft and there’s a lot of taste, right, that you accumulate sometimes over decades, right?
0:47:20 And I don’t think these models really have taste, right?
0:47:24 And so I think a lot of like a lot of the reactions that you mentioned maybe also come from that.
0:47:28 And so we do work with a lot of artists across all the modalities that we work with.
0:47:38 So image, video, music, because we really care about like building the technology step-by-step with them and trying to figure out, they really help us kind of like push the boundary of what’s possible.
0:47:46 A lot of people are really excited, but they really do bring a lot of their knowledge and expertise and kind of like 30 years of design knowledge.
0:47:54 We just worked with Russ Lovegrove on fine-tuning a model on his sketches so that he can then create something new out of that.
0:47:57 And then we design an actual physical chair that we like have a prototype of.
0:48:09 And so there’s a lot of people who want to kind of bring the expertise that they’ve built and kind of like the rich language that they use to describe their work and have that dialogue with the model so that they can push their work kind of to the frontier.
0:48:13 And it is, you know, it doesn’t happen in like one prompt and two minutes.
0:48:21 It does require a lot of that kind of taste and human creation and craft that goes into building something that actually then, you know, becomes art.
0:48:28 At the end, it’s still a tool that requires the human behind it to express the feelings and the emotions and the story.
0:48:29 Yeah, yeah, absolutely.
0:48:32 And that’s what resonates with you when you probably look at it, right?
0:48:39 You will have a different reaction when you know that there’s a human behind it who has spent 30 years thinking about something and then poured that into a piece of art.
0:48:51 Yeah, I think there’s also a bit of this phenomenon that like most people who consume creative content and maybe even ones that care a lot about it, like they don’t know what they’re going to like next.
0:48:55 You need someone who has a vision and can do something that’s interesting and different.
0:48:55 That’s right.
0:48:57 And then you show it to people and like, oh, wow, that’s amazing.
0:49:01 But like they wouldn’t necessarily like think of that on their own.
0:49:01 Right.
0:49:08 So when we’re, you know, when we’re optimizing these models, like one thing we could do is we could optimize for like the average preference of everybody.
0:49:12 But I don’t think you end up with interesting things by doing that.
0:49:16 You end up with something that everyone kind of likes, but you don’t end up with things that people are like, oh, wow, that’s amazing.
0:49:20 Like I’m going to change my whole like perspective of art because I saw that.
0:49:22 There’s the avant-garde edition of the model.
0:49:22 Yeah.
0:49:26 If I use it at the term, I don’t know.
0:49:27 What’s the other end of the spectrum?
0:49:31 The marketing edition or so, where it’s very predictable and straightforward.
0:49:32 Yeah.
0:49:36 Well, since we’re coming up on time, last couple of questions.
0:49:42 One is, what’s one feature that you know the model is capable of that you wish people ask you more?
0:49:44 Interleave?
0:49:45 Yeah, interleave.
0:49:45 Interleave.
0:49:49 I think we’ve always been amazed that nobody ever posts anything about, so interleave generations
0:49:53 what we call the model’s ability to generate more than one image for a specific prompt.
0:49:58 So you can ask for like, I want a story, like a bedtime story or something, like generate
0:50:00 the same character over these series of images.
0:50:06 And I think that, yeah, people haven’t really found it useful yet or haven’t discovered it.
0:50:06 I don’t know.
0:50:07 Oh, interesting.
0:50:10 Well, if you’re listening to the podcast, go try this out.
0:50:10 Try it.
0:50:11 Yeah.
0:50:11 Yeah.
0:50:18 And what’s the most exciting technical challenge that you look forward to tackle in the next,
0:50:19 I don’t know, months, years?
0:50:27 So I think that there’s really a high ceiling in terms of quality for where we’re going.
0:50:30 Like, I think, you know, people look at these images and say, oh, it’s almost perfect.
0:50:30 We must be done.
0:50:35 And for a while, we were in this like cherry pick phase where we would, you know, everyone
0:50:36 would pick their best images.
0:50:37 So you look at those and they’re great.
0:50:39 But actually, what’s more important now is the worst image.
0:50:42 We’re in a lemon picking stage because every model can cherry pick images that look perfect.
0:50:46 So like, now I think the real question is like, how expressible is this model?
0:50:49 And what’s the worst image you would get given what you’re trying to do?
0:50:53 So I think by raising the quality of the worst image, we really open up the amount of use
0:50:55 cases for things we can do.
0:51:00 Like, there’s all kinds of productivity use cases, like, you know, beyond this kind of like
0:51:02 immediate creative tasks that we know the model can do.
0:51:03 And I think that’s a direction we’re headed.
0:51:07 We’re headed to where if these models can do more things reasonably, then they’re just
0:51:09 the use cases will be far greater.
0:51:13 So that’s the moral equivalent of the monkeys on typewriters, basically.
0:51:16 Any model given enough tries will eventually be an amazing adventure.
0:51:18 But the other way around, it’s hard.
0:51:19 Yeah, the other way around is hard.
0:51:21 One monkey writing a book would be very hard.
0:51:22 You’d be a good monkey for that one.
0:51:28 What are the applications you think that would come out when we raise the lower bound?
0:51:34 So the one I’m most interested in, and we mentioned this before, is education factuality.
0:51:40 I have, you know, I have every, I don’t know how many times I want to use these models for
0:51:44 creative purposes a month, but like, I have way more use cases for information seeking,
0:51:48 factuality, kind of like learning, education type use cases.
0:51:53 So I think like, once that starts working, then it’ll be opening up all these new areas.
0:51:58 There’s also something about, I think, taking more advantage of the models context window.
0:52:03 So you can input a really large amount of content, right, into these LLMs.
0:52:10 And some companies, you mentioned a few before, they will have like 150 page brand guidelines
0:52:12 on like what you can and cannot do, right?
0:52:14 And they’re like very precise, right?
0:52:16 Like colors, fonts, right?
0:52:19 And like the size of like a Lego brick, maybe.
0:52:24 And so being able to actually like take that in and follow that to a T when you’re doing
0:52:29 generation, that’s like a whole new level of control that we just can’t, we don’t have
0:52:30 today, right?
0:52:33 To make sure that you’re actually kind of like following that to a T.
0:52:36 I think that will build a lot of trust with, you know, very established brands.
0:52:41 So we have a second creative compliance review model that then checks between the producing
0:52:42 and so on.
0:52:44 The model should do it on its own, right?
0:52:46 Like it should kind of have this loop.
0:52:46 Yes.
0:52:48 It should have this loop.
0:52:51 It’s like, okay, I generated this, but then page 52 says that I shouldn’t have, right?
0:52:53 I’m going to go back and try again.
0:52:55 And then two hours later, we’ll come back to you with a respect.
0:52:55 Yeah.
0:53:00 So we saw with the text models, how this inference time scaling, how much it can help, right?
0:53:01 Being able to critique your own work.
0:53:02 Yep.
0:53:04 So this, this feels really important.
0:53:08 Boy, incredibly, amazingly exciting future.
0:53:10 Yeah.
0:53:12 And congrats on all the amazing work.
0:53:12 Thank you.
0:53:12 Thank you.
0:53:15 Thank you so much for coming on the pod.
0:53:21 Thanks for listening to this episode of the A16Z podcast.
0:53:24 If you liked this episode, be sure to like, comment, subscribe.
0:53:28 Leave us a rating or review and share it with your friends and family.
0:53:32 For more episodes, go to YouTube, Apple Podcasts, and Spotify.
0:53:39 Follow us on X at A16Z and subscribe to our Substack at a16z.substack.com.
0:53:40 Thanks again for listening.
0:53:41 And I’ll see you in the next episode.
0:53:46 As a reminder, the content here is for informational purposes only.
0:53:49 Should not be taken as legal business, tax, or investment advice,
0:53:51 or be used to evaluate any investment or security,
0:53:55 and is not directed at any investors or potential investors in any,
0:53:56 A16Z fund.
0:54:00 Please note that A16Z and its affiliates may also maintain investments
0:54:01 in the companies discussed in this podcast.
0:54:04 For more details, including a link to our investments,
0:54:08 please see a16z.com forward slash disclosures.

Google DeepMind’s new image model Nano Banana took the internet by storm.

In this episode, we sit down with Principal Scientist Oliver Wang and Group Product Manager Nicole Brichtova to discuss how Nano Banana was created, why it’s so viral, and the future of image and video editing.

Resources:

Follow Oliver on X: https://x.com/oliver_wang2

Follow Nicole on X: https://x.com/nbrichtova

Follow Guido on X: https://x.com/appenz

Follow Yoko on X: https://x.com/stuffyokodraws

Stay Updated:

If you enjoyed this episode, be sure to like, subscribe, and share with your friends!

Follow a16z on X: https://x.com/a16z

Subscribe to a16z on Substack: https://a16z.substack.com/

Follow a16z on LinkedIn: https://www.linkedin.com/company/a16z

Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX

Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Stay Updated:

Find a16z on X

Find a16z on LinkedIn

Listen to the a16z Podcast on Spotify

Listen to the a16z Podcast on Apple Podcasts

Follow our host: https://twitter.com/eriktorenberg

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Google DeepMind Developers: How Nano Banana Was Made

Leave a Reply Cancel reply