Grok 3 vs Claude 3.7 vs GPT-4.5: Which Update is The Best?

AI transcript

🕒

Việt

中文

0:00:01 [MUSIC PLAYING]
0:00:03 Hey, welcome back to the Next Wave Podcast.
0:00:04 I’m Matt Wolf.
0:00:05 I’m here with Nathan Lanz.
0:00:08 And there has been an absolute ton of AI news
0:00:10 that came out recently, especially in the world
0:00:11 of large language models.
0:00:12 We got GROC 3.
0:00:14 We got GPT 4.5.
0:00:16 We got Claude 3.7.
0:00:21 Just so many new big foundation models have been released.
0:00:23 And so for this episode, we wanted
0:00:26 to deep dive into what each one is good at
0:00:28 and what each one is not good at.
0:00:29 So for that, we brought on our good friend
0:00:33 Matthew Berman, who is probably the best person we know
0:00:35 to really compare them all.
0:00:37 Because he deep dives and tests every single one
0:00:40 of these models way deeper than we test them.
0:00:44 So let’s go ahead and just dive right in with Matthew Berman.
0:00:46 Thanks so much for joining us, Matthew.
0:00:48 It’s great to have you back on the show.
0:00:48 Thanks for having me.
0:00:51 I’ve been telling people for a long time–
0:00:53 I think I even mentioned this to you last time we had you
0:00:53 on the show–
0:00:55 that when it comes to large language models
0:00:59 and trying to compare them and talk about which model is best
0:01:01 at this and which model is best at that.
0:01:03 I don’t even do that on my YouTube channel anymore.
0:01:05 I just point people to your channel.
0:01:07 I’m like, yeah, Matthew’s going to test this.
0:01:09 And he’s going to tell you which models do what better
0:01:10 than others.
0:01:12 So you’re like my go-to now when it comes
0:01:15 to comparing large language models.
0:01:17 I appreciate that.
0:01:17 That’s awesome.
0:01:18 Thanks for having me again.
0:01:21 I had a great time last time, so I’m excited to chat again.
0:01:22 Yeah, likewise.
0:01:23 So this will be fun.
0:01:26 I’m trying to figure out where the best place to dig in
0:01:28 is because so much has come out.
0:01:29 Here’s kind of the timeline of events
0:01:31 that I feel are the important events, right?
0:01:32 We got Grock 3.
0:01:36 And then a few days later, we got Claude Sonnet 3.7.
0:01:40 And then a few days after that, we got GPT 4.5, which
0:01:43 actually happens to be the day that we’re recording this episode
0:01:45 is the day that GPT 4.5 came out.
0:01:49 So we’ve all had that news in our heads for three hours now.
0:01:51 And that’s about it.
0:01:53 But maybe we start with Grock.
0:01:56 Matt, what have your thoughts been on Grock 3 so far?
0:01:57 Like how much have you played with it?
0:02:00 And what have you found it’s like really good at so far?
0:02:03 Yeah, so if I could show you my bookmarks bar in Chrome,
0:02:07 you will see that Grock now has a prominent placement
0:02:10 right next to chat GPT and right next to perplexity.
0:02:12 So the answer is I use it a lot.
0:02:16 And it has really become my go-to large language model.
0:02:16 Oh, really?
0:02:17 Yeah.
0:02:17 Same here.
0:02:20 Look, I have been a pretty die-hard chat GPT user.
0:02:21 There’s really two reasons.
0:02:23 Number one is speed, right?
0:02:26 I think speed of these models, speed of the response
0:02:29 is really underappreciated by a lot of people.
0:02:32 But it’s the same reason why you converted a higher rate
0:02:34 when a web page loads faster.
0:02:37 It just builds trust and you just get the answer more quickly.
0:02:39 And then it’s also the real time information.
0:02:43 Having access to all of the news on X in real time
0:02:45 is such a killer feature.
0:02:47 And of course, it’s a good model.
0:02:49 So I think those two factors, plus it just
0:02:53 being a fantastic model, that has made it my go-to model.
0:02:54 But that might change now that 4.5 is out,
0:02:55 but we’ll get there.
0:02:56 Yeah, yeah, yeah.
0:02:57 One thing I would say, too, is they
0:02:59 say that they’re really improving the model really fast.
0:03:02 And I believe it, because their team actually reached out to me.
0:03:03 And they’re like, what’s your feedback?
0:03:05 And I’ve been going back and forth with them,
0:03:07 giving them my thoughts on what they should be doing.
0:03:08 Oh, that’s cool.
0:03:08 Yeah.
0:03:10 And sorry if Aravind or anyone from Perplexity
0:03:12 hears this, but I’m like, this has replaced Perplexity for me.
0:03:14 And I think you guys should double down on that.
0:03:15 That’s what I was telling them.
0:03:17 I think you guys should be replacing Google,
0:03:19 replacing Wikipedia, replacing Perplexity.
0:03:21 Any question I have, I should be going to GROC.
0:03:23 They’re combining search and X data.
0:03:25 Like, no one has that, and no one ever will have that.
0:03:26 Yeah.
0:03:29 Nathan, does that make you a high-taste tester?
0:03:30 Oh, yeah.
0:03:32 Because like, I’m certainly not.
0:03:33 I get no previews at all.
0:03:34 Yeah.
0:03:35 I don’t either.
0:03:37 I don’t seem to get any early access
0:03:41 to any of the anthropic models, the open AI models,
0:03:42 the X models, none of them.
0:03:47 So yeah, I think Nathan probably has the best ins right now.
0:03:49 I was a low-level member of a lot of the different Silicon
0:03:51 Valley mafias.
0:03:52 I was not a high-level member, but I
0:03:55 was the low level of several of them.
0:03:59 The gaming one, the social media one, the crypto one,
0:04:02 the Taiwanese mafia.
0:04:04 I was kind of a member of all of those.
0:04:06 And some of the YC people, as well, I know a lot of them.
0:04:07 A lot of them, too.
0:04:10 You know, I personally think that GROC’s biggest hurdle
0:04:11 is just Elon.
0:04:14 I’m sure you’ve seen this, Matthew, on your YouTube channel,
0:04:16 because you’ve made a couple videos about GROC now.
0:04:18 When GROC came out, I made a news video about it.
0:04:21 And overwhelmingly, all of the comments about GROC are like,
0:04:24 I’m never going to touch anything Elon makes.
0:04:27 And so I almost think that, like, so many people
0:04:29 are throwing the baby out with the bathwater
0:04:32 when it comes to actually using and trying GROC
0:04:35 just because it’s Elon’s name attached to it.
0:04:39 Yeah, so whatever I post anything about GROC on X,
0:04:43 if it’s negative about GROC or Elon, I get flamed.
0:04:47 If it’s positive, everybody cheers it, shares it,
0:04:48 everything.
0:04:50 The opposite is true on YouTube.
0:04:53 I made a video about GROC 3 being really good,
0:04:54 because it is.
0:04:57 So many people commented, they won’t touch it just like you,
0:04:58 won’t touch it.
0:05:03 Elon’s AI model, bias, conservative, right wing.
0:05:04 I’m apolitic.
0:05:06 I’m trying to stay out of it as much as I can in the videos.
0:05:09 But anytime I mention anything having
0:05:16 to do with Elon X, GROC, I get the meanest comments on YouTube.
0:05:17 It’s wild.
0:05:20 I think so many people are just going to not experience
0:05:24 probably what is the best model available right now
0:05:27 just because of the Elon factor.
0:05:29 Now, when it comes to code, I don’t necessarily
0:05:31 think GROC is the best.
0:05:31 I don’t know.
0:05:34 I actually honestly haven’t tested GROC with code
0:05:36 because I mostly do code with cursor.
0:05:39 And I don’t believe GROC 3 has pushed its API out yet.
0:05:40 I don’t think.
0:05:41 I don’t think so.
0:05:43 So I’ve been mostly playing with 3.7.
0:05:44 I’ve tested it.
0:05:46 I’ve tested it a little bit.
0:05:46 It’s weird.
0:05:48 It’s not very consistent.
0:05:49 There’s sometimes where it’s amazing at code.
0:05:52 You’re like, oh, wow, that’s like a really creative solution.
0:05:53 And it’s doing something better than Claude
0:05:55 in some very limited circumstances.
0:05:57 But then other times, it just won’t follow my instructions
0:05:58 as well for code.
0:06:01 In terms of reliability, when Claude 3.7 came out,
0:06:03 that’s by far the best now for coding.
0:06:05 But it feels like there is something there.
0:06:07 I wouldn’t discount them forever.
0:06:08 They could improve that.
0:06:10 And all of a sudden, GROC 3 could be the best at coding.
0:06:11 Right.
0:06:12 Matthew, I’m curious.
0:06:14 So when it comes to GROC 3, I know
0:06:17 you have your own internal benchmarks
0:06:20 that you’ve been using on some of these tools.
0:06:22 What have you found GROC 3 is really good at?
0:06:25 And are there any things GROC 3 is just not
0:06:26 going to be your go-to for?
0:06:28 Well, let me say something first.
0:06:30 I had to throw my benchmarks out.
0:06:32 Because they were completely saturated.
0:06:35 They were absolutely annihilated by every single model
0:06:36 that comes out nowadays.
0:06:37 So I threw them out.
0:06:42 I’m currently in the process of creating a new set of benchmarks
0:06:43 and questions.
0:06:46 GROC, again, the thing I go to it for
0:06:49 is real-time information as quickly as I need it.
0:06:51 And that it’s awesome at.
0:06:53 I’ve tested it on some other things,
0:06:56 like quick coding challenges, some math challenges.
0:06:57 And it does really well.
0:07:00 But those aren’t the everyday use cases for me.
0:07:02 So there’s only so much I can test with it.
0:07:05 I’ll just prompt it with one of my benchmark questions,
0:07:05 see if it’s right.
0:07:07 It’s like, OK, yeah, it’s right or no, it’s wrong.
0:07:09 But overall, what I care about and I
0:07:12 think what most people care about is day-to-day usage.
0:07:15 Is it going to solve my problems?
0:07:16 And it does.
0:07:17 GROC3 is great at that.
0:07:21 And typically, I was using GPT-40.
0:07:24 And now GROC3 with thinking kind of replaces that,
0:07:26 although I don’t really need the thinking.
0:07:28 Yeah, it’s just really good at the day-to-day stuff
0:07:29 that I would use it for.
0:07:33 So I’m kind of split between GROC3, perplexity, and chat
0:07:34 GPT.
0:07:35 And now 4.5 came out.
0:07:37 I would say I tested it right before we got on here.
0:07:39 And I would say, OK, 4.5 has a better
0:07:41 vibe for general chat.
0:07:42 But it’s slower.
0:07:43 Oh, it’s about the vibes.
0:07:44 But when you get the response back,
0:07:46 the liked response is better, actually.
0:07:47 So that’s impressive.
0:07:50 It’s writing is better, but it’s slower.
0:07:51 So there’s a huge trade up there.
0:07:53 It feels like 4.5 is going to serve
0:07:57 something maybe for creative writing or creative work.
0:07:59 It’s probably the best for now.
0:08:01 But it is interesting that the cost is so much higher.
0:08:02 Really?
0:08:04 Yeah, I think it was 70 per million input tokens,
0:08:06 half of that for cashed input.
0:08:08 It’s funny, Nathan, that you bring up writing.
0:08:12 That is honestly something that I want to use AI for much more
0:08:15 often than I do, but it is so bad.
0:08:16 It’s so bad at writing.
0:08:20 And I haven’t had enough time to test 4.5 for writing,
0:08:21 but I really hope that it’s good.
0:08:22 It’s better.
0:08:25 Because then that’s going to be able to help me with a lot
0:08:26 of things that I do day to day.
0:08:28 I don’t script my videos, but sometimes I
0:08:31 want help writing the bullet points for them.
0:08:34 Or sometimes I want a tweet thread,
0:08:35 the initial drafts created for me.
0:08:38 I gave it probably, I don’t know,
0:08:40 over 100 pages of different notes about my game,
0:08:43 including the story and things like that and the game mechanics.
0:08:46 And it took a long time to respond, actually, like a very long.
0:08:47 I was kind of surprised.
0:08:49 This feels like a one pro when you hit into a lot of stuff.
0:08:50 4.5?
0:08:50 4.5.
0:08:52 Took a long time to respond.
0:08:52 I was surprised.
0:08:53 Like very long.
0:08:54 It was very slow.
0:08:56 Like three minutes to respond or something to all that.
0:08:59 But then its notes on the story were perfect.
0:09:02 It gave me amazing critiques of like, I love this part.
0:09:06 And it even had little emojis having color coatings.
0:09:10 It had green, orange, blue, and red and different ones.
0:09:11 It was like, like, green’s good.
0:09:12 Love this part.
0:09:14 This part probably could be tweaked.
0:09:15 This part don’t like it.
0:09:16 Here’s why.
0:09:17 And these parts are interesting.
0:09:19 Maybe you keep them, maybe you don’t.
0:09:20 The feedback was good.
0:09:21 Like it was solid.
0:09:22 So I do want to test it more for writing.
0:09:23 But my first impression is,
0:09:26 yeah, it is probably the best model for writing now.
0:09:27 That’s great.
0:09:27 Yeah.
0:09:29 Well, going back to Grock for a second,
0:09:31 have you guys played with the voice mode yet?
0:09:34 I’ve used advanced voice mode a little bit.
0:09:37 I don’t find it to be like super useful throughout my day.
0:09:39 If I’m driving, if I’m walking,
0:09:41 if there’s just not a screen in front of me,
0:09:42 maybe I’ll use it.
0:09:45 But what I found specifically with ChatGPT advanced voice mode
0:09:49 is you’ll ask a question and there’s that delay.
0:09:51 And then sometimes it repeats it and then stops
0:09:53 and then repeats it again.
0:09:56 It’s just such a high friction experience.
0:09:58 It’s not great yet.
0:10:00 I do use voice on perplexity.
0:10:01 I’m a big baseball fan.
0:10:03 Spring Training just started with baseball.
0:10:04 They’ve introduced some like new rules
0:10:06 that I didn’t realize existed.
0:10:08 I was watching one of the Spring Training games.
0:10:09 I opened up perplexity and I’m like,
0:10:11 “Hey, is there a new rule that I didn’t know about?”
0:10:13 And I just will talk to perplexity and ask it questions
0:10:15 and it will like do all the research,
0:10:17 figure out the new rules that are going on in baseball
0:10:19 and then give them back to me.
0:10:21 And I found that pretty helpful.
0:10:24 I still find myself typing my questions more often
0:10:25 than speaking them.
0:10:27 But every once in a while, I’ll be feeling lazy.
0:10:29 I’ll hit the voice button and just ask my question.
0:10:30 I find that helpful.
0:10:34 I know Nathan, you kind of almost use some of the voice modes,
0:10:36 especially advanced voice and ChatGPT
0:10:38 for like journaling, right?
0:10:40 I’m a big believer in it long-term,
0:10:41 but I have been kind of disappointed
0:10:42 like in advanced voice mode.
0:10:44 Like the demo they showed seemed amazing
0:10:45 and it seemed like they removed
0:10:46 so many different parts of it.
0:10:47 And some of the things they demoed
0:10:48 when you tried them in real life,
0:10:51 they don’t work as well as in the demos.
0:10:52 My wife’s Japanese.
0:10:53 My Japanese is getting better,
0:10:54 so I don’t have to use it as much,
0:10:56 but for like really hard topics,
0:10:58 we haven’t tried to use it to do translation.
0:11:01 And it just gets confused so easily.
0:11:02 Like as soon as you go from one language to the other,
0:11:04 like sometimes it’ll translate it properly.
0:11:06 So like, “Okay, cool, it worked in that use case.”
0:11:07 But okay, when then she talks back to me,
0:11:09 then sometimes it just totally gets confused.
0:11:10 And instead of translating,
0:11:13 sometimes it’ll like start talking versus translating.
0:11:15 It’ll start saying its own stuff
0:11:16 versus doing a translation.
0:11:18 And it’s like, “Okay, that’s super annoying.”
0:11:19 And then we just turn it off
0:11:21 every single time that’s ever happened.
0:11:23 – Yeah, I know Sam Holtman’s even talked about,
0:11:25 he wants the AIs in the future
0:11:28 to be more in line with your own beliefs, right?
0:11:31 So like, it knows your religious beliefs,
0:11:32 your political beliefs.
0:11:34 – So bring your own bias.
0:11:35 – Yeah, your own bias.
0:11:36 It will actually learn your bias
0:11:38 and sort of lean into your bias more
0:11:40 to give you more of what you want.
0:11:42 And I think that’s a really, really scary thing.
0:11:44 There was a little bit of word
0:11:45 about this happening on like Facebook, right?
0:11:49 Where Facebook was using like AI bots in the feed
0:11:52 that people didn’t even realize were AI bots.
0:11:54 And so people would post stuff on Facebook.
0:11:56 They would get like a bunch of responses
0:11:57 from these AI bots,
0:11:59 not even realize that they’re AI.
0:12:00 And they’re going, “Oh, cool,
0:12:02 I get great engagement on Facebook.”
0:12:04 And it keeps on bringing them back to Facebook
0:12:06 over and over and over again,
0:12:08 because Facebook’s where they get engagement
0:12:09 and they’re not even realizing
0:12:10 that they’re talking to AIs.
0:12:12 I think that’s going to be a bigger
0:12:13 and bigger problem as well,
0:12:17 where social media is all about getting dopamine hits, right?
0:12:20 We post our tweets because we want to get those likes.
0:12:21 We want to get those retweets.
0:12:22 We want to get those comments.
0:12:23 Every time we see one of those,
0:12:25 we get a little dopamine hit
0:12:27 and we keep on coming back for more.
0:12:30 Well, if AI gets really, really, really good
0:12:33 at giving us those dopamine hits every time we want them,
0:12:36 we’re going to go to wherever we get the most dopamine hits
0:12:38 at the highest frequency.
0:12:41 I think that’s what’s really sort of worries me
0:12:41 about the future.
0:12:44 And it ties into like the whole population collapse as well.
0:12:46 I think it gets to that point
0:12:47 where people communicate with other humans
0:12:49 less and less and less and less,
0:12:51 because they’re getting their dopamine hits
0:12:53 from fake people on social media.
0:12:55 They’re getting their conversational needs
0:12:57 met by unhinged voice chats
0:13:00 that have the same bias as me.
0:13:04 They love Trump and their, you know, the MAGA girl.
0:13:05 And I can talk to the MAGA girl
0:13:07 who has the same belief system that I have
0:13:09 or whatever, I’m not saying that’s my belief system.
0:13:11 I’m just, you know, for example.
0:13:13 And so it’s very, very concerning.
0:13:14 Like, I think you and me, Matthew,
0:13:16 are pretty much on the same page
0:13:19 where we generally lean optimistic on this stuff,
0:13:21 but there’s still quite a few things
0:13:23 that actually do scare me about this as well.
0:13:26 I’m not like the whole accelerationist
0:13:28 where I’m just like push forward as fast as possible.
0:13:30 I’m like, maybe there’s some things
0:13:32 we shouldn’t push forward as fast as possible on.
0:13:34 And that’s definitely one of those areas.
0:13:37 (upbeat music)
0:13:38 – Hey, we’ll be right back to the show.
0:13:40 But first I want to have another podcast
0:13:41 I know you’re gonna love.
0:13:43 It’s called Entrepreneurs on Fire
0:13:45 and it’s hosted by John Lee Dumas
0:13:47 available now on the HubSpot Podcast Network.
0:13:49 Entrepreneurs on Fire stokes inspiration
0:13:53 and shares strategies to fire up your entrepreneurial journey
0:13:54 and create the life you’ve always dreamed of.
0:13:57 The show is jam packed with unlimited energy,
0:13:58 value and consistency.
0:14:00 And really, you know, if you like fast paced
0:14:02 and packed with value stories
0:14:03 and you love entrepreneurship,
0:14:05 this is the show for you.
0:14:07 And recently they had a great episode
0:14:10 about how women are taking over remote sales
0:14:11 with Brooke Triplett.
0:14:13 It was a fantastic episode.
0:14:14 I learned a ton.
0:14:15 I highly suggest you check out the show.
0:14:17 So listen to Entrepreneurs on Fire
0:14:19 wherever you get your podcasts.
0:14:22 (upbeat music)
0:14:24 – Yeah, you know, Elon had said with Grock,
0:14:27 which I find kind of promising hopefully is that, you know,
0:14:29 they want to be maximum truth seeking,
0:14:30 which sounds a lot better to me
0:14:32 than when I’m hearing from open AI
0:14:33 of like bring your own bias.
0:14:35 Like we’re gonna allow you to pick your bias
0:14:38 and we’re just gonna serve you up information based on that
0:14:40 because all truth is subjective or whatever.
0:14:41 But I worried that, you know,
0:14:45 as obviously Elon has his own bias now as well, you know,
0:14:47 and, you know, I’m probably more right wing
0:14:48 than anyone on this podcast right now.
0:14:50 I’m not super right wing, but, you know,
0:14:52 after living in San Francisco for a long time,
0:14:53 that kind of did that to me.
0:14:55 And I do worry that he’ll go too far with it.
0:14:58 Grock will become like a right wing AI.
0:15:00 I do think it needs to be unbiased.
0:15:02 I’m not sure how you do that
0:15:03 ’cause there’s bias in everything.
0:15:04 Like if you pull up facts online
0:15:06 or you pull up something from the Wall Street Journal
0:15:09 or CNN or whatever, there’s bias in all of this.
0:15:11 And so since all the models are trained on that,
0:15:15 I’m just not sure how you get around the bias, you know?
0:15:17 – By the way, this is the part where I start
0:15:20 to get flabbed on Twitter and YouTube,
0:15:22 but for completely opposite reasons,
0:15:23 but I’m still gonna share anyways,
0:15:26 this is not reflective of my political opinion
0:15:29 or anything like that, but here’s a couple of things.
0:15:34 Yeah, Elon has said, “Maximally truth seeking for Grock 3.”
0:15:37 And great, like in theory, that makes a lot of sense.
0:15:39 But how do you do that?
0:15:43 Ultimately, you have to create systems
0:15:46 and maybe you’re able to create them or maybe you’re not,
0:15:48 but those systems are created by humans.
0:15:51 The original training data is created by humans.
0:15:55 The post-training techniques are created by humans.
0:15:58 The reinforcement learning is also set up by humans.
0:16:00 So there’s like a human in the loop.
0:16:04 So what he claims to be completely maximally truth seeking
0:16:06 is just what he believes.
0:16:08 And maybe not necessarily him,
0:16:09 but he’s trying to give a different perspective,
0:16:10 which, you know, fine.
0:16:13 But did you see what happened just a few days ago?
0:16:17 Grock 3 was caught having custom instructions that said,
0:16:20 don’t cite any sources that say Elon Musk
0:16:23 and Donald Trump are spreaders of disinformation.
0:16:25 And they were caught doing this.
0:16:28 And so I reported on it and of course,
0:16:30 flamed from both sides, but fine.
0:16:33 But the point is, yeah, it’s as simple as somebody
0:16:36 submits a PR and it has a little line
0:16:38 in the system prompt saying, don’t do this.
0:16:40 And then all of a sudden, fingers on the scale, right?
0:16:42 So it’s very possible.
0:16:44 It’s not only possible, it’s happening.
0:16:47 And it doesn’t matter if it’s from OpenAI or Grock
0:16:50 or Google or Anthropic, they all have bias.
0:16:53 I can’t imagine a world in which the bias
0:16:55 is removed completely, although I hope it does.
0:16:58 I just, I don’t see a path towards that.
0:17:00 – Yeah, I mean, I honestly don’t understand
0:17:02 how it’s possible.
0:17:04 Like you guys said, it’s all trained on data
0:17:05 that was created by humans.
0:17:09 It’s all basically a scraping of the entire internet,
0:17:11 which just inherently has bias.
0:17:14 Like I just, I don’t understand how it’s possible, honestly.
0:17:16 And how do you determine what is the truth
0:17:17 and what is not?
0:17:18 – That’s the question.
0:17:20 – I don’t know, maybe if the models get really smart
0:17:22 and can do like real reasoning
0:17:24 and that they can look at both sides of anything,
0:17:25 they can kind of find the middle ground
0:17:27 where there’s some truth, where there’s actual truth,
0:17:30 hopefully, but then some people will perceive
0:17:32 that real truth is not truth.
0:17:34 It’s like, well, people will perceive it to have bias
0:17:35 even if it ends up not having bias.
0:17:38 – No, but I feel like you’re getting into like this world
0:17:40 where now we’re letting AI decide what is
0:17:42 and what isn’t like ethical, right?
0:17:45 Like now we’re letting like AI sort of decide
0:17:49 what is like philosophically correct and what is not.
0:17:50 And to me, that seems weird
0:17:53 to let machines do that for humans, you know?
0:17:55 – Yeah, and, you know, it’s funny you say that,
0:17:58 I mentioned the same thing and Dave Shapiro,
0:18:01 another fellow YouTuber talking about AI said,
0:18:04 you know, actually, I would rather just completely
0:18:07 give it over to AI to make ethical decisions.
0:18:10 I think that’s what he was saying, but it’s interesting.
0:18:12 I can see both sides of it.
0:18:15 If there was a system that was completely unbiased,
0:18:18 assuming, right, yeah, okay, let them decide,
0:18:20 but how do you make that, as you said?
0:18:23 – I think you can’t allow AI to make that decision.
0:18:24 If you allow AI to make that decision,
0:18:26 I mean, that goes towards like eugenics
0:18:27 and crazy things like that.
0:18:30 Like you can’t allow AI to optimize based on
0:18:33 human performance or some crazy metric like that.
0:18:35 That would just lead to, you know, horrible things.
0:18:37 I think I totally disagree with that.
0:18:39 – Yeah, yeah.
0:18:42 Let’s shift over to Claude, ’cause Claude 3.7 came out
0:18:46 a couple of weeks ago and that one has proven to be
0:18:48 really, really good at code, it seems, right?
0:18:50 It seems like it didn’t make huge improvements
0:18:54 in almost any other areas, but it got a lot better at code.
0:18:56 From what I’ve seen so far, I mean, Matt,
0:18:58 you might have some different experiences with it.
0:19:00 I’m using cursor to write code
0:19:04 and it’s definitely gotten better at coding for me.
0:19:06 I’ve definitely noticed things that it would run
0:19:08 into these like loops where it wouldn’t write the code,
0:19:10 wouldn’t write the code, couldn’t figure out what
0:19:14 the problem was, 3.7, one prompt was 3.7,
0:19:16 now it figures out a problem that I had gone back
0:19:18 and forth 10 times prior.
0:19:21 So to me, it seems like it got a lot better at code,
0:19:24 but from most other people’s experiences,
0:19:26 it seems like it didn’t really improve
0:19:29 in almost any other areas other than code.
0:19:30 – Yeah, when I think about coding,
0:19:31 I think about Claude, right?
0:19:35 Claude 3.5 was the go-to model for a lot of coders
0:19:37 using AI assistance to help them code.
0:19:40 Yeah, cursor plus Claude, WinSurf plus Claude,
0:19:41 that was the model.
0:19:43 Now we had this huge upgrade and I agree,
0:19:44 it is a huge upgrade.
0:19:46 It also has the thinking capabilities.
0:19:49 So I’ve been doing stuff, vibe coding, right?
0:19:51 So I’m using some kind of IDE,
0:19:55 whether cursor or WinSurf plus I’m using 3.7 thinking
0:19:57 and it’s fantastic, right?
0:19:58 It is fantastic.
0:20:01 Now I’ll bring it back to what I said earlier,
0:20:03 the real world use cases,
0:20:05 the stuff that I want to use it for day to day,
0:20:08 one is coding, but a lot of other things are not coding.
0:20:11 And here’s the thing, it’s not that fast
0:20:12 and it doesn’t have web search.
0:20:14 It has no real time information.
0:20:19 So it’s essentially unusable for me outside of coding.
0:20:21 Although it’s fantastic at coding, I’ll give you that.
0:20:23 – Well, the nice thing is if you are using something
0:20:26 like cursor, cursor can actually do the web search for you
0:20:28 and then give that additional context.
0:20:31 So it’s almost like cursor will actually kind of do
0:20:33 like the perplexity thing, right?
0:20:35 Where if it needs to figure something out,
0:20:37 cursor itself goes and does the search
0:20:39 and then provides that information to Claude.
0:20:41 At least when I’ve been using it,
0:20:44 it seems to actually search the web when using cursor.
0:20:48 – Yeah, and Claude 3.7 Sonnet is available in perplexity,
0:20:49 I’ll just mention quickly.
0:20:51 By the way, perplexity just adds
0:20:53 the latest models all the time for free.
0:20:56 And I’ve not been paid by them at all.
0:20:58 I’m just a huge fan of their product.
0:21:00 So yeah, so if you wanted to try any of these new models,
0:21:03 you could go try it if you already have a perplexity account.
0:21:05 – Yeah, and their deep research is really good.
0:21:08 If we’re talking about good things about perplexity,
0:21:10 if we’re praising perplexity,
0:21:12 they just threw deep research in there as well,
0:21:13 and it’s really good.
0:21:15 – But yeah, web search is critical.
0:21:17 – Yeah, I wanted to say about Claude Sonnet.
0:21:19 Like my feeling is, you know, before it was released,
0:21:21 Claude was my go-to for just like general chat
0:21:23 for like discussing anything, right?
0:21:25 Like even like my game design document,
0:21:27 I would share that with Claude and that was my favorite.
0:21:28 Then Grock replaced that,
0:21:30 but Claude was pretty good at code.
0:21:33 Like you said, most engineers were using Claude Sonnet,
0:21:35 but not all, like I was using O1 Pro
0:21:37 and actually it was better, but way slower, right?
0:21:38 And you had to give it tons of context.
0:21:41 You could use something like repo prompt or something else.
0:21:42 But it seems like in this update,
0:21:44 they really doubled down on being the best at coding.
0:21:46 ‘Cause in terms of like general chat,
0:21:47 I feel like it actually stepped backwards.
0:21:49 Like when I use it now to chat with them,
0:21:51 like it actually got worse in this release like that.
0:21:53 I don’t like its responses as much.
0:21:55 They seem less human like, but it’s way better at coding.
0:21:57 So it feels like we’re starting to see
0:21:59 that all the AI models are finding their own like specialties
0:22:02 or at least that’s what Anthropics doing now with Claude.
0:22:04 And I do kind of wonder if all the models
0:22:05 are gonna have to do that.
0:22:06 You know, when I talked to the people at XAI,
0:22:09 I told them double down on having the best data.
0:22:11 ‘Cause you got the real-time data with X,
0:22:13 you got it with, you know, the search, double down on that.
0:22:15 And I think you’ll probably see that
0:22:16 where like these different models
0:22:17 will be the best at a thing.
0:22:21 But it seems like Chat2P is still trying to go more broad.
0:22:23 There’s trying to be like the best overall model.
0:22:26 And I’m kind of curious to see how all that ends up playing out.
0:22:27 – Yeah.
0:22:28 – It’s interesting, you know, quickly.
0:22:31 So I just pulled up the Claude 3.7 Sonnet blog post,
0:22:32 the announcement blog post.
0:22:35 And everybody knows Claude is great at coding
0:22:38 and they made this huge jump in coding.
0:22:41 But if you read it, it says,
0:22:42 in developing our reasoning models,
0:22:44 we’ve optimized somewhat less for math
0:22:46 and computer science competition problems.
0:22:50 And instead shifted focus toward real-world tasks
0:22:52 that better reflect how businesses actually use LLMs.
0:22:54 You know, now that I’m reading that,
0:22:57 maybe they met like these kind of benchmarky
0:23:00 computer science problems versus real-world coding problems.
0:23:02 But it just sounds like,
0:23:05 hey, we’re not focused as much on math and coding anymore,
0:23:07 more real-world stuff.
0:23:09 But it still is fantastic at that.
0:23:11 – Yeah.
0:23:12 – Did you guys see that they’re also,
0:23:15 it seems to be that they’re gonna be competing with cursor?
0:23:18 – Oh yeah, ’cause they released that code feature, right?
0:23:20 – Yeah, Claude code, all the top engineers,
0:23:21 I know they’ve tried it.
0:23:23 It’s very expensive to use,
0:23:24 but my understanding is like in some ways
0:23:25 it’s better than cursor.
0:23:27 So like, I’m not sure exactly how it works,
0:23:28 but I think you use it in the terminal,
0:23:31 you get full access to your entire code base,
0:23:33 and then it can just change stuff for you.
0:23:34 – Yeah, I tried it out.
0:23:35 It’s pretty cool.
0:23:35 That is interesting.
0:23:37 I think they really are doubling down on code.
0:23:38 – Yeah.
0:23:39 – I believe when I mentioned the benchmarks stuff,
0:23:40 I think they were probably talking about
0:23:42 more real-world coding versus like–
0:23:43 – I think you’re right.
0:23:44 – Benchmark coding, you know?
0:23:46 – Yeah, I think they know that most people
0:23:48 are using Claude mostly for coding.
0:23:49 – So it kind of sucks for cursor,
0:23:51 ’cause like right now everyone uses cursor to use Claude,
0:23:54 and now it’s roughly basically going to try to kill them.
0:23:55 – I don’t know.
0:23:57 I think people will probably still use cursor a lot,
0:23:59 because cursor and Windsor, if I believe,
0:24:02 are both forks of Visual Studio code.
0:24:02 – Yeah, they must be saying like,
0:24:04 oh, most of our usage is in cursor,
0:24:07 like why would we not like just be cursor then?
0:24:08 – Yeah, but I don’t know.
0:24:11 Like Visual Studio code is like pretty universal.
0:24:13 Like people use it for coding a lot,
0:24:15 and it’s what’s familiar to people.
0:24:18 So I think trying to get people to switch to a new IDE
0:24:21 might be a tough ask, unless Claude themselves goes
0:24:24 and makes their own fork of Visual Studio code.
0:24:25 So like unless Claude goes
0:24:28 and makes their own fork of Visual Studio code,
0:24:31 I have a hard time seeing like people switch over
0:24:34 to like a new IDE that’s like completely different.
0:24:36 – It’s also not an IDE.
0:24:38 It’s literally just sitting in the terminal.
0:24:41 So, you know, some people prefer that.
0:24:44 I prefer seeing the code more visually
0:24:46 and have like a nice interface to deal with.
0:24:49 So I’ve tried both cursor and Windsurf,
0:24:50 and they’re both great.
0:24:51 – Yeah, yeah.
0:24:53 To me, they feel very same-ish.
0:24:54 Like I have a hard time saying which one’s better.
0:24:56 They feel very, very similar to me.
0:24:59 – I’ve had more success with Windsurf
0:25:03 when trying to iterate on a whole code base.
0:25:05 Now cursor, I think just released,
0:25:08 they’re kind of just updated their agent feature,
0:25:09 which makes it a little bit easier
0:25:11 to operate on the whole code base.
0:25:15 And your AI coding assistant agent is able to grok
0:25:17 and search through the code and do different things.
0:25:20 That helps it work on the code base as a whole.
0:25:21 But yeah, you know what?
0:25:22 Competition’s always good.
0:25:23 – Yeah, agreed.
0:25:25 I’m trying to look for some other like really cool examples
0:25:27 here of stuff that people made with Claude.
0:25:28 I mean, these are all really cool examples.
0:25:31 We can spend 30 minutes looking at all these examples.
0:25:32 So I’m looking for the best ones right now.
0:25:35 I mean, lots and lots of really cool examples
0:25:37 of stuff that Claude just did in one shot.
0:25:38 Lots of snake games.
0:25:39 Lots and lots of snake games.
0:25:41 – I made a snake game, yeah.
0:25:43 Of course I did, and then I had–
0:25:44 – First thing I did when I tested Claude
0:25:46 was make a snake game, I still always do.
0:25:47 – Yeah.
0:25:49 – But did you guys see the actually good snake game?
0:25:50 The one where it’s like having a mental breakdown
0:25:51 as it’s escaping?
0:25:52 Did you see that?
0:25:54 – I think you were just showing that, yeah.
0:25:54 Self-aware, there you go.
0:25:56 – Oh, this one, yeah, yeah, yeah.
0:25:58 So the self-aware snake escape?
0:26:00 – Yeah, so that’s the one, yeah.
0:26:01 It’s like freaking out.
0:26:02 (laughing)
0:26:04 – It just said your brain will struggle with this,
0:26:07 and then the little text pops up on the screen
0:26:08 as the snake is moving around.
0:26:10 – Yeah, it’s like, what’s the snake thinking
0:26:11 as it’s trying to break out?
0:26:13 And it starts freaking out that it can’t break out.
0:26:14 (laughing)
0:26:16 – It says this was done with one prompt
0:26:19 plus a request to make special things happen faster.
0:26:21 So I don’t totally know what that means,
0:26:25 but that was apparently one-shotted to get that pretty wild.
0:26:27 Yeah, probably using the agent feature.
0:26:29 Actually Cursor, they tested the new Claude model,
0:26:31 and they suggested to use the agent feature.
0:26:33 And I tested it, it is pretty amazing.
0:26:35 I tried it on my game and it fixed a problem
0:26:37 I’d been trying to solve with O1 Pro
0:26:38 and hadn’t been able to,
0:26:39 but then it broke like two things.
0:26:40 So I’m like…
0:26:41 (laughing)
0:26:43 And then it was unable to fix the things that it broke.
0:26:46 So it’s like, there’s still serious limitations with this.
0:26:47 – Yeah, yeah.
0:26:52 – Yeah, I’m building a 2D turn-based strategy game right now.
0:26:54 And I’m a few hours in,
0:26:56 and now that I’m a few hours in
0:26:58 and a few thousand lines of code in,
0:27:00 it’s a little harder, right?
0:27:03 It takes longer for each iteration to add features.
0:27:05 There are more bugs popping up
0:27:07 if it changes one thing over here or something else changes.
0:27:09 Like I’ll ask it to change something
0:27:12 and then the entire game will look different
0:27:13 on the next turn,
0:27:14 even though I didn’t say anything about that.
0:27:17 So yeah, of course, there are some limitations.
0:27:18 We’re gonna get better,
0:27:20 especially as context windows grow.
0:27:22 That’s why I actually think maybe Google’s,
0:27:24 two million token context window models
0:27:26 are quite appropriate for coding.
0:27:29 I just, I don’t think anybody uses them for coding.
0:27:31 I could be wrong about that though.
0:27:32 – Well, I’ve tried using it
0:27:35 ’cause you can switch to Google’s models,
0:27:37 like their Gemini or their Gemma models
0:27:39 inside of Cursor, I believe.
0:27:40 And when I tried Google’s models,
0:27:42 they just didn’t perform as well as Claude.
0:27:45 So I always find myself going back to Claude.
0:27:46 I tried 01.
0:27:48 I haven’t really done a lot of coding with 01 Pro
0:27:49 because there’s no API, so…
0:27:51 – 01 Pro is the best in that situation right now.
0:27:52 Like, if you have a lot of context,
0:27:54 like my game project,
0:27:56 even if I remove a lot of other files
0:27:57 and just get to the basic scripts,
0:27:59 it’s like 100K of context.
0:28:00 – Yeah, but they don’t have an API yet.
0:28:02 So you can’t just use it straight in Cursor.
0:28:04 You have to use something like,
0:28:05 what’s it called, Repo Prompt?
0:28:07 – Repo Prompt, and there’s a few other things too
0:28:09 where you can combine your files into a file
0:28:10 that you just copy and pasted.
0:28:12 – Yeah, but I mean, I’ve got a PC,
0:28:13 Repo Prompt’s only on Mac,
0:28:15 so I can’t even use Repo Prompt if I wanted to.
0:28:18 But I mean, I could, I have a Mac, I just never use it.
0:28:21 So like, I can’t even use something like Repo Prompt for 01.
0:28:23 The other really good thing about the Google models
0:28:24 is like, I think they’re pretty much
0:28:26 the most inexpensive models.
0:28:27 So if you’re looking for like as cheap
0:28:29 as you can get, Google’s probably there.
0:28:31 I mean, Lama might even be a little bit cheaper,
0:28:34 but Google’s pretty damn cheap.
0:28:36 – Yeah, the people that I really trust,
0:28:38 their opinions on AI stuff,
0:28:40 they are telling me I’m missing out
0:28:42 by not using the Gemini models.
0:28:44 So there’s something there.
0:28:47 I just, you know, there’s only so much time in the day.
0:28:48 I haven’t really had a chance
0:28:50 to extensively test the Gemini models,
0:28:53 but I really should and I really need to get in there.
0:28:55 – I’m gonna give you guys a sneak peek real quick here
0:28:57 of what I’ve been building.
0:29:00 I’m building my video producer app.
0:29:02 Basically what I do is I load
0:29:04 all of these like various like interviews in here.
0:29:05 I’ve made these folders.
0:29:08 So like you’ve got like this interviews folder here
0:29:10 with just like tons of interviews.
0:29:13 Like here’s Rowan who’s interviewed Mark Zuckerberg
0:29:16 and Logan and Mustafa Solomon and Demis Hassabis.
0:29:19 I’ve basically scraped a whole bunch of like interviews
0:29:23 and panels and launch videos and keynotes
0:29:24 and, you know, Lex Friedman interviews
0:29:26 and all sorts of like interviews
0:29:29 with various like AI leaders and stuff, right?
0:29:31 And inside of each one of these,
0:29:33 I actually use OpenAI’s Whisperer
0:29:35 and it transcribes the whole thing for me.
0:29:37 So I have like the entire transcription,
0:29:39 but then I also have like videos
0:29:41 that don’t have any audio in them.
0:29:43 Like I’ve got like some B roll footage and stock footage.
0:29:46 Like here’s stock footage of like a robot.
0:29:48 And for this one, I actually use Google Gemini
0:29:50 and Gemini watches the whole video for me
0:29:52 and writes up a description of everything
0:29:53 that’s going on in the video.
0:29:57 And the idea being I just have this giant database
0:29:59 of videos where I throw the video in
0:30:02 and then I can search out anything I want.
0:30:04 So anytime like Sam Altman is mentioned in a video,
0:30:06 I can search out Sam Altman.
0:30:07 It’ll pull up all of the videos
0:30:09 that either have Sam Altman in them
0:30:10 or they mentioned Sam Altman.
0:30:13 And then I could quickly find exactly in the transcript
0:30:14 where he’s talked about.
0:30:15 – You’re building this yourself?
0:30:18 – All with cursor and Claude 3.7.
0:30:20 Dude, I would totally use this.
0:30:23 I’m using Notion for the almost same thing.
0:30:25 I’m basically just anytime I find a clip
0:30:26 that I would find useful in the future,
0:30:27 I’ll throw it in there.
0:30:29 And I just have to remember where it is
0:30:30 and what the context was.
0:30:32 This is super useful, man.
0:30:33 I would pay to use this.
0:30:35 – Yeah. So I’ve got like B roll that I shot.
0:30:37 This is actually B roll that you’re probably
0:30:39 in the background of if you look closely enough
0:30:41 ’cause this is at the NVIDIA event here,
0:30:43 the, you know, the little digits box.
0:30:44 I threw this video in here
0:30:46 and you can see it wrote this description.
0:30:47 This short video showcases
0:30:49 the NVIDIA project digits prototype
0:30:50 to compact computing device.
0:30:53 The video primarily focuses on the physical device itself.
0:30:55 So it goes into all of this detail
0:30:59 from a 31 second video of like me getting B roll digits.
0:31:00 So now if I ever, I’m like,
0:31:03 “Oh, what was that video I made that had digit in?”
0:31:04 And I need to pull that up really quick.
0:31:05 I could just search up digits, right?
0:31:08 And it’ll pull up this video as the top video.
0:31:11 So this is like sort of phase one of what I’m building here.
0:31:14 Phase two is I want to toss all of this
0:31:17 into like a rag like retrieval augmented generation model
0:31:20 where I can say, “Hey, I want to make a video
0:31:21 about Sam Altman.
0:31:23 Compile everything we know about Sam Altman
0:31:25 from all of these videos
0:31:27 and have it actually write like an outline for me
0:31:29 based on all the information that’s in these videos.”
0:31:30 – That’s so cool.
0:31:32 – So that’s what I’ve been building.
0:31:34 And this again, I’ve been working on for about two weeks now
0:31:37 and it uses like seven different APIs.
0:31:40 It’s using like the OpenAI whisper API.
0:31:42 It’s using the Gemini API
0:31:44 ’cause that can actually watch videos
0:31:46 and tell you what’s going on in the video.
0:31:49 It’s using Google’s cloud intelligence API
0:31:52 that’s actually able to like OCR any text in the video.
0:31:54 So if you’re watching like a slide presentation,
0:31:56 it can actually OCR any of the text
0:31:58 that’s in the slide presentation.
0:32:01 But yeah, it’s been a fun project to build.
0:32:03 But I run into the same kinds of stuff that Matthew,
0:32:05 you mentioned where I will go and ask it
0:32:07 to change like one feature.
0:32:09 I’ll be like, “Hey, the search isn’t working quite right.”
0:32:10 And it’ll be like, “Okay, I just fixed it.
0:32:11 I refreshed the page.”
0:32:13 And it changes the entire styling.
0:32:15 And I’m like, I didn’t ask you to touch the CSS at all.
0:32:17 I just wanted to change how the search functions.
0:32:19 Like what the hell?
0:32:21 But other than those little things,
0:32:22 I use GitHub a lot too.
0:32:25 It’s like every time a little change works,
0:32:27 I push it to GitHub so I know I can bring it back
0:32:28 if I need to.
0:32:29 – Oh, that’s smart, yeah.
0:32:31 – But yeah, it’s been fun.
0:32:35 And I’ve made less YouTube videos that I normally make.
0:32:37 I’ve only been putting out one YouTube video a week
0:32:38 for the past like month
0:32:40 because I’ve gotten so addicted
0:32:42 to playing around with AI coding.
0:32:43 It’s so fun.
0:32:44 Like I’m working on the game now.
0:32:46 I’m just like shocked that I can like build a game by myself.
0:32:48 It’s like, you know, in the past,
0:32:50 I never could imagine like one person could build a game
0:32:52 or now you’re building your own software product.
0:32:55 And Matt, you’re kind of like a no-code guy, right?
0:32:59 Now you’re using, you went from no-code to using AI to code
0:33:00 and you’re actually able to build a whole product.
0:33:02 I mean, it’s just, everything’s changing.
0:33:03 – Yeah.
0:33:05 And I mean, the thing is I’m learning as I go too, right?
0:33:07 ‘Cause like when you use these models,
0:33:10 it’ll explain to you what it did, what it changed.
0:33:12 You know, why something broke.
0:33:14 3.7 has been really, really good at that.
0:33:16 I don’t know if it’s 3.7
0:33:18 or if it’s like the agent feature inside of cursor.
0:33:21 But when it fixes stuff, it’ll explain the problem.
0:33:23 It’ll say you were running into this problem
0:33:24 because this, this and this was happening
0:33:26 or there was a conflict with this and this
0:33:29 or you know, it was sending the wrong information
0:33:30 through the API or whatever, right?
0:33:32 It gives you that information.
0:33:34 So although I don’t actually know
0:33:36 how to like type out the code myself,
0:33:39 I’m getting a lot better at troubleshooting
0:33:42 why problems are happening within the code.
0:33:45 I don’t know how to like actually change the code.
0:33:48 Like I don’t know what to write to make it work myself,
0:33:50 but I’m starting to pick up on like,
0:33:52 oh, I think this might conflict with this
0:33:54 as a result of learning
0:33:56 as it explains to me all of these problems.
0:33:59 So anyway, that’s what I’ve been working on.
0:34:01 But there is one last topic I do wanna shift in.
0:34:02 – Let’s talk about it.
0:34:04 – I wanna shift over to GPT 4.5
0:34:08 because as of today’s recording, GPT 4.5 came out.
0:34:10 Before we hit record, I asked Matt,
0:34:11 I’m like, what were your thoughts on that launch today?
0:34:13 And he’s like, I’ll save it for the recording.
0:34:15 I was like, all right, let’s save it for the recording.
0:34:16 So let’s start there.
0:34:20 What are your thoughts on the GPT 4.5 launch?
0:34:21 – So I think it looks cool.
0:34:24 I haven’t obviously tested it extensively.
0:34:25 It came out just a handful of hours ago,
0:34:28 not even since recording.
0:34:30 So Nathan, you mentioned it’s really good at writing.
0:34:32 So I’m excited to test that out.
0:34:34 But let me just talk about a couple of things
0:34:37 that I noticed from the live stream.
0:34:40 So one, it’s the largest model that they’ve ever made.
0:34:44 And it took new innovations on both the pre-training
0:34:47 as well as the serving of it, the inference,
0:34:48 to actually be able to serve this model.
0:34:51 And if you use it, it is pretty slow, right?
0:34:53 So I found that pretty interesting.
0:34:55 It’s a world knowledge model,
0:34:58 meaning it is not a thinking model,
0:35:00 but in terms of just questions and answers
0:35:02 that you would use kind of day to day,
0:35:03 it’s really good at that.
0:35:05 And it’s much better than GPT 4.0.
0:35:07 And then the last thing,
0:35:08 and I think this flew under the radar a bit,
0:35:10 and I wanna get your guys’ thoughts on this.
0:35:13 They said that it was such a massive model
0:35:17 that they actually trained it across multiple data centers,
0:35:19 not in a singular location.
0:35:21 So when you think of the GROC model,
0:35:24 it’s the Colossus data center, 100,000,
0:35:27 something like 200,000 GPUs in a single place.
0:35:28 They didn’t have that.
0:35:29 They being open AI, they don’t have that.
0:35:32 And so they had to split it up.
0:35:35 And I don’t think any other company has done
0:35:38 parallel training across multiple data centers
0:35:40 at this level before.
0:35:42 I think that it really flew under the radar
0:35:44 and unlocks the ability for companies
0:35:49 that don’t have the money or resources that an XAI does
0:35:51 to go out and spread their model out
0:35:53 and still get a massive model trained.
0:35:55 And I thought that was just fascinating.
0:35:57 – Yeah, I didn’t catch that.
0:35:58 I mean, it’s not something that I picked up on
0:36:00 until you just mentioned it.
0:36:02 So I mean, that is really interesting.
0:36:05 So you’re saying that like, basically,
0:36:07 there was separate physical locations
0:36:09 where the training was happening simultaneously
0:36:11 in separate locations.
0:36:15 – Yeah, ’cause they probably literally could not get
0:36:19 a large enough data center to train this one model
0:36:23 in a concentrated location like Colossus with GROC3.
0:36:26 So they had to split it up and do it in parallel.
0:36:28 And again, that really hadn’t been done before
0:36:30 at this level.
0:36:31 – Yeah, well, I mean,
0:36:33 we’ve got Project Stargate coming as well.
0:36:34 That’s gonna be the data center
0:36:36 where eventually they’ll be able to do it all
0:36:38 from one data center, I believe.
0:36:40 But yeah, no, that’s really, really interesting.
0:36:43 I personally found the actual presentation
0:36:45 a little bit underwhelming.
0:36:48 They didn’t really show off any sort of capabilities
0:36:51 that were like, oh, we’ve never seen that before.
0:36:53 – Considering this was like hyped up for like a year
0:36:55 of like, oh, Orion’s coming.
0:36:57 – Well, and then also Sam Altman,
0:36:59 he posted on Twitter a couple of days ago
0:37:02 that a lot of people who have used GPT 4.5
0:37:05 have gotten that feel the AGI moment from it, right?
0:37:08 But anybody really feel an AGI moment from their demo.
0:37:10 I don’t know, I’m reading into it a little too much.
0:37:13 But the other thing I realized too is like,
0:37:15 you can always tell when OpenAI doesn’t see it
0:37:16 as that big of an announcement,
0:37:18 when Sam doesn’t show up for the announcement.
0:37:19 – Right.
0:37:22 You know, I think there’s more there than people realize.
0:37:25 And so many people accuse me of being overly optimistic.
0:37:28 So bear with me, or at least take what I’m gonna say
0:37:29 with a grain of salt.
0:37:30 I really do think that there’s more there.
0:37:34 First of all, this is the first version, right?
0:37:35 And we’re gonna get lots of different upgrades over time.
0:37:37 We’re gonna get a turbo version.
0:37:40 But I think in terms of just a baseline model,
0:37:42 it’s a lot better at the kind of general Q and A.
0:37:45 And so from that, when you have all of that world knowledge
0:37:47 baked into this model,
0:37:49 how do you think the thinking versions,
0:37:52 the O1s, the O3s, how do you think those are the created?
0:37:54 It’s taking that foundation model,
0:37:56 using reinforcement learning,
0:37:59 and then kind of eliciting that thinking behavior from it.
0:38:02 So now we’re gonna have this incredible foundation model
0:38:05 to build the thinking models on top of.
0:38:07 Maybe that’s what O3 Pro is.
0:38:09 – And it’ll be crazy expensive.
0:38:11 – Oh, that’s another thing is the cost, yeah.
0:38:11 – You combine those two.
0:38:13 It’s like, okay, this model is already very expensive
0:38:15 before you do the thinking.
0:38:17 And so when you do, it’s like, it makes me wonder,
0:38:20 is that gonna lead to like the $2,000 a month plan
0:38:22 or something like this?
0:38:23 You know, there’s also another angle to think of it,
0:38:26 which is, it’s all about the vibes, right?
0:38:28 That was like a huge theme of the announcement of the vibes.
0:38:30 It’s a warm model.
0:38:32 It’s a high EQ model.
0:38:33 – They were trying to compete with Claude.
0:38:34 – Right, and so–
0:38:35 – You know, ’cause before Claude, everyone said like,
0:38:38 Claude had a better vibe than chat to BT, you know?
0:38:40 – Right, and so like, if you think about it
0:38:42 from that perspective,
0:38:44 they’re really positioning this model
0:38:48 to be a true kind of AI personal assistant.
0:38:50 And I emphasize the word personal.
0:38:53 It is there to help you.
0:38:55 It is there to know you.
0:38:58 And I think this is really more interesting
0:39:00 than people are giving it credit for.
0:39:03 Because when you think about systems like Siri,
0:39:04 what it could be,
0:39:08 it’s probably based on something like this type of model.
0:39:10 – Yeah.
0:39:11 – One thing I just wanted to point out too
0:39:13 is I don’t know if you guys caught this tweet,
0:39:16 but he basically said that the new model
0:39:18 is a giant expensive model,
0:39:19 and they really wanted to launch it,
0:39:21 but they ran out of GPUs, right?
0:39:24 And this kind of comes back to Matthew’s point earlier
0:39:26 of how they had to like train it, you know,
0:39:30 in separate locations with this parallel processing.
0:39:33 They’re literally like reaching the end
0:39:36 of the available GPUs to be able to process this stuff.
0:39:38 And I just found that really, really interesting.
0:39:39 So I wanted to share that real quick.
0:39:41 – Open AI is GPU poor.
0:39:45 – Yeah, I mean, they’ve got SoftBank now behind them.
0:39:47 – Who would have thought?
0:39:48 – I want to share two things that kind of show
0:39:50 that there is something special here.
0:39:52 So here’s something I saw that’s pretty interesting
0:39:55 from a professor or he’s a biomedical scientist.
0:39:56 He said, “It appears to be remarkable
0:39:57 in medical imaging diagnosis.
0:40:00 It was the only model that perfectly diagnosed
0:40:01 this ultrasound image.”
0:40:03 So like in terms of like looking at ultrasound,
0:40:06 apparently it’s by far the best model out there.
0:40:08 So there probably are things that we’re going to find
0:40:09 about this model that are special
0:40:12 that when you first use it are not apparent.
0:40:13 – Yeah, good point.
0:40:13 No, I agree.
0:40:15 I’m sure it’s a lot better model
0:40:18 than I feel like their presentation led on to be.
0:40:20 Part of the problem was like,
0:40:22 we all watched the live stream, I’m sure.
0:40:25 They seemed like they were kind of nervous.
0:40:26 They were kind of uncomfortable.
0:40:29 Sam wasn’t on the live stream this time.
0:40:31 I just don’t think it was presented very well, honestly.
0:40:32 I think that was probably the problem.
0:40:34 – Yeah, here’s another thing I saw
0:40:35 that was super interesting.
0:40:37 Like the hallucination rate is like way lower
0:40:38 with like 4.5.
0:40:42 So it’s like 37.1% hallucination rate
0:40:45 versus 4.0 being 61%, you know,
0:40:47 0.3 mini being 80.3%.
0:40:48 It’s like dramatically lower.
0:40:50 So that’s a, yeah, that’s step forward.
0:40:51 ‘Cause like obviously for big companies,
0:40:53 one of the big problems with using AI models
0:40:55 is hallucinations, right?
0:40:56 So yeah.
0:40:56 – Yeah, absolutely.
0:40:57 That’s crazy.
0:41:01 0.3 mini on simple QA has an 80% hallucination rate.
0:41:03 That seems insane to me.
0:41:04 – Seems right.
0:41:05 ‘Cause like 0.3 mini is like really fast,
0:41:07 but I’ve seen it make lots of weird mistakes.
0:41:09 Like when I tried to use it for coding and stuff,
0:41:10 it’s like, oh, you’re a genius.
0:41:12 And the next I’m like, oh, you’re a total moron.
0:41:15 It’s just like, it’s responses are like all over the place.
0:41:16 – I don’t quite understand that benchmark.
0:41:19 I feel like an 80% hallucination rate
0:41:21 just makes, would make something unusable.
0:41:23 I mean, if you’re only getting accurate information
0:41:25 20% of the time, I don’t know.
0:41:28 I don’t totally understand how this one is come to, I guess.
0:41:29 – I think we would have to see what the actual questions
0:41:33 are in the simple QA benchmark to understand the context
0:41:34 of why it’s scored so poorly.
0:41:36 But then, but then just, you know,
0:41:40 I guess maybe remember it’s mostly for math, science,
0:41:42 like basically STEM, right?
0:41:44 Things that have verifiable answers.
0:41:45 – Absolutely.
0:41:46 – I think it’s cool.
0:41:47 I think it’s a good start.
0:41:49 – Yeah, no, I’m excited about it.
0:41:51 I mean, like, hey, for content creators like us,
0:41:53 all this news is amazing, right?
0:41:54 ‘Cause we get to talk about it.
0:41:56 We get to, you know, keep on sharing what’s coming out.
0:41:59 We get to play with it all and get access to it
0:42:01 and show what it’s capable of
0:42:03 and put it through its motions and stress test it.
0:42:06 And I think, you know, I could speak for myself.
0:42:07 I think I could speak for Matthew.
0:42:08 That’s what we love doing, right?
0:42:10 Like we love playing with the stuff and stress testing it
0:42:12 and figuring out what it really can do.
0:42:15 And that’s what’s most exciting to me, I think.
0:42:19 You know, GPT-5 is probably only six, eight weeks away.
0:42:22 I mean, it’s not that far off, apparently.
0:42:24 So this is just the beginning.
0:42:27 It’s been a crazy couple of weeks.
0:42:28 – Absolutely.
0:42:28 – So who won?
0:42:29 Who won the last week?
0:42:30 – I don’t know.
0:42:33 I mean, they all have different pros and cons, I guess.
0:42:34 Right?
0:42:35 – I was most surprised by Grock.
0:42:36 Right, like out of all the three,
0:42:39 like I feel slightly disappointed by 4.5.
0:42:40 I would say it would be my general feeling.
0:42:42 Not like majorly disappointed,
0:42:43 but like Orion’s not as big of a deal
0:42:45 as I was hoping it was going to be.
0:42:45 – Yeah.
0:42:47 – Grock really impressed me.
0:42:50 Anthropic Clawd, you know, 3.7, amazing.
0:42:52 Kind of what I expected in terms of improvement.
0:42:54 So I would say Grock’s the biggest surprise
0:42:55 out of all the three.
0:42:56 – Yeah.
0:42:57 – I mean, personally, I’ve gotten the most use
0:43:01 out of Clawd 3.7 because I’ve been really, really going
0:43:03 down the coding rabbit hole lately,
0:43:05 but that’s just a very like anecdotal thing for me, right?
0:43:08 Like that’s the use case that I’ve found the most valuable
0:43:10 in the moment is I’m doing a lot of coding
0:43:12 and 3.7 is great for that for me right now
0:43:14 in this moment in time.
0:43:18 – Yeah, with the caveat that I haven’t tested GPT 4.5 much,
0:43:20 I gotta give the crown to Grock 3
0:43:23 over this last wave, these last couple of weeks.
0:43:25 It went from I never used Grock 2
0:43:28 to now it is my go-to model for a lot
0:43:30 of use cases on my day-to-day tasks.
0:43:32 So definitely have to give it to Grock 3 there.
0:43:33 – Yeah.
0:43:35 Now we just need that damn API.
0:43:36 – Yes, yeah.
0:43:37 Then we’ll really–
0:43:38 – It’ll be interesting to benchmark it
0:43:39 when the API comes out, right?
0:43:41 Like see how it actually compares
0:43:42 to all these other models.
0:43:43 – Yeah, yeah.
0:43:44 Well, cool, Matthew.
0:43:45 This has been amazing.
0:43:47 You know, anybody listening, make sure you check out
0:43:50 Matthew’s YouTube channel over at MatthewBurman.
0:43:51 You’ve got an amazing newsletter.
0:43:52 I’m subscribed to it.
0:43:54 It’s the forward future newsletter, I believe.
0:43:55 – Forward future.
0:43:57 – Everybody needs to go check those out anywhere else.
0:44:00 You want people to go check you out and follow you online?
0:44:04 – Yeah, come check me out on Twitter @MatthewBurman.
0:44:07 Come flame me for my opinions on politics,
0:44:08 even though I don’t share them.
0:44:08 – All right, you’re asking for it?
0:44:09 – Yeah, yeah.
0:44:10 – Well, you called it Twitter.
0:44:11 That’s like, you’re kind of like your DM.
0:44:12 – All right, you’re right.
0:44:13 I already offended one side.
0:44:15 – So like, look, I’m on your side, guys.
0:44:16 – Well, cool.
0:44:17 This has been super, super fun.
0:44:19 – Thank you for having me, guys.
0:44:19 Thank you.
0:44:20 – Great having you back.
0:44:23 I’m sure you’ll be back on if you want to be.
0:44:24 We’d love to have you again.
0:44:26 It’s always fun to chat with you
0:44:29 and really appreciate you spending the time with us today.
0:44:30 – Yeah, thank you, guys.
0:44:33 (upbeat music)
0:44:35 (upbeat music)
0:44:38 (upbeat music)
0:44:40 (upbeat music)
0:44:43 (upbeat music)
0:44:46 (upbeat music)

Episode 48: How do the latest updates to large language models stack up against each other? Matt Wolfe (https://x.com/mreflow) and Nathan Lands (https://x.com/NathanLands) are joined by Matthew Berman (https://x.com/MatthewBerman), an expert in deep-diving and testing the nuances of large language models.

In this episode, the trio discusses the recent releases of Grok 3, Claude 3.7, and GPT-4.5, analyzing their strengths, weaknesses, and unique features. Tune in to learn which model might be best for your needs, from coding and real-time information to creative writing and unbiased truth-seeking.

Check out The Next Wave YouTube Channel if you want to see Matt and Nathan on screen: https://lnk.to/thenextwavepd

—

Show Notes:

(00:00) Exploring New AI Models
(05:35) Inconsistent AI Code Performance
(06:26) Redesigning Benchmarks for Modern Models
(11:33) AI Bias Amplification on Social Media
(15:11) AI Bias and Human Oversight
(17:49) Claude 3.7: Improved Coding Abilities
(20:30) Claude Update: Better Code, Worse Chat
(23:19) Resistance to Switching IDE from VS Code
(28:05) Video Producer App Preview
(29:55) Showcasing Nvidia Digits Prototype
(34:00) GROK Model’s Distributed Training
(36:31) Optimistic Perspective on Future Upgrades
(40:59) Excited for GPT-5 Launch
(42:08) Claude 3.7 Excels in Coding

—

Mentions:

Matthew Berman: https://x.com/MatthewBerman
Forward Future: https://www.forwardfuture.ai/
Grok 3: https://x.ai/blog/grok-3
Claude 3.7: https://www.anthropic.com/news/claude-3-7-sonnet
GPT-4.5: https://openai.com/index/introducing-gpt-4-5/
Perplexity: https://www.perplexity.ai/
Cursor: https://www.cursor.com/
Gemini: https://ai.google/updates/

Get the guide to build your own Custom GPT: https://clickhubspot.com/tnw

—

Check Out Matt’s Stuff:

• Future Tools – https://futuretools.beehiiv.com/

• Blog – https://www.mattwolfe.com/

• YouTube- https://www.youtube.com/@mreflow

—

Check Out Nathan’s Stuff:

Newsletter: https://news.lore.com/
Blog – https://lore.com/

The Next Wave is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Production by Darren Clarke // Editing by Ezra Bakker Trupiano

Learn. Share. Evolve…

Leave a Reply Cancel reply

More posts

Why are we so obsessed with manufacturing?

Congress has voted to eliminate government funding for public media

641. What Does It Cost to Lead a Creative Life?

Marc Andreessen and Joe Lonsdale on Tariffs and Trade