What’s Missing Between LLMs and AGI – Vishal Misra & Martin Casado

Summary & Insights

Dịch sang tiếng Việt:

Hãy tưởng tượng Lầu Năm Góc tuyển dụng các nhà đầu tư ngân hàng từ Phố Wall, hứa hẹn việc được triển khai số vốn lớn hơn hầu hết mọi người thấy trong đời để mua cổ phần tại các công ty tư nhân. Đây là thực tế mới của chính sách kinh tế dưới thời chính quyền Trump, nơi ranh giới giữa nghệ thuật quản lý nhà nước và làm ăn kinh doanh đang bị xóa nhòa. Cuộc thảo luận tập trung vào một đơn vị phòng thủ kinh tế mới với 30 người trong Bộ Quốc phòng, được giao nhiệm vụ xác định và đầu tư tới 200 tỷ USD vào các lĩnh vực như khai thác khoáng sản, máy bay không người lái và năng lượng để đối phó với Trung Quốc. Sáng kiến này đại diện cho một sự chuyển hướng mạnh mẽ khi chính phủ đóng vai trò như một quỹ đầu tư chiến lược khổng lồ.

Một động thái tài chính song song và cũng đáng kinh ngạc liên quan đến thương vụ TikTok. Chính phủ Mỹ dự kiến thu một khoản “phí môi giới” 10 tỷ USD để tạo điều kiện cho việc bán ứng dụng này cho một nhóm nhà đầu tư do Mỹ dẫn đầu, một con số lớn hơn rất nhiều so với bất kỳ khoản phí tư vấn M&A truyền thống nào và đặt ra những câu hỏi sâu sắc về vai trò của chính phủ trong thương mại. Cuộc thảo luận đi sâu vào đạo đức và kinh tế của khoản phí này, đặt câu hỏi liệu nó có đại diện cho khoản thanh toán cho việc giám sát an ninh cần thiết hay là một hình thức đòn bẩy đối với một giao dịch bắt buộc, đặc biệt khi giá trị định giá cho hoạt động tại Mỹ của TikTok trong thương vụ này thấp đáng ngạc nhiên ở mức 14 tỷ USD.

Phần cuối của chương trình là những suy ngẫm cá nhân từ người dẫn chương trình Ed Elson về trải nghiệm gặp gỡ thính giả tại một buổi biểu diễn trực tiếp trong sự kiện South by Southwest. Ông nhận thấy tham vọng lan tỏa và sự tò mò trí tuệ của khán giả, từ các chuyên gia tài chính và nhà sáng lập công nghệ đến sinh viên và giáo sư. Khoảnh khắc kết nối này như một lời nhắc nhở về những con người thực sự đằng sau những con số thống kê lượt tải xuống và củng cố động lực của ông trong việc nâng cao nội dung chương trình cho một cộng đồng gắn bó như vậy.

Những góc nhìn đáng ngạc nhiên

Lầu Năm Góc đang công khai tuyển dụng các “chuyên ngân hàng bao phủ” từ Phố Wall – những chuyên gia biết các công ty cổ phần tư nhân nào sở hữu những tài sản nào – để đóng vai trò trinh sát giao dịch, cơ bản là xây dựng một ngân hàng đầu tư nội bộ tập trung vào an ninh quốc gia.

Tồn tại một sự căng thẳng đáng kể trong Lầu Năm Góc giữa triết lý “vốn nhẹ” của đầu tư mạo hiểm Thung lũng Silicon, vốn đã hoạt động trong công nghệ quốc phòng, và sự tập trung vào tài sản nặng truyền thống hơn của các cựu chiến binh đầu tư cổ phần tư nhân hiện đang được tuyển dụng.

Khoản phí 10 tỷ USD của chính phủ Mỹ cho thương vụ TikTok không chỉ lớn; về cấu trúc, nó chưa từng có tiền lệ, vì các chính phủ thường không thu phí môi giới để phê duyệt hoặc tạo điều kiện cho các giao dịch công ty dựa trên lý do an ninh quốc gia.

Có một mối quan ngại rõ ràng rằng khoản ngân sách 200 tỷ USD của Lầu Năm Góc có thể biến nó thành “tiền ngốc” cho các công ty cổ phần tư nhân đang tuyệt vọng muốn thoái vốn hàng nghìn tỷ tài sản mắc kẹt trên sổ sách của họ do thị trường IPO và M&A trì trệ.

Chính quyền đang tích cực thảo luận về việc “kiếm tiền từ bảng cân đối quốc gia,” có thể bao gồm các ý tưởng như tư nhân hóa Dịch vụ Bưu chính hoặc bán cổ phần trong các tài sản quốc gia khác, đưa Mỹ tiến gần hơn đến mô hình của một quỹ đầu tư tài sản quốc gia.

Những điểm rút ra thực tiễn

Đối với nhà đầu tư và nhà phân tích: Theo dõi chặt chẽ các lĩnh vực công nghệ quốc phòng, khai thác khoáng sản và năng lượng, vì dòng vốn khổng lồ sắp tới được định hướng bởi chính phủ có thể sẽ tạo ra những người chiến thắng mới và làm tăng tính biến động về định giá.

Đối với chuyên gia trong lĩnh vực tài chính: Cân nhắc con đường sự nghiệp phi truyền thống trong dịch vụ công; chính phủ hiện đang cung cấp các vai trò với sự ảnh hưởng và tiếp cận độc đáo, ngay cả khi mức lương trực tiếp có thể không bằng Phố Wall, cho những ai quan tâm đến chiến lược địa chính trị.

Đối với lãnh đạo doanh nghiệp: Hiểu rằng sự chấp thuận quy định, đặc biệt trên cơ sở an ninh quốc gia, giờ đây có thể đi kèm với một chi phí tài chính trực tiếp và đáng kể, làm thay đổi căn bản cách tính toán trong sáp nhập và mua lại ở các ngành công nghiệp nhạy cảm.

Đối với công dân và người đóng thuế: Xem xét kỹ lưỡng việc chính phủ nắm giữ cổ phần trong các công ty tư nhân bằng cách đặt hai câu hỏi then chốt: Đây có phải là lĩnh vực mà vốn tư nhân thực sự sẽ không đầu tư, và lợi ích công cộng hữu hình từ khoản đầu tư này là gì ngoài những lợi ích “chiến lược” mơ hồ?

Đối với người quan sát truyền thông: Nhận ra rằng mối quan hệ giữa chính quyền và các công ty truyền thông ngày càng có thể được định khung như một loạt các “thương vụ” giao dịch (như “cứu” TikTok hoặc thay đổi quyền sở hữu của CNN) hơn là các vấn đề nguyên tắc hoặc chính sách.

Hãy tưởng tượng bạn đưa cho một mô hình ngôn ngữ chỉ một từ “protein.” Nó nắm giữ một phân phối xác suất cho từ tiếp theo—có thể là “synthesis” (tổng hợp) hoặc “shake” (lắc). Sự phân nhánh trong chớp mắt này không ngẫu nhiên; đó là một sự cập nhật toán học, một sự điều chỉnh niềm tin theo Bayes dựa trên bằng chứng mới. Cơ chế chính xác, có thể dự đoán này là cốt lõi của cách thức hoạt động thực sự của các mô hình ngôn ngữ lớn (LLM), một quá trình mà Giáo sư Vishal Misra từ Đại học Columbia đã dành nhiều năm để chứng minh một cách chính thức.
Hành trình của Vishal Misra bắt đầu với một ứng dụng thành công sớm của GPT-3 để dịch các truy vấn cricket bằng ngôn ngữ tự nhiên thành một ngôn ngữ chuyên biệt (DSL) tùy chỉnh cho ESPN—một dạng sơ khai của RAG. Kinh ngạc vì nó hoạt động mà không cần truy cập nội bộ vào mô hình, ông trở nên ám ảnh với việc tìm hiểu *tại sao*. Nghiên cứu của ông dẫn đến một nhận thức nền tảng: một LLM có thể được coi một cách trừu tượng như một ma trận thưa cực lớn, trong đó mỗi hàng là một gợi ý (prompt) có thể và mỗi cột là một phân phối xác suất cho mã thông báo (token) tiếp theo. Việc đào tạo mô hình nén ma trận bất khả thi này, và trong quá trình suy luận, nó thực hiện các cập nhật Bayes chính xác một cách đáng kinh ngạc, điều chỉnh dự đoán của nó trong thời gian thực khi ngữ cảnh được cung cấp, như được thấy trong học ít mẫu (few-shot learning).
Tuy nhiên, Misra khẳng định rằng khả năng kết hợp mẫu này, dù thanh lịch về mặt toán học, không phải là trí thông minh thực sự. LLM hoạt động hoàn toàn dựa trên tương quan, học các mẫu thống kê từ dữ liệu đào tạo đã đóng băng của chúng. Chúng thiếu khả năng xây dựng các mô hình nhân quả của thế giới, để mô phỏng kết quả hoặc thực hiện các can thiệp. Đây là khoảng cách quan trọng giữa AI hiện tại và những gì có thể được coi là Trí tuệ Nhân tạo Tổng hợp (AGI). Để vượt qua vực thẳm đó, ông cho rằng chúng ta cần hai bước tiến lớn: các mô hình có thể học liên tục và cập nhật trọng số của chính chúng (tính dẻo) mà không quên thảm khốc, và một sự thay đổi cơ bản từ học tương quan sang khám phá quan hệ nhân quả.
Cuộc thảo luận đạt đến đỉnh điểm trong một bài kiểm tra thuyết phục cho AGI, mà Misra gọi là “bài kiểm tra Einstein.” Liệu một mô hình, chỉ được đào tạo trên vật lý trước năm 1911, có thể suy ra thuyết tương đối từ dữ liệu bất thường của thời đại đó không? Một cỗ máy tương quan thuần túy sẽ gặp khó khăn, bị ràng buộc bởi “trọng lực dữ liệu” của tư tưởng Newton thống trị. Trí thông minh đột phá thực sự đòi hỏi tạo ra một mô hình giải thích mới, đơn giản hơn—giảm độ phức tạp Kolmogorov của thế giới—điều mà con người làm được nhưng các LLM hiện tại thì không.
### Những Hiểu Biết Đáng Ngạc Nhiên
* **Transformer là những cỗ máy Bayes chính xác về mặt toán học:** Trong các thí nghiệm “ống thổi gió” có kiểm soát, các mô hình transformer nhỏ khớp với phân phối hậu nghiệm Bayes đúng về mặt lý thuyết với độ chính xác 10^-3 bit, chứng minh chúng thực hiện suy luận Bayes chính xác, không chỉ là thứ gì đó giống như vậy.
* **Kiến trúc, không chỉ dữ liệu, quyết định khả năng Bayes:** Khi thử nghiệm các kiến trúc mô hình khác nhau trên các nhiệm vụ được thiết kế để ngăn chặn ghi nhớ, transformer thành công hoàn hảo, Mamba hoạt động tốt, LSTM chỉ thành công một phần và MLP thất bại hoàn toàn, cho thấy thiên kiến quy nạp cho việc cập nhật Bayes được xây dựng ngay trong bản thân kiến trúc transformer.
* **”Bài kiểm tra Einstein” đặt AGI như sự nén:** Bước nhảy vọt tới AGI được đặt không phải là xử lý nhiều dữ liệu hơn mà là tìm ra một biểu diễn mới, đơn giản hơn về thế giới (độ phức tạp Kolmogorov thấp), tương tự như Einstein suy ra E=mc² từ các manh mối rời rạc, thay vì chỉ tương quan tất cả các điểm dữ liệu hiện có.
* **Học trong ngữ cảnh là tạm thời, không vĩnh viễn:** LLM cập nhật dự đoán của chúng trong một phiên duy nhất (cuộc trò chuyện) bằng lập luận Bayes, nhưng chúng hoàn toàn “quên” việc học này một khi cửa sổ ngữ cảnh được thiết lập lại; trọng số cốt lõi của chúng vẫn đóng băng, không giống như tính dẻo suốt đời của não người.
### Những Điểm Thiết Thực
* **Sử dụng các công cụ như TokenProbe để làm sáng tỏ hành vi mô hình:** Quan sát cách xác suất mã thông báo tiếp theo thay đổi theo thời gian thực khi bạn xây dựng gợi ý có thể cung cấp sự hiểu biết trực quan, bằng hình ảnh về việc học trong ngữ cảnh và cập nhật Bayes.
* **Thiết kế gợi ý để tận dụng suy luận Bayes:** Cấu trúc các ví dụ ít mẫu một cách có chủ đích, biết rằng mỗi ví dụ đóng vai trò là bằng chứng định hướng phân phối xác suất của mô hình cho các lần tạo tiếp theo.
* **Nhận biết ranh giới tương quan/quan hệ nhân quả trong ứng dụng:** Hãy thận trọng khi sử dụng LLM cho các nhiệm vụ đòi hỏi lập luận nhân quả thực sự, mô phỏng hoặc lập kế hoạch phản thực tế, vì chúng sẽ chỉ lặp lại các mẫu tương quan từ dữ liệu đào tạo.
* **Tập trung vào các phương pháp tiếp cận kết hợp để giải quyết vấn đề phức tạp:** Như đã được minh họa trong thí nghiệm gần đây của Donald Knuth, hãy sử dụng LLM như những cỗ máy kết hợp mạnh mẽ để khám phá không gian giải pháp, nhưng dựa vào lập luận nhân quả của con người (hoặc của một tác nhân cấp cao hơn) để tổng hợp các phát hiện thành một mô hình hoặc bằng chứng mới, mạch lạc.

想像一下，僅用「蛋白質」一詞提示語言模型。它對接下來可能出現的詞語——「合成」或「奶昔」——存在一個概率分佈。這種瞬間的分支並非隨機；它是一種數學更新，一種基於新證據的貝葉斯信念修正。這種精確且可預測的機制，正是大型語言模型實際運作的核心，哥倫比亞大學教授維沙爾·米斯拉（Vishal Misra）花費多年時間，對此過程進行了嚴格的證明。

米斯拉的旅程始於早期成功應用GPT-3，將板球自然語言查詢轉譯為ESPN專用的領域特定語言（DSL）——這是早期形式的檢索增強生成（RAG）。震驚於它在無需訪問模型內部的情況下竟能成功，他沉迷於探究其背後的「原因」。他的研究帶來了一個基礎性的洞見：可以將大型語言模型抽象地視為一個巨大的稀疏矩陣，其中每一行代表一個可能的提示，每一列則是對下一個詞元的概率分佈。模型的訓練過程壓縮了這個本不可能構建的矩陣；而在推論階段，它執行著驚人準確的貝葉斯更新，根據提供的上下文實時調整其預測，這在少量樣本學習中可見一斑。

然而，米斯拉堅定地認為，這種模式匹配的能力，無論數學上多麼優雅，都不是真正的智能。大型語言模型完全基於相關性運作，從其凍結的訓練數據中學習統計模式。它們缺乏建構世界因果模型、模擬結果或執行干預的能力。這正是當前人工智慧與可能被視為人工通用智慧（AGI）之間的關鍵鴻溝。為了跨越這一鴻溝，他認為我們需要兩項重大進展：能夠持續學習並在無災難性遺忘的前提下自我更新權重（可塑性）的模型，以及從學習相關性到發現因果關係的根本性轉變。

對話最終歸結為一個對AGI的引人深思的測試，米斯拉稱之為「愛因斯坦測試」。一個僅用1911年之前的物理學數據訓練的模型，能否從當時的異常數據中推導出相對論？一個純粹的相關性引擎將會舉步維艱，受限於主流牛頓思想所形成的「數據引力」。真正的突破性智慧需要創造一個新的、更簡單的解釋性模型——降低世界的柯爾莫哥洛夫複雜度——這是人類能做到，但當前大型語言模型無法做到的。

驚人洞見

Transformer是數學上精確的貝葉斯引擎：在受控的「風洞」實驗中，小型Transformer模型與理論上正確的貝葉斯後驗分佈的匹配精度達到了10^-3比特，證明它們執行的是精確的貝葉斯推論，而不僅僅是近似。

架構（而不僅是數據）決定了貝葉斯能力：在針對防止記憶而設計的任務上測試不同模型架構時，Transformer完美成功，Mamba表現良好，LSTM僅部分成功，而MLP則完全失敗，這表明貝葉斯更新的歸納偏見是內建於Transformer架構本身的。

「愛因斯坦測試」將AGI視為壓縮：邁向AGI的飛躍並非被視為處理更多數據，而是尋找對世界新的、更簡單的表徵（低柯爾莫哥洛夫複雜度），類似於愛因斯坦從分散線索推導出E=mc²，而不僅僅是關聯所有現有數據點。

上下文學習是暫時的，而非永久的：大型語言模型在單個會話（對話）中使用貝葉斯推理更新其預測，但一旦上下文窗口重置，它們便完全「遺忘」了這種學習；其核心權重保持凍結，這與人腦終生的可塑性不同。

實踐啟示

使用如TokenProbe等工具來揭示模型行為：觀察在構建提示時下一個詞元概率如何實時變化，可以直觀、視覺化地理解上下文學習和貝葉斯更新。

設計提示以利用貝葉斯推論：有意地構建少量樣本示例，須知每個示例都作為證據，引導模型後續生成的概率分佈。

在應用中認清相關性與因果關係的邊界：對於需要真正因果推理、模擬或反事實規劃的任務，使用大型語言模型時需謹慎，因為它們只會模仿訓練數據中的相關模式。

關注混合方法以解決複雜問題：正如高德納（Donald Knuth）最近的實驗所展示，利用大型語言模型作為強大的關聯性引擎來探索解決方案空間，但依賴人類（或更高級代理）的因果推理，將發現綜合成新的、連貫的模型或證明。

Imagina pedirle a un modelo de lenguaje solo la palabra “proteína”. Contiene una distribución de probabilidades sobre lo que sigue —”síntesis” o “batido”—. Esta bifurcación en fracciones de segundo no es aleatoria; es una actualización matemática, una revisión bayesiana de creencias basada en nueva evidencia. Este mecanismo preciso y predecible está en el corazón de cómo funcionan realmente los grandes modelos de lenguaje, un proceso que el profesor de Columbia, Vishal Misra, ha pasado años demostrando formalmente.

El viaje de Vishal Misra comenzó con una aplicación temprana y exitosa de GPT-3 para traducir consultas de críquet en lenguaje natural a un lenguaje específico de dominio (DSL) personalizado para ESPN—una forma temprana de RAG. Asombrado de que funcionara sin ningún acceso interno al modelo, se obsesionó con entender por qué. Su investigación lo llevó a una visión fundamental: un LLM puede concebirse de forma abstracta como una matriz colosal y dispersa, donde cada fila es un posible *prompt* y cada columna es una distribución de probabilidad para el siguiente *token*. El entrenamiento del modelo comprime esta matriz imposible, y durante la inferencia, realiza actualizaciones bayesianas notablemente precisas, ajustando sus predicciones en tiempo real a medida que se proporciona contexto, como se ve en el aprendizaje con pocos ejemplos (*few-shot learning*).

Sin embargo, Misra sostiene firmemente que esta destreza de reconocimiento de patrones, por matemáticamente elegante que sea, no es inteligencia verdadera. Los LLM operan enteramente en correlaciones, aprendiendo patrones estadísticos de sus datos de entrenamiento congelados. Carecen de la capacidad de construir modelos causales del mundo, de simular resultados o de realizar intervenciones. Esta es la brecha crítica entre la IA actual y lo que podría considerarse Inteligencia Artificial General (AGI). Para cruzar ese abismo, argumenta que necesitamos dos avances principales: modelos que puedan aprender continuamente y actualizar sus propios pesos (plasticidad) sin olvido catastrófico, y un cambio fundamental de aprender correlaciones a descubrir causalidad.

La conversación culmina en una prueba convincente para la AGI, que Misra llama la “prueba de Einstein”. ¿Podría un modelo, entrenado solo con física anterior a 1911, derivar la teoría de la relatividad a partir de los datos anómalos de la época? Un motor de pura correlación tendría dificultades, limitado por la “gravedad de los datos” del pensamiento newtoniano predominante. La verdadera inteligencia innovadora requiere crear un nuevo modelo explicativo más simple—reducir la complejidad de Kolmogorov del mundo—algo que los humanos hacen pero los LLM actuales no pueden.

Perspectivas Sorprendentes

Los Transformers son motores bayesianos matemáticamente precisos: En experimentos controlados de “túnel de viento”, pequeños modelos *transformer* coincidieron con la distribución posterior bayesiana teóricamente correcta con una precisión de 10^-3 bits, demostrando que realizan inferencia bayesiana exacta, no solo algo que se le parece.

La arquitectura, no solo los datos, dicta la capacidad bayesiana: Al probar diferentes arquitecturas de modelos en tareas diseñadas para evitar la memorización, los *transformers* tuvieron éxito perfecto, Mamba funcionó bien, las LSTM solo tuvieron éxito parcial y las MLP fallaron por completo, mostrando que el sesgo inductivo para la actualización bayesiana está incorporado en la propia arquitectura del *transformer*.

La “Prueba de Einstein” enmarca la AGI como compresión: El salto hacia la AGI se enmarca no como procesar más datos, sino como encontrar una nueva representación más simple del mundo (baja complejidad de Kolmogorov), similar a como Einstein derivó E=mc² de pistas dispares, en lugar de simplemente correlacionar todos los puntos de datos existentes.

El aprendizaje en contexto es temporal, no permanente: Los LLM actualizan sus predicciones dentro de una sola sesión (conversación) utilizando razonamiento bayesiano, pero “olvidan” por completo este aprendizaje una vez que se restablece la ventana de contexto; sus pesos centrales permanecen congelados, a diferencia de la plasticidad permanente del cerebro humano.

Aportes Prácticos

Usa herramientas como TokenProbe para desmitificar el comportamiento del modelo: Observar cómo cambian las probabilidades del siguiente *token* en tiempo real mientras construyes un *prompt* puede proporcionar una comprensión intuitiva y visual del aprendizaje en contexto y la actualización bayesiana.

Diseña *prompts* para aprovechar la inferencia bayesiana: Estructura deliberadamente los ejemplos de pocos disparos (*few-shot*), sabiendo que cada ejemplo actúa como evidencia que dirige la distribución de probabilidad del modelo para las generaciones posteriores.

Reconoce el límite correlación/causalidad en las aplicaciones: Sé cauteloso al usar LLM para tareas que requieren verdadero razonamiento causal, simulación o planificación contrafáctica, ya que solo repetirán patrones correlacionados de los datos de entrenamiento.

Enfócate en enfoques híbridos para la resolución de problemas complejos: Como se demostró en el experimento reciente de Donald Knuth, usa LLM como poderosos motores asociativos para explorar espacios de solución, pero confía en el razonamiento causal humano (o de un agente de nivel superior) para sintetizar los hallazgos en un modelo o prueba nuevo y coherente.

Imagine prompting a language model with just the word “protein.” It holds a distribution of probabilities for what comes next—”synthesis” or “shake.” This split-second branching isn’t random; it’s a mathematical update, a Bayesian revision of belief based on new evidence. This precise, predictable mechanism is at the heart of how large language models actually function, a process Columbia professor Vishal Misra has spent years formally proving.

Vishal Misra’s journey began with an early, successful application of GPT-3 to translate natural language cricket queries into a custom domain-specific language (DSL) for ESPN—an early form of RAG. Stunned that it worked without any internal model access, he became obsessed with understanding why. His research led to a foundational insight: an LLM can be abstractly thought of as a colossal, sparse matrix where each row is a possible prompt and each column is a probability distribution for the next token. The model’s training compresses this impossible matrix, and during inference, it performs remarkably accurate Bayesian updates, adjusting its predictions in real-time as context is provided, as seen in few-shot learning.

However, Misra is adamant that this pattern-matching prowess, however mathematically elegant, is not true intelligence. LLMs operate entirely on correlation, learning statistical patterns from their frozen training data. They lack the ability to build causal models of the world, to simulate outcomes, or to perform interventions. This is the critical gap between current AI and what might be considered Artificial General Intelligence (AGI). To cross that chasm, he argues we need two major advances: models that can learn continually and update their own weights (plasticity) without catastrophic forgetting, and a fundamental shift from learning correlations to discovering causation.

The conversation culminates in a compelling test for AGI, which Misra calls the “Einstein test.” Could a model, trained only on physics pre-1911, derive the theory of relativity from the anomalous data of the era? A pure correlation engine would struggle, bound by the “data gravity” of prevailing Newtonian thought. True breakthrough intelligence requires creating a new, simpler explanatory model—reducing the Kolmogorov complexity of the world—something humans do but current LLMs cannot.

Surprising Insights

Transformers are mathematically precise Bayesian engines: In controlled “wind tunnel” experiments, small transformer models matched the theoretically correct Bayesian posterior distribution to an accuracy of 10^-3 bits, proving they perform exact Bayesian inference, not just something that resembles it.

Architecture, not just data, dictates Bayesian capability: When testing different model architectures on tasks designed to prevent memorization, transformers succeeded perfectly, Mamba performed well, LSTMs only partially succeeded, and MLPs failed completely, showing the inductive bias for Bayesian updating is built into the transformer architecture itself.

The “Einstein Test” frames AGI as compression: The leap to AGI is framed not as processing more data but as finding a new, simpler representation of the world (low Kolmogorov complexity), akin to Einstein deriving E=mc² from disparate clues, rather than just correlating all existing data points.

In-context learning is temporary, not permanent: LLMs update their predictions within a single session (conversation) using Bayesian reasoning, but they completely “forget” this learning once the context window resets; their core weights remain frozen, unlike the lifelong plasticity of the human brain.

Practical Takeaways

Use tools like TokenProbe to demystify model behavior: Observing how next-token probabilities shift in real-time as you build a prompt can provide an intuitive, visual understanding of in-context learning and Bayesian updating.

Design prompts to leverage Bayesian inference: Structure few-shot examples deliberately, knowing that each example acts as evidence that steers the model’s probability distribution for subsequent generations.

Recognize the correlation/causation boundary in applications: Be cautious about using LLMs for tasks requiring true causal reasoning, simulation, or counterfactual planning, as they will only parrot correlated patterns from training data.

Focus on hybrid approaches for complex problem-solving: As demonstrated in Donald Knuth’s recent experiment, use LLMs as powerful associative engines to explore solution spaces, but rely on human (or a higher-level agent’s) causal reasoning to synthesize the findings into a new, coherent model or proof.

Summary & Insights

Imagine prompting a language model with just the word "protein." It holds a distribution of probabilities for what comes next—"synthesis" or "shake." This split-second branching isn't random; it's a mathematical update, a Bayesian revision of belief based on new evidence. This precise, predictable mechanism is at the heart of how large language models actually function, a process Columbia professor Vishal Misra has spent years formally proving.

The conversation culminates in a compelling test for AGI, which Misra calls the "Einstein test." Could a model, trained only on physics pre-1911, derive the theory of relativity from the anomalous data of the era? A pure correlation engine would struggle, bound by the "data gravity" of prevailing Newtonian thought. True breakthrough intelligence requires creating a new, simpler explanatory model—reducing the Kolmogorov complexity of the world—something humans do but current LLMs cannot.

Surprising Insights

Transformers are mathematically precise Bayesian engines: In controlled "wind tunnel" experiments, small transformer models matched the theoretically correct Bayesian posterior distribution to an accuracy of 10^-3 bits, proving they perform exact Bayesian inference, not just something that resembles it.
Architecture, not just data, dictates Bayesian capability: When testing different model architectures on tasks designed to prevent memorization, transformers succeeded perfectly, Mamba performed well, LSTMs only partially succeeded, and MLPs failed completely, showing the inductive bias for Bayesian updating is built into the transformer architecture itself.
The "Einstein Test" frames AGI as compression: The leap to AGI is framed not as processing more data but as finding a new, simpler representation of the world (low Kolmogorov complexity), akin to Einstein deriving E=mc² from disparate clues, rather than just correlating all existing data points.
In-context learning is temporary, not permanent: LLMs update their predictions within a single session (conversation) using Bayesian reasoning, but they completely "forget" this learning once the context window resets; their core weights remain frozen, unlike the lifelong plasticity of the human brain.

Practical Takeaways

Use tools like TokenProbe to demystify model behavior: Observing how next-token probabilities shift in real-time as you build a prompt can provide an intuitive, visual understanding of in-context learning and Bayesian updating.
Design prompts to leverage Bayesian inference: Structure few-shot examples deliberately, knowing that each example acts as evidence that steers the model's probability distribution for subsequent generations.
Recognize the correlation/causation boundary in applications: Be cautious about using LLMs for tasks requiring true causal reasoning, simulation, or counterfactual planning, as they will only parrot correlated patterns from training data.
Focus on hybrid approaches for complex problem-solving: As demonstrated in Donald Knuth’s recent experiment, use LLMs as powerful associative engines to explore solution spaces, but rely on human (or a higher-level agent’s) causal reasoning to synthesize the findings into a new, coherent model or proof.

Hãy tưởng tượng bạn đưa cho một mô hình ngôn ngữ chỉ một từ "protein." Nó nắm giữ một phân phối xác suất cho từ tiếp theo—có thể là "synthesis" (tổng hợp) hoặc "shake" (lắc). Sự phân nhánh trong chớp mắt này không ngẫu nhiên; đó là một sự cập nhật toán học, một sự điều chỉnh niềm tin theo Bayes dựa trên bằng chứng mới. Cơ chế chính xác, có thể dự đoán này là cốt lõi của cách thức hoạt động thực sự của các mô hình ngôn ngữ lớn (LLM), một quá trình mà Giáo sư Vishal Misra từ Đại học Columbia đã dành nhiều năm để chứng minh một cách chính thức.
Hành trình của Vishal Misra bắt đầu với một ứng dụng thành công sớm của GPT-3 để dịch các truy vấn cricket bằng ngôn ngữ tự nhiên thành một ngôn ngữ chuyên biệt (DSL) tùy chỉnh cho ESPN—một dạng sơ khai của RAG. Kinh ngạc vì nó hoạt động mà không cần truy cập nội bộ vào mô hình, ông trở nên ám ảnh với việc tìm hiểu *tại sao*. Nghiên cứu của ông dẫn đến một nhận thức nền tảng: một LLM có thể được coi một cách trừu tượng như một ma trận thưa cực lớn, trong đó mỗi hàng là một gợi ý (prompt) có thể và mỗi cột là một phân phối xác suất cho mã thông báo (token) tiếp theo. Việc đào tạo mô hình nén ma trận bất khả thi này, và trong quá trình suy luận, nó thực hiện các cập nhật Bayes chính xác một cách đáng kinh ngạc, điều chỉnh dự đoán của nó trong thời gian thực khi ngữ cảnh được cung cấp, như được thấy trong học ít mẫu (few-shot learning).
Tuy nhiên, Misra khẳng định rằng khả năng kết hợp mẫu này, dù thanh lịch về mặt toán học, không phải là trí thông minh thực sự. LLM hoạt động hoàn toàn dựa trên tương quan, học các mẫu thống kê từ dữ liệu đào tạo đã đóng băng của chúng. Chúng thiếu khả năng xây dựng các mô hình nhân quả của thế giới, để mô phỏng kết quả hoặc thực hiện các can thiệp. Đây là khoảng cách quan trọng giữa AI hiện tại và những gì có thể được coi là Trí tuệ Nhân tạo Tổng hợp (AGI). Để vượt qua vực thẳm đó, ông cho rằng chúng ta cần hai bước tiến lớn: các mô hình có thể học liên tục và cập nhật trọng số của chính chúng (tính dẻo) mà không quên thảm khốc, và một sự thay đổi cơ bản từ học tương quan sang khám phá quan hệ nhân quả.
Cuộc thảo luận đạt đến đỉnh điểm trong một bài kiểm tra thuyết phục cho AGI, mà Misra gọi là "bài kiểm tra Einstein." Liệu một mô hình, chỉ được đào tạo trên vật lý trước năm 1911, có thể suy ra thuyết tương đối từ dữ liệu bất thường của thời đại đó không? Một cỗ máy tương quan thuần túy sẽ gặp khó khăn, bị ràng buộc bởi "trọng lực dữ liệu" của tư tưởng Newton thống trị. Trí thông minh đột phá thực sự đòi hỏi tạo ra một mô hình giải thích mới, đơn giản hơn—giảm độ phức tạp Kolmogorov của thế giới—điều mà con người làm được nhưng các LLM hiện tại thì không.
### Những Hiểu Biết Đáng Ngạc Nhiên
* **Transformer là những cỗ máy Bayes chính xác về mặt toán học:** Trong các thí nghiệm "ống thổi gió" có kiểm soát, các mô hình transformer nhỏ khớp với phân phối hậu nghiệm Bayes đúng về mặt lý thuyết với độ chính xác 10^-3 bit, chứng minh chúng thực hiện suy luận Bayes chính xác, không chỉ là thứ gì đó giống như vậy.
* **Kiến trúc, không chỉ dữ liệu, quyết định khả năng Bayes:** Khi thử nghiệm các kiến trúc mô hình khác nhau trên các nhiệm vụ được thiết kế để ngăn chặn ghi nhớ, transformer thành công hoàn hảo, Mamba hoạt động tốt, LSTM chỉ thành công một phần và MLP thất bại hoàn toàn, cho thấy thiên kiến quy nạp cho việc cập nhật Bayes được xây dựng ngay trong bản thân kiến trúc transformer.
* **"Bài kiểm tra Einstein" đặt AGI như sự nén:** Bước nhảy vọt tới AGI được đặt không phải là xử lý nhiều dữ liệu hơn mà là tìm ra một biểu diễn mới, đơn giản hơn về thế giới (độ phức tạp Kolmogorov thấp), tương tự như Einstein suy ra E=mc² từ các manh mối rời rạc, thay vì chỉ tương quan tất cả các điểm dữ liệu hiện có.
* **Học trong ngữ cảnh là tạm thời, không vĩnh viễn:** LLM cập nhật dự đoán của chúng trong một phiên duy nhất (cuộc trò chuyện) bằng lập luận Bayes, nhưng chúng hoàn toàn "quên" việc học này một khi cửa sổ ngữ cảnh được thiết lập lại; trọng số cốt lõi của chúng vẫn đóng băng, không giống như tính dẻo suốt đời của não người.
### Những Điểm Thiết Thực
* **Sử dụng các công cụ như TokenProbe để làm sáng tỏ hành vi mô hình:** Quan sát cách xác suất mã thông báo tiếp theo thay đổi theo thời gian thực khi bạn xây dựng gợi ý có thể cung cấp sự hiểu biết trực quan, bằng hình ảnh về việc học trong ngữ cảnh và cập nhật Bayes.
* **Thiết kế gợi ý để tận dụng suy luận Bayes:** Cấu trúc các ví dụ ít mẫu một cách có chủ đích, biết rằng mỗi ví dụ đóng vai trò là bằng chứng định hướng phân phối xác suất của mô hình cho các lần tạo tiếp theo.
* **Nhận biết ranh giới tương quan/quan hệ nhân quả trong ứng dụng:** Hãy thận trọng khi sử dụng LLM cho các nhiệm vụ đòi hỏi lập luận nhân quả thực sự, mô phỏng hoặc lập kế hoạch phản thực tế, vì chúng sẽ chỉ lặp lại các mẫu tương quan từ dữ liệu đào tạo.
* **Tập trung vào các phương pháp tiếp cận kết hợp để giải quyết vấn đề phức tạp:** Như đã được minh họa trong thí nghiệm gần đây của Donald Knuth, hãy sử dụng LLM như những cỗ máy kết hợp mạnh mẽ để khám phá không gian giải pháp, nhưng dựa vào lập luận nhân quả của con người (hoặc của một tác nhân cấp cao hơn) để tổng hợp các phát hiện thành một mô hình hoặc bằng chứng mới, mạch lạc.

驚人洞見

Transformer是數學上精確的貝葉斯引擎：在受控的「風洞」實驗中，小型Transformer模型與理論上正確的貝葉斯後驗分佈的匹配精度達到了10^-3比特，證明它們執行的是精確的貝葉斯推論，而不僅僅是近似。

架構（而不僅是數據）決定了貝葉斯能力：在針對防止記憶而設計的任務上測試不同模型架構時，Transformer完美成功，Mamba表現良好，LSTM僅部分成功，而MLP則完全失敗，這表明貝葉斯更新的歸納偏見是內建於Transformer架構本身的。

「愛因斯坦測試」將AGI視為壓縮：邁向AGI的飛躍並非被視為處理更多數據，而是尋找對世界新的、更簡單的表徵（低柯爾莫哥洛夫複雜度），類似於愛因斯坦從分散線索推導出E=mc²，而不僅僅是關聯所有現有數據點。

上下文學習是暫時的，而非永久的：大型語言模型在單個會話（對話）中使用貝葉斯推理更新其預測，但一旦上下文窗口重置，它們便完全「遺忘」了這種學習；其核心權重保持凍結，這與人腦終生的可塑性不同。

實踐啟示

使用如TokenProbe等工具來揭示模型行為：觀察在構建提示時下一個詞元概率如何實時變化，可以直觀、視覺化地理解上下文學習和貝葉斯更新。

設計提示以利用貝葉斯推論：有意地構建少量樣本示例，須知每個示例都作為證據，引導模型後續生成的概率分佈。

在應用中認清相關性與因果關係的邊界：對於需要真正因果推理、模擬或反事實規劃的任務，使用大型語言模型時需謹慎，因為它們只會模仿訓練數據中的相關模式。

關注混合方法以解決複雜問題：正如高德納（Donald Knuth）最近的實驗所展示，利用大型語言模型作為強大的關聯性引擎來探索解決方案空間，但依賴人類（或更高級代理）的因果推理，將發現綜合成新的、連貫的模型或證明。

Imagina pedirle a un modelo de lenguaje solo la palabra "proteína". Contiene una distribución de probabilidades sobre lo que sigue —"síntesis" o "batido"—. Esta bifurcación en fracciones de segundo no es aleatoria; es una actualización matemática, una revisión bayesiana de creencias basada en nueva evidencia. Este mecanismo preciso y predecible está en el corazón de cómo funcionan realmente los grandes modelos de lenguaje, un proceso que el profesor de Columbia, Vishal Misra, ha pasado años demostrando formalmente.

La conversación culmina en una prueba convincente para la AGI, que Misra llama la "prueba de Einstein". ¿Podría un modelo, entrenado solo con física anterior a 1911, derivar la teoría de la relatividad a partir de los datos anómalos de la época? Un motor de pura correlación tendría dificultades, limitado por la "gravedad de los datos" del pensamiento newtoniano predominante. La verdadera inteligencia innovadora requiere crear un nuevo modelo explicativo más simple—reducir la complejidad de Kolmogorov del mundo—algo que los humanos hacen pero los LLM actuales no pueden.

Perspectivas Sorprendentes

Los Transformers son motores bayesianos matemáticamente precisos: En experimentos controlados de "túnel de viento", pequeños modelos *transformer* coincidieron con la distribución posterior bayesiana teóricamente correcta con una precisión de 10^-3 bits, demostrando que realizan inferencia bayesiana exacta, no solo algo que se le parece.

La arquitectura, no solo los datos, dicta la capacidad bayesiana: Al probar diferentes arquitecturas de modelos en tareas diseñadas para evitar la memorización, los *transformers* tuvieron éxito perfecto, Mamba funcionó bien, las LSTM solo tuvieron éxito parcial y las MLP fallaron por completo, mostrando que el sesgo inductivo para la actualización bayesiana está incorporado en la propia arquitectura del *transformer*.

La "Prueba de Einstein" enmarca la AGI como compresión: El salto hacia la AGI se enmarca no como procesar más datos, sino como encontrar una nueva representación más simple del mundo (baja complejidad de Kolmogorov), similar a como Einstein derivó E=mc² de pistas dispares, en lugar de simplemente correlacionar todos los puntos de datos existentes.

El aprendizaje en contexto es temporal, no permanente: Los LLM actualizan sus predicciones dentro de una sola sesión (conversación) utilizando razonamiento bayesiano, pero "olvidan" por completo este aprendizaje una vez que se restablece la ventana de contexto; sus pesos centrales permanecen congelados, a diferencia de la plasticidad permanente del cerebro humano.

Aportes Prácticos

Usa herramientas como TokenProbe para desmitificar el comportamiento del modelo: Observar cómo cambian las probabilidades del siguiente *token* en tiempo real mientras construyes un *prompt* puede proporcionar una comprensión intuitiva y visual del aprendizaje en contexto y la actualización bayesiana.

Diseña *prompts* para aprovechar la inferencia bayesiana: Estructura deliberadamente los ejemplos de pocos disparos (*few-shot*), sabiendo que cada ejemplo actúa como evidencia que dirige la distribución de probabilidad del modelo para las generaciones posteriores.

Reconoce el límite correlación/causalidad en las aplicaciones: Sé cauteloso al usar LLM para tareas que requieren verdadero razonamiento causal, simulación o planificación contrafáctica, ya que solo repetirán patrones correlacionados de los datos de entrenamiento.

Enfócate en enfoques híbridos para la resolución de problemas complejos: Como se demostró en el experimento reciente de Donald Knuth, usa LLM como poderosos motores asociativos para explorar espacios de solución, pero confía en el razonamiento causal humano (o de un agente de nivel superior) para sintetizar los hallazgos en un modelo o prueba nuevo y coherente.

Surprising Insights

Transformers are mathematically precise Bayesian engines: In controlled "wind tunnel" experiments, small transformer models matched the theoretically correct Bayesian posterior distribution to an accuracy of 10^-3 bits, proving they perform exact Bayesian inference, not just something that resembles it.

Architecture, not just data, dictates Bayesian capability: When testing different model architectures on tasks designed to prevent memorization, transformers succeeded perfectly, Mamba performed well, LSTMs only partially succeeded, and MLPs failed completely, showing the inductive bias for Bayesian updating is built into the transformer architecture itself.

The "Einstein Test" frames AGI as compression: The leap to AGI is framed not as processing more data but as finding a new, simpler representation of the world (low Kolmogorov complexity), akin to Einstein deriving E=mc² from disparate clues, rather than just correlating all existing data points.

In-context learning is temporary, not permanent: LLMs update their predictions within a single session (conversation) using Bayesian reasoning, but they completely "forget" this learning once the context window resets; their core weights remain frozen, unlike the lifelong plasticity of the human brain.

Practical Takeaways

Use tools like TokenProbe to demystify model behavior: Observing how next-token probabilities shift in real-time as you build a prompt can provide an intuitive, visual understanding of in-context learning and Bayesian updating.

Design prompts to leverage Bayesian inference: Structure few-shot examples deliberately, knowing that each example acts as evidence that steers the model's probability distribution for subsequent generations.

Recognize the correlation/causation boundary in applications: Be cautious about using LLMs for tasks requiring true causal reasoning, simulation, or counterfactual planning, as they will only parrot correlated patterns from training data.

Focus on hybrid approaches for complex problem-solving: As demonstrated in Donald Knuth’s recent experiment, use LLMs as powerful associative engines to explore solution spaces, but rely on human (or a higher-level agent’s) causal reasoning to synthesize the findings into a new, coherent model or proof.

Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematically predictable way as they process new information, explains why this still doesn’t mean they’re conscious, and describes what’s actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect.

Resources:

Follow Vishal Misra on X: https://x.com/vishalmisra
Follow Martin Casado on X: https://x.com/martin_casado

Stay Updated:

Find a16z on YouTube: YouTube

Find a16z on X

Find a16z on LinkedIn

Listen to the a16z Show on Spotify

Listen to the a16z Show on Apple Podcasts

Follow our host: https://twitter.com/eriktorenberg

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Dịch sang tiếng Việt:

Những góc nhìn đáng ngạc nhiên

Những điểm rút ra thực tiễn

驚人洞見

實踐啟示

Perspectivas Sorprendentes

Aportes Prácticos

Surprising Insights

Practical Takeaways

Surprising Insights

Practical Takeaways

驚人洞見

實踐啟示

Perspectivas Sorprendentes

Aportes Prácticos

Surprising Insights

Practical Takeaways

Leave a Reply Cancel reply