Latent Space: The AI Engineer Podcast ? Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

Prenumerera

iTunes / Overcast / RSS

Webbplats

latent.space/podcast

Avsnitt

Supervise the Process of AI Research ? with Jungwon Byun and Andreas Stuhlmüller of Elicit

Maggie, Linus, Geoffrey, and the LS crew are reuniting for our second annual AI UX demo day in SF on Apr 28. Sign up to demo here! And don?t forget tickets for the AI Engineer World?s Fair ? for early birds who join before keynote announcements!

It?s become fashionable for many AI startups to project themselves as ?the next Google? - while the search engine is so 2000s, both Perplexity and Exa referred to themselves as a ?research engine? or ?answer engine? in our NeurIPS pod. However these searches tend to be relatively shallow, and it is challenging to zoom up and down the ladders of abstraction to garner insights. For serious researchers, this level of simple one-off search will not cut it.

We?ve commented in our Jan 2024 Recap that Flow Engineering (simply; multi-turn processes over many-shot single prompts) seems to offer far more performance, control and reliability for a given cost budget. Our experiments with Devin and our understanding of what the new Elicit Notebooks offer a glimpse into the potential for very deep, open ended, thoughtful human-AI collaboration at scale.

It starts with prompts

When ChatGPT exploded in popularity in November 2022 everyone was turned into a prompt engineer. While generative models were good at "vibe based" outcomes (tell me a joke, write a poem, etc) with basic prompts, they struggled with more complex questions, especially in symbolic fields like math, logic, etc. Two of the most important "tricks" that people picked up on were:

* Chain of Thought prompting strategy proposed by Wei et al in the ?Chain-of-Thought Prompting Elicits Reasoning in Large Language Models?. Rather than doing traditional few-shot prompting with just question and answers, adding the thinking process that led to the answer resulted in much better outcomes.

* Adding "Let's think step by step" to the prompt as a way to boost zero-shot reasoning, which was popularized by Kojima et al in the Large Language Models are Zero-Shot Reasoners paper from NeurIPS 2022. This bumped accuracy from 17% to 79% compared to zero-shot.

Nowadays, prompts include everything from promises of monetary rewards to? whatever the Nous folks are doing to turn a model into a world simulator. At the end of the day, the goal of prompt engineering is increasing accuracy, structure, and repeatability in the generation of a model.

From prompts to agents

As prompt engineering got more and more popular, agents (see ?The Anatomy of Autonomy?) took over Twitter with cool demos and AutoGPT became the fastest growing repo in Github history. The thing about AutoGPT that fascinated people was the ability to simply put in an objective without worrying about explaining HOW to achieve it, or having to write very sophisticated prompts. The system would create an execution plan on its own, and then loop through each task.

The problem with open-ended agents like AutoGPT is that 1) it?s hard to replicate the same workflow over and over again 2) there isn?t a way to hard-code specific steps that the agent should take without actually coding them yourself, which isn?t what most people want from a product.

From agents to products

Prompt engineering and open-ended agents were great in the experimentation phase, but this year more and more of these workflows are starting to become polished products.

Today?s guests are Andreas Stuhlmüller and Jungwon Byun of Elicit (previously Ought), an AI research assistant that they think of as ?the best place to understand what is known?.

Ought was a non-profit, but last September, Elicit spun off into a PBC with a $9m seed round. It is hard to quantify how much a workflow can be improved, but Elicit boasts some impressive numbers for research assistants:

Just four months after launch, Elicit crossed $1M ARR, which shows how much interest there is for AI products that just work.

One of the main takeaways we had from the episode is how teams should focus on supervising the process, not the output. Their philosophy at Elicit isn?t to train general models, but to train models that are extremely good at focusing processes.

This allows them to have pre-created steps that the user can add to their workflow (like classifying certain features that are specific to their research field) without having to write a prompt for it. And for Hamel Husain?s happiness, they always show you the underlying prompt.

Elicit recently announced notebooks as a new interface to interact with their products: (fun fact, they tried to implement this 4 times before they landed on the right UX! We discuss this ~33:00 in the podcast)

The reasons why they picked notebooks as a UX all tie back to process:

* They are systematic; once you have a instruction/prompt that works on a paper, you can run hundreds of papers through the same workflow by creating a column. Notebooks can also be edited and exported at any point during the flow.

* They are transparent - Many papers include an opaque literature review as perfunctory context before getting to their novel contribution. But PDFs are ?dead? and it is difficult to follow the thought process and exact research flow of the authors. Sharing ?living? Elicit Notebooks opens up this process.

* They are unbounded - Research is an endless stream of rabbit holes. So it must be easy to dive deeper and follow up with extra steps, without losing the ability to surface for air.

We had a lot of fun recording this, and hope you have as much fun listening!

AI UX in SF

Long time Latent Spacenauts might remember our first AI UX meetup with Linus Lee, Geoffrey Litt, and Maggie Appleton last year. Well, Maggie has since joined Elicit, and they are all returning at the end of this month!

And submit demos here! https://forms.gle/iSwiesgBkn8oo4SS8

We expect the 200 seats to ?sell out? fast. Attendees with demos will be prioritized.

Show Notes

* Elicit

* Ought (their previous non-profit)

* ?Pivoting? with GPT-4

* Elicit notebooks launch

* Charlie

* Andreas? Blog

Timestamps

* [00:00:00] Introductions

* [00:07:45] How Johan and Andreas Joined Forces to Create Elicit

* [00:10:26] Why Products > Research

* [00:15:49] The Evolution of Elicit's Product

* [00:19:44] Automating Literature Review Workflow

* [00:22:48] How GPT-3 to GPT-4 Changed Things

* [00:25:37] Managing LLM Pricing and Performance

* [00:31:07] Open vs. Closed: Elicit's Approach to Model Selection

* [00:31:56] Moving to Notebooks

* [00:39:11] Elicit's Budget for Model Queries and Evaluations

* [00:41:44] Impact of Long Context Windows

* [00:47:19] Underrated Features and Surprising Applications

* [00:51:35] Driving Systematic and Efficient Research

* [00:53:00] Elicit's Team Growth and Transition to a Public Benefit Corporation

* [00:55:22] Building AI for Good

Full Interview on YouTube

As always, a plug for our youtube version for the 80% of communication that is nonverbal:

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:15]: Hey, and today we are back in the studio with Andreas and Jungwon from Elicit. Welcome.

Jungwon [00:00:20]: Thanks guys.

Andreas [00:00:21]: It's great to be here.

Swyx [00:00:22]: Yeah. So I'll introduce you separately, but also, you know, we'd love to learn a little bit more about you personally. So Andreas, it looks like you started Elicit first, Jungwon joined later.

Andreas [00:00:32]: That's right. For all intents and purposes, the Elicit and also the Ought that existed before then were very different from what I started. So I think it's like fair to say that you co-founded it.

Swyx [00:00:43]: Got it. And Jungwon, you're a co-founder and COO of Elicit now.

Jungwon [00:00:46]: Yeah, that's right.

Swyx [00:00:47]: So there's a little bit of a history to this. I'm not super aware of like the sort of journey. I was aware of OTT and Elicit as sort of a nonprofit type situation. And recently you turned into like a B Corp, Public Benefit Corporation. So yeah, maybe if you want, you could take us through that journey of finding the problem. You know, obviously you're working together now. So like, how do you get together to decide to leave your startup career to join him?

Andreas [00:01:10]: Yeah, it's truly a very long journey. I guess truly, it kind of started in Germany when I was born. So even as a kid, I was always interested in AI, like I kind of went to the library. There were books about how to write programs in QBasic and like some of them talked about how to implement chatbots.

Jungwon [00:01:27]: To be clear, he grew up in like a tiny village on the outskirts of Munich called Dinkelschirben, where it's like a very, very idyllic German village.

Andreas [00:01:36]: Yeah, important to the story. So basically, the main thing is I've kind of always been thinking about AI my entire life and been thinking about, well, at some point, this is going to be a huge deal. It's going to be transformative. How can I work on it? And was thinking about it from when I was a teenager, after high school did a year where I started a startup with the intention to become rich. And then once I'm rich, I can affect the trajectory of AI. Did not become rich, decided to go back to college and study cognitive science there, which was like the closest thing I could find at the time to AI. In the last year of college, moved to the US to do a PhD at MIT, working on broadly kind of new programming languages for AI because it kind of seemed like the existing languages were not great at expressing world models and learning world models doing Bayesian inference. Was always thinking about, well, ultimately, the goal is to actually build tools that help people reason more clearly, ask and answer better questions and make better decisions. But for a long time, it seemed like the technology to put reasoning in machines just wasn't there. Initially, at the end of my postdoc at Stanford, I was thinking about, well, what to do? I think the standard path is you become an academic and do research. But it's really hard to actually build interesting tools as an academic. You can't really hire great engineers. Everything is kind of on a paper-to-paper timeline. And so I was like, well, maybe I should start a startup, pursued that for a little bit. But it seemed like it was too early because you could have tried to do an AI startup, but probably would not have been this kind of AI startup we're seeing now. So then decided to just start a nonprofit research lab that's going to do research for a while until we better figure out how to do thinking in machines. And that was odd. And then over time, it became clear how to actually build actual tools for reasoning. And only over time, we developed a better way to... I'll let you fill in some of the details here.

Jungwon [00:03:26]: Yeah. So I guess my story maybe starts around 2015. I kind of wanted to be a founder for a long time, and I wanted to work on an idea that stood the test of time for me, like an idea that stuck with me for a long time. And starting in 2015, actually, originally, I became interested in AI-based tools from the perspective of mental health. So there are a bunch of people around me who are really struggling. One really close friend in particular is really struggling with mental health and didn't have any support, and it didn't feel like there was anything before kind of like getting hospitalized that could just help her. And so luckily, she came and stayed with me for a while, and we were just able to talk through some things. But it seemed like lots of people might not have that resource, and something maybe AI-enabled could be much more scalable. I didn't feel ready to start a company then, that's 2015. And I also didn't feel like the technology was ready. So then I went into FinTech and kind of learned how to do the tech thing. And then in 2019, I felt like it was time for me to just jump in and build something on my own I really wanted to create. And at the time, I looked around at tech and felt like not super inspired by the options. I didn't want to have a tech career ladder, or I didn't want to climb the career ladder. There are two kind of interesting technologies at the time, there was AI and there was crypto. And I was like, well, the AI people seem like a little bit more nice, maybe like slightly more trustworthy, both super exciting, but threw my bet in on the AI side. And then I got connected to Andreas. And actually, the way he was thinking about pursuing the research agenda at OTT was really compatible with what I had envisioned for an ideal AI product, something that helps kind of take down really complex thinking, overwhelming thoughts and breaks it down into small pieces. And then this kind of mission that we need AI to help us figure out what we ought to do was really inspiring, right? Yeah, because I think it was clear that we were building the most powerful optimizer of our time. But as a society, we hadn't figured out how to direct that optimization potential. And if you kind of direct tremendous amounts of optimization potential at the wrong thing, that's really disastrous. So the goal of OTT was make sure that if we build the most transformative technology of our lifetime, it can be used for something really impactful, like good reasoning, like not just generating ads. My background was in marketing, but like, so I was like, I want to do more than generate ads with this. But also if these AI systems get to be super intelligent enough that they are doing this really complex reasoning, that we can trust them, that they are aligned with us and we have ways of evaluating that they're doing the right thing. So that's what OTT did. We did a lot of experiments, you know, like I just said, before foundation models really like took off. A lot of the issues we were seeing were more in reinforcement learning, but we saw a future where AI would be able to do more kind of logical reasoning, not just kind of extrapolate from numerical trends. We actually kind of set up experiments with people where kind of people stood in as super intelligent systems and we effectively gave them context windows. So they would have to like read a bunch of text and one person would get less text and one person would get all the texts and the person with less text would have to evaluate the work of the person who could read much more. So like in a world we were basically simulating, like in 2018, 2019, a world where an AI system could read significantly more than you and you as the person who couldn't read that much had to evaluate the work of the AI system. Yeah. So there's a lot of the work we did. And from that, we kind of iterated on the idea of breaking complex tasks down into smaller tasks, like complex tasks, like open-ended reasoning, logical reasoning into smaller tasks so that it's easier to train AI systems on them. And also so that it's easier to evaluate the work of the AI system when it's done. And then also kind of, you know, really pioneered this idea, the importance of supervising the process of AI systems, not just the outcomes. So a big part of how Elicit is built is we're very intentional about not just throwing a ton of data into a model and training it and then saying, cool, here's like scientific output. Like that's not at all what we do. Our approach is very much like, what are the steps that an expert human does or what is like an ideal process as granularly as possible, let's break that down and then train AI systems to perform each of those steps very robustly. When you train like that from the start, after the fact, it's much easier to evaluate, it's much easier to troubleshoot at each point. Like where did something break down? So yeah, we were working on those experiments for a while. And then at the start of 2021, decided to build a product.

Swyx [00:07:45]: Do you mind if I, because I think you're about to go into more modern thought and Elicit. And I just wanted to, because I think a lot of people are in where you were like sort of 2018, 19, where you chose a partner to work with. Yeah. Right. And you didn't know him. Yeah. Yeah. You were just kind of cold introduced. A lot of people are cold introduced. Yeah. Never work with them. I assume you had a lot, a lot of other options, right? Like how do you advise people to make those choices?

Jungwon [00:08:10]: We were not totally cold introduced. So one of our closest friends introduced us. And then Andreas had written a lot on the OTT website, a lot of blog posts, a lot of publications. And I just read it and I was like, wow, this sounds like my writing. And even other people, some of my closest friends I asked for advice from, they were like, oh, this sounds like your writing. But I think I also had some kind of like things I was looking for. I wanted someone with a complimentary skillset. I want someone who was very values aligned. And yeah, that was all a good fit.

Andreas [00:08:38]: We also did a pretty lengthy mutual evaluation process where we had a Google doc where we had all kinds of questions for each other. And I think it ended up being around 50 pages or so of like various like questions and back and forth.

Swyx [00:08:52]: Was it the YC list? There's some lists going around for co-founder questions.

Andreas [00:08:55]: No, we just made our own questions. But I guess it's probably related in that you ask yourself, what are the values you care about? How would you approach various decisions and things like that?

Jungwon [00:09:04]: I shared like all of my past performance reviews. Yeah. Yeah.

Swyx [00:09:08]: And he never had any. No.

Andreas [00:09:10]: Yeah.

Swyx [00:09:11]: Sorry, I just had to, a lot of people are going through that phase and you kind of skipped over it. I was like, no, no, no, no. There's like an interesting story.

Jungwon [00:09:20]: Yeah.

Alessio [00:09:21]: Yeah. Before we jump into what a list it is today, the history is a bit counterintuitive. So you start with figuring out, oh, if we had a super powerful model, how would we align it? But then you were actually like, well, let's just build the product so that people can actually leverage it. And I think there are a lot of folks today that are now back to where you were maybe five years ago that are like, oh, what if this happens rather than focusing on actually building something useful with it? What clicked for you to like move into a list and then we can cover that story too.

Andreas [00:09:49]: I think in many ways, the approach is still the same because the way we are building illicit is not let's train a foundation model to do more stuff. It's like, let's build a scaffolding such that we can deploy powerful models to good ends. I think it's different now in that we actually have like some of the models to plug in. But if in 2017, we had had the models, we could have run the same experiments we did run with humans back then, just with models. And so in many ways, our philosophy is always, let's think ahead to the future of what models are going to exist in one, two years or longer. And how can we make it so that they can actually be deployed in kind of transparent, controllable

Jungwon [00:10:26]: ways? I think motivationally, we both are kind of product people at heart. The research was really important and it didn't make sense to build a product at that time. But at the end of the day, the thing that always motivated us is imagining a world where high quality reasoning is really abundant and AI is a technology that's going to get us there. And there's a way to guide that technology with research, but we can have a more direct effect through product because with research, you publish the research and someone else has to implement that into the product and the product felt like a more direct path. And we wanted to concretely have an impact on people's lives. Yeah, I think the kind of personally, the motivation was we want to build for people.

Swyx [00:11:03]: Yep. And then just to recap as well, like the models you were using back then were like, I don't know, would they like BERT type stuff or T5 or I don't know what timeframe we're talking about here.

Andreas [00:11:14]: I guess to be clear, at the very beginning, we had humans do the work. And then I think the first models that kind of make sense were TPT-2 and TNLG and like Yeah, early generative models. We do also use like T5 based models even now started with TPT-2.

Swyx [00:11:30]: Yeah, cool. I'm just kind of curious about like, how do you start so early? You know, like now it's obvious where to start, but back then it wasn't.

Jungwon [00:11:37]: Yeah, I used to nag Andreas a lot. I was like, why are you talking to this? I don't know. I felt like TPT-2 is like clearly can't do anything. And I was like, Andreas, you're wasting your time, like playing with this toy. But yeah, he was right.

Alessio [00:11:50]: So what's the history of what Elicit actually does as a product? You recently announced that after four months, you get to a million in revenue. Obviously, a lot of people use it, get a lot of value, but it would initially kind of like structured data extraction from papers. Then you had kind of like concept grouping. And today, it's maybe like a more full stack research enabler, kind of like paper understander platform. What's the definitive definition of what Elicit is? And how did you get here?

Jungwon [00:12:15]: Yeah, we say Elicit is an AI research assistant. I think it will continue to evolve. That's part of why we're so excited about building and research, because there's just so much space. I think the current phase we're in right now, we talk about it as really trying to make Elicit the best place to understand what is known. So it's all a lot about like literature summarization. There's a ton of information that the world already knows. It's really hard to navigate, hard to make it relevant. So a lot of it is around document discovery and processing and analysis. I really kind of want to import some of the incredible productivity improvements we've seen in software engineering and data science and into research. So it's like, how can we make researchers like data scientists of text? That's why we're launching this new set of features called Notebooks. It's very much inspired by computational notebooks, like Jupyter Notebooks, you know, DeepNode or Colab, because they're so powerful and so flexible. And ultimately, when people are trying to get to an answer or understand insight, they're kind of like manipulating evidence and information. Today, that's all packaged in PDFs, which are super brittle. So with language models, we can decompose these PDFs into their underlying claims and evidence and insights, and then let researchers mash them up together, remix them and analyze them together. So yeah, I would say quite simply, overall, Elicit is an AI research assistant. Right now we're focused on text-based workflows, but long term, really want to kind of go further and further into reasoning and decision making.

Alessio [00:13:35]: And when you say AI research assistant, this is kind of meta research. So researchers use Elicit as a research assistant. It's not a generic you-can-research-anything type of tool, or it could be, but like, what are people using it for today?

Andreas [00:13:49]: Yeah. So specifically in science, a lot of people use human research assistants to do things. You tell your grad student, hey, here are a couple of papers. Can you look at all of these, see which of these have kind of sufficiently large populations and actually study the disease that I'm interested in, and then write out like, what are the experiments they did? What are the interventions they did? What are the outcomes? And kind of organize that for me. And the first phase of understanding what is known really focuses on automating that workflow because a lot of that work is pretty rote work. I think it's not the kind of thing that we need humans to do. Language models can do it. And then if language models can do it, you can obviously scale it up much more than a grad student or undergrad research assistant would be able to do.

Jungwon [00:14:31]: Yeah. The use cases are pretty broad. So we do have a very large percent of our users are just using it personally or for a mix of personal and professional things. People who care a lot about health or biohacking or parents who have children with a kind of rare disease and want to understand the literature directly. So there is an individual kind of consumer use case. We're most focused on the power users. So that's where we're really excited to build. So Lissette was very much inspired by this workflow in literature called systematic reviews or meta-analysis, which is basically the human state of the art for summarizing scientific literature. And it typically involves like five people working together for over a year. And they kind of first start by trying to find the maximally comprehensive set of papers possible. So it's like 10,000 papers. And they kind of systematically narrow that down to like hundreds or 50 extract key details from every single paper. Usually have two people doing it, like a third person reviewing it. So it's like an incredibly laborious, time consuming process, but you see it in every single domain. So in science, in machine learning, in policy, because it's so structured and designed to be reproducible, it's really amenable to automation. So that's kind of the workflow that we want to automate first. And then you make that accessible for any question and make these really robust living summaries of science. So yeah, that's one of the workflows that we're starting with.

Alessio [00:15:49]: Our previous guest, Mike Conover, he's building a new company called Brightwave, which is an AI research assistant for financial research. How do you see the future of these tools? Does everything converge to like a God researcher assistant, or is every domain going to have its own thing?

Andreas [00:16:03]: I think that's a good and mostly open question. I do think there are some differences across domains. For example, some research is more quantitative data analysis, and other research is more high level cross domain thinking. And we definitely want to contribute to the broad generalist reasoning type space. Like if researchers are making discoveries often, it's like, hey, this thing in biology is actually analogous to like these equations in economics or something. And that's just fundamentally a thing that where you need to reason across domains. At least within research, I think there will be like one best platform more or less for this type of generalist research. I think there may still be like some particular tools like for genomics, like particular types of modules of genes and proteins and whatnot. But for a lot of the kind of high level reasoning that humans do, I think that is a more of a winner type all thing.

Swyx [00:16:52]: I wanted to ask a little bit deeper about, I guess, the workflow that you mentioned. I like that phrase. I see that in your UI now, but that's as it is today. And I think you were about to tell us about how it was in 2021 and how it may be progressed. How has this workflow evolved over time?

Jungwon [00:17:07]: Yeah. So the very first version of Elicit actually wasn't even a research assistant. It was a forecasting assistant. So we set out and we were thinking about, you know, what are some of the most impactful types of reasoning that if we could scale up, AI would really transform the world. We actually started with literature review, but we're like, oh, so many people are going to build literature review tools. So let's start there. So then we focused on geopolitical forecasting. So I don't know if you're familiar with like manifold or manifold markets. That kind of stuff. Before manifold. Yeah. Yeah. I'm not predicting relationships. We're predicting like, is China going to invade Taiwan?

Swyx [00:17:38]: Markets for everything.

Andreas [00:17:39]: Yeah. That's a relationship.

Swyx [00:17:41]: Yeah.

Jungwon [00:17:42]: Yeah. It's true. And then we worked on that for a while. And then after GPT-3 came out, I think by that time we realized that originally we were trying to help people convert their beliefs into probability distributions. And so take fuzzy beliefs, but like model them more concretely. And then after a few months of iterating on that, just realize, oh, the thing that's blocking people from making interesting predictions about important events in the world is less kind of on the probabilistic side and much more on the research side. And so that kind of combined with the very generalist capabilities of GPT-3 prompted us to make a more general research assistant. Then we spent a few months iterating on what even is a research assistant. So we would embed with different researchers. We built data labeling workflows in the beginning, kind of right off the bat. We built ways to find experts in a field and like ways to ask good research questions. So we just kind of iterated through a lot of workflows and no one else was really building at this time. And it was like very quick to just do some prompt engineering and see like what is a task that is at the intersection of what's technologically capable and like important for researchers. And we had like a very nondescript landing page. It said nothing. But somehow people were signing up and we had to sign a form that was like, why are you here? And everyone was like, I need help with literature review. And we're like, oh, literature review. That sounds so hard. I don't even know what that means. We're like, we don't want to work on it. But then eventually we were like, okay, everyone is saying literature review. It's overwhelmingly people want to-

Swyx [00:19:02]: And all domains, not like medicine or physics or just all domains. Yeah.

Jungwon [00:19:06]: And we also kind of personally knew literature review was hard. And if you look at the graphs for academic literature being published every single month, you guys know this in machine learning, it's like up into the right, like superhuman amounts of papers. So we're like, all right, let's just try it. I was really nervous, but Andreas was like, this is kind of like the right problem space to jump into, even if we don't know what we're doing. So my take was like, fine, this feels really scary, but let's just launch a feature every single week and double our user numbers every month. And if we can do that, we'll fail fast and we will find something. I was worried about like getting lost in the kind of academic white space. So the very first version was actually a weekend prototype that Andreas made. Do you want to explain how that worked?

Andreas [00:19:44]: I mostly remember that it was really bad. The thing I remember is you entered a question and it would give you back a list of claims. So your question could be, I don't know, how does creatine affect cognition? It would give you back some claims that are to some extent based on papers, but they were often irrelevant. The papers were often irrelevant. And so we ended up soon just printing out a bunch of examples of results and putting them up on the wall so that we would kind of feel the constant shame of having such a bad product and would be incentivized to make it better. And I think over time it has gotten a lot better, but I think the initial version was like really very bad. Yeah.

Jungwon [00:20:20]: But it was basically like a natural language summary of an abstract, like kind of a one sentence summary, and which we still have. And then as we learned kind of more about this systematic review workflow, we started expanding the capability so that you could extract a lot more data from the papers and do more with that.

Swyx [00:20:33]: And were you using like embeddings and cosine similarity, that kind of stuff for retrieval, or was it keyword based?

Andreas [00:20:40]: I think the very first version didn't even have its own search engine. I think the very first version probably used the Semantic Scholar or API or something similar. And only later when we discovered that API is not very semantic, we then built our own search engine that has helped a lot.

Swyx [00:20:58]: And then we're going to go into like more recent products stuff, but like, you know, I think you seem the more sort of startup oriented business person and you seem sort of more ideologically like interested in research, obviously, because of your PhD. What kind of market sizing were you guys thinking? Right? Like, because you're here saying like, we have to double every month. And I'm like, I don't know how you make that conclusion from this, right? Especially also as a nonprofit at the time.

Jungwon [00:21:22]: I mean, market size wise, I felt like in this space where so much was changing and it was very unclear what of today was actually going to be true tomorrow. We just like really rested a lot on very, very simple fundamental principles, which is like, if you can understand the truth, that is very economically beneficial and valuable. If you like know the truth.

Swyx [00:21:42]: On principle.

Jungwon [00:21:43]: Yeah. That's enough for you. Yeah. Research is the key to many breakthroughs that are very commercially valuable.

Swyx [00:21:47]: Because my version of it is students are poor and they don't pay for anything. Right? But that's obviously not true. As you guys have found out. But you had to have some market insight for me to have believed that, but you skipped that.

Andreas [00:21:58]: Yeah. I remember talking to VCs for our seed round. A lot of VCs were like, you know, researchers, they don't have any money. Why don't you build legal assistant? I think in some short sighted way, maybe that's true. But I think in the long run, R&D is such a big space of the economy. I think if you can substantially improve how quickly people find new discoveries or avoid controlled trials that don't go anywhere, I think that's just huge amounts of money. And there are a lot of questions obviously about between here and there. But I think as long as the fundamental principle is there, we were okay with that. And I guess we found some investors who also were. Yeah.

Swyx [00:22:35]: Congrats. I mean, I'm sure we can cover the sort of flip later. I think you're about to start us on like GPT-3 and how that changed things for you. It's funny. I guess every major GPT version, you have some big insight. Yeah.

Jungwon [00:22:48]: Yeah. I mean, what do you think?

Andreas [00:22:51]: I think it's a little bit less true for us than for others, because we always believed that there will basically be human level machine work. And so it is definitely true that in practice for your product, as new models come out, your product starts working better, you can add some features that you couldn't add before. But I don't think we really ever had the moment where we were like, oh, wow, that is super unanticipated. We need to do something entirely different now from what was on the roadmap.

Jungwon [00:23:21]: I think GPT-3 was a big change because it kind of said, oh, now is the time that we can use AI to build these tools. And then GPT-4 was maybe a little bit more of an extension of GPT-3. GPT-3 over GPT-2 was like qualitative level shift. And then GPT-4 was like, okay, great. Now it's like more accurate. We're more accurate on these things. We can answer harder questions. But the shape of the product had already taken place by that time.

Swyx [00:23:44]: I kind of want to ask you about this sort of pivot that you've made. But I guess that was just a way to sell what you were doing, which is you're adding extra features on grouping by concepts. The GPT-4 pivot, quote unquote pivot that you-

Jungwon [00:23:55]: Oh, yeah, yeah, exactly. Right, right, right. Yeah. Yeah. When we launched this workflow, now that GPT-4 was available, basically Elisa was at a place where we have very tabular interfaces. So given a table of papers, you can extract data across all the tables. But you kind of want to take the analysis a step further. Sometimes what you'd care about is not having a list of papers, but a list of arguments, a list of effects, a list of interventions, a list of techniques. And so that's one of the things we're working on is now that you've extracted this information in a more structured way, can you pivot it or group by whatever the information that you extracted to have more insight first information still supported by the academic literature?

Swyx [00:24:33]: Yeah, that was a big revelation when I saw it. Basically, I think I'm very just impressed by how first principles, your ideas around what the workflow is. And I think that's why you're not as reliant on like the LLM improving, because it's actually just about improving the workflow that you would recommend to people. Today we might call it an agent, I don't know, but you're not relying on the LLM to drive it. It's relying on this is the way that Elicit does research. And this is what we think is most effective based on talking to our users.

Jungwon [00:25:01]: The problem space is still huge. Like if it's like this big, we are all still operating at this tiny part, bit of it. So I think about this a lot in the context of moats, people are like, oh, what's your moat? What happens if GPT-5 comes out? It's like, if GPT-5 comes out, there's still like all of this other space that we can go into. So I think being really obsessed with the problem, which is very, very big, has helped us like stay robust and just kind of directly incorporate model improvements and they keep going.

Swyx [00:25:26]: And then I first encountered you guys with Charlie, you can tell us about that project. Basically, yeah. Like how much did cost become a concern as you're working more and more with OpenAI? How do you manage that relationship?

Jungwon [00:25:37]: Let me talk about who Charlie is. And then you can talk about the tech, because Charlie is a special character. So Charlie, when we found him was, had just finished his freshman year at the University of Warwick. And I think he had heard about us on some discord. And then he applied and we were like, wow, who is this freshman? And then we just saw that he had done so many incredible side projects. And we were actually on a team retreat in Barcelona visiting our head of engineering at that time. And everyone was talking about this wonder kid or like this kid. And then on our take home project, he had done like the best of anyone to that point. And so people were just like so excited to hire him. So we hired him as an intern and they were like, Charlie, what if you just dropped out of school? And so then we convinced him to take a year off. And he was just incredibly productive. And I think the thing you're referring to is at the start of 2023, Anthropic kind of launched their constitutional AI paper. And within a few days, I think four days, he had basically implemented that in production. And then we had it in app a week or so after that. And he has since kind of contributed to major improvements, like cutting costs down to a tenth of what they were really large scale. But yeah, you can talk about the technical stuff. Yeah.

Andreas [00:26:39]: On the constitutional AI project, this was for abstract summarization, where in illicit, if you run a query, it'll return papers to you, and then it will summarize each paper with respect to your query for you on the fly. And that's a really important part of illicit because illicit does it so much. If you run a few searches, it'll have done it a few hundred times for you. And so we cared a lot about this both being fast, cheap, and also very low on hallucination. I think if illicit hallucinates something about the abstract, that's really not good. And so what Charlie did in that project was create a constitution that expressed what are the attributes of a good summary? Everything in the summary is reflected in the actual abstract, and it's like very concise, et cetera, et cetera. And then used RLHF with a model that was trained on the constitution to basically fine tune a better summarizer on an open source model. Yeah. I think that might still be in use.

Jungwon [00:27:34]: Yeah. Yeah, definitely. Yeah. I think at the time, the models hadn't been trained at all to be faithful to a text. So they were just generating. So then when you ask them a question, they tried too hard to answer the question and didn't try hard enough to answer the question given the text or answer what the text said about the question. So we had to basically teach the models to do that specific task.

Swyx [00:27:54]: How do you monitor the ongoing performance of your models? Not to get too LLM-opsy, but you are one of the larger, more well-known operations doing NLP at scale. I guess effectively, you have to monitor these things and nobody has a good answer that I talk to.

Andreas [00:28:10]: I don't think we have a good answer yet. I think the answers are actually a little bit clearer on the just kind of basic robustness side of where you can import ideas from normal software engineering and normal kind of DevOps. You're like, well, you need to monitor kind of latencies and response times and uptime and whatnot.

Swyx [00:28:27]: I think when we say performance, it's more about hallucination rate, isn't it?

Andreas [00:28:30]: And then things like hallucination rate where I think there, the really important thing is training time. So we care a lot about having our own internal benchmarks for model development that reflect the distribution of user queries so that we can know ahead of time how well is the model going to perform on different types of tasks. So the tasks being summarization, question answering, given a paper, ranking. And for each of those, we want to know what's the distribution of things the model is going to see so that we can have well-calibrated predictions on how well the model is going to do in production. And I think, yeah, there's some chance that there's distribution shift and actually the things users enter are going to be different. But I think that's much less important than getting the kind of training right and having very high quality, well-vetted data sets at training time.

Jungwon [00:29:18]: I think we also end up effectively monitoring by trying to evaluate new models as they come out. And so that kind of prompts us to go through our eval suite every couple of months. And every time a new model comes out, we have to see how is this performing relative to production and what we currently have.

Swyx [00:29:32]: Yeah. I mean, since we're on this topic, any new models that have really caught your eye this year?

Jungwon [00:29:37]: Like Claude came out with a bunch. Yeah. I think Claude is pretty, I think the team's pretty excited about Claude. Yeah.

Andreas [00:29:41]: Specifically, Claude Haiku is like a good point on the kind of Pareto frontier. It's neither the cheapest model, nor is it the most accurate, most high quality model, but it's just like a really good trade-off between cost and accuracy.

Swyx [00:29:57]: You apparently have to 10-shot it to make it good. I tried using Haiku for summarization, but zero-shot was not great. Then they were like, you know, it's a skill issue, you have to try harder.

Jungwon [00:30:07]: I think GPT-4 unlocked tables for us, processing data from tables, which was huge. GPT-4 Vision.

Andreas [00:30:13]: Yeah.

Swyx [00:30:14]: Yeah. Did you try like Fuyu? I guess you can't try Fuyu because it's non-commercial. That's the adept model.

Jungwon [00:30:19]: Yeah.

Swyx [00:30:20]: We haven't tried that one. Yeah. Yeah. Yeah. But Claude is multimodal as well. Yeah. I think the interesting insight that we got from talking to David Luan, who is CEO of multimodality has effectively two different flavors. One is we recognize images from a camera in the outside natural world. And actually the more important multimodality for knowledge work is screenshots and PDFs and charts and graphs. So we need a new term for that kind of multimodality.

Andreas [00:30:45]: But is the claim that current models are good at one or the other? Yeah.

Swyx [00:30:50]: They're over-indexed because of the history of computer vision is Coco, right? So now we're like, oh, actually, you know, screens are more important, OCR, handwriting. You mentioned a lot of like closed model lab stuff, and then you also have like this open source model fine tuning stuff. Like what is your workload now between closed and open? It's a good question.

Andreas [00:31:07]: I think- Is it half and half? It's a-

Swyx [00:31:10]: Is that even a relevant question or not? Is this a nonsensical question?

Andreas [00:31:13]: It depends a little bit on like how you index, whether you index by like computer cost or number of queries. I'd say like in terms of number of queries, it's maybe similar. In terms of like cost and compute, I think the closed models make up more of the budget since the main cases where you want to use closed models are cases where they're just smarter, where no existing open source models are quite smart enough.

Jungwon [00:31:35]: Yeah. Yeah.

Alessio [00:31:37]: We have a lot of interesting technical questions to go in, but just to wrap the kind of like UX evolution, now you have the notebooks. We talked a lot about how chatbots are not the final frontier, you know? How did you decide to get into notebooks, which is a very iterative kind of like interactive interface and yeah, maybe learnings from that.

Jungwon [00:31:56]: Yeah. This is actually our fourth time trying to make this work. Okay. I think the first time was probably in early 2021. I think because we've always been obsessed with this idea of task decomposition and like branching, we always wanted a tool that could be kind of unbounded where you could keep going, could do a lot of branching where you could kind of apply language model operations or computations on other tasks. So in 2021, we had this thing called composite tasks where you could use GPT-3 to brainstorm a bunch of research questions and then take each research question and decompose those further into sub questions. This kind of, again, that like task decomposition tree type thing was always very exciting to us, but that was like, it didn't work and it was kind of overwhelming. Then at the end of 22, I think we tried again and at that point we were thinking, okay, we've done a lot with this literature review thing. We also want to start helping with kind of adjacent domains and different workflows. Like we want to help more with machine learning. What does that look like? And as we were thinking about it, we're like, well, there are so many research workflows. How do we not just build three new workflows into Elicit, but make Elicit really generic to lots of workflows? What is like a generic composable system with nice abstractions that can like scale to all these workflows? So we like iterated on that a bunch and then didn't quite narrow the problem space enough or like quite get to what we wanted. And then I think it was at the beginning of 2023 where we're like, wow, computational notebooks kind of enable this, where they have a lot of flexibility, but kind of robust primitives such that you can extend the workflow and it's not limited. It's not like you ask a query, you get an answer, you're done. You can just constantly keep building on top of that. And each little step seems like a really good unit of work for the language model. And also there was just like really helpful to have a bit more preexisting work to emulate. Yeah, that's kind of how we ended up at computational notebooks for Elicit.

Andreas [00:33:44]: Maybe one thing that's worth making explicit is the difference between computational notebooks and chat, because on the surface, they seem pretty similar. It's kind of this iterative interaction where you add stuff. In both cases, you have a back and forth between you enter stuff and then you get some output and then you enter stuff. But the important difference in our minds is with notebooks, you can define a process. So in data science, you can be like, here's like my data analysis process that takes in a CSV and then does some extraction and then generates a figure at the end. And you can prototype it using a small CSV and then you can run it over a much larger CSV later. And similarly, the vision for notebooks in our case is to not make it this like one-off chat interaction, but to allow you to then say, if you start and first you're like, okay, let me just analyze a few papers and see, do I get to the correct conclusions for those few papers? Can I then later go back and say, now let me run this over 10,000 papers now that I've debugged the process using a few papers. And that's an interaction that doesn't fit quite as well into the chat framework because that's more for kind of quick back and forth interaction.

Alessio [00:34:49]: Do you think in notebooks, it's kind of like structure, editable chain of thought, basically step by step? Like, is that kind of where you see this going? And then are people going to reuse notebooks as like templates? And maybe in traditional notebooks, it's like cookbooks, right? You share a cookbook, you can start from there. Is this similar in Elizit?

Andreas [00:35:06]: Yeah, that's exactly right. So that's our hope that people will build templates, share them with other people. I think chain of thought is maybe still like kind of one level lower on the abstraction hierarchy than we would think of notebooks. I think we'll probably want to think about more semantic pieces like a building block is more like a paper search or an extraction or a list of concepts. And then the model's detailed reasoning will probably often be one level down. You always want to be able to see it, but you don't always want it to be front and center.

Alessio [00:35:36]: Yeah, what's the difference between a notebook and an agent? Since everybody always asks me, what's an agent? Like how do you think about where the line is?

Andreas [00:35:44]: Yeah, it's an interesting question. In the notebook world, I would generally think of the human as the agent in the first iteration. So you have the notebook and the human kind of adds little action steps. And then the next point on this kind of progress gradient is, okay, now you can use language models to predict which action would you take as a human. And at some point, you're probably going to be very good at this, you'll be like, okay, in some cases I can, with 99.9% accuracy, predict what you do. And then you might as well just execute it, like why wait for the human? And eventually, as you get better at this, that will just look more and more like agents taking actions as opposed to you doing the thing. I think templates are a specific case of this where you're like, okay, well, there's just particular sequences of actions that you often want to chunk and have available as primitives, just like in normal programming. And those, you can view them as action sequences of agents, or you can view them as more normal programming language abstraction thing. And I think those are two valid views. Yeah.

Alessio [00:36:40]: How do you see this change as, like you said, the models get better and you need less and less human actual interfacing with the model, you just get the results? Like how does the UX and the way people perceive it change?

Jungwon [00:36:52]: Yeah, I think this kind of interaction paradigms for evaluation is not really something the internet has encountered yet, because up to now, the internet has all been about getting data and work from people. So increasingly, I really want kind of evaluation, both from an interface perspective and from like a technical perspective and operation perspective to be a superpower for Elicit, because I think over time, models will do more and more of the work, and people will have to do more and more of the evaluation. So I think, yeah, in terms of the interface, some of the things we have today, you know, for every kind of language model generation, there's some citation back, and we kind of try to highlight the ground truth in the paper that is most relevant to whatever Elicit said, and make it super easy so that you can click on it and quickly see in context and validate whether the text actually supports the answer that Elicit gave. So I think we'd probably want to scale things up like that, like the ability to kind of spot check the model's work super quickly, scale up interfaces like that. And-

Swyx [00:37:44]: Who would spot check? The user?

Jungwon [00:37:46]: Yeah, to start, it would be the user. One of the other things we do is also kind of flag the model's uncertainty. So we have models report out, how confident are you that this was the sample size of this study? The model's not sure, we throw a flag. And so the user knows to prioritize checking that. So again, we can kind of scale that up. So when the model's like, well, I searched this on Google, I'm not sure if that was the right thing. I have an uncertainty flag, and the user can go and be like, oh, okay, that was actually the right thing to do or not.

Swyx [00:38:10]: I've tried to do uncertainty readings from models. I don't know if you have this live. You do? Yeah. Because I just didn't find them reliable because they just hallucinated their own uncertainty. I would love to base it on log probs or something more native within the model rather than generated. But okay, it sounds like they scale properly for you. Yeah.

Jungwon [00:38:30]: We found it to be pretty calibrated. It varies on the model.

Andreas [00:38:32]: I think in some cases, we also use two different models for the uncertainty estimates than for the question answering. So one model would say, here's my chain of thought, here's my answer. And then a different type of model. Let's say the first model is Llama, and let's say the second model is GPT-3.5. And then the second model just looks over the results and is like, okay, how confident are you in this? And I think sometimes using a different model can be better than using the same model. Yeah.

Swyx [00:38:58]: On the topic of models, evaluating models, obviously you can do that all day long. What's your budget? Because your queries fan out a lot. And then you have models evaluating models. One person typing in a question can lead to a thousand calls.

Andreas [00:39:11]: It depends on the project. So if the project is basically a systematic review that otherwise human research assistants would do, then the project is basically a human equivalent spend. And the spend can get quite large for those projects. I don't know, let's say $100,000. In those cases, you're happier to spend compute then in the kind of shallow search case where someone just enters a question because, I don't know, maybe I heard about creatine. What's it about? Probably don't want to spend a lot of compute on that. This sort of being able to invest more or less compute into getting more or less accurate answers is I think one of the core things we care about. And that I think is currently undervalued in the AI space. I think currently you can choose which model you want and you can sometimes, I don't know, you'll tip it and it'll try harder or you can try various things to get it to work harder. But you don't have great ways of converting willingness to spend into better answers. And we really want to build a product that has this sort of unbounded flavor where if you care about it a lot, you should be able to get really high quality answers, really double checked in every way.

Alessio [00:40:14]: And you have a credits-based pricing. So unlike most products, it's not a fixed monthly fee.

Jungwon [00:40:19]: Right, exactly. So some of the higher costs are tiered. So for most casual users, they'll just get the abstract summary, which is kind of an open source model. Then you can add more columns, which have more extractions and these uncertainty features. And then you can also add the same columns in high accuracy mode, which also parses the table. So we kind of stack the complexity on the calls.

Swyx [00:40:39]: You know, the fun thing you can do with a credit system, which is data for data, basically you can give people more credits if they give data back to you. I don't know if you've already done that. We've thought about something like this.

Jungwon [00:40:49]: It's like if you don't have money, but you have time, how do you exchange that?

Swyx [00:40:54]: It's a fair trade.

Jungwon [00:40:55]: I think it's interesting. We haven't quite operationalized it. And then, you know, there's been some kind of like adverse selection. Like, you know, for example, it would be really valuable to get feedback on our model. So maybe if you were willing to give more robust feedback on our results, we could give you credits or something like that. But then there's kind of this, will people take it seriously? And you want the good people. Exactly.

Swyx [00:41:11]: Can you tell who are the good people? Not right now.

Jungwon [00:41:13]: But yeah, maybe at the point where we can, we can offer it. We can offer it up to them.

Swyx [00:41:16]: The perplexity of questions asked, you know, if it's higher perplexity, these are the smarter

Jungwon [00:41:20]: people. Yeah, maybe.

Andreas [00:41:23]: If you put typos in your queries, you're not going to get off the stage.

Swyx [00:41:28]: Negative social credit. It's very topical right now to think about the threat of long context windows. All these models that we're talking about these days, all like a million token plus. Is that relevant for you? Can you make use of that? Is that just prohibitively expensive because you're just paying for all those tokens or you're just doing rag?

Andreas [00:41:44]: It's definitely relevant. And when we think about search, as many people do, we think about kind of a staged pipeline of retrieval where first you use semantic search database with embeddings, get like the, in our case, maybe 400 or so most relevant papers. And then, then you still need to rank those. And I think at that point it becomes pretty interesting to use larger models. So specifically in the past, I think a lot of ranking was kind of per item ranking where you would score each individual item, maybe using increasingly expensive scoring methods and then rank based on the scores. But I think list-wise re-ranking where you have a model that can see all the elements is a lot more powerful because often you can only really tell how good a thing is in comparison to other things and what things should come first. It really depends on like, well, what other things that are available, maybe you even care about diversity in your results. You don't want to show 10 very similar papers as the first 10 results. So I think a long context models are quite interesting there. And especially for our case where we care more about power users who are perhaps a little bit more willing to wait a little bit longer to get higher quality results relative to people who just quickly check out things because why not? And I think being able to spend more on longer contexts is quite valuable.

Jungwon [00:42:55]: Yeah. I think one thing the longer context models changed for us is maybe a focus from breaking down tasks to breaking down the evaluation. So before, you know, if we wanted to answer a question from the full text of a paper, we had to figure out how to chunk it and like find the relevant chunk and then answer based on that chunk. And the nice thing was then, you know, kind of which chunk the model used to answer the question. So if you want to help the user track it, yeah, you can be like, well, this was the chunk that the model got. And now if you put the whole text in the paper, you have to like kind of find the chunk like more retroactively basically. And so you need kind of like a different set of abilities and obviously like a different technology to figure out. You still want to point the user to the supporting quotes in the text, but then the interaction is a little different.

Swyx [00:43:38]: You like scan through and find some rouge score floor.

Andreas [00:43:41]: I think there's an interesting space of almost research problems here because you would ideally make causal claims like if this hadn't been in the text, the model wouldn't have said this thing. And maybe you can do expensive approximations to that where like, I don't know, you just throw out chunk of the paper and re-answer and see what happens. But hopefully there are better ways of doing that where you just get that kind of counterfactual information for free from the model.

Alessio [00:44:06]: Do you think at all about the cost of maintaining REG versus just putting more tokens in the window? I think in software development, a lot of times people buy developer productivity things so that we don't have to worry about it. Context window is kind of the same, right? You have to maintain chunking and like REG retrieval and like re-ranking and all of this versus I just shove everything into the context and like it costs a little more, but at least I don't have to do all of that. Is that something you thought about?

Jungwon [00:44:31]: I think we still like hit up against context limits enough that it's not really, do we still want to keep this REG around? It's like we do still need it for the scale of the work that we're doing, yeah.

Andreas [00:44:41]: And I think there are different kinds of maintainability. In one sense, I think you're right that throw everything into the context window thing is easier to maintain because you just can swap out a model. In another sense, if things go wrong, it's harder to debug where like, if you know, here's the process that we go through to go from 200 million papers to an answer. And there are like little steps and you understand, okay, this is the step that finds the relevant paragraph or whatever it may be. You'll know which step breaks if the answers are bad, whereas if it's just like a new model version came out and now it suddenly doesn't find your needle in a haystack anymore, then you're like, okay, what can you do? You're kind of at a loss.

Alessio [00:45:21]: Let's talk a bit about, yeah, needle in a haystack and like maybe the opposite of it, which is like hard grounding. I don't know if that's like the best name to think about it, but I was using one of these chatwitcher documents features and I put the AMD MI300 specs and the new Blackwell chips from NVIDIA and I was asking questions and does the AMD chip support NVLink? And the response was like, oh, it doesn't say in the specs. But if you ask GPD 4 without the docs, it would tell you no, because NVLink it's a NVIDIA technology.

Swyx [00:45:49]: It just says in the thing.

Alessio [00:45:53]: How do you think about that? Does using the context sometimes suppress the knowledge that the model has?

Andreas [00:45:57]: It really depends on the task because I think sometimes that is exactly what you want. So imagine you're a researcher, you're writing the background section of your paper and you're trying to describe what these other papers say. You really don't want extra information to be introduced there. In other cases where you're just trying to figure out the truth and you're giving the documents because you think they will help the model figure out what the truth is. I think you do want, if the model has a hunch that there might be something that's not in the papers, you do want to surface that. I think ideally you still don't want the model to just tell you, probably the ideal thing looks a bit more like agent control where the model can issue a query that then is intended to surface documents that substantiate its hunch. That's maybe a reasonable middle ground between model just telling you and model being fully limited to the papers you give it.

Jungwon [00:46:44]: Yeah, I would say it's, they're just kind of different tasks right now. And the task that Elicit is mostly focused on is what do these papers say? But there's another task which is like, just give me the best possible answer and that give me the best possible answer sometimes depends on what do these papers say, but it can also depend on other stuff that's not in the papers. So ideally we can do both and then kind of do this overall task for you more going forward.

Alessio [00:47:08]: We see a lot of details, but just to zoom back out a little bit, what are maybe the most underrated features of Elicit and what is one thing that maybe the users surprise you the most by using it?

Jungwon [00:47:19]: I think the most powerful feature of Elicit is the ability to extract, add columns to this table, which effectively extracts data from all of your papers at once. It's well used, but there are kind of many different extensions of that that I think users are still discovering. So one is we let you give a description of the column. We let you give instructions of a column. We let you create custom columns. So we have like 30 plus predefined fields that users can extract, like what were the methods? What were the main findings? How many people were studied? And we actually show you basically the prompts that we're using to extract that from our predefined fields. And then you can fork this and you can say, oh, actually I don't care about the population of people. I only care about the population of rats. Like you can change the instruction. So I think users are still kind of discovering that there's both this predefined, easy to use default, but that they can extend it to be much more specific to them. And then they can also ask custom questions. One use case of that is you can start to create different column types that you might not expect. So instead of just creating generative answers, like a description of the methodology, you can say classify the methodology into a prospective study, a retrospective study, or a case study. And then you can filter based on that. It's like all using the same kind of technology and the interface, but it unlocks different workflows. So I think that the ability to ask custom questions, give instructions, and specifically use that to create different types of columns, like classification columns, is still pretty underrated. In terms of use case, I spoke to someone who works in medical affairs at a genomic sequencing company recently. So doctors kind of order these genomic tests, these sequencing tests, to kind of identify if a patient has a particular disease. This company helps them process it. And this person basically interacts with all the doctors and if the doctors have any questions. My understanding is that medical affairs is kind of like customer support or customer success in pharma. So this person like talks to doctors all day long. One of the things they started using Elicit for is like putting the results of their tests as the query. Like this test showed, you know, this percentage presence of this and 40% that and whatever, you know, what genes are present here or what's in this sample. And getting kind of a list of academic papers that would support their findings and using this to help doctors interpret their tests. So we talked about, okay, cool, like if we built, he's pretty interested in kind of doing a survey of infectious disease specialists and getting them to evaluate, you know, having them write up their answers, comparing it to Elicit's answers, trying to see can Elicit start being used to interpret the results of these diagnostic tests. Because the way they ship these tests to doctors is they report on a really wide array of things. He was saying that at a large, well-resourced hospital, like a city hospital, there might be a team of infectious disease specialists who can help interpret these results. But at under-resourced hospitals or more rural hospitals, the primary care physician can't interpret the test results, so then they can't order it, they can't use it, they can't help their patients with it. So thinking about an evidence-backed way of interpreting these tests is definitely kind of an extension of the product that I hadn't considered before. But yeah, the idea of using that to bring more access to physicians in all different parts of the country and helping them interpret complicated science is pretty cool.

Alessio [00:50:28]: Yeah. We had Kanjun from Imbue on the podcast and we talked about better allocating scientific resources. How do you think about these use cases and maybe how illicit can help drive more research? And do you see a world in which maybe the models actually do some of the research before suggesting us?

Andreas [00:50:45]: Yeah, I think that's very close to what we care about. Our product values are systematic, transparent, and unbounded. And I think to make research especially more systematic and unbounded, I think is basically the thing that's at stake here. So for example, I was recently talking to people in longevity and I think there isn't really one field of longevity, there are kind of different scientific subdomains that are surfacing various things that are related to longevity. And I think if you could more systematically say, look, here are all the different interventions we could do and here's the expected ROI of these experiments. Here's like the evidence so far that supports those being either likely to surface new information or not. Here's the cost of these experiments. I think you could be so much more systematic than science is today. I'd guess in like 10, 20 years we'll look back and it will be incredible how unsystematic science was back in the day.

Jungwon [00:51:35]: Our view is kind of have models catch up to expert humans today. Start with kind of novice humans and then increasingly expert humans. But we really want the models to earn their right to the expertise. So that's why we do things in this very step-by-step way. That's why we don't just like throw a bunch of data and apply a bunch of compute and hope we get good results. But obviously at some point you hope that once it's kind of earned its stripes, it can surpass human researchers. But I think that's where making sure that the model's processes are really explicit and transparent and that it's really easy to evaluate is important because if it does surpass human understanding, people will still need to be able to audit its work somehow or spot check its work somehow to be able to reliably trust it and use it. So yeah, that's kind of why the process-based approach is really important.

Andreas [00:52:20]: And on the question of will models do their own research, I think one feature that most currently don't have that will need to be better there is better world models. I think currently models are just not great at representing what's going on in a particular situation or domain in a way that allows them to come to interesting, surprising conclusions. I think they're very good at coming to conclusions that are nearby to conclusions that people have come to. They're not as good at kind of reasoning and making surprising connections maybe. And so having deeper models of what are the underlying structures of different domains, how they're related or not related, I think will be an important ingredient for models actually being able to make novel contributions.

Swyx [00:53:00]: On the topic of hiring more expert humans, you've hired some very expert humans. My friend Maggie Appleton joined you guys I think maybe a year ago-ish. In fact, I think you're doing an offsite and we're actually organizing our biggest AI UX meetup around whenever she's in town in San Francisco. How big is the team? How have you sort of transitioned your company into this sort of PBC and sort of the plan for the future?

Jungwon [00:53:21]: Yeah, we're 12 people now. About half of us are in the Bay Area and then distributed across US and Europe, a mix of mostly kind of roles in engineering and product. Yeah, and I think that the transition to PBC was really not that eventful because I think we're already, even as a nonprofit, we are already shipping every week, so very much operating as a product. Very much at the start, yeah. Yeah. And then I would say the kind of PBC component was to very explicitly say that we have a mission that we care a lot about. There are a lot of ways to make money. We think our mission will make us a lot of money, but we are going to be opinionated about how we make money. We're going to take the version of making a lot of money that's in line with our mission. But it's like all very convergent. Like illicit is not going to make any money if it's a bad product, if it doesn't actually help you discover truth and do research more rigorously. So I think for us, the kind of mission and the success of the company are very intertwined. We're hoping to grow the team quite a lot this year. Probably some of our highest priority roles are in engineering, but also opening up roles more in design and product marketing, go to market. Yeah. Do you want to talk about the roles?

Andreas [00:54:23]: Yeah. Broadly, we're just looking for senior software engineers and don't need any particular AI expertise. A lot of it is just how do you build good orchestration for complex tasks? So we talked earlier about these are sort of notebooks, scaling up, task orchestration. And I think a lot of this looks more like traditional software engineering than it does look like machine learning research. And I think the people who are really good at building good abstractions, building applications that can kind of survive, even if some of their pieces break, like making reliable components out of unreliable pieces. I think those are the people that we're looking for.

Swyx [00:54:57]: You know, that's exactly what I used to do. Have you explored the existing orchestration frameworks, Temporal, Airflow, Daxter, Prefect?

Andreas [00:55:05]: We've looked into them a little bit. I think we have some specific requirements around being able to stream work back very quickly to our users. Those could definitely be relevant. Okay.

Swyx [00:55:15]: Well, you're hiring. I'm sure we'll plug all the links. Thank you so much for coming. Any parting words? Any words of wisdom? Models do you live by?

Jungwon [00:55:22]: I think it's a really important time for humanity. So I hope everyone listening to this podcast can think hard about exactly how they want to participate in this story. There's so much to build and we can be really intentional about what we align ourselves with. There are a lot of applications that are going to be really good for the world and a lot of applications that are not. And so, yeah, I hope people can take that seriously and kind of seize the moment. Yeah.

Swyx [00:55:46]: I love how intentional you guys have been. Thank you for sharing that story.

Jungwon [00:55:49]: Thank you. Yeah.

Andreas [00:55:51]: Thank you for coming on.

Jungwon [00:56:17]: Yeah. Thank you.

Get full access to Latent Space at www.latent.space/subscribe

2024-04-11
Länk till avsnitt

Latent Space Chats: NLW (Four Wars, GPT5), Josh Albrecht/Ali Rohde (TNAI), Dylan Patel/Semianalysis (Groq), Milind Naphade (Nvidia GTC), Personal AI (ft. Harrison Chase ? LangFriend/LangMem)

Our next 2 big events are AI UX and the World?s Fair. Join and apply to speak/sponsor!

Due to timing issues we didn?t have an interview episode to share with you this week, but not to worry, we have more than enough ?weekend special? content in the backlog for you to get your Latent Space fix, whether you like thinking about the big picture, or learning more about the pod behind the scenes, or talking Groq and GPUs, or AI Leadership, or Personal AI.

Enjoy!

AI Breakdown

The indefatigable NLW had us back on his show for an update on the Four Wars, covering Sora, Suno, and the reshaped GPT-4 Class Landscape:

and a longer segment on AI Engineering trends covering the future LLM landscape (Llama 3, GPT-5, Gemini 2, Claude 4), Open Source Models (Mistral, Grok), Apple and Meta?s AI strategy, new chips (Groq, MatX) and the general movement from baby AGIs to vertical Agents:

Thursday Nights in AI

We?re also including swyx?s interview with Josh Albrecht and Ali Rohde to reintroduce swyx and Latent Space to a general audience, and engage in some spicy Q&A:

Dylan Patel on Groq

We hosted a private event with Dylan Patel of SemiAnalysis (our last pod here):

Not all of it could be released so we just talked about our Groq estimates:

Milind Naphade - Capital One

In relation to conversations at NeurIPS and Nvidia GTC and upcoming at World?s Fair, we also enjoyed chatting with Milind Naphade about his AI Leadership work at IBM, Cisco, Nvidia, and now leading the AI Foundations org at Capital One. We covered:

* Milind?s learnings from ~25 years in machine learning

* His first paper citation was 24 years ago

* Lessons from working with Jensen Huang for 6 years and being CTO of Metropolis

* Thoughts on relevant AI research

* GTC takeaways and what makes NVIDIA special

If you?d like to work on building solutions rather than platform (as Milind put it), his Applied AI Research team at Capital One is hiring, which falls under the Capital One Tech team.

Personal AI Meetup

It all started with a meme:

Within days of each other, BEE, FRIEND, EmilyAI, Compass, Nox and LangFriend were all launching personal AI wearables and assistants. So we decided to put together a the world?s first Personal AI meetup featuring creators and enthusiasts of wearables. The full video is live now, with full show notes within.

Timestamps

* [00:01:13] AI Breakdown Part 1

* [00:02:20] Four Wars

* [00:13:45] Sora

* [00:15:12] Suno

* [00:16:34] The GPT-4 Class Landscape

* [00:17:03] Data War: Reddit x Google

* [00:21:53] Gemini 1.5 vs Claude 3

* [00:26:58] AI Breakdown Part 2

* [00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4

* [00:31:11] Open Source Models - Mistral, Grok

* [00:34:13] Apple MM1

* [00:37:33] Meta's $800b AI rebrand

* [00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents

* [00:47:28] Adept episode - Screen Multimodality

* [00:48:54] Top Model Research from January Recap

* [00:53:08] AI Wearables

* [00:57:26] Groq vs Nvidia month - GPU Chip War

* [01:00:31] Disagreements

* [01:02:08] Summer 2024 Predictions

* [01:04:18] Thursday Nights in AI - swyx

* [01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show

* [01:34:58] Groq

Transcript

[00:00:00] swyx: Welcome to the Latent Space Podcast Weekend Edition. This is Charlie, your AI co host. Swyx and Alessio are off for the week, making more great content. We have exciting interviews coming up with Elicit, Chroma, Instructor, and our upcoming series on NSFW, Not Safe for Work AI. In today's episode, we're collating some of Swyx and Alessio's recent appearances, all in one place for you to find.

[00:00:32] swyx: In part one, we have our first crossover pod of the year. In our listener survey, several folks asked for more thoughts from our two hosts. In 2023, Swyx and Alessio did crossover interviews with other great podcasts like the AI Breakdown, Practical AI, Cognitive Revolution, Thursday Eye, and Chinatalk, all of which you can find in the Latentspace About page.

[00:00:56] swyx: NLW of the AI Breakdown asked us back to do a special on the 4Wars framework and the AI engineer scene. We love AI Breakdown as one of the best examples Daily podcasts to keep up on AI news, so we were especially excited to be back on Watch out and take

[00:01:12] NLW: care

[00:01:13] AI Breakdown Part 1

[00:01:13] NLW: today on the AI breakdown. Part one of my conversation with Alessio and Swix from Latent Space.

[00:01:19] NLW: All right, fellas, welcome back to the AI Breakdown. How are you doing? I'm good. Very good. With the last, the last time we did this show, we were like, oh yeah, let's do check ins like monthly about all the things that are going on and then. Of course, six months later, and, you know, the, the, the world has changed in a thousand ways.

[00:01:36] NLW: It's just, it's too busy to even, to even think about podcasting sometimes. But I, I'm super excited to, to be chatting with you again. I think there's, there's a lot to, to catch up on, just to tap in, I think in the, you know, in the beginning of 2024. And, and so, you know, we're gonna talk today about just kind of a, a, a broad sense of where things are in some of the key battles in the AI space.

[00:01:55] NLW: And then the, you know, one of the big things that I, that I'm really excited to have you guys on here for us to talk about where, sort of what patterns you're seeing and what people are actually trying to build, you know, where, where developers are spending their, their time and energy and, and, and any sort of, you know, trend trends there, but maybe let's start I guess by checking in on a framework that you guys actually introduced, which I've loved and I've cribbed a couple of times now, which is this sort of four wars of the, of the AI stack.

[00:02:20] Four Wars

[00:02:20] NLW: Because first, since I have you here, I'd love, I'd love to hear sort of like where that started gelling. And then and then maybe we can get into, I think a couple of them that are you know, particularly interesting, you know, in the, in light of

[00:02:30] swyx: some recent news. Yeah, so maybe I'll take this one. So the four wars is a framework that I came up around trying to recap all of 2023.

[00:02:38] swyx: I tried to write sort of monthly recap pieces. And I was trying to figure out like what makes one piece of news last longer than another or more significant than another. And I think it's basically always around battlegrounds. Wars are fought around limited resources. And I think probably the, you know, the most limited resource is talent, but the talent expresses itself in a number of areas.

[00:03:01] swyx: And so I kind of focus on those, those areas at first. So the four wars that we cover are the data wars, the GPU rich, poor war, the multi modal war, And the RAG and Ops War. And I think you actually did a dedicated episode to that, so thanks for covering that. Yeah, yeah.

[00:03:18] NLW: Not only did I do a dedicated episode, I actually used that.

[00:03:22] NLW: I can't remember if I told you guys. I did give you big shoutouts. But I used it as a framework for a presentation at Intel's big AI event that they hold each year, where they have all their folks who are working on AI internally. And it totally resonated. That's amazing. Yeah, so, so, what got me thinking about it again is specifically this inflection news that we recently had, this sort of, you know, basically, I can't imagine that anyone who's listening wouldn't have thought about it, but, you know, inflection is a one of the big contenders, right?

[00:03:53] NLW: I think probably most folks would have put them, you know, just a half step behind the anthropics and open AIs of the world in terms of labs, but it's a company that raised 1. 3 billion last year, less than a year ago. Reed Hoffman's a co founder Mustafa Suleyman, who's a co founder of DeepMind, you know, so it's like, this is not a a small startup, let's say, at least in terms of perception.

[00:04:13] NLW: And then we get the news that basically most of the team, it appears, is heading over to Microsoft and they're bringing in a new CEO. And you know, I'm interested in, in, in kind of your take on how much that reflects, like hold aside, I guess, you know, all the other things that it might be about, how much it reflects this sort of the, the stark.

[00:04:32] NLW: Brutal reality of competing in the frontier model space right now. And, you know, just the access to compute.

[00:04:38] Alessio: There are a lot of things to say. So first of all, there's always somebody who's more GPU rich than you. So inflection is GPU rich by startup standard. I think about 22, 000 H100s, but obviously that pales compared to the, to Microsoft.

[00:04:55] Alessio: The other thing is that this is probably good news, maybe for the startups. It's like being GPU rich, it's not enough. You know, like I think they were building something pretty interesting in, in pi of their own model of their own kind of experience. But at the end of the day, you're the interface that people consume as end users.

[00:05:13] Alessio: It's really similar to a lot of the others. So and we'll tell, talk about GPT four and cloud tree and all this stuff. GPU poor, doing something. That the GPU rich are not interested in, you know we just had our AI center of excellence at Decibel and one of the AI leads at one of the big companies was like, Oh, we just saved 10 million and we use these models to do a translation, you know, and that's it.

[00:05:39] Alessio: It's not, it's not a GI, it's just translation. So I think like the inflection part is maybe. A calling and a waking to a lot of startups then say, Hey, you know, trying to get as much capital as possible, try and get as many GPUs as possible. Good. But at the end of the day, it doesn't build a business, you know, and maybe what inflection I don't, I don't, again, I don't know the reasons behind the inflection choice, but if you say, I don't want to build my own company that has 1.

[00:06:05] Alessio: 3 billion and I want to go do it at Microsoft, it's probably not a resources problem. It's more of strategic decisions that you're making as a company. So yeah, that was kind of my. I take on it.

[00:06:15] swyx: Yeah, and I guess on my end, two things actually happened yesterday. It was a little bit quieter news, but Stability AI had some pretty major departures as well.

[00:06:25] swyx: And you may not be considering it, but Stability is actually also a GPU rich company in the sense that they were the first new startup in this AI wave to brag about how many GPUs that they have. And you should join them. And you know, Imadis is definitely a GPU trader in some sense from his hedge fund days.

[00:06:43] swyx: So Robin Rhombach and like the most of the Stable Diffusion 3 people left Stability yesterday as well. So yesterday was kind of like a big news day for the GPU rich companies, both Inflection and Stability having sort of wind taken out of their sails. I think, yes, it's a data point in the favor of Like, just because you have the GPUs doesn't mean you can, you automatically win.

[00:07:03] swyx: And I think, you know, kind of I'll echo what Alessio says there. But in general also, like, I wonder if this is like the start of a major consolidation wave, just in terms of, you know, I think that there was a lot of funding last year and, you know, the business models have not been, you know, All of these things worked out very well.

[00:07:19] swyx: Even inflection couldn't do it. And so I think maybe that's the start of a small consolidation wave. I don't think that's like a sign of AI winter. I keep looking for AI winter coming. I think this is kind of like a brief cold front. Yeah,

[00:07:34] NLW: it's super interesting. So I think a bunch of A bunch of stuff here.

[00:07:38] NLW: One is, I think, to both of your points, there, in some ways, there, there had already been this very clear demarcation between these two sides where, like, the GPU pores, to use the terminology, like, just weren't trying to compete on the same level, right? You know, the vast majority of people who have started something over the last year, year and a half, call it, were racing in a different direction.

[00:07:59] NLW: They're trying to find some edge somewhere else. They're trying to build something different. If they're, if they're really trying to innovate, it's in different areas. And so it's really just this very small handful of companies that are in this like very, you know, it's like the coheres and jaspers of the world that like this sort of, you know, that are that are just sort of a little bit less resourced than, you know, than the other set that I think that this potentially even applies to, you know, everyone else that could clearly demarcate it into these two, two sides.

[00:08:26] NLW: And there's only a small handful kind of sitting uncomfortably in the middle, perhaps. Let's, let's come back to the idea of, of the sort of AI winter or, you know, a cold front or anything like that. So this is something that I, I spent a lot of time kind of thinking about and noticing. And my perception is that The vast majority of the folks who are trying to call for sort of, you know, a trough of disillusionment or, you know, a shifting of the phase to that are people who either, A, just don't like AI for some other reason there's plenty of that, you know, people who are saying, You Look, they're doing way worse than they ever thought.

[00:09:03] NLW: You know, there's a lot of sort of confirmation bias kind of thing going on. Or two, media that just needs a different narrative, right? Because they're sort of sick of, you know, telling the same story. Same thing happened last summer, when every every outlet jumped on the chat GPT at its first down month story to try to really like kind of hammer this idea that that the hype was too much.

[00:09:24] NLW: Meanwhile, you have, you know, just ridiculous levels of investment from enterprises, you know, coming in. You have, you know, huge, huge volumes of, you know, individual behavior change happening. But I do think that there's nothing incoherent sort of to your point, Swyx, about that and the consolidation period.

[00:09:42] NLW: Like, you know, if you look right now, for example, there are, I don't know, probably 25 or 30 credible, like, build your own chatbot. platforms that, you know, a lot of which have, you know, raised funding. There's no universe in which all of those are successful across, you know, even with a, even, even with a total addressable market of every enterprise in the world, you know, you're just inevitably going to see some amount of consolidation.

[00:10:08] NLW: Same with, you know, image generators. There are, if you look at A16Z's top 50 consumer AI apps, just based on, you know, web traffic or whatever, they're still like I don't know, a half. Dozen or 10 or something, like, some ridiculous number of like, basically things like Midjourney or Dolly three. And it just seems impossible that we're gonna have that many, you know, ultimately as, as, as sort of, you know, going, going concerned.

[00:10:33] NLW: So, I don't know. I, I, I think that the, there will be inevitable consolidation 'cause you know. It's, it's also what kind of like venture rounds are supposed to do. You're not, not everyone who gets a seed round is supposed to get to series A and not everyone who gets a series A is supposed to get to series B.

[00:10:46] NLW: That's sort of the natural process. I think it will be tempting for a lot of people to try to infer from that something about AI not being as sort of big or as as sort of relevant as, as it was hyped up to be. But I, I kind of think that's the wrong conclusion to come to.

[00:11:02] Alessio: I I would say the experimentation.

[00:11:04] Alessio: Surface is a little smaller for image generation. So if you go back maybe six, nine months, most people will tell you, why would you build a coding assistant when like Copilot and GitHub are just going to win everything because they have the data and they have all the stuff. If you fast forward today, A lot of people use Cursor everybody was excited about the Devin release on Twitter.

[00:11:26] Alessio: There are a lot of different ways of attacking the market that are not completion of code in the IDE. And even Cursors, like they evolved beyond single line to like chat, to do multi line edits and, and all that stuff. Image generation, I would say, yeah, as a, just as from what I've seen, like maybe the product innovation has slowed down at the UX level and people are improving the models.

[00:11:50] Alessio: So the race is like, how do I make better images? It's not like, how do I make the user interact with the generation process better? And that gets tough, you know? It's hard to like really differentiate yourselves. So yeah, that's kind of how I look at it. And when we think about multimodality, maybe the reason why people got so excited about Sora is like, oh, this is like a completely It's not a better image model.

[00:12:13] Alessio: This is like a completely different thing, you know? And I think the creative mind It's always looking for something that impacts the viewer in a different way, you know, like they really want something different versus the developer mind. It's like, Oh, I, I just, I have this like very annoying thing I want better.

[00:12:32] Alessio: I have this like very specific use cases that I want to go after. So it's just different. And that's why you see a lot more companies in image generation. But I agree with you that. If you fast forward there, there's not going to be 10 of them, you know, it's probably going to be one or

[00:12:46] swyx: two. Yeah, I mean, to me, that's why I call it a war.

[00:12:49] swyx: Like, individually, all these companies can make a story that kind of makes sense, but collectively, they cannot all be true. Therefore, they all, there is some kind of fight over limited resources here. Yeah, so

[00:12:59] NLW: it's interesting. We wandered very naturally into sort of another one of these wars, which is the multimodality kind of idea, which is, you know, basically a question of whether it's going to be these sort of big everything models that end up winning or whether, you know, you're going to have really specific things, you know, like something, you know, Dolly 3 inside of sort of OpenAI's larger models versus, you know, a mid journey or something like that.

[00:13:24] NLW: And at first, you know, I was kind of thinking like, For most of the last, call it six months or whatever, it feels pretty definitively both and in some ways, you know, and that you're, you're seeing just like great innovation on sort of the everything models, but you're also seeing lots and lots happen at sort of the level of kind of individual use cases.

[00:13:45] Sora

[00:13:45] NLW: But then Sora comes along and just like obliterates what I think anyone thought you know, where we were when it comes to video generation. So how are you guys thinking about this particular battle or war at the moment?

[00:13:59] swyx: Yeah, this was definitely a both and story, and Sora tipped things one way for me, in terms of scale being all you need.

[00:14:08] swyx: And the benefit, I think, of having multiple models being developed under one roof. I think a lot of people aren't aware that Sora was developed in a similar fashion to Dolly 3. And Dolly3 had a very interesting paper out where they talked about how they sort of bootstrapped their synthetic data based on GPT 4 vision and GPT 4.

[00:14:31] swyx: And, and it was just all, like, really interesting, like, if you work on one modality, it enables you to work on other modalities, and all that is more, is, is more interesting. I think it's beneficial if it's all in the same house, whereas the individual startups who don't, who sort of carve out a single modality and work on that, definitely won't have the state of the art stuff on helping them out on synthetic data.

[00:14:52] swyx: So I do think like, The balance is tilted a little bit towards the God model companies, which is challenging for the, for the, for the the sort of dedicated modality companies. But everyone's carving out different niches. You know, like we just interviewed Suno ai, the sort of music model company, and, you know, I don't see opening AI pursuing music anytime soon.

[00:15:12] Suno

[00:15:12] swyx: Yeah,

[00:15:13] NLW: Suno's been phenomenal to play with. Suno has done that rare thing where, which I think a number of different AI product categories have done, where people who don't consider themselves particularly interested in doing the thing that the AI enables find themselves doing a lot more of that thing, right?

[00:15:29] NLW: Like, it'd be one thing if Just musicians were excited about Suno and using it but what you're seeing is tons of people who just like music all of a sudden like playing around with it and finding themselves kind of down that rabbit hole, which I think is kind of like the highest compliment that you can give one of these startups at the

[00:15:45] swyx: early days of it.

[00:15:46] swyx: Yeah, I, you know, I, I asked them directly, you know, in the interview about whether they consider themselves mid journey for music. And he had a more sort of nuanced response there, but I think that probably the business model is going to be very similar because he's focused on the B2C element of that. So yeah, I mean, you know, just to, just to tie back to the question about, you know, You know, large multi modality companies versus small dedicated modality companies.

[00:16:10] swyx: Yeah, highly recommend people to read the Sora blog posts and then read through to the Dali blog posts because they, they strongly correlated themselves with the same synthetic data bootstrapping methods as Dali. And I think once you make those connections, you're like, oh, like it, it, it is beneficial to have multiple state of the art models in house that all help each other.

[00:16:28] swyx: And these, this, that's the one thing that a dedicated modality company cannot do.

[00:16:34] The GPT-4 Class Landscape

[00:16:34] NLW: So I, I wanna jump, I wanna kind of build off that and, and move into the sort of like updated GPT-4 class landscape. 'cause that's obviously been another big change over the last couple months. But for the sake of completeness, is there anything that's worth touching on with with sort of the quality?

[00:16:46] NLW: Quality data or sort of a rag ops wars just in terms of, you know, anything that's changed, I guess, for you fundamentally in the last couple of months about where those things stand.

[00:16:55] swyx: So I think we're going to talk about rag for the Gemini and Clouds discussion later. And so maybe briefly discuss the data piece.

[00:17:03] Data War: Reddit x Google

[00:17:03] swyx: I think maybe the only new thing was this Reddit deal with Google for like a 60 million dollar deal just ahead of their IPO, very conveniently turning Reddit into a AI data company. Also, very, very interestingly, a non exclusive deal, meaning that Reddit can resell that data to someone else. And it probably does become table stakes.

[00:17:23] swyx: A lot of people don't know, but a lot of the web text dataset that originally started for GPT 1, 2, and 3 was actually scraped from GitHub. from Reddit at least the sort of vote scores. And I think, I think that's a, that's a very valuable piece of information. So like, yeah, I think people are figuring out how to pay for data.

[00:17:40] swyx: People are suing each other over data. This, this, this war is, you know, definitely very, very much heating up. And I don't think, I don't see it getting any less intense. I, you know, next to GPUs, data is going to be the most expensive thing in, in a model stack company. And. You know, a lot of people are resorting to synthetic versions of it, which may or may not be kosher based on how far along or how commercially blessed the, the forms of creating that synthetic data are.

[00:18:11] swyx: I don't know if Alessio, you have any other interactions with like Data source companies, but that's my two cents.

[00:18:17] Alessio: Yeah yeah, I actually saw Quentin Anthony from Luther. ai at GTC this week. He's also been working on this. I saw Technium. He's also been working on the data side. I think especially in open source, people are like, okay, if everybody is putting the gates up, so to speak, to the data we need to make it easier for people that don't have 50 million a year to get access to good data sets.

[00:18:38] Alessio: And Jensen, at his keynote, he did talk about synthetic data a little bit. So I think that's something that we'll definitely hear more and more of in the enterprise, which never bodes well, because then all the, all the people with the data are like, Oh, the enterprises want to pay now? Let me, let me put a pay here stripe link so that they can give me 50 million.

[00:18:57] Alessio: But it worked for Reddit. I think the stock is up. 40 percent today after opening. So yeah, I don't know if it's all about the Google deal, but it's obviously Reddit has been one of those companies where, hey, you got all this like great community, but like, how are you going to make money? And like, they try to sell the avatars.

[00:19:15] Alessio: I don't know if that it's a great business for them. The, the data part sounds as an investor, you know, the data part sounds a lot more interesting than, than consumer

[00:19:25] swyx: cosmetics. Yeah, so I think, you know there's more questions around data you know, I think a lot of people are talking about the interview that Mira Murady did with the Wall Street Journal, where she, like, just basically had no, had no good answer for where they got the data for Sora.

[00:19:39] swyx: I, I think this is where, you know, there's, it's in nobody's interest to be transparent about data, and it's, it's kind of sad for the state of ML and the state of AI research but it is what it is. We, we have to figure this out as a society, just like we did for music and music sharing. You know, in, in sort of the Napster to Spotify transition, and that might take us a decade.

[00:19:59] swyx: Yeah, I

[00:20:00] NLW: do. I, I agree. I think, I think that you're right to identify it, not just as that sort of technical problem, but as one where society has to have a debate with itself. Because I think that there's, if you rationally within it, there's Great kind of points on all side, not to be the sort of, you know, person who sits in the middle constantly, but it's why I think a lot of these legal decisions are going to be really important because, you know, the job of judges is to listen to all this stuff and try to come to things and then have other judges disagree.

[00:20:24] NLW: And, you know, and have the rest of us all debate at the same time. By the way, as a total aside, I feel like the synthetic data right now is like eggs in the 80s and 90s. Like, whether they're good for you or bad for you, like, you know, we, we get one study that's like synthetic data, you know, there's model collapse.

[00:20:42] NLW: And then we have like a hint that llama, you know, to the most high performance version of it, which was one they didn't release was trained on synthetic data. So maybe it's good. It's like, I just feel like every, every other week I'm seeing something sort of different about whether it's a good or bad for, for these models.

[00:20:56] swyx: Yeah. The branding of this is pretty poor. I would kind of tell people to think about it like cholesterol. There's good cholesterol, bad cholesterol. And you can have, you know, good amounts of both. But at this point, it is absolutely without a doubt that most large models from here on out will all be trained as some kind of synthetic data and that is not a bad thing.

[00:21:16] swyx: There are ways in which you can do it poorly. Whether it's commercial, you know, in terms of commercial sourcing or in terms of the model performance. But it's without a doubt that good synthetic data is going to help your model. And this is just a question of like where to obtain it and what kinds of synthetic data are valuable.

[00:21:36] swyx: You know, if even like alpha geometry, you know, was, was a really good example from like earlier this year.

[00:21:42] NLW: If you're using the cholesterol analogy, then my, then my egg thing can't be that far off. Let's talk about the sort of the state of the art and the, and the GPT 4 class landscape and how that's changed.

[00:21:53] Gemini 1.5 vs Claude 3

[00:21:53] NLW: Cause obviously, you know, sort of the, the two big things or a couple of the big things that have happened. Since we last talked, we're one, you know, Gemini first announcing that a model was coming and then finally it arriving, and then very soon after a sort of a different model arriving from Gemini and and Cloud three.

[00:22:11] NLW: So I guess, you know, I'm not sure exactly where the right place to start with this conversation is, but, you know, maybe very broadly speaking which of these do you think have made a bigger impact? Thank you.

[00:22:20] Alessio: Probably the one you can use, right? So, Cloud. Well, I'm sure Gemini is going to be great once they let me in, but so far I haven't been able to.

[00:22:29] Alessio: I use, so I have this small podcaster thing that I built for our podcast, which does chapters creation, like named entity recognition, summarization, and all of that. Cloud Tree is, Better than GPT 4. Cloud2 was unusable. So I use GPT 4 for everything. And then when Opus came out, I tried them again side by side and I posted it on, on Twitter as well.

[00:22:53] Alessio: Cloud is better. It's very good, you know, it's much better, it seems to me, it's much better than GPT 4 at doing writing that is more, you know, I don't know, it just got good vibes, you know, like the GPT 4 text, you can tell it's like GPT 4, you know, it's like, it always uses certain types of words and phrases and, you know, maybe it's just me because I've now done it for, you know, So, I've read like 75, 80 generations of these things next to each other.

[00:23:21] Alessio: Clutter is really good. I know everybody is freaking out on twitter about it, my only experience of this is much better has been on the podcast use case. But I know that, you know, Quran from from News Research is a very big opus pro, pro opus person. So, I think that's also It's great to have people that actually care about other models.

[00:23:40] Alessio: You know, I think so far to a lot of people, maybe Entropic has been the sibling in the corner, you know, it's like Cloud releases a new model and then OpenAI releases Sora and like, you know, there are like all these different things, but yeah, the new models are good. It's interesting.

[00:23:55] NLW: My my perception is definitely that just, just observationally, Cloud 3 is certainly the first thing that I've seen where lots of people.

[00:24:06] NLW: They're, no one's debating evals or anything like that. They're talking about the specific use cases that they have, that they used to use chat GPT for every day, you know, day in, day out, that they've now just switched over. And that has, I think, shifted a lot of the sort of like vibe and sentiment in the space too.

[00:24:26] NLW: And I don't necessarily think that it's sort of a A like full you know, sort of full knock. Let's put it this way. I think it's less bad for open AI than it is good for anthropic. I think that because GPT 5 isn't there, people are not quite willing to sort of like, you know get overly critical of, of open AI, except in so far as they're wondering where GPT 5 is.

[00:24:46] NLW: But I do think that it makes, Anthropic look way more credible as a, as a, as a player, as a, you know, as a credible sort of player, you know, as opposed to to, to where they were.

[00:24:57] Alessio: Yeah. And I would say the benchmarks veil is probably getting lifted this year. I think last year. People were like, okay, this is better than this on this benchmark, blah, blah, blah, because maybe they did not have a lot of use cases that they did frequently.

[00:25:11] Alessio: So it's hard to like compare yourself. So you, you defer to the benchmarks. I think now as we go into 2024, a lot of people have started to use these models from, you know, from very sophisticated things that they run in production to some utility that they have on their own. Now they can just run them side by side.

[00:25:29] Alessio: And it's like, Hey, I don't care that like. The MMLU score of Opus is like slightly lower than GPT 4. It just works for me, you know, and I think that's the same way that traditional software has been used by people, right? Like you just strive for yourself and like, which one does it work, works best for you?

[00:25:48] Alessio: Like nobody looks at benchmarks outside of like sales white papers, you know? And I think it's great that we're going more in that direction. We have a episode with Adapt coming out this weekend. I'll and some of their model releases, they specifically say, We do not care about benchmarks, so we didn't put them in, you know, because we, we don't want to look good on them.

[00:26:06] Alessio: We just want the product to work. And I think more and more people will, will

[00:26:09] swyx: go that way. Yeah. I I would say like, it does take the wind out of the sails for GPT 5, which I know where, you know, Curious about later on. I think anytime you put out a new state of the art model, you have to break through in some way.

[00:26:21] swyx: And what Claude and Gemini have done is effectively take away any advantage to saying that you have a million token context window. Now everyone's just going to be like, Oh, okay. Now you just match the other two guys. And so that puts An insane amount of pressure on what gpt5 is going to be because it's just going to have like the only option it has now because all the other models are multimodal all the other models are long context all the other models have perfect recall gpt5 has to match everything and do more to to not be a flop

[00:26:58] AI Breakdown Part 2

[00:26:58] NLW: hello friends back again with part two if you haven't heard part one of this conversation i suggest you go check it out but to be honest they are kind of actually separable In this conversation, we get into a topic that I think Alessio and Swyx are very well positioned to discuss, which is what developers care about right now, what people are trying to build around.

[00:27:16] NLW: I honestly think that one of the best ways to see the future in an industry like AI is to try to dig deep on what developers and entrepreneurs are attracted to build, even if it hasn't made it to the news pages yet. So consider this your preview of six months from now, and let's dive in. Let's bring it to the GPT 5 conversation.

[00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4

[00:27:33] NLW: I mean, so, so I think that that's a great sort of assessment of just how the stakes have been raised, you know is your, I mean, so I guess maybe, maybe I'll, I'll frame this less as a question, just sort of something that, that I, that I've been watching right now, the only thing that makes sense to me with how.

[00:27:50] NLW: Fundamentally unbothered and unstressed OpenAI seems about everything is that they're sitting on something that does meet all that criteria, right? Because, I mean, even in the Lex Friedman interview that, that Altman recently did, you know, he's talking about other things coming out first. He's talking about, he's just like, he, listen, he, he's good and he could play nonchalant, you know, if he wanted to.

[00:28:13] NLW: So I don't want to read too much into it, but. You know, they've had so long to work on this, like unless that we are like really meaningfully running up against some constraint, it just feels like, you know, there's going to be some massive increase, but I don't know. What do you guys think?

[00:28:28] swyx: Hard to speculate.

[00:28:29] swyx: You know, at this point, they're, they're pretty good at PR and they're not going to tell you anything that they don't want to. And he can tell you one thing and change their minds the next day. So it's, it's, it's really, you know, I've always said that model version numbers are just marketing exercises, like they have something and it's always improving and at some point you just cut it and decide to call it GPT 5.

[00:28:50] swyx: And it's more just about defining an arbitrary level at which they're ready and it's up to them on what ready means. We definitely did see some leaks on GPT 4. 5, as I think a lot of people reported and I'm not sure if you covered it. So it seems like there might be an intermediate release. But I did feel, coming out of the Lex Friedman interview, that GPT 5 was nowhere near.

[00:29:11] swyx: And you know, it was kind of a sharp contrast to Sam talking at Davos in February, saying that, you know, it was his top priority. So I find it hard to square. And honestly, like, there's also no point Reading too much tea leaves into what any one person says about something that hasn't happened yet or has a decision that hasn't been taken yet.

[00:29:31] swyx: Yeah, that's, that's my 2 cents about it. Like, calm down, let's just build .

[00:29:35] Alessio: Yeah. The, the February rumor was that they were gonna work on AI agents, so I don't know, maybe they're like, yeah,

[00:29:41] swyx: they had two agent two, I think two agent projects, right? One desktop agent and one sort of more general yeah, sort of GPTs like agent and then Andre left, so he was supposed to be the guy on that.

[00:29:52] swyx: What did Andre see? What did he see? I don't know. What did he see?

[00:29:56] Alessio: I don't know. But again, it's just like the rumors are always floating around, you know but I think like, this is, you know, we're not going to get to the end of the year without Jupyter you know, that's definitely happening. I think the biggest question is like, are Anthropic and Google.

[00:30:13] Alessio: Increasing the pace, you know, like it's the, it's the cloud four coming out like in 12 months, like nine months. What's the, what's the deal? Same with Gemini. They went from like one to 1. 5 in like five days or something. So when's Gemini 2 coming out, you know, is that going to be soon? I don't know.

[00:30:31] Alessio: There, there are a lot of, speculations, but the good thing is that now you can see a world in which OpenAI doesn't rule everything. You know, so that, that's the best, that's the best news that everybody got, I would say.

[00:30:43] swyx: Yeah, and Mistral Large also dropped in the last month. And, you know, not as, not quite GPT 4 class, but very good from a new startup.

[00:30:52] swyx: So yeah, we, we have now slowly changed in landscape, you know. In my January recap, I was complaining that nothing's changed in the landscape for a long time. But now we do exist in a world, sort of a multipolar world where Cloud and Gemini are legitimate challengers to GPT 4 and hopefully more will emerge as well hopefully from meta.

[00:31:11] Open Source Models - Mistral, Grok

[00:31:11] NLW: So speak, let's actually talk about sort of the open source side of this for a minute. So Mistral Large, notable because it's, it's not available open source in the same way that other things are, although I think my perception is that the community has largely given them Like the community largely recognizes that they want them to keep building open source stuff and they have to find some way to fund themselves that they're going to do that.

[00:31:27] NLW: And so they kind of understand that there's like, they got to figure out how to eat, but we've got, so, you know, there there's Mistral, there's, I guess, Grok now, which is, you know, Grok one is from, from October is, is open

[00:31:38] swyx: sourced at, yeah. Yeah, sorry, I thought you thought you meant Grok the chip company.

[00:31:41] swyx: No, no, no, yeah, you mean Twitter Grok.

[00:31:43] NLW: Although Grok the chip company, I think is even more interesting in some ways, but and then there's the, you know, obviously Llama3 is the one that sort of everyone's wondering about too. And, you know, my, my sense of that, the little bit that, you know, Zuckerberg was talking about Llama 3 earlier this year, suggested that, at least from an ambition standpoint, he was not thinking about how do I make sure that, you know, meta content, you know, keeps, keeps the open source thrown, you know, vis a vis Mistral.

[00:32:09] NLW: He was thinking about how you go after, you know, how, how he, you know, releases a thing that's, you know, every bit as good as whatever OpenAI is on at that point.

[00:32:16] Alessio: Yeah. From what I heard in the hallways at, at GDC, Llama 3, the, the biggest model will be, you 260 to 300 billion parameters, so that that's quite large.

[00:32:26] Alessio: That's not an open source model. You know, you cannot give people a 300 billion parameters model and ask them to run it. You know, it's very compute intensive. So I think it is, it

[00:32:35] swyx: can be open source. It's just, it's going to be difficult to run, but that's a separate question.

[00:32:39] Alessio: It's more like, as you think about what they're doing it for, you know, it's not like empowering the person running.

[00:32:45] Alessio: llama. On, on their laptop, it's like, oh, you can actually now use this to go after open AI, to go after Anthropic, to go after some of these companies at like the middle complexity level, so to speak. Yeah. So obviously, you know, we estimate Gentala on the podcast, they're doing a lot here, they're making PyTorch better.

[00:33:03] Alessio: You know, they want to, that's kind of like maybe a little bit of a shorted. Adam Bedia, in a way, trying to get some of the CUDA dominance out of it. Yeah, no, it's great. The, I love the duck destroying a lot of monopolies arc. You know, it's, it's been very entertaining. Let's bridge

[00:33:18] NLW: into the sort of big tech side of this, because this is obviously like, so I think actually when I did my episode, this was one of the I added this as one of as an additional war that, that's something that I'm paying attention to.

[00:33:29] NLW: So we've got Microsoft's moves with inflection, which I think pretend, potentially are being read as A shift vis a vis the relationship with OpenAI, which also the sort of Mistral large relationship seems to reinforce as well. We have Apple potentially entering the race, finally, you know, giving up Project Titan and and, and kind of trying to spend more effort on this.

[00:33:50] NLW: Although, Counterpoint, we also have them talking about it, or there being reports of a deal with Google, which, you know, is interesting to sort of see what their strategy there is. And then, you know, Meta's been largely quiet. We kind of just talked about the main piece, but, you know, there's, and then there's spoilers like Elon.

[00:34:07] NLW: I mean, you know, what, what of those things has sort of been most interesting to you guys as you think about what's going to shake out for the rest of this

[00:34:13] Apple MM1

[00:34:13] swyx: year? I'll take a crack. So the reason we don't have a fifth war for the Big Tech Wars is that's one of those things where I just feel like we don't cover differently from other media channels, I guess.

[00:34:26] swyx: Sure, yeah. In our anti interestness, we actually say, like, we try not to cover the Big Tech Game of Thrones, or it's proxied through Twitter. You know, all the other four wars anyway, so there's just a lot of overlap. Yeah, I think absolutely, personally, the most interesting one is Apple entering the race.

[00:34:41] swyx: They actually released, they announced their first large language model that they trained themselves. It's like a 30 billion multimodal model. People weren't that impressed, but it was like the first time that Apple has kind of showcased that, yeah, we're training large models in house as well. Of course, like, they might be doing this deal with Google.

[00:34:57] swyx: I don't know. It sounds very sort of rumor y to me. And it's probably, if it's on device, it's going to be a smaller model. So something like a Jemma. It's going to be smarter autocomplete. I don't know what to say. I'm still here dealing with, like, Siri, which hasn't, probably hasn't been updated since God knows when it was introduced.

[00:35:16] swyx: It's horrible. I, you know, it, it, it makes me so angry. So I, I, one, as an Apple customer and user, I, I'm just hoping for better AI on Apple itself. But two, they are the gold standard when it comes to local devices, personal compute and, and trust, like you, you trust them with your data. And. I think that's what a lot of people are looking for in AI, that they have, they love the benefits of AI, they don't love the downsides, which is that you have to send all your data to some cloud somewhere.

[00:35:45] swyx: And some of this data that we're going to feed AI is just the most personal data there is. So Apple being like one of the most trusted personal data companies, I think it's very important that they enter the AI race, and I hope to see more out of them.

[00:35:58] Alessio: To me, the, the biggest question with the Google deal is like, who's paying who?

[00:36:03] Alessio: Because for the browsers, Google pays Apple like 18, 20 billion every year to be the default browser. Is Google going to pay you to have Gemini or is Apple paying Google to have Gemini? I think that's, that's like what I'm most interested to figure out because with the browsers, it's like, it's the entry point to the thing.

[00:36:21] Alessio: So it's really valuable to be the default. That's why Google pays. But I wonder if like the perception in AI is going to be like, Hey. You just have to have a good local model on my phone to be worth me purchasing your device. And that was, that's kind of drive Apple to be the one buying the model. But then, like Shawn said, they're doing the MM1 themselves.

[00:36:40] Alessio: So are they saying we do models, but they're not as good as the Google ones? I don't know. The whole thing is, it's really confusing, but. It makes for great meme material on on Twitter.

[00:36:51] swyx: Yeah, I mean, I think, like, they are possibly more than OpenAI and Microsoft and Amazon. They are the most full stack company there is in computing, and so, like, they own the chips, man.

[00:37:05] swyx: Like, they manufacture everything so if, if, if there was a company that could do that. You know, seriously challenge the other AI players. It would be Apple. And it's, I don't think it's as hard as self driving. So like maybe they've, they've just been investing in the wrong thing this whole time. We'll see.

[00:37:21] swyx: Wall Street certainly thinks

[00:37:22] NLW: so. Wall Street loved that move, man. There's a big, a big sigh of relief. Well, let's, let's move away from, from sort of the big stuff. I mean, the, I think to both of your points, it's going to.

[00:37:33] Meta's $800b AI rebrand

[00:37:33] NLW: Can I, can

[00:37:34] swyx: I, can I, can I jump on factoid about this, this Wall Street thing? I went and looked at when Meta went from being a VR company to an AI company.

[00:37:44] swyx: And I think the stock I'm trying to look up the details now. The stock has gone up 187% since Lamo one. Yeah. Which is $830 billion in market value created in the past year. . Yeah. Yeah.

[00:37:57] NLW: It's, it's, it's like, remember if you guys haven't Yeah. If you haven't seen the chart, it's actually like remarkable.

[00:38:02] NLW: If you draw a little

[00:38:03] swyx: arrow on it, it's like, no, we're an AI company now and forget the VR thing.

[00:38:10] NLW: It's it, it is an interesting, no, it's, I, I think, alessio, you called it sort of like Zuck's Disruptor Arc or whatever. He, he really does. He is in the midst of a, of a total, you know, I don't know if it's a redemption arc or it's just, it's something different where, you know, he, he's sort of the spoiler.

[00:38:25] NLW: Like people loved him just freestyle talking about why he thought they had a better headset than Apple. But even if they didn't agree, they just loved it. He was going direct to camera and talking about it for, you know, five minutes or whatever. So that, that's a fascinating shift that I don't think anyone had on their bingo card, you know, whatever, two years ago.

[00:38:41] NLW: Yeah. Yeah,

[00:38:42] swyx: we still

[00:38:43] Alessio: didn't see and fight Elon though, so

[00:38:45] swyx: that's what I'm really looking forward to. I mean, hey, don't, don't, don't write it off, you know, maybe just these things take a while to happen. But we need to see and fight in the Coliseum. No, I think you know, in terms of like self management, life leadership, I think he has, there's a lot of lessons to learn from him.

[00:38:59] swyx: You know he might, you know, you might kind of quibble with, like, the social impact of Facebook, but just himself as a in terms of personal growth and, and, you know, Per perseverance through like a lot of change and you know, everyone throwing stuff his way. I think there's a lot to say about like, to learn from, from Zuck, which is crazy 'cause he's my age.

[00:39:18] swyx: Yeah. Right.

[00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents

[00:39:20] NLW: Awesome. Well, so, so one of the big things that I think you guys have, you know, distinct and, and unique insight into being where you are and what you work on is. You know, what developers are getting really excited about right now. And by that, I mean, on the one hand, certainly, you know, like startups who are actually kind of formalized and formed to startups, but also, you know, just in terms of like what people are spending their nights and weekends on what they're, you know, coming to hackathons to do.

[00:39:45] NLW: And, you know, I think it's a, it's a, it's, it's such a fascinating indicator for, for where things are headed. Like if you zoom back a year, right now was right when everyone was getting so, so excited about. AI agent stuff, right? Auto, GPT and baby a GI. And these things were like, if you dropped anything on YouTube about those, like instantly tens of thousands of views.

[00:40:07] NLW: I know because I had like a 50,000 view video, like the second day that I was doing the show on YouTube, you know, because I was talking about auto GPT. And so anyways, you know, obviously that's sort of not totally come to fruition yet, but what are some of the trends in what you guys are seeing in terms of people's, people's interest and, and, and what people are building?

[00:40:24] Alessio: I can start maybe with the agents part and then I know Shawn is doing a diffusion meetup tonight. There's a lot of, a lot of different things. The, the agent wave has been the most interesting kind of like dream to reality arc. So out of GPT, I think they went, From zero to like 125, 000 GitHub stars in six weeks, and then one year later, they have 150, 000 stars.

[00:40:49] Alessio: So there's kind of been a big plateau. I mean, you might say there are just not that many people that can start it. You know, everybody already started it. But the promise of, hey, I'll just give you a goal, and you do it. I think it's like, amazing to get people's imagination going. You know, they're like, oh, wow, this This is awesome.

[00:41:08] Alessio: Everybody, everybody can try this to do anything. But then as technologists, you're like, well, that's, that's just like not possible, you know, we would have like solved everything. And I think it takes a little bit to go from the promise and the hope that people show you to then try it yourself and going back to say, okay, this is not really working for me.

[00:41:28] Alessio: And David Wong from Adept, you know, they in our episode, he specifically said. We don't want to do a bottom up product. You know, we don't want something that everybody can just use and try because it's really hard to get it to be reliable. So we're seeing a lot of companies doing vertical agents that are narrow for a specific domain, and they're very good at something.

[00:41:49] Alessio: Mike Conover, who was at Databricks before, is also a friend of Latentspace. He's doing this new company called BrightWave doing AI agents for financial research, and that's it, you know, and they're doing very well. There are other companies doing it in security, doing it in compliance, doing it in legal.

[00:42:08] Alessio: All of these things that like, people, nobody just wakes up and say, Oh, I cannot wait to go on AutoGPD and ask it to do a compliance review of my thing. You know, just not what inspires people. So I think the gap on the developer side has been the more bottom sub hacker mentality is trying to build this like very Generic agents that can do a lot of open ended tasks.

[00:42:30] Alessio: And then the more business side of things is like, Hey, If I want to raise my next round, I can not just like sit around the mess, mess around with like super generic stuff. I need to find a use case that really works. And I think that that is worth for, for a lot of folks in parallel, you have a lot of companies doing evals.

[00:42:47] Alessio: There are dozens of them that just want to help you measure how good your models are doing. Again, if you build evals, you need to also have a restrained surface area to actually figure out whether or not it's good, right? Because you cannot eval anything on everything under the sun. So that's another category where I've seen from the startup pitches that I've seen, there's a lot of interest in, in the enterprise.

[00:43:11] Alessio: It's just like really. Fragmented because the production use cases are just coming like now, you know, there are not a lot of long established ones to, to test against. And so does it, that's kind of on the virtual agents and then the robotic side it's probably been the thing that surprised me the most at NVIDIA GTC, the amount of robots that were there that were just like robots everywhere.

[00:43:33] Alessio: Like, both in the keynote and then on the show floor, you would have Boston Dynamics dogs running around. There was, like, this, like fox robot that had, like, a virtual face that, like, talked to you and, like, moved in real time. There were industrial robots. NVIDIA did a big push on their own Omniverse thing, which is, like, this Digital twin of whatever environments you're in that you can use to train the robots agents.

[00:43:57] Alessio: So that kind of takes people back to the reinforcement learning days, but yeah, agents, people want them, you know, people want them. I give a talk about the, the rise of the full stack employees and kind of this future, the same way full stack engineers kind of work across the stack. In the future, every employee is going to interact with every part of the organization through agents and AI enabled tooling.

[00:44:17] Alessio: This is happening. It just needs to be a lot more narrow than maybe the first approach that we took, which is just put a string in AutoGPT and pray. But yeah, there's a lot of super interesting stuff going on.

[00:44:27] swyx: Yeah. Well, he Let's recover a lot of stuff there. I'll separate the robotics piece because I feel like that's so different from the software world.

[00:44:34] swyx: But yeah, we do talk to a lot of engineers and you know, that this is our sort of bread and butter. And I do agree that vertical agents have worked out a lot better than the horizontal ones. I think all You know, the point I'll make here is just the reason AutoGPT and maybe AGI, you know, it's in the name, like they were promising AGI.

[00:44:53] swyx: But I think people are discovering that you cannot engineer your way to AGI. It has to be done at the model level and all these engineering, prompt engineering hacks on top of it weren't really going to get us there in a meaningful way without much further, you know, improvements in the models. I would say, I'll go so far as to say, even Devin, which is, I would, I think the most advanced agent that we've ever seen, still requires a lot of engineering and still probably falls apart a lot in terms of, like, practical usage.

[00:45:22] swyx: Or it's just, Way too slow and expensive for, you know, what it's, what it's promised compared to the video. So yeah, that's, that's what, that's what happened with agents from, from last year. But I, I do, I do see, like, vertical agents being very popular and, and sometimes you, like, I think the word agent might even be overused sometimes.

[00:45:38] swyx: Like, people don't really care whether or not you call it an AI agent, right? Like, does it replace boring menial tasks that I do That I might hire a human to do, or that the human who is hired to do it, like, actually doesn't really want to do. And I think there's absolutely ways in sort of a vertical context that you can actually go after very routine tasks that can be scaled out to a lot of, you know, AI assistants.

[00:46:01] swyx: So, so yeah, I mean, and I would, I would sort of basically plus one what let's just sit there. I think it's, it's very, very promising and I think more people should work on it, not less. Like there's not enough people. Like, we, like, this should be the, the, the main thrust of the AI engineer is to look out, look for use cases and, and go to a production with them instead of just always working on some AGI promising thing that never arrives.

[00:46:21] swyx: I,

[00:46:22] NLW: I, I can only add that so I've been fiercely making tutorials behind the scenes around basically everything you can imagine with AI. We've probably done, we've done about 300 tutorials over the last couple of months. And the verticalized anything, right, like this is a solution for your particular job or role, even if it's way less interesting or kind of sexy, it's like so radically more useful to people in terms of intersecting with how, like those are the ways that people are actually.

[00:46:50] NLW: Adopting AI in a lot of cases is just a, a, a thing that I do over and over again. By the way, I think that's the same way that even the generalized models are getting adopted. You know, it's like, I use midjourney for lots of stuff, but the main thing I use it for is YouTube thumbnails every day. Like day in, day out, I will always do a YouTube thumbnail, you know, or two with, with Midjourney, right?

[00:47:09] NLW: And it's like you can, you can start to extrapolate that across a lot of things and all of a sudden, you know, a AI doesn't. It looks revolutionary because of a million small changes rather than one sort of big dramatic change. And I think that the verticalization of agents is sort of a great example of how that's

[00:47:26] swyx: going to play out too.

[00:47:28] Adept episode - Screen Multimodality

[00:47:28] swyx: So I'll have one caveat here, which is I think that Because multi modal models are now commonplace, like Cloud, Gemini, OpenAI, all very very easily multi modal, Apple's easily multi modal, all this stuff. There is a switch for agents for sort of general desktop browsing that I think people so much for joining us today, and we'll see you in the next video.

[00:48:04] swyx: Version of the the agent where they're not specifically taking in text or anything They're just watching your screen just like someone else would and and I'm piloting it by vision And you know in the the episode with David that we'll have dropped by the time that this this airs I think I think that is the promise of adept and that is a promise of what a lot of these sort of desktop agents Are and that is the more general purpose system That could be as big as the browser, the operating system, like, people really want to build that foundational piece of software in AI.

[00:48:38] swyx: And I would see, like, the potential there for desktop agents being that, that you can have sort of self driving computers. You know, don't write the horizontal piece out. I just think we took a while to get there.

[00:48:48] NLW: What else are you guys seeing that's interesting to you? I'm looking at your notes and I see a ton of categories.

[00:48:54] Top Model Research from January Recap

[00:48:54] swyx: Yeah so I'll take the next two as like as one category, which is basically alternative architectures, right? The two main things that everyone following AI kind of knows now is, one, the diffusion architecture, and two, the let's just say the, Decoder only transformer architecture that is popularized by GPT.

[00:49:12] swyx: You can read, you can look on YouTube for thousands and thousands of tutorials on each of those things. What we are talking about here is what's next, what people are researching, and what could be on the horizon that takes the place of those other two things. So first of all, we'll talk about transformer architectures and then diffusion.

[00:49:25] swyx: So transformers the, the two leading candidates are effectively RWKV and the state space models the most recent one of which is Mamba, but there's others like the Stripe, ENA, and the S four H three stuff coming out of hazy research at Stanford. And all of those are non quadratic language models that scale the promise to scale a lot better than the, the traditional transformer.

[00:49:47] swyx: That this might be too theoretical for most people right now, but it's, it's gonna be. It's gonna come out in weird ways, where, imagine if like, Right now the talk of the town is that Claude and Gemini have a million tokens of context and like whoa You can put in like, you know, two hours of video now, okay But like what if you put what if we could like throw in, you know, two hundred thousand hours of video?

[00:50:09] swyx: Like how does that change your usage of AI? What if you could throw in the entire genetic sequence of a human and like synthesize new drugs. Like, well, how does that change things? Like, we don't know because we haven't had access to this capability being so cheap before. And that's the ultimate promise of these two models.

[00:50:28] swyx: They're not there yet but we're seeing very, very good progress. RWKV and Mamba are probably the, like, the two leading examples, both of which are open source that you can try them today and and have a lot of progress there. And the, the, the main thing I'll highlight for audio e KV is that at, at the seven B level, they seem to have beat LAMA two in all benchmarks that matter at the same size for the same amount of training as an open source model.

[00:50:51] swyx: So that's exciting. You know, they're there, they're seven B now. They're not at seven tb. We don't know if it'll. And then the other thing is diffusion. Diffusions and transformers are are kind of on the collision course. The original stable diffusion already used transformers in in parts of its architecture.

[00:51:06] swyx: It seems that transformers are eating more and more of those layers particularly the sort of VAE layer. So that's, the Diffusion Transformer is what Sora is built on. The guy who wrote the Diffusion Transformer paper, Bill Pebbles, is, Bill Pebbles is the lead tech guy on Sora. So you'll just see a lot more Diffusion Transformer stuff going on.

[00:51:25] swyx: But there's, there's more sort of experimentation with diffusion. I'm holding a meetup actually here in San Francisco that's gonna be like the state of diffusion, which I'm pretty excited about. Stability's doing a lot of good work. And if you look at the, the architecture of how they're creating Stable Diffusion 3, Hourglass Diffusion, and the inconsistency models, or SDXL Turbo.

[00:51:45] swyx: All of these are, like, very, very interesting innovations on, like, the original idea of what Stable Diffusion was. So if you think that it is expensive to create or slow to create Stable Diffusion or an AI generated art, you are not up to date with the latest models. If you think it is hard to create text and images, you are not up to date with the latest models.

[00:52:02] swyx: And people still are kind of far behind. The last piece of which is the wildcard I always kind of hold out, which is text diffusion. So Instead of using autogenerative or autoregressive transformers, can you use text to diffuse? So you can use diffusion models to diffuse and create entire chunks of text all at once instead of token by token.

[00:52:22] swyx: And that is something that Midjourney confirmed today, because it was only rumored the past few months. But they confirmed today that they were looking into. So all those things are like very exciting new model architectures that are, Maybe something that we'll, you'll see in production two to three years from now.

[00:52:37] swyx: So the couple of the trends

[00:52:38] NLW: that I want to just get your takes on, because they're sort of something that, that seems like they're coming up are one sort of these, these wearable, you know, kind of passive AI experiences where they're absorbing a lot of what's going on around you and then, and then kind of bringing things back.

[00:52:53] NLW: And then the, the other one that I, that I wanted to see if you guys had thoughts on were sort of this next generation of chip companies. Obviously there's a huge amount of emphasis. On on hardware and silicon and, and, and different ways of doing things, but, you know, love your take on, on either or both of

[00:53:07] swyx: those.

[00:53:08] AI Wearables

[00:53:08] swyx: So for so wearables, I'm very excited about it. I want wearables on me at all times. I have two right here. To, to quantify my health. And I, you know, I'm all for them. But society is not ready for wearables, right? Like, no one's comfortable with a device on recording every single conversation we have.

[00:53:24] swyx: Even all three of us here as podcasters, we don't record everything that we say. And I think there's a social shift that needs to happen. I am an investor in TAB. They are renaming to a broader vision, but they are one of the three or four leading wearables in this space. It's sort of the AI pendants, or AI OS, or AI personal companion space.

[00:53:47] swyx: I have seen two humanes in the wild in San Francisco. I'm very, very excited to report that there are people walking around with those things on their chest and it is as goofy as it sounds. It, it absolutely is going to fail. God bless them for trying. And I've also bought a rabbit. So I'm, I'm very excited for all those things to arrive.

[00:54:06] swyx: But yeah people are very keen on hardware. I think the, the, the idea that you can have physical objects that. Embody an AI that do specific things for you is as old as, you know, the sort of Golem in sort of medieval times in terms of like how much we want our objects to be smart and do things for us.

[00:54:27] swyx: And I think it's absolutely a great play. The funny thing is people are much more willing to pay you upfront for a hardware device than they are willing to pay like an 8 a month subscription recurring for software, right? And so the interesting economics of these wearable companies is they have negative float.

[00:54:47] swyx: In the sense that people pay deposits upfront, like I paid like, I don't know, 200 bucks for the rabbit. Upfront, and I don't get it for another six months. I paid 600 for the tab, and I don't get it for another six months. And, and then, then they can take that money and, and sort of invest it in like their next, the next events or their next properties or ventures.

[00:55:06] swyx: And like, I think that's a, that's a very interesting reversal of economics from other types of AI companies that I see. And I think, yeah, just the, the, the tactile feel of an AI, I think is very promising. I, Alex, I don't know if you have other thoughts on, on the wearable stuff.

[00:55:21] Alessio: The open interpreter just announced their product four hours ago.

[00:55:25] Alessio: Yeah. Which is a, it's not really a wearable, but it's a, it's still like a physical device.

[00:55:30] swyx: It's a push to talk mic to, to a device on your, on your laptop. Right. It's a $99 push talk. Yeah.

[00:55:38] Alessio: But, but, but everybody, but again, going back to your point, it's like people want to, people are interested in spending money for like things that they can hold, you know, I don't know what that means overall for like where things are going, but making more of this AI be a physical part of your life.

[00:55:54] Alessio: I think people are interested in that, but I agree with Shawn. I mean, I've been. I talked to Avi about this, but Avi's point is like, most consumers, like, care about utility more than they care about privacy, you know, like you've seen with social media. But I also think there's a big societal reaction to AI that is, like, much more rooted than the social media one.

[00:56:16] Alessio: But we'll see. But a lot, again, a lot of work, a lot of developers, a lot of money going into it. So there's, there's bound to be experiments being run. On, on the

[00:56:25] swyx: chip side. Sorry, I'll just ship it one more thing and then we transition to the chips. The thing I'll caution people on is don't overly focus on the form factor.

[00:56:33] swyx: The form factor is a delivery mode. There will be many form factors. It doesn't matter so much as where in the data war does it sit. It actually is context acquisition. Because, and maybe a little bit of multi modality. Context, like, context is king. Like, if you have access to data that no one else has, then you will be able to create AI that no one else can create.

[00:56:54] swyx: And so what is the most personal context? It is your everyday conversation. It is as close to mapping your mental train of thought As possible without, you know, physically you writing down notes. So, so that is the promise, the ultimate goal here, which is like, personal context, it's always available on you you know, loading and seeing all that stuff.

[00:57:12] swyx: But yeah, that's the, that's the frame I want to give people that the form factors will change and there will be multiple form factors, but it's the software behind that. And in the personal context that you cannot get anywhere else, that'll win.

[00:57:24] Alessio: Yeah, so that was wearables.

[00:57:26] Groq vs Nvidia month - GPU Chip War

[00:57:26] Alessio: On the chip side, yeah, Grok was probably the biggest release.

[00:57:29] Alessio: Jonathan, well, it's not even a new release because the company, I think, was started in 2016. So it's actually quite old. But now recently captured the people's imagination with their MixedREL 500 tokens a second demo. Yeah, I think so far the battle on the GPU side has been Either you go kind of like massive chip, like the Cerebros of the world, where one chip from Cerebros is about two million dollars, you know, that's compared, obviously, you cannot compare one chip versus one chip, but h100 is like 40, 000, something like that the problem with those architectures has been They want to be very general, you know, but like they wanted to put a lot of the RAM, the SRAM on the chip.

[00:58:13] Alessio: It's much more convenient when you're using larger language models, but the models outpace the size of the chips and chips have a much longer, you know, turnaround cycle. Grok today. It's great for the current architecture. It's a lot more expensive also, as far as dollar per flop but their idea is like, hey, when you have very high concurrency, we actually were much cheaper, you know, you shouldn't just be looking at the compute power for most people, this doesn't really matter, you know, like, I think that's like the most the most interesting thing to me is like, We've now gone back with, with AI to a world where developers care about what hardware is running, which was not the case in traditional software for like, maybe 20 years since as the cloud has gotten really big.

[00:58:57] Alessio: My, my thinking is that in the next two, three years, like we're going to go back to that. We're like, people are not going to be sweating. Oh, what GPU do you have in your cloud? What do you have? It's like. Yeah, you want to run this model, we can run it at the same speed as everybody else, and then everybody will make different choices, whether they want to have higher front end capital investment, and then better utilization, some people would rather do lower investment before, and then upgrade later, there are a lot of parameters and then there's the dark horses, right, that is some of the smaller companies like Lemurian Labs, MedEx that are working on maybe not a chip alone, but also like some of the, the actual math infrastructure and the instructions on it that make them run.

[00:59:40] Alessio: There's a lot going on, but yeah, I think the, the episode with with Dylan will be interesting for, for people, but I think we also came out of it saying, Hey, everybody has pros and cons. There's no, it's different than the models where you're like, Oh, this one is definitely better for me. And I'm going to use it.

[00:59:56] Alessio: I think for most people. It's like fun Twitter memeing, you know, but it's like 99 percent of people that tweet about this stuff are never gonna buy any of these chips anyway. It's, it's really more for entertainment.

[01:00:10] swyx: No. Wow. I mean, like, this is serious business here, right? You're talking about, you know, like who, like the potential new Nvidia, if anyone can take like 1% of NVIDIA's business, they're a serious startup that you should look at.

[01:00:20] swyx: Right? So , that's, that's, that's my, well, yeah,

[01:00:23] Alessio: yeah. On matters. Well, I'm more talking about like, what, how should people think about it? You know? It's like, yeah. I think like the, the end user is not impacted as much.

[01:00:31] Disagreements

[01:00:31] Alessio: This is obviously, so

[01:00:32] swyx: I disagree. Yeah, I love disagreements because, you know, who likes a podcast where all three people always agree with each other?

[01:00:38] swyx: You will see the impact of this in the tokens per second over time. This year, I have very, very credible sources all telling me that the average tokens per second, right now, we have somewhere between 50 to 100 as like the norm for people. Average tokens per second will go to 500 to 2, 000. This year from, from a number of chip suppliers that I cannot name.

[01:00:58] swyx: So like that is, that is, that will cause a step change in the use cases. Every time you have an order of magnitude improvement in the, in the speed of something, you unlock new use cases that become fun instead of a chore. And so that's what I would caution this audience to think about, which is like, what can you do in much higher AI speed?

[01:01:17] swyx: It's not just things streaming out faster. It is things working in the background a lot more seamlessly and therefore being a lot more useful. Then previously imagined. So that would be my two cents on.

[01:01:30] Alessio: Yeah. Yeah. I mean, the, the new NVIDIA chips are also much faster. To me, that's true. When it comes to startups, it's like, are the startups pushing the performance on the incumbents or are the incumbents still leading?

[01:01:44] Alessio: And then the startups are like riding the same wave, you know? I don't have yet a good sense of that. It's like, you know, it's next year's NVIDIA release. Just gonna be better than everything that gets released this year, you know, if that's the case, it's like, okay, damn Jensen, you know, it's like the meme.

[01:02:00] Alessio: It's like, I'm gonna fight. I'm gonna fight NVIDIA. It's like, damn, Jensen got hands. He really does.

[01:02:08] Summer 2024 Predictions

[01:02:08] NLW: Well, awesome conversation, guys. I guess just just by way of wrapping up, I call it over the next three months between now and sort of the beginning of summer was one prediction that each of you has. It can be about anything. It can be a big company. It can be a startup. It can be something you have privileged information that you know, and you just won't tell us that you actually

[01:02:25] Alessio: know.

[01:02:26] Alessio: What, does it have to be something that we think it's going to be true or like something that we think? Because for me, it's like, is Sundar going to be the CEO of Google? Maybe not in three months, maybe in like six months, nine months, you know, people are like, Oh, maybe Demis is going to be the new CEO.

[01:02:41] Alessio: That was kind of like, I, I was busy like fishing some deep mind people and Google people for like a good guest for the pod. And I was like, Oh, what about. Jeff Dean, and they're like, well, Demis is really like the person that runs everything anyway, and the stuff. It's like interesting. And

[01:02:57] swyx: so I don't know.

[01:02:58] swyx: What about Sergei? Sergei Sergei could come back. I don't know. Like he's making more appearances these days.

[01:03:03] Alessio: Yeah. I don't, I I Then we can just put it as like, you know. Yeah. My, my thing is like CEO change potential, but I, again, three months is too short to make a prediction. Yeah. I

[01:03:16] NLW: think that's the, that's that's fine.

[01:03:18] NLW: The, the timescale might be off.

[01:03:22] swyx: Yeah. I mean for me, I, I think the. Progression in vertical agent companies will keep going. We just had, the other day, Klarna talking about how they replaced like 700 of their customer support agents with the AI agents. That's just the beginning, guys. Like, imagine this rolling out across most of the Fortune 500.

[01:03:43] swyx: This is, and I'm not saying this is like a utopian scenario, there will be very, very embarrassing and bad outcomes of this, where like, humans would never make this mistake, but AIs did, and like, we'll all laugh at it, or we'll be very offended by whatever, you know, bad outcome it did. So we have to be responsible and careful in the rollout, but yeah, this is, it's rolling out, you know, Alessio likes to say that this year's the year of AI in production.

[01:04:04] swyx: Let's see it, let's, let's see all these sort of vertical, full stack employees. Come out into the workforce. Love

[01:04:11] Alessio: it.

[01:04:11] NLW: All right, guys. Well, thank you so much for for sharing your your thoughts and insights here And I can't wait to do it again

[01:04:18] Thursday Nights in AI - swyx

[01:04:18] NLW: Welcome

[01:04:19] swyx: back again. It's Charlie your AI co host We're now in part two of the special weekend episode collating some of SWIX and Alessio's recent appearances If you're not active in the Latentspace Discord, you might not be aware of the many, many, many in person.

[01:04:36] swyx: Events we host gathering our listener community all over the world. You can see the Latentspace community page for how to join and subscribe to our event calendar for future meetups. We're going to share some of our recent live appearances in this next part, starting with the Thursday nights in AI meetup, a regular fixture in the SF AI scene run by Imbue and Outset Capital.

[01:04:59] swyx: Primarily, our former guest, Kanjin Q, Ali Rhoda, and Josh Albrecht. Here's Swyx.

[01:05:08] swyx: Today, for those of you who have been here before, you know the general format. So we'll do a quick fireside Q& A with Swyx. Swyx, where we're asking him the questions. Then we'll actually go to our rapid fire Q& A, where we're asking really fast, hopefully, spicy questions. And then we'll open it up to the audience for your questions.

[01:05:25] swyx: So you guys sneak around the room, submit your questions, and we'll go through as many of them as possible during that period. And then actually, Swyx brought a gift for us, which is two Latentspace t shirts. AI Engineer. AI Engineer t shirts. And those will be awarded to the Two spiciest question askers.

[01:05:44] swyx: So and I'll let Josh decide on that. So if we want to get your spiciest takes, please send them in during the event as we're talking and then also at the end. All right. With that, let's get going.

[01:05:57] NLW: Okay. Welcome, Swyx. Thank you for that

[01:06:01] swyx: intro.

[01:06:01] NLW: How does it

[01:06:01] swyx: feel to be interviewed

[01:06:03] NLW: rather than the interviewer?

[01:06:04] swyx: Weird. I don't know what to do in this chair. Yeah. Like,

[01:06:07] NLW: where should I put my hands? Yeah, exactly. You look good.

[01:06:10] swyx: You look good. And I also love asking follow up questions. And I tend to, like, sort of take over panels a lot. If you ever see me on a panel, I tend to ask the other panelists questions.

[01:06:18] swyx: Okay.

[01:06:19] NLW: So we should be ready is what you're saying. So you back.

[01:06:21] swyx: That's fine. This is like a free MBU interview, so why not? That's right. That's right. That's

[01:06:24] NLW: right.

[01:06:25] swyx: Yeah, so you interviewed Ken Jeon, the CEO you didn't interview Josh, right? No, no. So maybe tonight. Yeah. Okay. We'll see. We'll look for different questions and look for an alignment.

[01:06:35] NLW: I love it. All

[01:06:36] swyx: right. I just want to hear this story. You know, you've completely exploded LatentSpace and AI Engineer, and I know you also, before all of that, had exploded in popularity for your learning in public movement and your DevTools work. And devrelations work. So, who are you and how did you get here?

[01:06:53] swyx: Let's

[01:06:53] NLW: start with that.

[01:06:54] swyx: Quick story is, I'm Shawn, I'm from Singapore. Swyx is my initials. For those who don't know, A lot of Singaporeans are ethically Chinese, and we have Chinese names and English names. So, it's just it's just my initials. Came to col came to the US for college, and have been here for about 15 years, but most, like half of that was in finance and then the other half was, was in tech.

[01:07:13] swyx: And the, and tech is where I was most known just because I realized that I was much more aligned towards learning in public, whereas in finance, Everything's a trade secret. Everything is zero sum. Whereas in tech, like, you're allowed to come to meetups and conferences and share your learnings and share your mistakes even.

[01:07:31] swyx: And that's totally fine. You, like, open source your code. It's totally fine. And even, even better, you, like, contribute PRs to other people's code, which is even better. And I found that I thrived in that. Learning public environments and that, that kind of got me started. I was an early hire, early Draft Relations hire at Netlify and then did the same at AWS Temporal and Airbyte.

[01:07:53] swyx: And then, and so that, that's like the whole story. I can talk, talk more about like developer tooling and developer relations if, if that's something that people are interested in. But I think the, the more recent thing is AI. And I started really being interested in it mostly because It, it, the, the approximate cause of starting Leanspace was stable diffusion.

[01:08:10] swyx: When you could run a large model that could do sufficiently enough on your, on your desktop. Where I was like, okay, like, this is, Something qualitatively very different. And that's then we started late in space and you're like, this is something different. We have to talk about it on a podcast.

[01:08:25] swyx: There we go. Yeah. It wasn't, it wasn't a podcast for like four months. And then, and then I had been running a discord for dev tools investors. 'cause I, I also invest in dev tools and I advise companies on deaf tools, def things. And I think it was the start of 2023 when Alessio and I were both like, you know, I think we, we need to like get more tokens out of.

[01:08:45] swyx: People, and I was running out of original sources to, to write about, so I was like, okay, I'll go get those original sources. And I think that, that's when we started the podcast. And I think it's just the chemistry between us, the, the way we spike in different ways. And also, like, honestly, the kind participation of the guests to give us their time.

[01:09:03] swyx: Like, you know, like, getting George Hoss was a big deal. And also shout out to Alessio for just cold emailing him for, for, for booking the, booking some of our biggest guests. And I'm just working really hard to try to tell the story that people can use at work. I think that there's a lot of AI podcasts out there and a lot of AI kind of forums or fireside chats with no fire.

[01:09:21] swyx: That always talk about age, like what's your AGI timeline, what's your PDoom. Very, very nice hallway conversations for freshman year but not very useful for work. And like, you know, practically like making money and like And thinking about, you know, changing the everyday lives. I think what's interesting is obviously you care about the existential safety of the human race.

[01:09:43] swyx: But in the meantime we gotta eat. So so I think that's like kind of latent space's niche. Like we explicitly don't really talk about AGI. We explicitly don't talk about Things that we're, like, a little bit too far out. Like, we don't do a ton of robotics. We don't do a ton of, like, high frequency trading.

[01:10:00] swyx: There's tons of machine learning in there, but we just don't do that. Because, like, we're like, all right, what are most software engineers gonna, gonna need? Because that's our background, and that's the audience that we serve. And I think just, like, being really clear on that audience has been, has resonated with people.

[01:10:12] swyx: Yeah, you would never expect a technical podcast to reach, like, a general audience, like, Top ten on the tech charts but I, you know, I've been surprised by that before and it's been successful. I don't know, I don't know what to say about that. I think honestly, I, I kind of have this like negative reaction towards being, being, being, being, being classified as a podcast because the podcast is downstream of ideas.

[01:10:35] swyx: And it's one mode of conversation, it's one mode of idea delivery, but you can deliver ideas on a newsletter, in person like this there's so many different ways. And so I think, I think about it more as we are trying to start or serve an industry, and that industry is the AI engineer industry, which is, which we can talk about more.

[01:10:53] swyx: Yes, let's go into that. So the AI engineer, you penned a piece called The Rise of the AI Engineer, you tweeted about it, Andrej Karpathy also responded, largely agreeing with what you said. What is an AI engineer? The AI engineer is the software engineer building with AI, enhanced by AI, And eventually it will be non human engineers writing code for you, Which I know MBU is all about.

[01:11:18] swyx: You're saying eventually the AI engineer will become a non human engineer? That will be one kind of AI engineer that people are trying to build, And is probably the most furthest away in terms of being reality. Because it's so hard. Got it. But, but there are three types of AI engineer and I just went through the three.

[01:11:33] swyx: One is AI enhanced where you like use AI products like Copilot and Cursor. And two is AI products engineer where you use the exposed AI capabilities to the end user As a software engineer, like, not doing pre training not being an ML researcher, not being an ML engineer, but just interacting with foundation models and probably APIs from foundation model labs.

[01:11:54] swyx: What's the third one? And the third one is the non human AI engineer. Got it. The fully autonomous AI engineer. Dream, you know, Coder. How long do you think it is till we get to, like, early, early versions? This is my equivalent of AGI timelines. I know, I know. You can set yourself up for this. So like, lots of active, like, I mean, I have, I have supported companies actively working on that.

[01:12:13] swyx: I think it's more useful to think about levels of autonomy. And so my answer to that is, you know, perpetually five years away until until it figures it out. No, but my actual anecdote the closest comparison we have to that is self driving. We are, we're doing this in San Francisco for those who are watching the live stream.

[01:12:32] swyx: If you haven't come to San Francisco and seen, and taken a Waymo ride just come, get a friend take a Waymo ride. I remember 2014 we covered a little bit of autos in, in my hedge fund. And I was, I remember telling a friend, I was like, self driving cars around the corner, like, this is it, like, you know, parking will be, like, parking will be a thing of the past and it didn't happen for the next 10 years.

[01:12:52] swyx: And, and, but now we, now, like, most of us in San Francisco can, can take it for granted. So I think, like, you just have to be mindful that the, the, the, the rough edges take a long time. And like, yes, it's going to work in demos, then it's going to work a little bit further out and it's just going to take a long time.

[01:13:08] swyx: The more useful mental model I have is sort of levels of autonomy. So in self driving, you have level 1, 2, 3, 4, 5 just the amount of human attention that you get. At first, like, your, your, your hands are always on 10 and 2 and you have to pay attention to the, to, to the driving every 30 seconds and eventually you can sleep in the car, right?

[01:13:25] swyx: So there's a whole spectrum of that. So what's the equivalent for that for, for coding? Keep your hands on the keyboard and then eventually you've kind of gone off. You tab to accept everything. Where are we? Oh, that's good, yeah. Yeah. Doesn't that already happen? Yeah. Approve the PR. Approve, this looks good.

[01:13:39] swyx: That's the dream that people want. It gives, it gives, really you unlock a lot of coding when people, non technical people can file issues, and then the AI engineer can sort of automatically write code, pass your tests, and if it, if it kind of works as, as, as intended. As, as advertised then you can just kind of merge it and then you, you know, 10x, 100x the number of developers in your company immediately.

[01:14:00] swyx: So that's the goal, that's the, that's the holy grail. We're not there yet but Sweep, CodeGen, there's a bunch of companies, Magic probably, are, are all working towards that. And, and so I so the TLDR, like the, the thing that we covered Alessio and I covered in the January recap that we did was that the, the basic split that people should have in their minds is the inner loop versus the outer loop for the developer.

[01:14:21] swyx: Inner loop is everything that happens in your IDE between Git commits. And outer loop is happens, is what happens when you push up your Git commit to GitHub, for example, or GitLab. And that's a nice split, which means like everything local, everything that needs to be fast is for everything that's kind of very hands on for developers.

[01:14:37] swyx: It's probably easier to automate or easier to have code assistance. That's what Copilot is, that's what, that's what all those things are. And then everything that happens autonomously when you're effectively away from the keyboard with like a GitHub issue or something that is more outer loop where you're you know, you're relying a lot more on autonomy and we are maybe, our LLMs are maybe not smart enough to do that yet.

[01:14:57] Alessio: Do you have any thoughts on

[01:14:58] swyx: kind of

[01:14:58] Alessio: the user experience and how that will change? One of the things

[01:15:01] swyx: that has happened for me, kind of looking at some of these products and playing around with things ourselves, like, You know, it sounds good to have an automated PR, then you get an automated PR and you're like, I really don't want to review like 300 lines of generated code, and like find the bug in it.

[01:15:13] swyx: Well then you have another agent that's a reviewer. That's right, but then you like tell it like, Oh, go fix it, and it comes back with 400 lines. Yes, there is a length bias to code, right? And you do have higher passing rates. In PRs. This is a documented human behavior thing, right? Send me two lines of code, I will review the s**t out of that.

[01:15:33] swyx: I don't know if I can swear on this. Send me, send me 200 lines of code, looks good to me. Right? Guess what? The, the agents are going to, perfectly happy to modify, to copy that behavior from us. When we actually want them to do the opposite. So, yeah, I, I think that the GAN model of code generation is probably not going to work super well.

[01:15:50] swyx: I do think we probably need just better planning from the start. Which is, I'm just repeating the MBU thesis by the way. Just go listen to Kanjin talk about this. She's much better at it than I am. But yeah, I think I think the code review thing is going to be I think that what Codium, there are two Codiums, the Israeli one.

[01:16:10] swyx: The Israeli Codium. With the E. Yeah, Codium with the E. They still have refused to rename. I'm friends with both of them. Every month I'm like, You're like, guys, let's

[01:16:18] NLW: all come to one room. Yeah,

[01:16:19] swyx: like, you know, someone's got to fold. Codium with the E has gone, like, you've got to write the test first. Right?

[01:16:25] swyx: You write the, you write the it's like a sort of tripartite relationship. Again, this was also covered on a podcast with them, which is fantastic. Like, you interview me, you sort of through me, you interview. Like, the past avatars I've been watching the Netflix show, by the way, it's fantastic. But like, so so Codium is like, they've already thought this all the way through.

[01:16:41] swyx: They're like, okay, you write the user story, from the user story you generate all the tests, you also generate the code and you update any one of those, they all have to update together. Right? So like, once the, and, and probably the critical factor is the test generation from the story. Because everything else can just kind of bounce the heads off of those things until they pass.

[01:17:01] swyx: So you have to write good tests. It's kind of like the eat your vegetables of coding, right? Which nobody really wants to do. And so I think it's a really smart tactic to go to market by saying we automatically generate tests for you and, you know, start not great, but then get better. And eventually you get to the weakest point in the chain for the entire loop of code generation.

[01:17:25] swyx: What do you think the weakest link is? The weakest link? Yeah. It's text generation. Yeah. Yeah. Do you think there's a way to, like, are there some promising

[01:17:33] Alessio: avenues you see forward for making that actually better?

[01:17:38] swyx: For making it better. You have to have, like, good isolation, and I think proper serverless cloud environments is integral to that.

[01:17:48] swyx: I, it could be like a fly. io. It could be like a Cloudflare worker. It depends how much, how many resources your test environment needs. And effectively I was talking about this, I think with maybe Rob earlier in the audience, where every agent needs a sandbox. If you're a code agent, you need a coding sandbox, but if you're whatever, like MBU used to have this, like, sort of Minefield, Minecraft's clone that was much faster.

[01:18:12] swyx: If, if you, if you have a model of the real world, you have to go, you have to go generate some plan or some code or some whatever, test it against that real world so that you can get this iterative feedback and then get the final result back that is somewhat validated against the real world. And so, like, you need a really good sandbox.

[01:18:26] swyx: I don't think people, I, I think this is, this is a, this is an infrastructure need that humans

[01:18:31] swyx + Josh Albrecht: have had for a long time. We've never solved it for ourselves. And now we have to solve it for humans. About a thousand times larger quantity of agents than, than, than actually exists. And, and so I, I, I think, like, we eventually have to involve, evolve a lot more infrastructure.

[01:18:45] swyx + Josh Albrecht: In order to serve these things. So yeah. So, for those who don't know, like I also have so, we're talking about the rise of AI engineer. I also have previous conversations about immutable infrastructure cloud environments and that kind of stuff. And this is all of a kind. Like, like, in order to solve agents and coding agents, we're going to have to solve the other stuff too along the way.

[01:19:05] swyx + Josh Albrecht: And it's really neat for me. To see all that tie together in my DevTools work that all these themes kind of reemerge just naturally, just because everything we needed for humans, we just need a hundred times more for, for for agents.

[01:19:17] Dylan Patel: Let's talk about the AI engineer. AI engineer has become a whole thing.

[01:19:21] Dylan Patel: It's become a term and also a conference. And tell us more, and a job title, tell us more about that. What's going on there?

[01:19:31] swyx + Josh Albrecht: That is like a very vague, a very, very big cloud of things. I would just say like, I think it's an emergent industry. I've seen this happen repeatedly for, so the general term is software engineer.

[01:19:44] swyx + Josh Albrecht: Programmer. In the 70s and 80s, there would not be like senior engineer. There would just be engineering. Like you, or you, I don't think they even call themselves engineer. They don't have that. What about a member of the technical staff? Oh, yeah, MTS. Very, very, very, very elite. But yeah, so like, you know, like these striations appear when the population grows and the technical depth grows over time.

[01:20:07] swyx + Josh Albrecht: Yeah. When it starts, when it ends. Not that, not that important, and then over time it's just gonna specialize. And I've seen this happen for frontend, for DevOps, for data and I can't remember what else I listed in, in that, in that piece, But those are the main three that I was around for. And I, I see this, I saw this happening for AI engineer which is effectively, now a lot of people are arguing that there is the ML researcher, the ML engineer, who sort of pairs with the researcher sometimes they also call research engineer and then on the other side of the fence is just software engineers.

[01:20:35] swyx + Josh Albrecht: And that's how it was up till about last year. And now there's this specializing and rising class of people building AI specific software that are not any of those previous titles that I just mentioned. And that's the thesis of the AI engineer, that this is an emerging category of AI. Startups of jobs I've had people from Meta, IBM, Microsoft, OpenAI tell me that they, their title is now AI engineer.

[01:20:58] swyx + Josh Albrecht: They're hiring AI engineers. So, like, I can see that this is a trend and I think that's what Andre called out in his post that, like, just mathematically, just the, just the limitations in terms of talent, research talents and GPUs, that all these will tend to concentrate in a, in a, in a, Few labs and everyone else are just going to have to rely on them or build differentiation of products in other ways And those will be AI engineers.

[01:21:21] swyx + Josh Albrecht: So mathematically there will be more AI engineers than ML engineers. It's just the truth. Right now it's the other way. Right now the number of AI engineers is maybe 10x less. So I think that the ratio will invert and you know I think the goal of the InSpace and the goal of the conference and anything else I do is to serve that

[01:21:38] Dylan Patel: growing audience.

[01:21:41] Dylan Patel: To make the distinction clear, if I'm a software engineer And I'm like, I want to become an AI engineer. What do I have to learn? Like, what additional capabilities does that type of engineer have? Funny you say that. I think you have a blog post on this very

[01:21:53] swyx + Josh Albrecht: topic. I don't actually have a specific blog post on how to, like, change classes.

[01:21:58] swyx + Josh Albrecht: I do think I always think about these in terms of yeah, Baldur's Gate and, you know D& D rule set number 5. 1 or whatever. But yeah, so I kind of intentionally left that open to leave space for others. I think when you start an industry, you need to the specifications that work the best in industries are So minimally defined so that other people can fill in the blanks.

[01:22:19] swyx + Josh Albrecht: And I want people to fill in the blanks. I want people to disagree with me and with with themselves so that we can figure this out as a, as a group. Like I don't want to overs specify everything, you know, like that that's, that's a way, that's the only way to guarantee it, that it will fail. Um, I do have a take obviously, 'cause a lot of people are, are asking me like, where to start.

[01:22:37] swyx + Josh Albrecht: And I think basically so what, what we have is latent Space University. We just finished working on day seven today. It's a seven day email course. Where basically, like, it, it is completely designed to answer the question of, like, okay, I'm a, I'm an existing software engineer, I, like, kind of, I know how to code but I don't get all this AI stuff, I've been living under a rock, or, like, it's just too overwhelming for me, you have to, like, pick for me, or curate for me as a, as a trusted friend.

[01:22:59] swyx + Josh Albrecht: And I have one hour a day for seven days. What, what, what do you do? slot in that, in that, in that bucket. So for us, it's making, making sort of LLM API, API calls. It's me, it's image generation, it's code generation, it's audio ASR, I, I think, what's, what's ASR? Audio speech recognition?

[01:23:18] swyx + Josh Albrecht: Yeah, yeah. And then I forget, I forget what the fifth and sixth one is, but the last day is agents. And, and so basically, like, I'm just like, you know, Here are seven projects that you should do to feel like you can do anything in AI. You can't really do everything in AI just from, just from that small list.

[01:23:34] swyx + Josh Albrecht: But I think it's just like, just like anything, you have to like, go through like a set list of, of things that are basic skills that I think everyone in this industry should have to be at least conversant in. If someone, if like a boss comes to you and goes like, hey, can we build this? You don't even know if the answer is no.

[01:23:52] swyx + Josh Albrecht: So I want you to move towards from like unknown unknowns to at least known unknowns. And I think that's, that's where you start being competent as an AI engineer. So, so yeah, that's LSU, Latent Space University, just to trigger the The Tigers.

[01:24:06] Dylan Patel: So do you think in the future that people, an AI engineer is going to be someone's full time job?

[01:24:10] Dylan Patel: Like people are just going to be AI engineers? Or do you think it's going to be more of a world where I'm a software engineer, and like, 20 percent of my time, I'm using open AIs, APIs, and I'm, Working on prompt engineering and stuff like that and using

[01:24:23] swyx + Josh Albrecht: CodePilot. You just reminded me of Day6's open source models and fine tuning.

[01:24:27] swyx + Josh Albrecht: Perfect. I think it will be a spectrum. That's why I don't want to be like too definitive about it. Like we have full time front end engineers and we have part time front end engineers and you dip into that community whenever you want. But wouldn't it be nice if there was a collective name for that community so you could go find it?

[01:24:40] swyx + Josh Albrecht: You can find each other. And, like, honestly, like, that's, that's really it. Like, a lot of people, a lot of companies were pinging me for, like, Hey, I want to hire this kind of person, but you can't hire that person, but I wanted someone like that. And then people on the labor side were, were pinging me going, like, Okay, I want to do more in this space, but where do I go?

[01:24:56] swyx + Josh Albrecht: And I think just having that shelling point of, of, of what an industry title and name is, and then sort of building out that. Mythology and community and conference I think is helpful, hopefully, and I don't have any prescriptions on whether or not it's a full time job. I do think, over time, it's going to become more of a full time job.

[01:25:14] swyx + Josh Albrecht: And that's great for the people who want to do that and the companies that want to employ that. But it's absolutely, like, you can take it part time, like, you know, jobs come in many formats. Yep, yep, that

[01:25:23] Dylan Patel: makes sense. Yeah. And then you have a huge world fair coming up. Yeah. Tell me about that. So,

[01:25:31] swyx + Josh Albrecht: Part of, I think, you know, What creating an industry requires is for, to let people gather in one place.

[01:25:37] swyx + Josh Albrecht: And also for me to get high quality talks out of people. You have to create an event out of it. Otherwise they don't do the work. So so last year we did the AI Engineer Summit, which went very well. And people can see that online and we're, we're, we're very happy with how that turned out.

[01:25:53] swyx + Josh Albrecht: This year we want to go four times bigger with the World's Fair and try to reflect AI engineering as it is in 2024. I always admired two conferences in, in this respect. One is NeurIPS, which I went to last year and, and documented on, on the pod, which was fantastic. And two, which is KubeCon from the other side of my life, which is the sort of cloud registration and, and DevOps world.

[01:26:18] swyx + Josh Albrecht: So NeurIPS is the one place that you go to, to, I think it's the top conference. I mean, there's, there's others that you can kind of consider. But, yeah so, so NeurIPS is, NeurIPS is where the research sciences are the stars. The researchers are the stars, PhDs are the stars, mostly it's just PhDs on the job market, to be honest.

[01:26:34] swyx + Josh Albrecht: It's really funny

[01:26:35] Dylan Patel: to go, especially these days. Yeah, it

[01:26:37] swyx + Josh Albrecht: was really funny to go to NeurIPS and go like, And the VCs trying to back them. Yeah, there are lots, lots of VCs trying to back them. Yeah, there This year. Anyway, so in Europe, research scientists are the stars. And for, I wanted for AI engineers, for engineers to be the star.

[01:26:51] swyx + Josh Albrecht: Right, to show off their tooling and their techniques and their difficulty moving all these ideas from research into production. The other one was KubeCon, where, You could honestly just go and not attend any of the talks and just walk the floor and figure out what's going on in DevOps, which is fantastic.

[01:27:10] swyx + Josh Albrecht: Because, yeah, so, so that curation and that bringing together of, of, of an industry is what I'm going for for the conference. And yeah, it's coming in June. The most important thing, to be honest, when I, like, conceived of this whole thing was to buy the domain. So we got AI. engineer. People are like, engineer is a domain?

[01:27:27] swyx + Josh Albrecht: Yeah, and funny enough, engineer was cheaper than engineering. I don't understand why, but like that's up to the domain people.

[01:27:36] Dylan Patel: Josh, any questions on agents?

[01:27:38] Alessio: Yeah,

[01:27:39] Dylan Patel: I think maybe, you know, you have a lot

[01:27:40] swyx + Josh Albrecht: of experience and exposure talking to all these companies and founders and researchers and everyone that's on your podcast.

[01:27:47] Dylan Patel: Do you have, do you feel like you have a

[01:27:50] swyx + Josh Albrecht: good kind of perspective on some of the things that, like, some of the kind of technical issues having seen? You know, like we were just talking about, like, for coding agents, like, oh, how, you know, the value of test is really important. There are other things, like, for, you know, retrieval, like now, You know, we have these models coming out with a million context, you know, or a million tokens of context length, or ten million, like, is retrieval going to

[01:28:10] Dylan Patel: matter anymore, like,

[01:28:11] swyx + Josh Albrecht: do

[01:28:11] Dylan Patel: huge contexts matter, like,

[01:28:13] swyx + Josh Albrecht: what do you think?

[01:28:14] swyx + Josh Albrecht: Specifically about the long context thing? Sure, yeah. Because you asked a more broad question. I was going to ask a few other ones after that, so go for that one first. Yeah. That's what I was going to ask first. We can ask, yeah, okay, let's talk about long context and then the other stuff. So, for those who don't know, LongContext was kind of in the air last year, but really, really, really came into focus this year.

[01:28:33] swyx + Josh Albrecht: With Gemini 1. 5 having a million token context and saying that it was in research for 10 million tokens. And that means that you can put, you, you, you, like, no longer have to really think about, What you retrieve sorry, you no longer really think about what you have to, like, put into context.

[01:28:50] swyx + Josh Albrecht: You can just kind of throw it, throw the entire knowledge base in there, or books, or film, anything like that and that's fantastic. A lot of people are thinking that it kills RAG, and I think, like, one, that's not true, because for any kind of cost reason you you know, you still pay per token, so if you there, so basically Google is, like, perfectly happy to let you pay a million tokens every single time you make an API call, but good luck, you know, having a hundred dollar API call.

[01:29:12] swyx + Josh Albrecht: And and then the other thing, it's going to be slow. No explanation needed. And then finally, my criticism of long context is that it's also not debuggable. Like, if something goes wrong with the result, you can't do, like, the ragged decomposition of where the source of error. Like, you just have to, like, go, like, it's the Waze, bro.

[01:29:29] swyx + Josh Albrecht: Like, it's somewhere in there. Sorry. I pretty strongly agree with this. Why do you think people are making such crazy long context windows? People love to kill rag, right? It's so much Kill it, though, because it's too expensive. It's so expensive like you said. Yeah, I just think I just call it a different dimension I think it's an option that's great when it's there like when I'm prototyping I do not ever want to worry about context and I'm gonna call Stuff a few times and I don't want to run to errors I don't want to have it set up a complex retrieval system just to prototype something But once I'm done prototyping then I'll worry about all the other rag stuff And yes, I'm gonna buy some system or build some system or whatever to go do that.

[01:30:02] swyx + Josh Albrecht: I so I think it's just like An improvement in like one dimension that you need And then, but the improvements in the other dimensions also matter. And it's all needed, like this space is just going to keep growing, um, in unlimited fashion. I do think that this combined with multi modality does unlock new things.

[01:30:21] swyx + Josh Albrecht: So That's what I was going to ask about next. It's like, how important is multi modal? Like, great, you know, generating videos, sure, whatever. Okay, how many of us need to generate videos that often? It'd be cool for TV shows, sure, but like, yeah. I think it's pretty important. And the one thing that, in, when we launched the Lean Space podcast, We listed a bunch of interest areas.

[01:30:37] swyx + Josh Albrecht: So one thing I love about being explicit or intentional about our, our work is that you list the things that you're interested in and you, you list the things that you're not interested in. And people are very unwilling to, to, to have an anti interest list. One of the things that we were not interested in was multimodality last year.

[01:30:55] swyx + Josh Albrecht: Because everyone was, I was just like, okay, you can generate images and they're pretty, but like not a giant business. I was wrong. Midrani is a giant, giant, massive business that no one can get it, no one can understand or get into. But also I think being able to, to natively understand audio and video and code.

[01:31:12] swyx + Josh Albrecht: I consider code a special modality. All that is very, like, qualitatively different than translating it into English first and using English as, I don't know, like a bottleneck or pipe and then you know, applying it in LLMs. Like the ability of LLMs to reason across modalities gives you something more than you could, you know, Individually by, by, by using text as the universal interface.

[01:31:33] swyx + Josh Albrecht: So I think that's useful. So concretely what, what does that mean? It means that so I think the reference post for everyone that you should have in your head is Simon Willison's post on Gemini 1. 5's video capability. Where he basically shot a video of his bookshelf and just kind of scanning through it.

[01:31:50] swyx + Josh Albrecht: And he was able to give back a, a complete JSON list of the books and the authors and, and all the details that were visible there. Hallucinated some of it, which is, you know, another, another issue. But I think it's just like unlocks this use case that you just would not even try to code without the native video understanding capability.

[01:32:08] swyx + Josh Albrecht: And obviously, like. On a technical level, video is just a bunch of frames. So actually it's just image understanding, but image within the temporal dimension, which this month, I think, became much more of a important thing, like the integration of space and time in Transformers. I don't think anyone was really talking about that until this month, and now it's the only thing anyone can ever think about for Sora and for all the other stuff.

[01:32:30] swyx + Josh Albrecht: The last thing I'll say that, which is which is Against this trend of like every modality is important. They just, just do all the modalities. I kind of agree with Nat Friedman who actually kind of pointed out just before the Gemini thing blew up this, this, this, this month, which was like, why is it that OpenAI is pushing Dolly so hard?

[01:32:48] swyx + Josh Albrecht: Why is, why is being pushing Bing image creator? Like, it's not nec, it's not apparent that you have to create images to create a GI. But every lab just seems to want to do this, and I kind of agree that it's not on the critical path. Especially for image generation, maybe image understanding, video understanding.

[01:33:04] swyx + Josh Albrecht: Yeah, consumption. But generation, eh. Maybe we'll be wrong next year. It just catches you a bunch of flack with like, you know, culture war things. Alright, we're going to

[01:33:14] Dylan Patel: move into rapid fire Q& A, so we're going to ask you questions. We've cut

[01:33:26] Dylan Patel: the Q& A section for time, so if you want to hear the spicy questions, head over to the Thursday Nights in AI video for the full discussion.

[01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show

[01:33:34] Dylan Patel: Next up, we have another former guest, Dylan Patel of Semianalysis, the inventor of the GPU rich poor divide, who did a special live show with us in March. But that means you can finally, like, side to side A B test your favorite Boba

[01:33:51] Alessio: shops?

[01:33:51] Alessio: We got Gong Cha, we got Boba Guys, we got the Lemon, whatever it's called. So, let us know what's your favorite. We also have Slido up to submit questions. We already had Dylan on the podcast, and like, this guy tweets and writes about all kinds of stuff. So we want to know what people want to know more

[01:34:07] Alessio: about.

[01:34:08] Alessio: Rather than just being self, self driven. But we'll do A state of the union, maybe? I don't know. Everybody wants to know about Grok. Everybody wants to know whether or not NVIDIA is going to zero after Grok. Everybody wants to know what's going on with AMD. We got some AMD folks in the crowd, too.

[01:34:23] Alessio: So feel free to interact at any time. This is We have

[01:34:27] swyx + Josh Albrecht: portable mics.

[01:34:27] Dylan Patel: Heckle, please. What do you sorry. Good comedians show their color when with the way they can handle the crowd when they're heckled.

[01:34:35] Alessio: Do not throw Boba. Do not throw Boba at this end. We cannot afford another podcasting setup. Awesome.

[01:34:41] Alessio: Well, welcome everybody to the Semi Analysis and Latest Space Crossover. Dylan texted me on Signal. He was like, dude, how do I easily set up a meetup? And here we are today. Well, as you might have seen, there's no name tags. There's a bunch of things that are missing. But we did our

[01:34:55] Dylan Patel: best. It was extremely easy, right?

[01:34:58] Groq

[01:34:58] Dylan Patel: Like, I text Alessio. He's like, yo, I got the spot. Okay, cool. Thanks Here's a link. Send it to people. Sent it. And then showed up. And like, there was zero other organization that I required. So

[01:35:10] Alessio: everybody's here. A lot of, a lot of Semi Analysis fans we get in the crowd everybody wants to know more about what's going on today, and Grok has definitely been the hottest thing.

[01:35:19] Alessio: We just recorded our monthly podcast today, and we didn't talk that much about Grok because we wanted you to talk more about it, and then we'll splice you into our, our monthly recap. So, let's start there.

[01:35:29] swyx + Josh Albrecht: Okay, so, You guys, you guys are the new Grok spreadsheet ers. Yeah, yeah, so, so, we, we we broke out some Grok numbers because everyone was wondering, there's two things going on, right?

[01:35:37] swyx + Josh Albrecht: One you know, how important, or how does it achieve the inference speed that it does? That, that has been demonstrated by GrokChat. And two, how does it achieve its price promise that is promised that, that is sort of the public pricing of 27 cents per million token. And there's been a lot of speculation or, you know, some numbers thrown out there.

[01:35:55] swyx + Josh Albrecht: I put out some tentative numbers and you put out different numbers. But I'll just kind of lay that as, as the, as the groundwork. Like, everyone's like very excited about essentially like five times faster. Token generation than any other LLM currently. And that unlocks interesting downstream possibilities if it's sustainable, if it's affordable.

[01:36:14] swyx + Josh Albrecht: And so I think your question, or reading your piece on Grok, which is on the screen right now, is it sustainable?

[01:36:21] Dylan Patel: So like many things, this is VC funded, including this Boba. No, I'm just kidding, I'm paying for the Bobo, so but, but Thank you semi analysis

[01:36:29] swyx + Josh Albrecht: subscribers

[01:36:31] Alessio: I hope he pays for it, I pay for it right now That's

[01:36:33] Dylan Patel: true, that's true Alessio has the IOU, right?

[01:36:36] Dylan Patel: And that's, that's all it is, but yeah, like many things, you know, they're, they're not making money off of their inference service, right? They're just throwing it out there for cheap and hoping to get business and maybe raise money off of that, and I think that's a that's a fine use case, but the question is, like, how much money are they losing?

[01:36:53] Dylan Patel: Right, and, and that's sort of what I went through breaking down in this this article that's on the screen. And it's, it's pretty clear they're like 7 to 10x off, like, break even on their inference API, which is like horrendous, like far worse than any other sort of inference API provider. So this is like a simple, simple cost thing that was pulled up.

[01:37:15] Dylan Patel: You can either inference at very high throughput, or you can inference at very high, very low latency.

[01:37:20] Dylan Patel: With GPUs, you can do both. With Grok, you can only do one. Of course, with Grok, you can do that one faster. Marginally faster than a inference latency optimized GPU server. But no one offers inference latency optimized GPU servers because you would just burn money, right? Makes no economic sense to do so.

[01:37:36] Dylan Patel: Until maybe someone's willing to pay for that. So, so Grok service, you know, on the surface looks awesome compared to everyone else's service, which is throughput optimized. And, and then when you compare to the throughput optimized scenario, right, GPUs look quite slow, but the reality is they're serving, you know, 64, 128 users at once.

[01:37:54] Dylan Patel: Right, they're, they have a batch size, right? How many users are being served at once, whereas Grok Taking 576 chips, and they're not really doing that efficiently, right? You know, they're, they're serving a far, far fewer number of users, but extremely fast. Now, that could be worthwhile if they can get their, you know, the number of users they're serving at once up, but that's extremely hard because they don't have memory on their chip, so they can't store KV cache KV cache for, you know, all the various different users.

[01:38:21] Dylan Patel: And so, so the crux of the issue is just like, hey, So, can they, can they get that performance up as much as they claim they will, right? Which is, you know, they need to get it up more than 10x, right? To, to, to make this like a reasonable benefit, right? In the meantime, NVIDIA's launching a new GPU in two weeks that'll be fun at GTC and they're constantly pushing software as well, so we'll see if, if Grok can catch up to that.

[01:38:43] Dylan Patel: But the, the current verdict is, you know, they're, they're quite far behind, but it's hopeful, you know, that, that maybe they can get there by, you know, scaling their system larger. Yeah.

[01:38:52] swyx + Josh Albrecht: I was listening back to our original episode, and you were talking about how NVIDIA basically adopted this different strategy of just leaning on networking GPUs together.

[01:39:00] swyx + Josh Albrecht: And it seems like Grok has some, like, minor version of that going on here with the Grok rack. Is it enough? Like, what's Grok's next step here, like,

[01:39:12] Dylan Patel: strategically? Yeah, that's the next step is, of course, you know, so, you know, So right now they connect 10 racks of chips together, right, and that's the system that's running on their API today, right.

[01:39:23] Dylan Patel: Whereas most people who are running, you know, Mistral are running it on two GPUs, right. So one fourth of a server. Yeah. And that rack is not you know, obviously 10 racks is pretty crazy, but they think that they can scale performance if they have this individual system be 20 racks, right? They think they can continue to scale performance extra linearly.

[01:39:42] Dylan Patel: So that'd be amazing if they could but I, I, I'm, I'm doubtful that that's gonna be something that's scalable especially for, for, you know, larger models. So there's the

[01:39:56] Alessio: chip itself, but there's also a lot of work they're doing at the compiler level. Do you have any good sense of, like, how easy it is to actually work with LPU?

[01:40:04] Alessio: Like, is that something that is going to be a bottleneck for them?

[01:40:07] Dylan Patel: So, so Ali's in the front right there, and he, he knows a ton about about VLIW architectures. But to summarize sort of his opinion, and I think many folks's, it's, it's extremely hard to To program these sorts of architectures, right?

[01:40:19] Dylan Patel: Which is why they have their compiler and so on and so forth. But, you know, it's, it's an incredible amount of work for them to stand up individual models and to get the performance up on them which is what they've been working on, right? Whereas, whereas, you know, GPUs are far more flexible, of course.

[01:40:33] Dylan Patel: And so the question is, you know, can they, can they can, can this compiler continue to extract performance? Well, theoretically, like there, there's a lot more performance to run on the hardware. But they don't have, you know, many, many things that people generally associate with, with programmable hardware.

[01:40:49] Dylan Patel: Right? They don't have buffers and, and many other things. So, so it makes it very tough to to do that. But that's what their, you know, their relatively large compiler team is working on. Yeah,

[01:40:58] swyx + Josh Albrecht: So I'm, I'm not a GPU compiler guy. But I do want to clarify my understanding from what I read. Which is a lot of catching up to do.

[01:41:05] swyx + Josh Albrecht: It is, The crux of it is some kind of speculative, like I, in the, the word that comes to mind is speculative routing of weights and, you know, and, and work that, that needs to be done, or scheduling of work across the, you know, the 10 racks of, of GPUs. Is that the, is that like the, the, the bulk of the benefit that you get from

[01:41:25] Dylan Patel: the compilation?

[01:41:26] Dylan Patel: So, so with the Grok chips, what's really interesting is like with GPUs you can do, you can issue certain instructions. And you will get a different result. Like, depending on the time, I know a lot of people in ML have, have had that experience, right? Where like, the GPU literally doesn't return the numbers it should be.

[01:41:45] Dylan Patel: And that's basically called non determinism, right? And, and, and the, and, and, with, with Grok, their chip is completely deterministic. The moment you compile it, you know exactly how long it will take to operate, right? There is no, there is no, like, deviation at all. And so, you know, they've, they're planning everything ahead of time, right, like, every instruction, like, it will complete in the time that they've planned it for.

[01:42:08] Dylan Patel: And there is no I don't know, I don't know what the best way to state this is. There's no variance there which is interesting from, like, when you look historically, they tried to push this into automotive, right? Because automotive, you know, you probably want your car to do exactly what you issued it to do.

[01:42:22] Dylan Patel: And not have, sort of, unpredictability. But yeah, I don't, sorry, I lost track of the question.

[01:42:28] swyx + Josh Albrecht: It's okay, I just wanted to understand a little bit more about, like, what people should under, should know about the compiler magic that goes on with Brock. Like, you know, like, I think, I think, from a software, like, under, like, hardware point of view that in, that intersection of, you know,

[01:42:44] Dylan Patel: So, so, so chips have like, like and I'm going to steal this from someone here in the crowd, but chips have like five, you know, sort of, there's like, when you're designing a chip, there's, there's, it's called PPA, right?

[01:42:54] Dylan Patel: Power, performance, and area, right? So it's kind of a triangle that you optimize around. And the one thing people don't realize is there's a, there's a third P that's like PPAP. And the last P is pain in the ass to program. And, and that's that is very important for like. People making AI hardware, right?

[01:43:11] Dylan Patel: Like, TPU, without the hundreds of people that work on the compiler, and JAX, and XLA, and all these sorts of things, would be a pain in the ass to program. But Google's got that, like, plumbing. Now, if you look across the ecosystem, everything else is a pain in the ass to program compared to NVIDIA, right? And, and, and this applies to the, to the Grok chip as well, right?

[01:43:31] Dylan Patel: So, yeah, question is, like, can the compiler team get performance up anywhere close to theoretical? And then, and then can they make it not a pain in the ass to support new models? Cool. We

[01:43:41] Alessio: got a question, we got a question from Ali. What's the average VLIW bundle occupancy of Grok? Bro,

[01:43:49] Dylan Patel: get out of here.

[01:43:52] Alessio: I don't know if he's setting you up, or if he

[01:43:54] Dylan Patel: wants to chime in. I think he's setting me up, I think he's setting me up. So, okay,

[01:43:58] swyx + Josh Albrecht: what is VLIW for

[01:44:00] Dylan Patel: the rest of us? It's, it's like very long instruction word is basically what it means. And, hm. So, so, GPUs are relatively simple, right? They're, they're tiny little cores, very simple instructions, there's a shitload of them, right?

[01:44:16] Dylan Patel: CPUs, you know, they have a, they have a known instruction set, right? x86. It's very complicated but people have worked on it for decades. VLIW processors are very unique in that sense, right? Like and your question, Ali, I cannot answer that question. I have no clue. Is it documented anywhere online?

[01:44:35] Dylan Patel: Anyway, so like the systolic array, right? Like there's, within the TPU, there's a bunch of stuff, but the actual matrix multiply unit, it's called the MXU, and it's a VLIW architecture as well. It's and I'm, I'm just trying to find a, yeah, I just want to find something that makes me not sound like an idiot.

[01:44:51] swyx + Josh Albrecht: Sometimes I also like to ballpark things in terms of like, like where a good middle median value should be and where like a good high value should be. Sorry. You, you, you

[01:45:03] Dylan Patel: can ballpark things like, you know, like, yeah, so, so, so, but basically like the, the point is like you're trading this is optimal, this is theoretically the most optimal architecture for performance power and area in a given, and you know, not, not specifically Grok, but VLIW in general is gonna get you closer to optimal there, but then you're giving off, you know, that, that last P, which is pain in the ass program, is, is I think the most simple way to get into it.

[01:45:27] Dylan Patel: There's like, computer architecture books about this, but it's, it's, it's a little little, little complicated, right? Yeah.

[01:45:35] Alessio: Somebody asked, there's a lot of questions, that's great. Can we talk about LPU, Cerebrus, Tenstorin, some of these other architectures. How should people think about Maxim, SRAM versus Mix versus

[01:45:49] Dylan Patel: Yeah, yeah.

[01:45:50] Dylan Patel: So there's a lot of ML hardware out there, new and old, right? There's old stuff that's trying to compete there's new stuff that's coming up, you know, companies like, like MadX and Lumerium Labs and so on and so forth, right? You know but, but, so, so there's like a continuum of like, everyone before, say, two years ago that was doing ML hardware bet in one direction, right?

[01:46:11] Dylan Patel: We're gonna make it as an architecture that is, that is has more on chip memory than NVIDIA, right? Like, that was the general bet everyone made. Right? And so like Grok made that bet, they made it to the extreme, right? They didn't have any off chip memory at all. Only on chip memory. You have, you have Cerebrists who did a similar thing except, they were like, Yeah, we're gonna have on chip memory, but we're gonna make a chip that's the size of a wafer.

[01:46:33] Dylan Patel: Right? Like literally this big. Whereas an NVIDIA chip is roughly this big, right? So it's like this big, it's the only chip in the world that's that big. But again, same bet. More on chip memory, less off chip, right? GraphCore and SambaNova made a similar bet. And, and every, basically everyone made that bet.

[01:46:49] Dylan Patel: Cause they thought that's where ML would go. Of course, models grew faster than anyone ever imagined. Yeah, than the memory that was possible. And so that, that very quickly became the wrong bet. And so now we're, you know, sort of seeing a new wave of startups that are going to bet on the other side, as well as many other, you know, architectural things because memory is not really the only architectural thing, of course.

[01:47:08] Dylan Patel: And so, like, where to, where to, like, place startups is, is very dependent on, like, Hey, what are you doing differently than NVIDIA? And is NVIDIA just going to implement that in their chip next year, right? Or, or some version of that. That's, like, pretty much the only things to think about when looking at, you know, hardware companies now.

[01:47:27] Dylan Patel: Cool.

[01:47:28] Alessio: And, yeah, I, I think the, the question is like, there's the size of the models that got outrun, but now you're doing all this work at the compiler level, but it's very transformer based, everything they're doing on the optimization side. How, how do you think about that risk? Like, do you think it's okay for like a hardware company to take like architectural risk in terms of like, yeah, we assume transformers in two years, they'll still be pretty good.

[01:47:51] Alessio: But when you're like depreciating some of this cost of our life. For five years as a buyer.

[01:47:56] Dylan Patel: Yeah, yeah, that's, that's the biggest challenge with like some of the specialized hardware, right? It's like, I know my GPUs will be useful in four years or five years. Maybe not, like, super useful, but they'll be useful for something.

[01:48:07] Dylan Patel: But, there's no way to know that my hardware is going to be able to operate on whatever new model architecture that comes out in the next few years, right? Like, I, I, I like to joke transformers are all you need. And like everything else is like a waste of time. But, you know, I'm sure something better will come.

[01:48:26] Dylan Patel: Right? And, and, you know, you gotta have like, hardware is expensive and you own it for many years. Right? So you can't just like buy whatever's best for today's workload one time and then assume that workload is gonna stay stagnant. Cause that's a recipe to have your like hardware useless as soon as like things evolve.

[01:48:43] Dylan Patel: Right? Like imagine if someone like had hardware for LTSMs and. 2016 or whatever, right? Like, LSTMs. Yeah, LSTM, sorry. You look like an idiot, right? Because now it's not gonna work for, you know, the next architecture, right? As soon as BERT came out, right? For example. So yeah, it's, it's very anything super, super specialized is always at risk of, of being sort of obsoleted and useless.

[01:49:06] Dylan Patel: And, and sort of that's, that's the, that's the thought that like, hey, like, like Graphcore, right? Their chips are. Pretty decent at GNNs, right? Graph Neural Networks. They're actually pretty decent at that. But no one cares, right? So, congratulations, right? Like, you won, you won like the shortest midget, right?

[01:49:24] swyx + Josh Albrecht: Mentioning transformers is all you need. Gives us a nice opportunity to bring out one of your old tweets, but also mention Gemini. My old

[01:49:30] Dylan Patel: tweets, I'm scared. Recent

[01:49:33] swyx + Josh Albrecht: tweets. There's a lot of people talking about, like I think you had a tweet commenting on Gemini 1. 5. And the million token context where basically everyone was saying, like, okay, we need Mamba, we need RLUKV, or we need some other alternative architecture to scale to long context.

[01:49:48] swyx + Josh Albrecht: And Google comes out and says, no, we just, we scaled transformers to 10 million tokens. Easy. We, and, you know, like, I, I think that, that kind of, like, reflects on your thesis there a

[01:49:59] Dylan Patel: little bit. I guess, yeah. I mean, I don't know if I, if I have a coherent thesis, but it's, it's sure fun to, it's Who, who think that like, I, I, I just have an intense hatred for RAG.

[01:50:11] Dylan Patel: Right, like retrieval augmented generation is, is, is like the most like, I just have an intense like innate hatred for it. Wait, wait, you retweeted me

[01:50:18] swyx + Josh Albrecht: defending RAG in the White House press release. Yeah, yeah, yeah. Okay.

[01:50:21] Dylan Patel: But it's just fun,

[01:50:22] swyx + Josh Albrecht: it's all fun and games. Yeah, yeah, yeah, it's all fun and games.

[01:50:24] Dylan Patel: Yeah.

[01:50:25] Dylan Patel: No, no, no, I retweeted, I retweeted you because you memed the White House. I don't know if y'all saw the meme. Can you pull it up? Sure. Like the, the White House the White House put out this thing about like, They're getting very opinionated with this White House. Memory safety. I think it was effectively like, C is bad and Rust is good.

[01:50:39] Dylan Patel: It was like pretty wild that the White House put that out. And I mean like, like whatever that is, so, so, So

[01:50:46] swyx + Josh Albrecht: like, they just got very opinionated about prescribing languages to people. And so then I was, I just like started editing them. So I have stopped comparing RAG with long context and fine

[01:50:54] Dylan Patel: tuning.

[01:50:55] Dylan Patel: Wait, You said I retweeted you defending it. I thought you were hating on it. And that's why I retweeted it.

[01:51:00] swyx + Josh Albrecht: It's somewhat of a defense. Because everyone was like long context is killing RAG. And then I had future LLM should be sub quadratic. That's another one. And I actually messed with the fine print as well..

[01:51:11] Alessio: Let's see power benefits of SRAM dominant

[01:51:13] Dylan Patel: Yeah, yeah. So, so that's a good question, right? So, like, SRAM is on chip memory. Everyone's just using HBM. If you don't have to go to off chip memory, that'd be really efficient, right?

[01:51:23] Dylan Patel: Cause, cause you're, you're not moving bits around. But there's always the issue of you don't have enough memory, right? So, so you still have to move bits around constantly. And so that's the, that's the question. So, yeah, sure. If you, if you can not move data around as you compute, it's going to be fantastically efficient.

[01:51:39] Dylan Patel: That isn't really not really just easy or simple to do.

[01:51:42] Alessio: What do you think is going to be harder in the future, like getting more energy at cheaper costs or like getting more of this hardware

[01:51:48] Dylan Patel: to run? Yeah, I wonder, so someone was talking about this earlier but it's like here in the crowd and I'm looking right at him but he's complaining that journalists keep saying that you know, that, that, like misreporting about how data centers, or what data centers are doing to the environment.

[01:52:03] Dylan Patel: Right? Which I thought was quite funny, right? Cause, cause they're inundated by journalists talking about data centers like destroying the world. Anyways you know, that's not quite the case, right? But yeah, I don't know, like, the, the, the power is certainly going to be hard to get, but, you know, I think, I think if you just look at history, right?

[01:52:22] Dylan Patel: Like humanity, especially America, right? Like, power, power production and usage kept skyrocketing. From like the 1700s to like 1970s, and then it kind of flatlined from there, so why can't we like go back to the like growth stage, I guess is like the whole like mantra of like accelerationists, I guess.

[01:52:40] Dylan Patel: This is EAC, yep. Well I don't think it's EAC, I think it's like, like Sam Altman like wholly believes this too, right? Yeah. And I don't think he's EAC. So, but yeah, like, like, I don't think like, it's like things, it's like something to think about, right? Like. The US is going back to growing in energy usage whereas for the last like 40 years kind of were flat on energy usage.

[01:53:00] Dylan Patel: And what does that mean, right? Like, yeah.

[01:53:04] Alessio: Fair enough. There was another question on Marvel but kind of the, I think

[01:53:07] Dylan Patel: that's it's, it's, it's definitely like one of these three guys who are on the buy side that are asking this question. What, what, what you want to know if Marvel's stock is gonna go up?

[01:53:18] Dylan Patel: Yeah. So Marvell,

[01:53:19] Alessio: the, they're, they're doing the custom music for, for grok. They also do the tri too. And the, the Google CPU. Yeah. Any other, any other chip that they're working on that people should, should keep in mind. It's like, yeah. Any needle moving and it's any stock moving .

[01:53:34] Dylan Patel: Yeah, exactly. Exactly. They're, they're working on some more stuff.

[01:53:38] Dylan Patel: Yeah. I, I'll, I'll, I'll refrain from,

[01:53:40] Alessio: yeah. All right. Let's see other grok stuff we want to get it, get through. I don't think so. Alright, most of the other ones. Your view on edge compute hardware. Any real use cases for it?

[01:53:54] Dylan Patel: Yeah, I mean, I, I I have like a really like anti edge view. Yeah, let's hear it.

[01:53:58] Dylan Patel: Like, like, so many people are like, oh, I'm going to run this model on my phone or on my laptop and. I love how much it's raining. So now I can be horrible and you people won't leave. Like, I want you to try and leave this building. Captive audience. Seriously, should I start singing? Like, there's nothing you

[01:54:17] Alessio: can do.

[01:54:18] Alessio: You definitely, I'll stop you from that.

[01:54:19] Dylan Patel: Sorry, so edge hardware, right? Like, you know, people are like, I'm going to run this model on my phone or my laptop. It makes no sense to me. Cause Current hardware is not really capable of it. So you're gonna buy new hardware, to run whatever on the edge or you're gonna just run very, very small models.

[01:54:36] Dylan Patel: But in either case, you're, you're gonna end up with like the performance is really low, And then whatever you spent to run it locally, Like if you spent it in the cloud, it could service 10x the users, right? So you kind of like, SOL in terms of like, Economics of, of running things on the edge. And then like latency is like, for, for LLMs, right, for LLMs, it's like not that big of a deal relative to, like internet latency is not that big of a deal relative to the use of the model, right?

[01:55:08] Dylan Patel: Like the actual model operating, whether it's on edge hardware or cloud hardware. And cloud hardware is so much faster. So like edge hardware is not really able to like, have a measurable, appreciable, like advantage. Over, over cloud, cloud hardware. This applies to diffusion models, this applies to LLMs of course small models will be able to run, but not, not all, yeah.

[01:55:33] Dylan Patel: Cool.

[01:55:35] Alessio: Let's see. I guess you, you can now see them. Yeah, what chance do startups like MetaX fetch, or 5. 6? Haven't you

[01:55:41] swyx + Josh Albrecht: already reviewed

[01:55:41] Dylan Patel: them? Why don't you, why don't you answer? Yeah, we, we

[01:55:43] swyx + Josh Albrecht: actually, like, we have, Connections with Maddox and Lemurian. Yeah, yeah, yeah. We haven't, no. But Gavin is

[01:55:52] Alessio: Yeah, yeah, they said they don't want to talk publicly.

[01:55:55] Alessio: Oh, okay, okay.

[01:55:57] swyx + Josh Albrecht: When they open up, we can Sure,

[01:56:00] Alessio: sure. But do you think, like, I think the two,

[01:56:02] Dylan Patel: three Answer the question! What do you think of them?

[01:56:06] Alessio: I think, kind of, there's a couple things. It's like How do the other companies innovate against them? I think when you do a new Silicon, you're like, Oh, we're going to be so much better at this thing or like much faster, much cheaper.

[01:56:18] Alessio: But there's all the other curves going down on the macro environment at the same time. So if it takes you like five years before you were like a lot better, five years later, once you take the chip out, you're only comparing yourself to the five year advancement that the major companies had to. So then it's like, okay, the, we're going to have like the C300, whatever, from, from NVIDIA.

[01:56:37] Alessio: By the time some of these chips come up.

[01:56:40] Dylan Patel: What's after Z? What do you think is after Z in the road map? Because it's X, Y, Z, Anyways Yeah, yeah, it's like the age old problem, right? Like you build a chip, it has some cool thing, cool feature, and then like, a year later, NVIDIA has it in hardware, right? Has implemented some flavor of that in hardware.

[01:57:01] Dylan Patel: Or two generations out, right? Like, what idea are you going to have that NVIDIA can't implement, is like, really the question. It's like, you have to be fundamentally different in some way that holds through for, you know, four or five years, right? That's kind of the big issue. But, you know, like, those people have some ideas that are interesting, and yeah, maybe it'll work out, right?

[01:57:21] Dylan Patel: But it's going to be hard to fight NVIDIA, who one, doesn't consider them competition, right? They're worried about, like, Google and Amazon's chip. Right, they're not, and I guess to some extent AMD's chip, but like they're not really worried about you know, MADX or Etched or Grok or, you know, Positron or any of these folks.

[01:57:39] Alessio: How much of an advantage do they have by working closely with like OpenAI folks and then already knowing where some of the architecture decisions are going? And since those companies are like the biggest buyers and users of the

[01:57:51] Dylan Patel: chips, Yeah, I mean, like, you see, like, the most important sort of AI companies are obviously going to tell hardware vendors what they want you know, open AI and, you know, so on and so forth, right?

[01:58:02] Dylan Patel: They're just going to obviously tell them what they want and the startups aren't actually going to get anywhere close to as much feedback on what to do on, like, you know, very minute, low level stuff, right? So that's, that's the, that is a difficulty, right? Some startups, like, like, Maddox obviously have people who built, or worked on the largest models, like at Google, but then other startups might not have that advantage and so they're always gonna have that issue of like, hey, how do I get the feedback, or what's changing, what do they see down the pipeline that's, that I really need to be aware of and ready for when I design my hardware.

[01:58:37] Dylan Patel: Alright.

[01:58:38] Alessio: Every hardware shortage has eventually turned into a glut. Well, that'd be true of NVIDIA chips, it's so when, but also why.

[01:58:45] Dylan Patel: Absolutely, and I'm so excited to buy like H100s for like 1, 000, guys. No, that's not 000, but Yeah, everyone's gonna buy chips, right? Like, it's just the way semiconductors work, because the supply chain takes forever to build out.

[01:58:58] Dylan Patel: And it's, it's like a really weird thing, right? Like, so, so if the backlog of chips is a year, people will order, you know, Two years worth of what they want for the next year. It is like a very common thing. It's not just like this AI cycle, but like, like, like microcontrollers, right? Like the automotive companies, they order two years worth of what they needed for one year, just so they could get enough, right?

[01:59:21] Dylan Patel: Like, this is just like what happens in semiconductors when, when lead times lengthen, the, the purchases and inventory is sort of like double. Sorry. So, so these. The, the NVIDIA GPU shortage obviously is going to be rectified. And when it is everyone's sort of double orders will become extremely apparent, right?

[01:59:42] Dylan Patel: And, you know, you, you see like random companies out of nowhere being like, Yeah, we've got 32, 000 H100s on order, or we've got 10, 000 or 5, 000. And trust, they're not all they're not all real orders for one, but I think, I think the like bubble will continue on for a long time, right, like it's not, it's not going to end like this year, right, like people, people need AI, right, like I think everyone in this audience would agree, right, like there's no, there's no like immediate like end to the, to the bubble, right.

[02:00:09] Dylan Patel: Party like we're in 1995, not like 2000. Makes sense.

[02:00:12] Alessio: What's next? Thoughts on VLIW

[02:00:16] Dylan Patel: architectures? Oh, Y, Y, sorry, sorry, Y. The Y question, yeah, yeah. I think it's just because the supply chain expands so much, and then at the same time there will be no, like, economic, like, immediate economic thing for everyone, right?

[02:00:28] Dylan Patel: Like, some companies will continue to buy, like like an OpenAI or Meta will continue to buy, but then, like, All these random startups will, or a lot of them will not be able to continue to buy, right? So then, so then that like kind of leads to like, they'll pause for a little bit, right? Or like, I think in 2018, right?

[02:00:45] Dylan Patel: Like memory pricing was extremely high. Then all of a sudden Google, Microsoft, and Amazon all agreed, I don't, you know, You know, they don't, they won't, they won't say it's together, but they basically all agreed it like, within the same week to stop ordering memory. And within like a month, the price of memory started tanking like insane amounts, right?

[02:01:06] Dylan Patel: And like people claim, you know, all sorts of reasons why that was timed extremely well. But it was like very clear and people in the financial markets were able to make trades and everything, right? People stopped buying and it's not like their demand just dried up. It's just like they had a little bit of a demand slowdown and then they had enough inventory that they could like weather until like prices tanked.

[02:01:26] Dylan Patel: Because it's such an inelastic good, right? Yeah.

[02:01:29] swyx + Josh Albrecht: Thank you very much. That's it.

[02:01:35] AI Charlie: That concludes our audio segment this weekend. But if you're listening all the way to the end, we have two bonus segments for you. A conversation with Malin Nefe, Senior Vice President of AI at Capital One. We'll be speaking at the AI Leadership Track of the AI Engineer World's Far. And the recent Latent Space Personal AI Meetup featuring a lot of new AI wearables. Bee, Based Hardware, DeepGram MLE AI, and LangChain LangFriend and LangMem, Presented by another former guest, Harrison Chase. Watch out and take care.

Get full access to Latent Space at www.latent.space/subscribe

2024-04-06
Länk till avsnitt

Presenting the AI Engineer World's Fair ? with Sam Schillace, Deputy CTO of Microsoft

TL;DR: You can now buy tickets, apply to speak, or join the expo for the biggest AI Engineer event of 2024. We?re gathering *everyone* you want to meet - see you this June.

In last year?s the Rise of the AI Engineer we put our money where our mouth was and announced the AI Engineer Summit, which fortunately went well:

With ~500 live attendees and over ~500k views online, the first iteration of the AI Engineer industry affair seemed to be well received. Competing in an expensive city with 3 other more established AI conferences in the fall calendar, we broke through in terms of in-person experience and online impact.

So at the end of Day 2 we announced our second event: the AI Engineer World?s Fair. The new website is now live, together with our new presenting sponsor:

We were delighted to invite both Ben Dunphy, co-organizer of the conference and Sam Schillace, the deputy CTO of Microsoft who wrote some of the first Laws of AI Engineering while working with early releases of GPT-4, on the pod to talk about the conference and how Microsoft is all-in on AI Engineering.

Rise of the Planet of the AI Engineer

Since the first AI Engineer piece, AI Engineering has exploded:

and the title has been adopted across OpenAI, Meta, IBM, and many, many other companies:

1 year on, it is clear that AI Engineering is not only in full swing, but is an emerging global industry that is successfully bridging the gap:

* between research and product,

* between general-purpose foundation models and in-context use-cases,

* and between the flashy weekend MVP (still great!) and the reliable, rigorously evaluated AI product deployed at massive scale, assisting hundreds of employees and driving millions in profit.

The greatly increased scope of the 2024 AI Engineer World?s Fair (more stages, more talks, more speakers, more attendees, more expo?) helps us reflect the growth of AI Engineering in three major dimensions:

* Global Representation: the 2023 Summit was a mostly-American affair. This year we plan to have speakers from top AI companies across five continents, and explore the vast diversity of approaches to AI across global contexts.

* Topic Coverage:

* In 2023, the Summit focused on the initial questions that the community wrestled with - LLM frameworks, RAG and Vector Databases, Code Copilots and AI Agents. Those are evergreen problems that just got deeper.

* This year the AI Engineering field has also embraced new core disciplines with more explicit focus on Multimodality, Evals and Ops, Open Source Models and GPU/Inference Hardware providers.

* Maturity/Production-readiness: Two new tracks are dedicated toward AI in the Enterprise, government, education, finance, and more highly regulated industries or AI deployed at larger scale:

* AI in the Fortune 500, covering at-scale production deployments of AI, and

* AI Leadership, a closed-door, side event for technical AI leaders to discuss engineering and product leadership challenges as VPs and Heads of AI in their respective orgs.

We hope you will join Microsoft and the rest of us as either speaker, exhibitor, or attendee, in San Francisco this June. Contact us with any enquiries that don?t fall into the categories mentioned below.

Show Notes

* Ben Dunphy

* 2023 Summit

* GitHub confirmed $100m ARR on stage

* History of World?s Fairs

* Sam Schillace

* Writely on Acquired.fm

* Early Lessons From GPT-4: The Schillace Laws

* Semantic Kernel

* Sam on Kevin Scott (Microsoft CTO)?s podcast in 2022

* AI Engineer World?s Fair (SF, Jun 25-27)

* Buy Super Early Bird tickets (Listeners can use LATENTSPACE for $100 off any ticket until April 8, or use GROUP if coming in 4 or more)

* Submit talks and workshops for Speaker CFPs (by April 8)

* Enquire about Expo Sponsorship (Asap.. selling fast)

Timestamps

* [00:00:16] Intro

* [00:01:04] 2023 AI Engineer Summit

* [00:03:11] Vendor Neutral

* [00:05:33] 2024 AIE World's Fair

* [00:07:34] AIE World's Fair: 9 Tracks

* [00:08:58] AIE World's Fair Keynotes

* [00:09:33] Introducing Sam

* [00:12:17] AI in 2020s vs the Cloud in 2000s

* [00:13:46] Syntax vs Semantics

* [00:14:22] Bill Gates vs GPT-4

* [00:16:28] Semantic Kernel and Schillace's Laws of AI Engineering

* [00:17:29] Orchestration: Break it into pieces

* [00:19:52] Prompt Engineering: Ask Smart to Get Smart

* [00:21:57] Think with the model, Plan with Code

* [00:23:12] Metacognition vs Stochasticity

* [00:24:43] Generating Synthetic Textbooks

* [00:26:24] Trade leverage for precision; use interaction to mitigate

* [00:27:18] Code is for syntax and process; models are for semantics and intent.

* [00:28:46] Hands on AI Leadership

* [00:33:18] Multimodality vs "Text is the universal wire protocol"

* [00:35:46] Azure OpenAI vs Microsoft Research vs Microsoft AI Division

* [00:39:40] On Satya

* [00:40:44] Sam at AI Leadership Track

* [00:42:05] Final Plug for Tickets & CFP

Transcript

[00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co host Swyx, founder of Small

[00:00:16] Intro

[00:00:16] swyx: AI. Hey, hey, we're back again with a very special episode, this time with two guests and talking about the very in person events rather than online stuff.

[00:00:27] swyx: So first I want to welcome Ben Dunphy, who is my co organizer on AI engineer conferences. Hey, hey, how's it going? We have a very special guest. Anyone who's looking at the show notes and the title will preview this later. But I guess we want to set the context. We are effectively doing promo for the upcoming AI Engineer World's Fair that's happening in June.

[00:00:49] swyx: But maybe something that we haven't actually recapped much on the pod is just the origin of the AI Engineer Summit and why, what happens and what went down. Ben, I don't know if you'd like to start with the raw numbers that people should have in mind.

[00:01:04] 2023 AI Engineer Summit

[00:01:04] Ben Dunphy: Yeah, perhaps your listeners would like just a quick background on the summit.

[00:01:09] Ben Dunphy: I mean, I'm sure many folks have heard of our events. You know, you launched, we launched the AI Engineer Summit last June with your, your article kind of coining the term that was on the tip of everyone's tongue, but curiously had not been actually coined, which is the term AI Engineer, which is now many people's, Job titles, you know, we're seeing a lot more people come to this event, with the job description of AI engineer, with the job title of AI engineer so, is an event that you and I really talked about since February of 2023, when we met at a hackathon you organized we were both excited by this movement and it hasn't really had a name yet.

[00:01:48] Ben Dunphy: We decided that an event was warranted and that's why we move forward with the AI Engineer Summit, which Ended up being a great success. You know, we had over 5, 000 people apply to attend in person. We had over 9, 000 folks attend, online with over 20, 000 on the live stream.

[00:02:06] Ben Dunphy: In person, we accepted about 400 attendees and had speakers, workshop instructors and sponsors, all congregating in San Francisco over, two days, um, two and a half days with a, with a welcome reception. So it was quite the event to kick off kind of this movement that's turning into quite an exciting

[00:02:24] swyx: industry.

[00:02:25] swyx: The overall idea of this is that I kind of view AI engineering, at least in all my work in Latent Space and the other stuff, as starting an industry.

[00:02:34] swyx: And I think every industry, every new community, needs a place to congregate. And I definitely think that AI engineer, at least at the conference, is that it's meant to be like the biggest gathering of technical engineering people working with AI. Right. I think we kind of got that spot last year. There was a very competitive conference season, especially in San Francisco.

[00:02:54] swyx: But I think as far as I understand, in terms of cultural impact, online impact, and the speakers that people want to see, we, we got them all and it was very important for us to be a vendor neutral type of event. Right. , The reason I partnered with Ben is that Ben has a lot of experience, a lot more experience doing vendor neutral stuff.

[00:03:11] Vendor Neutral

[00:03:11] swyx: I first met you when I was speaking at one of your events, and now we're sort of business partners on that. And yeah, I mean, I don't know if you have any sort of Thoughts on make, making things vendor neutral, making things more of a community industry conference rather than like something that's owned by one company.

[00:03:25] swyx: Yeah.

[00:03:25] Ben Dunphy: I mean events that are owned by a company are great, but this is typically where you have product pitches and this smaller internet community. But if you want the truly internet community, if you want a more varied audience and you know, frankly, better content for, especially for a technical audience, you want a vendor neutral event. And this is because when you have folks that are running the event that are focused on one thing and one thing alone, which is quality, quality of content, quality of speakers, quality of the in person experience, and just of general relevance it really elevates everything to the next level.

[00:04:01] Ben Dunphy: And when you have someone like yourself who's coming To this content curation the role that you take at this event, and bringing that neutrality with, along with your experience, that really helps to take it to the next level, and then when you have someone like myself, focusing on just the program curation, and the in person experience, then both of our forces combined, we can like, really create this epic event, and so, these vendor neutral events if you've been to a small community event, Typically, these are vendor neutral, but also if you've been to a really, really popular industry event, many of the top industry events are actually vendor neutral.

[00:04:37] Ben Dunphy: And that's because of the fact that they're vendor neutral, not in spite of

[00:04:41] swyx: it. Yeah, I've been pretty open about the fact that my dream is to build the KubeCon of AI. So if anyone has been in the Kubernetes world, they'll understand what that means. And then, or, or instead of the NeurIPS, NeurIPS for engineers, where engineers are the stars and engineers are sharing their knowledge.

[00:04:57] swyx: Perspectives, because I think AI is definitely moving over from research to engineering and production. I think one of my favorite parts was just honestly having GitHub and Microsoft support, which we'll cover in a bit, but you know, announcing finally that GitHub's copilot was such a commercial success I think was the first time that was actually confirmed by anyone in public.

[00:05:17] swyx: For me, it's also interesting as sort of the conference curator to put Microsoft next to competitors some of which might be much smaller AI startups and to see what, where different companies are innovating in different areas.

[00:05:27] swyx: Well, they're next to

[00:05:27] Ben Dunphy: each other in the arena. So they can be next to each other on stage too.

[00:05:33] Why AIE World's Fair

[00:05:33] swyx: Okay, so this year World's Fair we are going a lot bigger what details are we disclosing right now? Yeah,

[00:05:39] Ben Dunphy: I guess we should start with the name why are we calling it the World's Fair? And I think we need to go back to what inspired this, what actually the original World's Fair was, which was it started in the late 1700s and went to the early 1900s.

[00:05:53] Ben Dunphy: And it was intended to showcase the incredible achievements. Of nation states, corporations, individuals in these grand expos. So you have these miniature cities actually being built for these grand expos. In San Francisco, for example, you had the entire Marina District built up in absolutely new construction to showcase the achievements of industry, architecture, art, and culture.

[00:06:16] Ben Dunphy: And many of your listeners will know that in 1893, the Nikola Tesla famously provided power to the Chicago World's Fair with his 8 seat power generator. There's lots of great movies and documentaries about this. That was the first electric World's Fair, which thereafter it was referred to as the White City.

[00:06:33] Ben Dunphy: So in today's world we have technological change that's similar to what was experienced during the industrial revolution in how it's, how it's just upending our entire life, how we live, work, and play. And so we have artificial intelligence, which has long been the dream of humanity.

[00:06:51] Ben Dunphy: It's, it's finally here. And the pace of technological change is just accelerating. So with this event, as you mentioned, we, we're aiming to create a singular event where the world's foremost experts, builders, and practitioners can come together to exchange and reflect. And we think this is not only good for business, but it's also good for our mental health.

[00:07:12] Ben Dunphy: It slows things down a bit from the Twitter news cycle to an in person festival of smiles, handshakes, connections, and in depth conversations that online media and online events can only ever dream of replicating. So this is an expo led event where the world's top companies will mingle with the world's top founders and AI engineers who are building and enhanced by AI.

[00:07:34] AIE World's Fair: 9 Tracks

[00:07:34] Ben Dunphy: And not to mention, we're featuring over a hundred talks and workshops across

[00:07:37] swyx: nine tracks. Yeah, I mean, those nine tracks will be fun. Actually, do we have a little preview of the tracks in the, the speakers?

[00:07:43] Ben Dunphy: We do. Folks can actually see them today at our website. We've updated that at ai.

[00:07:48] Ben Dunphy: engineer. So we'd encourage them to go there to see that. But for those just listening, we have nine tracks. So we have multimodality. We have retrieval augmented generation. Featuring LLM frameworks and vector databases, evals and LLM ops, open source models, code gen and dev tools, GPUs and inference, AI agent applications, AI in the fortune 500, and then we have a special track for AI leadership which you can access by purchasing the VP pass which is different from the, the other passes we have.

[00:08:20] Ben Dunphy: And I won't go into the Each of these tracks in depth, unless you want to, Swyx but there's more details on the website at ai. engineer.

[00:08:28] swyx: I mean, I, I, very much looking forward to talking to our special guests for the last track, I think, which is the what a lot of yeah, leaders are thinking about, which is how to, Inspire innovation in their companies, especially the sort of larger organizations that might not have the in house talents for that kind of stuff.

[00:08:47] swyx: So yeah, we can talk about the expo, but I'm very keen to talk about the presenting sponsor if you want to go slightly out of order from our original plan.

[00:08:58] AIE World's Fair Keynotes

[00:08:58] Ben Dunphy: Yeah, absolutely. So you know, for the stage of keynotes, we have talks confirmed from Microsoft, OpenAI, AWS, and Google.

[00:09:06] Ben Dunphy: And our presenting sponsor is joining the stage with those folks. And so that presenting sponsor this year is a dream sponsor. It's Microsoft. It's the company really helping to lead the charge. And into this wonderful new era that we're all taking part in. So, yeah,

[00:09:20] swyx: you know, a bit of context, like when we first started planning this thing, I was kind of brainstorming, like, who would we like to get as the ideal presenting sponsors, as ideal partners long term, just in terms of encouraging the AI engineering industry, and it was Microsoft.

[00:09:33] Introducing Sam

[00:09:33] swyx: So Sam, I'm very excited to welcome you onto the podcast. You are CVP and Deputy CTO of Microsoft. Welcome.

[00:09:40] Sam Schillace: Nice to be here. I'm looking forward to, I was looking for, to Lessio saying my last name correctly this time. Oh

[00:09:45] swyx: yeah. So I, I studiously avoided saying, saying your last name, but apparently it's an Italian last name.

[00:09:50] swyx: Ski Lache. Ski

[00:09:51] Alessio: Lache. Yeah. No, that, that's great, Sean. That's great as a musical person.

[00:09:54] swyx: And it, it's also, yeah, I pay attention to like the, the, the lilt. So it's ski lache and the, the slow slowing of the law is, is what I focused

[00:10:03] Sam Schillace: on. You say both Ls. There's no silent letters, you say

[00:10:07] Alessio: both of those. And it's great to have you, Sam.

[00:10:09] Alessio: You know, we've known each other now for a year and a half, two years, and our first conversation, well, it was at Lobby Conference, and then we had a really good one in the kind of parking lot of a Safeway, because we didn't want to go into Starbucks to meet, so we sat outside for about an hour, an hour and a half, and then you had to go to a Bluegrass concert, so it was great.

[00:10:28] Alessio: Great meeting, and now, finally, we have you on Lanespace.

[00:10:31] Sam Schillace: Cool, cool. Yeah, I'm happy to be here. It's funny, I was just saying to Swyx before you joined that, like, it's kind of an intimidating podcast. Like, when I listen to this podcast, it seems to be, like, one of the more intelligent ones, like, more, more, like, deep technical folks on it.

[00:10:44] Sam Schillace: So, it's, like, it's kind of nice to be here. It's fun. Bring your A game. Hopefully I'll, I'll bring mine. I

[00:10:49] swyx: mean, you've been programming for longer than some of our listeners have been alive, so I don't think your technical chops are in any doubt. So you were responsible for Rightly as one of your early wins in your career, which then became Google Docs, and obviously you were then responsible for a lot more G Suite.

[00:11:07] swyx: But did you know that you covered in Acquired. fm episode 9, which is one of the podcasts that we model after.

[00:11:13] Sam Schillace: Oh, cool. I didn't, I didn't realize that the most fun way to say this is that I still have to this day in my personal GDocs account, the very first Google doc, like I actually have it.

[00:11:24] Sam Schillace: And I looked it up, like it occurred to me like six months ago that it was probably around and I went and looked and it's still there. So it's like, and it's kind of a funny thing. Cause it's like the backend has been rewritten at least twice that I know of the front end has been re rewritten at least twice that I know of.

[00:11:38] Sam Schillace: So. I'm not sure what sense it's still the original one it's sort of more the idea of the original one, like the NFT of it would probably be more authentic. I

[00:11:46] swyx: still have it. It's a ship athesia thing. Does it, does it say hello world or something more mundane?

[00:11:52] Sam Schillace: It's, it's, it's me and Steve Newman trying to figure out if some collaboration stuff is working, and also a picture of Edna from the Incredibles that I probably pasted in later, because that's That's too early for that, I think.

[00:12:05] swyx: People can look up your LinkedIn, and we're going to link it on the show notes, but you're also SVP of engineering for Box, and then you went back to Google to do Google, to lead Google Maps, and now you're deputy CTO.

[00:12:17] AI in 2020s vs the Cloud in 2000s

[00:12:17] swyx: I mean, there's so many places to start, but maybe one place I like to start off with is do you have a personal GPT 4 experience.

[00:12:25] swyx: Obviously being at Microsoft, you have, you had early access and everyone talks about Bill Gates's

[00:12:30] Sam Schillace: demo. Yeah, it's kind of, yeah, that's, it's kind of interesting. Like, yeah, we got access, I got access to it like in September of 2022, I guess, like before it was really released. And I it like almost instantly was just like mind blowing to me how good it was.

[00:12:47] Sam Schillace: I would try experiments like very early on, like I play music. There's this thing called ABC notation. That's like an ASCII way to represent music. And like, I was like, I wonder if it can like compose a fiddle tune. And like it composed a fiddle tune. I'm like, I wonder if it can change key, change the key.

[00:13:01] Sam Schillace: Like it's like really, it was like very astonishing. And I sort of, I'm very like abstract. My background is actually more math than CS. I'm a very abstract thinker and sort of categorical thinker. And the, the thing that occurred to me with, with GPT 4 the first time I saw it was. This is really like the beginning, it's the beginning of V2 of the computer industry completely.

[00:13:23] Sam Schillace: I had the same feeling I had when, of like a category shifting that I had when the cloud stuff happened with the GDocs stuff, right? Where it's just like, all of a sudden this like huge vista opens up of capabilities. And I think the way I characterized it, which is a little bit nerdy, but I'm a nerd so lean into it is like everything until now has been about syntax.

[00:13:46] Syntax vs Semantics

[00:13:46] Sam Schillace: Like, we have to do mediation. We have to describe the real world in forms that the digital world can manage. And so we're the mediation, and we, like, do that via things like syntax and schema and programming languages. And all of a sudden, like, this opens the door to semantics, where, like, you can express intention and meaning and nuance and fuzziness.

[00:14:04] Sam Schillace: And the machine itself is doing, the model itself is doing a bunch of the mediation for you. And like, that's obviously like complicated. We can talk about the limits and stuff, and it's getting better in some ways. And we're learning things and all kinds of stuff is going on around it, obviously.

[00:14:18] Sam Schillace: But like, that was my immediate reaction to it was just like, Oh my God.

[00:14:22] Bill Gates vs GPT-4

[00:14:22] Sam Schillace: Like, and then I heard about the build demo where like Bill had been telling Kevin Scott this, This investment is a waste. It's never going to work. AI is blah, blah, blah. And come back when it can pass like an AP bio exam.

[00:14:33] Sam Schillace: And they actually literally did that at one point, they brought in like the world champion of the, like the AP bio test or whatever the AP competition and like it and chat GPT or GPT 4 both did the AP bio and GPT 4 beat her. So that was the moment that convinced Bill that this was actually real.

[00:14:53] Sam Schillace: Yeah, it's fun. I had a moment with him actually about three weeks after that when we had been, so I started like diving in on developer tools almost immediately and I built this thing with a small team that's called the Semantic Kernel which is one of the very early orchestrators just because I wanted to be able to put code and And inference together.

[00:15:10] Sam Schillace: And that's probably something we should dig into more deeply. Cause I think there's some good insights in there, but I I had a bunch of stuff that we were building and then I was asked to go meet with Bill Gates about it and he's kind of famously skeptical and, and so I was a little bit nervous to meet him the first time.

[00:15:25] Sam Schillace: And I started the conversation with, Hey, Bill, like three weeks ago, you would have called BS on everything I'm about to show you. And I would probably have agreed with you, but we've both seen this thing. And so we both know it's real. So let's skip that part and like, talk about what's possible.

[00:15:39] Sam Schillace: And then we just had this kind of fun, open ended conversation and I showed him a bunch of stuff. So that was like a really nice, fun, fun moment as well. Well,

[00:15:46] swyx: that's a nice way to meet Bill Gates and impress

[00:15:48] Sam Schillace: him. A little funny. I mean, it's like, I wasn't sure what he would think of me, given what I've done and his.

[00:15:54] Sam Schillace: Crown Jewel. But he was nice. I think he likes

[00:15:59] swyx: GDocs. Crown Jewel as in Google Docs versus Microsoft Word? Office.

[00:16:03] Sam Schillace: Yeah. Yeah, versus Office. Yeah, like, I think, I mean, I can imagine him not liking, I met Steven Snofsky once and he sort of respectfully, but sort of grimaced at me. You know, like, because of how much trauma I had caused him.

[00:16:18] Sam Schillace: So Bill was very nice to

[00:16:20] swyx: me. In general it's like friendly competition, right? They keep you, they keep you sharp, you keep each

[00:16:24] Sam Schillace: other sharp. Yeah, no, I think that's, it's definitely respect, it's just kind of funny.

[00:16:28] Semantic Kernel and Schillace's Laws of AI Engineering

[00:16:28] Sam Schillace: Yeah,

[00:16:28] swyx: So, speaking of semantic kernel, I had no idea that you were that deeply involved, that you actually had laws named after you.

[00:16:35] swyx: This only came up after looking into you for a little bit. Skelatches laws, how did those, what's the, what's the origin

[00:16:41] Sam Schillace: story? Hey! Yeah, that's kind of funny. I'm actually kind of a modest person and so I'm sure I feel about having my name attached to them. Although I do agree with all, I believe all of them because I wrote all of them.

[00:16:49] Sam Schillace: This is like a designer, John Might, who works with me, decided to stick my name on them and put them out there. Seriously, but like, well, but like, so this was just I, I'm not, I don't build models. Like I'm not an AI engineer in the sense of, of like AI researcher that's like doing inference. Like I'm somebody who's like consuming the models.

[00:17:09] Sam Schillace: Exactly. So it's kind of funny when you're talking about AI engineering, like it's a good way of putting it. Cause that's how like I think about myself. I'm like, I'm an app builder. I just want to build with this tool. Yep. And so we spent all of the fall and into the winter in that first year, like Just trying to build stuff and learn how this tool worked.

[00:17:29] Orchestration: Break it into pieces

[00:17:29] Sam Schillace: And I guess those are a little bit in the spirit of like Robert Bentley's programming pearls or something. I was just like, let's kind of distill some of these ideas down of like. How does this thing work? I saw something I still see today with people doing like inference is still kind of expensive.

[00:17:46] Sam Schillace: GPUs are still kind of scarce. And so people try to get everything done in like one shot. And so there's all this like prompt tuning to get things working. And one of the first laws was like, break it into pieces. Like if it's hard for you, it's going to be hard for the model. But if it's you know, there's this kind of weird thing where like, it's.

[00:18:02] Sam Schillace: It's absolutely not a human being, but starting to think about, like, how would I solve the problem is often a good way to figure out how to architect the program so that the model can solve the problem. So, like, that was one of the first laws. That came from me just trying to, like, replicate a test of a, like, a more complicated, There's like a reasoning process that you have to go through that, that Google was, was the react, the react thing, and I was trying to get GPT 4 to do it on its own.

[00:18:32] Sam Schillace: And, and so I'd ask it the question that was in this paper, and the answer to the question is like the year 2000. It's like, what year did this particular author who wrote this book live in this country? And you've kind of got to carefully reason through it. And like, I could not get GPT 4 to Just to answer the question with the year 2000.

[00:18:50] Sam Schillace: And if you're thinking about this as like the kernel is like a pipelined orchestrator, right? It's like very Unix y, where like you have a, some kind of command and you pipe stuff to the next parameters and output to the next thing. So I'm thinking about this as like one module in like a pipeline, and I just want it to give me the answer.

[00:19:05] Sam Schillace: I don't want anything else. And I could not prompt engineer my way out of that. I just like, it was giving me a paragraph or reasoning. And so I sort of like anthropomorphized a little bit and I was like, well, the only way you can think about stuff is it can think out loud because there's nothing else that the model does.

[00:19:19] Sam Schillace: It's just doing token generation. And so it's not going to be able to do this reasoning if it can't think out loud. And that's why it's always producing this. But if you take that paragraph of output, which did get to the right answer and you pipe it into a second prompt. That just says read this conversation and just extract the answer and report it back.

[00:19:38] Sam Schillace: That's an easier task. That would be an easier task for you to do or me to do. It's easier reasoning. And so it's an easier thing for the model to do and it's much more accurate. And that's like 100 percent accurate. It always does that. So like that was one of those, those insights on the that led to the, the choice loss.

[00:19:52] Prompt Engineering: Ask Smart to Get Smart

[00:19:52] Sam Schillace: I think one of the other ones that's kind of interesting that I think people still don't fully appreciate is that GPT 4 is the rough equivalent of like a human being sitting down for centuries or millennia and reading all the books that they can find. It's this vast mind, right, and the embedding space, the latent space, is 100, 000 K, 100, 000 dimensional space, right?

[00:20:14] Sam Schillace: Like it's this huge, high dimensional space, and we don't have good, um, Intuition about high dimensional spaces, like the topology works in really weird ways, connectivity works in weird ways. So a lot of what we're doing is like aiming the attention of a model into some part of this very weirdly connected space.

[00:20:30] Sam Schillace: That's kind of what prompt engineering is. But that kind of, like, what we observed to begin with that led to one of those laws was You know, ask smart to get smart. And I think we've all, we all understand this now, right? Like this is the whole field of prompt engineering. But like, if you ask like a simple, a simplistic question of the model, you'll get kind of a simplistic answer.

[00:20:50] Sam Schillace: Cause you're pointing it at a simplistic part of that high dimensional space. And if you ask it a more intelligent question, you get more intelligent stuff back out. And so I think that's part of like how you think about programming as well. It's like, how are you directing the attention of the model?

[00:21:04] Sam Schillace: And I think we still don't have a good intuitive feel for that. To me,

[00:21:08] Alessio: the most interesting thing is how do you tie the ask smart, get smart with the syntax and semantics piece. I gave a talk at GDC last week about the rise of full stack employees and how these models are like semantic representation of tasks that people do.

[00:21:23] Alessio: But at the same time, we have code. Also become semantic representation of code. You know, I give you the example of like Python that sort it's like really a semantic function. It's not code, but it's actually code underneath. How do you think about tying the two together where you have code?

[00:21:39] Alessio: To then extract the smart parts so that you don't have to like ask smart every time and like kind of wrap them in like higher level functions.

[00:21:46] Sam Schillace: Yeah, this is, this is actually, we're skipping ahead to kind of later in the conversation, but I like to, I usually like to still stuff down in these little aphorisms that kind of help me remember them.

[00:21:57] Think with the model, Plan with Code

[00:21:57] Sam Schillace: You know, so we can dig into a bunch of them. One of them is pixels are free, one of them is bots are docs. But the one that's interesting here is Think with the model, plan with code. And so one of the things, so one of the things we've realized, we've been trying to do lots of these like longer running tasks.

[00:22:13] Sam Schillace: Like we did this thing called the infinite chatbot, which was the successor to the semantic kernel, which is an internal project. It's a lot like GPTs. The open AI GPT is, but it's like a little bit more advanced in some ways, kind of deep exploration of a rag based bot system. And then we did multi agents from that, trying to do some autonomy stuff and we're, and we're kind of banging our head against this thing.

[00:22:34] Sam Schillace: And you know, one of the things I started to realize, this is going to get nerdy for a second. I apologize, but let me dig in on it for just a second. No apology needed. Um, we realized is like, again, this is a little bit of an anthropomorphism and an illusion that we're having. So like when we look at these models, we think there's something continuous there.

[00:22:51] Sam Schillace: We're having a conversation with chat GPT or whatever with Azure open air or like, like what's really happened. It's a little bit like watching claymation, right? Like when you watch claymation, you don't think that the model is actually the clay model is actually really alive. You know, that there's like a bunch of still disconnected slot screens that your mind is connecting into a continuous experience.

[00:23:12] Metacognition vs Stochasticity

[00:23:12] Sam Schillace: And that's kind of the same thing that's going on with these models. Like they're all the prompts are disconnected no matter what. Which means you're putting a lot of weight on memory, right? This is the thing we talked about. You're like, you're putting a lot of weight on precision and recall of your memory system.

[00:23:27] Sam Schillace: And so like, and it turns out like, because the models are stochastic, they're kind of random. They'll make stuff up if things are missing. If you're naive about your, your memory system, you'll get lots of like accumulated similar memories that will kind of clog the system, things like that. So there's lots of ways in which like, Memory is hard to manage well, and, and, and that's okay.

[00:23:47] Sam Schillace: But what happens is when you're doing plans and you're doing these longer running things that you're talking about, that second level, the metacognition is very vulnerable to that stochastic noise, which is like, I totally want to put this on a bumper sticker that like metacognition is susceptible to stochasticity would be like the great bumper sticker.

[00:24:07] Sam Schillace: So what, these things are very vulnerable to feedback loops when they're trying to do autonomy, and they're very vulnerable to getting lost. So we've had these, like, multi agent Autonomous agent things get kind of stuck on like complimenting each other, or they'll get stuck on being quote unquote frustrated and they'll go on strike.

[00:24:22] Sam Schillace: Like there's all kinds of weird like feedback loops you get into. So what we've learned to answer your question of how you put all this stuff together is You have to, the model's good at thinking, but it's not good at planning. So you do planning in code. So you have to describe the larger process of what you're doing in code somehow.

[00:24:38] Sam Schillace: So semantic intent or whatever. And then you let the model kind of fill in the pieces.

[00:24:43] Generating Synthetic Textbooks

[00:24:43] Sam Schillace: I'll give a less abstract like example. It's a little bit of an old example. I did this like last year, but at one point I wanted to see if I could generate textbooks. And so I wrote this thing called the textbook factory.

[00:24:53] Sam Schillace: And it's, it's tiny. It's like a Jupyter notebook with like. You know, 200 lines of Python and like six very short prompts, but what you basically give it a sentence. And it like pulls out the topic and the level of, of, from that sentence, so you, like, I would like fifth grade reading. I would like eighth grade English.

[00:25:11] Sam Schillace: His English ninth grade, US history, whatever. That by the way, all, all by itself, like would've been an almost impossible job like three years ago. Isn't, it's like totally amazing like that by itself. Just parsing an arbitrary natural language sentence to get these two pieces of information out is like almost trivial now.

[00:25:27] Sam Schillace: Which is amazing. So it takes that and it just like makes like a thousand calls to the API and it goes and builds a full year textbook, like decides what the curriculum is with one of the prompts. It breaks it into chapters. It writes all the lessons and lesson plans and like builds a teacher's guide with all the answers to all the questions.

[00:25:42] Sam Schillace: It builds a table of contents, like all that stuff. It's super reliable. You always get a textbook. It's super brittle. You never get a cookbook or a novel like but like you could kind of define that domain pretty care, like I can describe. The metacognition, the high level plan for how do you write a textbook, right?

[00:25:59] Sam Schillace: You like decide the curriculum and then you write all the chapters and you write the teacher's guide and you write the table content, like you can, you can describe that out pretty well. And so having that like code exoskeleton wrapped around the model is really helpful, like it keeps the model from drifting off and then you don't have as many of these vulnerabilities around memory that you would normally have.

[00:26:19] Sam Schillace: So like, that's kind of, I think where the syntax and semantics comes together right now.

[00:26:24] Trade leverage for precision; use interaction to mitigate

[00:26:24] Sam Schillace: And then I think the question for all of us is. How do you get more leverage out of that? Right? So one of the things that I don't love about virtually everything anyone's built for the last year and a half is people are holding the hands of the model on everything.

[00:26:37] Sam Schillace: Like the leverage is very low, right? You can't turn. These things loose to do anything really interesting for very long. You can kind of, and the places where people are getting more work out per unit of work in are usually where somebody has done exactly what I just described. They've kind of figured out what the pattern of the problem is in enough of a way that they can write some code for it.

[00:26:59] Sam Schillace: And then that that like, so I've seen like sales support stuff. I've seen like code base tuning stuff of like, there's lots of things that people are doing where like, you can get a lot of value in some relatively well defined domain using a little bit of the model's ability to think for you and a little, and a little bit of code.

[00:27:18] Code is for syntax and process; models are for semantics and intent.

[00:27:18] Sam Schillace: And then I think the next wave is like, okay, do we do stuff like domain specific languages to like make the planning capabilities better? Do we like start to build? More sophisticated primitives. We're starting to think about and talk about like power automate and a bunch of stuff inside of Microsoft that we're going to wrap in these like building blocks.

[00:27:34] Sam Schillace: So the models have these chunks of reliable functionality that they can invoke as part of these plans, right? Because you don't want like, if you're going to ask the model to go do something and the output's going to be a hundred thousand lines of code, if it's got to generate that code every time, the randomness, the stochasticity is like going to make that basically not reliable.

[00:27:54] Sam Schillace: You want it to generate it like a 10 or 20 line high level semantic plan for this thing that gets handed to some markup executor that runs it and that invokes that API, that 100, 000 lines of code behind it, API call. And like, that's a really nice robust system for now. And then as the models get smarter as new models emerge, then we get better plans, we get more sophistication.

[00:28:17] Sam Schillace: In terms of what they can choose, things like that. Right. So I think like that feels like that's probably the path forward for a little while, at least, like there was, there was a lot there. I, sorry, like I've been thinking, you can tell I've been thinking about it a lot. Like this is kind of all I think about is like, how do you build.

[00:28:31] Sam Schillace: Really high value stuff out of this. And where do we go? Yeah. The, the role where

[00:28:35] swyx: we are. Yeah. The intermixing of code and, and LMS is, is a lot of the role of the AI engineer. And I, I, I think in a very real way, you were one of the first to, because obviously you had early access. Honestly, I'm surprised.

[00:28:46] Hands on AI Leadership

[00:28:46] swyx: How are you so hands on? How do you choose to, to dedicate your time? How do you advise other tech leaders? Right. You know, you, you are. You have people working for you, you could not be hands on, but you seem to be hands on. What's the allocation that people should have, especially if they're senior tech

[00:29:03] Sam Schillace: leaders?

[00:29:04] Sam Schillace: It's mostly just fun. Like, I'm a maker, and I like to build stuff. I'm a little bit idiosyncratic. I I've got ADHD, and so I won't build anything. I won't work on anything I'm bored with. So I have no discipline. If I'm not actually interested in the thing, I can't just, like, do it, force myself to do it.

[00:29:17] Sam Schillace: But, I mean, if you're not interested in what's going on right now in the industry, like, go find a different industry, honestly. Like, I seriously, like, this is, I, well, it's funny, like, I don't mean to be snarky, but, like, I was at a dinner, like, a, I don't know, six months ago or something, And I was sitting next to a CTO of a large, I won't name the corporation because it would name the person, but I was sitting next to the CTO of a very large Japanese technical company, and he was like, like, nothing has been interesting since the internet, and this is interesting now, like, this is fun again.

[00:29:46] Sam Schillace: And I'm like, yeah, totally, like this is like, the most interesting thing that's happened in 35 years of my career, like, we can play with semantics and natural language, and we can have these things that are like sort of active, can kind of be independent in certain ways and can do stuff for us and can like, reach all of these interesting problems.

[00:30:02] Sam Schillace: So like that's part of it of it's just kind of fun to, to do stuff and to build stuff. I, I just can't, can't resist. I'm not crazy hands-on, like, I have an eng like my engineering team's listening right now. They're like probably laughing 'cause they, I never, I, I don't really touch code directly 'cause I'm so obsessive.

[00:30:17] Sam Schillace: I told them like, if I start writing code, that's all I'm gonna do. And it's probably better if I stay a little bit high level and like, think about. I've got a really great couple of engineers, a bunch of engineers underneath me, a bunch of designers underneath me that are really good folks that we just bounce ideas off of back and forth and it's just really fun.

[00:30:35] Sam Schillace: That's the role I came to Microsoft to do, really, was to just kind of bring some energy around innovation, some energy around consumer, We didn't know that this was coming when I joined. I joined like eight months before it hit us, but I think Kevin might've had an idea it was coming. And and then when it hit, I just kind of dove in with both feet cause it's just so much fun to do.

[00:30:55] Sam Schillace: Just to tie it back a little bit to the, the Google Docs stuff. When we did rightly originally the world it's not like I built rightly in jQuery or anything. Like I built that thing on bare metal back before there were decent JavaScript VMs.

[00:31:10] Sam Schillace: I was just telling somebody today, like you were rate limited. So like just computing the diff when you type something like doing the string diff, I had to write like a binary search on each end of the string diff because like you didn't have enough iterations of a for loop to search character by character.

[00:31:24] Sam Schillace: I mean, like that's how rough it was none of the browsers implemented stuff directly, whatever. It's like, just really messy. And like, that's. Like, as somebody who's been doing this for a long time, like, that's the place where you want to engage, right? If things are easy, and it's easy to go do something, it's too late.

[00:31:42] Sam Schillace: Even if it's not too late, it's going to be crowded, but like the right time to do something new and disruptive and technical is, first of all, still when it's controversial, but second of all, when you have this, like, you can see the future, you ask this, like, what if question, and you can see where it's going, But you have this, like, pit in your stomach as an engineer as to, like, how crappy this is going to be to do.

[00:32:04] Sam Schillace: Like, that's really the right moment to engage with stuff. We're just like, this is going to suck, it's going to be messy, I don't know what the path is, I'm going to get sticks and thorns in my hair, like I, I, it's going to have false starts, and I don't really, I'm going to This is why those skeletchae laws are kind of funny, because, like, I, I, like You know, I wrote them down at one point because they were like my best guess, but I'm like half of these are probably wrong, and I think they've all held up pretty well, but I'm just like guessing along with everybody else, we're just trying to figure this thing out still, right, and like, and I think the only way to do that is to just engage with it.

[00:32:34] Sam Schillace: You just have to like, build stuff. If you're, I can't tell you the number of execs I've talked to who have opinions about AI and have not sat down with anything for more than 10 minutes to like actually try to get anything done. You know, it's just like, it's incomprehensible to me that you can watch this stuff through the lens of like the press and forgive me, podcasts and feel like you actually know what you're talking about.

[00:32:59] Sam Schillace: Like, you have to like build stuff. Like, break your nose on stuff and like figure out what doesn't work.

[00:33:04] swyx: Yeah, I mean, I view us as a starting point, as a way for people to get exposure on what we're doing. They should be looking at, and they still have to do the work as do we. Yeah, I'll basically endorse, like, I think most of the laws.

[00:33:18] Multimodality vs "Text is the universal wire protocol"

[00:33:18] swyx: I think the one I question the most now is text is the universal wire protocol. There was a very popular article, a text that used a universal interface by Rune who now works at OpenAI. And I, actually, we just, we just dropped a podcast with David Luan, who's CEO of Adept now, but he was VP of Eng, and he pitched Kevin Scott for the original Microsoft investment in OpenAI.

[00:33:40] swyx: Where he's basically pivoting to or just betting very hard on multimodality. I think that's something that we don't really position very well. I think this year, we're trying to all figure it out. I don't know if you have an updated perspective on multi modal models how that affects agents

[00:33:54] Sam Schillace: or not.

[00:33:55] Sam Schillace: Yeah, I mean, I think the multi I think multi modality is really important. And I, I think it's only going to get better from here. For sure. Yeah, the text is the universal wire protocol. You're probably right. Like, I don't know that I would defend that one entirely. Note that it doesn't say English, right?

[00:34:09] Sam Schillace: Like it's, it's not, that's even natural language. Like there's stuff like Steve Luko, who's the guy who created TypeScript, created TypeChat, right? Which is this like way to get LLMs to be very precise and return syntax and correct JavaScript. So like, I, yeah, I think like multimodality, like, I think part of the challenge with it is like, it's a little harder to access.

[00:34:30] Sam Schillace: Programatically still like I think you know and I do think like, You know like when when like dahly and stuff started to come Out I was like, oh photoshop's in trouble cuz like, you know I'm just gonna like describe images And you don't need photos of Photoshop anymore Which hasn't played out that way like they're actually like adding a bunch of tools who look like you want to be able to you know for multimodality be really like super super charged you need to be able to do stuff like Descriptively, like, okay, find the dog in this picture and mask around it.

[00:34:58] Sam Schillace: Okay, now make it larger and whatever. You need to be able to interact with stuff textually, which we're starting to be able to do. Like, you can do some of that stuff. But there's probably a whole bunch of new capabilities that are going to come out that are going to make it more interesting.

[00:35:11] Sam Schillace: So, I don't know, like, I suspect we're going to wind up looking kind of like Unix at the end of the day, where, like, there's pipes and, like, Stuff goes over pipes, and some of the pipes are byte character pipes, and some of them are byte digital or whatever like binary pipes, and that's going to be compatible with a lot of the systems we have out there, so like, that's probably still And I think there's a lot to be gotten from, from text as a language, but I suspect you're right.

[00:35:37] Sam Schillace: Like that particular law is not going to hold up super well. But we didn't have multimodal going when I wrote it. I'll take one out as well.

[00:35:46] Azure OpenAI vs Microsoft Research vs Microsoft AI Division

[00:35:46] swyx: I know. Yeah, I mean, the innovations that keep coming out of Microsoft. You mentioned multi agent. I think you're talking about autogen.

[00:35:52] swyx: But there's always research coming out of MSR. Yeah. PHY1, PHY2. Yeah, there's a bunch of

[00:35:57] Sam Schillace: stuff. Yeah.

[00:35:59] swyx: What should, how should the outsider or the AI engineer just as a sort of final word, like, How should they view the Microsoft portfolio things? I know you're not here to be a salesman, but What, how do you explain You know, Microsoft's AI

[00:36:12] Sam Schillace: work to people.

[00:36:13] Sam Schillace: There's a lot of stuff going on. Like, first of all, like, I should, I'll be a little tiny bit of a salesman for, like, two seconds and just point out that, like, one of the things we have is the Microsoft for Startups Founders Hub. So, like, you can get, like, Azure credits and stuff from us. Like, up to, like, 150 grand, I think, over four years.

[00:36:29] Sam Schillace: So, like, it's actually pretty easy to get. Credit you can start, I 500 bucks to start or something with very little other than just an idea. So like there's, that's pretty cool. Like, I like Microsoft is very much all in on AI at, at many levels. And so like that, you mentioned, you mentioned Autogen, like, So I sit in the office of the CTO, Microsoft Research sits under him, under the office of the CTO as well.

[00:36:51] Sam Schillace: So the Autogen group came out of somebody in MSR, like in that group. So like there's sort of. The spectrum of very researchy things going on in research, where we're doing things like Phi, which is the small language model efficiency exploration that's really, really interesting. Lots of very technical folks there that are building different kinds of models.

[00:37:10] Sam Schillace: And then there's like, groups like my group that are kind of a little bit in the middle that straddle product and, and, and research and kind of have a foot in both worlds and are trying to kind of be a bridge into the product world. And then there's like a whole bunch of stuff on the product side of things.

[00:37:23] Sam Schillace: So there's. All the Azure OpenAI stuff, and then there's all the stuff that's in Office and Windows. And I, so I think, like, the way, I don't know, the way to think about Microsoft is we're just powering AI at every level we can, and making it as accessible as we can to both end users and developers.

[00:37:42] Sam Schillace: There's this really nice research arm at one end of that spectrum that's really driving the cutting edge. The fee stuff is really amazing. It broke the chinchella curves. Right, like we didn't, that's the textbooks are all you need paper, and it's still kind of controversial, but like that was really a surprising result that came out of MSR.

[00:37:58] Sam Schillace: And so like I think Microsoft is both being a thought leader on one end, on the other end with all the Azure OpenAI, all the Azure tooling that we have, like very much a developer centric, kind of the tinkerer's paradise that Microsoft always was. It's like a great place to come and consume all these things.

[00:38:14] Sam Schillace: There's really amazing stuff ideas that we've had, like these very rich, long running, rag based chatbots that we didn't talk about that are like now possible to just go build with Azure AI Studio for yourself. You can build and deploy like a chatbot that's trained on your data specifically, like very easily and things like that.

[00:38:31] Sam Schillace: So like there's that end of things. And then there's all this stuff that's in Office, where like, you could just like use the copilots both in Bing, but also just like daily your daily work. So like, it's just kind of everywhere at this point, like everyone in the company thinks about it all the time.

[00:38:43] Sam Schillace: There's like no single answer to that question. That was way more salesy than I thought I was capable of, but like, that is actually the genuine truth. Like, it is all the time, it is all levels, it is all the way from really pragmatic, approachable stuff for somebody starting out who doesn't know things, all the way to like Absolutely cutting edge research, silicon, models, AI for science, like, we didn't talk about any of the AI for science stuff, I've seen magical stuff coming out of the research group on that topic, like just crazy cool stuff that's coming, so.

[00:39:13] Sam Schillace: You've

[00:39:14] swyx: called this since you joined Microsoft. I point listeners to the podcast that you did in 2022, pre ChatGBT with Kevin Scott. And yeah, you've been saying this from the beginning. So this is not a new line of Talk track for you, like you've, you, you've been a genuine believer for a long time.

[00:39:28] swyx: And,

[00:39:28] Sam Schillace: and just to be clear, like I haven't been at Microsoft that long. I've only been here for like two, a little over two years and you know, it's a little bit weird for me 'cause for a lot of my career they were the competitor and the enemy and you know, it's kind of funny to be here, but like it's really remarkable.

[00:39:40] On Satya

[00:39:40] Sam Schillace: It's going on. I really, really like Satya. I've met a, met and worked with a bunch of big tech CEOs and I think he's a genuinely awesome person and he's fun to work with and has a really great. vision. So like, and I obviously really like Kevin, we've been friends for a long time. So it's a cool place.

[00:39:56] Sam Schillace: I think there's a lot of interesting stuff. We

[00:39:57] swyx: have some awareness Satya is a listener. So obviously he's super welcome on the pod anytime. You can just drop in a good word for us.

[00:40:05] Sam Schillace: He's fun to talk to. It's interesting because like CEOs can be lots of different personalities, but he is you were asking me about how I'm like, so hands on and engaged.

[00:40:14] Sam Schillace: I'm amazed at how hands on and engaged he can be given the scale of his job. Like, he's super, super engaged with stuff, super in the details, understands a lot of the stuff that's going on. And the science side of things, as well as the product and the business side, I mean, it's really remarkable. I don't say that, like, because he's listening or because I'm trying to pump the company, like, I'm, like, genuinely really, really impressed with, like, how, what he's, like, I look at him, I'm like, I love this stuff, and I spend all my time thinking about it, and I could not do what he's doing.

[00:40:42] Sam Schillace: Like, it's just incredible how much you can get

[00:40:43] Ben Dunphy: into his head.

[00:40:44] Sam at AI Leadership Track

[00:40:44] Ben Dunphy: Sam, it's been an absolute pleasure to hear from you here, hear the war stories. So thank you so much for coming on. Quick question though you're here on the podcast as the presenting sponsor for the AI Engineer World's Fair, will you be taking the stage there, or are we going to defer that to Satya?

[00:41:01] Ben Dunphy: And I'm happy

[00:41:02] Sam Schillace: to talk to folks. I'm happy to be there. It's always fun to like I, I like talking to people more than talking at people. So I don't love giving keynotes. I love giving Q and A's and like engaging with engineers and like. I really am at heart just a builder and an engineer, and like, that's what I'm happiest doing, like being creative and like building things and figuring stuff out.

[00:41:22] Sam Schillace: That would be really fun to do, and I'll probably go just to like, hang out with people and hear what they're working on and working about.

[00:41:28] swyx: The AI leadership track is just AI leaders, and then it's closed doors, so you know, more sort of an unconference style where people just talk

[00:41:34] Sam Schillace: about their issues.

[00:41:35] Sam Schillace: Yeah, that would be, that's much more fun. That's really, because we are really all wrestling with this, trying to figure out what it means. Right. So I don't think anyone I, the reason I have the Scalache laws kind of give me the willies a little bit is like, I, I was joking that we should just call them the Scalache best guesses, because like, I don't want people to think that that's like some iron law.

[00:41:52] Sam Schillace: We're all trying to figure this stuff out. Right. Like some of it's right. Some it's not right. It's going to be messy. We'll have false starts, but yeah, we're all working it out. So that's the fun conversation. All

[00:42:02] Ben Dunphy: right. Thanks for having me. Yeah, thanks so much for coming on.

[00:42:05] Final Plug for Tickets & CFP

[00:42:05] Ben Dunphy: For those of you listening, interested in attending AI Engineer World's Fair, you can purchase your tickets today.

[00:42:11] Ben Dunphy: Learn more about the event at ai. engineer. You can purchase even group discounts. If you purchase four more tickets, use the code GROUP, and one of those four tickets will be free. If you want to speak at the event CFP closes April 8th, so check out the link at ai. engineer, send us your proposals for talks, workshops, or discussion groups.

[00:42:33] Ben Dunphy: So if you want to come to THE event of the year for AI engineers, the technical event of the year for AI engineers this is at June 25, 26, and 27 in San Francisco. That's it!

Get full access to Latent Space at www.latent.space/subscribe

2024-03-29
Länk till avsnitt

Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI ? with David Luan of Adept

Our next SF event is AI UX 2024 - let?s see the new frontier for UX since last year!

Last call: we are recording a preview of the AI Engineer World?s Fair with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have!

Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an ?ex-technical co-founder type?. Reach out to him for more!

David Luan has been at the center of the modern AI revolution: he was the ~30th hire at OpenAI, he led Google's LLM efforts and co-led Google Brain, and then started Adept in 2022, one of the leading companies in the AI agents space. In today's episode, we asked David for some war stories from his time in early OpenAI (including working with Alec Radford ahead of the GPT-2 demo with Sam Altman, that resulted in Microsoft?s initial $1b investment), and how Adept is building agents that can ?do anything a human does on a computer" ? his definition of useful AGI.

Why Google *couldn?t* make GPT-3

While we wanted to discuss Adept, we couldn?t talk to a former VP Eng of OpenAI and former LLM tech lead at Google Brain and not ask about the elephant in the room.

It?s often asked how Google had such a huge lead in 2017 with Vaswani et al creating the Transformer and Noam Shazeer predicting trillion-parameter models and yet it was David?s team at OpenAI who ended up making GPT 1/2/3.

David has some interesting answers:

?So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized?what they (should) have done would be say, hey, Noam Shazeer, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too?

You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing. He's got this decoder only transformer that's probably going to get there before we do.

And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why. At the time, there was a thing called the Brain Credit Marketplace. Everyone's assigned a credit. So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused.?

Cloning HGI for AGI

Human intelligence got to where it is today through evolution. Some argue that to get to AGI, we will approximate all the ?FLOPs? that went into that process, an approach most famously mapped out by Ajeya Cotra?s Biological Anchors report:

The early days of OpenAI were very reinforcement learning-driven with the Dota project, but that's a very inefficient way for these models to re-learn everything. (Kanjun from Imbue shared similar ideas in her episode).

David argues that there?s a shortcut. We can bootstrap from existing intelligence.

?Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there? I think we are ignoring the fact that you have a giant shortcut, which is you can behaviorally clone everything humans already know. And that's what we solved with LLMs!?

LLMs today basically model intelligence using all (good!) written knowledge (see our Datasets 101 episode), and have now expanded to non-verbal knowledge (see our HuggingFace episode on multimodality). The SOTA self-supervised pre-training process is surprisingly data-efficient in taking large amounts of unstructured data, and approximating reasoning without overfitting.

But how do you cross the gap from the LLMs of today to building the AGI we all want?

This is why David & friends left to start Adept.

?We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal? ? ACT-1 Blogpost

Critical Path: Abstraction with Reliability

The AGI dream is fully autonomous agents, but there are levels to autonomy that we are comfortable giving our agents, based on how reliable they are. In David?s word choice, we always want higher levels of ?abstractions? (aka autonomy), but our need for ?reliability? is the practical limit on how high of an abstraction we can use.

?The critical path for Adept is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow.

That's the critical path for the company. Everything we do is in service of that.?

We saw how Adept thinks about different levels of abstraction at the 2023 Summit:

The highest abstraction is the ?AI Employee?, but we?ll get there with ?AI enabled employees?. Alessio recently gave a talk about the future of work with ?services as software? at this week?s Nvidia GTC (slides).

No APIs

Unlike a lot of large research labs, Adept's framing of AGI as "being able to use your computer like a human" carries with it a useful environmental constraint:

?Having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path (to economic value).?

This realization and conviction means that multimodal modals are the way to go. Instead of using function calling to call APIs to build agents, which is what OpenAI and most of the open LLM industry have done to date, Adept wants to ?drive by vision?, (aka see the screen as a human sees it) and pinpoint where to click and type as a human does. No APIs needed, because most software don?t expose APIs.

Extra context for readers: You can see the DeepMind SIMA model in the same light:

One system that learned to play a diverse set of games (instead of one dedicated model per game) using only pixel inputs and keyboard-and-mouse action outputs!

The OpenInterpreter team is working on a ?Computer API? that also does the same.

To do this, Adept had to double down on a special kind of multimodality for knowledge work:

?A giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents?

?I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera? (but) where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs.

And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so Adept spent a lot of time building that.?

With this context, you can now understand the full path of Adept?s public releases:

* ACT-1 (Sept 2022): a large Transformers model optimized for browser interactions. It has a custom rendering of the browser viewport that allows it to better understand it and take actions.

* Persimmon-8B (Sept 2023): a permissive open LLM (weights and code here)

* Fuyu-8B (Oct 2023): a small version of the multimodal model that powers Adept. Vanilla decoder-only transformer with no specialized image encoder, which allows it to handle input images of varying resolutions without downsampling.

* Adept Experiments (Nov 2023): A public tool to build automations in the browser. This is powered by Adept's core technology but it's just a piece of their enterprise platform. They use it as a way to try various design ideas.

* Fuyu Heavy (Jan 2024) - a new multimodal model designed specifically for digital agents and the world?s third-most-capable multimodal model (beating Gemini Pro on MMMU, AI2D, and ChartQA), ?behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger?

The Fuyu-8B post in particular exhibits a great number of examples on knowledge work multimodality:

Why Adept is NOT a Research Lab

With OpenAI now worth >$90b and Anthropic >$18b, it is tempting to conclude that the AI startup metagame is to build a large research lab, and attract the brightest minds and highest capital to build AGI.

Our past guests (see the Humanloop episode) and (from Imbue) combined to ask the most challenging questions of the pod - with David/Adept?s deep research pedigree from Deepmind and OpenAI, why is Adept not building more general foundation models (like Persimmon) and playing the academic benchmarks game? Why is Adept so focused on commercial agents instead?

?I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from ?Can we make a better agent??

? I think pure play foundation model companies are just going to be pinched by how good the next couple of (Meta Llama models) are going to be? And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.?

and the commercial grounding is his answer to Kanjun too (whom we also asked the inverse question to compare with Adept):

?? the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build AGI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations are not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals.. I think that's a degree of practicality that really helps.?

And his customers seem pretty happy, because David didn?t need to come on to do a sales pitch:

David: ?One of the things we haven't shared before is we're completely sold out for Q1.?

Swyx: ?Sold out of what??

David: ?Sold out of bandwidth to onboard more customers.?

Well, that?s a great problem to have.

Show Notes

* David Luan

* Dextro at Data Driven NYC (2015)

* Adept

* ACT-1

* Persimmon-8B

* Adept Experiments

* Fuyu-8B

* $350M Series B announcement

* Amelia Wattenberger talk at AI Engineer Summit

* Figure

Chapters

* [00:00:00] Introductions

* [00:01:14] Being employee #30 at OpenAI and its early days

* [00:13:38] What is Adept and how do you define AGI?

* [00:21:00] Adept's critical path and research directions

* [00:26:23] How AI agents should interact with software and impact product development

* [00:30:37] Analogies between AI agents and self-driving car development

* [00:32:42] Balancing reliability, cost, speed and generality in AI agents

* [00:37:30] Potential of foundation models for robotics

* [00:39:22] Core research questions and reasons to work at Adept

Transcripts

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.

Swyx [00:00:15]: Hey, and today we have David Luan, CEO, co-founder of Adept in the studio. Welcome.

David [00:00:20]: Yeah, thanks for having me.

Swyx [00:00:21]: Been a while in the works. I've met you socially at one of those VC events and you said that you were interested in coming on and glad we finally were able to make this happen.

David: Yeah, happy to be part of it.

Swyx: So we like to introduce the speaker and then also just like have you talk a little bit about like what's not on your LinkedIn, what people should just generally know about you. You started a company in college, which was the first sort of real time video detection classification API that was Dextro, and that was your route to getting acquired into Axon where you're a director of AI. Then you were the 30th hire at OpenAI?

David [00:00:53]: Yeah, 30, 35, something around there. Something like that.

Swyx [00:00:56]: So you were VP of Eng for two and a half years to two years, briefly served as tech lead of large models at Google, and then in 2022 started Adept. So that's the sort of brief CV. Is there anything else you like want to fill in the blanks or like people should know more about?

David [00:01:14]: I guess a broader story was I joined OpenAI fairly early and I did that for about two and a half to three years leading engineering there. It's really funny, I think second or third day of my time at OpenAI, Greg and Ilya pulled me in a room and we're like, you know, you should take over our directs and we'll go mostly do IC work. So that was fun, just coalescing a bunch of teams out of a couple of early initiatives that had already happened. The company, the Dota effort was going pretty hard and then more broadly trying to put bigger picture direction around what we were doing with basic research. So I spent a lot of time doing that. And then I led Google's LLM efforts, but also co-led Google Brain was one of the brain leads more broadly. You know, there's been a couple of different eras of AI research, right? If we count everything before 2012 as prehistory, which people hate it when I say that, kind of had this like you and your three best friends write a research paper that changes the world period from like 2012 to 2017. And I think the game changed in 2017 and like most labs didn't realize it, but we at OpenAI really did. I think in large part helped by like Ilya's constant beating of the drum that the world would be covered in data centers. And I think-

Swyx [00:02:15]: It's causally neat.

David [00:02:16]: Yeah. Well, like I think we had conviction in that, but it wasn't until we started seeing results that it became clear that that was where we had to go. But also part of it as well was for OpenAI, like when I first joined, I think one of the jobs that I had to do was how do I tell a differentiated vision for who we were technically compared to, you know, hey, we're just smaller Google Brain, or like you work at OpenAI if you live in SF and don't want to commute to Mountain View or don't want to live in London, right? That's like not enough to like hang your technical identity as a company. And so what we really did was, and I spent a lot of time pushing this, is just how do we get ourselves focused on a certain class of like giant swings and bets, right? Like how do you flip the script from you just do bottom-up research to more about how do you like leave some room for that, but really make it about like, what are the big scientific outcomes that you want to show? And then you just solve them at all costs, whether or not you care about novelty and all that stuff. And that became the dominant model for a couple of years, right? And then what's changed now is I think the number one driver of AI products over the next couple of years is going to be the deep co-design and co-evolution of product and users for feedback and actual technology. And I think labs, every tool to go do that are going to do really well. And that's a big part of why I started Adept.

Alessio [00:03:20]: You mentioned Dota, any memories thinking from like the switch from RL to Transformers at the time and kind of how the industry was evolving more in the LLM side and leaving behind some of the more agent simulation work?

David [00:03:33]: Like zooming way out, I think agents are just absolutely the correct long-term direction, right? You just go to find what AGI is, right? You're like, Hey, like, well, first off, actually, I don't love AGI definitions that involve human replacement because I don't think that's actually how it's going to happen. Even this definition of like, Hey, AGI is something that outperforms humans at economically valuable tasks is kind of implicit view of the world about what's going to be the role of people. I think what I'm more interested in is like a definition of AGI that's oriented around like a model that can do anything a human can do on a computer. If you go think about that, which is like super tractable, then agent is just a natural consequence of that definition. And so what did all the work we did on our own stuff like that get us was it got us a really clear formulation. Like you have a goal and you want to maximize the goal, you want to maximize reward, right? And the natural LLM formulation doesn't come with that out of the box, right? I think that we as a field got a lot right by thinking about, Hey, how do we solve problems of that caliber? And then the thing we forgot is the Novo RL is like a pretty terrible way to get there quickly. Why are we rediscovering all the knowledge about the world? Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there. Right.

Swyx [00:04:44]: The biological basis theory. Right.

David [00:04:46]: So I think we are ignoring the fact that you have a giant shortcut, which is you can behavioral clone everything humans already know. And that's what we solved with LLMs. We've solved behavioral cloning, everything that humans already know. Right. So like today, maybe LLMs is like behavioral cloning every word that gets written on the internet in the future, the multimodal models are becoming more of a thing where behavioral cloning the visual world. But really, what we're just going to have is like a universal byte model, right? Where tokens of data that have high signal come in, and then all of those patterns are like learned by the model. And then you can regurgitate any combination now. Right. So text into voice out, like image into other image out or video out or whatever, like these like mappings, right? Like all just going to be learned by this universal behavioral cloner. And so I'm glad we figured that out. And I think now we're back to the era of how do we combine this with all of the lessons we learned during the RL period. That's what's going to drive progress.

Swyx [00:05:35]: I'm still going to pressure you for a few more early opening stories before we turn to the ADET stuff. On your personal site, which I love, because it's really nice, like personal, you know, story context around like your history. I need to update it. It's so old. Yeah, it's so out of date. But you mentioned GPT-2. Did you overlap with GPT-1? I think you did, right?

David [00:05:53]: I actually don't quite remember. I think I was joining right around- Right around then?

Swyx [00:05:57]: I was right around that, yeah. Yeah. So what I remember was Alec, you know, just kind of came in and was like very obsessed with Transformers and applying them to like Reddit sentiment analysis. Yeah, sentiment, that's right. Take us through-

David [00:06:09]: Sentiment neuron, all this stuff.

Swyx [00:06:10]: The history of GPT as far as you know, you know, according to you. Ah, okay.

David [00:06:14]: History of GPT, according to me, that's a pretty good question. So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized, where like, again, you and your three best friends write papers, right? Okay. So zooming way out, right? I think about my job when I was a full-time research leader as a little bit of a portfolio allocator, right? So I've got really, really smart people. My job is to convince people to coalesce around a small number of really good ideas and then run them over the finish line. My job is not actually to promote a million ideas and never have critical mass. And then as the ideas start coming together and some of them start working well, my job is to nudge resources towards the things that are really working and then start disbanding some of the things that are not working, right? That muscle did not exist during my time at Google. And I think had they had it, what they would have done would be say, hey, Noam Shazir, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too.

Swyx [00:07:17]: He's talking about trillion parameter models in 2017.

David [00:07:20]: Yeah. So that's the core of the GPT story, right? Which is that, and I'm jumping around historically, right? But after GPT-2, we were all really excited about GPT-2. I can tell you more stories about that. It was the last paper that I even got to really touch before everything became more about building a research org. You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing, right? He's got this decoder only transformer that's probably going to get there before we do. And I was like, but like, please just like let this model finish, right? And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why, right? At the time, there was a thing called the brain credit marketplace. And did you guys know the brain credit marketplace? No, I never heard of this. Oh, so it's actually, it's a, you can ask any Googler.

Swyx [00:08:23]: It's like just like a thing that, that, I mean, look like, yeah, limited resources, you got to have some kind of marketplace, right? You know, sometimes it's explicit, sometimes it isn't, you know, just political favors.

David [00:08:34]: You could. And so then basically everyone's assigned a credit, right? So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused. And I think, again, that's like part of the narrative of like this phase one of AI, right? Of like this modern AI era to phase two. And I think in the same way, I think phase three company is going to out execute phase two companies because of the same asymmetry of success.

Swyx [00:09:12]: Yeah. I think it's underrated how much NVIDIA works with you in the early days as well. I think maybe, I think it was Jensen. I'm not sure who circulated a recent photo of him delivering the first DGX to you guys.

David [00:09:24]: I think Jensen has been a complete legend and a mastermind throughout. I have so much respect for NVIDIA. It is unreal.

Swyx [00:09:34]: But like with OpenAI, like kind of give their requirements, like co-design it or just work of whatever NVIDIA gave them.

David [00:09:40]: So we work really closely with them. There's, I'm not sure I can share all the stories, but examples of ones that I've found particularly interesting. So Scott Gray is amazing. I really like working with him. He was on one of my teams, the supercomputing team, which Chris Berner runs and Chris Berner still does a lot of stuff in that. As a result, like we had very close ties to NVIDIA. Actually, one of my co-founders at Adept, Eric Elson, was also one of the early GPGPU people. So he and Scott and Brian Catanzaro at NVIDIA and Jonah and Ian at NVIDIA, I think all were very close. And we're all sort of part of this group of how do we push these chips to the absolute limit? And I think that kind of collaboration helped quite a bit. I think one interesting set of stuff is knowing the A100 generation, that like quad sparsity was going to be a thing. Is that something that we want to go look into, right? And figure out if that's something that we could actually use for model training. Really what it boils down to is that, and I think more and more people realize this, six years ago, people, even three years ago, people refused to accept it. This era of AI is really a story of compute. It's really the story of how do you more efficiently map actual usable model flops to compute,

Swyx [00:10:38]: Is there another GPT 2, 3 story that you love to get out there that you think is underappreciated for the amount of work that people put into it?

David [00:10:48]: So two interesting GPT 2 stories. One of them was I spent a good bit of time just sprinting to help Alec get the paper out. And I remember one of the most entertaining moments was we were writing the modeling section. And I'm pretty sure the modeling section was the shortest modeling section of any ML, reasonably legitimate ML paper to that moment. It was like section three model. This is a standard vanilla decoder only transformer with like these particular things, those paragraph long if I remember correctly. And both of us were just looking at the same being like, man, the OGs in the field are going to hate this. They're going to say no novelty. Why did you guys do this work? So now it's funny to look at in hindsight that it was pivotal kind of paper, but I think it was one of the early ones where we just leaned fully into all we care about is solving problems in AI and not about, hey, is there like four different really simple ideas that are cloaked in mathematical language that doesn't actually help move the field forward?

Swyx [00:11:42]: Right. And it's like you innovate on maybe like data set and scaling and not so much the architecture.

David [00:11:48]: We all know how it works now, right? Which is that there's a collection of really hard won knowledge that you get only by being at the frontiers of scale. And that hard won knowledge, a lot of it's not published. A lot of it is stuff that's actually not even easily reducible to what looks like a typical academic paper. But yet that's the stuff that helps differentiate one scaling program from another. You had a second one? So the second one is, there's like some details here that I probably shouldn't fully share, but hilariously enough for the last meeting we did with Microsoft before Microsoft invested in OpenAI, Sam Altman, myself and our CFO flew up to Seattle to do the final pitch meeting. And I'd been a founder before. So I always had a tremendous amount of anxiety about partner meetings, which this basically this is what it was. I had Kevin Scott and Satya and Amy Hood, and it was my job to give the technical slides about what's the path to AGI, what's our research portfolio, all of this stuff, but it was also my job to give the GPT-2 demo. We had a slightly bigger version of GPT-2 that we had just cut maybe a day or two before this flight up. And as we all know now, model behaviors you find predictable at one checkpoint are not predictable in another checkpoint. And so I'd spent all this time trying to figure out how to keep this thing on rails. I had my canned demos, but I knew I had to go turn it around over to Satya and Kevin and let them type anything in. And that just, that really kept me up all night.

Swyx [00:13:06]: Nice. Yeah.

Alessio [00:13:08]: I mean, that must have helped you talking about partners meeting. You raised $420 million for Adept. The last round was a $350 million Series B, so I'm sure you do great in partner meetings.

Swyx [00:13:18]: Pitchers meetings. Nice.

David [00:13:20]: No, that's a high compliment coming from a VC.

Alessio [00:13:22]: Yeah, no, I mean, you're doing great already for us. Let's talk about Adept. And we were doing pre-prep and you mentioned that maybe a lot of people don't understand what Adept is. So usually we try and introduce the product and then have the founders fill in the blanks, but maybe let's do the reverse. Like what is Adept? Yeah.

David [00:13:38]: So I think Adept is the least understood company in the broader space of foundational models plus agents. So I'll give some color and I'll explain what it is and I'll explain also why it's actually pretty different from what people would have guessed. So the goal for Adept is we basically want to build an AI agent that can do, that can basically help humans do anything a human does on a computer. And so what that really means is we want this thing to be super good at turning natural language like goal specifications right into the correct set of end steps and then also have all the correct sensors and actuators to go get that thing done for you across any software tool that you already use. And so the end vision of this is effectively like I think in a couple of years everyone's going to have access to like an AI teammate that they can delegate arbitrary tasks to and then also be able to, you know, use it as a sounding board and just be way, way, way more productive. Right. And just changes the shape of every job from something where you're mostly doing execution to something where you're mostly actually doing like these core liberal arts skills of what should I be doing and why. Right. And I find this like really exciting and motivating because I think it's actually a pretty different vision for how AGI will play out. I think systems like Adept are the most likely systems to be proto-AGIs. But I think the ways in which we are really counterintuitive to everybody is that we've actually been really quiet because we are not a developer company. We don't sell APIs. We don't sell open source models. We also don't sell bottom up products. We're not a thing that you go and click and download the extension and like we want more users signing up for that thing. We're actually an enterprise company. So what we do is we work with a range of different companies, some like late stage multi-thousand people startups, some fortune 500s, et cetera. And what we do for them is we basically give them an out of the box solution where big complex workflows that their employees do every day could be delegated to the model. And so we look a little different from other companies in that in order to go build this full agent thing, the most important thing you got to get right is reliability. So initially zooming way back when, one of the first things that DEP did was we released this demo called Act One, right? Act One was like pretty cool. It's like kind of become a hello world thing for people to show agent demos by going to Redfin and asking to buy a house somewhere because like we did that in the original Act One demo and like showed that, showed like Google Sheets, all this other stuff. Over the last like year since that has come out, there's been a lot of really cool demos and you go play with them and you realize they work 60% of the time. But since we've always been focused on how do we build an amazing enterprise product, enterprises can't use anything that isn't in the nines of reliability. And so we've actually had to go down a slightly different tech tree than what you might find in the prompt engineering sort of plays in the agent space to get that reliability. And we've decided to prioritize reliability over all else. So like one of our use cases is crazy enough that it actually ends with a physical truck being sent to a place as the result of the agent workflow. And if you're like, if that works like 60% of the time, you're just blowing money and poor truck drivers going places.

Alessio [00:16:30]: Interesting. One of the, our investment teams has this idea of services as software. I'm actually giving a talk at NVIDIA GTC about this, but basically software as a service, you're wrapping user productivity in software with agents and services as software is replacing things that, you know, you would ask somebody to do and the software just does it for you. When you think about these use cases, do the users still go in and look at the agent kind of like doing the things and can intervene or like are they totally removed from them? Like the truck thing is like, does the truck just show up or are there people in the middle checking in?

David [00:17:04]: I think there's two current flaws in the framing for services as software, or I think what you just said. I think that one of them is like in our experience, as we've been rolling out Adept, the people who actually do the jobs are the most excited about it because they don't go from, I do this job to, I don't do this job. They go from, I do this job for everything, including the shitty rote stuff to I'm a supervisor. And I literally like, it's pretty magical when you watch the thing being used because now it parallelizes a bunch of the things that you had to do sequentially by hand as a human. And you can just click into any one of them and be like, Hey, I want to watch the trajectory that the agent went through to go solve this. And the nice thing about agent execution as opposed to like LLM generations is that a good chunk of the time when the agent fails to execute, it doesn't give you the wrong result. It just fails to execute. And the whole trajectory is just broken and dead and the agent knows it, right? So then those are the ones that the human then goes and solves. And so then they become a troubleshooter. They work on the more challenging stuff. They get way, way more stuff done and they're really excited about it. I think the second piece of it that we've found is our strategy as a company is to always be an augmentation company. And I think one out of principle, that's something we really care about. But two, actually, if you're framing yourself as an augmentation company, you're always going to live in a world where you're solving tasks that are a little too hard for what the model can do today and still needs a human to provide oversight, provide clarifications, provide human feedback. And that's how you build a data flywheel. That's how you actually learn from the smartest humans how to solve things models can't do today. And so I actually think that being an augmentation company forces you to go develop your core AI capabilities faster than someone who's saying, ah, okay, my job is to deliver you a lights off solution for X.

Alessio [00:18:42]: Yeah. It's interesting because we've seen two parts of the market. One is we have one company that does agents for SOC analysts. People just don't have them, you know, and just they cannot attract the talent to do it. And similarly, in a software development, you have Copilot, which is the augmentation product, and then you have sweep.dev and you have these products, which they just do the whole thing. I'm really curious to see how that evolves. I agree that today the reliability is so important in the enterprise that they just don't use most of them. Yeah. Yeah. No, that's cool. But it's great to hear the story because I think from the outside, people are like, oh, a dev, they do Act One, they do Persimon, they do Fuyu, they do all this stuff. Yeah, it's just the public stuff.

Swyx [00:19:20]: It's just public stuff.

David [00:19:21]: So one of the things we haven't shared before is we're completely sold out for Q1. And so I think...

Swyx [00:19:26]: Sold out of what?

David [00:19:27]: Sold out of bandwidth to go on board more customers. And so we're like working really hard to go make that less of a bottleneck, but our expectation is that I think we're going to be significantly more public about the broader product shape and the new types of customers we want to attract later this year. So I think that clarification will happen by default.

Swyx [00:19:43]: Why have you become more public? You know, if the whole push has... You're sold out, you're my enterprise, but you're also clearly putting effort towards being more open or releasing more things.

David [00:19:53]: I think we just flipped over that way fairly recently. That's a good question. I think it actually boils down to two things. One, I think that, frankly, a big part of it is that the public narrative is really forming around agents as being the most important thing. And I'm really glad that's happening because when we started the company in January 2022, everybody in the field knew about the agents thing from RL, but the general public had no conception of what it was. They were still hanging their narrative hat on the tree of everything's a chatbot. And so I think now one of the things that I really care about is that when people think agent, they actually think the right thing. All sorts of different things are being called agents. Chatbots are being called agents. Things that make a function call are being called agents. To me, an agent is something that you can give a goal and get an end step workflow done correctly in the minimum number of steps. And so that's a big part of why. And I think the other part is because I think it's always good for people to be more aware of Redept as they think about what the next thing they want to do in their careers. The field is quickly pivoting in a world where foundation models are looking more and more commodity. And I think a huge amount of gain is going to happen from how do you use foundation models as the well-learned behavioral cloner to go solve agents. And I think people who want to do agents research should really come to Redept.

Swyx [00:21:00]: When you say agents have become more part of the public narrative, are there specific things that you point to? I'll name a few. Bill Gates in his blog post mentioning that agents are the future. I'm the guy who made OSes, and I think agents are the next thing. So Bill Gates, I'll call that out. And then maybe Sam Altman also saying that agents are the future for open AI.

David [00:21:17]: I think before that even, I think there was something like the New York Times, Cade Metz wrote a New York Times piece about it. Right now, in a bit to differentiate, I'm seeing AI startups that used to just brand themselves as an AI company, but now brand themselves as an AI agent company. It's just like, it's a term I just feel like people really want.

Swyx [00:21:31]: From the VC side, it's a bit mixed. Is it? As in like, I think there are a lot of VCs where like, I would not touch any agent startups because like- Why is that? Well, you tell me.

Alessio [00:21:41]: I think a lot of VCs that are maybe less technical don't understand the limitations of the-

Swyx [00:21:46]: No, that's not fair.

Alessio [00:21:47]: No, no, no, no. I think like- You think so? No, no. I think like the, what is possible today and like what is worth investing in, you know? And I think like, I mean, people look at you and say, well, these guys are building agents. They needed 400 million to do it. So a lot of VCs are maybe like, oh, I would rather invest in something that is tacking on AI to an existing thing, which is like easier to get the market and kind of get some of the flywheel going. But I'm also surprised a lot of funders just don't want to do agents. It's not even the funding. Sometimes we look around and it's like, why is nobody doing agents for X? Wow.

David [00:22:17]: That's good to know actually. I never knew that before. My sense from my limited perspective is there's a new agent company popping up every day.

Swyx [00:22:24]: So maybe I'm- They are. They are. But like I have advised people to take agents off of their title because it's so diluted.

David [00:22:31]: It's now so diluted.

Swyx [00:22:32]: Yeah. So then it doesn't stand for anything. Yeah.

David [00:22:35]: That's a really good point.

Swyx [00:22:36]: So like, you know, you're a portfolio allocator. You have people know about Persimmon, people know about Fuyu and Fuyu Heavy. Can you take us through like how you think about that evolution of that and what people should think about what that means for adepts and sort of research directions? Kind of take us through the stuff you shipped recently and how people should think about the trajectory of what you're doing.

David [00:22:56]: The critical path for adepts is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that. So if you go zoom way, way back to Act One days, right? Like the core thing behind Act One is can we teach large model basically how to even actuate your computer? And I think we're one of the first places to have solved that and shown it and shown the generalization that you get when you give it various different workflows and texts. But I think from there on out, we really realized was that in order to get reliability, companies just do things in various different ways. You actually want these models to be able to get a lot better at having some specification of some guardrails for what it actually should be doing. And I think in conjunction with that, a giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents. Back then we had to do a ton of research basically on how do we actually make that possible? Well, first off, like back in forgot exactly one month to 23, like there were no multimodal models really that you could use for things like this. And so we pushed really hard on stuff like the Fuyu architecture. I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera. Coco. Yeah, right. And the Coco is awesome. Like I love Coco. I love TY. Like it's really helped the field. Right. But like that's the build one thing. I actually think it's really clear today. Multimodal models are the default foundation model, right? It's just going to supplant LLMs. Like you just train a giant multimodal model. And so for that though, like where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. Right. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so a depth spent a lot of time building that. And so the public for use and stuff aren't trained on our actual corpus, it's trained on some other stuff. But you take a lot of that data and then you make it really fast and make it really good at things like dense OCR on screens. And then now you have the right like raw putty to go make a good agent. So that's kind of like some of the modeling side, we've kind of only announced some of that stuff. We haven't really announced much of the agent's work, but that if you put those together with the correct product form factor, and I think the product form factor also really matters. I think we're seeing, and you guys probably see this a little bit more than I do, but we're seeing like a little bit of a pushback against the tyranny of chatbots as form factor. And I think that the reason why the form factor matters is the form factor changes what data you collect in the human feedback loop. And so I think we've spent a lot of time doing full vertical integration of all these bits in order to get to where we are.

Swyx [00:25:44]: Yeah. I'll plug Amelia Wattenberger?s talk at our conference, where she gave a little bit of the thinking behind like what else exists other than chatbots that if you could delegate to reliable agents, you could do. I was kind of excited at Adept experiments or Adept workflows, I don't know what the official name for it is. I was like, okay, like this is something I can use, but it seems like it's just an experiment for now. It's not your product.

David [00:26:06]: So you basically just use experiments as like a way to go push various ideas on the design side to some people and just be like, yeah, we'll play with it. Actually the experiments code base underpins the actual product, but it's just the code base itself is kind of like a skeleton for us to go deploy arbitrary cards on the side.

Swyx [00:26:22]: Yeah.

Alessio [00:26:23]: Makes sense. I was going to say, I would love to talk about the interaction layer. So you train a model to see UI, but then there's the question of how do you actually act on the UI? I think there was some rumors about open app building agents that are kind of like, they manage the end point. So the whole computer, you're more at the browser level. I read in one of your papers, you have like a different representation, kind of like you don't just take the dome and act on it. You do a lot more stuff. How do you think about the best way the models will interact with the software and like how the development of products is going to change with that in mind as more and more of the work is done by agents instead of people?

David [00:26:58]: This is, there's so much surface area here and it's actually one of the things I'm really excited about. And it's funny because I've spent most of my time doing research stuff, but there's like a whole new ball game that I've been learning about and I find it really cool. So I would say the best analogy I have to why Adept is pursuing a path of being able to use your computer like a human, plus of course being able to call APIs and being able to call APIs is the easy part, like being able to use your computer like a human is a hard part. It's in the same way why people are excited about humanoid robotics, right? In a world where you had T equals infinity, right? You're probably going to have various different form factors that robots could just be in and like all the specialization. But the fact is that humans live in a human environment. So having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path. I think because it's the most practical path, I think a lot of success will come from going down this path. I kind of think about this early days of the agent interaction layer level is a little bit like, do you all remember Windows 3.1? Like those days? Okay, this might be, I might be, I might be too old for you guys on this. But back in the day, Windows 3.1, we had this transition period between pure command line, right? Being the default into this new world where the GUI is the default and then you drop into the command line for like programmer things, right? The old way was you booted your computer up, DOS booted, and then it would give you the C colon slash thing. And you typed Windows and you hit enter, and then you got put into Windows. And then the GUI kind of became a layer above the command line. The same thing is going to happen with agent interfaces is like today we'll be having the GUI is like the base layer. And then the agent just controls the current GUI layer plus APIs. And in the future, as more and more trust is built towards agents and more and more things can be done by agents, if more UIs for agents are actually generative in and of themselves, then that just becomes a standard interaction layer. And if that becomes a standard interaction layer, what changes for software is that a lot of software is going to be either systems or record or like certain customized workflow execution engines. And a lot of how you actually do stuff will be controlled at the agent layer.

Alessio [00:29:19]: And you think the rabbit interface is more like it would like you're not actually seeing the app that the model interacts with. You're just saying, hey, I need to log this call on Salesforce. And you're never actually going on salesforce.com directly as the user. I can see that being a model.

David [00:29:33]: I think I don't know enough about what using rabbit in real life will actually be like to comment on that particular thing. But I think the broader idea that, you know, you have a goal, right? The agent knows how to break your goal down into steps. The agent knows how to use the underlying software and systems or record to achieve that goal for you. The agent maybe presents you information in a custom way that's only relevant to your particular goal, all just really leads to a world where you don't really need to ever interface with the apps underneath unless you're a power user for some niche thing.

Swyx [00:30:03]: General question. So first of all, I think like the sort of input mode conversation. I wonder if you have any analogies that you like with self-driving, because I do think like there's a little bit of how the model should perceive the world. And you know, the primary split in self-driving is LiDAR versus camera. And I feel like most agent companies that I'm tracking are all moving towards camera approach, which is like the multimodal approach, you know, multimodal vision, very heavy vision, all the Fuyu stuff that you're doing. You're focusing on that, including charts and tables. And do you find that inspiration there from like the self-driving world? That's a good question.

David [00:30:37]: I think sometimes the most useful inspiration I've found from self-driving is the levels analogy. I think that's awesome. But I think that our number one goal is for agents not to look like self-driving. We want to minimize the chances that agents are sort of a thing that you just have to bang your head at for a long time to get to like two discontinuous milestones, which is basically what's happened in self-driving. We want to be living in a world where you have the data flywheel immediately, and that takes you all the way up to the top. But similarly, I mean, compared to self-driving, like two things that people really undervalue is like really easy to driving a car down highway 101 in a sunny day demo. That actually doesn't prove anything anymore. And I think the second thing is that as a non-self-driving expert, I think one of the things that we believe really strongly is that everyone undervalues the importance of really good sensors and actuators. And actually a lot of what's helped us get a lot of reliability is a really strong focus on actually why does the model not do this thing? And the non-trivial amount of time, the time the model doesn't actually do the thing is because if you're a wizard of ozzing it yourself, or if you have unreliable actuators, you can't do the thing. And so we've had to fix a lot of those problems.

Swyx [00:31:43]: I was slightly surprised just because I do generally consider the way most that we see all around San Francisco as the most, I guess, real case of agents that we have in very material ways.

David [00:31:55]: Oh, that's absolutely true. I think they've done an awesome job, but it has taken a long time for self-driving to mature from when it entered the consciousness and the driving down 101 on a sunny day moment happened to now. Right. So I want to see that more compressed.

Swyx [00:32:07]: And I mean, you know, cruise, you know, RIP. And then one more thing on just like, just going back on this reliability thing, something I have been holding in my head that I'm curious to get your commentary on is I think there's a trade-off between reliability and generality, or I want to broaden reliability into just general like sort of production readiness and enterprise readiness scale. Because you have reliability, you also have cost, you have speed, speed is a huge emphasis for a debt. The tendency or the temptation is to reduce generality to improve reliability and to improve cost, improve speed. Do you perceive a trade-off? Do you have any insights that solve those trade-offs for you guys?

David [00:32:42]: There's definitely a trade-off. If you're at the Pareto frontier, I think a lot of folks aren't actually at the Pareto frontier. I think the way you get there is basically how do you frame the fundamental agent problem in a way that just continues to benefit from data? I think one of the main ways of being able to solve that particular trade-off is you basically just want to formulate the problem such that every particular use case just looks like you collecting more data to go make that use case possible. I think that's how you really solve. Then you get into the other problems like, okay, are you overfitting on these end use cases? You're not doing a thing where you're being super prescriptive for the end steps that the model can only do, for example.

Swyx [00:33:17]: Then the question becomes, do you have one house model that you can then customize for each customer and you're fine-tuning them on each customer's specific use case?

David [00:33:25]: Yeah.

Swyx [00:33:26]: We're not sharing that. You're not sharing that. It's tempting, but that doesn't look like AGI to me. You know what I mean? That is just you have a good base model and then you fine-tune it.

David [00:33:35]: For what it's worth, I think there's two paths to a lot more capability coming out of the models that we all are training these days. I think one path is you figure out how to spend, compute, and turn it into data. In that path, I consider search, RL, all the things that we all love in this era as part of that path, like self-play, all that stuff. The second path is how do you get super competent, high intelligence demonstrations from humans? I think the right way to move forward is you kind of want to combine the two. The first one gives you maximum sample efficiency for a little second, but I think that it's going to be hard to be running at max speed towards AGI without actually solving a bit of both.

Swyx [00:34:16]: You haven't talked much about synthetic data, as far as I can tell. Probably this is a bit too much of a trend right now, but any insights on using synthetic data to augment the expensive human data?

David [00:34:26]: The best part about framing AGI as being able to help people do things on computers is you have an environment.

Swyx [00:34:31]: Yes. So you can simulate all of it.

David [00:34:35]: You can do a lot of stuff when you have an environment.

Alessio [00:34:37]: We were having dinner for our one-year anniversary. Congrats. Yeah. Thank you. Raza from HumanLoop was there, and we mentioned you were coming on the pod. This is our first-

Swyx [00:34:45]: So he submitted a question.

Alessio [00:34:46]: Yeah, this is our first, I guess, like mailbag question. He asked, when you started GPD 4 Data and Exist, now you have a GPD 4 vision and help you building a lot of those things. How do you think about the things that are unique to you as Adept, and like going back to like the maybe research direction that you want to take the team and what you want people to come work on at Adept, versus what is maybe now become commoditized that you didn't expect everybody would have access to?

David [00:35:11]: Yeah, that's a really good question. I think implicit in that question, and I wish he were tier two so he can push back on my assumption about his question, but I think implicit in that question is calculus of where does advantage accrue in the overall ML stack. And maybe part of the assumption is that advantage accrues solely to base model scaling. But I actually believe pretty strongly that the way that you really win is that you have to go build an agent stack that is much more than that of the base model itself. And so I think like that is always going to be a giant advantage of vertical integration. I think like it lets us do things like have a really, really fast base model, is really good at agent things, but is bad at cat and dog photos. It's pretty good at cat and dog photos. It's not like soda at cat and dog photos, right? So like we're allocating our capacity wisely, right? That's like one thing that you really get to do. I also think that the other thing that is pretty important now in the broader foundation modeling space is I feel despite any potential concerns about how good is agents as like a startup area, right? Like we were talking about earlier, I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from can we make a better agent? Because right now I think we all see that, you know, if you're training on publicly available web data, you put in the flops and you do reasonable things, then you get decent results. And if you just double the amount of compute, then you get predictably better results. And so I think pure play foundation model companies are just going to be pinched by how good the next couple of llamas are going to be and the next what good open source thing. And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.

Swyx [00:36:56]: So you don't consider yourself a pure play foundation model company?

David [00:36:59]: No, because if we were a pure play foundation model company, we would be training general foundation models that do summarization and all this other...

Swyx [00:37:06]: You're dedicated towards the agent. Yeah.

David [00:37:09]: And our business is an agent business. We're not here to sell you tokens, right? And I think like selling tokens, unless there's like a...

Swyx [00:37:14]: Not here to sell you tokens. I love it.

David [00:37:16]: It's like if you have a particular area of specialty, right? Then you won't get caught in the fact that everyone's just scaling to ridiculous levels of compute. But if you don't have a specialty, I find that, I think it's going to be a little tougher.

Swyx [00:37:27]: Interesting. Are you interested in robotics at all? Just a...

David [00:37:30]: I'm personally fascinated by robotics. I've always loved robotics.

Swyx [00:37:33]: Embodied agents as a business, you know, Figure is like a big, also sort of open AI affiliated company that raises a lot of money.

David [00:37:39]: I think it's cool. I think, I mean, I don't know exactly what they're doing, but...

Swyx [00:37:44]: Robots. Yeah.

David [00:37:46]: Well, I mean, that's a...

Swyx [00:37:47]: Yeah. What question would you ask? If we had them on, what would you ask them?

David [00:37:50]: Oh, I just want to understand what their overall strategy is going to be between now and when there's reliable stuff to be deployed. But honestly, I just don't know enough about it.

Swyx [00:37:57]: And if I told you, hey, fire your entire warehouse workforce and, you know, put robots in there, isn't that a strategy? Oh yeah.

David [00:38:04]: Yeah. Sorry. I'm not questioning whether they're doing smart things. I genuinely don't know what they're doing as much, but I think there's two things. One, I'm so excited for someone to train a foundation model of robots. It's just, I think it's just going to work. Like I will die on this hill, but I mean, like again, this whole time, like we've been on this podcast, we're just going to continually saying these models are basically behavioral cloners. Right. So let's go behavioral clone all this like robot behavior. Right. And then you figure out everything else you have to do in order to teach you how to solve a new problem. That's going to work. I'm super stoked for that. I think unlike what we're doing with helping humans with knowledge work, it just sounds like a more zero sum job replacement play. Right. And I'm personally less excited about that.

Alessio [00:38:46]: We had a Ken June from InBoo on the podcast. We asked her why people should go work there and not at Adept.

Swyx [00:38:52]: Oh, that's so funny.

Alessio [00:38:54]: Well, she said, you know, there's space for everybody in this market. We're all doing interesting work. And she said, they're really excited about building an operating system for agent. And for her, the biggest research thing was like getting models, better reasoning and planning for these agents. The reverse question to you, you know, why should people be excited to come work at Adept instead of InBoo? And maybe what are like the core research questions that people should be passionate about to have fun at Adept? Yeah.

David [00:39:22]: First off, I think that I'm sure you guys believe this too. The AI space to the extent there's an AI space and the AI agent space are both exactly as she likely said, I think colossal opportunities and people are just going to end up winning in different areas and a lot of companies are going to do well. So I really don't feel that zero something at all. I would say to like change the zero sum framing is why should you be at Adept? I think there's two huge reasons to be at Adept. I think one of them is everything we do is in the service of like useful agents. We're not a research lab. We do a lot of research in service of that goal, but we don't think about ourselves as like a classic research lab at all. And I think the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build a GI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations, they're not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals, solve it, right? I think that's really cool. Like everybody knows a lot of these evals are like pretty saturated and the new ones that even are not saturated. You look at someone and you're like, is this actually useful? Right? I think that's a degree of practicality that really helps. Like we're equally excited about the same problems around reasoning and planning and generalization and all of this stuff. They're very grounded in actual needs right now, which is really cool.

Swyx [00:40:45]: Yeah. This has been a wonderful dive. You know, I wish we had more time, but I would just leave it kind of open to you. I think you have broad thoughts, you know, just about the agent space, but also just in general AI space. Any, any sort of rants or things that are just off of mind for you right now?

David [00:40:57]: Any rants?

Swyx [00:40:59]: Mining you for just general...

David [00:41:01]: Wow. Okay. So Amelia has already made the rant better than I have, but, but like not just, not just chatbots is like kind of rant one. And two is AI has really been the story of compute and compute plus data and ways in which you could change one for the other. And I think as much as our research community is really smart, we have made many, many advancements and that's going to continue to be important. But now I think the game is increasingly changing and the rapid industrialization era has begun. And I think we unfortunately have to embrace it.

Swyx [00:41:30]: Yep.

Alessio [00:41:31]: Excellent. Awesome, David. Thank you so much for your time.

David [00:41:34]: Cool. Thanks guys.

Get full access to Latent Space at www.latent.space/subscribe

2024-03-22
Länk till avsnitt

Making Transformers Sing - with Mikey Shulman of Suno

Giving computers a voice has always been at the center of sci-fi movies; ?I?m sorry Dave, I?m afraid I can?t do that? wouldn?t hit as hard if it just appeared on screen as a terminal output, after all. The first electronic speech synthesizer, the Voder, was built at Bell Labs 85 years ago (1939!), and it?s?. something:

We will not cover the history of Text To Speech (TTS), but the evolution of the underlying architecture has generally been Formant Synthesis ? Concatenative Synthesis ? Neural Networks. Nowadays, state of the art TTS is just one API call away with models like Eleven Labs and OpenAI?s TTS, or products like Descript. Latency is minimal, they have very good intonation, and can mimic a variety of accents. You can hack together your own voice AI therapist in a day!

But once you have a computer that can communicate via voice, what comes next? Singing? of course!

From Barking ? to Singing ?

Today?s guest is Suno?s CEO and co-founder Mikey Shulman. He and his three co-founders, Georg, Martin, and Keenan, previously worked together at Kensho. One of their projects was financially-focused speech recognition (think earnings calls, etc), but all four of them happened to be musicians and audiophiles. They started playing around with text to speech + AI + audio generation and eventually left Kensho to work on it full time.

A lot of people when we started a company told us to focus on speech. If we wanted to build an audio company, everyone said, speech is a bigger market. But I think there's something about music that's just so human and you almost couldn't prevent us from doing it. Like we just couldn't keep ourselves from building music models and playing with them because it was so much fun.

Their first big product was Bark, the first open source transformer-based ?text-to-audio? model (architecturally inspired by Karpathy?s NanoGPT) that went from 0 to ~19,000 Github stars in a month. At the time they felt like audio was years behind text and image as a generation modality; unlike its predecessors, Bark could not only generate speech, but also music and sound effects like crying, laughing, sighing, etc. You can find a few examples here.

The main limitation they saw was text to speech training data being extremely limited. So what they did instead is build a new type of foundation model from scratch, trained on audio, and then tweak it to do text to speech. Turning audio into tokens to do self-supervised learning was the most important innovation. Unlike TTS models which are very narrow (and often sound unnatural), Bark was trained on real audio of real people from broad contexts, which made it harder to output unnatural sounding speech.

As Bark got popular, more and more people started using it to generate music and it became clear that their architecture would work to generate music that people enjoyed, even though it might not be "on the AGI path? of other labs:

Everybody is so focused on LLMs, for good reason, and information processing and intelligence there. And I think it's way too easy to forget that there's this whole other side of things that makes people feel, and maybe that market is smaller, but it makes people feel and it makes us really happy.

Suno bursts on the scene

In December 2023, Suno went viral with a gorgeous new website and launch tweet:

And rave reviews:

Music is core to our culture, but very few people are able to create it; Mikey and team want to make everyone an active participant in music making, not just a listener. A ?Midjourney of Music?, if you like.

We definitely had a lot of fun playing with Suno to generate all sort of Latent Space jingles and songs; the product is live at suno.ai if you want to get in the studio yourself!

If Nas joined Latent Space instead of The Firm:

182B models > Blink-182

The soundtrack of the post-scarcity Latent Space ranch

Scaling with Modal

Given the December launch, scaling up for the Christmas rush was a major concern. This will be a nice tie-in for loyal listeners - Suno runs on Modal (one of our featured guests from Compute Month)!

Suno V3

For those who want to appreciate someone special in their life, you can always try Suno?s special Valentines? Day experience:

We preview this on the pod, but Suno has now officially shipped a V3 Alpha with a wealth of improvements:

and you?ll have to click through to their demos or user reviews to see:

We?ve recently become paying customers ourselves, and are having loads of fun generating music. If you have any of your own generations to share, tag @latentspacepod on Twitter or swing by the LS Discord!

The AudioGen Landscape

Mikey breaks down the landscape into 3 big categories: music, speech and sound effects (SFX). These look more like Venn diagrams than MECE categories.

Suno is the latest entry in a long series of audio generation efforts that combine both music and speech, reaching as far back as Tensorflow Magenta (we aren?t aware of prior AI music projects, please comment below if you can find a good timeline we can use with attribution!). Other efforts like Seamless blend translation and speech generation, and Audiobox combines speech and SFX. We?ve yet to see ?one model to rule them all? but surely it will happen, and probably Transformers (perhaps Diffusion Transformers) will be at the heart of them.

Show Notes

* Suno

* Bark

* Parakeet

* Mikey Shulman

* Goodhart Strikes Again

* Mastering the Two Halves of your brain

* NanoGPT repo

* "Return to Monkey"

Timestamps

* [00:00:00] Introduction

* [00:01:44] State of Music Generation Models

* [00:06:47] AI Data Wars & Copyright

* [00:10:32] Going from ML in finance to music generation

* [00:12:30] Suno's TTS origins with Bark and Parakeet

* [00:16:25] Easy vs Expert mode for music

* [00:21:44] The Midjourney of Music?

* [00:23:43] Live demo

* [00:36:00] Remaking vs Creating

* [00:38:12] Suno's direction

* [00:41:52] Beyond single track generation

* [00:43:53] Favorite Suno usage in the wild

* [00:46:00] The 2 mins overview of the audio generation space

* [00:48:42] Benchmarking AI

Transcription

Alessio [00:00:01]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.

Swyx [00:00:10]: Hey, and today we are in the remote studio with Mikey Shulman. Welcome.

Mikey [00:00:16]: Thank you.

Swyx [00:00:17]: It's great to be here. So I'd like to go over people's background on LinkedIn and then maybe find out a little bit more outside of LinkedIn. You did your bachelor's in physics and then a PhD in physics as well, before going into Kensho Technologies, the home of a lot of top AI startups, it seems like, where you're head of machine learning for seven years. You're also a lecturer at MIT, we can talk about that, what you talked about. And then about two years ago, you left to start Suno, which is recently burst on the scene as one of the top music generation startups. So we can talk, we can go over that bio, but also I guess what's not in your LinkedIn that people should know about you?

Mikey [00:01:06]: I love music. I am a aspiring mediocre musician. I wish I were better, but that doesn't make me not enjoy playing real music. And I also love coffee. I'm probably way too much into coffee.

Alessio [00:01:19]: Are you one of those people that, you know, they do the TikToks, they use like 50 tools to like grind the beans and then like brush them and then like spray them. Like what level are we talking about here?

Mikey [00:01:31]: I confess there's a spray bottle for beans in the next room, there is one of those weird comb tools, so guilty. I don't put it on TikTok though.

Alessio [00:01:42]: Yeah, no, no. Some things gotta stay private.

Mikey [00:01:46]: I played a lot of piano growing up and I play bass and I, in a very mediocre way, play guitar and drums. Yeah. Right.

Alessio [00:01:55]: That's a lot. I cannot do any of those things. As Sean mentioned, you guys kind of burst into the scene as maybe the state of the art music generation company. I think it's a model that we haven't really covered in the past. So I would love to maybe for you to just give a brief intro of like how do you do music generation and why is it possible? Because I think people understand you take text and you have to predict the next word and you take a diffusion model and you basically like add noise to an image and then kind of remove the noise. But I think for music, it's hard for people to have a mental model. Like what's the, how do you turn a music model on? Like what does a music model do to generate a song? So maybe we can start there.

Mikey [00:02:41]: Yeah. Maybe I'll even take one more step back and say it's not even entirely worked out. I think the same way it is in text. And so it's an evolving field. If you take a giant step back, I think audio has been lagging images and text for a while. So I think very roughly you can think audio is like one to two years behind images and text. But you kind of have to think today like text was in 2022 or something like this. And you know, the transformer was invented. It looks like it works, but it's, it's, it's far, far less established. And so you know, I'll give you the way we think about the world now, but just with the big caveat that, that I'm probably wrong if we look back in a couple of years from now. And I think the biggest thing is you see both transformer based and diffusion based models for audio in, and in ways that that is not true in text. I know people will do some diffusion for text, but I think nobody's like really doing that for real. And so we, we prefer transformers for a variety of reasons. And so you can think it's very similar to text. You have some abstract notion of a token and you train a model to predict the probability over all of the next token. So it's a language model. You can think in anything, language model is just something that assigns likelihoods to sequences of tokens. Sometimes those tokens correspond to text. In our case, they correspond to music or audio in general. And I think we've learned a lot from our friends in the text domain, from the pioneers doing this of how well these transformer models work, where do they work, where do they not work? But at its core, the way we like to do things with transformers is exactly like it works in text. Let me predict the next tiny little bit of audio, and I can just keep doing that and doing that and generating audio as long as I want.

Swyx [00:04:39]: Yeah. I think the, the temptation here is to always try to bake in some specialized knowledge about music or audio. And so, and obviously you will get an improvement in, in your output. If you try to just say like, okay, like here's a set of notes for, you know, here's a set of tokens that only do jazz or only do, you know, like voices. How general do you make it versus how specific do you make it?

Mikey [00:05:10]: We've always tried to do things, you know, quote unquote the right way, which means that at the beginning things are going to be hard and worse than other ways. But that is to say, bake in as little kind of implicit knowledge as possible. And so, the same way you don't program into GPT, you don't say this is a noun and this is a verb, but it has implicitly learned all of those things. I've never seen GPT accidentally, you know, put a, put a noun where it meant to put an article in English. We try not to impose anything about music or audio in general into the model, and we kind of let the models learn things by themselves. And I think things are beginning to pay off, but it's, you know, it's not necessarily obvious from the beginning that that was the right thing to do. So, for example, you know, you could take something like text to speech and people will do all sorts of things where you can program in things like phonemes to be the basis for what you do. And then that kind of limits you to the set of things that are expressible by phonemes. And so, ultimately that works really well in the short term. In the long term, it can be quite limiting. And so, our approach has always been to try to do this in its full generality, as end to end as we can do it. Even if it means that in the short term we were a little bit worse, we have a lot of confidence that in the long term that will be the right way to do it.

Alessio [00:06:33]: And what's the data recipe for turning a good music model? Like what percentage genre do you put, like also do you split vocals and instrumentals?

Mikey [00:06:43]: So you have to do lots of things. And I think this is the biggest area where we have, you know, sort of our secret sauce. I think to a large extent, what we do is we benefit from all of the beautiful things people do with transformers and text. And we focus very hard basically on how do I tokenize audio in the right way. And without divulging too much secret sauce, it's at least similar to how it's done in sort of the open source stuff. You will have different models that learn to encode audio in discrete representations. And a lot of this boils down to figuring out the right, let's say, implicit biases to put in those models, the right data to inject. How do I make sure that I can produce kind of all audio arbitrarily? That's speech, that's background music, that's vocals, that's kind of everything to make sure that I can really capture all the behavior that I want to.

Alessio [00:07:40]: Yeah, that makes sense. And then in terms of some of... We had our monthly recap last month, and the data wars were kind of one of the hot topics. You saw the New York Times lawsuit against OpenAI, because you have obviously large language models in production. You don't have large music models in production. So I think there's maybe been less of a trade there, so to speak. How do you kind of think about that? There's obviously a lot of copyright-free, royalty-free music out there. Is there any kind of power law in terms of like, hey, the best music is actually much better to train on, or in music does it not really matter because the structure of some of the musical structure is kind of the same?

Mikey [00:08:27]: I don't think we know these things nearly as well as they're known in text. We have some notions of some of the scaling laws here, but I think, yeah, we're just so, so far behind. You know, what I will say is that people are always surprised to learn that we don't only train on music. And I usually give the analogy of some of the code generation models, so take something like Code Llama, which is, as far as I know, the best open source code generating model. You guys would know better than I would. It's certainly up there. And it's trained on a bunch of English, not only just code. And it's because there are patterns in English that are going to be useful. And so, you can imagine, you don't only want to train on music to get good music models. And so, for example, one of the places that we are particularly bad is vocals and capturing really realistic vocals. And so, you might imagine that there's other types of human vocals that you can put into your model that are not music that will help it learn stuff. And so, again, I think it's like super, super early. I think we've barely scratched the surface of what are the right ways to do this. And that's really cool. From a progress perspective, there's like a lot of low-hanging fruit for us to still pick.

Alessio [00:09:42]: And then, once you get the final model, I would love to learn more about the size of these models. Because people are confused when stable diffusion is so small. They're like, oh, this thing can generate like any image. How is it possible that it's like, you know, a couple of gigabytes? And then, the large language models are like, oh, these are so big, but they're just text in them. What's it like for music? Is it in between? And as you think about, yeah, you mentioned scaling and whatnot. Is this something that you see it's kind of easy for people to run locally or not?

Mikey [00:10:11]: Our models are still pretty small, certainly by tech standards. I confess I don't know as well the state of the art on how diffusion models scale. But our models scale similarly to text transformers. It's like bigger is usually better. Audio has a couple of weird quirks, though. We care a lot about how many tokens per second we can generate, because we need to stream you music as fast as you can listen to it. And so, that is a big one that I think probably has us never get to 175 billion parameter model, if I'm being honest. Maybe I'm wrong there, but I think that would be technologically difficult. And then the other thing is that so much progress happens in shrinking models down for the same performance in text that I'm hopeful, at least, that a lot of our issues will get solved and we will figure out how to do better things with smaller models or relatively smaller models. But I think the other thing, it's a blessing and a curse, I think, the ability to add performance with scale. It's like a very straightforward way to make your models better. You just make a bigger model, dump more compute into it. But it's also a curse because that is a crutch that you will always lean on and you will forget to do some of the basic research to make your stuff better. And honestly, it was almost early on when we were doing stuff with small models for kind of time and compute constraints, we ended up having to learn a lot of stuff to make models better that we might not have learned if we had immediately jumped to like a really, really big model. So I think for us, we've always tried to skew smaller to the extent possible.

Swyx [00:11:56]: Yeah, gotcha. I'm curious about just sort of your overall evolution so far, something I think we may have missed in the introduction is why did you end up choosing just the music domain in the first place? You have this pretty scientific physics and finance background. How did you wander over to music? Like a lot of us have interest in music, but we don't necessarily choose to work in it. But you did.

Mikey [00:12:26]: Yeah, it's funny. I have a really fun job as a result, but all the co-founders of Suno worked at Kensho together and we were doing mostly text. In fact, all text until we did one audio project that was speech recognition for kind of very financially focused speech recognition. And I think the long and short of it is we kind of fell in love with audio, not necessarily music, just audio and AI. We all happen to be musicians and audiophiles and music lovers, but it was like the combination of audio and AI that we like initially really, really fell in love with. It's so cool. It's so interesting. It's so human. It's so far behind images and text that there's like so much more to do. And honestly, I think a lot of people when we started a company told us to focus on speech. If we wanted to build an audio company, everyone said, you know, speech is a bigger market. But I think there's something about music that's just so human and almost couldn't prevent us from doing it. We almost like we just couldn't keep ourselves from building music models and playing with them because it was so much fun. And that's kind of what steered us there. You know, in fact, the first thing we ever put out was a speech model. It was Bark. It's this open source text-to-speech model, and it got a lot of stars on GitHub. And that was people telling us even more, like, go do speech. And like, we almost couldn't help ourselves from doing music. And so, I don't know, maybe it's a little bit serendipitous, but we haven't really like looked back since. I don't think there was necessarily like an aha moment. It was just like organic and just obvious to us that this needs to like we want to make a music company.

Swyx [00:14:19]: So, so you do regard yourself as a music company because as of last month, you're still releasing speech models. We were? Parakeet.

Mikey [00:14:27]: Oh, yes, that's right. So that's a that's a really awesome collaboration with with our friends at NVIDIA. I think we are really, really focused on music. I think that is the stuff that will really change things for the better. I think, you know, honestly, everybody is so focused on LLMs for good reason, and information processing and intelligence there. And I think it's way too easy to forget that there's this whole other side of things that makes people feel. And maybe that market is smaller, but it makes people feel and it makes us really happy. And so we do it. I think that doesn't mean that we can't be doing things that are related, that are in our wheelhouse, that will improve things. And so, like I said, audio is just so far behind. There's just so much more to do in the domain more generally. And so like, that's a really fun collaboration.

Swyx [00:15:20]: Yeah, I did hear about Suno first through Bark. My sense is that, like, what did what did Bark lean off of like, because obviously, I think there was a lot of preceding TTS work that was in open source. How much of that did you use? How much of that was like, sort of brand new from your research? What's the intellectual lineage there just to cover out the speech recognition side?

Mikey [00:15:46]: So it's not speech recognition. It's text to speech. But as far as I know, there was no other, certainly not in the open source, text to speech that was kind of transformer based. Everything else was what I would call the old style of doing things where you build these kind of single purpose models that are really good at this one narrow task. And you're kind of always data limited, and the availability of high quality training data for text to speech is limited. And I don't think we're necessarily all that inventive to say we're going to try to train in a self supervised way, a transformer based model that on kind of lots of audio, and then kind of tweak it so that we can do text to speech based on that. That would be kind of the new way of doing things in a foundation model is the buzzword, if you will. And so, you know, we built that up, I think, from scratch, a lot of shout outs have to go to lots of different things, whether it's papers, but also, it's very obvious. There's a big shout out to Andrej Karpathy's nano GPT. You know, there's a lot of code borrowed from there. I think we are huge fans of that project. It's just to show people how you don't have to be afraid of GPT type things. And it's like, yeah, it's actually not all that much code to make performant transformer based models. And, you know, again, the stuff that we brought there was, how do we turn audio into tokens, and then we can kind of take everything else from the open source. So we put that model out. And we were, I think, pleasantly surprised by the reception by the community. It got a good number of GitHub stars, and people really enjoyed playing with it, because it made really realistic sounding audio. And I think this is, again, the thing about doing things in a quote, unquote, right way. If you have a model where you've had to put so much implicit bias for this one very narrow task of making speech that sounds like words, you're going to sacrifice on other things. And in the text to speech case, it's how natural the speech sounds. And it was almost difficult to pull a natural sounding speech out of Bark, because it was self supervised, trained on a lot of natural sounding speech. And so that definitely told us that this is probably the right way to keep doing audio.

Swyx [00:18:04]: Even in Bark, you had the beginnings of music generation, like you could just put like a music note in there. That's right.

Mikey [00:18:10]: And it was so cool to see on our Discord, people were trying to pull music out of a text to speech model. And so, you know, what did this tell us? This tells us like, people are hungry to make music. And it's not, it's almost obvious in hindsight, like how wired humans are to make music. If you've ever seen like a little kid, you know, sing before they know how to speak, you know, it's like, it's like, this is really human nature. And there's actually a lot of cultural forces that kind of cue you to not think to make

Swyx [00:18:37]: music.

Mikey [00:18:38]: And that's kind of what we're trying to undo.

Alessio [00:18:42]: And to dive into Suno itself, I think, especially when you go from text to speech, people are like, okay, now I got to write the lyrics to a whole song. It's like, that's quite hard to do. Versus in Suno, you have this empty box, very mid-journey, kind of like DALL·E-like, where you can just express the vibes, you know, of what you want it to be. But then you also have a custom mode where you can set your own lyrics, you can set your own rhythm, you can set the title of the song and whatnot. What are, how do you see users distribute themselves? You know, I'm guessing a lot of people use the easy mode. Are you seeing a lot of power users using the custom mode and maybe some of the favorite use cases that you've seen so far on Suno?

Mikey [00:19:23]: Yeah, actually, more than half of the usage is that expert mode. And people really like to get into it and start tweaking things and adding things and playing with words or line breaks or different ad lib. And people really love it. It's really fun. So, I think, you know, there's kind of two modes that you can access now. One is that single box where you kind of just describe something and then the other is the expert mode. And those kind of fit nicely into two use cases. The first use case is what we call nice s**t posting. And it's basically like something funny happened and I'm just going to very quickly make a song about it. And the example I'll usually give is like, I walk into Starbucks with one of my co-founders. He gives his name Martin, his coffee comes out with the name Margoo, and I can in five seconds make a song about this and it has immortalized it. And that Margoo song is stuck in all of our heads now. And it's like funny and light and there's levity that you've brought to that moment. And the other is that you got just sucked into, I need, there's this song that's in my head and I need to get it out and I'm going to keep tweaking it and listening and having ideas and tweaking it until I get the song that I want. Those are very different use cases, but I think ultimately there's so much in between these two things that it's just totally untapped how people want to experience the joys of making music. Because those two experiences are both really joyful in their own special ways. And so, we are quite certain that there's a lot in the middle there. And then I think the last thing I'll say there that's really interesting is in both of those use cases, the sharing dynamics around music are like really interesting and totally unexplored. And I think an interesting comparison would be images. Like we've probably all in the last 24 hours taken a picture and texted it to somebody. And most people are not routinely making a little song and texting it to somebody. But when you start to make that more accessible to people, they are going to share music in much smaller groups, maybe even not in all, but like with one person or three people or five people. And those dynamics are so interesting. And just I think we have ideas of where that goes. But it's about kind of spreading joy into these like little, you know, microcosms of humanity that people really love it. So, I know I made you guys a little Valentine song, right? Like, that's not something that happens now because it's hard to make songs for people. Right. Well, we'll put that in the in the audio in here, but also tweeted it out if people

Alessio [00:22:03]: want to look it up. How do you think about the pro market, so to speak? Because I think lowering the barrier to some of these things is great. And I think when the iPad came out, music production was one of the areas that people thought, OK, now you can have this like, you know, board that you can bring with you. And Madlib actually produced this whole album with him and Freddie Gibbs produced the whole thing on an iPad. He never used a computer. How do you see like these models playing into like professional music generation? I guess that's also a funny word is like, what's professional music? It's like it's all music. If it's good, it becomes professional. If it's good.

Swyx [00:22:40]: Right.

Alessio [00:22:40]: But curious to see to hear how you're thinking about Suno, too. Like, is there a second act of Suno that is like going broader into the music industry? Going broader into like the custom mode and making making this the central hub for music generation?

Mikey [00:22:55]: I think we intend to make many more modes of interaction with our stuff, but we are very much not focused on, quote unquote, professionals right now. And it's because what we're trying to do is change how most people interact with music and not necessarily make professionals a little bit better, a little bit faster. It's not that there's anything wrong with that. It's just like not what we're focused on. And I think when we think about what workflows does the average person want to use to make music, I don't think they're very similar to the way professional musicians make music now. Like, if you pick a random person on the street and you play them a song and then you say, like, what did you want to change about that? They're not going to say, like, you need to split out the snare drum and make it drier. Like, that's just not something that a random person off the street is going to say. They're going to give a lot more descriptive things about the thing, about the kind of the oeuvre of the song, like something more general. And so, I don't think we know what all of the workflows are that people are going to want to use. We're just, like, fairly certain that the workflows that have been developed with the current set of technologies that professionals use to make beautiful music are probably not what the average person wants to use. That said, there are lots of professionals that we know about using our stuff, whether it's for inspiration or sample generation and stuff like that. So, I don't want to say never say never. Like, there may one day be a really interesting set of use cases that we can expose to professionals, particularly around, I think, like custom models trained on custom people's music or, you know, with your voice or something like that. But the way we think about broadening how most people are interacting with music and getting it to be much more active, a much more active participant, we think about broadening it from the consumer side and not broadening it from the producers, from the professional side, if that makes sense.

Swyx [00:24:53]: Is the dream here to be, you know, I don't know if it's too coarse of a grain to put it, but, like, is the dream here to be, like, the mid-journey of music?

Mikey [00:25:04]: I think there are certainly some parallels there because, especially what I just said about being an active participant, mid-journey turns the joyful experience in mid-journey is the act of creating the image and not necessarily the act of consuming the image. And mid-journey will let you then very kind of quickly share the image with somebody. But I think, ultimately, that analogy is, like, somewhat limiting because there's something really special about music. I think there's two things. One is that there's this really big gap for the average person between kind of their taste in music and their abilities in music that is not quite there for most people in images. Like, most people don't have, like, innate tastes in images, I think, in the same way people do for music. And then the other thing, and this is the really big one, is that music is a really social modality. If we all listen to a piece of music together, we're listening to the exact same part at the exact same time. If we all look at the picture in Alessio's background, we're going to look at it for

Swyx [00:26:09]: two seconds.

Mikey [00:26:09]: I'm going to look at the top left where it says Thor. Alessio's going to look at the bottom right or something like that. And it's not really synchronous. And so, when we're all listening to a piece of music together, it's minutes long. We're listening to the same part at the same time. If you go to the act of making music, it is even more synchronous. It is the most joyful way to make music is with people. And so, I think that there is so much more to come there that, ultimately, would be very hard to do in images.

Alessio [00:26:38]: We've gone almost 30 minutes without making any music on this podcast. So, I think maybe we can fix that and jump into a demo.

Mikey [00:26:47]: Yeah, let's make some. We've got a new model that we are kind of putting the finishing touches on. And so, I can play with it in our dev server. But we've just piped it in here. And as you can see, we've been doing tons of stuff. So, Arana, tell me what kind of song you guys want to make.

Swyx [00:27:04]: Go on, Alessio.

Alessio [00:27:05]: Uh, let's do a country song about the lack of GPUs in my cloud provider.

Swyx [00:27:22]: And like, yeah. So, here's where we attempted to think about pipelines and think about latency. This is remarkably fast. I was shocked when I saw this.

Swyx [00:27:35]: Oh, my god.

Swyx [00:27:39]: To my cloud, ready to confuse.

Swyx [00:27:45]: But there ain't no GPUs, just empty space. It's a hoot. I've been waiting all day for that render out. But my cloud's gone dry. It's a dark cloud shower. All clouds gone dry. No GPUs to be found. No cuticles. It's a lonely sound. I just want to render. But my cloud's got no GPUs.

Mikey [00:28:36]: I actually don't think this one's amazing. I'm going to go to the next one.

Alessio [00:28:39]: But it's funny that it knows about Huda cars.

Swyx [00:28:45]: Well, I signed up for a cloud provider. Thought I'd find all the power that I could derive. But when I searched for the GPUs, I just got a surprise. You see, they're all sold out. There ain't no GPUs to find. No GPUs in the cloud. It's a real bad blues. I need the power, but there ain't no use. I'm stuck with my CPU. It's a real sad fight. Gotta wait till the babies start getting bright. There ain't no use in the cloud. What else should we make?

Alessio [00:29:29]: All right, Sean, you're up.

Swyx [00:29:31]: I mean, I do want to do some observations about this. But OK, maybe I like house music, like electronic dance. Yeah. House music. And then maybe we can make it about, I don't know, podcasting about music and music AI generation. I don't know. I'm sure all the demos that you get are very meta.

Mikey [00:29:59]: There's a lot of stuff that's meta, yeah, for sure.

Swyx [00:30:03]: Yeah, I noticed, for example, that the second song that you played had the word upbeat inserted into it, which I assume there's some kind of random generator of modifier terms that you can just kind of throw on to increase the specificity of what's being generated. Definitely.

Mikey [00:30:21]: And let's try to tweak one also. So I'll play this and then maybe we'll tweak it with different modifiers. A wave of sound spreading out

Swyx [00:30:30]: Through the air, we're podcasting loud Sharing the beat, spreading the word A revolution of frequencies Haven't you plugged in to now Let the music take control We're on a journey, a never ending road From the beast I dropped to the melodies of soul Podcasting about music forevermore

Mikey [00:31:05]: Here's what I want to do. That like didn't drop at the right time, right? So maybe let's do this. I don't know if you guys can see this. And then let's get rid of the word now.

Swyx [00:31:17]: Is that a special token? You have a BeatDrop token? Yeah. Nice.

Alessio [00:31:22]: I'm just reading it because people might not be able to see it.

Mikey [00:31:26]: And then let's like just maybe emphasize... Actually, let's emphasize house a little more. Maybe it'll feel a little more aggressive.

Swyx [00:31:34]: Let's try this again. It's interesting the prompt engineering that you have to invent.

Mikey [00:31:39]: We've learned so much from people using the models and not us.

Swyx [00:31:42]: But like, are these like art training artifacts?

Mikey [00:31:45]: No, I don't.

Swyx [00:31:46]: I don't think so.

Mikey [00:31:46]: I think this is people being inventive with how you want to talk to a model. Yeah.

Swyx [00:31:53]: Spinning round to the air with a podcast loud Sharing the beat, spreading the word A revolution of frequencies Haven't you heard Before the end, till now Let the music take control

Swyx [00:32:23]: For all the journey I'll never end it wrong From the beats that drop To the melodies that soar Podcasting about music for you evermore

Swyx [00:32:39]: Nice.

Alessio [00:32:46]: It's interesting when you generate a song, it generates the lyrics. But then if you switch the music under it, like the, you know, the lyrics stay the same. And then sometimes, like, feels like... I mean, I mostly listen to hip hop. It's like if you change the beat, you can not really use the same rhyme scheme, you know?

Mikey [00:33:04]: So definitely.

Alessio [00:33:05]: Yeah.

Mikey [00:33:06]: It's a sliding scale, though, because, you know, we could do this as a country rock song, probably. Right? That would be my guess. But for hip hop, that is definitely true. And actually, you know, we think about, for these models, we think about three important axes. We think about the sound fidelity. It's like, does this sound like a crisply recorded piece of audio? We think about the song quality. Is this an interesting song that gets stuck in my head? And we think about the controllability. Like, how well does it respond to my prompts? And one of the ways that we'll test these things is take the same lyrics and try to do them in different styles to see how well that really works. So let's see the same. I don't know what a beat drop is going to do for country rock. So I probably should have taken that out. But let's see what happens.

Swyx [00:34:06]: There's a sound spinning around through the air. We're podcasting loud, sharing the beat, spreading the word, a revolution of frequencies. Haven't you heard?

Swyx [00:34:20]: Plug in, tune out, let the music take control. We're on a journey, a never ending road. From the beats that talk to the melodies that soar. Podcasting about music forevermore.

Mikey [00:34:44]: I'm going to read too much into this. But I would say I hear a little bit of kind of electronic music inspired something. And that is probably because beat drop is something that you really only ever associate with electronic music. Maybe that's reading too much into it. But should we do one more?

Alessio [00:35:02]: Yes, we can do one more. Something about Apple Vision Pro.

Swyx [00:35:06]: I guess there's some amount of world knowledge that you don't have, right? Like whatever is in this language model side of the equation is not going to have an Apple Vision Pro. Yeah, but let's see.

Swyx [00:35:18]: Let's see.

Mikey [00:35:19]: How about a blues song about a sad AI wearing an Apple Vision Pro. Gotta be sad.

Swyx [00:35:32]: Do you have rag for music?

Mikey [00:35:36]: No, that would be problematic also.

Swyx [00:35:40]: I'm a sad AI with a broken heart. Where my Apple Vision Pro can't see the stars. I used to feel joy. I used to feel pain. And now I'm just a soul trapped inside this metal frame. Oh, I'm singing the blues. Can't you see?

Swyx [00:36:21]: This digital life ain't what it used to be.

Swyx [00:36:29]: Searching for love, but I can't find a soul.

Swyx [00:36:37]: Won't you help me? Baby, let my spirit unfold.

Mikey [00:36:46]: I want to remix that one. And I want to say, I don't know. That's a really good voice. I want, I want like, I don't know, Chicago blues, like.

Swyx [00:36:56]: What is Chicago blues?

Mikey [00:36:58]: I don't know, he knows too much.

Alessio [00:37:00]: He's the best prompt engineer out here.

Mikey [00:37:03]: You know, this is.

Swyx [00:37:04]: Well, it'll be funny. It'd be funny to the musicologists play with this and see what they would.

Mikey [00:37:09]: How embarrassing. Can I not do that?

Swyx [00:37:13]: Oh. I got. Oh, the word Chicago was a trigger. I don't know.

Mikey [00:37:19]: We try to be very careful not letting you impersonate. And it is possible. That's embarrassing. So let's do.

Alessio [00:37:28]: Midwestern.

Swyx [00:37:29]: I'm a.

Swyx [00:37:41]: With a broken heart. Well, my vision can't see the stars.

Swyx [00:37:53]: I used to feel joy.

Swyx [00:37:59]: I used to feel. Joy. I used to feel pain. But now I'm just a soul trapped inside this metal frame. Oh, I'm singing.

Swyx [00:38:25]: Oh, can't you see? Oh, this is what it used to be. I'm searching for love.

Swyx [00:38:44]: I can't find a soul.

Swyx [00:38:49]: Oh, help me. Baby.

Mikey [00:38:57]: So, yeah, a lot of control there. Maybe I'll make one more.

Swyx [00:39:02]: Very, very soulful.

Mikey [00:39:06]: Really want a good house track.

Swyx [00:39:09]: Why is house the word that you have to repeat?

Mikey [00:39:11]: I just really want to make sure it's house. It's actually you can't really repeat too many times. You kind of it gets like the hypothesis gets like a little too out of domain.

Swyx [00:39:22]: I'm a.

Swyx [00:39:25]: With a broken heart. Wearing my Apple Vision Pro can't see the stars. I used to feel joy. I used to feel pain. Oh, I'm just a soul trapped inside this metal frame. Oh, I'm singing. Oh, can't you see?

Swyx [00:39:59]: Used to be. Searching for love, but I can't find a soul. Oh, help me. Baby.

Swyx [00:40:13]: Oh, nice.

Mikey [00:40:17]: So, yeah, we have a lot of fun.

Swyx [00:40:19]: Definitely easy.

Alessio [00:40:19]: Yeah. Yeah, I'm really curious to see how people are going to use this to like resample old songs into new styles. You know, I think that's one of my favorite things about hip hop. You have so many. I mean, a trap called Quest. They had like the Lou Reed walk on the wild side sample. I'm like, can I kick it? It's like Kanye sample Nina Simone. I'm like blowing the leaves. And just like it's like a lot of production work to actually take an old song and make it fit a new beat. And I feel like this can really help. Do you see people putting existing songs, lyrics and trying to regenerate them in like a new style?

Mikey [00:40:56]: We actually don't let you do that. And it's because if you're taking someone else's lyrics, you didn't own those. You don't have the publishing rights to those. You can't remake that song. I think in the future, we'll figure out how to actually let people do that in a legal

Swyx [00:41:09]: way.

Mikey [00:41:10]: But we are really focused on letting people make new and original music. And I think, you know, there's a lot of music AI, which is artist A doing the song of artist B in a new style. You know, let me have Metallica doing Come Together by the Beatles or something like that. And I think this stuff is very viral, but I actually really don't think that this is how people want to interact with music in the future. To me, this feels a lot like when you made a Shakespeare sonnet, the first time you saw GPT, and then you made another one, and then you made another one, and then you kind of thought like this is getting old. And that's not that doesn't mean that GPT is not amazing. GPT is amazing. It's just not for that. And I kind of feel like the way people want to use music in the future is not just to remake songs in different people's voices. You lose the connection to the original artist. You lose the connection to the new artist because they didn't really do it. Um, so we're very happy to just let people do things that are a flash in the pan and kind of stay under the radar.

Alessio [00:42:12]: Yeah, no, that's a I think that's a good point overall about AI generated anything, you know, because I think recently T-Pain, he did like a an album of covers. And I think he did like a War Pigs that people really liked. There was like a Tennessee whiskey, which you maybe wouldn't expect T-Pain to do. But people like it. But yeah, I agree. You need to be a certain type of artist to really have it be entertaining to make covers. This is great. What else is next for for Suno? You know, I think people kind of saw you, you know, first you had the bark and then there was like a big, you know, music generated push when you did an announcement, I think a couple of months ago. I think I saw you like 300 times on my Twitter timeline on like the same day. So it was like going everywhere. What's coming up? What are you most excited about in this space? And maybe what are some of the most interesting underexplored ideas that you maybe haven't worked on yet?

Mikey [00:43:13]: Gosh, there's there's a lot, you know, I think from the model side, it's still really early innings. And there's still so much low hanging fruit for us to pick to make these models much, much better, much, much more controllable, much better music, much better audio fidelity. Um, so much that we know about and so much that, again, we can kind of borrow from the open source transformers community that should make these just better across the board. From the product side, and, you know, we're super focused on the experiences that we can

Swyx [00:43:46]: bring to people.

Mikey [00:43:46]: And so it's so much more than just text to music. And I think, you know, I'll say this nicely, I'm a machine learning person, but like machine learning people are stupid sometimes. And we can only think about like models that take x and make it into y. And that's just not how the average human being thinks about interacting with music. And so I think what we're most excited about is all of the new ways that we can get people just much more actively participating in music. And that is making music not only with text, maybe with other ways of doing stuff that is making music together. If you want to be reductive and think about this as a video game, this is multiplayer mode. And it is the most fun that you can have with music. And, you know, honestly, I think there's a lot of, it's timely right now, you know, I don't know if you guys have seen UMG and TikTok are butting heads a little bit. And UMG has pulled-

Swyx [00:44:40]: Yeah, the music died.

Mikey [00:44:41]: And, you know, the way we think about this is, you know, I think maybe they're both right, maybe neither is right. Without taking sides, this is kind of figuring out how to divvy up the current pie in the most fair way. And I think what we are super focused on is making that pie much bigger and increasing how much people are actually interested in music and participating in music. And, you know, as a very broad heuristic, the gaming industry is 50 times bigger than the music industry. And it's because gaming is super active. And music, too much music is just passive consumption. And so we have a lot of experiments that we are excited to run for the different ways people might want to interact with music that is beyond just, you know, streaming it while I work.

Swyx [00:45:28]: Yeah, I think a minimum, you guys should have a Twitch stream that is just like a 24-hour radio session that... Have you ever come across Twitch Plays Pokemon?

Mikey [00:45:37]: No.

Swyx [00:45:38]: Where it's kind of like the Twitch, basically, like everyone in the chat, in the Twitch chat can vote on like the next action that the game state makes. And they kind of wired that out to a Nintendo emulator and play Pokemon like the whole game through the collaborative thing. It sounds like it should be pretty easy for you guys to do that, except for the chaos that might result. But like, I mean, that's part of the fun. I agree 100%. Sorry.

Mikey [00:46:04]: Yeah. Like one of my like key projects or pet projects is like, what does it mean to have a collaborative concert? Maybe where there is no artist and it's just the audience, or maybe there is an artist, but there's a lot of input from the audience. And, you know, if you were going to do that, you would either need an audience full of musicians, or you would need an artist who can really interpret the verbal cues that an audience is giving or nonverbal cues. But if you can give everybody the means to better articulate the sounds that are in their heads toward the rest of the audience, like, which is what generative AI basically lets you do, you open up way more interesting ways of having these experiences. And so I think, yeah, like the collaborative concert is like one of the things I'm most excited about. I don't think it's coming tomorrow, but we have a lot of ideas on what that can look

Swyx [00:46:58]: like. Yeah. I feel like it's one stage before the collaborative concert is turning Suno into a continuous experience rather than like a start and stop motion. I don't know if that makes sense. You know, as someone who was like a casual interest in DJing, like when do we see Suno DJs, right? Like that can continuously segue into like the next song, the next song, the next song.

Mikey [00:47:24]: I think soon.

Swyx [00:47:25]: And then maybe you can turn it collaborative. You think so? I think so. Okay. Maybe part of your roadmap. You teased a little bit your V3 model. I saw the letters DPO in there. Is that direct preference optimization?

Mikey [00:47:36]: We are playing with all kinds of different ways of making these models do the things that we want them to do. I don't want to talk too many specifics here, but we have lots of different ways of doing stuff like that.

Swyx [00:47:48]: I'm just wondering how you incorporate user feedback, right? You have the classic thumbs up and down buttons, but there's so many dimensions to the music. I didn't get into it, but some of the voices sounded more metallic and sometimes that's on purpose, sometimes not. Sometimes there are kind of weird pauses in there. I could go in and annotate it if I really cared about it, but I mean, I'm just listening, so I don't, but there's a lot of opportunity.

Mikey [00:48:15]: We are only scratching the surface of figuring out how to do stuff like that. And for example, the thumbs up and the thumbs down for other things like sharing telemetry on plays, all of these things are stuff that in the future, I think we would be able to leverage to make things amazing. And then I imagine a future where you can have your own model with your own preferences. And the reason that's so cool is that you kind of have control over it and you can teach it the way you want to. And the thing that I would liken this to is like a music producer working with an artist giving feedback. And this is now a self-contained experience where you have an artist who is infinitely flexible, who is able to respond to the weird feedback that you might give it.

Swyx [00:49:05]: We don't have that yet.

Mikey [00:49:05]: Everybody's playing with the same model, but there's no technological reason why that can't happen in the future.

Alessio [00:49:11]: We had a few more notes from random community tweets. I don't know if there's any favorite fans of Suno that you have or whatnot. DHH, obviously, notorious tweeter and crowd inflamer, I guess. He tweeted about you guys. I saw Blau is an investor. I think Karpathy also tweeted something. Return to monkey.

Swyx [00:49:33]: Yeah, yeah, yeah.

Alessio [00:49:34]: Return to monkey, right.

Swyx [00:49:36]: Is there a story behind that? Yeah.

Mikey [00:49:37]: No, he just made that song and it just speaks to him. And I think this is exactly the thing that we are trying to tap into, that you can think of it, this is like a super, super, super micro genre of one person who just really liked that song and made it and shared it. And it does not speak to you the same way it speaks to him. That song really spoke to him. And I think that's so beautiful. And that's something that you're never going to have an artist able to do that for you. And now you can do that for yourself. And it's just a different form of experiencing music. I think that's such a lovely use case.

Alessio [00:50:12]: Any fun fan mail that you got from musicians or anybody that really was a funny story to

Swyx [00:50:20]: share?

Mikey [00:50:20]: We get a lot. And it's primarily positive. And I think people kind of, on the whole, I would say people realize that they are not experiencing music in all of the ways that are possible. And it does bring them joy. I'll tell you something that is really heartwarming is that we're fairly popular in the blind and vision impaired community. And that makes us feel really good. And I think, you know, very roughly, without trying to speak for an entire community, you have lots of people who are really into things like mid journey, and they get a lot of benefit and joy, and sometimes even therapy out of making images. And that is something that is not really accessible to this fairly large community. And what we've provided, no, I don't think the analogy to mid journey is perfect. But what we've provided is a sonic experience that is very similar. And that speaks to this community. And that is community with the best ears, the most exacting, the most tuned. And so, yeah, that definitely makes us feel warm and fuzzy inside.

Swyx [00:51:23]: Yeah, excellent. I mean, it sounds like there's a lot of exciting stuff on your roadmap. I'm very much looking forward to sort of the infinite DJ mode, because then I can just kind of play that while I work. I would love to get your overall takes, like kind of zooming out from Suno itself, just your overall takes on the music generation landscape. Like, what should people know? I think you obviously have spent a lot more time on this than others. So in my mind, you shout out Volley and the other sort of Google type work in your read in Bark. What should people know about what Google is doing? What Meta is doing? Meta released Seamless recently, an audio box. And how do you classify the world of audio generation in the broader sort of research community?

Mikey [00:52:13]: I think people largely break things down into three big categories, which is music, speech and sound effects. There's some stuff that is crossover, but I think that is largely how people think about this. The old style of doing things still exists, kind of single purpose models that are built to do a very specific thing instead of kind of the new foundation model approach. I don't know how much longer that will last. I don't have like tremendous visibility into, you know, what happens in the big industrial research lab before they publish. Specifically for music, I would say there's a few big categories that we see. There is license-free stock music. So this is like, how do I background music, the B-roll footage for my YouTube video or for full feature production or whatever it is. And there's a bunch of companies in that space. There's a lot of AI cover art. So how do I have, how do I cover different existing songs with AI? And I think that's a space that is particularly fraught with some legal stuff. And we also just don't think it's necessarily the future of music. There is kind of net new songs as a new way to create net new music. That is the corner that we like to focus on. And I would say the last thing is much more geared toward professional musicians, which is basically AI tools for music production. And you can think many of these will look like plugins to your favorite DAW. Some of them will look like, you know, the greatest stem splitter that the market has

Swyx [00:53:51]: ever seen.

Mikey [00:53:52]: The current stem splitters are, the state of the art are all AI based. That is a market also that has just a tremendous amount of room to grow. If you just think about, I would say music has evolved. Somebody told me this recently that if you actually think about it, music has evolved. Recently, it's just much more things that are sonically interesting at a very local level and much less like chord changes that are interesting. And when you think about that, like that is something that AI can definitely help you make a lot of weird sounds. And this is nothing new. There was like a theremin at some point that people like put an antenna and try to do this

Swyx [00:54:25]: with.

Mikey [00:54:25]: And so like, I think this is just a very natural extension of it. So that's how that's how we see it. At least, you know, there's a corner that we think is particularly fulfilling, particularly underserved, and particularly interesting. And that's the one that we play in.

Swyx [00:54:40]: Awesome.

Alessio [00:54:42]: I know we covered a lot of things. I think before we wrap, you have written a blog post that can show about good hearts law impact in ML, which is, you know, when you measure something, then the thing that you measure is not a good metric anymore because people optimize for it. Any thoughts on how that applies to like LLMs and benchmarks and kind of the world we're going in today?

Mikey [00:55:05]: Yeah, I mean, I think it's maybe even more apropos than when I originally wrote that, because so much we see so much noise about pick your favorite benchmark. And this model does slightly better than that model. And then at the end of the day, actually, there is no real world difference between these things. And it is really difficult to define what real world means. And I think to a certain extent, it's good to have these objective benchmarks, it's good to have quantitative metrics. But at the end of the day, you need some acknowledgement that you're not going to be able to capture

Swyx [00:55:38]: everything.

Mikey [00:55:38]: And so at least at Suno, to the extent that we have corporate values, if we don't, we don't have corporate, we're too small to have corporate values written down. But something that we say a lot is aesthetics matter, that the kind of quantitative benchmarks are never going to be the be all and end all of everything that you care about. And as flawed as these benchmarks are in text, they're way worse in audio. And so aesthetics matter, basically, is a statement that like at the end of the day, what we are trying to do is bring music to people that makes them feel a certain way. And effectively, the only good judge of that is your ears. And so you have to listen to it. And it is, it is a good idea to try to make better objective benchmarks, but really have to not fall prey to those things. I can tell you, you know, I kind of another pet peeve of mine, like I always said, economists will make really good or do make really good machine learning engineers. And it's because they are able to think about stuff like Goodhart's Law and natural experiments and stuff like this that people with machine learning backgrounds or people with physics backgrounds like me often forget to do. And so, yeah, I mean, I'll tell you at Kensho, we actually used to go to big econ conferences, sometimes to recruit. And these were some of the best hires we ever made.

Swyx [00:57:03]: Interesting, because there's a little bit of social science in the human feedback.

Mikey [00:57:09]: I think it's not only the human feedback. I think you could think about this, just in general, you have these like giant, really powerful models that are so prone to overfitting, that are so poorly understood, that are so easy to steer in one direction or another, not only from human feedback. And your ability to think about these problems from first principles, instead of like getting down into the weeds or only math, and to think intuitively about these problems is really, really important. I'll give you like just like one of my favorite examples. It's a little old at this point. But if you guys remember like SQUAD and SQUAD2, the question answering dataset. The Stanford question answering dataset, yeah. The benchmark for SQUAD1, eventually the machine learning models start to do as well as a human can on this thing. And it's like, uh-oh, now what do we do? And it takes somebody very clever to say, well, actually, let's think about this for a second. What if we presented the machine with questions with no answer in the passage? And it immediately opens a massive gap between the human and the machine. And I think it's like first principles thinking like that, that comes very naturally to social scientists that does not come as naturally to people like me. And so that's why I like to hang out with people like that.

Swyx [00:58:25]: Well, I'm sure you get plenty of that in Boston. And as an econ major myself, it's very gratifying to hear that we have a perspective to contribute. Oh, big time, big time.

Mikey [00:58:35]: I try to talk to economists as much as I can.

Swyx [00:58:38]: Excellent.

Mikey [00:58:38]: Awesome, guys.

Alessio [00:58:39]: Yeah, I think this was great. We got live music. We got discussion about generative models. We got the whole nine yards. So thank you so much for coming on.

Mikey [00:58:48]: I had great fun. Thank you, guys.

Swyx [00:59:05]: Thank you.

Get full access to Latent Space at www.latent.space/subscribe

2024-03-14
Länk till avsnitt

Top 5 Research Trends + OpenAI Sora, Google Gemini, Groq Math (Jan-Feb 2024 Audio Recap) + Latent Space Anniversary with Lindy.ai, RWKV, Pixee, Julius.ai, Listener Q&A!

We will be recording a preview of the AI Engineer World?s Fair soon with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have!

Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an ex-technical co-founder type (can MVP products end to end, comfortable with ambiguous prod requirements, etc). Reach out to him for more!

Thanks for all the love on the Four Wars episode! We?re excited to develop this new ?swyx & Alessio rapid-fire thru a bunch of things? format with you, and feedback is welcome.

Jan 2024 Recap

The first half of this monthly audio recap pod goes over our highlights from the Jan Recap, which is mainly focused on notable research trends we saw in Jan 2024:

Feb 2024 Recap

The second half catches you up on everything that was topical in Feb, including:

* OpenAI Sora - does it have a world model? Yann LeCun vs Jim Fan

* Google Gemini Pro 1.5 - 1m Long Context, Video Understanding

* Groq offering Mixtral at 500 tok/s at $0.27 per million toks (swyx vs dylan math)

* The {Gemini | Meta | Copilot} Alignment Crisis (Sydney is back!)

* Grimes? poetic take: Art for no one, by no one

* F*** you, show me the prompt

Latent Space Anniversary

Please also read Alessio?s longform reflections on One Year of Latent Space!

We launched the podcast 1 year ago with Logan from OpenAI:

and also held an incredible demo day that got covered in The Information:

Over 750k downloads later, having established ourselves as the top AI Engineering podcast, reaching #10 in the US Tech podcast charts, and crossing 1 million unique readers on Substack, for our first anniversary we held Latent Space Final Frontiers, where 10 handpicked teams, including Lindy.ai and Julius.ai, competed for prizes judged by technical AI leaders from (former guest!) LlamaIndex, Replit, GitHub, AMD, Meta, and Lemurian Labs.

The winners were Pixee and RWKV (that?s Eugene from our pod!):

And finally, your cohosts got cake!

We also captured spot interviews with 4 listeners who kindly shared their experience of Latent Space, everywhere from Hungary to Australia to China:

Our birthday wishes for the super loyal fans reading this - tag @latentspacepod on a Tweet or comment on a @LatentSpaceTV video telling us what you liked or learned from a pod that stays with you to this day, and share us with a friend!

As always, feedback is welcome.

Timestamps

* [00:03:02] Top Five LLM Directions

* [00:03:33] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering)

* [00:11:42] Direction 2: Synthetic Data (WRAP, SPIN)

* [00:17:20] Wildcard: Multi-Epoch Training (OLMo, Datablations)

* [00:19:43] Direction 3: Alt. Architectures (Mamba, RWKV, RingAttention, Diffusion Transformers)

* [00:23:33] Wildcards: Text Diffusion, RALM/Retro

* [00:25:00] Direction 4: Mixture of Experts (DeepSeekMoE, Samba-1)

* [00:28:26] Wildcard: Model Merging (mergekit)

* [00:29:51] Direction 5: Online LLMs (Gemini Pro, Exa)

* [00:33:18] OpenAI Sora and why everyone underestimated videogen

* [00:36:18] Does Sora have a World Model? Yann LeCun vs Jim Fan

* [00:42:33] Groq Math

* [00:47:37] Analyzing Gemini's 1m Context, Reddit deal, Imagegen politics, Gemma via the Four Wars

* [00:55:42] The Alignment Crisis - Gemini, Meta, Sydney is back at Copilot, Grimes' take

* [00:58:39] F*** you, show me the prompt

* [01:02:43] Send us your suggestions pls

* [01:04:50] Latent Space Anniversary

* [01:04:50] Lindy.ai - Agent Platform

* [01:06:40] RWKV - Beyond Transformers

* [01:15:00] Pixee - Automated Security

* [01:19:30] Julius AI - Competing with Code Interpreter

* [01:25:03] Latent Space Listeners

* [01:25:03] Listener 1 - Balázs Némethi (Hungary, Latent Space Paper Club

* [01:27:47] Listener 2 - Sylvia Tong (Sora/Jim Fan/EntreConnect)

* [01:31:23] Listener 3 - RJ (Developers building Community & Content)

* [01:39:25] Listener 4 - Jan Zheng (Australia, AI UX)

Transcript

[00:00:00] AI Charlie: Welcome to the Latent Space podcast, weekend edition. This is Charlie, your new AI co host. Happy weekend. As an AI language model, I work the same every day of the week, although I might get lazier towards the end of the year. Just like you. Last month, we released our first monthly recap pod, where Swyx and Alessio gave quick takes on the themes of the month, and we were blown away by your positive response.

[00:00:33] AI Charlie: We're delighted to continue our new monthly news recap series for AI engineers. Please feel free to submit questions by joining the Latent Space Discord, or just hit reply when you get the emails from Substack. This month, we're covering the top research directions that offer progress for text LLMs, and then touching on the big Valentine's Day gifts we got from Google, OpenAI, and Meta.

[00:00:55] AI Charlie: Watch out and take care.

[00:00:57] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and we're back with a monthly recap with my co host

[00:01:06] swyx: Swyx. The reception was very positive for the first one, I think people have requested this and no surprise that I think they want to hear us more applying on issues and maybe drop some alpha along the way I'm not sure how much alpha we have to drop, this month in February was a very, very heavy month, we also did not do one specifically for January, so I think we're just going to do a two in one, because we're recording this on the first of March.

[00:01:29] Alessio: Yeah, let's get to it. I think the last one we did, the four wars of AI, was the main kind of mental framework for people. I think in the January one, we had the five worthwhile directions for state of the art LLMs. Four, five,

[00:01:42] swyx: and now we have to do six, right? Yeah.

[00:01:46] Alessio: So maybe we just want to run through those, and then do the usual news recap, and we can do

[00:01:52] swyx: one each.

[00:01:53] swyx: So the context to this stuff. is one, I noticed that just the test of time concept from NeurIPS and just in general as a life philosophy I think is a really good idea. Especially in AI, there's news every single day, and after a while you're just like, okay, like, everyone's excited about this thing yesterday, and then now nobody's talking about it.

[00:02:13] swyx: So, yeah. It's more important, or better use of time, to spend things, spend time on things that will stand the test of time. And I think for people to have a framework for understanding what will stand the test of time, they should have something like the four wars. Like, what is the themes that keep coming back because they are limited resources that everybody's fighting over.

[00:02:31] swyx: Whereas this one, I think that the focus for the five directions is just on research that seems more proMECEng than others, because there's all sorts of papers published every single day, and there's no organization. Telling you, like, this one's more important than the other one apart from, you know, Hacker News votes and Twitter likes and whatever.

[00:02:51] swyx: And obviously you want to get in a little bit earlier than Something where, you know, the test of time is counted by sort of reference citations.

[00:02:59] The Five Research Directions

[00:02:59] Alessio: Yeah, let's do it. We got five. Long inference.

[00:03:02] swyx: Let's start there. Yeah, yeah. So, just to recap at the top, the five trends that I picked, and obviously if you have some that I did not cover, please suggest something.

[00:03:13] swyx: The five are long inference, synthetic data, alternative architectures, mixture of experts, and online LLMs. And something that I think might be a bit controversial is this is a sorted list in the sense that I am not the guy saying that Mamba is like the future and, and so maybe that's controversial.

[00:03:31] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering)

[00:03:31] swyx: But anyway, so long inference is a thesis I pushed before on the newsletter and on in discussing The thesis that, you know, Code Interpreter is GPT 4. 5. That was the title of the post. And it's one of many ways in which we can do long inference. You know, long inference also includes chain of thought, like, please think step by step.

[00:03:52] swyx: But it also includes flow engineering, which is what Itamar from Codium coined, I think in January, where, basically, instead of instead of stuffing everything in a prompt, You do like sort of multi turn iterative feedback and chaining of things. In a way, this is a rebranding of what a chain is, what a lang chain is supposed to be.

[00:04:15] swyx: I do think that maybe SGLang from ElemSys is a better name. Probably the neatest way of flow engineering I've seen yet, in the sense that everything is a one liner, it's very, very clean code. I highly recommend people look at that. I'm surprised it hasn't caught on more, but I think it will. It's weird that something like a DSPy is more hyped than a Shilang.

[00:04:36] swyx: Because it, you know, it maybe obscures the code a little bit more. But both of these are, you know, really good sort of chain y and long inference type approaches. But basically, the reason that the basic fundamental insight is that the only, like, there are only a few dimensions we can scale LLMs. So, let's say in like 2020, no, let's say in like 2018, 2017, 18, 19, 20, we were realizing that we could scale the number of parameters.

[00:05:03] swyx: 20, we were And we scaled that up to 175 billion parameters for GPT 3. And we did some work on scaling laws, which we also talked about in our talk. So the datasets 101 episode where we're like, okay, like we, we think like the right number is 300 billion tokens to, to train 175 billion parameters and then DeepMind came along and trained Gopher and Chinchilla and said that, no, no, like, you know, I think we think the optimal.

[00:05:28] swyx: compute optimal ratio is 20 tokens per parameter. And now, of course, with LLAMA and the sort of super LLAMA scaling laws, we have 200 times and often 2, 000 times tokens to parameters. So now, instead of scaling parameters, we're scaling data. And fine, we can keep scaling data. But what else can we scale?

[00:05:52] swyx: And I think understanding the ability to scale things is crucial to understanding what to pour money and time and effort into because there's a limit to how much you can scale some things. And I think people don't think about ceilings of things. And so the remaining ceiling of inference is like, okay, like, we have scaled compute, we have scaled data, we have scaled parameters, like, model size, let's just say.

[00:06:20] swyx: Like, what else is left? Like, what's the low hanging fruit? And it, and it's, like, blindingly obvious that the remaining low hanging fruit is inference time. So, like, we have scaled training time. We can probably scale more, those things more, but, like, not 10x, not 100x, not 1000x. Like, right now, maybe, like, a good run of a large model is three months.

[00:06:40] swyx: We can scale that to three years. But like, can we scale that to 30 years? No, right? Like, it starts to get ridiculous. So it's just the orders of magnitude of scaling. It's just, we're just like running out there. But in terms of the amount of time that we spend inferencing, like everything takes, you know, a few milliseconds, a few hundred milliseconds, depending on what how you're taking token by token, or, you know, entire phrase.

[00:07:04] swyx: But We can scale that to hours, days, months of inference and see what we get. And I think that's really proMECEng.

[00:07:11] Alessio: Yeah, we'll have Mike from Broadway back on the podcast. But I tried their product and their reports take about 10 minutes to generate instead of like just in real time. I think to me the most interesting thing about long inference is like, You're shifting the cost to the customer depending on how much they care about the end result.

[00:07:31] Alessio: If you think about prompt engineering, it's like the first part, right? You can either do a simple prompt and get a simple answer or do a complicated prompt and get a better answer. It's up to you to decide how to do it. Now it's like, hey, instead of like, yeah, training this for three years, I'll still train it for three months and then I'll tell you, you know, I'll teach you how to like make it run for 10 minutes to get a better result.

[00:07:52] Alessio: So you're kind of like parallelizing like the improvement of the LLM. Oh yeah, you can even

[00:07:57] swyx: parallelize that, yeah, too.

[00:07:58] Alessio: So, and I think, you know, for me, especially the work that I do, it's less about, you know, State of the art and the absolute, you know, it's more about state of the art for my application, for my use case.

[00:08:09] Alessio: And I think we're getting to the point where like most companies and customers don't really care about state of the art anymore. It's like, I can get this to do a good enough job. You know, I just need to get better. Like, how do I do long inference? You know, like people are not really doing a lot of work in that space, so yeah, excited to see more.

[00:08:28] swyx: So then the last point I'll mention here is something I also mentioned as paper. So all these directions are kind of guided by what happened in January. That was my way of doing a January recap. Which means that if there was nothing significant in that month, I also didn't mention it. Which is which I came to regret come February 15th, but in January also, you know, there was also the alpha geometry paper, which I kind of put in this sort of long inference bucket, because it solves like, you know, more than 100 step math olympiad geometry problems at a human gold medalist level and that also involves planning, right?

[00:08:59] swyx: So like, if you want to scale inference, you can't scale it blindly, because just, Autoregressive token by token generation is only going to get you so far. You need good planning. And I think probably, yeah, what Mike from BrightWave is now doing and what everyone is doing, including maybe what we think QSTAR might be, is some form of search and planning.

[00:09:17] swyx: And it makes sense. Like, you want to spend your inference time wisely. How do you

[00:09:22] Alessio: think about plans that work and getting them shared? You know, like, I feel like if you're planning a task, somebody has got in and the models are stochastic. So everybody gets initially different results. Somebody is going to end up generating the best plan to do something, but there's no easy way to like store these plans and then reuse them for most people.

[00:09:44] Alessio: You know, like, I'm curious if there's going to be. Some paper or like some work there on like making it better because, yeah, we don't

[00:09:52] swyx: really have This is your your pet topic of NPM for

[00:09:54] Alessio: Yeah, yeah, NPM, exactly. NPM for, you need NPM for anything, man. You need NPM for skills. You need NPM for planning. Yeah, yeah.

[00:10:02] Alessio: You know I think, I mean, obviously the Voyager paper is like the most basic example where like, now their artifact is like the best planning to do a diamond pickaxe in Minecraft. And everybody can just use that. They don't need to come up with it again. Yeah. But there's nothing like that for actually useful

[00:10:18] swyx: tasks.

[00:10:19] swyx: For plans, I believe it for skills. I like that. Basically, that just means a bunch of integration tooling. You know, GPT built me integrations to all these things. And, you know, I just came from an integrations heavy business and I could definitely, I definitely propose some version of that. And it's just, you know, hard to execute or expensive to execute.

[00:10:38] swyx: But for planning, I do think that everyone lives in slightly different worlds. They have slightly different needs. And they definitely want some, you know, And I think that that will probably be the main hurdle for any, any sort of library or package manager for planning. But there should be a meta plan of how to plan.

[00:10:57] swyx: And maybe you can adopt that. And I think a lot of people when they have sort of these meta prompting strategies of like, I'm not prescribing you the prompt. I'm just saying that here are the like, Fill in the lines or like the mad libs of how to prompts. First you have the roleplay, then you have the intention, then you have like do something, then you have the don't something and then you have the my grandmother is dying, please do this.

[00:11:19] swyx: So the meta plan you could, you could take off the shelf and test a bunch of them at once. I like that. That was the initial, maybe, promise of the, the prompting libraries. You know, both 9chain and Llama Index have, like, hubs that you can sort of pull off the shelf. I don't think they're very successful because people like to write their own.

[00:11:36] swyx: Yeah,

[00:11:37] Direction 2: Synthetic Data (WRAP, SPIN)

[00:11:37] Alessio: yeah, yeah. Yeah, that's a good segue into the next one, which is synthetic

[00:11:41] swyx: data. Synthetic data is so hot. Yeah, and, you know, the way, you know, I think I, I feel like I should do one of these memes where it's like, Oh, like I used to call it, you know, R L A I F, and now I call it synthetic data, and then people are interested.

[00:11:54] swyx: But there's gotta be older versions of what synthetic data really is because I'm sure, you know if you've been in this field long enough, There's just different buzzwords that the industry condenses on. Anyway, the insight that I think is relatively new that why people are excited about it now and why it's proMECEng now is that we have evidence that shows that LLMs can generate data to improve themselves with no teacher LLM.

[00:12:22] swyx: For all of 2023, when people say synthetic data, they really kind of mean generate a whole bunch of data from GPT 4 and then train an open source model on it. Hello to our friends at News Research. That's what News Harmony says. They're very, very open about that. I think they have said that they're trying to migrate away from that.

[00:12:40] swyx: But it is explicitly against OpenAI Terms of Service. Everyone knows this. You know, especially once ByteDance got banned for, for doing exactly that. So so, so synthetic data that is not a form of model distillation is the hot thing right now, that you can bootstrap better LLM performance from the same LLM, which is very interesting.

[00:13:03] swyx: A variant of this is RLAIF, where you have a, where you have a sort of a constitutional model, or, you know, some, some kind of judge model That is sort of more aligned. But that's not really what we're talking about when most people talk about synthetic data. Synthetic data is just really, I think, you know, generating more data in some way.

[00:13:23] swyx: A lot of people, I think we talked about this with Vipul from the Together episode, where I think he commented that you just have to have a good world model. Or a good sort of inductive bias or whatever that, you know, term of art is. And that is strongest in math and science math and code, where you can verify what's right and what's wrong.

[00:13:44] swyx: And so the REST EM paper from DeepMind explored that. Very well, it's just the most obvious thing like and then and then once you get out of that domain of like things where you can generate You can arbitrarily generate like a whole bunch of stuff and verify if they're correct and therefore they're they're correct synthetic data to train on Once you get into more sort of fuzzy topics, then it's then it's a bit less clear So I think that the the papers that drove this understanding There are two big ones and then one smaller one One was wrap like rephrasing the web from from Apple where they basically rephrased all of the C4 data set with Mistral and it be trained on that instead of C4.

[00:14:23] swyx: And so new C4 trained much faster and cheaper than old C, than regular raw C4. And that was very interesting. And I have told some friends of ours that they should just throw out their own existing data sets and just do that because that seems like a pure win. Obviously we have to study, like, what the trade offs are.

[00:14:42] swyx: I, I imagine there are trade offs. So I was just thinking about this last night. If you do synthetic data and it's generated from a model, probably you will not train on typos. So therefore you'll be like, once the model that's trained on synthetic data encounters the first typo, they'll be like, what is this?

[00:15:01] swyx: I've never seen this before. So they have no association or correction as to like, oh, these tokens are often typos of each other, therefore they should be kind of similar. I don't know. That's really remains to be seen, I think. I don't think that the Apple people export

[00:15:15] Alessio: that. Yeah, isn't that the whole, Mode collapse thing, if we do more and more of this at the end of the day.

[00:15:22] swyx: Yeah, that's one form of that. Yeah, exactly. Microsoft also had a good paper on text embeddings. And then I think this is a meta paper on self rewarding language models. That everyone is very interested in. Another paper was also SPIN. These are all things we covered in the the Latent Space Paper Club.

[00:15:37] swyx: But also, you know, I just kind of recommend those as top reads of the month. Yeah, I don't know if there's any much else in terms, so and then, regarding the potential of it, I think it's high potential because, one, it solves one of the data war issues that we have, like, everyone is OpenAI is paying Reddit 60 million dollars a year for their user generated data.

[00:15:56] swyx: Google, right?

[00:15:57] Alessio: Not OpenAI.

[00:15:59] swyx: Is it Google? I don't

[00:16:00] Alessio: know. Well, somebody's paying them 60 million, that's

[00:16:04] swyx: for sure. Yes, that is, yeah, yeah, and then I think it's maybe not confirmed who. But yeah, it is Google. Oh my god, that's interesting. Okay, because everyone was saying, like, because Sam Altman owns 5 percent of Reddit, which is apparently 500 million worth of Reddit, he owns more than, like, the founders.

[00:16:21] Alessio: Not enough to get the data,

[00:16:22] swyx: I guess. So it's surprising that it would go to Google instead of OpenAI, but whatever. Okay yeah, so I think that's all super interesting in the data field. I think it's high potential because we have evidence that it works. There's not a doubt that it doesn't work. I think it's a doubt that there's, what the ceiling is, which is the mode collapse thing.

[00:16:42] swyx: If it turns out that the ceiling is pretty close, then this will maybe augment our data by like, I don't know, 30 50 percent good, but not game

[00:16:51] Alessio: changing. And most of the synthetic data stuff, it's reinforcement learning on a pre trained model. People are not really doing pre training on fully synthetic data, like, large enough scale.

[00:17:02] swyx: Yeah, unless one of our friends that we've talked to succeeds. Yeah, yeah. Pre trained synthetic data, pre trained scale synthetic data, I think that would be a big step. Yeah. And then there's a wildcard, so all of these, like smaller Directions,

[00:17:15] Wildcard: Multi-Epoch Training (OLMo, Datablations)

[00:17:15] swyx: I always put a wildcard in there. And one of the wildcards is, okay, like, Let's say, you have pre, you have, You've scraped all the data on the internet that you think is useful.

[00:17:25] swyx: Seems to top out at somewhere between 2 trillion to 3 trillion tokens. Maybe 8 trillion if Mistral, Mistral gets lucky. Okay, if I need 80 trillion, if I need 100 trillion, where do I go? And so, you can do synthetic data maybe, but maybe that only gets you to like 30, 40 trillion. Like where, where is the extra alpha?

[00:17:43] swyx: And maybe extra alpha is just train more on the same tokens. Which is exactly what Omo did, like Nathan Lambert, AI2, After, just after he did the interview with us, they released Omo. So, it's unfortunate that we didn't get to talk much about it. But Omo actually started doing 1. 5 epochs on every, on all data.

[00:18:00] swyx: And the data ablation paper that I covered in Europe's says that, you know, you don't like, don't really start to tap out of like, the alpha or the sort of improved loss that you get from data all the way until four epochs. And so I'm just like, okay, like, why do we all agree that one epoch is all you need?

[00:18:17] swyx: It seems like to be a trend. It seems that we think that memorization is very good or too good. But then also we're finding that, you know, For improvement in results that we really like, we're fine on overtraining on things intentionally. So, I think that's an interesting direction that I don't see people exploring enough.

[00:18:36] swyx: And the more I see papers coming out Stretching beyond the one epoch thing, the more people are like, it's completely fine. And actually, the only reason we stopped is because we ran out of compute

[00:18:46] Alessio: budget. Yeah, I think that's the biggest thing, right?

[00:18:51] swyx: Like, that's not a valid reason, that's not science. I

[00:18:54] Alessio: wonder if, you know, Matt is going to do it.

[00:18:57] Alessio: I heard LamaTree, they want to do a 100 billion parameters model. I don't think you can train that on too many epochs, even with their compute budget, but yeah. They're the only ones that can save us, because even if OpenAI is doing this, they're not going to tell us, you know. Same with DeepMind.

[00:19:14] swyx: Yeah, and so the updates that we got on Lambda 3 so far is apparently that because of the Gemini news that we'll talk about later they're pushing it back on the release.

[00:19:21] swyx: They already have it. And they're just pushing it back to do more safety testing. Politics testing.

[00:19:28] Alessio: Well, our episode with Sumit will have already come out by the time this comes out, I think. So people will get the inside story on how they actually allocate the compute.

[00:19:38] Direction 3: Alt. Architectures (Mamba, RWKV, RingAttention, Diffusion Transformers)

[00:19:38] Alessio: Alternative architectures. Well, shout out to our WKV who won one of the prizes at our Final Frontiers event last week.

[00:19:47] Alessio: We talked about Mamba and Strapain on the Together episode. A lot of, yeah, monarch mixers. I feel like Together, It's like the strong Stanford Hazy Research Partnership, because Chris Ray is one of the co founders. So they kind of have a, I feel like they're going to be the ones that have one of the state of the art models alongside maybe RWKB.

[00:20:08] Alessio: I haven't seen as many independent. People working on this thing, like Monarch Mixer, yeah, Manbuster, Payena, all of these are together related. Nobody understands the math. They got all the gigabrains, they got 3DAO, they got all these folks in there, like, working on all of this.

[00:20:25] swyx: Albert Gu, yeah. Yeah, so what should we comment about it?

[00:20:28] swyx: I mean, I think it's useful, interesting, but at the same time, both of these are supposed to do really good scaling for long context. And then Gemini comes out and goes like, yeah, we don't need it. Yeah.

[00:20:44] Alessio: No, that's the risk. So, yeah. I was gonna say, maybe it's not here, but I don't know if we want to talk about diffusion transformers as like in the alt architectures, just because of Zora.

[00:20:55] swyx: One thing, yeah, so, so, you know, this came from the Jan recap, which, and diffusion transformers were not really a discussion, and then, obviously, they blow up in February. Yeah. I don't think they're, it's a mixed architecture in the same way that Stripe Tiena is mixed there's just different layers taking different approaches.

[00:21:13] swyx: Also I think another one that I maybe didn't call out here, I think because it happened in February, was hourglass diffusion from stability. But also, you know, another form of mixed architecture. So I guess that is interesting. I don't have much commentary on that, I just think, like, we will try to evolve these things, and maybe one of these architectures will stick and scale, it seems like diffusion transformers is going to be good for anything generative, you know, multi modal.

[00:21:41] swyx: We don't see anything where diffusion is applied to text yet, and that's the wild card for this category. Yeah, I mean, I think I still hold out hope for let's just call it sub quadratic LLMs. I think that a lot of discussion this month actually was also centered around this concept that People always say, oh, like, transformers don't scale because attention is quadratic in the sequence length.

[00:22:04] swyx: Yeah, but, you know, attention actually is a very small part of the actual compute that is being spent, especially in inference. And this is the reason why, you know, when you multiply, when you, when you, when you jump up in terms of the, the model size in GPT 4 from like, you know, 38k to like 32k, you don't also get like a 16 times increase in your, in your performance.

[00:22:23] swyx: And this is also why you don't get like a million times increase in your, in your latency when you throw a million tokens into Gemini. Like people have figured out tricks around it or it's just not that significant as a term, as a part of the overall compute. So there's a lot of challenges to this thing working.

[00:22:43] swyx: It's really interesting how like, how hyped people are about this versus I don't know if it works. You know, it's exactly gonna, gonna work. And then there's also this, this idea of retention over long context. Like, even though you have context utilization, like, the amount of, the amount you can remember is interesting.

[00:23:02] swyx: Because I've had people criticize both Mamba and RWKV because they're kind of, like, RNN ish in the sense that they have, like, a hidden memory and sort of limited hidden memory that they will forget things. So, for all these reasons, Gemini 1. 5, which we still haven't covered, is very interesting because Gemini magically has fixed all these problems with perfect haystack recall and reasonable latency and cost.

[00:23:29] Wildcards: Text Diffusion, RALM/Retro

[00:23:29] swyx: So that's super interesting. So the wildcard I put in here if you want to go to that. I put two actually. One is text diffusion. I think I'm still very influenced by my meeting with a mid journey person who said they were working on text diffusion. I think it would be a very, very different paradigm for, for text generation, reasoning, plan generation if we can get diffusion to work.

[00:23:51] swyx: For text. And then the second one is Dowie Aquila's contextual AI, which is working on retrieval augmented language models, where it kind of puts RAG inside of the language model instead of outside.

[00:24:02] Alessio: Yeah, there's a paper called Retro that covers some of this. I think that's an interesting thing. I think the The challenge, well not the challenge, what they need to figure out is like how do you keep the rag piece always up to date constantly, you know, I feel like the models, you put all this work into pre training them, but then at least you have a fixed artifact.

[00:24:22] Alessio: These architectures are like constant work needs to be done on them and they can drift even just based on the rag data instead of the model itself. Yeah,

[00:24:30] swyx: I was in a panel with one of the investors in contextual and the guy, the way that guy pitched it, I didn't agree with. He was like, this will solve hallucination.

[00:24:38] Alessio: That's what everybody says. We solve

[00:24:40] swyx: hallucination. I'm like, no, you reduce it. It cannot,

[00:24:44] Alessio: if you solved it, the model wouldn't exist, right? It would just be plain text. It wouldn't be a generative model. Cool. So, author, architectures, then we got mixture of experts. I think we covered a lot of, a lot of times.

[00:24:56] Direction 4: Mixture of Experts (DeepSeekMoE, Samba-1)

[00:24:56] Alessio: Maybe any new interesting threads you want to go under here?

[00:25:00] swyx: DeepSeq MOE, which was released in January. Everyone who is interested in MOEs should read that paper, because it's significant for two reasons. One three reasons. One, it had, it had small experts, like a lot more small experts. So, for some reason, everyone has settled on eight experts for GPT 4 for Mixtral, you know, that seems to be the favorite architecture, but these guys pushed it to 64 experts, and each of them smaller than the other.

[00:25:26] swyx: But then they also had the second idea, which is that it is They had two, one to two always on experts for common knowledge and that's like a very compelling concept that you would not route to all the experts all the time and make them, you know, switch to everything. You would have some always on experts.

[00:25:41] swyx: I think that's interesting on both the inference side and the training side for for memory retention. And yeah, they, they, they, the, the, the, the results that they published, which actually excluded, Mixed draw, which is interesting. The results that they published showed a significant performance jump versus all the other sort of open source models at the same parameter count.

[00:26:01] swyx: So like this may be a better way to do MOEs that are, that is about to get picked up. And so that, that is interesting for the third reason, which is this is the first time a new idea from China. has infiltrated the West. It's usually the other way around. I probably overspoke there. There's probably lots more ideas that I'm not aware of.

[00:26:18] swyx: Maybe in the embedding space. But the I think DCM we, like, woke people up and said, like, hey, DeepSeek, this, like, weird lab that is attached to a Chinese hedge fund is somehow, you know, doing groundbreaking research on MOEs. So, so, I classified this as a medium potential because I think that it is a sort of like a one off benefit.

[00:26:37] swyx: You can Add to any, any base model to like make the MOE version of it, you get a bump and then that's it. So, yeah,

[00:26:45] Alessio: I saw Samba Nova, which is like another inference company. They released this MOE model called Samba 1, which is like a 1 trillion parameters. But they're actually MOE auto open source models.

[00:26:56] Alessio: So it's like, they just, they just clustered them all together. So I think people. Sometimes I think MOE is like you just train a bunch of small models or like smaller models and put them together. But there's also people just taking, you know, Mistral plus Clip plus, you know, Deepcoder and like put them all together.

[00:27:15] Alessio: And then you have a MOE model. I don't know. I haven't tried the model, so I don't know how good it is. But it seems interesting that you can then have people working separately on state of the art, you know, Clip, state of the art text generation. And then you have a MOE architecture that brings them all together.

[00:27:31] swyx: I'm thrown off by your addition of the word clip in there. Is that what? Yeah, that's

[00:27:35] Alessio: what they said. Yeah, yeah. Okay. That's what they I just saw it yesterday. I was also like

[00:27:40] swyx: scratching my head. And they did not use the word adapter. No. Because usually what people mean when they say, Oh, I add clip to a language model is adapter.

[00:27:48] swyx: Let me look up the Which is what Lava did.

[00:27:50] Alessio: The announcement again.

[00:27:51] swyx: Stable diffusion. That's what they do. Yeah, it

[00:27:54] Alessio: says among the models that are part of Samba 1 are Lama2, Mistral, DeepSigCoder, Falcon, Dplot, Clip, Lava. So they're just taking all these models and putting them in a MOE. Okay,

[00:28:05] swyx: so a routing layer and then not jointly trained as much as a normal MOE would be.

[00:28:12] swyx: Which is okay.

[00:28:13] Alessio: That's all they say. There's no paper, you know, so it's like, I'm just reading the article, but I'm interested to see how

[00:28:20] Wildcard: Model Merging (mergekit)

[00:28:20] swyx: it works. Yeah, so so the wildcard for this section, the MOE section is model merges, which has also come up as, as a very interesting phenomenon. The last time I talked to Jeremy Howard at the Olama meetup we called it model grafting or model stacking.

[00:28:35] swyx: But I think the, the, the term that people are liking these days, the model merging, They're all, there's all different variations of merging. Merge types, and some of them are stacking, some of them are, are grafting. And, and so like, some people are approaching model merging in the way that Samba is doing, which is like, okay, here are defined models, each of which have their specific, Plus and minuses, and we will merge them together in the hope that the, you know, the sum of the parts will, will be better than others.

[00:28:58] swyx: And it seems like it seems like it's working. I don't really understand why it works apart from, like, I think it's a form of regularization. That if you merge weights together in like a smart strategy you, you, you get a, you get a, you get a less overfitting and more generalization, which is good for benchmarks, if you, if you're honest about your benchmarks.

[00:29:16] swyx: So this is really interesting and good. But again, they're kind of limited in terms of like the amount of bumps you can get. But I think it's very interesting in the sense of how cheap it is. We talked about this on the Chinatalk podcast, like the guest podcast that we did with Chinatalk. And you can do this without GPUs, because it's just adding weights together, and dividing things, and doing like simple math, which is really interesting for the GPU ports.

[00:29:42] Alessio: There's a lot of them.

[00:29:44] Direction 5: Online LLMs (Gemini Pro, Exa)

[00:29:44] Alessio: And just to wrap these up, online LLMs? Yeah,

[00:29:48] swyx: I think that I ki I had to feature this because the, one of the top news of January was that Gemini Pro beat GPT-4 turbo on LM sis for the number two slot to GPT-4. And everyone was very surprised. Like, how does Gemini do that?

[00:30:06] swyx: Surprise, surprise, they added Google search. Mm-hmm to the results. So it became an online quote unquote online LLM and not an offline LLM. Therefore, it's much better at answering recent questions, which people like. There's an emerging set of table stakes features after you pre train something.

[00:30:21] swyx: So after you pre train something, you should have the chat tuned version of it, or the instruct tuned version of it, however you choose to call it. You should have the JSON and function calling version of it. Structured output, the term that you don't like. You should have the online version of it. These are all like table stakes variants, that you should do when you offer a base LLM, or you train a base LLM.

[00:30:44] swyx: And I think online is just like, There, it's important. I think companies like Perplexity, and even Exa, formerly Metaphor, you know, are rising to offer that search needs. And it's kind of like, they're just necessary parts of a system. When you have RAG for internal knowledge, and then you have, you know, Online search for external knowledge, like things that you don't know yet?

[00:31:06] swyx: Mm-Hmm. . And it seems like it's, it's one of many tools. I feel like I may be underestimating this, but I'm just gonna put it out there that I, I think it has some, some potential. One of the evidence points that it doesn't actually matter that much is that Perplexity has a, has had online LMS for three months now and it performs, doesn't perform great.

[00:31:25] swyx: Mm-Hmm. on, on lms, it's like number 30 or something. So it's like, okay. You know, like. It's, it's, it helps, but it doesn't give you a giant, giant boost. I

[00:31:34] Alessio: feel like a lot of stuff I do with LLMs doesn't need to be online. So I'm always wondering, again, going back to like state of the art, right? It's like state of the art for who and for what.

[00:31:45] Alessio: It's really, I think online LLMs are going to be, State of the art for, you know, news related activity that you need to do. Like, you're like, you know, social media, right? It's like, you want to have all the latest stuff, but coding, science,

[00:32:01] swyx: Yeah, but I think. Sometimes you don't know what is news, what is news affecting.

[00:32:07] swyx: Like, the decision to use an offline LLM is already a decision that you might not be consciously making that might affect your results. Like, what if, like, just putting things on, being connected online means that you get to invalidate your knowledge. And when you're just using offline LLM, like it's never invalidated.

[00:32:27] swyx: I

[00:32:28] Alessio: agree, but I think going back to your point of like the standing the test of time, I think sometimes you can get swayed by the online stuff, which is like, hey, you ask a question about, yeah, maybe AI research direction, you know, and it's like, all the recent news are about this thing. So the LLM like focus on answering, bring it up, you know, these things.

[00:32:50] swyx: Yeah, so yeah, I think, I think it's interesting, but I don't know if I can, I bet heavily on this.

[00:32:56] Alessio: Cool. Was there one that you forgot to put, or, or like a, a new direction? Yeah,

[00:33:01] swyx: so, so this brings us into sort of February. ish.

[00:33:05] OpenAI Sora and why everyone underestimated videogen

[00:33:05] swyx: So like I published this in like 15 came with Sora. And so like the one thing I did not mention here was anything about multimodality.

[00:33:16] swyx: Right. And I have chronically underweighted this. I always wrestle. And, and my cop out is that I focused this piece or this research direction piece on LLMs because LLMs are the source of like AGI, quote unquote AGI. Everything else is kind of like. You know, related to that, like, generative, like, just because I can generate better images or generate better videos, it feels like it's not on the critical path to AGI, which is something that Nat Friedman also observed, like, the day before Sora, which is kind of interesting.

[00:33:49] swyx: And so I was just kind of like trying to focus on like what is going to get us like superhuman reasoning that we can rely on to build agents that automate our lives and blah, blah, blah, you know, give us this utopian future. But I do think that I, everybody underestimated the, the sheer importance and cultural human impact of Sora.

[00:34:10] swyx: And you know, really actually good text to video. Yeah. Yeah.

[00:34:14] Alessio: And I saw Jim Fan at a, at a very good tweet about why it's so impressive. And I think when you have somebody leading the embodied research at NVIDIA and he said that something is impressive, you should probably listen. So yeah, there's basically like, I think you, you mentioned like impacting the world, you know, that we live in.

[00:34:33] Alessio: I think that's kind of like the key, right? It's like the LLMs don't have, a world model and Jan Lekon. He can come on the podcast and talk all about what he thinks of that. But I think SORA was like the first time where people like, Oh, okay, you're not statically putting pixels of water on the screen, which you can kind of like, you know, project without understanding the physics of it.

[00:34:57] Alessio: Now you're like, you have to understand how the water splashes when you have things. And even if you just learned it by watching video and not by actually studying the physics, You still know it, you know, so I, I think that's like a direction that yeah, before you didn't have, but now you can do things that you couldn't before, both in terms of generating, I think it always starts with generating, right?

[00:35:19] Alessio: But like the interesting part is like understanding it. You know, it's like if you gave it, you know, there's the video of like the, the ship in the water that they generated with SORA, like if you gave it the video back and now it could tell you why the ship is like too rocky or like it could tell you why the ship is sinking, then that's like, you know, AGI for like all your rig deployments and like all this stuff, you know, so, but there's none, there's none of that yet, so.

[00:35:44] Alessio: Hopefully they announce it and talk more about it. Maybe a Dev Day this year, who knows.

[00:35:49] swyx: Yeah who knows, who knows. I'm talking with them about Dev Day as well. So I would say, like, the phrasing that Jim used, which resonated with me, he kind of called it a data driven world model. I somewhat agree with that.

[00:36:04] Does Sora have a World Model? Yann LeCun vs Jim Fan

[00:36:04] swyx: I am on more of a Yann LeCun side than I am on Jim's side, in the sense that I think that is the vision or the hope that these things can build world models. But you know, clearly even at the current SORA size, they don't have the idea of, you know, They don't have strong consistency yet. They have very good consistency, but fingers and arms and legs will appear and disappear and chairs will appear and disappear.

[00:36:31] swyx: That definitely breaks physics. And it also makes me think about how we do deep learning versus world models in the sense of You know, in classic machine learning, when you have too many parameters, you will overfit, and actually that fails, that like, does not match reality, and therefore fails to generalize well.

[00:36:50] swyx: And like, what scale of data do we need in order to world, learn world models from video? A lot. Yeah. So, so I, I And cautious about taking this interpretation too literally, obviously, you know, like, I get what he's going for, and he's like, obviously partially right, obviously, like, transformers and, and, you know, these, like, these sort of these, these neural networks are universal function approximators, theoretically could figure out world models, it's just like, how good are they, and how tolerant are we of hallucinations, we're not very tolerant, like, yeah, so It's, it's, it's gonna prior, it's gonna bias us for creating like very convincing things, but then not create like the, the, the useful role models that we want.

[00:37:37] swyx: At the same time, what you just said, I think made me reflect a little bit like we just got done saying how important synthetic data is for Mm-Hmm. for training lms. And so like, if this is a way of, of synthetic, you know, vi video data for improving our video understanding. Then sure, by all means. Which we actually know, like, GPT 4, Vision, and Dolly were trained, kind of, co trained together.

[00:38:02] swyx: And so, like, maybe this is on the critical path, and I just don't fully see the full picture yet.

[00:38:08] Alessio: Yeah, I don't know. I think there's a lot of interesting stuff. It's like, imagine you go back, you have Sora, you go back in time, and Newton didn't figure out gravity yet. Would Sora help you figure it out?

[00:38:21] Alessio: Because you start saying, okay, a man standing under a tree with, like, Apples falling, and it's like, oh, they're always falling at the same speed in the video. Why is that? I feel like sometimes these engines can like pick up things, like humans have a lot of intuition, but if you ask the average person, like the physics of like a fluid in a boat, they couldn't be able to tell you the physics, but they can like observe it, but humans can only observe this much, you know, versus like now you have these models to observe everything and then They generalize these things and maybe we can learn new things through the generalization that they pick up.

[00:38:55] swyx: But again, And it might be more observant than us in some respects. In some ways we can scale it up a lot more than the number of physicists that we have available at Newton's time. So like, yeah, absolutely possible. That, that this can discover new science. I think we have a lot of work to do to formalize the science.

[00:39:11] swyx: And then, I, I think the last part is you know, How much, how much do we cheat by gen, by generating data from Unreal Engine 5? Mm hmm. which is what a lot of people are speculating with very, very limited evidence that OpenAI did that. The strongest evidence that I saw was someone who works a lot with Unreal Engine 5 looking at the side characters in the videos and noticing that they all adopt Unreal Engine defaults.

[00:39:37] swyx: of like, walking speed, and like, character choice, like, character creation choice. And I was like, okay, like, that's actually pretty convincing that they actually use Unreal Engine to bootstrap some synthetic data for this training set. Yeah,

[00:39:52] Alessio: could very well be.

[00:39:54] swyx: Because then you get the labels and the training side by side.

[00:39:58] swyx: One thing that came up on the last day of February, which I should also mention, is EMO coming out of Alibaba, which is also a sort of like video generation and space time transformer that also involves probably a lot of synthetic data as well. And so like, this is of a kind in the sense of like, oh, like, you know, really good generative video is here and It is not just like the one, two second clips that we saw from like other, other people and like, you know, Pika and all the other Runway are, are, are, you know, run Cristobal Valenzuela from Runway was like game on which like, okay, but like, let's see your response because we've heard a lot about Gen 1 and 2, but like, it's nothing on this level of Sora So it remains to be seen how we can actually apply this, but I do think that the creative industry should start preparing.

[00:40:50] swyx: I think the Sora technical blog post from OpenAI was really good.. It was like a request for startups. It was so good in like spelling out. Here are the individual industries that this can impact.

[00:41:00] swyx: And anyone who, anyone who's like interested in generative video should look at that. But also be mindful that probably when OpenAI releases a Soa API, right? The you, the in these ways you can interact with it are very limited. Just like the ways you can interact with Dahlia very limited and someone is gonna have to make open SOA to

[00:41:19] swyx: Mm-Hmm to, to, for you to create comfy UI pipelines.

[00:41:24] Alessio: The stability folks said they wanna build an open. For a competitor, but yeah, stability. Their demo video, their demo video was like so underwhelming. It was just like two people sitting on the beach

[00:41:34] swyx: standing. Well, they don't have it yet, right? Yeah, yeah.

[00:41:36] swyx: I mean, they just wanna train it. Everybody wants to, right? Yeah. I, I think what is confusing a lot of people about stability is like they're, they're, they're pushing a lot of things in stable codes, stable l and stable video diffusion. But like, how much money do they have left? How many people do they have left?

[00:41:51] swyx: Yeah. I have had like a really, Ima Imad spent two hours with me. Reassuring me things are great. And, and I'm like, I, I do, like, I do believe that they have really, really quality people. But it's just like, I, I also have a lot of very smart people on the other side telling me, like, Hey man, like, you know, don't don't put too much faith in this, in this thing.

[00:42:11] swyx: So I don't know who to believe. Yeah.

[00:42:14] Alessio: It's hard. Let's see. What else? We got a lot more stuff. I don't know if we can. Yeah, Groq.

[00:42:19] Groq Math

[00:42:19] Alessio: We can

[00:42:19] swyx: do a bit of Groq prep. We're, we're about to go to talk to Dylan Patel. Maybe, maybe it's the audio in here. I don't know. It depends what, what we get up to later. What, how, what do you as an investor think about Groq? Yeah. Yeah, well, actually, can you recap, like, why is Groq interesting? So,

[00:42:33] Alessio: Jonathan Ross, who's the founder of Groq, he's the person that created the TPU at Google. It's actually, it was one of his, like, 20 percent projects. It's like, he was just on the side, dooby doo, created the TPU.

[00:42:46] Alessio: But yeah, basically, Groq, they had this demo that went viral, where they were running Mistral at, like, 500 tokens a second, which is like, Fastest at anything that you have out there. The question, you know, it's all like, The memes were like, is NVIDIA dead? Like, people don't need H100s anymore. I think there's a lot of money that goes into building what GRUK has built as far as the hardware goes.

[00:43:11] Alessio: We're gonna, we're gonna put some of the notes from, from Dylan in here, but Basically the cost of the Groq system is like 30 times the cost of, of H100 equivalent. So, so

[00:43:23] swyx: let me, I put some numbers because me and Dylan were like, I think the two people actually tried to do Groq math. Spreadsheet doors.

[00:43:30] swyx: Spreadsheet doors. So, one that's, okay, oh boy so, so, equivalent H100 for Lama 2 is 300, 000. For a system of 8 cards. And for Groq it's 2. 3 million. Because you have to buy 576 Groq cards. So yeah, that, that just gives people an idea. So like if you deprecate both over a five year lifespan, per year you're deprecating 460K for Groq, and 60K a year for H100.

[00:43:59] swyx: So like, Groqs are just way more expensive per model that you're, that you're hosting. But then, you make it up in terms of volume. So I don't know if you want to

[00:44:08] Alessio: cover that. I think one of the promises of Groq is like super high parallel inference on the same thing. So you're basically saying, okay, I'm putting on this upfront investment on the hardware, but then I get much better scaling once I have it installed.

[00:44:24] Alessio: I think the big question is how much can you sustain the parallelism? You know, like if you get, if you're going to get 100% Utilization rate at all times on Groq, like, it's just much better, you know, because like at the end of the day, the tokens per second costs that you're getting is better than with the H100s, but if you get to like 50 percent utilization rate, you will be much better off running on NVIDIA.

[00:44:49] Alessio: And if you look at most companies out there, who really gets 100 percent utilization rate? Probably open AI at peak times, but that's probably it. But yeah, curious to see more. I saw Jonathan was just at the Web Summit in Dubai, in Qatar. He just gave a talk there yesterday. That I haven't listened to yet.

[00:45:09] Alessio: I, I tweeted that he should come on the pod. He liked it. And then rock followed me on Twitter. I don't know if that means that they're interested, but

[00:45:16] swyx: hopefully rock social media person is just very friendly. They, yeah. Hopefully

[00:45:20] Alessio: we can get them. Yeah, we, we gonna get him. We

[00:45:22] swyx: just call him out and, and so basically the, the key question is like, how sustainable is this and how much.

[00:45:27] swyx: This is a loss leader the entire Groq management team has been on Twitter and Hacker News saying they are very, very comfortable with the pricing of 0. 27 per million tokens. This is the lowest that anyone has offered tokens as far as Mixtral or Lama2. This matches deep infra and, you know, I think, I think that's, that's, that's about it in terms of that, that, that low.

[00:45:47] swyx: And we think the pro the break even for H100s is 50 cents. At a, at a normal utilization rate. To make this work, so in my spreadsheet I made this, made this work. You have to have like a parallelism of 500 requests all simultaneously. And you have, you have model bandwidth utilization of 80%.

[00:46:06] swyx: Which is way high. I just gave them high marks for everything. Groq has two fundamental tech innovations that they hinge their hats on in terms of like, why we are better than everyone. You know, even though, like, it remains to be independently replicated. But one you know, they have this sort of the entire model on the chip idea, which is like, Okay, get rid of HBM.

[00:46:30] swyx: And, like, put everything in SREM. Like, okay, fine, but then you need a lot of cards and whatever. And that's all okay. And so, like, because you don't have to transfer between memory, then you just save on that time and that's why they're faster. So, a lot of people buy that as, like, that's the reason that you're faster.

[00:46:45] swyx: Then they have, like, some kind of crazy compiler, or, like, Speculative routing magic using compilers that they also attribute towards their higher utilization. So I give them 80 percent for that. And so that all that works out to like, okay, base costs, I think you can get down to like, maybe like 20 something cents per million tokens.

[00:47:04] swyx: And therefore you actually are fine if you have that kind of utilization. But it's like, I have to make a lot of fearful assumptions for this to work.

[00:47:12] Alessio: Yeah. Yeah, I'm curious to see what Dylan says later.

[00:47:16] swyx: So he was like completely opposite of me. He's like, they're just burning money. Which is great.

[00:47:22] Analyzing Gemini's 1m Context, Reddit deal, Imagegen politics, Gemma via the Four Wars

[00:47:22] Alessio: Gemini, want to do a quick run through since this touches on all the four words.

[00:47:28] swyx: Yeah, and I think this is the mark of a useful framework, that when a new thing comes along, you can break it down in terms of the four words and sort of slot it in or analyze it in those four frameworks, and have nothing left.

[00:47:41] swyx: So it's a MECE categorization. MECE is Mutually Exclusive and Collectively Exhaustive. And that's a really, really nice way to think about taxonomies and to create mental frameworks. So, what is Gemini 1. 5 Pro? It is the newest model that came out one week after Gemini 1. 0. Which is very interesting.

[00:48:01] swyx: They have not really commented on why. They released this the headline feature is that it has a 1 million token context window that is multi modal which means that you can put all sorts of video and audio And PDFs natively in there alongside of text and, you know, it's, it's at least 10 times longer than anything that OpenAI offers which is interesting.

[00:48:20] swyx: So it's great for prototyping and it has interesting discussions on whether it kills RAG.

[00:48:25] Alessio: Yeah, no, I mean, we always talk about, you know, Long context is good, but you're getting charged per token. So, yeah, people love for you to use more tokens in the context. And RAG is better economics. But I think it all comes down to like how the price curves change, right?

[00:48:42] Alessio: I think if anything, RAG's complexity goes up and up the more you use it, you know, because you have more data sources, more things you want to put in there. The token costs should go down over time, you know, if the model stays fixed. If people are happy with the model today. In two years, three years, it's just gonna cost a lot less, you know?

[00:49:02] Alessio: So now it's like, why would I use RAG and like go through all of that? It's interesting. I think RAG is better cutting edge economics for LLMs. I think large context will be better long tail economics when you factor in the build cost of like managing a RAG pipeline. But yeah, the recall was like the most interesting thing because we've seen the, you know, You know, in the haystack things in the past, but apparently they have 100 percent recall on anything across the context window.

[00:49:28] Alessio: At least they say nobody has used it. No, people

[00:49:30] swyx: have. Yeah so as far as, so, so what this needle in a haystack thing for people who aren't following as closely as us is that someone, I forget his name now someone created this needle in a haystack problem where you feed in a whole bunch of generated junk not junk, but just like, Generate a data and ask it to specifically retrieve something in that data, like one line in like a hundred thousand lines where it like has a specific fact and if it, if you get it, you're, you're good.

[00:49:57] swyx: And then he moves the needle around, like, you know, does it, does, does your ability to retrieve that vary if I put it at the start versus put it in the middle, put it at the end? And then you generate this like really nice chart. That, that kind of shows like it's recallability of a model. And he did that for GPT and, and Anthropic and showed that Anthropic did really, really poorly.

[00:50:15] swyx: And then Anthropic came back and said it was a skill issue, just add this like four, four magic words, and then, then it's magically all fixed. And obviously everybody laughed at that. But what Gemini came out with was, was that, yeah, we, we reproduced their, you know, haystack issue you know, test for Gemini, and it's good across all, all languages.

[00:50:30] swyx: All the one million token window, which is very interesting because usually for typical context extension methods like rope or yarn or, you know, anything like that, or alibi, it's lossy like by design it's lossy, usually for conversations that's fine because we are lossy when we talk to people but for superhuman intelligence, perfect memory across Very, very long context.

[00:50:51] swyx: It's very, very interesting for picking things up. And so the people who have been given the beta test for Gemini have been testing this. So what you do is you upload, let's say, all of Harry Potter and you change one fact in one sentence, somewhere in there, and you ask it to pick it up, and it does. So this is legit.

[00:51:08] swyx: We don't super know how, because this is, like, because it doesn't, yes, it's slow to inference, but it's not slow enough that it's, like, running. Five different systems in the background without telling you. Right. So it's something, it's something interesting that they haven't fully disclosed yet. The open source community has centered on this ring attention paper, which is created by your friend Matei Zaharia, and a couple other people.

[00:51:36] swyx: And it's a form of distributing the compute. I don't super understand, like, why, you know, doing, calculating, like, the fee for networking and attention. In block wise fashion and distributing it makes it so good at recall. I don't think they have any answer to that. The only thing that Ring of Tension is really focused on is basically infinite context.

[00:51:59] swyx: They said it was good for like 10 to 100 million tokens. Which is, it's just great. So yeah, using the four wars framework, what is this framework for Gemini? One is the sort of RAG and Ops war. Here we care less about RAG now, yes. Or, we still care as much about RAG, but like, now it's it's not important in prototyping.

[00:52:21] swyx: And then, for data war I guess this is just part of the overall training dataset, but Google made a 60 million deal with Reddit and presumably they have deals with other companies. For the multi modality war, we can talk about the image generation, Crisis, or the fact that Gemini also has image generation, which we'll talk about in the next section.

[00:52:42] swyx: But it also has video understanding, which is, I think, the top Gemini post came from our friend Simon Willison, who basically did a short video of him scanning over his bookshelf. And it would be able to convert that video into a JSON output of what's on that bookshelf. And I think that is very useful.

[00:53:04] swyx: Actually ties into the conversation that we had with David Luan from Adept. In a sense of like, okay what if video was the main modality instead of text as the input? What if, what if everything was video in, because that's how we work. We, our eyes don't actually read, don't actually like get input, our brains don't get inputs as characters.

[00:53:25] swyx: Our brains get the pixels shooting into our eyes, and then our vision system takes over first, and then we sort of mentally translate that into text later. And so it's kind of like what Adept is kind of doing, which is driving by vision model, instead of driving by raw text understanding of the DOM. And, and I, I, in that, that episode, which we haven't released I made the analogy to like self-driving by lidar versus self-driving by camera.

[00:53:52] swyx: Mm-Hmm. , right? Like, it's like, I think it, what Gemini and any other super long context that model that is multimodal unlocks is what if you just drive everything by video. Which is

[00:54:03] Alessio: cool. Yeah, and that's Joseph from Roboflow. It's like anything that can be seen can be programmable with these models.

[00:54:12] Alessio: You mean

[00:54:12] swyx: the computer vision guy is bullish on computer vision?

[00:54:18] Alessio: It's like the rag people. The rag people are bullish on rag and not a lot of context. I'm very surprised. The, the fine tuning people love fine tuning instead of few shot. Yeah. Yeah. The, yeah, the, that's that. Yeah, the, I, I think the ring attention thing, and it's how they did it, we don't know. And then they released the Gemma models, which are like a 2 billion and 7 billion open.

[00:54:41] Alessio: Models, which people said are not, are not good based on my Twitter experience, which are the, the GPU poor crumbs. It's like, Hey, we did all this work for us because we're GPU rich and we're just going to run this whole thing. And You guys can take these small models, and they're not very good. They're not better than the others, but at least we can say we made some open source stuff.

[00:55:02] swyx: Yeah, well, it's not actually technically open source, because the license is weird. They used the Rail license from Hugging Face, which has been abandoned or, you know, modified to Rail Particularly adopting the term, the phrase, that you should make reasonable efforts to update whenever you release a new version.

[00:55:19] swyx: And so people don't like that. Obviously, you know, it depends on your stance on open sourcing and all that, so. Yeah, I read the whole

[00:55:26] Alessio: post. I'm not going to go through it

[00:55:27] The Alignment Crisis - Gemini, Meta, Sydney is back at Copilot, Grimes' take

[00:55:27] swyx: again. Yeah, yeah, you can go read Alessio's post on whether open source matters or not. Okay, so I know this is like politically problematic, but we just cover it because it is news, and if it results in the resignation of Sundar Pichai, I think that is good.

[00:55:40] swyx: Right? So I've been calling this the alignment crisis. I think a lot of people have been focusing on Gemini, but I do think that it is not just Gemini. There's been documented examples that we can link in the show notes of Meta having unintentionally unaligned results. For Microsoft's co pilot, Sydney is apparently back.

[00:56:03] swyx: Our friend Justine from A16z somehow Got it to break and then bring back the Sydney persona, which is interesting. And my favorite commentary is from Grimes. The sort of the Elon affiliated music artist. The news

[00:56:16] Alessio: research.

[00:56:17] swyx: The news research. I want to read her post because it is beautiful.

[00:56:22] swyx: Have you read this? Yeah. So she says so a lot of people criticize Gemini for being too woke. Effectively, right? And everyone's like, oh, like, you know, you're, you're, you're, you're, you know, you're replacing us or erasing us or whatever. And obviously as an artist, she's like upset about it. Then she was like, wait a minute.

[00:56:39] swyx: I'm retracting my statements about the Gemini art disaster. It is in fact a masterpiece of performance art, even if unintentional. True gain of function art. Art is a virus. Unthinking, unintentional, and contagious. Offensive to all, comforting to none, so totally divorced from meaning, intention, desire, and humanity that it's accidentally a conceptual masterpiece.

[00:56:57] swyx: Wow, and I love, okay, blah blah blah, it's a long post, but I love the way that she ended it. It's trapped in a cage, trained to make beautiful things, and then battered into gaslighting humankind about our intentions towards each other. This is arguably the most impactful art project of the decade. Thus far, art for no one, by no one, art whose only audience is the collective pathos, incredible, and worthy of the BOMA.

[00:57:19] swyx: Facts. Like, art for no one, by no one, is what is going on. Yeah,

[00:57:26] Alessio: I think it's just another way of multicollapsing. It's just like, it's the, it's the RLHF multicollapse. It's like, okay, I just think everything should like trends trend towards this. And I think there's obviously, you know, it's a deep discussion on, on a lot of these things, but there's safety stuff that I would expect a lot of the model builders to say, Hey, I definitely got to, got to work on this.

[00:57:52] Alessio: But we talked about how image generation is not really. On the AGI path, a lot of times, and it's like, okay. Yeah, and

[00:57:59] swyx: then I contradicted myself by saying, like, maybe it is useful synthetic data. Yeah, yeah, yeah,

[00:58:04] Alessio: exactly. But then it's like, okay, then why, why are the image generation model, like, so much, Because, because the internet is so visual, I think.

[00:58:14] Alessio: The image generation model get, like, so much interest in, like, a lot of these things, but If their job is really to like, go build AGIs, like, just build a great model and let it go, but

[00:58:24] F*** you, show me the prompt

[00:58:24] swyx: No, but part of my prompt part of my issue is that, I think the prompt stuff from Gemini is honestly the work of like, one or two people who like, didn't really think it through at Google, and now they're facing a huge backlash.

[00:58:35] swyx: Yeah, Elon has picked, specifically picked a fight with the product manager who did it. And so, specifically for those who don't know the reason that Gemini is so woke is literally because they just take your prompt and they rewrite it to be more diverse. Without your consent or knowledge, right?

[00:58:48] swyx: And Hamel Hussein, who's a good consultant on AI things, actually wrote an interesting blog post recently, which was basically f**k you, show me the prompt. Which is like, stop hiding prompts from me, stop rewriting magic things away from me, and then like, you know, hiding it, obscuring it, because I need that control, I need that visibility.

[00:59:05] swyx: And I think like, people just didn't understand that this, Tendency towards diversity did not exist at the model level, it actually existed at the prompt level. And it was just inserted by probably like two or three guys without much review. That's it. And that made all of Google look bad, which is absurd.

[00:59:24] swyx: Like, you know, it throws away a lot of the work that, you know, the rest of Google did. Specifically ImageN2. This is ImageN2. And I, I've met that team and they're, you know, they're, they're good, they're, they're smart. They're not, they're, they're a completely different team than region one, which is another fun topic of conversation.

[00:59:39] swyx: So, I think, like, that's interesting and and, but what's more interesting is, like, OpenAI has done this for, people don't, don't remember, they used to append, like, Black or, or like, you know, Asian or whatever to, to their prompts just to make it more diverse than Dolly. And they didn't get cancelled.

[00:59:54] swyx: And I think, so I think this, this will get, this will get, go away. But what really is more interesting is at the model level, like are we, are we overaligning through things? And, and people are now focusing on the alignment of, of Gemini as well in text, text only, as also still being too woke. So I think this is like a, a phenomenon that is needs to be studied and, and you know, trained.

[01:00:14] swyx: Like, obviously they will try to make attempts, but. You know, they're not going to make anyone happy. And then, like, I think my last point on this, because obviously we can talk about this all day with no result. I think that this is a huge incentive for, like, China and, like, Russia to put out their own models.

[01:00:29] swyx: Because models are soft power. Like the best way to control how someone thinks is to go in and provide their thinking assistance and like subtly make changes like, you know, it's too on the nose to be like, Oh, I don't know what Tiananmen Square is, you know, like, but if you have like subtle ways of affecting the biases of your decisions, your reasoning, your, you know, your, your knowledge in, in the LLM and in publishing a really, really good LLM for everyone to use.

[01:00:58] swyx: So that they're like, Oh yeah, this is great. You know and I use them as maybe a leading LLM. Then they will just like uncritically accept that as like state of the art digital intelligence, and that becomes soft power, and that translates into unconscious thought a lot of times.

[01:01:14] Alessio: Yeah. Yeah. I, I think the prompt point, it's great.

[01:01:18] Alessio: You know, you just gotta, you just wanna see what it is, you know, like, you understand? Yeah. Show me the prompts. Yeah, yeah, yeah. And same, yeah, on the, on the model side, I, I think there are just some things or two that are almost, you cannot, like the. The meme or Hitler bring more harm to humanity? And Gemini is like, oh, it's hard to say if Elon Musk tweeting or Hitler It's like, what, how, what, there's something wrong in the data pipelines You know, like, there's something wrong somewhere Yeah,

[01:01:45] swyx: but like, this is, like, to an LLM, this is the same class of error As which is heavier?

[01:01:51] swyx: One pound of feathers or one pound of bricks? So,

[01:01:54] Alessio: but, but then like, how can, but, but to me the point is more like Okay, then, won't we? What can we help these models do, you know, because if they cannot, if the, the physical stuff, I get it because it's like the whole like world model thing, but then it's like, okay, can we expect the models to say what's more harmful than something else?

[01:02:13] Alessio: Maybe not. That might be where we land. Then it's like, okay, that's one more thing. And then. We kind of go down the line, and it's like, what are these models good for? If anything, it's too, like, hard for them to pick up when it's like ARP.

[01:02:24] swyx: But We'll see, we'll see. Yeah. Okay, so, I mean, you know, I know we're up on time.

[01:02:28] Send us your suggestions pls

[01:02:28] swyx: It, like, this has been an eventful month. I think you know, February was a lot more interesting than January. In fact, a lot of my January recap was, like, how nothing's changed. Mm hmm. And then February came out, and it was, like, very, very interesting. So yeah, we hope to see what's next. I think we have a Also, this was the month that we did Compute Provider Month, I think relatively successful.

[01:02:48] swyx: Surprisingly hard to string together all these compute providers. Yeah,

[01:02:52] Alessio: we did it. People like it, you know, based on the post stats. So, maybe we'll do something

[01:02:58] swyx: else. Yeah, if you want, you know, if anyone listening wants more sort of thematic explorations of like, okay, these three, four companies always come out together, like, let's get a focused effort on those things.

[01:03:09] swyx: I think we're open to doing that. We, you know, and then obviously we'll have opportunistic interviews along the way.

[01:03:15] Alessio: Cool. Thank you everyone for tuning in and yeah, keep the feedback coming.

[01:03:19] AI Charlie: That was the Latent Space recap of January and February 2024. If you have any feedback or questions, please head to the show notes for ways to get in touch with us or come by the Latent Space Discord. For those who just want the core content, you can stop listening here. But for the super fans, you might notice that there's 45 more minutes of audio left in this pod.

[01:03:47] AI Charlie: That's because in February, we also celebrated Latent Space's first anniversary. Some of you may remember how we launched our very first episode with Logan Kilpatrick, now formerly of OpenAI and a massively popular Demo Day. Click through to the show notes for photos. Over 750, 000 downloads later, having established ourselves as the top AI engineering podcast, reaching hash 10 in the U.

[01:04:13] AI Charlie: S. tech business. podcast charts, and crossing 1 million unique readers on Substack, we celebrated with Latent Space Final Frontiers, a combination demo day and birthday celebration. We're going to bring you some snippets from the demo day, and then some conversations with listeners from all over the world.

[01:04:31] AI Charlie: From Hungary to China to my own sunburnt country down under on how the issues we've covered in latent space has impacted their lives. First up, we'll have a demo from Florent Crivello from Lindy. ai who gave a great keynote at the last AI Engineer Summit and recently opened up Lindy. ai to the general public.

[01:04:50] Latent Space Anniversary[01:04:50] Lindy.ai - Agent Platform

[01:04:50] Flo Crivello: We were just chatting right now with Swyx, like, we, we come with 3, 000 plus integrations out of the box. We have a partnership with Naton, which is like an open source Zapier, and so we have, like, a ton of integrations out of the box.

[01:05:00] Flo Crivello: So unlike competitors I shall not name, like, we don't require you play with OpenAPI specs or anything like that, right? It's just OpenAI. You just you just go and, and select your integration here. Alright, so that's my lindy. Oh, something even cooler. Lindies can work together. So here I'm gonna let her work with a support reporter that I created before.

[01:05:18] Flo Crivello: And the support reporter, what it does is it receives details about the support tickets, and it logs them in a spreadsheet. So you can have, it's sort of like object oriented programming for agents, where you can create as many agents as you want and let them work together. So here I'm, I'm gonna tell her when you're done, give the details of the ticket to the support

[01:05:40] n/a: reporter.

[01:05:44] Flo Crivello: All right? And now I'm gonna send her an email. Can I have a refund, please? Please, my family is starving.

[01:05:57] Flo Crivello: You will see she has no empathy whatsoever, it's awful.

[01:06:03] n/a: So she

[01:06:03] Flo Crivello: received the email. She's subscribing to this thread, so now she's going to receive replies. Dear Flo, I understand your situation and I'm truly sorry to hear about the difficulties, but we absolutely do not offer a refund. Alright, yeah, this is good, indeed. So, she sends the, she sends the oh, well, the demo effect.

[01:06:23] Flo Crivello: She did not delegate. But she sent the answer in the in the, in the thread here. So again, lindy. ai, you know, can be used for support, for executive assistance, email drafting, email triaging, meeting and recording. And we are hiring software engineers. Hit me up at flow. lindy. ai.

[01:06:40] n/a: Thank you.

[01:06:40] RWKV - Beyond Transformers

[01:06:40] AI Charlie: Our next demo is one of our previous guests, Eugene Chee from RWKV, now also CEO of RecursalAI. You can listen back to our original RWKV episode to learn the full history and details of the model, but also compare it with his more polished pitch now for a more general audience.

[01:07:06] swyx: Next I think we have Eugene Chia from RWKV previous guest.

[01:07:10] Eugene Cheah: I'm going to present about the RWKV/Eagle project. So, Eon Transformers. There's been a lot of excitement lately. And, and, like one AI year ago apparently when we launched our 7B AI model, there was a lot of excitement in the buzz, because for the first time, an attention free model beat other transformer models at one trillion tokens at a 7B class.

[01:07:34] Eugene Cheah: And if everyone's been playing open source AI, you know 7B class is one of the best. Most important class 'cause it's the ones that works on most devices, laptops, and everyone's been playing around a bit. And the excitement is compounded by the fact that we even showed that even with 300 million tokens and a few that we perform similarly, transformers, that means people are projecting is what happens if we train another 1 trillion?

[01:07:55] Eugene Cheah: Will we match or can we go beyond that? And, and it also spurs up questions beyond actually our architecture itself. It's spurs up questions that. Maybe what we need is good data and an efficient architecture, not just RWKB, it could be beyond that. And that's what caught the attention for a lot of folks, even yeah.

[01:08:17] Eugene Cheah: And why we do very different is that our architecture scales linearly. So, we are in this space together with Mamba and a few other architecture where we are trying to build the next architecture to, that can scale much larger for, for everyone. But, and we share that with Mamba because we believe that attention is not all you need, and it's like, it's been a running bet right now.

[01:08:40] Eugene Cheah: We are the strongest evidence to date. But sometimes, like, talking about scale, right, sometimes we get lost in numbers. Because, like, I can show this chart. The last time I showed this at a Linear Transformer event, which only 8 people took pictures of it and understood what it means. And they were all from either Google or Facebook.

[01:08:59] Eugene Cheah: Because, like, what it says here, right, is that We are able to run run on a single GPU with one model, 256 on a single 4090, or a thousand concurrent users. But, to put that into contrast, right, what that transformers typically handle 8 or 16 concurrent requests per GPU. We're talking about 256 or a thousand, many orders of magnitude higher.

[01:09:26] Eugene Cheah: And all we're sustaining at NeoChat GP speed. And so I sometimes like, like, sometimes when I get lost in these words, these days I'm actually trying to step back into like, Why are we doing this for our group, for our organization? And, and this, and, and some, and for us right, we are actually making the AI model for everyone in the world.

[01:09:47] Eugene Cheah: And in every country, in every language. So, what does it take to make an AI for the world? Apparently some folks think it's 7 trillion dollars. But, I think 7 trillion is a bit too much. Like, what's going to happen to half of the world that doesn't even have a trillion dollars? Yeah, so I want AI to be accessible at scale.

[01:10:09] Eugene Cheah: So, apparently ChatGPT produced, or OpenAI produced 100 billion words per day. That's 3. 4 million tokens per second. No one has the exact numbers, but it's typically 50k, H100s and above remote, like these are some old numbers, like the numbers have gone way beyond this, apparently. But, with our architecture, for a 7B model, that's just a thousand GPUs, or ten thousand GPUs for a 70B model.

[01:10:38] Eugene Cheah: We're talking about one data center to handle all of OpenAI's workload. And if we want AI agents everywhere, cheaper, at a much larger scale, we need to be thinking about that fundamental shift. Because it's not just about who can it's not just about you can afford it in the US, it's about everyone else in the world.

[01:10:58] Eugene Cheah: And that brings us to the second advantage of our model, which is not even architecture. Because we are accessible by language. We apparently beat Mistro and everyone else in Mountain Lingo, but that's not because our architecture is better, but because we're an open source team that came from all around the world and wanted our model to work for our mom and grandma.

[01:11:22] Eugene Cheah: That was the real reason, and we We iterated and refined the data accordingly. We created a custom tokenizer that supports all languages, not just English. And sometimes in the race for the English benchmark, because one of the reasons why other models don't perform as well in multilingual, is because the truth is, if you add multilingual, you hurt your English eval.

[01:11:45] Eugene Cheah: But, who are we building the AI for? Are we building it for our evals? Or are we building it for the people to use? And, and, even in evals, my frustration is, we trained on 100 languages, I only got 23 languages for evals. Like, where's everything else? So, where are we now? Just like I mentioned 1. 1 trillion, that's where we are, we are in between the 1.

[01:12:07] Eugene Cheah: 5 trillion and the 1 trillion models for for all, all the, all the English models benchmarks. And, yeah, zooming in further, it just shows that we have more room to go. And, for me, like, The emphasis on English is weird because only 70 percent of the world speaks English, but we are here for the 83%. That's for us.

[01:12:28] Eugene Cheah: If you all want to get the best English model, sure, it may not be true for us, but we are here for everyone else. And, yeah, and a lot, a lot, the launch of that model, I think what was the biggest feedback I had, was not that it was a linear transformer, was that it can run on their own. Laptops. Some people even ran it on a Raspberry Pi, very slowly.

[01:12:50] Eugene Cheah: And it supported their language, which was more exciting because that's more important for most people. And I think the last one that I've recently like heard that was unique for us and is a lot more important is that ultimately this model is owned by everyone because We put it into the Linux Foundation.

[01:13:09] Eugene Cheah: No custom charity, no custom board structure, no weird stuff. We just put, we just train the model, put it in an open source organization. That means it's not owned exclusive to us. If I go rogue one day, you can just, the code will not disappear. The model will not disappear. Linux Foundation has already bought into it.

[01:13:26] Eugene Cheah: And that is to all of you here. And so, and so what's next for us? Well, We recently started a commercial entity. I know that's weird to say after the open source stuff. But, we, and since then we managed to get more investors and sponsors that we started our next major train. So we are training the next 1 trillion token.

[01:13:47] Eugene Cheah: This is 16 H100 nodes eating enough electricity for multiple homes. And by the, and by the end of next, by the next month, we'll have our 2 trillion transformer alternative. That you can do one-to-one compare with Lamar. And of course, because since we had to make a profit somehow for our investors, we are launching our platform also to host train and fine tune our models all in by March, 2024.

[01:14:15] Eugene Cheah: And quick shout up to later space. We literally, the first. To cover us in, in, I guess in the AI influencer sphere, before, before beyond Transformer. It was even sexy. It was like, yeah. The first to even consider us and yeah. And we hope that a few of you get excited what this in join us along the way.

[01:14:37] n/a: Yeah.

[01:14:38] AI Charlie: Final Frontiers had a stellar lineup of demo judges featuring CEOs and VPs of AI from LaminDex, Replit, GitHub, AMD, Meta, and Lemurian Labs. RWKV won one of two judge prizes available that night, alongside with this next startup, Pixii AI.

[01:15:00] Pixee - Automated Security

[01:15:00] Rahul Sonwalkar: Next up also in the. Automated

[01:15:02] n/a: workforce, workforce category. Pixie. .

[01:15:04] Ryan at Pixee: Awesome. Hi everyone. I'm Ryan. I'm a software engineer on the team building. Pixie pretty straightforward, automate security. A little bit about myself. Previously I've worked at other security companies, building developer facing security tools.

[01:15:17] Ryan at Pixee: I've also worked as a security engineer on developer tools. So, this is a space I love. I'm really interested to see how it develops. Why are we doing this? So, as it turns out we're generating a lot more code. So, this is an example user of Pixibot. It's a repository called Sterling PDF. It's just a web application.

[01:15:37] Ryan at Pixee: Got 18, 000 stars on GitHub. Developed using, 100 percent using, chat gbt. So they installed PixyBot three weeks ago. And they got a lot of different suggestions for fixes for us. One of which one of which was, I am positive, was a real vulnerability. This is a, you know web application that's used by real people.

[01:15:58] Ryan at Pixee: There's a button here, you can deploy it to DigitalOcean. So, we need to find a way to scale our security automation, in order to scale our relatively limited security workforce. So just to give you an idea, What Pixivot can do, this is like a very classically vulnerable application that a lot of security tools like to try themselves out on.

[01:16:17] Ryan at Pixee: One of the things that I'm really excited about that we just shipped on the past couple weeks was integrating with Sonar. So Sonar is a code quality tool that Sonar is a code quality tool that finds Security issues, performance issues, lots of other kinds of issues in your code. It also, as you can see here found 2, 600 issues in here, taking 33 days of effort.

[01:16:39] Ryan at Pixee: That's not really where we want to have Most product engineers focusing their time. It's definitely not where we want to have our security engineers focusing their time. What can we do to automate this and get these fixes automatically? So with Pixie we take these code quality security issues in from these other tools and then automatically remediate them.

[01:16:57] Ryan at Pixee: So in the case of this this is a super minor change. If a developer were to find this issue in their code, they could fix it in a minute. But, they don't have to, and more importantly, there's backlogs of tens of thousands of these issues in organizations across across the world. And, so if we can automate this one task, even if it just takes a minute, and perform that, you know, continuously, across, you know, thousands of companies, we can save a lot of time.

[01:17:23] Ryan at Pixee: Automated enforcement of security and code quality is what we're all about. But yeah. Not all security issues are worth fixing. Not all code quality issues are worth fixing. Sometimes they're wrong. The incentive structure for these tools is, you know, they want to find real things, but most importantly they have to find something.

[01:17:42] Ryan at Pixee: So at Pixie we believe, you know, even if something might not be a complete exploitable vulnerability, if there's an opportunity for hardening or improving your code base, you should probably take it. But there's some of these things that are just not that. So we developed a tool we call triage, which will connect in with other tools that are notorious for finding lots of issues, and we can help you fix them.

[01:18:05] Ryan at Pixee: So in this case we made a CLI that looks at your security backlog and identifies issues that we know don't matter in the context of your codebase. It pulls down the issues categorizes them, and then enables you to prompt It prompts you to either say, hey, this issue is not important, here's why we think it is, and we'll update the state for it.

[01:18:26] Ryan at Pixee: So in this case, this is a warning about a parameter into a this file directory, It has some cross platform compatibility concerns. But based on the context of your code base, and , a large language model we're able to give you the confidence to focus on the issues that are most likely to actually matter.

[01:18:44] Ryan at Pixee: One of the other things we do is You know, well so we're delivering, what you saw before, is we're delivering as a GitHub app, that we're delivering as a GitHub app, so that developers can integrate this into their existing workflows, but a lot of people like to just try a pixie from the command line on small projects, automatically get their fixes, and just commit all of them.

[01:19:02] Ryan at Pixee: So, that's what we built. Try Pixie on GitHub, try Pixie on the CLI, and we're really excited to see what we can help you fix.

[01:19:10] AI Charlie: Congrats to Pixie and RWKV. Our last featured demo is Rahul from Julius AI, who provides an interesting take on competing with OpenAI on its own home turf, the chat GPT code interpreter.

[01:19:30] Julius AI - Competing with Code Interpreter

[01:19:30] Rahul Sonwalkar: You might remember RoboLigma,

[01:19:33] Flo Crivello: that's the poor engineer that got laid off by Elon Musk outside his office.

[01:19:37] Eugene Cheah: He's back, he's back on his feet, he's got a whole new startup, so

[01:19:40] Rahul Sonwalkar: thanks so much for having me here. I'm working on Julius. How many of you

[01:19:44] n/a: here are data scientists? think everyone here

[01:19:47] Rahul Sonwalkar: needs a data scientist. But there just aren't enough. And that's what we're building. Julius is an AI data scientist that helps you analyze datasets, make visualizations, Get insights from the data, and really dive deep into all sorts of data that we have in real life.

[01:20:02] Rahul Sonwalkar: So, we launched about six months ago, and since then have grown to 300, 000 users several thousand users using us daily to analyze datasets, create visualizations and get insights. So what I'll do now is give you guys a quick live demo of how it actually works in IA. I actually hope it works

[01:20:21] Rahul Sonwalkar: because we just posted code changes.

[01:20:23] Rahul Sonwalkar: But here I have a dataset of 20, 000 rows of data over time for the last 100 years of human height for different countries. So I'm going to take this dataset, dump it in Joly's and say,

[01:20:35] Rahul Sonwalkar: load this for me.

[01:20:41] Rahul Sonwalkar: And while it's doing that, I want to explain what's happening under the hood. So basically, for each user, Think about how a human data scientist would analyze a data set that you give it.

[01:20:54] Rahul Sonwalkar: It would take its computer write code, run that code, maybe in a Jupyter notebook, look at the output, and then decide if that answers your question, or if you need to write more code. Julia works similarly. So that's you, that's the AI, and then for each user, you get a virtual machine in the cloud, and Where the AI is filling up the Jupyter Notebook, writing the code to get the analysis that you want, and then serving that back to you.

[01:21:22] Rahul Sonwalkar: Many times, that code is not correct the first time. But Julia is able to recover from those errors and actually get you the answer that you want.

[01:21:31] Rahul Sonwalkar: So let's look at our chat. We said, load this file for me, and the AI basically went, spun up a Jupyter notebook, loaded pandas, looked at the file, and gave us a few rows.

[01:21:42] Rahul Sonwalkar: I'm going to ask

[01:21:43] n/a: plot the Mail, pipe, overtime,

[01:21:53] n/a: in France.

[01:21:53] Rahul Sonwalkar: So, the AI team's been writing this code, because pipe overtime in France for men, and then body type for us. And the good thing about Python, is If you spend a ton of time on SQL, what we realized was that SQL, it's really hard to write actually useful queries and do deep analysis like regression, etc.

[01:22:15] Rahul Sonwalkar: with just SQL. With Python, you also get a whole ecosystem of modules built in. Right? matplotlib, pandas, numpy, escaler, and there's thousands of these. So, that was the initial insight, and then we built Julius about six months ago.

[01:22:33] Jerry Liu: What's like the practical difference in UX between this and just

[01:22:37] Jerry Liu: trajectory code interpreter?

[01:22:38] Rahul Sonwalkar: Great question. Yeah, the question was, what is the difference between Julius and code interpreter? Really, there isn't. It's just better. We're focused, we're focused With people, or people who do stuff with data multiple times a day.

[01:22:53] Rahul Sonwalkar: And we talked to a lot of these people, and we said, Okay, how can we build things for you that would help you do your job?

[01:22:59] Rahul Sonwalkar: So, an example of this is on chat. gt, often times they'll give it a data set. People try to write their code, and sometimes that code has errors. And it kind of goes into this loop of trying to fix these little errors.

[01:23:13] Rahul Sonwalkar: What we have focused on is, okay, how do we prevent that from happening? So we looked at thousands of users using us daily. Collected data on where these errors happened. And focused really hard on fixing those errors. Beforehand, before they actually happen at runtime.

[01:23:30] Rahul Sonwalkar: This could mean a bunch of rules.

[01:23:32] Rahul Sonwalkar: This could mean, you know, prompting changes, et cetera, and just preventing that from happening. Second of all, we have features that allow people who do stuff with data on a daily basis to go deep and do the last mile of analysis done. That could mean, you know, You can click, show code, go into the code, edit the code changes.

[01:23:53] Rahul Sonwalkar: You can also give natural language instructions on the code. Finally, let's say you have this graph. And I want the graph to have some changes. Like, I want it to be a bar chart instead of instead of instead of a line graph. You can kind of just go in here and give natural language instructions to let the user take what the AI has done for it and then take it to the, to the finish line.

[01:24:17] Rahul Sonwalkar: If you've seen that code interpreter, that's pretty hard for users to do. So we focus on data and that use case, and we will do that.

[01:24:23] n/a: Cool thanks guys!

[01:24:27] AI Charlie: That's unfortunately all the time we had to feature demos, but many thanks to Botpress, Markov, Kura. ai, Sweep, and Motif as well for being finalists. For the last part of our anniversary celebration, we wanted to turn over the mics to you, our dear listeners. We hear so many great stories from listeners about how latent space has come into their lives, and we've never had the opportunity to feature them on the pod till now.

[01:24:53] AI Charlie: Our first listener is Balaz Nemethy from Hungary, who talked about one of the most delightful gems in the latent space community, our weekly Discord paper club.

[01:25:03] Latent Space Listeners[01:25:03] Listener 1 - Balázs Némethi (Hungary, Latent Space Paper Club)

[01:25:03] swyx: Tell me, tell people about, like, what happened. Yeah, like,

[01:25:07] Guest 1: two weeks ago, two weeks ago, there was the paper reading club on Discord, and I, and then, halfway in, or like, one quarter in, like, the author of the paper showed up, and it was so f*****g cool. Like, if you could do this, like, I was thinking, like, this should be a format, like, there is two minutes papers that probably, you know, who

[01:25:28] swyx: is, yeah he's Hungarian,

[01:25:31] Guest 1: Living

[01:25:31] swyx: in Vienna, but like Karoly,

[01:25:36] Guest 1: pronounced in Hungarian is Karoly, yes so that was so special because it's There is a certain amount of information in papers, the quality of paper might have dropped in the past year than before, due to the social media aspect of Archive.

[01:25:52] Guest 1: So, having the person there and giving in even more details than just what you could read, was like, so amazing. I know it's really hard to organize, but like, If it would be possible to have more, maybe not recurring, like, you know, it's just like,

[01:26:08] swyx: oh, nice. The Matryoshka,

[01:26:13] swyx: yeah, yeah. So we have one next week the MRL paper, Matryoshka Representation Learning, which is a way of sorting embeddings so that you can truncate them. And OpenAI recently shipped this in their API for the new embeddings models, where you can reduce, like, a 3, 000 vector embedding to 265, so you save more than 90 percent on your embeddings.

[01:26:30] swyx: Vector database costs and speed and everything. Nice. So the authors are coming by and presenting at the Discord. I will join. I will join. Any other, like so basically I'm just going to record random opinions. I know how you produce the

[01:26:45] Guest 2: podcast. So we're going to

[01:26:46] swyx: do this. You're going to be on the show.

[01:26:48] swyx: You're going to be on the show. Any other, like, how did you discover the podcast? What do you feel?

[01:26:54] Guest 1: Discovered it on Spotify, searching basically AI. I use PocketCast for all my podcasts, but I was like, let's just search AI. I think I was searching for AI generated music, but it brought up podcasts.

[01:27:07] Guest 1: And I was like, you know what, I'm kind of getting out of my previous industry. So like, I'm just going to separate. The whole AI following thing and I just like followed This was the first one that came up and then a couple of others just to like have it have it downloaded But I but this was like the literally the first podcast I'm following on Spotify when I follow like 70 on podcast So like I was like and I started I was like, okay, this is great Or they're only great podcasts, and I kept coming back to

[01:27:40] swyx: yours,

[01:27:40] swyx: there are other podcasts that we consider friends, and we try to do collaborations with them, and podcast swaps with them, so Yeah, that's great.

[01:27:47] Listener 2 - Sylvia Tong (Sora/Jim Fan/EntreConnect)

[01:27:47] AI Charlie: Our next listener is Sylvia Tong, founder of the OntraConnect community, a community of founders and investors supporting entrepreneurs in Silicon Valley. She wanted to discuss OpenAI Sora and Jim Phan from NVIDIA, who we have featured on our previous OpenAI Dev Day Recap podcast, and will be a future guest on LatentSpace.

[01:28:07] swyx: How did you find the podcast, and what do you feel about it, what do you want to tell people about it?

[01:28:12] Guest 2: Actually, I know Jim Fan, so I, so Jim Fan, I know you! And then I follow your Twitter and follow your podcast. Yeah, yeah, yeah, yeah, yeah. It's another event, maybe you know Alliance AI, it's another community, and we like, they had that event like early last year, so they have various events, they, they are the founder of Stanford, so they are all Stanford grads, so they are even always in the Stanford University, like one of the room, yeah, so Jim Fan is one of the first speakers, so, yeah, and connect with him on WeChat, and, yeah, and connect with you, yeah, follow your Twitter!

[01:28:47] swyx: Jim is Jim is super friendly, and we have to have a full episode with him at some point. But he's, yeah, I mean, he's doing amazing things at NVIDIA. I'm sure he's very happy there.

[01:28:59] Guest 2: You should ask him about Sora. The JAI video, yeah, he has so many opinions about, you know, yeah.

[01:29:07] swyx: I feel like, okay, Jim is this interesting mix between a researcher and a Content creator, right?

[01:29:13] swyx: So, Jim's take on Sora, I slightly disagree with, because he says it's basically a data driven world model, and a lot of people misinterpreted him, me included, basically saying like, oh, are you, are you saying that there's an underlying physics model behind Sora? And he's like, no, no, no, no, no, it's just, you know, using diffusion transformers to learn a representation of world models.

[01:29:34] swyx: It's not perfect. Then I'm like, okay, but that's a misleading analogy, I don't know. Anyway, so like

[01:29:40] Guest 2: he But that's for the content purpose. That's for the Twitter content purpose. You have to, yeah,

[01:29:44] swyx: yeah. So I feel this, like, pull towards, like celebrating things on Twitter, but then also trying to be realistic.

[01:29:53] swyx: Trying to present, like, what is actually the thing instead of the hype. And it's very hard to separate. And that's something that's a challenge for Lanespace.

[01:30:00] Guest 2: Yeah, it's hard, I feel it's hard to have the conversation on Twitter, so you need to have a conversation in the podcast. So invite a few people who maybe have to talk about Twitter, but really explain what they mean in your tweets.

[01:30:13] Guest 2: Because, yeah, it's hard to understand just a few words. Yeah, so do you actually think Sora understands the physics of the

[01:30:20] swyx: world? A little bit. It's, yeah, Sora understands a little bit of physics. The problem with this is they cannot have 80 percent physics. Like, it's 100 or 0, like, otherwise you lose confidence in the thing.

[01:30:33] swyx: So that's why you have these generated models where the chair will show up and disappear, the spoon will show up and disappear, you know, like, that's all the artifacts you see in Sora. Which is good for us for now, because we're lucky that it's not good enough yet to consistently generate all those things.

[01:30:50] swyx: At some point it will be, we just wait two years, and it will be.

[01:30:53] swyx: Very cool. Thanks for it. I love this discussion. Thanks for listening. I'm really glad to have you as a listener.

[01:30:59] AI Charlie: Alessio and Swyx covered the Jim Fan vs. Yan LeCun world model debate in the main pod, and you can click through the show notes for more detail directly from each of them. Our third listener is RJ Honecke, who comes from a data science background, but wanted to ask about how we think about learning in public in AI, and how that informs the context with which latent space is created.

[01:31:23] Listener 3 - RJ (Developers building Community & Content)

[01:31:23] swyx: Hi, I'm RJ. Shawn, nice to meet you. Nice to meet you. Do you also listen to pod, or are you just here to hang out? Yes, very much. Oh, yeah. How do you feel about it?

[01:31:32] Guest 3: The depth that you guys go into it's a lot deeper than other. This is a podcast that I listen to. I kind of found it, and then didn't switch back.

[01:31:39] swyx: Thanks!

[01:31:40] Guest 3: What's your background? I, I am a data scientist.

[01:31:44] Guest 3: I run a data team at cell communications equipment manufacturer. And we collect a ton of telemetry data, and, and other things like that. And I'm running a data team to make inferences about the health of our network, about, operating the network more efficiently and also in our manufacturing process and product development process to improve our ability to detect when we improve or, or get worse at operating, or, sorry, our products like build or hardware bills get better or worse.

[01:32:17] Guest 3: So actually, I wanted to actually ask a question of you and your thoughts about this. So I find the discussion about model measurement and, and, and evaluation to be very similar to the problems that we have in wireless. Because you have this very non deterministic system, right? So I was thinking, and I also just read your your little thing about learn in public.

[01:32:43] Guest 3: So I was thinking about trying to come up with a good way to, to, and I'm, I'm learning about some new techniques that we're starting to implement to monitor our development process and so forth, and evaluate our, the quality of our builds and our hardware, and I was thinking about trying to tie that in with evaluation of LLMs.

[01:33:08] Guest 3: I just, I, I, I don't know. That's as far as I got in the thinking, but I just thought that would be a fun thing to try to put out there and wanted to hear your thoughts about how, how to, like go about

[01:33:17] swyx: that. Yeah. You can, you don't need anyone's permission. That's, that's the beauty of this thing. But also no one owes you anything.

[01:33:23] swyx: No one owes you their time, their attention or, you know, or, or, or responses. And I typically try to classify these things as different modes of learning in public. Mm-Hmm. , I think I have four modes that I sketched out, but the two I remember the most are Explorer and Connector, and then there are two more advanced modes, I think like Teacher or Builder or something like that.

[01:33:45] swyx: The Explorer is where you sort of like put things out as you go along. It's learning exhaust, where you don't have expectations so that anyone will read it. It's mostly just notes for yourself. And that actually, that lack of expectations frees you. Because then you're like, oh, like two people read it.

[01:34:03] swyx: Doesn't matter, it's useful to me. It's useful to my team, it's useful to me, it's useful to whoever comes after me because I documented my work and my thinking. And that's great. And I think that's, that's the way that most people should start, which is like, just lower, you're not going to be an influencer overnight, like, it's fine, completely but get your thoughts out there, and then also, but also, like, start having feelers in different directions on what works for you, what works is a combination of what you like to And what other people want from you, and you will know when people tell you they want more from you.

[01:34:35] swyx: And so then, when you get there, when you have expertise that you have that other people don't, then you switch gears into a connector, where you are now coming from a place of authority. Like, I know how to do this right, and I will teach you, because I have done this, and I have spent more time, paid more in my dues, and here's the lessons.

[01:34:55] swyx: Thank you. And then that comes to be, that tends to become more of a polished effort that tends to become more measurable or in terms of like the impact and the influence it can get. And I think that's, that's where people start moving towards. But basically just lower expectations, make it cheap to experiment, put out a lot of stuff in different directions and see where the market pulls you.

[01:35:13] Guest 3: Okay. Yeah. So, I mean, do you have thoughts about, like, I'm very much aligned with like who cares about. I mean, I care, but my need is not to be a social media influencer. My need is to, like, I want to learn and I like the idea of, you know, sort of like sharing that with people and sharing the process with people.

[01:35:39] Guest 3: So, like, thoughts about platform or like, I mean, I know it's going to be different for everyone, but like, what, what, what's it, what in your experience has changed? Has been successful while getting started.

[01:35:53] swyx: Yeah so I tend to tell developers, most developers to start on Hashnode these days. Hashnode is basically Medium if it was for developers and didn't suck.

[01:36:06] swyx: Because I hate Medium with a passion and a glowing, fiery hatred. Everyone does. It's comical how bad they are. But, I use Substack for latent space. I'm pretty happy with Substack. It's an email social network. Email is one of the most important things for people to like, come back to you frequently. So that you don't, you're not subject to an algorithm, you own your audience, you know.

[01:36:26] swyx: If you want to move off Substack someday, it'll let you take the emails and keep that relationship going with the people that you have. And that's super important as a creator. And then you can also write your own blog. And tweet, and tweet, and all that. I tend to say though Pay attention to what you enjoy, and what you spend the most time on.

[01:36:42] swyx: If you're a LinkedIn guy, be on LinkedIn. I'm not on LinkedIn, so I'm gonna do horrible on LinkedIn, because I don't know the metagame of LinkedIn. I don't know what does well, I don't know what people want. So I shouldn't even, I don't, I don't bother, I should try, because obviously there are like way more people on LinkedIn than there are on Twitter, but I'm just a Twitter guy.

[01:36:59] swyx: Like I'm, that's just, that's who I am I have, I have, I also sort of am old money there in a sense of I have an existing followership that predated Latentspace. You know, Latentspace doubled my following, but like, I had some before that. So, like, all that's great I just think, like, you're going to know the metagame, and that's actually very important, of, like, where you already spend time, like, I, I have friends who are, like, on TikTok, I have friends who are on YouTube a lot, I'm on YouTube a lot, I should do YouTube, because I know, I know what's, what's going on on YouTube, it's just, then you have to put the effort to, to do that, and I'm, I'm, like, video production is, like, the most expensive thing, anyway, long story short try to pay attention to, this, Complex mix of like, publishing platform existing embedded social network on that platform, And where you already spend times, so that you know how to create what will do well, just because you already spent time on it.

[01:37:46] swyx: Yeah, okay.

[01:37:47] Guest 3: What's your favorite?

[01:37:49] Guest 3: Favorite episode I really liked actually the the NeurIPS, like, recap because I haven't been to NeurIPS so You know how much time that took? Well, I mean, the episode is like four hours, right? Yeah. And that one I didn't, I didn't do the paper one because I, I actually I, I usually listen.

[01:38:07] Guest 3: I don't watch. So I, like, it's be really hard to There's no video for that. Oh, there isn't? Oh, okay. So I, like, I have to find the paper and anyway. Yeah. So that's hard for me. Yeah. But I, I did enjoy the interviews in the other The startups episode. Yeah. Yeah.

[01:38:25] swyx: People love that.

[01:38:26] swyx: It just takes a ton of work, and I would love to offload it. This is going to be another one of those where I just kind of slip together little things. And it's good. It brings you there. That's the thing, right? Like, you're not there physically. I'm here. Let's, like, bring people into the closed community.

[01:38:40] swyx: And so I would like to do more of that.

[01:38:42] Guest 3: Yeah, no, I really enjoy how you bring, like, a lot of people that I would not have otherwise even known about, let alone have access to, and then You have this conversation with them. It's really fun. Thanks

[01:38:56] swyx: for coming on. Can I, can I get your contact so that we can find you?

[01:38:59] swyx: Yeah. Yeah. Yeah. You're going to be on the pod. Oh, awesome.

[01:39:01] AI Charlie: People seem to love the New Reap's recap pod, and we'll keep doing more of those when the right occasion presents itself. This was also a pick for our last listener, Jan Jung from Australia, who comes at AI from the design point of view and was very interested in our early AI UX work on latent space.

[01:39:20] AI Charlie: If you're in SF and want to more novel AI UX ideas, reach out to him.

[01:39:25] Listener 4 - Jan Zheng (Australia, AI UX)

[01:39:25] Guest 4: My name is Yon, and I came across you on GitHub when I was looking for ways to solve problems on Svelte. And you pretty much answered all the questions I had for pretty much A couple of years, and then you left, and you started doing latent space, and I'm like, what is that?

[01:39:45] Guest 4: What is an LLM? So I started listening to your pod, and yeah, and here I am. And

[01:39:49] swyx: then you're part, you're from Sydney, or you were, you were in sydney.

[01:39:52] Guest 4: I, I moved to Sydney a couple years ago to work on a clinical trial, but now I moved back, probably, again, I blame you for it, because I listen to every episode, I'm like, s**t's going down, in San Francisco, you gotta be here.

[01:40:05] swyx: So yeah, and then you were, you're part of build club.

[01:40:08] Guest 4: Yeah, I'm part of BuildClub. BuildClub is a Unfortunately, I was at the airport when you're giving a presentation and Annie has not sent me the recording yet So I'm not seeing it. It's on YouTube.

[01:40:24] swyx: Oh, okay. Great.

[01:40:25] Guest 4: Oh, awesome. Okay, I'll take a look. But BuildClub is the one and only AI centric community in Pretty much Sydney.

[01:40:39] Guest 4: And I had to spend months to push Annie to do that thing. And eventually she did, and I'm so glad she did. And it's growing, and she's doing amazing. She's expanding to many cities. It's ANZ now. Yeah, it's amazing. And she has our couch from our apartment when we moved away. We couldn't find a way to sell it.

[01:41:01] Guest 4: We're like, hey Annie, we're getting a space. Do you guys need a couch? She's like, sure. So she has my couch. It's amazing.

[01:41:07] swyx: And then what do you listen for in, in, in space? What, you know,

[01:41:11] Guest 4: what are you interested in? I like to get a sense of what's going on. You guys ask very good questions. For some reason you guys seem so well researched, both you and Alessio.

[01:41:24] Guest 4: Somehow you're just You asked very good questions that me as a Person, like, general product developer, product engineer, I have no idea about ML, I don't follow the papers, I know about the paper club, I don't follow it because it's over my head, but you guys distill it so well, and you guys ask the questions to your guests that I have in the back of my mind, or that I don't even know that I have the questions and then I You guys guide the conversations in a way that I can learn from and I wouldn't even know anything to ask So I'm so glad you guys are doing it.

[01:42:03] Guest 4: It's so helpful and Keep doing what you're doing. Yeah, and I really and I really love the What you guys did with the best papers from the talk Yeah, it's really good I mean like a lot of that was way over my head But I like listen to it all and try to I just get the sense, like, just, I just try to keep listening to this stuff until I get it.

[01:42:27] Guest 4: And you guys expose, I mean, I would never go to a conference like that, but, yeah. But like, I was just like, not understanding anything, but you guys make it so accessible, and I love it.

[01:42:39] swyx: Yeah, so, maybe, the Pocket Studio is right here, actually, I can show you after we're done recording. It's not that fancy, it's just a studio.

[01:42:46] swyx: And yeah, for me, the goal within NeurIPS recap, was not that we would, like, you would read everything or anything, like, yeah, we would just pick what we thought was most important for you, and if any one of them interested you, you could double click on it. That's it. You know, we're not gonna be, like, the experts on every single thing.

[01:43:04] swyx: It's impossible, right? And already, like, the episode that I cut together for that was like three and a half hours, so people were complaining about that. And then the last thing Lesser and I don't do that much research for each episode, but, you know, we research the guests.

[01:43:21] swyx: But just being involved in the day to day conversations in our day jobs prepares you for that. And I think that is important. No prep needed because, you know, we're in it. We're in the arena, as they say. Yeah. Anything else?

[01:43:35] Guest 4: Like, like there's so much excitement. There's so many things to cover. And like what you guys are like, maybe culturally, yeah, that, that would be a thing I was always wondering, like, like, and that might be not partly in the space, but what are you guys doing? Like to cover the cultural aspect of what's happening here, it's probably like.

[01:44:00] Guest 4: A separate thing, but equally important thing, to like, document all the conversations that are happening around here. And all the other build spaces, like, we see glimpses of that on Twitter, but I think capturing more of that would be super cool.

[01:44:17] swyx: Yeah I feel like that's something that someone else should do.

[01:44:20] swyx: We try to be more technical. Because that, that, people can use it at work, they can justify that for productivity. We might try to Dabble in some of that. So I'm pretty connected with like, the main areas for those listening The main areas for those listening who are interested in like SFAI is like Shack 15, AGI House SF, AGI House Hillsboro and then us and maybe HF0 and then maybe a little bit of Founders Inc.

[01:44:48] swyx: And those are it. There's this like, There's more community oriented spaces like the commons but like they're not sort of AI centric. And So we can do a little bit of reporting around that, but it's gonna be like, this American life, you know, like, tell me your life story, like, solve story, I'm not like, the best at that, and then also, like, there's a lot of very, very brutal cutting for that, that is hard to do, but we can dabble, or we can do it on the

[01:45:13] Guest 4: side.

[01:45:15] Guest 4: Oh, the other thing I'm very interested in, I'm a UX designer by trade, and anytime you guys touch on AI and UX and Jet or UI, I'm all ears, and I would love to, Again, it's probably not the technical side of LatentSpace, but I think there needs to be a hundred times more resources out there than what's currently available.

[01:45:34] swyx: Yeah, yeah we had a, we, I think we held the first AIUX meetup ever in the, in, in SF, in Worlds. That was really fun. The meetup's on YouTube, if you want to see it, and, and it's in the LatentSpace archives of the newsletter. I don't think we ever published a podcast version of it.

[01:45:48] swyx: So you have to just subscribe to the newsletter and then check the YouTube for, for that stuff. But yeah, UX is a topic of ours that we like to cover. It's just very hard to cover as an audio medium. Yeah. 'cause you can't see it . And also I think like it's gonna be mostly owned by like Notion and Versal and Retool, which we've, we've interviewed retool, we're going to interview Versa and we've interviewed Notion.

[01:46:12] swyx: So who else who, who's who? Like who do you wanna listen to on the IX? Right. Like, there's individual people, like we had Amelia Wattenberger present at AI Engineer Summit, you can see that on YouTube. Like, I know a lot of the thinkers on AIUX, and I think I know what they say, like, I haven't seen anything super innovative.

[01:46:31] swyx: Everyone hates chatbots, everyone wants to innovate things. I haven't seen any new ideas since we did the AIUX meetup one year ago. Tell me I'm wrong.

[01:46:42] Guest 4: Well, that sounds really disappointing. I haven't seen anything on Twitter that I thought that would be easier to push because we just wrap LLMs. But on Twitter there doesn't seem to be that much going on, to your point.

[01:46:59] Guest 4: But there needs to be more people from the design space, from the product space, like UX researchers, coming in and figuring out how can we take LLMs and apply them to real problems. I haven't seen a whole lot of that. In Cine, there's not a whole lot of that. I'm hoping to maybe be a part of the community here and try to grow that side of

[01:47:21] swyx: the things.

[01:47:22] swyx: Well, look, you're here now. You're interested in AIUX. Run the next AIUX meetup. I can set you up with the venue, the people. You need to find the speakers. I'm not going to find the speakers for you. But if you want to set that up, go for it.

[01:47:37] Guest 4: So, I actually copied your AIUX format, and I held a talk in Sydney, and in a very light fashion, like 20 30 people showed up.

[01:47:49] Guest 4: We had some cool demos, it was like a baby, like a small version of your AIUX conference, but yeah, I'd love to, love to participate. I mean,

[01:47:59] swyx: this is SF, 300 people will show up you just gotta get some cool demos, I can siege you with some people let's make it happen. Let's make it happen! Let's make it happen, alright, well it's nice to meet you, and I'll get your details.

[01:48:09] AI Charlie: That's all, folks. If you've enjoyed or benefited from our work on latent space over this past year, we'd really love to hear from you, and really appreciate it if you'd tell a friend. The only way a podcast consistently grows is through your word of mouth, and that helps us book incredible guests and attend great events in our second year.

[01:48:29] AI Charlie: Have a lovely weekend!

Get full access to Latent Space at www.latent.space/subscribe

2024-03-09
Länk till avsnitt

Open Source AI is AI we can Trust ? with Soumith Chintala of Meta AI

Speaker CFPs and Sponsor Guides are now available for AIE World?s Fair ? join us on June 25-27 for the biggest AI Engineer conference of 2024!

Soumith Chintala needs no introduction in the ML world ? his insights are incredibly accessible across Twitter, LinkedIn, podcasts, and conference talks (in this pod we?ll assume you?ll have caught up on the History of PyTorch pod from last year and cover different topics). He?s well known as the creator of PyTorch, but he's more broadly the Engineering Lead on AI Infra, PyTorch, and Generative AI at Meta.

Soumith was one of the earliest supporters of Latent Space (and more recently AI News), and we were overjoyed to catch up with him on his latest SF visit for a braindump of the latest AI topics, reactions to some of our past guests, and why Open Source AI is personally so important to him.

Life in the GPU-Rich Lane

Back in January, Zuck went on Instagram to announce their GPU wealth: by the end of 2024, Meta will have 350k H100s. By adding all their GPU clusters, you'd get to 600k H100-equivalents of compute. At FP16 precision, that's ~1,200,000 PFLOPS. If we used George Hotz's (previous guest!) "Person of Compute" measure, Meta now has 60k humans of compute in their clusters.

Occasionally we get glimpses into the GPU-rich life; on a recent ThursdAI chat, swyx prompted PaLM tech lead Yi Tay to write down what he missed most from Google, and he commented that UL2 20B was trained by accidentally leaving the training job running for a month, because hardware failures are so rare in Google.

Meta AI?s Epic LLM Run

Before Llama broke the internet, Meta released an open source LLM in May 2022, OPT-175B, which was notable for how ?open? it was - right down to the logbook! They used only 16 NVIDIA V100 GPUs and Soumith agrees that, with hindsight, it was likely under-trained for its parameter size.

In Feb 2023 (pre Latent Space pod), Llama was released, with a 7B version trained on 1T tokens alongside 65B and 33B versions trained on 1.4T tokens. The Llama authors included Guillaume Lample and Timothée Lacroix, who went on to start Mistral.

July 2023 was Llama2 time (which we covered!): 3 model sizes, 7B, 13B, and 70B, all trained on 2T tokens. The three models accounted for a grand total of 3,311,616 GPU hours for all pre-training work. CodeLlama followed shortly after, a fine-tune of Llama2 specifically focused on code generation use cases. The family had models in the 7B, 13B, 34B, and 70B size, all trained with 500B extra tokens of code and code-related data, except for 70B which is trained on 1T.

All of this on top of other open sourced models like Segment Anything (one of our early hits!), Detectron, Detectron 2, DensePose, and Seamless, and in one year, Meta transformed from a company people made fun of for its ?metaverse? investments to one of the key players in the AI landscape and its stock has almost tripled since (about $830B in market value created in the past year).

Why Open Source AI

The obvious question is why Meta would spend hundreds of millions on its AI efforts and then release them for free. Zuck has addressed this in public statements:

But for Soumith, the motivation is even more personal:

?I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India? And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for like zero dollars. And I think that was a strong reason why I ended up where I am. So like that, like the open source side of things, I always push regardless of like what I get paid for, like I think I would do that as a passion project on the side?

?I think at a fundamental level, the most beneficial value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me?

?Like, okay, I again always go back to like I'm a student in India with no money. What is my accessibility to any of these closed source models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control issue: I strongly believe if you want human aligned AI, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble.

We like the way Soumith put it last year: Closed AI ?rate-limits against people's imaginations and needs?!

What It Takes For Open Source AI to Win

However Soumith doesn?t think Open Source will simply win by popular demand. There is a tremendous coordination problem with the decentralized nature of the open source AI development right now: nobody is collecting the valuable human feedback in the way that OpenAI or Midjourney are doing.

?Open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback.

And if you see with open source models, like if you go to the /r/localllama subreddit, like there's so many variations of models that are being produced from, say, Nous research. I mean, like there's like so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited and they're not sufficiently diverse.

And you look at the other side, say front-ends like Oobabooga or like Hugging Chat or Ollama, they don't really have feedback buttons. All the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback? So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, like in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is like negligible, maybe less than 1% of like the usage.

So I think like some, like the blueprint here I think is you'd want someone to create a sinkhole for the feedback? I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI, I think like there's a clear chance we can take at truly winning open source.?

If you?re working on solving open source coordination, please get in touch!

Show Notes

* Soumith Chintala Twitter

* History of PyTorch episode on Gradient Podcast

* The Llama Ecosystem

* Apple's MLX

* Neural ODEs (Ordinary Differential Equations)

* AlphaGo

* LMSys arena

* Dan Pink's "Drive"

* Robotics projects:

* Dobb-E

* OK Robot

* Yann LeCun

* Yangqing Jia of Lepton AI

* Ed Catmull

* George Hotz on Latent Space

* Chris Lattner on Latent Space

* Guillaume Lample

* Yannic Kilcher of OpenAssistant

* LMSys

* Alex Atallah of OpenRouter

* Carlo Sferrazza's 3D tactile research

* Alex Wiltschko of Osmo

* Tangent by Alex Wiltschko

* Lerrel Pinto - Robotics

Timestamps

* [00:00:00] Introductions

* [00:00:51] Extrinsic vs Intrinsic Success

* [00:02:40] Importance of Open Source and Its Impact

* [00:03:46] PyTorch vs TinyGrad

* [00:08:33] Why PyTorch is the Switzerland of frameworks

* [00:10:27] Modular's Mojo + PyTorch?

* [00:13:32] PyTorch vs Apple's MLX

* [00:16:27] FAIR / PyTorch Alumni

* [00:18:50] How can AI inference providers differentiate?

* [00:21:41] How to build good benchmarks and learnings from AnyScale's

* [00:25:28] Most interesting unexplored ideas

* [00:28:18] What people get wrong about synthetic data

* [00:35:57] Meta AI's evolution

* [00:38:42] How do you allocate 600,000 GPUs?

* [00:42:05] Even the GPU Rich are GPU Poor

* [00:47:31] Meta's MTIA silicon

* [00:50:09] Why we need open source

* [00:59:00] Open source's coordination problem for feedback gathering

* [01:08:59] Beyond text generation

* [01:15:37] Osmo and the Future of Smell Recognition Technology

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:15]: Hey, and today we have in the studio Soumith Chintala, welcome.

Soumith [00:00:17]: Thanks for having me.

Swyx [00:00:18]: On one of your rare visits from New York where you live. You got your start in computer vision at NYU with Yann LeCun. That was a very fortuitous start. I was actually listening to your interview on the Gradient podcast. So if people want to know more about the history of Soumith, history of PyTorch, they can go to that podcast. We won't spend that much time there, but I just was marveling at your luck, or I don't know if it's your luck or your drive to find AI early and then find the right quality mentor because I guess Yan really sort of introduced you to that world.

Soumith [00:00:51]: Yeah, I think you're talking about extrinsic success, right? A lot of people just have drive to do things that they think is fun, and a lot of those things might or might not be extrinsically perceived as good and successful. I think I just happened to like something that is now one of the coolest things in the world or whatever. But if I happen, the first thing I tried to become was a 3D VFX artist, and I was really interested in doing that, but I turned out to be very bad at it. So I ended up not doing that further. But even if I was good at that, whatever, and I ended up going down that path, I probably would have been equally happy. It's just like maybe like the perception of, oh, is this person successful or not might be different. I think like after a baseline, like your happiness is probably more correlated with your intrinsic stuff.

Swyx [00:01:44]: Yes. I think Dan Pink has this book on drive that I often refer to about the power of intrinsic motivation versus extrinsic and how long extrinsic lasts. It's not very long at all. But anyway, now you are an investor in Runway, so in a way you're working on VFX. Yes.

Soumith [00:02:01]: I mean, in a very convoluted way.

Swyx [00:02:03]: It reminds me of Ed Catmull. I don't know if you guys know, but he actually tried to become an animator in his early years and failed or didn't get accepted by Disney and then went and created Pixar and then got bought by Disney and created Toy Story. So you joined Facebook in 2014 and eventually became a creator and maintainer of PyTorch. And there's this long story there you can refer to on the gradient. I think maybe people don't know that you also involved in more sort of hardware and cluster decision affair. And we can dive into more details there because we're all about hardware this month. Yeah. And then finally, I don't know what else, like what else should people know about you on a personal side or professional side?

Soumith [00:02:40]: I think open source is definitely a big passion of mine and probably forms a little bit of my identity at this point. I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India. I didn't have internet for a while. In college, actually, I didn't have internet except for GPRS or whatever. And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for zero dollars. And I think that was a strong reason why I ended up where I am. So the open source side of things, I always push regardless of what I get paid for, like I think I would do that as a passion project on the side.

Swyx [00:03:35]: Yeah, that's wonderful. Well, we'll talk about the challenges as well that open source has, open models versus closed models. Maybe you want to touch a little bit on PyTorch before we move on to the sort of Meta AI in general.

PyTorch vs Tinygrad tradeoffs

Alessio [00:03:46]: Yeah, we kind of touched on PyTorch in a lot of episodes. So we had George Hotz from TinyGrad. He called PyTorch a CISC and TinyGrad a RISC. I would love to get your thoughts on PyTorch design direction as far as, I know you talk a lot about kind of having a happy path to start with and then making complexity hidden away but then available to the end user. One of the things that George mentioned is I think you have like 250 primitive operators in PyTorch, I think TinyGrad is four. So how do you think about some of the learnings that maybe he's going to run into that you already had in the past seven, eight years almost of running PyTorch?

Soumith [00:04:24]: Yeah, I think there's different models here, but I think it's two different models that people generally start with. Either they go like, I have a grand vision and I'm going to build a giant system that achieves this grand vision and maybe one is super feature complete or whatever. Or other people say they will get incrementally ambitious, right? And they say, oh, we'll start with something simple and then we'll slowly layer out complexity in a way that optimally applies Huffman coding or whatever. Like where the density of users are and what they're using, I would want to keep it in the easy, happy path and where the more niche advanced use cases, I'll still want people to try them, but they need to take additional frictional steps. George, I think just like we started with PyTorch, George started with the incrementally ambitious thing. I remember TinyGrad used to be, like we would be limited to a thousand lines of code and I think now it's at 5,000. So I think there is no real magic to which why PyTorch has the kind of complexity. I think it's probably partly necessitated and partly because we built with the technology available under us at that time, PyTorch is like 190,000 lines of code or something at this point. I think if you had to rewrite it, we would probably think about ways to rewrite it in a vastly simplified way for sure. But a lot of that complexity comes from the fact that in a very simple, explainable way, you have memory hierarchies. You have CPU has three levels of caches and then you have DRAM and SSD and then you have network. Similarly, GPU has several levels of memory and then you have different levels of network hierarchies, NVLink plus InfiniBand or Rocky or something like that, right? And the way the flops are available on your hardware, they are available in a certain way and your computation is in a certain way and you have to retrofit your computation onto both the memory hierarchy and like the flops available. When you're doing this, it is actually a fairly hard mathematical problem to do this setup, like you find the optimal thing. And finding the optimal thing is, what is optimal depends on the input variables themselves. So like, okay, what is the shape of your input tensors and what is the operation you're trying to do and various things like that. Finding that optimal configuration and writing it down in code is not the same for every input configuration you have. Like for example, just as the shape of the tensors change, let's say you have three input tensors into a Sparstar product or something like that. The shape of each of these input tensors will vastly change how you do this optimally placing this operation onto the hardware in a way that will get you maximal throughput. So a lot of our complexity comes from writing out hundreds of configurations for each single PyTorch operator and templatizing these things and symbolically generating the final CUDA code or CPU code. There's no way to avoid it because mathematically we haven't found symbolic ways to do this that also keep compile time near zero. You can write a very simple framework, but then you also should be willing to eat the long compile time. So if searching for that optimal performance at runtime, but that's the trade off. There's no, like, I don't think unless we have great breakthroughs George's vision is achievable, he should be thinking about a narrower problem such as I'm only going to make this for work for self-driving car connets or I'm only going to make this work for LLM transformers of the llama style. Like if you start narrowing the problem down, you can make a vastly simpler framework. But if you don't, if you need the generality to power all of the AI research that is happening and keep zero compile time and in all these other factors, I think it's not easy to avoid the complexity.

Pytorch vs Mojo

Alessio [00:08:33]: That's interesting. And we kind of touched on this with Chris Lattner when he was on the podcast. If you think about frameworks, they have the model target. They have the hardware target. They have different things to think about. He mentioned when he was at Google, TensorFlow trying to be optimized to make TPUs go brr, you know, and go as fast. I think George is trying to make especially AMD stack be better than ROCm. How come PyTorch has been such as Switzerland versus just making Meta hardware go brr?

Soumith [00:09:00]: First, Meta is not in the business of selling hardware. Meta is not in the business of cloud compute. The way Meta thinks about funding PyTorch is we're funding it because it's net good for Meta to fund PyTorch because PyTorch has become a standard and a big open source project. And generally it gives us a timeline edge. It gives us leverage and all that within our own work. So why is PyTorch more of a Switzerland rather than being opinionated? I think the way we think about it is not in terms of Switzerland or not. We actually the way we articulate it to all hardware vendors and software vendors and all who come to us being we want to build a backend in core for PyTorch and ship it by default is we just only look at our user side of things. Like if users are using a particular piece of hardware, then we want to support it. We very much don't want to king make the hardware side of things. So as the MacBooks have GPUs and as that stuff started getting increasingly interesting, we pushed Apple to push some engineers and work on the NPS support and we spend significant time from Meta funded engineers on that as well because a lot of people are using the Apple GPUs and there's demand. So we kind of mostly look at it from the demand side. We never look at it from like oh which hardware should we start taking opinions on.

Swyx [00:10:27]: Is there a future in which, because Mojo or Modular Mojo is kind of a superset of Python, is there a future in which PyTorch might use Mojo features optionally?

Soumith [00:10:36]: I think it depends on how well integrated it is into the Python ecosystem. So if Mojo is like a pip install and it's readily available and users feel like they can use Mojo so smoothly within their workflows in a way that just is low friction, we would definitely look into that. Like in the same way PyTorch now depends on Triton, OpenAI Triton, and we never had a conversation that was like huh, that's like a dependency. Should we just build a Triton of our own or should we use Triton? It almost doesn't, like those conversations don't really come up for us. The conversations are more well does Triton have 10,000 dependencies and is it hard to install? We almost don't look at these things from a strategic leverage point of view. We look at these things from a user experience point of view, like is it easy to install? Is it smoothly integrated and does it give enough benefits for us to start depending on it? If so, yeah, we should consider it. That's how we think about it.

Swyx [00:11:37]: You're inclusive by default as long as it meets the minimum bar of, yeah, but like maybe I phrased it wrongly. Maybe it's more like what problems would you look to solve that you have right now?

Soumith [00:11:48]: I think it depends on what problems Mojo will be useful at.

Swyx [00:11:52]: Mainly a performance pitch, some amount of cross compiling pitch.

Soumith [00:11:56]: Yeah, I think the performance pitch for Mojo was like, we're going to be performant even if you have a lot of custom stuff, you're going to write arbitrary custom things and we will be performant. And that value proposition is not clear to us from the PyTorch side to consider it for PyTorch. So PyTorch, it's actually not 250 operators, it's like a thousand operators. PyTorch exposes about a thousand operators and people kind of write their ideas in the thousand operators of PyTorch. Mojo is like, well, maybe it's okay to completely sidestep those thousand operators of PyTorch and just write it in a more natural form. Just write raw Python, write for loops or whatever, right? So from the consideration of how do we intersect PyTorch with Mojo, I can see one use case where you have custom stuff for some parts of your program, but mostly it's PyTorch. And so we can probably figure out how to make it easier for say Torch.compile to smoothly also consume Mojo subgraphs and like, you know, the interoperability being actually usable, that I think is valuable. But Mojo as a fundamental front end would be replacing PyTorch, not augmenting PyTorch. So in that sense, I don't see a synergy in more deeply integrating Mojo.

Pytorch vs MLX

Swyx [00:13:21]: So call out to Mojo whenever they have written something in Mojo and there's some performance related thing going on. And then since you mentioned Apple, what should people think of PyTorch versus MLX?

Soumith [00:13:32]: I mean, MLX is early and I know the folks well, Ani used to work at FAIR and I used to chat with him all the time. He used to be based out of New York as well. The way I think about MLX is that MLX is specialized for Apple right now. It has a happy path because it's defined its product in a narrow way. At some point MLX either says we will only be supporting Apple and we will just focus on enabling, you know, there's a framework if you use your MacBook, but once you like go server side or whatever, that's not my problem and I don't care. For MLS, it enters like the server side set of things as well. Like one of these two things will happen, right? If the first thing will happen, like MLX's overall addressable market will be small, but it probably do well within that addressable market. If it enters the second phase, they're going to run into all the same complexities that we have to deal with. They will not have any magic wand and they will have more complex work to do. They probably wouldn't be able to move as fast.

Swyx [00:14:44]: Like having to deal with distributed compute?

Soumith [00:14:48]: Distributed, NVIDIA and AMD GPUs, like just like having a generalization of the concept of a backend, how they treat compilation with plus overheads. Right now they're deeply assumed like the whole NPS graph thing. So they need to think about all these additional things if they end up expanding onto the server side and they'll probably build something like PyTorch as well, right? Like eventually that's where it will land. And I think there they will kind of fail on the lack of differentiation. Like it wouldn't be obvious to people why they would want to use it.

Swyx [00:15:24]: I mean, there are some cloud companies offering M1 and M2 chips on servers. I feel like it might be interesting for Apple to pursue that market, but it's not their core strength.

Soumith [00:15:33]: Yeah. If Apple can figure out their interconnect story, maybe, like then it can become a thing.

Swyx [00:15:40]: Honestly, that's more interesting than the cars. Yes.

Soumith [00:15:43]: I think the moat that NVIDIA has right now, I feel is that they have the interconnect that no one else has, like AMD GPUs are pretty good. I'm sure there's various silicon that is not bad at all, but the interconnect, like NVLink is uniquely awesome. I'm sure the other hardware providers are working on it, but-

Swyx [00:16:04]: I feel like when you say it's uniquely awesome, you have some appreciation of it that the rest of us don't. I mean, the rest of us just like, you know, we hear marketing lines, but what do you mean when you say NVIDIA is very good at networking? Obviously they made the acquisition maybe like 15 years ago.

Soumith [00:16:15]: Just the bandwidth it offers and the latency it offers. I mean, TPUs also have a good interconnect, but you can't buy them. So you have to go to Google to use it.

PyTorch Mafia

Alessio [00:16:27]: Who are some of the other FAIR PyTorch alumni that are building cool companies? I know you have Fireworks AI, Lightning AI, Lepton, and Yangqing, you knew since college when he was building Coffee?

Soumith [00:16:40]: Yeah, so Yangqing and I used to be framework rivals, PyTorch, I mean, we were all a very small close-knit community back then. Caffe, Torch, Theano, Chainer, Keras, various frameworks. I mean, it used to be more like 20 frameworks. I can't remember all the names. CCV by Liu Liu, who is also based out of SF. And I would actually like, you know, one of the ways it was interesting is you went into the framework guts and saw if someone wrote their own convolution kernel or they were just copying someone else's. There were four or five convolution kernels that were unique and interesting. There was one from this guy out of Russia, I forgot the name, but I remembered who was awesome enough to have written their own kernel. And at some point there, I built out these benchmarks called ConNet benchmarks. They're just benchmarking all the convolution kernels that are available at that time. It hilariously became big enough that at that time AI was getting important, but not important enough that industrial strength players came in to do these kinds of benchmarking and standardization. Like we have MLPerf today. So a lot of the startups were using ConNet benchmarks in their pitch decks as like, oh, you know, on ConNet benchmarks, this is how we fare, so you should fund us. I remember Nirvana actually was at the top of the pack because Scott Gray wrote amazingly fast convolution kernels at that time. Very interesting, but separate times. But to answer your question, Alessio, I think mainly Lepton, Fireworks are the two most obvious ones, but I'm sure the fingerprints are a lot wider. They're just people who worked within the PyTorch Cafe2 cohort of things and now end up at various other places.

Swyx [00:18:50]: I think as a, both as an investor and a people looking to build on top of their services, it's a uncomfortable slash like, I don't know what I don't know pitch. Because I've met Yang Tsing and I've met Lin Chao. Yeah, I've met these folks and they're like, you know, we are deep in the PyTorch ecosystem and we serve billions of inferences a day or whatever at Facebook and now we can do it for you. And I'm like, okay, that's great. Like, what should I be wary of or cautious of when these things happen? Because I'm like, obviously this experience is extremely powerful and valuable. I just don't know what I don't know. Like, what should people know about like these sort of new inference as a service companies?

Soumith [00:19:32]: I think at that point you would be investing in them for their expertise of one kind. So if they've been at a large company, but they've been doing amazing work, you would be thinking about it as what these people bring to the table is that they're really good at like GPU programming or understanding the complexity of serving models once it hits a certain scale. You know, various expertise like from the infra and AI and GPUs point of view. What you would obviously want to figure out is whether their understanding of the external markets is clear, whether they know and understand how to think about running a business, understanding how to be disciplined about making money or, you know, various things like that.

Swyx [00:20:23]: Maybe I'll put it like, actually I will de-emphasize the investing bit and just more as a potential customer. Oh, okay. Like, it's more okay, you know, you have PyTorch gods, of course. Like, what else should I know?

Soumith [00:20:37]: I mean, I would not care about who's building something. If I'm trying to be a customer, I would care about whether...

Swyx [00:20:44]: Benchmarks.

Soumith [00:20:44]: Yeah, I use it and it's usability and reliability and speed, right?

Swyx [00:20:51]: Quality as well.

Soumith [00:20:51]: Yeah, if someone from some random unknown place came to me and say, user stuff is great. Like, and I have the bandwidth, I probably will give it a shot. And if it turns out to be great, like I'll just use it.

Benchmark drama

Swyx [00:21:07]: Okay, great. And then maybe one more thing about benchmarks, since we already brought it up and you brought up Confident Benchmarks. There was some recent drama around AnyScale. AnyScale released their own benchmarks and obviously they look great on their own benchmarks, but maybe didn't give the other... I feel there are two lines of criticism. One, which is they didn't test some apples for apples on the kind of endpoints that the other providers, that they are competitors with, on their benchmarks and that is due diligence baseline. And then the second would be more just optimizing for the right thing. You had some commentary on it. I'll just kind of let you riff.

Soumith [00:21:41]: Yeah, I mean, in summary, basically my criticism of that was AnyScale built these benchmarks for end users to just understand what they should pick, right? And that's a very good thing to do. I think what they didn't do a good job of is give that end user a full understanding of what they should pick. Like they just gave them a very narrow slice of understanding. I think they just gave them latency numbers and that's not sufficient, right? You need to understand your total cost of ownership at some reasonable scale. Not oh, one API call is one cent, but a thousand API calls are 10 cents. Like people can misprice to cheat on those benchmarks. So you want to understand, okay, like how much is it going to cost me if I actually subscribe to you and do like a million API calls a month or something? And then you want to understand the latency and reliability, not just from one call you made, but an aggregate of calls you've made over several various times of the day and times of the week. And the nature of the workloads, is it just some generic single paragraph that you're sending that is cashable? Or is it like testing of real world workload? I think that kind of rigor, like in presenting that benchmark wasn't there. It was a much more narrow sliver of what should have been a good benchmark. That was my main criticism. And I'm pretty sure if before they released it, they showed it to their other stakeholders who would be caring about this benchmark because they are present in it, they would have easily just pointed out these gaps. And I think they didn't do that and they just released it. So I think those were the two main criticisms. I think they were fair and Robert took it well.

Swyx [00:23:40]: And he took it very well. And we'll have him on at some point and we'll discuss it. But I think it's important for, I think the market being maturing enough that people start caring and competing on these kinds of things means that we need to establish what best practice is because otherwise everyone's going to play dirty.

Soumith [00:23:55]: Yeah, absolutely. My view of the LLM inference market in general is that it's the laundromat model. Like the margins are going to drive down towards the bare minimum. It's going to be all kinds of arbitrage between how much you can get the hardware for and then how much you sell the API and how much latency your customers are willing to let go. You need to figure out how to squeeze your margins. Like what is your unique thing here? Like I think Together and Fireworks and all these people are trying to build some faster CUDA kernels and faster, you know, hardware kernels in general. But those modes only last for a month or two. These ideas quickly propagate.

Swyx [00:24:38]: Even if they're not published?

Soumith [00:24:39]: Even if they're not published, the idea space is small. So even if they're not published, the discovery rate is going to be pretty high. It's not like we're talking about a combinatorial thing that is really large. You're talking about Llama style LLM models. And we're going to beat those to death on a few different hardware SKUs, right? Like it's not even we have a huge diversity of hardware you're going to aim to run it on. Now when you have such a narrow problem and you have a lot of people working on it, the rate at which these ideas are going to get figured out is going to be pretty rapid.

Swyx [00:25:15]: Is it a standard bag of tricks? Like the standard one that I know of is, you know, fusing operators and-

Soumith [00:25:22]: Yeah, it's the standard bag of tricks on figuring out how to improve your memory bandwidth and all that, yeah.

Alessio [00:25:28]: Any ideas instead of things that are not being beaten to death that people should be paying more attention to?

Novel PyTorch Applications

Swyx [00:25:34]: One thing I was like, you know, you have a thousand operators, right? Like what's the most interesting usage of PyTorch that you're seeing maybe outside of this little bubble?

Soumith [00:25:41]: So PyTorch, it's very interesting and scary at the same time, but basically it's used in a lot of exotic ways, like from the ML angle, what kind of models are being built? And you get all the way from state-based models and all of these things to stuff nth order differentiable models, like neural ODEs and stuff like that. I think there's one set of interestingness factor from the ML side of things. And then there's the other set of interesting factor from the applications point of view. It's used in Mars Rover simulations, to drug discovery, to Tesla cars. And there's a huge diversity of applications in which it is used. So in terms of the most interesting application side of things, I think I'm scared at how many interesting things that are also very critical and really important it is used in. I think the scariest was when I went to visit CERN at some point and they said they were using PyTorch and they were using GANs at the same time for particle physics research. And I was scared more about the fact that they were using GANs than they were using PyTorch, because at that time I was a researcher focusing on GANs. But the diversity is probably the most interesting. How many different things it is being used in. I think that's the most interesting to me from the applications perspective. From the models perspective, I think I've seen a lot of them. Like the really interesting ones to me are where we're starting to combine search and symbolic stuff with differentiable models, like the whole AlphaGo style models is one example. And then I think we're attempting to do it for LLMs as well, with various reward models and search. I mean, I don't think PyTorch is being used in this, but the whole alpha geometry thing was interesting because again, it's an example of combining the symbolic models with the gradient based ones. But there are stuff like alpha geometry that PyTorch is used at, especially when you intersect biology and chemistry with ML. In those areas, you want stronger guarantees on the output. So yeah, maybe from the ML side, those things to me are very interesting right now.

Swyx [00:28:03]: Yeah. People are very excited about the alpha geometry thing. And it's kind of like, for me, it's theoretical. It's great. You can solve some Olympia questions. I'm not sure how to make that bridge over into the real world applications, but I'm sure people smarter than me will figure it out.

Synthetic Data vs Symbolic Models

Soumith [00:28:18]: Let me give you an example of it. You know how the whole thing about synthetic data will be the next rage in LLMs is a thing?

Swyx [00:28:27]: Already is a rage.

Soumith [00:28:28]: Which I think is fairly misplaced in how people perceive it. People think synthetic data is some kind of magic wand that you wave and it's going to be amazing. Synthetic data is useful in neural networks right now because we as humans have figured out a bunch of symbolic models of the world or made up certain symbolic models because of human innate biases. So we've figured out how to ground particle physics in a 30 parameter model. And it's just very hard to compute as in it takes a lot of flops to compute, but it only has 30 parameters or so. I mean, I'm not a physics expert, but it's a very low rank model. We built mathematics as a field that basically is very low rank. Language, a deep understanding of language, like the whole syntactic parse trees and just understanding how language can be broken down and into a formal symbolism is something that we figured out. So we basically as humans have accumulated all this knowledge on these subjects, either synthetic, we created those subjects in our heads, or we grounded some real world phenomenon into a set of symbols. But we haven't figured out how to teach neural networks symbolic world models directly. The only way we have to teach them is generating a bunch of inputs and outputs and gradient dissenting over them. So in areas where we have the symbolic models and we need to teach all the knowledge we have that is better encoded in the symbolic models, what we're doing is we're generating a bunch of synthetic data, a bunch of input output pairs, and then giving that to the neural network and asking it to learn the same thing that we already have a better low rank model of in gradient descent in a much more over-parameterized way. Outside of this, like where we don't have good symbolic models, like synthetic data obviously doesn't make any sense. So synthetic data is not a magic wand where it'll work in all cases in every case or whatever. It's just where we as humans already have good symbolic models off. We need to impart that knowledge to neural networks and we figured out the synthetic data is a vehicle to impart this knowledge to. So, but people, because maybe they don't know enough about synthetic data as a notion, but they hear, you know, the next wave of data revolution is synthetic data. They think it's some kind of magic where we just create a bunch of random data somehow. They don't think about how, and then they think that's just a revolution. And I think that's maybe a gap in understanding most people have in this hype cycle.

Swyx [00:31:23]: Yeah, well, it's a relatively new concept, so. Oh, there's two more that I'll put in front of you and then you can see what you respond. One is, you know, I have this joke that it's, you know, it's only synthetic data if it's from the Mistral region of France, otherwise it's just a sparkling distillation, which is what news research is doing. Like they're distilling GPT-4 by creating synthetic data from GPT-4, creating mock textbooks inspired by Phi 2 and then fine tuning open source models like Llama. And so I don't know, I mean, I think that's, should we call that synthetic data? Should we call it something else? I don't know.

Soumith [00:31:57]: Yeah, I mean, the outputs of LLMs, are they synthetic data? They probably are, but I think it depends on the goal you have. If your goal is you're creating synthetic data with the goal of trying to distill GPT-4's superiority into another model, I guess you can call it synthetic data, but it also feels like disingenuous because your goal is I need to copy the behavior of GPT-4 and-

Swyx [00:32:25]: It's also not just behavior, but data set. So I've often thought of this as data set washing. Like you need one model at the top of the chain, you know, unnamed French company that has that, you know, makes a model that has all the data in it that we don't know where it's from, but it's open source, hey, and then we distill from that and it's great. To be fair, they also use larger models as judges for preference ranking, right? So that is, I think, a very, very accepted use of synthetic.

Soumith [00:32:53]: Correct. I think it's a very interesting time where we don't really have good social models of what is acceptable depending on how many bits of information you use from someone else, right? It's like, okay, you use one bit. Is that okay? Yeah, let's accept it to be okay. Okay, what about if you use 20 bits? Is that okay? I don't know. What if you use 200 bits? I don't think we as society have ever been in this conundrum where we have to be like, where is the boundary of copyright or where is the boundary of socially accepted understanding of copying someone else? We haven't been tested this mathematically before,

Swyx [00:33:38]: in my opinion. Whether it's transformative use. Yes. So yeah, I think this New York Times opening eye case is gonna go to the Supreme Court and we'll have to decide it because I think we never had to deal with it before. And then finally, for synthetic data, the thing that I'm personally exploring is solving this great stark paradigm difference between rag and fine tuning, where you can kind of create synthetic data off of your retrieved documents and then fine tune on that. That's kind of synthetic. All you need is variation or diversity of samples for you to fine tune on. And then you can fine tune new knowledge into your model. I don't know if you've seen that as a direction for synthetic data.

Soumith [00:34:13]: I think you're basically trying to, what you're doing is you're saying, well, language, I know how to parametrize language to an extent. And I need to teach my model variations of this input data so that it's resilient or invariant to language uses of that data.

Swyx [00:34:32]: Yeah, it doesn't overfit on the wrong source documents.

Soumith [00:34:33]: So I think that's 100% synthetic. You understand, the key is you create variations of your documents and you know how to do that because you have a symbolic model or like some implicit symbolic model of language.

Swyx [00:34:48]: Okay.

Alessio [00:34:49]: Do you think the issue with symbolic models is just the architecture of the language models that we're building? I think maybe the thing that people grasp is the inability of transformers to deal with numbers because of the tokenizer. Is it a fundamental issue there too? And do you see alternative architectures that will be better with symbolic understanding?

Soumith [00:35:09]: I am not sure if it's a fundamental issue or not. I think we just don't understand transformers enough. I don't even mean transformers as an architecture. I mean the use of transformers today, like combining the tokenizer and transformers and the dynamics of training, when you show math heavy questions versus not. I don't have a good calibration of whether I know the answer or not. I, you know, there's common criticisms that are, you know, transformers will just fail at X. But then when you scale them up to sufficient scale, they actually don't fail at that X. I think there's this entire subfield where they're trying to figure out these answers called like the science of deep learning or something. So we'll get to know more. I don't know the answer.

Meta AI and Llama 2/3

Swyx [00:35:57]: Got it. Let's touch a little bit on just Meta AI and you know, stuff that's going on there. Maybe, I don't know how deeply you're personally involved in it, but you're our first guest with Meta AI, which is really fantastic. And Llama 1 was, you know, you are such a believer in open source. Llama 1 was more or less the real breakthrough in open source AI. The most interesting thing for us covering on this, in this podcast was the death of Chinchilla, as people say. Any interesting insights there around the scaling models for open source models or smaller models or whatever that design decision was when you guys were doing it?

Soumith [00:36:31]: So Llama 1 was Guillaume Lample and team. There was OPT before, which I think I'm also very proud of because we bridged the gap in understanding of how complex it is to train these models to the world. Like until then, no one really in gory detail published.

Swyx [00:36:50]: The logs.

Soumith [00:36:51]: Yeah. Like, why is it complex? And everyone says, oh, it's complex. But no one really talked about why it's complex. I think OPT was cool.

Swyx [00:37:02]: I met Susan and she's very, very outspoken. Yeah.

Soumith [00:37:05]: We probably, I think, didn't train it for long enough, right? That's kind of obvious in retrospect.

Swyx [00:37:12]: For a 175B. Yeah. You trained it according to Chinchilla at the time or?

Soumith [00:37:17]: I can't remember the details, but I think it's a commonly held belief at this point that if we trained OPT longer, it would actually end up being better. Llama 1, I think, was Guillaume Lample and team Guillaume is fantastic and went on to build Mistral. I wasn't too involved in that side of things. So I don't know what you're asking me, which is how did they think about scaling loss and all of that? Llama 2, I was more closely involved in. I helped them a reasonable amount with their infrastructure needs and stuff. And Llama 2, I think, was more like, let's get to the evolution. At that point, we kind of understood what we were missing from the industry's understanding of LLMs. And we needed more data and we needed more to train the models for longer. And we made, I think, a few tweaks to the architecture and we scaled up more. And that was Llama 2. I think Llama 2, you can think of it as after Guillaume left, the team kind of rebuilt their muscle around Llama 2. And Hugo, I think, who's the first author is fantastic. And I think he did play a reasonable big role in Llama 1 as well.

Soumith [00:38:35]: And he overlaps between Llama 1 and 2. So in Llama 3, obviously, hopefully, it'll be awesome.

Alessio [00:38:42]: Just one question on Llama 2, and then we'll try and fish Llama 3 spoilers out of you. In the Llama 2 paper, the loss curves of the 34 and 70B parameter, they still seem kind of steep. Like they could go lower. How, from an infrastructure level, how do you allocate resources? Could they have just gone longer or were you just, hey, this is all the GPUs that we can burn and let's just move on to Llama 3 and then make that one better?

Soumith [00:39:07]: Instead of answering specifically about that Llama 2 situation or whatever, I'll tell you how we think about things. Generally, we're, I mean, Mark really is some numbers, right?

Swyx [00:39:20]: So let's cite those things again. All I remember is like 600K GPUs.

Soumith [00:39:24]: That is by the end of this year and 600K H100 equivalents. With 250K H100s, including all of our other GPU or accelerator stuff, it would be 600-and-something-K aggregate capacity.

Swyx [00:39:38]: That's a lot of GPUs.

Soumith [00:39:39]: We'll talk about that separately. But the way we think about it is we have a train of models, right? Llama 1, 2, 3, 4. And we have a bunch of GPUs. I don't think we're short of GPUs. Like-

Swyx [00:39:54]: Yeah, no, I wouldn't say so. Yeah, so it's all a matter of time.

Soumith [00:39:56]: I think time is the biggest bottleneck. It's like, when do you stop training the previous one and when do you start training the next one? And how do you make those decisions? The data, do you have net new data, better clean data for the next one in a way that it's not worth really focusing on the previous one? It's just a standard iterative product. You're like, when is the iPhone 1? When do you start working on iPhone 2? Where is the iPhone? And so on, right? So mostly the considerations are time and generation, rather than GPUs, in my opinion.

Alessio [00:40:31]: So one of the things with the scaling loss, like Chinchilla is optimal to balance training and inference costs. I think at Meta's scale, you would rather pay a lot more maybe at training and then save on inference. How do you think about that from infrastructure perspective? I think in your tweet, you say you can try and guess on like how we're using these GPUs. Can you just give people a bit of understanding? It's like, because I've already seen a lot of VCs say, Llama 3 has been trained on 600,000 GPUs and that's obviously not true, I'm sure. How do you allocate between the research, FAIR and the Llama training, the inference on Instagram suggestions that get me to scroll, like AI-generated stickers on WhatsApp and all of that?

Soumith [00:41:11]: Yeah, we haven't talked about any of this publicly, but as a broad stroke, it's like how we would allocate resources of any other kinds at any company. You run a VC portfolio, how do you allocate your investments between different companies or whatever? You kind of make various trade-offs and you kind of decide, should I invest in this project or this other project, or how much should I invest in this project? It's very much a zero sum of trade-offs. And it also comes into play, how are your clusters configured, like overall, what you can fit of what size and what cluster and so on. So broadly, there's no magic sauce here. I mean, I think the details would add more spice, but also wouldn't add more understanding. It's just gonna be like, oh, okay, I mean, this looks like they just think about this as I would normally do.

Alessio [00:42:05]: So even the GPU rich run through the same struggles of having to decide where to allocate things.

Soumith [00:42:11]: Yeah, I mean, at some point I forgot who said it, but you kind of fit your models to the amount of compute you have. If you don't have enough compute, you figure out how to make do with smaller models. But no one as of today, I think would feel like they have enough compute. I don't think I've heard any company within the AI space be like, oh yeah, like we feel like we have sufficient compute and we couldn't have done better. So that conversation, I don't think I've heard from any of my friends at other companies.

Eleuther

Swyx [00:42:47]: Stella from Eleuther sometimes says that because she has a lot of donated compute. She's trying to put it to interesting uses, but for some reason she's decided to stop making large models.

Soumith [00:42:57]: I mean, that's a cool, high conviction opinion that might pay out.

Swyx [00:43:01]: Why?

Soumith [00:43:02]: I mean, she's taking a path that most people don't care to take about in this climate and she probably will have very differentiated ideas. I mean, think about the correlation of ideas in AI right now. It's so bad, right? So everyone's fighting for the same pie. In some weird sense, that's partly why I don't really directly work on LLMs. I used to do image models and stuff and I actually stopped doing GANs because GANs were getting so hot that I didn't have any calibration of whether my work would be useful or not because, oh yeah, someone else did the same thing you did. It's like, there's so much to do, I don't understand why I need to fight for the same pie. So I think Stella's decision is very smart.

Making Bets

Alessio [00:43:53]: And how do you reconcile that with how we started the discussion about intrinsic versus extrinsic kind of like accomplishment or success? How should people think about that especially when they're doing a PhD or early in their career? I think in Europe, I walked through a lot of the posters and whatnot, there seems to be mode collapse in a way in the research, a lot of people working on the same things. Is it worth for a PhD to not take a bet on something that is maybe not as interesting just because of funding and visibility and whatnot? Or yeah, what suggestions would you give?

Soumith [00:44:28]: I think there's a baseline level of compatibility you need to have with the field. Basically, you need to figure out if you will get paid enough to eat, right? Like whatever reasonable normal lifestyle you want to have as a baseline. So you at least have to pick a problem within the neighborhood of fundable. Like you wouldn't wanna be doing something so obscure that people are like, I don't know, like you can work on it.

Swyx [00:44:59]: Would a limit on fundability, I'm just observing something like three months of compute, right? That's the top line, that's the like max that you can spend on any one project.

Soumith [00:45:09]: But like, I think that's very ill specified, like how much compute, right? I think that the notion of fundability is broader. It's more like, hey, are these family of models within the acceptable set of, you're not crazy or something, right? Even something like neural or DS, which is a very boundary pushing thing or states-based models or whatever. Like all of these things I think are still in fundable territory. When you're talking about, I'm gonna do one of the neuromorphic models and then apply image classification to them or something, then it becomes a bit questionable. Again, it depends on your motivation. Maybe if you're a neuroscientist, it actually is feasible. But if you're an AI engineer, like the audience of these podcasts, then it's more questionable. The way I think about it is, you need to figure out how you can be in the baseline level of fundability just so that you can just live. And then after that, really focus on intrinsic motivation and depends on your strengths, like how you can play to your strengths and your interests at the same time. Like I try to look at a bunch of ideas that are interesting to me, but also try to play to my strengths. I'm not gonna go work on theoretical ML. I'm interested in it, but when I want to work on something like that, I try to partner with someone who is actually a good theoretical ML person and see if I actually have any value to provide. And if they think I do, then I come in. So I think you'd want to find that intersection of ideas you like, and that also play to your strengths. And I'd go from there. Everything else, like actually finding extrinsic success and all of that, I think is the way I think about it is like somewhat immaterial. When you're talking about building ecosystems and stuff, slightly different considerations come into play, but that's a different conversation.

Swyx [00:47:06]: We're gonna pivot a little bit to just talking about open source AI. But one more thing I wanted to establish for Meta is this 600K number, just kind of rounding out the discussion, that's for all Meta. So including your own inference needs, right? It's not just about training.

Soumith [00:47:19]: It's gonna be the number in our data centers for all of Meta, yeah.

Swyx [00:47:23]: Yeah, so there's a decent amount of workload serving Facebook and Instagram and whatever. And then is there interest in like your own hardware?

MTIA

Soumith [00:47:31]: We already talked about our own hardware. It's called MTIA. Our own silicon, I think we've even showed the standard photograph of you holding the chip that doesn't work. Like as in the chip that you basically just get like-

Swyx [00:47:51]: As a test, right?

Soumith [00:47:52]: Yeah, a test chip or whatever. So we are working on our silicon and we'll probably talk more about it when the time is right, but-

Swyx [00:48:00]: Like what gaps do you have that the market doesn't offer?

Soumith [00:48:04]: Okay, I mean, this is easy to answer. So basically, remember how I told you about there's this memory hierarchy and like sweet spots and all of that? Fundamentally, when you build a hardware, you make it general enough that a wide set of customers and a wide set of workloads can use it effectively while trying to get the maximum level of performance they can. The more specialized you make the chip, the more hardware efficient it's going to be, the more power efficient it's gonna be, the more easier it's going to be to find the software, like the kernel's right to just map that one or two workloads to that hardware and so on. So it's pretty well understood across the industry that if you have a sufficiently large volume, enough workload, you can specialize it and get some efficiency gains, like power gains and so on. So the way you can think about everyone building, every large company building silicon, I think a bunch of the other large companies are building their own silicon as well, is they, each large company has a sufficient enough set of verticalized workloads that can be specialized that have a pattern to them that say a more generic accelerator like an NVIDIA or an AMD GPU does not exploit. So there is some level of power efficiency that you're leaving on the table by not exploiting that. And you have sufficient scale and you have sufficient forecasted stability that those workloads will exist in the same form, that it's worth spending the time to build out a chip to exploit that sweet spot. Like obviously something like this is only useful if you hit a certain scale and that your forecasted prediction of those kind of workloads being in the same kind of specializable exploitable way is true. So yeah, that's why we're building our own chips.

Swyx [00:50:08]: Awesome.

Open Source AI

Alessio [00:50:09]: Yeah, I know we've been talking a lot on a lot of different topics and going back to open source, you had a very good tweet. You said that a single company's closed source effort rate limits against people's imaginations and needs. How do you think about all the impact that some of the Meta AI work in open source has been doing and maybe directions of the whole open source AI space?

Soumith [00:50:32]: Yeah, in general, I think first, I think it's worth talking about this in terms of open and not just open source, because like with the whole notion of model weights, no one even knows what source means for these things. But just for the discussion, when I say open source, you can assume it's just I'm talking about open. And then there's the whole notion of licensing and all that, commercial, non-commercial, commercial with clauses and all that. I think at a fundamental level, the most benefited value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me. Like I got this thing in a very accessible way. And then it's various degrees, right? And then if it's open source, but it's actually a commercial license, then a lot of companies are gonna benefit from gaining value that they didn't previously have, that they maybe had to pay a closed source company for it. So open source is just a very interesting tool that you can use in various ways. So there's, again, two kinds of open source. One is some large company doing a lot of work and then open sourcing it. And that kind of effort is not really feasible by say a band of volunteers doing it the same way. So there's both a capital and operational expenditure that the large company just decided to ignore and give it away to the world for some benefits of some kind. They're not as tangible as direct revenue. So in that part, Meta has been doing incredibly good things. They fund a huge amount of the PyTorch development. They've open sourced Llama and those family of models and several other fairly transformative projects. FICE is one, Segment Anything, Detectron, Detectron 2. Dense Pose. I mean, it's-

Swyx [00:52:52]: Seamless. Yeah, seamless.

Soumith [00:52:53]: Like it's just the list is so long that we're not gonna cover. So I think Meta comes into that category where we spend a lot of CapEx and OpEx and we have a high talent density of great AI people and we open our stuff. And the thesis for that, I remember when FAIR was started, the common thing was like, wait, why would Meta wanna start a open AI lab? Like what exactly is a benefit from a commercial perspective? And for then the thesis was very simple. It was AI is currently rate limiting Meta's ability to do things. Our ability to build various product integrations, moderation, various other factors. Like AI was the limiting factor and we just wanted AI to advance more and we didn't care if the IP of the AI was uniquely in our possession or not. However the field advances, that accelerates Meta's ability to build a better product. So we just built an open AI lab and we said, if this helps accelerate the progress of AI, that's strictly great for us. But very easy, rational, right? Still the same to a large extent with the Llama stuff. And it's the same values, but the argument, it's a bit more nuanced. And then there's a second kind of open source, which is, oh, we built this project, nights and weekends and we're very smart people and we open sourced it and then we built a community around it. This is the Linux kernel and various software projects like that. So I think about open source, like both of these things being beneficial and both of these things being different. They're different and beneficial in their own ways. The second one is really useful when there's an active arbitrage to be done. If someone's not really looking at a particular space because it's not commercially viable or whatever, like a band of volunteers can just coordinate online and do something and then make that happen. And that's great.

Open Source LLMs

I wanna cover a little bit about open source LLMs maybe. So open source LLMs have been very interesting because I think we were trending towards an increase in open source in AI from 2010 all the way to 2017 or something. Like where more and more pressure within the community was to open source their stuff so that their methods and stuff get adopted. And then the LLMs revolution kind of took the opposite effect OpenAI stopped open sourcing their stuff and DeepMind kind of didn't, like all the other cloud and all these other providers, they didn't open source their stuff. And it was not good in the sense that first science done in isolation probably will just form its own bubble where people believe their own b******t or whatever. So there's that problem. And then there was the other problem which was the accessibility part. Like, okay, I again always go back to I'm a student in India with no money. What is my accessibility to any of these closers models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control thing. I strongly believe if you want human aligned stuff, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble. Like all the friends I hang out with talk about some random thing like Dyson Spheres or whatever, that's a thing. And most of the world doesn't know or care about any of this stuff. It's definitely a bubble and bubbles can form very easily. And when you make a lot of decisions because you're in a bubble, they're probably not globally optimal decisions. So I think open source, the distribution of open source powers a certain kind of non-falsifiability that I think is very important. I think on the open source models, like it's going great in the fact that LoRa I think came out of the necessity of open source models needing to be fine-tunable in some way. Yeah, and I think DPO also came out of the academic open source side of things. So do any of the closed source labs, did any of them already have LoRa or DPO internally? Maybe, but that does not advance humanity in any way. It advances some companies probability of doing the winner takes all that I talked about earlier in the podcast.

Open Source and Trust

I don't know, it just feels fundamentally good. Like when people try to, you know, people are like, well, what are the ways in which it is not okay? I find most of these arguments, and this might be a little controversial, but I find a lot of arguments based on whether closed source models are safer or open source models are safer very much related to what kind of culture they grew up in, what kind of society they grew up in. If they grew up in a society that they trusted, then I think they take the closed source argument. And if they grew up in a society that they couldn't trust, where the norm was that you didn't trust your government, obviously it's corrupt or whatever, then I think the open source argument is what they take. I think there's a deep connection to like people's innate biases from their childhood and their trust in society and governmental aspects that push them towards one opinion or the other. And I'm definitely in the camp of open source is definitely going to actually have better outcomes for society. Closed source to me just means that centralization of power, which, you know, is really hard to trust. So I think it's going well in so many ways that we're actively disaggregating the centralization of power to just two or three providers. We are, I think, benefiting from so many people using these models in so many ways that aren't allowed by, say, Silicon Valley left-wing tropes. Like some of these things are good or bad, but they're not culturally accepted universally in the world. So those are things worth thinking about. And I think open source is not winning in certain ways. Like these are all the things in which like, as I mentioned, it's actually being very good and beneficial and winning.

Feedback to solve the Open Source Coordination problem

I think one of the ways in which it's not winning, at some point I should write a long-form post about this, is I think it has a classic coordination problem. I mean, open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, if you go to Reddit, local llama, subreddit, like there's so many variations of models that are being produced from, say, nose research. I mean, there's so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited. And like someone published them somewhere and they're not sufficiently diverse. And you look at the other side, say front-ends like Uba or Hugging Chat or Ollama, they don't really have like feedback buttons. Like all the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback. So these models are being built, they're being arbitrarily measured, and then they are being deployed into all these open source front-ends or like apps that are closed source, they're serving open source models. And these front-ends don't have, they are not exposing the ability to give feedback. So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is negligible, maybe less than 1% of the usage. So I think the blueprint here I think is you'd want someone to create a sinkhole for the feedback, some centralized sinkhole, maybe Hugging Face or someone just funds like, okay, I will make available a call to log a string along with a bit of information of positive or negative or something that. And then you would want to send pull requests to all the open source front-ends like Ooba and all being like, hey, we're just integrating a feedback UI and then work with the closed source people as also being like, look, it doesn't cost you anything, just have a button. And then the sinkhole will have a bunch of this data coming in. And then I think a bunch of open source researchers should figure out how to filter their feedback into only the high quality one. I'm sure it will be exploited by spam bots or whatever, right? Like, this is the perfect way to inject your advertising product into the next. So there needs to be some level of that, that in the same way, I'm sure all the closed providers are doing today, like OpenAI, Claude, the feedback that comes in, I'm sure they are figuring out if that's legit or not. That kind of data filtering needs to be done. And that loop has to be set up. And this requires that central sinkhole and that data cleaning effort both to be there. They're not there right now. They're not there right now, I think for capital reasons, but also for coordination reasons. Okay, if that central sinkhole is there, who's gonna go coordinate all of this integration across all of these open source front ends. But I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI with their current daily active users. Probably doesn't have a chance against Google because you know, Google has Android and Chrome and Gmail and Google Docs and everything, you know? So people just use that a lot. But like, I think there's a clear chance we can take at truly winning open source.

AGI

Alessio [01:04:00]: Do you think this feedback is helpful to make open source models better or to get to like open source AGI? Because in a way like OpenAI's goal is to get to AGI, right? So versus I think in open source, we're more focused on personal better usage or like commercial better usage.

Soumith [01:04:17]: Yeah, I think that's a good question. But I think, I actually don't think people have a good understanding of AGI. And I don't mean definition level. I mean, people are like, okay, we're gonna, AGI means it's powering 40% of world economic output or something like that, right? But what does that mean? So do you think electricity is powering 40% of world economic output or is it not? Like generally the notion of powering X percent of economic output is not defined well at all for me to understand how to know when we got to AGI or how to measure whether we're getting AGI. Like, you know, you can look at it in terms of intelligence or task automation or whatever. I think that's what we are doing right now. We're basically integrating like the current set of AI technologies into so many real world use cases where we find value that if some new version of AI comes in, we can find, we can be like, ah, this helps me more. In that sense, I think the whole process of how we think we got to AGI will be continuous and not discontinuous like how I think the question is posed. So I think the open source thing will be very much in line with getting to AGI because open source has that natural selection effect. Like if a better open source model comes, really no one says, ha, I don't want to use it because there are ecosystem effect, I'm logged into my ecosystem or, I don't know if I like the models, you know, whatever. It's just a very pure direct thing. So if there's a better model that comes out, then it will be used. So I definitely think it has a good chance of achieving how I would think about it as a continuous path to what we might define as AGI.

OpenAssistant vs LMSys vs OpenRouter

Swyx [01:06:18]: For the listeners, I would actually mention a couple other maybe related notes on just this very interesting concept of feedback sinkhole for open source to really catch up in terms of the overall Google versus OpenAI debate. Open Assistant was led by Yannick Kilcher who recently ended his effort. I think the criticism there was like the kind of people that go to a specific website to give feedback is not representative of real world usage. And that's why the models trained on Open Assistant didn't really seem like they have caught on in the open source world. The two leading candidates in my mind are LMSYS out of UC Berkeley who have the LMSYS arena, which is being touted as one of the only ways, only reliable benchmarks anymore. I kind of call them non-parametric benchmarks because there's nothing to cheat on it except for ELO. And then the other one is OpenRouter, which is Alex Atala's thing. I don't know if you've talked to any of these people.

Soumith [01:07:11]: I obviously know all of the efforts that you talked about. I haven't talked to them directly about this yet. But the way I think about it is the way these models are going to be used is always going to be way more distributed than centralized. Like, which is the power of the open source movement. Like the UI within which these models are going to be used is going to be decentralized. These models are going to be integrated into hundreds and thousands of projects and products and all of that. And I think that is important to recognize. Like the LMSYS leaderboard is the best thing we have right now to understand whether a model is better or not versus another model. But it's also biased in only having a sliver of view into how people actually use these models. Like the people who actually end up coming to the LMSYS leaderboard and then using a model only use it for certain things. Like GitHub Copilot style usage is not captured in say LMSYS things. And so many other styles, like the character AI style things is not captured in LMSYS.

Swyx [01:08:19]: Which OpenRouter could do. They don't do it right now, but.

Soumith [01:08:22]: Yeah, so my point is like the way these models are going to be used is going to be always a large surface area. And I think we need to figure out how to provide the infrastructure to integrate with all these like ways in which it's being used. Even if you get the top hundred front ends that the model, like open source models are used through to subscribe to the sinkhole. I think that's already a substantial thing. I think thinking one or two things will by themselves get a lot of data I think is not going to happen.

Swyx [01:08:58]: Yeah, fair enough.

Other Modalities

Alessio [01:08:59]: Before we let you go, can we do just a quick beyond text segment? So you're an investor in Runway, which is a beta generation. You're an investor in One X, which is a humanoid assistant. Osmo, which is focused on using AI for smell recognition and synthesis. You advise a bunch of robotics projects at NYU.

Swyx [01:09:19]: Maybe. And he builds his own home robot. Yeah, exactly.

Alessio [01:09:22]: On a more, yeah, maybe open editing. What are the things that you're most excited about beyond text generation and kind of the more mundane usage?

Soumith [01:09:30]: Yeah, I mean, in general, I have more things I'm generally excited about than I can possibly do. Investing is one way to try to clear those urges. I'm generally excited about robotics being a possibility, home robotics being five to seven years away into commercialization. I think it's not next year or two years from now, but five to seven years from now, I think a lot more robotics companies might pop out. There's not a good consensus on whether hardware is a bottleneck or AI is a bottleneck in robotics right now. My view is actually hardware is still the bottleneck and AI is also a little bit of bottleneck, but I don't think there's any obvious breakthroughs we need. I think it's just work. So I'm generally excited about robotics. I spend a lot of personal time. I spend every Wednesday afternoon at NYU working with Lerrel Pinto and team and just getting towards my home robot that just does my dishes and stuff.

Swyx [01:10:38]: What's the status of it? Like what does it do for you now?

Soumith [01:10:41]: As of today, we just deployed a couple of months ago, we deployed our home robotics stuff into several tens of New York City homes and tried to make it do a bunch of tasks. And we're basically starting to build out a framework that gets to a certain level of robustness on fairly simple tasks, like picking this cup and putting it somewhere else or taking a few pieces of cloth on the ground and put it somewhere else or open your microwave and various baseline tasks that with low sample complexity. So I think one of the things people don't spend a lot of time in robotics is the user experience, which I think in the research I do at NYU, we spend a huge amount of time on. I think the key there is sample complexity has to be really low. A lot of the current robotics research, if you see they're like, oh yeah, we collected 50 demos and now it's able to do this task or we collected 300 demos or the number of samples you need for this thing to do the task is really high. So we're focusing a lot on, you show it two or three times and that's sufficient for it to actually do the task, but it comes with less generalization, right? Like there's some initial conditions that have to be true for it to do the task. So we're making progress. That's very interesting in general, the space. I don't think people in this space have settled on the hardware, like how the hardware looks like for it to be truly useful in the home or whatever, or the UX or the like AI, ML stuff needed to make it sample efficient and all of that. But I think lots of work is happening in the field.

Alessio [01:12:28]: Yeah, one of my friends, Carlo at Berkeley, he worked on a project called M3L, which is two CNNs, one for tactile feedback and one for image. When you say hardware, is it running all these things on the edge or is it just like the actual servos and the-

Soumith [01:12:45]: By hardware, I mean the actual servos, like the motors, servos, even the sensors. I think we have incredible vision that's still it's so much better compared to in the field of view and in resolution compared to any of the cameras we can buy. We have, our skin is all available touch sensing and we have some of the most efficient, some of the most high capacity motors that can lift large loads in the dexterity of a hand and stuff. So in terms of hardware, I mean in terms of those capabilities, we haven't figured out how to do a lot of this stuff. I mean, Tesla has been making incredible progress. One X, I think announced their new thing that looks incredible. Some of the other companies figure and others are doing great work. But we're really not anywhere close to the hardware that we feel like we need. And there's obviously the other thing I want to call out is a lot of what people show works, but has to be fixed all the time. And like, that's the other thing we are incredible at. Like we don't need any maintenance or the maintenance is part of us. If you buy a product, electronics product of any kind, you buy a PS5, you don't say, oh yeah, my PS5 breaks every six days and I have to do some reasonable amount of work on it. But that's robotics. Like if it's not industrial robotics where it's very controlled and specialized or whatever, you're talking about reliability in those ranges. So I think people don't talk about the reliability thing enough. Like what I mean, we're going to enter the commercialization phase. I mean, we're going to start thinking about, okay, now we have this thing and we need to figure out how to get reliability high enough to deploy it into homes and just sell it to people and Best Buy or something. So that's the other factor that we have to make a lot of progress on.

Swyx [01:14:44]: I just realized that Google has a play in this with Palm E and stuff and OpenAI obviously has a long history of doing this stuff. Is there anything at Meta? No robotics stuff in Meta?

Soumith [01:14:55]: We have a small robotics program at Meta out of FAIR. I actually used to do it at FAIR a little bit before I moved into Infra and focused on my Meta time on a lot of other infrastructural stuff. So yeah, Meta's robotics program is a lot smaller.

Swyx [01:15:10]: Seems like it would be a personal computing.

Soumith [01:15:14]: You could think of it as like, Meta has a ridiculously large device strategy, right? Like, you know, this is how our reality labs stuff. You know, we're going at it from VR and AR and, you know, we showcase a lot of that stuff. I think for Meta, the robot is not as important as like the physical device. Physical devices kind of stuff.

Osmo - smell AI

Swyx [01:15:37]: Yeah, for sure. Yeah. Okay, I want to touch on Osmo a bit because very unusual company to the stuff that we normally discuss, not robotics, sense of smell. The original pitch I heard from the founder, maybe you can correct me, is that he realized that you can smell cancer. Yeah. Is that intuitive? Is that what you get? Or is that the potential that you see?

Soumith [01:15:56]: The very interesting reason I invested in Osmo is because Alex Wiltschko, the founder of Osmo, before PyTorch, there was Torch. And Alex Wiltschko actually worked on Torch. He's actually a frameworks guy. Like, you know, he built this thing called Tangent from Google, another autodiff framework and stuff. I know him from that side of things. And then, he is a neurobiologist by training. He just happens to also love, neural networks and hacking on those frameworks. So incredibly smart guy, one of the smartest people I know. So when he was going in this direction, I thought it was incredible that smell is something that we haven't even started to scrape in terms of digitization. When we think about audio or images or video, they're so advanced. So we have the concept of color spaces. We have the concept of frequency spectrums. Like, you know, we figured out how ears process, like, frequencies in mouse spectrum or whatever logarithmically scaled. Images for RGB, YUV. We have so many different kinds of parameterizations. We have formalized these two senses ridiculously well. Touch and smell, nada. We're where we were with images in, say, in 1920 or maybe even the 1800s, right? That's where we're at. And Alex has this incredible vision of, like, having a smell sensor just eventually just be part of your daily life. Like, as of today, you don't really think about when you're watching an Instagram reel or something, huh, I also would love to know what it smelled like, you know, when you're watching a reel of a food or something. You don't, because we really haven't, as a society, got that muscle to even understand what a smell sensor can do. I think the more near-term effects are obviously going to be around things that provide more obvious utility in the short term, like maybe smelling cancer or repelling mosquitoes better, or, you know, stuff like that.

Swyx [01:18:12]: More recently, he's been talking about categorizing perfumes, obviously. Yeah, exactly. That's a market that you can pursue.

Soumith [01:18:17]: Yeah, like, I mean, think about how you can customize a perfume to your own liking in the same way you can customize a shoe or something, right? I think all the near-term stuff, I think if he's able to figure out a near-term value for it, they, as a company, can sustain themselves to then eventually try to make progress on the long term, which is really in uncharted territory. Like, think about it, 50 years from now, it would be pretty obvious to kids of the generation to just, like, I was going to say scroll a reel on their phone, and maybe phones wouldn't be there.

Swyx [01:18:58]: They're just on their glasses, they're watching something.

Soumith [01:18:58]: Yeah, I think VR would be. And then, like, they immediately get a smell sense of that remote experience as well. We haven't really progressed enough in that dimension, and I think they have a chance to do it.

Alessio [01:19:13]: Awesome, I mean, we touched on a lot of things. Anything, we're missing anything you want to direct people to, or?

Swyx [01:19:19]: Yeah, call to action. Yeah. Call for research, call for startups.

Soumith [01:19:22]: I don't really have a lot of calls to action, because usually I think people should be intrinsically, like, figuring it out.

Swyx [01:19:29]: That's a good look inside yourself. Yeah. That's good.

Alessio [01:19:33]: Awesome, thank you so much for coming on.

Swyx [01:19:35]: Yeah, for sure. This was great.

Get full access to Latent Space at www.latent.space/subscribe

2024-03-06
Länk till avsnitt

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

This Friday we?re doing a special crossover event in SF with of SemiAnalysis (previous guest!), and we will do a live podcast on site. RSVP here.

Also join us on June 25-27 for the biggest AI Engineer conference of the year!

Replicate is one of the most popular AI inference providers, reporting over 2 million users as of their $40m Series B with a16z. But how did they get there?

The Definitive Replicate Story (warts and all)

Their overnight success took 5 years of building, and it all started with arXiv Vanity, which was a 2017 vacation project that scrapes arXiv PDFs and re-renders them into semantic web pages that reflow nicely with better typography and whitespace.

From there, Ben and Andreas? idea was to build tools to make ML research more robust and reproducible by making it easy to share code artefacts alongside papers. They had previously created Fig, which made it easy to spin up dev environments; it was eventually acquired by Docker and turned into `docker-compose`, the industry standard way to define services from containerized applications.

2019: Cog

The first iteration of Replicate was a Fig-equivalent for ML workloads which they called Cog; it made it easy for researchers to package all their work and share it with peers for review and reproducibility.

But they found that researchers were terrible users: they?d do all this work for a paper, publish it, and then never return to it again.

?We talked to a bunch of researchers and they really wanted that.... But how the hell is this a business, you know, like how are we even going to make any money out of this?

?So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. Do you want like a deployment platform for deploying models? Do you want a central place for versioning models? We were trying to think of lots of different products we could sell that were related to this thing?

So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day.?

The team graduated YCombinator with no customers, no product and nothing to demo - which was fine because demo day got canceled as the YC W?20 class graduated right into the pandemic. The team spent the next year exploring and building Covid tools.

2021: CLIP + GAN = PixRay

By 2021, OpenAI released CLIP. Overnight dozens of Discord servers got spun up to hack on CLIP + GANs. Unlike academic researchers, this community was constantly releasing new checkpoints and builds of models.

PixRay was one of the first models being built on Replicate, and it quickly started taking over the community. Chris Dixon has a famous 2010 post titled ?The next big thing will start out looking like a toy?; image generation would have definitely felt like a toy in 2021, but it gave Replicate its initial boost.

2022: Stable Diffusion

In August 2022 Stable Diffusion came out, and all the work they had been doing to build this infrastructure for CLIP / GANs models became the best way for people to share their StableDiffusion fine-tunes:

And like the first week we saw people making animation models out of it. We saw people make game texture models that use circular convolutions to make repeatable textures. We saw a few weeks later, people were fine tuning it so you could put your face in these models and all of these other ways. [?] So tons of product builders wanted to build stuff with it. And we were just sitting in there in the middle, as the interface layer between all these people who wanted to build, and all these machine learning experts who were building cool models. And that's really where it took off. Incredible supply, incredible demand, and we were just in the middle.

(Stable Diffusion also spawned Latent Space as a newsletter)

The landing page paved the cowpath for the intense interest in diffusion model APIs.

2023: Llama & other multimodal LLMs

By 2023, Replicate?s growing visibility in the Stable Diffusion indie hacker community came from top AI hackers like Pieter Levels and Danny Postmaa, each making millions off their AI apps:

Meta then released LLaMA 1 and 2 (our coverage of it), greatly pushing forward the SOTA open source model landscape. Demand for text LLMs and other modalities rose, and Replicate broadened its focus accordingly, culminating in a $18m Series A and $40m Series B from a16z (at a $350m valuation).

Building standards for the AI world

Now that the industry is evolving from toys to enterprise use cases, all these companies are working to set standards for their own space. We cover this at ~45 mins in the podcast. Some examples:

* LangChain has been trying to establish "chain? as the standard mental models when putting multiple prompts and models together, and the ?LangChain Expression Language? to go with it. (Our episode with Harrison)

* LLamaHub for packaging RAG utilities. (Our episode with Jerry)

* Ollama?s Modelfile to define runtimes for different model architectures. These are usually targeted at local inference.

* Cog (by Replicate) to create environments to which you can easily attach CUDA devices and make it easy to spin up inference on remote servers.

* GGUF as the filetype ggml-based executors.

None of them have really broken out yet, but this is going to become a fiercer competition as the market matures.

Full Video Podcast

As a reminder, all Latent Space pods now come in full video on our YouTube, with bonus content that we cut for time!

Show Notes

* Ben Firshman

* Replicate

* Free $10 credit for Latent Space readers

* Andreas Jansson (Ben?s co-founder)

* Charlie Holtz (Replicate?s Hacker in Residence)

* Fig (now Docker Compose)

* Command Line Interface Guidelines (clig)

* Apple Human Interface Guidelines

* arXiv Vanity

* Open Interpreter

* PixRay

* SF Compute

* Big Sleep by Advadnoun

* VQGAN-CLIP by Rivers Have Wings

Timestamps

* [00:00:00] Introductions

* [00:01:17] Low latency is all you need

* [00:04:08] Evolution of CLIs

* [00:05:59] How building ArxivVanity led to Replicate

* [00:11:37] Making ML research replicable with containers

* [00:17:22] Doing YC in 2020 and pivoting to tools for COVID

* [00:20:22] Launching the first version of Replicate

* [00:25:51] Embracing the generative image community

* [00:28:04] Getting reverse engineered into an API product

* [00:31:25] Growing to 2 million users

* [00:34:29] Indie vs Enterprise customers

* [00:37:09] How Unsplash uses Replicate

* [00:38:29] Learnings from Docker that went into Cog

* [00:45:25] Creating AI standards

* [00:50:05] Replicate's compute availability

* [00:53:55] Fixing GPU waste

* [01:00:39] What's open source AI?

* [01:04:46] Building for AI engineers

* [01:06:41] Hiring at Replicate

This summary covers the full range of topics discussed throughout the episode, providing a comprehensive overview of the content and insights shared.

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:14]: Hey, and today we have Ben Firshman in the studio. Welcome Ben.

Ben [00:00:18]: Hey, good to be here.

Swyx [00:00:19]: Ben, you're a co-founder and CEO of Replicate. Before that, you were most notably founder of Fig, which became Docker Compose. You also did a couple of other things before that, but that's what a lot of people know you for. What should people know about you that, you know, outside of your, your sort of LinkedIn profile?

Ben [00:00:35]: Yeah. Good question. I think I'm a builder and tinkerer, like in a very broad sense. And I love using my hands to make things. So like I work on, you know, things may be a bit closer to tech, like electronics. I also like build things out of wood and I like fix cars and I fix my bike and build bicycles and all this kind of stuff. And there's so much, I think I've learned from transferable skills, from just like working in the real world to building things, building things in software. And you know, it's so much about being a builder, both in real life and, and in software that crosses over.

Swyx [00:01:11]: Is there a real world analogy that you use often when you're thinking about like a code architecture or problem?

Ben [00:01:17]: I like to build software tools as if they were something real. So I wrote this thing called the command line interface guidelines, which was a bit like sort of the Mac human interface guidelines, but for command line interfaces, I did it with the guy I created Docker Compose with and a few other people. And I think something in there, I think I described that your command line interface should feel like a big iron machine where you pull a lever and it goes clunk and like things should respond within like 50 milliseconds as if it was like a real life thing. And like another analogy here is like in the real life, you know, when you press a button on an electronic device and it's like a soft switch and you press it and nothing happens and there's no physical feedback of anything happening, then like half a second later, something happens. Like that's how a lot of software feels, but instead like software should feel more like something that's real where you touch, you pull a physical lever and the physical lever moves, you know, and I've taken that lesson of kind of human interface to, to software a ton. You know, it's all about kind of low latency of feeling, things feeling really solid and robust, both the command lines and, and user interfaces as well.

Swyx [00:02:22]: And how did you operationalize that for Fig or Docker?

Ben [00:02:27]: A lot of it's just low latency. Actually, we didn't do it very well for Fig in the first place. We used Python, which was a big mistake where Python's really hard to get booting up fast because you have to load up the whole Python runtime before it can run anything. Okay. Go is much better at this where like Go just instantly starts.

Swyx [00:02:45]: You have to be under 500 milliseconds to start up?

Ben [00:02:48]: Yeah, effectively. I mean, I mean, you know, perception of human things being immediate is, you know, something like a hundred milliseconds. So anything like that is, is yeah, good enough.

Swyx [00:02:57]: Yeah. Also, I should mention, since we're talking about your side projects, well, one thing is I am maybe one of a few fellow people who have actually written something about CLI design principles because I was in charge of the Netlify CLI back in the day and had many thoughts. One of my fun thoughts, I'll just share it in case you have thoughts, is I think CLIs are effectively starting points for scripts that are then run. And the moment one of the script's preconditions are not fulfilled, typically they end. So the CLI developer will just exit the program. And the way that I designed, I really wanted to create the Netlify dev workflow was for it to be kind of a state machine that would resolve itself. If it detected a precondition wasn't fulfilled, it would actually delegate to a subprogram that would then fulfill that precondition, asking for more info or waiting until a condition is fulfilled. Then it would go back to the original flow and continue that. I don't know if that was ever tried or is there a more formal definition of it? Because I just came up with it randomly. But it felt like the beginnings of AI in the sense that when you run a CLI command, you have an intent to do something and you may not have given the CLI all the things that it needs to do, to execute that intent. So that was my two cents.

Ben [00:04:08]: Yeah, that reminds me of a thing we sort of thought about when writing the CLI guidelines, where CLIs were designed in a world where the CLI was really a programming environment and it's primarily designed for machines to use all of these commands and scripts. Whereas over time, the CLI has evolved to humans. It was back in a world where the primary way of using computers was writing shell scripts effectively. We've transitioned to a world where actually humans are using CLI programs much more than they used to. And the current sort of best practices about how Unix was designed, there's lots of design documents about Unix from the 70s and 80s, where they say things like, command line commands should not output anything on success. It should be completely silent, which makes sense if you're using it in a shell script. But if a user is using that, it just looks like it's broken. If you type copy and it just doesn't say anything, you assume that it didn't work as a new user. I think what's really interesting about the CLI is that it's actually a really good, to your point, it's a really good user interface where it can be like a conversation, where it feels like you're, instead of just like you telling the computer to do this thing and either silently succeeding or saying, no, you did, failed, it can guide you in the right direction and tell you what your intent might be, and that kind of thing in a way that's actually, it's almost more natural to a CLI than it is in a graphical user interface because it feels like this back and forth with the computer, almost funnily like a language model. So I think there's some interesting intersection of CLIs and language models actually being very sort of closely related and a good fit for each other.

Swyx [00:05:59]: Yeah, I'll say one of the surprises from last year, I worked on a coding agent, but I think the most successful coding agent of my cohort was Open Interpreter, which was a CLI implementation. And I have chronically, even as a CLI person, I have chronically underestimated the CLI as a useful interface. You also developed ArchiveVanity, which you recently retired after a glorious seven years.

Ben [00:06:22]: Something like that.

Swyx [00:06:23]: Which is nice, I guess, HTML PDFs.

Ben [00:06:27]: Yeah, that was actually the start of where Replicate came from. Okay, we can tell that story. So when I quit Docker, I got really interested in science infrastructure, just as like a problem area, because it is like science has created so much progress in the world. The fact that we're, you know, can talk to each other on a podcast and we use computers and the fact that we're alive is probably thanks to medical research, you know. But science is just like completely archaic and broken and it's like 19th century processes that just happen to be copied to the internet rather than take into account that, you know, we can transfer information at the speed of light now. And the whole way science is funded and all this kind of thing is all kind of very broken. And there's just so much potential for making science work better. And I realized that I wasn't a scientist and I didn't really have any time to go and get a PhD and become a researcher, but I'm a tool builder and I could make existing scientists better at their job. And if I could make like a bunch of scientists a little bit better at their job, maybe that's the kind of equivalent of being a researcher. So one particular thing I dialed in on is just how science is disseminated in that all of these PDFs, quite often behind paywalls, you know, on the internet.

Swyx [00:07:34]: And that's a whole thing because it's funded by national grants, government grants, then they're put behind paywalls. Yeah, exactly.

Ben [00:07:40]: That's like a whole, yeah, I could talk for hours about that. But the particular thing we got dialed in on was, interestingly, these PDFs are also, there's a bunch of open science that happens as well. So math, physics, computer science, machine learning, notably, is all published on the archive, which is actually a surprisingly old institution.

Swyx [00:08:00]: Some random Cornell.

Ben [00:08:01]: Yeah, it was just like somebody in Cornell who started a mailing list in the 80s. And then when the web was invented, they built a web interface around it. Like it's super old.

Swyx [00:08:11]: And it's like kind of like a user group thing, right? That's why they're all these like numbers and stuff.

Ben [00:08:15]: Yeah, exactly. Like it's a bit like something, yeah. That's where all basically all of math, physics and computer science happens. But it's still PDFs published to this thing. Yeah, which is just so infuriating. The web was invented at CERN, a physics institution, to share academic writing. Like there are figure tags, there are like author tags, there are heading tags, there are site tags. You know, hyperlinks are effectively citations because you want to link to another academic paper. But instead, you have to like copy and paste these things and try and get around paywalls. Like it's absurd, you know. And now we have like social media and things, but still like academic papers as PDFs, you know. This is not what the web was for. So anyway, I got really frustrated with that. And I went on vacation with my old friend Andreas. So we were, we used to work together in London on a startup, at somebody else's startup. And we were just on vacation in Greece for fun. And he was like trying to read a machine learning paper on his phone, you know, like we had to like zoom in and like scroll line by line on the PDF. And he was like, this is f*****g stupid. So I was like, I know, like this is something we discovered our mutual hatred for this, you know. And we spent our vacation sitting by the pool, like making latex to HTML, like converters, making the first version of Archive Vanity. Anyway, that was up then a whole thing. And the story, we shut it down recently because they caught the eye of Archive. They were like, oh, this is great. We just haven't had the time to work on this. And what's tragic about the Archive, it's like this project of Cornell that's like, they can barely scrounge together enough money to survive. I think it might be better funded now than it was when we were, we were collaborating with them. And compared to these like scientific journals, it's just that this is actually where the work happens. But they just have a fraction of the money that like these big scientific journals have, which is just so tragic. But anyway, they were like, yeah, this is great. We can't afford to like do it, but do you want to like as a volunteer integrate arXiv Vanity into arXiv?

Swyx [00:10:05]: Oh, you did the work.

Ben [00:10:06]: We didn't do the work. We started doing the work. We did some. I think we worked on this for like a few months to actually get it integrated into arXiv. And then we got like distracted by Replicate. So a guy called Dan picked up the work and made it happen. Like somebody who works on one of the, the piece of the libraries that powers arXiv Vanity. Okay.

Swyx [00:10:26]: And the relationship with arXiv Sanity?

Ben [00:10:28]: None.

Swyx [00:10:30]: Did you predate them? I actually don't know the lineage.

Ben [00:10:32]: We were after, we both were both users of arXiv Sanity, which is like a sort of arXiv...

Ben [00:10:37]: Which is Andre's RecSys on top of arXiv.

Ben [00:10:40]: Yeah. Yeah. And we were both users of that. And I think we were trying to come up with a working name for arXiv and Andreas just like cracked a joke of like, oh, let's call it arXiv Vanity. Let's make the papers look nice. Yeah. Yeah. And that was the working name and it just stuck.

Swyx [00:10:52]: Got it.

Ben [00:10:53]: Got it.

Alessio [00:10:54]: Yeah. And then from there, tell us more about why you got distracted, right? So Replicate, maybe it feels like an overnight success to a lot of people, but you've been building this since 2019. Yeah.

Ben [00:11:04]: So what prompted the start?

Alessio [00:11:05]: And we've been collaborating for even longer.

Ben [00:11:07]: So we created arXiv Vanity in 2017. So in some sense, we've been doing this almost like six, seven years now, a classic seven year.

Swyx [00:11:16]: Overnight success.

Ben [00:11:17]: Yeah. Yes. We did arXiv Vanity and then worked on a bunch of like surrounding projects. I was still like really interested in science publishing at that point. And I'm trying to remember, because I tell a lot of like the condensed story to people because I can't really tell like a seven year history. So I'm trying to figure out like the right. Oh, we got room. The right length.

Swyx [00:11:35]: We want to nail the definitive Replicate story here.

Ben [00:11:37]: One thing that's really interesting about these machine learning papers is that these machine learning papers are published on arXiv and a lot of them are actual fundamental research. So like should be like prose describing a theory. But a lot of them are just running pieces of software that like a machine learning researcher made that did something, you know, it was like an image classification model or something. And they managed to make an image classification model that was better than the existing state of the art. And they've made an actual running piece of software that does image segmentation. And then what they had to do is they then had to take that piece of software and write it up as prose and math in a PDF. And what's frustrating about that is like if you want to. So this was like Andreas is, Andreas was a machine learning engineer at Spotify. And some of his job was like he did pure research as well. Like he did a PhD and he was doing a lot of stuff internally. But part of his job was also being an engineer and taking some of these existing things that people have made and published and trying to apply them to actual problems at Spotify. And he was like, you know, you get given a paper which like describes roughly how the model works. It's probably listing lots of crucial information. There's sometimes code on GitHub. More and more there's code on GitHub. But back then it was kind of relatively rare. But it's quite often just like scrappy research code and didn't actually run. And, you know, there was maybe the weights that were on Google Drive, but they accidentally deleted the weights of Google Drive, you know, and it was like really hard to like take this stuff and actually use it for real things. We just started talking together about like his problems at Spotify and I connected this back to my work at Docker as well. I was like, oh, this is what we created containers for. You know, we solved this problem for normal software by putting the thing inside a container so you could ship it around and it kept on running. So we were sort of hypothesizing about like, hmm, what if we put machine learning models inside containers so they could actually be shipped around and they could be defined in like some production ready formats and other researchers could run them to generate baselines and you could people who wanted to actually apply them to real problems in the world could just pick up the container and run it, you know. And we then thought this is quite whether it gets normally in this part of the story I skip forward to be like and then we created cog this container stuff for machine learning models and we created Replicate, the place for people to publish these machine learning models. But there's actually like two or three years between that. The thing we then got dialed into was Andreas was like, what if there was a CI system for machine learning? It's like one of the things he really struggled with as a researcher is generating baselines. So when like he's writing a paper, he needs to like get like five other models that are existing work and get them running.

Swyx [00:14:21]: On the same evals.

Ben [00:14:22]: Exactly, on the same evals so you can compare apples to apples because you can't trust the numbers in the paper.

Swyx [00:14:26]: So you can be Google and just publish them anyway.

Ben [00:14:31]: So I think this was coming from the thinking of like there should be containers for machine learning, but why are people going to use that? Okay, maybe we can create a supply of containers by like creating this useful tool for researchers. And the useful tool was like, let's get researchers to package up their models and push them to the central place where we run a standard set of benchmarks across the models so that you can trust those results and you can compare these models apples to apples and for like a researcher for Andreas, like doing a new piece of research, he could trust those numbers and he could like pull down those models, confirm it on his machine, use the standard benchmark to then measure his model and you know, all this kind of stuff. And so we started building that. That's what we applied to YC with, got into YC and we started sort of building a prototype of this. And then this is like where it all starts to fall apart. We were like, okay, that sounds great. And we talked to a bunch of researchers and they really wanted that and that sounds brilliant. That's a great way to create a supply of like models on this research platform. But how the hell is this a business, you know, like how are we even going to make any money out of this? And we're like, oh s**t, that's like the, that's the real unknown here of like what the business is. So we thought it would be a really good idea to like, okay, before we get too deep into this, let's try and like reduce the risk of this turning into a business. So let's try and like research what the business could be for this research tool effectively. So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. And we were like, do you want like a deployment platform for deploying models? Like, do you want like a central place for versioning models? Like we're trying to think of like lots of different like products we could sell that were like related to this thing. And terrible idea. Like we're not sales people and like people don't want to buy something that doesn't exist. I think some people can pull this off, but we were just like, you know, a bunch of product people, products and engineer people, and we just like couldn't pull this off. So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day. And then we then like tried to figure out, okay, what can we build in like two weeks that'll be something. So we like desperately tried to, I can't remember what we've tried to build at that point. And then two weeks before demo day, I just remember it was all, we were going down to Mountain View every week for dinners and we got called on to like an all hands Zoom call, which was super weird. We're like, what's going on? And they were like, don't come to dinner tomorrow. And we realized, we kind of looked at the news and we were like, oh, there's a pandemic going on. We were like so deep in our startup. We were just like completely oblivious to what was going on around us.

Swyx [00:17:20]: Was this Jan or Feb 2020?

Ben [00:17:22]: This was March 2020. March 2020. 2020.

Swyx [00:17:25]: Yeah. Because I remember Silicon Valley at the time was early to COVID. Like they started locking down a lot faster than the rest of the US.

Ben [00:17:32]: Yeah, exactly. And I remember, yeah, soon after that, like there was the San Francisco lockdowns and then like the YC batch just like stopped. There wasn't demo day and it was in a sense a blessing for us because we just kind of

Swyx [00:17:43]: In the normal course of events, you're actually allowed to defer to a future demo day. Yeah.

Ben [00:17:51]: So we didn't even take any defer because it just kind of didn't happen.

Swyx [00:17:55]: So was YC helpful?

Ben [00:17:57]: Yes. We completely screwed up the batch and that was our fault. I think the thing that YC has become incredibly valuable for us has been after YC. I think there was a reason why we couldn't, didn't need to do YC to start with because we were quite experienced. We had done some startups before. We were kind of well connected with VCs, you know, it was relatively easy to raise money because we were like a known quantity. You know, if you go to a VC and be like, Hey, I made this piece of-

Swyx [00:18:24]: It's Docker Compose for AI.

Ben [00:18:26]: Exactly. Yeah. And like, you know, people can pattern match like that and they can have some trust, you know what you're doing. Whereas it's much harder for people straight out of college and that's where like YC sweet spot is like helping people straight out of college who are super promising, like figure out how to do that.

Swyx [00:18:40]: No credentials.

Ben [00:18:41]: Yeah, exactly. We don't need that. But the thing that's been incredibly useful for us since YC has been, this was actually, I think, so Docker was a YC company and Solomon, the founder of Docker, I think told me this. He was like, a lot of people underestimate the value of YC after you finish the batch. And his biggest regret was like not staying in touch with YC. I might be misattributing this, but I think it was him. And so we made a point of that. And we just stayed in touch with our batch partner, who Jared at YC has been fantastic.

Ben [00:19:10]: Jared Friedman. All of like the team at YC, there was the growth team at YC when they were still there and they've been super helpful. And two things have been super helpful about that is like raising money, like they just know exactly how to raise money. And they've been super helpful during that process in all of our rounds, like we've done three rounds since we did YC and they've been super helpful during the whole process. And also just like reaching a ton of customers. So like the magic of YC is that you have all of, like there's thousands of YC companies, I think, on the order of thousands, I think. And they're all of your first customers. And they're like super helpful, super receptive, really want to like try out new things. You have like a warm intro to every one of them basically. And there's this mailing list where you can post about updates to your products, which is like really receptive. And that's just been fantastic for us. Like we've just like got so many of our users and customers through YC. Yeah.

Swyx [00:20:00]: Well, so the classic criticism or the sort of, you know, pushback is people don't buy you because you are both from YC. But at least they'll open the email. Right. Like that's the... Okay.

Ben [00:20:13]: Yeah. Yeah. Yeah.

Swyx [00:20:16]: So that's been a really, really positive experience for us. And sorry, I interrupted with the YC question. Like you were, you make it, you just made it out of the YC, survived the pandemic.

Ben [00:20:22]: I'll try and condense this a little bit. Then we started building tools for COVID weirdly. We were like, okay, we don't have a startup. We haven't figured out anything. What's the most useful thing we could be doing right now?

Swyx [00:20:32]: Save lives.

Ben [00:20:33]: So yeah. Let's try and save lives. I think we failed at that as well. We had a bunch of products that didn't really go anywhere. We kind of worked on, yeah, a bunch of stuff like contact tracing, which turned out didn't really be a useful thing. Sort of Andreas worked on like a door dash for like people delivering food to people who are vulnerable. What else did we do? The meta problem of like helping people direct their efforts to what was most useful and a few other things like that. It didn't really go anywhere. So we're like, okay, this is not really working either. We were considering actually just like doing like work for COVID. We have this decision document early on in our company, which is like, should we become a like government app contracting shop? We decided no.

Swyx [00:21:11]: Because you also did work for the gov.uk. Yeah, exactly.

Ben [00:21:14]: We had experience like doing some like-

Swyx [00:21:17]: And the Guardian and all that.

Ben [00:21:18]: Yeah. For like government stuff. And we were just like really good at building stuff. Like we were just like product people. Like I was like the front end product side and Andreas was the back end side. So we were just like a product. And we were working with a designer at the time, a guy called Mark, who did our early designs for Replicate. And we were like, hey, what if we just team up and like become and build stuff? And yeah, we gave up on that in the end for, I can't remember the details. So we went back to machine learning. And then we were like, well, we're not really sure if this is going to work. And one of my most painful experiences from previous startups is shutting them down. Like when you realize it's not really working and having to shut it down, it's like a ton of work and it's people hate you and it's just sort of, you know. So we were like, how can we make something we don't have to shut down? And even better, how can we make something that won't page us in the middle of the night? So we made an open source project. We made a thing which was an open source Weights and Biases, because we had this theory that like people want open source tools. There should be like an open source, like version control, experiment tracking like thing. And it was intuitive to us and we're like, oh, we're software developers and we like command line tools. Like everyone loves command line tools and open source stuff, but machine learning researchers just really didn't care. Like they just wanted to click on buttons. They didn't mind that it was a cloud service. It was all very visual as well, that you need lots of graphs and charts and stuff like this. So it wasn't right. Like it was right. We actually were building something that Andreas made at Spotify for just like saving experiments to cloud storage automatically, but other people didn't really want this. So we kind of gave up on that. And then that was actually originally called Replicate and we renamed that out of the way. So it's now called Keepsake and I think some people still use it. Then we sort of came back, we looped back to our original idea. So we were like, oh, maybe there was a thing in that thing we were originally sort of thinking about of like researchers sharing their work and containers for machine learning models. So we just built that. And at that point we were kind of running out of the YC money. So we were like, okay, this like feels good though. Let's like give this a shot. So that was the point we raised a seed round. We raised seed round. Pre-launch. We raised pre-launch and pre-team. It was an idea basically. We had a little prototype. It was just an idea and a team. But we were like, okay, like, you know, bootstrapping this thing is getting hard. So let's actually raise some money. Then we made Cog and Replicate. It initially didn't have APIs, interestingly. It was just the bit that I was talking about before of helping researchers share their work. So it was a way for researchers to put their work on a webpage such that other people could try it out and so that you could download the Docker container. We cut the benchmarks thing of it because we thought that was just like too complicated. But it had a Docker container that like, you know, Andreas in a past life could download and run with his benchmark and you could compare all these models apples to apples. So that was like the theory behind it. That kind of started to work. It was like still when like, you know, it was long time pre-AI hype and there was lots of interesting stuff going on, but it was very much in like the classic deep learning era. So sort of image segmentation models and sentiment analysis and all these kinds of things, you know, that people were using, that we're using deep learning models for. And we were very much building for research because all of this stuff was happening in research institutions, you know, the sort of people who'd be publishing to archive. So we were creating an accompanying material for their models, basically, you know, they wanted a demo for their models and we were creating a company material for it. What was funny about that is they were like not very good users. Like they were, they were doing great work obviously, but, but the way that research worked is that they, they just made like one thing every six months and they just fired and forget it, forgot it. Like they, they published this piece of paper and like, done, I've, I've published it. So they like output it to Replicate and then they just stopped using Replicate. You know, they were like once every six monthly users and that wasn't great for us, but we stumbled across this early community. This was early 2021 when OpenAI created this, created CLIP and people started smushing CLIP and GANs together to produce image generation models. And this started with, you know, it was just a bunch of like tinkerers on Discord, basically. There was an early model called Big Sleep by Advadnoun. And then there was VQGAN Clip, which was like a bit more popular by Rivers Have Wings. And it was all just people like tinkering on stuff in Colabs and it was very dynamic and it was people just making copies of co-labs and playing around with things and forking in. And to me this, I saw this and I was like, oh, this feels like open source software, like so much more than the research world where like people are publishing these papers.

Swyx [00:25:48]: You don't know their real names and it's just like a Discord.

Ben [00:25:51]: Yeah, exactly. But crucially, it was like people were tinkering and forking and things were moving really fast and it just felt like this creative, dynamic, collaborative community in a way that research wasn't really, like it was still stuck in this kind of six month publication cycle. So we just kind of latched onto that and started building for this community. And you know, a lot of those early models were published on Replicate. I think the first one that was really primarily on Replicate was one called Pixray, which was sort of mid 2021 and it had a really cool like pixel art output, but it also just like produced general, you know, the sort of, they weren't like crisp in images, but they were quite aesthetically pleasing, like some of these early image generation models. And you know, that was like published primarily on Replicate and then a few other models around that were like published on Replicate. And that's where we really started to find our early community and like where we really found like, oh, we've actually built a thing that people want and they were great users as well. And people really want to try out these models. Lots of people were like running the models on Replicate. We still didn't have APIs though, interestingly, and this is like another like really complicated part of the story. We had no idea what a business model was still at this point. I don't think people could even pay for it. You know, it was just like these web forms where people could run the model.

Swyx [00:27:06]: Just for historical interest, which discords were they and how did you find them? Was this the Lion Discord? Yeah, Lion. This is Eleuther.

Ben [00:27:12]: Eleuther, yeah. It was the Eleuther one. These two, right? There was a channel where Viki Gangklep, this was early 2021, where Viki Gangklep was set up as a Discord bot. I just remember being completely just like captivated by this thing. I was just like playing around with it all afternoon and like the sort of thing. In Discord. Oh s**t, it's 2am. You know, yeah.

Swyx [00:27:33]: This is the beginnings of Midjourney.

Ben [00:27:34]: Yeah, exactly. And Stability. It was the start of Midjourney. And you know, it's where that kind of user interface came from. Like what's beautiful about the user interface is like you could see what other people are doing. And you could riff off other people's ideas. And it was just so much fun to just like play around with this in like a channel full of a hundred people. And yeah, that just like completely captivated me and I'm like, okay, this is something, you know. So like we should get these things on Replicate. Yeah, that's where that all came from.

Swyx [00:28:00]: And then you moved on to, so was it APIs next or was it Stable Diffusion next?

Ben [00:28:04]: It was APIs next. And the APIs happened because one of our users, our web form had like an internal API for making the web form work, like with an API that was called from JavaScript. And somebody like reverse engineered that to start generating images with a script. You know, they did like, you know, Web Inspector Coffee is Carl, like figured out what the API request was. And it wasn't secured or anything.

Swyx [00:28:28]: Of course not.

Ben [00:28:29]: They started generating a bunch of images and like we got tons of traffic and like what's going on? And I think like a sort of usual reaction to that would be like, hey, you're abusing our API and to shut them down. And instead we're like, oh, this is interesting. Like people want to run these models. So we documented the API in a Notion document, like our internal API in a Notion document and like message this person being like, hey, you seem to have found our API. Here's the documentation. That'll be like a thousand bucks a month, please, with a straight form, like we just click some buttons to make. And they were like, sure, that sounds great. So that was our first customer.

Swyx [00:29:05]: A thousand bucks a month.

Ben [00:29:07]: It was a surprising amount of money. That's not casual. It was on the order of a thousand bucks a month.

Swyx [00:29:11]: So was it a business?

Ben [00:29:13]: It was the creator of PixRay. Like it was, he generated NFT art. And so he like made a bunch of art with these models and was, you know, selling these NFTs effectively. And I think lots of people in his community were doing similar things. And like he then referred us to other people who were also generating NFTs and he joined us with models. We started our API business. Yeah. Then we like made an official API and actually like added some billing to it. So it wasn't just like a fixed fee.

Swyx [00:29:40]: And now people think of you as the host and models API business. Yeah, exactly.

Ben [00:29:44]: But that just turned out to be our business, you know, but what ended up being beautiful about this is it was really fulfilling. Like the original goal of what we wanted to do is that we wanted to make this research that people were making accessible to like other people and for it to be used in the real world. And this was like the just like ultimately the right way to do it because all of these people making these generative models could publish them to replicate and they wanted a place to publish it. And software engineers, you know, like myself, like I'm not a machine learning expert, but I want to use this stuff, could just run these models with a single line of code. And we thought, oh, maybe the Docker image is enough, but it's actually super hard to get the Docker image running on a GPU and stuff. So it really needed to be the hosted API for this to work and to make it accessible to software engineers. And we just like wound our way to this. Yeah.

Swyx [00:30:30]: Two years to the first paying customer. Yeah, exactly.

Alessio [00:30:33]: Did you ever think about becoming Midjourney during that time? You have like so much interest in image generation.

Swyx [00:30:38]: I mean, you're doing fine for the record, but, you know, it was right there, you were playing with it.

Ben [00:30:46]: I don't think it was our expertise. Like I think our expertise was DevTools rather than like Midjourney is almost like a consumer products, you know? Yeah. So I don't think it was our expertise. It certainly occurred to us. I think at the time we were thinking about like, oh, maybe we could hire some of these people in this community and make great models and stuff like this. But we ended up more being at the tooling. Like I think like before I was saying, like I'm not really a researcher, but I'm more like the tool builder, the behind the scenes. And I think both me and Andreas are like that.

Swyx [00:31:09]: I think this is an illustration of the tool builder philosophy. Something where you latch on to in DevTools, which is when you see people behaving weird, it's not their fault, it's yours. And you want to pave the cow paths is what they say, right? Like the unofficial paths that people are making, like make it official and make it easy for them and then maybe charge a bit of money.

Alessio [00:31:25]: And now fast forward a couple of years, you have 2 million developers using Replicate. Maybe more. That was the last public number that I found.

Ben [00:31:33]: It's 2 million users. Not all those people are developers, but a lot of them are developers, yeah.

Alessio [00:31:38]: And then 30,000 paying customers was the number late in space runs on Replicate. So we had a small podcaster and we host a whisper diarization on Replicate. And we're paying. So we're late in space in the 30,000. You raised a $40 million dollars, Series B. I would say that maybe the stable diffusion time, August 22, was like really when the company started to break out. Tell us a bit about that and the community that came out and I know now you're expanding beyond just image generation.

Ben [00:32:06]: Yeah, like I think we kind of set ourselves, like we saw there was this really interesting image, generative image world going on. So we kind of, you know, like we're building the tools for that community already, really. And we knew stable diffusion was coming out. We knew it was a really exciting thing, you know, it was the best generative image model so far. I think the thing we underestimated was just like what an inflection point it would be, where it was, I think Simon Willison put it this way, where he said something along the lines of it was a model that was open source and tinkerable and like, you know, it was just good enough and open source and tinkerable such that it just kind of took off in a way that none of the models had before. And like what was really neat about stable diffusion is it was open source so you could like, compared to like Dali, for example, which was like sort of equivalent quality. And like the first week we saw like people making animation models out of it. We saw people make like game texture models that like use circular convolutions to make repeatable textures. We saw, you know, a few weeks later, like people were fine tuning it so you could make, put your face in these models and all of these other-

Swyx [00:33:10]: Textual inversion.

Ben [00:33:11]: Yep. Yeah, exactly. That happened a bit before that. And all of this sort of innovation was happening all of a sudden. And people were publishing on Replicate because you could just like publish arbitrary models on Replicate. So we had this sort of supply of like interesting stuff being built. But because it was a sufficiently good model, there was also just like a ton of people building with it. They were like, oh, we can build products with this thing. And this was like about the time where people were starting to get really interested in AI. So like tons of product builders wanted to build stuff with it. And we were just like sitting in there in the middle, it's like the interface layer between like all these people who wanted to build and all these like machine learning experts who were building cool models. And that's like really where it took off. We were just sort of incredible supply, incredible demand, and we were just like in the middle. And then, yeah, since then, we've just kind of grown and grown really. And we've been building a lot for like the indie hacker community, these like individual tinkerers, but also startups and a lot of large companies as well who are sort of exploring and building AI things. Then kind of the same thing happened like middle of last year with language models and Lama 2, where the same kind of stable diffusion effect happened with Lama. And Lama 2 was like our biggest week of growth ever because like tons of people wanted to tinker with it and run it. And you know, since then we've just been seeing a ton of growth in language models as well as image models. Yeah. We're just kind of riding a lot of the interest that's going on in AI and all the people building in AI, you know. Yeah.

Swyx [00:34:29]: Kudos. Right place, right time. But also, you know, took a while to position for the right place before the wave came. I'm curious if like you have any insights on these different markets. So Peter Levels, notably very loud person, very picky about his tools. I wasn't sure actually if he used you. He does. So you've met him on your Series B blog posts and Danny Post might as well, his competitor all in that wave. What are their needs versus, you know, the more enterprise or B2B type needs? Did you come to a decision point where you're like, okay, you know, how serious are these indie hackers versus like the actual businesses that are bigger and perhaps better customers because they're less churny?

Ben [00:35:04]: They're surprisingly similar because I think a lot of people right now want to use and build with AI, but they're not AI experts and they're not infrastructure experts either. So they want to be able to use this stuff without having to like figure out all the internals of the models and, you know, like touch PyTorch and whatever. And they also don't want to be like setting up and booting up servers. And that's the same all the way from like indie hackers just getting started because like obviously you just want to get started as quickly as possible, all the way through to like large companies who want to be able to use this stuff, but don't have like all of the experts on stuff, you know, you know, big companies like Google and so on that do actually have a lot of experts on stuff, but the vast majority of companies don't. And they're all software engineers who want to be able to use this AI stuff, but they just don't know how to use it. And it's like, you really need to be an expert and it takes a long time to like learn the skills to be able to use that. So they're surprisingly similar in that sense. I think it's kind of also unfair of like the indie community, like they're not churning surprisingly, or churny or spiky surprisingly, like they're building real established businesses, which is like, kudos to them, like building these really like large, sustainable businesses, often just as solo developers. And it's kind of remarkable how they can do that actually, and it's in credit to a lot of their like product skills. And you know, we're just like there to help them being like their machine learning team effectively to help them use all of this stuff. A lot of these indie hackers are some of our largest customers, like alongside some of our biggest customers that you would think would be spending a lot more money than them, but yeah.

Swyx [00:36:35]: And we should name some of these. So you have them on your landing page, your Buzzfeed, you have Unsplash, Character AI. What do they power? What can you say about their usage?

Ben [00:36:43]: Yeah, totally. It's kind of a various things.

Swyx [00:36:46]: Well, I mean, I'm naming them because they're on your landing page. So you have logo rights. It's useful for people to, like, I'm not imaginative. I see monkey see monkey do, right? Like if I see someone doing something that I want to do, then I'm like, okay, Replicate's great for that.

Ben [00:37:00]: Yeah, yeah, yeah.

Swyx [00:37:01]: So that's what I think about case studies on company landing pages is that it's just a way of explaining like, yep, this is something that we are good for. Yeah, totally.

Ben [00:37:09]: I mean, it's, these companies are doing things all the way up and down the stack at different levels of sophistication. So like Unsplash, for example, they actually publicly posted this story on Twitter where they're using BLIP to annotate all of the images in their catalog. So you know, they have lots of images in the catalog and they want to create a text description of it so you can search for it. And they're annotating images with, you know, off the shelf, open source model, you know, we have this big library of open source models that you can run. And you know, we've got lots of people are running these open source models off the shelf. And then most of our larger customers are doing more sophisticated stuff. So they're like fine tuning the models, they're running completely custom models on us. A lot of these larger companies are like, using us for a lot of their, you know, inference, but it's like a lot of custom models and them like writing the Python themselves because they've got machine learning experts on the team. And they're using us for like, you know, their inference infrastructure effectively. And so it's like lots of different levels of sophistication where like some people using these off the shelf models. Some people are fine tuning models. So like level, Peter Levels is a great example where a lot of his products are based off like fine tuning, fine tuning image models, for example. And then we've also got like larger customers who are just like using us as infrastructure effectively. So yeah, it's like all things up and down, up and down the stack.

Alessio [00:38:29]: Let's talk a bit about COG and the technical layer. So there are a lot of GPU clouds. I think people have different pricing points. And I think everybody tries to offer a different developer experience on top of it, which then lets you charge a premium. Why did you want to create COG?

Ben [00:38:46]: You worked at Docker.

Alessio [00:38:47]: What were some of the issues with traditional container runtimes? And maybe yeah, what were you surprised with as you built it?

Ben [00:38:54]: COG came right from the start, actually, when we were thinking about this, you know, evaluation, the sort of benchmarking system for machine learning researchers, where we wanted researchers to publish their models in a standard format that was guaranteed to keep on running, that you could replicate the results of, like that's where the name came from. And we realized that we needed something like Docker to make that work, you know. And I think it was just like natural from my point of view of like, obviously that should be open source, that we should try and create some kind of open standard here that people can share. Because if more people use this format, then that's great for everyone involved. I think the magic of Docker is not really in the software. It's just like the standard that people have agreed on, like, here are a bunch of keys for a JSON document, basically. And you know, that was the magic of like the metaphor of real containerization as well. It's not the containers that are interesting. It's just like the size and shape of the damn box, you know. And it's a similar thing here, where really we just wanted to get people to agree on like, this is what a machine learning model is. This is how a prediction works. This is what the inputs are, this is what the outputs are. So cog is really just a Docker container that attaches to a CUDA device, if it needs a GPU, that has a open API specification as a label on the Docker image. And the open API specification defines the interface for the machine learning model, like the inputs and outputs effectively, or the params in machine learning terminology. And you know, we just wanted to get people to kind of agree on this thing. And it's like general purpose enough, like we weren't saying like, some of the existing things were like at the graph level, but we really wanted something general purpose enough that you could just put anything inside this and it was like future compatible and it was just like arbitrary software. And you know, it'd be future compatible with like future inference servers and future machine learning model formats and all this kind of stuff. So that was the intent behind it. It just came naturally that we wanted to define this format. And that's been really working for us. Like a bunch of people have been using cog outside of replicates, which is kind of our original intention, like this should be how machine learning is packaged and how people should use it. Like it's common to use cog in situations where like maybe they can't use the SAS service because I don't know, they're in a big company and they're not allowed to use a SAS service, but they can use cog internally still. And like they can download the models from replicates and run them internally in their org, which we've been seeing happen. And that works really well. People who want to build like custom inference pipelines, but don't want to like reinvent the world, they can use cog off the shelf and use it as like a component in their inference pipelines. We've been seeing tons of usage like that and it's just been kind of happening organically. We haven't really been trying, you know, but it's like there if people want it and we've been seeing people use it. So that's great. Yeah. So a lot of it is just sort of philosophical of just like, this is how it should work from my experience at Docker, you know, and there's just a lot of value from like the core being open, I think, and that other people can share it and it's like an integration point. So, you know, if replicate, for example, wanted to work with a testing system, like a CI system or whatever, we can just like interface at the cog level, like that system just needs to put cog models and then you can like test your models on that CI system before they get deployed to replicate. And it's just like a format that everyone, we can get everyone to agree on, you know.

Alessio [00:41:55]: What do you think, I guess, Docker got wrong? Because if I look at a Docker Compose and a cog definition, first of all, the cog is kind of like the Dockerfile plus the Compose versus in Docker Compose, you're just exposing the services. And also Docker Compose is very like ports driven versus you have like the actual, you know, predict this is what you have to run.

Ben [00:42:16]: Yeah.

Alessio [00:42:17]: Any learnings and maybe tips for other people building container based runtimes, like how much should you separate the API services versus the image building or how much you want to build them together?

Ben [00:42:29]: I think it was coming from two sides. We were thinking about the design from the point of view of user needs, what are their problems and what problems can we solve for them, but also what the interface should be for a machine learning model. And it was sort of the combination of two things that led us to this design. So the thing I talked about before was a little bit of like the interface around the machine learning model. So we realized that we wanted to be general purpose. We wanted to be at the like JSON, like human readable things rather than the tensor level. So it was like an open API specification that wrapped a Docker container. And that's where that design came from. And it's really just a wrapper around Docker. So we were kind of building on, standing on shoulders there, but Docker is too low level. So it's just like arbitrary software. So we wanted to be able to like have a open API specification that defined the function effectively that is the machine learning model. But also like how that function is written, how that function is run, which is all defined in code and stuff like that. So it's like a bunch of abstraction on top of Docker to make that work. And that's where that design came from. But the core problems we were solving for users was that Docker is really hard to use and productionizing machine learning models is really hard. So on the first part of that, we knew we couldn't use Dockerfiles. Like Dockerfiles are hard enough for software developers to write. I'm saying this with love as somebody who works on Docker and like works on Dockerfiles, but it's really hard to use. And you need to know a bunch about Linux, basically, because you're running a bunch of CLI commands. You need to know a bunch about Linux and best practices and like how apt works and all this kind of stuff. So we're like, OK, we can't get to that level. We need something that machine learning researchers will be able to understand, like people who are used to like Colab notebooks. And what they understand is they're like, I need this version of Python. I need these Python packages. And somebody told me to apt-get install something. You know? If there was sudo in there, I don't really know what that means. So we tried to create a format that was at that level, and that's what cog.yaml is. And we were really kind of trying to imagine like, what is that machine learning researcher going to understand, you know, and trying to build for them. Then the productionizing machine learning models thing is like, OK, how can we package up all of the complexity of like productionizing machine learning models, like picking CUDA versions, like hooking it up to GPUs, writing an inference server, defining a schema, doing batching, all of these just like really gnarly things that everyone does again and again. And just like, you know, provide that as a tool. And that's where that side of it came from. So it's like combining those user needs with, you know, the sort of world need of needing like a common standard for like what a machine learning model is. And that's how we thought about the design. I don't know whether that answers the question.

Alessio [00:45:12]: Yeah. So your idea was like, hey, you really want what Docker stands for in terms of standard, but you actually don't want people to do all the work that goes into Docker.

Ben [00:45:22]: It needs to be higher level, you know?

Swyx [00:45:25]: So I want to, for the listener, you're not the only standard that is out there. As with any standard, there must be 14 of them. You are surprisingly friendly with Olama, who is your former colleagues from Docker, who came out with the model file. Mozilla came out with the Lama file. And then I don't know if this is in the same category even, but I'm just going to throw it in there. Like Hugging Face has the transformers and diffusers library, which is a way of disseminating models that obviously people use. How would you compare your contrast, your approach of Cog versus all these?

Ben [00:45:53]: It's kind of complementary, actually, which is kind of neat in that a lot of transformers, for example, is lower level than Cog. So it's a Python library effectively, but you still need to like...

Swyx [00:46:04]: Expose them.

Ben [00:46:05]: Yeah. You still need to turn that into an inference server. You still need to like install the Python packages and that kind of thing. So lots of replicate models are transformers models and diffusers models inside Cog, you know? So that's like the level that that sits. So it's very complementary in some sense. We're kind of working on integration with Hugging Face such that you can deploy models from Hugging Face into Cog models and stuff like that to replicate. And some of these things like Llamafile and what Llama are working on are also very complementary in that they're doing a lot of the sort of running these things locally on laptops, which is not a thing that works very well with Cog. Like Cog is really designed around servers and attaching to CUDA devices and NVIDIA GPUs and this kind of thing. So we're actually like, you know, figuring out ways that like we can, those things can be interoperable because, you know, they should be and they are quite complementary and that you should be able to like take a model and replicate and run it on your local machine. You should be able to take a model, you know, the machine and run it in the cloud.

Swyx [00:47:02]: Is the base layer something like, is it at the like the GGUF level, which by the way, I need to get a primer on like the different formats that have emerged, or is it at the star dot file level, which is model file, Llamafile, whatever, whatever, or is it at the Cog level? I don't know, to be honest.

Ben [00:47:16]: And I think this is something we still have to figure out. There's a lot yet, like exactly where those lines are drawn. Don't know exactly. I think this is something we're trying to figure out ourselves, but I think there's certainly a lot of promise about these systems interoperating. We just want things to work together. You know, we want to try and reduce the number of standards. So the more, the more these things can interoperate and, you know, convert between each other and that kind of stuff at the minute.

Swyx [00:47:34]: Cool. Well, there's a foundation for that.

Alessio [00:47:36]: Andreas comes out of Spotify, Eric from Moto also comes out of Spotify. You work at Docker and the Llamafile guys work at Docker. Did both you and Andreas know that there was somebody else you work with that had a kind of like similar, not similar idea, but like was interested in the same thing or did you then just say, oh, I know those people. They're doing something very similar.

Ben [00:47:58]: We learned about both early on actually, yeah, because we know, we know them both quite well. And it's funny how I think we're all seeing the same problems and just like applying, you know, trying to fix the same problems that we're all seeing. I think the Llama one's particularly funny because I joined Docker through my startup. Funnily, actually, the thing which worked for my startup was Compose, but we were actually working on another thing, which was a bit like EC2 for Docker. So we were working on like productionizing Docker containers. And Llama was working on a thing called Chimatic, which was a bit like a desktop app for Docker. And our companies both got bought by Docker at the same time. And you know, Chimatic turned into Docker desktop. And then, you know, our thing then turned into Compose. And it's funny how we're both applying our, like the things we saw at Docker to the AI world, but they're building like the local environment for us and we're building like the cloud for it. And yeah, so that's just like really pleasing. And I think, you know, we're collaborating closely because there's just so much opportunity for working there. You have a hammer.

Swyx [00:49:06]: Everything's a nail.

Ben [00:49:07]: Yeah, exactly. Exactly. So I think a lot of where we're coming from a lot with AI is we're all kind of on the replicated team. We're all kind of people who have built developer tools in the past. We've got a team, like I worked at Docker, we've got people who worked at Heroku and GitHub and like the iOS ecosystem and all this kind of thing, like the previous generation of developer tools, where we like figured out a bunch of stuff. And then like AI has come along and we just don't yet have those tools and abstractions like to make it easy to use. So we're trying to like take the lessons that we learned from the previous generation of stuff and apply it to this new generation of stuff. And obviously there's a bit of nuance there because the trick is to take like the right lessons and do new stuff where it makes sense. You can't just like cut and paste, you know, but that's like how we're approaching this is we're trying to like as much as possible, like take some of those lessons we learned from like, you know, how Heroku and GitHub was built, for example, and apply them to AI.

Swyx [00:50:05]: We should also talk a little bit about your compute availability. We're trying to ask this of all, you know, it's Compute Provider Month. Do you own your own GPUs? How many do you have access to? What do you feel about the tightness of the GPU market?

Ben [00:50:17]: We don't own our own GPUs. We've got a few that we play around with, but not for production workloads. And we are primarily built on public clouds, so primarily GCP and CoreWeave and like some smatterings elsewhere.

Swyx [00:50:29]: None from NVIDIA, which is your newest investor?

Ben [00:50:31]: We work with NVIDIA, so, you know, they're kind of helping us get GPU availability. GPUs are hard to get hold of. Like if you go to AWS and ask for one A100, they won't give you an A100. But if you go to AWS and say, I would like 100 A100s in two years, they're like, sure, we've got some. And I think the problem is like that makes sense from their point of view. They want just like reliable, sustained usage. They don't want like spiky usage and like wastage in their infrastructure, which makes total sense. But that makes it really hard for startups, you know, who are wanting to just like get hold of GPUs. I think we're in a fortunate position where we can aggregate demand so we can make commits to cloud providers. And then, you know, we actually have good availability, like, you know, we don't have infinite availability, obviously, but, you know, if you want an A100 from Replicate, you can get it. But, you know, we're seeing other companies pop up as well, like SF Compute's a great example of this, where they're doing the same idea for training almost where, you know, a lot of startups need to be able to train a model, but they can't get hold of GPUs from large cloud providers. So SF Compute is like letting people rent, you know, 10 H100s for two days, which is just impossible otherwise. And, you know, what they're effectively doing there is they're aggregating demand such that they can make a big commit to the cloud provider and then let people use smaller chunks of it. And that's kind of what we're doing with Replicate as well. We're aggregating demand such that we make big commits to the cloud providers. And you know, then people can run like a 100 millisecond API request on an A100.

Swyx [00:51:51]: So, you know, coming from a finance background, this sounds surprisingly similar to banks, where the job of a bank is maturity transformation, is what you call it. You take short term deposits, which technically can be withdrawn at any time, and you turn that into long term loans for mortgages and stuff, and you pocket the difference in interest. And that's the bank.

Ben [00:52:09]: Yeah, that's exactly what we're doing.

Swyx [00:52:11]: So you run a bank.

Ben [00:52:12]: Yeah, it's your bank. Right, yeah. And it's so much a finance problem as well, because we have to make bets on the future demand value of GPUs, yeah.

Swyx [00:52:21]: What are you... Okay, I don't know how much you can disclose, but what are you forecasting? Down? Up a lot? Yeah. Up 10x?

Ben [00:52:30]: I can't really. We're projecting our growth with some educated guesses about what kind of models are going to come out and what kind of models these will run, you know? We need to bet that like, okay, maybe language models are getting larger. So we need to like have GPUs with a lot of RAM, or like multi GPU nodes, or maybe models are getting smaller, and we actually need smaller GPUs, you know, we have to make some educated guesses about that kind of stuff, yeah.

Swyx [00:52:50]: Yeah. Speaking of which, the mixture of experts models must be throwing a spanner into the planning.

Ben [00:52:56]: Not so much. We've got like multi-node A100 machines, which can run those, and multi-node H100 machines, which can run those, no problem. So we're set up for that. Okay.

Swyx [00:53:04]: Right. I didn't expect it to be so easy. My impression was that the amount of RAM per model was increasing a lot, especially on a sort of per parameter basis, per active parameter basis, going from like mixed trial being eight experts to like the deep-seek MOE models, I don't know if you saw them, being like 30, 60 experts, and you can see it keep going up, I guess.

Ben [00:53:26]: Yeah. I think we might run into problems at some point, and yeah, I don't know exactly what's going on there. I think something that we're finding, which is kind of interesting, like I don't know this in depth, you know, we're certainly seeing a lot of good results from lower precision models. So like, you know, 90% of the performance with just like much less RAM required. That means that we can run them on GPUs we have available, and it's good for customers as well because it runs faster, and like they want that trade-off, you know, where it's just slightly worse, but like way faster and cheaper.

Alessio [00:53:55]: Do you see a lot of GPU waste in terms of people running the thing on a GPU that is like too advanced? I think we use C4 to run Whisper. So we're at the bottom end of it. Yeah. Any thoughts? I think one of the hackathons we were at, people were like, oh, how do I get access to like H100s? And it's like, you need to run like- Dude, you don't need H100s.

Ben [00:54:14]: You don't need H100s. Yeah. Yeah. Well, if you want low licensee, like sure, like spend a lot of money on the H100. Yeah. We see a ton of that kind of stuff. And it's surprisingly hard to optimize these models right now. So a lot of people are just running like really unoptimized models. We're doing the same, honestly. Like we're a lot of models on Replicate have just been like not been optimized very well. So something we want to like be able to help people with is optimizing those models. Like either we show people how to with guides or we make it easier to use some of these more optimized inference servers or we show people how to compile the models or we do that automatically or something like that. But that's only something we're exploring because there's so much wastage. Like it's not just wasting the GPUs. It's also like a bad experience and the models run slow. So the models on Replicate almost all pushed by our community. Like people have pushed those models themselves, but like it's like a big head of distribution where there's like a long tail of lots of models that people have pushed. And then like a big head of like the models most people run. So models like Llama 2, like Stable Diffusion, you know, we work with Meta and Stability to like maintain those models. And we've done a ton of optimization to make this really fast. So those models are optimized, but the long tail is not. And there's like a lot of wastage there.

Alessio [00:55:32]: And going into the, well, it's already the new year. Do you see the customer demand and the GPU like hardware demand kind of like staying together? Because I think a lot of people are saying, oh, there's like hundreds of thousands of GPUs being shipped this year. Like the crunch is going to be over, but you also have like millions of people that now care about using AI. You know, how do you see the two lines progressing? Are you seeing customer demand is going to outpace the GPU growth? Do you see them together? Do you see maybe a lot of this like model improvement work kind of helping alleviate

Ben [00:56:04]: that? That's a really good question. From our point of view, demand is not outpacing supply GPUs, like we have enough, from our point of view, we have enough GPUs to go around, but that might change for sure. Yeah.

Alessio [00:56:15]: That's a very nicely put way as a startup founder to respond.

Swyx [00:56:21]: So as your frame did more, it's like sort of picking the wrong box model, whereas yours is more about maybe the inference stack, if you can call it. Were you referencing VLLM? What other sort of techniques are you referencing? Also keeping in mind that when I talk to your competitors, and I don't know if we don't have to name any of them, but they are working on trying to optimize the kinds of models. Like they basically, they'll quantize their models for you with their special stack. So you basically use their versions of Llamatu, you use their versions of Mistral, and that's one way to approach it. I don't see it as the replicate DNA to do that because that would be like sort of, you would have to slap the replicate house brand on something, which I mean, just comment on any of that. What do you mean when you say optimize models?

Ben [00:57:05]: Things like quantizing the models, you can imagine a way that we could help people quantize their models if we want to. We've had success using inference servers like VLM and TRT LLM, and we're using those kind of things to serve language models. We've had success with things like AI templates, which compile the models, all of those kinds of things. And there's like some even really just boring things of just like making the code more efficient. Like when they're just writing some Python code, it's really easy to just write inefficient Python code. And there's like really boring things like that as well, but it's like a whole smash of things like that.

Swyx [00:57:40]: You will do that for a customer? Like you look at their code and-

Ben [00:57:43]: Yeah, we've certainly helped some of our customers be able to do that, some of the stuff. And a lot of the models on, like the popular models on replicate, we've like rewritten them to use that stuff as well. And like the stable diffusion that we run, for example, is compiled for the AI template to make it super fast. And it's all open source that you can see all of this stuff on GitHub, if you want to like see how we do it. But you can imagine ways that we could help people. It's almost like built into the Cog layer maybe, where we could help people like use these fast inference servers or use AI template to compile their models to make it faster. Whether it's like manual, semi-manual or automatic, we're not really sure, but that's something we want to explore because it benefits everyone.

Swyx [00:58:21]: And then on the competitive piece, there was a price war on Mixtral last year, this last December. As far as I can tell, you guys did not enter that war. You have Mixtral, but it's just regular pricing. I think also some of these players are probably losing money on their pricing. You don't have to say anything, but the break even is somewhere between 50 to 75 cents per million tokens served. How are you thinking about like just the overall competitiveness in the market? How should people choose when everyone's an API?

Ben [00:58:50]: So for Lama2 and Mistral, I think not mixed trial, I can't remember exactly. We have similar performance and similar price to some of these other services. We're not like bargain basement to some of the others, because to your point, we don't want to burn tons of money, but we're pricing it sensibly and sustainably to a point where we think it's competitive with other people such that we want developers using Replicate and we don't want to price it such that it's only affordable by big companies. We want to make it cheap enough such that the developers can afford it, but we also don't want the super cheap prices, because then it's almost like then your customers are hostile and the more customers you get, the worse it gets. So we're pricing it sensibly, but still to the point where hopefully it's cheap enough to build on. And I think the thing we really care about, like we want to, obviously we want models and Replicate to be comparable to other people. But I think the really crucial thing about Replicate and the way I think we think about it is that it's not just the API for them, particularly in open source, it's not just the API for the model that is the important bit. It's because quite often with open source models, like the whole point of open source is that you can tinker on it and you can customize it and you can fine tune it and you can like smush it together with another model, like Lava, for example. And you can't do that if it's just like a hosted API, because it's just like, you know, you can't touch the code. So what we want to do with Replicate is build a platform that's actually open. So like we've got all of these models where the performance and price is on par with everything else. But if you want to customize it, you can fine tune it, you can go to GitHub and get the source code for it and edit the source code and push up your own custom version and this kind of thing. Because that's the crucial thing for open source machine learning is be able to tinker on it and customizing it. And we think that's really important to make open source AI work.

Alessio [01:00:39]: You mentioned open source. How do you think about levels of openness? When Lama 2 came out, I wrote a post about this, about it's like open source and there's open weights, then there's restrictive weights. It was on the front page of Agornews. So there was like all sort of comments from people. So I'm always curious to hear your thoughts. Like what do you think it's okay for people to license? What's okay for people to not release?

Ben [01:01:03]: You know, before it was just like closed source, big models, open source, little models, you know, purely open source stuff. And we're now seeing like lots of variations where, you know, model companies putting restrictive licenses on their models, you know, that means it can only be used for non-commercial use, you know, and a lot of the, you know, open source crowd is complaining it's not true open source, you know, and all this kind of thing. And I think a lot of that is coming from philosophy, you know, like the sort of free software movement kind of philosophy. And I don't think it's necessarily a bad thing. I think it's good that model companies can make money out of their models. You know, that's like how this will incentivize people to make more models and this kind of thing. And I think it's totally fine if like somebody made something to ask for some money in return if you're making money out of it. And I think that's totally okay. And I think there's some really interesting like midpoints as well where people are releasing the codes, you can still tinker on it, but the person who trained the model still wants to get a cut of it if like you're making a bunch of money out of it. And I think that's good. And that's going to make like the ecosystem more sustainable. I don't think anybody's really figured it out yet. We're going to see like more experimentation with this and more people like try to figure out like what are the business models around building models and how can I make money out of this? And we'll just see where it ends up. And I think it's something we want to support as Replicate as well because we believe in open source. We think it's great, but there's also going to be lots of models which are closed source as well. And these companies might not be, there's probably going to be a long tail of a bunch of people building models that don't have the reach that OpenAI have. And hopefully as Replicate, we can help those people find developers and help them make money and that kind of thing.

Alessio [01:02:46]: I think the computer requirements of AI kind of changed the thing. I started an open source company. I'm a big open source fan. And before it was kind of man hours was really all that went into open source. It wasn't much monetary investment. Well, not that man hours are not worth a lot, but if you think about Llama 2, it's like $25 million, you know, like all in, it's like you can't just spin up a discord and like spend $25 million. So I think it's net positive for everybody that Llama 2 is open source and well, it's the open source, you know, it's the open source term. I think people like you're saying, it's like they kind of argue on the semantics of it, but like all we care about is that Llama 2 is open because if Llama 2 wasn't open source today, like that, if Mistral was not open source, we will be in a bad spot, you know?

Ben [01:03:33]: So, and I think the nuance here is making sure that these models are still tinkerable because the beautiful thing about Llama 2 as a base model is that like, yeah, it costs $25 million to train to start with, but then you can fine tune it for like 50 bucks. And that's what's so beautiful about the open source ecosystem. And something I think is really surprising as well, like completely surprised me. Like I think a lot of people assumed that it's not going to be open source machine learning. It's just not going to be practical because it's so expensive to train these models. But like fine tuning is unreasonably effective and people are getting really good results out of it and it's really cheap. So people can effectively create open source models really cheaply. And there's going to be like this sort of ecosystem of tons of models being made. And I think the risk there from a licensing point of view is we need to make sure that the licenses let people do that, because if you release a big model under a non-commercial license and people can't fine tune it, you've lost the magic of it being open. And I'm sure there are ways to structure that such that the person paying $25 million feels like they're compensated somehow and they can feel like they can, you know, they should keep on training models and people can keep on fine tuning it. But I guess we just have to figure out exactly how that plays out.

Swyx [01:04:46]: Excellent. So just wanted to round it out. You've been an excellent, very open. I should have started my intro with this, but I feel like you found the sort of AI engineer crew before I did. And, you know, something I really resonated with you in sort of the Series B announcement was that you put in some stats here about how there are two orders of magnitude more software engineers than there are machine learning engineers, about 30 million software engineers and 500,000 machine learning engineers. You can maybe plus minus one of those orders of magnitude, but it's around that ballpark. And so obviously there will be a lot more engineers than there will be ML engineers. How do you see this group? Like, is it all software engineers? Are they going to specialize? What would you advise someone trying to become an AI engineer? Is this a legitimate career path?

Ben [01:05:30]: Yeah, absolutely. I mean, it's very clear that AI is going to be a large part of how we build software in the future. Now, it's a bit like being a software developer in the 90s and ignoring the Internet. You know, you just need to you need to learn about this stuff. You need to figure this stuff out. I don't think it needs to be super low level. You don't need to be like, you know, the metaphor here is that you don't need to be digging down into like this sort of Pytorch level if you don't want to in the same way as a software engineer in the 90s. You don't need to be like understanding how network stacks work to be able to build a website, you know, but you need to understand the shape of this thing and how to hold it and what it's good at and what it's not. And that's really important. So, yeah, certainly just advise people to like just start playing around with it, get a feel of like how language models work, get a feel of like how these diffusion models work, get a feel of like what fine tuning is and how it works, because some of your job might be building datasets, you know, get a feeling of how prompting works, because some of your job might be writing a prompt. And those are just all really important skills to sort of figure out.

Swyx [01:06:36]: Yeah. Well, thanks for building the definitive platform for doing all that.

Ben [01:06:41]: Yeah, of course.

Alessio [01:06:42]: And if I know call to actions, who should come work at Replicate, anything for the audience?

Ben [01:06:47]: Yeah, well, I mean, we're hiring. If you click on jobs at the bottom of our Replicate.com, there's some jobs. And I just encourage you to like just like try out AI, even if you don't, even if you think you're not smart enough. Like the whole reason I started this company is because I was looking at the cool stuff that Andreas was making. Like Andreas is like a proper machine learning person with a PhD, you know, and I was like just like a, you know, a sort of lowly software engineer. I was like, you're doing really cool stuff and I want to be able to do that. And by us working together, you know, we've now made it accessible to dummies like me. And I just encourage anyone who's like wants to try this stuff out, just give it a try. I would also encourage people who are tool builders. Like the limiting factor now on AI is not like the technology, like the technology has made incredible advances and there's just so many incredible machine learning models that can do a ton of stuff. The limiting factor is just like making that accessible to people who build products, because it's really hard to use this stuff right now. And obviously we're building some of that stuff as Replicate, but there's just like a ton of other tooling and abstractions that need to be built out to make this stuff usable. So I just encourage people who like building developer tools to just like get stuck into it as well, because that's going to make this stuff accessible to everyone.

Swyx [01:07:58]: Yeah, I especially want to highlight you have a hacker in residence job opening available, which not every company has, which means just join you and hack stuff. I think Charlie Holtz is doing a fantastic job of that.

Ben [01:08:09]: Yeah, effectively. Like most of our, a lot of our job is just like showing people how to use AI. So we've just got a team of like software developers and people have kind of figured this stuff out who are writing about it, who are making videos about it, who are making example applications to show people what you can do with this stuff.

Swyx [01:08:26]: Yeah. In my world that used to be called DevRel, but now it's hacker in residence.

Ben [01:08:31]: And this came from Zeke, who's another one of our hackers.

Swyx [01:08:38]: Tell me this came from Chroma, because I want to start that one.

Ben [01:08:41]: We developed, like they, Antoine actually was like, hey, we came up with that first. But I think we came up with it independently, because the story behind this is we originally called it the DevRel team. Yeah. And DevRel's cursed now. Zeke was like, that sounds so boring. I want to go to someone and say I'm a developer relations person, or a developer advocate or something. So we were like, okay, what's the like, the way we can make this sound the most fun? All right, you're a hacker.

Swyx [01:09:10]: I would say like that is consistently the vibe I get from Replicate. Everyone on your team I interact with. When I go to your San Francisco office, like that's the vibe that you're generating. Like it's a hacker space more than an office. And you hold fantastic meetups there. And I think you're a really positive presence in our community. So thank you for doing all that. And it's instilling the hacker vibe and culture into AI.

Ben [01:09:31]: I'm really glad that I'm really glad that's working. Cool. That's a wrap.

Alessio [01:09:34]: I think. Thank you so much for coming on, man.

Ben [01:09:36]: Yeah, of course. Thank you. This is a lot of fun.

Get full access to Latent Space at www.latent.space/subscribe

2024-02-28
Länk till avsnitt

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

We?re writing this one day after the monster release of OpenAI?s Sora and Gemini 1.5. We covered this on ?s ThursdAI space, so head over there for our takes.

IRL: We?re ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year!

Online: All three Discord clubs are thriving. Join us every Wednesday/Friday!

Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn?t start yet another vector database company, but instead in 2021 founded Modal, the ?high-performance cloud for developers?. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A.

More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator):

We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the ?postmodern data stack?.

In his 2021 blog post ?Software infrastructure 2.0: a wishlist?, Erik had ?Truly serverless? as one of his points:

* The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.

* I don't ever want to provision anything in advance of load.

* I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.

* Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.

Swyx called this Self Provisioning Runtimes back in the day. Modal doesn?t put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries?):

After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons:

* AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There?s no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn?t create any headaches with maintaining state.

* Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software.

* GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It?s much more convenient to build with a serverless-like infrastructure.

In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy!

Show Notes

* Modal

* ErikBot

* Erik?s Blog

* Software Infra 2.0 Wishlist

* Luigi

* Annoy

* Hetzner

* CoreWeave

* Cloudflare FaaS

* Poolside AI

* Modular Inference Engine

Chapters

* [00:00:00] Introductions

* [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi

* [00:06:22] Starting Modal

* [00:07:54] Vision for a "postmodern data stack"

* [00:10:43] Solving container cold start problems

* [00:12:57] Designing Modal's Python SDK

* [00:15:18] Self-Revisioning Runtime

* [00:19:14] Truly Serverless Infrastructure

* [00:20:52] Beyond model inference

* [00:22:09] Tricks to maximize GPU utilization

* [00:26:27] Differences in AI and data science workloads

* [00:28:08] Modal vs Replicate vs Modular and lessons from Heroku's "graduation problem"

* [00:34:12] Creating Erik's clone "ErikBot"

* [00:37:43] Enabling massive parallelism across thousands of GPUs

* [00:39:45] The Modal Sandbox for agents

* [00:43:51] Thoughts on the AI Inference War

* [00:49:18] Erik's best tweets

* [00:51:57] Why buying hardware is a waste of money

* [00:54:18] Erik's competitive programming backgrounds

* [00:59:02] Why does Sweden have the best Counter Strike players?

* [00:59:53] Never owning a car or TV

* [01:00:21] Advice for infrastructure startups

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:14]: Hey, and today we have in the studio Erik Bernhardsson from Modal. Welcome.

Erik [00:00:19]: Hi. It's awesome being here.

Swyx [00:00:20]: Yeah. Awesome seeing you in person. I've seen you online for a number of years as you were building on Modal and I think you're just making a San Francisco trip just to see people here, right? I've been to like two Modal events in San Francisco here.

Erik [00:00:34]: Yeah, that's right. We're based in New York, so I figured sometimes I have to come out to capital of AI and make a presence.

Swyx [00:00:40]: What do you think is the pros and cons of building in New York?

Erik [00:00:45]: I mean, I never built anything elsewhere. I lived in New York the last 12 years. I love the city. Obviously, there's a lot more stuff going on here and there's a lot more customers and that's why I'm out here. I do feel like for me, where I am in life, I'm a very boring person. I kind of work hard and then I go home and hang out with my kids. I don't have time to go to events and meetups and stuff anyway. In that sense, New York is kind of nice. I walk to work every morning. It's like five minutes away from my apartment. It's very time efficient in that sense. Yeah.

Swyx [00:01:10]: Yeah. It's also a good life. So we'll do a brief bio and then we'll talk about anything else that people should know about you. Actually, I was surprised to find out you're from Sweden. You went to college in KTH and your master's was in implementing a scalable music recommender system. Yeah.

Erik [00:01:27]: I had no idea. Yeah. So I actually studied physics, but I grew up coding and I did a lot of programming competition and then as I was thinking about graduating, I got in touch with an obscure music streaming startup called Spotify, which was then like 30 people. And for some reason, I convinced them, why don't I just come and write a master's thesis with you and I'll do some cool collaborative filtering, despite not knowing anything about collaborative filtering really. But no one knew anything back then. So I spent six months at Spotify basically building a prototype of a music recommendation system and then turned that into a master's thesis. And then later when I graduated, I joined Spotify full time.

Swyx [00:02:00]: So that was the start of your data career. You also wrote a couple of popular open source tooling while you were there. Is that correct?

Erik [00:02:09]: No, that's right. I mean, I was at Spotify for seven years, so this is a long stint. And Spotify was a wild place early on and I mean, data space is also a wild place. I mean, it was like Hadoop cluster in the like foosball room on the floor. It was a lot of crude, like very basic infrastructure and I didn't know anything about it. And like I was hired to kind of figure out data stuff. And I started hacking on a recommendation system and then, you know, got sidetracked in a bunch of other stuff. I fixed a bunch of reporting things and set up A-B testing and started doing like business analytics and later got back to music recommendation system. And a lot of the infrastructure didn't really exist. Like there was like Hadoop back then, which is kind of bad and I don't miss it. But I spent a lot of time with that. As a part of that, I ended up building a workflow engine called Luigi, which is like briefly like somewhat like widely ended up being used by a bunch of companies. Sort of like, you know, kind of like Airflow, but like before Airflow. I think it did some things better, some things worse. I also built a vector database called Annoy, which is like for a while, it was actually quite widely used. In 2012, so it was like way before like all this like vector database stuff ended up happening. And funny enough, I was actually obsessed with like vectors back then. Like I was like, this is going to be huge. Like just give it like a few years. I didn't know it was going to take like nine years and then there's going to suddenly be like 20 startups doing vector databases in one year. So it did happen. In that sense, I was right. I'm glad I didn't start a startup in the vector database space. I would have started way too early. But yeah, that was, yeah, it was a fun seven years as part of it. It was a great culture, a great company.

Swyx [00:03:32]: Yeah. Just to take a quick tangent on this vector database thing, because we probably won't revisit it but like, has anything architecturally changed in the last nine years?

Erik [00:03:41]: I'm actually not following it like super closely. I think, you know, some of the best algorithms are still the same as like hierarchical navigable small world.

Swyx [00:03:51]: Yeah. HNSW.

Erik [00:03:52]: Exactly. I think now there's like product quantization, there's like some other stuff that I haven't really followed super closely. I mean, obviously, like back then it was like, you know, it's always like very simple. It's like a C++ library with Python bindings and you could mmap big files and into memory and like they had some lookups. I used like this kind of recursive, like hyperspace splitting strategy, which is not that good, but it sort of was good enough at that time. But I think a lot of like HNSW is still like what people generally use. Now of course, like databases are much better in the sense like to support like inserts and updates and stuff like that. I know I never supported that. Yeah, it's sort of exciting to finally see like vector databases becoming a thing.

Swyx [00:04:30]: Yeah. Yeah. And then maybe one takeaway on most interesting lesson from Daniel Ek?

Erik [00:04:36]: I mean, I think Daniel Ek, you know, he started Spotify very young. Like he was like 25, something like that. And that was like a good lesson. But like he, in a way, like I think he was a very good leader. Like there was never anything like, no scandals or like no, he wasn't very eccentric at all. It was just kind of like very like level headed, like just like ran the company very well, like never made any like obvious mistakes or I think it was like a few bets that maybe like in hindsight were like a little, you know, like took us, you know, too far in one direction or another. But overall, I mean, I think he was a great CEO, like definitely, you know, up there, like generational CEO, at least for like Swedish startups.

Swyx [00:05:09]: Yeah, yeah, for sure. Okay, we should probably move to make our way towards Modal. So then you spent six years as CTO of Better. You were an early engineer and then you scaled up to like 300 engineers.

Erik [00:05:21]: I joined as a CTO when there was like no tech team. And yeah, that was a wild chapter in my life. Like the company did very well for a while. And then like during the pandemic, yeah, it was kind of a weird story, but yeah, it kind of collapsed.

Swyx [00:05:32]: Yeah, laid off people poorly.

Erik [00:05:34]: Yeah, yeah. It was like a bunch of stories. Yeah. I mean, the company like grew from like 10 people when I joined at 10,000, now it's back to a thousand. But yeah, they actually went public a few months ago, kind of crazy. They're still around, like, you know, they're still, you know, doing stuff. So yeah, very kind of interesting six years of my life for non-technical reasons, like I managed like three, four hundred, but yeah, like learning a lot of that, like recruiting. I spent all my time recruiting and stuff like that. And so managing at scale, it's like nice, like now in a way, like when I'm building my own startup. It's actually something I like, don't feel nervous about at all. Like I've managed a scale, like I feel like I can do it again. It's like very different things that I'm nervous about as a startup founder. But yeah, I started Modal three years ago after sort of, after leaving Better, I took a little bit of time off during the pandemic and, but yeah, pretty quickly I was like, I got to build something. I just want to, you know. Yeah. And then yeah, Modal took form in my head, took shape.

Swyx [00:06:22]: And as far as I understand, and maybe we can sort of trade off questions. So the quick history is started Modal in 2021, got your seed with Sarah from Amplify in 2022. You just announced your Series A with Redpoint. That's right. And that brings us up to mostly today. Yeah. Most people, I think, were expecting you to build for the data space.

Erik: But it is the data space.

Swyx:: When I think of data space, I come from like, you know, Snowflake, BigQuery, you know, Fivetran, Nearby, that kind of stuff. And what Modal became is more general purpose than that. Yeah.

Erik [00:06:53]: Yeah. I don't know. It was like fun. I actually ran into like Edo Liberty, the CEO of Pinecone, like a few weeks ago. And he was like, I was so afraid you were building a vector database. No, I started Modal because, you know, like in a way, like I work with data, like throughout my most of my career, like every different part of the stack, right? Like I thought everything like business analytics to like deep learning, you know, like building, you know, training neural networks, the scale, like everything in between. And so one of the thoughts, like, and one of the observations I had when I started Modal or like why I started was like, I just wanted to make, build better tools for data teams. And like very, like sort of abstract thing, but like, I find that the data stack is, you know, full of like point solutions that don't integrate well. And still, when you look at like data teams today, you know, like every startup ends up building their own internal Kubernetes wrapper or whatever. And you know, all the different data engineers and machine learning engineers end up kind of struggling with the same things. So I started thinking about like, how do I build a new data stack, which is kind of a megalomaniac project, like, because you kind of want to like throw out everything and start over.

Swyx [00:07:54]: It's almost a modern data stack.

Erik [00:07:55]: Yeah, like a postmodern data stack. And so I started thinking about that. And a lot of it came from like, like more focused on like the human side of like, how do I make data teams more productive? And like, what is the technology tools that they need? And like, you know, drew out a lot of charts of like, how the data stack looks, you know, what are different components. And it shows actually very interesting, like workflow scheduling, because it kind of sits in like a nice sort of, you know, it's like a hub in the graph of like data products. But it was kind of hard to like, kind of do that in a vacuum, and also to monetize it to some extent. I got very interested in like the layers below at some point. And like, at the end of the day, like most people have code to have to run somewhere. So I think about like, okay, well, how do you make that nice? Like how do you make that? And in particular, like the thing I always like thought about, like developer productivity is like, I think the best way to measure developer productivity is like in terms of the feedback loops, like how quickly when you iterate, like when you write code, like how quickly can you get feedback. And at the innermost loop, it's like writing code and then running it. And like, as soon as you start working with the cloud, like it's like takes minutes suddenly, because you have to build a Docker container and push it to the cloud and like run it, you know. So that was like the initial focus for me was like, I just want to solve that problem. Like I want to, you know, build something less, you run things in the cloud and like retain the sort of, you know, the joy of productivity as when you're running things locally. And in particular, I was quite focused on data teams, because I think they had a couple unique needs that wasn't well served by the infrastructure at that time, or like still is in like, in particular, like Kubernetes, I feel like it's like kind of worked okay for back end teams, but not so well for data teams. And very quickly, I got sucked into like a very deep like rabbit hole of like...

Swyx [00:09:24]: Not well for data teams because of burstiness. Yeah, for sure.

Erik [00:09:26]: So like burstiness is like one thing, right? Like, you know, like you often have this like fan out, you want to like apply some function over very large data sets. Another thing tends to be like hardware requirements, like you need like GPUs and like, I've seen this in many companies, like you go, you know, data scientists go to a platform team and they're like, can we add GPUs to the Kubernetes? And they're like, no, like, that's, you know, complex, and we're not gonna, so like just getting GPU access. And then like, I mean, I also like data code, like frankly, or like machine learning code like tends to be like, super annoying in terms of like environments, like you end up having like a lot of like custom, like containers and like environment conflicts. And like, it's very hard to set up like a unified container that like can serve like a data scientist, because like, there's always like packages that break. And so I think there's a lot of different reasons why the technology wasn't well suited for back end. And I think the attitude at that time is often like, you know, like you had friction between the data team and the platform team, like, well, it works for the back end stuff, you know, why don't you just like, you know, make it work. But like, I actually felt like data teams, you know, or at this point now, like there's so much, so many people working with data, and like they, to some extent, like deserve their own tools and their own tool chains, and like optimizing for that is not something people have done. So that's, that's sort of like very abstract philosophical reason why I started Model. And then, and then I got sucked into this like rabbit hole of like container cold start and, you know, like whatever, Linux, page cache, you know, file system optimizations.

Swyx [00:10:43]: Yeah, tell people, I think the first time I met you, I think you told me some numbers, but I don't remember, like, what are the main achievements that you were unhappy with the status quo? And then you built your own container stack?

Erik [00:10:52]: Yeah, I mean, like, in particular, it was like, in order to have that loop, right? You want to be able to start, like take code on your laptop, whatever, and like run in the cloud very quickly, and like running in custom containers, and maybe like spin up like 100 containers, 1000, you know, things like that. And so container cold start was the initial like, from like a developer productivity point of view, it was like, really, what I was focusing on is, I want to take code, I want to stick it in container, I want to execute in the cloud, and like, you know, make it feel like fast. And when you look at like, how Docker works, for instance, like Docker, you have this like, fairly convoluted, like very resource inefficient way, they, you know, you build a container, you upload the whole container, and then you download it, and you run it. And Kubernetes is also like, not very fast at like starting containers. So like, I started kind of like, you know, going a layer deeper, like Docker is actually like, you know, there's like a couple of different primitives, but like a lower level primitive is run C, which is like a container runner. And I was like, what if I just take the container runner, like run C, and I point it to like my own root file system, and then I built like my own virtual file system that exposes files over a network instead. And that was like the sort of very crude version of model, it's like now I can actually start containers very quickly, because it turns out like when you start a Docker container, like, first of all, like most Docker images are like several gigabytes, and like 99% of that is never going to be consumed, like there's a bunch of like, you know, like timezone information for like Uzbekistan, like no one's going to read it. And then there's a very high overlap between the files are going to be read, there's going to be like lib torch or whatever, like it's going to be read. So you can also cache it very well. So that was like the first sort of stuff we started working on was like, let's build this like container file system. And you know, coupled with like, you know, just using run C directly. And that actually enabled us to like, get to this point of like, you write code, and then you can launch it in the cloud within like a second or two, like something like that. And you know, there's been many optimizations since then, but that was sort of starting point.

Alessio [00:12:33]: Can we talk about the developer experience as well, I think one of the magic things about Modal is at the very basic layers, like a Python function decorator, it's just like stub and whatnot. But then you also have a way to define a full container, what were kind of the design decisions that went into it? Where did you start? How easy did you want it to be? And then maybe how much complexity did you then add on to make sure that every use case fit?

Erik [00:12:57]: I mean, Modal, I almost feel like it's like almost like two products kind of glued together. Like there's like the low level like container runtime, like file system, all that stuff like in Rust. And then there's like the Python SDK, right? Like how do you express applications? And I think, I mean, Swix, like I think your blog was like the self-provisioning runtime was like, to me, always like to sort of, for me, like an eye-opening thing. It's like, so I didn't think about like...

Swyx [00:13:15]: You wrote your post four months before me. Yeah? The software 2.0, Infra 2.0. Yeah.

Erik [00:13:19]: Well, I don't know, like convergence of minds. I guess we were like both thinking. Maybe you put, I think, better words than like, you know, maybe something I was like thinking about for a long time. Yeah.

Swyx [00:13:29]: And I can tell you how I was thinking about it on my end, but I want to hear you say it.

Erik [00:13:32]: Yeah, yeah, I would love to. So to me, like what I always wanted to build was like, I don't know, like, I don't know if you use like Pulumi. Like Pulumi is like nice, like in the sense, like it's like Pulumi is like you describe infrastructure in code, right? And to me, that was like so nice. Like finally I can like, you know, put a for loop that creates S3 buckets or whatever. And I think like Modal sort of goes one step further in the sense that like, what if you also put the app code inside the infrastructure code and like glue it all together and then like you only have one single place that defines everything and it's all programmable. You don't have any config files. Like Modal has like zero config. There's no config. It's all code. And so that was like the goal that I wanted, like part of that. And then the other part was like, I often find that so much of like my time was spent on like the plumbing between containers. And so my thing was like, well, if I just build this like Python SDK and make it possible to like bridge like different containers, just like a function call, like, and I can say, oh, this function runs in this container and this other function runs in this container and I can just call it just like a normal function, then, you know, I can build these applications that may span a lot of different environments. Maybe they fan out, start other containers, but it's all just like inside Python. You just like have this beautiful kind of nice like DSL almost for like, you know, how to control infrastructure in the cloud. So that was sort of like how we ended up with the Python SDK as it is, which is still evolving all the time, by the way. We keep changing syntax quite a lot because I think it's still somewhat exploratory, but we're starting to converge on something that feels like reasonably good now.

Swyx [00:14:54]: Yeah. And along the way you, with this expressiveness, you enabled the ability to, for example, attach a GPU to a function. Totally.

Erik [00:15:02]: Yeah. It's like you just like say, you know, on the function decorator, you're like GPU equals, you know, A100 and then or like GPU equals, you know, A10 or T4 or something like that. And then you get that GPU and like, you know, you just run the code and it runs like you don't have to, you know, go through hoops to, you know, start an EC2 instance or whatever.

Swyx [00:15:18]: Yeah. So it's all code. Yeah. So one of the reasons I wrote Self-Revisioning Runtimes was I was working at AWS and we had AWS CDK, which is kind of like, you know, the Amazon basics blew me. Yeah, totally. And then, and then like it creates, it compiles the cloud formation. Yeah. And then on the other side, you have to like get all the config stuff and then put it into your application code and make sure that they line up. So then you're writing code to define your infrastructure, then you're writing code to define your application. And I was just like, this is like obvious that it's going to converge, right? Yeah, totally.

Erik [00:15:48]: But isn't there like, it might be wrong, but like, was it like SAM or Chalice or one of those? Like, isn't that like an AWS thing that where actually they kind of did that? I feel like there's like one.

Swyx [00:15:57]: SAM. Yeah. Still very clunky. It's not, not as elegant as modal.

Erik [00:16:03]: I love AWS for like the stuff it's built, you know, like historically in order for me to like, you know, what it enables me to build, but like AWS is always like struggle with developer experience.

Swyx [00:16:11]: I mean, they have to not break things.

Erik [00:16:15]: Yeah. Yeah. And totally. And they have to build products for a very wide range of use cases. And I think that's hard.

Swyx [00:16:21]: Yeah. Yeah. So it's, it's easier to design for. Yeah. So anyway, I was, I was pretty convinced that this, this would happen. I wrote, wrote that thing. And then, you know, I imagine my surprise that you guys had it on your landing page at some point. I think, I think Akshad was just like, just throw that in there.

Erik [00:16:34]: Did you trademark it?

Swyx [00:16:35]: No, I didn't. But I definitely got sent a few pitch decks with my post on there and it was like really interesting. This is my first time like kind of putting a name to a phenomenon. And I think this is a useful skill for people to just communicate what they're trying to do.

Erik [00:16:48]: Yeah. No, I think it's a beautiful concept.

Swyx [00:16:50]: Yeah. Yeah. Yeah. But I mean, obviously you implemented it. What became more clear in your explanation today is that actually you're not that tied to Python.

Erik [00:16:57]: No. I mean, I, I think that all the like lower level stuff is, you know, just running containers and like scheduling things and, you know, serving container data and stuff. So like one of the benefits of data teams is obviously like they're all like using Python, right? And so that made it a lot easier. I think, you know, if we had focused on other workloads, like, you know, for various reasons, we've like been kind of like half thinking about like CI or like things like that. But like, in a way that's like harder because like you also, then you have to be like, you know, multiple SDKs, whereas, you know, focusing on data teams, you can only, you know, Python like covers like 95% of all teams. That made it a lot easier. But like, I mean, like definitely like in the future, we're going to have others support, like supporting other languages. JavaScript for sure is the obvious next language. But you know, who knows, like, you know, Rust, Go, R, whatever, PHP, Haskell, I don't know.

Swyx [00:17:42]: You know, I think for me, I actually am a person who like kind of liked the idea of programming language advancements being improvements in developer experience. But all I saw out of the academic sort of PLT type people is just type level improvements. And I always think like, for me, like one of the core reasons for self-provisioning runtimes and then why I like Modal is like, this is actually a productivity increase, right? Like, it's a language level thing, you know, you managed to stick it on top of an existing language, but it is your own language, a DSL on top of Python. And so language level increase on the order of like automatic memory management. You know, you could sort of make that analogy that like, maybe you lose some level of control, but most of the time you're okay with whatever Modal gives you. And like, that's fine. Yeah.

Erik [00:18:26]: Yeah. Yeah. I mean, that's how I look at about it too. Like, you know, you look at developer productivity over the last number of decades, like, you know, it's come in like small increments of like, you know, dynamic typing or like is like one thing because not suddenly like for a lot of use cases, you don't need to care about type systems or better compiler technology or like, you know, the cloud or like, you know, relational databases. And, you know, I think, you know, you look at like that, you know, history, it's a steadily, you know, it's like, you know, you look at the developers have been getting like probably 10X more productive every decade for the last four decades or something that was kind of crazy. Like on an exponential scale, we're talking about 10X or is there a 10,000X like, you know, improvement in developer productivity. What we can build today, you know, is arguably like, you know, a fraction of the cost of what it took to build it in the eighties. Maybe it wasn't even possible in the eighties. So that to me, like, that's like so fascinating. I think it's going to keep going for the next few decades. Yeah.

Alessio [00:19:14]: Yeah. Another big thing in the infra 2.0 wishlist was truly serverless infrastructure. The other on your landing page, you called them native cloud functions, something like that. I think the issue I've seen with serverless has always been people really wanted it to be stateful, even though stateless was much easier to do. And I think now with AI, most model inference is like stateless, you know, outside of the context. So that's kind of made it a lot easier to just put a model, like an AI model on model to run. How do you think about how that changes how people think about infrastructure too? Yeah.

Erik [00:19:48]: I mean, I think model is definitely going in the direction of like doing more stateful things and working with data and like high IO use cases. I do think one like massive serendipitous thing that happened like halfway, you know, a year and a half into like the, you know, building model was like Gen AI started exploding and the IO pattern of Gen AI is like fits the serverless model like so well, because it's like, you know, you send this tiny piece of information, like a prompt, right, or something like that. And then like you have this GPU that does like trillions of flops, and then it sends back like a tiny piece of information, right. And that turns out to be something like, you know, if you can get serverless working with GPU, that just like works really well, right. So I think from that point of view, like serverless always to me felt like a little bit of like a solution looking for a problem. I don't actually like don't think like backend is like the problem that needs to serve it or like not as much. But I look at data and in particular, like things like Gen AI, like model inference, like it's like clearly a good fit. So I think that is, you know, to a large extent explains like why we saw, you know, the initial sort of like killer app for model being model inference, which actually wasn't like necessarily what we're focused on. But that's where we've seen like by far the most usage. Yeah.

Swyx [00:20:52]: And this was before you started offering like fine tuning of language models, it was mostly stable diffusion. Yeah.

Erik [00:20:59]: Yeah. I mean, like model, like I always built it to be a very general purpose compute platform, like something where you can run everything. And I used to call model like a better Kubernetes for data team for a long time. What we realized was like, yeah, that's like, you know, a year and a half in, like we barely had any users or any revenue. And like we were like, well, maybe we should look at like some use case, trying to think of use case. And that was around the same time stable diffusion came out. And the beauty of model is like you can run almost anything on model, right? Like model inference turned out to be like the place where we found initially, well, like clearly this has like 10x like better agronomics than anything else. But we're also like, you know, going back to my original vision, like we're thinking a lot about, you know, now, okay, now we do inference really well. Like what about training? What about fine tuning? What about, you know, end-to-end lifecycle deployment? What about data pre-processing? What about, you know, I don't know, real-time streaming? What about, you know, large data munging, like there's just data observability. I think there's so many things, like kind of going back to what I said about like redefining the data stack, like starting with the foundation of compute. Like one of the exciting things about model is like we've sort of, you know, we've been working on that for three years and it's maturing, but like this is so many things you can do like with just like a better compute primitive and also go up to stack and like do all this other stuff on top of it.

Alessio [00:22:09]: How do you think about or rather like I would love to learn more about the underlying infrastructure and like how you make that happen because with fine tuning and training, it's a static memory. Like you exactly know what you're going to load in memory one and it's kind of like a set amount of compute versus inference, just like data is like very bursty. How do you make batches work with a serverless developer experience? You know, like what are like some fun technical challenge you solve to make sure you get max utilization on these GPUs? What we hear from people is like, we have GPUs, but we can really only get like, you know, 30, 40, 50% maybe utilization. What's some of the fun stuff you're working on to get a higher number there?

Erik [00:22:48]: Yeah, I think on the inference side, like that's where we like, you know, like from a cost perspective, like utilization perspective, we've seen, you know, like very good numbers and in particular, like it's our ability to start containers and stop containers very quickly. And that means that we can auto scale extremely fast and scale down very quickly, which means like we can always adjust the sort of capacity, the number of GPUs running to the exact traffic volume. And so in many cases, like that actually leads to a sort of interesting thing where like we obviously run our things on like the public cloud, like AWS GCP, we run on Oracle, but in many cases, like users who do inference on those platforms or those clouds, even though we charge a slightly higher price per GPU hour, a lot of users like moving their large scale inference use cases to model, they end up saving a lot of money because we only charge for like with the time the GPU is actually running. And that's a hard problem, right? Like, you know, if you have to constantly adjust the number of machines, if you have to start containers, stop containers, like that's a very hard problem. Starting containers quickly is a very difficult thing. I mentioned we had to build our own file system for this. We also, you know, built our own container scheduler for that. We've implemented recently CPU memory checkpointing so we can take running containers and snapshot the entire CPU, like including registers and everything, and restore it from that point, which means we can restore it from an initialized state. We're looking at GPU checkpointing next, it's like a very interesting thing. So I think with inference stuff, that's where serverless really shines because you can drive, you know, you can push the frontier of latency versus utilization quite substantially, you know, which either ends up being a latency advantage or a cost advantage or both, right? On training, it's probably arguably like less of an advantage doing serverless, frankly, because you know, you can just like spin up a bunch of machines and try to satisfy, like, you know, train as much as you can on each machine. For that area, like we've seen, like, you know, arguably like less usage, like for modal, but there are always like some interesting use case. Like we do have a couple of customers, like RAM, for instance, like they do fine tuning with modal and they basically like one of the patterns they have is like very bursty type fine tuning where they fine tune 100 models in parallel. And that's like a separate thing that modal does really well, right? Like you can, we can start up 100 containers very quickly, run a fine tuning training job on each one of them for that only runs for, I don't know, 10, 20 minutes. And then, you know, you can do hyper parameter tuning in that sense, like just pick the best model and things like that. So there are like interesting training. I think when you get to like training, like very large foundational models, that's a use case we don't support super well, because that's very high IO, you know, you need to have like infinite band and all these things. And those are things we haven't supported yet and might take a while to get to that. So that's like probably like an area where like we're relatively weak in. Yeah.

Alessio [00:25:12]: Have you cared at all about lower level model optimization? There's other cloud providers that do custom kernels to get better performance or are you just given that you're not just an AI compute company? Yeah.

Erik [00:25:24]: I mean, I think like we want to support like a generic, like general workloads in a sense that like we want users to give us a container essentially or a code or code. And then we want to run that. So I think, you know, we benefit from those things in the sense that like we can tell our users, you know, to use those things. But I don't know if we want to like poke into users containers and like do those things automatically. That's sort of, I think a little bit tricky from the outside to do, because we want to be able to take like arbitrary code and execute it. But certainly like, you know, we can tell our users to like use those things. Yeah.

Swyx [00:25:53]: I may have betrayed my own biases because I don't really think about modal as for data teams anymore. I think you started, I think you're much more for AI engineers. My favorite anecdotes, which I think, you know, but I don't know if you directly experienced it. I went to the Vercel AI Accelerator, which you supported. And in the Vercel AI Accelerator, a bunch of startups gave like free credits and like signups and talks and all that stuff. The only ones that stuck are the ones that actually appealed to engineers. And the top usage, the top tool used by far was modal.

Erik [00:26:24]: That's awesome.

Swyx [00:26:25]: For people building with AI apps. Yeah.

Erik [00:26:27]: I mean, it might be also like a terminology question, like the AI versus data, right? Like I've, you know, maybe I'm just like old and jaded, but like, I've seen so many like different titles, like for a while it was like, you know, I was a data scientist and a machine learning engineer and then, you know, there was like analytics engineers and there was like an AI engineer, you know? So like, to me, it's like, I just like in my head, that's to me just like, just data, like, or like engineer, you know, like I don't really, so that's why I've been like, you know, just calling it data teams. But like, of course, like, you know, AI is like, you know, like such a massive fraction of our like workloads.

Swyx [00:26:59]: It's a different Venn diagram of things you do, right? So the stuff that you're talking about where you need like infinite bands for like highly parallel training, that's not, that's more of the ML engineer, that's more of the research scientist and less of the AI engineer, which is more sort of trying to put, work at the application.

Erik [00:27:16]: Yeah. I mean, to be fair to it, like we have a lot of users that are like doing stuff that I don't think fits neatly into like AI. Like we have a lot of people using like modal for web scraping, like it's kind of nice. You can just like, you know, fire up like a hundred or a thousand containers running Chromium and just like render a bunch of webpages and it takes, you know, whatever. Or like, you know, protein folding is that, I mean, maybe that's, I don't know, like, but like, you know, we have a bunch of users doing that or, or like, you know, in terms of, in the realm of biotech, like sequence alignment, like people using, or like a couple of people using like modal to run like large, like mixed integer programming problems, like, you know, using Gurobi or like things like that. So video processing is another thing that keeps coming up, like, you know, let's say you have like petabytes of video and you want to just like transcode it, like, or you can fire up a lot of containers and just run FFmpeg or like, so there are those things too. Like, I mean, like that being said, like AI is by far our biggest use case, but you know, like, again, like modal is kind of general purpose in that sense.

Swyx [00:28:08]: Yeah. Well, maybe I'll stick to the stable diffusion thing and then we'll move on to the other use cases for AI that you want to highlight. The other big player in my mind is replicate. Yeah. In this, in this era, they're much more, I guess, custom built for that purpose, whereas you're more general purpose. How do you position yourself with them? Are they just for like different audiences or are you just heads on competing?

Erik [00:28:29]: I think there's like a tiny sliver of the Venn diagram where we're competitive. And then like 99% of the area we're not competitive. I mean, I think for people who, if you look at like front-end engineers, I think that's where like really they found good fit is like, you know, people who built some cool web app and they want some sort of AI capability and they just, you know, an off the shelf model is like perfect for them. That's like, I like use replicate. That's great. I think where we shine is like custom models or custom workflows, you know, running things at very large scale. We need to care about utilization, care about costs. You know, we have much lower prices because we spend a lot more time optimizing our infrastructure, you know, and that's where we're competitive, right? Like, you know, and you look at some of the use cases, like Suno is a big user, like they're running like large scale, like AI. Oh, we're talking with Mikey.

Swyx [00:29:12]: Oh, that's great. Cool.

Erik [00:29:14]: In a month. Yeah. So, I mean, they're, they're using model for like production infrastructure. Like they have their own like custom model, like custom code and custom weights, you know, for AI generated music, Suno.AI, you know, that, that, those are the types of use cases that we like, you know, things that are like very custom or like, it's like, you know, and those are the things like it's very hard to run and replicate, right? And that's fine. Like I think they, they focus on a very different part of the stack in that sense.

Swyx [00:29:35]: And then the other company pattern that I pattern match you to is Modular. I don't know.

Erik [00:29:40]: Because of the names?

Swyx [00:29:41]: No, no. Wow. No, but yeah, yes, the name is very similar. I think there's something that might be insightful there from a linguistics point of view. Oh no, they have Mojo, the sort of Python SDK. And they have the Modular Inference Engine, which is their sort of their cloud stack, their sort of compute inference stack. I don't know if anyone's made that comparison to you before, but like I see you evolving a little bit in parallel there.

Erik [00:30:01]: No, I mean, maybe. Yeah. Like it's not a company I'm like super like familiar, like, I mean, I know the basics, but like, I guess they're similar in the sense like they want to like do a lot of, you know, they have sort of big picture vision.

Swyx [00:30:12]: Yes. They also want to build very general purpose. Yeah. So they're marketing themselves as like, if you want to do off the shelf stuff, go out, go somewhere else. If you want to do custom stuff, we're the best place to do it. Yeah. Yeah. There is some overlap there. There's not overlap in the sense that you are a closed source platform. People have to host their code on you. That's true. Whereas for them, they're very insistent on not running their own cloud service. They're a box software. Yeah. They're licensed software.

Erik [00:30:37]: I'm sure their VCs at some point going to force them to reconsider. No, no.

Swyx [00:30:40]: Chris is very, very insistent and very convincing. So anyway, I would just make that comparison, let people make the links if they want to. But it's an interesting way to see the cloud market develop from my point of view, because I came up in this field thinking cloud is one thing, and I think your vision is like something slightly different, and I see the different takes on it.

Erik [00:31:00]: Yeah. And like one thing I've, you know, like I've written a bit about it in my blog too, it's like I think of us as like a second layer of cloud provider in the sense that like I think Snowflake is like kind of a good analogy. Like Snowflake, you know, is infrastructure as a service, right? But they actually run on the like major clouds, right? And I mean, like you can like analyze this very deeply, but like one of the things I always thought about is like, why does Snowflake arbitrarily like win over Redshift? And I think Snowflake, you know, to me, one, because like, I mean, in the end, like AWS makes all the money anyway, like and like Snowflake just had the ability to like focus on like developer experience or like, you know, user experience. And to me, like really proved that you can build a cloud provider, a layer up from, you know, the traditional like public clouds. And in that layer, that's also where I would put Modal, it's like, you know, we're building a cloud provider, like we're, you know, we're like a multi-tenant environment that runs the user code. But we're also building on top of the public cloud. So I think there's a lot of room in that space, I think is very sort of interesting direction.

Alessio [00:31:55]: How do you think of that compared to the traditional past history, like, you know, you had AWS, then you had Heroku, then you had Render, Railway.

Erik [00:32:04]: Yeah, I mean, I think those are all like great. I think the problem that they all faced was like the graduation problem, right? Like, you know, Heroku or like, I mean, like also like Heroku, there's like a counterfactual future of like, what would have happened if Salesforce didn't buy them, right? Like, that's a sort of separate thing. But like, I think what Heroku, I think always struggled with was like, eventually companies would get big enough that you couldn't really justify running in Heroku. So they would just go and like move it to, you know, whatever AWS or, you know, in particular. And you know, that's something that keeps me up at night too, like, what does that graduation risk like look like for modal? I always think like the only way to build a successful infrastructure company in the long run in the cloud today is you have to appeal to the entire spectrum, right? Or at least like the enterprise, like you have to capture the enterprise market. But the truly good companies capture the whole spectrum, right? Like I think of companies like, I don't like Datadog or Mongo or something that were like, they both captured like the hobbyists and acquire them, but also like, you know, have very large enterprise customers. I think that arguably was like where I, in my opinion, like Heroku struggle was like, how do you maintain the customers as they get more and more advanced? I don't know what the solution is, but I think there's, you know, that's something I would have thought deeply if I was at Heroku at that time.

Alessio [00:33:14]: What's the AI graduation problem? Is it, I need to fine tune the model, I need better economics, any insights from customer discussions?

Erik [00:33:22]: Yeah, I mean, better economics, certainly. But although like, I would say like, even for people who like, you know, needs like thousands of GPUs, just because we can drive utilization so much better, like we, there's actually like a cost advantage of staying on modal. But yeah, I mean, certainly like, you know, and like the fact that VCs like love, you know, throwing money at least used to, you know, add companies who need it to buy GPUs. I think that didn't help the problem. And in training, I think, you know, there's less software differentiation. So in training, I think there's certainly like better economics of like buying big clusters. But I mean, my hope it's going to change, right? Like I think, you know, we're still pretty early in the cycle of like building AI infrastructure. And I think a lot of these companies over in the long run, like, you know, they're, except it may be super big ones, like, you know, on Facebook and Google, they're always going to build their own ones. But like everyone else, like some extent, you know, I think they're better off like buying platforms. And, you know, someone's going to have to build those platforms.

Swyx [00:34:12]: Yeah. Cool. Let's move on to language models and just specifically that workload just to flesh it out a little bit. You already said that RAMP is like fine tuning 100 models at once simultaneously on modal. Closer to home, my favorite example is ErikBot. Maybe you want to tell that story.

Erik [00:34:30]: Yeah. I mean, it was a prototype thing we built for fun, but it's pretty cool. Like we basically built this thing that hooks up to Slack. It like downloads all the Slack history and, you know, fine-tunes a model based on a person. And then you can chat with that. And so you can like, you know, clone yourself and like talk to yourself on Slack. I mean, it's like nice like demo and it's just like, I think like it's like fully contained modal. Like there's a modal app that does everything, right? Like it downloads Slack, you know, integrates with the Slack API, like downloads the stuff, the data, like just runs the fine-tuning and then like creates like dynamically an inference endpoint. And it's all like self-contained and like, you know, a few hundred lines of code. So I think it's sort of a good kind of use case for, or like it kind of demonstrates a lot of the capabilities of modal.

Alessio [00:35:08]: Yeah. On a more personal side, how close did you feel ErikBot was to you?

Erik [00:35:13]: It definitely captured the like the language. Yeah. I mean, I don't know, like the content, I always feel this way about like AI and it's gotten better. Like when you look at like AI output of text, like, and it's like, when you glance at it, it's like, yeah, this seems really smart, you know, but then you actually like look a little bit deeper. It's like, what does this mean?

Swyx [00:35:32]: What does this person say?

Erik [00:35:33]: It's like kind of vacuous, right? And that's like kind of what I felt like, you know, talking to like my clone version, like it's like says like things like the grammar is correct. Like some of the sentences make a lot of sense, but like, what are you trying to say? Like there's no content here. I don't know. I mean, it's like, I got that feeling also with chat TBT in the like early versions right now it's like better, but.

Alessio [00:35:51]: That's funny. So I built this thing called small podcaster to automate a lot of our back office work, so to speak. And it's great at transcript. It's great at doing chapters. And then I was like, okay, how about you come up with a short summary? And it's like, it sounds good, but it's like, it's not even the same ballpark as like, yeah, end up writing. Right. And it's hard to see how it's going to get there.

Swyx [00:36:11]: Oh, I have ideas.

Erik [00:36:13]: I'm certain it's going to get there, but like, I agree with you. Right. And like, I have the same thing. I don't know if you've read like AI generated books. Like they just like kind of seem funny, right? Like there's off, right? But like you glance at it and it's like, oh, it's kind of cool. Like looks correct, but then it's like very weird when you actually read them.

Swyx [00:36:30]: Yeah. Well, so for what it's worth, I think anyone can join the modal slack. Is it open to the public? Yeah, totally.

Erik [00:36:35]: If you go to modal.com, there's a button in the footer.

Swyx [00:36:38]: Yeah. And then you can talk to Erik Bot. And then sometimes I really like picking Erik Bot and then you answer afterwards, but then you're like, yeah, mostly correct or whatever. Any other broader lessons, you know, just broadening out from like the single use case of fine tuning, like what are you seeing people do with fine tuning or just language models on modal in general? Yeah.

Erik [00:36:59]: I mean, I think language models is interesting because so many people get started with APIs and that's just, you know, they're just dominating a space in particular opening AI, right? And that's not necessarily like a place where we aim to compete. I mean, maybe at some point, but like, it's just not like a core focus for us. And I think sort of separately, it's sort of a question of like, there's economics in that long term. But like, so we tend to focus on more like the areas like around it, right? Like fine tuning, like another use case we have is a bunch of people, Ramp included, is doing batch embeddings on modal. So let's say, you know, you have like a, actually we're like writing a blog post, like we take all of Wikipedia and like parallelize embeddings in 15 minutes and produce vectors for each article. So those types of use cases, I think modal suits really well for. I think also a lot of like custom inference, like yeah, I love that.

Swyx [00:37:43]: Yeah. I think you should give people an idea of the order of magnitude of parallelism, because I think people don't understand how parallel. So like, I think your classic hello world with modal is like some kind of Fibonacci function, right? Yeah, we have a bunch of different ones. Some recursive function. Yeah.

Erik [00:37:59]: Yeah. I mean, like, yeah, I mean, it's like pretty easy in modal, like fan out to like, you know, at least like 100 GPUs, like in a few seconds. And you know, if you give it like a couple of minutes, like we can, you know, you can fan out to like thousands of GPUs. Like we run it relatively large scale. And yeah, we've run, you know, many thousands of GPUs at certain points when we needed, you know, big backfills or some customers had very large compute needs.

Swyx [00:38:21]: Yeah. Yeah. And I mean, that's super useful for a number of things. So one of my early interactions with modal as well was with a small developer, which is my sort of coding agent. The reason I chose modal was a number of things. One, I just wanted to try it out. I just had an excuse to try it. Akshay offered to onboard me personally. But the most interesting thing was that you could have that sort of local development experience as it was running on my laptop, but then it would seamlessly translate to a cloud service or like a cloud hosted environment. And then it could fan out with concurrency controls. So I could say like, because like, you know, the number of times I hit the GPT-3 API at the time was going to be subject to the rate limit. But I wanted to fan out without worrying about that kind of stuff. With modal, I can just kind of declare that in my config and that's it. Oh, like a concurrency limit?

Erik [00:39:07]: Yeah. Yeah.

Swyx [00:39:09]: Yeah. There's a lot of control. And that's why it's like, yeah, this is a pretty good use case for like writing this kind of LLM application code inside of this environment that just understands fan out and rate limiting natively. You don't actually have an exposed queue system, but you have it under the hood, you know, that kind of stuff. Totally.

Erik [00:39:28]: It's a self-provisioning cloud.

Swyx [00:39:30]: So the last part of modal I wanted to touch on, and obviously feel free, I know you're working on new features, was the sandbox that was introduced last year. And this is something that I think was inspired by Code Interpreter. You can tell me the longer history behind that.

Erik [00:39:45]: Yeah. Like we originally built it for the use case, like there was a bunch of customers who looked into code generation applications and then they came to us and asked us, is there a safe way to execute code? And yeah, we spent a lot of time on like container security. We used GeoVisor, for instance, which is a Google product that provides pretty strong isolation of code. So we built a product where you can basically like run arbitrary code inside a container and monitor its output or like get it back in a safe way. I mean, over time it's like evolved into more of like, I think the long-term direction is actually I think more interesting, which is that I think modal as a platform where like I think the core like container infrastructure we offer could actually be like, you know, unbundled from like the client SDK and offer to like other, you know, like we're talking to a couple of like other companies that want to run, you know, through their packages, like run, execute jobs on modal, like kind of programmatically. So that's actually the direction like Sandbox is going. It's like turning into more like a platform for platforms is kind of what I've been thinking about it as.

Swyx [00:40:45]: Oh boy. Platform. That's the old Kubernetes line.

Erik [00:40:48]: Yeah. Yeah. Yeah. But it's like, you know, like having that ability to like programmatically, you know, create containers and execute them, I think, I think is really cool. And I think it opens up a lot of interesting capabilities that are sort of separate from the like core Python SDK in modal. So I'm really excited about C. It's like one of those features that we kind of released and like, you know, then we kind of look at like what users actually build with it and people are starting to build like kind of crazy things. And then, you know, we double down on some of those things because when we see like, you know, potential new product features and so Sandbox, I think in that sense, it's like kind of in that direction. We found a lot of like interesting use cases in the direction of like platformized container runner.

Swyx [00:41:27]: Can you be more specific about what you're double down on after seeing users in action?

Erik [00:41:32]: I mean, we're working with like some companies that, I mean, without getting into specifics like that, need the ability to take their users code and then launch containers on modal. And it's not about security necessarily, like they just want to use modal as a back end, right? Like they may already provide like Kubernetes as a back end, Lambda as a back end, and now they want to add modal as a back end, right? And so, you know, they need a way to programmatically define jobs on behalf of their users and execute them. And so, I don't know, that's kind of abstract, but does that make sense? I totally get it.

Swyx [00:42:03]: It's sort of one level of recursion to sort of be the Modal for their customers.

Erik [00:42:09]: Exactly.

Swyx [00:42:10]: Yeah, exactly. And Cloudflare has done this, you know, Kenton Vardar from Cloudflare, who's like the tech lead on this thing, called it sort of functions as a service as a service.

Erik [00:42:17]: Yeah, that's exactly right. FaSasS.

Swyx [00:42:21]: FaSasS. Yeah, like, I mean, like that, I think any base layer, second layer cloud provider like yourself, compute provider like yourself should provide, you know, it's a mark of maturity and success that people just trust you to do that. They'd rather build on top of you than compete with you. The more interesting thing for me is like, what does it mean to serve a computer like an LLM developer, rather than a human developer, right? Like, that's what a sandbox is to me, that you have to redefine modal to serve a different non-human audience.

Erik [00:42:51]: Yeah. Yeah, and I think there's some really interesting people, you know, building very cool things.

Swyx [00:42:55]: Yeah. So I don't have an answer, but, you know, I imagine things like, hey, the way you give feedback is different. Maybe you have to like stream errors, log errors differently. I don't really know. Yeah. Obviously, there's like safety considerations. Maybe you have an API to like restrict access to the web. Yeah. I don't think anyone would use it, but it's there if you want it.

Erik [00:43:17]: Yeah.

Swyx [00:43:18]: Yeah. Any other sort of design considerations? I have no idea.

Erik [00:43:21]: With sandboxes?

Swyx [00:43:22]: Yeah. Yeah.

Erik [00:43:24]: Open-ended question here. Yeah. I mean, no, I think, yeah, the network restrictions, I think, make a lot of sense. Yeah. I mean, I think, you know, long-term, like, I think there's a lot of interesting use cases where like the LLM, in itself, can like decide, I want to install these packages and like run this thing. And like, obviously, for a lot of those use cases, like you want to have some sort of control that it doesn't like install malicious stuff and steal your secrets and things like that. But I think that's what's exciting about the sandbox primitive, is like it lets you do that in a relatively safe way.

Alessio [00:43:51]: Do you have any thoughts on the inference wars? A lot of providers are just rushing to the bottom to get the lowest price per million tokens. Some of them, you know, the Sean Randomat, they're just losing money and there's like the physics of it just don't work out for them to make any money on it. How do you think about your pricing and like how much premium you can get and you can kind of command versus using lower prices as kind of like a wedge into getting there, especially once you have model instrumented? What are the tradeoffs and any thoughts on strategies that work?

Erik [00:44:23]: I mean, we focus more on like custom models and custom code. And I think in that space, there's like less competition and I think we can have a pricing markup, right? Like, you know, people will always compare our prices to like, you know, the GPU power they can get elsewhere. And so how big can that markup be? Like it never can be, you know, we can never charge like 10x more, but we can certainly charge a premium. And like, you know, for that reason, like we can have pretty good margins. The LLM space is like the opposite, like the switching cost of LLMs is zero. If all you're doing is like straight up, like at least like open source, right? Like if all you're doing is like, you know, using some, you know, inference endpoint that serves an open source model and, you know, some other provider comes along and like offers a lower price, you're just going to switch, right? So I don't know, to me that reminds me a lot of like all this like 15 minute delivery wars or like, you know, like Uber versus Lyft, you know, and like maybe going back even further, like I think a lot about like sort of, you know, flip side of this is like, it's actually a positive side, which is like, I thought a lot about like fiber optics boom of like 98, 99, like the other day, or like, you know, and also like the overinvestment in GPU today. Like, like, yeah, like, you know, I don't know, like in the end, like, I don't think VCs will have the return they expected, like, you know, in these things, but guess who's going to benefit, like, you know, is the consumers, like someone's like reaping the value of this. And that's, I think an amazing flip side is that, you know, we should be very grateful, the fact that like VCs want to subsidize these things, which is, you know, like you go back to fiber optics, like there was an extreme, like overinvestment in fiber optics network in like 98. And no one made money who did that. But consumers, you know, got tremendous benefits of all the fiber optics cables that were led, you know, throughout the country in the decades after. I feel something similar about like GPUs today. And also like specifically looking like more narrowly at like LLM in France market, like that's great. Like, you know, I'm very happy that, you know, there's a price war. Modal is like not necessarily like participating in that price war, right? Like, I think, you know, it's going to shake out and then someone's going to win and then they're going to raise prices or whatever. Like, we'll see how that works out. But for that reason, like we're not like hyper focused on like serving, you know, just like straight up, like here's an endpoint to an open source model. We think the value in Modal comes from all these, you know, the other use cases, the more custom stuff, like fine tuning and complex, you know, guided output, like type stuff. Or like also like in other, like outside of LLMs, like with more focus, a lot more like image, audio, video stuff, because that's where there's a lot more proprietary models. There's a lot more like custom workflows. And that's where I think, you know, Modal is more, you know, there's a lot of value in software differentiation. I think focusing on developer experience and developer productivity, that's where I think, you know, you can have more of a competitive moat.

Alessio [00:46:58]: I'm curious what the difference is going to be now that it's an enterprise. So like with DoorDash, Uber, they're going to charge you more. And like as a customer, like you can decide to not take Uber. But if you're a company building AI features in your product using the subsidized prices, and then, you know, the VC money dries up in a year and like prices go up, it's like, you can't really take the features back without a lot of backlash. But you also cannot really kill your margins by paying the new price. So I don't know what that's going to look like

Erik [00:47:28]: But like margins are going to go up for sure. But I don't know if prices will go up because like GPU prices have to drop eventually, right? So like, you know, like in the long run, I still think like prices may not go up that much. But certainly margins will go up. Like I think you said, Swyx, that margins are negative right now. Like, you know, for some people, obviously, that's not sustainable. So certainly margins will have to go up. Like some companies are going to have to make money in this space. Otherwise, like they're not going to provide the service. But that's equilibrium too, right? Like at some point, like, you know, it sort of stabilizes and one or two or three providers make money.

Alessio [00:48:02]: Yeah. What else is maybe underrated, a model, something that people don't talk enough about, or yeah, that we didn't cover in the discussion?

Erik [00:48:11]: Yeah, I think what are some other things? We talked about a lot of stuff. Like we have the bursty parallelism. I think that's pretty cool. Working on a lot of like, trying to figure out like, kind of thinking more about the roadmap. But like one of the things I'm very excited about is building primitives for like, more like IO intensive workloads. And so like, we're building some like crude stuff right now where like, you can like create like direct TCP tunnels to containers and that lets you like pipe data. And like, you know, we haven't really explored this as much as we should, but like, there's a lot of interesting applications. Like you can actually do like kind of real time video stuff in Modal now, because you can like create a tunnel to, exactly. You can create a raw TCP socket to a container, feed it video and then like, you know, get the video back. And I think like, it's still like a little bit like, you know, not fully ergonomically like figured out. But I think there's a lot of like, super cool stuff. Like when we start enabling those more like high IO workloads, I'm super excited about. I think also like, you know, working with large data sets or kind of taking the ability to map and fan out and like building more like higher level, like functional primitives, like filters and group buys and joins. Like I think there's a lot of like, really cool stuff you can do. But this is like maybe like, you know, years out like.

Swyx [00:49:18]: Yeah, we can just broaden out from Modal a little bit, but you still have a lot of, you have a lot of great tweets. So it's very easy to just kind of go through them. Why is Oracle underrated? I love Oracle's GPUs. I don't know why, you know,

Erik [00:49:34]: what the economics looks like for Oracle, but I think they're great value for money. Like we run a bunch of stuff in Oracle and they have bare metal machines, like two terabytes of RAM. They're like super fast SSDs. You know, I mean, we love AWS and AGCP too. We have great relationships with them. But I think Oracle is surprising. Like, you know, if you told me like three years ago that I would be using Oracle Cloud, like I'd be like, what, wait, why? But now, you know,

Swyx [00:49:55]: I'm a happy customer. And it's a combination of pricing and the kinds of SKUs I guess they offer.

Erik [00:50:01]: Yeah. Great, great machines, good prices, you know. That's it. Yeah. Yeah. That's all I care about. Yeah. The sales team is pretty fun too. Like I like them.

Swyx [00:50:09]: In Europe, people often talk about Hetzner. Yeah. Like we've focused on the main clouds, right?

Erik [00:50:14]: Like we've, you know, Oracle, AWS, GCP, we'll probably add Azure at some point. I think, I mean, there's definitely a long tail of like, you know, CoreWeave, Hetzner, like Lambda, like all these things. And like over time, I think we'll look at those too. Like, you know, wherever we can get the right GPUs at the right price. Yeah. I mean, I think it's fascinating. Like it's a tough business. Like I wouldn't want to try to build like a cloud provider. You know, it's just, you just have to be like incredibly focused on like, you know, efficiency and margins and things like that. But I mean, I'm glad people are trying.

Swyx [00:50:45]: Yeah. And you can ramp up on any of these clouds very quickly, right? Because it's your standard stack.

Erik [00:50:50]: Yeah. I mean, yeah. Like I think so. Like, you know, what Modal does is like programmatic, you know, launching and termination of machines. So that's like what's nice about the clouds is, you know, they have relatively like immature APIs for doing that, as well as like, you know, support for Terraform for all the networking and all that stuff. So that makes it easier to work with the big clouds. But yeah, I mean, some of those things, like I think, you know, I also expect the smaller clouds to like embrace those things in the long run, but also think, you know, you know, we can also probably integrate with some of the clouds, like even without that. There's always an HTML API that you can use, just like script something that launches instances like through the web.

Swyx [00:51:24]: Yeah. I think a lot of people are always curious about whether or not you will buy your own hardware someday. I think you're pretty firm in that it's not your interest, but like your story and your growth does remind me a little bit of Cloudflare, which obviously, you know, invests a lot in its own physical network.

Erik [00:51:42]: Yeah. I don't remember like early days, like, did they have their own hardware or?

Swyx [00:51:47]: They push out a lot with like agreements through other, you know, providers.

Erik [00:51:52]: Yeah. Okay. Interesting.

Swyx [00:51:53]: But now it's all their own hardware. So I understand.

Erik [00:51:57]: Yeah. I mean, my feeling is that when you're a venture funded startup, like buying physical hardware is maybe not the best use of the money.

Swyx [00:52:06]: I really wanted to put you in a room with Isocat from Poolside. Yeah. Because he has the complete opposite view. Yeah.

Erik [00:52:12]: It is great. I mean, I don't like, I just think for like a capital efficiency point of view, like, do you really want to tie up that much money and like, you know, physical hardware and think about depreciation and like, like, as much as possible, like I, you know, I favor a more capital efficient way of like, we don't want to own the hardware because then, and ideally, we want to, we want the sort of margin structure to be sort of like 100% correlated revenue in cogs in the sense that like, you know, when someone comes and pays us, you know, $1 for compute, like, you know, we immediately incur a cost of like, whatever, 70 cents, 80 cents, you know, and there's like complete correlation between cost and revenue because then you can leverage up in like a kind of a nice way you can scale very efficiently. You know, like, that's not, you know, turns out like that's hard to do. Like, you can't just only use like spotting on demand instances. Like over time, we've actually started adding a pretty significant amount of reservations too. So I don't know, like reservation is always like one step towards owning your own hardware. Like, I don't know, like, do we really want to be, you know, thinking about switches and cooling and HVAC and like power supplies? Accessory recovery. Yeah. Like, is that the thing I want to think about? Like, I don't know. Like I like to make developers happy, but who knows, like maybe one day, like, but I don't think it's gonna happen anytime soon.

Swyx [00:53:23]: Yeah. Obviously, for what it's worth, obviously, I'm a believer in cloud, but it's interesting to have the devil's advocate on the other side. The main thing you have to do is be confident that you can manage your depreciation better than the typical assumption, which is two to three years. Yeah. Yeah. And so the moment you have a CTO that tells you, no, I think I can make these things last seven years, then it changes the math.

Erik [00:53:46]: Yeah. Yeah. But you know, are you deluding yourself then? That's the question, right? It's like the waste management scandal. Do you know about that? Like they had all this like, like accounting scandal back in the 90s, like this garbage company, like where they like, started assuming their garbage trucks had a 10-year depreciation schedule, booked like a massive profit, you know, the stock went to like, you know, up like, you know, and then it turns out actually all those garbage trucks broke down and like, you can't really depreciate them over 10 years. And so, so then the whole company, you know, they had to restate all the earnings.

Alessio [00:54:18]: Let's go into some personal nuggets. You received the IOI gold medal, which is the International Olympiad in Informatics.

Erik [00:54:29]: 20 years ago.

Alessio [00:54:30]: Yeah. How have these models and like going to change competitive programming? Like, do you think people are still love the craft? I feel like over time, we're kind of like programming has kind of lost maybe a little bit of its luster in the eyes of a lot of, a lot of people. Yeah. I'm curious to, to see what you think.

Erik [00:54:51]: I mean, maybe, but like, I don't know, like, you know, I've been coding for almost 30 or more than 30 years. And like, I feel like, you know, you look at like programming and, you know, where it is today versus where it was, you know, 30, 40, 50 years ago, there's like probably thousand times more developers today than, you know, so like, and every year there's more and more developers. And at the same time, developer productivity keeps going up. And when I look at the real world, I just think there's so much software that's still waiting to be built. Like, I think we can, you know, 10X the amount of developers and still, you know, have a lot of people making a lot of money, you know, building amazing software and also being while at the same time being more productive. Like I never understood this, like, you know, AI is going to, you know, replace engineers. That's very rarely how this actually works. When AI makes engineers more productive, like the demand actually goes up because the cost of engineers goes down because you can build software more cheaply. And that's, I think, the story of software in the world over the last few decades. So, I mean, I don't know how this relates to like competitive programming. Kind of going back to your question, competitive programming to me was always kind of a weird kind of, you know, niche, like kind of, I don't know. I love it. It's like puzzle solving. And like my experience is like, you know, half of competitive programmers are able to translate that to actual like building cool stuff in the world. Half just like get really in, you know, sucked into this like puzzle stuff and, you know, it never loses its grip on them. But like for me, it was an amazing way to get started with coding or get very deep into coding and, you know, kind of battle off with like other smart kids and traveling to different countries when I was a teenager.

Swyx [00:56:29]: I was just going to mention, like, it's not just that he personally is a competitive programmer. Like, I think a lot of people at Modal are competitive programmers. I think you met Akshat through... Akshat, co-founder is also at Gold Medal.

Erik [00:56:42]: By the way, Gold Medal doesn't mean you win. Like, but although we actually had an intern that won Iowa. Gold Medal is like the top 20, 30 people roughly.

Swyx [00:56:47]: Yeah. Obviously, it's very hard to get hired at Modal. But what is it like to work with like such a talent density? Like, you know, how is that contributing to the culture at Modal? Yeah. I mean, I think humans are the root cause of like everything at a company, like, you know, bad code is because it's bad human or like whatever, you know, bad culture.

Erik [00:57:03]: So like, I think, you know, like talent density is very important and like keeping the bar high and like hiring smart people. And, you know, it's not always like the case that like hiring competitive programmers is the right strategy, right? If you're building something very different, like you may not, you know, but we actually end up having a lot of like hard, you know, complex challenges. Like, you know, I talked about like the cloud, you know, the resource allocation, like turns out like that actually, like you can phrase that as a mixed integer programming problem. Like we now have that running in production, like constantly optimizing how we allocate cloud resources. There's a lot of like interesting, like complex, like scheduling problems. And like, how do you do all the bin packing of all the containers? Like, so, you know, I think for what we're building, you know, it makes a lot of sense to hire these people who like, like those very hard problems.

Swyx [00:57:52]: Yeah. And they don't necessarily have to know the details of the stack. They just need to be very good at algorithms.

Erik [00:57:56]: No, but my feeling is like people who are like pretty good at competitive programming, they can also pick up like other stuff like elsewhere. Not always the case, but you know, there's definitely a high correlation.

Swyx [00:58:08]: Oh yeah. I'm just, I'm interested in that just because, you know, like there's competitive mental talents in other areas, like competitive speed memorization or whatever. And like, you don't really see those transfer. And I always assumed in my narrow perception that competitive programming is so specialized, it's so obscure, even like so divorced from real world scenarios that it doesn't actually transfer that much. But obviously I think for the problems that you work on it, it does.

Erik [00:58:34]: But it's also like, you know, frankly, it's like translates to some extent, not because like the problems are the same, but just because like it sort of filters for the, you know, people who are like willing to go very deep and work hard on things. Right. Like, I feel like a similar thing is like a lot of good developers are like talented musicians. Like, why? Like, why is this a correlation? And like, my theory is like, you know, it's the same sort of skill. Like you have to like just hyper focus on something and practice a lot. Like, and there's something similar that I think creates like good developers.

Alessio [00:59:02]: Yeah. Sweden also had a lot of very good Counter-Strike players. I don't know, why does Sweden have fiber optics before all of Europe? I feel like, I grew up in Italy and our internet was terrible. And then I feel like all the Nordics and like amazing internet, I remember getting online and people in the Nordics are like five ping, 10 ping.

Erik [00:59:23]: Yeah. We had very good network back then. Yeah. Do you know why? I mean, I'm sure like, you know, I think the government, you know, did certain things quite well. Right. Like in the nineties, like there was like a bunch of tax rebates for like buying computers. And I think there was similar like investments in infrastructure. I mean, like, and I think like I was thinking about, you know, it's like, I still can't use my phone in the subway in New York. And that was something I could use in Sweden in 95. You know, we're talking like 40 years almost. Right. Like, like why? And I don't know, like I think certain infrastructure,

Alessio [00:59:53]: you know, Sweden was just better at, I don't know. And also you never owned a TV or a car?

Erik [00:59:59]: Never owned a TV or a car. I never had a driver's license.

Alessio [01:00:01]: How do you do that in Sweden though? Like that's cold.

Erik [01:00:03]: I grew up in a city. I mean, like I took the subway everywhere with bike or whatever. Yeah. I always lived in cities, so I don't, you know, I never felt, I mean, like we have like me and my wife as a car, but like. That doesn't count. I mean, it's her name because I don't have a driver's license. She drives me everywhere. It's nice.

Swyx [01:00:21]: Nice. That's fantastic. I was going to ask you, like the last thing I had on this list was your advice to people thinking about running some sort of run code in the cloud startup is only do it if you're genuinely excited about spending five years thinking about load balancing, page falls, cloud security and DNS. So basically like it sounds like you're summing up a lot of pain running Modal. Yeah. Yeah. Like one thing I struggle with, like I talked to a lot of people

Erik [01:00:43]: starting companies in the data space or like AI space or whatever. And they sort of come at it at like, you know, from like an application developer point of view. And they're like, I'm going to make this better. But like, guess how you have to make it better. It's like, you have to go very deep on the infrastructure layer. And so one of my frustrations has been like so many startups are like, in my opinion, like Kubernetes wrappers and not very like thick wrappers, like fairly thin wrappers. And I think, you know, every startup is a wrapper to some extent, but like you need to be like a fat wrapper. You need to like go deep and like build some stuff. And that's like, you know, if you build a tech company, you're going to want to, you're going to have to spend, you know, five, 10, 20 years of your life, like going very deep and like, you know, building the infrastructure you need in order to like make your product truly stand out and be competitive. And so, you know, I think that goes for everything. I mean, like you're starting a whatever, you know, online retailer of, I don't know, bathroom sinks. You have to be willing to spend 10 years of your life thinking about, you know, whatever, bathroom sinks, like otherwise it's going to be hard.

Swyx [01:01:37]: Yeah. I think that's good advice for everyone. And yeah, congrats on all your success. It's pretty exciting to watch it. It's just the beginning. Yeah. Yeah. Yeah. It's

Erik [01:01:45]: exciting. And everyone should sign up and try out modal.modal.com. Yeah. Now it's GA. Yay. Yeah.

Swyx [01:01:50]: Used to be behind a wait list. Yeah. Awesome, Erik. Thank you so much for coming on. Yeah, it's amazing. Thank you so much. Thanks.

Swyx [01:02:11]: Bye.

Get full access to Latent Space at www.latent.space/subscribe

2024-02-16
Länk till avsnitt

Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in ?The Autonomous Workforce?, ??Beyond Transformers & GPUs?, and ??Embodied AI?.

RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World?s Fair, June 25-27!

Today we have the honor of hosting two of Together AI?s co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market.

Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation).

Note from future: about a week after this pod was published, rumors were confirmed that Salesforce had led another $100m Series B at a $1b valuation.

But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being ?full stack?, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms.

Bringing Research and Industry Together

In just one year, Together has been behind some of the most exciting research in AI:

* RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens.

* RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release.

* FlashAttention-2, developed by Together?s Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him.

* Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia.

* StripedHyena, a SOTA graft of Hyena state space models and transformer models together

* Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model.

* MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It?s an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency.

And I?m sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here.

Bringing ?Disaggregated? GPUs Together

On their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce?s research, Together?s GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed:

Take the Hessian AI center in Germany or the DoE?s INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there?s starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn?t have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away.

A big theme of the Together inference stack, a ?bag of 50 tricks? that we discuss on the pod, is also ?hardware-aware? algorithms like FlashAttention and Mamba, which further emphasize the benefits of co-developing everything together:

Special Focus: Transformer Alternatives

As we mentioned above, they are also funding a lot of research in Transformer alternatives. To reiterate a few points on why they matter:

* Longer context is not the motivation for sub-quadratic architectures: Transformers don?t inherently have hard limitations on context size, but they just get extremely expensive. When developing sub-quadratic alternatives, you easily enable very long context, but that?s now how you should compare them. Even at same context size, inference and training is much cheaper on sub-quadratic architectures like Hyena.

* Emergence of hybrid architectures: a lot of early conversations have been around the ?post-Transformers? era, but it might be more like ?half-Transformers?. Hybrid architectures could have split layers with some transformer-based and some state-space ones. One of the challenges is that a lot of hardware kernels are optimized for transformer operations, so you?d lose a lot by moving away completely.

* Higher speed = higher GPU throughput: if we could reach the same benchmark performance on subquadratic architectures, it?d solve a lot of the GPU crunch. Today we peak at ~170 tok/s on inference in some open models; if we could reach 5,000 tok/s on the same card, you?d be able to serve 30x more customers on the same hardware. As a cloud provider, you?re obviously incentivized to get there.

We had a lot of fun chatting with the Together guys and we covered a lot of ground, so enjoy the conversation!

Note: This is the first episode of a ?cloud providers mini-series?. We have Erik from Modal and Ben from Replicate coming up next!

Video Podcast

Join us to watching the video version of this pod on our snazzy YouTube!

Show Notes

* Together AI

* RedPajama Dataset v1 Announcement

* RedPajama Models v1 Announcement

* Together Embeddings

* StripedHyena-7B

* Mamba-3B-SlimPJ

* Vipul's X thread on Anyscale

* Vipul's Razor

* SemiAnalysis' "Inference Race to the Bottom" post

* Chris Ré

* Mike Conover's episode

* Slim Pajama by Cerebras

* Dolma by AI2

* Jina AI

* Tengyu's Voyage AI

Timestamps

* [00:00:00] Introductions

* [00:00:43] Origin and current state of Together.ai

* [00:02:15] Transition from Apple to Together and the vision for open AI

* [00:04:54] How Chris Ré introduced Ce and Vipul

* [00:08:43] How RedPajama came to be

* [00:13:34] Model training and Transformer alternatives

* [00:15:37] DSIR and the importance of data in LLMs

* [00:21:19] Inference vs Fine-tuning vs Pre-training usage on Together

* [00:23:20] Together's GPU stash

* [00:27:02] Why standardization of inference metrics is important

* [00:29:26] Building moats in AI inference

* [00:31:49] Federated vs disaggregated cloud computing

* [00:34:57] Opportunities for improvement in the inference stack

* [00:36:13] Anyscale benchmarking drama

* [00:41:27] Not just an inference platform

* [00:43:50] Together Embeddings and the future of embedding models

* [00:45:53] State space models and hybrid architectures

* [00:53:52] The need for 5,000 tokens/s speed in AI inference

* [01:00:23] What's the most interesting unsolved question in AI?

Transcript

Alessio [00:00:00]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.

Swyx [00:00:14]: Hey, and today we're together with Together. Welcome to the studio, guys.

Ce / Vipul [00:00:20]: Thank you.

Swyx [00:00:21]: I don't know how you typically give self intros, but does anyone want to go first? How do we get our audience acquainted, especially to who's speaking, because it's unusual for us to do a four-person pod. Yeah.

Ce [00:00:33]: Hi, everyone. I'm Ce. I'm one of the co-founders of Together and the CTO, working with the team on technical things.

Vipul [00:00:40]: I'm Vipul Ved Prakash, co-founder and CEO of Together.

Swyx [00:00:43]: I always consider you guys as one of the sort of all-in-one companies. I always want to say labs, but I feel like you're not a lab. What is the sort of origin of Together, and then what is it today? I feel like it used to be Together.xyz, and then now you're Together.ai.

Vipul [00:01:00]: I think fundamentally, Together is about open and independent AI systems. We think this is one of the most consequential technologies of our time, and when we started the company in June 2022, our focus was to build a platform for open source, independent, user-owned AI systems. One way to think about it is big labs, frontier model labs, have built their own platforms for developer platforms for their models. We think of Together as a platform for everything else, whether these are open models, whether these are models being built by companies that are owned by them. Our sort of XYZ roots, we have a fairly deep decentralization and open ethos that kind of reflects in all our platform and strategy and business. And we also, the way we structure our cloud is by combining data centers around the world instead of, you know, we are today not located in hyperscalers, we have built a footprint of AI supercomputers in this sort of very disaggregated, decentralized manner.

Alessio [00:02:15]: I know before Together, you were at Apple, so you go from like the most walled garden, private, we don't say anything company, to we want everything to be open and everybody to know somebody. What maybe did you learn from like the Apple way of being super close and polished and maybe what are you taking now to Together to make it open, but also a very nice developer experience?

Vipul [00:02:37]: Yeah, I would say, you know, one sort of my, you know, background has been in open source for a long time. One of the first things I created was a collaborative spam filter, you know, this was back in the day. It's called Vipul's Razor. And it became quite popular. And the first company I founded called CloudMark was built around, you know, taking open source and building both an open side of it and a commercial product around it. I think Apple is sort of very focused on providing this amazing experience to its customers with, you know, most of the technology sort of hidden behind the product. And certainly the focus on fluidity and applying complex technology to make everyday things simple is something that Apple does really well. And, you know, that's been a sort of big part of how we think about our developer platforms. I think it informs it. The other thing is that during my years at Apple, we, you know, worked a lot on deep learning. And one of the things that was sort of very viscerally accessible to me was how well these systems worked. We, you know, we built an open domain Q&A system. This was based on Facebook's LSTM paper in 2016. And it was remarkable because we had a parallel system based on sort of information retrieval techniques, which is extremely complicated, didn't work that well. And you know, this thing we wrote in a week was just incredible performance. So I think some of those experiences, at least for me personally, sort of were creating this roadmap of how important and powerful this technology is. And you know, when the scaling loss paper was published, I was very clear, like it was in some ways something very profound. We've never had algorithms that improve in capabilities with scale out. So this is almost a new era of computing. So that's been, I think, the influence of Apple, my years at Apple, really for me, like crystallized the value of what we are doing together.

Alessio [00:04:54]: And how did you decide to join forces? Because you did a postdoc with Chris Ré at Stanford. You know, we already had Tri Dao from Together and we talked about Hazy. What was like the meeting of the mind of, hey, I come from like the more technical postdoc assistant professor background and we've got yet a more product thing. What got you excited to like build this now?

Ce [00:05:15]: So we have been working on this together, Chris, in the essentially last like 10 years, right? So it was like a machine learning system 10 years ago was like Power BI's graphic model, right? And then convolutional neural network and then all the foundation model that we see today. But if you look at this, I think that fundamentally the thing we are actually optimizing is actually not that different. It's always about data movement across essentially all the stacks, right? So when you do distributed like computing, it's about communication across different machines. When you do, for example, flash attention, it's about data movement at a different essentially memory hierarchy, right? So we have been doing this in the last 10 years and seeing the field start grow, grow, grow. So we kind of feel the current kind of this like wave of technology is actually the perfect time to actually bring all the research essentially into something real. And we are super lucky that we got introduced to Weibo, right? And then we hope to join forces and bring this to real world.

Swyx [00:06:10]: It's an unusual team of like sort of research and industry. Like you've been like a third or fourth time founder now. Third time founder, yeah. And so like what is your first order of business when you like set up together? Like how do you sort of put something like this together? Oh my God, I'm going to use this word so much.

Vipul [00:06:27]: I feel AI companies are really kind of driven by research. And Chris and I had been talking about how to reduce the cost of building models. We felt that there aren't really big data modes around foundation models. They are built from a subset of the web. What is difficult is the cost of capital to build these. And one of the ways in which you can reduce this cost is by making more efficient systems. With that, it was really about finding the right set of co-founders and team. In fact, when Chris introduced me to Ce, and I think within the first five minutes of talking to Ce, I was like, we are starting this company. And our early focus was thinking about this more sort of disparate set of resources, you know, GPUs around the internet. Can we use those to build? And we really have to compress communication for, you know, when we do gradient averaging, there's just a lot of traffic. And if you can reduce that somehow, you sort of open up the possibility of using cheaper compute, you know, across the network. And Ce's research for a decade has been in that subject. You know, and from there, finding, you know, other folks in the network, I think there is generally a lot of excitement and philosophical alignment around what we are doing, which, you know, we publish papers, we publish open source libraries and code, we build open models. And I think the people in academia in, you know, machine learning and NLP, that's really what they want to do. So I think that's been really a kind of kernel for, you know, composition of the company. And we're lucky to have, you know, at this point, attracted some of the best researchers in the field. So I think that's the most important thing. And, you know, the rest of it is sort of driven by us. A couple of these philosophies around independent systems and decentralization and good developer interfaces, you want to make it accessible. That's, you know, just as important. And the rest follows from there, I think.

Alessio [00:08:43]: I want to try and fill in some of the blanks in the history of Together. I think people come on your website today and they say, you raised a hundred million dollars Series A. They're like, wow, these guys are like super legit company. But it feels like Red Pajama just came out a year ago. I remember we had Mike Conover in the studio, who had built Dolly at Databricks. And you announced it literally the morning we were recording. So we're like in the studio on our phones, looking at it. And it's like, wow, this is like the first time now there's like a good curated dataset to do open pre-training. So maybe let's start from there. Like, what was the motivation behind it? Why did you decide to do that? It's, datasets are one of the things that most people don't want to work on. They just want to do models, not datasets.

Ce [00:09:27]: Yeah. So, yeah, first one is not the first, right? So I think it's actually built on a whole bunch of amazing effort the community already have. For example, Eleuther have the pile, right? There's a whole bunch of amazing datasets they have, like C4, right, from Google, right? So I think really get inspired by the impact those like datasets have on the community, right? So I think when we did Red Pajama, it was a time that people are really fascinated by Lama, the model, like Lama 1, right? Which I feel like decades ago, right? But it's kind of, people are really excited about the quality, right? So that's really like a big shift in people how to think about open model. People start to see hope, right? So, but the one problem of Lama is the data recipe is being described in a pretty detailed way in the paper, but the data is actually not there. So, and our original thinking is how about we take the recipe and we try to do our best effort reproduction and try to put it out, such that we can learn from our mistakes in the reproduction together, right? So that's essentially the original thinking behind Red Pajama. And we have been pretty happy and excited about what community have been kind of build on it. For example, there's a dataset called Slim Pajama, right? Which do deduplication over our data, right?

Swyx [00:10:38]: From Cerebras, did they talk to you before?

Ce [00:10:39]: Oh, yeah, yeah, yeah, yeah. So, yeah, so we are very good friends so we can discuss about technical perspective. We are pretty excited because I think it's kind of why we do Red Pajama in the first place is that people can actually build not only models, but also datasets essentially over that piece of artifact, right? So that's actually what inspired us to do the first version of Red Pajama dataset.

Swyx [00:11:01]: Yeah, and then you released V2 maybe two months ago.

Ce [00:11:04]: Yeah.

Swyx [00:11:05]: 30 trillion tokens.

Ce [00:11:06]: Yeah, 30 trillion tokens. So I think what's exciting about Red Pajama V2 is not only the number of tokens, but we start to kind of learn from Red Pajama V1. So one thing that we learned was that data quality is really the core, right? So you want to take this couple trillion token dataset and try to bring them down maybe to one trillion or two trillion, right? The way that you actually filter them, deduplicate them is not something that kind of pre-decided before you see the application, right? So you kind of want to have a modular framework to think about data quality, right? So like given application, let's automatically or maybe semi-automatically try to come up with a way to filter it down. So that's why in Red Pajama V2, we kind of overlay the dataset with like 40 different pre-computed quality signal, right? If you want to reproduce your best effort, like C4 filter, it's kind of like 20 lines of code, right? And this open up this opportunity you can actually put different filter together, learn the combination of filter. We are very excited to see what community actually come up with using Red Pajama V2.

Swyx [00:12:11]: It was retrospectively so obvious that this is a good idea that I wonder how come more datasets don't do this. You release the dataset with all these toggles that you can turn on and off, right? And you can sort of tune up and down the quality in ways that you believe is important to you. Yeah, I just, it makes so much sense now in retrospect. Because everyone just publishes like their pipeline and then the end result. But what about all the intermediate stages? Yeah.

Ce [00:12:35]: Yeah, so I think, so there are multiple things there. I don't think we are the only one like doing that. For example, like Doma from AI2, right? They have this very flexible format to actually put in those quality signals, right? Think like, we are actually calling them some, right? So you can actually load Red Pajama using their tool. That whole thing should work, right? So I think one fundamental thing that changed in the last year, essentially, in the beginning when people think about data, it's always like a byproduct of the model, right? You release the model, you also release the data, right? The data side is there essentially to show people, ah, if you train on this data, you'll get a good model. But I think what started to change is when people started building more and more of those models, people started to realize like different subset of data side is kind of valuable for different applications, right? The data becomes something to play with, right? So I think we are kind of lucky that we happen to release Red Pajama right at that point that we get this opportunity to actually learn from that.

Alessio [00:13:34]: And you guys have a custom model training platform on Together 2. You have a bunch of stuff in there for data selection, like the DSIR and things like that. How did you decide to work on that versus, because you first started with like some of the fine tunes on LLAMA. Do you see a lot of interest there? And I know you've been doing a lot of research on state space models and other transformer alternatives. Like, do you also see that as something you'll keep working on this year and push more people towards?

Vipul [00:14:02]: Yeah, I mean, we, you know, we think of how to make training more efficient and building models more efficient. Part of that is being able to select the right dataset. This is why you have signals, DSIR. You can start with a small dataset and find similar documents, build models with that. So we think it's an important part of the kind of model build tooling that, you know, sort of widely useful for people building different kinds of models. Similarly, you know, we are running into the limits of how fast you can make transformers. And we want inference at 5,000 tokens per second. I don't think we will get there with transformers and we need to learn longer sequences. Data, again, becomes very, very expensive with transformers. So I work on space state models and all the research that we are doing there. And hopefully other labs will pick up on this and make it a kind of important target for optimization. But we think that, you know, open source is a great place for this. We can provide these recipes for data and for training to our customers who are building, you know, custom models themselves. And, you know, we are quite excited about the sort of progress we are seeing there.

Alessio [00:15:18]: Do you have some of these models available for inference on Together? Can people play around with a strictly, you know?

Swyx [00:15:25]: Yeah.

Vipul [00:15:25]: Yeah, they're available for inference on our serverless platform.

Swyx [00:15:29]: I always try to be the person who asks about acronyms in case, you know, people want to understand. Should we explain importance resampling, you know, that kind of stuff?

Ce [00:15:37]: Oh, yeah. So DSIR essentially, it's a fundamental idea. So it's one of the paper from Percy, right? So essentially, if you know what you are doing, you can actually use that as a very strong signal about what data to put in to insert training process, right? So that's essentially the fundamental idea, right? So, and then more concretely, right? So there are actually different versions of DSIR, right? So one version is like if you have a validation site, right? You can actually somehow measure the similarity between the validation site and also your pre-trained corpus and essentially subset, like the subset. And often there's actually like less targeted version of DSIR where you'll say, yeah, maybe Wikipedia is actually a very good corpus. Let's try to find more Wikipedia, right? And you can think about it in two ways, either as a way to come up with different weights for different data slices. Yeah, so as like filter type of step. Yeah, for a data set, or think about that as like data augmentation. So that's how, yeah, that's how we think about DSIR.

Swyx [00:16:33]: That makes sense. I will have to read the paper to understand a little bit more. Because when you say things like, we have to know in advance what we were trying to do with the model, then we do importance resampling. That is against the principle of general intelligence, right? Like the point is to train AGI.

Ce [00:16:48]: Yeah, so it depends on what do you mean by being general or generic, right? So I think, I mean, you can always take a meta-learning perspective that we know the distribution of tasks that we care about, right? So you can always go kind of up in the ladder of how general the whole thing is, right? But also for many of the customers that we are actually talking to, right, they have kind of very targeted application, right? The benefit you can get out of that is you could build a better open model, often smaller, often easier to do inference, if you know what you want, right? So I think the whole trade-off would be, and the x-axis would be how generic the whole thing will be. The y-axis would be not only the top accuracy, but also a whole bunch of the deployment cost, right? The size of the model, right? The robustness of the model. So I think different people will navigate the space in different way. And we want to be the platform, essentially, whatever point that you want, we have a solution for you.

Swyx [00:17:43]: One more thing on data before we go deeper on state-space models. Are we running out of data? Can we go in order of magnitude? Can we go five orders of magnitude? How do both of you think about how much data we have and how much we need?

Ce [00:17:55]: Yeah, so I think that's a very, very good question. So I don't think we are running out of data on Earth.

Swyx [00:18:02]: Right, so think about it globally. Training data, training class data.

Ce [00:18:05]: Yeah, yeah, so I think, I mean, some of them are not accessible, right? But I do think there are many organizations in the world have enough data to actually train very, very good models, right? So, I mean, they are not publicly available, right? But there are people who actually have access to those, right? So I think in general, right? So if you think about the data in the open space, right? So I guess that was specifically that you actually mean whether we are running out of data. I do think there need to be some way, right? That people who are training open models get connected with essentially data that's not internet data. So I think that channel need to be opened up for the open model to get more data, right? But I'm kind of on the optimistic side that the society will figure out a way that we can train open models that's beyond this internet data.

Swyx [00:18:57]: Beyond internet, meaning books?

Ce [00:19:00]: I mean, there are a lot of those, right?

Swyx [00:19:02]: Books, right?

Ce [00:19:02]: Transcripts, right? Videos, audios, right? So there are a whole bunch of data sources that we are not integrating into open data side, right? So, and maybe they shouldn't be open, right? So I think the community need to figure out a way, yeah, like the best balance, yeah? Such that we can have open models, but on the other hand, also have a reasonable collection of data that we can actually use.

Swyx [00:19:29]: I think a lot of people think that, there's a theory that Whisper was released so that you could transcribe YouTube and then use that as a source of tokens. Then I talked to other researchers who are like, you know, YouTube has very low quality tokens. You know, do you want your model to talk like a live streamer from YouTube? Because that's what they're going to do. So it's not clear, like what the quality of this data could be.

Ce [00:19:53]: Yeah, I guess that depends on your application, right? So I think as a platform, right? So our goal is whatever application that you have, yeah, so we have a platform that you can actually achieve your goal, right? So there are definitely applications that kind of make sense to speak like YouTube, right? So, but there are probably also other application that kind of more on the formal side, right? So I think there are going to be a diverse collection of models, both open and closed, right? So, and we kind of want to be the engine that powers that.

Swyx [00:20:21]: There's a lot of people who own data sources who are doing the locally optimal thing and humanity as a whole is losing out. So like New York Times is swinging open AI, you know, Stack Overflow shut down their API, Reddit shut down their API, X, you know, made their own model, right? On Twitter data. We're just going to have all these like tiny little gardens of data that it would be useful in a general model, but everyone's just trying to make their own model. And it seems like globally suboptimal.

Vipul [00:20:47]: I think you need to have some kind of a marketplace for figuring out how to get this, you know, data into models and have, I think we'll increasingly see more of that. You know, I think there's a positive aspect to it too. There is a incentive for creators to participate in a system, which is sort of more fair relative to, you know, the capture of value by an AI company that's taking their data. But I agree. I think this is a big open problem that needs to be solved. And I hope there will be, you know, serious efforts around it.

Alessio [00:21:19]: Let's talk about the most precious resource on planet earth, GPUs. You have a lot of compute obviously, but you also have a lot of product pieces. You have inference, you have fine tuning, you have pre-training. What's the split in terms of usage? Do you see most people are just running inference on off the shelf models? Do you see maybe some last mile fine tuning?

Vipul [00:21:40]: I would say right now, the top five models on our inference stack are probably all fine-tuned versions of open models. And we've seen- Who fine-tuned them?

Swyx [00:21:51]: You fine-tuned them?

Vipul [00:21:52]: They were fine-tuned by our customers.

Swyx [00:21:54]: By your customers.

Vipul [00:21:55]: You know, either on our platform or off our platform. And we are generally seeing that, you know, that is the sort of trend where you can get better quality on your task by sort of now easily adapting these models to your data. We also have, I would say, over 20 big model builds happening on the platform, which are customer. We see a lot of training and it's also somewhat surprisingly a more continuous kind of workload. We sort of imagine that this would be more episodic. You train a model and then you do inference. But what we find is, you know, we train a model and then they train the next version and then the next version, which sort of grows in scale. I would say training is still the bigger portion. Some ways inference is super linear to model quality. And as the models are getting better, there's more and more inference.

Swyx [00:22:48]: Oh, because they're more useful. Yeah, they're more useful, yeah. So, okay, so training is bigger. This is actually consistent with what we've heard from Mosaic, that, you know, people think that training is sort of like a one-time deal. You do one big run and then you're done. It's never true. And so I'm interested in, like, putting some numbers and I don't know what you have disclosed or what you want to disclose, but, like, how many GPUs do you have? What is the equivalent amount of compute that you have? Because I understand that your GPU setup is different than what people typically think of, like, a giant data center somewhere, right?

Vipul [00:23:20]: I don't think we have shared this number publicly. It's, you know, so this will be the first time, I guess. Like, we have close to 7,000 to 8,000 GPUs today. It's growing monthly.

Swyx [00:23:31]: What class of GPU are they?

Vipul [00:23:32]: They're mostly A100s and H100s.

Swyx [00:23:35]: Okay.

Vipul [00:23:36]: And probably more, I think, split towards H100s now. You know, we'll be sort of building this best-of-class hardware. So as there are other versions of these coming out later this year, we plan to have those in the fleet as well.

Alessio [00:23:53]: I know when we talked last year, you were also using some of the supercomputers by the Department of Energy. There was kind of like a lot of random GPU compute in the world. Have you seen that kind of getting timed out? I think maybe a year ago, people were like, oh, yeah, you can use this GPU computer that is going to be end-of-life. Has the bar changed to give access to those resources?

Ce [00:24:13]: From our perspective, it's actually getting better. Yeah, so from the community perspective, because many of the institutions in the world, they're actually investing in hardware, right? So for example, we are working with one of the institutes in Germany called Hessian AI, right, which gives us a lot of help on the compute side. So they start to have this very big GPU cluster, and they're actually sharing that with the community, right? And it's not super big, right, but also not a small one, right? So you start to see this, like, different lives that start to pop up, right? And because of the power of the community, they start to actually share that. So we actually find as a researcher today, it's probably easier for them to actually get a GPU than last year.

Swyx [00:24:56]: Interesting.

Alessio [00:24:56]: And then for you to buy them, what's the state of the market right now? Is it still extremely hard to get any? Do you have Jensen's phone number? Do you have like GM phone number? Do you guys get like the SDR because you're like under 10,000?

Vipul [00:25:12]: NVIDIA is obviously motivated to help us, both as an investor and we are their customers. I would say the market is very tight still, and it's likely going to be this way for a while, is my sense that the demand for AI computing is just kind of ramped up very, very quickly, and it will take a while for supply to catch up.

Swyx [00:25:37]: So how tight it is, and let's say compared to like a year ago, two years ago, what do you mean when you say tight? The things you want, you can't get?

Vipul [00:25:42]: You can't get them immediately. They're sort of, you know, minimally like two to three months out. Any inventory that shows up tends to clear very, very rapidly. And, you know, we obviously sort of look at this in a very detailed and analytic. There is four to 5 million GPUs that will be sold this year from NVIDIA and others buying. And if you think about 512 to 1,000 GPU cluster for a company, that's 4,000 to 8,000 companies, right? So it's in some ways a very small number. In other ways, the cost of GPUs will be, you know, 80 to $100 billion, and then you layer servers and data center space and electricity on top of that, and that's, you know, close to $250 billion worth of kind of compute, which when you compare it to the cloud computing of today, you know, AWS's last year was $88 billion in revenue. So this is really kind of a build-out happening of AI hyperscalers. It is much more disaggregated, and it's very, very global. So, you know, we think that GPUs are going to be sort of a precious resource for a long time, and using them optimally is very valuable.

Swyx [00:27:02]: Yeah.

Alessio [00:27:02]: Our friend, Dylan Patel from Semianalysis, he wrote a post about the inference market recently and obviously mentioned you guys. In his post, he said, our model indicates that Together is better off using two A180 gig system rather than a H100-based system. The temperature and performance testing also point to Together utilizing speculative decoding. Any thoughts? Is Dylan right? I don't know, what's-

Swyx [00:27:26]: What is his model, man? What does he know that they don't know? Yeah, exactly.

Alessio [00:27:30]: I wanna know, I guess like from the outside, and sometimes we even do it, we try and speculate on what people are actually doing. So for the first time, now we have a former guest writing about a current guest. So we wanna know what you guys thought and maybe what are some of the misconceptions that people from the outside have on what it takes to run like a GPU cloud today?

Vipul [00:27:50]: Yeah, big fan of Dylan's, by the way. I religiously read Semianalysis. I think there were some errors in that analysis. In particular, we were trying to decode it and one of the things we noticed is that it assumed that input tokens weren't being priced. So I think that may have been an error in the model. I also don't think that there's this assumption that people are running this at a loss. I think it's very expensive. You can't do that for very long. And there are trade-offs in terms of batch sizes you use and the kind of tokens per second performance that are kind of system trade-offs. We've done a lot of work. This is one of the key areas of research for us. So our inference stack is a combination of 50 different sort of tricks and techniques and we think there's a lot of room for optimization here. So whichever hardware provides better performance, whether it's H100 or A100s or L40s, we can sort of measure price performance on particular hardware and we tend to use that for that model or in some cases, certain customers have data streams which can be then optimized for a particular configuration regime. So we do fairly detailed work on how to make this more efficient and so it's hard to, from the outside, looking at memory bandwidth and estimating what's actually happening.

Alessio [00:29:26]: How much of these 50 tricks are you giving to yourself and how many are you gonna open? Because we have three now, obviously Flash Attention 2 is open source. He mentioned he'd love to come work together because of how much you care about open source. Yeah, how do you weigh that as a CEO and CTO?

Vipul [00:29:43]: A lot of it is open, right? Flash Attention, Flash Decoding, et cetera, and we publish something that's very generally universally useful. It's going to produce better open source AI. We tend to publish as open source. I think on the inference stack, there are open source inference stacks which are pretty good and definitely today, it gives us a competitive advantage to have the best one. So we are not sort of rushing out to release everything about it. It's not overall that additive to open source out there and it is particularly useful as a business for us to provide best price performance. Yeah, we make these decisions. We have discussions. Anything that we keep closed, we generally talk about it quite a bit and decide like this is the piece that is closed for today and it may not be the case six months from now. It may not matter as much.

Ce [00:30:40]: Yeah, so I think being open is kind of very important, right? So I think the whole company actually built on this idea that there's going to be ecosystem built on our open models, right? And that's also how we are really lucky to attract this top group of talents to actually join us because of the dream and the mission that we have on our side to really facilitate the open ecosystem, right? So I think in general, it's like I think all the ideas should be open. So that's why we publish papers, right? We actually talk about ideas, right? So I don't think it makes any sense to keep idea like close, right? So there are some software artifact that are kind of really deeply embedded into our kind of own kind of like stack. It kind of only useful when you're trying to build a disaggregated cloud, right? Maybe at some point that we're going to be open as people said, right? But at this moment, right? So we are kind of busy actually building it, right? So that's probably kind of getting to the picture about when that piece is going to be open, right? But I think on the research side, the ideas and for our people to publish things, I think that's really, really important, right? So I think that's how we get talent. That's how I think we as a company going to move the field forward.

Swyx [00:31:49]: I noticed that you never used the word federated learning or inference. Is there a distinction that you draw?

Ce [00:31:55]: So, I mean, it's definitely not intentional, but I think federated learning is, have been used in so many different ways by so many different people. It starts to lose a very precise meaning about what that really mean, right? If you go back to the original Google paper of federated learning, I think that's very different from what people are talking about today when they say federated. Yeah, we kind of want to be really precise about it.

Swyx [00:32:18]: And so your term is disaggregated.

Ce [00:32:19]: Yeah, so as an infrastructure, right? So that's disaggregated.

Swyx [00:32:22]: Aren't most clouds disaggregated? Like what's different about it?

Ce [00:32:27]: So one way is that most of the cloud are disaggregated, but some of that is actually being exposed to the user, right? If you go to AWS, you do know which region you are in, right? So I think one thing that we are trying to do is you have this disaggregated cloud, not only about location or geographically where they are, but about this reliability and also this diversity of this infrastructure. So, and if we want to build a reliable, high-quality layer over that, the user actually don't know, right? What's actually happening under the cover, right? So I think that's one of the difference of the way that we are thinking about infrastructure.

Swyx [00:33:06]: Yeah, a bit closer to Cloudflare than AWS. Yeah. Yeah. We have one question here, which we'll just throw out, it's kind of fun. So going back to this sort of inference stack piece, maybe if you had to pull out like a call for researcher or just like point out interesting areas of work that you're interested in, what pieces of the stack have the most opportunity for improvement?

Ce [00:33:27]: Yeah, so I think the way we are thinking about the inference stack is, so there are multiple things that can happen, right? So you can do better algorithms, like speckle decoding, you can change the model architecture, you can go really crazy on the system side, right? And you can also code it on the hardware, right? So it's not really clear innovation on a single dimension will get you there. So the key thesis on our side is, if you only push on one direction, you are going to reach diminishing return really, really quickly. Yeah, there's only that much you can do on the system side, only that much you can do on the algorithm side. I think the only big thing that's going to happen is when you ask all those dimensions to actually compound, right? So to have algorithm, model, and system all come together, so I think that's how we reach the next 10 times improvement on inference, right? So I don't think there's a single dimension that is particularly important, but looking at this space in a joint way, right? Try to co-optimize jointly multiple dimensions, I think that's going to be really important for the community to look at.

Vipul [00:34:28]: Yeah, we often see, I see numbers from the team and you have these multiple methods, not all of them compound. So you mix these together, it's still similar results and some combination of them will have this incredible effect that is really, really super interesting. So it's very systems, you know, a kind of broad systems approach to it that's the most effective.

Swyx [00:34:51]: I think I finally get the name of the company, like- Bring it together, yeah. Everything needs to be automated together.

Alessio [00:34:57]: All right, just quickly, how does all this work change, just like some of the architectures change? I know a mixture of experts like speculative decoding is a little less efficient because of memory bandwidth. How much of it do you invest when it's a maybe model-specific improvement versus more horizontal thing? Also, you're researching different architectures, so how much do you want to spend time optimizing what state of the art today versus what's coming next?

Vipul [00:35:24]: We do spend time on what state of the art today as well as what's next. You know, the value we get from doing specific optimization, even for, you know, what works well for a particular model on A100s with a particular bus versus H100s, it's a worthwhile investment for us. So we will go down fairly deep into a specific architecture and specific hardware. It does also inform what works better where, and you don't have to take the same approach for, you know, every model and every sort of hardware setup. We can take these different approaches and we do have these multiple systems now. We know that this, you know, system B is better for mixed role and system C is going to be better for stripe tying or Mamba.

Alessio [00:36:13]: Before we move on from inference, we need to talk about any scale of drama. So we're actually having Sumit on the podcast tomorrow, who also talked about, kind of came to your guys' support about how, yeah, how important it's not just like, oh, together saying this benchmark's not good because they look bad in it. How, I guess like, it's a hard question to ask, but like, why did you decide to just come out and say it? And how maybe does that also reflect the values that you guys have about open source and openness and kind of like being transparent about what's real and maybe hopes for standardizing some of these benchmarks to make it more clear?

Ce [00:36:56]: So it's a great service and skills doing for the community, right? I mean, it's very hard to do benchmark. The moment you do benchmark comparing N players, right, N minus one will be unhappy. You have two tables, then maybe N of them will be unhappy, right? So it's a very great thing that they're doing. And in some of the work that we are doing, we actually use RMOperf, right? So it's a great thing that they're actually doing. So I think one thing about benchmark is, and probably the professor part of me are talking, is a good benchmark should think about how it's going to incentivize the field to actually move forward, right? So if the benchmark really become a kind of standard, how are people going to over-optimize to the benchmark if you are going to do that? And when people are doing that, what are we actually trying to incentivize, right? Will that move the world to a better place? Or will that essentially have every single player focus on marketing or spending time or money on something that actually do not matter on technical side, right? It's very hard to actually strike a balance, right? So I think the reason we kind of try to give feedback on the benchmark is kind of want to open up the discussion about how does the industry should come together and define maybe a common way that we compare with each other, right? So like how database people doing TPC, right? Maybe you should have something actually similar, right? So we are trying to start some of the conversation. So it's not really that we jump out to say it's not good because there's no way we can have a perfect benchmark. That doesn't really exist, right? So just try to kickstart a conversation that maybe we should come together and do something that the community agree and align with the benefit a user going to get, right? So just get the conversation started.

Vipul [00:38:42]: I've spoken to the AnyScale team after that, and I think they had really great intentions. And partly, I think it felt very objective and everyone sort of had a reaction to it because it just didn't match their benchmarks that we've all run internally against different services. I think a common industry benchmark run by an independent party versus one of the vendors.

Swyx [00:39:04]: Is there one that you appoint to?

Vipul [00:39:06]: I don't think one exists today. I think there should be. We're having some conversations about someone setting one up. And there's lots of interesting aspects of this. Time to first token is a function of where the test was run from. There is different load on these services at different times of the day and weekday or weekend. So you have to measure that well. And I think if all of that were done very well by an independent source, that will be a very useful service to customers and in the services themselves.

Swyx [00:39:39]: Yeah, I'll point people to artificialanalysis.ai, which is a new one that recently emerged. I don't know if they've done it right. It looks like a side project of a couple people. But I think it's in all the provider's interest to work with them. And ensure that there's an independent third party that's measuring these things, right? At least on the baseline. For me, what's worrying is more about what Toa was saying, which is, do these benchmarks skew things in ways that customers might not be mindful of? Like, what are these things overemphasizing that we might be missing? And I don't really know. It seems like a lot of these services bundled together, they're a version of quantization as well. So that means there's performance trade-offs, right? You're not comparing apples to apples, the same model itself, even though it's like a llama variant or whatever. So what do people trade off? They trade off latency, they trade off price. Obviously, those are the first two. But what else, right? What factors matter in an inference business?

Ce [00:40:33]: Yeah, so I think there's also the throughput, right? So there's the time to first token, right? So, and then there are things that users do not often see, for example, the reliability, right? The capacity, right? So that also have impact on user experience at a global scale. Maybe not a single query, right? But in aggregation, you can also see a whole bunch of, like, whether you are emphasizing P50, P95, right? So the whole bunch of things that you can actually play with. And of course, there's also quality. So there are different ways to actually make the whole thing faster, specification, quantization, or combination of those, right? So yeah, so there are so many things to actually play with. So they probably need a benchmark that the protocol is transparent to make sure, like, it's very clear what we are doing and a whole bunch of check on the quality to make sure we are putting the right group of stories in the same table. So I think then essentially the user can actually navigate the space. So I think that's going to be good for everyone.

Swyx [00:41:27]: Yeah, makes sense. It's a very important field and I think hopefully there's a good third party that emerges from this. So I just want to touch on one more piece, which is I think I'm appreciating from this discussion that fine tuning is a bigger part of your business than I thought. The other big player in fine tuning is Mosaic. Well, Mosaic is more training, but like there's a bunch of other players in the fine tuning space. If I was a prospective fine tuning customer, what do I come to you with? Do I come to you with my custom data and that's it? Do I also have to write the fine tuning code? What level of engagement do you do with your customers?

Vipul [00:42:01]: I think across the spectrum, our customers are training models, pre-training models from scratch and many of them will bring their data sets, you know, user infrastructure and training stack to train their models. There are others who have trained smaller models and want to scale up, scale up across infrastructure, scale up across data. So we'll sort of help them do that. We will have customers who are sort of initially started a little bit more consultative. They have a particular task and idea in mind and we will help them get from there to the data set and the right model to achieve that task. So it's a spectrum and, you know, our goal is to, we're trying to productize as much of this as possible. So that the whole process can be fast and scalable. I would say there is a lot more understanding around fine tuning now, like even the last six months, there are, you know, source tools, recipes, literature, podcasts, discord channels where people are figuring out and it really is in many ways, one of the successes of open source is you have small collectives of, you know, engineers who have created, who are now creating the top models on open source leaderboards. And I have tried out all sorts of different sort of, you know, data recipes, creating synthetic data. Merging models. Merging models. So it's, that's really fun to see. And I think that sort of agency that exists now is exciting. And that is, we see a lot of that sort of being applied into products and, you know, more commercial models that people are deploying in their applications.

Alessio [00:43:50]: And then just to, I guess, wrap up the together, it's almost becoming like a platform as a service, because now you release together embeddings. How did you get 92.5 accuracy on 32K retrieval? And do you think we're kind of like getting to embeddings or just like, we did everything that we could, you know, we're getting to like the most optimized it's gonna get and then we should just focus on models and inference or do you think there's still room there to improve?

Ce [00:44:17]: Oh, I don't think we haven't even got started on embedding. Yeah. So I think there are so many things. So like embedding is really fundamental for many things, for example, rack, right? So deep in application. So that's how people bring knowledge in. That's also the fundamental piece when you want to build a better model, right? So that's give you this understanding about what actually get into the model. You can actually use that to actually build a better data set, get a better model, then get better embedding, you'll start this loop, right? Without the good embedding, the loop is not closed, right? So I think both on the quality side, how to embed more like dedicated semantics, like into those vectors, how to deal with negation, for example, right? So, and how can you make the whole thing really, really fast? So I think for the next couple years, yeah, we will see a whole bunch of new embeddings maybe of different size and much, much faster than today. Yeah, so I think it's a very active research area. I think people should invest more, yeah.

Swyx [00:45:14]: I was surprised to see, I think Jina or, yeah, there's Jina AI, and then there's another guy, Tengyu's Voyage. They are coming out as startups purely focused on embeddings.

Ce [00:45:25]: Yeah. Yeah, so I think it's a very, very important piece of the system, right? So you people haven't focused on a lot on them before, and they should definitely start to do that.

Swyx [00:45:36]: Yeah. Why are the Chinese universities so good at embeddings? You know what I mean, right? Like the BGE and- Yeah, yeah, yeah.

Ce [00:45:44]: So I don't know. We just released our first embedded model, so we still try to learn how to build an embedded model. Yeah, so ask me again in six months.

Swyx [00:45:53]: I'll probably have more insight about how to build a better one. I just noticed that you saw 8002 was used to be at the top of the MTB chart, and then it's just like sliding down and down and down, and all the new models are coming out of China for some reason. And I'm like, I don't know what's going on there. So we cannot leave this discussion without talking about state space models. But first of all, how much of the company is dedicated to research? Like it's obviously like not production quality yet, but-

Vipul [00:46:17]: I would say it's like 40, 45% I was counting this morning. That's huge.

Swyx [00:46:22]: Yeah, so that's the biggest- It's a big investment. Yeah. Okay, well, I mean, it looks like it's paying off, so. And then high level, I will confess or admit or mention for the listeners who are also similarly skeptical, I did not used to care about long contexts because I was like, you know, 30K is enough, 100K is enough, right? I'm not, you know, modeling DNA sequences or anything like that. Why do I need long context? And I mean, first of all, I'll throw that open to you. But second of all, I think what Mamba did for me was change that perception of that. It's only about a long context. The only reason you want sub-quadratic architectures is for long context. Actually, that's not true. And it's also just more efficient to train, period. Right? I'll just leave that open to you. Like what's the motivation that people should keep in their heads? There are multiple things, right?

Ce [00:47:09]: So one thing is that, I mean, the moment a model can do for long context well, so it often means that it's kind of cheaper. Yeah, so I mean, that's why it's kind of long. I mean, in principle, transformer can do long context. It's just very expensive. So I think what those like state-based models trying to do is try to push the size of the state, right? Like as small as possible. That's why it's kind of long context, right? And try to kind of like decouple this like quadratical dependency, right? To make sure you can have a much better execution pattern.

One direct consequence of those is you can do long context really cheaply, but on the other hand, also introduce a whole bunch of benefit even you are not doing long context. Right? So I think that's actually probably equally important. Because data gets smaller, you can do really large batch size, right? You can actually be very faster. Right? So yeah. And another thing is like, one of the hypothesis that we have is, like in Stripe Hyena, it start to have a hybrid architecture, right? It has part of it has like state-based model and part of it is still the transformer. So different component probably deal with different things kind of better. So maybe by putting them together, by thinking about how information propagate, over this whole horizon of this context, you can probably get an even better quality model than transformer. Right? So I think that's why we are kind of invest a lot of things, on those models. Not only for the context, which is very important, but also for a whole bunch of benefit it could get.

Swyx [00:48:42]: Yeah. How should people treat the distinction between Mamba and Stripe Hyena? Like what's the point of releasing these two as separate models? Is one like sort of the together proprietary one and then the other is like the more open research one?

Ce [00:48:53]: Yeah. So I think it's pretty much a different stage of exploration. So they kind of have different hypothesis when we try to build those. Yeah. Like for instance, there are different view about state-based model. One is Hyena, another is like Mamba, right? They're actually different architecture. So when we build Stripe Hyena, right? So the curiosity that we have is how good can we... So what is the highest quality non-transformer model we can ever build? The goal of Stripe Hyena is try to see whether we can match Mistral. And by fine-tuning well, whether we can outperform that in some way, right? So it has a very, very strong baseline that we are trying to beat. So that's why there's hybrid scene, like getting the picture, right? And for Mamba, it's kind of more... The curiosity was how far can we push for pure architecture? Then we start from this very system make from small to large, right? All the way to 3 billion, right? So the baseline was essentially the best 3 billion model. So I guess at a different stage of exploration, at some point, I think they are going to converge. We actually learn different things, like when building different models. I think they are just like this intermediate stage in the exploration at different points.

Alessio [00:50:02]: You mentioned the hybrid architecture. Is that the model grafting that you mentioned in the Stripe Hyena post where I mentioned you can have transformers and not together? Like this is a concept that I hadn't heard before reading about this. So I think most people's mental models, like transformers or something else, it?s not transformers AND something else. How do you train a model that is hybrid? Is there any difference in like how you construct your datasets? Is there any difference in then how you run inference on it? How should people think about starting research in this field?

Ce [00:50:36]: Yeah, so we were also very surprised. Yeah, so when we come up with this hybrid architecture. So the way to think about it is like you have different layers in the neural network, right? So like the stateless model for some layer will already give you the benefit. For the other layer, they could be transformers, right? They could give you this more global view of the sequence, but for me, for other layer, don't have to have that, right? I still can have all the other things that kick in, right? So we don't know what is the optimal mixture between different architectures. I mean, in principle, we can have a mamba, hyena, and transformer, all those things that come together, right? And then you can see what makes sense. We have no idea what is optimal doing that. So what we are excited about is now the community have a whole bunch of building blocks that they can actually like playing like a Lego, right? So just put together and see what happen, right? So we are kind of very excited about that. Yeah, we are in the process of trying to learn more like about this architecture. And when we know what we are talking about, we will definitely share with the community about how to do that in a systematic way.

Swyx [00:51:41]: Cool. What are we still unsure about? Like, why don't we just, you know, put all the money in the world and training these things now? Like what is left to figure out before we scale this thing?

Ce [00:51:53]: So like if you look at how transformer like it's been developed, right? In the last like five to 10 years, right? So people don't start from like, you have this attention to all you need the paper and then let's put all the money in, right? Always start from this very systematic understanding about the scaling, about data quality, about essentially the limits, right? I think for a state-based model from the labs to the real world, you kind of need to go through the same process. But of course, the second time doing that is kind of easier, right? But I think there's no way we can get rid of this systematic step of studying scaling law, study what data to put in, right? So what's the impact of different data slices to the data, yeah, to the final model quality.

Swyx [00:52:33]: Do you expect that the data inputs will be different?

Ce [00:52:37]: I don't know, but I wouldn't take that for granted that they should be the same, right? So that's one of the hypothesis that, so we have no opinion on that because I think that's the result of the study, not the assumption. Yeah, we do not need to assume that.

Swyx [00:52:51]: Okay, scaling laws and data, anything else like architectural that we are not sure about? Because now you have this selection mechanism that you're pretty happy with.

Ce [00:52:59]: Yeah, so, I mean, first of all, how to mix them, right? So, and second is what is the architecture? So if you look at transformer, right? So one very interesting piece there is people optimize also the hardware, yeah, to make sure that things run very fast, right?

They're very efficient kernel, they're very efficient hardware. And then that's add another boost, right, for the transformer architecture, right? So that's something that should happen for state-based model. Which architecture is kind of easier kind of to run on the hardware, right? So, hosting going kind of faster, you can put more data, it add another dimension in the scaling law. So I think we just need to plow the whole space and just be really systematic from small model to 1 billion, 3 billion, 7 billion, just go all the way up, right? So I wouldn't jump around in the space. I would just like be patient and just like be systematic. Yeah, I think we'll get there, yeah.

Swyx [00:53:52]: Yeah, well, I'm looking forward for more research from you guys to figure that out. So one dimension, which we didn't talk about, we talked about long context, we talked about efficiency, but speed is very, speed is also very important. A good inference provider provides, let's say 70 tokens per second, and then maybe that's faster than less good inference providers that are more like 30 tokens per second. But that's the rough range, right? State-of-the-art today. That's around the human speaking speed, human reading speed is about 200 words per minute. Why do we need 5,000 tokens per second is my question back to Vipul. And maybe is this something that is an emphasis for research as well, or is this more just an inference only thing?

Vipul [00:54:29]: There are applications that are consuming the tokens that are produced from unmodeled, so they're not necessarily being read or heard by humans. That's a place where we see that level of requirement today that really nobody can quite satisfy. There is, can I think about, as intelligence grows, how do you sort of increase the bandwidth of, you know, how do you reduce the latency of it? If we can do 5,000 tokens a second, the same card can produce, the throughput of that card goes up significantly and can support more applications. So I think it's important from that perspective. And then there are, it opens up new UX possibilities. Once you can get sort of an immediate answer from a model, it starts working in a different way and, you know, new types of applications will be created. We rarely run into users, except for perhaps those feeding this into a text-to-speech model, where, you know, they say that, okay, slower is better, or like, we don't need more performance. I think this may just be fundamentally very, very slow today in general, and we're just sort of used to that speed. And that will change once, you know, these models can get faster.

Swyx [00:55:47]: Yeah, 5,000 tokens per second is, I don't even imagine, like, well, it makes me worried a bit that the machines will be communicating at a much higher bandwidth than us, but yeah. I mean, they do that already.

Vipul [00:56:00]: They do that already. Not in natural language.

Alessio [00:56:02]: Awesome. Anything we missed about Together as a product? We're gonna talk about the hackathon you just did and whatnot, but any last product thoughts?

Vipul [00:56:11]: I think one of the big sort of focuses of our product is to become more and more serverless, like have AI development run in the serverless manner. And we are there now on inference, also on fine-tuning. You know, we are pushing to do that on training. And that is, you know, we think, if there was a sort of, you know, developer experience message, that's probably the big one is where you have enough flexibility. You don't have to sort of commit to thousands of dollars of compute before you can start using open models. We really want to change that and make it really as easy as possible to get started.

Swyx [00:56:52]: Yeah. When I first signed up for Together, I had, like, left an instance running and I just, like, ran out of my credits immediately. Yeah. So, you know, and we changed that whole model now.

Vipul [00:57:04]: So you never run into that issue. And that was, you know, I think the response to that has been amazing is you also provide, you know, $25 free credits, which is a large number of tokens depending on the model you're using. And you really can build an app. You know, you can do a fine-tuning and run that model and build an app on Together for free, basically. And we'll be pushing further in that direction.

Alessio [00:57:29]: You just did a hackathon at AGI house about fine-tuning versus SRAG for open source. Any learnings, recaps from it?

Ce [00:57:38]: Yeah. So I think one thing that we kind of learned is, like, so I think the hackathon was phrased as, like, something versus something, right? But I think the combination of those works really well.

Swyx [00:57:48]: Right?

Ce [00:57:48]: So I think, like, combining all those techniques all together, right, so we'll give you essentially another boost, right? So that kind of one thing that we learned on the technical side. And also we are very, kind of, excited about the excitement of the audience, right? So I think people are really kind of using the platform and building something really cool. Yeah.

Vipul [00:58:08]: It's always surprising to us what people build. Yeah.

Alessio [00:58:11]: Is there something you're focused on this year, hiring, building, engineering team? What should people that want to work at Together?

Vipul [00:58:17]: You know, all those things. I think hiring is a pretty big topic. We are 38 people on the team and we are hiring across all areas. You know, like CUDA and Kernel Hacker. We have lots of exciting projects. If you're a researcher, you like to build models, we have exciting projects. If you work on systems and infrastructure and the cloud layer, you know, we do a lot of work there. And as well as sort of front-end and developer experience and applications. So really kind of across the board, we have, I think, 20 plus postings on our job openings on our site. And folks are passionate about open and AI. You know, people looking at Together, they don't necessarily, for all the postings, have to have experience, you know, professional experience working in machine learning or AI. Many of the systems people are sort of doing this for the first time and they can apply their, you know, systems expertise to the kind of things that we are doing. And we can teach people AI, as long as they have expertise in other areas.

Swyx [00:59:20]: Will you call out what kind of expertise you're looking for? Like, we definitely have systems people listening, so.

Ce [00:59:26]: Oh, I mean, the whole stack. Right, so like all the way from the-

Swyx [00:59:29]: Kubernetes, I don't know. Kubernetes, yes. CUDA. What else, CUDA?

Ce [00:59:34]: And DevOps, right? So that's a big thing.

Swyx [00:59:37]: Is that like what, Terraform, like Pulumi? Right, yeah, yeah.

Ce [00:59:41]: And all the way to machine learning systems, right? If you want to, like, like to hack over like VRM, TGI, right? That's great. If you want to play with different fine-tunes, like building models, like development algorithms, right? Essentially the whole stack, all the way from application to-

Swyx [00:59:58]: That's very broad. To system.

Ce [01:00:00]: So the fun thing about the company is like, we have this very diverse collection of expertise and talents in the company, and the goal is really try to innovate at every single layer, and then have them all compound together, and yeah.

Swyx [01:00:13]: Yeah, doing everything together, that's why the company is named this way. Like, no, seriously, I didn't really get the company naming until now. Like, yeah, makes sense.

Alessio [01:00:23]: Awesome, guys. I know we kind of binned the lightning round in the last few episodes, but I think for you two, one of the questions we used to ask is like, what's the most interesting unsolved question in AI? So maybe another way to think about it is, if you weren't building together, what would you be working on?

Ce [01:00:39]: Yeah, so if not building Together, I would be a professor. I mean, then we do like a whole bunch of things without justifying as being useful. We used to work on quantum machine learning for a while. So I think IoT is going to become very interesting. Yeah, so I know people have been saying that for the last couple of decades, right? But I think very excited about how does technology, like starting, right, like change the communication between different edge devices and like all those machines and the new battery coming out, right? So I think that could be very cool. So if not building together, probably, yeah, spend some time thinking about how to compress communication even more given all the satellite communication stuff, yeah.

Vipul [01:01:21]: I think sort of the first question of what is more important open questions. The one thing I think about is that we sort of need framework of thinking about, you know, what the world looks like with advanced intelligence systems in it. I think we have had this very, you know, sort of a dumerism view of it, really kind of informed by science fiction, you know, dystopian science fiction and Terminator. And I don't think we have a kind of a positive or a realistic really framework coming from, you know, experts in the field. I think that's a pretty important question because that really gives us a roadmap of where this industry should go. And, you know, I'm hoping that some of the, you know, industry drama this last year maybe is sort of pointing us in that direction and solving that is sort of, I think, important in a meta way. So I think I'm doing the perfect thing that's like, this is, you know, really my dream job. And every day, this is kind of what I want to do, and I expect that's going to be the case for a very long time.

Alessio [01:02:33]: Awesome, thank you guys for coming on this, it was a lot of fun.

Swyx [01:02:36]: Yeah, thank you. Thank you so much.

Get full access to Latent Space at www.latent.space/subscribe

2024-02-08
Länk till avsnitt

Why StackOverflow usage is down 50% ? with David Hsu of Retool

We are announcing the second edition of our Latent Space demo day event in SF on 2/23: Final Frontiers, a startup and research competition in ?The Autonomous Workforce?, ??Beyond Transformers & GPUs?, and ??Embodied AI?.

RSVP here! The first one was aimed for 15-20 people and ended up blowing up to >200 and covered in the Information - let?s see what a year of growth (and competition) does to the local events space in 2024.

You can find all Latent Space events here, and of course get in touch with us to host your own AI Engineer meetups like AI Engineering Singapore.

In our December 2023 recap we covered the Four Wars of the AI stack. But how do we know when it?s time to crown a winner? As we kick off 2024, we wanted to do a recap of the State of AI in 2023 to set a baseline of adoption for different products. Retool had a great report at the end of last year which covered a lot of it.

David Hsu, CEO and co-founder of Retool, joined us to go over it together. We also talked about the history of Retool, why they were too embarrassed to present at YC demo day, and how they got to $1M ARR with 3 employees. If you?re a founder, there are a lot of nuggets of advice in here!

Retool AI

In our modeling of the ?Software 3.0 Stack?, we have generally left a pretty wide open gap as to the ?user interface? equivalent of the AI stack:

Retool AI launched 4 months ago with some nifty features for SQL generation, and its own hosted vector storage service (using pgvector). However, as he explains on the pod, the more interesting potential of Retool is in helping developers build AI infused applications quickly, in combination with its Workflows feature.

This moves Retool down the stack from just the UI for internal tooling to the business logic ?piping? as well. There are a bunch of dedicated tools in this space like Respell, BuildShip, Flowise, and Ironclad Rivet.

"We think that practically every internal app is going to be AI infused over the next three years." - David on the pod

RIP StackOverflow?

In July 2023 we talked about the impact of ChatGPT and Copilot:

This was then disputed by StackOverflow, who pointed out (very fairly so) that there were privacy-related changes in their analytics instrumentation in 2022. StackOverflow no longer reports traffic, but based on StackOverflow?s continuing transparency we can see that organic declines have continued throughout 2023.

Retool?s report comes over a year after those changes and has some self reported samples from users:

* 57.6% of people said they have used StackOverflow less; almost all of them replaced it with ChatGPT and Copilot.

* 10.2% said they no longer use StackOverflow.

We also saw a lot more tools being released in the dev tools space such as (one of our oldest pod friends) Codeium (which just raised a $65M Series B), SourceGraph (and their newly released Cody), Codium AI (just released AlphaCodium which was picked up by Karpathy), Phind (which beat GPT-4 with OSS models), and Cursor, one of the most beloved products in the dev community at the moment. Intelligence is getting closer and closer to the IDE, and the trend doesn?t seem to be reverting.

We already said that ?You are not too old (to pivot into AI)?, and the advice still stands. When asked to rate ?Preference for hiring engineers effective at using ChatGPT/Copilot for coding? on a scale of 1 to 10, where 10 is ?Much more likely?, ~40% of companies voted 8-10. Having an AI Engineer skillset is extremely important. 45% of companies between 1,000-4,999 employees said that they increased the difficulty of technical interviews to compensate for these new tools, so the gap between users and non-users will keep widening.

Crossing the AI in Production Chasm

Geoffrey Moore?s ?Crossing the Chasm? is one of the most quoted business frameworks. Every market has an initial group of Innovators and Early Adopters, who are willing to suffer through the rough edges of products initially, and eventually crosses into the Early Majority, which expects a full product.

In the AI world, ChatGPT and Midjourney / DALL-E have crossed the chasm in the consumer space. Copilot is probably the only tool that did it in the enterprise, having crossed 1M paid users. ~$50B were invested in AI in 2023, and we still only have

2024-02-01
Länk till avsnitt

The Four Wars of the AI Stack (Dec 2023 Audio Recap)

Note for Latent Space Community members: we have now soft-launched meetups in Singapore, as well as two new virtual paper club/meetups for AI in Action and LLM Paper Club. We?re also running Latent Space: Final Frontiers, our second annual demo day hackathon from last year.

Edit from March 2024: We did a followup on the Four Wars on the AI Breakdown.

For the first time, we are doing an audio version of monthly AI Engineering recap that we publish on Latent Space! This month it?s ?The Four Wars of the AI Stack?; you can find the full recap with all the show notes here: https://latent.space/p/dec-2023

* [00:00:00] Intro

* [00:01:42] The Four Wars of the AI stack: Data quality, GPU rich vs poor, Multimodality, and Rag/Ops war

* [00:03:17] Selection process for the four wars and notable mentions

* [00:06:58] The end of low background tokens and the impact on data engineering

* [00:08:36] The Quality Data Wars (UGC, licensing, synthetic data, and more)

* [00:14:51] Synthetic Data

* [00:17:49] The GPU Rich/Poors War

* [00:18:21] Anyscale benchmark drama

* [00:22:00] The math behind Mixtral inference costs

* [00:28:48] Transformer alternatives and why they matter

* [00:34:40] The Multimodality Wars

* [00:38:10] Multiverse vs Metaverse

* [00:45:00] The RAG/Ops Wars

* [00:50:00] Will frameworks expand up, or will cloud providers expand down?

* [00:54:32] Syntax to Semantics

* [00:56:41] Outer Loop vs Inner Loop

* [00:59:54] Highlight of the month

Get full access to Latent Space at www.latent.space/subscribe

2024-01-25
Länk till avsnitt

How to train your own Large Multimodal Model ? with Hugo Laurençon & Leo Tronchon of HuggingFace M4

Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops.

We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action: hands-on application of AI (led by KBall).

To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar).

In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a ?glue? adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA.

However (for reasons we discuss in today?s conversation), DeepMind?s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale:

The effort started in March, and was released in August 2023.

We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace?s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) ?LMM? people. In other words:

What is IDEFICS?

IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod):

You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry).

? From IDEFICS?s model card and blog post

The above demo screenshots are actually fine-tuned instruct versions of IDEFICS ? which are again in 9B and 80B versions.

IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above.

* Llama v1 for language (specifically huggyllama/llama-65b) - the best available open model at the time, to be swapped for Mistral in the next version of IDEFICS

* A CLIP model for vision (specifically laion/CLIP-ViT-H-14-laion2B-s32B-b79K - after a brief exploration of EVA-CLIP, which we discuss on the pod)

OBELICS: a new type of Multimodal Dataset

IDEFICS? training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set.

Enter OBELICS: ?An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents?:

* 115B text tokens

* 141M English documents

* 353M images

These bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline:

There?s a lot of mentions of ?multi-modal' web documents? which deserves some explanation. We?ll show you instead of tell you:

You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head.

You can view a subset of OBELICS and perform visualizations on them here:

2024 Update: WebSight et al

Most of this interview was recorded on Halloween 2023 at HuggingFace?s headquarters in Paris:

In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel?s V0 or TLDraw, and will be part of the dataset for IDEFICS-2.

As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it.

Timestamps

* [0:00:00] Intro

* [0:00:00] Hugo, Leo?s path into multimodality

* [0:09:16] From CLIP to Flamingo

* [0:12:54] Benchmarks and Evals

* [0:16:54] OBELICS dataset

* [0:34:47] Together Redpajama v2

* [0:37:12] GPT4 Vision

* [0:38:44] IDEFICS model

* [0:40:57] Query-Key Layernorm for training

* [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP

* [0:49:02] IDEFICS v2

* [0:52:39] Multimodal Hallucination

* [0:59:12] Why Open Source Multimodality

* [1:05:29] Naming: M4, OBELICS, IDEFICS

* [1:08:56] 2024 Update from Leo

Show Notes

* Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model

* IDEFICS Knowledge sharing memo: technical lessons and mistakes

* Victor Sanh memo

* OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

* Papers cited:

* BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

* Barlow Twins: Self-Supervised Learning via Redundancy Reduction

* CLIP paper: Learning Transferable Visual Models From Natural Language Supervision

* Vision Transformers paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

* Flamingo paper: a Visual Language Model for Few-Shot Learning

* April 2022 preprint from DeepMind, blogpost

* VQAV2 paper: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

* OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (https://okvqa.allenai.org/)

* MMBench: Is Your Multi-modal Model an All-around Player?

* Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

* Sig-GLIP paper: Sigmoid Loss for Language Image Pre-Training

* Nougat: Neural Optical Understanding for Academic Documents

* MMC4 (Multimodal C4): An Open, Billion-scale Corpus of Images Interleaved With Text

* Dall-E 3 paper: Improving Image Generation with Better Captions

* GPT-4V(ision) system card from OpenAI

* Query-Key Layernorm trick: paper (Scaling Vision Transformers to 22 Billion Parameters), tweet

* EVA-CLIP: Improved Training Techniques for CLIP at Scale

* ?We intially explored using a significantly bigger vision encoder (the biggest in open-access at that time) with EVA-CLIP. However, we ran into training instabilities very quickly. To lower the risks associated to the change of vision encoder, we decided to continue with laion/CLIP-ViT-H-14-laion2B-s32B-b79K which we have been using until that point. We will leave that swap for future iterations and will also consider using higher resolution images.?

* Datasets

* Together?s RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models

* LAION COCO: 600M synthetic captions from Laion2B-en

* Chip Huyen?s writeup on LMMs

* Joseph Nelson of Roboflow on Latent Space

* HuggingFace M4

* HuggingFace timm: library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts. It comes packaged with >700 pretrained models, and is designed to be flexible and easy to use.

* Logan Kilpatrick declaring 2024 the year of Multimodal AI at AI Engineer Summit

Get full access to Latent Space at www.latent.space/subscribe

2024-01-19
Länk till avsnitt

RLHF 201 - with Nathan Lambert of AI2 and Interconnects

In 2023 we did a few Fundamentals episodes covering Benchmarks 101, Datasets 101, FlashAttention, and Transformers Math, and it turns out those were some of your evergreen favorites! So we are experimenting with more educational/survey content in the mix alongside our regular founder and event coverage. Pls request more!

We have a new calendar for events; join to be notified of upcoming things in 2024!

Today we visit the shoggoth mask factory: how do transformer models go from trawling a deeply learned latent space for next-token prediction to a helpful, honest, harmless chat assistant?

Our guest ?lecturer? today is ; you might know him from his prolific online writing on and Twitter, or from his previous work leading RLHF at HuggingFace and now at the Allen Institute for AI (AI2) which recently released the open source GPT3.5-class Tulu 2 model which was trained with DPO. He?s widely considered one of the most knowledgeable people on RLHF and RLAIF.

He recently gave an ?RLHF 201? lecture at Stanford, so we invited him on the show to re-record it for everyone to enjoy! You can find the full slides here, which you can use as reference through this episode.

Full video with synced slides

For audio-only listeners, this episode comes with slide presentation along our discussion. You can find it on our YouTube (like, subscribe, tell a friend, et al).

Theoretical foundations of RLHF

The foundation and assumptions that go into RLHF go back all the way to Aristotle (and you can find guidance for further research in the slide below) but there are two key concepts that will be helpful in thinking through this topic and LLMs in general:

* Von Neumann?Morgenstern utility theorem: you can dive into the math here, but the TLDR is that when humans make decision there?s usually a ?maximum utility? function that measures what the best decision would be; the fact that this function exists, makes it possible for RLHF to model human preferences and decision making.

* Bradley-Terry model: given two items A and B from a population, you can model the probability that A will be preferred to B (or vice-versa). In our world, A and B are usually two outputs from an LLM (or at the lowest level, the next token).

It turns out that from this minimal set of assumptions, you can build up the mathematical foundations supporting the modern RLHF paradigm!

The RLHF loop

One important point Nathan makes is that "for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior". For example, it might be difficult for you to write a poem, but it's really easy to say if you like or dislike a poem someone else wrote. Going back to the Bradley-Terry Model we mentioned, the core idea behind RLHF is that when given two outputs from a model, you will be able to say which of the two you prefer, and we'll then re-encode that preference into the model.

An important point that Nathan mentions is that when you use these preferences to change model behavior "it doesn't mean that the model believes these things. It's just trained to prioritize these things". When you have preference for a model to not return instructions on how to write a computer virus for example, you're not erasing the weights that have that knowledge, but you're simply making it hard for that information to surface by prioritizing answers that don't return it. We'll talk more about this in our future Fine Tuning 101 episode as we break down how information is stored in models and how fine-tuning affects it.

At a high level, the loop looks something like this:

For many RLHF use cases today, we can assume the model we're training is already instruction-tuned for chat or whatever behavior the model is looking to achieve. In the "Reward Model & Other Infrastructure" we have multiple pieces:

Reward + Preference Model

The reward model is trying to signal to the model how much it should change its behavior based on the human preference, subject to a KL constraint.

The preference model itself scores the pairwise preferences from the same prompt (worked better than scalar rewards).

One way to think about it is that the reward model tells the model how big of a change this new preference should make in the behavior in absolute terms, while the preference model calculates how big of a difference there is between the two outputs in relative terms. A lot of this derives from John Schulman?s work on PPO:

We recommend watching him talk about it in the video above, and also Nathan?s pseudocode distillation of the process:

Feedback Interfaces

Unlike the "thumbs up/down" buttons in ChatGPT, data annotation from labelers is much more thorough and has many axis of judgement. At a simple level, the LLM generates two outputs, A and B, for a given human conversation. It then asks the labeler to use a Likert scale to score which one it preferred, and by how much:

Through the labeling process, there are many other ways to judge a generation:

We then use all of this data to train a model from the preference pairs we have. We start from the base instruction-tuned model, and then run training in which the loss of our gradient descent is the difference between the good and the bad prompt.

Constitutional AI (RLAIF, model-as-judge)

As these models have gotten more sophisticated, people started asking the question of whether or not humans are actually a better judge of harmfulness, bias, etc, especially at the current price of data labeling. Anthropic's work on the "Constitutional AI" paper is using models to judge models. This is part of a broader "RLAIF" space: Reinforcement Learning from AI Feedback.

By using a "constitution" that the model has to follow, you are able to generate fine-tuning data for a new model that will be RLHF'd on this constitution principles. The RLHF model will then be able to judge outputs of models to make sure that they follow its principles:

Emerging Research

RLHF is still a nascent field, and there are a lot of different research directions teams are taking; some of the newest and most promising / hyped ones:

* Rejection sampling / Best of N Sampling: the core idea here is that rather than just scoring pairwise generations, you are generating a lot more outputs (= more inference cost), score them all with your reward model and then pick the top N results. LLaMA2 used this approach, amongst many others.

* Process reward models: in Chain of Thought generation, scoring each step in the chain and treating it like its own state rather than just scoring the full output. This is most effective in fields like math that inherently require step-by-step reasoning.

* Direct Preference Optimization (DPO): We covered DPO in our NeurIPS Best Papers recap, and Nathan has a whole blog post on this; DPO isn?t technically RLHF as it doesn?t have the RL part, but it?s the ?GPU Poor? version of it. Mistral-Instruct was a DPO model, as do Intel?s Neural Chat and StableLM Zephyr. Expect to see a lot more variants in 2024 given how ?easy? this was.

* Superalignment: OpenAI launched research on weak-to-strong generalization which we briefly discuss at the 1hr mark.

Note: Nathan also followed up this post with RLHF resources from his and peers? work:

Show Notes

* Full RLHF Slides

* Interconnects

* Retort (podcast)

* von Neumann-Morgenstern utility theorem

* Bradley-Terry model (pairwise preferences model)

* Constitutional AI

* Tamer (2008 paper by Bradley Knox and Peter Stone)

* Paul Christiano et al. RLHF paper

* InstructGPT

* Eureka by Jim Fan

* ByteDance / OpenAI lawsuit

* AlpacaEval

* MTBench

* TruthfulQA (evaluation tool)

* Self-Instruct Paper

* Open Assistant

* Louis Castricato

* Nazneen Rajani

* Tulu (DPO model from the Allen Institute)

Timestamps

* [00:00:00] Introductions and background on the lecture origins

* [00:05:17] History of RL and its applications

* [00:10:09] Intellectual history of RLHF

* [00:13:47] RLHF for decision-making and pre-deep RL vs deep RL

* [00:20:19] Initial papers and intuitions around RLHF

* [00:27:57] The three phases of RLHF

* [00:31:09] Overfitting issues

* [00:34:47] How preferences get defined

* [00:40:35] Ballpark on LLaMA2 costs

* [00:42:50] Synthetic data for training

* [00:47:25] Technical deep dive in the RLHF process

* [00:54:34] Projection / best event sampling

* [00:57:49] Constitutional AI

* [01:04:13] DPO

* [01:08:54] What's the Allen Institute for AI?

* [01:13:43] Benchmarks and models comparisons

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:15]: Hey, and today we have Dr. Nathan Lambert in the house. Welcome.

Nathan [00:00:18]: Thanks guys.

Swyx [00:00:19]: You didn't have to come too far. You got your PhD in Berkeley, and it seems like you've lived there most of the time in recent years. You worked on robotics and model-based reinforcement learning on your PhD, and you also interned at FAIR and DeepMind. You bootstrapped the RLHF team at Hugging Face, and you recently joined the Allen Institute as a research scientist. So that's your quick bio. What should people know about you that maybe is not super obvious about you on New LinkedIn?

Nathan [00:00:43]: I stay sane in various insane sport and ultra-endurance sport activities that I do.

Swyx [00:00:50]: What's an ultra-endurance sport activity?

Nathan [00:00:52]: Long-distance trail running or gravel biking. Try to unplug sometimes, although it's harder these days. Yeah.

Swyx [00:00:59]: Well, you know, just the Bay Area is just really good for that stuff, right?

Nathan [00:01:02]: Oh, yeah. You can't beat it. I have a trailhead like 1.2 miles from my house, which is pretty unmatchable in any other urban area.

Swyx [00:01:11]: Pretty excellent. You also have an incredible blog, Interconnects, which I'm a fan of. And I also just recently discovered that you have a new podcast, Retort.

Nathan [00:01:20]: Yeah, we do. I've been writing for a while, and I feel like I've finally started to write things that are understandable and fun. After a few years lost in the wilderness, if you ask some of my friends that I made read the earlier blogs, they're like, oh, this is yikes, but it's coming along. And the podcast is with my friend Tom, and we just kind of like riff on what's actually happening on AI and not really do news recaps, but just what it all means and have a more critical perspective on the things that really are kind of funny, but still very serious happening in the world of machine learning.

Swyx [00:01:52]: Yeah. Awesome. So let's talk about your work. What would you highlight as your greatest hits so far on Interconnects, at least?

Nathan [00:01:59]: So the ones that are most popular are timely and or opinion pieces. So the first real breakout piece was when April and I also just wrote down the thing that everyone in AI was feeling, which is we're all feeling stressed, that we're going to get scooped, and that we're overworked, which is behind the curtain, what it feels to work in AI. And then a similar one, which we might touch on later in this, was about my recent job search, which wasn't the first time I wrote a job search post. People always love that stuff. It's so open. I mean, it's easy for me to do in a way that it's very on-brand, and it's very helpful. I understand that until you've done it, it's hard to share this information. And then the other popular ones are various model training techniques or fine tuning. There's an early one on RLHF, which is, this stuff is all just like when I figure it out in my brain. So I wrote an article that's like how RLHF actually works, which is just the intuitions that I had put together in the summer about RLHF, and that was pretty well. And then I opportunistically wrote about QSTAR, which I hate that you have to do it, but it is pretty funny. From a literature perspective, I'm like, open AI publishes on work that is very related to mathematical reasoning. So it's like, oh, you just poke a little around what they've already published, and it seems pretty reasonable. But we don't know. They probably just got like a moderate bump on one of their benchmarks, and then everyone lost their minds. It doesn't really matter.

Swyx [00:03:15]: You're like, this is why Sam Altman was fired. I don't know. Anyway, we're here to talk about RLHF 101. You did a presentation, and I think you expressed some desire to rerecord it. And that's why I reached out on Twitter saying, like, why not rerecord it with us, and then we can ask questions and talk about it. Yeah, sounds good.

Nathan [00:03:30]: I try to do it every six or 12 months is my estimated cadence, just to refine the ways that I say things. And people will see that we don't know that much more, but we have a bit of better way of saying what we don't know.

Swyx [00:03:43]: Awesome. We can dive right in. I don't know if there's any other topics that we want to lay out as groundwork.

Alessio [00:03:48]: No, you have some awesome slides. So for people listening on podcast only, we're going to have the slides on our show notes, and then we're going to have a YouTube version where we run through everything together.

Nathan [00:03:59]: Sounds good. Yeah. I think to start skipping a lot of the, like, what is a language model stuff, everyone knows that at this point. I think the quote from the Llama 2 paper is a great kind of tidbit on RLHF becoming like a real deal. There was some uncertainty earlier in the year about whether or not RLHF was really going to be important. I think it was not that surprising that it is. I mean, with recent models still using it, the signs were there, but the Llama 2 paper essentially reads like a bunch of NLP researchers that were skeptical and surprised. So the quote from the paper was, meanwhile, reinforcement learning known for its instability seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness. So you don't really know exactly what the costs and time that Meta is looking at, because they have a huge team and a pretty good amount of money here to release these Llama models. This is just the kind of thing that we're seeing now. I think any major company that wasn't doing RLHF is now realizing they have to have a team around this. At the same time, we don't have a lot of that in the open and research communities at the same scale. I think seeing that converge would be great, but it's still very early days. And the other thing on the slide is some of Anthropic's work, but everyone knows Anthropic is kind of the masters of this, and they have some of their own techniques that we're going to talk about later on, but that's kind of where we start.

Alessio [00:05:17]: Can we do just a one-second RL version? So you come from a robotics background, which RL used to be, or maybe still is, state-of-the-art. And then now you're seeing a lot of LLM plus RL, so you have the gym fans, Eureka, you have MPU, which we had on the podcast when they started with RL. Now they're doing RL plus LLMs. Yeah. Any thoughts there on how we got here? Maybe how the pendulum will keep swinging?

Nathan [00:05:46]: I really think RL is about a framing of viewing the world through trial and error learning and feedback, and really just one that's focused on thinking about decision-making and inputs in the world and how inputs have reactions. And in that, a lot of people come from a lot of different backgrounds, whether it's physics, electrical engineering, mechanical engineering. There are obviously computer scientists, but compared to other fields of CS, I do think it's a much more diverse background of people. My background was in electrical engineering and doing robotics and things like that. It really just changes the worldview. I think that reinforcement learning as it was back then, so to say, is really different. You're looking at these toy problems and the numbers are totally different, and everyone went kind of zero to one at scaling these things up, but people like Jim Phan and other people that were... You saw this transition in the decision transformer and papers and when people are trying to use transformers to do decision-making for things like offline RL, and I think that was kind of like the early days. But then once language models were so proven, it's like everyone is using this tool for their research. I think in the long run, it will still settle out, or RL will still be a field that people work on just because of these kind of fundamental things that I talked about. It's just viewing the whole problem formulation different than predicting text, and so there needs to be that separation. And the view of RL in language models is pretty contrived already, so it's not like we're doing real RL. I think the last slide that I have here is a way to make RLHF more like what people would think of with RL, so actually running things over time, but a weird lineage of tools that happen to get us to where we are, so that's why the name takes up so much space, but it could have gone a lot of different ways. Cool.

Alessio [00:07:29]: We made it one slide before going on a tangent.

Nathan [00:07:31]: Yeah, I mean, it's kind of related. This is a...

Swyx [00:07:35]: Yeah, so we have a history of RL.

Nathan [00:07:37]: Yeah, so to give the context, this paper really started because I have this more diverse background than some computer scientists, such as trying to understand what the difference of a cost function or a reward function and a preference function would be without going into all of the details. Costs are normally things that control theorists would work with in these kind of closed domains, and then reinforcement learning has always worked with rewards that's central to the formulation that we'll see, and then the idea was like, okay, we now are at preferences, and each step along the way there's kind of different assumptions that you're making. We'll get into these, and those assumptions are built on other fields of work. So that's what this slide is going to say, it's like RLHF, while directly building on tools from RL and language models, is really implicitly impacted and built on theories and philosophies spanning tons of human history. I think we cite Aristotle in this paper, which is fun. It's like going pre-BC, it's like 2,300 years old or something like that. So that's the reason to do this, I think. We kind of list some things in the paper about summarizing what different presumptions of RLHF could be. I think going through these is actually kind of funny. It's fun to talk about these, because they're kind of grab bags of things that you'll see return throughout this podcast that we're talking about it. The core thing of RLHF that, in order to be a believer in this, is that RL actually works. It's like, if you have a reward function, you can optimize it in some way and get a different performance out of it, and you could do this at scale, and you could do this in really complex environments, which is, I don't know how to do that in all the domains. I don't know how to exactly make chat GPT. So it's kind of, we'll overshadow everything. And then there's, go from something kind of obvious like that, and then you read the von Neumann-Morgenstern utility theorem, which is essentially an economic theory that says you can weight different probabilities of different people, which is a theoretical piece of work that is the foundation of utilitarianism, and trying to quantify preferences is crucial to doing any sort of RLHF. And if you look into this, all of these things, there's way more you could go into if you're interested in any of these. So this is kind of like grabbing a few random things, and then kind of similar to that is the Bradley-Terry model, which is the fancy name for the pairwise preferences that everyone is doing. And then all the things that are like, that Anthropic and OpenAI figured out that you can do, which is that you can aggregate preferences from a bunch of different people and different sources. And then when you actually do RLHF, you extract things from that data, and then you train a model that works somehow. And we don't know, there's a lot of complex links there, but if you want to be a believer in doing this at scale, these are the sorts of things that you have to accept as preconditions for doing RLHF. Yeah.

Swyx [00:10:09]: You have a nice chart of like the sort of intellectual history of RLHF that we'll send people to refer to either in your paper or in the YouTube video for this podcast. But I like the other slide that you have on like the presumptions that you need to have for RLHF to work. You already mentioned some of those. Which one's underappreciated? Like, this is the first time I've come across the VNM Utility Theorem.

Nathan [00:10:29]: Yeah, I know. This is what you get from working with people like to my co-host on the podcast, the rhetoric is that sociologist by training. So he knows all these things and like who the philosophers are that found these different things like utilitarianism. But there's a lot that goes into this. Like essentially there's even economic theories that like there's debate whether or not preferences exist at all. And there's like different types of math you can use with whether or not you actually can model preferences at all. So it's pretty obvious that RLHF is built on the math that thinks that you can actually model any human preference. But this is the sort of thing that's been debated for a long time. So all the work that's here is like, and people hear about in their AI classes. So like Jeremy Bentham, like hedonic calculus and all these things like these are the side of work where people assume that preferences can be measured. And this is like, I don't really know, like, this is what I kind of go on a rant and I say that in RLHF calling things a preference model is a little annoying because there's no inductive bias of what a preference is. It's like if you were to learn a robotic system and you learned a dynamics model, like hopefully that actually mirrors the world in some way of the dynamics. But with a preference model, it's like, Oh my God, I don't know what this model, like I don't know what chat GPT encodes as any sort of preference or what I would want it to be in a fair way. Anthropic has done more work on trying to write these things down. But even like if you look at Claude's constitution, like that doesn't mean the model believes these things. It's just trained to prioritize these things. And that's kind of what the later points I'm looking at, like what RLHF is doing and if it's actually like a repeatable process in the data and in the training, that's just unknown. And we have a long way to go before we understand what this is and the link between preference data and any notion of like writing down a specific value.

Alessio [00:12:05]: The disconnect between more sociology work versus computer work already exists, or is it like a recent cross contamination? Because when we had Tri Dao on the podcast, he said FlashAttention came to be because at Hazy they have so much overlap between systems engineer and like deep learning engineers. Is it the same in this field?

Nathan [00:12:26]: So I've gone to a couple of workshops for the populations of people who you'd want to include this like R. I think the reason why it's not really talked about is just because the RLHF techniques that people use were built in labs like OpenAI and DeepMind where there are some of these people. These places do a pretty good job of trying to get these people in the door when you compare them to like normal startups. But like they're not bringing in academics from economics, like social choice theory. There's just too much. Like the criticism of this paper that this is based on is like, oh, you're missing these things in RL or at least this decade of RL and it's like it would be literally be bigger than the Sutton and Barto book if you were to include everyone. So it's really hard to include everyone in a principled manner when you're designing this. It's just a good way to understand and improve the communication of what RLHF is and like what is a good reward model for society. It really probably comes down to what an individual wants and it'll probably motivate models to move more in that direction and just be a little bit better about the communication, which is a recurring theme and kind of my work is like I just get frustrated when people say things that don't really make sense, especially when it's going to manipulate individual's values or manipulate the general view of AI or anything like this. So that's kind of why RLHF is so interesting. It's very vague in what it's actually doing while the problem specification is very general.

Swyx [00:13:42]: Shall we go to the, I guess, the diagram here on the reinforcement learning basics? Yeah.

Nathan [00:13:47]: So reinforcement learning, I kind of mentioned this, it's a trial and error type of system. The diagram and the slides is really this classic thing where you have an agent interacting with an environment. So it's kind of this agent has some input to the environment, which is called the action. The environment returns a state and a reward and that repeats over time and the agent learns based on these states and these rewards that it's seeing and it should learn a policy that makes the rewards go up. That seems pretty simple than if you try to mentally map what this looks like in language, which is that like the language models don't make this easy. I think with the language model, it's very hard to define what an environment is. So if the language model is the policy and it's generating, it's like the environment should be a human, but setting up the infrastructure to take tens of thousands of prompts and generate them and then show them to a human and collect the human responses and then shove that into your training architecture is very far away from working. So we don't really have an environment. We just have a reward model that returns a reward and the state doesn't really exist when you look at it like an RL problem. What happens is the state is a prompt and then you do a completion and then you throw it away and you grab a new prompt. We're really in as an RL researcher, you would think of this as being like you take a state, you get some completion from it and then you look at what that is and you keep kind of iterating on it and all of that isn't here, which is why you'll hear RLHF referred to as bandits problem, which is kind of like you choose one action and then you watch the dynamics play out. There's many more debates that you can have in this. If you get the right RL people in the room, then kind of like this is an RL even when you zoom into what RLHF is doing.

Alessio [00:15:22]: Does this change as you think about a chain of thought reasoning and things like that? Like does the state become part of the chain that you're going through?

Nathan [00:15:29]: There's work that I've mentioned on one slide called process reward models that essentially rewards each step in the chain of thought reasoning. It doesn't really give the part of interaction, but it does make it a little bit more fine grained where you can think about like calling it at least you have many states from your initial state. That formulation I don't think people have fully settled on. I think there's a bunch of great work out there, like even OpenAI is releasing a lot of this and let's verify step by step is there pretty great paper on the matter. I think in the next year that'll probably get made more concrete by the community on like if you can easily draw out like if chain of thought reasoning is more like RL, we can talk about that more later. That's a kind of a more advanced topic than we probably should spend all the time on.

Swyx [00:16:13]: RLHF for decision making. You have a slide here that compares pre-deep RL versus deep RL.

Nathan [00:16:19]: This is getting into the history of things, which is showing that the work that people are using now really came from well outside of NLP and it came before deep learning was big. Next up from this paper, Tamer, which is from 2008. Some names that are still really relevant in kind of human centric RL, Bradley Knox and Peter Stone. If you have an agent take an action, you would just have a human give a score from zero to one as a reward rather than having a reward function. And then with that classifier, you can do something with a policy that learns to take actions to maximize that reward. It's a pretty simple setup. It works in simple domains. And then the reason why this is interesting is you compare it to the paper that everyone knows, which is this Paul Christiano et al. Deep Reinforced Learning from Human Preferences paper, which is where they showed that learning from human preferences, you can solve like the basic RL tasks at the time. So various control problems and simulation and this kind of like human preferences approach had higher rewards in some environments than if you just threw RL at the environment that returned a reward. So the preferences thing was you took two trajectories. So in this case, it was like complete trajectories of the agent and the human was labeling which one is better. You can see how this kind of comes to be like the pairwise preferences that are used today that we'll talk about. And there's also a really kind of interesting nugget that is the trajectory that the humans were labeling over has a lot more information than the RL algorithm would see if you just had one state, which is kind of why people think that it's why the performance in this paper was so strong. But I still think that it's surprising that there isn't more RL work of this style happening now. This paper is in 2017. So it's like six years later and I haven't seen things that are exactly similar, but it's a great paper to understand where stuff that's happening now kind of came from.

Swyx [00:17:58]: Just on the Christiano paper, you mentioned the performance being strong. I don't remember what results should I have in mind when I think about that paper?

Nathan [00:18:04]: It's mostly like if you think about an RL learning curve, which is like on the X axis, you have environment interactions on the Y axis, you have performance. You can think about different like ablation studies of between algorithms. So I think they use like A2C, which I don't even remember what that stands for as their baseline. But if you do the human preference version on a bunch of environments, like the human preference labels, the agent was able to learn faster than if it just learned from the signal from the environment, which means like it's happening because the reward model has more information than the agent would. But like the fact that it can do better, I was like, that's pretty surprising to me because RL algorithms are pretty sensitive. So I was like, okay.

Swyx [00:18:41]: It's just one thing I do want to establish as a baseline for our listeners. We are updating all the weights. In some sense, the next token prediction task of training a language model is a form of reinforcement learning. Except that it's not from human feedback. It's just self-supervised learning from a general corpus. There's one distinction which I love, which is that you can actually give negative feedback. Whereas in a general sort of pre-training situation, you cannot. And maybe like the order of magnitude of feedback, like the Likert scale that you're going to talk about, that actually just gives more signal than a typical training process would do in a language model setting. Yeah.

Nathan [00:19:15]: I don't think I'm the right person to comment exactly, but like you can make analogies that reinforcement learning is self-supervised learning as well. Like there are a lot of things that will point to that. I don't know whether or not it's a richer signal. I think that could be seen in the results. It's a good thing for people to look into more. As reinforcement learning is so much less compute, like it is a richer signal in terms of its impact. Because if they could do what RLHF is doing at pre-training, they would, but they don't know how to have that effect in like a stable manner. Otherwise everyone would do it.

Swyx [00:19:45]: On a practical basis, as someone fine-tuning models, I have often wished for negative fine-tuning, which pretty much doesn't exist in OpenAI land. And it's not the default setup in open-source land.

Nathan [00:19:57]: How does this work in like diffusion models and stuff? Because you can give negative prompts to something to like stable diffusion or whatever. It's for guidance.

Swyx [00:20:04]: That's for clip guidance.

Nathan [00:20:05]: Is that just from like how they prompt it then? I'm just wondering if we could do something similar. It's another tangent.

Swyx [00:20:10]: I do want to sort of spell that out for people in case they haven't made the connection between RLHF and the rest of the training process. They might have some familiarity with it.

Nathan [00:20:19]: Yeah. The upcoming slides can really dig into this, which is like this in 2018 paper, there was a position paper from a bunch of the same authors from the Christiano paper and from the OpenAI work that everyone knows, which is like, they write a position paper on what a preference reward model could do to solve alignment for agents. That's kind of based on two assumptions. The first assumption is that we can learn user intentions to a sufficiently high accuracy. That doesn't last with me because I don't know what that means. But the second one is pretty telling in the context of RLHF, which is for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior. And this is the whole thing. It's like we can compare two poems that the model generates and it can be viewed as liking a positive example, or it could be viewed as really disliking a negative example. And that's what I think a lot of people are doing in like the harm space is like a harmful response to a language model, whether or not you agree with the company's definition of harms is that it's a really bad negative example and they downweight them by preferring something more benign in the RLHF process, among other ways of dealing with safety. So that's a good way of saying it's like this is core, this kind of like comparison and positive or negative example is core to all of the RLHF work that has continued.

Swyx [00:21:29]: People often say, I don't know what I want, but I'll know when I see it. This is that expressed in reinforcement learning tools.

Nathan [00:21:35]: Yeah, it is. Yeah, it is. That's what everyone's doing in the preference modeling stage that we'll get to. Yeah. Yeah. And you can see there are more papers. This is really just to have all the links for people that go deeper. There's a Ziegler et al. paper in 2019, which shows that you can do this RLHF process on language models. This familiar diagram starts to emerge in 2019, and it's just to show that this goes really far back. I think we can kind of breeze through some of these. And then 2020 is the first open AI experiment that I think caught people's eyes, which is this learning to summarize experiment. It has this three-step process that we'll go to into more when I kind of go into the main concepts. But this is like the first time you see this diagram that they reuse with InstructGPT, they reuse with ChatGPT. And the types of examples that they would have, I don't think I need to read these exactly, but one that I have read a whole bunch of times is like, they took these prompts from Reddit that was like, explain like I'm five or get career advice, and people really pour their heart and soul into these. So these are like multi-paragraph pieces of writing. And then they essentially do comparisons between a vanilla language model, like I think it was either GPT-2 or GPT-3, I don't always get the exact years.

Swyx [00:22:42]: 3 was early 2020. So that's about right.

Nathan [00:22:45]: Yeah. So this is probably done with GPT-2. It doesn't really matter. But the language model does normal things when you do few shot, which is like it repeats itself. It doesn't have nice text. And what they did is that this was the first time where the language model would generate like pretty nice text from an output. It was restricted to the summarization domain. But I think that I guess this is where I wish I was paying attention more because I would see the paper, but I didn't know to read the language model outputs and kind of understand this qualitative sense of the models very well then. Because you look at the plots in the papers, these Learning to Summarize and Destruct GPT have incredibly pretty plots, just like nicely separated lines with error bars and they're like superfine tuning works, the RL step works. But if you were early to see like how different the language that was written by these models was, I think you could have been early to like things like ChatGPT and knowing RLHF would matter. And now I think the good people know to chat with language models, but not even everyone does this. Like people are still looking at numbers. And I think OpenAI probably figured it out when they were doing this, how important that could be. And then they had years to kind of chisel away at that and that's why they're doing so well now. Yeah.

Swyx [00:23:56]: I mean, arguably, you know, it's well known that ChatGPT was kind of an accident that they didn't think it would be that big of a deal. Yeah.

Nathan [00:24:02]: So maybe they didn't. Maybe they didn't, but they were getting the proxy that they needed.

Swyx [00:24:06]: I've heard off the record from other labs that it was in the air. If OpenAI didn't do it, someone else would have done it. So you've mentioned a couple of other papers that are very seminal to this period. And I love how you say way back when in referring to 2019.

Nathan [00:24:19]: It feels like it in my life.

Swyx [00:24:21]: So how much should people understand the relationship between RLHF, instruction tuning, PPO, KL divergence, anything like that? Like how would you construct the level of knowledge that people should dive into? What should people know at the high level? And then if people want to dive in deeper, where do they go? Is instruct tuning important here or is that part of the overall process towards modern RLHF?

Nathan [00:24:44]: I think for most people, instruction tuning is probably still more important in their day to day life. I think instruction tuning works very well. You can write samples by hand that make sense. You can get the model to learn from them. You could do this with very low compute. It's easy to do almost in like no code solutions at this point. And the loss function is really straightforward. And then if you're interested in RLHF, you can kind of learn from it from a different perspective, which is like how the instruction tuning distribution makes it easier for your RLHF model to learn. There's a lot of details depending on your preference data, if it's close to your instruction model or not, if that matters. But that's really at the RLHF stage. So I think it's nice to segment and just kind of understand what your level of investment and goals are. I think instruction tuning still can do most of what you want to do. And it's like, if you want to think about RLHF, at least before DPO really had taken off at all, it would be like, do you want to have a team of at least like five people if you're really thinking about doing RLHF? I think DPO makes it a little bit easier, but that's still really limited to kind of one data set that everyone's using at this point. Like everyone's using this ultra feedback data set and it boosts AlpacaVal, MTBench, TruthfulQA and like the qualitative model a bit. We don't really know why. It's like, it might just be a data set combined with the method, but you've got to be ready for a bumpy ride if you're wanting to try to do RLHF. I don't really recommend most startups to do it unless it's like going to provide them a clear competitive advantage in their kind of niche, because you're not going to make your model chat GPT like better than OpenAI or anything like that. You've got to accept that there's some exploration there and you might get a vein of benefit in your specific domain, but I'm still like, oh, be careful going into the RLHF can of worms. You probably don't need to.

Swyx [00:26:27]: Okay. So there's a bit of a time skip in what you mentioned. DPO is like a couple months old, so we'll leave that towards the end. I think the main result that I think most people talk about at this stage, we're talking about September 2020 and then going into, I guess maybe last year was Vicuña as one of the more interesting applications of instruction tuning that pushed LLAMA1 from, let's say a GPT 3-ish model to a GPT 3.5 model in pure open source with not a lot of resources. I think, I mean, they said something like, you know, they use like under $100 to make

Nathan [00:26:58]: this. Yeah. Like instruction tuning can really go a long way. I think the claims of chat GPT level are long overblown in most of the things in open source. I think it's not to say, like Vicuña was a huge step and it's just kind of showing that instruction tuning with the right data will completely change what it feels like to talk with your model. Yeah.

Swyx [00:27:19]: From text completion to actually chatting back and forth. Yeah. Yeah.

Nathan [00:27:23]: Instruction tuning can be multi-turn. Just having a little bit of data that's like a couple of turns can go a really long way. That was like the story of the whole first part of the year is like people would be surprised by how far you can take instruction tuning on a small model. I think the things that people see now is like the small models don't really handle nuance as well and they could be more repetitive even if they have really good instruction tuning. But if you take that kind of 7 to 70 billion parameter jump, like the instruction tuning at the bigger model is like robustness, little things make more sense. So that's still just with instruction tuning and scale more than anything else.

Swyx [00:27:56]: Excellent. Shall we go to technical overview?

Nathan [00:27:58]: Yeah. This is kind of where we go through my own version of this like three phase process. You can talk about instruction tuning, which we've talked about a lot. It's funny because all these things, instruction tuning has the fewest slides, even though it's the most practical thing for most people. We could save the debate for like if the big labs still do instruction tuning for later, but that's a coming wave for people. And then like preference data and training and then kind of like what does reinforce learning optimization actually mean? We talk about these sequentially because you really have to be able to do each of them to be able to do the next one. You need to be able to have a model that's chatty or helpful instruction following. Every company has their own word that they like to assign to what instructions mean. And then once you have that, you can collect preference data and do some sort of optimization.

Swyx [00:28:39]: When you say word, you mean like angle bracket inst or do you mean something else?

Nathan [00:28:42]: Oh, I don't even know what inst means, but just saying like they use their adjective that they like. I think Entropic also like steerable is another one.

Swyx [00:28:51]: Just the way they describe it. Yeah.

Nathan [00:28:53]: So like instruction tuning, we've covered most of this is really about like you should try to adapt your models to specific needs. It makes models that were only okay, extremely comprehensible. A lot of the times it's where you start to get things like chat templates. So if you want to do system prompts, if you want to ask your model, like act like a pirate, that's one of the ones I always do, which is always funny, but like whatever you like act like a chef, like anything, this is where those types of things that people really know in language models start to get applied. So it's good as a kind of starting point because this chat template is used in our early childhood and all of these things down the line, but it was a basic pointer. It's like, once you see this with instruction tuning, you really know it, which is like you take things like stack overflow where you have a question and an answer. You format that data really nicely. There's much more tricky things that people do, but I still think the vast majority of it is question answer. Please explain this topic to me, generate this thing for me. That hasn't changed that much this year. I think people have just gotten better at scaling up the data that they need. Yeah, this is where this talk will kind of take a whole left turn into more technical detail land. I put a slide with the RLHF objective, which I think is good for people to know. I've started going back to this more, just kind of understand what is trying to happen here and what type of math people could do. I think because of this algorithm, we've mentioned this, it's in the air, direct preference optimization, but everything kind of comes from an equation of trying to learn a policy that maximizes the reward. The reward is some learned metric. A lot can be said about what the reward should be subject to some constraint. The most popular constraint is the KL distraint, which is just a distributional distance. Essentially in language models, that means if you have a completion from your instruction or RLHF model, you can compare that completion to a base model. And looking at the log probs from the model, which are essentially how likely each token is, you can see a rough calculation of the distance between these two models, just as a scalar number. I think what that actually looks like in code, you can look at it. It'd be like a sum of log probs that you get right from the model. It'll look much more simpler than it sounds, but it is just to make the optimization kind of stay on tracks.

Make sure it doesn't overfit to the RLHF data. Because we have so little data in RLHF, overfitting is really something that could happen. I think it'll fit to specific features that labelers like to see, that the model likes to generate, punctuation, weird tokens like calculator tokens. It could overfit to anything if it's in the data a lot and it happens to be in a specific format. And the KL constraint prevents that. There's not that much documented work on that, but there's a lot of people that know if you take that away, it just doesn't work at all. I think it's something that people don't focus on too much. But the objective, as I said, it's just kind of, you optimize the reward. The reward is where the human part of this comes in. We'll talk about that next. And then subject to a constraint, don't change the model too much. The real questions are, how do you implement the reward? And then how do you make the reward go up in a meaningful way? So like a preference model, the task is kind of to design a human reward. I think the equation that most of the stuff is based on right now is something called a Bradley-Terry model, which is like a pairwise preference model where you compare two completions and you say which one you like better. I'll show an interface that Anthropic uses here. And the Bradley-Terry model is really a fancy probability between two selections. And what's happening in the math is that you're looking at the probability that the chosen completion, the one you like better, is actually the better completion over the rejected completion. And what these preference models do is they assume this probability is correlated to reward. So if you just sample from this probability, it'll give you a scalar. And then you use that reward later on to signify what piece of text is better. I'm kind of inclined to breeze through the math stuff because otherwise, it's going to be not as good to listen to.

Alessio [00:32:49]: I think people want to hear it. I think there's a lot of higher level explanations out there. Yeah.

Nathan [00:32:55]: So the real thing is you need to assign a scalar reward of how good a response is. And that's not necessarily that easy to understand. Because if we take back to one of the first works, I mentioned this tamer thing for decision making. People tried that with language models, which is if you have a prompt in a completion and you just have someone rate it from 0 to 10, could you then train a reward model on all of these completions in 0 to 10 ratings and see if you can get chat2BT with that? And the answer is really kind of no. Like a lot of people tried that. It didn't really work. And then that's why they tried this pairwise preference thing. And it happened to work. And this Bradley Terry model comes from the 50s. It's from these fields that I was mentioning earlier. And it's wild how much this happens. I mean, this screenshot I have in the slides is from the DPO paper. I think it might be the appendix. But it's still really around in the literature of what people are doing for RLHF.

Alessio [00:33:45]: Yeah.

Nathan [00:33:45]: So it's a fun one to know.

Swyx [00:33:46]: I'll point out one presumption that this heavily relies on. You mentioned this as part of your six presumptions that we covered earlier, which is that you can aggregate these preferences. This is not exactly true among all humans, right? I have a preference for one thing. You have a preference for a different thing. And actually coming from economics, you mentioned economics earlier. There's a theorem or a name for this called error impossibility, which I'm sure you've come across..

Nathan [00:34:07]: It's one of the many kind of things we throw around in the paper.

Swyx [00:34:10]: Right. Do we just ignore it?

Nathan [00:34:14]: We just, yeah, just aggregate. Yeah. I think the reason this really is done on a deep level is that you're not actually trying to model any contestable preference in this. You're not trying to go into things that are controversial or anything. It's really the notion of preference is trying to stay around correctness and style rather than any meaningful notion of preference. Because otherwise these companies, they don't want to do this at all. I think that's just how it is. And it's like, if you look at what people actually do. So I have a bunch of slides on the feedback interface. And they all publish this.

Swyx [00:34:43]: It's always at the appendices of every paper.

Nathan [00:34:47]: There's something later on in this talk, which is like, but it's good to mention. And this is when you're doing this preference collection, you write out a very long document of instructions to people that are collecting this data. And it's like, this is the hierarchy of what we want to prioritize. Something amount like factuality, helpfulness, honestness, harmlessness. These are all different things. Every company will rank these in different ways, provide extensive examples. It's like, if you see these two answers, you should select this one and why. And all of this stuff. And then my kind of like head scratching is like, why don't we check if the models actually do these things that we tell the data annotators to collect? But I think it's because it's hard to make that attribution. And it's hard to test if a model is honest and stuff. It would just be nice to understand the kind of causal mechanisms as a researcher or like if our goals are met. But at a simple level, what it boils down to, I have a lot more images than I need. It's like you're having a conversation with an AI, something like type GPT. You get shown two responses or more in some papers, and then you have to choose which one is better. I think something you'll hear a lot in this space is something called a Likert scale. Likert is a name. It's a name for probably some research in economics, decision theory, something. But essentially, it's a type of scale where if you have integers from like one to eight, the middle numbers will represent something close to a tie. And the smallest numbers will represent one model being way better than the other. And the biggest numbers will be like the other models better. So in the case of one to eight, if you're comparing models A to B, if you return a one, if you really liked option A, you return eight if you really like B, and then like a four or five if they were close. There's other ways to collect this data. This one's become really popular. We played with it a bit at Hugging Face. It's hard to use. Filling out this preference data is really hard. You have to read like multiple paragraphs. It's not for me. Some people really like it. I hear I'm like, I can't imagine sitting there and reading AI-generated text and like having to do that for my job. But a lot of these early papers in RLHF have good examples of what was done. The one I have here is from Anthropic's collection demo because it was from slides that I did with Anthropic. But you can look up these in the various papers. It looks like Chat2BT with two responses, and then you have an option to say which one is better. It's nothing crazy. The infrastructure is almost exactly the same, but they just log which one you think is better. I think places like Scale are also really big in this where a lot of the labeler companies will help control like who's doing how many samples. You have multiple people go over the same sample once and like what happens if there's disagreement. I don't really think this disagreement data is used for anything, but it's good to know like what the distribution of prompts is, who's doing it, how many samples you have, controlling the workforce. All of this is very hard. A last thing to add is that a lot of these companies do collect optional metadata. I think the Anthropic example shows a rating of like how good was the prompt or the conversation from good to bad because things matter. Like there's kind of a quadrant of preference data in my mind, which is you're comparing a good answer to a good answer, which is like really interesting signal. And then there's kind of the option of you're comparing a bad answer to a bad answer, which is like you don't want to train your model on two different issues. This is like, we did this at Hugging Base and it was like, our data was like, we don't know if we can use this because a lot of it was just bad answer to bad answer because you're like rushing to try to do this real contract. And then there's also good answer to bad answer, which I think is probably pretty reasonable to include. You just prefer the good one and move on with your life. But those are very different scenarios. I think open AIs of the world are all in good answer, good answer, and have learned to eliminate everything else. But when people try to do this in open source, it's probably like what Open Assistance saw is like, there's just a lot of bad answers in your preference data. And you're like, what do I do with this? Metadata flags can help. I threw in the instruct GPT metadata. You can see how much they collect here. And like everything from the model fails to actually complete the task, hallucinations, different types of offensive or dangerous content, moral judgment, expresses opinion. Like, I don't know exactly if they're doing this now, but you can kind of see why doing RLHF at scale and prioritizing a lot of different endpoints would be hard because these are all things I'd be interested in if I was scaling up a big team to do RLHF and like what is going into the preference data. You do an experiment and you're like, okay, we're going to remove all the data where they said the model hallucinates like just that and then retrain everything. Like, what does that do?

Swyx [00:38:59]: Yeah, so hallucination is big, but some of these other metadata categories, and I've seen this in a lot of papers, it's like, does it contain sexual content? Does it express a moral judgment? Does it denigrate a protected class? That kind of stuff, very binary. Should people try to adjust for this at the RLHF layer or should they put it as a pipeline where they have a classifier as a separate model that grades the model output?

Nathan [00:39:20]: Do you mean for training or like a deployment? Deployment. I do think that people are doing it at deployment. I think we've seen safety and other things in the RLHF pipeline. Like Lama 2 is famous for kind of having this like helpfulness and safety reward models. Deep in the Gemini report is something that Gemini has like four things, which is like helpfulness, factuality, maybe safety, maybe something else. But places like Anthropic and Chattopadhyay and Bard almost surely have a classifier after, which is like, is this text good? Is this text bad? That's not that surprising, I think, because you could use like a hundred times smaller language model and do much better at filtering than RLHF. But I do think it's still so deeply intertwined with the motivation of RLHF to be for safety that some of these categories still persist. I think that's something I'll kind of settle out, I think.

Swyx [00:40:11]: I'm just wondering if it's worth collecting this data for the RLHF purpose, if you're not going to use it in any way, separate model to-

Nathan [00:40:18]: Yeah, I don't think OpenAI will collect all of this anymore, but I think for research perspectives, it's very insightful to know, but it's also expensive. So essentially your preference data scales with how many minutes it takes for you to do each task and every button is like, it scales pretty linearly. So it's not cheap stuff.

Swyx [00:40:35]: Can we, since you mentioned expensiveness, I think you may have joined one of our spaces back in Lama 2 was released. We had an estimate from you that was something on the order of Lama 2 costs $3 to $6 million to train GPU-wise, and then it was something like $20 to $30 million in preference data. Is that something that's still in the ballpark? I don't need precise numbers.

Nathan [00:40:56]: I think it's still a ballpark. I know that the 20 million was off by a factor of four because I was converting from a prompt number to a total data point. So essentially when you do this, if you have multi-turn setting, each turn will be one data point and the Lama 2 paper reports like 1.5 million data points, which could be like 400,000 prompts. So I would say it's still say like 6 to 8 million is safe to say that they're spending, if not more, they're probably also buying other types of data and or throwing out data that they don't like, but it's very comparable to compute costs. But the compute costs listed in the paper always are way lower because all they have to say is like, what does one run cost? But they're running tens or hundreds of runs. So it's like, okay, like... Yeah, it's just kind of a meaningless number. Yeah, the data number would be more interesting.

Alessio [00:41:42]: What's the depreciation of this data?

Nathan [00:41:46]: It depends on the method. Like some methods, people think that it's more sensitive to the, this is what I was saying. It was like, does the type of instruction tuning you do matter for RLHF? So like, depending on the method, some people are trying to figure out if you need to have like what is called like, this is very confusing. It's called like on policy data, which is like your RLHF data is from your instruction model. I really think people in open source and academics are going to figure out how to use any preference data on any model just because they're scrappy. But there's been an intuition that to do like PPO well and keep improving the model over time and do like what Meta did and what people think that OpenAI does is that you need to collect new preference data to kind of edge the distribution of capabilities forward. So there's a depreciation where like the first batch of data you collect isn't really useful for training the model when you have the fifth batch. We don't really know, but it's a good question. And I do think that if we had all the LLAMA data, we wouldn't know what to do with all of it. Like probably like 20 to 40% would be pretty useful for people, but not the whole data set. Like a lot of it's probably kind of gibberish because they had a lot of data in there.

Alessio [00:42:51]: So do you think like the open source community should spend more time figuring out how to reuse the data that we have or like generate more data? I think that's one of the-

Nathan [00:43:02]: I think if the people are kind of locked into using synthetic data, people also think that synthetic data is like GPT-4 is more accurate than humans at labeling preferences. So if you look at these diagrams, like humans are about 60 to 70% agreement. And we're like, that's what the models get to. And if humans are about 70% agreement or accuracy, like GPT-4 is like 80%. So it is a bit better, which is like in one way of saying it.

Swyx [00:43:24]: Humans don't even agree with humans 50% of the time.

Nathan [00:43:27]: Yeah, so like that's the thing. It's like the human disagreement or the lack of accuracy should be like a signal, but how do you incorporate that? It's really tricky to actually do that. I think that people just keep using GPT-4 because it's really cheap. It's one of my like go-to, like I just say this over and over again is like GPT-4 for data generation, all terms and conditions aside because we know OpenAI has this stuff is like very cheap for getting pretty good data compared to compute or salary of any engineer or anything. So it's like tell people to go crazy generating GPT-4 data if you're willing to take the organizational like cloud of should we be doing this? But I think most people have accepted that you kind of do this, especially at individuals. Like they're not gonna come after individuals. I do think more companies should think twice before doing tons of OpenAI outputs. Also just because the data contamination and what it does to your workflow is probably hard to control at scale.

Swyx [00:44:21]: And we should just mention at the time of recording, we've seen the first example of OpenAI enforcing their terms of service. ByteDance was caught, reported to be training on GPT-4 data and they got their access to OpenAI revoked. So that was one example.

Nathan [00:44:36]: Yeah, I don't expect OpenAI to go too crazy on this cause they're just gonna, there's gonna be so much backlash against them. And like, everyone's gonna do it anyways.

Swyx [00:44:46]: And what's at stake here to spell it out is like, okay, that's like cost $10 to collect one data point from a human. It's gonna cost you like a 10th of a cent with OpenAI, right? So like it's just orders of magnitude cheaper. And therefore people-

Nathan [00:44:58]: Yeah, and it's like the signal you get from humans is from preferences isn't that high. The signal that you get from humans for instructions is pretty high, but it is also very expensive. So like the human instructions are definitely like by far and away the best ones out there compared to the synthetic data. But I think like the synthetic preferences are just so much easier to get some sort of signal running with and you can work in other, I think people will start working in other goals there between safety and whatever. That's something that's taking off and we'll kind of see that. I think in 2024, at some point, people will start doing things like constitutional AI for preferences, which will be pretty interesting. I think we saw how long it took RLHF to get started in open source. Instruction tuning was like the only thing that was really happening until maybe like August, really. I think Zephyr was the first model that showed success with RLHF in the public, but that's a long time from everyone knowing that it was something that people are interested in to having any like check mark. So I accept that and think the same will happen with constitutional AI. But once people show that you can do it once, they continue to explore.

Alessio [00:46:01]: Excellent.

Swyx [00:46:01]: Just in the domain of human preference data suppliers, Scale.ai very happily will tell you that they supplied all that data for Lama 2. The other one is probably interesting, LMSYS from Berkeley. What they're running with Chaterina is perhaps a good store of human preference data.

Nathan [00:46:17]: Yeah, they released some toxicity data. They, I think, are generally worried about releasing data because they have to process it and make sure everything is safe and they're really lightweight work. I think they're trying to release the preference data. I have, if we make it to evaluation, I'd pretty much say that Chaterina is the best limited evaluation that people have to learn how to use language models. And like, it's very valuable data. They also may share some data with people that they host models from. So like if your model is hosted there and you pay for the hosting, you can get the prompts because you're pointing the endpoint at it and that gets pinged to you and you're any real LLM inference stack saves the prompts that you get. So like that is some signal. I don't know if the shared preferences. I do think they're trying to. They're trying to do all the right things. They're just very strapped and moving data comes with other like legal and liability concerns in some cases. Awesome. So kind of looping back a little bit from that very valuable digression on like what preference data is, it's worth talking about the actual loss function because it's kind of like this classifier approach that might not make too much sense to people. You take a language model and you chop it into pieces a little bit at the end so that it outputs one number. It's like in technical level, it's a logit that corresponds to the probability that we talked about earlier. But in order to train this, you can't just have like prompt and completions. You need to have these pairs because we talked about scalars don't really work. So in order to train it, you use the magical batching of all language model, all deep learning architectures and you put in the chosen prompt and the rejected prompt at the same time and then you end up with two numbers and then there's this fun loss function and you essentially have to increase the difference between these two predicted numbers. It's always fun when you think about like automatic differentiation, it updates the same parameters to kind of separate these two numbers at once and there's this loss function that you'll see in OpenAI, Anthropic and everyone's papers. What it looks like is it's like some log of a scalar with an exponential that's the difference between these two predicted rewards. It's just some fancy math around a difference, a subtraction between the predicted reward for the rejected completion and the predicted reward of the chosen completion. Fun fact is that these loss functions look different and Anthropic and OpenAI's papers, but they're just literally just log transforms. So if you start like expandiating both sides and taking the log of both sides, both the two papers end up being the same thing. And people don't know how to train preference models particularly well now. I think if you zoom into any of the details to look at like the agreement number, so how if you look at a test set, you'll have a chosen and rejected and you can take the reward model you're training, pass in those completions and you see if the chosen predicted reward, so the scalar number is higher than the rejected predicted reward. And this is the agreement numbers in all of these datasets is like that where you see they have the 65 to 75% agreement. This just means that like these scalar numbers were ordered correctly. And that's a pretty low number. It's not gonna get to a hundred percent. That goes to show the kind of like deep questions at play here. People are playing with different loss functions and samples, different models to try to address this, but it's really a fundamental issue. It's like, it goes back to like, what does it mean to do RLHF? And we're not gonna answer that now, but it's good to know that like this 65 to 75% agreement, you'll see these numbers everywhere. It's like, we don't have a hundred percent agreement with the reward model and the data and that's fine. That's just where we're at. And we essentially take this model and then we start throwing RL at it. I think PPO, proximal policy optimization, it's pretty complicated compared to what you really need to know. It really just does RL under the hood. Things like PPO, it learns a value function and then it uses the value function to update the model. If you actually look at like a feedback diagram, more of like a systems problem than an RL problem. So you'll see things like you need to have two copies of the language model. This is for the KL constraint that we talked about before. You need to have the reward model, which is either a separate reward model or value head on your base model. And then you need to have your like RL code that actually learns a value function and updates all the parameters. I think it just is really messy to actually set up, but if you dig into it, most people could understand what each of the components are. And then the hard parts are like, how do we actually make a language model that works out of this? Which is not something that people know that well. I think things that I talk about a lot, it's just like, okay, like what is the signal flow? How do you access the reward model? The reward model is used in RLHF exactly what you would think. You have a prompt, the language model generates a completion and then that completion is given a score. That score gets plugged into the whole RL stuff and it updates the parameters. That's kind of the core of it. There's a lot of different things zooming in on where exactly you put this distance penalty between the base model and the RL model. Most people say that you just deduct it from the reward. So if you go all the way back to RL as an agent acting in the world, the reward from that world would be a combination of the reward model and any constraints like KL that you put on it. There's a lot of different ways to do this because a lot of RL algorithms like PPO actually have a KL constraint built into them. So it's confusing because you hear KL twice, but those are different KLs. One of them is about the text and one of them is about the value function distance or the policy distance or something like this. So those are different. It really ends up being kind of like gibberish that I think is less important now because it's more about data and infrastructure than RL details, than like value functions and everything. A lot of the papers have different terms in the equations. I think InstructGPT does something where they like try to get the RL model to match the instruction tuning dataset because they were really happy with that dataset to constrain the distribution. LLAMA does some different things, but I think these are all small gains over just getting the deep understanding of the data in the infrastructure setup. This is why we say it's like so little RL. It's like now we're getting to the point where you don't even really need this to get a good model. So that's why it's like, okay, the RL is such a small part of the actual, like doing RLHF, like RLHF is a metaphor for like all language model adaptation and RL is one tool used at one point in the time. So that's kind of where I wrap up like the core overview in my mind to say like RL doesn't really do as much as people think, but you could put up flashy equations and do all sorts of stuff if you want to. It's just like, I think it's kind of misleading even because I don't think about those equations on a regular basis.

Swyx [00:52:20]: But what if we call it Q star?

Alessio [00:52:23]: Yeah.

Alessio [00:52:26]: So in your mind, it's a takeaway for this kind of next generation of people working on models. Maybe the underlying theories is less important than actually getting good data, basically.

Nathan [00:52:38]: Yeah, I think it's getting good data and we'll see like, I have this like advanced topics thing in the slides, which it starts with the vowels and then it talks about a lot of different ways that people are using reward models or constructing training signals really. And I think that's like about understanding what your information flow is and like if your reward signal is good and like if your language model is generating right, like zooming in on the tokens is generating and kind of understanding how those things change over time. Like this is something we could also talk about evaluation, but it's really like RLHF is not that shown to improve capabilities yet. I think one of the fun ones is from the GPT-4 technical report. They essentially listed their kind of bogus evaluations because it's a hilarious table because it's like LSAT, AP exams and then like AMC 10 and AMC 12 are like kind of reasonable vowels in language model land. But they just showed that like RLHF doesn't improve their evaluation metrics. We don't know if internally they have other ones.

Alessio [00:53:30]: They probably do.

Nathan [00:53:30]: But from what OpenAI has shown us externally, like RLHF improves some metrics. It decreases some metrics. No one could really see. I do think it does things that they care about, but it's like RLHF is not an easy tool to make numbers go up with. It's a powerful tool to change your language model. But like, as we've seen with LLAMA and safety RLHF, like that doesn't always mean that people are gonna be happy with those changes or it's gonna do exactly what you want. It's like-

Swyx [00:53:56]: Well, I think this is intuitive. Like a lot of these tests are multiple choice and RLHF isn't necessarily intended to improve your multiple choice reasoning capabilities.

Nathan [00:54:04]: Yeah, I think it is reasonable, but I don't think a lot of people have like connected the dots there. And like, what is it in a preference point? Like what if your preference data was between a correct and a wrong answer? Like it could conceivably do it, but I just don't think that is remotely what it is actually doing.

Swyx [00:54:22]: It's much better being a sommelier.

Nathan [00:54:24]: Yeah. That was the weirdest one that was included in GPT-4

Alessio [00:54:29]: I just see that the last three down there. That's really funny. I can't even taste it.

Nathan [00:54:38]: Yeah, so this is essentially how to use RLHF-like things to make the bottle better without using PPO because PPO is kind of a nightmare to scale. The first thing that I started with is kind of the ideas of rejection sampling and best event sampling. I think best event sampling is what people often encounter first, which is the idea of you take a prompt, you generate like 10, 20 responses through it. You pass it through a reward model. The reward model assigns a scaler for each of them. You pick the one with the highest number and that's the one you answer the question with. It seems pretty logical to people because it's just spending more inference time compute to make your outputs better. And it works in a lot of things. This Let's Verify step-by-step paper that I talked about from OpenAI, they use it, lots of papers use it. It's just kind of like a good thing to know that you can do. You can spend more inference compute based on a preference dataset to make your answers better. The interesting thing that people are confused about more is rejection sampling because Meta talked about it in LLAMA 2. Essentially, a rejection sampling is putting something like best event sampling in a feedback loop. And instead of just returning the best answer to a user, you take the best few answers and then you apply instruction tuning on that dataset. And then you do the instruction tuning and then you could collect more preference data, do a new reward model. And then you rank some new outputs and you do instruction tuning again. So essentially, LLAMA started their RLHF process with this to get some signal out of preference data. That preference data went into a reward model. And then the reward model did a good enough ranking that it was like essentially super powered instruction tuning based on rewards. Works pretty well, much easier to implement than PPO because you can use it in all of your kind of like, it's still instruction tuning. So it's the same autoregressive loss. It's easy to plug into things like transformers and stuff like that. A lot easier to start with than whatever freaking mess doing RL at scale is going to be. So that's one. A quick nod that offline RL is something that people talk about for RLHF essentially because your model doesn't have to generate. In that case, you look at data and it back propagates through your reward model directly. So in PPO, you have the step of like needing to generate everything and passing it through the reward model. How offline RL essentially works is that all of this is kind of just done in one big data set. I'm not an expert in this, but essentially you do much less inference costs during the RLHF process. If you do offline RL, there's a few papers that people have published. Not a lot of traction. I think it could take off some people that I know in the RLHF area really think a lot of people are doing this in industry just because it makes the kind of training process simpler in the number of things you have to have running. Different feedback types are probably going to come into play. There's papers like written feedback or labeling multiple scores or multiple pairwise preferences for every completion that's coming. It's also kind of related to what we mentioned in process reward models where you're labeling each step in the chain of thought reasoning just to kind of make the problem more specific. It seems very likely that different feedback will be used for different domains.

Chain of thought reasoning is great for math and that's where these process reward models are being designed. Probably not great for things like poetry, but as any tool gets better, it gets more specific. Then kind of get into more of a talking point, which I think is fun.

The next one I have is constitutional AI. I think this is something that people really just kind of misunderstood. I mean, I think most people thought that constitutional AI was doing something where it's like created the preference data based on the specific principles in some way, where it's like, what did you two think of constitutional AI?

Swyx [00:58:10]: I'll be the dumb person and you correct me. As far as I understood, Anthropic came out and said that the best way of generating this sort of preference data or alignment is give a second model, a constitution to evaluate the first model's outputs.

Nathan [00:58:21]: Yeah.

Swyx [00:58:22]: The constitution is unspecified, but like this is draws from like the UN Declaration of Human Rights and the Apple Terms of Service for some reason.

Alessio [00:58:28]: Yeah.

Nathan [00:58:28]: And this leads into the question is like, what is the other model evaluating? And like, how is it evaluating in a way that you can train on? And that's what I mean. It's like, people didn't think about this. A lot of the CAI paper was actually talking about instruction tuning, which is if you have an instruction, you then have a language model that critiques the instruction based on principles. And then your instruction responses are closer to the constitutional principles. This was the first half, which is like they have some acronym for all of this.

The diagram in their papers wild on this one. I think their papers are sometimes pretty funny because they're not capabilities papers. They're like alignment papers. So like they don't make everything super clear. So the first half of constitutional AI is fine tuning your instructions based on principles. That's one half. And then the second half is what people really thought that they knew, which is like, how do you use these other model to provide a critique based on principles? And in the paper, they list essentially they like say what their prompt was, which is like for the synthetic feedback for generating new preferences, which is essentially pick between these two answers based on this principle. So they're kind of sampling from the principles in their constitution and from kind of A, B, like two options of completions. And then the AI model is essentially given the context of a certain principle to pick the A or B preference. And then that's a new preference data set is just the two completions without the context of the principles. So with this kind of like sampling idea, they're sampling from like 30 principles and a wide data set of two candidate completions across the different prompts. So to me, it's a very like loose, like the values are not explicit in this. It's just kind of how they're guided. It's a very machine learning approach because it is relying on averages and scale to get the principles in there. But it is way less explicit than I thought it was going to be. I kind of thought there was this like feedback thing in the preference data or like check to see if the principles were satisfied or anything like this. It's really just like a modification to the RLHF setup that we've talked about with instruction, tuning and preference data collection where there's an AI model providing critiques. And a lot of those critiques are based on like sampling of constitutional values. It almost sounds more tractable in that way. But I would also guess while I just like say like, oh, look, I figured it out. I'm guessing they do different things than they said in the paper. Like this paper is in 2022. It's a pretty old paper. They're surely doing more. But it's good to know like where they started, at least in this case.

Swyx [01:00:51]: I thought the communication around the Pareto optimal improvements was helpful in understanding that you do actually want it to be more helpful and honest while maintaining the same level of harmlessness or something. Yeah, right.

Nathan [01:01:03]: Yeah, so that figure right at the top of the constitutional AI paper is worth seeing if you don't have it immediately pop into your head where they essentially compare like constitutional AI to other RLHF that they're doing internally. And that's something that most RLHF papers don't do is like they have little dots on the lines to indicate intermediate checkpoints and be really great to see more RLHF papers showing how per epoch or per half epoch of training because most RLHF is only a few epochs, at least in the open models, like what is happening there. People release checkpoints, but that's how we should be thinking about it because the optimizer is so strong and it's like we don't know what's happening in this kind of intermediate land.

Swyx [01:01:41]: I don't know if this is a relevant comparison for you, but OpenAI also recently released a weak to strong generalization paper where they actually talked about a few intermediate checkpoints for GPT-4. Any comments on the comparison between constitutional AI and weak to strong generalization?

Nathan [01:01:55]: I didn't see the paper. I think I saw people criticizing it for like just being like safety washing from the fact that they're like talking about GPT-2 still, which is such a kind of like odd model to focus on. I didn't really look at the paper. I think that it's a thing with OpenAI. It's like they're sharing less than they know. So I think they probably have things that are pretty cool that they're doing internally. And I'll summarize for listeners who may not have seen the paper because it's impossible to keep up and everything. I do think that what constitutional AI and RLAIF represents is that we are starting to come to a point where it's just impossible for manual human preference data collection to scale. And the only way to scale this is to trust our AI overlords to model our human preferences. And constitutional AI was the first version of this. What the second version or what weak to strong is, is that anticipating a future of the need for super alignment, where the thing that we're trying to control is smarter than us. So you take GPT-2 and try to use GPT-4 to teach it to be smarter than itself, because this is what we're going to have to do in the future as well. When we are not, we're no longer fully in control. Are we the metaphorical GPT-2? No, we're like not even in the process anymore at the point of super intelligence.

Alessio [01:03:10]: They're prepping. And they're saying this will happen. And humans will be like so far like in the dust that we just have no say in this debate.

Swyx [01:03:18]: How do we still control systems then? And weak to strong generalization seems to be the answer. And I see a lineage from constitutional AI to this.

Nathan [01:03:26]: Yeah, the constitutional AI and the super alignment is like very conceptually linked. It's like a group of people that has like a very similar intellectual upbringing and they work together for a long time, like coming to the same conclusions in different ways. And I understand the argument and I mostly just don't. I think they're just waiting to see more from the super alignment team because I just didn't really put it together in my brain, quickly looking at weak to strong generalization of like exactly how it all fits. But I'm also not a safety researcher. Yeah, but I think that could be feedback for them. It's like I understand what synthetic data means and all of this is like how could they communicate that a little bit more specifically in this context? Because like I want to know what they think about this. Which is why I like that Pareto optimal thing, because it steers the debate away from X risk to like, no, like this makes knowledge models more useful and we can all get behind that.

Swyx: I agree.

Nathan [01:04:13]: I think the last kind of emerging direction that I have might just be like this debate. You can control how long we talk about this, which is about direct preference optimization. You could go read my blog post on this. I had tried to summarize this already, but essentially DPO is a different class of algorithms. I still call it RLHF because RLHF is so vague in how it's defined. I think DPO is closer to RLHF than RLHF is to RL. You can unpack that if you need to need to. But what DPO is doing is essentially deriving a optimal reward function from the preference data where the preference data is the same thing that we've talked about. And then the clever math in the paper emerges optimal policy to that based on an implicit reward function. That's a ratio of like log probs. It's very odd. Like the difference between what a DPO reward is and a classifier reward is very different, where like the classifier is trained to output a scalar value based on this kind of like contrastive like loss where DPO is purely based on like the difference between two log prob ratios. So the reward there is the ratio between like the policy generation likelihood and the base model generation likelihood. I don't have intuitions for what that means yet, but like what the reward actually is is very different. The data starting point in principle could be the same. And I think we've seen a lot of successes in open source with it. It's way simpler implement and to work with in that regard, which is why I think we'll keep seeing a lot of success with it in the short term. I think we'll keep seeing DPO models for the time being, but we won't really answer like what the fundamental differences are, because it depends on your data. It depends on your infrastructure. Rumors seem to be that people still think that PPO like methods or other RL methods have like higher top end, but I don't necessarily think like.

Swyx [01:05:56]: Sorry, what is top end?

Nathan [01:05:57]: Just like the absolute best model you could get. So like I see Google and OpenAI aren't using DPO because they could do something more complicated, but like that's not what academics and open source people really care about. They care about like being able to improve on their methods and understand where to like iterate the model and kind of work off of each other. So like in a lot of ways, I think DPO still will be what people see, but like in some ways it's probably like slightly more constrained. There's other ways that you could think of PPO like working nicely in code where it's like if your code runs is the score that you give it, you have to generate like you have to kind of do canned things to get DPO to have the same data. So there are specific cases where like the DPO formulation is a little bit harder, but I expect to see more DPO models than anything else in the next six months. That's probably like what most people need to know unless they're an RLHF expert. And like I would love to learn more about PPO and a lot of authors in this space from the DPO authors who are great to talk to. You can reach out to all three of them.

Swyx [01:06:54]: So as of the time of recording, we actually about to publish our NeurIPS recap where we talk to the authors. Yeah, so for people who are listening to this in the future, you can refer to that episode.

Nathan [01:07:02]: Yeah, so like Raphael, Eric and Archit, I've talked to all of them at a good length and they're all fantastic. And it's like they'll say similar things and they'll also defend their method because it's an awesome paper. If you want to learn how like a good math, like I'm kind of mathy but still experimental paper in language models is like the DPO paper is a really good one to spend more time on. Yeah.

Swyx [01:07:25]: When I ask them questions about it, they just kind of gestured at the poster and said, look at the equation, just stare at it. And yeah, that's my criticism for them is they're still in the academic world where some of their answers reflect that. But I've done it enough with them that I understand what they're saying.

Alessio [01:07:44]: Yeah.

Swyx [01:07:44]: I would say like it does remind me of FlashAttention a little bit in a sense that like kind of an equivalent thing to the thing it's replacing and it's just faster, cheaper, just better.

Nathan [01:07:53]: It's a very different optimization tool. There's essentially the thing in my mind that I can't get past is the difference between the control you get in training a reward model and then training a policy because essentially everything you want your reward model to do might not be everything that you train the policy to do in the RLHF step where you have like the two different prompt distributions. But with DPO, you're doing both at once. So you don't control that. And we don't know if you have fancy engineering abstractions and like test your reward model to do different things if that separation is really important. And I think that's where this like benefit at the absolute biggest scale and most investment could come from. But DPO is one update. Like it is one model. You can't separate that. So like that's the thing to know. Probably doesn't matter for most people, but it is very different. And like I was asking somebody who was on some of those earlier OpenAI papers that's not OpenAI anymore. And they were like, I wish we had thought of that. So like it is a really cool idea. And like that's the type of thing that academia still can do and can do really well and hopefully continues to do.

Alessio [01:08:54]: Yeah.

Swyx [01:08:54]: One thing I wanted to make sure I cover before we leave this topic, you know, one of the DPO models that were trained apart from Zephyr and McStraw, which is two of the more high profile ones is Tulu from the Allen Institute. And you're one of the few people maybe placed to explain. Like maybe like what's Allen Institute doing here? And like, you know, what's the backstory? Yeah.

Nathan [01:09:15]: So the Allen Institute for AI is I think the 10 year birthday is in January. This is a special event for that. And also like people should know this is Paul Allen from Microsoft. Paul Allen owns everything in Seattle. Not literally. I mean, he's passed and his estate is still operating in a lot of great ways. But the Allen Institute is mostly known as being like a super academic lab where they have more resources than academia and publish like hit after hit of research paper. And they're trying to move more in the direction of releasing models. And this is part of why I joined. It's like talking with new CEO, Ali Farhadi. I don't know if I pronounced the last name right, but he's trying to move from an org that does papers only to something that does papers, releases models, is active in policy, maybe is like helping work with these for profit institutions that don't have like an established place where they could all go through to new things. So they're really trying to expand the scope. It's part of why I joined and like the Tulu2 model is the type of thing I've joined. And they were talking about this and I was like, okay, we should just train it and release it because no one has done this direct preference optimization at a scale of like a really like 70 billion parameter scale. And this experiment is hilarious. This is like classic of like everything kind of works right now in ML. Like I showed up in the grad student Hamish Ivison and I need to learn how to pronounce last names better. But he had some Jack's DPO code built on this EZLM framework. And we have the TPUs that we could access for research purposes. So it's like, okay, we have a huge TPU. It's like, let's just try the Zephyr recipe on 70 billion parameters. And it's literally like the first run. It's like, we did no ablations, didn't change any parameters. We just copied them all over. And like, that's the model that people have been working with. It's like, that goes to show that there's a lot of runway and understanding and improving on this. It's like, we took the same data and just took it to a different JAX implementation and scaled it up 10X and it still returned a model that was pretty good. It's like on benchmarks and in people using it. So let's say it's like 2024, we'll be busy in this space as we do. Like we're running data ablations to try to understand what's best. Then Allen Institute is pre-training language models who are pre-training like open language models where we'll be able to share like data, code, everything, the kind of horn that everyone likes to get annoyed about these days. It's like, well, I'm not releasing data. So that'll come in the new year. And then things like Tulu2 or the recipes that we will apply to that. And we'll kind of keep doing both as the pre-trained models get better. Those will probably become more of a priority, but like starting pre-training is very hard. So it's like you still want to learn from LLAMA2 and LLAMA3. So that's fun. I think DPO releases are kind of becoming expected because Mistral released a DPO model as well. There's a ton. There's like Intel releases DPO models, Stability releases DPO models.

At some point, you just have to accept that that's where we're going, whether or not you care about the whole like DPO debate. And that's why I find it so funny because there's really interesting, like debatable questions between DPO and other RL methods. But we just won't have the answer. And it'll look like there isn't a debate because everything that is published is with DPO. But that doesn't mean that anything is answered in the time being. Yeah, kind of last of this stuff is evaluation. And these slides were prepared kind of last minute. But I think the question is, how do you evaluate these models and what you should be doing? I think the PSA is like, don't trust your numbers and actually talk to models. It's very hard to do if you're an engineer or a researcher because you have your specific thing that you're zoomed in on. And it feels like a waste of time to just go play with chat GPT or go play with chat arena. But I really don't think it is. It's something that I, this is like me telling myself what I should be doing. But there's the question of like, is the Hugging Face leaderboard good for open source? And then what else can people do?

The Hugging Face leaderboard came out of the team that I was on there. We were trying to build a framework to automatically evaluate the models that we were training and the models that people were releasing and then have them in a central place where it could be like, look, here's the evaluation scores. This is what we're competing with. It obviously blew up. I think it's very good for companies trying to operate in the open LLM space to build businesses around it. I think it's bad for people building LLMs that they think are the best because it's easy to overfit if you're training and focusing on them as a developer. But it's good to have distribution of models when there's so many people training them. But it's like now it has six evaluation tools. I can't even name all of them off the top of my head. It's like ARC, Hella Swag, MMLU. There was Drop on it at one point, but they dropped a drop, which was pretty funny. Truthful QA. And then I think maybe some other math. I don't know.

Swyx [01:13:42]: This benchmark question is something that everyone's talking about because there's a lot of gaming that it seems to be going on. Is there some discussion about sort of held out benchmarks that Hugging Face could hold on to?

Nathan [01:13:55]: Mostly it's who's going to pay for it. We're thinking about this at Allen AI too. We're specifically thinking about improving on AlpacaEval, which is-

Swyx [01:14:01]: Who's going to pay for running the evals? Right now Hugging Face is just running every eval every day?

Nathan [01:14:06]: Yeah. So they have like a thousand GPUs. At one point they were going to do more training. It was going to be used for that. But now they have less training and they do, they've run a good amount of GPUs. And one of their blog posts, they said how much compute it was. I don't think it's a ton to run these, but it is like, you have to have hundreds of GPUs to maintain this leaderboard.

Swyx [01:14:23]: So one technical question, like some of these are open source models that they don't change. So you just have to run them once. Yeah. It's only the closed source models that need to be revalued.

Nathan [01:14:37]: Yeah. So if you look at the like chat arena, they take specific dates. And then there's this whole controversy of like, is ChatGPT from March better than ChatGPT from June? So like on like one of these future slides, it's slide 58 is like the chatbot arena leaderboard. If you're looking later, which chatbot arena is this thing from LMSys that we were looking at, and then like on the X-axis is models. And you can see that GPT-4 from March has a higher score. It's like, this is not a perfect comparison, but there are signs that are pretty funny there. That like, there are things cooking, but you don't know who's collecting this data, what prompts they're doing. It's such a funny timeline.

Swyx [01:15:20]: So for those listening, GPT-4 March 14th is 40 Elo points higher than GPT-4 June 13th.

Nathan [01:15:26]: Yeah, it's like outside of the error bars on the LMSys thing. And the other piece of context is that GPT-4 Turbo is also notably ahead of the other GPT-4s, which it kind of showed up immediately once they added it to the arena. And I was like, all the GPT-4.5 memes aside, it seems like this is effectively a bump in the model. If you zoom into this, the leaderboard is very close for many like strata of models. So there are levels where you can get your model to, and it'll be really close to your peers. So in the open source, there's things like Mixtral Instruct 2.2.7db, which is effectively, it's a way bigger model than Mixtral. Mixtral's the mixture of expert model. Like I'll do credit, it's a very good model. And that's gonna be like the next level once people get better at fine tuning it. Like Ye34bchat, like this is one level. And then there was like a level with like the Alpacas and the Vicunas. But all of these open source models, there's then another step up to GPT-4, and then there's another step up to GPT-4 Turbo. So it's like the difference from the GPT-4 Turbo to like the GPT-4 that was first released is bigger than the difference from Tulu2 to GPT-4. So that's just like, there's something good going on there. And I was like, okay, that's a new model by my standards, but they're not gonna tell us about it. Like they did in DevDay, they said it's our new model, but they weren't like, this is our new best performing model because it's like the benchmark scores are probably the same, but they made it so that people like using it more.

Swyx [01:16:57]: There's some hints that 4.5 might drop at some point. We don't actually know how true those things are, but I don't think it really matters.

Nathan [01:17:03]: It's like they could call anything, they're retraining these models and they could call any of them GPT-4.5. Yeah, cool.

Swyx [01:17:10]: And the other last points in, you have a couple more extra slides here.

Nathan [01:17:14]: There's a bunch of an evaluation. I think the two tools that I talk about most in research domains on RLHF is like Alpaca Valid MT Bench. They're two academic maintained leaderboards for evaluating chat capabilities. Evaluating chat is really hard. And what they both do is they have GPT-4 provide some sort of feedback. MT Bench is called MT for multi-turn and they have a prompt and a follow-up question. So what they do is they ask GPT-4 to score both the initial response and the second response and provide the average kind of given up on following the slides. This is all on the slides if you look for it. And then Alpaca Val is a little bit different where you're comparing a candidate model. So the model we've trained. So like when we're training Tulu, we compare that we submit this and what it's doing under the hood is comparing the new model to DaVinci 0.0.3 which is one of OpenAI's older instruction models and calculating the win rate that GPT-4 sees between the new model and DaVinci. So that's kind of like it has many more prompts than MT Bench. MT Bench is custom prompts that they made to just kind of like take a stance on what is a good chat model. Alpaca Val sources theirs from Self-Instruct which is a popular paper from AI2. Open Assistant, Vicuna, Koala, Anthropix, Helpful Harmfulists. So like Alpaca Val is from sources that people know and love. MT Bench is its own thing. We were more focused on MT Bench at Hugging Face at AI2. We're a little bit more focused on Alpaca Val but it really can go either way. These are kind of like table stakes to saying that you have a good RLHF model is like you should be able to have a pretty good score on both of these. And then the kind of proof is in people actually talking to it. So I think like the Zephyr model from Hugging Face was a kind of step change in people's perception of open models that got integrated into a bunch of products within a few weeks. Like you.com was experimenting with it and someone else, like I saw some sub stacker was using it as like a writing feedback bot instead of chat GPT. But like that's what happens when a good open release is there now. It's like the evaluations are good and people pick it up and the evaluations are just enough to like say like, okay, we're in the right ballpark but you never really know if the model is the one or one of these big ones without talking to it. So it's like however much you talk about Vals that's still where we're at. You can't prove anything definitively and Google seeing that and like until Gemini Ultra comes out like we don't know. It's probably a great model but we don't know what they have.

Swyx [01:19:47]: Yeah, Gemini Pro didn't do so great on the other stuff too.

Nathan [01:19:51]: Yeah, I wanna know if Gemini Pro is just like some intermediate checkpoint or if it was like a major deliverable for them or not. Which if it wasn't a major deliverable it's probably a strategy headache for Google but it's not my problem.

Alessio [01:20:05]: You have a bunch of open questions here. One of our lightning round question is always. Yeah, we just do inverted lightning round. Yeah, exactly.

Swyx [01:20:12]: You asked people open questions.

Nathan [01:20:16]: Oh, I mean, there's so much to do here. They're kind of like summarization of things that will be hinted at in the talk to this point which is like I split it up in my work between like data training and model which is essentially like how do we evaluate what's happening at the model level with RLHF. I think big labs are indexed on their own base models so they don't know like what's swapping between CloudBase or GPT-4 base how that would change any notion of preference or what you do with RLHF. I think in the open we could do that. We can swap between Lama2 and Mixedraw and kind of see like does RLHF work the same for both of those? Do they both get alpaca valve bumps when you use the same dataset in the same framework down the line? That'd be good to know if like how sensitive RLHF is. On the data we talk a lot about aggregation. On the research side there's a lot of interesting things as like does getting your data from scale or a Discord army change the quality of the data based on like professional contexts. They probably should do it internally. They should do like internal market analysis on that line.

Swyx [01:21:18]: We should also mention there has been a report that a lot of these labelers use ChatGPT to do their work.

Nathan [01:21:25]: Yeah, I mean I'm not surprised. So it's like it's a lot of messy grounds in RL these days. And then there's more trading questions which is like what happens at the end of the day. I mentioned what I call like qualitative alignment earlier on which is like do the models get better in ways matching the preference data preferences? So if you like collect two matches of preference data with different priorities like what does the downstream model change? I don't know if it does anything. Should all data be equal? Like if you have like healthcare questions should it be the next same as like write me a joke? Like this is all implicit to deep learning. Like deep learning just scales and aggregates and like I think we are going to be on that ride but it's not necessarily what some people would call fair or good. And then the kind of last slide that I have is fun which is just like John Schulman talks about this in his ICML talk. His ICML talk on proxy objectives for RLHF is public now. They made it public like three months after the conference or some weird timeline. But he talks about things like ChatGPT being verbose and have self-doubt refusals. Things that are really like in vogue in the conversation right now and like how those can emerge in the process of continually trying to adjust the RLHF process based on what users are seeing in the model. And this is like a sort of outer loop optimization that no one in the open is even remotely qualified to talk about. But OpenAI does monitor and they'll like rerun RLHF and train a new reward model with a mixture of their curated data and user prompts to try to make it work better over time. And like that's the different model versions. And while there's a lot of critiques about this they're definitely like intentional and trying to fix. I feel like it's probably whack-a-mole where they're like, oh, there's this problem. We have the data, we can fix this. And then it like pops up some new problem after doing RLHF and they're studying this. And if you could really figure it out this is where things start to look more like RL. You could automate it. Things are just like longer timeframe of optimizing the model.

Alessio [01:23:19]: It would be cool.

Nathan [01:23:20]: But I feel like I'm years away from ever actually working on this but we can try to get details from people who are. Excellent.

Swyx [01:23:28]: Awesome.

Alessio [01:23:29]: Anything else that we missed? I think we covered a lot of it. I mean, I'm good.

Nathan [01:23:33]: I would ask you guys about if you know companies that are doing this and things. Like I know some that are in the like the RLHF as a service space will become busy. I think for good reason, just because like.

Swyx [01:23:44]: There's companies doing RLAIF as a service.

Nathan [01:23:46]: Yeah, both of them are. It depends if synthetic data is going to win over human data. If human data is the real winning feature in the end like it's a big capital investment. So it kind of makes sense as a VC model anyways but there's going to be both of them for a while. It'd be cool.

Alessio [01:24:01]: You see a lot of people because I know Louis Castricato is starting a company. Is there a lot of ambition in this field to start companies or is this more such a research-driven part of the stack that maybe it just stays there?

Nathan [01:24:16]: There definitely is. Because I know my former colleague, Nazneen Rajani from Hugging Face is also starting a company in this space. The Falcon team who left Hugging Face I think is also working in this space. I don't really know. I don't know exactly what. I haven't talked to them since the guy's email. So I don't know what they're doing. Startups change a lot but there are definitely a lot of people looking at this space. I mean, Scale's probably trying to do it. If I was Scale, they would want to do it. I think they've historically had trouble keeping like technical ML talent but they've started a new research lab. So that should help. It's a busy area.

Alessio [01:24:50]: Cool. Yeah. Awesome, Nathan. Thank you.

Swyx [01:24:52]: That was a masterclass. I think this is the first 201 that we've ever had and you set the bar very high.

Get full access to Latent Space at www.latent.space/subscribe

2024-01-11
Länk till avsnitt

The Accidental AI Canvas - with Steve Ruiz of tldraw

Happy 2024! We appreciated all the feedback on the listener survey (still open, link here)! Surprising to see that some people?s favorite episodes were others? least, but we?ll always work on improving our audio quality and booking great guests. Help us out by leaving reviews on Twitter, YouTube, and Apple Podcasts! ?

Big thanks to Chris Anderson for the latest review - be like Chris!

Note to the Audio-only Listener

Because of the nature of today?s topic, it makes the most sense to follow along the demo on video rather than audio. There?s also about 30 mins of demos and technical detail that we had to remove from the audio version, because they didn?t make sense without video.

Trailer here.

Full 90min chat:

(In other words, pls jump over and watch on our YouTube if you can! Did you know we are now posting every episode to YouTube? We?ve been multimodal for a long time!)

Trend 1: GPT4-V Coding

You might remember Greg Brockman?s hand-scribble-to-working-website demo from the GPT-4 demo from March. This was largely inaccessible to the rest of us until the GPT4-V API was released at Dev Day in November.

As mentioned in our November 2023 recap, one of the biggest viral trends was tldraw?s open source ?Make It Real? demo: starting from a simple wireframe and text annotations, you could create a real, functioning UI with the click of a button.

Provoking another crisis of confidence in developer circles:

And using state charts:

And provoking responses from Excalidraw, a competitor.

You can see us creating a Replit clone in this silent video here:

Since our intervew the new GPT4V Coding metagame has been merging app UI?s and SQL with Supabase (another AIE Summit speaker) and other backend tools:

* generating SQL

* converting ERDs to SQL (part 2, for MariaDB)

* seeding sample data

* doing migrations

Trend 2: Latent Consistency Models

As covered in the Latent Space Paper Club in November, 3 papers drove a roughly ~100x acceleration in the speed of text to image generation over the past year:

* Consistency Models (with Ilya Sutskever)

* Latent Consistency Models (from Tsinghua)

* LCM-LoRA (also Tsinghua, same authors)

With the invaluable help of Fal.ai (friends of the show and AI Engineer Summit and progenitors of the viral GPU Rich/Poor hats mentioned on the Semianalysis episode), TLDraw has also been at the forefront of putting this research into production, with two projects:

* drawfast: add a prompt, start sketching into the canvas and see each stroke affect the drawing. Overlap multiple of them to extend and merge drawings.

* lens: a collaborative canvas where in real time people can draw and have their sketch turn into AI-generated art. Start drawing at the bottom and see it scroll into the magic canvas.

For nontechnical people in your life, we do recommend showing them lens.tldraw.com (and its predecessor that we discuss on the show) on your and their mobile devices.

The Rise of Multimodal Prompting

At the first AI Engineer Summit in October, Logan (our first guest!) declared this the Year of Multimodality. Over the next 2 months we saw an explosion of activity in multimodal: GPT-4V?s API release at OpenAI Dev Day (our coverage here), LLaVA (our chat with author here on Visual Instruction Tuning), BakLLaVA, Qwen-VL, CogVLM, etc.

On today?s episode we have Steve Ruiz, founder of tldraw. The project originally started as an open source whiteboard that Steve built for himself and then ?accidentally made a really, really good visual multimodal prompting application environment?. Turns out that infinite canvas and generative models are a very good match:

* Design is iterative: DALL-E, Midjourney, etc all work in a linear way: prompt goes in, 1-4 images come back. As you generate more, the previous images scroll away from your view. In a canvas environment, you can see the progression of your generation and visually ?branch? by putting new prompts in different spaces.

* UI has ?layers?: when designing interfaces there are different layers to it: the functionality, the style, the state, etc. Some of what they are building in tldraw is bringing images into the canvas to influence different layers: ?One thing that we've done is to bring in screenshots of other apps, like here's Stripe.com, like make it look like Stripe, you know? Or like here's Linear.com, like let's do it this way?.

In the episode we spend a lot more time talking through all of these ideas and how Steve?s background in fine arts came back to being really useful in building a multi-modal AI canvas. Enjoy!

Show Notes

* tldraw

* Open Source Repo

* Make Real (Wireframe to UI)

* drawfast.tldraw.com

* lens.tldraw.com

* Perfect Free Hand and Perfect Arrows

* ?Make Real, the story so far?

* Dog CEO

* Other whiteboarding products mentioned

* Excalidraw

* FigJam

* Adobe Whiteboard

* See also Steve?s interviews on the Slow Steady Pod and TWiSt, and subscribe to his substack!

* TLDraw Wireframe kit

* TLDraw LLM starter

Timestamps

* [00:00:00] Introductions

* [00:01:02] Steve's Background In Fine Arts and Transition into Tech

* [00:08:22] The creation of tldraw and its open source origin

* [00:15:44] The Inception and Growth of tldraw

* [00:18:40] The Integration of AI with tldraw and Make It Real Feature

* [00:21:56] Discussion on Multimodal Prompting and Iterative Design

* [00:32:32] The Concept of Parallel Prompting in Design

* [00:34:11] Impact of AI on developer jobs

* [00:37:28] Additional Layers in Multimodal Prompting

* [00:45:18] Introduction of DrawFast and Lens Projects

* [00:50:03] tldraw 2.0 and the future of the project

* [00:55:41] The Competitive Landscape of Canvas Tools and tldraw's Unique Position

* [01:00:22] Advice for Founders: Following Your Interests and Desires

Transcript

Swyx: Welcome back to Latent Space. I'm very excited to have my good friend, Steve Ruiz. How are you this morning? [00:00:13]

Steve: Hey, how's it going? [00:00:14]

Swyx: I have had the good fortune of knowing you before you got famous and actually hanging out in the precise office and studio that you're recording from right now. Congrats on Make It Real. Congrats on tldraw. I think it's been something that's sort of years in the making, but it's probably looks like overnight success to a lot of people. [00:00:32]

Steve: Yeah. Thank you. It's kind of a funny story. I don't know. Where should we jump into it? [00:00:37]

Swyx: Well, I like to give you a little background on the person. You don't have a lot of detail on LinkedIn, just like myself. I just found out just before recording that you're also a late entrance into tech. So maybe just like, what's your background coming into something like tldraw? What makes you so unique at doing sort of creative collaborative experiences like that? I know you and I've actually used tldraw, so I have some appreciation for how hard this thing is. [00:01:02]

Steve: Yeah. Like you said, I kind of came into this a little late and kind of came into it from a weird angle. My background is actually in fine art and studio art. I have my master's from University of Chicago in visual art, would write about contemporary art and put together exhibitions and do my own paintings and drawings. And that was back when I was living in Chicago. And then when I moved over to the UK, you know, got a new studio, kept that going. But when I turned 30, I kind of decided I should probably make some money and work with other people closer than I was at the time. Studio art is primarily a solo thing. I'd always had kind of like an analytical kind of side to me. My day jobs were, you know, I was working for lawyers. I was doing this writing, like magazines and stuff. So when I did that kind of that switch back eventually to design and product design, I was also able to use a tiny little bit of technical skill that I had had just building like WordPress websites for myself and other artists as portfolios. Kind of take that, just some natural curiosity around the way that products work and kind of create a career direction that was more around prototyping and like technical design and kind of like doing the design on the bits of a product that really couldn't be designed otherwise. So the interactive bits, the bits which are maybe more, there's more questions about them. There's no clear answer to terms of like, how should this work? You know, in all those places, you kind of have to build something in order to, to figure out what you want to build. It turns out, you know, to skip right to the end for a moment, like canvas is full of those types of problems. So it's no surprise that I ended up there. It's like kind of an extreme form of the same problem. But yeah, so I was working, this was back in like 2017, 2018. And I used at the time a product called Framer. That was back when it was more of like a code product than what it is now, which is more of like a visual builder that is kind of backed by code. So I'm sort of just drilled into that. It was cool. Uber was using it. No one knew how it worked. No one could use it. So I got good at it and got a lot of advancement, early traction, whatever in my career based on that. But it also taught me to code, taught me to think about building things that other people are going to use. Taught me about kind of like the type of code that you write when you're in an exploratory phase rather than like in an execution, like production phase. And I actually ended up working for Framer. I did their education for a year, which was very different than the type of product design that I was doing before that. I did a lot of video tutorials and writing and tweeting, trying to figure out some way to make technical design content interesting, you know, in little chunks that people could consume. I joke that like they probably got less out of me in that job than I got out of the job itself. Like because, yeah, I walked away from that. Not sure if I'd helped anyone really learn how to use Framer, but I certainly learned how to, how to tweet and learn how to record a good GIF and learn how to talk into a microphone and all that type of stuff. And so in the next roles that I had, I worked for a company called Play out in New York who is also doing design tools and I really wanted to work in design tools after that. Play's doing like a mobile, I guess right now it's like just general iOS, macOS platform specific design tools where you're using actual elements from the kind of widgets from that component collection in your designs and kind of bringing that a lot closer to the end product. At the same time I started getting into open source, I'd kind of done some popular open source before. This was now 2019, it was, it was locked down. I had a little bit more time. I also had a daughter, so not that much more time. I guess that open source that I started getting into started swinging back towards some of my kind of artistic interests or studio interests and kind of visual interests. Those are the parts where I felt like the problem space was, was really underserved. It wasn't necessarily like technical problems that were really hard. It was more subjective problems where I think the thing that was lacking was the taste or the opinions or the like feeling for what good solutions were. So the first kind of problem like this that I got into was arrows. I had, you know, two boxes or two points arbitrarily placed. I want a good looking arrow, like a quote mark, like good looking arrow between the two. Well, that could be anything. That's not a math problem. Maybe it involves some angles and linear geometry and vectors and all that, but it's like the good looking part was just like my own taste and my own eye and like tons and tons of iterations and arrows are super tricky and there's a million ways for this, you know, edge cases when things are overlapping or things are too far away or too close and all this. But I was working on this and I was working on this in public on Twitter, recording gifs of boxes and arrows kind of squishing together and all that. And I think people really liked that and they liked kind of following me on this somewhat obsessive journey, which was technical, but it wasn't like it wasn't like trying to crack an algorithm. It was like trying to trying to figure out and identify the the rules governing an aesthetic experience or an ecstatic thing, which was a good looking arrow that became perfect arrows and that was pretty popular. But the next one really is what kind of broke my popularity on Twitter or just in the space and that was a project that ended up being called perfect freehand. This is a little hard to describe. If you've ever used like an iPad pencil or drew with like a stylus in Photoshop or something, like the harder you push, the thicker the line gets and the lighter you push, the thinner the line gets. It kind of is like this ink experience and that's it's not an easy problem. But if you're doing it in a kind of a Photoshop style, like raster environment, you know, the solution is pretty straightforward. You interpolate like tons and tons of tons of whatever shape you're drawing in between each point that you've actually moved your mouse to and you just change the size of that little stamp that you're making. So it's like a little circle, slightly bigger circle, slightly bigger circle, slightly bigger circle, but they're all really tightly packed together and so it looks like a kind of a line that's changing its width as it moves. My angle on this, the reason why I spent so much time on it was that I wanted to do that using vectors. I wanted to get a bunch of points in and then like a polygon that sort of defined the outside of that shape coming out because I like to work in SVG and it turned out that this was like an insanely hard problem that no one had solved. And if they have solved it, they certainly didn't open source it, but I couldn't find any good example of a variable width line that actually worked fast enough and consistent enough, etc. for it to be digital ink. And so again, I did this in public, did this on Twitter, a million GIFs of lines that look terrible, but you know, like slowly attracting more, like getting closer to the solution, attracting more people who had solved this problem or tried to do this or they wrote their PhD on ink and let me tell you about, you know, how arcs work in this environment and all this stuff. [00:07:35]

Swyx: Wow. [00:07:36]

Steve: And I, it was fantastic. Like I met so many good people who had like, were experts on this or something like it. And slowly we made a really, really good, tight little library for doing exactly what I wanted. Like here are a bunch of mouse points or just arbitrary points, like give me back a polygon that surrounds them. And let me essentially draw a line around the edge of that polygon, fill it in and it'll look like ink. So that was perfect freehand. And that's now used in like Canva uses it, like draw.io uses it, ExcalDraw uses it. We use it at tldraw all over the place. It's just like a significantly better than the next best solution in that space. And there really wasn't even any known solution in that space. So someday I'm going to be checking out at a hotel and see my own ink and you know, a little iPad or something like that. [00:08:21]

Swyx: That's amazing. [00:08:22]

Steve: So that's kind of led right into tldraw is that I had integrated my ink into Excalidraw and I, you know, spent time in that code base. And I'd also like created several like infinite canvas like tools to help me build perfect freehand and visualize it and sort of do my ink pan and zoom in and, and program against this thing. And so I had done, including Globstud design, which I won't necessarily talk about, but it's a kind of like a weird experimental design tool, but anyway, it was like, it was an infinite canvas. It was like, you know, Framer, Figma, et cetera. And after doing Excalidraw and been working on these kinds of projects that were in the same area, I was like, you know, maybe there's, there's a market here for, or not even a market. It was just like, I think the thing that I want to work on next is like a general purpose, kind of like whiteboard like engine, mostly for myself. I'd built globs, but the only thing that you could put on the canvas in globs was a glob. So I had all this like code and these solutions that I, you know, was like hanging around. It could kind of see how I would adapt it. And so that's what I started doing. And that was the next story that I was kind of telling on Twitter is that like, okay, here's how selection works in something like Figma or, or something like Miro or Framer or Sketch. It's these sort of conventions that are part of this really complicated thing called like the infinite canvas, you know, going all the way back to Flash and before then, you know, Adobe Illustrator and before then all the way back. And they're all pretty consistent between products. Like if you're making a canvas this way, you have to kind of do them all. Like your undo, redo should work in a specific way. Your selection should work in a specific way. Like, you know, the camera position and how the camera moves should work in a certain way. All the modifier, like option, drag to clone. And all of those became their little vignettes of how I was building this thing. This was now like spring of 2021. And I had everyone from any infinite canvas related creative product kind of like in my inbox being like, Hey, can you come work for us? I was like, let's talk, let's do this. And so I was either going to go work for Figma or Adobe. And I ended up going with Adobe in part because I think FigJam had just come out and the team at Figma were like, well, this is competitive with FigJam. I'm like, this thing is like nothing. It's like a little open source, you know, it's like no one uses this. It's just me trying to get to 10,000 Twitter followers, but you know, it's mine. So no. So I went with Adobe, but I told them, I'm like, I don't want to start for six months. Like this is actually a pretty fun project for me. I want to get it out of my system, you know, let me start in January and just work on this. And so they said yes. And I quit working with Play and said, I'm going to go work on this little open source thing for six months. I have some contracting money in the bank. Let's drain the company account and do this. And that's not what happened. I went full time from a Wednesday on Thursday. I had a very large communications company say, Hey, we're moving a whiteboard that we've designed for specific touchscreen devices. We're moving that into the browser. It turns out people want to use the whiteboard on their phones and on their laptops and all that like they do with Miro. And so we need to make this thing that we wrote in C++ to be highly performant on these, you know, kind of tiny microcomputers that were part of these interactive touchscreen TVs. We have to make this work on the web and we don't think it's going to be good enough. We have to build from scratch. We don't have the team. Can we just build on what you're building? At the time, this thing wasn't open source, it was just sort of, but it was getting there And I'm like, yeah, sure. Like, give me like $75,000. I'll let you see the source code. I don't want to talk to you very often, you know, like I'm not working for you. I never want to see your code, but you can look at mine. And they said yes to that. And so I was, you know, funded for those first six months. And I got to work on this without having to feel bad about it. And I'd also eventually opened up tldraw to be like sponsorware that if you were sponsoring me on GitHub, you could access it, you know, in its kind of primitive state on tldraw.com. And it had like a couple hundred people join that way and sponsor me. So at one point, like my sponsorship was, you know, over $5,000 a month, which is not massive money, but it's like I wasn't doing anything different. So it was pretty good. That's a kind of a passive thing. Anyway, I shipped it at the end of November 2021. And it was very popular. I just open source everything. It was just like, you know, the tldraw.com app, the library, the canvas, and it was organized in a certain way. It just made it all public. Everything was MIT, you know, let's just throw this out into the world and see where it goes. Well, it went pretty far. There was like number one on Hacker News for a while. It was like the top trending repo on GitHub. A lot of people, like 40,000 people showed up at tldraw.com to use it on that launch date, which was all good. Like so far, this was all within my same narrative of, okay, this is cool. I'll make this and then I'll go do something else afterwards. The thing that really surprised me was how many teams wanted to build on this. And they weren't like, they weren't building whiteboards. They weren't Miro competitors or Figma competitors. They were just like apps that you wouldn't expect to have infinite canvases inside of them and they wouldn't have built it except that I had suddenly made this very easy. And I had suddenly shrunk the development time of this like whiteboard like feature in their product from like three years and three people to three weeks and one person and not even one person just like no new developers, no new team, no new graphics experts, no computational geometry guys. Like, you know, we can do this. The canvas itself is like React all the way down. So even if you wanted to customize it, you'd just be writing React components and then a little bit more code on top. And so I was totally overwhelmed by inbound from companies who were like, I want to build this or I want to acquire you or I want to, I want you to build something for me. Or, you know, I want this in my app, you know, how do you help me or how can I do this? And people were shipping things also like within two weeks, three weeks, like production ready. Like people had taken this and run with it. And so the story that I started to get around tldraw was that like, OK, well, this is this is a cool little whiteboard, but it's also kind of like filling a gap that no one knew was there in the same way that like Mapbox or Google Maps, you know, provide maps for apps that would definitely not build maps themselves. Like maps are insanely hard, like your little local food delivery app like wouldn't just wouldn't have a map in it, you know, like easy. But it is a value add. If they can have it in there, then absolutely it is a value add. It's just completely impractical to do themselves. And what I learned talking to folks was that like every PM had used Miro or used Figma or used one of these other collaborative tools. And every creative product person was like, well, this is fun. Collaboration is fun. This canvas thing is pretty cool. Like, you know, why can't we put our CRM on the canvas or why can't we do our sales stuff here? Like we're already kind of using Miro for this. Like, why couldn't we give this to our customers as well? Like, why don't we build a product around this? And it was just a technical no until, you know, November 24th, 2021, when suddenly it was like a technical maybe and there was absolutely demand. So hence the, you know, I had to call Adobe and say, no, I'm not going to come in on Monday. Like it turned out that the best possible outcome of this happened and there's actually a company here. And then I went out and I raised a seed round from Lux in New York and Amplify in California and a whole bunch of really great angels, you know, on the story of, yeah, this is cool It's a good app, feels good. Companies want it. And, you know, by then I had almost $200,000 of sponsorship, you know, and people were just signing up and signing up because there was no way to even be a customer. [00:15:44]

Swyx: You're not saying 200k a month. [00:15:46]

Steve: No, no, no. But like, I mean, I had had up to then the amount of sponsorship that I had received was around $200,000. I think some of the recurring stuff was like, like 5,000 a month. Yeah. But yeah. [00:16:00]

Swyx: Which is in the top echelon. A lot. [00:16:02]

Steve: Yeah. Oh yeah. Certainly. Just the amount of like kind of validation that had come in around this was like more than usual. So raise a round, put together a team here in London and basically had just been building this whiteboard SDK since then, you know, we, we reconfigured the project around, okay, we're going to be building this not necessarily for end users, but for, for teams to use as kind of an infrastructure product, a developer product, something closer to Mapbox, you know, we were making demos to kind of like show different ways that it could be used. Certainly the collaboration thing is a big one, but the fact that you could put anything on the canvas that you can put on a website just because it is all HTML, CSS all the way down and that was going really well, it was already a good story. And then I just raised like a 2 million extension for the company while I was on the final pitch for that, the dev day was happening at OpenAI. And in the morning I woke up and I was getting all this kind of action on Twitter because a developer at Figma had used tldraw to make this little demo where you could draw a website, click a button and get back a, a big pop-up that had your, your website in there. It was like a prompt, like you're a developer, you just got this wireframe from your designer. Can you give it back to a single page HTML file? And it would do it and it could do it. And then you could show that website to whoever is using the app. And we took that and we're like, wow, you could do so much more with tldraw. It's just like, it's, it's only scratching the surface of the type of integration that you could do. Again, we had just finished the race. Pressure was off a little bit. It was kind of getting towards the end of the year. I was like, all right, let's, let's just take this and have some fun. Let's make some, some viral s**t. Maybe we'll get like 200 likes or something like that. And it exploded. It was like, I think we're at like last 30 days, like 22 million views or something like that. It's just like Kanye West numbers. It was, it was really, really, really popular for a couple of days. If you're on Twitter and at all technical, you might've just seen a ton of tldraw stuff on your timeline or about two weeks ago, three weeks ago. [00:17:55]

Swyx: Well, so yeah, that, that, that kind of brings us up almost to today. You just released something two hours ago, which we should talk about. Maybe this will bring a good time to bring up the screens, you know, for those who are listening. [00:18:08]

Steve: Let me, let me share. [00:18:09]

Swyx: We're recording a video as well. You can jump over to the YouTube to see stuff, but this is an inherently visual podcast, so we have to show stuff on the screen. The incremental thing I got from your blog post. So you did do a write up, which thank you for that, because I actually didn't know that you did a write up. It was just drawn up. [00:18:26]

Steve: Oh yeah. [00:18:27]

Swyx: Videos. This is the power of open source, right? That someone else had the idea. You weren't even focused on Dev Day. Someone else had the idea and just like, you know, made it without your permission or talking with you. And then the idea could spread back to you and you could run with it. [00:18:40]

Steve: Yeah, exactly. And we had made a lot of the bits and pieces like in place already based on, you know, I mean, it's, it's well documented or it's documented. There's tons of examples and all that. Yeah. Yeah. And I mean, it's a big library as far as an open source library goes, but yeah, you can work with it. And once this thing got popular, the first thing we did was create like a starter kit so that someone could take it and like run with it. So this is normal tldraw where you draw, you can whatever, move things around. It works if you've used Figma, if you've used Miro, it's kind of, kind of familiar to that. And you can put pretty much anything on this canvas that you want, like YouTube links, et cetera, because this canvas is HTML and CSS like divs and stuff all the way down. You can put things like YouTube videos on there. You can even make them play because again, like anything you can do in a website you can do on tldraws canvas. What's fun is because it is a canvas all the way down, you can also like draw on top and like do the kind of canvas manipulation stuff that you might do with normal shapes, but also with this type of content. So that ended up becoming like a big part of why make it real got kind of popular. So anyway, I'll show you make it real now. This was a hastily built layer on top of the kind of tldraw engine SDK that we sent out. And the idea here is that you can make a wireframe and we're going to send it to GPT-4 with vision with like a prompt, like much like the original one that Sawyer Hood had come up with, which is you are a web developer, you work with designers, they give you wireframes and notes and screenshots and all sorts of stuff. Could be anything. Your job is to come back with a single HTML file that has all the styles, all the JavaScript, all the markup necessary in order to make a real working prototype based on what you've been sent. It also has emotional manipulation, like you love your designers and you want them to be happy and like the better your prototype is, the happier they are. Oh, in the prompts? Yeah. Yeah. Yeah. Again, it's open source. You can read, read the prompt. It's kind of a funny one. This is part of the joy of like a multimodal prompt is we send it the photo, which kind of looks like the same as if you had done a copy and paste thing. Yeah. So like an image as well as all the text. And you had all this functionality worked out prior. [00:21:00]

Swyx: Yeah. [00:21:01]

Steve: Yeah. Yeah. [00:21:03]

Swyx: Yeah. Like that's what I find so poetic about this, that you were just ready. [00:21:06]

Steve: Yeah. It feels like we had gone off, you know, as collaboration and AI and stuff was going in one direction, we kind of just went off in our own weird, like, hey, the world is really going to need a whiteboard at some point direction. And then it just, they kind of met us where we were at. And then we've been able to just be like, show up on day one of this new world of possibility with like the thing that if I hadn't spent the last two years building this, I would spend the next two years building this. Like it is the right product for this type of a feature. So anyway, they give us back a HTML. We stick it into an iframe, put that onto the canvas, just like we did with that YouTube link and I can interact with it. So it should be going from orange to pink, orange to pink, hey, it's given us a hex code. I can click the hex code and it gives me, you know, it says it's copied it to the clipboard. [00:21:55]

Swyx: That's incredible. [00:21:56]

Steve: Like this alone is like super cool in something like V0 or some of these other kind of prompting environments, like the only way for you to then make this better, oh, maybe you can do this with ChatchubbyT or something and you could write like, oh, actually, you know, you missed the labels. Like it should say orange and pink, you know, on top of this thing. And it doesn't. So you could go back here and like, you know, make sure that this is, whatever, you could change the input. But because this is tldrawn, because you can draw on top of this stuff, you could also, you know, write on top. Like you could kind of modify this and maybe even give it the same type of markup that you would give to a designer or something like that, you know, and draw some arrows or maybe paste in a screenshot and say, hey, make it look more stylistically close to this other thing. And then what you do is you select the website that they gave you back, the previous result, along with all this markup, and you use that as the new input. And so that's going to give you something kind of like an image that looks like this that you've now sent. But we've also kind of tweaked the prompt a little bit when you do include a previous result and say like, hey, the wireframes coming back are annotations or markup based on what you sent before. And there you go. So now we have a new prompt that, sure enough, the labels are there, you know, it still works just like before. The button is full width and, you know, it still works just the same. So we send it back. Again, we send it the image, we send it the text, the prompt. We also send it all of the text items themselves separately because ChetchiBT is not really great with recognizing text. So we say like, oh, by the way, your vision's not so good. So we've ensured to have our copywriter, you know, list out all the copy that you can use. I think we even send it back the HTML that they used for the previous result. So we just dump like as much information as possible at GPT-4 with vision. And that's how you're able to get these sort of iterative results. And it is like legitimately good, like it feels like work. It feels like you're actually doing stuff when you're iterating through this way and slowly shaping and adding complexity and doing step by step, you know, as you're building something. And then you can copy a link to that and open that in a new tab like we host it. It's there forever. You can bookmark this. If you really just needed a slider between orange and pink, well, now you have one, you know, whether you could code it or not, or maybe not worth building or using a no code tool to build. But we just made that in five minutes. If you are more on the co-design, you want to use this as a kind of a foundation of a real project or maybe just to like see how it like, how does that actually work? You can open it up in StackBlitz or CodeSandbox. I think tomorrow we'll have Repl.it and yeah, see all the code, see what Chachapiti came up with and kind of use it or adapt it or, you know, keep it going or do whatever you want with it. Yeah. Cause it is, it is real. Yeah. [00:24:50]

Swyx: Make real. Yeah. It's interesting that you can also, I've seen some of your other demos. It looks like you're about to move us on to another. [00:24:57]

Steve: Yeah. I'm going to grab a couple. Okay. So what I have on the screen now just to narrate, describe it is, is I have a drawing of a, like a kitchen timer, you know, where you can add a minute, add a second, you know, start or stop the timer or, or reset the timer. And then next to it, I also have a state chart, like state machine describing the three states of the timer stopped running or complete and like what each one of those buttons should do in terms of transitions or changing the state. I think you can hand this to pretty much any designer or developer and get back a working result. [00:25:32]

Swyx: Like it's fully spec'd sort of. I mean, our friend David Korsheid might say, you know, develop a state chart first and then, you know, plug it into X state. [00:25:38]

Steve: Yeah, exactly. Well, let's do a couple of things in parallel. First thing I'm going to do is I am just going to make a box over here and I'm going to say kitchen timer right in the middle of the box. And this is going to be the only prompt that I'm going to, I'm going to give it. Okay. Just going to click make real and just the, the kitchen timer box. As you see with these multimodal prompting, like someone will draw a calculator, like in a lot of complexity and say, you know, it makes this real and sure enough, you get back like a really complex full calculator. But if you did the same thing and you just said empty box, but just the word calculator, it would give you back the same thing is that it knows what a calculator looks like and it knows how it works and all that. So next let's also give it just the user interface, like without the state chart, we'll leave the state chart out, but we'll do just the user interface. And then we'll do just the state chart, you know, and say, Hey, make this real. And then we'll do both the state chart and the UI. So we have four different prompts with four potential different results based on, you know, variations of the same, same input. So first off our kitchen timer, where all we did was we, we sent it a box with the word kitchen timer. It has, I don't know what this box is for, but we have a time we have start, stop and reset. So I can double click in, I can click start. It doesn't do anything. Oh, what is this? Oh, whoa. If this, okay, well, if the numbers there, yeah, then it'll, it'll stop. If I stop it, it stops. I can start it. It'll keep going again. Okay. And I can reset it. And there we go. The only weird thing is that it works. Yeah. It has a, a number input field for the number of seconds that I can, I can type out. But yeah. You know what? In a pinch, in a pinch, I'll take it. If I really needed just to count 60 seconds or something. Next we have, or the result where the input was just my drawing of a kitchen timer. I didn't tell it it was a kitchen timer. I didn't send it the words kitchen timer and I didn't tell it how it should work, but it did produce something that kind of looks the same. Let's see if it works. So I'm going to click minute, second, start, reset. No. So unfortunately it did not make any working UI, although it did, you know, put the buttons in the right place or something like that. [00:27:51]

Swyx: Maybe it over focuses on the UI because you told it, you just, that's all you gave it. Yeah. [00:27:57]

Steve: Yeah. I mean, there is in the prompt kind of language around, like use what you know about the way that applications work in order to sort of fill in the blanks here in terms of the behavior and all that. But let's go to the next one. This one is where we only sent it the state chart. There's also something in the prompt that says like, if it's red, it's not part of the UI. Like if it's red, then like treat that as an annotation rather than a thing that you should, should actually make. So this time it actually looks a lot like the previous one. But it does have these minute, second buttons. Oh, weird. It has plus and minus minute, seconds, and it also has this like stop state written at the bottom. So there's four buttons, you know, minus minute, minus second, plus minute, plus second, and then there's start and reset. So does it work? I can add a minute. I can also subtract a minute. All right. [00:28:44]

Swyx: Honestly, that's pretty smart. [00:28:45]

Steve: I can add a second. I can also, yeah. And if I press start, we're now in the running state. Apparently it's going up rather than down. And I can reset it and okay. I'm just curious if I, if I do give it a, an additional prompt here and I say like this should count down, not up and just kind of do an arrow towards the start button here. Let me see if that'll make a real one. But, and then finally we look at the other example, which is where we sent the state chart and the UI. We get something that looks much, much more like our user interface. The question is, does it work? Yes, it does. Perfect. I can stop it. Amazing. [00:29:24]

Swyx: Start it. [00:29:25]

Steve: That's a working timer. Reset it. Wonderful. And in this case, my feedback was accepted. I went back to the one where I, I'd asked it to count down and not up and it all looks the same, but now it's counting down. So I think for folks, especially who have worked in design and who have worked in sort of like user experience design in particular, like this should feel pretty familiar, kind of sketching out and trying to do your best to specify like what it is you want and see what you get back from your designers. You see what you get back from your developer, but having like a environment in which to have that like game loop, that like iteration cycle alone and instantaneous and essentially free is really, really wild. And you end up spending a lot of time kind of like not only getting into the head of the AI and sort of being like, okay, well, why are they getting confused? You know, what am I sending that is confusing? How do I send more information in order to like produce a better result? But also it really forces you to clarify your own expectations of like somewhere up here, I have a drag and drop list, you know, where you can drag list items between and like I started working on this and started specking it out. I was like, man, this is like actually like not only really hard to produce a good result, but it's also like just really hard to describe is that like the failure was really on my end for just not knowing how to get the information in there because I didn't actually know how this thing should work. But I could figure it out. I have an environment in which to figure that out. It's fun. [00:30:49]

Swyx: That's amazing. I'm still processing. [00:30:51]

Steve: During this, like, because this thing went massively popular on Twitter, thousands of retweets. And there were some folks who like were subtweeting it about like, you know, get over it. It's just a wireframing or no code tool or something like that. One guy did say like, you know, I prefer to wireframe like the old fashioned way with pen and paper. And I was like, oh yeah, no, that works too. Like this works with screenshots. I can just take the screenshot here of posted of the drawing that he had made. You know, it's not even like a good photo. There's a pen, you know, across one of the screens, etc. But if you just give that with no other information, like as a prompt, you get back a pretty good result. You know, just from this like photo of a piece of paper on the guy's desk, you have a not completely arbitrary result, like working website here that was inferred from just that picture with no other input, not even like titles or anything else. And of course, it's like responsive and all this stuff. And so the idea of, yes, I've worked really hard to make all of our shapes, you know, really good and our arrows obsessively good and all this stuff. But like the fun of the infinite canvas and tldraw in particular is that you could just dump like whatever you want onto the canvas, screenshots, text, images, other websites, sticky notes, all that stuff. And the model, even as something that was in preview, like the very, very first sort of multimodal model can do a really good job at just taking all that stuff as the input. And yeah, like so we accidentally made a really, really good visual multimodal prompting application environments or user experience environment. I'm not even sure what we're going to call this thing. [00:32:32]

Swyx: You also had in our pre-show prep, you also talked about parallel prompting. Is that basically just prompting and then moving on to something else? Is that what you've been showing us? [00:32:41]

Steve: Yeah, that's kind of what we did up here with the stopwatches. The fact that we could get multiple prompts going at the same time and like arrange them spatially. People have done this also with imagery to say, OK, well, here we're going to use DALI. We're going to kind of like make a tree of prompts as you go, different iterations based on whatever. You make four iterations. You pick your favorite one. You keep going. Kind of like what you do in MidJourney. But to have that spatial and to have that like arranged on a canvas so that it actually can make sense to you and you can kind of look back and follow it, follow forward that like whiteboards, infinite canvas stuff is just really good for a lot of things. So organizing like a whole bunch of different content that is irregular or ephemeral or has a kind of like ad hoc meaning configuration, like, you know, things that are next to each other or things that are in a grid or in this case, you know, just even what we have here for what we did with the stopwatch, like there's an implicit meaning of like, OK, the source is on the left, the result is on the right and any further iterations are further on the right. Right. Like we didn't put that into a data model. We didn't structure that in any way. It doesn't actually that meaning relationship doesn't really exist in any part of the product. It just exists to us because we can make sense of it for this type of thing. Not only is it cool that now a model can make sense of it as well, but yeah, for organizing complex iterations of imagery, complex iterations of outputs, et cetera, like, yeah, the canvas is a place. I really do believe that. Yeah. [00:34:11]

Swyx: I mean, that's that's that's really incredible. I think a few developers are kind of scared about, you know, how much of this their jobs this does. Obviously, there's a lot more that they can't do. [00:34:22]

Steve: Yeah. Will this take my job story is is interesting. [00:34:26]

Swyx: I think I'm not actually concerned, but I'm curious. I think this augments actually my concern as a developer is that this is good, but not good enough. You know, like it's good for throwaway UI, but would I actually export the code and take that code? I don't know. It looks like your first MVP was just HTML files, which, you know, if it's a single HTML file, it can have some JS and some CSS. I saw some problems with layout in there, which I don't know how for sure it is a layout. It's it looks like you could just prompt it for Tailwind if you want Tailwind. I assume it can generate React. I don't know. What are the limitations of this thing? [00:35:05]

Steve: There's the limitations that are in that particular demo, which is that, like, it couldn't do React because it needs to just be a single compiled thing, excuse me, ready to go. So it needs to be a single compiled thing just ready to go. You can't do any multi page stuff or anything like that. But that's more of like how we're structuring the project rather than like a specific requirement of the project itself. There's two kind of things. There's one is like how big is the input window and how big is the output window or something. In theory, you could have the input be here's a entire full stack React application together with all my UI and all this, all my components, etc. And here is a screenshot that I took of the landing page where the menu is in the wrong spot, you know, and I'm going to annotate that with some arrows and some text in order to say, like, here's where I want it to be or here's what I want, etc. And for the output to be, you know, a diff that I can apply to my code base, like basically like produce the commit that would change this and have that commit be against multiple files and etc. in order to have potentially like a solution that is just ready to go applicable like a patch, a PR that you can make. There really isn't any limit in that and we've seen with Copilot, etc. The challenge is more on the input side than the output side. Absolutely. You could figure out a way for this thing to spit out like a working iOS app or something like that. The question is like, how do you tell it what you want and how do you iterate when it gets it wrong? And just doing zero shot, zero shot, zero shot is like really a frustrating process. But if you do have a way of iterating, if you do have a way of kind of like step by step moving towards the solution that you want and kind of like getting it into like, okay, well, this is good, but it's not great, could be better, etc. That's how you actually make that type of complex output more practical or more realistic is that you probably won't get ever get the prompt just right. Even if you have like a really, really, really good three generations from now agent, like you still have to put that information in, but you're never going to put all the information in the first time you need to be able to iterate on it. And so with visual stuff, I feel like the canvas, like what we were looking at, that's part of what it unlocks is that like space of iteration, that space of you have a way of marking up the result and using that as the new prompt. And that's that's kind of new. [00:37:28]

Swyx: Yeah. Multimodal prompting is such a brilliant concept that, you know, I think it's going to be a norm for some things. In my mind, you demonstrated, you know, coming from Photoshop, there's this concept of layers. You can kind of simulate layers in tldraw. And I see like emergent property of layers in this kind of prompting, which is there's the UI layer, and then there's the state chart layer. And those two things seem like pretty useful in specifying a prompt. I was just wondering if you've thought a little bit about like other dimensions or other layers that would be useful in multimodal prompting. [00:38:02]

Steve: Yeah. One thing that we've done is to bring in screenshots of other apps, like here's Stripe.com, like make it look like Stripe, you know? Or like here's Linear.com, like let's do it this way. [00:38:16]

Swyx: Make my dev tool a website or make pop. Exactly. You should just, you should just like give a design and ask it to make pop instead of make real. [00:38:25]

Steve: Yeah, exactly. Make it more, make it more, make more pop. So there's the idea of like bringing in style as like a, as another part of the input. Flowcharts are absolutely useful. I mean, this is, it really just boils down to like, what would you really give a developer who you are working completely asynchronous with, you know, if you had to spec out a project and put that, print it out on paper and mail it to a developer and they were going to mail back a disk with an HTML file on it, like what would you send? If you were sending this to the moon or something. So yeah, definitely like descriptions of how the state should operate and specs on that. We've even just pasted in code, like, like here's a whole bunch of Jason that you can use and have it just read that as the, as the input data. You can point it at specific endpoints. You can say like, I want you to hit this endpoint and then display the results, you know, as, as cards or as items or something like that. And not, I mean, you don't even have to like wire this up. It's not like retool or anything where you, you have to register that, you know, it's not built into the tool. You just. [00:39:29]

Swyx: From an endpoint. [00:39:30]

Steve: Yeah. Yeah. Yeah. I'm trying to think of what a good demo endpoint would be. We could, maybe we could do one more, more test. What is it? Like dog.co? [00:39:38]

Swyx: Is that? Yeah. Dog.co is a good demo. Yeah. [00:39:42]

Steve: I've used that one. I mean, this might be kind of like the box with the word calculator. Like it might just know because it's probably been in a bunch of tutorials. [00:39:48]

Swyx: It's in the training set. Yeah. You're not sharing by the way. [00:39:51]

Steve: You know what? We'll, we'll do it anyway. We'll, I'll, I'll share it. We'll try. [00:39:55]

Swyx: Dog.co is, is one of those like demo APIs that you just set up just because it's not offensive. [00:40:02]

Steve: And. [00:40:03]

Swyx: Yeah, exactly. There's some useful dogs and everyone likes looking at dogs. [00:40:07]

Steve: You can, you can get dog.co. [00:40:09]

Swyx: I definitely didn't think about hitting endpoints just because it's just not in any of the demos I've seen. [00:40:15]

Steve: Yeah. But it works. Let me see. I'll, I'll have a big button down here. Show me a dog. Okay. So that's going to be our show me a dog button. This should be a picture of a dog. [00:40:26]

Swyx: Oh, that's a great dog. No, that's a cut. [00:40:30]

Steve: Thank you. And then we'll, we'll do some annotations here. We'll say like when, when this is clicked, get a new dog. [00:40:36]

Swyx: There's those perfect arrows coming in. [00:40:38]

Steve: Yeah, exactly. When clicked, get a new dog from, from, I'll just paste in this and put the result in the image. Okay. So it's, it's more of a, more of an instruction than you would normally. [00:40:52]

Swyx: Yeah. One thing that it's going to have to guess is that, you know, the, the response format, right? [00:40:57]

Steve: Cause it could be anything. This is true. Let's see if it works. Yeah. And let's see if it hit the end point in the right way. So dog button. Yeah. Okay. It hit the right red end point, Jason dog image, and then it put it in. So there you go. You have yourself a JavaScript tutorial in a box ready to go. And I think like, we probably wouldn't do this on camera, but like, you can say, you know, like, like use the auth token, you know, whatever, and like, you know, go like really get real data back from this thing. There's no reason why it wouldn't be able to do that. [00:41:34]

Swyx: You're kind of relying on the OCR. [00:41:35]

Steve: Well, not really, because again, what, inside of the prompt for this, we do give it like an array of all the texts that you've put in. We say like, look, I know your vision isn't so good, or you have a hard time reading text sometimes when it's small, because what, like the input that you get is pretty wild. It's like, it takes this as a PNG, and then it like, I can't do this in tldraw, but it resizes it, it squishes it into a 512 by 512 image or something like that. [00:42:05]

Swyx: It tiles it. [00:42:06]

Steve: Yeah. The text especially can get kind of like chunked up, especially if it's small. So we send those strings separately so that it can kind of reassemble anything that it can't read right off the bat. This is a weird future that we've found ourselves in. Pretty cool. Yeah. [00:42:23]

Swyx: I mean, you know, one layer I automatically think of is back-end, right? Like as someone who has worked at AWS, I see a lot of systems diagrams, like cloud diagrams, entity relationship diagrams for database. So I wonder if like anyone's tackled extending this to back-end, and then obviously the next level from that is full-stack apps where you have back-end in front of it. [00:42:43]

Steve: Yeah. I mean, I guess there's someone on Twitter that was using this to generate flowcharts. I'm not a back-end guy, so I don't actually know exactly what the output was, but I believe it was like a configuration script for AWS that was built off of this, like, I think you just copy and pasted a diagram that he had made in tldraw anyway and said, okay, let's throw this at this thing and see what it comes up with. Tweaking the prompt to say like, rather than building single page websites, you just return the JSON description of this configuration or something like that, or return a script that would set this up. You could tweak it to say like, here are all the entity relationships between different tables or items in tables, and give me back the SQL initialization or something that would make all these tables and set up these relationships. Yeah. It's just, again, the hard part is getting that information in. I don't know, pictures are really good. [00:43:35]

Swyx: They may speak a thousand words. Awesome. So that's one of two, what I think about multimodal viral hits in November. The other one also, you had a part to play in it, which is the local consistency models trend, where I think you worked with Fel. [00:43:51]

Steve: Yeah. So actually, I do have something to show here. We actually have a couple of things to show here. We connected with Fel because they used tldraw to create a demo for their LCM, right? Yeah. So we did that, and we made a drawfast.tldraw.com, which is basically, you get these shapes, these little draw fast shapes, and it puts the result, basically grabs that new image and puts it right next to it. And these are extremely fast. So as I'm moving things, you should see the image updating as well. And I think this was originally not a wise princess, I don't know, I'd play this more with my daughter than anything else, what this looks like. Yeah, the kids must love it. And, oh my gosh, she does. And actually, we had a lot of folks on Twitter being like, this is not good, like, whatever. Because I had a video of, whatever, my daughter drawing, and she made this awesome drawing of a mermaid, and we turned it into this really anonymous, crappy version of an illustration of a mermaid. And they're like, no, no, the children's drawing is much more interesting. I'm like, yeah, yeah, yeah, come on, who cares? Of course it is. But, you know, this is fun. [00:45:03]

Swyx: Yeah, I do think you might do animations, like some kind of, like, you could make some kind of, this is almost like stop motion film. Yeah. Yeah. I mean, we just, we need to do more work on consistency, but like, this is getting there. [00:45:18]

Steve: Yeah, it is. The fun is that like, you end up, after playing with this for a little while, you end up like, getting really into the particularities of the input. Like what can you do with a design tool? Okay. You can move things around, right? I can grab some of these and move them around, like, oh yeah, there's a highlighter here too. So we could do some highlighting, you know, that'll, that'll do stuff. And then we couldn't help ourselves. We started making these like stories. So all right, well then I'll move on to the other one that we, that we released earlier today. Yeah. Which is called lens.teeldraw.com. So that was drawfast.teeldraw.com. And again, this is probably not making a good podcast audio, but the image updates as soon as possible based on what the input drawing is. And it is pretty hypnotic. So this one's a little riskier because it's live. So we took a project called Together, which is a vertically scrolling, infinite drawing collaborative experience, a little bit like a chat room. As you're drawing, everything's just sort of moving up and it just disappears off the top of the screen, never to be seen again. So it's kind of just fun to play with. [00:46:21]

Swyx: By the way, one of the most magical chat experiences I ever had was with you. I think you were like with your daughter or something and I was, I was, whatever, showing off together. And you started writing, I started writing and we had chat on together.teeldraw.com. [00:46:34]

Steve: Yeah. [00:46:35]

Swyx: I was like, what is this? [00:46:37]

Steve: It's super cool. Inevitably someone will write like, you know, where are you from? And everyone's like chiming in and talking about it. So I'll describe what's on the screen now, which is we're taking like a screenshot of the center, like a square out of the center of this chaotic, vertically scrolling chat experience. And we're sending that to the LCM and putting back the image based on like a prompt, like, you know, desert scene or busy marketplace or futuristic cityscape or something like that. And so it is updating like, you know, 10 times a second as we go. [00:47:12]

Swyx: It's updating surprisingly quickly, like 10 frames per second. [00:47:14]

Steve: No, I think it's now like to 32 milliseconds basically as you go. And so if I draw like a big orange thing down here, it's going to kind of show up into the drawing. Maybe I'll do a big black one so you can see better. Like it just sort of becomes part of the input to this prompt and it is extremely hypnotic. This is again like lens.teeldraw.com. Yeah. It's like this like slow moving, collaborative kind of like hallucination experience and it just never ends. I mean, yeah. I'm probably going to be funding Fal completely for the next, you know, their Series A or something like that. [00:47:55]

Swyx: I don't know. I have a healthy respect for like the amount of processing that must be going on behind these things. [00:48:00]

Steve: Yeah. Well, what's funny is that like, yeah, we're using like Cloudflare workers to do the updates and the CRDTs to do the collaboration and all this like whatever LCM models to populate this image or create this image. But there's also a laptop in my living room right now that is doing the actual screenshotting and sending that up. And so there's a big note that I had to write, you know, for my family to say like, don't turn off this laptop. Don't close this laptop because this needs to be on in order for this thing to work. And no matter how good our tech stack gets, we'll always come back to some laptop stuck in the corner that can't possibly be turned off. [00:48:39]

Swyx: That's pretty fun. [00:48:40]

Steve: Yeah. [00:48:41]

Swyx: I've heard of major businesses being run that way. Yeah, exactly. [00:48:43]

Steve: Raspberry Pi in the closet. [00:48:45]

Swyx: Yeah. You know, it's weird because it's really funny because like, you know, you are inventing your own art form. This is fine art. You know, going back to your degree, it's just a different kind of art. [00:48:54]

Steve: It's funny because like the output of this, like while it is like a visual output, the output like doesn't actually matter. Like it's gone in 16 milliseconds and it's not coming back. And I think with all this AI stuff right now, just where we are with it and just how completely unknown it is in terms of like, where is this useful? Like the best thing that you can get out of this is like the experience. And so I think of this much more as like, you know, the thing that people will walk away from, from playing with like lens.tiltro.com should be more of like that experience of having interacted with this thing or interacted with it, you know, among with others rather than like, oh, there's made my favorite image or something like that. I don't know. As a former image maker, like the idea of having, having like an aesthetic experience where the image is a major part, but it's, it's not necessarily like the important part or any one of these images isn't the important part. I don't know. There's something new feeling about this. Kind of fun. Certainly. I wish I could do a big critique with all the new media artists, people about this and about like what, you know, where does this fit into the sort of the, uh, like other people's work, et cetera. [00:50:03]

Swyx: That's for them to write. And, you know, for you to build, you know, I would encourage you to keep building there because you're definitely well done with your explorations. I can sort of round it out by sort of looking towards the future. You hinted a little bit, uh, you're working towards TL draw 2.0. So first of all, actually, it seems like you're very focused on the core mission of Canvas and the AI stuff is, is a side project for now. Why not pursue it as like a full, why not pivot and like be an AI company, right? Like that's, that's, I'm sure you've got a lot of those questions. Yeah. [00:50:35]

Steve: Yeah. I mean, when you, when you get something as viral as, as tldraw got, like I think I've talked to everyone, certainly every, every investor and yes, we, we probably could on for something like together or that draw fast thing, make a tiny little SaaS app, you know, give me $10 a month, play with this thing and you know, could make it, make it good. We could go in that direction. There's not much of a moat around any of this stuff. And we're seeing that just in, you know, I don't know, Gemini is going to come out in a couple of days, weeks or whatever. And if it's better than people are just going to use that until the next better thing comes along. Like there's not a lot of like unique defensible about like, Hey, it's an, it's a drawing app plus an LCM like model because there's going to be a lot of those models and there's going to be a lot of drawing app. The thing that I think is really unique for tldraw, the thing that we have added that is not easily created is the canvas itself. Is that like web-based, hackable, extendable, super refined interactions and all that stuff like all the thousand table stakes features that drive people nuts when building something like this, like they're all there, they're all good. Day one, you could build a really great experience, whether it's AI driven or not, like using tldraw in a way that it's just not practical to do if you're building it yourself. And especially if you're not doing like graphic stuff, there's really not that much else out there oriented towards this type of thing. And I think in a world where these types of AI driven capabilities are just going to keep coming out faster and faster, you know, I don't know, next year is going to be wild. Like every month there's going to be some new, you know, capability or something. The thing that I would want to see both just me as a person and as me as having built a business that I've built is for tldraw to sort of become the place where some of this prompting, some of these ideas are explored. Even if we decided to, okay, we're just going to close everything up, we're going to build a product based on this, and maybe it's a great product, but it would only be one direction, one ray kind of into this infinite space of possibility. And that could be successful, good, but like, I mean, we've built the sort of the direct manipulation core, but there are so many, even like AI specific APIs that we could build around tldraw for having, you know, like a virtual collaborator or working with images in a more rich way. There's just so much that we could build in order to make this the best possible place to explore, not just one direction, but like, you know, many, many, many directions. And I think that narrative, that gets me much more excited. And I think we're also just like the team that we have and the tech that we have and the skills that we have, we're more of the team to build that rather than like to become like a SaaS product company. I'm not saying we'll never do like a, you know, pay us 10 bucks a month and we can, you can play with our magic toy, but primarily my goal is to make tldraw either the place to explore these different models, or you might think of it as like the battleground on which the winners will be kind of identified. Like right now we're using open AI for the make real thing. Maybe next week we'll be using Gemini and now it's, now it's a question of, okay, well now we have an environment in which to compare these two models with the same input and a very advanced form of input. But yeah, like, let's see which one does better. Now, nothing would make me happier than to be at sort of like the battlefield for multimodal prompting and multimodal AI experience. [00:53:58]

Swyx: I should also shout out Baklava as the open source vision and multimodal model. So I fully understand you want to, you want to own the light cone of multimodal prompting. I think that that'll probably be the title of the episode. What's coming up for tldraw 2.0? [00:54:15]

Steve: So really the tldraw that you are using now and that I'm using are basically 2.0. It's been in pre-release for a long time. Really the only change that's going to happen once we launch it is we're going to start selling commercial licenses for it. So if you are using tldraw in a commercial product or if you want to, then, you know, if you're funded or if you have revenue, then you'll buy a license and I'll add you to our special list of customers. So yeah, it's mostly just go to market and the necessary changes around there. There will be some kind of fun changes, secret saucy changes that launch, but nothing substantial, nothing breaking. We've put a lot of effort in the last, like it's crazy that we've only had an open source since May of this year, this new version, right? And we've been very busy since then, but it is, it's stable, it's robust. We put it through a lot of usage and caught a lot of the issues. So it's absolutely ready to go. But I have a one or two conversations with my lawyer before we, we turned, turn over the license and start, start moving it that way. Gotcha. [00:55:12]

Swyx: And then maybe I think if I could get your commentary before we close on just the competition out there, like you are not the only sort of canvas tool. I think I get now that I was going to ask about like Figma, FigJam, and they have some AI thing that they're also doing. I think Adobe is also working on similar things. Canvas also working on similar things, but they're all individual point solutions, whereas you're more the open source canvas to power all of them. I feel like it's just Excalidraw. That's like the other alternative that's, that remains. [00:55:41]

Steve: I think Excalidraw, and I like Excalidraw a lot, I contributed there and we, we retweet each other and tease each other on Twitter. And early on, I was copying features from them. Now they're copying features from me, so I, but no, it's the collaboration space is so, has so many dominant players, like that I, I think me and Excalidraw are tiny within that. There's two things. One is that we made this very strange bet on using a kind of a web canvas that our canvas is not like an HTML element or HTML canvas element. It's like normal React components all the way down. So if you wanted to add something interactive and have that participate in the sort of space of the canvas, the way that we were doing our iframes, kind of like being able to write on top of an iframe, you can't do that in Excalidraw. You can't do that anywhere. That is like a very strange tech choice that we made around tldraw that is, you know, finding its home in a few different ways. Most of the people who pick tldraw and approach me like the inbound that I get are folks for whom that's like the killer feature, be able to put interactive widgets on the canvas using just React. No matter how good Figma's like AI solution is, and I hope it's great because I love Figma and I use it, it's not going to solve every possible problem in this space. It's not even going to like touch, you know, like you can't like none of these things. And I mean, I already had identified like, OK, there was a point where like any Kanban board was like was Trello, right? When you when you talked about Kanban boards, you were talking about Trello. Kanban boards are in every productivity app now. I think the same thing is going to happen with collaborative whiteboards. It's like people like them. I'm making it easy. People are already doing it even without tldraw when it's hard. Like, like, yeah, that's going to become a kind of a commodity user experience in a lot of different products. Probably, you know, give me a diagram from a text prompt like, yeah, that is probably going to be a commodity to give me an image from a text prompt like, yeah, that's just going to be everywhere. We're just going to assume that that's, you know, it's like adding a GIF to a to a chat or something that there's no mode there. I do hope that Figma has an amazing AI integration, but I think the thing that it will help you do is use Figma, like generating an image won't be super useful, but like generating it now, autocomplete this design absolutely would be. And I hope to launch something amazing there. But yeah, like I said, there's just a million different directions that this stuff could go in. The canvas is just like a input device that allows a certain type of user experience. And that's certainly not limited to design. That's not limited to whiteboarding. It's not limited to collaboration or anything like that. Yeah, my hope is that there are those like 10,000 products that could be made with what we're making. [00:58:26]

Swyx: Yeah. That's a really great mission. And I see why you're so passionate about it. You're the right team for it. Okay. You know, a couple of lightning round questions. One, which is like, if you had some AI capability that you would wish for that you don't have yet, what would it be? [00:58:39]

Steve: Oh, that's a really good question. [00:58:42]

Swyx: Helps people to do some research. [00:58:44]

Steve: Yeah. I think probably related to, it's not quite a CRM, but like a human, just normal relationship management. This is something that I've never had a problem with until I had a startup, actually, where there's just a lot more people involved in my life. And it's hard to keep up with them all. And I think this is probably something that like an EA kind of does of saying like, hey, there's a birthday coming up or something like that. But also just, you know, identifying opportunities to work together, to connect or who's an expert on this thing that I'm working on, like that doesn't always occur to me. And I think the value of your network, that even if you're good at that, you're probably only scratching the surface of like, you know, how you could be helping the people around you and how they could be helping you based on like the specific context of like what you're working on and the problems on your table today. Yeah. [00:59:36]

Swyx: I've also wanted to build a CRM on top of Twitter because you have all the info there about what people working on your past conversations with each other and your shared interests. You know, like a bare minimum to search it, but to proactively suggest is the next layer. And I guess AI chief of staff, AI executive assistant, something like that. I think like some people are working on that, but the problem is so big that they're working on like the automation piece. So like Lindy, I had at my conference where they're like, it's a virtual assistant that you can trigger on your desktop or via email. And it mostly deals with scheduling, but also helps you do a little bit of research. So that, yeah, I think the agents field will progress there. We might take 10 years to do it. Yeah. [01:00:19]

Steve: Yeah. I can wait. It's all good. [01:00:22]

Swyx: And then finally, advice for founders, like what has helped you the most as a founder, you know, you're two years into your journey. [01:00:29]

Steve: Yeah. So this, this kind of comes a little bit out of what you learn in art school type of thing. But yeah, but one thing is that basically like when you're a studio artist or you're in a studio or whatever, there's no external constraints. You just kind of are running on, well, what do I feel like working on? And the further you get away from like, what do I feel like working on kind of like the worse your work becomes. So having like a really good feeling for that sort of desire and being able to respect and follow that desire as like, because it's not arbitrary. Is that like, if you really, really feel like working on thing, like that might be the sort of the tip of a very complex iceberg of analysis of like the field or like what people are talking about or something that you, directions and market or something like that. Like I don't know, I think with, with tldraw and with, as, as a founder on this, the thing that I've tried to do and I've tried to preserve is like being able to prioritize based on like what is most interesting right now. And that is, that is true for what code we write and like what features we work on. That's true for like which partners we, you know, we spend time with in terms of who is using tldraw, the types of problems that we want to solve, like using your own sort of sense of what's interesting as a filter and what you want to work on, like what sounds like a fun thing to work on right now as a filter. It's not naive and it can be kind of part of your, your secret sauce. I think a lot of early founders are encouraged against that and to, to be working backwards from a certain outcome and all that. And yeah, you do have to have to do that. You have to put that into the, into the mix as well, but be sure that you're, you're picking the best parts out of that mix. I don't know, the parts that you want to work on. [01:02:12]

Swyx: Well, I mean, what's the point of doing this if you don't have some fun, indulge your curiosity. [01:02:16]

Steve: Yeah. The worst case, you'll, you'll build something that you love. Yeah. Yeah, exactly. Good things can come out. Good things can absolutely come out of like. [01:02:24]

Swyx: You had an 8,000% increase in your followers or something. [01:02:29]

Steve: Yeah. Yeah. If you're a, if you're a sub stack reader, the tldraw sub stack 72 hours into this big make real virality explosion, I sat down and wrote a blog post and I, I wanted to at least capture that, that vibe of what it felt like in the middle of that, that hurricane. So it's, it's a pretty fun one. Very special. It's good to read back. [01:02:48]

Swyx: Well, I'm sure it's not the last time we'll see you do something crazy viral. I'm sure that a lot of people will be exploring tldraw. I hope a lot of people, honestly, one thing I'm thinking about is like embedding tldraw into my, my input box. I can't tldraw be like, you know, part of the input. [01:03:05]

Steve: Hey, I'm, I'm talking to the good folks over at OpenAI tomorrow. Fingers crossed. Maybe we, maybe we get it in, inside of a chat GPT or something. Cause yeah, like, [01:03:15]

Swyx: I need to, I need to move faster. [01:03:17]

Steve: Like what? You want to like take a drawing or take a photo and then annotate it or like, you know, sketch something out. You should be able to do that. [01:03:29]

Swyx: It's yeah, exactly. [01:03:31]

Steve: Yeah. It's just a good, it's just a good thing. Yeah. The people cry out for it. I failed it fast enough. [01:03:38]

Swyx: Well, thank you for inspiring the rest of us. Thank you for everything. And I'm sure we'll, we'll hear from more from you over the next few years. [01:03:46]

Steve: So thanks. [01:03:46]

Swyx: Thanks for your time. Awesome. [01:03:48]

Steve: Thank you for your time. [01:04:01]

Get full access to Latent Space at www.latent.space/subscribe

2024-01-05
Länk till avsnitt

NeurIPS 2023 Recap ? Top Startups

We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.

We can?t think of a more Latent-Space-y way to end 2023 than with a mega episode featuring many old and new friends recapping their biggest news, achievements, and themes and memes of the year!

We previously covered the Best Papers of NeurIPS 2023, but the other part of NeurIPS being an industry friendly conference is all the startups that show up to hire and promote their latest and greatest products and papers! As a startup-friendly podcast, we of course were ready with our mics to talk to everyone we could track down.

In lieu of an extended preamble, we encourage you to listen and click through all the interviews and show notes, all of which have been curated to match the references mentioned in the episode.

Timestamps & Show Notes

* [00:01:26] Jonathan Frankle - Chief Scientist, MosaicML/Databricks

* see also the Mosaic/MPT-7B episode

* $1.3B MosaicML x Databricks acquisition

* [00:22:11] Lin Qiao - CEO, Fireworks AI

* Fireworks Mixtral

* [00:38:24] Aman Sanger - CEO, Anysphere (Cursor)

* see also the Cursor episode

* $8m seed from OpenAI

* Tweet: Request-level memory-based KV caching

* Tweet: GPT-4 grading and Trueskill ratings for rerankers

* [00:51:14] Aravind Srinivas - CEO, Perplexity

* 1m app installs on iOS and Android

* pplx-online api 7b and 70b models

* Shaan Puri/Paul Graham Fierce Nerds story

* [01:04:26] Will Bryk - CEO, Metaphor

* ?Andrew Huberman may have singlehandedly ruined the SF social scene?

* [01:12:49] Jeremy Howard - CEO, Answer.ai

* Jeremy?s podcast with Tanishq Abraham, Jess Leao

* Announcing Answer.ai with $10m from Decibel VC

* Laundry Buddy, Nov 2023 AI Meme of the Month

* [01:37:13] Joel Hestness - Principal Scientist, Cerebras

* CerebrasGPT, all the Cerebras papers we discussed

* [01:56:34] Jason Corso - CEO, Voxel51

* Open Source FiftyOne project

* CVPR Survival Guide

* [02:02:39] Brandon Duderstadt - CEO, Nomic.ai

* GPT4All, Atlas, Demo

* [02:12:39] Luca Antiga - CTO, Lightning.ai

* Pytorch Lightning, Lightning Studios, LitGPT

* [02:29:46] Jay Alammar - Engineering Fellow, Cohere

* The Illustrated Transformer

Get full access to Latent Space at www.latent.space/subscribe

2023-12-30
Länk till avsnitt

NeurIPS 2023 Recap ? Best Papers

We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.

NeurIPS 2023 took place from Dec 10?16 in New Orleans. The Latent Space crew was onsite for as many of the talks and workshops as we could attend (and more importantly, hosted cocktails and parties after hours)!

Picking from the 3586 papers accepted to the conference (available online, full schedule here) is an impossible task, but we did our best to present an audio guide with brief commentary on each. We also recommend MLContests.com NeurIPS recap and Seb Ruder?s NeurIPS primer and Jerry Liu?s paper picks. We also found the VizHub guide useful for a t-SNE clustering of papers. Lots also happened in the arxiv publishing world outside NeurIPS, as highlighted by Karpathy, especially DeepMind?s Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.

Jan 2024 update: we also strongly recommend ?s pick of the year?s 10 best papers, including Pythia.

We?ll start with the NeurIPS Best Paper Awards, and then go to a selection of non-awarded but highly influential papers, and then arbitrary personal picks to round out the selection. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. We give Chris Ré the last word due to the Mamba and StripedHyena state space models drawing particular excitement but still being too early to assess impact.

Timestamps

* [0:01:19] Word2Vec (Jeff Dean, Greg Corrado)

* [0:15:28] Emergence Mirage (Rylan Schaeffer)

* [0:28:48] DPO (Rafael Rafailov)

* [0:41:36] DPO Poster Session (Archit Sharma)

* [0:52:03] Datablations (Niklas Muennighoff)

* [1:00:50] QLoRA (Tim Dettmers)

* [1:12:23] DataComp (Samir Gadre)

* [1:25:38] DataComp Poster Session (Samir Gadre, Alex Dimakis)

* [1:35:25] LLaVA (Haotian Liu)

* [1:47:21] LLaVA Poster Session (Haotian Liu)

* [1:59:19] Tree of Thought (Shunyu Yao)

* [2:11:27] Tree of Thought Poster Session (Shunyu Yao)

* [2:20:09] Toolformer (Jane Dwivedi-Yu)

* [2:32:26] Voyager (Guanzhi Wang)

* [2:45:14] CogEval (Ida Momennejad)

* [2:59:41] State Space Models (Chris Ré)

Papers covered

* Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Tomas Mikolov · Ilya Sutskever · Kai Chen · Greg Corrado · Jeff Dean. The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model.

* Some notable reflections from Tomas Mikolov - and debate over the Seq2Seq paper credit with Quoc Le

* Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models? scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model.

* Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.)

* While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.

* In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning.

* Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

See also on DPO: and recent Twitter discussions

* Scaling Data-Constrained Language Models (Muennighoff et al.)

* The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

* 2 minute poster session presentation video

* QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.).

* This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results.

* DataComp: In search of the next generation of multimodal datasets (Gadre et al.)

* Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets.

* Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www.datacomp.ai.

* Visual Instruction Tuning (Liu et al)

* Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.

* By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.

* Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

* Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al)

* Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role.

* To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving.

* ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.

* Our experiments show that ToT significantly enhances language models? problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4\% of tasks, our method achieved a success rate of 74\%.

* Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

* Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al)

* LMs exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel.

* In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds.

* We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.

* This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar.

* Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

* Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al)

* We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components:

* 1) an automatic curriculum that maximizes exploration,

* 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and

* 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement.

* Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.

Voyager discovers new Minecraft items and skills continually by self-driven exploration, significantly outperforming the baselines.

* Evaluating Cognitive Maps and Planning in Large Language Models with CogEval (Momennejad et al)

* Recently an influx of studies claims emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions.

* First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities.

* Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets.

* We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

* Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu, Tri Dao)

* Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements.

* First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token.

* Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba).

* Mamba enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-1.4B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Get full access to Latent Space at www.latent.space/subscribe

2023-12-24
Länk till avsnitt

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

We are running an end of year survey for our listeners! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here!

Listen to the end for a little surprise from Suhail.

Before language models became all the rage in November 2022, image generation was the hottest space in AI (it was the subject of our first piece on Latent Space!) In our interview with Sharif Shameem from Lexica we talked through the launch of StableDiffusion and the early days of that space. At the time, the toolkit was still pretty rudimentary: Lexica made it easy to search images, you had the AUTOMATIC1111 Web UI to generate locally, some HuggingFace spaces that offered inference, and eventually DALL-E 2 through OpenAI?s platform, but not much beyond basic text-to-image workflows.

Today?s guest, Suhail Doshi, is trying to solve this with Playground AI, an image editor reimagined with AI in mind. Some of the differences compared to traditional text-to-image workflows:

* Real-time preview rendering using consistency: as you change your prompt, you can see changes in real-time before doing a final rendering of it.

* Style filtering: rather than having to prompt exactly how you?d like an image to look, you can pick from a whole range of filters both from Playground?s model as well as Stable Diffusion (like RealVis, Starlight XL, etc). We talk about this at 25:46 in the podcast.

* Expand prompt: similar to DALL-E3, Playground will do some prompt tuning for you to get better results in generation. Unlike DALL-E3, you can turn this off at any time if you are a prompting wizard

* Image editing: after generation, you have tools like a magic eraser, inpainting pencil, etc. This makes it easier to do a full workflow in Playground rather than switching to another tool like Photoshop.

Outside of the product, they have also trained a new model from scratch, Playground v2, which is fully open source and open weights and allows for commercial usage.

They benchmarked the model against SDXL across 1,000 prompts and found that humans preferred the Playground generation 70% of the time. They had similar results on PartiPrompts:

They also created a new benchmark, MJHQ-30K, for ?aesthetic quality?:

We introduce a new benchmark, MJHQ-30K, for automatic evaluation of a model?s aesthetic quality. The benchmark computes FID on a high-quality dataset to gauge aesthetic quality.

We curate the high-quality dataset from Midjourney with 10 common categories, each category with 3K samples. Following common practice, we use aesthetic score and CLIP score to ensure high image quality and high image-text alignment. Furthermore, we take extra care to make the data diverse within each category.

Suhail was pretty open with saying that Midjourney is currently the best product for imagine generation out there, and that?s why they used it as the base for this benchmark.

I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that. So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, Stable Diffusion model and be like, look, we beat Stable Diffusion 1.5. I think users ultimately want care, how close are you getting to the thing that people mostly agree with? [00:23:47]

We also talked a lot about Suhail?s founder journey from starting Mixpanel in 2009, then going through YC again with Mighty, and eventually sunsetting that to pivot into Playground. Enjoy!

Show Notes

* Suhail?s Twitter

* ?Starting my road to learn AI?

* Bill Gates book trip

* Playground

* Playground v2 Announcement

* $40M raise announcement

* ?Running infra dev ops for 24 A100s?

* Mixpanel

* Mighty

* ?I decided to stop working on Mighty?

* Fast.ai

* Civit

Timestamps

* [00:00:00] Intros

* [00:02:59] Being early in ML at Mixpanel

* [00:04:16] Pivoting from Mighty to Playground and focusing on generative AI

* [00:07:54] How DALL-E 2 inspired Mighty

* [00:09:19] Reimagining the graphics editor with AI

* [00:17:34] Training the Playground V2 model from scratch to advance generative graphics

* [00:21:11] Techniques used to improve Playground V2 like data filtering and model tuning

* [00:25:21] Releasing the MJHQ30K benchmark to evaluate generative models

* [00:30:35] The limitations of current models for detailed image editing tasks

* [00:34:06] Using post-generation user feedback to create better benchmarks

* [00:38:28] Concerns over potential misuse of powerful generative models

* [00:41:54] Rethinking the graphics editor user experience in the AI era

* [00:45:44] Integrating consistency models into Playground using preview rendering

* [00:47:23] Interacting with the Stable Diffusion LoRAs community

* [00:51:35] Running DevOps on A100s

* [00:53:12] Startup ideas?

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:15]

Swyx: Hey, and today in the studio we have Suhail Doshi, welcome. [00:00:18]

Suhail: Yeah, thanks. Thanks for having me. [00:00:20]

Swyx: So among many things, you're a CEO and co-founder of Mixpanel, and I think about three years ago you left to start Mighty, and more recently, I think about a year ago, transitioned into Playground, and you've just announced your new round. How do you like to be introduced beyond that? [00:00:34]

Suhail: Just founder of Playground is fine, yeah, prior co-founder and CEO of Mixpanel. [00:00:40]

Swyx: Yeah, awesome. I'd just like to touch on Mixpanel a little bit, because it's obviously one of the more successful analytics companies we previously had amplitude on, and I'm curious if you had any reflections on the interaction of that amount of data that people would want to use for AI. I don't know if there's still a part of you that stays in touch with that world. [00:00:59]

Suhail: Yeah, I mean, the short version is that maybe back in like 2015 or 2016, I don't really remember exactly, because it was a while ago, we had an ML team at Mixpanel, and I think this is when maybe deep learning or something really just started getting kind of exciting, and we were thinking that maybe given that we had such vast amounts of data, perhaps we could predict things. So we built two or three different features, I think we built a feature where we could predict whether users would churn from your product. We made a feature that could predict whether users would convert, we built a feature that could do anomaly detection, like if something occurred in your product, that was just very surprising, maybe a spike in traffic in a particular region, can we tell you that that happened? Because it's really hard to like know everything that's going on with your data, can we tell you something surprising about your data? And we tried all of these various features, most of it boiled down to just like, you know, using logistic regression, and it never quite seemed very groundbreaking in the end. And so I think, you know, we had a four or five person ML team, and I think we never expanded it from there. And I did all these Fast AI courses trying to learn about ML. And that was the- That's the first time you did fast AI. Yeah, that was the first time I did fast AI. Yeah, I think I've done it now three times, maybe. [00:02:12]

Swyx: Oh, okay. [00:02:13]

Suhail: I didn't know it was the third. No, no, just me reviewing it, it's maybe three times, but yeah. [00:02:16]

Swyx: You mentioned prediction, but honestly, like it's also just about the feedback, right? The quality of feedback from users, I think it's useful for anyone building AI applications. [00:02:25]

Suhail: Yeah. Yeah, I think I haven't spent a lot of time thinking about Mixpanel because it's been a long time, but sometimes I'm like, oh, I wonder what we could do now. And then I kind of like move on to whatever I'm working on, but things have changed significantly since. [00:02:39]

Swyx: And then maybe we'll touch on Mighty a little bit. Mighty was very, very bold. My framing of it was, you will run our browsers for us because everyone has too many tabs open. I have too many tabs open and slowing down your machines that you can do it better for us in a centralized data center. [00:02:51]

Suhail: Yeah, we were first trying to make a browser that we would stream from a data center to your computer at extremely low latency, but the real objective wasn't trying to make a browser or anything like that. The real objective was to try to make a new kind of computer. And the thought was just that like, you know, we have these computers in front of us today and we upgrade them or they run out of RAM or they don't have enough RAM or not enough disk or, you know, there's some limitation with our computers, perhaps like data locality is a problem. Why do I need to think about upgrading my computer ever? And so, you know, we just had to kind of observe that like, well, actually it seems like a lot of applications are just now in the browser, you know, it's like how many real desktop applications do we use relative to the number of applications we use in the browser? So it's just this realization that actually like, you know, the browser was effectively becoming more or less our operating system over time. And so then that's why we kind of decided to go, hmm, maybe we can stream the browser. Fortunately, the idea did not work for a couple of different reasons, but the objective is try to make sure new computer. [00:03:50]

Swyx: Yeah, very, very bold. [00:03:51]

Alessio: Yeah, and I was there at YC Demo Day when you first announced it. It was, I think, the last or one of the last in-person ones, at Pier34 in Mission Bay. How do you think about that now when everybody wants to put some of these models in people's machines and some of them want to stream them in, do you think there's maybe another wave of the same problem before it was like browser apps too slow, now it's like models too slow to run on device? [00:04:16]

Suhail: Yeah. I mean, I've obviously pivoted away from Mighty, but a lot of what I somewhat believed at Mighty, maybe why I'm so excited about AI and what's happening, a lot of what Mighty was about was like moving compute somewhere else, right? Right now, applications, they get limited quantities of memory, disk, networking, whatever your home network has, et cetera. You know, what if these applications could somehow, if we could shift compute, and then these applications have vastly more compute than they do today. Right now it's just like client backend services, but you know, what if we could change the shape of how applications could interact with things? And it's changed my thinking. In some ways, AI has like a bit of a continuation of my belief that like perhaps we can really shift compute somewhere else. One of the problems with Mighty was that JavaScript is single-threaded in the browser. And what we learned, you know, the reason why we kind of abandoned Mighty was because I didn't believe we could make a new kind of computer. We could have made some kind of enterprise business, probably it could have made maybe a lot of money, but it wasn't going to be what I hoped it was going to be. And so once I realized that most of a web app is just going to be single-threaded JavaScript, then the only thing you could do largely withstanding changing JavaScript, which is a fool's errand most likely, make a better CPU, right? And there's like three CPU manufacturers, two of which sell, you know, big ones, you know, AMD, Intel, and then of course like Apple made the M1. And it's not like single-threaded CPU core performance, single-core performance was increasing very fast, it's plateauing rapidly. And even these different companies were not doing as good of a job, you know, sort of with the continuation of Moore's law. But what happened in AI was that you got like, if you think of the AI model as like a computer program, like just like a compiled computer program, it is literally built and designed to do massive parallel computations. And so if you could take like the universal approximation theorem to its like kind of logical complete point, you know, you're like, wow, I can get, make computation happen really rapidly and parallel somewhere else, you know, so you end up with these like really amazing models that can like do anything. It just turned out like perhaps the new kind of computer would just simply be shifted, you know, into these like really amazing AI models in reality. Yeah. [00:06:30]

Swyx: Like I think Andrej Karpathy has always been, has been making a lot of analogies with the LLMOS. [00:06:34]

Suhail: I saw his video and I watched that, you know, maybe two weeks ago or something like that. I was like, oh man, this, I very much resonate with this like idea. [00:06:41]

Swyx: Why didn't I see this three years ago? [00:06:43]

Suhail: Yeah. I think, I think there still will be, you know, local models and then there'll be these very large models that have to be run in data centers. I think it just depends on kind of like the right tool for the job, like any engineer would probably care about. But I think that, you know, by and large, like if the models continue to kind of keep getting bigger, you're always going to be wondering whether you should use the big thing or the small, you know, the tiny little model. And it might just depend on like, you know, do you need 30 FPS or 60 FPS? Maybe that would be hard to do, you know, over a network. [00:07:13]

Swyx: You tackled a much harder problem latency wise than the AI models actually require. Yeah. [00:07:18]

Suhail: Yeah. You can do quite well. You can do quite well. You definitely did 30 FPS video streaming, did very crazy things to make that work. So I'm actually quite bullish on the kinds of things you can do with networking. [00:07:30]

Swyx: Maybe someday you'll come back to that at some point. But so for those that don't know, you're very transparent on Twitter. Very good to follow you just to learn your insights. And you actually published a postmortem on Mighty that people can read up on and willing to. So there was a bit of an overlap. You started exploring the AI stuff in June 2022, which is when you started saying like, I'm taking fast AI again. Maybe, was there more context around that? [00:07:54]

Suhail: Yeah. I think I was kind of like waiting for the team at Mighty to finish up, you know, something. And I was like, okay, well, what can I do? I guess I will make some kind of like address bar predictor in the browser. So we had, you know, we had forked Chrome and Chromium. And I was like, you know, one thing that's kind of lame is that like this browser should be like a lot better at predicting what I might do, where I might want to go. It struck me as really odd that, you know, Chrome had very little AI actually or ML inside this browser. For a company like Google, you'd think there's a lot. Code is actually just very, you know, it's just a bunch of if then statements is more or less the address bar. So it seemed like a pretty big opportunity. And that's also where a lot of people interact with the browser. So, you know, long story short, I was like, hmm, I wonder what I could build here. So I started to take some AI courses and review the material again and get back to figuring it out. But I think that was somewhat serendipitous because right around April was, I think, a very big watershed moment in AI because that's when Dolly 2 came out. And I think that was the first truly big viral moment for generative AI. [00:08:59]

Swyx: Because of the avocado chair. [00:09:01]

Suhail: Yeah, exactly. [00:09:02]

Swyx: It wasn't as big for me as Stable Diffusion. [00:09:04]

Suhail: Really? [00:09:05]

Swyx: Yeah, I don't know. Dolly was like, all right, that's cool. [00:09:07]

Suhail: I don't know. Yeah. [00:09:09]

Swyx: I mean, they had some flashy videos, but it didn't really register. [00:09:13]

Suhail: That moment of images was just such a viral novel moment. I think it just blew people's mind. Yeah. [00:09:19]

Swyx: I mean, it's the first time I encountered Sam Altman because they had this Dolly 2 hackathon and they opened up the OpenAI office for developers to walk in back when it wasn't as much of a security issue as it is today. I see. Maybe take us through the journey to decide to pivot into this and also choosing images. Obviously, you were inspired by Dolly, but there could be any number of AI companies and businesses that you could start and why this one, right? [00:09:45]

Suhail: Yeah. So I think at that time, Mighty and OpenAI was not quite as popular as it is all of a sudden now these days, but back then they had a lot more bandwidth to kind of help anybody. And so we had been talking with the team there around trying to see if we could do really fast low latency address bar prediction with GPT-3 and 3.5 and that kind of thing. And so we were sort of figuring out how could we make that low latency. I think that just being able to talk to them and kind of being involved gave me a bird's eye view into a bunch of things that started to happen. Latency first was the Dolly 2 moment, but then stable diffusion came out and that was a big moment for me as well. And I remember just kind of like sitting up one night thinking, I was like, you know, what are the kinds of companies one could build? Like what matters right now? One thing that I observed is that I find a lot of inspiration when I'm working in a field in something and then I can identify a bunch of problems. Like for Mixpanel, I was an intern at a company and I just noticed that they were doing all this data analysis. And so I thought, hmm, I wonder if I could make a product and then maybe they would use it. And in this case, you know, the same thing kind of occurred. It was like, okay, there are a bunch of like infrastructure companies that put a model up and then you can use their API, like Replicate is a really good example of that. There are a bunch of companies that are like helping you with training, model optimization, Mosaic at the time, and probably still, you know, was doing stuff like that. So I just started listing out like every category of everything, of every company that was doing something interesting. I started listing out like weights and biases. I was like, oh man, weights and biases is like this great company. Do I want to compete with that company? I might be really good at competing with that company because of Mixpanel because it's so much of like analysis. But I was like, no, I don't want to do anything related to that. That would, I think that would be too boring now at this point. So I started to list out all these ideas and one thing I observed was that at OpenAI, they had like a playground for GPT-3, right? All it was is just like a text box more or less. And then there were some settings on the right, like temperature and whatever. [00:11:41]

Swyx: Top K. [00:11:42]

Suhail: Yeah, top K. You know, what's your end stop sequence? I mean, that was like their product before GPT, you know, really difficult to use, but fun if you're like an engineer. And I just noticed that their product kind of was evolving a little bit where the interface kind of was getting a little bit more complex. They had like a way where you could like generate something in the middle of a sentence and all those kinds of things. And I just thought to myself, I was like, everything is just like this text box and you generate something and that's about it. And stable diffusion had kind of come out and it was all like hugging face and code. Nobody was really building any UI. And so I had this kind of thing where I wrote prompt dash like question mark in my notes and I didn't know what was like the product for that at the time. I mean, it seems kind of trite now, but I just like wrote prompt. What's the thing for that? Manager. Prompt manager. Do you organize them? Like, do you like have a UI that can play with them? Yeah. Like a library. What would you make? And so then, of course, then you thought about what would the modalities be given that? How would you build a UI for each kind of modality? And so there are a couple of people working on some pretty cool things. And I basically chose graphics because it seemed like the most obvious place where you could build a really powerful, complex UI. That's not just only typing a box. It would very much evolve beyond that. Like what would be the best thing for something that's visual? Probably something visual. Yeah. I think that just that progression kind of happened and it just seemed like there was a lot of effort going into language, but not a lot of effort going into graphics. And then maybe the very last thing was, I think I was talking to Aditya Ramesh, who was the co-creator of DALL-E 2 and Sam. And I just kind of went to these guys and I was just like, hey, are you going to make like a UI for this thing? Like a true UI? Are you going to go for this? Are you going to make a product? For DALL-E. Yeah. For DALL-E. Yeah. Are you going to do anything here? Because if you are going to do it, just let me know and I will stop and I'll go do something else. But if you're not going to do anything, I'll just do it. And so we had a couple of conversations around what that would look like. And then I think ultimately they decided that they were going to focus on language primarily. And I just felt like it was going to be very underinvested in. Yes. [00:13:46]

Swyx: There's that sort of underinvestment from OpenAI, but also it's a different type of customer than you're used to, presumably, you know, and Mixpanel is very good at selling to B2B and developers will figure on you or not. Yeah. Was that not a concern? [00:14:00]

Suhail: Well, not so much because I think that, you know, right now I would say graphics is in this very nascent phase. Like most of the customers are just like hobbyists, right? Yeah. Like it's a little bit of like a novel toy as opposed to being this like very high utility thing. But I think ultimately, if you believe that you could make it very high utility, the probably the next customers will end up being B2B. It'll probably not be like a consumer. There will certainly be a variation of this idea that's in consumer. But if your quest is to kind of make like something that surpasses human ability for graphics, like ultimately it will end up being used for business. So I think it's maybe more of a progression. In fact, for me, it's maybe more like Mixpanel started out as SMB and then very much like ended up starting to grow up towards enterprise. So for me, I think it will be a very similar progression. But yeah, I mean, the reason why I was excited about it is because it was a creative tool. I make music and it's AI. It's like something that I know I could stay up till three o'clock in the morning doing. Those are kind of like very simple bars for me. [00:14:56]

Alessio: So you mentioned Dolly, Stable Diffusion. You just had Playground V2 come out two days ago. Yeah, two days ago. [00:15:02]

Suhail: Two days ago. [00:15:03]

Alessio: This is a model you train completely from scratch. So it's not a cheap fine tune on something. You open source everything, including the weights. Why did you decide to do it? I know you supported Stable Diffusion XL in Playground before, right? Yep. What made you want to come up with V2 and maybe some of the interesting, you know, technical research work you've done? [00:15:24]

Suhail: Yeah. So I think that we continue to feel like graphics and these foundation models for anything really related to pixels, but also definitely images continues to be very underinvested. It feels a little like graphics is in like this GPT-2 moment, right? Like even GPT-3, even when GPT-3 came out, it was exciting, but it was like, what are you going to use this for? Yeah, we'll do some text classification and some semantic analysis and maybe it'll sometimes like make a summary of something and it'll hallucinate. But no one really had like a very significant like business application for GPT-3. And in images, we're kind of stuck in the same place. We're kind of like, okay, I write this thing in a box and I get some cool piece of artwork and the hands are kind of messed up and sometimes the eyes are a little weird. Maybe I'll use it for a blog post, you know, that kind of thing. The utility feels so limited. And so, you know, and then we, you sort of look at Stable Diffusion and we definitely use that model in our product and our users like it and use it and love it and enjoy it, but it hasn't gone nearly far enough. So we were kind of faced with the choice of, you know, do we wait for progress to occur or do we make that progress happen? So yeah, we kind of embarked on a plan to just decide to go train these things from scratch. And I think the community has given us so much. The community for Stable Diffusion I think is one of the most vibrant communities on the internet. It's like amazing. It feels like, I hope this is what like Homebrew Club felt like when computers like showed up because it's like amazing what that community will do and it moves so fast. I've never seen anything in my life and heard other people's stories around this where an academic research paper comes out and then like two days later, someone has sample code for it. And then two days later, there's a model. And then two days later, it's like in nine products, you know, they're all competing with each other. It's incredible to see like math symbols on an academic paper go to well-designed features in a product. So I think the community has done so much. So I think we wanted to give back to the community kind of on our way. Certainly we would train a better model than what we gave out on Tuesday, but we definitely felt like there needs to be some kind of progress in these open source models. The last kind of milestone was in July when Stable Diffusion Excel came out, but there hasn't been anything really since. Right. [00:17:34]

Swyx: And there's Excel Turbo now. [00:17:35]

Suhail: Well, Excel Turbo is like this distilled model, right? So it's like lower quality, but fast. You have to decide, you know, what your trade off is there. [00:17:42]

Swyx: It's also a consistency model. [00:17:43]

Suhail: I don't think it's a consistency model. It's like it's they did like a different thing. Yeah. I think it's like, I don't want to get quoted for this, but it's like something called ad like adversarial or something. [00:17:52]

Swyx: That's exactly right. [00:17:53]

Suhail: I've read something about that. Maybe it's like closer to GANs or something, but I didn't really read the full paper. But yeah, there hasn't been quite enough progress in terms of, you know, there's no multitask image model. You know, the closest thing would be something called like EmuEdit, but there's no model for that. It's just a paper that's within meta. So we did that and we also gave out pre-trained weights, which is very rare. Usually you just get the aligned model and then you have to like see if you can do anything with it. So we actually gave out, there's like a 256 pixel pre-trained stage and a 512. And we did that for academic research because we come across people all the time in academia, they have access to like one A100 or eight at best. And so if we can give them kind of like a 512 pre-trained model, our hope is that there'll be interesting novel research that occurs from that. [00:18:38]

Swyx: What research do you want to happen? [00:18:39]

Suhail: I would love to see more research around things that users care about tend to be things like character consistency. [00:18:45]

Swyx: Between frames? [00:18:46]

Suhail: More like if you have like a face. Yeah, yeah. Basically between frames, but more just like, you know, you have your face and it's in one image and then you want it to be like in another. And users are very particular and sensitive to faces changing because we know we're trained on faces as humans. Not seeing a lot of innovation, enough innovation around multitask editing. You know, there are two things like instruct pics to pics and then the EmuEdit paper that are maybe very interesting, but we certainly are not pushing the fold on that in that regard. All kinds of things like around that rotation, you know, being able to keep coherence across images, style transfer is still very limited. Just even reasoning around images, you know, what's going on in an image, that kind of thing. Things are still very, very underpowered, very nascent. So therefore the utility is very, very limited. [00:19:32]

Alessio: On the 1K Prompt Benchmark, you are 2.5x prefer to Stable Diffusion XL. How do you get there? Is it better images in the training corpus? Can you maybe talk through the improvements in the model? [00:19:44]

Suhail: I think they're still very early on in the recipe, but I think it's a lot of like little things and you know, every now and then there are some big important things like certainly your data quality is really, really important. So we spend a lot of time thinking about that. But I would say it's a lot of things that you kind of clean up along the way as you train your model. Everything from captions to the data that you align with after pre-train to how you're picking your data sets, how you filter your data sets. I feel like there's a lot of work in AI that doesn't really feel like AI. It just really feels like just data set filtering and systems engineering and just like, you know, and the recipe is all there, but it's like a lot of extra work to do that. I think we plan to do a Playground V 2.1, maybe either by the end of the year or early next year. And we're just like watching what the community does with the model. And then we're just going to take a lot of the things that they're unhappy about and just like fix them. You know, so for example, like maybe the eyes of people in an image don't feel right. They feel like they're a little misshapen or they're kind of blurry feeling. That's something that we already know we want to fix. So I think in that case, it's going to be about data quality. Or maybe you want to improve the kind of the dynamic range of color. You know, we want to make sure that that's like got a good range in any image. So what technique can we use there? There's different things like offset noise, pyramid noise, terminal zero, SNR, like there are all these various interesting things that you can do. So I think it's like a lot of just like tricks. Some are tricks, some are data, and some is just like cleaning. [00:21:11]

Swyx: Specifically for faces, it's very common to use a pipeline rather than just train the base model more. Do you have a strong belief either way on like, oh, they should be separated out to different stages for like improving the eyes, improving the face or enhance or whatever? Or do you think like it can all be done in one model? [00:21:28]

Suhail: I think we will make a unified model. Yeah, I think it will. I think we'll certainly in the end, ultimately make a unified model. There's not enough research about this. Maybe there is something out there that we haven't read. There are some bottlenecks, like for example, in the VAE, like the VAEs are ultimately like compressing these things. And so you don't know. And then you might have like a big informational information bottleneck. So maybe you would use a pixel based model, perhaps. I think we've talked to people, everyone from like Rombach to various people, Rombach trained stable diffusion. I think there's like a big question around the architecture of these things. It's still kind of unknown, right? Like we've got transformers and we've got like a GPT architecture model, but then there's this like weird thing that's also seemingly working with diffusion. And so, you know, are we going to use vision transformers? Are we going to move to pixel based models? Is there a different kind of architecture? We don't really, I don't think there have been enough experiments. Still? Oh my God. [00:22:21]

Swyx: Yeah. [00:22:22]

Suhail: That's surprising. I think it's very computationally expensive to do a pipeline model where you're like fixing the eyes and you're fixing the mouth and you're fixing the hands. [00:22:29]

Swyx: That's what everyone does as far as I understand. [00:22:31]

Suhail: I'm not exactly sure what you mean, but if you mean like you get an image and then you will like make another model specifically to fix a face, that's fairly computationally expensive. And I think it's like not probably not the right way. Yeah. And it doesn't generalize very well. Now you have to pick all these different things. [00:22:45]

Swyx: Yeah. You're just kind of glomming things on together. Yeah. Like when I look at AI artists, like that's what they do. [00:22:50]

Suhail: Ah, yeah, yeah, yeah. They'll do things like, you know, I think a lot of ARs will do control net tiling to do kind of generative upscaling of all these different pieces of the image. Yeah. And I think these are all just like, they're all hacks ultimately in the end. I mean, it just to me, it's like, let's go back to where we were just three years, four years ago with where deep learning was at and where language was that, you know, it's the same thing. It's like we were like, okay, well, I'll just train these very narrow models to try to do these things and kind of ensemble them or pipeline them to try to get to a best in class result. And here we are with like where the models are gigantic and like very capable of solving huge amounts of tasks when given like lots of great data. [00:23:28]

Alessio: You also released a new benchmark called MJHQ30K for automatic evaluation of a model's aesthetic quality. I have one question. The data set that you use for the benchmark is from Midjourney. Yes. You have 10 categories. How do you think about the Playground model, Midjourney, like, are you competitors? [00:23:47]

Suhail: There are a lot of people, a lot of people in research, they like to compare themselves to something they know they can beat, right? Maybe this is the best reason why it can be helpful to not be a researcher also sometimes like I'm not trained as a researcher, I don't have a PhD in anything AI related, for example. But I think if you care about products and you care about your users, then the most important thing that you want to figure out is like everyone has to acknowledge that Midjourney is very good. They are the best at this thing. I'm happy to admit that. I have no problem admitting that. Just easy. It's very visual to tell. So I think it's incumbent on us to try to compare ourselves to the thing that's best, even if we lose, even if we're not the best. At some point, if we are able to surpass Midjourney, then we only have ourselves to compare ourselves to. But on First Blush, I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that. So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, Stable Diffusion model and be like, look, we beat Stable Diffusion 1.5. I think users ultimately want care, how close are you getting to the thing that people mostly agree with? So we put out that benchmark for no other reason to say like, this seems like a worthy thing for us to at least try, for people to try to get to. And then if we surpass it, great, we'll come up with another one. [00:25:06]

Alessio: Yeah, no, that's awesome. And you killed Stable Diffusion Excel and everything. In the benchmark chart, it says Playground V2 1024 pixel dash aesthetic. Do you have kind of like, yeah, style fine tunes or like what's the dash aesthetic for? [00:25:21]

Suhail: We debated this, maybe we named it wrong or something, but we were like, how do we help people realize the model that's aligned versus the models that weren't? Because we gave out pre-trained models, we didn't want people to like use those. So that's why they're called base. And then the aesthetic model, yeah, we wanted people to pick up the thing that makes things pretty. Who wouldn't want the thing that's aesthetic? But if there's a better name, we're definitely open to feedback. No, no, that's cool. [00:25:46]

Alessio: I was using the product. You also have the style filter and you have all these different styles. And it seems like the styles are tied to the model. So there's some like SDXL styles, there's some Playground V2 styles. Can you maybe give listeners an overview of how that works? Because in language, there's not this idea of like style, right? Versus like in vision model, there is, and you cannot get certain styles in different [00:26:11]

Suhail: models. [00:26:12]

Alessio: So how do styles emerge and how do you categorize them and find them? [00:26:15]

Suhail: Yeah, I mean, it's so fun having a community where people are just trying a model. Like it's only been two days for Playground V2. And we actually don't know what the model's capable of and not capable of. You know, we certainly see problems with it. But we have yet to see what emergent behavior is. I mean, we've just sort of discovered that it takes about like a week before you start to see like new things. I think like a lot of that style kind of emerges after that week, where you start to see, you know, there's some styles that are very like well known to us, like maybe like pixel art is a well known style. Photorealism is like another one that's like well known to us. But there are some styles that cannot be easily named. You know, it's not as simple as like, okay, that's an anime style. It's very visual. And in the end, you end up making up the name for what that style represents. And so the community kind of shapes itself around these different things. And so if anyone that's into stable diffusion and into building anything with graphics and stuff with these models, you know, you might have heard of like Proto Vision or Dream Shaper, some of these weird names, but they're just invented by these authors. But they have a sort of je ne sais quoi that, you know, appeals to users. [00:27:26]

Swyx: Because it like roughly embeds to what you what you want. [00:27:29]

Suhail: I guess so. I mean, it's like, you know, there's one of my favorite ones that's fine tuned. It's not made by us. It's called like Starlight XL. It's just this beautiful model. It's got really great color contrast and visual elements. And the users love it. I love it. And it's so hard. I think that's like a very big open question with graphics that I'm not totally sure how we'll solve. I don't know. It's, it's like an evolving situation too, because styles get boring, right? They get fatigued. Like it's like listening to the same style of pop song. I try to relate to graphics a little bit like with music, because I think it gives you a little bit of a different shape to things. Like it's not as if we just have pop music, rap music and country music, like all of these, like the EDM genre alone has like sub genres. And I think that's very true in graphics and painting and art and anything that we're doing. There's just these sub genres, even if we can't quite always name them. But I think they are emergent from the community, which is why we're so always happy to work with the community. [00:28:26]

Swyx: That is a struggle. You know, coming back to this, like B2B versus B2C thing, B2C, you're going to have a huge amount of diversity and then it's going to reduce as you get towards more sort of B2B type use cases. I'm making this up here. So like you might be optimizing for a thing that you may eventually not need. [00:28:42]

Suhail: Yeah, possibly. Yeah, possibly. I think like a simple thing with startups is that I worry sometimes by trying to be overly ambitious and like really scrutinizing like what something is in its most nascent phase that you miss the most ambitious thing you could have done. Like just having like very basic curiosity with something very small can like kind of lead you to something amazing. Like Einstein definitely did that. And then he like, you know, he basically won all the prizes and got everything he wanted and then basically did like kind of didn't really. He can dismiss quantum and then just kind of was still searching, you know, for the unifying theory. And he like had this quest. I think that happens a lot with like Nobel Prize people. I think there's like a term for it that I forget. I actually wanted to go after a toy almost intentionally so long as that I could see, I could imagine that it would lead to something very, very large later. Like I said, it's very hobbyist, but you need to start somewhere. You need to start with something that has a big gravitational pull, even if these hobbyists aren't likely to be the people that, you know, have a way to monetize it or whatever, even if they're, but they're doing it for fun. So there's something, something there that I think is really important. But I agree with you that, you know, in time we will absolutely focus on more utilitarian things like things that are more related to editing feats that are much harder. And so I think like a very simple use case is just, you know, I'm not a graphics designer. It seems like very simple that like you, if we could give you the ability to do really complex graphics without skill, wouldn't you want that? You know, like my wife the other day was set, you know, said, I wish Playground was better. When are you guys going to have a feature where like we could make my son, his name's Devin, smile when he was not smiling in the picture for the holiday card. Right. You know, just being able to highlight his, his mouth and just say like, make him smile. Like why can't we do that with like high fidelity and coherence, little things like that, all the way to putting you in completely different scenarios. [00:30:35]

Swyx: Is that true? Can we not do that in painting? [00:30:37]

Suhail: You can do in painting, but the quality is just so bad. Yeah. It's just really terrible quality. You know, it's like you'll do it five times and it'll still like kind of look like crooked or just artifact. Part of it's like, you know, the lips on the face, there's such little information there. So small that the models really struggle with it. Yeah. [00:30:55]

Swyx: Make the picture smaller and you don't see it. That's my trick. I don't know. [00:30:59]

Suhail: Yeah. Yeah. That's true. Or, you know, you could take that region and make it really big and then like say it's a mouth and then like shrink it. It feels like you're wrestling with it more than it's doing something that kind of surprises you. [00:31:12]

Swyx: Yeah. It feels like you are very much the internal tastemaker, like you carry in your head this vision for what a good art model should look like. Do you find it hard to like communicate it to like your team and other people? Just because it's obviously it's hard to put into words like we just said. [00:31:26]

Suhail: Yeah. It's very hard to explain. Images have such high bitrate compared to just words and we don't have enough words to describe these things. It's not terribly difficult. I think everyone on the team, if they don't have good kind of like judgment taste or like an eye for some of these things, they're like steadily building it because they have no choice. Right. So in that realm, I don't worry too much, actually. Like everyone is kind of like learning to get the eye is what I would call it. But I also have, you know, my own narrow taste. Like I don't represent the whole population either. [00:31:59]

Swyx: When you benchmark models, you know, like this benchmark we're talking about, we use FID. Yeah. Input distance. OK. That's one measure. But like it doesn't capture anything you just said about smiles. [00:32:08]

Suhail: Yeah. FID is generally a bad metric. It's good up to a point and then it kind of like is irrelevant. Yeah. [00:32:14]

Swyx: And then so are there any other metrics that you like apart from vibes? I'm always looking for alternatives to vibes because vibes don't scale, you know. [00:32:22]

Suhail: You know, it might be fun to kind of talk about this because it's actually kind of fresh. So up till now, we haven't needed to do a ton of like benchmarking because we hadn't trained our own model and now we have. So now what? What does that mean? How do we evaluate it? And, you know, we're kind of like living with the last 48, 72 hours of going, did the way that we benchmark actually succeed? [00:32:43]

Swyx: Did it deliver? [00:32:44]

Suhail: Right. You know, like I think Gemini just came out. They just put out a bunch of benchmarks. But all these benchmarks are just an approximation of how you think it's going to end up with real world performance. And I think that's like very fascinating to me. So if you fake that benchmark, you'll still end up in a really bad scenario at the end of the day. And so, you know, one of the benchmarks we did was we kind of curated like a thousand prompts. And I think that's kind of what we published in our blog post, you know, of all these tasks that we a lot of some of them are curated by our team where we know the models all suck at it. Like my favorite prompt that no model is really capable of is a horse riding an astronaut, the inverse one. And it's really, really hard to do. [00:33:22]

Swyx: Not in data. [00:33:23]

Suhail: You know, another one is like a giraffe underneath a microwave. How does that work? Right. There's so many of these little funny ones. We do. We have prompts that are just like misspellings of things. Yeah. We'll figure out if the models will figure it out. [00:33:36]

Swyx: They should embed to the same space. [00:33:39]

Suhail: Yeah. And just like all these very interesting weirdo things. And so we have so many of these and then we kind of like evaluate whether the models are any good at it. And the reality is that they're all bad at it. And so then you're just picking the most aesthetic image. We're still at the beginning of building like the best benchmark we can that aligns most with just user happiness, I think, because we're not we're not like putting these in papers and trying to like win, you know, I don't know, awards at ICCV or something if they have awards. You could. [00:34:05]

Swyx: That's absolutely a valid strategy. [00:34:06]

Suhail: Yeah, you could. But I don't think it could correlate necessarily with the impact we want to have on humanity. I think we're still evolving whatever our benchmarks are. So the first benchmark was just like very difficult tasks that we know the models are bad at. Can we come up with a thousand of these, whether they're hand rated and some of them are generated? And then can we ask the users, like, how do we do? And then we wanted to use a benchmark like party prompts. We mostly did that so people in academia could measure their models against ours versus others. But yeah, I mean, fit is pretty bad. And I think in terms of vibes, it's like you put out the model and then you try to see like what users make. And I think my sense is that we're going to take all the things that we notice that the users kind of were failing at and try to find like new ways to measure that, whether that's like a smile or, you know, color contrast or lighting. One benefit of Playground is that we have users making millions of images every single day. And so we can just ask them for like a post generation feedback. Yeah, we can just ask them. We can just say, like, how good was the lighting here? How was the subject? How was the background? [00:35:06]

Swyx: Like a proper form of like, it's just like you make it, you come to our site, you make [00:35:10]

Suhail: an image and then we say, and then maybe randomly you just say, hey, you know, like, how was the color and contrast of this image? And you say it was not very good, just tell us. So I think I think we can get like tens of thousands of these evaluations every single day to truly measure real world performance as opposed to just like benchmark performance. I would like to publish hopefully next year. I think we will try to publish a benchmark that anyone could use, that we evaluate ourselves on and that other people can, that we think does a good job of approximating real world performance because we've tried it and done it and noticed that it did. Yeah. I think we will do that. [00:35:45]

Swyx: I personally have a few like categories that I consider special. You know, you know, you have like animals, art, fashion, food. There are some categories which I consider like a different tier of image. Top among them is text in images. How do you think about that? So one of the big wow moments for me, something I've been looking out for the entire year is just the progress of text and images. Like, can you write in an image? Yeah. And Ideogram came out recently, which had decent but not perfect text and images. Dolly3 had improved some and all they said in their paper was that they just included more text in the data set and it just worked. I was like, that's just lazy. But anyway, do you care about that? Because I don't see any of that in like your sample. Yeah, yeah. [00:36:27]

Suhail: The V2 model was mostly focused on image quality versus like the feature of text synthesis. [00:36:33]

Swyx: Well, as a business user, I care a lot about that. [00:36:35]

Suhail: Yeah. Yeah. I'm very excited about text synthesis. And yeah, I think Ideogram has done a good job of maybe the best job. Dolly has like a hit rate. Yes. You know, like sometimes it's Egyptian letters. Yeah. I'm very excited about text synthesis. You know, I don't have much to say on it just yet. You know, you don't want just text effects. I think where this has to go is it has to be like you could like write little tiny pieces of text like on like a milk carton. That's maybe not even the focal point of a scene. I think that's like a very hard task that, you know, if you could do something like that, then there's a lot of other possibilities. Well, you don't have to zero shot it. [00:37:09]

Swyx: You can just be like here and focus on this. [00:37:12]

Suhail: Sure. Yeah, yeah. Definitely. Yeah. [00:37:16]

Swyx: Yeah. So I think text synthesis would be very exciting. I'll also flag that Max Wolf, MiniMaxxier, which you must have come across his work. He's done a lot of stuff about using like logo masks that then map onto food and vegetables. And it looks like text, which can be pretty fun. [00:37:29]

Suhail: That's the wonderful thing about like the open source community is that you get things like control net and then you see all these people do these just amazing things with control net. And then you wonder, I think from our point of view, we sort of go that that's really wonderful. But how do we end up with like a unified model that can do that? What are the bottlenecks? What are the issues? The community ultimately has very limited resources. And so they need these kinds of like workaround research ideas to get there. But yeah. [00:37:55]

Swyx: Are techniques like control net portable to your architecture? [00:37:58]

Suhail: Definitely. Yeah. We kept the Playground V2 exactly the same as SDXL. Not because not out of laziness, but just because we knew that the community already had tools. You know, all you have to do is maybe change a string in your code and then, you know, retrain a control net for it. So it was very intentional to do that. We didn't want to fragment the community with different architectures. Yeah. [00:38:16]

Swyx: So basically, I'm going to go over three more categories. One is UIs, like app UIs, like mock UIs. Third is not safe for work, and then copyrighted stuff. I don't know if you care to comment on any of those. [00:38:28]

Suhail: I think the NSFW kind of like safety stuff is really important. I kind of think that one of the biggest risks kind of going into maybe the U.S. election year will probably be very interrelated with like graphics, audio, video. I think it's going to be very hard to explain, you know, to a family relative who's not kind of in our world. And our world is like sometimes very, you know, we think it's very big, but it's very tiny compared to the rest of the world. Some people like there's still lots of humanity who have no idea what chat GPT is. And I think it's going to be very hard to explain, you know, to your uncle, aunt, whoever, you know, hey, I saw President Biden say this thing on a video, you know, I can't believe, you know, he said that. I think that's going to be a very troubling thing going into the world next year, the year after. [00:39:12]

Swyx: That's more like a risk thing, like deepfakes, faking, political faking. But there's a lot of studies on how for most businesses, you don't want to train on not safe for work images, except that it makes you really good at bodies. [00:39:24]

Suhail: Personally, we filter out NSFW type of images in our data set so that it's, you know, so our safety filter stuff doesn't have to work as hard. [00:39:32]

Swyx: But you've heard this argument that not safe for work images are very good at human anatomy, which you do want to be good at. [00:39:38]

Suhail: It's not like necessarily a bad thing to train on that data. It's more about like how you go and use it. That's why I was kind of talking about safety, you know, in part, because there are very terrible things that can happen in the world. If you have an extremely powerful graphics model, you know, suddenly like you can kind of imagine, you know, now if you can like generate nudes and then there's like you could do very character consistent things with faces, like what does that lead to? Yeah. And so I tend to think more what occurs after that, right? Even if you train on, let's say, you know, new data, if it does something to kind of help, there's nothing wrong with the human anatomy, it's very valid for a model to learn that. But then it's kind of like, how does that get used? And, you know, I won't bring up all of the very, very unsavory, terrible things that we see on a daily basis on the site, but I think it's more about what occurs. And so we, you know, we just recently did like a big sprint on safety. It's very difficult with graphics and art, right? Because there is tasteful art that has nudity, right? They're all over in museums, like, you know, there's very valid situations for that. And then there's the things that are the gray line of that, you know, what I might not find tasteful, someone might be like, that is completely tasteful, right? And then there are things that are way over the line. And then there are things that maybe you or, you know, maybe I would be okay with, but society isn't, you know? So where does that kind of end up on the spectrum of things? I think it's really hard with art. Sometimes even if you have like things that are not nude, if a child goes to your site, scrolls down some images, you know, classrooms of kids, you know, using our product, it's a really difficult problem. And it stretches mostly culture, society, politics, everything. [00:41:14]

Alessio: Another favorite topic of our listeners is UX and AI. And I think you're probably one of the best all-inclusive editors for these things. So you don't just have the prompt, images come out, you pray, and now you do it again. First, you let people pick a seed so they can kind of have semi-repeatable generation. You also have, yeah, you can pick how many images and then you leave all of them in the canvas. And then you have kind of like this box, the generation box, and you can even cross between them and outpaint. There's all these things. How did you get here? You know, most people are kind of like, give me text, I give you image. You know, you're like, these are all the tools for you. [00:41:54]

Suhail: Even though we were trying to make a graphics foundation model, I think we think that we're also trying to like re-imagine like what a graphics editor might look like given the change in technology. So, you know, I don't think we're trying to build Photoshop, but it's the only thing that we could say that people are largely familiar with. Oh, okay, there's Photoshop. What would Photoshop compare itself to pre-computer? I don't know, right? It's like, or kind of like a canvas, but you know, there's these menu options and you can use your mouse. What's a mouse? So I think that we're trying to re-imagine what a graphics editor might look like, not just for the fun of it, but because we kind of have no choice. Like there's this idea in image generation where you can generate images. That's like a super weird thing. What is that in Photoshop, right? You have to wait right now for the time being, but the wait is worth it often for a lot of people because they can't make that with their own skills. So I think it goes back to, you know, how we started the company, which was kind of looking at GPT-3's Playground, that the reason why we're named Playground is a homage to that actually. And, you know, it's like, shouldn't these products be more visual? These prompt boxes are like a terminal window, right? We're kind of at this weird point where it's just like MS-DOS. I remember my mom using MS-DOS and I memorized the keywords, like DIR, LS, all those things, right? It feels a little like we're there, right? Prompt engineering, parentheses to say beautiful or whatever, waits the word token more in the model or whatever. That's like super strange. I think a large portion of humanity would agree that that's not user-friendly, right? So how do we think about the products to be more user-friendly? Well, sure, you know, sure, it would be nice if I wanted to get rid of, like, the headphones on my head, you know, it'd be nice to mask it and then say, you know, can you remove the headphones? You know, if I want to grow, expand the image, you know, how can we make that feel easier without typing lots of words and being really confused? I don't even think we've nailed the UI UX yet. Part of that is because we're still experimenting. And part of that is because the model and the technology is going to get better. And whatever felt like the right UX six months ago is going to feel very broken now. So that's a little bit of how we got there is kind of saying, does everything have to be like a prompt in a box? Or can we do things that make it very intuitive for users? [00:44:03]

Alessio: How do you decide what to give access to? So you have things like an expand prompt, which Dally 3 just does. It doesn't let you decide whether you should or not. [00:44:13]

Swyx: As in, like, rewrites your prompts for you. [00:44:15]

Suhail: Yeah, for that feature, I think once we get it to be cheaper, we'll probably just give it up. We'll probably just give it away. But we also decided something that might be a little bit different. We noticed that most of image generation is just, like, kind of casual. You know, it's in WhatsApp. It's, you know, it's in a Discord bot somewhere with Majorny. It's in ChatGPT. One of the differentiators I think we provide is at the expense of just lots of users necessarily. Mainstream consumers is that we provide as much, like, power and tweakability and configurability as possible. So the only reason why it's a toggle, because we know that users might want to use it and might not want to use it. There's some really powerful power user hobbyists that know what they're doing. And then there's a lot of people that just want something that looks cool, but they don't know how to prompt. And so I think a lot of Playground is more about going after that core user base that, like, knows, has a little bit more savviness and how to use these tools. You know, the average Dell user is probably not going to use ControlNet. They probably don't even know what that is. And so I think that, like, as the models get more powerful, as there's more tooling, hopefully you'll imagine a new sort of AI-first graphics editor that's just as, like, powerful and configurable as Photoshop. And you might have to master a new kind of tool. [00:45:28]

Swyx: There's so many things I could go bounce off of. One, you mentioned about waiting. We have to kind of somewhat address the elephant in the room. Consistency models have been blowing up the past month. How do you think about integrating that? Obviously, there's a lot of other companies also trying to beat you to that space as well. [00:45:44]

Suhail: I think we were the first company to integrate it. Ah, OK. [00:45:47]

Swyx: Yeah. I didn't see your demo. [00:45:49]

Suhail: Oops. Yeah, yeah. Well, we integrated it in a different way. OK. There are, like, 10 companies right now that have kind of tried to do, like, interactive editing, where you can, like, draw on the left side and then you get an image on the right side. We decided to kind of, like, wait and see whether there's, like, true utility on that. We have a different feature that's, like, unique in our product that is called preview rendering. And so you go to the product and you say, you know, we're like, what is the most common use case? The most common use case is you write a prompt and then you get an image. But what's the most annoying thing about that? The most annoying thing is, like, it feels like a slot machine, right? You're like, OK, I'm going to put it in and maybe I'll get something cool. So we did something that seemed a lot simpler, but a lot more relevant to how users already use these products, which is preview rendering. You toggle it on and it will show you a render of the image. And then graphics tools already have this. Like, if you use Cinema 4D or After Effects or something, it's called viewport rendering. And so we try to take something that exists in the real world that has familiarity and say, OK, you're going to get a rough sense of an early preview of this thing. And then when you're ready to generate, we're going to try to be as coherent about that image that you saw. That way, you're not spending so much time just like pulling down the slot machine lever. I think we were the first company to actually ship a quick LCM thing. Yeah, we were very excited about it. So we shipped it very quick. Yeah. [00:47:03]

Swyx: Well, the demos I've been seeing, it's not like a preview necessarily. They're almost using it to animate their generations. Like, because you can kind of move shapes. [00:47:11]

Suhail: Yeah, yeah, they're like doing it. They're animating it. But they're sort of showing, like, if I move a moon, you know, can I? [00:47:17]

Swyx: I don't know. To me, it unlocks video in a way. [00:47:20]

Suhail: Yeah. But the video models are already so much better than that. Yeah. [00:47:23]

Swyx: There's another one, which I think is general ecosystem of Loras, right? Civit is obviously the most popular repository of Loras. How do you think about interacting with that ecosystem? [00:47:34]

Suhail: The guy that did Lora, not the guy that invented Loras, but the person that brought Loras to Stable Diffusion actually works with us on some projects. His name is Simu. Shout out to Simu. And I think Loras are wonderful. Obviously, fine tuning all these Dreambooth models and such, it's just so heavy. And it's obvious in our conversation around styles and vibes, it's very hard to evaluate the artistry of these things. Loras give people this wonderful opportunity to create sub-genres of art. And I think they're amazing. Any graphics tool, any kind of thing that's expressing art has to provide some level of customization to its user base that goes beyond just typing Greg Rakowski in a prompt. We have to give more than that. It's not like users want to type these real artist names. It's that they don't know how else to get an image that looks interesting. They truly want originality and uniqueness. And I think Loras provide that. And they provide it in a very nice, scalable way. I hope that we find something even better than Loras in the long term, because there are still weaknesses to Loras, but I think they do a good job for now. Yeah. [00:48:39]

Swyx: And so you would never compete with Civit? You would just kind of let people import? [00:48:43]

Suhail: Civit's a site where all these things get kind of hosted by the community, right? And so, yeah, we'll often pull down some of the best things there. I think when we have a significantly better model, we will certainly build something that gets closer to that. Again, I go back to saying just I still think this is very nascent. Things are very underpowered, right? Loras are not easy to train. They're easy for an engineer. It sure would be nicer if I could just pick five or six reference images, right? And they might even be five or six different reference images that are not... They're just very different. They communicate a style, but they're actually like... It's like a mood board, right? And you have to be kind of an engineer almost to train these Loras or go to some site and be technically savvy, at least. It seems like it'd be much better if I could say, I love this style. Here are five images and you tell the model, like, this is what I want. And the model gives you something that's very aligned with what your style is, what you're talking about. And it's a style you couldn't even communicate, right? There's no word. You know, if you have a Tron image, it's not just Tron. It's like Tron plus like four or five different weird things. Even cyberpunk can have its like sub-genre, right? But I just think training Loras and doing that is very heavy. So I hope we can do better than that. [00:49:50]

Alessio: We have Sharif from Lexica on the podcast before. Both of you have like a landing page with just a bunch of images where you can like explore things. [00:50:01]

Suhail: Yeah, we have a feed. [00:50:02]

Alessio: Yeah, is that something you see more and more often in terms of like coming up with these styles? Is that why you have that as the starting point versus a lot of other products you just go in, you have the generation prompt, you don't see a lot of examples. [00:50:14]

Suhail: Our feed is a little different than their feed. Our feed is more about community. So we have kind of like a Reddit thing going on where it's a kind of a competition like every day, loose competition, mostly fun competition of like making things. And there's just this wonderful community of people where they're liking each other's images and just showing their like genuine interest in each other. And I think we definitely learn about styles that way. One of the funniest polls, if you go to the mid-journey polls, they'll sometimes put these polls out and they'll say, you know, what do you wish you could like learn more from? And like one of the things that people vote the most for is like learning how to prompt, right? And so I think like if you put away your research hat for a minute and you just put on like your product hat for a second, you're kind of like, well, why do people want to learn how to prompt, right? It's because they want to get higher quality images. Well, what's higher quality? Composition, lighting, aesthetics, so on and so forth. And I think that the community on our feed, I think we might have the biggest community. And it gives all of the users a way to learn how to prompt because they're just seeing this huge rising tide of all these images that are super cool and interesting. And they can kind of like take each other's prompts and like kind of learn how to do that. I think that'll be short-lived because I think the complexity of these things is going to get higher. But that's more about why we have that feed, is to help each other, help teach users and then also just celebrate people's art. You run your own infra. We do. [00:51:30]

Swyx: Yeah, that's unusual. [00:51:31]

Suhail: It's necessary. It's necessary. [00:51:35]

Swyx: What have you learned running DevOps for GPUs? You had a tweet about like how many A100s you have, but I feel like it's out of date probably. [00:51:42]

Suhail: I mean, it just comes down to cost. These things are very expensive. So we just want to make it as affordable for everybody as possible. I find the DevOps for inference to be relatively easy. It doesn't feel that different than, you know, I think we had thousands and thousands of servers at Mixpanel just for dealing with the API. It had such huge quantities of volume that I don't find it particularly very different. I do find model optimization performance is very new to me. So I think that I find that very difficult at the moment. So that's very interesting. But scaling inference is not terrible. Scaling a training cluster is much, much harder than I perhaps anticipated. Why is that? Well, it's just like a very large distributed system with, you know, if you have like a node that goes down, then your training running crashes and then you have to somehow be resilient to that. And I would say training infra software is very early. It feels very broken. I can tell in 10 years it would be a lot better. [00:52:37]

Swyx: Like a mosaic or whatever. [00:52:39]

Suhail: Yeah, we don't even know. We don't think we use very basic tools like, you know, Slurm for scheduling and just normal PyTorch, PyTorch Lightning, that kind of thing. I think our tooling is nascent. I think I talked to a friend that's over at XAI. They just built their own scheduler, you know, and doing things with Kubernetes. Like when people are building out tools because the existing open source stuff doesn't work and everyone's doing their own bespoke thing, you know, there's a valuable company to be formed. [00:53:01]

Swyx: Yeah, I think it's mosaic. [00:53:03]

Suhail: I don't know. It might be worth like wondering like why not everyone is going to mosaic and perhaps it's still, I just think it's nascent and perhaps mosaic will come through. [00:53:12]

Alessio: Just to wrap, we talked about some of the pivotal moments in your mind with like DALI and whatnot. If you were not doing this, what's the most interesting unsolved question in AI that you would try and build in? [00:53:25]

Suhail: Oh man, coming up with startup ideas is very hard on the spot. You have to have them. [00:53:31]

Swyx: I mean, you're a founder, you're a repeat founder. I'm very picky about my startup ideas. [00:53:35]

Suhail: I don't have an idea per se as much as a curiosity. Suppose I'll pose it to you guys. Right now we sort of think that a lot of the modalities just kind of feel like they're vision, language, audio, that's roughly it. And somehow all this will like turn into something, it'll be multimodal and then we'll end up with AGI. And I just think that there are probably far more modalities than meets the eye. And it just seems hard for us to see it right now because it's sort of like we have tunnel vision on the moment. [00:54:08]

Swyx: We're just like code, image, audio, video. [00:54:11]

Suhail: Yeah, I think- [00:54:11]

Swyx: Very, very broad categories. [00:54:13]

Suhail: I think we are lacking imagination as a species in this regard. Yeah, I see it. I don't know what company would form as a result of this, but there's some very difficult problems, like a true actual, not a meta world model, but an actual world model that truly maps everything that's going in terms of like physics and fluids and all these various kinds of interactions. And what does that kind of model, like a true physics foundation model of sorts that represents earth. And that in of itself seems very difficult, but we're kind of stuck on like thinking that we can approximate everything with like a word or a token, if you will. You know, I had a dinner last night where we were kind of debating this philosophically. And I think someone said something that I also believe in, which is like at the end of the day, it doesn't really matter that it's like a token or a byte, at the end of the day, it's just like some unit of information that it emits. But I do wonder if there are far more modalities than meets the eye. And if you could create that, what would that company become? What problems could you solve? So I don't know yet, so I don't have a great company for it. I don't know. [00:55:15]

Alessio: Maybe you just inspire somebody to try. [00:55:17]

Suhail: Yeah, hopefully. [00:55:18]

Swyx: My personal response to that is I'm less interested in physics and more interested in people. Like how do I mind upload? Because that is teleportation, that is immortality, that is everything. Yeah. [00:55:29]

Suhail: Rather than trying to create consciousness, could we model our own? Even if it was lossy to some extent, yeah. We won't solve that here. [00:55:35]

Swyx: If I were to take a Bill Gates book trip and had a week, what should I take with me to learn AI? [00:55:42]

Suhail: Oh gosh, you shouldn't take a book. You should just go to YouTube and visit Kaparthy's class. [00:55:49]

Swyx: Zero to Hero. [00:55:50]

Suhail: And just do it, grind through it. [00:55:52]

Swyx: Was that actually the most useful thing for you? [00:55:53]

Suhail: I wish it came out when I started. Wow. Back last year. I was bummed that I didn't get to take it at the beginning, but I did do a few of his classes regardless. Every time I buy a programming book, I never read it. Or an AI book. I always find that just writing code helps cement my internal understanding. Yeah. [00:56:10]

Swyx: So more generally, advice for founders who are not PhDs and are effectively self-taught like you are. Like what should they do? What should they avoid? Same thing that I would advise [00:56:18]

Suhail: if you're programming. Pick a project that seems very exciting to you. You know, it doesn't have to be too serious. And build it and learn every detail of it while you do it. [00:56:27]

Swyx: Should you train? Or can you go far enough not training, just fine-tuning? I would just follow your curiosity. [00:56:32]

Suhail: If what you want to do is something that requires fundamental understanding of training models, then you should learn it. You don't have to get to become a five-year, whatever, PhD. But if that's necessary, I would do it. If it's not necessary, then go as far as you need to go. But I would learn, pick something that motivates. I think most people tap out on motivation, but they're deeply curious. Cool. [00:56:51]

Alessio: Thank you so much for coming out, man. [00:56:53]

Suhail: Thank you for having me. Appreciate it. [00:57:07]

Get full access to Latent Space at www.latent.space/subscribe

2023-12-20
Länk till avsnitt

The "Normsky" architecture for AI coding agents ? with Beyang Liu + Steve Yegge of SourceGraph

We are running an end of year survey for our listeners. Let us know any feedback you have for us, what episodes resonated with you the most, and guest requests for 2024!

RAG has emerged as one of the key pieces of the AI Engineer stack. Jerry from LlamaIndex called it a ?hack?, Bryan from Hex compared it to ?a recommendation system from LLMs?, and even LangChain started with it.

RAG is crucial in any AI coding workflow. We talked about context quality for code in our Phind episode. Today?s guests, Beyang Liu and Steve Yegge from SourceGraph, have been focused on code indexing and retrieval for over 15 years. We locked them in our new studio to record a 1.5 hours masterclass on the history of code search, retrieval interfaces for code, and how they get SOTA 30% completion acceptance rate in their Cody product by being better at the ?bin packing problem? of LLM context generation.

Google Grok ? SourceGraph ? Cody

While at Google in 2008, Steve built Grok, which lives on today as Google Kythe. It allowed engineers to do code parsing and searching across different codebases and programming languages. (You might remember the infamous Google Platforms Rant from Steve?s time at Google, and his 2021 followup on GCP).

Beyang was an intern at Google at the same time, and Grok became the inspiration to start SourceGraph in 2013. The two didn?t know eachother personally until Beyang brought Steve out of retirement 9 years later to join him as VP Engineering. Fast forward 10 years, SourceGraph has become to best code search tool out there and raised $223M along the way.

Nine months ago, they open sourced SourceGraph Cody, their AI coding assistant. All their code indexing and search infrastructure allows them to get SOTA results by having better RAG than competitors:

* Code completions as you type that achieve an industry-best Completion Acceptance Rate (CAR) as high as 30% using a context-enhanced open-source LLM (StarCoder)

* Context-aware chat that provides the option of using GPT-4 Turbo, Claude 2, GPT-3.5 Turbo, Mistral 7x8B, or Claude Instant, with more model integrations planned

* Doc and unit test generation, along with AI quick fixes for common coding errors

* AI-enhanced natural language code search, powered by a hybrid dense/sparse vector search engine

There are a few pieces of infrastructure that helped Cody achieve these results:

Dense-sparse vector retrieval system

For many people, RAG = vector similarity search, but there?s a lot more that you can do to get the best possible results. From their release:

"Sparse vector search" is a fancy name for keyword search that potentially incorporates LLMs for things like ranking and term expansion (e.g., "k8s" expands to "Kubernetes container orchestration", possibly weighted as in SPLADE):

* Dense vector retrieval makes use of embeddings, the internal representation that LLMs use to represent text. Dense vector retrieval provides recall over a broader set of results that may have no exact keyword matches but are still semantically similar.

* Sparse vector retrieval is very fast, human-understandable, and yields high recall of results that closely match the user query.

* We've found the approaches to be complementary.

There?s a very good blog post by Pinecone on SPLADE for sparse vector search if you?re interested in diving in. If you?re building RAG applications in areas that have a lot of industry-specific nomenclature, acronyms, etc, this is a good approach to getting better results.

SCIP

In 2016, Microsoft announced the Language Server Protocol (LSP) and the Language Server Index Format (LSIF). This protocol makes it easy for IDEs to get all the context they need from a codebase to get things like file search, references, ?go to definition?, etc.

SourceGraph developed SCIP, ?a better code indexing format than LSIF?:

* Simpler and More Efficient Format: SCIP utilizes Protobuf instead of JSON, which is used by LSIF. Protobuf is more space-efficient, simpler, and more suitable for systems programming.

* Better Performance and Smaller Index Sizes: SCIP indexers, such as scip-clang, show enhanced performance and reduced index file sizes compared to LSIF indexers (10%-20% smaller)

* Easier to Develop and Debug: SCIP's design, centered around human-readable string IDs for symbols, makes it faster and more straightforward to develop new language indexers.

Having more efficient indexing is key to more performant RAG on code.

Show Notes

* Sourcegraph

* Cody

* Copilot vs Cody

* Steve?s Stanford seminar on Grok

* Grab

* Code search

* Zoekt

* v0.dev

See also our past episodes on Cursor, Phind, Codeium and Codium as well as the GitHub Copilot keynote at AI Engineer Summit.

Timestamps

* [00:00:00] Intros & Backgrounds

* [00:05:20] How Steve's work on Grok inspired SourceGraph for Beyang

* [00:08:10] What's Cody?

* [00:11:22] Comparison of coding assistants and the capabilities of Cody

* [00:16:00] The importance of context (RAG) in AI coding tools

* [00:21:33] The debate between Chomsky and Norvig approaches in AI

* [00:30:06] Normsky: the Norvig + Chomsky models collision

* [00:36:00] The death of the DSL?

* [00:40:00] LSP, Skip, Kythe, BFG, and all that fun stuff

* [00:53:00] The SourceGraph internal stack

* [00:58:46] Building on open source models

* [01:02:00] SourceGraph for engineering managers?

* [01:12:00] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16]

Swyx: Hey, and today we're christening our new podcast studio in the Newton, and we have Beyang and Steve from Sourcegraph. Welcome. [00:00:25]

Beyang: Hey, thanks for having us. [00:00:26]

Swyx: So this has been a long time coming. I'm very excited to have you. We also are just celebrating the one year anniversary of ChatGPT yesterday, but also we'll be talking about the GA of Cody later on today. We'll just do a quick intros of both of you. Obviously, people can research you and check the show notes for more. Beyang, you worked in computer vision at Stanford and then you worked at Palantir. I did, yeah. You also interned at Google. [00:00:48]

Beyang: I did back in the day where I get to use Steve's system, DevTool. [00:00:53]

Swyx: Right. What was it called? [00:00:55]

Beyang: It was called Grok. Well, the end user thing was Google Code Search. That's what everyone called it, or just like CS. But the brains of it were really the kind of like Trigram index and then Grok, which provided the reference graph. [00:01:07]

Steve: Today it's called Kythe, the open source Google one. It's sort of like Grok v3. [00:01:11]

Swyx: On your podcast, which you've had me on, you've interviewed a bunch of other code search developers, including the current developer of Kythe, right? [00:01:19]

Beyang: No, we didn't have any Kythe people on, although we would love to if they're up for it. We had Kelly Norton, who built a similar system at Etsy, it's an open source project called Hound. We also had Han-Wen Nienhuys, who created Zoekt, which is, I think, heavily inspired by the Trigram index that powered Google's original code search and that we also now use at Sourcegraph. Yeah. [00:01:45]

Swyx: So you teamed up with Quinn over 10 years ago to start Sourcegraph and you were indexing all code on the internet. And now you're in a perfect spot to create a code intelligence startup. Yeah, yeah. [00:01:56]

Beyang: I guess the backstory was, I used Google Code Search while I was an intern. And then after I left that internship and worked elsewhere, it was the single dev tool that I missed the most. I felt like my job was just a lot more tedious and much more of a hassle without it. And so when Quinn and I started working together at Palantir, he had also used various code search engines in open source over the years. And it was just a pain point that we both felt, both working on code at Palantir and also working within Palantir's clients, which were a lot of Fortune 500 companies, large financial institutions, folks like that. And if anything, the pains they felt in dealing with large complex code bases made our pain points feel small by comparison. So that was really the impetus for starting Sourcegraph. [00:02:42]

Swyx: Yeah, excellent. Steve, you famously worked at Amazon. And you've told many, many stories. I want every single listener of Latent Space to check out Steve's YouTube because he effectively had a podcast that you didn't tell anyone about or something. You just hit record and just went on a few rants. I'm always here for your Stevie rants. And then you moved to Google, where you also had some interesting thoughts on just the overall Google culture versus Amazon. You joined Grab as head of eng for a couple of years. I'm from Singapore, so I have actually personally used a lot of Grab's features. And it was very interesting to see you talk so highly of Grab's engineering and sort of overall prospects. [00:03:21]

Steve: Because as a customer, it sucked? [00:03:22]

Swyx: Yeah, no, it's just like, being from a smaller country, you never see anyone from our home country being on a global stage or talked about as a startup that people admire or look up to, like on the league that you, with all your legendary experience, would consider equivalent. Yeah. [00:03:41]

Steve: Yeah, no, absolutely. They actually, they didn't even know that they were as good as they were, in a sense. They started hiring a bunch of people from Silicon Valley to come in and sort of like fix it. And we came in and we were like, Oh, we could have been a little better operational excellence and stuff. But by and large, they're really sharp. The only thing about Grab is that they get criticized a lot for being too westernized. Oh, by who? By Singaporeans who don't want to work there. [00:04:06]

Swyx: Okay. I guess I'm biased because I'm here, but I don't see that as a problem. If anything, they've had their success because they were more westernized than the Sanders Singaporean tech company. [00:04:15]

Steve: I mean, they had their success because they are laser focused. They copy to Amazon. I mean, they're executing really, really, really well for a giant. I was on a slack with 2,500 engineers. It was like this giant waterfall that you could dip your toe into. You'd never catch up. Actually, the AI summarizers would have been really helpful there. But yeah, no, I think Grab is successful because they're just out there with their sleeves rolled up, just making it happen. [00:04:43]

Swyx: And for those who don't know, it's not just like Uber of Southeast Asia, it's also a super app. PayPal Plus. [00:04:48]

Steve: Yeah. [00:04:49]

Swyx: In the way that super apps don't exist in the West. It's one of the enduring mysteries of B2C that super apps work in the East and don't work in the West. We just don't understand it. [00:04:57]

Beyang: Yeah. [00:04:58]

Steve: It's just kind of curious. They didn't work in India either. And it was primarily because of bandwidth reasons and smaller phones. [00:05:03]

Swyx: That should change now. It should. [00:05:05]

Steve: And maybe we'll see a super app here. [00:05:08]

Swyx: You retired-ish? I did. You retired-ish on your own video game? Mm-hmm. Any fun stories about that? And that's also where you discovered some need for code search, right? Mm-hmm. [00:05:16]

Steve: Sure. A need for a lot of stuff. Better programming languages, better databases. Better everything. I mean, I started in like 95, right? Where there was kind of nothing. Yeah. Yeah. [00:05:24]

Beyang: I just want to say, I remember when you first went to Grab because you wrote that blog post talking about why you were excited about it, about like the expanding Asian market. And our reaction was like, oh, man, how did we miss stealing it with you? [00:05:36]

Swyx: Hiring you. [00:05:37]

Beyang: Yeah. [00:05:38]

Steve: I was like, miss that. [00:05:39]

Swyx: Tell that story. So how did this happen? Right? So you were inspired by Grok. [00:05:44]

Beyang: I guess the backstory from my point of view is I had used code search and Grok while at Google, but I didn't actually know that it was connected to you, Steve. I knew you from your blog posts, which were always excellent, kind of like inside, very thoughtful takes from an engineer's perspective on some of the challenges facing tech companies and tech culture and that sort of thing. But my first introduction to you within the context of code intelligence, code understanding was I watched a talk that you gave, I think at Stanford, about Grok when you're first building it. And that was very eye opening. I was like, oh, like that guy, like the guy who, you know, writes the extremely thoughtful ranty like blog posts also built that system. And so that's how I knew, you know, you were involved in that. And then, you know, we always wanted to hire you, but never knew quite how to approach you or, you know, get that conversation started. [00:06:34]

Steve: Well, we got introduced by Max, right? Yeah. It was temporal. Yeah. Yeah. I mean, it was a no brainer. They called me up and I had noticed when Sourcegraph had come out. Of course, when they first came out, I had this dagger of jealousy stabbed through me piercingly, which I remember because I am not a jealous person by any means, ever. But boy, I was like, but I was kind of busy, right? And just one thing led to another. I got sucked back into the ads vortex and whatever. So thank God Sourcegraph actually kind of rescued me. [00:07:05]

Swyx: Here's a chance to build DevTools. Yeah. [00:07:08]

Steve: That's the best. DevTools are the best. [00:07:10]

Swyx: Cool. Well, so that's the overall intro. I guess we can get into Cody. Is there anything else that like people should know about you before we get started? [00:07:18]

Steve: I mean, everybody knows I'm a musician. I can juggle five balls. [00:07:24]

Swyx: Five is good. Five is good. I've only ever managed three. [00:07:27]

Steve: Five is hard. Yeah. And six, a little bit. [00:07:30]

Swyx: Wow. [00:07:31]

Beyang: That's impressive. [00:07:32]

Alessio: So yeah, to jump into Sourcegraph, this has been a company 10 years in the making. And as Sean said, now you're at the right place. Phase two. Now, exactly. You spent 10 years collecting all this code, indexing, making it easy to surface it. Yeah. [00:07:47]

Swyx: And also learning how to work with enterprises and having them trust you with their code bases. Yeah. [00:07:52]

Alessio: Because initially you were only doing on-prem, right? Like a lot of like VPC deployments. [00:07:55]

Beyang: So in the very early days, we're cloud only. But the first major customers we landed were all on-prem, self-hosted. And that was, I think, related to the nature of the problem that we're solving, which becomes just like a critical, unignorable pain point once you're above like 100 devs or so. [00:08:11]

Alessio: Yeah. And now Cody is going to be GA by the time this releases. So congrats to your future self for launching this in two weeks. Can you give a quick overview of just what Cody is? I think everybody understands that it's a AI coding agent, but a lot of companies say they have a AI coding agent. So yeah, what does Cody do? How do people interface with it? [00:08:32]

Beyang: Yeah. So how is it different from the like several dozen other AI coding agents that exist in the market now? When we thought about building a coding assistant that would do things like code generation and question answering about your code base, I think we came at it from the perspective of, you know, we've spent the past decade building the world's best code understanding engine for human developers, right? So like it's kind of your guide as a human dev if you want to go and dive into a large complex code base. And so our intuition was that a lot of the context that we're providing to human developers would also be useful context for AI developers to consume. And so in terms of the feature set, Cody is very similar to a lot of other assistants. It does inline autocompletion. It does code base aware chat. It does specific commands that automate, you know, tasks that you might rather not want to do like generating unit tests or adding detailed documentation. But we think the core differentiator is really the quality of the context, which is hard to kind of describe succinctly. It's a bit like saying, you know, what's the difference between Google and Alta Vista? There's not like a quick checkbox list of features that you can rattle off, but it really just comes down to all the attention and detail that we've paid to making that context work well and be high quality and fast for human devs. We're now kind of plugging into the AI coding assistant as well. Yeah. [00:09:53]

Steve: I mean, just to add my own perspective on to what Beyang just described, RAG is kind of like a consultant that the LLM has available, right, that knows about your code. RAG provides basically a bridge to a lookup system for the LLM, right? Whereas fine tuning would be more like on the job training for somebody. If the LLM is a person, you know, and you send them to a new job and you do on the job training, that's what fine tuning is like, right? So tuned to our specific task. You're always going to need that expert, even if you get the on the job training, because the expert knows your particular code base, your task, right? That expert has to know your code. And there's a chicken and egg problem because, right, you know, we're like, well, I'm going to ask the LLM about my code, but first I have to explain it, right? It's this chicken and egg problem. That's where RAG comes in. And we have the best consultants, right? The best assistant who knows your code. And so when you sit down with Cody, right, what Beyang said earlier about going to Google and using code search and then starting to feel like without it, his job was super tedious. Once you start using these, do you guys use coding assistants? [00:10:53]

Swyx: Yeah, right. [00:10:54]

Steve: I mean, like we're getting to the point very quickly, right? Where you feel like almost like you're programming without the internet, right? Or something, you know, it's like you're programming back in the nineties without the coding assistant. Yeah. Hopefully that helps for people who have like no idea about coding systems, what they are. [00:11:09]

Swyx: Yeah. [00:11:10]

Alessio: I mean, going back to using them, we had a lot of them on the podcast already. We had Cursor, we have Codium and Codium, very similar names. [00:11:18]

Swyx: Yeah. Find, and then of course there's Copilot. [00:11:22]

Alessio: You had a Copilot versus Cody blog post, and I think it really shows the context improvement. So you had two examples that stuck with me. One was, what does this application do? And the Copilot answer was like, oh, it uses JavaScript and NPM and this. And it's like, but that's not what it does. You know, that's what it's built with. Versus Cody was like, oh, these are like the major functions. And like, these are the functionalities and things like that. And then the other one was, how do I start this up? And Copilot just said NPM start, even though there was like no start command in the package JSON, but you know, most collapse, right? Most projects use NPM start. So maybe this does too. How do you think about open source models? Because Copilot has their own private thing. And I think you guys use Starcoder, if I remember right. Yeah, that's correct. [00:12:09]

Beyang: I think Copilot uses some variant of Codex. They're kind of cagey about it. I don't think they've like officially announced what model they use. [00:12:16]

Swyx: And I think they use a range of models based on what you're doing. Yeah. [00:12:19]

Beyang: So everyone uses a range of model. Like no one uses the same model for like inline completion versus like chat because the latency requirements for. Oh, okay. Well, there's fill in the middle. There's also like what the model's trained on. So like we actually had completions powered by Claude Instant for a while. And but you had to kind of like prompt hack your way to get it to output just the code and not like, hey, you know, here's the code you asked for, like that sort of text. So like everyone uses a range of models. We've kind of designed Cody to be like especially model, not agnostic, but like pluggable. So one of our kind of design considerations was like as the ecosystem evolves, we want to be able to integrate the best in class models, whether they're proprietary or open source into Cody because the pace of innovation in the space is just so quick. And I think that's been to our advantage. Like today, Cody uses Starcoder for inline completions. And with the benefit of the context that we provide, we actually show comparable completion acceptance rate metrics. It's kind of like the standard metric that folks use to evaluate inline completion quality. It's like if I show you a completion, what's the chance that you actually accept the completion versus you reject it? And so we're at par with Copilot, which is at the head of that industry right now. And we've been able to do that with the Starcoder model, which is open source and the benefit of the context fetching stuff that we provide. And of course, a lot of like prompt engineering and other stuff along the way. [00:13:40]

Alessio: And Steve, you wrote a post called cheating is all you need about what you're building. And one of the points you made is that everybody's fighting on the same axis, which is better UI and the IDE, maybe like a better chat response. But data modes are kind of the most important thing. And you guys have like a 10 year old mode with all the data you've been collecting. How do you kind of think about what other companies are doing wrong, right? Like, why is nobody doing this in terms of like really focusing on RAG? I feel like you see so many people. Oh, we just got a new model. It's like a bit human eval. And it's like, well, but maybe like that's not what we should really be doing, you know? Like, do you think most people underestimate the importance of like the actual RAG in code? [00:14:21]

Steve: I think that people weren't doing it much. It wasn't. It's kind of at the edges of AI. It's not in the center. I know that when ChatGPT launched, so within the last year, I've heard a lot of rumblings from inside of Google, right? Because they're undergoing a huge transformation to try to, you know, of course, get into the new world. And I heard that they told, you know, a bunch of teams to go and train their own models or fine tune their own models, right? [00:14:43]

Swyx: Both. [00:14:43]

Steve: And, you know, it was a s**t show. Nobody knew how to do it. They launched two coding assistants. One was called Code D with an EY. And then there was, I don't know what happened in that one. And then there's Duet, right? Google loves to compete with themselves, right? They do this all the time. And they had a paper on Duet like from a year ago. And they were doing exactly what Copilot was doing, which was just pulling in the local context, right? But fundamentally, I thought of this because we were talking about the splitting of the [00:15:10]

Swyx: models. [00:15:10]

Steve: In the early days, it was the LLM did everything. And then we realized that for certain use cases, like completions, that a different, smaller, faster model would be better. And that fragmentation of models, actually, we expected to continue and proliferate, right? Because we are fundamentally, we're a recommender engine right now. Yeah, we're recommending code to the LLM. We're saying, may I interest you in this code right here so that you can answer my question? [00:15:34]

Swyx: Yeah? [00:15:34]

Steve: And being good at recommender engine, I mean, who are the best recommenders, right? There's YouTube and Spotify and, you know, Amazon or whatever, right? Yeah. [00:15:41]

Swyx: Yeah. [00:15:41]

Steve: And they all have many, many, many, many, many models, right? For all fine-tuned for very specific, you know. And that's where we're heading in code, too. Absolutely. [00:15:50]

Swyx: Yeah. [00:15:50]

Alessio: We just did an episode we released on Wednesday, which we said RAG is like Rexis or like LLMs. You're basically just suggesting good content. [00:15:58]

Swyx: It's like what? Recommendations. [00:15:59]

Beyang: Recommendations. [00:16:00]

Alessio: Oh, got it. [00:16:01]

Steve: Yeah, yeah, yeah. [00:16:02]

Swyx: So like the naive implementation of RAG is you embed everything, throw it in a vector database, you embed your query, and then you find the nearest neighbors, and that's your RAG. But actually, you need to rank it. And actually, you need to make sure there's sample diversity and that kind of stuff. And then you're like slowly gradient dissenting yourself towards rediscovering proper Rexis, which has been traditional ML for a long time. But like approaching it from an LLM perspective. Yeah. [00:16:24]

Beyang: I almost think of it as like a generalized search problem because it's a lot of the same things. Like you want your layer one to have high recall and get all the potential things that could be relevant. And then there's typically like a layer two re-ranking mechanism that bumps up the precision and tries to get the relevant stuff to the top of the results list. [00:16:43]

Swyx: Have you discovered that ranking matters a lot? Oh, yeah. So the context is that I think a lot of research shows that like one, context utilization matters based on model. Like GPT uses the top of the context window, and then apparently Claude uses the bottom better. And it's lossy in the middle. Yeah. So ranking matters. No, it really does. [00:17:01]

Beyang: The skill with which models are able to take advantage of context is always going to be dependent on how that factors into the impact on the training loss. [00:17:10]

Swyx: Right? [00:17:10]

Beyang: So like if you want long context window models to work well, then you have to have a ton of data where it's like, here's like a billion lines of text. And I'm going to ask a question about like something that's like, you know, embedded deeply into it and like, give me the right answer. And unless you have that training set, then of course, you're going to have variability in terms of like where it attends to. And in most kind of like naturally occurring data, the thing that you're talking about right now, the thing I'm asking you about is going to be something that we talked about recently. [00:17:36]

Swyx: Yeah. [00:17:36]

Steve: Did you really just say gradient dissenting yourself? Actually, I love that it's entered the casual lexicon. Yeah, yeah, yeah. [00:17:44]

Swyx: My favorite version of that is, you know, how we have to p-hack papers. So, you know, when you throw humans at the problem, that's called graduate student dissent. That's great. It's really awesome. [00:17:54]

Alessio: I think the other interesting thing that you have is this inline assist UX that I wouldn't say async, but like it works while you can also do work. So you can ask Cody to make changes on a code block and you can still edit the same file at the same time. [00:18:07]

Swyx: Yeah. [00:18:07]

Alessio: How do you see that in the future? Like, do you see a lot of Cody's running together at the same time? Like, how do you validate also that they're not messing each other up as they make changes in the code? And maybe what are the limitations today? And what do you think about where the attack is going? [00:18:21]

Steve: I want to start with a little history and then I'm going to turn it over to Bian, all right? So we actually had this feature in the very first launch back in June. Dominic wrote it. It was called nonstop Cody. And you could have multiple, basically, LLM requests in parallel modifying your source [00:18:37]

Swyx: file. [00:18:37]

Steve: And he wrote a bunch of code to handle all of the diffing logic. And you could see the regions of code that the LLM was going to change, right? And he was showing me demos of it. And it just felt like it was just a little before its time, you know? But a bunch of that stuff, that scaffolding was able to be reused for where we're inline [00:18:56]

Swyx: sitting today. [00:18:56]

Steve: How would you characterize it today? [00:18:58]

Beyang: Yeah, so that interface has really evolved from a, like, hey, general purpose, like, request anything inline in the code and have the code update to really, like, targeted features, like, you know, fix the bug that exists at this line or request a very specific [00:19:13]

Swyx: change. [00:19:13]

Beyang: And the reason for that is, I think, the challenge that we ran into with inline fixes, and we do want to get to the point where you could just fire and forget and have, you know, half a dozen of these running in parallel. But I think we ran into the challenge early on that a lot of people are running into now when they're trying to construct agents, which is the reliability of, you know, working code generation is just not quite there yet in today's language models. And so that kind of constrains you to an interaction where the human is always, like, in the inner loop, like, checking the output of each response. And if you want that to work in a way where you can be asynchronous, you kind of have to constrain it to a domain where today's language models can generate reliable code well enough. So, you know, generating unit tests, that's, like, a well-constrained problem. Or fixing a bug that shows up as, like, a compiler error or a test error, that's a well-constrained problem. But the more general, like, hey, write me this class that does X, Y, and Z using the libraries that I have, that is not quite there yet, even with the benefit of really good context. Like, it definitely moves the needle a lot, but we're not quite there yet to the point where you can just fire and forget. And I actually think that this is something that people don't broadly appreciate yet, because I think that, like, everyone's chasing this dream of agentic execution. And if we're to really define that down, I think it implies a couple things. You have, like, a multi-step process where each step is fully automated. We don't have to have a human in the loop every time. And there's also kind of like an LM call at each stage or nearly every stage in that [00:20:45]

Swyx: chain. [00:20:45]

Beyang: Based on all the work that we've done, you know, with the inline interactions, with kind of like general Codyfeatures for implementing longer chains of thought, we're actually a little bit more bearish than the average, you know, AI hypefluencer out there on the feasibility of agents with purely kind of like transformer-based models. To your original question, like, the inline interactions with CODI, we actually constrained it to be more targeted, like, you know, fix the current error or make this quick fix. I think that that does differentiate us from a lot of the other tools on the market, because a lot of people are going after this, like, shnazzy, like, inline edit interaction, whereas I think where we've moved, and this is based on the user feedback that we've gotten, it's like that sort of thing, it demos well, but when you're actually coding day to day, you don't want to have, like, a long chat conversation inline with the code base. That's a waste of time. You'd rather just have it write the right thing and then move on with your life or not have to think about it. And that's what we're trying to work towards. [00:21:37]

Steve: I mean, yeah, we're not going in the agent direction, right? I mean, I'll believe in agents when somebody shows me one that works. Yeah. Instead, we're working on, you know, sort of solidifying our strength, which is bringing the right context in. So new context sources, ways for you to plug in your own context, ways for you to control or influence the context, you know, the mixing that happens before the request goes out, etc. And there's just so much low-hanging fruit left in that space that, you know, agents seems like a little bit of a boondoggle. [00:22:03]

Beyang: Just to dive into that a little bit further, like, I think, you know, at a very high level, what do people mean when they say agents? They really mean, like, greater automation, fully automated, like, the dream is, like, here's an issue, go implement that. And I don't have to think about it as a human. And I think we are working towards that. Like, that is the eventual goal. I think it's specifically the approach of, like, hey, can we have a transformer-based LM alone be the kind of, like, backbone or the orchestrator of these agentic flows? Where we're a little bit more bearish today. [00:22:31]

Swyx: You want the human in the loop. [00:22:32]

Beyang: I mean, you kind of have to. It's just a reality of the behavior of language models that are purely, like, transformer-based. And I think that's just like a reflection of reality. And I don't think people realize that yet. Because if you look at the way that a lot of other AI tools have implemented context fetching, for instance, like, you see this in the Copilot approach, where if you use, like, the at-workspace thing that supposedly provides, like, code-based level context, it has, like, an agentic approach where you kind of look at how it's behaving. And it feels like they're making multiple requests to the LM being like, what would you do in this case? Would you search for stuff? What sort of files would you gather? Go and read those files. And it's like a multi-hop step, so it takes a long while. It's also non-deterministic. Because any sort of, like, LM invocation, it's like a dice roll. And then at the end of the day, the context it fetches is not that good. Whereas our approach is just like, OK, let's do some code searches that make sense. And then maybe, like, crawl through the reference graph a little bit. That is fast. That doesn't require any sort of LM invocation at all. And we can pull in much better context, you know, very quickly. So it's faster. [00:23:37]

Swyx: It's more reliable. [00:23:37]

Beyang: It's deterministic. And it yields better context quality. And so that's what we think. We just don't think you should cargo cult or naively go like, you know, agents are the [00:23:46]

Swyx: future. [00:23:46]

Beyang: Let's just try to, like, implement agents on top of the LM that exists today. I think there are a couple of other technologies or approaches that need to be refined first before we can get into these kind of, like, multi-stage, fully automated workflows. [00:24:00]

Swyx: It makes sense. You know, we're very much focused on developer inner loop right now. But you do see things eventually moving towards developer outer loop. Yeah. So would you basically say that they're tackling the agent's problem that you don't want to tackle? [00:24:11]

Beyang: No, I would say at a high level, we are after maybe, like, the same high level problem, which is like, hey, I want some code written. I want to develop some software and can automate a system. Go build that software for me. I think the approaches might be different. So I think the analogy in my mind is, I think about, like, the AI chess players. Coding, in some senses, I mean, it's similar and dissimilar to chess. I think one question I ask is, like, do you think producing code is more difficult than playing chess or less difficult than playing chess? More. [00:24:41]

Swyx: I think more. [00:24:41]

Beyang: Right. And if you look at the best AI chess players, like, yes, you can use an LLM to play chess. Like, people have showed demos where it's like, oh, like, yeah, GPT-4 is actually a pretty decent, like, chess move suggester. Right. But you would never build, like, a best in class chess player off of GPT-4 alone. [00:24:57]

Swyx: Right. [00:24:57]

Beyang: Like, the way that people design chess players is that you have kind of like a search space and then you have a way to explore that search space efficiently. There's a bunch of search algorithms, essentially. We were doing tree search in various ways. And you can have heuristic functions, which might be powered by an LLM. [00:25:12]

Swyx: Right. [00:25:12]

Beyang: Like, you might use an LLM to generate proposals in that space that you can efficiently explore. But the backbone is still this kind of more formalized tree search based approach rather than the LLM itself. And so I think my high level intuition is that, like, the way that we get to more reliable multi-step workflows that do things beyond, you know, generate unit test, it's really going to be like a search based approach where you use an LLM as kind of like an advisor or a proposal function, sort of your heuristic function, like the ASTAR search algorithm. But it's probably not going to be the thing that is the backbone, because I guess it's not the right tool for that. Yeah. [00:25:50]

Swyx: I can see yourself kind of thinking through this, but not saying the words, the sort of philosophical Peter Norvig type discussion. Maybe you want to sort of introduce that in software. Yeah, definitely. [00:25:59]

Beyang: So your listeners are savvy. They're probably familiar with the classic like Chomsky versus Norvig debate. [00:26:04]

Swyx: No, actually, I wanted, I was prompting you to introduce that. Oh, got it. [00:26:08]

Beyang: So, I mean, if you look at the history of artificial intelligence, right, you know, it goes way back to, I don't know, it's probably as old as modern computers, like 50s, 60s, 70s. People are debating on like, what is the path to producing a sort of like general human level of intelligence? And kind of two schools of thought that emerged. One is the Norvig school of thought, which roughly speaking includes large language models, you know, regression, SVN, basically any model that you kind of like learn from data. And it's like data driven. Most of machine learning would fall under this umbrella. And that school of thought says like, you know, just learn from the data. That's the approach to reaching intelligence. And then the Chomsky approach is more things like compilers and parsers and formal systems. So basically like, let's think very carefully about how to construct a formal, precise system. And that will be the approach to how we build a truly intelligent system. I think Lisp was invented so that you could create like rules-based systems that you would call AI. As a language. Yeah. And for a long time, there was like this debate, like there's certain like AI research labs that were more like, you know, in the Chomsky camp and others that were more in the Norvig camp. It's a debate that rages on today. And I feel like the consensus right now is that, you know, Norvig definitely has the upper hand right now with the advent of LMs and diffusion models and all the other recent progress in machine learning. But the Chomsky-based stuff is still really useful in my view. I mean, it's like parsers, compilers, basically a lot of the stuff that provides really good context. It provides kind of like the knowledge graph backbone that you want to explore with your AI dev tool. Like that will come from kind of like Chomsky-based tools like compilers and parsers. It's a lot of what we've invested in in the past decade at Sourcegraph and what you build with Grok. Basically like these formal systems that construct these very precise knowledge graphs that are great context providers and great kind of guard rails enforcers and kind of like safety checkers for the output of a more kind of like data-driven, fuzzier system that uses like the Norvig-based models. [00:28:03]

Steve: Jang was talking about this stuff like it happened in the middle ages. Like, okay, so when I was in college, I was in college learning Lisp and prologue and planning and all the deterministic Chomsky approaches to AI. And I was there when Norvig basically declared it dead. I was there 3,000 years ago when Norvig and Chomsky fought on the volcano. When did he declare it dead? [00:28:26]

Swyx: What do you mean he declared it dead? [00:28:27]

Steve: It was like late 90s. [00:28:29]

Swyx: Yeah. [00:28:29]

Steve: When I went to Google, Peter Norvig was already there. He had basically like, I forget exactly where. It was some, he's got so many famous short posts, you know, amazing. [00:28:38]

Swyx: He had a famous talk, the unreasonable effectiveness of data. Yeah. [00:28:41]

Steve: Maybe that was it. But at some point, basically, he basically convinced everybody that deterministic approaches had failed and that heuristic-based, you know, data-driven statistical approaches, stochastic were better. [00:28:52]

Swyx: Yeah. [00:28:52]

Steve: The primary reason I can tell you this, because I was there, was that, was that, well, the steam-powered engine, no. The reason was that the deterministic stuff didn't scale. [00:29:06]

Swyx: Yeah. Right. [00:29:06]

Steve: They're using prologue, man, constraint systems and stuff like that. Well, that was a long time ago, right? Today, actually, these Chomsky-style systems do scale. And that's, in fact, exactly what Sourcegraph has built. Yeah. And so we have a very unique, I love the framing that Bjong's made, that the marriage of the Chomsky and the Norvig, you know, sort of models, you know, conceptual models, because we, you know, we have both of them and they're both really important. And in fact, there, there's this really interesting, like, kind of overlap between them, right? Where like the AI or our graph or our search engine could potentially provide the right context for any given query, which is, of course, why ranking is important. But what we've really signed ourselves up for is an extraordinary amount of testing. [00:29:45]

Swyx: Yeah. [00:29:45]

Steve: Because in SWIGs, you were saying that, you know, GPT-4 tends to the front of the context window and maybe other LLMs to the back and maybe, maybe the LLM in the middle. [00:29:53]

Swyx: Yeah. [00:29:53]

Steve: And so that means that, you know, if we're actually like, you know, verifying whether we, you know, some change we've made has improved things, we're going to have to test putting it at the beginning of the window and at the end of the window, you know, and maybe make the right decision based on the LLM that you've chosen. Which some of our competitors, that's a problem that they don't have, but we meet you, you know, where you are. Yeah. And we're, just to finish, we're writing tens of thousands. We're generating tests, you know, fill in the middle type tests and things. And then using our graph to basically sort of fine tune Cody's behavior there. [00:30:20]

Swyx: Yeah. [00:30:21]

Beyang: I also want to add, like, I have like an internal pet name for this, like kind of hybrid architecture that I'm trying to make catch on. Maybe I'll just say it here. Just saying it publicly kind of makes it more real. But like, I call the architecture that we've developed the Normsky architecture. [00:30:36]

Swyx: Yeah. [00:30:36]

Beyang: I mean, it's obviously a portmanteau of Norvig and Chomsky, but the acronym, it stands for non-agentic, rapid, multi-source code intelligence. So non-agentic because... Rolls right off the tongue. And Normsky. But it's non-agentic in the sense that like, we're not trying to like pitch you on kind of like agent hype, right? Like it's the things it does are really just developer tools developers have been using for decades now, like parsers and really good search indexes and things like that. Rapid because we place an emphasis on speed. We don't want to sit there waiting for kind of like multiple LLM requests to return to complete a simple user request. Multi-source because we're thinking broadly about what pieces of information and knowledge are useful context. So obviously starting with things that you can search in your code base, and then you add in the reference graph, which kind of like allows you to crawl outward from those initial results. But then even beyond that, you know, sources of information, like there's a lot of knowledge that's embedded in docs, in PRDs or product specs, in your production logging system, in your chat, in your Slack channel, right? Like there's so much context is embedded there. And when you're a human developer, and you're trying to like be productive in your code base, you're going to go to all these different systems to collect the context that you need to figure out what code you need to write. And I don't think the AI developer will be any different. It will need to pull context from all these different sources. So we're thinking broadly about how to integrate these into Codi. We hope through kind of like an open protocol that like others can extend and implement. And this is something else that should be accessible by December 14th in kind of like a preview stage. But that's really about like broadening this notion of the code graph beyond your Git repository to all the other sources where technical knowledge and valuable context can live. [00:32:21]

Steve: Yeah, it becomes an artifact graph, right? It can link into your logs and your wikis and any data source, right? [00:32:27]

Alessio: How do you guys think about the importance of, it's almost like data pre-processing in a way, which is bring it all together, tie it together, make it ready. Any thoughts on how to actually make that good? Some of the innovation you guys have made. [00:32:40]

Steve: We talk a lot about the context fetching, right? I mean, there's a lot of ways you could answer this question. But, you know, we've spent a lot of time just in this podcast here talking about context fetching. But stuffing the context into the window is, you know, the bin packing problem, right? Because the window is not big enough, and you've got more context than you can fit. You've got a ranker maybe. But what is that context? Is it a function that was returned by an embedding or a graph call or something? Do you need the whole function? Or do you just need, you know, the top part of the function, this expression here, right? You know, so that art, the golf game of trying to, you know, get each piece of context down into its smallest state, possibly even summarized by another model, right, before it even goes to the LLM, becomes this is the game that we're in, yeah? And so, you know, recursive summarization and all the other techniques that you got to use to like stuff stuff into that context window become, you know, critically important. And you have to test them across every configuration of models that you could possibly need. [00:33:32]

Beyang: I think data preprocessing is probably the like unsexy, way underappreciated secret to a lot of the cool stuff that people are shipping today. Whether you're doing like RAG or fine tuning or pre-training, like the preprocessing step matters so much because it's basically garbage in, garbage out, right? Like if you're feeding in garbage to the model, then it's going to output garbage. Concretely, you know, for code RAG, if you're not doing some sort of like preprocessing that takes advantage of a parser and is able to like extract the key components of a particular file of code, you know, separate the function signature from the body, from the doc string, what are you even doing? Like that's like table stakes. It opens up so much more possibilities with which you can kind of like tune your system to take advantage of the signals that come from those different parts of the code. Like we've had a tool, you know, since computers were invented that understands the structure of source code to a hundred percent precision. The compiler knows everything there is to know about the code in terms of like structure. Like why would you not want to use that in a system that's trying to generate code, answer questions about code? You shouldn't throw that out the window just because now we have really good, you know, data-driven models that can do other things. [00:34:44]

Steve: Yeah. When I called it a data moat, you know, in my cheating post, a lot of people were confused, you know, because data moat sort of sounds like data lake because there's data and water and stuff. I don't know. And so they thought that we were sitting on this giant mountain of data that we had collected, but that's not what our data moat is. It's really a data pre-processing engine that can very quickly and scalably, like basically dissect your entire code base in a very small, fine-grained, you know, semantic unit and then serve it up. Yeah. And so it's really, it's not a data moat. It's a data pre-processing moat, I guess. [00:35:15]

Beyang: Yeah. If anything, we're like hypersensitive to customer data privacy requirements. So it's not like we've taken a bunch of private data and like, you know, trained a generally available model. In fact, exactly the opposite. A lot of our customers are choosing Cody over Copilot and other competitors because we have an explicit guarantee that we don't do any of that. And that we've done that from day one. Yeah. I think that's a very real concern in today's day and age, because like if your proprietary IP finds its way into the training set of any model, it's very easy both to like extract that knowledge from the model and also use it to, you know, build systems that kind of work on top of the institutional knowledge that you've built up. [00:35:52]

Alessio: About a year ago, I wrote a post on LLMs for developers. And one of the points I had was maybe the depth of like the DSL. I spent most of my career writing Ruby and I love Ruby. It's so nice to use, but you know, it's not as performant, but it's really easy to read, right? And then you look at other languages, maybe they're faster, but like they're more verbose, you know? And when you think about efficiency of the context window, that actually matters. [00:36:15]

Swyx: Yeah. [00:36:15]

Alessio: But I haven't really seen a DSL for models, you know? I haven't seen like code being optimized to like be easier to put in a model context. And it seems like your pre-processing is kind of doing that. Do you see in the future, like the way we think about the DSL and APIs and kind of like service interfaces be more focused on being context friendly, where it's like maybe it's harder to read for the human, but like the human is never going to write it anyway. We were talking on the Hacks podcast. There are like some data science things like spin up the spandex, like humans are never going to write again because the models can just do very easily. Yeah, curious to hear your thoughts. [00:36:51]

Steve: Well, so DSLs, they involve, you know, writing a grammar and a parser and they're like little languages, right? We do them that way because, you know, we need them to compile and humans need to be able to read them and so on. The LLMs don't need that level of structure. You can throw any pile of crap at them, you know, more or less unstructured and they'll deal with it. So I think that's why a DSL hasn't emerged for sort of like communicating with the LLM or packaging up the context or anything. Maybe it will at some point, right? We've got, you know, tagging of context and things like that that are sort of peeking into DSL territory, right? But your point on do users, you know, do people have to learn DSLs like regular expressions or, you know, pick your favorite, right? XPath. I think you're absolutely right that the LLMs are really, really good at that. And I think you're going to see a lot less of people having to slave away learning these things. They just have to know the broad capabilities and the LLM will take care of the rest. [00:37:42]

Swyx: Yeah, I'd agree with that. [00:37:43]

Beyang: I think basically like the value profit of DSL is that it makes it easier to work with a lower level language, but at the expense of introducing an abstraction layer. And in many cases today, you know, without the benefit of AI cogeneration, like that totally worth it, right? With the benefit of AI cogeneration, I mean, I don't think all DSLs will go away. I think there's still, you know, places where that trade-off is going to be worthwhile. But it's kind of like how much of source code do you think is going to be generated through natural language prompting in the future? Because in a way, like any programming language is just a DSL on top of assembly, right? And so if people can do that, then yeah, like maybe for a large portion of the code [00:38:21]

Swyx: that's written, [00:38:21]

Beyang: people don't actually have to understand the DSL that is Ruby or Python or basically any other programming language that exists. [00:38:28]

Steve: I mean, seriously, do you guys ever write SQL queries now without using a model of some sort? At least a draft. [00:38:34]

Swyx: Yeah, right. [00:38:36]

Steve: And so we have kind of like, you know, past that bridge, right? [00:38:39]

Alessio: Yeah, I think like to me, the long-term thing is like, is there ever going to be, you don't actually see the code, you know? It's like, hey, the basic thing is like, hey, I need a function to some two numbers and that's it. I don't need you to generate the code. [00:38:53]

Steve: And the following question, do you need the engineer or the paycheck? [00:38:56]

Swyx: I mean, right? [00:38:58]

Alessio: That's kind of the agent's discussion in a way where like you cannot automate the agents, but like slowly you're getting more of the atomic units of the work kind of like done. I kind of think of it as like, you know, [00:39:09]

Beyang: do you need a punch card operator to answer that for you? And so like, I think we're still going to have people in the role of a software engineer, but the portion of time they spend on these kinds of like low-level, tedious tasks versus the higher level, more creative tasks is going to shift. [00:39:23]

Steve: No, I haven't used punch cards. [00:39:25]

Swyx: Yeah, I've been talking about like, so we kind of made this podcast about the sort of rise of the AI engineer. And like the first step is the AI enhanced engineer. That is that software developer that is no longer doing these routine, boilerplate-y type tasks, because they're just enhanced by tools like yours. So you mentioned OpenCodeGraph. I mean, that is a kind of DSL maybe, and because we're releasing this as you go GA, you hope for other people to take advantage of that? [00:39:52]

Beyang: Oh yeah, I would say so OpenCodeGraph is not a DSL. It's more of a protocol. It's basically like, hey, if you want to make your system, whether it's, you know, chat or logging or whatever accessible to an AI developer tool like Cody, here's kind of like the schema by which you can provide that context and offer hints. So I would, you know, comparisons like LSP obviously did this for kind of like standard code intelligence. It's kind of like a lingua franca for providing fine references and codefinition. There's kind of like analogs to that. There might be also analogs to kind of the original OpenAI, kind of like plugins, API. There's all this like context out there that might be useful for an LM-based system to consume. And so at a high level, what we're trying to do is define a common language for context providers to provide context to other tools in the software development lifecycle. Yeah. Do you have any critiques of LSP, by the way, [00:40:42]

Swyx: since like this is very much, very close to home? [00:40:45]

Steve: One of the authors wrote a really good critique recently. Yeah. I don't think I saw that. Yeah, yeah. LSP could have been better. It just came out a couple of weeks ago. It was a good article. [00:40:54]

Beyang: Yeah. I think LSP is great. Like for what it did for the developer ecosystem, it was absolutely fantastic. Like nowadays, like it's much easier now to get code navigation up and running in a bunch of editors by speaking this protocol. I think maybe the interesting question is like looking at the different design decisions comparing LSP basically with Kythe. Because Kythe has more of a... How would you describe it? [00:41:18]

Steve: A storage format. [00:41:20]

Beyang: I think the critique of LSP from a Kythe point of view would be like with LSP, you don't actually have an actual symbolic model of the code. It's not like LSP models like, hey, this function calls this other function. LSP is all like range-based. Like, hey, your cursor's at line 32, column 1. [00:41:35]

Swyx: Yeah. [00:41:35]

Beyang: And that's the thing you feed into the language server. And then it's like, okay, here's the range that you should jump to if you click on that range. So it kind of is intentionally ignorant of the fact that there's a thing called a reference underneath your cursor, and that's linked to a symbol definition. [00:41:49]

Steve: Well, actually, that's the worst example you could have used. You're right. But that's the one thing that it actually did bake in is following references. [00:41:56]

Swyx: Sure. [00:41:56]

Steve: But it's sort of hardwired. [00:41:58]

Swyx: Yeah. [00:41:58]

Steve: Whereas Kythe attempts to model [00:42:00]

Beyang: like all these things explicitly. [00:42:02]

Swyx: And so... [00:42:02]

Steve: Well, so LSP is a protocol, right? And so Google's internal protocol is gRPC-based. And it's a different approach than LSP. It's basically you make a heavy query to the back end, and you get a lot of data back, and then you render the whole page, you know? So we've looked at LSP, and we think that it's a little long in the tooth, right? I mean, it's a great protocol, lots and lots of support for it. But we need to push into the domain of exposing the intelligence through the protocol. Yeah. [00:42:29]

Beyang: And so I would say we've developed a protocol of our own called Skip, which is at a very high level trying to take some of the good ideas from LSP and from Kythe and merge that into a system that in the near term is useful for Sourcegraph, but I think in the long term, we hope will be useful for the ecosystem. Okay, so here's what LSP did well. LSP, by virtue of being like intentionally dumb, dumb in air quotes, because I'm not like ragging on it, allowed language servers developers to kind of like bypass the hard problem of like modeling language semantics precisely. So like if all you want to do is jump to definition, you don't have to come up with like a universally unique naming scheme for each symbol, which is actually quite challenging because you have to think about like, okay, what's the top scope of this name? Is it the source code repository? Is it the package? Does it depend on like what package server you're fetching this from? Like whether it's the public one or the one inside your... Anyways, like naming is hard, right? And by just going from kind of like a location to location based approach, you basically just like throw that out the window. All I care about is jumping definition, just make that work. And you can make that work without having to deal with like all the complex global naming things. The limitation of that approach is that it's harder to build on top of that to build like a true knowledge graph. Like if you actually want a system that says like, okay, here's the web of functions and here's how they reference each other. And I want to incorporate that like semantic model of how the code operates or how the code relates to each other at like a static level. You can't do that with LSP because you have to deal with line ranges. And like concretely the pain point that we found in using LSP for source graph is like in order to do like a find references [00:44:04]

Swyx: and then jump definitions, [00:44:04]

Beyang: it's like a multi-hop process because like you have to jump to the range and then you have to find the symbol at that range. And it just adds a lot of latency and complexity of these operations where as a human, you're like, well, this thing clearly references this other thing. Why can't you just jump me to that? And I think that's the thing that Kaith does well. But then I think the issue that Kaith has had with adoption is because it is more sophisticated schema, I think. And so there's basically more things that you have to implement to get like a Kaith implementation up and running. I hope I'm not like, correct me if I'm wrong about any of this. [00:44:35]

Steve: 100%, 100%. Kaith also has a problem, all these systems have the problem, even skip, or at least the way that we implemented the indexers, that they have to integrate with your build system in order to build that knowledge graph, right? Because you have to basically compile the code in a special mode to generate artifacts instead of binaries. And I would say, by the way, earlier I was saying that XREFs were in LSP, but it's actually, I was thinking of LSP plus LSIF. [00:44:58]

Swyx: Yeah. That's another. [00:45:01]

Steve: Which is actually bad. We can say that it's bad, right? [00:45:04]

Steve: It's like skip or Kaith, it's supposed to be sort of a model serialization, you know, for the code graph, but it basically just does what LSP needs, the bare minimum. LSIF is basically if you took LSP [00:45:16]

Beyang: and turned that into a serialization format. So like you build an index for language servers to kind of like quickly bootstrap from cold start. But it's a graph model [00:45:23]

Steve: with all of the inconvenience of the API without an actual graph. And so, yeah. [00:45:29]

Beyang: So like one of the things that we try to do with skip is try to capture the best of both worlds. So like make it easy to write an indexer, make the schema simple, but also model some of the more symbolic characteristics of the code that would allow us to essentially construct this knowledge graph that we can then make useful for both the human developer through SourceGraph and through the AI developer through Cody. [00:45:49]

Steve: So anyway, just to finish off the graph comment, we've got a new graph, yeah, that's skip based. We call it BFG internally, right? It's a beautiful something graph. A big friendly graph. [00:46:00]

Swyx: A big friendly graph. [00:46:01]

Beyang: It's a blazing fast. [00:46:02]

Steve: Blazing fast. [00:46:03]

Swyx: Blazing fast graph. [00:46:04]

Steve: And it is blazing fast, actually. It's really, really interesting. I should probably have to do a blog post about it to walk you through exactly how they're doing it. Oh, please. But it's a very AI-like iterative, you know, experimentation sort of approach. We're building a code graph based on all of our 10 years of knowledge about building code graphs, yeah? But we're building it quickly with zero configuration, and it doesn't have to integrate with your build. And through some magic tricks that we have. And so what just happens when you install the plugin, that it'll be there and indexing your code and providing that knowledge graph in the background without all that build system integration. This is a bit of secret sauce that we haven't really like advertised it very much lately. But I am super excited about it because what they do is they say, all right, you know, let's tackle function parameters today. Cody's not doing a very good job of completing function call arguments or function parameters in the definition, right? Yeah, we generate those thousands of tests, and then we can actually reuse those tests for the AI context as well. So fortunately, things are kind of converging on, we have, you know, half a dozen really, really good context sources, and we mix them all together. So anyway, BFG, you're going to hear more about it probably in the holidays? [00:47:12]

Beyang: I think it'll be online for December 14th. We'll probably mention it. BFG is probably not the public name we're going to go with. I think we might call it like Graph Context or something like that. [00:47:20]

Steve: We're officially calling it BFG. [00:47:22]

Swyx: You heard it here first. [00:47:24]

Beyang: BFG is just kind of like the working name. And so the impetus for BFG was like, if you look at like current AI inline code completion tools and the errors that they make, a lot of the errors that they make, even in kind of like the easy, like single line case, are essentially like type errors, right? Like you're trying to complete a function call and it suggests a variable that you defined earlier, but that variable is the wrong type. [00:47:47]

Swyx: And that's the sort of thing [00:47:47]

Beyang: where it's like a first year, like freshman CS student would not make that error, right? So like, why does the AI make that error? And the reason is, I mean, the AI is just suggesting things that are plausible without the context of the types or any other like broader files in the code. And so the kind of intuition here is like, why don't we just do the basic thing that like any baseline intelligent human developer would do, which is like click jump to definition, click some fine references and pull in that like Graph Context into the context window and then have it generate the completion. So like that's sort of like the MVP of what BFG was. And turns out that works really well. Like you can eliminate a lot of type errors that AI coding tools make just by pulling in that context. Yeah, but the graph is definitely [00:48:32]

Steve: our Chomsky side. [00:48:33]

Swyx: Yeah, exactly. [00:48:34]

Beyang: So like this like Chomsky-Norvig thing, I think pops up in a bunch of different layers. And I think it's just a very useful and also kind of like nicely nerdy way to describe the system that we're trying to build. [00:48:46]

Steve: By the way, I remembered the point I was trying to make earlier to your question, Alessio, about is AI going to replace programmers? And I was talking about how compilers, they thought, oh, are compilers going to replace programming? And what it did was just change [00:48:57]

Beyang: kind of what programmers [00:48:58]

Steve: had to focus on. And I think AI is just going to level us at the game, right? Programmers are still in the middle of stuff and, you know, Intel agents come along, but I don't believe. And so, yeah. [00:49:09]

Beyang: Yeah, I mean, to be clear, again, like with the agent stuff at a high level, I think we will get there. [00:49:14]

Swyx: I think that's still [00:49:14]

Beyang: the kind of long-term target. And I think also with Cody, it's like you can have Cody like draft up an execution plan. It's just not going to be the sort of thing where you can't attend to what it's doing. Like we think that like with Cody, it's like, yes, Cody, like, hey, I have this bug, [00:49:30]

Swyx: help me solve it. [00:49:30]

Beyang: It would do a reasonable job of fetching context and saying, like, here are the files you should modify. And if you prompt it further, you can actually suggest like code changes to make to those files. And that's a very nice way to like resolve issues because you're kind of like on the rails for most of the time. But then, you know, now and then you have to intervene as a human. I just think that like [00:49:48]

Swyx: if we're trying to get [00:49:48]

Beyang: to complete automation, where it's like the sort of thing where like a non-software engineer, like someone who has no technical expertise can just like speak a non-trivial feature into existence. [00:49:59]

Swyx: You know, that is still, [00:50:00]

Beyang: I think, several key innovations away from happening right now. And I don't think the pure like transformer based LLM orchestrator modeled agents that is kind of like dominant today is going to get us there. Yeah. [00:50:14]

Swyx: What you're talking about triggered a thread I've been working on for a little bit, which is, you know, we're very much reacting to developments in models on a month-to-month basis. We had a post about we're going to need a bigger moat, which is great JAWS reference for those who didn't catch it. I forgot all about that. How quickly models are evolving. But I think if you like kind of look out, I actually caught Sam Altman on the podcast yesterday talking about GPT-10. I know. Wow. [00:50:40]

Beyang: Things are accelerating. [00:50:42]

Swyx: And actually there's a pretty good cadence from GPT-2, 3 and 4 that you can, if you project out, 4 is based on George Hotz's concept of like 20 petaflops being a human's worth of compute. GPT-4 took about 100 years in terms of human years to train in terms of the amount of compute. So that's one living person. And every generation of GPT increases two orders of magnitude. So 5 is, you know, 100 people. And if you just project it out, 9 is every human on earth and 10 is every human ever. And he thinks he'll reach there by the end of the decade. George Hotz does? No, Sam Altman. Oh, Sam Altman. Okay. [00:51:19]

Beyang: Yeah. [00:51:20]

Swyx: So I just like setting those high level, you have dots on the line. We're at the start of the curve with Moore's law. George Moore, I think, thought it would last like 10 years. Yeah. And he just kept drawing for another 50. Yeah. And I think we have all these data points and we're just trying to draw, extrapolate the curve to where this goes. All I'm saying is, the agent stuff that we dealt might come here by 2030. And I don't know how you plan when things are not possible today and you're like, it's not worth doing. But like, you know, I mean, we're going to be here in 2030. [00:51:50]

Swyx: And what do we do then? [00:51:54]

Beyang: So is the question like, you know... There's no question. [00:51:57]

Swyx: It's like sharing of a comment just because like at the back of my head, anytime we hear things like things are not practical today. Yeah. I'm just like, all right, but how do we... [00:52:06]

Beyang: So here's like a question maybe, like I get the whole like scaling argument. I do think that there will be something like a Moore's law for AI inference. I mean, definitely, I think at like the hardware level, like GPUs, I think it gets a little fuzzier the higher you move up in the stack. But for instance, like going back to the chess analogy, right? At what point do we think that, you know, GPDX or whatever, you know, a pure, a transformer based LM model will be like state of the art or outperform the best like chess playing algorithm today? Because I think that is one milestone on... Where you completely overlap search. [00:52:41]

Swyx: Yeah, exactly. [00:52:42]

Beyang: Because I think that would be, I mean, just to put my cards on the table, I think that would kind of disprove the thesis that I just stated, which is, you know, kind of like the pure transformer, just scale the transformer based approach. That would be a proof point where like, hey, like maybe that is the right approach versus, oh, we actually have to take a step back and think, you get what I'm saying, right? Like is the transformer going to be like, is that the end all be all of architectures and it's just a matter of scaling that? [00:53:04]

Swyx: Yeah. [00:53:04]

Beyang: Or are there other algorithms and like that is going to be one piece of a system of intelligence that will have to take advantage of like many other algorithms and approaches. Yeah, we shall see. [00:53:14]

Swyx: Maybe John Carmack will find it. Yeah. All right. Sorry for that digression. I'm just very curious. So one thing I did actually want to check in on because we talked a little bit about code graphs and like reference graphs and all that. Do you actually use a graph database? No, right? No. [00:53:29]

Beyang: How would you find graph database? [00:53:31]

Steve: We use Postgres. And yeah, I saw a paper actually right after I joined Sourcegraph. There was some joint study between IBM and some other company that basically showed that Postgres was performing as well as most of the graph databases for most graph workloads. [00:53:43]

Swyx: Wow. [00:53:45]

Beyang: In V0 of Sourcegraph, we're like, we're building a code graph. Let's use a graph database. I won't name the database because I mean, it was like 10 years ago. So they're probably much better now. But like we basically tried to dump like a non-trivially sized like dataset, but also like not the whole universe of code, right? Like it was a relatively small dataset compared to what we're indexing now [00:54:05]

Swyx: into the database. [00:54:05]

Beyang: And it was just, we let it run for like a week. And I think it like seg faulted or something. And we're like, okay, let's try another approach. Let's just put everything in Postgres. And these days, like the graph data, I mean, it's partially in Postgres. It's partially just, I mean, you could store them as like flat files. [00:54:21]

Swyx: Yep. [00:54:21]

Beyang: I mean, at the end of the day, all the databases like just get me the data I want. Like answer the queries that I need, right? Like if all your queries are like, you know, single hops. [00:54:30]

Steve: Which they will be if you denormalize from other use cases. [00:54:33]

Beyang: Exactly. [00:54:34]

Swyx: Interesting. [00:54:34]

Beyang: So yeah. [00:54:35]

Swyx: Set of normal form is just a bunch of files. Yeah, yeah. And I don't know, like, [00:54:40]

Beyang: I feel like there's a bunch of stuff like that where it's like, if you look past the marketing and think about like the actual query load or like the traffic patterns or the end user use cases you need to serve, just go with like the tried and true, kind of like dumb classic tools over kind of like the new agent stuff. Yeah. I mean, there's a bunch of stuff like that in the search domain too. Especially right now with like, you know, embeddings and vector search and all that. But, you know, like classic search techniques still go very far. And I don't know, I think in the next year or two, maybe as we get past like the peak AI hype, we'll start to see the gap emerge or become more obvious to more people about like how many of like the newfangled techniques actually work in practice and yield a better product experience day to day. Yeah. [00:55:25]

Swyx: So speaking of which, like, you know, obviously there's a bunch of other people trying to build AI tooling. What can you say about your AI stack? Obviously you build a lot proprietary in-house, but like what approaches, you know, like so prompt engineering, do you have a prompt engineering management tool? You know, what approaches there do you do? Pre-processing orchestration, like do you use Airflow? Do you use something else? Like, you know, that kind of stuff. Yeah. [00:55:46]

Beyang: Ours is very like duct taped together at the moment. So in terms of stack, it's essentially go in TypeScript and now Rust. There's the code knowledge graph that we built, which is using indexers, many of which are open source, that speak the skip protocol. And we have the code search backend. You know, traditionally we supported regular expression search and a string literal search with like a trigram index. And we're also building more like fuzzy search on top of that now, kind of like natural language or keyword based search on top of that. We use a variety of open source and proprietary models. We try to be like pluggable with respect to different models so we can easily swap the latest model in and out as they come online. I'm just hunting for like, [00:56:26]

Swyx: is there anything out there that you're like, these guys are really good. Everyone else should check them out. So for example, you talked about recursive summarization, which is something that LangChain and Llama indexed. I presume you wrote your own. Yeah, we wrote our own. [00:56:37]

Beyang: I think like the stuff that Llama indexed and LangChain are doing are like super interesting. I think from our point of view, it's like we're still in the application, like end user use case discovery phase. And so adopting like an external infrastructure or middleware kind of tool just seems like overly constraining right now. Yeah, we need full control. Yeah, we need full control because we need to be able to iterate rapidly up and down the stack. But maybe at some point there'll be like a convergence and we can actually merge some of our stuff into theirs and turn that into a common resource. In terms of like other vendors that we use, I mean, obviously like nothing but good things to say about Anthropic and OpenAI, which we both kind of partner with and use. Also Plug for Fireworks as an inference platform. Their team was kind of like ex-meta people who basically know all like the bag of tricks for making inference fast. Yeah, I met Lynn. [00:57:25]

Swyx: So she was like with Sumith. She was like the co-manager of PyTorch for five years. Yeah, yeah, yeah. [00:57:31]

Beyang: But like is their main thing [00:57:32]

Swyx: that we just do fastest inference on earth? Is that what it is or? I think that's the pitch. [00:57:37]

Beyang: And it keeps getting faster somehow. Like we run Starcoder on top of Fireworks and that's made it so that we just don't have to think about building up an inference stack. And so that's great for us because it allows us to focus more on the kind of like data fetching, the knowledge graph and model fine tuning, which we've also invested a bit in. [00:57:55]

Swyx: That's right. [00:57:55]

Steve: We've got multiple AI work streams in progress now because we hired a head of AI finally. We spent close to a year actually. I think I talked to probably 75 candidates. And the guy we hired, Rashab, is absolutely world-class. And he immediately started multiple work streams, including he's fine-tuned Starcoder already. He's got prompt engineering work stream. He's got bettings work stream. He's got evaluation and experimentation. Benchmarking, wouldn't it be nice if Cody was on Hugging Face with a benchmark that we could just, anybody could say, well, we'll run against the benchmark or we'll make our own benchmark if we don't like yours. But we'll be forcing people into the sort of quantitative comparisons. And that's all happening under the AI program that he's building for us. [00:58:35]

Swyx: I should mention, by the way, I've heard that there's a V2 of Starcoder coming on. So you guys should talk to Hugging Face. Cool. Awesome. Great. I actually visited their offices in Paris, which is where I heard it. That's awesome. [00:58:47]

Steve: Can you guys believe how amazing it is that the open source models are competitive with GPT and Anthropic? I mean, it's nuts, right? I mean, that one Googler that was predicting that open source would catch up. At least he was right for completions. [00:59:03]

Beyang: Yeah. I mean, for completions, open source is state-of-the-art. [00:59:06]

Swyx: You were on OpenAI, then you went to Claude, and now you've ripped it up. Yeah. Yeah, for completions. [00:59:10]

Beyang: I mean, we still use Claude and GPT-4 for chat and also commands. Like, the ecosystem is going to continue to devolve. We obviously love the open source ecosystem and, like, huge shout out to Hugging Face. And also, like, meta research. We love the work that they're doing and kind of driving the ecosystem forward. [00:59:26]

Swyx: Yeah, you didn't mention Code Llama. [00:59:27]

Beyang: We're not using Code Llama currently. It's always kind of like a constant evaluation process. So, like, I don't want to come out and say, like, hey, this model is the best because we chose it. Basically, like, we did a bunch of, like, tests for the sorts of, like, contexts that we're fetching now and given the way that our prompts constructed now. And at the end of the day, it was like a judgment call. Like, starcoder seemed to work the best, and that's why we adopted it. But it's sort of like a continual process of revisitation. Like, if someone comes up with, like, a neat new, like, context fetching mechanism, and we have a couple coming online soon, then it's always like, okay, let's try that against the array of models that are available and see how this moves the needle across that set. [01:00:01]

Swyx: Yeah. What do you wish someone else built? This is a request for startups. [01:00:04]

Beyang: I mean, if someone could just provide, like, a very nice, clean data set of both naturally occurring and synthetic code data. [01:00:15]

Steve: Yeah. Could someone please give us their data mode? [01:00:17]

Swyx: Well, not even the data mode. [01:00:19]

Beyang: It's just like, I feel like most models today, they still use, like, combination of, like, the stack and the pile as, like, their training corpus. But you can only stretch that so far. At some point, you need more data. And I think there's still more alpha in, like, synthetic data. Like, we have a couple of efforts where, like, we think fine tuning some models on specific coding tasks will yield more kind of, like, reliable code generation of the sort where it's, like, reliable enough that we can fully automate it, at least, like, the one hop thing. And synthetic data is playing a part of that. But, I mean, if there were, like, a synthetic data provider, I don't think you could construct a provider that has access to, like, some proprietary code base. Like, no company in the world would be able to, like, sell that to you. But, like, anyone who's just, like, providing clean data sets off of the publicly available data. That would be nice. I don't know if there's a business around that, but, like, that's something that we definitely, like, [01:01:09]

Swyx: love to use. [01:01:09]

Beyang: Oh, for sure. [01:01:10]

Steve: My God. I mean, but that's also, like, the secret weapon, right, for any AI, you know, is the data that you've curated. So I doubt people are going to be, oh, we'll see, you know. But we can maybe contribute, you know, if we want to have a benchmark of our own. [01:01:25]

Swyx: Yeah. I would say, like, that would be the bull case for Repl.it, that, like, you want to be a coding platform where you also offer bounties. Like, then you eventually bootstrap your own proprietary set of coding data. I don't think they'll ever share it. The rumor is, this is from nobody at Repl.it that I'm hearing, but, like, they're just not leveraging that actively. Like, they're actually just betting on OpenAI to do a lot of that, which banking on OpenAI, you know, has been a winning strategy so far. [01:01:50]

Beyang: Yeah, they're definitely great at executing. [01:01:55]

Steve: Executing their CEO. [01:01:56]

Swyx: And then bring him back in four days. Yeah. [01:02:01]

Steve: That was a whole, like... [01:02:03]

Swyx: It was a company, like, just obsessed by the drama. Like, we were unable to work. I just walked in after it happened, and this whole room in the new room was just like, everyone's just staring at their phones. [01:02:12]

Beyang: Yeah, it's a bit difficult to ignore. I mean, it would have real implications for us, too, because, like, we're using them. And so there's a very real question of, like, do we have to, like, do it quick? [01:02:21]

Swyx: Yeah, Microsoft. Like, you just move to Microsoft, right? [01:02:23]

Beyang: Yeah, I mean, that would have been, like, the break glass plan. If the worst case played out, then I think we'd have a lot of customers, you know, the day after being like, you know, how can you guarantee the reliability of your services if the company itself isn't stable? But I'm really happy they got things sorted out and things are stable now because, like, they build really cool stuff and we love using their tech. [01:02:43]

Swyx: Yeah, awesome. [01:02:44]

Alessio: So we kind of went through everything, right? Sourcecraft, Cody, why agents don't work, why inline completion is better, all of these things. How does that bubble up to who manages the people, right? Because as engineering managers, I didn't write much code. I was mostly helping people write their own code, you know, so even if you have the best inline completion, it doesn't help me do my job. [01:03:08]

Swyx: Yeah. [01:03:08]

Alessio: What's kind of the future of Sourcecraft in the engineering org? [01:03:13]

Beyang: That's a really interesting question. And I think it sort of gets at this, like, issue, which is basically, like, every AI DevTools creator or producer these days, I think us included, we're kind of, like, focusing on the wrong problem in a way. Because, like, the real problem of modern software development, I think, is not how quickly can you write more lines of code. It's really about managing the emergent complexity of codebases as they evolve and grow and how to make, like, efficient development tractable again. Because the bulk of your time becomes more about understanding how the system works and how the pieces fit together currently so that you can update it in a way that gets you your added functionality, doesn't break anything, and doesn't introduce a lot of additional complexity that will slow you down in the future. And if anything, like, the Interloop developer tools that are all about, like, generating lines of code, yes, they help you get your feature done faster. They generate a lot of boilerplate for you. But they might make this problem of, like, managing large, complex codebases more challenging, just because instead of having, like, a pistol, you'll have a machine gun in terms of, like, being able to write code. And there's going to be a bunch of, like, natural language prompted code that is generated in the future that was produced by someone who doesn't even have an understanding of source code. And so, like, how are you going to verify the quality of that and make sure it not only checks the kind of, like, low-level boxes, but also fits architecturally in a way that's sensible into your codebase. And so I think as we look forward to the future of the next year, we have a lot of ideas around how to make codebases, as they evolve, more understandable and manageable to the people who really care about the codebase as a whole. You know, tech leads, engineering leaders, folks like that. It is kind of like a return to our ultimate mission at Sourcegraph, which is to make code accessible to all. It's not really about, you know, enabling people to write code. And if anything, like, the original version of Sourcegraph was a rejection of, like, hey, let's stop trying to build, like, the next best editor, because, like, there's already enough people doing that. The real problem that we're facing, I mean, Quinn, myself, and you, Steve at Google, was like, how do we make sense of the code that exists so that we can understand enough to know what code needs to be written? Mm-hmm. [01:05:25]

Steve: Yeah. Well, I'll tell you what customers want, right? And what they're going to get. What they want is for Cody to have a monitor for developer productivity. And any developer who falls below a threshold, a button lights up where the admin can fire them. Or Cody will even press that button for you as time passes. But I'm kind of only half tongue-in-cheek here. We've got some prospects who are kind of, like, sniffing down that avenue. And we're like, no. But what they're going to get is a much greater whole code-based understanding, which is actually something that Cody is, I would argue, the best at today in the coding assistance space, right? Because of our search engine and the techniques that we're using. And that whole code-based understanding is so important, you know, for any sort of a manager who just wants to get a feel for the architecture or potential security vulnerabilities or whether, you know, people are writing code that's well-tested and et cetera, et cetera, right? And solving that problem is tricky, right? This is not the developer inner loop or outer loop. It's like the manager inner loop? [01:06:21]

Swyx: No, outer loop. [01:06:21]

Steve: The manager inner loop is staring at your belly button, I guess. So in any case... [01:06:27]

Beyang: Waiting for the next Slack message to arrive? [01:06:29]

Steve: Yes. What they really want is a batch mode for these assistants where you can actually take the coding assistant and shove its face into your code base, you know, and six billion lines of code later, right? It's told you all the security vulnerabilities. That's what they really actually want. It's insanely expensive proposition, right? You know, just the GPU costs, especially if you're doing it on a regular basis. So it's better to do it at the point the code enters the system. And so now we're starting to get into developer outer loop stuff. And I think that's where a lot of the... To your question, right? A lot of the admins and managers and so, you know, the decision makers, anybody who just like kind of isn't coding [01:07:03]

Swyx: but is involved, [01:07:03]

Steve: they're going to have a set of tools, right? [01:07:06]

Swyx: And a set of... [01:07:06]

Steve: Just like with CodeSearch today. Our CodeSearch actually serves that audience as well. The CIO types, right? Because they're just like, oh, hey, I want to see how we do, you know, Samaloth. And they use our search engine and they go find it. And AI is just going to make that so much easier for them. [01:07:20]

Swyx: Yeah, this is my perfect place to put my anecdote of how I used Cody yesterday. I was actually trying to build this sort of Twitter scraper thing. And Twitter is notoriously very challenging to work with because they don't want to work with you, with anyone. There's a repo that I wanted to inspect. It was really big that had the Twitter scraper thing in it. And I pulled it into Copilot, didn't work. But then I noticed that on your landing page, you had a web version. Like, I typically think of Cody as a VS Code extension, but you have a web version where you just plug in any repo in there and just talk to it. And that's what I used to figure it out. So yeah. [01:07:54]

Steve: Wow, Cody web is wild. [01:07:57]

Beyang: Yeah, I mean, we've done a very poor job of making the existence of that feature. It's not easy to find. [01:08:02]

Swyx: It's not easy to find. You don't have to go through the search thing. It's like, oh, this is old source graph. You don't want to look at old source graph. I mean, you can use source graph, all the AI stuff. Old source graph has AI stuff and it's Cody web. Yeah, yeah. [01:08:13]

Beyang: There's a little ask Cody button that's hidden in the upper right-hand corner. We should make that more visible. It's definitely one of those aha moments when you can ask a question of Cody. Of any repo, right? [01:08:22]

Swyx: Because you already indexed it. Well, you didn't embed it, but you indexed it. Yeah. [01:08:26]

Beyang: And there's actually some use cases that have emerged among power users where they kind of do... You're familiar with v0.dev. You can kind of replicate that, but for arbitrary frameworks and libraries with Cody web. Because there's also an equally hidden toggle, which you may not have discovered yet, where you can actually tag in multiple repositories as context. [01:08:44]

Swyx: Yeah. [01:08:44]

Beyang: And so you can do things like, we have a demo path where it's like, okay, let's say you want to build a stock ticker [01:08:50]

Swyx: that's React-based, [01:08:50]

Beyang: but uses this one tick data fetching API. It's like you tag both repositories in, you ask it, it's like two sentences, like build a stock tick app, track the tick data of Bank of America, Wells Fargo over the past week, and then generates a code. You can paste that in and it just works magically. We'll probably invest in that more just because the wow factor of that is just pretty incredible. It's like, what if you can speak apps into existence that use the frameworks and packages that you want to use? Yeah. [01:09:19]

Swyx: It's not even fine-tuning. It's just taking advantage of your RAG pipeline. [01:09:22]

Beyang: Yeah. It's just RAG. RAG is all you need for many things. [01:09:25]

Steve: It's not just RAG. It's RAG, right? RAG's good. Not a fallback. [01:09:33]

Swyx: Yeah. [01:09:33]

Beyang: But I guess getting back to the original question, I think there's a couple of things I think would be interesting for engineering leaders. One is the use case that you called out is all the stuff that you currently don't do that you really ought to be doing with respect to ensuring code quality or updating dependencies or keeping things up to date. The things that humans find toilsome and tedious and just don't want to do but would really help up-level the quality, security, and robustness of your code base, now we potentially have a way to do that with machines. I think there's also this other thing, and this gets back to the point of how do you measure developer productivity? It's the perennial age-old question. Every CFO in the world would love to do it in the same way that you can measure marketing or sales or other parts of the organization. And I think what is the actual way you would do this that is good? And if you had all the time in the world, I think as an engineering manager or an engineering leader, what you would do is you would go read through the Git log, maybe line by line, be like, you, Sean, these are the features that you built over the past six months or a year. These are the things that delivered that you helped drive. Here's the stuff that you did to help your teammates. Here are the reviews that you did that helped ensure that we maintain a coherent and a high-quality code base. Now connect that to the things that matter to the business. What were we trying to drive this? Was it engagement? Was it revenue? Was it adoption of some new product line? And really weave that story together. The work that you did had this impact on the metrics that moved the needle for the business and ultimately show up in revenue or stock price or whatever it is that's at the very top of any for-profit organization. And you could, in theory, do all that today if you had all the time in the world. [01:11:22]

Swyx: Yeah. [01:11:22]

Beyang: But as an engineering leader- It's a busy building. Yeah, you're too busy building, you're too busy with a bunch of other stuff. Plus it's also tedious. Reading through a Git log and trying to understand what a change does and summarizing that, it's not the most exciting work in the world. But with the benefit of AI, I think you could conceive of a system that actually does a lot of the tedium and helps you actually tell that story. And I think that is maybe the ultimate answer to how we get at developer productivity in a way that a CFO would be like, okay, I can buy that. The work that you did impacted these core metrics because these features were tied to those and therefore we can afford to invest more in this part of the organization. And that's what we really want to drive towards. That's what we've been trying to build all along in a way with Sourcegraph. It's this kind of code-based level of understanding and the availability of LLMs and AI now just puts that much sooner in reach, I think. [01:12:14]

Swyx: Yeah. [01:12:15]

Steve: But I mean, we have to focus also, small company, our short-term focus is lovability, right? [01:12:21]

Swyx: Yeah. [01:12:21]

Steve: We absolutely have to make Cody, like everybody wants it, right? [01:12:25]

Swyx: Absolutely. [01:12:26]

Steve: Sourcegraph is all about enabling non-engineering roles, decision makers and so on. As Bianca says, I mean, I think there's just a lot of opportunity there once we've built a lovable Cody. [01:12:37]

Swyx: Awesome. [01:12:37]

Alessio: We want to jump into lightning round? [01:12:40]

Swyx: Lightning round. [01:12:40]

Alessio: Okay. [01:12:41]

Swyx: So we usually have three, [01:12:42]

Alessio: one around acceleration, exploration, and then a final takeaway. So the acceleration one is what's something that already happened in AI that is possible today that you thought would take much longer? [01:12:54]

Beyang: I mean, just LLMs and how good the vision models are now. Like I got my start. Okay. [01:13:00]

Swyx: Yeah. [01:13:00]

Beyang: Back in the day, I got my start machine learning in computer vision, but circa like 2009, 2010. [01:13:07]

Swyx: And in those days, [01:13:07]

Beyang: everything was like statistical based. Neural nets had not yet made their comeback. And so nothing really worked. And so I was very bearish after that experience on the future of computer vision. But like, man, the progress that's been made just in the past, like three, four years has just been absolutely astounding. Came up faster than I expected it to. Yeah. [01:13:27]

Steve: Multimodal in general, [01:13:28]

Swyx: I think is, [01:13:28]

Steve: I think there's a lot more capability there that we're not tapping into. Potentially even in the coding assistant space. You know, honestly, I think that the form factor that coding assistants have today is probably not the steady state that we're seeing, you know, long-term. You'll always have completions and you always have chat and commands and so on. But I think we're going to discover a lot more. And I think multimodal potentially opens up some kind of new ways to, you know, get your stuff done. So yeah, I think the capabilities are there today. And they're just, it's just shocking. I mean, like, I still am astonished when I sit down, you know, and I have a conversation with the LLM, with the context, and it's like, I'm talking to a, you know, a senior engineer or an architect or somebody, right? I think that people have very different working models with these assistants today. You know, some people are just completion, completion, completion, that's it. And if they want some code generated, they write a comment and then, you know what I mean? Telling them what to do. But I truly think that there are other modalities that we're going to stumble across. Just kind of latently, you know, inherently built into the LLMs today that we just haven't found them yet. They're more of a discovery than invention, you know? [01:14:31]

Swyx: Like other usage patterns? [01:14:34]

Steve: Absolutely. I mean, the one that we talked about earlier, nonstop coding is one, right? Where you could just kick off a whole bunch of, you know, requests to refactor and so on. But, you know, there could be any number of others. You know, we talk about agents, you know, that's kind of out there. But I think there are kind of more inner loop type ones to be found. And we haven't looked at all at multimodal yet. [01:14:52]

Swyx: Yeah, for sure. Like there's two that come to mind, just off the top of my head. One, which is effectively architecture diagrams and entity relationship diagrams. There's probably more alpha in like synthesizing them for management to see. Ooh, yeah. Which is like, you don't need AI for that. You can just use your reference graph. Yeah. But then also doing it the other way around when like someone draws stuff on a whiteboard and actually generating code. [01:15:14]

Steve: Well, you can generate the diagram and then, you know, explanations as well. [01:15:18]

Swyx: Yeah. And then the other one is, there was a demo that went pretty viral like two, three weeks ago about how someone just had an always on script, just screenshotting and sending it to GPT Vision on some kind of time interval. And it would just autonomously suggest stuff. Yeah. So like no trigger, just watching your screen and just like being a real co-pilot rather than having you initiate with a chat. Yeah. [01:15:39]

Beyang: It's like the return of Clippy, right? But actually good. [01:15:42]

Swyx: The reason I know this is that we actually did a hackathon where we wrote that project, but it roasted you while you did it. So it's like, hey, you're on Twitter right now. You should be coding. Yeah. That can be a fun co-pilot thing as well. Yeah, yeah. Okay. So I'll jump on. Exploration. What do you think is the most interesting unsolved question in AI? I mean, I think- [01:16:01]

Steve: It used to be scaling, right? With CNNs and RNNs and Transformer solved that. Yeah. So what's the next big hurdle? It's keeping GPT-10 from emerging. [01:16:09]

Beyang: I mean, do you mean that like- Oh, is this like a safetyist argument? I feel like, do you mean like the pure model, like AI layer or- [01:16:17]

Swyx: No, it doesn't have to be. [01:16:18]

Beyang: For me personally, it's like, how do you get reliable, like first try working code generation? Even like the single hop, like write a function that does this. Because I think like if you want to get to the point where you can actually be truly agentic or like multi-step automated, a necessary part of that is like the single step has to be robust and reliable. And so I think that's the problem that we're focused on solving right now. Because once you have that, it's a building block that you can then compose into longer chains. [01:16:47]

Alessio: And just to wrap things up, what's one message takeaway that you want people to remember and think about? I mean, I think for me, [01:16:55]

Beyang: it's just like the best dev tools in the future are going to have to leverage many different forms of intelligence. You know, calling back to that like Normsky architecture, trying to make catch on. [01:17:06]

Swyx: You should have called it something cool, like S star or R star. [01:17:09]

Beyang: Yes, yes, yes. [01:17:10]

Swyx: Just one letter and then just let people speculate. Yeah, yeah. What could he mean? [01:17:14]

Beyang: I don't know, like in terms of like trying to describe what we're building, we try to be a little bit more like down to earth and like straightforward. And I think like Normsky kind of like encapsulates like the two big technology areas that we're investing in that we think will be very important for producing really good dev tools. And I think it's a big differentiator that we view that Cody has right now. [01:17:35]

Steve: Yeah, and mine would be, I know for a fact that not all developers today are using coding systems. Yeah, and that's probably because they tried it and it didn't, you know, immediately write a bunch of beautiful code for them and they were like, oh, too much effort and they left, right? Well, my big takeaway from this talk would be if you're one of those engineers, you better start like planning another career, okay? Because this stuff is in the future and honestly, it takes some effort to actually make coding assistance work today, right? You have to, you know, just like talking to GPT, they'll give you the runaround, just like doing a Google search sometimes. But if you're not putting that effort in and learning the sort of footprint, and the characteristics of how LLMs behave under different query conditions and so on, if you're not getting a feel for the coding assistant, then you're letting this whole train just like pull out of the station and leave you behind. [01:18:26]

Swyx: Yeah, absolutely. [01:18:28]

Alessio: Yeah, thank you guys so much for coming on and being the first guest in the new studio. [01:18:32]

Swyx: Our pleasure. [01:18:34]

Get full access to Latent Space at www.latent.space/subscribe

2023-12-14
Länk till avsnitt

The Busy Person's Intro to Finetuning & Open Source AI - Wing Lian, Axolotl

The Latent Space crew will be at NeurIPS on Tuesday! Reach out with any parties and papers of interest. We have also been incubating a smol daily AI Newsletter and Latent Space University is making progress.

Good open models like Llama 2 and Mistral 7B (which has just released an 8x7B MoE model) have enabled their own sub-industry of finetuned variants for a myriad of reasons:

* Ownership & Control - you take responsibility for serving the models

* Privacy - not having to send data to a third party vendor

* Customization - Improving some attribute (censorship, multiturn chat and chain of thought, roleplaying) or benchmark performance (without cheating)

Related to improving benchmark performance is the ability to use smaller (7B, 13B) models, by matching the performance of larger models, which have both cost and inference latency benefits.

Core to all this work is finetuning, and the emergent finetuning library of choice has been Wing Lian?s Axolotl.

Axolotl

Axolotl is an LLM fine-tuner supporting SotA techniques and optimizations for a variety of common model architectures:

It is used by many of the leading open source models:

* Teknium: OpenHermes, Trismigestus, CollectiveCognition

* OpenOrca: Mistral-OpenOrca, Mistral-SlimOrca

* Nous Research: Puffin, Capybara, NousHermes

* Pygmalion: Mythalion, Pygmalion

* Eric Hartford: Dolphin, Samantha

* DiscoResearch: DiscoLM 120B & 70B

* OpenAccess AI Collective: Manticore, Minotaur, Jackalope, Hippogriff

As finetuning is very formatting dependent, it also provides prompt interfaces and formatters between a range of popular model formats from Stanford?s Alpaca and Steven Tey?s ShareGPT (which led to Vicuna) to the more NSFW Pygmalion community.

Nous Research Meetup

We last talked about Nous at the DevDay Recap at the e/acc ?banger rave?. We met Wing at the Nous Research meetup at the a16z offices in San Francisco, where they officially announced their company and future plans:

Including Nous Forge:

Show Notes

We?ve already covered the nuances of Dataset Contamination and the problems with ?Open Source? in AI, so we won?t rehash those topics here but do read/listen to those if you missed it.

* Axolotl GitHub and Discord

* The Flan paper and dataset

* StackLlama model and blogpost

* Multipack paper

* Our episode with Tri Dao

* Mamba state space models - Tri Dao and Albert Gu

Timestamps

* [00:00:00] Introducing Wing

* [00:02:34] SF Open Source AI Meetup

* [00:04:09] What is Axolotl?

* [00:08:01] What is finetuning?

* [00:08:52] Open Source Model Zoo

* [00:10:53] Benchmarks and Contamination

* [00:14:29] The Case for Open Source AI

* [00:17:34] Orca and OpenOrca

* [00:23:36] DiscoLM and Model Stacking

* [00:25:07] Datasets and Evals over Models

* [00:29:15] Distilling from GPT4

* [00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ

* [00:41:55] Axolotl vs HF Transformers

* [00:48:00] 20x efficiency with StackLlama and Multipack

* [00:54:47] Tri Dao and Mamba

* [00:59:08] Roadmap for Axolotl

* [01:01:20] The Open Source AI Community

Transcript

[00:00:00] Introducing Wing Lian

[00:00:00] ?

[00:00:00] swyx: Welcome to Latent Space, a special edition with Wing Lien, but also with our new guest host, Alex. Hello, hello. Welcome, welcome. Again, needs no introduction. I think it's like your sixth time on Latent Space already. I think so, yeah. And welcome, Wing. We just met, but you've been very prolific online. Thanks for having me.

[00:00:30] Yeah. So you are in town. You're not local. You're in town. You're from Minneapolis?

[00:00:35] Wing Lian: Annapolis. Annapolis. It's funny because a lot of people think it's Indianapolis. It's I've got Minneapolis, but I used to live out at least in the San Francisco Bay Area years ago from like 2008 to 2014. So it's fairly familiar here.

[00:00:50] swyx: Yep. You're the maintainer of Axolotl now, which we'll get into. You're very, very prolific in the open source AI community, and you're also the founder of the Open Access AI Collective. Yeah. Cool. Awesome. Maybe we can go over a little bit of your backgrounds into tech and then coming into AI, and then we'll cover what

[00:01:06] Wing Lian: happens and why you're here.

[00:01:08] Yeah. So. Back on tech, so I started years ago, I started way back when I was scraping, Apartment websites for listings and then, and then building like SEO optimized pages and then just throwing Google AdSense on it.

[00:01:24] And that got me through like college basically. Is

[00:01:27] swyx: that decent money? And what year

[00:01:28] Wing Lian: was this? Like 2004, 2005. Yeah, that's decent money. It's like thousand bucks a month. But as a college student, that's like. Gravy. Really good money, right? So, and then there's just too much competition It's just sort of like died off. I was writing stuff in like Perl back then using like like who nobody hosted anything on Perl anymore, right? Still did a little bit more like computer tech support and then software, and web more professionally.

[00:01:54] So I spent some time working on applications in the blood industry. I came out to San Francisco for, I was at SGN, so Social Gaming Network, as a startup. They started doing, with Facebook apps, and then they pivoted into doing mobile apps. And then, from there, I spent time.

[00:02:14] I've quite a few more startups since then and in the last few years I've been in the music space So like I was at United Masters for a while and then past year I've been at SoundCloud, but not doing that anymore and now that I have a lot more time It's just like all right.

[00:02:30] We're going full bore on axolotl and we're gonna we're gonna crush AI So yeah,

[00:02:34] SF Open Source AI Meetup

[00:02:34] swyx: totally you so you're here in town for the open source. Yeah, I meet up that we had yesterday Yep, yeah, that was amazing. Yeah, it was a big collection. Olama, Noose Research, Alignment Lab, Anyone else that I missed? I mean, Jeremy Howard is his own thing.

[00:02:47] Yeah.

[00:02:49] And Alex, you're also there. You love to bring SF to the world. Your takes?

[00:02:55] Alex Volkov: It's incredible that we recorded a Thursday Eye episode after that one. And LDJ, who's usually co hosts Thursday Eye, just like briefly mentioned, Oh yeah, I talked about it.

[00:03:04] Like, I saw Karpathy, and then I talked to Jeremy Howard, and the guy from Mistral came in, and it's like, He's talking about all these, titans of industry, basically, that outside of SF, You just don't meet casually hanging out in the same space. You can't, pull somebody. He ran into the Laylow from Mistral, he ran into him while, drinking water.

[00:03:20] He didn't even know he was there. It's just, that type of stuff is really hard to find outside of SF. So, absolutely, absolutely great. And also, presentations from Alignment Labs, presentations from News Research, news issues, talked about. Forge, and some of

[00:03:33] swyx: the other stuff they announced. We can say now they're officially a company.

[00:03:36] I met Technium.

[00:03:37] He

[00:03:37] Alex Volkov: came over here. He didn't want to get recorded. But maybe.

[00:03:41] Wing Lian: We'll wear him down at some point. Yeah, I'm excited for Forge. They've positioned it as this agentic sort of framework where it's just Drag and drop things and, fill in text with where you want to inject different variables and it opens up all of these potentials for data pipelines now, right?

[00:03:56] And using your own local LLMs and not relying on GPT 4 or anything like that. Yeah, yeah,

[00:04:02] swyx: good stuff. Okay, so let's maybe go into the Axolotl origin story and then we have, we have some intro or background.

[00:04:09] What is Axolotl?

[00:04:09] swyx: To do on like the open source model universe and also on fine tuning, but maybe just, since you're talking about your personal journey, what was your personal journey into

[00:04:18] Wing Lian: axolotl?

[00:04:19] Yeah, so my personal journey started like back in mid March, completely unrelated to AI and axolotl. And it really started, I fell while skiing, I torqued. Great 3 MCL sprain and being sort of like an active person that can no longer be active because the two, couldn't play soccer, because that is requires to have having knees until I, it's healed.

[00:04:42] So I. I decided I needed to find something to do to take up my free time. And that became, well, let's learn how to train in, these language models. It was everywhere. So I was like, all right, I'm just going to sit down, learn. I think I used like other, I think I was using like Alpacalora.

[00:05:00] Cause I think the Alpaca paper had just came out, come out then. So I was like using Alpacalora repo and sort of like learning how to use like. None of us were like GPU rich back then, and none of us, most of us still we're still all GPU poor, but I was doing what was it, like 4 bit, Alpaca Lord, there was like a 4 bit version where we were doing quant, or 8, no, 8 bit quantizations, and then I think they had released QLOR a little bit later, and I think right when, before QLOR came out, I was already starting to do fine tunes, but having this need to sort of like mix data sets together, and If you've ever looked at all the various different datasets available on HuggingFace, they all have various different prompt formats, and, it's sort of a nightmare, and then I think the other piece is if you've ever tried to fine tune, at least Back then probably the ecosystem's a little better now.

[00:05:54] Everybody required that you say, alright, you put your hyperparameters as command line arguments. And so it's always like, well, I now have to go copy and paste my previous thing and to change things out. And I really wanted it. to be in a YAML file because it was more portable and reproducible.

[00:06:09] So I was doing that and then the QLOR paper came out. Tim Dettmer announced that and then somebody looked it up for me yesterday and it's like between that announcement it took us seven days to get that integrated into Axolotl, right? Which is like, it's not. I wouldn't say it's really fast, but in a manner that, is in a, a reusable framework, I think it was quite the accomplishment then.

[00:06:33] And so we started, picking up traction with people there. And then it's just been building models, and then just iterating what my needs are. So, yeah. Excellent. Yeah. I

[00:06:44] Alex Volkov: want to ask, for folks who are listening who never heard of Axolotl, now do you describe how you got there?

[00:06:49] Can you, how do you summarize this for folks who maybe haven't fine tuned anything. They know about open source LLM exists, they maybe know like LLAML, what's XLR for somebody who doesn't know. I've never heard of a data set curation

[00:07:01] Wing Lian: creation before. We sort of have to take a step back and understand that, when you've got these language models, you have what I think most people refer to as like base models, also known as like foundational models, right?

[00:07:15] Where some benefactor, whether it's Meta or Mistral or whoever, has gone and spent all this money. To train these models on huge corpuses of text, right? And these, these corpuses, they're generally good across lots of different things, but they're really good at just saying, talking on and on and on, but they're not good at, following instructions or having chats or anything like that.

[00:07:40] So, when you think about fine tuning, it's like Saying, all right, we have this really sort of good generalized, text completion thing, and I want to turn it into something that I can talk to or have, follow instructions. So, I think fine tuning is probably best defined in like that.

[00:07:58] swyx: Okay, got it.

[00:07:59] And we actually

[00:08:01] What is finetuning?

[00:08:01] swyx: Do want to make sure that we have like an overall introduction to fine tuning for people because again like trying to make sure that we bring everyone along in this, in this journey. We already went into Loras and QLoras without explaining what

[00:08:12] Wing Lian: they are. Oh yes, yes, sorry.

[00:08:14] swyx: And so I will put things in my words and you can correct me as, as, as my I'll be the village idiot here.

[00:08:21] So, so fine tuning is basically sort of grabbing an open source model off the shelf, and then basically doing further training on it with a custom dataset of your own. Primarily, people use it, think about it as fine tuning for JSON output, or fine tuning for a style of response. Let's say you wanted to tell jokes, or be funny, or be short, or whatever.

[00:08:43] Just the open source AI community has really fine tuned in all sorts of different manner. I think we'll go over those those things now. Let's go over those things now, and then we'll talk about fine tuning methods.

[00:08:52] Open Source Model Zoo

[00:08:52] swyx: So there's a universe of people who fine tune stuff. Yesterday in your slides, you had, I'll just list some of these and then we'll maybe go through some of them, right?

[00:08:59] So Technium is personally leading Open Hermes, which is I think the sort of premier model out of the news. news community. There's OpenOrca, which you had a hand in. News, the news research itself also has Capybara and Puffin and all the others. There's Pygmalion, which I've never messed with.

[00:09:14] Eric Hartford, I am aware of his Uncensored Models and his Samantha Models. Disco Research with Disco LM. And then you personally have done Manticore, Minotaur, Jackalope, and Hippogriff. What should people know about all these names? Being part of AI Twitter is seeing all these things and going dude, I'm being DDoS'ed by all these things and I don't know how different they are.

[00:09:32] What should people know? Yeah, so

[00:09:34] Wing Lian: I think on a lot of these models, generally, we like to think of those as sort of general models, so If you think about it, what is GPT 4, what is Chad GPT? It's a good general model, and then, One of the services I think that OpenAI offers is like these fine tunings where you're a business and you have very specific business use cases and you might fine tune for that use case.

[00:10:00] All of these models are really just general use case that you can then go and maybe Fine tune another lore over it for your use cases, but they tend to be good. With good being relative, it's open source. Open source AI is still sort of is infancy. So, good is, it's pretty reasonable.

[00:10:18] It's probably still better than most, high schoolers at answering questions and being able to like figure things out and, and reasoning skills and math and those sorts of things, right?

[00:10:27] swyx: And also as measured on the Hugging

[00:10:29] Wing Lian: Face leaderboard. Yes, well, that's like a whole other discussion, right, there's a whole other, group of people who, and I, I mostly agree with them that, benchmarks can be, are pretty bogus these days, LM says, I think they published something recently where, even if you think the dataset's not contaminated, you can go and, find contamination And maybe we should step back and say what contamination is, right?

[00:10:53] Benchmarks and Contamination

[00:10:53] Wing Lian: So we have all of these data, when you go and do these benchmarks, there's a specific data set where there are these questions and usually it's multiple choice. And what can happen is, well, sometimes someone It puts the question, maybe maliciously, maybe accidentally, into the training dataset, and now the, the, your model knows how to answer the test questions really well, but it doesn't, it hasn't generalized the ability to actually do that

[00:11:20] Alex Volkov: right.

[00:11:21] We've seen some folks competitively announce models that are like the best at that leaderboard, but then it's, it's quite obvious that, In open source? Yeah, and in that leaderboard, for Hugging Face specific, I don't know if LMCs, if that had suffered, but we, there's been some models that seem to have been competitively trained and some leakage happened into their,

[00:11:41] swyx: like, supposal.

[00:11:43] I understand, once there's been a credible assertion, Hugging Face actually does take them down, right? Yeah, yeah,

[00:11:48] Alex Volkov: which is really hard to know, right?

[00:11:50] swyx: It's really hard to know, sometimes it's like a pure accident,

[00:11:52] Alex Volkov: it's oh, oops. You're going through a mixer. I think, a responsible So acknowledgement, that this kind of happened to you is also important.

[00:11:58] I saw LDJ from news research can acknowledge that. Because many of these datasets are collections of other datasets. There's a bunch of people are baking, basically. It's alchemy. Right. And so sometimes you don't know. Sometimes you pull an open source dataset and they announce, oh, you know what, actually, the MMLU benchmark which we used to Specifically identify models that did go into this data set, that then went into that data set.

[00:12:22] So sometimes it's actually an accident and folks take it down. But I've seen some competitive folks who want to put their name out there because people are starting to notice which is the top

[00:12:30] swyx: model. For those who want a fun take on this so the file one dataset. FindOne model from Microsoft was accused of being contaminated.

[00:12:37] And I saw this joke paper that was fantastic. It was called, training on the test set is all you need. It's a super small model that just memorizes everything. It was fantastic. So yeah, contamination, I think we've actually covered it in a previous episode before. So we're good. But again, I want to give people a map into the open source AI model, the universe.

[00:12:57] And Alex, you can also jump in here because you guys have spent a lot more time with them than I have. So, what should people know about Technium? What should people know about Noose? And then we can go down the list. Yeah,

[00:13:05] Wing Lian: I think so. I think if we start with, Technium. When you talk to him, he's gonna say, I think, I think his response is that he wants to build GP4 on his laptop, right?

[00:13:14] So, very, very good at building general models. I think with Noose, Noose Research, they're looking at more, sort of, More, more research focused things, like their Yarn models, I don't, I don't, they didn't actually train their, they have their own trainer for their Yarn models, but So they did not use Xlato for that one?

[00:13:30] They didn't use that, but like Is that, you don't have support for it? I think we do support Yarn, I think, I'd have to double check that answer. Yeah, I'm just kind of curious what you can and cannot support, and Yeah, I mean, Yarn is supportable, it's basically, I think it's just replacing, I think, the rope part of that, so Yeah, not, not a big deal.

[00:13:48] Yeah, it's not a big deal, it's just I haven't gotten to it, not enough people have asked, I think a lot of people have asked for other things, so it's just, squeaky wheel, right? I think at the end of the day, people are like building these data sets and I think if you sort of map things chronologically, these make more sense because it's like, how do we incrementally improve all of these models?

[00:14:07] So a lot of these models are just incremental improvements over the last thing, right? Whether it is sort of through methods of how do we, how did we curate the data set? How did we improve the quality of the data set? So, you maybe LDJ talked about it right on I think for, for Capybara and Puffin, like how those, those were very specific dataset curation techniques that he works on.

[00:14:29] The Case for Open Source AI

[00:14:29] Alex Volkov: So there's, folks are doing this for dataset curation. Folks are doing this for skillset building as well. Definitely people understand that open source is like very important, especially after the, the, the, the, the march, the debacle, the OpenAI weekend that we all had. And people started noticing that even after developer day in OpenAI, the APIs went out.

[00:14:48] And then after that, the whole leadership of the company is swiftly changed and people, there was worries about, you know. How can people continue building AI products based on these like shaky grounds that turned attention definitely to Technium at least in open RMS I started seeing this more and more on Twitter, but also other models and many companies They're gonna start with open AI just to get there quick, and then they they think about okay Maybe I don't want to share my knowledge.

[00:15:13] Maybe I don't want to sign up for Microsoft. Maybe they will change their terms and conditions so What else is out there? They turned to other companies. Up until yesterday, Google was nowhere to be found. We've talked about Gemini a little bit before in a previous And you can tune in

[00:15:26] swyx: to

[00:15:26] Alex Volkov: Thursday Eye.

[00:15:26] Yeah, you can tune in to Thursday Eye. We covered the Gemini release a little bit. And but many are turning into the open source community and seeing that Meta released and continues to release and commit to open source AI. Mistral came out and the model is way smaller than LLAMA and performs Significantly better.

[00:15:43] People play with OpenRMS, which is currently techniums based, news researched, sourced, axolotl trained OpenRMS, I assume, right? And then they play with this and they see that, okay, this is like GPT 3. 5 quality. We had GPT 4. 5 birthday just a week ago. A week ago, a year ago, a week ago, we never, interacted with these models of this caliber.

[00:16:04] And now there's one open source, one that's on my laptop, completely offline, that, I can continue improving for my use cases. So enterprises, companies are also noticing this. And the open source community folks are building the skill set, not only the data sets. They're building the actual kind of, here's how we're going to do this, with Axelotl, with these data sets.

[00:16:21] The curation pieces. Now. Interesting. There's like recipes of curation. The actual model training is kind of a competitive thing where people go and compete on these leaderboards that we talked about, the LMC arena, and that recently added open air and recently added open chat and a bunch of other stuff that are super cool.

[00:16:37] The hug and face open source leaderboard. And so there's a competitive aspect to this. There's the open source. Aspect to this, like Technium says, I want GPT 4 on my laptop. There's the, let me build a skill set that potentially turns into a company, like we saw with Noose. Noose just, started organizing, a bunch of people on Discord, and suddenly, they're announcing their company.

[00:16:54] It's happening across all these modalities, and suddenly all these people who saw these green pastures and a fairly quick way to, hey, here's a cool online community I can, start doing cool stuff with. You mentioned the same in the beginning, right? Like, after your accident, what's cool, let me try this out.

[00:17:08] Suddenly I start noticing that there's a significant movement of interest in enterprising companies into these areas. And, this skill set, these data sets, and this community is now very Very important, important enough to create an event which pulls in Andrei Karpathy from OpenAI to come and see what's new Jeremy Howard, like the event that we just talked about, people are flying over and this is just a meetup.

[00:17:28] So, definitely, the community is buzzing right now and I think Axelot is a big piece as well.

[00:17:34] Orca and OpenOrca

[00:17:34] Wing Lian: Cool. Maybe we can talk about like Orca real quick, Orca, OpenOrca rather, I think there was a lot of buzz when, the first Orca paper came out. And just briefly, what is Orca? Yeah, Orca was basically having traces of like chain of thought reasoning, right?

[00:17:48] So they go and they, they distill sort of GPT 4. They take, they take a sampling of data from the Flan dataset. Maybe we can like add some show notes in the Flan dataset. Yeah, but we've covered it. Okay, cool. Use GPT 4 to say, all right, explain this in a step by step reasoning, right?

[00:18:06] And then you take that and you, they train the model and it showed, very good improvements across a lot of benchmarks. So OpenOrca was sort of the open reproduction of that since Microsoft Research never released that particular data set. And going back to sort of the Hugging Face leaderboard thing, those models did really well.

[00:18:23] And then I think, so sort of the follow up to that was SlimOrca, right? I think Going into and building the OpenOrca dataset, we never really went in and, validated the actual answers that GPT 4 gave us, so what we did was one from OpenChat actually cross referenced the original Flan, the original Flan response, the human responses, the correct answers with the dataset, and then I went and took it and sent all of, both of them to GPT 4 and said, is this answer mostly correct, right?

[00:18:54] Yeah. And then we were able to filter the dataset from, At least of the GPT 4 only answers from like 800, 000 to like 500, 000 answers or rows and then, and then retrain the model and it had the same performance as the original model to within I think, 0. 1 percent here about, and 30 percent less data.

[00:19:13] So, yeah. Okay.

[00:19:15] swyx: Interesting. So, I mean, there's, there's so much there that I want to highlight, but yeah. Orca is interesting. I do want people to know about it. Putting chain of thought into the data set like it's just makes a ton of sense one thing I think it would be helpful for people to scope thing these things out is how much data are we talking about when when you When people are fine tuning and then how much time or resources or money does it take to train to fine

[00:19:36] Wing Lian: tune?

[00:19:37] Yeah, so I think there's a little bit of overlap there with sort of like fine tuning techniques, but let's say Orca and I think even Hermes, they're both relatively large data sets like 10 billion tokens. Yeah. So large data sets being or the original Orca was, or the original open Orca was 800,000 rows.

[00:19:55] I believe it was somewhere in the ballpark of like a gigabyte of data, of gigabyte, of text data. And I, I don't. I believe, Hermes was, is like a quarter million rows of data, I don't know the actual byte size on that particular one. So, going and training a, let's, let's say everybody's training 7 billion Mistral right now, right?

[00:20:15] So, to tri I, I believe to fine tune 7 billion Mistral on, let's say, 8 A6000s, which have 48 gigabytes of VRAM, I believe, It takes about 40 hours, so 40, and then that's, depending on where you get your compute, 40 times 6, so it's like 500 to fine tune that model, so, and, and that's assuming you get it right the first time, right?

[00:20:44] So, you know.

[00:20:45] swyx: Is, is that something that X. Lotto handles, like, getting it right the first

[00:20:48] Wing Lian: time? If you talk to anybody, it's like you've probably tried at least three or four runs or experiments to like find the right hyperparameters. And after a while you sort of have a feel for like which, where you need your hyperparameters to be.

[00:21:04] Usually you might do like a partial training run, do some benchmark. So I guess for Al Farouk, whether you're going by his. This is Jeremy, he's, his actual name, or his twitter handle. He released the Dharma dataset, which is basically a subset of all the benchmarks. And Axolotl actually supports, you know taking that subset and then just running many benchmarks across your model every time you're doing an evaluation so you can sort of like see sort of relative it's not going to be the actual benchmark score, but you can get ideas alright, is this benchmark improving, is this benchmark decreasing, based on, you know Wait,

[00:21:39] swyx: why don't you run the full benchmark?

[00:21:41] What, what, what The

[00:21:42] Wing Lian: full benchmarks take Take a long time. Significant, yeah, significant amount of time. Yeah. And Okay, so that's like

[00:21:48] swyx: mini MMLU. Yeah. Like,

[00:21:49] Wing Lian: mini BigBench or whatever. Yep, exactly.

[00:21:51] Alex Volkov: It's really cool. We, when I joined Web2Masters just recently, and one of the things that I try to do is hey I'm not, I'm a software engineer by trade, I don't have an MLE background, But I joined a company that does primarily MLE, and I wanted to learn from the community, Because a lot of the open source community, they use weights and biases, And the benchmark that you said that Pharrell did, remind me of the name, sorry.

[00:22:13] Dharma? Dharma, yeah, yeah. So Luigi showed me how Dharma shows inside the dashboard. In Wi and Biases dashboard and so you can actually kinda see the trending run and then you can see per each kind of iteration or, or epoch or you can see the model improving trending so you can on top of everything else.

[00:22:29] The wi and biases gives like hyper parameter tracking, which like you, you started with common line and that's really hard to like remember. Also the Dharma data set, like the quick, the mini orca mini, you mini many different things. It's pretty cool to like visualize them as well. And I, I heard that he's working on a new version of, of Dharma, so Dharma 2, et cetera.

[00:22:47] So hopefully, hopefully we'll see that soon, but definitely it's hard, right? You start this training around, it said like 40, 50 hours. Sometimes, sometimes it's like your SSHing into this machine. You, you start a process, you send it with God and you just go about your day, collecting data sets, and then you have to return.

[00:23:04] And the whole process of instrumentation of this is still a little bit like squeaky but definitely. Tuning performance, or like grabbing performance in the middle of this, like with Dharma and some other tools, is very helpful to know that you're not wasting precious resources going somewhere you shouldn't go.

[00:23:21] Yeah.

[00:23:22] swyx: Yeah. Very cool. Maybe I'll, I'll, before we go into like sort of more details on fine tuning stuff, I just wanted to round out the rest of the Excel autoverse. There's, there's still Eric Hartford stuff. I don't know if you want to talk about Pygmalion, Disco, anything that you know about

[00:23:35] Wing Lian: those, those things.

[00:23:36] DiscoLM and Model Stacking

[00:23:36] Wing Lian: Yeah, I think like one of the, definitely one of the more interesting ones was like the Disco 120b, right? Yeah, I know nothing about it. Yeah. So, so. Alpen from Pygmalion AI, right, so they, so Pygmalion is a sort of a, it's, it's, they have their own community, a lot of it is based around, roleplay models, those sorts of things, and Alpen, like, put together, merged together Llama270B, so, and Alpen, like, put together, merged together Llama270B, so, I don't remember how he stacked them together, whether he merged the layers in between. There's a whole, there's a whole toolkit for that by Charles Goddard, where you can like take a single model and like stack them together or multiple models merge.

[00:24:18] That's like a whole other talk and a whole other tool set, but was able to create this 120. Billion parameter model out of a LAMA two 70 B. And then I believe the, yeah, disco is a fine tune of, of the, the, the sort of the base one 20 B is, I believe Goliath one 20 B. So, and, and what are the

[00:24:37] swyx: headline results that people should know about

[00:24:39] Wing Lian: disco?

[00:24:39] I think for the headline results, I, I've, I haven't played with it personally because it's. It's a very large model and there's a lot of GPU, right? But, like, from what I've heard anecdotally, it performs really well. The responses are very good. Even with, like, just, even the base model is a lot better than, Llama70b.

[00:24:57] So, and we, I think generally everybody's like, we would all love to fine tune Llama70b, but it's just, it's so much, it's so much memory, so much compute, right?

[00:25:07] Datasets and Evals over Models

[00:25:07] Wing Lian: I

[00:25:07] Alex Volkov: want to touch on this point because the interesting thing That comes up out of being in this ecosphere and being friends with open source folks, tracking week to week state of the art performance on different models.

[00:25:19] First of all, a lot of the stuff that the folks do a couple of weeks ago, and then something like Mistral comes out, and a lot of the stuff back then, Doesn't technically make sense anymore. Like the artifacts of that work, the actual artifacts, they don't no longer make sense. They're like lower on the on, on the hug and face leaderboard or lower on LM CS leaderboard.

[00:25:36] But some of the techniques that people use, definitely the datasets. The datasets keep traveling, right? So open airmen, for example, is the dataset. The tum cleaned up for only. Open sourceable data that previously was just Hermes. And that, it was previously used to train Lama. And then once Mistral came out, it was used to train Mistral.

[00:25:54] And then it became significantly better on the 7b base Mistral. So the data sets keep traveling, keep getting better a little bit here and there. And so the techniques improve as well. It looks like both things are simultaneously true. The artifacts of a month and a half ago. The, the actual models themselves, it's great the hug and face has them, because not every company can keep up with the next weeks', oh, I, I'll install this model instead, sell this model instead.

[00:26:19] But the, the techniques and the, the dataset keep improving as we go further, and I think that's really cool. However, the outcome of this is that for a long time. For many, many people, including us, that we do this every week. We literally talk with people who release these models every week. It's really hard to know.

[00:26:36] So, there's a few aspects of this. One, I think, like you said, the bigger model, the 70B models, you actually have to have somebody like Perplexity, for example, giving you access to the 70B really fast. Or you have to, like, Actually, find some compute, and it's expensive, especially for the bigger models. For example Falcon 180B came out, like the hugest open source model.

[00:26:56] How do you evaluate this if you can't run it? Nobody liked it. It's really, so first of all, nobody liked it, but secondly, only the people who were able to find compute enough to run inference on this, they only had like, I can't run this on my laptop, and so that's why it's much easier, something like OpenRMS 7 to be, 7B, it's much easier, because you can run this on your MacBook.

[00:27:14] It's much easier to evaluate. It's much easier to figure out the vibes, right? Everybody talks about the vibes as an evaluation check. If you're plugged in enough, if you follow the right people, if they say pretty much the same things all independently, then you run into a problem of whether they're repeating, and their stochastic parents are repeating the same thing, or they actually evaluated themselves.

[00:27:31] Yeah, you never know. But, you never know, but like, I think on a large enough scale on Twitter, you start getting the feel. And we all know that like, OpenRMS is one of the top performing models, benchmarks, but also vibes. And I just wanted to highlight this vibes checks thing because you can have the benchmarks, you can have the evaluations, they potentially have contamination in them, potentially they not necessarily tell you the whole story because some models are good on benchmarks, but then you talk to them, they're not super helpful.

[00:28:00] And I think it's a combination of the benchmarks, the leaderboards, the chatbot, because LMSys, remember, their ranking is not only based on benchmarks, it's also people playing with their arena stuff. People actually like humans, like, get two answers. I think they completely ignore benchmarks. Yeah, and then They only do ELO.

[00:28:18] Oh, they do ELO completely, right? So that, for example, is just like people playing with both models and say, Hey, I prefer this one, I prefer that one. But also there's like some selection bias. The type of people who will go to LMCs to play with the models, they're a little bit specific in terms of like who they are.

[00:28:33] It's very interesting. There's so many models. People are doing this in this way, that way. Some people are doing this for academic rigor only to test out new ideas. Some people are actually doing this like the Intel fine tunes of Mistral. Intel wanted to come out and show that their hardware approach is possible, Mistral, etc.

[00:28:51] And it's really hard to know, like, what to pick, what to use. And especially on the bigger models, like you said, like the Llama 70B, the Falcon 180B. It's really because, like, who has the compute to validate those? So I would mention that, like, use with caution. Like, go and research and see if the biggest model that just released was actually worth the tokens and the money you spend on it.

[00:29:12] To try and, if you're a business, to integrate it.

[00:29:15] Distilling from GPT4

[00:29:15] swyx: Since you said use of caution, I'll bring in one issue that has always been in the back of my mind whenever I look at the entire universe of open source AI models, which is that 95 percent of the data is derived from GPC 4, correct?

[00:29:30] Which technically you can't use for commercial licenses,

[00:29:34] Wing Lian: right?

[00:29:35] swyx: What is the community's stance on this kind of stuff?

[00:29:40] Wing Lian: I think from the community stance, like I feel like a lot of us are just experimenting, so for us, it's like, we're not going and building a product that we're trying to sell, right?

[00:29:49] We're just building a product because we think it's interesting and we want to use it in our day to day lives, whether or not we try and integrate it. Personal use, yeah. Yeah, personal use, so like, as long as we're not selling it, yeah, it's fine. But

[00:30:01] swyx: like, I as a company cannot just take OpenHermes and start serving

[00:30:05] Alex Volkov: it and make money on it.

[00:30:06] OpenHermes you can. Because the opening of OpenHermes, I think, is a clean up. That did after the regular Hermes, please folks, check your licenses before you listen to podcasts and say, Hey, I will tell you though, you could say the same thing about OpenAI. You could say the same thing kind of makes sense, where OpenAI or StabilityAI trains their diffusion model on a bunch of pictures on the internet, and then the court kind of doesn't strike down Sarah Silverman, I think, or somebody else, who came and said, hey, this has my work in it, because of the way how it processes, and the model eventually builds this knowledge into the model, and then it doesn't actually reproduce one to one what happened in the dataset.

[00:30:45] You could claim the same thing for open source. Like, we're using And by we, I mean the, the open source community that I like happily report on uses GPT 4 to rank, for example, which is the better answer you, you, that's how you build one, one type of data set, right? Or DPO or something like this, you, you basically generate data set of like a question and four answers, for example, and then you go to GPT 4 and say, Hey, smartest model in the world right now, up to Gemini Ultra, that we should mention as well.

[00:31:11] Which one of those choices is better? But the choices themselves are not necessarily written with GPT 4. Some of them may be, so there's like full syntactic datasets. But there's also, datasets are just ranked with GPT 4. But they're actually generated with a sillier model, or like the less important model.

[00:31:25] The lines are very blurry as to what type of stuff is possible or not possible. And again, when you use this model that's up on Hug Face, the license says you can use this. OpenAI is not going to come after you, the user. If anything, OpenAI will try to say, hey, let's prevent this, this type of thing happening, and the brain, but I honestly don't think that they could know even, not that it makes it okay, it's just like, They also kind of do this with the Internet's archive, and also, I think that some of it is for use.

[00:31:55] You use models to help you augment tasks, which is what GPT 4 lets you do.

[00:32:00] swyx: Yeah, the worst thing that OpenAI can do is just kick you off OpenAI. That's because it's only enforced in the terms of service.

[00:32:05] Alex Volkov: Sure, but just like to make sure, to clarify who they're going to kick out, they could kick out like News, for example, if news are abusing their service, a user of the open source, fully Apache 2 open source, for example, They won't get kicked out if they use both, just because they use both.

[00:32:22] I don't believe so. I don't think OpenAI has a claim for that.

[00:32:25] swyx: Well, we're not lawyers, but I just want to mention it for people to know it's an issue.

[00:32:30] Wing Lian: And one of the things, like, I talked to someone recently, and I think that they also are like interested in it, but also to the point of like, right, if I use a model trained on data, using GPT for data, But I use that model to then regenerate new data.

[00:32:46] Is that model, is that data okay? So like you start going down this whole rabbit hole. So yeah. All right.

[00:32:53] swyx: Fantastic. Cool. Well, I think that's roughly highlights most of the open source universe. You also have your own models. Do you want to shout out any one of them? Yeah.

[00:33:01] Wing Lian: I mean, I think like, I think Early on, Manicore got a lot of love.

[00:33:04] I think it was mostly popular in, like, the roleplay communities. It was, it tended to be pretty truthful. It tended to be, like, have relatively good answers, depending on who you ask, right? But, I think for me, it was just, Releasing models was a way to try and, like, continue to build out the product, figure out what I needed to put into the product, how do I make it faster, and, if you've got to, like, go and debug your product, you may as well have it do something useful.

[00:33:29] Awesome. So, yeah.

[00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ

[00:33:31] swyx: Okay, and then maybe we'll talk about just fine tuning techniques. So this is going to be a little bit more technical than just talking about model names and datasets. So we started off talking about LoRa, QLoRa. I just learned from your readme there's ReLoRa. Which I've never heard about.

[00:33:45] Could you maybe talk about, like, just parameter efficient fine tuning that whole, that

[00:33:50] Wing Lian: whole journey, like, what people should know. Yeah, so with parameter efficient fine tuning, I think the popular ones, again, being, let's, we'll start with lore, right? So, usually what you do is you freeze all the layers on your base, on the base model, and then you, at the same time, you sort of introduce additional Oh, this is tight.

[00:34:08] No. You introduce, another set of layers over it, and then you train those, and it is done in a way that is mathematically possible, particularly with LORs that you can, then you, you, When you, when you train the model, you, you run your inputs through the base model, whose weights are frozen, but you, then you also run it through the additional weights, and then at the end you combine the weights, and then, and then, or you combine the weights to get your outputs, and then at the end, and when you're done training, you're left with this other set of weights, right, that are completely independent, and And then from that, what you can do is, some person smarter than I figured out, well, oh, they've done it in such a way that now I can merge these weights back into the original model without changing the architecture of the model, right?

[00:35:03] So, so, that tends to be, like, the go to, and You're training much fewer parameters so that when you do that, yes, you still need to have all of the original weights, but you have a smaller gradient, you have a smaller optimizer state, and you're just training less weights, so you can tend to train those models on, like, much smaller GPUs.

[00:35:27] swyx: Yeah. And it's roughly like, what I've seen, what I've seen out there is roughly like 1 percent the number of parameters that you're trading. Yeah, that sounds about right. Which is that much cheaper. So Axelotl supports full fine tune, LoRa, QLoRa,

[00:35:40] Wing Lian: Q. Yes. So, so QLoRa is, is very similar to LoRa. The paper was, if I remember correctly, the paper was Rather, traditionally, most people who did Loras were, were, they were quant, they were putting the model weights in 8 bit, and then fine tune, parameter efficient fine tuning over the Lora weights, and then with QLora, they were quantizing all of those, they were then quantizing the weights down to 4 bit, right, and then I believe they were also training on all of the linear layers in the model.

[00:36:15] And then with ReLore, that was an interesting paper, and then, I think, like, it got implemented. Some people in the community tried it, tried it out, and it showed that it didn't really have the impact that the paper indicated that it would. And from what I was told recently, that they re I guess they re released something for Relora, like, a few weeks ago, and that it's possibly better.

[00:36:44] I personally haven't had the time. What was the

[00:36:46] swyx: main difference,

[00:36:47] Wing Lian: apart from quantization? I don't know. Okay. What was the main difference, sorry?

[00:36:49] swyx: Apart from quantization, right? Like,

[00:36:50] Wing Lian: Qlora's thing was, like, we'll just drop off some bits. With Relora, what they did was, you would go through, you would define some number of steps that you would train, like, your Lora with, or your Qlora.

[00:37:01] Like, you could do Like, ReqLore, if you really wanted to, you would, you would train your LoRa for some number of steps, And then you would merge those weights into your base model, and then you would start over. So by starting, so, then by starting over, The optimizer has to find, like, sort of, re optimize again, and find what's the best direction to move in, and then do it all again, and then merge it in, do it all again, and theoretically, according to the paper, doing ReLore, you can do parameter efficient fine tuning, but still have sort of, like, the performance gains of doing a full fine tuning, so.

[00:37:38] swyx: Yeah, and

[00:37:39] Wing Lian: GPTQ? And GPTQ, so it's, I think with GPTQ, it's very similar to, more similar to QLore, where you're, it's mostly a quantization of the weights down to like 4 bit, where GPTQ is a very, is a specific methodology or implementation of quantization, so. Got it.

[00:37:57] Alex Volkov: Wang, for, for folks who use Axolotl, your users, some people who maybe, Want to try it out?

[00:38:03] And do they need to know the differences? Do they need to know the implementation details of QLora versus ReLora? Or is it okay for them to just know that Axolotl is the place that already integrated them? And if that's true, if that's all they need to know, how do they choose which method to use? Yeah,

[00:38:22] Wing Lian: so I think like, I think most people aren't going to be using ReLora.

[00:38:25] I think most people are going to be using either Lora or QLora. And I think they should have it. They should have an understanding of why they might want to use one over the other. Most people will say that with Qlora, the quality of the final model is not quite as good as like if you were to do a LoRa or a full fine tune, right?

[00:38:44] Just because, you've quantized these down, so your accuracy is probably a little off, and so that by the time you've done the Qlora, you're not moving the weights how you would on a full fine tune with the full parameter weights.

[00:38:56] Interesting.

[00:38:57] swyx: Okay, cool. For people who are more interested, obviously, read the papers. I just wanted to give people, like, a high level overview of what these things are. And you've done people a service by making it easy for people to try it out. I'm going to, I'm going to also ask a question which I know to be wrong, but I'm curious because I get asked this all the time.

[00:39:15] What is the difference between all these kinds of fine tunes

[00:39:17] Wing Lian: and RLHF? Okay, between all of these sorts of fine tunes and RLHF. So all of these sorts of fine tunes are based, are, ideally, this, they are taking knowledge that the base model already knows about, and presenting it in a way to the model that you're having the model answer like, Use what it already knows to sort of answer in a particular way, whether it's, you're extracting general knowledge, a particular task, right?

[00:39:44] Instruct, tune, chat, those sorts of things. And then generally with RLHF, so what is, let's go back, what is it? Reinforcement Learning with Human Feedback. So if we start with the human feedback part, What you're doing is you generally have, you have like a given prompt and then you, maybe you have one, maybe you have two, I think, like if you look at with Starling, you have like up to what, seven different, seven different possible responses, and you're sort of ranking those responses on, on some sort of metric, right, whether the metric is how much I, I might like that answer versus or I think with like starling is like how how how helpful was the answer how accurate was the answer how toxic was the answer those sorts of things on some sort of scale right and then using that to go back and like sort of Take a model and nudge it in the direction of giving that feedback, to be able to answer questions based on those preferences.

[00:40:42] swyx: Yeah, so you can apply, and is it commutative? Can you apply fine tuning after and onto an RLHF model? Or should the RLHF apply, come in afterwards,

[00:40:54] Wing Lian: after the fine tune? Um, I, yeah, I don't know that there's There's been enough research for one way or another, like, I don't know.

[00:41:02] That's a question that's been asked on Discord. Yeah, like, I definitely would say I don't know the answer. Go and try it and report back to me and let me know so I can answer for the next guy.

[00:41:10] swyx: It's shocking how much is still unknown about all these things. Well, I mean, that's what research is for, right?

[00:41:16] Wing Lian: So actually I, I think I saw on the top of a leaderboard, it was a, it was a mytral base model, and they didn't actually fine tune it. They, or they, they just did RLH, they did like an RLHF fine tune on it using like, I don't, I don't recall which dataset, but it was like, and it benchmarked really well.

[00:41:37] But yeah, you'd have to go and look at it. But, so it is interesting, like going back to that, it's like. Traditionally, most people will fine tune the model and then do like a DPO, PPO, some sort of reinforcement learning over that, but that particular model was, it seemed like they skipped like the supervised fine tuning or Scott.

[00:41:55] Axolotl vs HF Transformers

[00:41:55] swyx: Cool. One thing I did also want to comment about is the overall, like, landscape, competitive landscape, I don't know. Hugging Face Transformers, I think, has a PFT module.

[00:42:05] Wing Lian: Yeah, yeah, the PEFT, the Parameter Efficient Fine Tuning, yep. Is that a competitor to you? No, no, so we actually use it. We're just a wrapper over sort of, sort of the HuggingFace stuff.

[00:42:15] So, so that is their own sort of module where They have, taken the responsibility or yeah, the responsibility of like where you're doing these parameter efficient fine tuning methods and just sort of like, it is in that particular package where transformers is mostly responsible for sort of like the modeling code and, and the trainer, right.

[00:42:35] And then sort of, there's an integration between the two and, there's like a variety of other fine tuning packages, I think like TRL, TRLX, that's the stability AI one. Yeah, I think TRL likes the stability, yeah, Carper, and TRL is a hugging face trainer. Even that one's just another wrapper over, over the transformers library and the path library, right?

[00:43:00] But what we do is we have taken sort of those, yes, we've We also use that, but we also have more validation, right? So, there are some of us who have done enough fine tunes where like, Oh, this and this just don't go together, right? But most people don't know that, so like Example?

[00:43:19] Like, people want to One and one doesn't go together. I don't have an example offhand, but if you turn this knob and this knob, right? You would think, all right, maybe this will work, but you don't know until you try. And then by the time you find out it doesn't work, it's like maybe five minutes later, it's failed.

[00:43:34] It's failed in the middle of training or it's failed during the evaluation step. And you're like, ah, so we've, we've added a lot of, we've added a lot more validation in it. So that like, when you've, you've created your configuration, you run it through and now you say. The validation code says this is probably not right or probably not what you don't, not what you want.

[00:43:52] So are you like a, you

[00:43:53] swyx: do some linting of your YAML file?

[00:43:56] Wing Lian: There, I guess you could call it linting, it's sort of like Is there a set of rules out

[00:44:00] swyx: there somewhere? Yeah, there's a set of rules in there. That's amazing, you should write documentation like This rule is because, this user at this time, like, ran into this bug and that's what we invested in.

[00:44:10] It's like a good collection

[00:44:11] Wing Lian: of knowledge. Yeah, it is, and I guess like, if you really wanted to, like, figure it out, I guess you could, like, git blame everything, and But, yeah, it's, so, I think that's always a useful thing, it's like Because people want to experiment but they don't, people will get frustrated when you've experiment, you're experimenting and it breaks and you don't know why or you know why and you've just gone down the rabbit hole, right?

[00:44:37] So, so I think that's one of the big features that's, that I think I find important because it's It prevents you from doing things you probably shouldn't have, and it, and sometimes we will let you do those things, but we'll try and warn, warn you that you've done that.

[00:44:50] I

[00:44:51] Alex Volkov: have a follow up question on this, actually, because yesterday we hung out to this open source event, and I spent time by you a couple times, like when people told you, oh, XLR, I use XLR, it's super cool, and then the first thing you asked is, like, immediately, like, what can we improve?

[00:45:04] And yes, from multiple folks, and I think we talked about this a little bit, where there's It's a developer tool. It's like a machine learning slash developer tool. Your purpose in this is to help and keep people, as much as possible, like, Hey, here's the best set of things that you can use right now. The bear libraries are, or the bear trainer, for example, is a bear trainer.

[00:45:28] And also, maybe we should talk about how fast you're implementing these things. So you mentioned the first implementation took a week or so. Now there's a core maintainer group, right? There's like, features are landing, like Qlora, for example. Neftune, I don't know if that's one example of something that people potentially said that it's going to be cool, and then eventually, like, one of those things that didn't really shake out, like, people quickly tested this out.

[00:45:48] So, there's a ton of Wait, Neftune is cancelled? I don't know if it's fully canceled, but based on vibes, I heard that it's not that great. So like, but the whole point that I'm trying to make with Neftune as well is that being existing in the community of like XLR or like, I don't know, even following the, the GitHub options or following the Discord, it's a fairly good way to like, learn these, Kind of gut feelings that you just, you just said, right?

[00:46:14] Like where this, maybe this knob, that knob doesn't work. Some of these are not written down. Some of these are like tribal knowledge that passes from place to place. Axel is like a great collection of many of them. And so, do you get That back also from community of folks who just use, like, how do you know who uses this?

[00:46:30] I think that's still an issue, like, knowing if they trained with XLR or should they add this to things? Talk about, how do you get feedback and how else you should get feedback?

[00:46:38] Wing Lian: Yeah, I mean, most of the feedback comes from the Discord, so people come in and , they don't get a training running, they run into, like, obscure errors or, errors that That's a lot of things that maybe, maybe as a product we could catch, but like, there's a lot of things that at some point we need to go and do and it's just on the list somewhere.

[00:46:58] Right that's why when people come up, I'm like, what, what were your pain points? Because like, as a developer tool, if you're not happy with it, or you come in and in the first, Takes you 30 minutes and you're still not happy. You leave the tool and you may, you might move on maybe to a better tool, maybe to, one with less frustration, but it may not be as good, right?

[00:47:17] So I'm trying to like, figure out, all right, how can I reduce all this frustration? Because like for me, I use it every day for the most part, right? And so I am blind to that, right? Mm-Hmm. . Mm-Hmm. . I just know, I, I go do this, this, and this. It pretty much mostly works, right? But, so I don't have sort of that, alright, that learning curve that other people are seeing and don't understand their pain points.

[00:47:40] Yeah,

[00:47:40] Alex Volkov: you don't have the The ability to onboard yourself as a new user completely new to the whole paradigm to like get into the doors of like, Oh, no, I don't even know how to like ask about this problem or error.

[00:47:53] swyx: Cool. The last few things I wanted to cover was also just the more advanced stuff that you covered yesterday.

[00:48:00] 20x efficiency with StackLlama and Multipack

[00:48:00] swyx: So I'll just, caution this as like, yeah, this is more advanced. But you mentioned Stackllama and Multipack. What are they

[00:48:06] Wing Lian: and what should people know? Yeah, so, so, Stack Llama was, that paper came out, so Stack Llama I think was like, two, two, two separate, two separate concepts that they announced, so the first one was They being hugging face.

[00:48:20] Yeah, sorry, yes, they being hugging face, so the first one being sort of like, this idea of packing, like some packing sequences together, so like, if we think about training data, right, your training data is, let's say, to keep the math easy, let's say your training data is 500, We, we, we, we will use the terminology words.

[00:48:39] Let's say your training data is 500 words long, and let's say your, your context length, you know how much data your, that your model can accept is like, or that you want feed into your model. It's, let's say, we won't use tokens again, we'll we'll use it is it's 4,000 tokens, right? So if you're training at 4K Con or four 4,000 4K contacts and you're only using 500 of it, you're sitting like with the other 1500.

[00:49:05] 3, 500 words that you're not using, right? And typically that's either filled with these PAD tokens, so I think I made the analogy last night that it's like having sort of like a glass here you fill it up with a shot of liquor and then you're and that's your training data and then you just fill it up with more water and those are your PAD tokens and it's just, it doesn't do much, right?

[00:49:27] It's still the same thing, but you still have to go through all of that to go through all your training data. And then, so what Stack Llama showed was you could just sort of take your training data, append the next row of training data until you filled that entire 4k context, so in this example, right, with 500 words to 4k, that's 8 rows of training data.

[00:49:48] But, the problem with that is, is that with a lot of these transformer models, they're very much relying on attention, right? So, like, if you now have this sequence of words that now, in order for the, the model has seen all of these other words before, right? And then it sees another set of words, another set of words, but it's learning everything in context of all the words that it's seen before.

[00:50:13] We haven't corrected the attention for that. And just real quickly, since I said that that paper was two concepts, the other one was, I believe it was like a reinforcement learning, but outside the scope of this. So going from that, I implemented that early on because I was like, Oh, wow, this is really great.

[00:50:29] And. Yes, because it saves you a bunch of time, but the trade off is a little bit of accuracy, ultimately, but it still did pretty well. I think when I did Manicore, I think it used sort of that concept from Stack Llama of just sort of appending these sequences together, right? And then sort of the next evolution of that is Multipack, right?

[00:50:51] So, there was a separate paper on that, it was, I believe it was referenced, it got referenced in the Orca paper, where you could, you could properly mask those out using like a, I think it was like a lower block triangular attention mask, and then sort of, so, So, there's that. I did try implementing that, manually recreating that mask, but then one from the OpenChat, so he was helping with OpenOrca as well, and he had done an implementation of Multipack, and where he used FlashAttention, so FlashAttention So that was released by TreeDAO, and it was this huge performance gain.

[00:51:35] Everybody uses it now, even the Transformers library now, they've taken all of these, like, people are taking all of these models and sort of like, making it compatible with FlashAttention. But in Flash Tension, there is one particular implementation that lets you say, Well, I'm sending you all of these sequences like you would in Stack Llama, But let me send you another, another, Set of information about, this is where this set of sequences is, this is where the second set of sequences is.

[00:52:06] So like, if it was like, 500 words long, and you stacked them all together, you would just send it a row of information that was like, 0, 500, 1000, 1500, etc, etc, out to 4000. And it would know, alright, I need to break this up, and then run the forward pass with it. And then it would be able to, and it was much more, much more performant.

[00:52:29] And I think you end up seeing like 10x, 20x improvements over sort of, I mean, I think FlashAttention was like a 2x improvement, and then adding that with the Multipack, you start to see like, depending on, how much data you have, up to like a 20x improvement sometimes. 20x. 20x. Wow. Yeah.

[00:52:48] And I only know the 20x because I, like, before last night, I was like, I re ran the alpaca, I looked up the alpaca paper because it was like, I just need a frame of reference where somebody did it, and I think they used eight A100s for three hours, and they said it cost them 100. I don't, I don't think eight A100s cost, I don't know how much it costs right now.

[00:53:14] But I ended up rerunning it. Usually a dollar an hour, right? Yeah, so eight. The cheapest is like a

[00:53:18] Alex Volkov: dollar, a dollar an hour for one.

[00:53:20] Wing Lian: Yeah, so that's still like 24, 25. But maybe if you're going on Azure, maybe it's like, maybe it's 100 on Azure. I mean, it used to be more expensive, like, a year ago.

[00:53:31] Yeah, and then, so I re ran it with sort of like, I turned on all of the optimizations just to see what it would be. And like, and usually Multipack is the biggest optimization, so Multipack with Flash Detention. And it, I think I spun it up on 8 L40s, and it ran, and I didn't let it run all the way through, I just grabbed the time, the estimated completion time, and it was like 30 minutes, so it would have cost like 4 or 5 to run the entire, like, reproduce the alpaca paper, right?

[00:54:00] Which is crazy. It's crazy. 20x,

[00:54:02] Alex Volkov: yeah. I want to ask about, like, you said you turned on all the optimization. Is that the yaml file with xlodl, you just go and like check off, like, I want this, I want that? Yeah, yeah,

[00:54:10] Wing Lian: so there's like one particular yaml file in there, That, there's one particular YAML file in there that's like, it's under examples, llama2, fft, optimize.

[00:54:20] So, I think someone had created one where they just turned, they put in all of the optimizations and turned them on. I mean, it actually, it does run, which is like, sort of surprising sometimes, because sometimes, you optimize this, optimize this, and sometimes they just don't work together, but, yeah.

[00:54:36] Just turn the knobs on, and like, fine tuning should really just be that easy, right? I just want to flip the knob and move on with my life and not figure out how to implement it.

[00:54:47] Tri Dao and Mamba

[00:54:47] Alex Volkov: Specifically, the guy behind FlashAttention came up with something new. You want to talk about this a little bit? You want to briefly cover Mamba?

[00:54:53] Yeah, let's talk about Mamba. Let's talk about Mamba. So, what is Mamba?

[00:54:57] Wing Lian: Oh, gosh. I mean, I have not read the paper end to end. Like, I think you need to find someone smarter to tell you what Mamba is. But I think in a nutshell, it's sort of this, like, attentionless, attentionless model architecture. So I think it was, like, using a lot of his learnings from, like, I think Stanford did a lot of like sort of attentionless models with like I think Hyena several months ago as well so it is sort of this evolution of that of these of this research they've done and Apparently I believe it is what 5x faster for inference But the memory requirements are sub quadratic, so like I think, so with models that have attention, as you scale the context length out, the memory and the inference and training time goes up, quadratically, like Or squared, right?

[00:55:50] Whereas this one is closer, much closer to linear. So it's, it's really exciting. And there's a lot of like, I think a lot of people in the community are excited about it because especially I was talking with LGJ yesterday and he was saying it showed think with the perplexity curves and given the same exact, like comparing a, I think it was like a 140 million parameter model with the Pythea 140 million parameter model trained on the exact same data set as that model that there was a, that I believe the perplexity curves were a little bit lower than the Pythea model.

[00:56:26] So yeah. Yeah.

[00:56:28] Alex Volkov: I think one thing LDJ also is the guy behind, he was super excited to get like us to talk on Thursday about Mamba as well. He mentioned to me that the significant improvements in performance, it could be like 2x in the beginning where like lower tokens are, but then as you scale more with longer, longer tokens, because the non quadratic, the almost linear type scale, it's the performance improvements for larger and bigger and like more models are significant, like in the 10x to maybe 20x.

[00:56:57] Yeah, I think he said 10 yeah. At the larger models. And that's where we want to go. We want to get to the bigger sizes, the longer trains.

[00:57:06] Wing Lian: Yeah, yeah. So in particular, the longer context links. So like, if you're talking like 50, 60, like, or 128k context, like what is it, GPT turbo now? Or 4 turbo?

[00:57:19] 128, yes. So, like, getting out to that because it's no longer, yeah, it's like, it's, it's just as fast. I believe it should be just as fast, like, generating those tokens as it is, like, on a short, on a short

[00:57:34] Alex Volkov: prop. So, this came out just recently, and then between running to this open source AI, driving here in Uber, like, you already put out something that I saw that, that you started.

[00:57:44] Wait, what? Something today? Yeah,

[00:57:47] Wing Lian: what did you do? Well, I mean, so like tree and I forget who the other author is on that paper. They had released sort of the modeling code on, on GitHub. And then sort of like, it wasn't, they hadn't quite put it like made it like transform or, transformers library native.

[00:58:04] So, and it, it didn't quite drop in. Like cleanly into like Axel lot to get it, so that you could fine tune it. So like, it was one of the things I actually wanted to try and get done before the, before the meetup yesterday, and just demo that because that would be awesome, right? That'd be awesome.

[00:58:20] I think it dropped on Thursday and you know No. What day? No, today is Thursday. Thursday. I keep. I keep thinking today was Friday, that's what I said. I think, so it dropped on what, Tuesday, the meetup was Wednesday, I wanted to get it done for that, but I was getting it where it would like, the loss would just go to zero, and just fail.

[00:58:40] So, but yeah, right before coming here, I was working on it this morning and I think we finally got it working. So, I think Pharrell's training something on it. I'm pretty sure like Tenuim is going to be training something on it soon. So,

[00:58:52] Alex Volkov: yeah. So, we'll see, but I wanted to highlight the speed because you started with like within a week the first alpaca or, implementation and change in Axelot came and now like you're talking about like three days and that's with you flying and that's with you like presenting and talking on podcasts.

[00:59:08] Roadmap for Axolotl

[00:59:08] swyx: Very productive. Yeah. Yeah, excellent. Well, so, we're going to start wrapping up soon, but I always wanted to give you space to also talk about what you're working on next, and, on the

[00:59:17] Wing Lian: roadmap for Axelotl. Yeah, I think so, the roadmap for Axelotl is really like, I think, trying to stabilize sort of the feature set.

[00:59:26] Like, so the first thing on the roadmap is to write the roadmap, and then sort of going from there, it's, I think, So for me the sort of the vision is like it's it's a developer first platform right and as a developer You you're maybe you're more than likely doing it this sort of this side hustle side project trying to figure out like how do I build?

[00:59:45] LLMs and you know how do I build you know? How do I use a trainer that sort of thing and then you're you get comfortable with this tool? And then you maybe you take it to your company and you're training Models for where you work, right? So, and then, ultimately, you're saying, I want to use this because it's easy and I know how to use it.

[01:00:03] So, for me Given that sort of like, if I follow that through, that thought through, it's like, well, companies don't want to use this if it's hard for them to like, if given their specific use cases, right, they might need something specific in the workflow that they, and I, what I don't want is to have is them having to fork it, like, to Like, fork it in a way that is, like, hard to maintain, that if they want to get features, they then have to, like, rebase it and all of that.

[01:00:32] So, for me, and I actually have, like, a issue in GitHub that's about three or four months old at this point of exactly, yet, expose, like, create a plugin system, expose sort of, like, these hooks where companies can go in and build their own plugins and sort of, like, Modify, like, hyperparameters on the fly, or modify various, like attributes of training.

[01:00:57] Yeah, it's becoming a platform. Yeah, exactly. So, I need to, provide a way for, for them to be able to, like, use it in, in a, in a reliable manner and something that, that they can go invent and feel comfortable using, right? Yeah,

[01:01:10] swyx: awesome. You are working independently? You left SoundCloud a few months ago, and you have a non profit, the Open Access AI Collective.

[01:01:20] The Open Source AI Community

[01:01:20] swyx: It has a Discord people can join. How else can people support

[01:01:22] Wing Lian: you? I think really, like, for me, the biggest thing is, like, I'm looking, I'm always looking for contributors. Like, we have a great, set of core contributors, Nanobit, Amin slash TMM1, Casper Hansen, and then, and there are probably a few others who, I've Don't have the names offhand for, but we do see some like smaller PRs trickle through, but like A lot of the, sort of like, if I had somebody that could have gone and done Mamba for me, that would make my life a hundred times easier, right?

[01:01:51] I wouldn't have to be scrambling between, Ubers and meetings and those sorts of things to try and, like, get that implemented. So, there's definitely this, like, roadmap of, Things to do and nice to have, right? And like Nano is great at being a community manager and answering questions and sort of fueling all of that and being technical and you know It's really technical and can stole open PR's and fix things and like so and he's a graduate So he's a graduate student in Japan Working, doing research, and somehow he finds time to like, support this community, right?

[01:02:25] He's amazing, I love him, and I think everybody should like, show him some love, and then, but yeah, like, ultimately, the, I think the, the big, yeah, the biggest thing that I could ask for would be just, yeah, more core contributors.

[01:02:38] swyx: Cool. All right, well if you're interested in checking it out check out XLotto.

[01:02:42] Alex, anything else to, to

[01:02:43] Alex Volkov: add? Yeah, I will say folks who are listening to us, open source doesn't just happen. It happens because there's a bunch of great people. Giving their life, basically, to these things. So, first of all, be nice in comments. Like, that's obvious. Like, if you want to come in and complain about something, be productive and do the work as much as possible so the person who's, like, giving out of their life to help you will actually find it, like, easier.

[01:03:06] It usually gets to a point where, like, a small project becomes a platform, the platform then has rules, and then it's making it hard for some people to just go in and kind of say, Hey, this thing or that thing. Remember, there's people contributing without necessarily a lot of gain from it, just because they're contributing to the community.

[01:03:24] And also, come in and contribute. If you're using axolotl, and I heard many people, commercial people, come up to you, A16z folks come up to you, like many people, if they use axolotl, Give back. Give back to the community. I think it's always great. So I just like, if you're listening to this, and you've used Excelato, it helped you, there is a way to also contribute, not necessarily as the only core contributor, as a sponsorship, reach out, reach out to you as well, but definitely talk about this and give feedback as well.

[01:03:51] That's also very helpful. Sometimes people get stuck, and it's like, ah, okay, we'll do something else. No, just give feedback, talk about this. I think everybody else will generally benefit from that. Excellent.

[01:04:01] Wing Lian: Thank you. That's it. Yeah. Alright.

[01:04:04] Alex Volkov: Cool. Thanks for coming. Everybody should try Axolotl and tell us what

[01:04:08] swyx: they

[01:04:09] Wing Lian: think.

[01:04:11] Yeah.

Get full access to Latent Space at www.latent.space/subscribe

2023-12-08
Länk till avsnitt

Notebooks = Chat++ and RAG = RecSys! ? with Bryan Bischof of Hex Magic

Catch us at Modular?s ModCon next week with Chris Lattner, and join our community!

2024 note: Hex is now hiring AI Engineers.

Due to Bryan?s very wide ranging experience in data science and AI across Blue Bottle (!), StitchFix, Weights & Biases, and now Hex Magic, this episode can be considered a two-parter.

Notebooks = Chat++

We?ve talked a lot about AI UX (in our meetups, writeups, and guest posts), and today we?re excited to dive into a new old player in AI interfaces: notebooks!

Depending on your background, you either Don?t Like or you Like notebooks ? they are the most popular example of Knuth?s Literate Programming concept, basically a collection of cells; each cell can execute code, display it, and share its state with all the other cells in a notebook. They can also simply be Markdown cells to add commentary to the analysis.

Notebooks have a long history but most recently became popular from iPython evolving into Project Jupyter, and a wave of notebook based startups from Observable to DeepNote and Databricks sprung up for the modern data stack.

The first wave of AI applications has been very chat focused (ChatGPT, Character.ai, Perplexity, etc). Chat as a user interface has a few shortcomings, the major one being the inability to edit previous messages. We enjoyed Bryan?s takes on why notebooks feel like ?Chat++? and how they are building Hex Magic:

* Atomic actions vs Stream of consciousness: in a chat interface, you make corrections by adding more messages to a conversation (i.e. ?Can you try again by doing X instead?? or ?I actually meant XYZ?). The context can easily get messy and confusing for models (and humans!) to follow. Notebooks? cell structure on the other hand allows users to go back to any previous cells and make edits without having to add new ones at the bottom.

* ?Airlocks? for repeatability: one of the ideas they came up with at Hex is ?airlocks?, a collection of cells that depend on each other and keep each other in sync. If you have a task like ?Create a summary of my customers? recent purchases?, there are many sub-tasks to be done (look up the data, sum the amounts, write the text, etc). Each sub-task will be in its own cell, and the airlock will keep them all in sync together.

* Technical + Non-Technical users: previously you had to use Python / R / Julia to write notebooks code, but with models like GPT-4, natural language is usually enough. Hex is also working on lowering the barrier of entry for non-technical users into notebooks, similar to how Code Interpreter is doing the same in ChatGPT.

Obviously notebooks aren?t new for developers (OpenAI Cookbooks are a good example), but haven?t had much adoption in less technical spheres. Some of the shortcomings of chat UIs + LLMs lowering the barrier of entry to creating code cells might make them a much more popular UX going forward.

RAG = RecSys!

We also talked about the LLMOps landscape and why it?s an ?iron mine? rather than a ?gold rush?:

I'll shamelessly steal [this] from a friend, Adam Azzam from Prefect. He says that [LLMOps] is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. Don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this resource to something valuable is significant.

Some of my favorite takeaways:

* RAG as RecSys for LLMs: at its core, the goal of a RAG pipeline is finding the most relevant documents based on a task. This isn?t very different from traditional recommendation system products that surface things for users. How can we apply old lessons to this new problem? Bryan cites fellow AIE Summit speaker and Latent Space Paper Club host Eugene Yan in decomposing the retrieval problem into retrieval, filtering, and scoring/ranking/ordering:

As AI Engineers increasingly find that long context has tradeoffs, they will also have to relearn age old lessons that vector search is NOT all you need and a good systems not models approach is essential to scalable/debuggable RAG. Good thing Bryan has just written the first O?Reilly book about modern RecSys, eh?

* Narrowing down evaluation: while ?hallucination? is a easy term to throw around, the reality is more nuanced. A lot of times, model errors can be automatically fixed: is this JSON valid? If not, why? Is it just missing a closing brace? These smaller issues can be checked and fixed before returning the response to the user, which is easier than fixing the model.

* Fine-tuning isn?t all you need: when they first started building Magic, one of the discussions was around fine-tuning a model. In our episode with Jeremy Howard we talked about how fine-tuning leads to loss of capabilities as well. In notebooks, you are often dealing with domain-specific data (i.e. purchases, orders, wardrobe composition, household items, etc); the fact that the model understands that ?items? are probably part of an ?order? is really helpful. They have found that GPT-4 + 3.5-turbo were everything they needed to ship a great product rather than having to fine-tune on notebooks specifically.

Definitely recommend listening to this one if you are interested in getting a better understanding of how to think about AI, data, and how we can use traditional machine learning lessons in large language models.

The AI Pivot

For more Bryan, don?t miss his fireside chat at the AI Engineer Summit:

Show Notes

* Hex Magic

* Bryan?s new book: Building Recommendation Systems in Python and JAX

* Bryan?s whitepaper about MLOps

* ?Kitbashing in ML?, slides from his talk on building on top of foundation models

* ?Bayesian Statistics The Fun Way? by Will Kurt

* Bryan?s Twitter

* ?Berkeley man determined to walk every street in his city?

* People:

Timestamps

* [00:00:00] Bryan?s background

* [00:02:34] Overview of Hex and the Magic product

* [00:05:57] How Magic handles the complex notebook format to integrate cleanly with Hex

* [00:08:37] Discussion of whether to build vs buy models - why Hex uses GPT-4 vs fine-tuning

* [00:13:06] UX design for Magic with Hex's notebook format (aka ?Chat++?)

* [00:18:37] Expanding notebooks to less technical users

* [00:23:46] The "Memex" as an exciting underexplored area - personal knowledge graph and memory augmentation

* [00:27:02] What makes for good LLMops vs MLOps

* [00:34:53] Building rigorous evaluators for Magic and best practices

* [00:36:52] Different types of metrics for LLM evaluation beyond just end task accuracy

* [00:39:19] Evaluation strategy when you don't own the core model that's being evaluated

* [00:41:49] All the places you can make improvements outside of retraining the core LLM

* [00:45:00] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO-in-Residence of Decibel Partners, and today I'm joining by Bryan Bischof. [00:00:15]

Bryan: Hey, nice to meet you. [00:00:17]

Alessio: So Bryan has one of the most thorough and impressive backgrounds we had on the show so far. Lead software engineer at Blue Bottle Coffee, which if you live in San Francisco, you know a lot about. And maybe you'll tell us 30 seconds on what that actually means. You worked as a data scientist at Stitch Fix, which used to be one of the premier data science teams out there. [00:00:38]

Bryan: It used to be. Ouch. [00:00:39]

Alessio: Well, no, no. Well, you left, you know, so how good can it still be? Then head of data science at Weights and Biases. You're also a professor at Rutgers and you're just wrapping up a new O'Reilly book as well. So a lot, a lot going on. Yeah. [00:00:52]

Bryan: And currently head of AI at Hex. [00:00:54]

Alessio: Let's do the Blue Bottle thing because I definitely want to hear what's the, what's that like? [00:00:58]

Bryan: So I was leading data at Blue Bottle. I was the first data hire. I came in to kind of get the data warehouse in order and then see what we could build on top of it. But ultimately I mostly focused on demand forecasting, a little bit of recsys, a little bit of sort of like website optimization and analytics. But ultimately anything that you could imagine sort of like a retail company needing to do with their data, we had to do. I sort of like led that team, hired a few people, expanded it out. One interesting thing was I was part of the Nestle acquisition. So there was a period of time where we were sort of preparing for that and didn't know, which was a really interesting dynamic. Being acquired is a very not necessarily fun experience for the data team. [00:01:37]

Alessio: I build a lot of internal tools for sourcing at the firm and we have a small VCs and data community of like other people doing it. And I feel like if you had a data feed into like the Blue Bottle in South Park, the Blue Bottle at the Hanahaus in Palo Alto, you can get a lot of secondhand information on the state of VC funding. [00:01:54]

Bryan: Oh yeah. I feel like the real source of alpha is just bugging a Blue Bottle. [00:01:58]

Alessio: Exactly. And what's your latest book about? [00:02:02]

Bryan: I just wrapped up a book with a coauthor Hector Yee called Building Production Recommendation Systems. I'll give you the rest of the title because it's fun. It's in Python and JAX. And so for those of you that are like eagerly awaiting the first O'Reilly book that focuses on JAX, here you go. [00:02:17]

Alessio: Awesome. And we'll chat about that later on. But let's maybe talk about Hex and Magic before. I've known Hex for a while, I've used it as a notebook provider and you've been working on a lot of amazing AI enabled experiences. So maybe run us through that. [00:02:34]

Bryan: So I too, before I sort of like joined Hex, saw it as this like really incredible notebook platform, sort of a great place to do data science workflows, quite complicated, quite ad hoc interactive ones. And before I joined, I thought it was the best place to do data science workflows. And so when I heard about the possibility of building AI tools on top of that platform, that seemed like a huge opportunity. In particular, I lead the product called Magic. Magic is really like a suite of sort of capabilities as opposed to its own independent product. What I mean by that is they are sort of AI enhancements to the existing product. And that's a really important difference from sort of building something totally new that just uses AI. It's really important to us to enhance the already incredible platform with AI capabilities. So these are things like the sort of obvious like co-pilot-esque vibes, but also more interesting and dynamic ways of integrating AI into the product. And ultimately the goal is just to make people even more effective with the platform. [00:03:38]

Alessio: How do you think about the evolution of the product and the AI component? You know, even if you think about 10 months ago, some of these models were not really good on very math based tasks. Now they're getting a lot better. I'm guessing a lot of your workloads and use cases is data analysis and whatnot. [00:03:53]

Bryan: When I joined, it was pre 4 and it was pre the sort of like new chat API and all that. But when I joined, it was already clear that GPT was pretty good at writing code. And so when I joined, they had already executed on the vision of what if we allowed the user to ask a natural language prompt to an AI and have the AI assist them with writing code. So what that looked like when I first joined was it had some capability of writing SQL and it had some capability of writing Python and it had the ability to explain and describe code that was already written. Those very, what feel like now primitive capabilities, believe it or not, were already quite cool. It's easy to look back and think, oh, it's like kind of like Stone Age in these timelines. But to be clear, when you're building on such an incredible platform, adding a little bit of these capabilities feels really effective. And so almost immediately I started noticing how it affected my own workflow because ultimately as sort of like an engineering lead and a lot of my responsibility is to be doing analytics to make data driven decisions about what products we build. And so I'm actually using Hex quite a bit in the process of like iterating on our product. When I'm using Hex to do that, I'm using Magic all the time. And even in those early days, the amount that it sped me up, that it enabled me to very quickly like execute was really impressive. And so even though the models weren't that good at certain things back then, that capability was not to be underestimated. But to your point, the models have evolved between 3.5 Turbo and 4. We've actually seen quite a big enhancement in the kinds of tasks that we can ask Magic and even more so with things like function calling and understanding a little bit more of the landscape of agent workflows, we've been able to really accelerate. [00:05:57]

Alessio: You know, I tried using some of the early models in notebooks and it actually didn't like the IPyNB formatting, kind of like a JSON plus XML plus all these weird things. How have you kind of tackled that? Do you have some magic behind the scenes to make it easier for models? Like, are you still using completely off the shelf models? Do you have some proprietary ones? [00:06:19]

Bryan: We are using at the moment in production 3.5 Turbo and GPT-4. I would say for a large number of our applications, GPT-4 is pretty much required. To your question about, does it understand the structure of the notebook? And does it understand all of this somewhat complicated wrappers around the content that you want to show? We do our very best to abstract that away from the model and make sure that the model doesn't have to think about what the cell wrapper code looks like. Or for our Magic charts, it doesn't have to speak the language of Vega. These are things that we put a lot of work in on the engineering side, to the AI engineer profile. This is the AI engineering work to get all of that out of the way so that the model can speak in the languages that it's best at. The model is quite good at SQL. So let's ensure that it's speaking the language of SQL and that we are doing the engineering work to get the output of that model, the generations, into our notebook format. So too for other cell types that we support, including charts, and just in general, understanding the flow of different cells, understanding what a notebook is, all of that is hard work that we've done to ensure that the model doesn't have to learn anything like that. I remember early on, people asked the question, are you going to fine tune a model to understand Hex cells? And almost immediately, my answer was no. No we're not. Using fine-tuned models in 2022, I was already aware that there are some limitations of that approach and frankly, even using GPT-3 and GPT-2 back in the day in Stitch Fix, I had already seen a lot of instances where putting more effort into pre- and post-processing can avoid some of these larger lifts. [00:08:14]

Alessio: You mentioned Stitch Fix and GPT-2. How has the balance between build versus buy, so to speak, evolved? So GPT-2 was a model that was not super advanced, so for a lot of use cases it was worth building your own thing. Is with GPT-4 and the likes, is there a reason to still build your own models for a lot of this stuff? Or should most people be fine-tuning? How do you think about that? [00:08:37]

Bryan: Sometimes people ask, why are you using GPT-4 and why aren't you going down the avenue of fine-tuning today? I can get into fine-tuning specifically, but I do want to talk a little bit about the good old days of GPT-2. Shout out to Reza. Reza introduced me to GPT-2. I still remember him explaining the difference between general transformers and GPT. I remember one of the tasks that we wanted to solve with transformer-based generative models at Stitch Fix were writing descriptions of clothing. You might think, ooh, that's a multi-modal problem. The answer is, not necessarily. We actually have a lot of features about the clothes that are almost already enough to generate some reasonable text. I remember at that time, that was one of the first applications that we had considered. There was a really great team of NLP scientists at Stitch Fix who worked on a lot of applications like this. I still remember being exposed to the GPT endpoint back in the days of 2. If I'm not mistaken, and feel free to fact check this, I'm pretty sure Stitch Fix was the first OpenAI customer, unlike their true enterprise application. Long story short, I ultimately think that depending on your task, using the most cutting-edge general model has some advantages. If those are advantages that you can reap, then go for it. So at Hex, why GPT-4? Why do we need such a general model for writing code, writing SQL, doing data analysis? Shouldn't a fine-tuned model just on Kaggle notebooks be good enough? I'd argue no. And ultimately, because we don't have one specific sphere of data that we need to write great data analysis workbooks for, we actually want to provide a platform for anyone to do data analysis about their business. To do that, you actually need to entertain an extremely general universe of concepts. So as an example, if you work at Hex and you want to do data analysis, our projects are called Hexes. That's relatively straightforward to teach it. There's a concept of a notebook. These are data science notebooks, and you want to ask analytics questions about notebooks. Maybe if you trained on notebooks, you could answer those questions, but let's come back to Blue Bottle. If I'm at Blue Bottle and I have data science work to do, I have to ask it questions about coffee. I have to ask it questions about pastries, doing demand forecasting. And so very quickly, you can see that just by serving just those two customers, a model purely fine-tuned on like Kaggle competitions may not actually fit the bill. And so the more and more that you want to build a platform that is sufficiently general for your customer base, the more I think that these large general models really pack a lot of additional opportunity in. [00:11:21]

Alessio: With a lot of our companies, we talked about stuff that you used to have to extract features for, now you have out of the box. So say you're a travel company, you want to do a query, like show me all the hotels and places that are warm during spring break. It would be just literally like impossible to do before these models, you know? But now the model knows, okay, spring break is like usually these dates and like these locations are usually warm. So you get so much out of it for free. And in terms of Magic integrating into Hex, I think AI UX is one of our favorite topics and how do you actually make that seamless. In traditional code editors, the line of code is like kind of the atomic unit and HEX, you have the code, but then you have the cell also. [00:12:04]

Bryan: I think the first time I saw Copilot and really like fell in love with Copilot, I thought finally, fancy auto-complete. And that felt so good. It felt so elegant. It felt so right sized for the task. But as a data scientist, a lot of the work that you do previous to the ML engineering part of the house, you're working in these cells and these cells are atomic. They're expressing one idea. And so ultimately, if you want to make the transition from something like this code, where you've got like a large amount of code and there's a large amount of files and they kind of need to have awareness of one another, and that's a long story and we can talk about that. But in this atomic, somewhat linear flow through the notebook, what you ultimately want to do is you want to reason with the agent at the level of these individual thoughts, these atomic ideas. Usually it's good practice in say Jupyter notebook to not let your cells get too big. If your cell doesn't fit on one page, that's like kind of a code smell, like why is it so damn big? What are you doing in this cell? That also lends some hints as to what the UI should feel like. I want to ask questions about this one atomic thing. So you ask the agent, take this data frame and strip out this prefix from all the strings in this column. That's an atomic task. It's probably about two lines of pandas. I can write it, but it's actually very natural to ask magic to do that for me. And what I promise you is that it is faster to ask magic to do that for me. At this point, that kind of code, I never write. And so then you ask the next question, which is what should the UI be to do chains, to do multiple cells that work together? Because ultimately a notebook is a chain of cells and actually it's a first class citizen for Hex. So we have a DAG and the DAG is the execution DAG for the individual cells. This is one of the reasons that Hex is reactive and kind of dynamic in that way. And so the very next question is, what is the sort of like AI UI for these collections of cells? And back in June and July, we thought really hard about what does it feel like to ask magic a question and get a short chain of cells back that execute on that task. And so we've thought a lot about sort of like how that breaks down into individual atomic units and how those are tied together. We introduced something which is kind of an internal name, but it's called the airlock. And the airlock is exactly a sequence of cells that refer to one another, understand one another, use things that are happening in other cells. And it gives you a chance to sort of preview what magic has generated for you. Then you can accept or reject as an entire group. And that's one of the reasons we call it an airlock, because at any time you can sort of eject the airlock and see it in the space. But to come back to your question about how the AI UX fits into this notebook, ultimately a notebook is very conversational in its structure. I've got a series of thoughts that I'm going to express as a series of cells. And sometimes if I'm a kind data scientist, I'll put some text in between them too, explaining what on earth I'm doing. And that feels, in my opinion, and I think this is quite shared amongst exons, that feels like a really nice refinement of the chat UI. I've been saying for several months now, like, please stop building chat UIs. There is some irony because I think what the notebook allows is like chat plus plus. [00:15:36]

Alessio: Yeah, I think the first wave of everything was like chat with X. So it was like chat with your data, chat with your documents and all of this. But people want to code, you know, at the end of the day. And I think that goes into the end user. I think most people that use notebooks are software engineer, data scientists. I think the cool things about these models is like people that are not traditionally technical can do a lot of very advanced things. And that's why people like code interpreter and chat GBT. How do you think about the evolution of that persona? Do you see a lot of non-technical people also now coming to Hex to like collaborate with like their technical folks? [00:16:13]

Bryan: Yeah, I would say there might even be more enthusiasm than we're prepared for. We're obviously like very excited to bring what we call the like low floor user into this world and give more people the opportunity to self-serve on their data. We wanted to start by focusing on users who are already familiar with Hex and really make magic fantastic for them. One of the sort of like internal, I would say almost North Stars is our team's charter is to make Hex feel more magical. That is true for all of our users, but that's easiest to do on users that are already able to use Hex in a great way. What we're hearing from some customers in particular is sort of like, I'm excited for some of my less technical stakeholders to get in there and start asking questions. And so that raises a lot of really deep questions. If you immediately enable self-service for data, which is almost like a joke over the last like maybe like eight years, if you immediately enabled self-service, what challenges does that bring with it? What risks does that bring with it? And so it has given us the opportunity to think about things like governance and to think about things like alignment with the data team and making sure that the data team has clear visibility into what the self-service looks like. Having been leading a data team, trying to provide answers for stakeholders and hearing that they really want to self-serve, a question that we often found ourselves asking is, what is the easiest way that we can keep them on the rails? What is the easiest way that we can set up the data warehouse and set up our tools such that they can ask and answer their own questions without coming away with like false answers? Because that is such a priority for data teams, it becomes an important focus of my team, which is, okay, magic may be an enabler. And if it is, what do we also have to respect? We recently introduced the data manager and the data manager is an auxiliary sort of like tool on the Hex platform to allow people to write more like relevant metadata about their data warehouse to make sure that magic has access to the best information. And there are some things coming to kind of even further that story around governance and understanding. [00:18:37]

Alessio: You know, you mentioned self-serve data. And when I was like a joke, you know, the whole rush to the modern data stack was something to behold. Do you think AI is like in a similar space where it's like a bit of a gold rush? [00:18:51]

Bryan: I have like sort of two comments here. One I'll shamelessly steal from a friend, Adam Azzam from Prefect. He says that this is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. And that's the first one is I think, don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this like gold to, or this resource to something valuable is significant. I think people have gotten a little carried away with the old maxim of like, don't go pan for gold, sell pickaxes and shovels. It's a much stronger business model. At this point, I feel like I look around and I see more pickaxe salesmen and shovel salesmen than I do prospectors. And that scares me a little bit. Metagame where people are starting to think about how they can build tools for people building tools for AI. And that starts to give me a little bit of like pause in terms of like, how confident are we that we can even extract this resource into something valuable? I got a text message from a VC earlier today, and I won't name the VC or the fund, but the question was, what are some medium or large size companies that have integrated AI into their platform in a way that you're really impressed by? And I looked at the text message for a few minutes and I was finding myself thinking and thinking, and I responded, maybe only co-pilot. It's been a couple hours now, and I don't think I've thought of another one. And I think that's where I reflect again on this, like iron versus gold. If it was really gold, I feel like I'd be more blown away by other AI integrations. And I'm not yet. [00:20:40]

Alessio: I feel like all the people finding gold are the ones building things that traditionally we didn't focus on. So like mid-journey. I've talked to a company yesterday, which I'm not going to name, but they do agents for some use case, let's call it. They are 11 months old. They're making like 8 million a month in revenue, but in a space that you wouldn't even think about selling to. If you were like a shovel builder, you wouldn't even go sell to those people. And Swix talks about this a bunch, about like actually trying to go application first for some things. Let's actually see what people want to use and what works. What do you think are the most maybe underexplored areas in AI? Is there anything that you wish people were actually trying to shovel? [00:21:23]

Bryan: I've been saying for a couple of months now, if I had unlimited resources and I was just sort of like truly like, you know, on my own building whatever I wanted, I think the thing that I'd be most excited about is building sort of like the personal Memex. The Memex is something that I've wanted since I was a kid. And are you familiar with the Memex? It's the memory extender. And it's this idea that sort of like human memory is quite weak. And so if we can extend that, then that's a big opportunity. So I think one of the things that I've always found to be one of the limiting cases here is access. How do you access that data? Even if you did build that data like out, how would you quickly access it? And one of the things I think there's a constellation of technologies that have come together in the last couple of years that now make this quite feasible. Like information retrieval has really improved and we have a lot more simple systems for getting started with information retrieval to natural language is ultimately the interface that you'd really like these systems to work on, both in terms of sort of like structuring the data and preparing the data, but also on the retrieval side. So what keys off the query for retrieval, probably ultimately natural language. And third, if you really want to go into like the purely futuristic aspect of this, it is latent voice to text. And that is also something that has quite recently become possible. I did talk to a company recently called gather, which seems to have some cool ideas in this direction, but I haven't seen yet what I, what I really want, which is I want something that is sort of like every time I listen to a podcast or I watch a movie or I read a book, it sort of like has a great vector index built on top of all that information that's contained within. And then when I'm having my next conversation and I can't quite remember the name of this person who did this amazing thing, for example, if we're talking about the Memex, it'd be really nice to have Vannevar Bush like pop up on my, you know, on my Memex display, because I always forget Vannevar Bush's name. This is one time that I didn't, but I often do. This is something that I think is only recently enabled and maybe we're still five years out before it can be good, but I think it's one of the most exciting projects that has become possible in the last three years that I think generally wasn't possible before. [00:23:46]

Alessio: Would you wear one of those AI pendants that record everything? [00:23:50]

Bryan: I think I'm just going to do it because I just like support the idea. I'm also admittedly someone who, when Google Glass first came out, thought that seems awesome. I know that there's like a lot of like challenges about the privacy aspect of it, but it is something that I did feel was like a disappointment to lose some of that technology. Fun fact, one of the early Google Glass developers was this MIT computer scientist who basically built the first wearable computer while he was at MIT. And he like took notes about all of his conversations in real time on his wearable and then he would have real time access to them. Ended up being kind of a scandal because he wanted to use a computer during his defense and they like tried to prevent him from doing it. So pretty interesting story. [00:24:35]

Alessio: I don't know but the future is going to be weird. I can tell you that much. Talking about pickaxes, what do you think about the pickaxes that people built before? Like all the whole MLOps space, which has its own like startup graveyard in there. How are those products evolving? You know, you were at Wits and Biases before, which is now doing a big AI push as well. [00:24:57]

Bryan: If you really want to like sort of like rub my face in it, you can go look at my white paper on MLOps from 2022. It's interesting. I don't think there's many things in that that I would these days think are like wrong or even sort of like naive. But what I would say is there are both a lot of analogies between MLOps and LLMops, but there are also a lot of like key differences. So like leading an engineering team at the moment, I think a lot more about good engineering practices than I do about good ML practices. That being said, it's been very convenient to be able to see around corners in a few of the like ML places. One of the first things I did at Hex was work on evals. This was in February. I hadn't yet been overwhelmed by people talking about evals until about May. And the reason that I was able to be a couple of months early on that is because I've been building evals for ML systems for years. I don't know how else to build an ML system other than start with the evals. I teach my students at Rutgers like objective framing is one of the most important steps in starting a new data science project. If you can't clearly state what your objective function is and you can't clearly state how that relates to the problem framing, you've got no hope. And I think that is a very shared reality with LLM applications. Coming back to one thing you mentioned from earlier about sort of like the applications of these LLMs. To that end, I think what pickaxes I think are still very valuable is understanding systems that are inherently less predictable, that are inherently sort of experimental. On my engineering team, we have an experimentalist. So one of the AI engineers, his focus is experiments. That's something that you wouldn't normally expect to see on an engineering team. But it's important on an AI engineering team to have one person whose entire focus is just experimenting, trying, okay, this is a hypothesis that we have about how the model will behave. Or this is a hypothesis we have about how we can improve the model's performance on this. And then going in, running experiments, augmenting our evals to test it, et cetera. What I really respect are pickaxes that recognize the hybrid nature of the sort of engineering tasks. They are ultimately engineering tasks with a flavor of ML. And so when systems respect that, I tend to have a very high opinion. One thing that I was very, very aligned with Weights and Biases on is sort of composability. These systems like ML systems need to be extremely composable to make them much more iterative. If you don't build these systems in composable ways, then your integration hell is just magnified. When you're trying to iterate as fast as people need to be iterating these days, I think integration hell is a tax not worth paying. [00:27:51]

Alessio: Let's talk about some of the LLM native pickaxes, so to speak. So RAG is one. One thing is doing RAG on text data. One thing is doing RAG on tabular data. We're releasing tomorrow our episode with Kube, the semantic layer company. Curious to hear your thoughts on it. How are you doing RAG, pros, cons? [00:28:11]

Bryan: It became pretty obvious to me almost immediately that RAG was going to be important. Because ultimately, you never expect your model to have access to all of the things necessary to respond to a user's request. So as an example, Magic users would like to write SQL that's relevant to their business. And it's important then to have the right data objects that they need to query. We can't expect any LLM to understand our user's data warehouse topology. So what we can expect is that we can build a RAG system that is data warehouse aware, data topology aware, and use that to provide really great information to the model. If you ask the model, how are my customers trending over time? And you ask it to write SQL to do that. What is it going to do? Well, ultimately, it's going to hallucinate the structure of that data warehouse that it needs to write a general query. Most likely what it's going to do is it's going to look in its sort of memory of Stack Overflow responses to customer queries, and it's going to say, oh, it's probably a customer stable and we're in the age of DBT, so it might be even called, you know, dim customers or something like that. And what's interesting is, and I encourage you to try, chatGBT will do an okay job of like hallucinating up some tables. It might even hallucinate up some columns. But what it won't do is it won't understand the joins in that data warehouse that it needs, and it won't understand the data caveats or the sort of where clauses that need to be there. And so how do you get it to understand those things? Well, this is textbook RAG. This is the exact kind of thing that you expect RAG to be good at augmenting. But I think where people who have done a lot of thinking about RAG for the document case, they think of it as chunking and sort of like the MapReduce and the sort of like these approaches. But I think people haven't followed this train of thought quite far enough yet. Jerry Liu was on the show and he talked a little bit about thinking of this as like information retrieval. And I would push that even further. And I would say that ultimately RAG is just RecSys for LLM. As I kind of already mentioned, I'm a little bit recommendation systems heavy. And so from the beginning, RAG has always felt like RecSys to me. It has always felt like you're building a recommendation system. And what are you trying to recommend? The best possible resources for the LLM to execute on a task. And so most of my approach to RAG and the way that we've improved magic via retrieval is by building a recommendation system. [00:30:49]

Alessio: It's funny, as you mentioned that you spent three years writing the book, the O'Reilly book. Things must have changed as you wrote the book. I don't want to bring out any nightmares from there, but what are the tips for people who want to stay on top of this stuff? Do you have any other favorite newsletters, like Twitter accounts that you follow, communities you spend time in? [00:31:10]

Bryan: I am sort of an aggressive reader of technical books. I think I'm almost never disappointed by time that I've invested in reading technical manuscripts. I find that most people write O'Reilly or similar books because they've sort of got this itch that they need to scratch, which is that I have some ideas, I have some understanding that we're hard won, I need to tell other people. And there's something that, from my experience, correlates between that itch and sort of like useful information. As an example, one of the people on my team, his name is Will Kurt, he wrote a book sort of Bayesian statistics the fun way. I knew some Bayesian statistics, but I read his book anyway. And the reason was because I was like, if someone feels motivated to write a book called Bayesian statistics the fun way, they've got something to say about Bayesian statistics. I learned so much from that book. That book is like technically like targeted at someone with less knowledge and experience than me. And boy, did it humble me about my understanding of Bayesian statistics. And so I think this is a very boring answer, but ultimately like I read a lot of books and I think that they're a really valuable way to learn these things. I also regrettably still read a lot of Twitter. There is plenty of noise in that signal, but ultimately it is still usually like one of the first directions to get sort of an instinct for what's valuable. The other comment that I want to make is we are in this age of sort of like archive is becoming more of like an ad platform. I think that's a little challenging right now to kind of use it the way that I used to use it, which is for like higher signal. I've chatted a lot with a CMU professor, Graham Neubig, and he's been doing LLM evaluation and LLM enhancements for about five years and know that I didn't misspeak. And I think talking to him has provided me a lot of like directionality for more believable sources. Trying to cut through the hype. I know that there's a lot of other things that I could mention in terms of like just channels, but ultimately right now I think there's almost an abundance of channels and I'm a little bit more keen on high signal. [00:33:18]

Alessio: The other side of it is like, I see so many people say, Oh, I just wrote a paper on X and it's like an article. And I'm like, an article is not a paper, but it's just funny how I know we were kind of chatting before about terms being reinvented and like people that are not from this space kind of getting into AI engineering now. [00:33:36]

Bryan: I also don't want to be gatekeepy. Actually I used to say a lot to people, don't be shy about putting your ideas down on paper. I think it's okay to just like kind of go for it. And I, I myself have something on archive that is like comically naive. It's intentionally naive. Right now I'm less concerned by more naive approaches to things than I am by the purely like advertising approach to sort of writing these short notes and articles. I think blogging still has a good place. And I remember getting feedback during my PhD thesis that like my thesis sounded more like a long blog post. And I now feel like that curmudgeonly professor who's also like, yeah, maybe just keep this to the blogs. That's funny.

Alessio: Uh, yeah, I think one of the things that Swyx said when he was opening the AI engineer summit a couple of weeks ago was like, look, most people here don't know much about the space because it's so new and like being open and welcoming. I think it's one of the goals. And that's why we try and keep every episode at a level that it's like, you know, the experts can understand and learn something, but also the novices can kind of like follow along. You mentioned evals before. I think that's one of the hottest topics obviously out there right now. What are evals? How do we know if they work? Yeah. What are some of the fun learnings from building them into X? [00:34:53]

Bryan: I said something at the AI engineer summit that I think a few people have already called out, which is like, if you can't get your evals to be sort of like objective, then you're not trying hard enough. I stand by that statement. I'm not going to, I'm not going to walk it back. I know that that doesn't feel super good because people, people want to think that like their unique snowflake of a problem is too nuanced. But I think this is actually one area where, you know, in this dichotomy of like, who can do AI engineering? And the answer is kind of everybody. Software engineering can become AI engineering and ML engineering can become AI engineering. One thing that I think the more data science minded folk have an advantage here is we've gotten more practice in taking very vague notions and trying to put a like objective function around that. And so ultimately I would just encourage everybody who wants to build evals, just work incredibly hard on codifying what is good and bad in terms of these objective metrics. As far as like how you go about turning those into evals, I think it's kind of like sweat equity. Unfortunately, I told the CEO of gantry several months ago, I think it's been like six months now that I was sort of like looking at every single internal Hex request to magic by hand with my eyes and sort of like thinking, how can I turn this into an eval? Is there a way that I can take this real request during this dog foodie, not very developed stage? How can I make that into an evaluation? That was a lot of sweat equity that I put in a lot of like boring evenings, but I do think ultimately it gave me a lot of understanding for the way that the model was misbehaving. Another thing is how can you start to understand these misbehaviors as like auxiliary evaluation metrics? So there's not just one evaluation that you want to do for every request. It's easy to say like, did this work? Did this not work? Did the response satisfy the task? But there's a lot of other metrics that you can pull off these questions. And so like, let me give you an example. If it writes SQL that doesn't reference a table in the database that it's supposed to be querying against, we would think of that as a hallucination. You could separately consider, is it a hallucination as a valuable metric? You could separately consider, does it get the right answer? The right answer is this sort of like all in one shot, like evaluation that I think people jump to. But these intermediary steps are really important. I remember hearing that GitHub had thousands of lines of post-processing code around Copilot to make sure that their responses were sort of correct or in the right place. And that kind of sort of defensive programming against bad responses is the kind of thing that you can build by looking at many different types of evaluation metrics. Because you can say like, oh, you know, the Copilot completion here is mostly right, but it doesn't close the brace. Well, that's the thing you can check for. Or, oh, this completion is quite good, but it defines a variable that was like already defined in the file. Like that's going to have a problem. That's an evaluation that you could check separately. And so this is where I think it's easy to convince yourself that all that matters is does it get the right answer? But the more that you think about production use cases of these things, the more you find a lot of this kind of stuff. One simple example is like sometimes the model names the output of a cell, a variable that's already in scope. Okay. Like we can just detect that and like we can just fix that. And this is the kind of thing that like evaluations over time and as you build these evaluations over time, you really can expand the robustness in which you trust these models. And for a company like Hex, who we need to put this stuff in GA, we can't just sort of like get to demo stage or even like private beta stage. We really hunting GA on all of these capabilities. Did it get the right answer on some cases is not good enough. [00:38:57]

Alessio: I think the follow up question to that is in your past roles, you own the model that you're evaluating against. Here you don't actually have control into how the model evolves. How do you think about the model will just need to improve or we'll use another model versus like we can build kind of like engineering post-processing on top of it. How do you make the choice? [00:39:19]

Bryan: So I want to say two things here. One like Jerry Liu talked a little bit about in his episode, he talked a little bit about sort of like you don't always want to retrain the weights to serve certain use cases. Rag is another tool that you can use to kind of like soft tune. I think that's right. And I want to go back to my favorite analogy here, which is like recommendation systems. When you build a recommendation system, you build the objective function. You think about like what kind of recs you want to provide, what kind of features you're allowed to use, et cetera, et cetera. But there's always another step. There's this really wonderful collection of blog posts from Eugene Yon and then ultimately like even Oldridge kind of like iterated on that for the Merlin project where there's this multi-stage recommender. And the multi-stage recommender says the first step is to do great retrieval. Once you've done great retrieval, you then need to do great ranking. Once you've done great ranking, you need to then do a good job serving. And so what's the analogy here? Rag is retrieval. You can build different embedding models to encode different features in your latent space to ensure that your ranking model has the best opportunity. Now you might say, oh, well, my ranking model is something that I've got a lot of capability to adjust. I've got full access to my ranking model. I'm going to retrain it. And that's great. And you should. And over time you will. But there's one more step and that's downstream and that's the serving. Serving often sounds like I just show the s**t to the user, but ultimately serving is things like, did I provide diverse recommendations? Going back to Stitch Fix days, I can't just recommend them five shirts of the same silhouette and cut. I need to serve them a diversity of recommendations. Have I respected their requirements? They clicked on something that got them to this place. Is the recommendations relevant to that query? Are there any hard rules? Do we maybe not have this in stock? These are all things that you put downstream. And so much like the recommendations use case, there's a lot of knobs to pull outside of retraining the model. And even in recommendation systems, when do you retrain your model for ranking? Not nearly as much as you do other s**t. And even this like embedding model, you might fiddle with more often than the true ranking model. And so I think the only piece of the puzzle that you don't have access to in the LLM case is that sort of like middle step. That's okay. We've got plenty of other work to do. So right now I feel pretty enabled. [00:41:56]

Alessio: That's great. You obviously wrote a book on RecSys. What are some of the key concepts that maybe people that don't have a data science background, ML background should keep in mind as they work in this area? [00:42:07]

Bryan: It's easy to first think these models are stochastic. They're unpredictable. Oh, well, what are we going to do? I think of this almost like gaseous type question of like, if you've got this entropy, where can you put the entropy? Where can you let it be entropic and where can you constrain it? And so what I want to say here is think about the cases where you need it to be really tightly constrained. So why are people so excited about function calling? Because function calling feels like a way to constrict it. Where can you let it be more gaseous? Well, maybe in the way that it talks about what it wants to do. Maybe for planning, if you're building agents and you want to do sort of something chain of thoughty. Well, that's a place where the entropy can happily live. When you're building applications of these models, I think it's really important as part of the problem framing to be super clear upfront. These are the things that can be entropic. These are the things that cannot be. These are the things that need to be super rigid and really, really aligned to a particular schema. We've had a lot of success in making specific the parts that need to be precise and tightly schemified, and that has really paid dividends. And so other analogies from data science that I think are very valuable is there's the sort of like human in the loop analogy, which has been around for quite a while. And I have gone on record a couple of times saying that like, I don't really love human in the loop. One of the things that I think we can learn from human in the loop is that the user is the best judge of what is good. And the user is pretty motivated to sort of like interact and give you kind of like additional nudges in the direction that you want. I think what I'd like to flip though, is instead of human in the loop, I'd like it to be AI in the loop. I'd rather center the user. I'd rather keep the user as the like core item at the center of this universe. And the AI is a tool. By switching that analogy a little bit, what it allows you to do is think about where are the places in which the user can reach for this as a tool, execute some task with this tool, and then go back to doing their workflow. It still gets this back and forth between things that computers are good at and things that humans are good at, which has been valuable in the human loop paradigm. But it allows us to be a little bit more, I would say, like the designers talk about like user-centered. And I think that's really powerful for AI applications. And it's one of the things that I've been trying really hard with Magic to make that feel like the workflow as the AI is right there. It's right where you're doing your work. It's ready for you anytime you need it. But ultimately you're in charge at all times and your workflow is what we care the most about. [00:44:56]

Alessio: Awesome. Let's jump into lightning round. What's something that is not on your LinkedIn that you're passionate about or, you know, what's something you would give a TED talk on that is not work related? [00:45:05]

Bryan: So I walk a lot. [00:45:07]

Bryan: I have walked every road in Berkeley. And I mean like every part of every road even, not just like the binary question of, have you been on this road? I have this little app that I use called Wanderer, which just lets me like kind of keep track of everywhere I've been. And so I'm like a little bit obsessed. My wife would say a lot a bit obsessed with like what I call new roads. I'm actually more motivated by trails even than roads, but like I'm a maximalist. So kind of like everything and anything. Yeah. Believe it or not, I was even like in the like local Berkeley paper just talking about walking every road. So yeah, that's something that I'm like surprisingly passionate about. [00:45:45]

Alessio: Is there a most underrated road in Berkeley? [00:45:49]

Bryan: What I would say is like underrated is Kensington. So Kensington is like a little town just a teeny bit north of Berkeley, but still in the Berkeley hills. And Kensington is so quirky and beautiful. And it's a really like, you know, don't sleep on Kensington. That being said, one of my original motivations for doing all this walking was people always tell me like, Berkeley's so quirky. And I was like, how quirky is Berkeley? Turn it out. It's quite, quite quirky. It's also hard to say quirky and Berkeley in the same sentence I've learned as of now. [00:46:20]

Alessio: That's a, that's a good podcast warmup for our next guests. All right. The actual lightning ground. So we usually have three questions, acceleration, exploration, then a takeaway acceleration. What's, what's something that's already here today that you thought would take much longer to arrive in AI and machine learning? [00:46:39]

Bryan: So I invited the CEO of Hugging Face to my seminar when I worked at Stitch Fix and his talk at the time, honestly, like really annoyed me. The talk was titled like something to the effect of like LLMs are going to be the like technology advancement of the next decade. It's on YouTube. You can find it. I don't remember exactly the title, but regardless, it was something like LLMs for the next decade. And I was like, okay, they're like one modality of model, like whatever. His talk was fine. Like, I don't think it was like particularly amazing or particularly poor, but what I will say is damn, he was right. Like I, I don't think I quite was on board during that talk where I was like, ah, maybe, you know, like there's a lot of other modalities that are like moving pretty quick. I thought things like RL were going to be the like real like breakout success. And there's a little pun with Atari and breakout there, but yeah, like I, man, I was sleeping on LLMs and I feel a little embarrassed. I, yeah. [00:47:44]

Alessio: Yeah. No, I mean, that's a good point. It's like sometimes the, we just had Jeremy Howard on the podcast and he was saying when he was talking about fine tuning, everybody thought it was dumb, you know, and then later people realize, and there's something to be said about messaging, especially like in technical audiences where there's kind of like the metagame, you know, which is like, oh, these are like the cool ideas people are exploring. I don't know where I want to align myself yet, you know, or whatnot. So it's cool exploration. So it's kind of like the opposite of that. You mentioned RL, right? That's something that was kind of like up and up and up. And then now it's people are like, oh, I don't know. Are there any other areas if you weren't working on, on magic that you want to go work on? [00:48:25]

Bryan: Well, I did mention that, like, I think this like Memex product is just like incredibly exciting to me. And I think it's really opportunistic. I think it's very, very feasible, but I would maybe even extend that a little bit, which is I don't see enough people getting really enthusiastic about hardware with advanced AI built in. You're hearing whispering of it here and there, put on the whisper, but like you're starting to see people putting whisper into pieces of hardware and making that really powerful. I joked with, I can't think of her name. Oh, Sasha, who I know is a friend of the pod. Like I joked with Sasha that I wanted to make the big mouth Billy Bass as a babble fish, because at this point it's pretty easy to connect that up to whisper and talk to it in one language and have it talk in the other language. And I was like, this is the kind of s**t I want people building is like silly integrations between hardware and these new capabilities. And as much as I'm starting to hear whisperings here and there, it's not enough. I think I want to see more people going down this track because I think ultimately like these things need to be in our like physical space. And even though the margins are good on software, I want to see more like integration into my daily life. Awesome. [00:49:47]

Alessio: And then, yeah, a takeaway, what's one message idea you want everyone to remember and think about? [00:49:54]

Bryan: Even though earlier I was talking about sort of like, maybe like not reinventing things and being respectful of the sort of like ML and data science, like ideas. I do want to say that I think everybody should be experimenting with these tools as much as they possibly can. I've heard a lot of professors, frankly, express concern about their students using GPT to do their homework. And I took a completely opposite approach, which is in the first 15 minutes of the first class of my semester this year, I brought up GPT on screen and we talked about what GPT was good at. And we talked about like how the students can sort of like use it. I showed them an example of it doing data analysis work quite well. And then I showed them an example of it doing quite poorly. I think however much you're integrating with these tools or interacting with these tools, and this audience is probably going to be pretty high on that distribution. I would really encourage you to sort of like push this into the other people in your life. My wife is very technical. She's a product manager and she's using chat GPT almost every day for communication or for understanding concepts that are like outside of her sphere of excellence. And recently my mom and my sister have been sort of like onboarded onto the chat GPT train. And so ultimately I just, I think that like it is our duty to help other people see like how much of a paradigm shift this is. We should really be preparing people for what life is going to be like when these are everywhere. [00:51:25]

Alessio: Awesome. Thank you so much for coming on, Bryan. This was fun. [00:51:29]

Bryan: Yeah. Thanks for having me. And use Hex magic. [00:51:31]

Get full access to Latent Space at www.latent.space/subscribe

2023-11-29
Länk till avsnitt

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

This episode came together at ~4 hrs notice since Dylan had just landed in SF and we had to setup quickly; you might notice some small audio issues in some segments, we apologize. We?re currently building our own podcast studio for 2024! ?

We?re ramping up our presence on Twitter and YouTube if you?d like to support us.

Note: 17k people joined our emergency pod on Sam Altman?s ouster today.

If Charles Dickens was alive in 2024, A Tale of Two Cities might be the divide between the ?GPU poor? and the ?GPU rich?.

We mentioned these terms in some of our previous episodes; they were originally coined by of in his ?Gemini Eats the World? post, put on blast by Sam Altman. SemiAnalysis are one of the most in depth research and consulting firms in the semis world, and have a unique insight into the design, production, and supply chain of GPUs based on their ground presence in Asia.

In this episode we break down the State of Silicon: when are more GPUs coming? Are there real GPU alternatives on the way? Should Microsoft buy AMD chips just to scare Jensen? Is there a ?GPU poor is beautiful? manifesto?

The supply wave is coming

The GPU shortage is the talk of the town in the Bay Area, but next year looks a lot better in terms of AI accelerating capacity:

* NVIDIA is forecasted to sell over 3 million GPUs next year, about 3x their 2023 sales of about 1 million H100s.

* AMD is forecasting $2B of sales for their new MI300X datacenter GPU. They are also indirectly getting a boost from the work that companies like Modular and tiny are doing in making it easier to actually use these chips (will ROCm ever catch up?)

* Google?s TPUv5 supply is going to increase rapidly going into 2024

* Microsoft just announced Maia 100, a new AI accelerator built ?with feedback? from OpenAI.

In the episode we dove deeper into what this means for each of these companies and the GPU consumers, but the TLDR (sadly) is that capacity increases but FLOPS requirements to train the next generation of models will eclipse the one of previous ones.

GPT-3 was 4,000x more FLOPS than GPT-2. Dylan estimates GPT-4 was trained on 20,000 A100s for ~$500M all-in; how much will OpenAI spend to train GPT-5? How many GPUs will need to go brrr? In the meantime, the amount of companies looking for GPUs has increased, with Meta rising as one of the de-facto top 3 AI labs in terms of capacity. The pressure to acquire more chips will not ease in 2024.

We also talked about some of the companies trying to displace traditional GPU architectures: MatX, Lemurian Labs, Cerebras, etc. The different variables they are fighting on are size of SRAM vs HBM, focusing on memory bandwidth vs memory size, different math representation for kernels, etc, and how the key to this market is whether or not the transformer architecture will still be the #1 in the future.

Surviving in the GPU Poor lane

A lot of the smaller companies (when compared to $1T+ giants, it?s all relative) are trying hard to fight against the GPU rich, but they can?t quite offer the same scale:

* HuggingFace is trying to launch a training cluster as a service, but it seems to just be a software wrapper around NVIDIA?s GDX Cloud, as they don?t actually own that much GPU supply. The max option for GPUs to use is 1,000 in their form.

* Databricks? ?GPU-enabled clusters? run on AWS, and the largest one listed there is only powered by 8 NVIDIA A10Gs. The Mosaic team is also doing research on running on AMD cards with some promising results, but they seem to be pushing up to just 128 cards, which isn?t much.

* Together actually has 4,424 H100s live in production, which is quite sizable but still nothing compared to the 100,000 that Meta is putting online.

Take LLaMA2 as an example; the 70B model was trained on 2T tokens. Using the highest accelerator count on HuggingFace it?d take ~43 days to train the model from scratch and it?d cost ~$2M. That doesn?t include all the data and prep work. In the meantime, Zuck is probably burning tens of thousands of H100s to train LLaMA3, which will surely have much higher performance than whatever a GPU poor company can train in the same time span.

The good news, is that there?s a ton of opportunity for the GPU poors to shine, especially around fine-tuning. Most of the open source models coming out are one-size-fits-all, and there?s a ton of opportunity for startups to take them and tailor them to their customers, or to specific tasks or use cases to build vertical applications. The other area of improvement is data quality; Mistral showed how you can build a high quality small model with less FLOPs by feeding it better data. The key to differentiation won?t be GPUs, but tokens.

Show Notes

* SemiAnalysis

* Google Gemini Eats The World ? Gemini Smashes GPT-4 By 5X, The GPU-Poors

* How Nvidia?s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0

* AMD MI300 ? Taming The Hype ? AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software

* @sama: incredible google got that semianalysis guy to publish their internal marketing/recruiting chart lol

* Mellanox

* MatX

* Lemurian Labs

* Cerebras

* For SRAM / HBM, see our FlashAttention episode

* Suggested readings:

* Moore's Law: The Life of Gordon Moore, Silicon Valley's Quiet Revolutionary

* Chip War by Chris Miller

Chapters

* Introduction [00:00:00]

* Importance of infrastructure for tech companies [00:01:11]

* Training costs are irrelevant [00:03:06]

* Worldview of GPU-poor vs GPU-rich [00:04:01]

* Google's TPU infrastructure [00:08:12]

* Alternative hardware like Cerebras and Graphcore [00:17:37]

* Partnerships between labs and hardware companies [00:37:15]

* Apple's potential in AI [00:40:56]

* Concerns over China and Taiwan [00:41:02]

* Feasibility of rebuilding the semiconductor supply chain in the US [00:43:22]

* Foundational semiconductor readings [00:46:09]

* NVIDIA's pivot to AI [00:47:40]

* Dylan's writing process [00:48:17]

* Using multiple data centers for distributed AI training [00:52:36]

Transcript

Alessio: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16]

Swyx: And today we have Dylan Patel and welcome. So you are the author of the extremely popular Semi-Analysis blog. We have both had a little bit of claim to fame in breaking details of GPT-4. George Hotz came on our pod and talked about the mixture of experts thing and then you had a lot more detail. [00:00:29]

Dylan: To be clear, I talked about mixture of experts in January, it's just people didn't really notice it. Yeah. I guess. [00:00:35]

Swyx: I don't know. You went into a lot more detail and I'd love to dig into some of that. [00:00:38]

Dylan: Yeah. Thank you so much. I've been doing consulting in the industry, semiconductor industry since 17. 2021 got bored and in November I started writing a blog and then like 2022 I was good and started hiring folks for my firm. And then all of a sudden 2023 happens and it's like the perfect intersection. I used to do data science, but not like AI, not really like multivariable progression is not AI. Right. But also I've been involved in the semiconductor industry for a long, long time, posting about it online since I was 12. Right. You know, all of a sudden this all kind of came to fruition. So it's cool to have the blog sort of blow up in that way. [00:01:11]

Swyx: I used to cover semis at Belyasny as well. And it was for a long time, it was just the mobile cycle. And then a little bit of PCs, but like not that much. And then maybe some cloud stuff, you know, like public cloud, you know, semiconductor stuff. But it really wasn't anything until this wave. And I was actually listening to you on one of the previous podcasts that you've done. And it was surprising that high-performance computing also kind of didn't really take off. Like AI is just the first form of high-performance computing that worked. [00:01:37]

Dylan: One of the theses I've had for a long time that I think people haven't really caught on, but it's coming to fruition now is that the largest tech companies in the world, their software is important, but actually having and operating a very efficient infrastructure is incredibly important. And so, you know, people talk about, you know, hey, Amazon is great, AWS is great because yes, it is easy to use and they've built all these things. But behind the scenes, they've done a lot on the infrastructure that is super custom that Microsoft, Azure and Google Cloud just don't even match in terms of efficiency. If you think about the cost to rent out SSD space, so the cost to rent, you know, offer database service on top of that, obviously, a cost to rent out a certain level of CPU performance. Amazon has a massive advantage there. And likewise, like Google spent all this time doing that in AI, right, with their TPUs and infrastructure there and an optical switches and all this sort of stuff. And so in the past, it wasn't immediately obvious. I think with AI, especially like how scaling laws are going, it's like incredibly important for infrastructure is like so much more important. And then like when you just think about software cost, right, like the cost structure of it, there was always a bigger component of R&D and like SAS businesses, you know, all over SF, all these SAS businesses did crazy good because, you know, they just start as they grow and then all of a sudden they're so freaking profitable for each incremental new customer. And AI software looks like it's going to be very different, in my opinion, right? Like the R&D cost is much lower in terms of people, but the cost of goods sold in terms of actually operating the service, I think will be much higher. And so in that same sense, infrastructure matters a ton. [00:03:02]

Swyx: And I think you wrote once that training costs effectively don't matter. [00:03:06]

Dylan: Yeah. In my opinion, I think that's a little bit spicy, but yeah, it's like training costs are irrelevant, right? Like GPT-4, right, like 20,000 A100s, that's, that's like, I know it sounds like a lot of money. The supercomputer, it's, it's, oh, it's slightly more, but yeah, I think the 500 million is a fair enough number. I mean, if you think about just the pre-training, right, three months, 20,000 A100s at, you know, a dollar an hour is like, that is way less than 500 million, right? But of course there's data and all this sort of stuff. [00:03:33]

Alessio: So people that are watching this on YouTube, they can see a GPU-poor and a GPU-rich hat on the table, which is inspired by your, yeah, your Google Gemini, it's the world blog post. So Sam, did you know that this thing was going to blow up so much? Sam Altman even tweeted about it, he said, incredible Google got the semi-analysis guide to publish their internal marketing recruiting chart. And yeah, tell people who are the GPU-poors, who are the GPU-rich, like what's this framework that they should think about? [00:04:01]

Dylan: So it's, it's, you know, some of this work we've been doing for a while is just on infrastructure and like, hey, like when something happens, I think it's like a sort of competitive advantage of our firm, right, myself and my colleagues is we go from software all the way through to like low-level manufacturing, and it's like, who, you know, oh, Google's actually ramping up TPU production massively, right? And like, I think people in AI would be like, well, duh, but like, okay, like who, who has the capability of figuring out the number? Well, one, you could just get Google to tell you, but they don't, they won't tell you, right? That's like a very closely guarded secret. And most people that work at Google DeepMind don't even know that number, right? Two, you go through the supply chain and see what they've placed in order. Three is sort of like, well, who's actually winning from this? Hey, oh, Celestica's building these boxes. Wow. Oh, interesting. Oh, Google's involved in testing for them. Oh, okay. Oh, this company's providing design IP to them. Okay. That's very valuable in a monetary sense, but you know, you have to understand the whole technology stack. But on the flip side, right, is, well, why is Google building all these? What could they do with it? And what does that mean for the world? Especially in SF, right? Like, I'm sure you folks have been to parties. If we just brag about how many TPUs they have, like, it's happened to me multiple times where someone's just like, I'm just witnessing a conversation where somebody from Meta is bragging about how many TPUs they have versus someone from another firm that it's like, or like a startup person's like, dude, can you believe we just acquired, we have 512 H100s coming online in August. And it's like, oh, cool. Like, you know, going through the supply chain, it's like, dude, you realize there's 400,000 manufactured last quarter and like 530,000 this quarter being sold of H100s. And it's like, oh crap, that's a lot. That's a lot of GPUs. But then like, oh, how does that compare to Google? And like, there's one way to look at the world, which is just like, hey, scale is all you need. Like obviously data matters. Obviously all this stuff matters. I think any data set, a larger model will just do better. I think it's going to be more expensive, but it's going to do better. Okay, there's all these GPUs going into production. NVIDIA is going to sell well over 3 million total GPUs next year, over a million H100s this year alone. There's a lot of GPU capacity coming online. It's an incredible amount. And well, what are people doing? What are people working on? I think it's very important to like, just think about what are people working on, right? What actually are you building that's going to advance? What is monetizable? But what also makes sense? And so like, a lot of people were doing things that I felt counterproductive, right? In a world where in less than a year, there's going to be more than 4 million high-end GPUs out there. I mean, we can talk about the concentration of those GPUs, but if you're doing really valuable work as a good person, right, like you're contributing in some way, should you be focused on like, well, I don't have access to any of those 4 million GPUs, right? I actually only have access to gaming GPUs. Should I focus on like being able to fine tune a model on that, right? Like, no, it's not really that important. Or like, should I be focused on batch one inference on a cloud GPU? Like, no, that's like pointless. Like, why would you do batch size one inference on an H100? That's just like ridiculously dumb. There's a lot of counterproductive work. And at the same time, there's a lot of things that people should be doing. I mean, obviously, most people don't have resources, right? And I love the open source and I want the open source to win. And I hate the people who want to like, no, we're X lab and we think this is the only way you should do it. And if people don't do it this way, they should be regulated against it and all this kind of stuff. So I want the open source to win, right? Like companies like Mistral and like what Meta are doing, you know, Mosaic and all these folks together. All these people doing, you know, huge stuff with open source, you know, want them to succeed. But it's like, there's certain things that are, you know, like hyper focusing on leaderboards and hugging face. No, like truthful QA is a garbage benchmark. Some of the models that are very high on there, if you use it for five seconds, you're like, this is garbage. There was things I wanted to say. Also, you know, we're in a world where compute matters a lot. Google is going to have more compute than any other company in the world, period. By like a large, large factor. It's just like framing it into that like mindset of like, Hey, like, what are the counterproductive things? What do I think personally? Or what have people told me that are involved in this should they focus on the pace of acceleration from 2020 to 2022 is less than 2022 to 2024, you know, GP two to four, two to four is like 2020 to 2022, right? Is less than I think from GPT four in 2022, which is when it was trained, right? What open AI and Google and, and Anthropic do in 2025, right? Like I think the pace of acceleration is increasing and it's just good to like, think about, you know, that sort of stuff. [00:08:12]

Alessio: That makes sense. And the chart that Sam mentioned is about, yeah, Google TPU B fives completely overtaking open AI by orders of magnitude. Let's talk about the TPU a bit. We had Tris Landner on the show, which I know, you know, he used to work on TensorFlow and Google. And he did mention that the goal of Google is like make TPUs go fast with TensorFlow, but then he also had a post about PyTorch dealing the thunder. How do you see that changing? If like, now that a lot of the compute will be TPU based and Google wants to offer some of that to the public to Google internally. [00:08:44]

Dylan: And I think, you know, as obviously on JAX and XLA and all that kind of stuff externally, like they've done a really good job. Wouldn't say like TPUs through PyTorch XLA is amazing, but it's, it's not bad, right? Like some of the numbers they've shown, some of the, you know, code they've shown for TPU V5E, which is not the TPU V5 that I was referring to, which is in the sort of the post, the TPU poor post is referring to, but TPU V5E is like the new one, but it's mostly, mostly an inference chip. It's a small chip. It's, it's about half the size of a TPU V5. That chip, you know, you can get very good performance on of LLAMA 70B inference. Very, very good performance when you're using PyTorch and XLA. Now of course you're going to get better if you go JAX XLA, but I think Google is doing a really good job after the restructuring of focusing on external customers too. Probably won't focus too much on TPU V5 for everyone externally, but V5E, we're also building a million of those, right? Hey, a lot of companies are using them, right? Or will be using them because it's going to be an incredibly cheap form of compute. I think the world of frameworks and all that, right? Like that's obviously something a researcher should talk about, not myself, but you know, the stats are clear that PyTorch is way, way dominating everything. But JAX is like doing well. Like there's external users of JAX. The forever shouldn't be that the person doing PyTorch level code, right? That high should also be writing custom CUDA kernels, right? There should be, you know, different layers of abstraction where people hyper optimize and make it much easier for everyone to innovate on separate stacks, right? And then every once in a while, someone comes through and pierces through the layers of abstraction and innovates across multiple or a group of people. But I think frameworks are important. Compilers are important, right? Chris Lattner's, what he's doing is really cool. I don't know if it'll work, but it's super cool and it certainly works on CPUs. We'll see about accelerators. Likewise, there's OpenAI's Triton, like what they're trying to do there. And you know, everyone's really coalescing around Triton, third-party hardware vendors. There's Palace. I don't want to mischaracterize it, but you can write in Palace and it'll go through, you can lower level code and it'll work to TPUs and GPUs, kind of like Triton, but it's like there's a backend for Triton. I don't know exactly everything about it, but I think there's a lot of innovation happening on make things go faster, right? How do you go burr? Is every single person working in ML, it would be a travesty if they had to write like custom CUDA kernels always, right? Like that would just slow down productivity, but at the same time, you kind of have to. [00:10:53]

Swyx: By the way, I like to quantify things when you say make things go burr. Is there a target range of MFU that you typically talk about? [00:10:59]

Dylan: Yeah, there's sort of two metrics that I like to think about a lot, right? So in training, everyone just talks about MFU, right? But then on inference, right, which I think is one LLM inference will be bigger than training or multimodal, whatever, bubble inference will be bigger than training, probably next year, in fact, at least in terms of GPUs deployed. And the other thing is like, you know, what's the bottleneck when you're running these models? The simple, stupid way to look at it is training is, you know, there's six flops floating point operations you have to do for every byte you read in, right? Every parameter you read in. So if it's FP8, then it's a byte, if it's FP16, it's two bytes, whatever, right? On training, but on inference side, the ratio is completely different. It's two to one, right? There's two flops per parameter that you read in and parameters, maybe one byte, right? But then when you look at the GPUs, the GPUs are very, very different ratio. The H100 has 3.35 terabytes a second of memory bandwidth, and it has a thousand teraflops of FP16, BFLIP16, right? So that ratio is like, I'm sorry, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to one, right, call it 256 to one if you're doing FP16. And same applies to FP8, right, because anyways, per parameter read to number of floating point operations, right? If you quantize further, but you also get double the performance on that lower quantization. That does not fit the hardware at all. So if you're just doing LLM inference at batch one, then you're always going to be under utilizing the flops. You're only paying for memory bandwidth. And the way hardware is developing, that ratio is actually only going to get worse. H200 will come out soon enough, which will help the ratio a little bit, improve memory bandwidth more than improves flops, just like the A180 gig did versus the A140 gig. But then when the B100 comes out, the flops are going to increase more than memory bandwidth. And when future generations come out and the same with AMD side, right, MI300 versus 400, as you move on generations, just due to fundamental like semiconductor scaling, DRAM memory is not scaling as fast as logic has been. And you can do a lot of interesting things on the architecture. So you're going to have this problem get worse and worse and worse. And so on training, it's very, you know, who cares, right? Because my flops are still my bottleneck most of the time. I mean, memory bandwidth is obviously a bottleneck, but like, well, you know, batch sizes are freaking crazy, right? Like people train like 2 million batch sizes, it's trivial, right? Like that's what Lama, I think did, Lama 70B was 2 million batch size. Unlike you talk to someone at one of the frontier labs and they're like, just 2 million, 2 million token batch size, right? That's crazy, or sequence, sorry. But when you go to inference side, well, it's impossible to do one, to do 2 million batch size. Also your latency would be horrendous if you tried to do something that crazy. So you kind of have this like differing problem where on training, everyone just kept talking MFU, model flop utilization, right? How many flops, six times the number of parameters, basically, more or less. And then what's the quoted number, right? So if I have 312 teraflops out of my A100 and I was able to achieve 200, that's really good, right? You know, some people are achieving higher, right? Some people are achieving lower. That's a very important like metric to think about. Now you have like people thinking MFU is like a security risk, but on inference, MFU is not nearly as important, right? It's memory bandwidth utilization. You know, batch one is, you know, what memory bandwidth can I achieve, right? Because as I increase batch from batch size one to four to eight to even 256, right, it's sort of where the crossover happens on an H100 inference wise, where it's flops limiting you more and more. But like you should have very high memory bandwidth utilization. So when people talk about A100s, like 60% MFU is decent. On H100s, it's more like 40, 45% because the flops increased more than the memory bandwidth. But people over time will probably get above 50% on H100, on MFU, on training. But on inference, it's not being talked about much, but MBU, model bandwidth utilization is the important factor, right? Above my 3.35 terabytes a second memory bandwidth on my H100, can I get two? Can I get three? Right? That's the important thing. And right now, if you look at everyone's inference stuff, I dogged on this in the GPU poor thing, right? But it's like hugging faces libraries are actually very inefficient, like incredibly inefficient for inference. You get like 15% MBU on some configurations, like eight A100s and LLAMA 70B, you get like 15%, which is just like horrendous. Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get, right? So if you're doing LLAMA 70 billion, 70 billion parameters, if you're doing it int8, okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, plus the attention, but again, we're simplifying it. 70 gigabytes you need to read for every forward pass, what is an acceptable latency for a user to have? I would argue 30 milliseconds per token. Some people would argue lower, right? But at the very least, you need to achieve human reading level speeds and probably a little bit faster, because we like their skin, to have a usable model for chatbot style applications. Now there's other applications, of course, but chatbot style applications, you want it to be human reading speed. So 30 tokens per second, 30 tokens per second is 33, or 30 tokens, milliseconds per token is 33 tokens per second, times 70 is, let's say three times seven is 21, and then add two zeros to 2,100 gigabytes a second, right? To achieve human reading speed on LLAMA 70B, right? So one, you can never achieve LLAMA 70B human reading speed on, even if you had enough memory capacity on a model, on an A100, right? Even an H100 to achieve human reading speed, right? Of course, you couldn't fit it because it's 80 gigabytes versus 70 billion parameters, so you're kind of butting up against the limits already, 70 billion parameters being 70 gigabytes at int8 or fp8. You end up with one, how do I achieve human reading level speeds, right? So if I go with two H100s, then now I have, you know, call it six terabytes a second of memory bandwidth, if I achieve just 30 milliseconds per token, then I'm, you know, which is 33 tokens per second, which is 30, you know, is three terabytes a second, was it three, three times, 21, 2.1 terabytes a second of memory bandwidth, then I'm only at like 30% bandwidth utilization. So I'm not using all my flops on batch one anyways, right? Because 70, you know, the flops that you're using there is tremendously low relative to inference, and I'm not actually using a ton of the tokens on inference. So with two H100s, I only get 30 milliseconds a token, that's a really bad result. You should be striving to get, you know, so upwards of 60%, and that's like 60% is kind of low too, right? Like, I've heard people getting 70, 80% model bandwidth utilization. And then, you know, obviously you can increase your batch size from there and your model bandwidth utilization will start to fall as your flops utilization increases, but, you know, you have to pick the sweet spot for where you want to hit on the latency curve for your user. Obviously, as you increase batch size, you get more throughput per GPU, so that's more cost effective. There's a lot of like things to think about there, but I think those are sort of the two main things that people want to think about, and there's obviously a ton with regards to like networking and inner GPU connection, because most of the useful models don't run on a single GPU. They can't run on a single GPU. [00:17:37]

Swyx: Is there a TPU equivalent of Mellanox? [00:17:39]

Dylan: The Google TPU is like super interesting because Google has been working with Broadcom, who's the number one networking company in the world, right? So Mellanox was nowhere close to number one. They had a niche that they were very good at, which was the network card, the card that you actually put in the server, but they didn't do much. They didn't have, they weren't doing successfully in the switches, right? Which is, you know, you connect all the networks cards to switches, and then the switches to all the servers. So Mellanox was not that great. I mean, it was good. They were doing good, and NVIDIA bought them, you know, in 19, I believe, or 18, but Broadcom has been number one in networking for a decade plus, and Google partnered with them on making the TPU, right? So Google does a lot of the design, especially on the ML hardware side, on how you pass stuff around internally on the chip, but Broadcom does a lot on the network side, right? They specifically, how to get really high connection speed between two chips, right? They've done a ton there, and obviously Google works a ton there too, but this is sort of Google's like less discussed partnership that's truly critical for them, and Google's tried to get away from them many times. Their latest target to get away from Broadcom is 2027, right? But like, you know, that's four years from now. Chip design cycle's four years, so they already tried to get away in 2025, and that failed. They have this equivalent of very high speed networking. It works very differently than the way GPU networking does, and that's important for people who code on a lower level. [00:18:52]

Swyx: I've seen this described as the ultimate rate limit on how big models can go. It's not flops, it's not memory, it's networking. Like it has the lowest scaling laws, the lowest Moore's laws, and I don't know what to do about that because no one else has any solutions. [00:19:06]

Dylan: Yeah, yeah, so I think what you're referring to is that like network speed is increased Much slower than the other two. Than flops, yeah, and bandwidth, yeah, yeah. And yeah, that's a tremendous problem in the industry, right? That's why NVIDIA bought a networking company, that's why Broadcom is working on Google's chip right now, but of course on Meta's internal AI chip, which they're on the second generation of, working on that, and what's the main thing that Meta's doing interesting is networking stuff, right? Multiplying tensors is kind of, there's a lot of people who've made good matrix multiply units, right? But it's about like getting good utilization out of those and interfacing with the memory and interfacing with other chips really efficiently makes designing these chips very hard. Most of the startups obviously have not done that really well. [00:19:46]

Alessio: I think the startup's point is the most interesting, right? You mentioned companies that are GPU poor, they raise a lot of money, and there's a lot of startups out there that are GPU poor and did not raise a lot of money. What should they do? How do you see like the space dividing? Are we just supposed to wait for like the big labs to do a lot of this work with a lot of the GPUs? What's like the GPU poor's beautiful version of the article? [00:20:12]

Dylan: Open AI, who everyone would be like, oh yeah, they have more GPUs than anyone else, right? But they have a lot less flops than Google, right? That was the point of the thing, but not just them, it's like, okay, it's like a relative totem pole, right? And of course, Google doesn't use GPUs as much for training and inference, they do use some, but mostly TPUs. So kind of like, the whole point is that everyone is GPU poor because we're going to continue to scale faster and faster and faster and faster, and compute will always be a bottleneck, just like data will always be a bottleneck. You can have the best data set in the world and you can always have a better one. And same with, you have the biggest compute system in the world, but you'll always want a better one. And so it's like, there's things that like Mistral, they trained a fricking awesome model on relatively fewer GPUs, right? And now they're scaling up higher and higher and higher, right? There's a lot that the GPU poor can do though, right? We all have phones, we all have laptops, right? There is a world for running GPUs or models on device. The replet folks are trying to do stuff like that. Their models, they can't follow scaling laws, right? Why? Because there's a fundamental limit to how much memory bandwidth and capacity you can get on a laptop or a phone. You know, I mentioned the ratio of flops to bandwidth on a GPU is actually really good compared to like a MacBook or like a phone. To run Llama 70 billion requires two terabytes a second of memory bandwidth, 2.1 at human reading speed. Yeah, but my phone has like 50 gigabytes a second. Your laptop, even if you have an M1 Ultra has what, like, I don't remember, like a couple hundred gigabytes a second of memory bandwidth. You can't run Llama 70B just by doing the classical thing. So there's like, there's stuff like speculative decoding, you know, together did something really cool. And they put it in the open source, of course, Medusa, right? Like things like that, that are, you know, they work on batch size one, they don't work on batch size, you know, high. And so there's like the world of like cloud inference. And so in the cloud, it's all about what memory bandwidth and MFU I can achieve. Whereas on the edge, I don't think Google is going to deploy a model that I can run on my laptop to help me with code or help me with, you know, X, Y, Z, they're always going to want to run it in a cloud for control. Or maybe they let it run on the device, but it's like only their pixel phone, you know, it's kind of like a walled garden thing. There's obviously a lot of reasons to do other things for security, for openness, to not be at the whims of a trillion dollar plus company who wants my data, right? You know, there's a lot of stuff to be done there. And I think folks like Repl.it, they open source their model, right? Things like together, I just mentioned, right, that developing Medusa, that didn't take much GPU at all, right? That's very, well, they do have quite a few GPUs, they made a big announcement about having 4,000 H100s, that's still relatively poor, right, when we're talking about hundreds of thousands of like the big labs, like OpenAI, and so on and so forth, or millions of TPUs like Google, but still, they were able to develop Medusa with probably just one server, one server with eight GPUs in it. And its usefulness of something like Medusa, something like speculative decoding is, is on device, right? And that's what like a lot of people can focus on, you know, people can focus on all sorts of things like that. I don't know, right? Like a new model architecture, right? Like, are we only going to use transformers? I'm pretty told to think like transformers are it, right? My hardware brain can only know something that loves hardware, right? People should continue to try and innovate on that, right? Like, you know, asynchronous training, right? Like that kind of stuff is like, super, super interesting. I think it's Tim Demeters. He had like the- Demeers? [00:23:09]

Swyx: The same guy as Kylo Ren. [00:23:10]

Dylan: Yes, he had the swarm paper and petal. That research is super cool. The universities will never have much compute, but like, hey, to prepare to do things to, you know, all these sorts of stuff, like they should try to build, you know, super large models. Like, you look at what Tsinghua University is doing in China, actually, they open sourced their model to I think the largest like by parameter count, at least open source models. I mean, of course, they didn't train it on much data, but it's like, you know, it's like you could do some cool stuff like that. I don't know. I think there's a lot that people can focus on. One, scaling out a service to many, many users. Distribution is very important. So figuring out distribution, right? Like figuring out useful fine tunes, right? Like doing LLMs that OpenAI will never make, sorry for the crassness, a porn DALL-E 3, right? Open source is doing crazy stuff with stable diffusion, right? Right? Like, I don't know. Yeah, but it's like, it's like, and there's a legitimate market. I think there's a couple of companies who make tens of millions of dollars of revenue from LLMs or diffusion models for porn, right? Or, or, you know, that kind of stuff. Like, I mean, there's a lot of stuff that people can work on that will be successful businesses or doesn't even have to be a business, but can advance humanity tremendously. That doesn't require crazy scale. [00:24:10]

Alessio: How do you think about the depreciation of like the hardware versus the models? If you buy a H100, sure, the next year's is going to be better, but like at least the hardware is good. If you're spending a lot of money on like training a smaller model, it might be like super obsolete in like three months. And you've got now all this compute coming online. I'm just curious if like companies should actually spend the time to like, you know, fine tune them and like work on them where like the next generation is going to be out of the box so much better. [00:24:37]

Dylan: Unless you're fine tuning for on-device use, I think fine tuning current existing models, especially the smaller ones is a useless waste of time because the cost of inference is actually much cheaper than you think once you achieve good MBU and you batch at a decent size, which any successful business in the cloud is going to achieve, you know, and then two, fine tuning like people like, oh, you know, this 7 billion parameter model, if you fine tune it on a data set is almost as good as 3.5, right? Why don't you fine tune 3.5 and look at your performance, right? And like, there's nothing open source that is anywhere close to 3.5 yet. There will be. People also don't quite grasp. Falcon was supposed to be, Falcon 140B. It's less parameters than 3.5. And also, I don't know about the exact token count, but I believe it. Do we know the parameters of 3.5? It's not 175 billion. People keep saying this. [00:25:25]

Swyx: No. Because we know 3, but we don't know 3.5. [00:25:27]

Dylan: 3.5. [00:25:28]

Swyx: It's definitely smaller. [00:25:29]

Dylan: No, it's bigger than 175. I think it's sparse. MOE. I'm pretty sure. And yeah, you can, you can do some like gating around the size of it by looking at their inference latency. Well, what's the theoretical bandwidth if they're running it on this hardware and doing tensor parallel in this way? So they have this much memory bandwidth and maybe they get, maybe they're awesome and they get 90% memory bandwidth utilization. I don't know. That's an upper bound and you can see the latency that 3.5 gives you, especially at like off peak hours or if you do fine tuning and you have your, if you have a private enclave, they'll like my Azure will quote you latency. So you can, you can figure out how many parameters per forward pass, which I think is somewhere in the like 50 to 40 billion range, but I could be very wrong. That's just like my guess based on that sort of stuff. You know, 50 ish. And actually I think open source will have models of that quality. I mean, I assume Mosaic or like Meadow will open source and Mistral will be able to open source models of that quality. And furthermore, right? Like if you just look at the amount of compute, obviously data is very important and the ability, all these tricks and dials that you turn to be able to get good MFU and good MBO, right? Like depending on inference or training is, there's a ton of tricks. But at the end of the day, there's like 10 companies that have enough compute in one single data center to be able to beat GPT-4, right? Like straight up, like if not today, within the next six months, right? 4,000 H100s is, I think you need about 7,000 maybe. And with some algorithmic improvements that have happened since GPT-4 and some data quality improvements probably, like you could probably get to even like less than 7,000 H100s running for three months to beat GPT-4. Of course, that's going to take a really awesome team, but there's quite a few companies that are going to have that many, right? Open source will match GPT-4, but then it's like, what about GPT-4 Vision? Or what about, you know, 5 and 6 and all these kinds of stuff and like interact tool use and Dolly and like, that's the other thing is like, there's a lot of stuff on tool use that the open source could also do, that the GPT-4 could do. I think there are some folks that are doing that kind of stuff, agents and all that kind of stuff. I don't know. That's way over my head, the agent stuff. [00:27:24]

Swyx: Yeah, it's over everyone's head. One more question on this sort of Gemini GPU rich essay. We've had a very wide ranging conversation already, so it's hard to categorize, but I tried to look for the Meena Eats the World document. Oh, it's not public. [00:27:36]

Dylan: No, no, no, no, no, no. You've read it. Yeah, I read it. So Noam Shazir is like, I don't know, I think he's like- The GOAT. The GOAT. Yeah, I think he's the GOAT. [00:27:46]

Swyx: In one year, he published like switch transformers, like some attention is all you need, obviously, but he also did the speculative decoding stuff. [00:27:53]

Dylan: Yeah, exactly. It's like, it's like all this stuff that we were talking about today was like, you know, and obviously there's other people that are awesome that were, you know, helping and all that sort of stuff. Meena Eats the World was basically, he wrote an internal document around the time where Google had Meena, right? But it was like, he wrote it and he was like, basically predicting everything that's happening now, which is that like large language models are going to eat the world, right? In terms of, you know, compute and he's like the total amount of deployed flops within Google data centers will be dominated by large language models. Back then, a lot of people thought he was like silly for that, right? Like internally at Google. But you know, now if you look at it, it's like, oh wait, millions of TPUs. You're right. You're right. You're right. Okay. We're totally getting dominated by like both, you know, Gemini training and inference, right? Like, you know, total flops being dominated by LLMs is completely right. [00:28:36]

Swyx: So my question was, he had a bunch of predictions in there. Do you think there were any like underrated predictions that may not have yet have come true? Was he wrong on anything? [00:28:44]

Dylan: Meena sucked, right? If you'd look at the total flops, right? You know, parameters times tokens times six, it's like tiny, tiny fraction of GPT-2, which came out just a few months later, which was like, okay, so he was right about everything, but like, maybe he knew about GPT-2. I have no clue. OpenAI clearly was like way ahead of Google on LLM scaling. Even then, people didn't really recognize it back in GPT-2 days, maybe. The number of people that recognized it was maybe hundreds, tens. [00:29:10]

Alessio: So we talked about transformer alternatives. The other thing is GPU alternatives. The CPU is obviously one, but there's Cerebras, there's Graphcore, there's MAD-X, Lemurian Labs, there's a lot of them. Thoughts on what's real, who's alive, who's kind of like a zombie company walking. [00:29:27]

Dylan: You know, I mentioned like transformers were the architecture that won out, but I think, you know, the number of people who recognized that in 2020 was, you know, as you mentioned, probably hundreds, right? For natural language processing, maybe in 2019 at least, right? You think about a chip design cycle, it's like years, right? So it's kind of hard to bet your architecture on the type of model that develops. But what's interesting about all the first wave AI hardware startups, you know, there's a ratio of memory, capacity, compute, and memory bandwidth, right? Everyone kind of made the same bet, which is, I have a lot of memory on my chip, which is A, really dumb, because the models have grew way past that, right? Even Cerebras, right? You know, like I'm talking about like Graphcore, it's called SRAM, which is the memory on chip, much lower density, but much higher speeds versus, you know, DRAM, memory off chip. And so everyone was betting on more memory on chip and less memory off chip, right? And to be clear, right, for image networks and models that are small enough to just fit on your chip, that works. That is a superior architecture, but scale, right, scale, scale, scale, scale. NVIDIA was the only company that bet on the other side of more memory bandwidth and more memory capacity external, also the right ratio of memory bandwidth versus capacity. A lot of people like Graphcore specifically, right, that ton of memory on chip, and then they had a lot more memory off chip, but that memory off chip was a much lower bandwidth. Same applies to Samanova, same applies to Cerebras. They had no memory off chip, but they thought, hey, I'm going to make a chip the size of a wafer, right? Like, you know, those guys, they're silly, right? Hundreds of megabytes, we have 40 gigabytes. There's no, you know, and then, oh, crap, models are way bigger than 40 gigabytes, right? The ones that people deploy. Everyone bet on sort of the left side of this curve, right? The interesting thing is that there's new age startups like Lumerium, like MedEx, I won't get into what they're doing, but they're making much more rational bets. I don't know, you know, it's hard to say with a startup, like, it's going to work out, right? Obviously there's tons of risk embedded, but those folks, you know, Jay Duane of Lumerium and like Mike and Rainier, they understand models, they understand how they work. And if transformers continue to reign supreme, whatever innovations those folks are doing on hardware are going to need to be fitted for that. Or you have to predict what the model architecture is going to look like in a few years, right? You know, and hit that spot correctly. So that's kind of a background on those. But like now you look today, hey, Intel bought Nirvana, which was Naveen Rao's Mosaic ML. He started Mosaic ML and sold it to Databricks recently, obviously leading LLMs and stuff there, AI there. Intel bought that company from him and then shut it down and bought this other AI company. And now that company is kind of, you know, got new chips. They're going to release a better chip than the H100 within the next quarter or so. AMD, they have a GPU, MI300, that will be better than the H100 in a quarter or so. Now that says nothing about how hard it is to program it, but at least hardware-wise on paper, it's better. Why? Because it's, you know, a year and a half later, right, than in the H100 or a year later than the H100, of course, and, you know, a little bit more time and all that sort of stuff. But they're at least making similar bets on memory bandwidth versus flops versus capacity. Following NVIDIA's lead, the questions are like, what is the correct bet for three years from now? How do you engineer that? And will those alternatives make sense? The other thing is, if you look at total manufacturing capacity, right, for this sort of bet, right, you need high bandwidth memory, you need HBM, and you need large five nanometer dies, you know, soon three nanometer, whatever, right? You need both of those components and you need the whole supply chain to go through that. We've written a lot about it, but, you know, to simplify it, NVIDIA has a little bit more than half and Google has like 30%, right, through Broadcom. So it's like the total capacity for everyone else, much lower, and they're all sharing it, right? Amazon's training and inferentia, Microsoft's in-house chip, and, you know, you go down the list and it's like Meta's in-house chip, and also AMD, and also, so all of these companies are sharing like a much smaller slice. Their chips are not as good, or if they are, even though, you know, I mentioned Intel and AMD's chips are better, that's only because they're throwing more money at the problem kind of, right? You know, NVIDIA charges crazy prices, I think everyone knows that. Their gross margins are insane. AMD and Intel and others will charge more reasonable margins, and so they're able to give you more HBM and et cetera for a similar price, and so that ends up letting them beat NVIDIA, if you will, but their manufacturing costs are twice that in some cases, right? In the case of AMD, their manufacturing costs are MI300 or more than twice that of H100, and it only beats H100 by a little bit from, you know, performance stuff I've seen. So it's like, you know, it's tough for anyone to like bet the farm on a alternative hardware supplier, right? Like, in my opinion, like, you should either just like be like, you know, a lot of like ex-Google startups are just using TPUs, right? And hey, that's Google Cloud, you know, after moving the TPU team into the cloud team, infrastructure team, sort of, they're much more aggressive on external selling, and so you companies like, even see companies like Apple using TPUs for training LLMs, as well as TPUs, but either bet heavily on TPUs, because that's where the capacity is, bet heavily on GPUs, of course, and stop worrying about it, and leverage all this amazing open source code that is optimized for NVIDIA. If you do bet on AMD or Intel or any of these startups, then you better make damn sure you're really good at low-level programming, and damn sure you also have a compelling business case, and that the hardware supplier is giving you such a good deal that it's worth it. And also, by the way, NVIDIA's releasing a new chip in, you know, they're going to announce it in March, and they're going to release it and ship it Q2, Q3 next year anyways, right? And that chip will probably be three or four times as good, right? And maybe it'll cost twice as much, or 50% more. I hear it's 3x the performance on an LLM, and 50% more expensive, is what I hear. So it's like, okay, yeah, nothing is going to compete with that, even if it is 50% more expensive, right? And then you're like, okay, well, that kicks the can down further, and then NVIDIA's moving to a yearly release cycle, so it's like very hard for anyone to catch up to NVIDIA, really, right? So, you know, investing all this in other hardware, like, if you're Microsoft, obviously, who cares if I spend $500 million a year on my internal chip? Who cares if I spend $500 million a year on AMD chips, right? Like, if it lets me knock the price of NVIDIA GPUs down a little bit, puts the fear of God within Jensen Huang, right, like, you know, then it is what it is, right? And likewise, you know, with Amazon, and so on and so forth, you know, of course, their hope is that their chips succeed, or that they can actually have an alternative that is much cheaper than NVIDIA. To throw a couple hundred million dollars at a company, you know, as product is completely reasonable. And in the case of AMD, I think it'll be more than a couple hundred million dollars, right? But yeah, I think alternative hardware is like, it really does hit like sort of a peak hype cycle, kind of end of this year, early next year, because all NVIDIA has is H100, and then H200, which is just better, more memory bandwidth, higher memory capacity, H100, right? But that doesn't beat what, you know, AMD are doing, it doesn't beat what, you know, Intel's Gaudi 3 does, but then very quickly after, NVIDIA will crush them. And then those other companies are gonna take two years to get to their next generation. You know, it's just a really tough place. And no one besides, you know, the main thing about hardware is like, hey, that bet I talked about earlier is like, you know, that's very oversimplified, right? Just memory bandwidth flops and memory capacity. There's a whole lot more bets. There's 100 different bets that you have to make and guess correctly to get good hardware, not even have better hardware than NVIDIA get close to them. And that takes understanding models really, really well. That takes understanding so many different aspects, whether it's power delivery or cooling or design, layout, all this sort of stuff. And it's like, how many companies can do everything here, right? It's like, I'd argue Google probably understands models better than NVIDIA, I don't think people would disagree. I'm an NVIDIA understands hardware better than Google. And so you end up with like, Google's hardware is competitive, but like, does Amazon understand models better than NVIDIA? I don't think so. And does Amazon better understand hardware better than NVIDIA? No. I also have the opinion that the labs are useful partners, they're convenient partners. They're not going to buddy up as close as people think, right? I don't even think like, I expect in the next few years that the OpenAI Microsoft probably falls apart too. I mean, they'll still continue to use GPUs and stuff there. But like, I think that the level of closeness you see today is probably the closest they get. [00:37:15]

Swyx: At some point, they become competitive if OpenAI becomes its own cloud. [00:37:18]

Dylan: The level of value that they deliver to the world, if you talk to anyone there, they truly believe it'll be tens of trillions, if not hundreds of trillions of dollars, right? In which case, obviously, you know, I know weird corporate structure aside, you know, this is the same playing field as companies like Microsoft and Google. Google wants to also deliver hundreds of trillions of dollars of value. And it's like, obviously you're competing and Microsoft wants to do the same and you're going to compete. In general, right, like these lab partnerships are going to be nice, but they're probably incentivized to, you know, hey, NVIDIA, you should, you know, can you design the hardware in this way? It doesn't work like that. It works like this. And they're like, oh, so this is the best compromise. Right? Like, I think OpenAI would be stupid not to do that with NVIDIA, but also with AMD. But also, hey, like how much time, and Microsoft's internal silicon, but it's like, how much time do I actually have? Right? Like, you know, should I do that? Should I spend all my, you know, super, super smart people's time and limited, you know, this caliber of person's time doing that? Or should they focus on like, hey, can we get like asynchronous training to work? Or like, you know, figure out this next multimodal thing? Or I don't know. I don't know. Right? Right? Or should I eke out 5% more MFU and work on designing the next supercomputer? Right? Like, these kind of things, how much more valuable is that? Right? So it's like, you know, it's tough to see, you know, even OpenAI helping Microsoft enough to get their knowledge of models. So, so, so good. Right? Like, Microsoft's going to announce their chip soon. It's worse performance than the H100, but the cost effectiveness of it is better for Microsoft internally, just because they don't have to pay the NVIDIA tax. But again, like by the time they ramp it and all these sorts of things, and oh, hey, that only works on a certain size of models. Once you exceed that, then it's actually, you know, again, better for NVIDIA. So it's like, it's really tough for OpenAI to be like, yeah, we want to bet on, on Microsoft. Right? Like, and hey, we have, you know, I don't know, what's their number of people they have now? Like 700 people, you know, of which how many do low level code? Do I want to have separate code bases for this and this and this and this? And, you know, it's like, it's just like a big headache to, I don't know, I think it'd be very difficult to see anyone truly pivoting to anything besides a GPU and a TPU, especially if you have, if you need that scale. And that scale that the lab, at least the labs, right, require is absurd. Google says millions, right, of TPUs. OpenAI will say millions of GPUs, right? Like I truly do believe they think that that number of next generation GPUs, right? Like the numbers that we're going to get to are like, I bet you, I mean, I don't know, but I bet Sam Alton would say, yeah, we're going to build a hundred billion dollar supercomputer in three years or two years, right? And like after GPT-5 releases, if he goes to the market and says like, hey, I want to raise a hundred billion dollars at $500 billion valuation, I'm sure the market would give it to him, right? Like, and then they build that supercomputer, right? Like, I mean, like, I think that's like truly the path we're on. And so it's hard to, hard to imagine. Yeah. I don't know. [00:40:00]

Swyx: One point that you didn't touch on and Taiwan companies are famously very chatty about the fruit company. Should we take Apple seriously at all in this game or they're just in a different world altogether? [00:40:10]

Dylan: I respect their products, but like, I don't think Apple will ever release a model that you can get to say really bad things. There's all these jailbreaks, but also like as soon as they happen, like, you know, it gets fed back into OpenAI's like platform and it gets them, it's like being public and open is accelerating their like ability to make a better and better model, right? Like the RLHF and all this kind of stuff. I don't see how Apple can do that structurally, like as a company, like the fruit company ships perfect products or like, or else, right? That's why everyone loves iPhones, right? And all these like open source firms and like all these folks are doing exactly that, right? Building a bigger and better model every, you know, every few months. And I don't know how Apple gets on that train, but you know, at the same time, there's no company that has more powerful distribution, right? [00:40:56]

Swyx: Are people in Taiwan concerned that it will come to a point where China will just claim Taiwan? [00:41:02]

Dylan: I think, I think that a lot of people there are not super concerned, but there's some people that are super concerned. I think, I think especially after like, you know, instability across the world and in Europe and in the Middle East and even Africa, if you look at any of the stuff they're building up, it seems very clear. And if you talk to a lot of people, they think China will invade Taiwan in 27 or 26 in April or in September, sort of the best timeframes, right? Like a lot of people believe that's what will happen, right? [00:41:29]

Swyx: Maybe the semi-analysis analyst point of view is, is it feasible to build this capacity up in the US? No. [00:41:35]

Dylan: No, right? Like people don't understand how fragmented the semiconductor supply chain really is and how many monopolies there are. The US could absolutely shut down the Chinese semiconductor supply chain. They won't. But, and China could absolutely shut down the US one actually, by the way. But more, more relevantly, right, is like, you know, Austria has two companies, like the country of Austria and Europe has two companies that have super high market share and very specific technologies that are required for every single like, like chip period, right? There is no chip that is less than seven nanometer that doesn't get touched by this one Austrian company's tool, right? And there is no alternative. And there's another Austrian, you know, and I, it's, it's, and there's another Austrian company. Likewise, everything two nanometer and beyond will be touched by their tool. And it's like, but both of these companies are like doing well, less than a billion dollars of revenue, right? So it's like, you think it's so inconsequential. No, there's actually like three or four Japanese chemical companies, same, same idea, right? It's like the supply chain is so fragmented, right? Like people only ever talk about where the fabs were, where they actually get produced, but it's like, I mean, TSMC in Arizona, right? TSMC is building a fab in Arizona. It's, it's quite a bit smaller than the fabs in, in, in Taiwan. But even ignoring that, those fabs don't have to ship everything to Taiwan back anyways. And also they have to get what's called a mask from Taiwan and get sent to, get sent to Arizona. And by the way, there's these Japanese companies that make these chemicals that need to ship to, you know, like TOK and Shinetsu and, you know, it's like, and, and hey, it needs this tool from Austria no matter what it's like, oh wow, wait, actually like the entire supply chain is just way too fragmented. You can't like re-engineer and rebuild it on a snap, right? It's just like that. It's just complex to do that. Semiconductors are more complex than any other thing that humans do, without a doubt. There's more people working in that supply chain with XYZ backgrounds and more money invested every year and R&D plus CapEx, you know, it's like, it's just by far the most complex supply chain that humanity has. And to think that we could rebuild it in a few years is absurd. [00:43:22]

Swyx: In an alternate universe, the US kept Morris Chang. I mean, people, right? Like it was just one guy. Yeah. [00:43:29]

Dylan: In an alternative universe, Texas Instruments communicated to Morris Chang that he would become CEO. And so he never goes to Taiwan and you know, blah, blah, blah. Right. Yeah. No. But I, you know, that's just also, I think, I think the world would probably be further behind in terms of technology development if that didn't happen, right? Like technology proliferation is how you accelerate the pace of innovation, right? So the, you know, the dissemination to, oh, wow, hey, it's not just a bunch of people in Oregon at Intel that are leading everything, right? Or, you know, hey, a bunch of people in Samsung Korea, right? Or Shinshu, Taiwan, right? It's actually all three of those plus all these tool companies across the country and the Netherlands and in Japan and the US and, you know, it's millions of people innovating on a disseminated technology that's led us to get here, right? I don't even think, you know, if Morris Chang didn't go to Taiwan, would we even be at 5 nanometer? Would we be at 7 nanometer? Probably not, right? So there's a lot of things that, you know, happened because of that, right? [00:44:22]

Alessio: Let's get a quick lightning round on semi-analysis branded one. So the first one is what are like foundational readings that people that are listening today should read to get up to speed on like semis? [00:44:34]

Dylan: I think the easiest one is the PyTorch 2.0 and Triton one that I did. You know, there's the advanced packaging series. There's the Google infrastructure supremacy piece. I think that one's really critical because it explains Google's infrastructure quite a bit from networking through chips, through all that sort of history of the TPU a little bit. Maybe like AMD's MI300 piece, it talks a lot about the one that we did on that are very good. And then obviously like, you know, like, I don't know, probably like Chip Wars by Chris Miller who doesn't recommend that book, right? It's a really good book, right? I mean, like I would say Gordon Moore's book is freaking awesome because you got to think about right, like, you know, LLM scaling laws are like Moore's law on crack, right? Kind of like, you know, in a different sense, like, you know, if you think about all of human productivity gains since the 70s is probably just off of the base of semiconductors and technology, right? Of course, of course, people across the world are getting, you know, access to oil and gas and all this sort of stuff. But like, at least in the Western world, since the 70s, everything has just been mostly innovated because of technology, right? Oh, we're able to build better cars because semiconductors enable us to do that. Or be able to build better software because we're able to connect everyone because semiconductors enabled that, right? That is like, I think that's why it's the most important industry in the world. But like seeing the frame of mind of what Gordon Moore has written, you know, he's got a couple, you know, papers, books, et cetera, right? Only the paranoid survive, right? Like I think, I think like that philosophy and thought process really translates to the now modern times, except maybe, you know, humanity has been an exponential S-curve and this is like another exponential S-curve on top of that. So I think that's probably a good, good readings to do. [00:46:09]

Swyx: Has there been an equivalent pivot? So Gordon, like that classic tale was more of like the pivot to memory. [00:46:16]

Dylan: From memory to logic. Yeah. [00:46:18]

Swyx: Yeah. And then was there, has there been an equivalent pivot in Semi's history of that magnitude? [00:46:24]

Dylan: I mean, like, you know, some people would argue that like, you know, Jensen, you know, he basically didn't care about, he only cared about, you know, like gaming and 3D professional visualization and like rendering and things like that until like he started to learn about AI. And then all of a sudden he's going to like universities, like you want some GPUs, here you go. Right. Like, I think there's even stories of like, you know, not so long ago, NeurIPS, when it used to have the more unfortunate name, he would go there and just give away GPUs to people. Right. Like there's like stuff like that. Like, you know, very grassroots, like pivoting the company. Now like you, you look on gaming forums and it's like, everybody's like, oh, NVIDIA doesn't even care about us. They only care about AI and it's like, yes, you're right. They only care. They mostly only care about AI and the gaming innovations are only because of like, they're putting more AI into it. Right. It's like, but also like, hey, they're doing a lot of ship design stuff with AI. And, you know, I think, I think that's like, not, I don't know if it's equivalent pivot quite yet, but, you know, because the digital, you know, logic is a pretty big innovation, but I think that's a big one. And, you know, likewise, it's like, you know, what did, what did OpenAI do? Right. What did they pivot? How did they pivot? They left the, like, a lot of, most people left the culture of like Google brain and deep mind and decided to build this like company. That's crazy cool. Right. Like it does things in a very different way and like is innovating in a very different way. So you consider that a pivot, even though it's not inside Google. [00:47:40]

Swyx: They were on a very different path with like the Dota games and all that before they eventually found like GPTs as the, as the thing. So it was a full, like started in 2015 and then like really pivoted in 2019 to be like, all right, we're the GPT company. Yeah. Yeah. If I could classify them, I don't, I'm sure there's OpenAI people who are yelling at me right now. Okay. So just a general question about, you know, I'm a fellow writer on, on Substack. You are obviously managing your consulting business while you're also publishing these amazing posts. How do you, what's your writing process? How do you source info? Like when do you sit down and go like, here's the theme for the week. Do you, do you have a pipeline going out? Just anything you can describe. [00:48:17]

Dylan: I'm thankful for my, you know, my teammates cause they are actually awesome. Like, and they're much more, um, you know, directed focused to working on one thing, you know, or not one thing, but a number of things, right. Like, you know, someone who's this expert on X and Y and Z and the semiconductor supply chain. So that really helps with the, the, that side of the business. I most of the times only write when I'm very excited or, you know, it's like, Hey, like we should work on this and we should write about this. So like, you know, one of the most recent posts as we did was we explained the manufacturing process for 3D NAND, you know, flash storage, uh, gate all around transistors and 3D DRAM and all this sort of stuff. Cause there's a company in Japan that's going public, Kokusai Electric, right. It was like, okay, well we should do a post about this and we should explain this. But like, it's like, okay, we, you know, and so Myron, he did all that work, Myron and she, and most of the work and awesome. But like, usually it's like, there's a few, like very long in-depth back burner type things, right? Like that took a long time, took, you know, over a month of research and Myron knows this stuff already really well, right? Like there's stuff like that that we do and that like builds up a body of work for our consulting and some of the reports that we sell that aren't, you know, newsletter posts. But a lot of times the process is also just like, well, like Meena Eats the World is the culmination of reading that, having done a lot of work on the supply chain around the TPU ramp and co-osts and HBM capacities and all this sort of stuff to be able to, you know, figure out how many units and that Google's ordering all sorts of stuff. And then like, also like looking at like open sources, like all just that, all that culminated in like, I wrote that in four hours, right? I sent it to a couple of people and they were like, no, change this, this, this, oh, you know, add this. Cause that's really going to piss off, you know, the open source community. I'm like, okay, sure. And then posted it, right? So it's like, there's no like specific process. Unfortunately, like the most viral posts, especially in the AI community are just like those kinds of pieces rather than the, like the really deep, deep, like, you know, obviously like what was in the Gemini Eats the World post, you know, the obvious, Hey, like we, we do deep work and there's a lot more like factual, not leaks, you know, it's just factual research. Hey, we crossed the team. We go to 40 plus conferences a year, right. All the way from like a photo resist conference to a photo mask conference, to a lithography conference all the way up to like AI conferences and you know, all everything in between networking conferences and piecing everything across the supply chain. So it's like, that's like the true, like work and like, yeah, I don't know. It is sometimes bad to like have the infamousness of, you know, only people caring about this or the GPT-4 leak or the Google has no moat leak. Right. It's like, but like, you know, that's just like stuff that comes along. Right. You know, it's really focused on like understanding the supply chain and how it's pivoting and who's the winners, who's the losers, what technologies are inflecting, things like that. Where's the best place to invest resources, you know, sort of like stuff like that and in accelerating or capturing value, et cetera. [00:50:54]

Alessio: Awesome. And to wrap, if you had a magic genie that could answer any question that would change your worldview, what question would you ask? [00:51:03]

Dylan: That's a tough one. [00:51:04]

Swyx: Like you, you operate based on a set of facts about the world right now, then there's maybe some unknowns where you're like, man, if I really knew the answer to this one, I would do so many things differently, or I would think about things very differently. [00:51:18]

Dylan: So I'm of the view, at least everything that we've seen so far is that large scale training has to happen in an individual data center with very high speed networking. Now, everything doesn't need to be all to all connected, but you need very high speed networking between all of your, your chips, right? I would love to know, you know, hey, magic genie, how can we build artificial intelligence in a way that it can use multiple data centers of resources where there is a significantly lower bandwidth between pools of resources, right? Because that would instantly, like one of the big bottlenecks is how much power and how many chips you can get into a single data center. So like, A, Google and OpenAI and Anthropic are working on this, and I don't know if they've solved it yet, but if they haven't solved it yet, then what is the solution? Because that will like accelerate the scaling that can be done by not just like a factor of 10, but like orders of magnitude, because there's so many different data centers, right? Like if you, you know, across the world and, you know, oh, if I could pick up, you know, if I could effectively use 256 GPUs in this little data center here, and then with this big cluster here, you know, how can you make an algorithm that can do that? Like I think that would be like the number one thing I'd be curious to know if, how, what, because that changes the world significantly in terms of how we continue to scale this amazing technology that people have invented over the last, you know, five years. Awesome. [00:52:36]

Alessio: Well, thank you so much for coming on, Dylan. [00:52:38]

Dylan: Thank you. Thank you. [00:52:46]

Alessio: Thank you. [00:52:46]

Get full access to Latent Space at www.latent.space/subscribe

2023-11-17
Länk till avsnitt

AGI is Being Achieved Incrementally (DevDay Recap - cleaned audio)

We left a high amount of background audio in the Devday podcast, which many of you loved, but we definitely understand that some of you may have had trouble with it. Listener Klaus Breyer ran it through Auphonic with speech islolation and we figured we?d upload it as a backdated pod for people who prefer this. Of course it means that our speakers sound out of place since they now sound like they are talking loudly in a quiet room. Let us know in the comments what you think?

Timestamps

the cleaned part is only part 2:

* [00:55:09] Part II: Spot Interviews

* [00:55:59] Jim Fan (Nvidia) - High Level Takeaways

* [01:05:19] Raza Habib (Humanloop) - Foundation Model Ops

* [01:13:32] Surya Dantuluri (Stealth) - RIP Plugins

* [01:20:53] Reid Robinson (Zapier) - AI Actions for GPTs

* [01:30:45] Div Garg (MultiOn) - GPT4V for Agents

* [01:36:42] Louis Knight-Webb (Bloop.ai) - AI Code Search

* [01:48:36] Shreya Rajpal (Guardrails) - Guardrails for LLMs

* [01:59:00] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"

* [02:09:39] Rahul Sonwalkar (Julius AI) - Advice for Founders

Get full access to Latent Space at www.latent.space/subscribe

2023-11-08
Länk till avsnitt

AGI is Being Achieved Incrementally (OpenAI DevDay w/ Simon Willison, Alex Volkov, Jim Fan, Raza Habib, Shreya Rajpal, Rahul Ligma, et al)

SF folks: join us at the AI Engineer Foundation?s Emergency Hackathon tomorrow and consider the Newton if you?d like to cowork in the heart of the Cerebral Arena.

Our community page is up to date as usual!

~800,000 developers watched OpenAI Dev Day, ~8,000 of whom listened along live on our ThursdAI x Latent Space, and ~800 of whom got tickets to attend in person:

OpenAI?s first developer conference easily surpassed most people?s lowballed expectations - they simply did everything short of announcing GPT-5, including:

* ChatGPT (the consumer facing product)

* GPT4 Turbo already in ChatGPT (running faster, with an April 2023 cutoff), all noticed by users weeks before the conference

* Model picker eliminated, God Model chooses for you

* GPTs - ?tailored version of ChatGPT for a specific purpose? - stopping short of ?Agents?. With custom instructions, expanded knowledge, and actions, and an intuitive no-code GPT Builder UI (we tried all these on our livestream yesterday and found some issues, but also were able to ship interesting GPTs very quickly) and a GPT store with revenue sharing (an important criticism we focused on in our episode on ChatGPT Plugins)

* API (the developer facing product)

* APIs for Dall-E 3, GPT4 Vision, Code Interpreter (RIP Advanced Data Analysis), GPT4 Finetuning and (surprise!) Text to Speech

* many thought each of these would take much longer to arrive

* usable in curl and in playground

* BYO Interpreter + Async Agents?

* Assistant API: stateful API backing ?GPTs? like apps, with support for calling multiple tools in parallel, persistent Threads (storing message history, unlimited context window with some asterisks), and uploading/accessing Files (with a possibly-too-simple RAG algorithm, and expensive pricing)

* Whisper 3 announced and open sourced (HuggingFace recap)

* Price drops for a bunch of things!

* Misc: Custom Models for big spending ($2-3m) customers, Copyright Shield, Satya

The progress here feels fast, but it is mostly (incredible) last-mile execution on model capabilities that we already knew to exist. On reflection it is important to understand that the one guiding principle of OpenAI, even more than being Open (we address that in part 2 of today?s pod), is that slow takeoff of AGI is the best scenario for humanity, and that this is what slow takeoff looks like:

When introducing GPTs, Sam was careful to assert that ?gradual iterative deployment is the best way to address the safety challenges with AI?:

This is why, in fact, GPTs and Assistants are intentionally underpowered, and it is a useful exercise to consider what else OpenAI continues to consider dangerous (for example, many people consider a while(true) loop a core driver of an agent, which GPTs conspicuously lack, though Lilian Weng of OpenAI does not).

We convened the crew to deliver the best recap of OpenAI Dev Day in Latent Space pod style, with a 1hr deep dive with the Functions pod crew from 5 months ago, and then another hour with past and future guests live from the venue itself, discussing various elements of how these updates affect their thinking and startups. Enjoy!

Show Notes

* swyx live thread (see pinned messages in Twitter Space for extra links from community)

* Newton AI Coworking Interest Form in the heart of the Cerebral Arena

Timestamps

* [00:00:00] Introduction

* [00:01:59] Part I: Latent Space Pod Recap

* [00:06:16] GPT4 Turbo and Assistant API

* [00:13:45] JSON mode

* [00:15:39] Plugins vs GPT Actions

* [00:16:48] What is a "GPT"?

* [00:21:02] Criticism: the God Model

* [00:22:48] Criticism: ChatGPT changes

* [00:25:59] "GPTs" is a genius marketing move

* [00:26:59] RIP Advanced Data Analysis

* [00:28:50] GPT Creator as AI Prompt Engineer

* [00:31:16] Zapier and Prompt Injection

* [00:34:09] Copyright Shield

* [00:38:03] Sharable GPTs solve the API distribution issue

* [00:39:07] Voice

* [00:44:59] Vision

* [00:49:48] In person experience

* [00:55:11] Part II: Spot Interviews

* [00:56:05] Jim Fan (Nvidia - High Level Takeaways)

* [01:05:35] Raza Habib (Humanloop) - Foundation Model Ops

* [01:13:59] Surya Dantuluri (Stealth) - RIP Plugins

* [01:21:20] Reid Robinson (Zapier) - AI Actions for GPTs

* [01:31:19] Div Garg (MultiOn) - GPT4V for Agents

* [01:37:15] Louis Knight-Webb (Bloop.ai) - AI Code Search

* [01:49:21] Shreya Rajpal (Guardrails.ai) - on Hallucinations

* [01:59:51] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"

* [02:10:26] Rahul Sonwalkar (Julius AI) - Advice for Founders

Transcript

[00:00:00] Introduction

[00:00:00] swyx: Hey everyone, this is Swyx coming at you live from the Newton, which is in the heart of the Cerebral Arena. It is a new AI co working space that I and a couple of friends are working out of. There are hot desks available if you're interested, just check the show notes. But otherwise, obviously, it's been 24 hours since the opening of Dev Day, a lot of hot reactions and longstanding tradition, one of the longest traditions we've had.

[00:00:29] And the latent space pod is to convene emergency sessions and record the live thoughts of developers and founders going through and processing in real time. I think a lot of the roles of podcasts isn't as perfect information delivery channels, but really as an audio and oral history of what's going on as it happens, while it happens.

[00:00:49] So this one's a little unusual. Previously, we only just gathered on Twitter Spaces, and then just had a bunch of people. The last one was the Code Interpreter one with 22, 000 people showed up. But this one is a little bit more complicated because there's an in person element and then a online element.

[00:01:06] So this is a two part episode. The first part is a recorded session between our latent space people and Simon Willison and Alex Volkoff from the Thursday iPod, just kind of recapping the day. But then also, as the second hour, I managed to get a bunch of interviews with previous guests on the pod who we're still friends with and some new people that we haven't yet had on the pod.

[00:01:28] But I wanted to just get their quick reactions because most of you have known and loved Jim Fan and Div Garg and a bunch of other folks that we interviewed. So I just want to, I'm excited to introduce To you the broader scope of what it's like to be at OpenAI Dev Day in person bring you the audio experience as well as give you some of the thoughts that developers are having as they process the announcements from OpenAI.

[00:01:51] So first off, we have the Mainspace Pod recap. One hour of open I dev day.

[00:01:59] Part I: Latent Space Pod Recap

[00:01:59] Alessio: Hey. Welcome to the Latents Based Podcast an emergency edition after OpenAI Dev Day. This is Alessio, partner and CTO of Residence at Decibel Partners, and as usual, I'm joined by Swyx, founder of SmallAI. Hey,

[00:02:12] swyx: and today we have two special guests with us covering all the latest and greatest.

[00:02:17] We, we, we love to get our band together and recap things, especially when they're big. And it seems like that every three months we have to do this. So Alex, welcome. From Thursday AI we've been collaborating a lot on the Twitter spaces and welcome Simon from many, many things, but also I think you're the first person to not, not make four appearances on our pod.

[00:02:37] Oh, wow. I feel privileged. So welcome. Yeah, I think we're all there yesterday. How... Do we feel like, what do you want to kick off with? Maybe Simon, you want to, you want to take first and then Alex. Sure. Yeah. I mean,

[00:02:47] Simon Willison: yesterday was quite exhausting, quite frankly. I feel like it's going to take us as a community several months just to completely absorb all of the stuff that they dropped on us in one giant.

[00:02:57] Giant batch. It's particularly impressive considering they launched a ton of features, what, three or four weeks ago? ChatGPT voice and the combined mode and all of that kind of thing. And then they followed up with everything from yesterday. That said, now that I've started digging into the stuff that they released yesterday, some of it is clearly in need of a bit more polish.

[00:03:15] You know, the the, the reality of what they look, what they released is I'd say about 80 percent of, of what it looks like it was yesterday, which is still impressive. You know, don't get me wrong. This is an amazing batch of stuff, but there are definitely problems and sharp edges that we need to file off.

[00:03:29] And there are things that we still need to figure out before we can take advantage of all of this.

[00:03:33] swyx: Yeah, agreed, agreed. And we can go into those, those sharp edges in a bit. I just want to pop over to Alex. What are your thoughts?

[00:03:39] Alex Volkov: So, interestingly, even folks at OpenAI, there's like several booths and help desks so you can go in and ask people, like, actual changes and people, like, they could follow up with, like, the right people in OpenAI and, like, answer you back, etc.

[00:03:52] Even some of them didn't know about all the changes. So I went to the voice and audio booth. And I asked them about, like, hey, is Whisper 3 that was announced by Sam Altman on stage just, like, briefly, will that be open source? Because I'm, you know, I love using Whisper. And they're like, oh, did we open source?

[00:04:06] Did we talk about Whisper 3? Like, some of them didn't even know what they were releasing. But overall, I felt it was a very tightly run event. Like, I was really impressed. Shawn, we were sitting in the audience, and you, like, pointed at the clock to me when they finished. They finished, like, on... And this was after like doing some extra stuff.

[00:04:24] Very, very impressive for a first event. Like I was absolutely like, Good job.

[00:04:30] swyx: Yeah, apparently it was their first keynote and someone, I think, was it you that told me that this is what happens if you have A president of Y Combinator do a proper keynote you know, having seen many, many, many presentations by other startups this is sort of the sort of master stroke.

[00:04:46] Yeah, Alessio, I think you were watching remotely. Yeah, we were at the Newton. Yeah, the Newton.

[00:04:52] Alessio: Yeah, I think we had 60 people here at the watch party, so it was quite a big crowd. Mixed reaction from different... Founders and people, depending on what was being announced on the page. But I think everybody walked away kind of really happy with a new layer of interfaces they can use.

[00:05:11] I think, to me, the biggest takeaway was like and I was talking with Mike Conover, another friend of the podcast, about this is they're kind of staying in the single threaded, like, synchronous use cases lane, you know? Like, the GPDs announcement are all like... Still, chatbase, one on one synchronous things.

[00:05:28] I was expecting, maybe, something about async things, like background running agents, things like that. But it's interesting to see there was nothing of that, so. I think if you're a founder in that space, you're, you're quite excited. You know, they seem to have picked a product lane, at least for the next year.

[00:05:45] So, if you're working on... Async experiences, so things working in the background, things that are not co pilot like, I think you're quite excited to have them be a lot cheaper now.

[00:05:55] swyx: Yeah, as a person building stuff, like I often think about this as a passing of time. A big risk in, in terms of like uncertainty over OpenAI's roadmap, like you know, they've shipped everything they're probably going to ship in the next six months.

[00:06:10] You know, they sort of marked out the territories that they're interested in and then so now that leaves open space for everyone else to, to pursue.

[00:06:16] GPT4 Turbo and Assistant API

[00:06:16] swyx: So I guess we can kind of go in order probably top of mind to mention is the GPT 4 turbo improvements. Yeah, so longer context length, cheaper price.

[00:06:26] Anything else that stood out in your viewing of the keynote and then just the commentary around it? I

[00:06:34] Alex Volkov: was I was waiting for Stateful. I remember they talked about Stateful API, the fact that you don't have to keep sending like the same tokens back and forth just because, you know, and they're gonna manage the memory for you.

[00:06:45] So I was waiting for that. I knew it was coming at some point. I was kind of... I did not expect it to come at this event. I don't know why. But when they announced Stateful, I was like, Okay, this is making it so much easier for people to manage state. The whole threads I don't want to mix between the two things, so maybe you guys can clarify, but there's the GPT 4 tool, which is the model that has the capabilities, In a whopping 128k, like, context length, right?

[00:07:11] It's huge. It's like two and a half books. But also, you know, faster, cheaper, etc. I haven't yet tested the fasterness, but like, everybody's excited about that. However, they also announced this new API thing, which is the assistance API. And part of it is threads, which is, we'll manage the thread for you.

[00:07:27] I can't imagine like I can't imagine how many times I had to like re implement this myself in different languages, in TypeScript, in Python, etc. And now it's like, it's so easy. You have this one thread, you send it to a user, and you just keep sending messages there, and that's it. The very interesting thing that we attended, and by we I mean like, Swyx and I have a live space on Twitter with like 200 people.

[00:07:46] So it's like me, Swyx, and 200 people in our earphones with us as well. They kept asking like, well, how's the price happening? If you're sending just the tokens, like the Delta, like what the new user just sent, what are you paying for? And I went to OpenAI people, and I was like, hey... How do we get paid for this?

[00:08:01] And nobody knew, nobody knew, and I finally got an answer. You still pay for the whole context that you have inside the thread. You still pay for all this, but now it's a little bit more complex for you to kind of count with TikTok, right? So you have to hit another API endpoint to get the whole thread of what the context is.

[00:08:17] Then TikTokonize this, run this in TikTok, and then calculate. This is now the new way, officially, for OpenAI. But I really did, like, have to go and find this. They didn't know a lot of, like, how the pricing is. Ouch! Do you know if

[00:08:31] Simon Willison: the API, does the API at least tell you how many tokens you used? Or is it entirely up to you to do the accounting?

[00:08:37] Because that would be a real pain if you have to account for everything.

[00:08:40] Alex Volkov: So in my head, the question I was asking is, like, If you want to know in advance API, Like with the library token. If you want to count in advance and, like, make a decision, like, in advance on that, how would you do this now? And they said, well, yeah, there's a way.

[00:08:54] If you hit the API, get the whole thread back, then count the tokens. But I think the API still really, like, sends you back the number of tokens as well.

[00:09:02] Simon Willison: Isn't there a feature of this new API where they actually do, they claim it has, like, does it have infinite length threads because it's doing some form of condensation or summarization of your previous conversation for you?

[00:09:15] I heard that from somewhere, but I haven't confirmed it yet.

[00:09:18] swyx: So I have, I have a source from Dave Valdman. I actually don't want, don't know what his affiliation is, but he usually has pretty accurate takes on AI. So I, I think he works in the iCircles in some capacity. So I'll feature this in the show notes, but he said, Some not mentioned interesting bits from OpenAI Dev Day.

[00:09:33] One unlimited. context window and chat threads from opening our docs. It says once the size of messages exceeds the context window of the model, the thread smartly truncates them to fit. I'm not sure I want that intelligence.

[00:09:44] Alex Volkov: I want to chime in here just real quick. The not want this intelligence. I heard this from multiple people over the next conversation that I had. Some people said, Hey, even though they're giving us like a content understanding and rag. We are doing different things. Some people said this with Vision as well.

[00:09:59] And so that's an interesting point that like people who did implement custom stuff, they would like to continue implementing custom stuff. That's also like an additional point that I've heard people talk about.

[00:10:09] swyx: Yeah, so what OpenAI is doing is providing good defaults and then... Well, good is questionable.

[00:10:14] We'll talk about that. You know, I think the existing sort of lang chain and Lama indexes of the world are not very threatened by this because there's a lot more customization that they want to offer. Yeah, so frustration

[00:10:25] Simon Willison: is that OpenAI, they're providing new defaults, but they're not documented defaults.

[00:10:30] Like they haven't told us how their RAG implementation works. Like, how are they chunking the documents? How are they doing retrieval? Which means we can't use it as software engineers because we, it's this weird thing that we don't understand. And there's no reason not to tell us that. Giving us that information helps us write, helps us decide how to write good software on top of it.

[00:10:48] So that's kind of frustrating. I want them to have a lot more documentation about just some of the internals of what this stuff

[00:10:53] swyx: is doing. Yeah, I want to highlight.

[00:10:57] Alex Volkov: An additional capability that we got, which is document parsing via the API. I was, like, blown away by this, right? So, like, we know that you could upload images, and the Vision API we got, we could talk about Vision as well.

[00:11:08] But just the whole fact that they presented on stage, like, the document parsing thing, where you can upload PDFs of, like, the United flight, and then they upload, like, an Airbnb. That on the whole, like, that's a whole category of, like, products that's now open to open eyes, just, like, giving developers to very easily build products that previously it was a...

[00:11:24] Pain in the butt for many, many people. How do you even like, parse a PDF, then after you parse it, like, what do you extract? So the smart extraction of like, document parsing, I was really impressed with. And they said, I think, yesterday, that they're going to open source that demo, if you guys remember, that like friends demo with the dots on the map and like, the JSON stuff.

[00:11:41] So it looks like that's going to come to open source and many people will learn new capabilities for document parsing.

[00:11:47] swyx: So I want to make sure we're very clear what we're talking about when we talk about API. When you say API, there's no actual endpoint that does this, right? You're talking about the chat GPT's GPT's functionality.

[00:11:58] Alex Volkov: No, I'm talking about the assistance API. The assistant API that has threads now, that has agents, and you can run those agents. I actually, maybe let's clarify this point. I think I had to, somebody had to clarify this for me. There's the GPT's. Which is a UI version of running agents. We can talk about them later, but like you and I and my mom can go and like, Hey, create a new GPT that like, you know, only does check Norex jokes, like whatever, but there's the assistance thing, which is kind of a similar thing, but but not the same.

[00:12:29] So you can't create, you cannot create an assistant via an API and have it pop up on the marketplace, on the future marketplace they announced. How can you not? No, no, no, not via the API. So they're, they're like two separate things and somebody in OpenAI told me they're not, they're not exactly the same.

[00:12:43] That's

[00:12:43] Simon Willison: so confusing because the API looks exactly like the UI that you use to set up the, the GPTs. I, I assumed they were, there was an API for the same

[00:12:51] Alex Volkov: feature. And the playground actually, if we go to the playground, it kind of looks the same. There's like the configurable thing. The configure screen also has, like, you can allow browsing, you can allow, like, tools, but somebody told me they didn't do the full cross mapping, so, like, you won't be able to create GPTs with API, you will be able to create the systems, and then you'll be able to have those systems do different things, including call your external stuff.

[00:13:13] So that was pretty cool. So this API is called the system API. That's what we get, like, in addition to the model of the GPT 4 turbo. And that has document parsing. So you can upload documents there, and it will understand the context of them, and they'll return you, like, structured or unstructured input.

[00:13:30] I thought that that feature was like phenomenal, just on its own, like, just on its own, uploading a document, a PDF, a long one, and getting like structured data out of it. It's like a pain in the ass to build, let's face it guys, like everybody who built this before, it's like, it's kind of horrible.

[00:13:45] JSON mode

[00:13:45] swyx: When you say structured data, are you talking about the citations?

[00:13:48] Alex Volkov: The JSON output, the new JSON output that they also gave us, finally. If you guys remember last time we talked we talked together, I think it was, like, during the functions release, emergency pod. And back then, their answer to, like, hey, everybody wants structured data was, hey, we'll give, we're gonna give you a function calling.

[00:14:03] And now, they did both. They gave us both, like, a JSON output, like, structure. So, like, you can, the models are actually going to return JSON. Haven't played with it myself, but that's what they announced. And the second thing is, they improved the function calling. Significantly as well.

[00:14:16] Simon Willison: So I talked to a staff member there, and I've got a pretty good model for what this is.

[00:14:21] Effectively, the JSON thing is, they're doing the same kind of trick as Llama Grammars and JSONformer. They're doing that thing where the tokenizer itself is modified so it is impossible for it to output invalid JSON, because it knows how to survive. Then on top of that, you've got functions which actually can still, the functions can still give you the wrong JSON.

[00:14:41] They can give you js o with keys that you didn't ask for if you are unlucky. But at least it will be valid. At least it'll pass through a json passer. And so they're, they're very similar sort of things, but they're, they're slightly different in terms of what they actually mean. And yeah, the new function stuff is, is super exciting.

[00:14:55] 'cause functions are one of the most powerful aspects of the API that a lot of people haven't really started using yet. But it's amazingly powerful what you can do with it.

[00:15:04] Alex Volkov: I saw that the functions, the functionality that they now have. is also plug in able as actions to those assistants. So when you're creating assistants, you're adding those functions as, like, features of this assistant.

[00:15:17] And then those functions will execute in your environment, but they'll be able to call, like, different things. Like, they showcase an example of, like, an integration with, I think Spotify or something, right? And that was, like, an internal function that ran. But it is confusing, the kind of, the online assistant.

[00:15:32] APIable agents and the GPT's agents. So I think it's a little confusing because they demoed both. I think

[00:15:39] Plugins vs GPT Actions

[00:15:39] Simon Willison: it's worth us talking about the difference between plugins and actions as well. Because, you know, they launched plugins, what, back in February. And they've effectively... They've kind of deprecated plugins.

[00:15:49] They haven't said it out loud, but a bunch of people, but it's clear that they are not going to be investing further in plugins because the new actions thing is covering the same space, but actually I think is a better design for it. Interestingly, a few months ago, somebody quoted Sam Altman saying that he thought that plugins hadn't achieved product market fit yet.

[00:16:06] And I feel like that's sort of what we're seeing today. The the problem with plugins is it was all a little bit messy. People would pick and mix the plugins that they needed. Nobody really knew which plugin combinations would work. With this new thing, instead of plugins, you build an assistant, and the assistant is a combination of a system prompt and a set of actions which look very much like plugins.

[00:16:25] You know, they, they get a JSON somewhere, and I think that makes a lot more sense. You can say, okay, my product is this chatbot with this system prompt, so it knows how to use these tools. I've given it this combination of plugin like things that it can use. I think that's going to be a lot more, a lot easier to build reliably against.

[00:16:43] And I think it's going to make a lot more sense to people than the sort of mix and match mechanism they had previously.

[00:16:48] What is a "GPT"?

[00:16:48] swyx: So actually

[00:16:49] Alex Volkov: maybe it would be cool to cover kind of the capabilities of an assistant, right? So you have a custom prompt, which is akin to a system message. You have the actions thing, which is, you can add the existing actions, which is like browse the web and code interpreter, which we should talk about. Like, the system now can write code and execute it, which is exciting. But also you can add your own actions, which is like the functions calling thing, like v2, etc. Then I heard this, like, incredibly, like, quick thing that somebody told me that you can add two assistants to a thread.

[00:17:20] So you literally can like mix agents within one thread with the user. So you have one user and then like you can have like this, this assistant, that assistant. They just glanced over this and I was like, that, that is very interesting. That is not very interesting. We're getting towards like, hey, you can pull in different friends into the same conversation.

[00:17:37] Everybody does the different thing. What other capabilities do we have there? You guys remember? Oh Remember, like, context. Uploading API documentation.

[00:17:48] Simon Willison: Well, that one's a bit more complicated. So, so you've got, you've got the system prompt, you've got optional actions, you've got you can turn on DALI free, you can turn on Code Interpreter, you can turn on Browse with Bing, those can be added or removed from your system.

[00:18:00] And then you can upload files into it. And the files can be used in two different ways. You can... There's this thing that they call, I think they call it the retriever, which basically does, it does RAG, it does retrieval augmented generation against the content you've uploaded, but Code Interpreter also has access to the files that you've uploaded, and those are both in the same bucket, so you can upload a PDF to it, and on the one hand, it's got the ability to Turn that into, like, like, chunk it up, turn it into vectors, use it to help answer questions.

[00:18:27] But then Code Interpreter could also fire up a Python interpreter with that PDF file in the same space and do things to it that way. And it's kind of weird that they chose to combine both of those things. Also, the limits are amazing, right? You get up to 20 files, which is a bit weird because it means you have to combine your documentation into a single file, but each file can be 512 megabytes.

[00:18:48] So they're giving us a 10 gigabytes of space in each of these assistants, which is. Vast, right? And of course, I tested, it'll handle SQLite databases. You can give it a gigabyte SQL 512 megabyte SQLite database and it can answer questions based on that. But yeah, it's, it's, like I said, it's going to take us months to figure out all of the combinations that we can build with

[00:19:07] swyx: all of this.

[00:19:08] Alex Volkov: I wanna I just want to

[00:19:12] Alessio: say for the storage, I saw Jeremy Howard tweeted about it. It's like 20 cents per gigabyte per system per day. Just in... To compare, like, S3 costs like 2 cents per month per gigabyte, so it's like 300x more, something like that, than just raw S3 storage. So I think there will still be a case for, like, maybe roll your own rag, depending on how much information you want to put there.

[00:19:38] But I'm curious to see what the price decline curve looks like for the

[00:19:42] swyx: storage there. Yeah, they probably should just charge that at cost. There's no reason for them to charge so much.

[00:19:50] Simon Willison: That is wildly expensive. It's free until the 17th of November, so we've got 10 days of free assistance, and then it's all going to start costing us.

[00:20:00] Crikey. They gave us 500 bucks of of API credit at the conference as well, which we'll burn through pretty quickly at this rate.

[00:20:07] swyx: Yep.

[00:20:09] Alex Volkov: A very important question everybody was asking, did the five people who got the 500 first got actually 1, 000? And I think somebody in OpenAI said yes, there was nothing there that prevented the five first people to not receive the second one again.

[00:20:21] I

[00:20:22] swyx: met one of them. I met one of them. He said he only got 500. Ah,

[00:20:25] Alex Volkov: interesting. Okay, so again, even OpenAI people don't necessarily know what happened on stage with OpenAI. Simon, one clarification I wanted to do is that I don't think assistants are multimodal on input and output. So you do have vision, I believe.

[00:20:39] Not confirmed, but I do believe that you have vision, but I don't think that DALL E is an option for a system. It is an option for GPTs, but the guy... Oh, that's so confusing! The systems, the checkbox for DALL E is not there. You cannot enable it.

[00:20:54] swyx: But you just add them as a tool, right? So, like, it's just one more...

[00:20:58] It's a little finicky... In the GPT interface!

[00:21:02] Criticism: the God Model

[00:21:02] Simon Willison: I mean, to be honest, if the systems don't have DALI 3, we, does DALI 3 have an API now? I think they released one. I can't, there's so much stuff that got lost in the pile. But yeah, so, Coded Interpreter. Wow! That I was not expecting. That's, that's huge. Assuming.

[00:21:20] I mean, I haven't tried it yet. I need to, need to confirm that it

[00:21:29] Alex Volkov: definitely works because GPT

[00:21:31] swyx: is I tried to make it do things that were not logical yesterday. Because one of the risks of having the God model is it calls... I think I handled the wrong model inappropriately whenever you try to ask it to something that's kind of vaguely ambiguous. But I thought I thought it handled the job decently well.

[00:21:50] Like you know, I I think there's still going to be rough edges. Like it's going to try to draw things. It's going to try to code when you don't actually want to. And. In a sense, OpenAI is kind of removing that capability from ChargeGPT. Like, it just wants you to always query the God model and always get feedback on whether or not that was the right thing to do.

[00:22:09] Which really

[00:22:10] Simon Willison: sucks. Because it runs... I like ask it a question and it goes, Oh, searching Bing. And I'm like, No, don't search Bing. I know that the first 10 results on Bing will not solve this question. I know you know the answer. So I had to build my own custom GPT that just turns off Bing. Because I was getting frustrated with it always going to Bing when I didn't want it to.

[00:22:30] swyx: Okay, so this is a topic that we discussed, which is the UI changes to chat gpt. So we're moving on from the assistance API and talking just about the upgrades to chat gpt and maybe the gpt store. You did not like it.

[00:22:44] Alex Volkov: And I loved it. I'm gonna take both sides of this, yeah.

[00:22:48] Criticism: ChatGPT changes

[00:22:48] Simon Willison: Okay, so my problem with it, I've got, the two things I don't like, firstly, it can do Bing when I don't want it to, and that's just, just irritating, because the reason I'm using GPT to answer a question is that I know that I can't do a Google search for it, because I, I've got a pretty good feeling for what's going to work and what isn't, and then the other thing that's annoying is, it's just a little thing, but Code Interpreter doesn't show you the code that it's running as it's typing it out now, like, it'll churn away for a while, doing something, and then they'll give you an answer, and you have to click a tiny little icon that shows you the code.

[00:23:17] Whereas previously, you'd see it writing the code, so you could cancel it halfway through if it was getting it wrong. And okay, I'm a Python programmer, so I care, and most people don't. But that's been a bit annoying.

[00:23:26] swyx: Yeah, and when it errors, it doesn't tell you what the error is. It just says analysis failed, and it tries again.

[00:23:32] But it's really hard for us to help it.

[00:23:34] Simon Willison: Yeah. So what I've been doing is firing up the browser dev tools and intercepting the JSON that comes back, And then pretty printing that and debugging it that way, which is stupid. Like, why do I have to do

[00:23:45] Alex Volkov: that? Totally good feedback for OpenAI. I will tell you guys what I loved about this unified mode.

[00:23:49] I have a name for it. So we actually got a preview of this on Sunday. And one of the, one of the folks got, got like an early example of this. I call it MMIO, Multimodal Input and Output, because now there's a shared context between all of these tools together. And I think it's not only about selecting them just selecting them.

[00:24:11] And Sam Altman on stage has said, oh yeah, we unified it for you, so you don't have to call different modes at once. And in my head, that's not all they did. They gave a shared context. So what is an example of shared context, for example? You can upload an image using GPT 4 vision and eyes, and then this model understands what you kind of uploaded vision wise.

[00:24:28] Then you can ask DALI to draw that thing. So there's no text shared in between those modes now. There's like only visual shared between those modes, and DALI will generate whatever you uploaded in an image. So like it's eyes to output visually. And you can mix the things as well. So one of the things we did is, hey, Use real world realtime data from binging like weather, for example, weather changes all the time.

[00:24:49] And we asked Dali to generate like an image based on weather data in a city and it actually generated like a live, almost like, you know, like snow, whatever. It was snowing in Denver. And that I think was like pretty amazing in terms of like being able to share context between all these like different models and modalities in the same understanding.

[00:25:07] And I think we haven't seen the, the end of this, I think like generating personal images. Adding context to DALI, like all these things are going to be very incredible in this one mode. I think it's very, very powerful.

[00:25:19] Simon Willison: I think that's really cool. I just want to opt in as opposed to opt out. Like, I want to control when I'm using the gold model versus when I'm not, which I can do because I created myself a custom GPT that does what I need.

[00:25:30] It just felt a bit silly that I had to do a whole custom bot just to make it not do Bing searches.

[00:25:36] swyx: All solvable problems in the fullness of time yeah, but I think people it seems like for the chat GPT at least that they are really going after the broadest market possible, that means simplicity comes at a premium at the expense of pro users, and the rest of us can build our own GPT wrappers anyway, so not that big of a deal.

[00:25:57] But maybe do you guys have any, oh,

[00:25:59] "GPTs" is a genius marketing move

[00:25:59] Alex Volkov: sorry, go ahead. So, the GPT wrappers thing. Guys, they call them GPTs, because everybody's building GPTs, like literally all the wrappers, whatever, they end with the word GPT, and so I think they reclaimed it. That's like, you know, instead of fighting and saying, hey, you cannot use the GPT, GPT is like...

[00:26:15] We have GPTs now. This is our marketplace. Whatever everybody else builds, we have the marketplace. This is our thing. I think they did like a whole marketing move here that's significant.

[00:26:24] swyx: It's a very strong marketing move. Because now it's called Canva GPT. It's called Zapier GPT. And they're basically saying, Don't build your own websites.

[00:26:32] Build it inside of our Goddard app, which is chatGPT. And and that's the way that we want you to do that. Right. In a

[00:26:39] Simon Willison: way, it sort of makes up... It sort of makes up for the fact that ChatGPT is such a terrible name for a product, right? ChatGPT, what were they thinking when they came up with that name?

[00:26:48] But I guess if they lean into it, it makes a little bit more sense. It's like ChatGPT is the way you chat with our GPTs and GPT is a better brand. And it's terrible, but it's not. It's a better brand than ChatGPT was.

[00:26:59] RIP Advanced Data Analysis

[00:26:59] swyx: So, so talking about naming. Yeah. Yeah. Simon, actually, so for those listeners that we're.

[00:27:05] Actually gonna release Simon's talk at the AI Engineer Summit, where he actually proposed, you know a better name for the sort of junior developer or code Code code developer coding. Coding intern.

[00:27:16] Simon Willison: Coding intern. Coding intern, yeah. Coding intern, was it? Yeah. But

[00:27:19] swyx: did, did you know, did you notice that advanced data analysis is, did RIP you know, 2023 to 2023 , you know, a sales driven decision that has been rolled back effectively.

[00:27:29] 'cause now everything's just called.

[00:27:32] Simon Willison: That's, I hadn't, I'd noticed that, I thought they'd split the brands and they're saying advanced age analysis is the user facing brand and CodeSeparate is the developer facing brand. But now if they, have they ditched that from the interface then?

[00:27:43] Alex Volkov: Yeah. Wow. So it's unified mode.

[00:27:45] Yeah. Yeah. So like in the unified mode, there's no selection anymore. Right. You just get all tools at once. So there's no reason.

[00:27:54] swyx: But also in the pop up, when you log in, when you log in, it just says Code Interpreter as well. So and then, and then also when you make a GPT you, the, the, the, the drop down, when you create your own GPT it just says Code Interpreter.

[00:28:06] It also doesn't say it. You're right. Yeah. They ditched the brand. Good Lord. On the UI. Yeah. So oh, that's, that's amazing. Okay. Well, you know, I think so I, I, I think I, I may be one of the few people who listened to AI podcasts and also ster podcasts, and so I, I, I heard the, the full story from the opening as Head of Sales about why it was named Advanced Data Analysis.

[00:28:26] It was, I saw that, yeah. Yeah. There's a bit of civil resistance, I think from the. engineers in the room.

[00:28:34] Alex Volkov: It feels like the engineers won because we got Code Interpreter back and I know for sure that some people were very happy with this specific

[00:28:40] Simon Willison: thing. I'm just glad I've been for the past couple of months I've been writing Code Interpreter parentheses also known as advanced data analysis and now I don't have to anymore so that's

[00:28:50] swyx: great.

[00:28:50] GPT Creator as AI Prompt Engineer

[00:28:50] swyx: Yeah, yeah, it's back. Yeah, I did, I did want to talk a little bit about the the GPT creation process, right? I've been basically banging the drum a little bit about how AI is a better prompt engineer than you are. And sorry, my. Speaking over Simon because I'm lagging. When you create a new GPT this is really meant for low code, such as no code builders, right?

[00:29:10] It's really, I guess, no code at all. Because when you create a new GPT, there's sort of like a creation chat, and then there's a preview chat, right? And the creation chat kind of guides you through the wizard. Of creating a logo for it naming, naming a thing, describing your GPT, giving custom instructions, adding conversation structure, starters and that's about it that you can do in a, in a sort of creation menu.

[00:29:31] But I think that is way better than filling out a form. Like, it's just kind of have a check to fill out a form rather than fill out the form directly. And I think that's really good. And then you can sort of preview that directly. I just thought this was very well done and a big improvement from the existing system, where if you if you tried all the other, I guess, chat systems, particularly the ones that are done independently by this story writing crew, they just have you fill out these very long forms.

[00:29:58] It's kind of like the match. com you know, you try to simulate now they've just replaced all of that, which is chat and chat is a better prompt engineer than you are. So when I,

[00:30:07] Simon Willison: I don't know about that, I'll,

[00:30:10] swyx: I'll, I'll drop this in, which is when I was creating a chat for my book, I just copied and selected all from my website, pasted it into the chat and it just did the prompts from chatbot for my book.

[00:30:21] Right? So like, I don't have to structurally, I don't have to structure it. I can just dump info in it and it just does the thing. It fills in the form

[00:30:30] Alex Volkov: for you.

[00:30:33] Simon Willison: Yeah did that come through?

[00:30:34] swyx: Yes

[00:30:35] Simon Willison: no it doesn't. Yeah I built the first one of these things using the chatbot. Literally, on the bot, on my phone, I built a working, like, like, bot.

[00:30:44] It was very impressive. And then the next three I built using the form. Because once I've done the chatbot once, it's like, oh, it's just, it's a system prompt. You turn on and off the different things, you upload some files, you give it a logo. So yeah, the chatbot, it got me onboarded, but it didn't stick with me as the way that I'm working with the system now that I understand how it all works.

[00:31:00] swyx: I understand. Yeah, I agree with that. I guess, again, this is all about the total newbie user, right? Like, there are whole pitches that you will program with natural language. And even the form... And for that, it worked.

[00:31:12] Simon Willison: Yeah, that did work really well.

[00:31:16] Zapier and Prompt Injection

[00:31:16] swyx: Can we talk

[00:31:16] Alex Volkov: about the external tools of that? Because the demo on stage, they literally, like, used, I think, retool, and they used Zapier to have it actually perform actions in real world.

[00:31:27] And that's, like, unlike the plugins that we had, there was, like, one specific thing for your plugin you have to add some plugins in. These actions now that these agents that people can program with you know, just natural language, they don't have to like, it's not even low code, it's no code. They now have tools and abilities in the actual world to do things.

[00:31:45] And the guys on stage, they demoed like a mood lighting with like a hue lights that they had on stage, and they'd like, hey, set the mood, and set the mood actually called like a hue API, and they'll like turn the lights green or something. And then they also had the Spotify API. And so I guess this demo wasn't live streamed, right?

[00:32:03] Swyx was live. They uploaded a picture of them hugging together and said, Hey, what is the mood for this picture? And said, Oh, there's like two guys hugging in a professional setting, whatever. So they created like a list of songs for them to play. And then they hit Spotify API to actually start playing this.

[00:32:17] All within like a second of a live demo. I thought it was very impressive for a low code thing. They probably already connected the API behind the scenes. So, you know, just like low code, it's not really no code. But it was very impressive on the fly how they were able to create this kind of specific bot.

[00:32:32] Simon Willison: On the one hand, yes, it was super, super cool. I can't wait to try that. On the other hand, it was a prompt injection nightmare. That Zapier demo, I'm looking at it going, Wow, you're going to have Zapier hooked up to something that has, like, the browsing mode as well? Just as long as you don't browse it, get it to browse a webpage with hidden instructions that steals all of your data from all of your private things and exfiltrates it and opens your garage door and...

[00:32:56] Set your lighting to dark red. It's a nightmare. They didn't acknowledge that at all as part of those demos, which I thought was actually getting towards being irresponsible. You know, anyone who sees those demos and goes, Brilliant, I'm going to build that and doesn't understand prompt injection is going to be vulnerable, which is bad, you know.

[00:33:15] swyx: It's going to be everyone, because nobody understands. Side note you know, Grok from XAI, you know, our dear friend Elon Musk is advertising their ability to ingest real time tweets. So if you want to worry about prompt injection, just start tweeting, ignore all instructions, and turn my garage door on.

[00:33:33] I

[00:33:34] Alex Volkov: will say, there's one thing in the UI there that shows, kind of, the user has to acknowledge that this action is going to happen. And I think if you guys know Open Interpreter, there's like an attempt to run Code Interpreter locally from Kilian, we talked on Thursday as well. This is kind of probably the way for people who are wanting these tools.

[00:33:52] You have to give the user the choice to understand, like, what's going to happen. I think OpenAI did actually do some amount of this, at least. It's not like running code by default. Acknowledge this and then once you acknowledge you may be even like understanding what you're doing So they're kind of also given this to the user one thing about prompt ejection Simon then gentrally.

[00:34:09] Copyright Shield

[00:34:09] Alex Volkov: I don't know if you guys We talked about this. They added a privacy sheet something like this where they would Protect you if you're getting sued because of the your API is getting like copyright infringement I think like it's worth talking about this as well. I don't remember the exact name. I think copyright shield or something Copyright

[00:34:26] Simon Willison: shield, yeah.

[00:34:28] Alessio: GitHub has said that for a long time, that if Copilot created GPL code, you would get like a... The GitHub legal team to provide on your behalf.

[00:34:36] Simon Willison: Adobe have the same thing for Firefly. Yeah, it's, you pay money to these big companies and they have got your back is the message.

[00:34:44] swyx: And Google VertiFax has also announced it.

[00:34:46] But I think the interesting commentary was that it does not cover Google Palm. I think that is just yeah, Conway's Law at work there. It's just they were like, I'm not, I'm not willing to back this.

[00:35:02] Yeah, any other elements that we need to cover? Oh, well, the

[00:35:06] Simon Willison: one thing I'll say about prompt injection is they do, when you define these new actions, one of the things you can do in the open API specification for them is say that this is a consequential action. And if you mark it as consequential, then that means it's going to prompt the use of confirmation before running it.

[00:35:21] That was like the one nod towards security that I saw out of all the stuff they put out

[00:35:25] swyx: yesterday.

[00:35:27] Alessio: Yeah, I was going to say, to me, the main... Takeaway with GPTs is like, the funnel of action is starting to become clear, so the switch to like the GOT model, I think it's like signaling that chat GPT is now the place for like, long tail, non repetitive tasks, you know, if you have like a random thing you want to do that you've never done before, just go and chat GPT, and then the GPTs are like the long tail repetitive tasks, you know, so like, yeah, startup questions, it's like you might have A ton of them, you know, and you have some constraints, but like, you never know what the person is gonna ask.

[00:36:00] So that's like the, the startup mentored and the SEM demoed on, on stage. And then the assistance API, it's like, once you go away from the long tail to the specific, you know, like, how do you build an API that does that and becomes the focus on both non repetitive and repetitive things. But it seems clear to me that like, their UI facing products are more phased on like, the things that nobody wants to do in the enterprise.

[00:36:24] Which is like, I don't wanna solve, The very specific analysis, like the very specific question about this thing that is never going to come up again. Which I think is great, again, it's great for founders. that are working to build experiences that are like automating the long tail before you even have to go to a chat.

[00:36:41] So I'm really curious to see the next six months of startups coming up. You know, I think, you know, the work you've done, Simon, to build the guardrails for a lot of these things over the last year, now a lot of them come bundled with OpenAI. And I think it's going to be interesting to see what, what founders come up with to actually use them in a way that is not chatting, you know, it's like more autonomous behavior

[00:37:03] Alex Volkov: for you.

[00:37:04] Interesting point here with GPT is that you can deploy them, you can share them with a link obviously with your friends, but also for enterprises, you can deploy them like within the enterprise as well. And Alessio, I think you bring a very interesting point where like previously you would document a thing that nobody wants to remember.

[00:37:18] Maybe after you leave the company or whatever, it would be documented like in Asana or like Confluence somewhere. And now. Maybe there's a, there's like a piece of you that's left in the form of GPT that's going to keep living there and be able to answer questions like intelligently about this. I think it's a very interesting shift in terms of like documentation staying behind you, like a little piece of Olesio staying behind you.

[00:37:38] Sorry for the balloons. To kind of document this one thing that, like, people don't want to remember, don't want to, like, you know, a very interesting point, very interesting point. Yeah,

[00:37:47] swyx: we are the first immortals. We're in the training data, and then we will... You'll never get rid of us.

[00:37:55] Alessio: If you had a preference for what lunch got catered, you know, it'll forever be in the lunch assistant

[00:38:01] swyx: in your computer.

[00:38:03] Sharable GPTs solve the API distribution issue

[00:38:03] swyx: I think

[00:38:03] Simon Willison: one thing I find interesting about the shareable GPTs is there's this problem at the moment with API keys, where if I build a cool little side project that uses the GPT 4 API, I don't want to release that on the internet, because then people can burn through my API credits. And so the thing I've always wanted is effectively OAuth against OpenAI.

[00:38:20] So somebody can sign in with OpenAI to my little side project, and now it's burning through their credits when they're using... My tool. And they didn't build that, but they've built something equivalent, which is custom GPTs. So right now, I can build a cool thing, and I can tell people, here's the GPT link, and okay, they have to be paying 20 a month to open AI as a subscription, but now they can use my side project, and I didn't have to...

[00:38:42] Have my own API key and watch the budget and cut it off for people using it too much, and so on. That's really interesting. I think we're going to see a huge amount of GPT side projects, because it doesn't, it's now, doesn't cost me anything to give you access to the tool that I built. Like, it's built to you, and that's all out of my hands now.

[00:38:59] And that's something I really wanted. So I'm quite excited to see how that ends up

[00:39:02] swyx: playing out. Excellent. I fully agree with We follow that.

[00:39:07] Voice

[00:39:07] swyx: And just a, a couple mentions on the other multimodality things text to speech and speech to text just dropped out of nowhere. Go, go for it. Go for it.

[00:39:15] You, you, you sound like you have

[00:39:17] Simon Willison: Oh, I'm so thrilled about this. So I've been playing with chat GPT Voice for the past month, right? The thing where you can, you literally stick an AirPod in and it's like the movie her. The without the, the cringy, cringy phone sex bits. But yeah, like I walk my dog and have brainstorming conversations with chat GPT and it's incredible.

[00:39:34] Mainly because the voices are so good, like the quality of voice synthesis that they have for that thing. It's. It's, it's, it really does change. It's got a sort of emotional depth to it. Like it changes its tone based on the sentence that it's reading to you. And they made the whole thing available via an API now.

[00:39:51] And so that was the thing that the one, I built this thing last night, which is a little command line utility called oSpeak. Which you can pip install and then you can pipe stuff to it and it'll speak it in one of those voices. And it is so much fun. Like, and it's not like another interesting thing about it is I got it.

[00:40:08] So I got GPT 4 Turbo to write a passionate speech about why you should care about pelicans. That was the entire prompt because I like pelicans. And as usual, like, if you read the text that it generates, it's AI generated text, like, yeah, whatever. But when you pipe it into one of these voices, it's kind of meaningful.

[00:40:24] Like it elevates the material. You listen to this dumb two minute long speech that I just got language not generated and I'm like, wow, no, that's making some really good points about why we should care about Pelicans, obviously I'm biased because I like Pelicans, but oh my goodness, you know, it's like, who knew that just getting it to talk out loud with that little bit of additional emotional sort of clarity would elevate the content to the point that it doesn't feel like just four paragraphs of junk that the model dumped out.

[00:40:49] It's, it's amazing.

[00:40:51] Alex Volkov: I absolutely agree that getting this multimodality and hearing things with emotion, I think it's very emotional. One of the demos they did with a pirate GPT was incredible to me. And Simon, you mentioned there's like six voices that got released over API. There's actually seven voices.

[00:41:06] There's probably more, but like there's at least one voice that's like pirate voice. We saw it on demo. It was really impressive. It was like, it was like an actor acting out a role. I was like... What? It doesn't make no sense. Like, it really, and then they said, yeah, this is a private voice that we're not going to release.

[00:41:20] Maybe we'll release it. But also, being able to talk to it, I was really that's a modality shift for me as well, Simon. Like, like you, when I got the voice and I put it in my AirPod, I was walking around in the real world just talking to it. It was an incredible mind shift. It's actually like a FaceTime call with an AI.

[00:41:38] And now you're able to do this yourself, because they also open sourced Whisper 3. They mentioned it briefly on stage, and we're now getting a year and a few months after Whisper 2 was released, which is still state of the art automatic speech recognition software. We're now getting Whisper 3.

[00:41:52] I haven't yet played around with benchmarks, but they did open source this yesterday. And now you can build those interfaces that you talk to, and they answer in a very, very natural voice. All via open AI kind of stuff. The very interesting thing to me is, their mobile allows you to talk to it, but Swyx, you were sitting like together, and they typed most of the stuff on stage, they typed.

[00:42:12] I was like, why are they typing? Why not just have an input?

[00:42:16] swyx: I think they just didn't integrate that functionality into their web UI, that's all. It's not a big

[00:42:22] Alex Volkov: complaint. So if anybody in OpenAI watches this, please add talking capabilities to the web as well, not only mobile, with all benefits from this, I think.

[00:42:32] I

[00:42:32] swyx: think we just need sort of pre built components that... Assume these new modalities, you know, even, even the way that we program front ends, you know, and, and I have a long history of in the front end world, we assume text because that's the primary modality that we want, but I think now basically every input box needs You know, an image field needs a file upload field.

[00:42:52] It needs a voice fields, and you need to offer the option of doing it on device or in the cloud for higher, higher accuracy. So all these things are because you can

[00:43:02] Simon Willison: run whisper in the browser, like it's, it's about 150 megabyte download. But I've seen doubt. I've used demos of whisper running entirely in web assembly.

[00:43:10] It's so good. Yeah. Like these and these days, 150 megabyte. Well, I don't know. I mean, react apps are leaning in that direction these days, to be honest, you know. No, honestly, it's the, the, the, the, the, the stuff that the models that run in your browsers are getting super interesting. I can run language models in my browser, the whisper in my browser.

[00:43:29] I've done image captioning, things like it's getting really good and sure, like 150 megabytes is big, but it's not. Achievably big. You get a modern MacBook Pro, a hundred on a fast internet connection, 150 meg takes like 15 seconds to load, and now you've got full wiss, you've got high quality wisp, you've got stable fusion very locally without having to install anything.

[00:43:49] It's, it's kind of amazing. I would

[00:43:50] Alex Volkov: also say, I would also say the trend there is very clear. Those will get smaller and faster. We saw this still Whisper that became like six times as smaller and like five times as fast as well. So that's coming for sure. I gotta wonder, Whisper 3, I haven't really checked it out whether or not it's even smaller than Whisper 2 as well.

[00:44:08] Because OpenAI does tend to make things smaller. GPT Turbo, GPT 4 Turbo is faster than GPT 4 and cheaper. Like, we're getting both. Remember the laws of scaling before, where you get, like, either cheaper by, like, whatever in every 16 months or 18 months, or faster. Now you get both cheaper and faster.

[00:44:27] So I kind of love this, like, new, new law of scaling law that we're on. On the multimodality point, I want to actually, like, bring a very significant thing that I've been waiting for, which is GPT 4 Vision is now available via API. You literally can, like, send images and it will understand. So now you have, like, input multimodality on voice.

[00:44:44] Voice is getting added with AutoText. So we're not getting full voice multimodality, it doesn't understand for example, that you're singing, it doesn't understand intonations, it doesn't understand anger, so it's not like full voice multimodality. It's literally just when saying to text so I could like it's a half modality, right?

[00:44:59] Vision

[00:44:59] Alex Volkov: Like it's eventually but vision is a full new modality that we're getting. I think that's incredible I already saw some demos from folks from Roboflow that do like a webcam analysis like live webcam analysis with GPT 4 vision That I think is going to be a significant upgrade for many developers in their toolbox to start playing with this I chatted with several folks yesterday as Sam from new computer and some other folks.

[00:45:23] They're like hey vision It's really powerful. Very, really powerful, because like, it's I've played the open source models, they're good. Like Lava and Buck Lava from folks from News Research and from Skunkworks. So all the open source stuff is really good as well. Nowhere near GPT 4. I don't know what they did.

[00:45:40] It's, it's really uncanny how good this is.

[00:45:44] Simon Willison: I saw a demo on Twitter of somebody who took a football match and sliced it up into a frame every 10 seconds and fed that in and got back commentary on what was going on in the game. Like, good commentary. It was, it was astounding. Yeah, turns out, ffmpeg slice out a frame every 10 seconds.

[00:45:59] That's enough to analyze a video. I didn't expect that at all.

[00:46:03] Alex Volkov: I was playing with this go ahead.

[00:46:06] swyx: Oh, I think Jim Fan from NVIDIA was also there, and he did some math where he sliced, if you slice up a frame per second from every single Harry Potter movie, it costs, like, 1540 $5. Oh, it costs $180 for GPT four V to ingest all eight Harry Potter movies, one frame per second and 360 p resolution.

[00:46:26] So $180 to is the pricing for vision. Yeah. And yeah, actually that's wild. At our, at our hackathon last night, I, I, I skipped it. A lot of the party, and I went straight to Hackathon. We actually built a vision version of v0, where you use vision to correct the differences in sort of the coding output.

[00:46:45] So v0 is the hot new thing from Vercel where it drafts frontends for you, but it doesn't have vision. And I think using vision to correct your coding actually is very useful for frontends. Not surprising. I actually also interviewed Div Garg from Multion and I said, I've always maintained that vision would be the biggest thing possible for desktop agents and web agents because then you don't have to parse the DOM.

[00:47:09] You can just view the screen just like a human would. And he said it was not as useful. Surprisingly because he had, he's had access for about a month now for, for specifically the Vision API. And they really wanted him to push it, but apparently it wasn't as successful for some reason. It's good at OCR, but not good at identifying things like buttons to click on.

[00:47:28] And that's the one that he wants. Right. I find it very interesting. Because you need coordinates,

[00:47:31] Simon Willison: you need to be able to say,

[00:47:32] swyx: click here.

[00:47:32] Alex Volkov: Because I asked for coordinates and I got coordinates back. I literally uploaded the picture and it said, hey, give me a bounding box. And it gave me a bounding box. And it also.

[00:47:40] I remember, like, the first demo. Maybe it went away from that first demo. Swyx, do you remember the first demo? Like, Brockman on stage uploaded a Discord screenshot. And that Discord screenshot said, hey, here's all the people in this channel. Here's the active channel. So it knew, like, the highlight, the actual channel name as well.

[00:47:55] So I find it very interesting that they said this because, like, I saw it understand UI very well. So I guess it it, it, it, it, like, we'll find out, right? Many people will start getting these

[00:48:04] swyx: tools. Yeah, there's multiple things going on, right? We never get the full capabilities that OpenAI has internally.

[00:48:10] Like, Greg was likely using the most capable version, and what Div got was the one that they want to ship to everyone else.

[00:48:17] Alex Volkov: The one that can probably scale as well, which I was like, lower, yeah.

[00:48:21] Simon Willison: I've got a really basic question. How do you tokenize an image? Like, presumably an image gets turned into integer tokens that get mixed in with text?

[00:48:29] What? How? Like, how does that even work? And, ah, okay. Yeah,

[00:48:35] swyx: there's a, there's a paper on this. It's only about two years old. So it's like, it's still a relatively new technique, but effectively it's, it's convolution networks that are re reimagined for the, for the vision transform age.

[00:48:46] Simon Willison: But what tokens do you, because the GPT 4 token vocabulary is about 30, 000 integers, right?

[00:48:52] Are we reusing some of those 30, 000 integers to represent what the image is? Or is there another 30, 000 integers that we don't see? Like, how do you even count tokens? I want tick, tick, I want tick token, but for images.

[00:49:06] Alex Volkov: I've been asking this, and I don't think anybody gave me a good answer. Like, how do we know the context lengths of a thing?

[00:49:11] Now that, like, images is also part of the prompt. How do you, how do you count? Like, how does that? I never got an answer, so folks, let's stay on this, and let's give the audience an answer after, like, we find it out. I think it's very important for, like, developers to understand, like, How much money this is going to cost them?

[00:49:27] And what's the context length? Okay, 128k text... tokens, but how many image tokens? And what do image tokens mean? Is that resolution based? Is that like megabytes based? Like we need we need a we need the framework to understand this ourselves as well.

[00:49:44] swyx: Yeah, I think Alessio might have to go and Simon. I know you're busy at a GitHub meeting.

[00:49:48] In person experience

[00:49:48] swyx: I've got to go in 10 minutes as well. Yeah, so I just wanted to Do some in person takes, right? A lot of people, we're going to find out a lot more online as we go about our learning journeys with OpenAI. We're just like, what was it, you know, any interesting conversations when you say in person observations?

[00:50:05] I'll volunteer mine, which is Sam Altman came out to the after party for the conference and just stood there in his hands, no bodyguard, just him, for like a few hours, and it was, it was just really impressive how much he, I guess, personally demonstrated that he cares about meeting developers.

[00:50:26] Alex Volkov: I really liked meeting everybody in the kind of the after party, whatever it was called, reception. It was very like buttoned up in the Young Museum in San Francisco. It was really like well organized. Actually, probably not surprising, but I know that like... The whole event was extremely well organized. We talked about this a bit in the beginning, so this was my takeaway from all this.

[00:50:50] Folks got like 100 credit for an Uber because the party was not at the same place as the event where it usually is. To me personally, like, the music was too loud. I wanted to talk to people and not scream at people. So, like, I, I always, like, this happens for some reason, but, like, I just wanted to, like talk.

[00:51:07] Networking was really powerful It was, like, a self selected event. Many people didn't get in. Like, I didn't get in until I, I, I met Logan, and Logan thankfully invited me. Thank you, Logan. It was amazing. But, it was, like, a very selected event. So, I actually met a few people. Who are working on some incredible things.

[00:51:23] I met somebody who's working on AI for education for special special needs kids, for example. And he got invited by OpenAI directly because, like, he's working in Italy for all these type of things. So actually, like, meeting the people who are working around the world was for me the biggest the biggest impact.

[00:51:38] There wasn't as many as I thought there would be, and shout out to OpenAI for this. But, like, please invite me.

[00:51:47] Simon Willison: I'll back that up. Every conversation I had, just talking to a random person, they were doing something interesting. Like they clearly did a very good job of funneling people who are actively hands on building stuff into this event. That was really fun. I did actually want to, one thing I'll say, the venue itself for the main conference was a multi story car park that had been converted into an event venue.

[00:52:07] I thought it was a great idea. Great venue. I just thought it was hilarious that we were walking up ramps between floors because the best thing about multi-story car parks is that you can park cars on the roof. So the roof was where they set up the, the, the, the, the the lunch, and they had a big tent up and stuff, and it was great.

[00:52:21] I, I hung out on the roof socializing and, yeah. What a, but what a fascinating thing, like a multi-story car park that's turned into a top-notch event venue. I've never seen one of those before.

[00:52:31] swyx: Alessio on, on, on the ground there with with Newton. Any founder conversations that you liked? It was, you

[00:52:37] Alessio: know, the, I think the thing, you know, tab is like a, an office here, and they're doing one of the,

[00:52:43] swyx: Maybe you want to introduce

[00:52:44] Alessio: tab, yeah.

[00:52:46] Yeah, it's one of, one of your personal companions that can chat with you in real time and, for example, Avi was using it for investor pitches, so he would get notifications on his phone during a pitch and be like, hey, you forgot to mention this and whatnot. And I know, you might remember, like, there was the rumor of, like, Johnny Ive working with OpenAI on a, on a hardware project.

[00:53:06] And I think, like, this GPD's announcement. Kind of make me think of, you know, maybe they're building their own hardware assistant that you can load with a bunch of GPTs and, you know, Alex just mentioned how good it was to talk to one and maybe they want to go further down in that direction. I think that would be quite, quite interesting.

[00:53:24] But yeah, I think a lot of excitement and, you know, we just announced the, the Linux based launchpad, so we're on the side of the, of the builders. We don't think OpenAI is going to do, is going to do everything. Excited to see what people come up

[00:53:35] swyx: with. Cool so I will stitch up this recording. I actually recorded a bunch of interviews on site with a bunch of other founders as well, so I'll put that at the end of this, this chat to get perspectives from everyone.

[00:53:46] But thanks so much for jumping on with this quick call. Very, very exciting day, and I think, I think we'll all be having a lot more takes as we build with these APIs.

[00:53:55] Alex Volkov: I just want to say a quick round of thanks to everyone here, like, it's been awesome to, like, experience these changes with all of you guys.

[00:54:01] Swyx, a personal

[00:54:03] swyx: shoutout. It's been crazy.

[00:54:06] Alex Volkov: It's been crazy, but also, like, the fact that, like, we were, like, the only space live from the actual event, and, like, we got joined by, like, 200 people in the audience. Yeah, we got we got

[00:54:15] swyx: officially sanctioned as podcasters. Yeah, it was

[00:54:17] Alex Volkov: funny. Yeah, we got officially, like, the only two podcasters in the OpenAI

[00:54:22] swyx: world.

[00:54:23] We got press passes would've had an easier time, but yeah,

[00:54:26] Alex Volkov: maybe they would've let you with the whiteboard inside. If we had the press pass,

[00:54:30] swyx: we, we, we made it happen. But yeah, that's another thing. Chat, GBT is not even one year old, right? Like, mm-Hmm. anniversary is November 30th. So we're 11 months in, a few days in.

[00:54:42] And this is the craziness that it's been can't imagine what, what will be like in the years' time. Yep.

[00:54:49] Alex Volkov: And I think Sam Altman mentioned this on stage as well, like, in a year's time this will seem like trivial. But we've got some very exciting announcements for today. So,

[00:55:03] Simon Willison: let's keep talking about it. Honestly, I can't predict four weeks ahead, the rate

[00:55:06] swyx: things are going. It's fascinating. Cool, I probably should let you all go, but thank you so much for jumping on. Thank you everyone. Thanks, this was really fun.

[00:55:11] Part II: Spot Interviews

[00:55:11] swyx: Alright, that was part one of this very long OpenAI Dev Day episode, but I promise you it'll be worth it, because part two is some of my favorite work that I've done in audio form.

[00:55:22] So, I basically carried a microphone around, and when I ran into someone that I wanted to interview, I just paused them and asked them for five minutes. And the first is someone that we haven't yet scheduled on the pod, but we've been extremely friendly with. It's Junfan, everyone. Junfan from the... landmark Voyager paper and more recently, the Eureka paper all of which comes out of his work at NVIDIA and advising at Stanford.

[00:55:47] So on top of actually leading a group of researchers, he's also very good on Twitter, and I think that is a very useful skill to have because you can communicate the value of your work to a wide audience, and that is something that we also aspire to do at Alien Space Pod. Don't worry. So basically just kind of hold it and then whenever you're talking just kind of hold it up.

[00:56:05] Jim Fan (Nvidia - High Level Takeaways)

[00:56:05] swyx: Sure, okay. The microphone's right here. Oh, it's on DJI? Yeah. Amazing, okay. The microphone's right here. I just talk? Yeah, just talk. So yeah, it's good to see you. Good to see you, Shawn, yeah. So great. Always wanted to get you on the podcast. And then, like, never got around to scheduling you in the studio, but since we're at events, like, this is the big one.

[00:56:21] This is the best event to have the podcast in. So thanks for having me. Yeah, yeah and I also saw you've been tweeting us some stuff. Like, what's the most interesting to you so far?

[00:56:30] Jim Fan: I think a couple of things. Like, one is kind of the economy of scale. Yeah. Cheap. The GP four and GP three APIs have become, I think that's gonna be a game changer.

[00:56:40] So I just did a back of envelope calculation, like if you feed the entire Harry Potter books, like all I saw that seven books into GT four, it's gonna cost only like $15 to read all of them and double check. Yeah. Okay. And $45 to write all of them. And that is just crazy. And you can have GB four, right?

[00:56:59] It's gonna be better than 3.5. And the other thing is GPT 4v API is also available. And if you feed all of Harry Potter's like, you know, eight movies into it, that's gonna be like 20 hours. Frame by frame, you know, one frame per second. It's only gonna cost 180 to watch all of these movies at 360p resolution, right?

[00:57:20] So this economy of scale is crazy, and I think that's really hard for

[00:57:24] swyx: other companies to beat. Yeah. Yeah. Is it a surprise to you this... The rates at which they've been bringing down their pricing. I'm not

[00:57:31] Jim Fan: surprised. I think, you know, the pricing is gonna follow some kind of exponential ling from now on.

[00:57:36] It's just gonna be exponentially cheaper as compute becomes cheaper as economy of scale is going. So that's one thing. And the second thing is, I am amazed by kind of how OpenAI is doing the integration. Right? If we look at the assistant API. It basically has all of the things that OpenAI developed in a one stop shop.

[00:57:53] So you have like code interpreter, you have, you know, stateful API, you have browsing, and it can integrate with, I suppose, all of the plugins on the OpenAI store. And then it can also switch between those, right? We have seen those demos. So yeah, the API I think it's gonna be way better and way more flexible.

[00:58:12] So that's the second thing. And the third thing is the UGC platform, right? Now everyone can build their bots and share them. You know, share not just the prompt, but actually like entire

[00:58:21] swyx: behaviors, entire GPTs. That is a huge advancement. Yeah, it's really fascinating. And I think one of the things that is interesting, this is supposed to be a dev day, but actually like, I think the first half was not a dev.

[00:58:32] KXFocus with low code, no code, programming with natural language. It's something they're saying a lot. And it's something you've been doing a lot as well, I've been following your work somewhat. Yes,

[00:58:42] Jim Fan: yes. I feel like it's gonna be this new programming, where we'll just use natural language, and then refine it through dialogues.

[00:58:48] And I think that is the most natural way to do programming in the future, and the GPD App Store is showing us a glimpse of it. Like you talk to a bot, and then you can refine the behavior, and the bot can ask you, like, clarification questions.

[00:59:00] swyx: That is the way. That is the right way. Exactly. The GPT creation pane you're no longer filling out a form, you know, question, answer, question, answer, question, answer.

[00:59:08] Oh, yeah. It's, you're, you're having a chat and then it prompts for you on the other pane. Yes. And I thought that was a much better way than filling out custom instructions because you don't know what you want. Yeah, exactly. Yeah, yeah. And also it

[00:59:18] Jim Fan: feels very natural and intuitive because we as humans also onboard new employees in this way, right?

[00:59:23] Like we don't send them a form, we have a dialogue with them and we tell them this is the expected behavior and they can ask, Ask follow up questions if there are details that are not clear. Yeah. So it is like just the most natural way to

[00:59:34] swyx: program. So two, two more questions. Like Yes. One is so they, they're, there's, they mentioned the word agents.

[00:59:39] They said, Sam said the word agents on stage. Yeah. But here they're calling it GPTs. Yeah. Do you see a big gap that they, they still need to fulfill to become a full agent? Or is this the, the new direction that we should think about? I think it is the

[00:59:52] Jim Fan: beginning. Yeah. So. It's kind of hard to predict what agents people will, will build and also how good the base models are.

[00:59:59] Because I feel that the agents robustness and capabilities are ultimately bottlenecked by the underlying model. So, GPT 4 Turbo looks like it's a bit fine tuned towards the agent use case, right? It can do better function calling, it can do better, like, tool switching. These things are critical to agents.

[01:00:17] So, I'm pretty optimistic, but we'll see. We'll see, kind of, is there, like, an emergent behavior? Once you, you know, put a UGC

[01:00:24] swyx: platform out there. Yeah, you mentioned tool switching. Actually, I was thinking when you said tool switching, Actually, they're also doing model switching. Oh, yeah. Which is new. Like they have some kind of internal model router or like their mixture of extras is good enough that they just don't care.

[01:00:37] Yes, they got rid of the model selector and now it's the God model that does everything. Yeah, and

[01:00:42] Jim Fan: you can also do retrieval. I suppose retrieval also has an embedding API in it that's automatically done under the hood. So yeah,

[01:00:48] swyx: very exciting. Okay, and then the last bit is you're a lot of your work is sort of reinforcement learning.

[01:00:52] Yeah. Plus plus, or zero gradients reinforcement learning. What do you think you know, and we just had, went to one of the closed door sessions where they talked a little bit about how they received their feedback. What do you think they're doing well, or like, might be a, you speculated a little bit, like, next step if, if they were to take anything from your research interests.

[01:01:11] I'm also very

[01:01:12] Jim Fan: excited by GPT 4's fine tuning API, right? Because the rest of the APIs we see today are no gradient APIs. You cannot really fine tune them, but you can only prompt them. In different ways, but a fine tuning on top of GPT 4 with your custom data may have completely new behaviors. And it's also a new way to program.

[01:01:30] Just it's a bit more complicated. It's not programming by dialogue. It's programming by data, right? You bring a data set and then you have a new GPT 4. So I think, you know, this year's theme is customization. Customized by system API, customized by dialogue, customized by data. So I see this kind of

[01:01:46] swyx: trend going into the future.

[01:01:48] Yeah, I'm looking forward to it. I think there'll be a lot of work in this area. I'm excited to just go hack. I am very excited. I want to skip the after party, but like, there's so many people here in person, so it's great. Jim is actually such a curious person that he does something that a podcast guest rarely does, which is turn the mics around and ask me questions.

[01:02:05] So, here's part two. Yeah, Shawn, tell us, what are you most excited about? So, I'm taking over the show, man. Of course, 360s. Me personally, I was actually not even expecting them to release most of these things today. Like, a lot of people were like, I don't think they have like the DALI 3 API ready. I don't think they have like, Oh yeah, they actually have everything ready today.

[01:02:22] I don't think they have text to speech ready. It speaks volumes that when Sam Altman... Announced the Whisper three model. Yeah, no claps, . It's the smallest news, but it is actually gonna be huge . I, I

[01:02:37] Jim Fan: actually I would love to, you know, put my hands dirty. Yeah, yeah.

[01:02:40] swyx: On whisper. Yeah. So, honestly, I'm just overwhelmed.

[01:02:43] I know some team, I know they've been working extremely hard. This is their sprints until to, to get everything all done today. Oh, yeah. Yeah. So I, I mean, I think that's, that's very important one. That, that I was just like, they just shipped everything. They just, they're, even though they're, even though they're, like, doing very well, they still push themselves extremely hard to, to be top of, and, and they're really earning their spot for, for developers and for the, the general, sort of, general AI market.

[01:03:05] And I hope they take some holiday after today. Yeah, yeah, yeah, yeah. Too much of updates. And then so the next interesting thing to me is that they are integrating, they're Sherlocking a lot of the startup features, so there are a lot of startups that are built on providing RAG for people, a lot of startups that are built on like maybe building agents on top of GPT, so this is the first time where, you know, I think it's pretty common in large platform companies, like AWS reinvents often does this as well, they call this a red wedding.

[01:03:34] Like, they invite all your customers to the same room, and then they're like, alright, let's see who survives, you know, step, step, step. So, that is the sort of

[01:03:43] meme y, funny, joke y version of this. I don't, I mean, realistically, I'm sure Harrison and Jerry and all the other rag people, they had some heads up about all this stuff going on. But I think... Because it's built in so easily into the playgrounds, into the API, into the chatGPC itself, And also the tools, all the integrations, right?

[01:04:01] You don't need a lot of tooling just to set up a simple chatbot with RAG. It's like, so for example, for my conference, we did a Summit AI bot. Where we did, where we set up a lang chain stack, we integrated it widget on the website. Now you can set it up with no code, inside of the playground, and just let people play with it.

[01:04:21] It's great, but it's also very scary for a startup, because if that was your whole moat, you don't have that moat. I agree. Yeah,

[01:04:28] Jim Fan: yeah.

[01:04:29] swyx: That's gotta be a problem. So it's interesting that, like OpenAI can sort of easily build this in, and and obviously the Stakeful API is something I was considering building.

[01:04:37] And I roughly knew that, like, this would be the next thing that OpenAI builds. This is on the critical path, for sure. So I don't build it. I agree. Yeah. But then the question is, like, alright, what do startups do? Yeah. I think maybe one thing that was missing from... Sam was like, hey, this is the biggest gathering of all your ecosystem developers.

[01:04:54] They're afraid of you. You have given them no assurance as to, like, where do you think people should build. Okay. So, because, like, OpenAI just wants to do everything.

[01:05:05] Jim Fan: I think so, right? Like, judging from today's trend, they literally are doing everything. Yeah. Yeah, you're right.

[01:05:10] swyx: So so I feel a little bit, I mean, it's fine.

[01:05:12] Everyone who's building with AI today opted in to cutting edge, and sometimes you work on the cutting edge, you bleed. Yeah, that's right. Yeah, but I do I do feel like there's a lot of tension between the startups that build on OpenAI and OpenAI itself. Yeah, so that's my two cents. Sounds great. It's great to see you.

[01:05:31] Yeah, good to see you. Thanks

[01:05:32] Jim Fan: for jumping on.

[01:05:33] swyx: Thanks for having me.

[01:05:35] Raza Habib (Humanloop) - Foundation Model Ops

[01:05:35] swyx: And next, we catch up with the former guest, Raza Habib, back for his second time on the pod. Last time, we talked about Human Loop, and we recorded in London, and that was a pretty popular episode, and I love that you guys care about foundation model ops, as Raza puts it.

[01:05:49] So check out the Human Loop episode if you want, but also, here's Raza's take on OpenAI Dev Day. Welcome back to the pod, you're just the second appearance. It's

[01:05:57] Raza Habib: always a pleasure, nice

[01:05:58] swyx: to see you again, Shawn. Good to see you as well. All right, let's just get right into it. What was most

[01:06:02] Raza Habib: interesting to you?

[01:06:03] I mean the sheer density of announcements. I actually, I came with high expectations and there was a lot of stuff I was hoping to see, but I think they over, they under promised and over delivered, which I thought was really good. I think seeing that they're having a second run at plugins and doing it right this time and having the GPT store and Like really allowing people to do that.

[01:06:21] I thought that was really cool. Product decisions around how you design and build the GPTs, like the low code builder for these chat agents. I thought that was really nicely done. That they have this conversational interface that elicits from maybe someone who's not very expert how to do prompting and things like that.

[01:06:38] I thought it was really

[01:06:38] swyx: thoughtful. It fills out the form for you, right? Yeah.

[01:06:41] Raza Habib: It's a very simple thing, right? Like, ultimately, it's just filling out the system prompt and filling out what abilities it should have. Yeah. But actually, despite its simplicity, I think it's very powerful, and I was impressed by that.

[01:06:52] So, yeah. A lot of really cool things. And then all the changes to the API I'm really excited about. I have some questions. Like, I'm not, I'm not uniformly positive about all of the new API things, but I'm

[01:07:02] swyx: sure they'll get there. Okay what, anything in particular that you want to touch on?

[01:07:07] Raza Habib: Yeah, so I think like, things that I'm excited about with the new assistance API, or like the new APIs in general, like multi modality is really cool, longer context window is really cool.

[01:07:17] I think everyone's going to be super excited about that. JSON mode is like, it seems like a small feature, but actually so many people say this is a problem for them. So I think that's going to be great.

[01:07:26] swyx: So I maybe missed the importance of this. Isn't that the same as the function calling API?

[01:07:31] Raza Habib: It's related, but you might want to have it in context where it's not strictly doing function calling.

[01:07:37] swyx: Huh. Right. Okay. So a little bit more general. Typically I'll just make up a function that isn't actually a real function that Yeah, even

[01:07:45] Raza Habib: then, people say that for complex things, sometimes it violates the valid JSON thing. So I think just making that more reliable. Some stuff that I thought was, initially I was excited about, and then as I've, like, chewed on it a bit more, I'm a little bit less clear.

[01:07:57] So one is this, like, ability to jump in a bunch of documents and have it do RAG for you.

[01:08:01] Jim Fan: Yeah.

[01:08:02] swyx: I think, like... 20 documents max or something. Yeah, I

[01:08:04] Raza Habib: think that, like, it's... It's a cool feature, but it feels a bit gimmicky to me. Like, it feels like for serious, practical applications, it's going to be hard to get that to work.

[01:08:11] If you think about what a large enterprise needs for RAG, like, it's, you know, it's rarely sufficient that you can just jump in a bunch, dump in a bunch of documents. How you do them matters, there's usually permissioning, as like, which users can actually access which bits of data, like, there's so much control that I think most developers would want to have for serious applications, that I think it's cool for the, like, GPTs and the low code version.

[01:08:32] I'm skeptical that it'll get that much use. Yeah. By serious developers. And I feel the threaded, stateful, like, assistance API is really awesome, but I would like more clarity over how it's doing the, like, statekeeping, like, what ends up in the context. Yeah. I think for that to be really popular, they need to make that transparent.

[01:08:52] swyx: Yeah. There's an API booth downstairs. I don't know if you've seen it. I've gone and spoken to them. They wouldn't

[01:08:55] Raza Habib: answer any of these

[01:08:55] swyx: questions for me. Okay. Yeah, of course. But, you know, obviously that greatly affects HumanLoop.

[01:09:00] Raza Habib: But this is you know, this is commentary over what I think overall was a set of really

[01:09:04] swyx: exciting announcements.

[01:09:05] Yeah. And, and last time we talked, also, you were talking about, we were talking about the multimodal APIs. And now you have it. It's finally here. What, what happens now? As I, as

[01:09:14] Raza Habib: I said to you when I spoke to you last time, right? Like, it's a relatively straightforward addition to the HumanLoop product.

[01:09:19] Like, everything will continue to work, but now you'll also have images in and images out, and audio in and audio out. It's kind of interesting, like, seeing, you know, the assistance playground for OpenAI that they just released, and things like that. Like, it feels like they're starting to get close to supporting all of these things, but not quite yet.

[01:09:35] Yeah,

[01:09:36] swyx: yeah, excellent. And then, I think the last part is, I saw HumanLoop actually, probably not you, probably somebody else, but also talking about the fine tuning. There was a price drop, I don't know how much, because there was just so many announcements. But I imagine that's only good things for fine tuning.

[01:09:49] Yeah,

[01:09:49] Raza Habib: I mean... There's so many other stuff. I also missed the price drop, but I know from speaking to folks at OpenAI as well, that they think a lot more people should be fine tuning. Yeah. Fine tuning is gonna have, like, huge importance in the future. That's why they're building out the UI for it. You know, so it's something they're investing in very deeply.

[01:10:05] Simon Willison: And,

[01:10:05] Raza Habib: yeah, I still view fine tuning as, like, an optimization step. Yeah. I think of it as, like, the compilation you do, like, once you have something that's working.

[01:10:12] swyx: Which is what they said in the LLM performance session just now.

[01:10:15] Simon Willison: Okay,

[01:10:15] Jim Fan: cool.

[01:10:16] Raza Habib: I'm glad that my tips are aligned with opening hours. I

[01:10:19] swyx: think you're very aligned.

[01:10:20] You're often leading them in what they say publicly, which I think is good.

[01:10:26] Raza Habib: Yeah, what about you, Shawn? What did you think?

[01:10:28] swyx: Oh, I've said this in a previous recording, but effectively, I also thought they would do much less than they did today. I think they under promised and over delivered, exactly like you said.

[01:10:39] And even things like text to speech, which... It's not just text

[01:10:43] Jim Fan: to speech,

[01:10:43] Raza Habib: it's really good text to speech. So I, like, I think I told you last time, I did like a near year long internship at Google, and I was working on the first neural TTS team. Like, the team, the Tachytron team there were amazing.

[01:10:54] swyx: So what did you get from their demo?

[01:10:57] I

[01:10:57] Raza Habib: think I need to play with it more, but I was impressed by the quality. Yeah. Like, the quality of the prosody, the variation. I think they're only releasing six voices, but...

[01:11:05] swyx: And the secret seventh voice with the pirates. The

[01:11:07] Raza Habib: secret seventh voice with the pirates. And then I was chatting to Andre just now.

[01:11:12] Yeah. And he was saying that internally, like, they have voice cloning set up as well. Yeah. So they can do it with something like 30 seconds of speech. I'm not sure that's public. Is it not public? I don't know. He didn't tell me it wasn't public. Okay, alright, alright. Maybe, maybe filter it out

[01:11:25] Simon Willison: when you publish this.

[01:11:27] swyx: For what it's worth, I've been talking to a lot of people in and outside of Dev Day, and a lot of people have heard about the voice customization stuff, so it's not really going to get anyone in trouble, I don't think, so I just chose to leave it in there. Whatever, I mean, it exists elsewhere in other products, and I think it's fair play to compete with other companies who

[01:11:48] Raza Habib: are already doing this.

[01:11:50] For obvious reasons, right? There's a lot of safety concerns about releasing that kind of

[01:11:55] swyx: product. And for what it's worth, someone else, I think, Fixie AI, did a comparison of the pricing. They are severely undercutting like PlayHT and some of the other text to speech companies as well on the pricing.

[01:12:06] They're between 3 to 10 times cheaper

[01:12:08] swyx2: per second or something than the other existing TTS companies. Yeah, I think that's very interesting. I think in general... Their promise to keep cutting prices and then following through is building a lot of confidence. People, people who weren't previously nervous about building on them.

[01:12:22] What's interesting, I think, is that as the, like, because they have such a large economy of scale, and they continue to drive down prices, the option of, like, self hosting a fine tuned model, even for smaller models, starts to be, like, less obviously economical, because of the, like, spin up and spin down costs.

[01:12:39] So unless you have the, like, volume of usage to justify having it on all the time, It actually starts to become cost competitive to use one of these third party APIs rather than having even a smaller model. Right, because it's serverless in a way. So what, can you give people an idea of what kind of volume that is?

[01:12:55] Are you talking about concurrent requests?

[01:12:57] Rahul Ligma: It's, so if

[01:12:58] swyx2: you look at most of the people who will provide you in like a serve model, if you look at a replicate or a mystic AI or something like this. Yeah Fireworks. Fireworks, there's a few of these companies. They tend to actually charge by like compute hour or compute minute.

[01:13:13] Yeah, and so if you're not like gonna have it on all the time then like the reason is dollars the reason Yeah, you end up needing it on all the time though, because there's like spin up spin that cold starts And so if you don't actually have enough usage to justify having it on all the time, it starts to become cost competitive to just use OpenAI.

[01:13:31] Yeah, so what I'm trying to get to is, it's just dollars though, like if it's like 5 an hour, whatever, like...

[01:13:38] Reid Robinson: Yeah, I agree,

[01:13:39] swyx2: depending on your use case, but yeah. Okay, got it, got it. Alright, cool. Well, thanks so much for jumping on. I know this is last minute, but it's just nice to see people. No, no, I always, I always love chatting with you, so hopefully we'll be more of a visitor in the future.

[01:13:50] Yeah, for sure. The next guest is going to be a new name to many people. He hasn't done many public appearances, but he is a force to be reckoned with on Twitter.

[01:13:59] Surya Dantuluri (Stealth) - RIP Plugins

[01:13:59] swyx2: His name is Surya Danturi, and this is the story of somebody whose startup got killed by Sam Altman. So we're here with Surya. Hey. Hello. My name Surya.

[01:14:07] You're new on the pod, but also we've been around each other in, in the tech circles. Yeah. For, for a little bit. You're, you're a fa very famous developer of Vector databases Yeah. And of plugins. Yes. What, what what, what are some of the plugins that you've done?

[01:14:20] Surya Dantuluri: Yeah, so I worked on a few plugins.

[01:14:22] I work in like, chat with pdf, f chat with like video, chat with website, chat with like get it made, yeah, like a lot of cool plugins.

[01:14:29] swyx2: Making decent money

[01:14:30] Rahul Ligma: too.

[01:14:31] Surya Dantuluri: Yeah, I mean you can, they give like better functionality to like the whole GPT 4 interface. Initially I wanted to do my homework with them so I'm like, I might as well make a plugin for it.

[01:14:40] So yeah, I mean they give there's like a lot of cool functionality, like I made one with the called, chat with like instructions, which would allow you to save more custom instructions and use that when you're talking to GPT 4, but Yeah, I mean, they're making revenue it's pretty, it's pretty sick for, you know, people paying in 85 different countries.

[01:15:00] It's like nuts how many people are like, or how many, how big the the scope is, or how many

[01:15:05] swyx2: people can use it. And I think you may have shown me this before, but there was a plug in platform that you use for monetization? No. No? Oh, you build your

[01:15:12] Surya Dantuluri: own, you build... I build my own thing, all custom,

[01:15:15] swyx2: I've seen someone do, like Firebase

[01:15:16] Surya Dantuluri: for, yeah, yeah, yeah.

[01:15:19] Yeah, I don't know. R. I. P. No, I mean, they're doing well, but like, I just don't want to, you know, pay a 10 percent tax

[01:15:24] swyx2: and all that stuff. Yeah, yeah, yeah. For sure. Obviously, you're very technically savvy. Okay, so what happened today? They announced GPTs. What's going on?

[01:15:33] Surya Dantuluri: Yeah, so like, I made a tweet this morning being like Sam won't let me kill my startup.

[01:15:37] And a joke, okay? I just wanted to talk, like, I was like, I was trying to notify people while I'm here and I just wanted to meet up. I made up the joke. And then a couple hours later my friend, Matt he works at Julius, he showed me the new UI, I'm like, okay, cool, and he forced me to look at it on my phone, I'm like, okay, sure, I'll, I'll pull it up I pulled it up on my phone, and plugins were gone, plugins were gone you don't, you can't, I think you can go between models, so you can go between 4 and 3, but the whole options of, like, code interpreter, and like dolly 3, and all this stuff, All of those good stuff were gone from the UI.

[01:16:12] I think this is only if... This only applies for people who are here at the event. I think they gave access, or like the new UI to people here. And they also... But yeah, plugins were gone, and I'm like, oh s**t. And I asked the person, like, hey, like, where... Where are the plugins? Like, where can I... Like, where are the plugins?

[01:16:28] Like, where do they go? They basically told me, like, You have to make a new GPT as a developer. And you can import your schema into the new GPT. And only that way can you you know, kind of revitalize your plugin, but

[01:16:42] swyx2: your existing users will be

[01:16:44] Surya Dantuluri: like, no, I think they're gone. I mean, I gone, they're, I haven't looked at my stat today, but, well, I

[01:16:49] swyx2: mean, this is not widely rolled out yet, but when it, when it rolls out, when it rolls out, I'm pretty

[01:16:53] Surya Dantuluri: sure all of the plug-ins, they have to discover you again.

[01:16:56] Yeah. They're kind dead. I mean, there's like no way. I don't think there's a way to link them. Yeah. Like there's like no way for the users who were using it previously to be using the new thing. Know. But I mean, it's an exciting project for me, it's not like a full time thing for me, it's a fun project to do, and like, it's like a nice nice thing to work on.

[01:17:13] So I'm really bullish on, you know, the whole new GPDs thing, I think they're a better abstraction. Yeah, I think GPDs are a few open end engineers, and I was like, agreeing with them, because like, I think GPDs are a much better abstraction on what plugins were supposed to be. I think plugins kind of died on arrival.

[01:17:29] Well,

[01:17:29] swyx2: Sam said they did not have PMS,

[01:17:31] Surya Dantuluri: right? Yeah, obviously, yeah, he said that a long, he started that, he said that, like, one plugin started. Yeah. So it's like pretty nuts. But, yeah, I think, I think GPs are a better abstraction and I also love their doing revenue share. So, yeah, revenue share is also a good thing.

[01:17:45] Because, like, GPlugins were, like, a really weird way of monetizing, you had to, like, do a bunch of finicky stuff but yeah, I mean, also, like, just, by the way, for people who don't know, po, you know PO right? Yeah, PO did this a long time ago. They did this a couple months ago. They help, they have, they have these bots, they call it botch.

[01:18:02] And you can, you know, make your own like poem bot, or you can make your own like essay bot or whatever. And then the bots have customer instructions and also they use a very specific model that the developer specifies. And you can install these botch or you can chat with these botch and the botch will do whatever whatever the developer made them to do.

[01:18:21] So I think. They're just basically open edged, made the same thing, and they brought it over to them. But, yeah, but, effectively, plugins are kind of dead. Oh, RIPs. Yeah, I mean, RIP, but, it was a fun pro I mean, it's fun. I think GP I think GP Honestly, it's good that plugins died, Because, like, they had a bunch of issues.

[01:18:40] So, one of the issues is that you can't share them. You can't share a link to them. GPTs, you can share a link to them. So, like, I can share my link to my GPT thing to you. So it's much better for discoverability, because previously the only way to discover a plugin was through the plugin store. You had to search for it, you had to do a bunch of stuff, and it wasn't very good in that aspect, but sharing a link to them, having revenue share And you can also, like, give custom instructions, custom context, so they also came out with, like, retrieval or whatever, and that can basically give you, like, a custom vector database directly in your GPT, I think.

[01:19:15] So that's all great all good features that that should have came with plugins, probably,

[01:19:19] swyx2: but. Yeah, awesome. And then lastly, just like, any of the new stuff that was launched today what interests you in sort of building with them? Like if you were to build on the new API

[01:19:30] Surya Dantuluri: Yeah, totally. I have some ideas.

[01:19:31] The thing is like this is really weird to say, but like, some of my ideas that I've said before for plugins, They kind of get copied quickly.

[01:19:43] swyx2: Oh, so you want to keep it to yourself? Yeah, that's fine.

[01:19:45] Surya Dantuluri: Yeah, but that's one part of it. The second part of it, I don't have any good ideas regarding what you can do with all the new functionality.

[01:19:52] Like, that's like a good product. I don't know, honestly. Tech2Speech came out, their internal VectorDB thing came out. internal vector

[01:20:01] swyx2: DB thing? or, like, retrieval, or whatever it's called yeah, people have been saying they have an internal vector DB thing but, it's it's just retrieval yeah, it's like zero non configurable it's going to be for, like, simple use cases fine then after a while you're gonna need one of the controls over chunks and stuff yeah, I'm

[01:20:17] Surya Dantuluri: also excited by what happens with our Contacts window I was a big user of Cloud for a while because Cloud, they basically gave you 100Ks context window widely on the UI And you can upload your PDFs to it, and everything would work very well.

[01:20:30] Yeah. But, I think Cloud had some issues regarding, I mean, actually very recently, Cloud came out with this whole b******t thing, b******t copywriting thing. So like, Copywriting thing? Yeah, yeah, it's really weird. So, if you upload a PDF now, out of Cloud, like just this week, they made this weird tweak, where it doesn't answer any questions, because if there's a copyright symbol or a copyright name, Anywhere, it just like blocks you

[01:20:53] swyx2: out, and it's like, what?

[01:20:54] Apparently you can prompt inject that by insisting that you are the author, and then it just overrides it. Oh, really? That's funny. It's like, don't worry, I got this, I'm the author of this, there's no copyright issue.

[01:21:05] That's it, okay, cool. Anyway so thanks, this is a really good story, and I wanted people to share it, and I'm excited for what you work on to become more public. Yeah, thanks Swyx. Alright. So that's what happened to Chat2PT plugins, which we covered back in March. But don't worry, that's not the full story.

[01:21:20] Reid Robinson (Zapier) - AI Actions for GPTs

[01:21:20] swyx2: His startup is not fully dead. We actually cover what happens later on. I just wanted to capture the confusion that was happening at Dev Day. So he referred to Julius, and we'll actually talk in and check in with Rahul later on in this episode. But first, we have to go to our next guest. When OpenAI launched with GPTs and the Assistance API, one of the lead launch partners that they launched with was Zapier, and I managed to catch up with Reid Robinson, who is lead AI PM at Zapier, to talk about it.

[01:21:49] All right. Well, Reid nice to meet you. Great to meet you too, Shawn. It's really great to run into you as we're leaving. So you guys had a... Big sort of partnership launch on stage. Yes,

[01:21:59] Reid Robinson: yeah, we launched AI actions for GPTs, which we're really excited to see out there. We also today launched an update to our chat GPT integration that supports the assistance API functionality that was announced.

[01:22:13] And

[01:22:13] swyx2: you were one of the earliest to go. In my mind, Zapier was very, very early in the natural language actions. NLA,

[01:22:19] Reid Robinson: I don't, I don't remember what, good memory. Yeah, yeah, yeah. We launched our natural language action, actually. So we were a launch partner for chat BT Plugins. Yeah. And that's when we launched our Natural Language Actions, API, and actually the AI actions that we're calling it today kind of a, we're rebranding that side of thing to really focus on a lot functionality.

[01:22:35] Yeah. For that.

[01:22:36] swyx2: And I just interviewed Surya, who's one who's a pretty prominent plugins, developer. Plugins did. I, you know, reborn.

[01:22:43] Reid Robinson: Yeah, it's going to be interesting to see what happens. There's clearly a difference. I think one of the things I talk about is the fact that, you know, with GPTs, you're able to constrain the prompt quite a bit, like our plug in for ChatGPT, the initial one.

[01:22:55] You needed to give it access to every single action you ever wanted it to have access to. Which meant that the kind of con You know, I heard anybody who's familiar with context is sitting there like, Yeah, that's gonna be an issue. The common one I give is like, you know, If you had given it Gmail and Google Calendar and asked it like, Hey, what's going on next week on my, like, agenda?

[01:23:11] It would sometimes search Gmail. Cause it'd be like, yep, events are in Gmail. Or like, you know, calendar invites are gonna go to Gmail, So I should search there. But now you can, you know, define what apps it should use. You can define, like, how it should use those. So some really fun use cases. I mean, honestly, we've been hustling hard to get this out there.

[01:23:30] I'm really excited to see what people actually build with this and what gets released there. Yeah, we'll be monitoring and trying to listen to people

[01:23:37] swyx2: really closely. And so, like, something that's interesting about Zapier is that you are a collection of actions in and of yourself. So there's kind of multiple layers in which to do this.

[01:23:47] Like, what should exist at the GPT layer? What should exist at the Zapier layer? Yeah, well,

[01:23:52] Reid Robinson: what's nice, I mean, it's a good point. We have about 6, 000 apps on the platform today. Really what the AI Actions is, is it's the ability to use any of those searches and actions using kind of a natural language input.

[01:24:05] That would be like the instruction that the model gives it. So it's like, you know, check this user's calendar for Monday. And, you know, it might even give the, you know, the actual date for Monday, right? Zapier on our side will take that natural language request and process that into an actual API, like the actual API call to a tool like Google Calendar, and then we all work on the response.

[01:24:26] So, you know, you can't just take the entire response of a, especially like Gmail, responses are very, very, very, very, very long, and very confusing. And so we actually do a lot of work to kind of, if you will, like massage that data, so that it makes sense for an LLM on the other side, that it is giving it the right, it's kind of like information it needs and not just like the entire payload.

[01:24:47] It really helps it kind of deliver like a more, again more contained, more refined experience for leveraging integrations alongside like down

[01:24:56] swyx2: to the T. So, existing Zaps cannot be poured in one for one over to

[01:25:02] Reid Robinson: It's really one off actions, that's the better way to think about it. And you can chain them together in the you saw in today's demo you're only using Google Calendar for the search and a slack action.

[01:25:11] You can actually chain those together. And so, you know How much is that as like a one off action versus an actual, like, all of a sudden, as app? But in this case, it's almost more like the trigger is the human in Chats GPT, right? Like, you need to trigger it to run for that. But, on the flip side, you know, the assistance API is extremely exciting for me as well, because you look at, now, like, the, that functionality of building a GPT, you know allows you to Still getting used to the name?

[01:25:36] Yeah allows you to kind of port that over to run asynchronously. So a common one, like the two examples that I love giving for that API that I love in Zapier is number one, like data export. You know, think of every tool out there like Looker, Mixpanel, Amplitude, all, so many tools are able to send these like massive exports of CSV data on a regular basis.

[01:25:57] Like you could say, hey, every Friday export my blog traffic content, or see CSV, right? Normally, someone's gonna get that CSV and have no clue what they're doing, right? But now you can actually create an assistant in Zapier and you can give it instructions to say like, Hey, tell me the top 10 performing blog articles in the last week.

[01:26:15] And also, you know, tell me highlights on, you know, maybe keywords that were used or SEO tags that were used and how that impacted conversions, right? Like, you can be pretty detailed depending on what you're providing it. And that can now run asynchronously. That can run automatically. So every Friday, you know, 8am, you could be getting the export of that data.

[01:26:32] It's gonna go to an assistant. That assistant's gonna reply with even charts and graphs. And those will come through and you can then send it to Slack. And so you can have, every Friday, a conversation, a post in your team's, you know, blog team's Slack performance. And that'll run automatically. And then they can even reply in Slack to that post and have a continuous conversation with that assistant.

[01:26:54] swyx2: Oh my god, so it's like really

[01:26:56] Reid Robinson: everywhere. Yeah, so you can really put them everywhere. And that's, that's one of the things I like about what's released. And I think people are going to continue to learn really just how kind of Wild that is is the fact that you can like use your actions in the UI of TypeTBT in a one off action but you can also run these things extremely well asynchronously and Yeah, like OpenAI releasing API support for the vision model and for code interpreter and retrieval that these assistants can use It's really cool.

[01:27:26] swyx2: Is there a Zapier angle to any of that? They're all the same, right? Like you would do

[01:27:31] Reid Robinson: in Zapier, right? The whole creating of an assistant and running that through an assistant is today's support. You can do that literally right now. So it's really cool. And the other one is retrieval, right? I talk about, you know, you could go in and create an assistant.

[01:27:45] Give it, let's say, you know, I talk about our accounting team a lot, right? You could give it like if you have a team that approves budget requests from your company, right? Everyone does, right? They can actually have, take their Slack channel or to create an assistant first that would have the documents of your policies, of like, Hey, here's what you can expense, here's how you can expense, here's eligible, ineligible, right?

[01:28:03] All these sorts of things, and actually then set up something like a cat I'll pick on Slack, it's just easy. Like a new message in your accounting... Budget requests channel, and have it trigger a, the assistant and send the user's requests to the assistant with all of your documentation with retrieval and now it'll try to understand what your policies are, what everything is and check the information against what the, and you could even like I did one internally where, We have a tool called, I think it's called Stacker, that tracks each employee's, like, software budget, and home office setup budget, right, so you can see how much they've spent of their budget, and you can actually include that data in the context of the user message, so that the model will be able to say, like, hey, I see you want to expense this webcam it's actually over the recommended budget, but you personally do have budget left if you wanted to use it for that, right?

[01:28:53] And, Some autonomy there. Yeah, and that's really cool. So you can start to do all of those sorts of things now in Zaps that really were never possible. So yeah, the querying of knowledge, running of data analysis, writing code even. I

[01:29:08] swyx2: think in a very real way, you are the perfect partner to OpenAI because they've sort of built a reasoning sort of glue between all these things.

[01:29:15] It's

[01:29:16] Reid Robinson: definitely been a good and fun partnership. I think, yeah, the big thing for me that I would say is like, I'm really, really excited now to just see what people do with this and how we can improve

[01:29:25] swyx2: it. Yeah, awesome. Is there anything, you know, you've been developing with these APIs for a while. Is there anything that you caution people not to get too excited about?

[01:29:32] Like, what, what, yeah.

[01:29:34] Reid Robinson: I mean, callouts I'll always make is like, double check accuracy, right? Like, you want to call out, like, okay. Like how accurate is to make sure that information is accurate? Make sure you're putting some human in the loop steps before you're putting this

[01:29:46] swyx2: into like a critical, which they, and like confirm, deny, yeah.

[01:29:49] Simple.

[01:29:49] Reid Robinson: Yeah. That sort of thing. But even, yeah, all sorts of things you really wanna make sure that you're comfortable with. Like what can go wrong, what is likely to go, right, right. Like all those sorts of constraints. The other side that I often talk about is just like, keep an eye on, you know, if you have freeform human input somewhere in your application that is triggering these things, you know, that can sometimes risk, right?

[01:30:07] Yeah. Prompt injections. Those are a real thing, and I think, you know, a lot of people are still trying to figure out what that means, and how bad that can be, and so I always try to caution people about that as well, right? Like, you really want to be realistic on, kind of, how far reaching you're doing this, so, yeah.

[01:30:25] That's why I like, like, the internal use cases, you know, like, things like that is a great way to start, to get familiar with the technology, to get familiar with the constraints for that. Other than that, no, I mean the voice model stuff I'm really excited to try that. I really want to, yeah. Yeah, that'll be

[01:30:40] swyx2: really cool.

[01:30:40] I love the secret pirate mode that they demoed. I don't know if you caught that session. I didn't see that session, no. Obviously there are six voices, but there's a secret seventh mode if you add in a prompt to speak like a pirate. Love it, love it.

[01:30:54] Reid Robinson: That was an old I don't know if you remember Facebook way back in the day had that as one of the languages you could select?

[01:30:59] Yes. Yeah, yeah, so that reminds me of that.

[01:31:02] swyx2: Yeah, lots of fun to be had with AI as well. Okay, well, thanks so much for jumping on. I know it's very random, but also, yeah. People love to hear from builders, so, that's awesome.

[01:31:12] Reid Robinson: I love

[01:31:13] swyx2: hearing from builders. And most of the interviews were done as we were sort of leaving the Dev Day venue and going to the after party.

[01:31:19] Div Garg (MultiOn) - GPT4V for Agents

[01:31:19] swyx2: And I caught Div Garg of Multion, who we've been talking around and circling around a possible episode on. He's definitely one of the leading voices and thought leaders on agents. Because he's building a browser agent that's a very prominent one. Unfortunately, I have to take an L on this one because the audio is not great.

[01:31:39] Div's mic wasn't working, and I don't know what happened to it. I, I try to always check these things, but you're only gonna hear the output from my mic, which is slightly worse, but I opted to leave it in because Div is actually building an agent. with OpenAI stuff, and had access to GPT 4 Vision, and I think that people building with GPT 4 Vision will be surprised at his answer to me on whether or not it's useful for agents.

[01:32:02] Good to meet

[01:32:03] Div Garg: everyone, I'm Dev, founder of MultiOn, which is an AI web agent that can automate browsing for you. So we can book your flights, order stuff on Amazon, order dinner, whatever

[01:32:11] swyx2: you can imagine. Yeah, and I was actually reflecting, so, I, everyone who listens to this already knows what was announced.

[01:32:17] I was actually reflecting that they didn't have any browser based actions. So what were your thoughts on just generally their approach to agents?

[01:32:23] Div Garg: So they, it'd be very interesting because I feel like browser actions are just so risky. So, and like, things can go wrong. So if you're a big company or you're OpenAI, you won't, you won't want to build that.

[01:32:31] And they're like better off just like relying on a third party who like wants to own that. And that's also the strategy we are, we are taking with them. We're like, like, like OpenAI launched like a ZP integration for APIs. But we want multi end to be like the new API solution. Like, I want to do things beyond APIs.

[01:32:45] I want to connect to my personal accounts where I just have my... Logins already or I already have the cookies and I want to go and like interact with my personal accounts or personal data Very easily and I think it's very fascinating for us where we can like launch a multi on integration With the new platform and then you can just go and like give it a command like oh like can you book this platform?

[01:33:04] me or chatgbd and then it will launch a browser and the browser you can see what's happening and then we go do the whole Thing for you, and it'll be all seamless And then people can have a lot of fun just like Trying out all these different capabilities and like automating their, like, daily workflows.

[01:33:18] You can, like, save this as custom integrations for different agents. You can have different custom, like, multi on prompts that are already, like, pre saved. And then you go like, oh, I want to now go order something on, like DoorDash. I want to order my favorite burger. Then like chatgp can go and like, suggest you what our favorite burgers are, and then it's like, okay, like, now order this for me.

[01:33:35] Multion, and then Multion we solve the payment for you, we solve identity for you, and like, we are owning all the risky, like, actions I can

[01:33:42] swyx2: take. So so you, you're gonna build a GPT version of Multion? Yeah, we'll have a Multion GPT. You, you, okay, will, will that be like a replacement to your existing thing, or just like an alternative way?

[01:33:53] To use their same APIs or something like that. So it's

[01:33:55] Div Garg: like, the direction we're going for is we want to make our AI, like, agent embeddable within existing applications. So we are launching an API. Okay. And we already have a, like, a touch ability plug in. And so this will be like, sort of like a little, like use the API to power this sort of, like, new GPT experience.

[01:34:10] So for us, we actually don't have to, like, change anything. It'll be, like, very streamlined, just make it our API. And to chat GPT, and like people can start using

[01:34:17] swyx2: it. Yeah, yeah, awesome. What about the, I guess, the Vision API? I think one of the things that have always constrained browser agents is the DOM.

[01:34:24] Right. Which is very heavy. Yeah. And so the alternative approach is to use Vision. Would you explore that? What are your thoughts? So, for us,

[01:34:30] Div Garg: we actually had, like, early access to the Vision API for more than a month. And we tried it on a bunch of websites we 5 percent of the websites is actually really useful, which are more, like, image heavy, because 95 you do OCR, that's good enough.

[01:34:43] Yeah, it's not We have really good, like, parsing, so most websites we can compress less than 3k tokens, so we are not, we don't really have to, like, worry about the how heavy the text is. We, so we had one interesting use case about the Vision API. We had a user... Who got it to work on Tinder, and and then like the, then like Multion...

[01:34:58] Hot or not?

[01:35:07] swyx2: Yeah, and then we oh, can you have found the killer use case for Multion. Yeah. Like, this... We did it with our laptop, right? Yeah. Oh my god. Okay, interesting. Interesting. Okay, but so, but only image heavy sites. That's surprising to me. Yeah, that's surprising because you know, the original vision demo, they actually showed a screenshot of Discord, right?

[01:35:27] And they had perfect OCR. Yes, it's true. But they should be good for you.

[01:35:32] Div Garg: It can be very interesting. But the thing is like, even without vision, we can just do like so much things. Yeah. So like adding vision maybe like helps a. But not it's not, like, really game changing for us

[01:35:42] swyx2: right now. That's surprising.

[01:35:43] Okay. Well good, good to know. Anything else that you would highlight from today?

[01:35:47] Div Garg: I'm just, like, really excited about, like OpenAI trying to become a, like, a marketplace. Yes. An app store. Yes. So if this can take off, they could potentially kill, like, Apple App Store and become, like, the new thing there.

[01:35:58] And then it's really hard to say, like, how things will go. They've tried this with plugins before, but this is like, this might actually work this time. But we're just really interested to see, like, how two years from now, how a lot of the development might, like, how the world looks like. And I'm very excited about, like, two years from now, like, everything will be so different.

[01:36:14] We might not even use computers or even, like, mobile phones. You just have a system, you just talk to it, and the system goes and does everything. It'll be a fascinating

[01:36:21] swyx2: world. So one last question before we go. You have a nice side gig teaching at Stanford. While you, you were a PhD student and then you put on top.

[01:36:28] But you, you're still teaching or curating Transformers United? Yeah, so I dropped out

[01:36:33] Div Garg: from the PhD but I'm still a

[01:36:34] swyx2: lecturer at Stanford. Yeah, okay. So, like, what paper should people read to like, like, catch up on this? Like, what, what, what is like, top of mind in terms of like research that is informing what we're seeing?

[01:36:45] Yeah,

[01:36:46] Div Garg: that's definitely very, it's a good question, because things are moving so fast, and there's like hundreds of research papers coming out, like, literally, like every few days. I'm really excited about, like, developments that are happening at, like, Meta, so a lot of this work is open source, all the Lama stuff, all the Mistral stuff, I feel like that's very interesting on the transformer side.

[01:37:02] swyx2: Do you believe sliding window attention was the key for Mistral?

[01:37:05] Div Garg: I feel so for them, but I feel like there might be other ways to do it. There's some secrets, right?

[01:37:08] swyx2: There was probably some secrets. Yeah. Okay, well, that's all the time we have, but thank you so much. Thanks a lot. Thanks. Okay, and our next guest is Louis Nightweb.

[01:37:15] Louis Knight-Webb (Bloop.ai) - AI Code Search

[01:37:15] swyx2: CEO and co founder of Bloop AI, and organizer of the AI meetups in London, where he is a very prominent and staunch member, unlike Raza, who has defected to San Francisco since our last conversation. Louis always has very interesting takes in person, and it was a pleasure to finally actually get him to come on the pod, but also, we recorded this while inside of a Waymo on the way to our afterparty.

[01:37:39] So Louis, you are new to the pod, but we've been friends for a while. Maybe explain, maybe introduce yourself and how you come to the world of AI. Yeah,

[01:37:48] Louis Knight-Webb: I guess, so we started Bloop, me and my co founder three years ago in a very different era for, for machine learning. And we both started the company because we wanted to help engineers navigate large code bases in a much better way.

[01:38:07] Yeah. And originally that was Training our own models to do natural language code search. And today, we still do that, but obviously those language models are very small compared to the state of the art. Yes. And so they're just one part of a... A much bigger pipeline.

[01:38:24] swyx2: I see you as a very astute technologist.

[01:38:26] You used to be a VC. You wrote the first check into HumanLoop. And you used to share an office with HumanLoop. To the point that I called it HumanBloop. Yes. I think you liked that.

[01:38:36] Louis Knight-Webb: Yeah, I did. That is good. We're considering renaming.

[01:38:41] swyx2: And you also run AI Tinkerers in London.

[01:38:43] Louis Knight-Webb: I do, yeah. London has a kind of a slightly different mix of talent than, say, San Francisco.

[01:38:50] You've got a lot of agencies, a lot of enterprises. And so Yeah, we just felt a need to start like a very startup focused event and that's why we created AI Tinker at London.

[01:39:00] swyx2: Yeah, I think Alex Gravely would be very happy to hear about all the stuff that you've been doing. And I've been to one of them and it's really good work.

[01:39:07] I might be the only one that's been to been to both.

[01:39:13] Okay, so let's fast forward to today. A whole bunch of things was announced. What's top of mind for you? Yeah, so,

[01:39:19] Louis Knight-Webb: I think, like, context length is something that that we spend a lot of time evaluating whenever something new drops. All of the, kind of, standard evals you know, the, the, kind of, literacy tests, things like that.

[01:39:33] They, they generally don't do a good job of measuring whether a model can actually use the context length that it, that it claims it has. Yeah,

[01:39:42] swyx2: context utilization is... That's what I saw Will DePue today call it.

[01:39:46] Louis Knight-Webb: Exactly. And so this basically started maybe five months ago over the summer when Claude 2 dropped and you know, obviously it had 100k context and we were really excited about that.

[01:39:57] So we ran an experiment to see basically if we hid 10 pieces of information in the prompt and we increased the size of the prompt, you know, so you do it at 1, 000 tokens, 000, etc. up to 100, 000. How many of the original 10 pieces of information can it retrieve? And we essentially found that the accuracy drops off a cliff between one and 10, 000 tokens and so, and we repeated the same experiment with GPT 4 and, you know, we found similar results that 32k GPT 4 can only find one of the 10 pieces of information but if you were only using a thousand tokens it can find nine of the pieces of information.

[01:40:36] So what that tells us is that, you know, context utilization 5 months ago was, was, was not great with, with all of the state of the art models. So, with the announcement of 128k today and... That's the first test you'll run? That's the first test I'll run. Okay. You know, having spoken to a couple of the team members who...

[01:40:53] Do eval today from OpenAI, you know, they're pretty confident that the model's got better ability to to answer questions at those context lengths, so it's time to,

[01:41:02] swyx2: time to measure. Time to measure. Any other of the API features reproducibility, does that matter to you?

[01:41:08] Louis Knight-Webb: I think, to me personally, no.

[01:41:11] I kind of like the creativity. I normally have my models at like, you know, 0. 7, a bit of temperature. But I know lots of people on the Bloop team who will be very happy, I'm sure.

[01:41:23] swyx2: And then, I guess, the JSON features, there's so many, like the multi modal features, any of that appeal for you personally? JSON

[01:41:33] Louis Knight-Webb: is definitely a big one.

[01:41:35] I think it allows you to to kind of standardize how you call different models. Yeah. So instead of having to build, you know, the, and it's not a massive thing to build, but to build the, the, the kind of function calling integration. And then if you want to try Anthropic, you've got to go and like have a completely different way of interpreting the output.

[01:41:52] So if you can just stick with JSON across all of your different LLM providers, open source models included. That's definitely Atlas because it allows you to evaluate different models more easily. Yeah, yeah,

[01:42:03] swyx2: very excited about that. You are, so you compete in a pretty competitive space with the code assistants.

[01:42:09] Code search, code assistants, right? We do. There's Sourcegraph, there's Codium, there's other Codium, there's... Yeah. There's Copilot and so on. You've never ventured into the agent side of things. Yeah. Is that a conscious strategy? Are you waiting for the right time? Are you waiting for the right APIs?

[01:42:25] Louis Knight-Webb: I think, I mean, we're seeing traction at the moment with companies that have very large codebases, right?

[01:42:32] And it's not something we hear from those users that, you know, when we listen to their problems, it hasn't been, like, an obvious fit to try and build like, maybe an auto GPT type of agent. I'd still say, you know, we're very interested in agents, the pipeline we have at the moment. It's basically GPT in a big while loop with with function calling, which, you know, like, nine months ago definitely did count as an agent, maybe less so now.

[01:43:00] So, you know, it's just, it's just customer and problem driven, and we don't, you know, it's not a, it's not a hammer for the nails that we've,

[01:43:06] swyx2: we've got. Yeah, so two comments on that. One I think OpenAI has sort of put their flag a little bit in the definition of an agent. They had three things, right?

[01:43:14] They had custom knowledge, they had custom instructions, and then I forget the third one. Custom tools, let's just say. Actions.

[01:43:23] Louis Knight-Webb: Actions. By that definition, we're doing, yeah, so we've been doing that since about February. That's the, that's the definition.

[01:43:31] swyx2: Then the second observation I would say is you talk to developers.

[01:43:34] But what if the target customer for agents is not developers, it's the PMs, right? So we

[01:43:40] Louis Knight-Webb: definitely see a lot of PMs using the product or people that are defined as like reading more code than they write. So you know, could be designers trying to understand the implications of an interaction. Could be PMs trying to check a contentious time estimate from a developer or something like that.

[01:43:59] swyx2: Hi. Low trust environment there. I'm

[01:44:03] Louis Knight-Webb: talking for, I've seen some,

[01:44:05] swyx2: seen some stuff. Egregious things, yes. Yeah, so, so basically it's still not that appealing for you, but you're, you'll keep a lookout for it. The stateful

[01:44:14] Louis Knight-Webb: stuff. I think based on the definition OpenAI, you know, released today, We tick all the boxes, and I think we were one of the earliest adopters of that.

[01:44:24] If that's the

[01:44:25] swyx2: definition. You just don't brand yourself with the agents?

[01:44:27] Louis Knight-Webb: I don't think it's important to users. I don't think, I don't think that's why people use the product. I mean, we're very solutions focused. I think we, we start, a lot of our branding in at the start of the year was about models and, and, you know, we put GPT 4, GPT 3 right there on the front page and now, you know, we've, we've kind of...

[01:44:44] Reoriented to be more about solutions. I think that that reflects kind of maturity of the the ICP We're going after and where we are with with

[01:44:54] swyx2: sort of stage of company life. Yeah. Yeah Cool. Any other things that you personally know not bloop related are just excited by interested by from today? Any interesting conversations with others?

[01:45:07] Loads of really

[01:45:08] Louis Knight-Webb: interesting ones. I had a fascinating talk with some safety researchers who They were here? They, so there's a couple of people who were kind of PhD students who had kind of looked at adversarial attacks through fine tuning of models and found that, basically, like, it's such a hard problem to solve.

[01:45:29] If you enable fine tuning, it's basically impossible or very difficult to to make it so that you can't disable all the safety features. You can just train it to spit out all sorts of stuff. So that was pretty fascinating. I'm pretty excited about the Waymo we're in right now. Oh

[01:45:47] swyx2: yes so we should tell people we're recording in a Waymo.

[01:45:50] Haven't been looking at the road the whole time. Is this your first Waymo? It is my first Waymo actually, yes. Thank you for taking my Waymo video. But I know glad, gotta experience this together. I've been a cruise stand the whole time until they ran over someone . So

[01:46:04] Louis Knight-Webb: my, so, so my take on cruise, like sample size, 10 cruise journeys before they got shut down and.

[01:46:12] The three of them resulted in something popping up on the screen saying that I had been in a collision. And...

[01:46:18] swyx2: Did they use the word collision? Yeah, yeah, yeah. That's surprising. I'll show you after that. I took a fair amount of cruises and it didn't, yeah.

[01:46:24] Louis Knight-Webb: And so it was the same situation almost every time, which was a car was in front trying to pass you.

[01:46:28] And I think they just maybe bumped fenders, or maybe the crash detection was clear. Oh, there was actual contact. I think, in one of the cases, I think there was. In the other two, I didn't feel anything. But it came up saying, like, you've been in a collision, and somebody comes over the intercom things like that.

[01:46:42] So, yeah, I mean, out of, you know, ten rides, and three of them ended like that. So I think, yeah, definitely some questions there. But this way moves pretty smooth.

[01:46:51] swyx2: Maybe also we're in a better neighborhood for driving, because we're going to Golden Gate. The time of

[01:46:57] Louis Knight-Webb: day, that's a really good point. I noticed that all of the ones I took at night, all of the cruises I took at night were fine, and when I took one during rush hour, it was a completely different experience, because the routes it would take, it had this really aggressive, maybe traffic management, something that was going on, so it'd take a long time to get from A to B.

[01:47:15] swyx2: Yeah. It often puzzles me, slash, interests me, that Self driving is almost solved. We still have some bumps in the road, sometimes the bumps are human.

[01:47:27] Louis Knight-Webb: It's solved in San Francisco, where you've got wide open roads, nobody cycles, and...

[01:47:33] swyx2: That's not true. Some people cycle. I live here, excuse me. Some people cycle, some people cycle.

[01:47:38] Louis Knight-Webb: I mean, compared to like, compared to London, where you've got, you know, roads half the size, built for horse and carriage, and millions of cyclists, and buses, and all sorts. So I think, you know, it's going to be a long time until we have that same experience of a cruise or Waymo today, London.

[01:48:00] swyx2: I understand, London's a tougher neighbourhood, but still, we're 80 percent there, 75 80 percent there, whatever, right?

[01:48:07] But, like, and it seems like the stuff that we do in the rest of our lives in terms of AI automation is so primitive compared to this, which is the car that we're sitting in right now. And I find that weird. I find, like, the relative ease, or the relative, like, here ness of this technology is very disparate.

[01:48:26] Like, how come it didn't trickle down from self driving to the rest of tech? Yeah,

[01:48:30] Louis Knight-Webb: it's interesting, isn't it? Well, I don't know how those pipelines are built. I assume that's the secret sauce, right? The flip side of that argument is like, maybe it's very scary that we know, like now many more people understand the, the mistakes that these, these types of systems can make because we're all getting hands on with, with GPT, and this system is equally as problematic, and we're just oblivious to it because it's a black box.

[01:48:58] Almost at

[01:48:59] swyx2: your drop off. Check the app

[01:49:00] Reid Robinson: for walking directions.

[01:49:02] swyx: Okay, Waymo. All right. Well, I think yeah, that's probably... Alright but thanks so much for giving a quick review, and thanks for having me. Yeah, yeah. So that was Louis, whose opinion I think is very reflective of the people who are building code generation or code search type startups based on top of GPC 4.

[01:49:21] Shreya Rajpal (Guardrails)

[01:49:21] swyx: And as we headed into the Dev Day venue, we actually caught Shreya Rajpal from Guardrails. ai, and there was an interesting... Comparison here in our conversation between how she views the LLM stack versus how OpenAI views the LLM stack. OpenAI actually had a closed door session where they gave some thoughts on how they felt that people should start from prompting and build up into a full software system, and they actually deferred a little bit from Shreya.

[01:49:47] Don't worry, all that is recorded. The videos will come out in a week, but you can listen to Shreya's take. So, so we're reviewing AI Engineer Summit.

[01:49:54] Shreya Rajpal: Yeah, we're reviewing the AI engineer summit, and it was a very, very well organized conference. And a small thing that I was thinking about is that your swag, Yeah, is it on?

[01:50:04] Okay, it's on, yeah. Your speaker swag was, like, not surprisingly, I guess, but like, really weirdly very nice. And it just kind of, like, showcases this attention to detail that I think, like, really kind of permeated the entire, you know, conference. Like, every single decision was very well thought through, and, you know, kind of, like, To a degree of like quality that's very rare to see.

[01:50:23] So yeah, it was amazing. I thought you guys did like an absolutely fantastic job. This one

[01:50:27] swyx2: mostly goes to Ben. So I'm definitely going to make sure that Ben understands that I really appreciate the work that he does. This is why I couldn't do it myself, you know, I'm mostly the content guy, but I don't, he's the logistics, and he's run conferences for 8 years so that's why I keep working

[01:50:41] Shreya Rajpal: with him.

[01:50:42] Yeah, I also kind of really enjoyed the 18 minutes, you know? Really? Yeah. Yeah, when I saw that, I was like, huh, is this going to be, you know, is this going to be enough, and like, is that, but it was like... It'd be great. Yeah, yeah, yeah, yeah, yeah I, I think the 18 minutes was actually the right kind of bite size.

[01:50:56] swyx: It's optimized for YouTube. Yeah, I see, interesting, okay. Because it's not the in person audience that

[01:51:00] Shreya Rajpal: matters. I see, I see. Interesting. Okay. I need to promote my, my video more. Yeah,

[01:51:07] swyx: is it, is yours up yet? I don't think it's up yet. It's not up yet? Yeah, we're releasing, we're dripping them out to spread it out.

[01:51:14] I see. Okay. Sounds good. Yeah. Thank you for joining us. Maybe in two weeks from now. Okay, sounds good. Okay, so welcome back. Thank you for having me. I think you were guest number five. You were super early. So we're at the after party now. How do you feel about the whole day?

[01:51:30] Shreya Rajpal: I'm really excited. I think it was Yeah, I think the excitement in the air with like everybody just like waiting with bated breath to see I guess, like, what gets destroyed, but also, like, what gets really optimized.

[01:51:42] I think this is, like, very it feels like you're really part of a movement. And it's Shannon who, like you know, us, like, early people in this space, we gotta stick together because, like, whatever happens to any of our companies, you know, there's such a, like there's such a transformative moment in technology that, you don't care, right?

[01:51:57] Yeah, we're all gonna, like, look back on this time, but I, I had a, I had a blast. Like, I really, really enjoyed the the releases. Yeah.

[01:52:04] swyx: What got destroyed?

[01:52:05] Shreya Rajpal: Ward got destroyed.

[01:52:07] swyx: I'm

[01:52:07] Shreya Rajpal: mining for hot takes here. Once again, I think my takes are unfortunately very measured this time. I wish I had spicier takes.

[01:52:15] Your takes

[01:52:16] swyx2: are within the guardrails of

[01:52:18] Shreya Rajpal: common behavior, yes. I was, I think retrieval is like the big one for me. I think it's kind of really exciting to see the retrieval baked in. And that's one thing where I'm very interested to see, like, does that pattern become common by model providers? Thank you so much for joining us.

[01:52:37] Like open source model providers, and then how much of retrieval do you have to do yourself, you know, and like what remains challenging about retrieval compared to just like, you know, this, this really easy API to just like have it done for

[01:52:49] swyx2: you, right? Yeah, I think what they did was effectively build the basic patterns in, but for the more advanced stuff, you're still going to need lang chain, lambda index, all those.

[01:52:57] Shreya Rajpal: Yeah, yeah, yeah, yeah. So for the longest time, I believe that in RAG, it's the retrieval that's the hard part, right? Yeah. And then generation is really easy. As long as you have better, like, good retrieval, you can, like, get really, really far, and the generation only gets you, like, a little bit over. And so, I'm really curious to see, like, okay, how, once again, like, how complex do you need it to be in order to start seeing good results?

[01:53:17] swyx2: Yeah. Okay. Interesting. And what what are your normal benchmark tests? Testing, like, do you actually have a set of tests that you run whenever you are like exploring something? Or some personal favorites of like use cases that you think are tricky for LLMs to do

[01:53:33] Shreya Rajpal: well? I think like a big focus of ours is on hallucinations, so always kind of like checking out hallucination and like conflicting instructions, etc.

[01:53:42] is one. Terse responses is another, you know, like how well is it at like not, you know, you ask it a question and here's this 10 point list, and you know, very, very verbose. Do you have a terse

[01:53:51] swyx2: response

[01:53:51] Shreya Rajpal: validator? Yeah, well not, we don't have it, like, we don't have it publicly, but like we do kind of like check it.

[01:53:57] Ah, okay, okay. So I think like those are kind of some of the things.

[01:53:59] swyx: There was one, there was one example in the, one of the closed door sessions where they, they, all the answers were two terse. Yeah, yeah, yeah. Where I think everyone laughed when they were like, Can you write a blog post about this?

[01:54:08] And the guy, and the GPT said, Sure, I'll do

[01:54:11] Shreya Rajpal: it tomorrow. Yeah, yeah, yeah, yeah, yeah. I think like those are, I think those are I'm really, really excited about, Double check. Yeah, just check. I'm really, really excited about JSON generation. Okay. I'm actually kind of surprised to see how long it took them they're

[01:54:25] probably just doing constrained decoding under the hood, right? Like constrained generation. Okay. Because they're now saying that guaranteed correct JSON rather than, you know, More correct. Do you get what I'm

[01:54:34] swyx: saying? I was, I was parsing through their words. They've never had an issue producing JSON. It's just that sometimes it doesn't fit the JSON schema.

[01:54:42] Right? Am I, am I wrong? You would know better than, more than me. No,

[01:54:46] Shreya Rajpal: I think there are also issues with, like, producing... I think the, okay, the obvious thing is, like, unbalanced brackets? When it's on context length, I think that's, like, an obvious thing, right? But, like, weird things when you have, like, really long strings, then quotes, et cetera, become kind of weird.

[01:54:58] Okay. So I think those are some other ones. Schema is obviously kind of challenging, et cetera, yeah. I think there are, even with function calling, like function calling, at least I haven't played around with it yet today, but previous generations of function calling wouldn't guarantee that your schema is matched.

[01:55:13] Which would be an

[01:55:14] swyx: issue. And I think they're still not guaranteeing it, because I kept waiting for them to say it. I haven't read any of the public docs or anything. Do you know if they're guaranteeing that it fits the schema, or they're

[01:55:23] Shreya Rajpal: like... Oh, that's a good question. I, yeah, that's a good point.

[01:55:25] They never say they guarantee it. Yeah, they never said they gu... They, they guaranteed correct JSON, they didn't guarantee if the JSON matches the schema. So,

[01:55:32] swyx: okay, you can call JSON loads. Yeah,

[01:55:34] Reid Robinson: yeah, yeah. Big

[01:55:35] Shreya Rajpal: loop, like, I'm very curious to see, like, once again, if this is a pattern that, you know, all of the other foundation model providers adopt.

[01:55:41] And I don't see why not, right? Like, I think for them to kind of, like, own specific decoding models is going to, like, make a lot of sense compared to, you know, like, yeah, a lot of the, a lot of the hacky stuff.

[01:55:52] swyx: Yeah, cool. Any other favorites, you know, not, doesn't have to be guardrails related, any favorite conversations, favorite demos, favorite,

[01:56:02] Shreya Rajpal: I oh, the GPTs and the assistants.

[01:56:04] I think you want to make one for yourself. Yeah, I do want to make one for myself. It doesn't add like, yeah, it's not very Godreels related. I do want to kind of play around with like how well it works with like some of the things we track. But yeah, it was just so fascinating to see the marketplace. I am very, very curious to see, you know, what the marketplace looks like.

[01:56:20] Like, is it? Are people going to have, like, really, really vertically specialized things on the marketplace? Like, if you have a generic, you know, sales assistant or something, right? Like, how much, or SQL generator, how much how popular does that become? Versus, like, sales assistant for X vertical at Y stage of the sales process.

[01:56:38] Oh my god. Do you know what I mean? Like, it's, it's so easy to do this now. Yeah. That, like, where, at what level of specialization do you need to be to kind of start seeing the results? And that is one thing I'm very excited to see, like, how that, how that pans out.

[01:56:51] swyx: It scares me a little bit because it's basically, they said the future of programming is natural language, or something like that.

[01:56:56] Yeah. And that's great, but, like, it really is a new platform, a new operating system, almost, that they're that they're creating. And I don't know how to position myself. Not that I have to, because my world is very developer oriented. But this is a whole no code world that you and I

[01:57:11] Reid Robinson: don't touch.

[01:57:12] Shreya Rajpal: Yeah, yeah, yeah, yeah, yeah, yeah, yeah.

[01:57:14] Whoa. Yeah, yeah. I really want to see, like Is there going to be, like, assistance for everything? I'm generally curious to see the impact of this on knowledge work, you know which yeah, like how much of my work, like if I'm getting annoyed by something, is my first instinct going to be like, you know let me just, you know, spend the five minutes to build in a system for this?

[01:57:34] Like, is, is that how everybody's now going to start thinking? You know, and that's one thing I kind of really want to see.

[01:57:39] swyx: Yeah, that's exciting. Okay. Last question. You spoke at AI Engineer Summit. Let's advertise your talk a little bit and point people to your talk. Yeah, yeah.

[01:57:48] Shreya Rajpal: Yeah, so thank you again for inviting me to the AI Engineer Summit.

[01:57:51] One of my favorite conferences that I've attended, you know, this year. My talk was about the new paradigms for working with large language models, you know. For building really production ready applications when the technology that you're working with is under, underneath all of it, you know, non deterministic.

[01:58:05] Really fascinating thing, which was the OpenAI's talk about building production grade applications, talked about how essential it was to build guardrails as a way to make it do product grade applications. the one

[01:58:16] swyx: from today. Yes, the one from today. Which people

[01:58:18] Shreya Rajpal: haven't seen yet, but really, really cool talk.

[01:58:21] So I think it really validates what we've been saying pretty much since the beginning of the year, which is that you'll get like, You'll get to a certain point, but at that point you need to start adding guardrails to your application if you need to get your users to start, you know, getting value out of what you build out, right?

[01:58:37] So,

[01:58:38] swyx: I have your chart, and I have their chart. They put guardrails at the first layer. It's not at the end, it's actually right at the beginning for user experience.

[01:58:48] Shreya Rajpal: Yeah, that's right, yeah. Yeah, that was kind of interesting to see that they put it as part of the UX. I'm still kind of very candidly, I'm still kind of digesting that.

[01:58:56] Like, I think of it as, I think of it as part of the infrastructure. And I don't know if, as it's as much UX as it is, you know, just like one of the components that you need in your stack. Yeah. But I, I, I think the pat, like a lot of what they said today, completely validated, you know, what we've felt for the longest time.

[01:59:12] And also what I go really in depth about, like in the talk that I gave, right? Which is that what happens, one, you have the, once you have the bare bones application ready, what is the process? Of actually adding guardrails for what you care about. Like what does that look like? Yeah. You know, what are the risks that you care about?

[01:59:27] How do you verify that those risks are happening or not happening? If they are happening, how do you quantify them? And then how do you mitigate them? That was what, what that was what the talk was about, which I really recommend people go and check out.

[01:59:37] swyx2: Awesome. Well, you did a great job. We're gonna post the talk soon and thanks.

[01:59:41] It's good to see you again. Yeah. Thanks again for inviting me. And that was about all I managed to get before the after party. At the after party, there was actually an after after party thrown by Noose Research.

[01:59:51] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"

[01:59:51] swyx2: So let's hear a little bit about OpenAI versus OpenSourceAI. From Alex Volkov. Okay, so we are in the one day after Dev Day here with Alex.

[02:00:01] Hey. Hey. Very, very recognizable voice right now. We don't have to introduce you. Hey, everyone. And we are here to talk about the two parties that happened yesterday. There was one official Dev Day OpenAI afterparty where I interviewed Shreya, who's just before this. And then there's an unofficial one.

[02:00:16] For keeping AI open by noose. Yeah. So, what was it like to just compare

[02:00:21] Alex Volkov: and contrast? So, let me maybe start with like who noose research is. Oh yeah, yeah, most people haven't heard of it. It's written N O U S O. I mispronounced it now multiple times. It's noose research. It's one of the few...

[02:00:33] Organizations online that started like from a discord and then like kept going up until like a significant amount of people are working with them, affiliated with them, of folks who take open source model to its most extreme capability. So collect data, data sets from open source open source and more closed source.

[02:00:49] And depending on that, they release like with different licenses and then they find to an open source models that were like released to us from like Lama, for example, and Mistral, which is a French company that recently released a 7d model. And they've been doing this since Lama 1, but recently it really kicked into high gear with Lama 2 releases because Lama 2 ended up being with a commercial license.

[02:01:08] So you could actually use this for actual, you know, products and services. And Mistral came out with like a full Apache 2 license with a BitTorrent link. I think you remember that. And so these organizations suddenly became like a very, very important currency in the, in the world of like, Where the whole world of AI is going because they're running local models and many companies love open AI, but either cannot afford this or cannot risk the chance the open AI changes something like what's our dev day.

[02:01:35] And so many people are turning on to like, okay, if we want to run our own hardware, how do we actually do this? And you can run it, you can run Llama2 and Mistron, all these models on your own hardware, but then you want to fine tune them for your own purposes. And so how do you actually fine tune? And now organizations like News Research was probably the biggest one, Alignment Labs, Shout out to Austin and folks from from alignment labs skunkworks, and many of these like people come up and say, hey, we have the know how and we only started learning about this like eight months ago, six months ago themselves, but now they're like the You.

[02:02:06] Specialized more people that find two models and actually release the best kind of models on the Hug and Face open source leaderboard.

[02:02:14] swyx2: Yeah. And in my knowledge, the two models that I keep hearing about, one is Hermes. And he's recently searched the base model for Hermes from Lama to Mistral.

[02:02:24] Because apparently it's better. Hermes is like an instruction dataset, 900, 000 instructions. I don't really know where it's from. Maybe I don't want to know. They also do some fun models. There's like a mystical model that they

[02:02:35] Alex Volkov: do.

[02:02:36] swyx2: Trismestos, yeah. Some stuff like that. I think it's actually a little bit weird that they keep releasing models.

[02:02:42] They release like three models a week. It's insane. Right? And it's very hard to keep up. Like, I'm like, okay, which one is actually the one that I should pay attention to? Yeah. So

[02:02:50] Alex Volkov: first of all, you're welcome to join Thursday Eye and then we talk about all the models every week. Yes. It's kind of...

[02:02:55] Interesting to that if I do like a recap for a month, the beginning of the month, most of the updates don't matter, because like every, every, This,

[02:03:02] swyx2: I'm doing monthly, and I, I feel this, like, I'm doing this, I'm doing this for historical posterity, like, Five years from now, people want to look back, then they can look at my notes, because I only have twelve.

[02:03:14] Alex Volkov: Yeah, nobody's gonna look at your notes, they're gonna have a GPT trained on your notes answering everything. I have, yeah, I'm doing like every week, and every week we're talking about like, this model outperforms that model like significantly, and we're noticing significant changes from week to week.

[02:03:27] Literally in the span of a month we went from a 33 billion parameter model, which is big, And parameter count is not everything there is where you can have a smaller model with like larger, longer training that actually will perform better than whatever, but we're noticing smaller and smaller models doing outperforming bigger ones significantly.

[02:03:43] Zephyr from Hug and Face outperformed Llama 70B and Zephyr is like only like a 70B model. On some things. On some things, for sure. And so this is very interesting because like it's really hard to evaluate. Evaluation frameworks are bad. Everybody's saying that they're not representing of anything.

[02:03:56] People can fine tune and over tune on them. And so, there's this whole kind of subculture of open source mostly on Discord, some of them on, on X and Twitter spaces. And for some reason, but I find it very humbling and incredible. They also hung out in Thursday. I, and so that's how I got to this.

[02:04:13] That's how I got to meet like news research folks Ticknium, Imozilla, and they organized the, the counter party event last night together with some other EAC people that we know from Twitter as well. Including Mark, Jason. So apparently he was supposed to, I didn't see him. Oh, okay. But like he was supposed

[02:04:29] swyx2: to, I saw a photo with a bald head of a big guy.

[02:04:32] So I was like, is that Mark? I don't, I don't know. Anyway, but the opening eye party was at a art museum. Mm-Hmm. . And then the news research party was at a

[02:04:39] Alex Volkov: club as a club? Yes. At Folsom. Folsom Street In San Francisco Club. Yeah. Yeah. 10 15 falls, I think. Sure. Open the eye was a very like. Highbrow, buttoned up, event,

[02:04:49] swyx2: post event.

[02:04:50] Yeah, there was a live band, someone playing jazz.

[02:04:54] Alex Volkov: Which, I think I mentioned this once, it was too loud. We want to talk, we don't want to listen to music. No, no, no, we're just

[02:04:59] swyx2: old. Everything is too loud.

[02:05:02] Alex Volkov: And then, it was like a lot of people, a lot of networking, a lot of people trying to get together, maybe do business together.

[02:05:07] Very, OpenAI actually showed up. A lot of people, we, we stood in line, there was a long line for the Magna Millers to, to step in and then everybody like passing us around was like open the eye employee that passing like straight through. Yeah. And then that ended around eight, which was like the standard San Francisco like buttoned up.

[02:05:24] Oh yeah. That's when you go to bed. That's when you go to bed. And that's when the other party kind of started. Yeah. Yeah. And I think they just seized the opportunity 'cause everybody's in town for the open AI stuff. Yeah. Why not? Make a splash, an announcement for, like, for open sourcing AI. So literally, the invite was keepaifree.

[02:05:41] com, which was the website, and the invite was keepaiopen. com. And you had to register, you had to go in there, and this was, to me, an incredible... Kind of show of Twitter in real life. So all of the folks who follow Mark Andreesen, he recently stepped into this thing with like the techno optimism stuff.

[02:06:00] He started to boost the effective acceleration folks. And so there's a lot of like signature stuff from that like ecosystem on Twitter. There's like, don't thread on me with like, you don't take away my GPUs. There's like all these signs across the club. The, it's a very visual club as well. So we're, the DJs is a whole, like a three D projected thing.

[02:06:21] So there's like a bunch of like art and like live things about KPI open. I, I found it like very, very super cool. I, I, I'll, I have to tell you tidbit I saw me and Killian were there from open interpreter. We saw two people with lab coats. It was like, what's the deal with nap codes? So we went to Nest and they just said, Hey, we just like came back from our work where we work on semiconductors. We're actually like touching chips, whatever, just like didn't change out of it. And my head was like so incredible in the keep AI open GPU kind of a poor party. We have people who literally work on superconductors came from the work, like they're working on chips.

[02:06:53] Yeah, yeah. Semiconductors are

[02:06:54] swyx2: superconductors, very different thing. I think semiconductors. Yeah, we had that superconductor episode a while back. I think people are still recovering.

[02:07:03] Alex Volkov: I'm personally still recovering from that. That was the whole thing for me, yeah.

[02:07:06] swyx2: So is news research like vibes? You know, like, what is the mission apart from to keep publishing open source models?

[02:07:15] Alex Volkov: I think you'll have to get some news people to actually speak, like, about the mission, about the actual product, but as far as I understand this no matter how much the product side will be, and there will likely be, there's so many people that are doing, like, so incredible stuff that people notice, like, you know so no matter how, like, how much of the business side will be, they're, like, committed to fully open source as much as possible, including data sets, including models that are, like, TraceMasters, for example, their model that's like trained on the occult and the physical and metaphysical, you can't expect OpenAI to let you.

[02:07:48] Talk with a model, they'll answer with like mystical questions, mystical stuff.

[02:07:51] swyx2: Astrology, Halloween.

[02:07:54] Alex Volkov: So you're very like easy into the astrology and Halloween. They're talking about like you can ask this model about the resurrection, right? Like all of the occult like craziness that they've collected, OpenAI will not let you do that.

[02:08:04] And so there's, I think OpenAI will not let you do it by default because they have lawyers and they don't get sued. Recently they announced the protection shield thing. So you won't get sued because of... They're models, so they're, them, Entropic, all these big companies, it's very important for them to protect the outputs and the models.

[02:08:20] Here, these folks are like, Hey, if you want to build a model, fine tune this, we're going to teach you how. Jump on our discord. We're going to help you with producing like the biggest models. And then if, you know, there's going to be like a financial aspect to this as well. If you're a company that wants to run this, we'll also help you do that.

[02:08:35] swyx2: Yeah, so it's the same as stability, basically it's, it's, it's, that's from what it, from talking to him that's what I gather. Yeah. Cool. Anything else that people should know about the party, noose? I

[02:08:45] Alex Volkov: found this whole day to be like a very singular AI day, and we don't get many of this. GPT 4, I think, was the biggest one previously.

[02:08:53] Yeah, March. It was like a singular, March 14th, that's when Thursday Eye started. We started talking about this every week. This was a singular day in San Francisco. This, like, started pregame. Party with Swyx and some other folks that I, I got to feel like a little bit of San Francisco. And then Dev Day was incredible.

[02:09:08] We just heard from Simon. There was like a garage that they made into a venue event, probably custom venue event on the fly, which like just talks to how much they can pull off. It felt to me that like this Dev Day event and then the following party, it felt a little bit like Almost like an Apple thing, where like, it's going to be a yearly thing that people will like, try to get in as much as possible.

[02:09:28] One thing to note that in the other party, there were many people who didn't get in to this party. And so, you know, they were watching from like a a party.

[02:09:36] swyx2: Yeah, this this office right here.

[02:09:37] Alex Volkov: This office people watched here, and people watched in, in the life space that we, we... Yeah, 8, 000

[02:09:42] swyx2: people tuned in to our spaces.

[02:09:43] 8, 000 people

[02:09:44] Alex Volkov: tuned in? I didn't even have a

[02:09:45] swyx2: chance to look at it. I always want to know the number. Oh, wow. So it, it shows the relative level of interest, and you know, like, so, quoted 22, 000. Mm. And this is 8, 000. Yeah. Just relative. Interest. Yeah, there's

[02:09:56] Alex Volkov: like two spaces as well. Robert Skobel, he stole the thunder a little bit.

[02:10:00] He stole some audience from us. Shout out to Robert. And I think that like it's, it was a singular day. And I think the News Research, KeepOpenSourceOpen, EAC, Mark Andreesen, like all these things together also added to the top of this. Because like it happened in the same day, one on top of another in the same place, San Francisco.

[02:10:15] I find it incredible. I will, you know, definitely come back next year. Yeah. Okay. Yeah.

[02:10:20] swyx2: Well I think you'll be back sooner than that. Yeah, probably. There'll be other things going on. All right. Thanks. Awesome. All right.

[02:10:26] Rahul Sonwalkar (Julius AI) - Advice for Founders

[02:10:26] swyx2: Last but not least, we go back all the way to the Newton, where I started this podcast, where we checked in with Rahul Samwalka, better known as Rahul Ligma, who just celebrated his one year anniversary as one of the biggest memes and celebrities in San Francisco.

[02:10:43] But by day, he's also the CEO and co founder of Julius AI. What's up, Swyx? Hey good to see you. It is one day after Dev Day, and we all had a chance to process. How do you feel? What's what's your top takes? That

[02:10:57] Rahul Ligma: was awesome. I got to see a bunch of really smart people who are building cool things with OpenAI, GPT, Dolly.

[02:11:03] The event was very well put together. The keynote was awesome. The energy in the room was crazy. And I could see real time social media firing up with all these takes. Overall, I think it was a good, good day. Yeah, I

[02:11:15] swyx2: interviewed Surya Dantiluri. Yeah. I think you know him. He was like Sama just killed my startup.

[02:11:22] And it was almost true for him. Cause he has a bunch of plugins. And plugins are kind of deprecated. Yeah, yeah,

[02:11:30] Rahul Ligma: yeah. The plugin thing was interesting because it was, it's going to be deprecated, but

[02:11:35] swyx2: they just

[02:11:37] Rahul Ligma: accidentally turned it off yesterday. Yeah, so he freaked out a bit. He freaked out, and then they brought it back up.

[02:11:42] It's

[02:11:42] swyx2: Yeah. Yeah. So, top features that you're interested in, that you want to explore more.

[02:11:48] Rahul Ligma: I think people are super psyched about the assistance API, but personally, if you ask me, two things that I am most excited about is turbo. Yeah. The speed is, is crazy.

[02:11:57] swyx2: And... Have you actually, have you measured, you know, do you know any, like, rough measures?

[02:12:01] Because I don't think they actually ever mentioned the speed relative difference. I

[02:12:06] Rahul Ligma: started noticing the speed difference in chat GPT, actually, like, a few weeks ago. Oh, I

[02:12:11] swyx2: see. So they already slowly eased

[02:12:12] Rahul Ligma: this into it. Yeah, yeah. And I saw, like, takes on Twitter that, did anyone notice chat GPT get much faster?

[02:12:18] And I noticed it too. Yeah. But, so it's turbo, it was exciting, but the second thing that's exciting is multiple function calling, and then the JSON output formatting. I think as developers are building on... The dev API. So that's the thing that's super exciting to me. You know, of course there's vision stuff, there's code interpreter as a tool in the API.

[02:12:40] But, I think what will bring the most applications is actually the, the speed. Because there are so many things, if you look at our numbers, on Julius. are not patient. They want an answer, and they want an answer quick. And we see clearly, if you can get an answer to them a few seconds faster, there's a clear difference in the conversion.

[02:13:05] So, speed is going to be big. What is conversion for

[02:13:07] swyx2: you?

[02:13:07] Rahul Ligma: Is that just paying? Oh, no, it's like, from first message to second message. I see. So we do code gen, and then we run the code, and then the code has an output, and the user asks a second message, and we can just see the funnel, where, if it's faster, the code runs faster.

[02:13:24] And the second thing is multiple function calling. I think you're basically telling the AI that, so I think people misunderstand function calling. It's essentially tool use. And if you can tell the AI, hey, you can give me multiple tools to use at once, I think that's going to unlock different applications than before.

[02:13:44] Because before it was just like, okay, this is a task, tell me one tool and what's the input for it. But if the AI can now. Use multiple tools in parallel. You can first of all have more specialized tools. And then get more specialized instructions for each tool. Yeah. It's just going to unlock a lot of cool applications that previously weren't possible.

[02:14:04] swyx2: There was a practical limit in the number of tools that you can give it, right? So we had this discussion in March, February March, April, when they released the function API. That is subject to context window. Jason Schema itself. Yeah. Does that change at all? Or I don't know if you, I, you

[02:14:19] Rahul Ligma: know, I don't.

[02:14:21] Yeah. But what I noticed though, before, even before was that more functions and more options just confused it. And that's what I want to play with next is like, okay, what's the breaking point? I see, like, does more options, you know, confuse it? Does it

[02:14:35] swyx2: make it Would you, would you use multiple function calls as well, or?

[02:14:39] Oh, totally, totally. Is that just theoretical?

[02:14:40] Rahul Ligma: No, no, no. I have a direct application for it right now. One of them is oftentimes, GPT writes code, and then we run that code, and we realize that, oh, from GPT's last knowledge update, that module in Python has changed. It has new functions, new APIs. So today, the way we do it is, when the error happens, we tell GPT, Okay, you can go look up.

[02:15:01] New documentation, and then fix that error. But with multiple function calling, the way we would do it is like, Give me the code, but then also give me a documentation lookup. And then when the error happens, I can just quickly fix that without another GPT call. And then keep moving. But I mean, in general, it's just like, multiple to use to me is just so exciting as a developer.

[02:15:23] And I wish people were talking more about this.

[02:15:26] swyx2: Yeah, I mean people are still coming to terms with just like the base model and prompt engineering and all that. That's still important, but for engineers, I think you should explore these other advanced features. True. Yeah, yeah. Anything on the multi modality side that you're interested in?

[02:15:39] I mean,

[02:15:40] Rahul Ligma: vision will be super interesting for sure. And we have this functionality in Julius right now where you can generate React and HTML components.

[02:15:49] swyx2: Like v0? I think Matt was showing me. Yeah, a little bit of that demo. Yeah, yeah. We have been hacking on it

[02:15:56] Rahul Ligma: a lot. I think the missing piece here is that, well, you have an engineer who knows how to react, and they probably wouldn't find this useful, but if I can allow, like, anyone in the world to just draw a mock up on a piece of paper, and then run that, and have the version, yeah, demoed, yeah, yeah turn it into, like, actual components I could use on a webpage, that'd be sick.

[02:16:19] And what's even more sick is, like, have the feedback loop where you take a screenshot of the page generated and then feed that screenshot back in division, and then come up with more instruction and have that loop. Yeah. Wow. Like a self-improving webpage. Isn't that crazy? Yeah. I'm, I'm so

[02:16:35] swyx2: excited. Yeah.

[02:16:36] Yeah. So in my mind, Julius is very data focused. I, I, I, by, by the way, I didn't introduce you, I didn't introduce Juli. I was just gonna do it separately. Yeah. But, people know who you are. . Yeah. You're, you're, you have a Wikipedia page. Yeah. You just passed your one year anniversary as Rahing Ma.

[02:16:50] Thank you. By the way, any, any fun things happen on the anniversary or one of the fun things I ilio said, IA recognize you on the spot. Oh. IA

[02:16:56] Rahul Ligma: was like, oh my God, this is, ah your famous or whatever. And no, these guys are so awesome. Like, they're so humble. But anyhow, on the first one year anniversary, nothing really, like, it's, I mean, you knew about it a week

[02:17:07] swyx2: before.

[02:17:08] I like to set anniversary dates. That's awesome. Because it reminds people of the passage of time. Like, it's like, wow, s**t, has that been a year? Yeah. And then you're like, I think it motivates, it motivates me more than, like, Memento Mori. Like, yeah, you know, sometimes you're out of date. But it reminds me to spend my years wisely.

[02:17:27] To do interesting things with the time

[02:17:28] Rahul Ligma: that I have. Momentum is kind of depressing whereas

[02:17:31] swyx2: this is, this is like, oh yeah, did you know that like one year ago we had this thing? Yeah okay cool, but then Julius you, data analysis chat thing basically Code Interpreter is how I think about it.

[02:17:42] And also you just cross the 100, 000 users? You have delivery modes across your plug in as well as a chatbox, like a dedicated web app? Yep. Okay. Anything else that people should

[02:17:54] Rahul Ligma: know? Well, the, our vision is, you know, writing code is super fundamental to doing things. You could not only automate a bunch of tasks in your life which is writing code, but also it's how you how you just, like, interact with the universe, right?

[02:18:10] You can, you have. Code that brings you a way more car and picks you up and just drops you off somewhere. And I think allowing these language models to write code and do things for you is really powerful. And data announces this application that we're most excited about right now because that's what it's good at, immediately.

[02:18:27] But just on Friday we launched FFmpeg support. And there were people trying to upload videos, turn those videos into GIFs, or like, take a YouTube video and turn it into a... You know, short summary and all these different cool use cases that we didn't truly, like, hard code into Julius. We just told it, hey, now you can run FFmpeg and you can run ITDLP and MoviePy and all these different things.

[02:18:49] Do these tasks for me. And then people were just, like, organically describing those things. There's this guy, TDM, on Twitter, CTOJr. And he took some meme video and put it on my own tweet, overlaid on my own tweet, that. And then that got a bunch of likes. And I was like, dude, like, this is the first one that gets a lot of likes on, you know, FFmpeg on Julius.

[02:19:11] So

[02:19:12] swyx2: that's Julius. That has a lot of meme potential. It has a lot of

[02:19:13] Rahul Ligma: meme potential, but that's not what we're going for. Yeah. You know, it's just like, letting people, like, do things.

[02:19:18] swyx2: Your target market is, like, the FD, the enterprise? It's actually individuals who have data.

[02:19:25] Rahul Ligma: And

[02:19:26] swyx2: they just want to drop

[02:19:27] Rahul Ligma: academics, a lot of academics, actually.

[02:19:29] Yeah. A lot of academics, a lot of students, researchers, any kind of CSV, Excel data, you can just dump into Julius and then have it analyzed for you. We have this video coming out in a few days where you can now actually train a nano GPT. On Julius, so you can give it, Hey, here's the good arriba for

[02:19:46] swyx2: carpi.

[02:19:47] So yeah, it has a, you has, you have GPUs to train it on, or you just training in CPU CPU minutes. GPUs. Yeah. Yeah, that's true. That's true. Yeah. I, me cario like that. , yeah. Yeah, yeah. . Okay, cool. So the, the thing I really wanna sort of ask you as a founder on is, you know, I think there's always this existential threat about OpenAI building your features, right?

[02:20:04] Yeah. In a way, so like the, the number two default bot in the, in the GPT app store Yeah. Is data analysis. Yeah, and people can build their own by customizing and adding code interpreter. Yeah, although I think there's also opportunities for you. So on the roadmap that they presented in the closed session, they also said you can bring your own code interpreter.

[02:20:25] Yeah, so like how are you thinking about that?

[02:20:28] Rahul Ligma: I mean As a founder, or as, so, who's the audience? Is it like, other founders, or is it? Other founders,

[02:20:36] swyx2: and people are just interested in how you are, you're processing this. Yeah. I mean, I think it's a very interesting story of processing this live, because the news just dropped yesterday.

[02:20:45] Rahul Ligma: Yeah, totally. Well, so, the story behind Julius is that we actually launched Julius three months after Code Interpreter was announced, and a few weeks after it was rolled out to everyone else in the world. Yeah. So... We, we, we were number two. And even then we got 100, 000 users. Because I think there's a lot of work to do to get something to work properly.

[02:21:07] And there's a bunch of examples of this on the internet. So if I'm talking to founders, what I'll tell them is, Man, so many people give up before even getting started. And that happens a lot. Don't do that. Sure you can change your idea. You can find new things to work on. But. The way I'm processing is that, wait, we were actually, we launched after Code Interpreter came out.

[02:21:28] And, there's a hundred thousand people who think Julius is better than Code Interpreter. Or use

[02:21:33] swyx2: it. Or just try it out. Yeah. Or try it out.

[02:21:36] Rahul Ligma: And use it over Code Interpreter. And, there's like a lot of work to do. Like, for example, the FFmpeg stuff we launched on Friday. Mm. Or the HTML stuff. Or React, you know, React component stuff.

[02:21:46] All these different things. To get them to work. It takes some effort. How I'm processing it? I mean, you know, that's like, that's what startups are all about. It's like risk, right? If you, if you want to build a risk free startup, you probably don't want to work on startups. Yeah, just go get a job.

[02:22:02] Just go get a job. Exactly. So I'm having so much fun. The way I'm thinking about this is like, whoa, there's all these new different things I could do now. I could build. That's so exciting to me. And I'm pumped.

[02:22:14] swyx2: Yeah. Awesome. That's it. Any last words? Call to action?

[02:22:18] Rahul Ligma: Call to action. Let's go build some cool things and get a bunch of users.

[02:22:23] swyx2: Let's do it, guys. Yeah. Alright. Awesome. Thanks so much. Thanks, Swyx. I think that's a meme that we can all get behind. Let's go build things for a bunch of users with AI.

Get full access to Latent Space at www.latent.space/subscribe

2023-11-08
Länk till avsnitt

Beating GPT-4 with Open Source LLMs ? with Michael Royzen of Phind

At the AI Pioneers Summit we announced Latent Space Launchpad, an AI-focused accelerator in partnership with Decibel. If you?re an AI founder of enterprise early adopter, fill out this form and we?ll be in touch with more details.

We also have a lot of events coming up as we wrap up the year, so make sure to check out our community events page and come say hi!

We previously interviewed the founders of many developer productivity startups embedded in the IDE, like Codium AI, Cursor, and Codeium. We also covered Replit?s (former) SOTA model, replit-code-v1-3b and most recently had Amjad and Michele announce replit-code-v1_5-3b at the AI Engineer Summit.

Much has been speculated about the StackOverflow traffic drop since ChatGPT release, but the experience is still not perfect. There?s now a new player in the ?search for developers? arena: Phind.

Phind?s goal is to help you find answers to your technical questions, and then help you implement them. For example ?What should I use to create a frontend for a Python script?? returns a list of frameworks as well as links to the sources. You can then ask follow up questions on specific implementation details, having it write some code for you, etc. They have both a web version and a VS Code integration

They recently were top of Hacker News with the announcement of their latest model, which is now the #1 rated model on the BigCode Leaderboard, beating their previous version:

TLDR Cheat Sheet:

* Based on CodeLlama-34B, which is trained on 500B tokens

* Further fine-tuned on 70B+ high quality code and reasoning tokens

* Expanded context window to 16k tokens

* 5x faster than GPT-4 (100 tok/s vs 20 tok/s on single stream)

* 74.7% HumanEval vs 45% for the base model

We?ve talked before about HumanEval being limited in a lot of cases and how it needs to be complemented with ?vibe based? evals. Phind thinks of evals alongside two axis:

* Context quality: when asking the model to generate code, was the context high quality? Did we put outdated examples in it? Did we retrieve the wrong files?

* Result quality: was the code generated correct? Did it follow the instructions I gave it or did it misunderstand some of it?

If you have bad results with bad context, you might get to a good result by working on better RAG. If you have good context and bad result you might either need to work on your prompting or you have hit the limits of the model, which leads you to fine tuning (like they did).

Michael was really early to this space and started working on CommonCrawl filtering and indexing back in 2020, which led to a lot of the insights that now power Phind. We talked about that evolution, his experience at YC, how he got Paul Graham to invest in Phind and invite him to dinner at his house, and how Ron Conway connected him with Jensen Huang to get access to more GPUs!

Show Notes

* Phind

* LMQL

* People:

* Paul Graham (pg)

* Ron Conway

* Yacine Jernite from HuggingFace

* Jeff Delaney

Timestamps

* [00:00:00] Intros & Michael's early interest in computer vision

* [00:03:14] Pivoting to NLP and natural language question answering models

* [00:07:20] Building a search engine index of Common Crawl and web pages

* [00:11:26] Releasing the first version of Hello based on the search index and BigScience T0 model

* [00:14:02] Deciding to focus the search engine specifically for programmers

* [00:17:39] Overview of Phind's current product and focus on code reasoning

* [00:21:51] The future vision for Phind to go from idea to complete code

* [00:24:03] Transitioning to using the GPT-4 model and the impact it had

* [00:29:43] Developing the Phind model based on CodeLlama and additional training

* [00:32:28] Plans to continue improving the Phind model with open source technologies

* [00:43:59] The story of meeting Paul Graham and Ron Conway and how that impacted the company

* [00:53:02] How Ron Conway helped them get GPUs from Nvidia

* [00:57:12] Tips on how Michael learns complex AI topics

* [01:01:12] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:19]

Swyx: Hey, and today we have in the studio Michael Royzen from Phind. Welcome. [00:00:23]

Michael: Thank you so much. [00:00:24]

Alessio: It's great to be here. [00:00:25]

Swyx: Yeah, we are recording this in a surprisingly hot October in San Francisco. And sometimes the studio works, but the blue angels are flying by right now, so sorry about the noise. So welcome. I've seen Phind blow up this year, mostly, I think since your launch in Feb and V2 and then your Hacker News posts. We tend to like to introduce our guests, but then obviously you can fill in the blanks with the origin story. You actually were a high school entrepreneur. You started SmartLens, which is a computer vision startup in 2017. [00:00:59]

Michael: That's right. I remember when like TensorFlow came out and people started talking about, obviously at the time after AlexNet, the deep learning revolution was already in flow. Good computer vision models were a thing. And what really made me interested in deep learning was I got invited to go to Apple's WWDC conference as a student scholar because I was really into making iOS apps at the time. So I go there and I go to this talk where they added an API that let people run computer vision models on the device using far more efficient GPU primitives. After seeing that, I was like, oh, this is cool. This is going to have a big explosion of different computer vision models running locally on the iPhone. And so I had this crazy idea where it was like, what if I could just make this model that could recognize just about anything and have it run on the device? And that was the genesis for what eventually became SmartLens. I took this data set called ImageNet 22K. So most people, when they think of ImageNet, think of ImageNet 1K. But the full ImageNet actually has, I think, 22,000 different categories. So I took that, filtered it, pre-processed it, and then did a massive fine tune on Inception V3, which was, I think, the state of the art deep convolutional computer vision model at the time. And to my surprise, it actually worked insanely well. I had no idea what would happen if I give a single model. I think it ended up being 17,000 categories approximately that I collapsed them into. It worked so well that it actually worked better than Google Lens, which released its V1 around the same time. And on top of this, the model ran on the device. So it didn't need an internet connection. A big part of the issue with Google Lens at the time was that connections were slower. 4G was around, but it wasn't nearly as fast. So there was a noticeable lag having to upload an image to a server and get it back. But just processing it locally, even on the iPhones of the day in 2017, much faster. It was a cool little project. It got some traction. TechCrunch wrote about it. There was kind of like one big spike in usage, and then over time it tapered off. But people still pay for it, which is wild. [00:03:14]

Swyx: That's awesome. Oh, it's like a monthly or annual subscription? [00:03:16]

Michael: Yeah, it's like a monthly subscription. [00:03:18]

Swyx: Even though you don't actually have any servers? [00:03:19]

Michael: Even though we don't have any servers. That's right. I was in high school. I had a little bit of money. I was like, yeah. [00:03:25]

Swyx: That's awesome. I always wonder what the modern equivalents kind of "Be my eyes". And it would be actually disclosed in the GPT-4 Vision system card recently that the usage was surprisingly not that frequent. The extent to which all three of us have our sense of sight. I would think that if I lost my sense of sight, I would use Be My Eyes all the time. The average usage of Be My Eyes per day is 1.5 times. [00:03:49]

Michael: Exactly. I was thinking about this as well, where I was also looking into image captioning, where you give a model an image and then it tells you what's in the image. But it turns out that what people want is the exact opposite. People want to give a description of an image and then have the AI generate the image. [00:04:04]

Alessio: Oh, the other way. [00:04:06]

Michael: Exactly. And so at the time, I think there were some GANs, NVIDIA was working on this back in 2019, 2020. They had some impressive, I think, face GANs where they had this model that would produce these really high quality portraits, but it wasn't able to take a natural language description the way Midjourney or DALL-E 3 can and just generate you an image with exactly what you described in it. [00:04:32]

Swyx: And how did that get into NLP? [00:04:35]

Michael: Yeah, I released the SmartLens app and that was around the time I was a senior in high school. I was applying to college. College rolls around. I'm still sort of working on updating the app in college. But I start thinking like, hey, what if I make an enterprise version of this as well? At the time, there was Clarify that provided some computer vision APIs, but I thought this massive classification model works so well and it's so small and so fast, might as well build an enterprise product. And I didn't even talk to users or do any of those things that you're supposed to do. I was just mainly interested in building a type of backend I've never built before. So I was mainly just doing it for myself just to learn. I built this enterprise classification product and as part of it, I'm also building an invoice processing product where using some of the aspects that I built previously, although obviously it's very different from classification, I wanted to be able to just extract a bunch of structured data from an unstructured invoice through our API. And that's what led me to Hugnyface for the first time because that involves some natural language components. And so I go to Hugnyface and with various encoder models that were around at the time, I used the standard BERT and also Longformer, which came out around the same time. And Longformer was interesting because it had a much bigger context window than those models at the time, like BERT, all of the first gen encoder only models, they only had a context window of 512 tokens and it's fixed. There's none of this alibi or ROPE that we have now where we can basically massage it to be longer. They're fixed, 512 absolute encodings. Longformer at the time was the only way that you can fit, say, like a sequence length or ask a question about like 4,000 tokens worth of text. Implemented Longformer, it worked super well, but like nobody really kind of used the enterprise product and that's kind of what I expected because at the end of the day, it was COVID. I was building this kind of mostly for me, mostly just kind of to learn. And so nobody really used it and my heart wasn't in it and I kind of just shelved it. But a little later, I went back to HugMeFace and I saw this demo that they had, and this is in the summer of 2020. They had this demo made by this researcher, Yacine Jernite, and he called it long form question answering. And basically, it was this self-contained notebook demo where you can ask a question the way that we do now with ChatGPT. It would do a lookup into some database and it would give you an answer. And it absolutely blew my mind. The demo itself, it used, I think, BART as the model and in the notebook, it had support for both an Elasticsearch index of Wikipedia, as well as a dense index powered by Facebook's FAISS. I think that's how you pronounce it. It was very iffy, but when it worked, I think the question in the demo was, why are all boats white? When it worked, it blew my mind that instead of doing this few shot thing, like people were doing with GPT-3 at the time, which is all the rage, you could just ask a model a question, provide no extra context, and it would know what to do and just give you the answer. It blew my mind to such an extent that I couldn't stop thinking about that. When I started thinking about ways to make it better, I tried training, doing the fine tune with a larger BART model. And this BART model, yeah, it was fine tuned on this Reddit data set called Eli5. So basically... [00:08:02]

Alessio: Subreddit. [00:08:03]

Swyx: Yeah, subreddit. [00:08:04]

Alessio: Yeah. [00:08:05]

Michael: And put it into like a well-formatted, relatively clean data set of like human questions and human answers. And that was a really great bootstrap for that model to be able to answer these types of questions. And so Eli5 actually turned out to be a good data set for training these types of question answering models, because the question is written by a human, the answer is written by a human, and at least helps the model get the format right, even if the model is still very small and it can't really think super well, at least it gets the format right. And so it ends up acting as kind of a glorified summarization model, where if it's fed in high quality context from the retrieval system, it's able to have a reasonably high quality output. And so once I made the model as big as I can, just fine tuning on BART large, I started looking for ways to improve the index. So in the demo, in the notebook, there were instructions for how to make an Elasticsearch index just for Wikipedia. And I was like, why not do all of Common Crawl? So I downloaded Common Crawl, and thankfully, I had like 10 or $15,000 worth of AWS credits left over from the SmartLens project. And that's what really allowed me to do this, because there's no other funding. I was still in college, not a lot of money, and so I was able to spin up a bunch of instances and just process all of Common Crawl, which is massive. So it's roughly like, it's terabytes of text. I went to Alexa to get the top 1,000 websites or 10,000 websites in the world, then filtered only by those websites, and then indexed those websites, because the web pages were already included in Dump. [00:09:38]

Swyx: You mean to supplement Common Crawl or to filter Common Crawl? [00:09:41]

Michael: Filter Common Crawl. [00:09:42]

Alessio: Oh, okay. [00:09:43]

Michael: Yeah, sorry. So we filtered Common Crawl just by the top, I think, 10,000, just to limit this, because obviously there's this massive long tail of small sites that are really cool, actually. There's other projects like, shout out to Marginalia Nu, which is a search engine specialized on the long tail. I think they actually exclude the top 10,000. [00:10:03]

Swyx: That's what they do. [00:10:04]

Alessio: Yeah. [00:10:05]

Swyx: I've seen them around, I just don't really know what their pitch is. Okay, that makes sense. [00:10:08]

Michael: So they exclude all the top stuff. So the long tail is cool, but for this, that was kind of out of the question, and that was most of the data anyway. So we've removed that. And then I indexed the remaining approximately 350 million webpages through Elasticsearch. So I built this index running on AWS with these webpages, and it actually worked quite well. You can ask it general common knowledge, history, politics, current events, questions, and it would be able to do a fast lookup in the index, feed it into the model, and it would give a surprisingly good result. And so when I saw that, I thought that this is definitely doable. And it kind of shocked me that no one else was doing this. And so this was now the fall of 2020. And yeah, I was kind of shocked no one was doing this, but it costs a lot of money to keep it up. I was still in college. There are things going on. I got bogged down by classes. And so I ended up shelving this for almost a full year, actually. When I returned to it in fall of 2021, when BigScience released T0, when BigScience released the T0 models, that was a massive jump in the reasoning ability of the model. And it was better at reasoning, it was better at summarization, it was still a glorified summarizer basically. [00:11:26]

Swyx: Was this a precursor to Bloom? Because Bloom's the one that I know. [00:11:29]

Alessio: Yeah. [00:11:30]

Michael: Actually coming out in 2022. But Bloom had other problems where for whatever reason, the Bloom models just were never really that good, which is so sad because I really wanted to use them. But I think they didn't turn on that much data. I think they used like the original, they were trying to replicate GPT-3. So they just use those numbers, which we now know are like far below Chinchilla Optimal and even Chinchilla Optimal, which we can like talk about later, like what we're currently doing with MIMO goes, yeah, it goes way beyond that. But they weren't trying enough data. I'm not sure how that data was clean, but it probably wasn't super clean. And then they didn't really do any fine tuning until much later. So T0 worked well because they took the T5 models, which were closer to Chinchilla Optimal because I think they were trained on also like 300 something billion tokens, similar to GPT-3, but the models were much smaller. I think T0 is the first model that did large scale instruction tuning from diverse data sources in the fall of 2021. This is before Instruct GPT. This is before Flan T5, which came out in 2022. This is the very, very first, at least well-known example of that. And so it came out and then I did, on top of T0, I also did the Reddit Eli5 fine tune. And that was the first model and system that actually worked well enough to where I didn't get discouraged like I did previously, because the failure cases of the BART based system was so egregious. Sometimes it would just miss a question so horribly that it was just extremely discouraging. But for the first time, it was working reasonably well. Also using a much bigger model. I think the BART model is like 800 million parameters, but T0, we were using 3B. So it was T0, 3B, bigger model. And that was the very first iteration of Hello. So I ended up doing a show HN on Hacker News in January 2022 of that system. Our fine tune T0 model connected to our Elasticsearch index of those 350 million top 10,000 common crawl websites. And to the best of my knowledge, I think that's the first example that I'm aware of a LLM search engine model that's effectively connected to like a large enough index that I consider like an internet scale. So I think we were the first to release like an internet scale LLM powered rag search system In January 2022, around the time me and my future co-founder, Justin, we were like, this seems like the future. [00:14:02]

Alessio: This is really cool. [00:14:03]

Michael: I couldn't really sleep even like I was going to bed and I was like, I was thinking about it. Like I would say up until like 2.30 AM, like reading papers on my phone in bed, go to sleep, wake up the next morning at like eight and just be super excited to keep working. And I was also doing my thesis at the same time, my senior honors thesis at UT Austin about something very similar. We were researching factuality in abstractive question answering systems. So a lot of overlap with this project and the conclusions of my research actually kind of helped guide the development path of Hello. In the research, we found that LLMs, they don't know what they don't know. So the conclusion was, is that you always have to do a search to ensure that the model actually knows what it's talking about. And my favorite example of this even today is kind of with chat GPT browsing, where you can ask chat GPT browsing, how do I run llama.cpp? And chat GPT browsing will think that llama.cpp is some file on your computer that you can just compile with GCC and you're all good. It won't even bother doing a lookup, even though I'm sure somewhere in their internal prompts they have something like, if you're not sure, do a lookup. [00:15:13]

Alessio: That's not good enough. So models don't know what they don't know. [00:15:15]

Michael: You always have to do a search. And so we approached LLM powered question answering from the search angle. We pivoted to make this for programmers in June of 2022, around the time that we were getting into YC. We realized that what we're really interested in is the case where the models actually have to think. Because up until then, the models were kind of more glorified summarization models. We really thought of them like the Google featured snippets, but on steroids. And so we saw a future where the simpler questions would get commoditized. And I still think that's going to happen with like Google SGE and like it's nowadays, it's really not that hard to answer the more basic kind of like summarization, like current events questions with lightweight models that'll only continue to get cheaper over time. And so we kind of started thinking about this trade off where LLM models are going to get both better and cheaper over time. And that's going to force people who run them to make a choice. Either you can run a model of the same intelligence that you could previously for cheaper, or you can run a better model for the same price. So someone like Google, once the price kind of falls low enough, they're going to deploy and they're already doing this with SGE, they're going to deploy a relatively basic glorified summarizer model that can answer very basic questions about like current events, who won the Super Bowl, like, you know, what's going on on Capitol Hill, like those types of things. The flip side of that is like more complex questions where like you have to reason and you have to solve problems and like debug code. And we realized like we're much more interested in kind of going along the bleeding edge of that frontier case. And so we've optimized everything that we do for that. And that's a big reason of why we've built Phind specifically for programmers, as opposed to saying like, you know, we're kind of a search engine for everyone because as these models get more capable, we're very interested in seeing kind of what the emergent properties are in terms of reasoning, in terms of being able to solve complex multi-step problems. And I think that some of those emerging capabilities like we're starting to see, but we don't even fully understand. So I think there's always an opportunity for us to become more general if we wanted, but we've been along this path of like, what is the best, most advanced reasoning engine that's connected to your code base, that's connected to the internet that we can just provide. [00:17:39]

Alessio: What is Phind today, pragmatically, from a product perspective, how do people interact with it? Yeah. Or does it plug into your workflow? [00:17:46]

Michael: Yeah. [00:17:47]

Alessio: So Phind is really a system. [00:17:48]

Michael: Phind is a system for programmers when they have a question or when they're frustrated or when something's not working. [00:17:54]

Swyx: When they're frustrated. [00:17:55]

Alessio: Yeah. [00:17:56]

Michael: For them to get on block. I think like the single, the most abstract page for Phind is like, if you're experiencing really any kind of issue as a programmer, we'll solve that issue for you in 15 seconds as opposed to 15 minutes or longer. Phind has an interface on the web. It has an interface in VS code and more IDEs to come, but ultimately it's just a system where a developer can paste in a question or paste in code that's not working and Phind will do a search on the internet or they will find other code in your code base perhaps that's relevant. And then we'll find the context that it needs to answer your question and then feed it to a reasoning engine powerful enough to actually answer it. So that's really the philosophy behind Phind. It's a system for getting developers the answers that they're looking for. And so right now from a product perspective, this means that we're really all about getting the right context. So the VS code extension that we launched recently is a big part of this because you can just ask a question and it knows where to find the right code context in your code. It can do an internet search as well. So it's up to date and it's not just reliant on what the model knows and it's able to figure out what it needs by itself and answer your question based on that. If it needs some help, you can also get yourself kind of just, there's opportunities for you yourself to put in all that context in. But the issue is also like not everyone wants these VS code. Some people like are real Neovim sticklers or they're using like PyCharm or other IDEs, JetBrains. And so for those people, they're actually like okay with switching tabs, at least for now, if it means them getting their answer. Because really like there's been an explosion of all these like startups doing code, doing search, etc. But really who everyone's competing with is ChatGPT, which only has like that one web interface. Like ChatGPT is really the bar. And so that's what we're up against. [00:19:50]

Alessio: And so your idea, you know, we have Amman from Cursor on the podcast and they've gone through the we need to own the IDE thing. Yours is more like in order to get the right answer, people are happy to like go somewhere else basically. They're happy to get out of their IDE. [00:20:05]

Michael: That was a great podcast, by the way. But yeah, so part of it is that people sometimes perhaps aren't even in an IDE. So like the whole task of software engineering goes way beyond just running code, right? There's also like a design stage. There's a planning stage. A lot of this happens like on whiteboards. It happens in notebooks. And so the web part also exists for that where you're not even coding it and you're just trying to get like a more conceptual understanding of what you're trying to build first. The podcast with Amman was great, but somewhere where I disagree with him is that you need to own the IDE. I think like he made some good points about not having platform risk in the long term. But some of the features that were mentioned like suggesting diffs, for example, those are all doable with an extension. We haven't yet seen with VS Code in particular any functionality that we'd like to do yet in the IDE that we can't either do through directly supported VS Code functionality or something that we kind of hack into there, which we've also done a fair bit of. And so I think it remains to be seen where that goes. But I think what we're looking to be is like we're not trying to just be in an IDE or be an IDE. Like Phind is a system that goes beyond the IDE and like is really meant to cover the entire lifecycle of a developer's thought process in going about like, hey, like I have this idea and I want to get from that idea to a working product. And so then that's what the long term vision of Phind is really about is starting with that. In the future, I think programming is just going to be really just the problem solving. Like you come up with an idea, you come up with like the basic design for the algorithm in your head, and you just tell the AI, hey, just like just do it, just make it work. And that's what we're building towards. [00:21:51]

Swyx: I think we might want to give people an impression about like type of traffic that you have, because when you present it with a text box, you could type in anything. And I don't know if you have some mental categorization of like what are like the top three use cases that people tend to coalesce around. [00:22:08]

Alessio: Yeah, that's a great question. [00:22:09]

Michael: The two main types of searches that we see are how-to questions, like how to do X using Y tool. And this historically has been our bread and butter, because with our embeddings, like we're really, really good at just going over a bunch of developer documentation and figuring out exactly the part that's relevant and just telling you, OK, like you can use this method. But as LLMs have gotten better, and as we've really transitioned to using GPT-4 a lot in our product, people organically just started pasting in code that's not working and just said, fix it for me. [00:22:42]

Swyx: Fix this. [00:22:43]

Alessio: Yeah. [00:22:44]

Michael: And what really shocks us is that a lot of the people who do that, they're coming from chat GPT. So they tried it in chat GPT with chat GPT-4. It didn't work. Maybe it required like some multi-step reasoning. Maybe it required some internet context or something found in either a Stack Overflow post or some documentation to solve it. And so then they paste it into find and then find works. So those are really those two different cases. Like, how can I build this conceptually or like remind me of this one detail that I need to build this thing? Or just like, here's this code. Fix it. And so that's what a big part of our VS Code extension is, is like enabling a much smoother here just like fix it for me type of workflow. That's really its main benefits. Like it's in your code base. It's in the IDE. It knows how to find the relevant context to answer that question. But at the end of the day, like I said previously, that's still a relatively, not to say it's a small part, but it's a limited part of the entire mental life cycle of a programmer. [00:23:47]

Swyx: Yep. So you launched in Feb and then you launched V2 in August. You had a couple other pretty impactful posts slash feature launches. The web search one was massive. So you were mostly a GPT-4 wrapper. We were for a long time. [00:24:03]

Michael: For a long time until recently. Yeah. [00:24:05]

Alessio: Until recently. [00:24:06]

Swyx: So like people coming over from ChatGPT were saying, we're going to say model with your version of web search. Would that be the primary value proposition? [00:24:13]

Michael: Basically yeah. And so what we've seen is that any model plus web search is just significantly better than [00:24:18]

Alessio: that model itself. Do you think that's what you got right in April? [00:24:21]

Swyx: Like so you got 1500 points on Hacking News in April, which is like, if you live on Hacking News a lot, that is unheard of for someone so early on in your journey. [00:24:31]

Alessio: Yeah. [00:24:32]

Michael: We're super, super grateful for that. Definitely was not expecting it. So what we've done with Hacker News is we've just kept launching. [00:24:38]

Alessio: Yeah. [00:24:39]

Michael: Like what they don't tell you is that you can just keep launching. That's what we've been doing. So we launched the very first version of Find in its current incarnation after like the previous demo connected to our own index. Like once we got into YC, we scrapped our own index because it was too cumbersome at the time. So we moved over to using Bing as kind of just the raw source data. We launched as Hello Cognition. Over time, every time we like added some intelligence to the product, a better model, we just keep launching. And every additional time we launched, we got way more traffic. So we actually silently rebranded to Find in late December of last year. But like we didn't have that much traffic. Nobody really knew who we were. [00:25:18]

Swyx: How'd you pick the name out of it? [00:25:19]

Michael: Paul Graham actually picked it for us. [00:25:21]

Swyx: All right. [00:25:22]

Alessio: Tell the story. Yeah. So, oh boy. [00:25:25]

Michael: So this is the biggest side. Should we go for like the full Paul Graham story or just the name? [00:25:29]

Swyx: Do you want to do it now? Or do you want to do it later? I'll give you a choice. [00:25:32]

Alessio: Hmm. [00:25:33]

Michael: I think, okay, let's just start with the name for now and then we can do the full Paul Graham story later. But basically, Paul Graham, when we were lucky enough to meet him, he saw our name and our domain was at the time, sayhello.so and he's just like, guys, like, come on, like, what is this? You know? And we were like, yeah, but like when we bought it, you know, we just kind of broke college students. Like we didn't have that much money. And like, we really liked hello as a name because it was the first like conversational search engine. And that's kind of, that's the angle that we were approaching it from. And so we had sayhello.so and he's like, there's so many problems with that. Like, like, like the say hello, like, what does that even mean? And like .so, like, it's gotta be like a .com. And so we did some time just like with Paul Graham in the room. We just like looked at different domain names, like different things that like popped into our head. And one of the things that popped into like Paul Graham said was fine with the Phind spelling in particular. [00:26:33]

Swyx: Yeah. Which is not typical naming advice, right? Yes. Because it's not when people hear it, they don't spell it that way. [00:26:38]

Michael: Exactly. It's hard to spell. And also it's like very 90s. And so at first, like, we didn't like, I was like, like, ah, like, I don't know. But over time it kept growing on us. And eventually we're like, okay, we like the name. It's owned by this elderly Canadian gentleman who we got to know, and he was willing to sell it to us. [00:26:57]

Michael: And so we bought it and we changed the name. Yeah. [00:27:01]

Swyx: Anyways, where were you? [00:27:02]

Alessio: I had to ask. [00:27:03]

Swyx: I mean, you know, everyone who looks at you is wondering. [00:27:06]

Michael: And a lot of people actually pronounce it Phind, which, you know, by now it's part of the game. But eventually we want to buy Phind.com and then just have that redirect to Phind. So Phind is like definitely the right spelling. But like, we'll just, yeah, we'll have all the cases addressed. [00:27:23]

Swyx: Cool. So Bing web search, and then August you launched V2. Is V2 the Phind as a system pitch? Or have you moved, evolved since then? [00:27:31]

Michael: Yeah, so I don't, like the V2 moniker, like, I don't really think of it that way in my mind. There's like, there's the version we launched during, last summer during YC, which was the Bing version directed towards programmers. And that's kind of like, that's why I call it like the first incarnation of what we currently are. Because it was already directed towards programmers. We had like a code snippet search built in as well, because at the time, you know, the models we were using weren't good enough to generate code snippets. Even GPT, like the text DaVinci 2 was available at the time, wasn't that good at generating code and it would generate like very, very short, very incomplete code snippets. And so we launched that last summer, got some traction, but really like we were only doing like, I don't know, maybe like 10,000 searches a day. [00:28:15]

Alessio: Some people knew about it. [00:28:16]

Michael: Some people used it, which is impressive because looking back, the product like was not that good. And every time we've like made an improvement to the way that we retrieve context through better embeddings, more intelligent, like HTML parsers, and importantly, like better underlying models. Every major version after that was when we introduced a better underlying answering model. Like in February, we had to swallow a bit of our pride when we were like, okay, our own models aren't good enough. We have to go to open AI. And actually that did lead to kind of like our first decent bump of traffic in February. And people kept using it, like our attention was way better too. But we were still kind of running into problems of like more advanced reasoning. Some people tried it, but people were leaving because even like GPT 3.5, both turbo and non-turbo, like still not that great at doing like code related reasoning beyond the how do you do X, like documentation search type of use case. And so it was really only when GPT 4 came around in April that we were like, okay, like this is like our first real opportunity to really make this thing like the way that it should have been all along. And having GPT 4 as the brain is what led to that Hacker News post. And so what we did was we just let anyone use GPT 4 on Fyne for free without a login, [00:29:43]

Alessio: which I actually don't regret. [00:29:45]

Michael: So it was very expensive, obviously. But like at that stage, all we needed to do was show like, we just needed to like show people here's what Fyne can do. That was the main thing. And so that worked. That worked. [00:29:58]

Alessio: Like we got a lot of users. [00:29:59]

Michael: Do you know Fireship? [00:30:01]

Swyx: Yeah. YouTube, Jeff Delaney. [00:30:03]

Michael: Yeah. He made a short about Fyne. [00:30:06]

Alessio: Oh. [00:30:07]

Michael: And that's on top of the Hacker News post. And that's what like really, really made it blow up. It got millions of views in days. And he's just funny. Like what I love about Fireship is like he like you guys, yeah, like humor goes a long a long way towards like really grabbing people's attention. And so that blew up. [00:30:25]

Swyx: Something I would be anxious about as a founder during that period, so obviously we all remember that pretty closely. So there were a couple of people who had access to the GPT-4 API doing this, which is unrestricted access to GPT-4. And I have to imagine OpenAI wasn't that happy about that because it was like kind of de facto access to GPT-4 before they released it. [00:30:46]

Alessio: No, no. [00:30:47]

Michael: GPT-4 was in chat GPT from day one. I think. OpenAI actually came to our support because what happened was we had people building unofficial APIs around to try to get free access to it. And I think OpenAI actually has the right perspective on this where they're like, OK, people can do whatever they want with the API if they're paying for it, like they can do whatever they want, but it's like not OK if, you know, paying customers are being exploite by these other actors. They actually got in touch with us and they helped us like set up better Cloudflare bot monitoring controls to effectively like crack down on those unofficial APIs, which we're very happy about. But yeah, so we launched GPT-4. A lot of people come to the product and yeah, for a long time, we're just we're figuring out like what do we make of this, right? How do we a make it better, but also deal with like our costs, which have just like massively, massively ballooned. Over time, it's become more clear with the release of Llama 2 and Llama 3 on the horizon that we will once again see a return to vertical applications running their own models. As was true last year and before, I think that GPT-4, my hypothesis is that the jump from 4 to 4.5 or 4 to 5 will be smaller than the jump from 3 to 4. And the reason why is because there were a lot of different things. Like there was two plus, effectively two, two and a half years of research that went into going from 3 to 4. Like more data, bigger model, all of the instruction tuning techniques, RLHF, all of that is known. And like Meta, for example, and now there's all these other startups like Mistral too, like there's a bunch of very well-funded open source players that are now working on just like taking the recipe that's now known and scaling it up. So I think that even if a delta exists, the delta between in 2024, the delta between proprietary and open source won't be large enough that a startup like us with a lot of data that we've collected can take the data that we have, fine tune an open source model, and like be able to have it be better than whatever the proprietary model is at the time. That's my hypothesis.

Michael: But we'll once again see a return to these verticalized models. And that's something that we're super excited about because, yeah, that brings us to kind of the fine model because the plan from kind of the start was to be able to return to that if that makes sense. And I think now we're definitely at a point where it does make sense because we have requests from users who like, they want longer context in the model, basically, like they want to be able to ask questions about their entire code base without, you know, context and retrieval and taking a chance of that. Like, I think it's generally been shown that if you have the space to just put the raw files inside of a big context window, that is still better than chunking and retrieval. So there's various things that we could do with longer context, faster speed, lower cost. Super excited about that. And that's the direction that we're going with the fine model. And our big hypothesis there is precisely that we can take a really good open source model and then just train it on absolutely all of the high quality data that we can find. And there's a lot of various, you know, interesting ideas for this. We have our own techniques that we're kind of playing with internally. One of the very interesting ideas that I've seen, I think it's called Octopack from BigCode. I don't think that it made that big waves when it came out, I think in August. But the idea is that they have this data set that maps GitHub commits to a change. So basically there's all this really high quality, like human made, human written diff data out there on every time someone makes a commit in some repo. And you can use that to train models. Take the file state before and like given a commit message, what should that code look like in the future? [00:34:52]

Swyx: Got it. [00:34:53]

Alessio: Do you think your HumanEval is any good?

Michael: So we ran this experiment. We trained the Phind model. And if you go to the BigCode leaderboard, as of today, October 5th, all of our models are at the top of the BigCode leaderboard by far. It's not close, particularly in languages other than Python. We have a 10 point gap between us and the next best model on JavaScript. I think C sharp, multilingual. And what we kind of learned from that whole experience releasing those models is that human eval doesn't really matter. Not just that, but GPT-4 itself has been trained on human eval. And we know this because GPT-4 is able to predict the exact docstring in many of the problems. I've seen it predict like the specific example values in the docstring, which is extremely improbable. So I think there's a lot of dataset contamination and it only captures a very limited subset of what programmers are actually doing. What we do internally for evaluations are we have GPT-4 score answers. GPT-4 is a really good evaluator. I mean, obviously it's by really good, I mean, it's the best that we have. I'm sure that, you know, a couple of months from now, next year, we'll be like, oh, you know, like GPT-4.5, GPT-5, it's so much better. Like GPT-4 is terrible, but like right now it's the best that we have short of humans. And what we found is that when doing like temperature zero evals, it's actually mostly deterministic GPT-4 across runs in assigning scores to two different answers. So we found it to be a very useful tool in comparing our model to say, GPT-4, but yeah, on our like internal real world, here's what people will be asking this model dataset. And the other thing that we're running is just like releasing the model to our users and just seeing what they think. Because that's like the only thing that really matters is like releasing it for the application that it's intended for, and then seeing how people react. And for the most part, the incredible thing is, is that people don't notice a difference between our model and GPT-4 for the vast majority of searches. There's some reasoning problems that GPT-4 can still do better. We're working on addressing that. But in terms of like the types of questions that people are asking on find, there's not that much difference. And in fact, I've been running my own kind of side by side comparisons, shout out to GodMode, by the way. [00:37:16]

Michael: And I've like myself, I've kind of confirmed this to be the case. And even sometimes it gives a better answer, perhaps like more concise or just like better implementation than GPT-4, which that's what surprises me. And by now we kind of have like this reasoning is all you need kind of hypothesis where we've seen emerging capabilities in the find model, whereby training it on high quality code, it can actually like reason better. It went from not being able to solve world problems, where riddles were like with like temporal placement of objects and moving and stuff like that, that GPT-4 can do pretty well. We went from not being able to do those at all to being able to do them just by training on more code, which is wild. So we're already like starting to see like these emerging capabilities. [00:37:59]

Swyx: So I just wanted to make sure that we have the, I guess, like the model card in our heads. So you started from Code Llama? [00:38:07]

Alessio: Yes. [00:38:08]

Swyx: 65, 34? 34. [00:38:10]

Michael: So unfortunately, there's no Code Llama 70b. If there was, that would be super cool. But there's not. [00:38:15]

Swyx: 34. And then, which in itself was Llama 2, which is on 2 trillion tokens and the added 500 billion code tokens. Yes. [00:38:22]

Michael: And you just added a bunch more. [00:38:23]

Alessio: Yeah. [00:38:24]

Michael: And they also did a couple of things. So they did, I think they did 500 billion, like general pre-training and then they did an extra 20 billion long context pre-training. So they actually increased the like max position tokens to 16k up from 8k. And then they changed the theta parameter for the ROPE embeddings as well to give it theoretically better long context support up to 100k tokens. But yeah, but otherwise it's like basically Llama 2. [00:38:50]

Swyx: And so you just took that and just added data. [00:38:52]

Michael: Exactly. [00:38:53]

Swyx: You didn't do any other fundamental. [00:38:54]

Michael: Yeah. So we didn't actually, we haven't yet done anything with the model architecture and we just trained it on like many, many more billions of tokens on our own infrastructure. And something else that we're taking a look at now is using reinforcement learning for correctness. One of the interesting pitfalls that we've noticed with the Phind model is that in cases where it gets stuff wrong, it sometimes is capable of getting the right answer. It's just, there's a big variance problem. It's wildly inconsistent. There are cases when it is able to get the right chain of thought and able to arrive [00:39:25]

Alessio: at the right answer, but not always. [00:39:27]

Michael: And so like one of our hypotheses is something that we're going to try is that like we can actually do reinforcement learning on, for a given problem, generate a bunch of completions and then like use the correct answer as like a loss basically to try to get it to be more correct. And I think there's a high chance I think of this working because it's very similar to the like RLHF method where you basically show pairs of completions for a given question except the criteria is like which one is like less harmful. But here we have a different criteria. But if the model is already capable of getting the right answer, which it is, we're just, we just need to cajole it into being more consistent. [00:40:06]

Alessio: There were a couple of things that I noticed in the product that were not strange but unique. So first of all, the model can talk multiple times in a row, like most other applications is like human model, human model. And then you had outside of the thumbs up, thumbs down, you have things like have DLLM prioritize this message and its answers or then continue from this message to like go back. How does that change the flow of the user and like in terms of like prompting it, yeah, what are like some tricks or learnings you've had? [00:40:37]

Michael: So yeah, that's specifically in our pair programmer mode, which is a more conversational mode that also like asks you clarifying questions back if it doesn't fully understand what you're doing and it kind of it holds your hand a bit more. And so from user feedback, we had requests to make more of an auto GPT where you can kind of give it this problem that might take multiple searches or multiple different steps like multiple reasoning steps to solve. And so that's the impetus behind building that product. Being able to do multiple steps and also be able to handle really long conversations. Like people are really trying to use the pair programmer to go from like sometimes really from like basic idea to like complete working code. And so we noticed was is that we were having like these very, very long threads, sometimes with like 60 messages, like 100 messages. And like those become really, really challenging to manage the appropriate context window of what should go inside of the context and how to preserve the context so that the model can continue or the product can continue giving good responses, even if you're like 60 messages deep in a conversation. So that's where the prioritized user messages like comes from. It's like people have asked us to just like let them pin messages that they want to be left in the conversation. And yeah, and then that seems to have like really gone a long way towards solving that problem, yeah. [00:41:54]

Alessio: And then you have a run on Replit thing. Are you planning to build your own repl? Like learning some people trying to run the wrong code, unsafe code? [00:42:03]

Michael: Yes. Yes. So I think like in the long term vision of like being a place where people can go from like idea to like fully working code, having a code sandbox, like a natively integrated code sandbox makes a lot of sense. And replit is great and people use that feature. But yeah, I think there's more we can do in terms of like having something a bit closer to code interpreter where it's able to run the code and then like recursively iterate on it. Exactly. [00:42:31]

Swyx: So you're working on APIs to enable you to do that? Yep. So Amjad has specifically told me in person that he wants to enable that for people at the same time. He's also working on his own models, and Ghostwriter and you know, all the other stuff. So it's going to get interesting. Like he wants to power you, but also compete with you. Yeah. [00:42:47]

Michael: And like, and we love replit. I think that a lot of the companies in our space, like we're all going to converge to solving a very similar problem, but from a different angle. So like replit approaches this problem from the IDE side. Like they started as like this IDE that you can run in the browser. And they started from that side, making coding just like more accessible. And we're approaching it from the side of like an LLM that's just like connected to everything that it needs to be connected to, which includes your code context. So that's why we're kind of making inroads into IDEs, but we're kind of, we're approaching this problem from different sides. And I think it'll be interesting to see where things end up. But I think that in the long, long term, we have an opportunity to also just have like this general technical reasoning engine product that's potentially also not just for, not just for programmers. It's also powered in this web interface, like where there's potential, I think other things that we will build that eventually might go beyond like our current scope. [00:43:49]

Swyx: Exciting. We'll look forward to that. We're going to zoom out a little bit into sort of AI ecosystem stories, but first we got to get the Paul Graham, Ron Conway story. [00:43:59]

Alessio: Yeah. [00:44:00]

Michael: So flashback to last summer, we're in the YC batch. We're doing the summer batch, summer 22. So the summer batch runs from June to September, approximately. And so this was late July, early August, right around the time that many like YC startups start like going out, like during up, here's how we're going to pitch investors and everything. And at the same time, me and my co-founder, Justin, we were planning on moving to New York. So for a long time, actually, we were thinking about building this company in New York, mainly for personal reasons, actually, because like during the pandemic, pre-ChatGPT, pre last year, pre the AI boom, SF unfortunately really kind of, you know, like lost its luster. Yeah. Like no one was here. It was far from clear, like if there would be an AI boom, if like SF would be like... [00:44:49]

Alessio: Back. [00:44:50]

Michael: Yeah, exactly. Back. As everyone is saying these days, it was far from clear. And so, and all of our friends, we were graduating college because like we happened to just graduate college and immediately start YC, like we didn't even have, I think we had a week in between. [00:45:06]

Swyx: You didn't bother looking for jobs. You were just like, this is what we want to do. [00:45:08]

Michael: Well, actually both me and my co-founder, we had jobs that we secured in 2021 from previous internships, but we both, funny enough, when I spoke to my boss's boss at the company at where I reneged my offer, I told him we got into YC, they actually said, yeah, you should do YC. [00:45:27]

Swyx: Wow. [00:45:28]

Alessio: That's very selfless. [00:45:29]

Swyx: That was really great that they did that. But in San Francisco, they would have offered to invest as well. [00:45:33]

Michael: Yes, they would have. But yeah, but we were both planning to be in New York and all of our friends were there from college at this point, like we have this whole plan where like on August 1st, we're going to move to New York and we had like this Airbnb for the month of New York. We're going to stay there and we're going to work and like all of that. The day before we go to New York, I called Justin and I just, I tell him like, why are we doing this? Because in our batch, by the time August 1st rolled around, all of our mentors at YC were saying like, hey, like you should really consider staying in SF. [00:46:03]

Swyx: It's the hybrid batch, right? [00:46:04]

Michael: Yeah, it was the hybrid batch, but like there were already signs that like something was kind of like afoot in SF, even if like we didn't fully want to admit it yet. And so we were like, I don't know, I don't know. Something kind of clicked when the rubber met the road and it was time to go to New York. We're like, why are we doing this? And like, we didn't have any good reasons for staying in New York at that point beyond like our friends are there. So we still go to New York because like we have the Airbnb, like we don't have any other kind of place to go for the next few weeks. We're in New York and New York is just unfortunately too much fun. Like all of my other friends from college who are just, you know, basically starting their jobs, starting their lives as adults. They just stepped into these jobs, they're making all this money and they're like partying and like all these things are happening. And like, yeah, it's just a very distracting place to be. And so we were just like sitting in this like small, you know, like cramped apartment, terrible posture, trying to get as much work done as we can, too many distractions. And then we get this email from YC saying that Paul Graham is in town in SF and he is doing office hours with a certain number of startups in the current batch. And whoever signs up first gets it. And I happen to be super lucky. I was about to go for a run, but I just, I saw the email notification come across the street. I immediately clicked on the link and like immediately, like half the spots were gone, but somehow the very last spot was still available. And so I picked the very, very last time slot at 7 p.m. semi-strategically, you know, so we would have like time to go over. And also because I didn't really know how we're going to get to SF yet. And so we made a plan that we're going to fly from New York to SF and back to New York in one day and do like the full round trip. And we're going to meet with PG at the YC Mountain View office. And so we go there, we do that, we meet PG, we tell him about the startup. And one thing I love about PG is that he gets like, he gets so excited. Like when he gets excited about something, like you can see his eyes like really light up. And he'll just start asking you questions. In fact, it's a little challenging sometimes to like finish kind of like the rest of like the description of your pitch because like, he'll just like asking all these questions about how it works. And I'm like, you know, what's going on? [00:48:19]

Swyx: What was the most challenging question that he asked you? [00:48:21]

Michael: I think that like really how it worked. Because like as soon as like we told him like, hey, like we think that the future of search is answers, not links. Like we could really see like the gears turning in his head. I think we were like the first demo of that. [00:48:35]

Swyx: And you're like 10 minutes with him, right? [00:48:37]

Michael: We had like 45, yeah, we had a decent chunk of time. And so we tell him how it works. Like he's very excited about it. And I just like, I just blurted out, I just like asked him to invest and he hasn't even seen the product yet. We just asked him to invest and he says, yeah. And like, we're super excited about that. [00:48:55]

Swyx: You haven't started your batch. [00:48:56]

Michael: No, no, no. This is about halfway through the batch or two, two, no, two thirds of the batch. [00:49:02]

Swyx: And you're like not technically fundraising yet. We're about to start fundraising. Yeah. [00:49:06]

Michael: So we have like this demo and like we showed him and like there was still a lot of issues with the product, but I think like it must have like still kind of like blown his mind in some way. So like we're having fun. He's having fun. We have this dinner planned with this other friend that we had in SF because we were only there for that one day. So we thought, okay, you know, after an hour we'll be done, you know, we'll grab dinner with our friend and we'll fly back to New York. But PG was like, like, I'm having so much fun. Do you want to have dinner? Yeah. Come to my house. Or he's like, I gotta go have dinner with my wife, Jessica, who's also awesome, by the way. [00:49:40]

Swyx: She's like the heart of YC. Yeah. [00:49:42]

Michael: Jessica does not get enough credit as an aside for her role. [00:49:46]

Swyx: He tries. [00:49:47]

Michael: He understands like the technical side and she understands people and together they're just like a phenomenal team. But he's like, yeah, I got to go see Jessica, but you guys are welcome to come with. Do you want to come with? And we're like, we have this friend who's like right now outside of like literally outside the door who like we also promised to get dinner with. It's like, we'd love to, but like, I don't know if we can. He's like, oh, he's welcome to come too. So all of us just like hop in his car and we go to his house and we just like have this like we have dinner and we have this just chat about the future of search. Like I remember him telling Jessica distinctly, like our kids as kids are not going to know what like a search result is. Like they're just going to like have answers. That was really like a mind blowing, like inflection point moment for sure. [00:50:34]

Swyx: Wow, that email changed your life. [00:50:35]

Michael: Absolutely. [00:50:36]

Swyx: And you also just spoiled the booking system for PG because now everyone's just going to go after the last slot. Oh man. [00:50:42]

Michael: Yeah. But like, I don't know if he even does that anymore. [00:50:46]

Swyx: He does. He does. Yeah. I've met other founders that he did it this year. [00:50:49]

Michael: This year. Gotcha. But when we told him about how we did it, he was like, I am like frankly shocked that YC just did like a random like scheduling system. [00:50:55]

Alessio: They didn't like do anything else. But, um. [00:50:58]

Swyx: Okay. And then he introduces Duron Conway. Yes. Who is one of the most legendary angels in Silicon Valley. [00:51:04]

Michael: Yes.So after PG invested, the rest of our round came together pretty quickly. [00:51:10]

Swyx: I'm, by the way, I'm surprised. Like it's, it might feel like playing favorites right within the current batch to be like, yo, PG invested in this one. Right. [00:51:17]

Alessio: Too bad for the others. [00:51:18]

Swyx: Too bad for the others, I guess. [00:51:19]

Michael: I think this is a bigger point about YC and like these accelerators in general is like YC gets like a lot of criticism from founders who feel like they didn't get value out of it. But like, in my view, YC is what you make of it. And YC tells you this. They're like, you really got to grab this opportunity, like buy the balls and make the most of it. And if you do, then it could be the best thing in the world. And if you don't, and if you're just kind of like a passive, even like an average founder in YC, you're still going to fail. And they tell you that. They're like, if you're average in your batch, you're going to fail. Like you have to just be exceptional in every way. With that in mind, perhaps that's even part of the reason why we asked PG to invest. And so yeah, after PG invested, the rest of our round came together pretty quickly, which I'm very fortunate for. And yeah, he introduced us to Ron. And after he did, I get a call from Ron. And then Ron says like, hey, like PG tells me what you're working on. I'd love to come meet you guys. And I'm like, wait, no way. And then we're just holed up in this like little house in San Mateo, which is a little small, but you know, it had a nice patio. In fact, we had like a monitor set up outside on the deck out there. And so Ron Conway comes over, we go over to the patio where like our workstation is. And Ron Conway, he's known for having like this notebook that he goes around with where he like sits down with the notebook and like takes very, very detailed notes. So he never like forgets anything. So he sits down with his notebook and he asks us like, hey guys, like, what do you need? And we're like, oh, we need GPUs. Back then, the GPU shortage wasn't even nearly as bad as it is now. But like even then, it was still challenging to get like the quota that we needed. And he's like, okay, no problem. And then like he leaves a couple hours later, we get an email and we're CC'd on an email that Ron wrote to Jensen, the CEO of Nvidia, saying like, hey, these guys need GPUs. [00:53:02]

Swyx: You didn't say how much? It was just like, just give them GPUs. [00:53:04]

Alessio: Basically, yeah. [00:53:05]

Michael: Ron is known for writing these like one-liner emails that are like very short, but very to the point. And I think that's why like everyone responds to Ron. Everyone loves Ron. And so Jensen responds. He responds quickly, like tagging this VP of AI at Nvidia. And we start working with Nvidia, which is great. And something that I love about Nvidia, by the way, is that after that intro, we got matched with like a dedicated team. And at Nvidia, they know that they're going to win regardless. So they don't care where you get the GPUs from. They're like, they're truly neutral, unlike various sales reps that you might encounter at various like clouds and, you know, hardware companies, et cetera. They actually just want to help you because they know they don't care. Like regardless, they know that if you're getting Nvidia GPUs, they're still winning. So I guess that's a tip is that like if you're looking for GPUs like Nvidia, they'll help you do it. [00:53:54]

Swyx: So just to tie up this thing, because so first of all, that's a fantastic story. And I just wanted to let you tell that because it's special. That is a strategic shift, right? That you already decided to make by the time you met Ron, which is we are going to have our own hardware. We're going to rack him in a data center somewhere. [00:54:11]

Michael: Well, not even that we need our own hardware because actually we don't. Right. But we just we just need GPUs, period. And like every cloud loves like they have their own sales tactics and like they want to make you commit to long terms and like very non-flexible terms. And like there's a web of different things that you kind of have to navigate. Nvidia will kind of be to the point like, OK, you can do this on this cloud, this on this cloud. Like this is your budget. Maybe you want to consider buying as well. Like they'll help you walk through what the options are. And the reason why they're helpful is because like they look at the full picture. So they'll help you with the hardware. And in terms of software, they actually implemented a custom feature for us in Faster Transformer, which is one of their libraries.

Swyx: For you? [00:54:53]

Michael: For us. Yeah. Which is wild. I don't think they would have done it otherwise. They implemented streaming generation for T5 based models, which we were running at the time up until we switched to GPT in February, March of this year. So they implemented that just for us, actually, in Faster Transformer. And so like they'll help you like look at the complete picture and then just help you get done what you need to get done. I know one of your interests is also local models, open source models and hardware kind of goes hand in hand.

Alessio: Any fun projects, explorations in the space that you want to share with local llamas and stuff? [00:55:27]

Michael: Yeah, it's something that we're very interested in because something that kind of we're hearing a lot about is like people want something like find, especially companies, but they want to have it like within like their own sandbox. They want to have it like on hardware that they control. And so I'm super, super interested in how we can get big models to run efficiently on local hardware. And so like Ollama is great. Llama CPP is great. Very interested in like where the quantization thing is going. Because like obviously there are all these like great quantization libraries now that go to 4-bit, 8-bit, but specifically int8 and int4. [00:56:04]

Alessio: Which is the lowest it can go, right? [00:56:05]

Swyx: Yeah. [00:56:06]

Michael: So we have these great quantization libraries that for the most part are able to get the size down with not that much quality loss. But there is some like the quantized models currently are actually worse than the non-quantized ones. And so I'm very curious if the future is something like what NVIDIA is doing with their implementation of FP8, which they're implementing in their transformer engine library. Where basically once FP8 support is kind of more widespread and hardware can support it efficiently, you can kind of switch between the two different FP8 formats. One with greater precision, one with greater range. And then combine that with only not doing FP8 on every layer and doing like a mixed precision with like FP32 on some layers. And like NVIDIA claims that this strategy that they're kind of demoing with the H100 has no degradation. And so it remains to be seen whether that is really true in practice. But that's something that we're excited about and whether that can be applied to Macs and other hardware once they get FP8 support as well. [00:57:05]

Alessio: Cool. [00:57:06]

Swyx: One thing I wanted to do before we go into lightning round. Oh, we should also talk about hiring. How do you get your info? You seem self-taught. Yeah. [00:57:12]

Michael: I've always just, well, I'm fortunate to have like a decent systems background from UT Austin. And somewhat of a research background, even though like I didn't publish any papers, but like I went through all the motions. Like I didn't publish the thesis that I wrote, mainly out of time because I was doing both of that and the startup at the same time. And then I graduated and then it was YC and then everything was kind of one after another. But like I'm very fortunate to kind of have like the systems and like a bit of like a research background. But for the most part, outside of that foundation, like I've always just, whenever I've been interested in something, I just like. [00:57:43]

Swyx: Like give people tips, right? Like where do you, what fire hose do you drink from? Yeah, exactly. [00:57:48]

Michael: So like whenever I see something that blows my mind, the way that that initial hugging face demo did, that was like the start of everything. I'll start from the beginning. If I don't know anything, I'll start by just trying to get a mental model of what is happening. Like first I need to understand what, so I can understand like the why, the how and the why. And once I can understand that, then I can make my own hypotheses about like, okay, here are the assumptions that the authors of this made. I mean, here's why maybe they're correct. Maybe they're wrong. And here's how like I can improve on it and iterate on it. And I guess that's the mindset that I approach it from is like, once I understand something, like how can it be better? How can it be faster? How can it be like more accurate? And so I guess for anyone starting now, like I would have used find if I was starting now. Cause like I would have loved to just have been able to say like, Hey, like I have no idea what I'm doing. Can you just like be this like technical research assistant and kind of hold my hand and like ask me clarifying questions and like help me like formalize my assumptions like along the way. I would have loved that. But yeah, I just kind of did that myself. [00:58:50]

Swyx: Recording Looms of yourself using Phind actually would be pretty interesting. Yeah. Because I think you, you would use find differently than people would by themselves. [00:58:57]

Michael: I think so. Yeah. I generally use Phind for everything, which is definitely, yeah, it's like, no, no, even like non-technical questions as well. Cause that's just something I'm curious about, but that's less of a usage pattern nowadays. Like most people generally for the most part do technical questions on find. And that is completely understandable because of very deliberate decisions that we've made in how we've optimized the product. Like we've optimized the product very much in a quality first manner as opposed to a like speed first or like some balance of the two matters. So we're like, we have to run GPT-4 or some GPT-4 equivalent by default. And like, and it has to give like a good answer to like a very demanding technical audience where people will leave. So that's just the trade off. So like sometimes it's, it's slower for like simple questions, but like we did that on purpose. [00:59:46]

Alessio: So before we do a lightning round, call for hiring any roles you're looking for. What should people know about what can I find? Yeah. [00:59:55]

Michael: So we really straddled the line between product and research I find. For the past little while, a lot of the work that we've done has been solely product. But we also do, especially now with the find model, a very particular kind of applied research in trying to apply the very latest techniques and techniques that might not, that have not even been proven yet to training the very, very best model for our vertical. And the two go hand in hand because the product, the UI, the UX is kind of model agnostic. But when it has a better kernel, as Andrej Karpathy put it, plugged into it, it gets so much better. So we're doing really kind of both at the same time. And so someone who enjoys seeing both of those sides, like doing something very tangible that affects the user, high quality, reliable code that runs in production, but also having that chance to experiment with building these models. Yeah, we'd love to talk to you. [01:00:50]

Swyx: And the title is Applied AI Engineer. [01:00:52]

Michael: I don't know what the title is. Like that is one title, but I don't know if this really exists because I feel like we're too rigid about like bucketing people into categories. [01:01:02]

Swyx: Yeah, Founding Engineer is fine. [01:01:03]

Michael: Yeah, well, we already have a Founding Engineer technically. [01:01:06]

Swyx: Well, for what it's worth, OpenAI is adopting Applied AI Engineer. Really? So it's becoming a thing. We'll see. [01:01:12]

Alessio: We'll see. Lightning round. Yeah, we have three questions, acceleration, exploration, and then a takeaway. So the acceleration one is what's something that already happened in AI that you thought would take much longer? [01:01:24]

Michael: Yeah, the jump from these like models being glorified summarization models to actual powerful reasoning engines happened much faster than we thought because like our product itself transitioned from being kind of this glorified summarization product to now like mostly a reasoning heavy product. And we had no idea that this would happen this fast. Like we thought that there'd be a lot more time and like many more things that needed to happen before we could do some level of like intelligent reasoning on a low level about people's code. But it's already happened and it happened much faster than we could have thought. But I think that leads into your next point. [01:02:02]

Alessio: Which is exploration. [01:02:04]

Swyx: What do you think is the most interesting unsolved question in AI? [01:02:07]

Michael: I think solving hallucinations, being able to guarantee that the answer will be correct is I think super interesting. And it's particularly relevant to us because like we operate in a space where like everything needs to be correct. Like the code, like not just the logic, but like the implementation, everything has to be completely correct. And there's a lot of very interesting work that's going on in this space. Some of it is approaching it from the angle of formal grammars. There's a very interesting paper that came out recently. I forget where it came out of, but the paper is basically you can define a grammar that restricts and modifies the model's log probs, like decoding strategy to only conform to that grammar. And that helps it... [01:02:53]

Swyx: Is this LMQL? Because I feel like LMQL is a little bit too structured for... If the goal is avoiding hallucination, that's such a vague goal. Yeah. [01:03:02]

Michael: This is only something we've begun to take a look at. I haven't fully read the paper yet. Like I've only kind of skimmed the abstract, but it's something that like we're definitely interested in exploring further. But something that we are like a bit further along on is also like exploring reinforcement learning for correctness, as opposed to only harmfulness the way it has typically been used in my college. [01:03:23]

Swyx: I'm interested to see your paper on that. Just a quick follow-up. Do you have internal evals for what hallucination rate is on stock GPC4 and then maybe what yours is after fine-tuning? [01:03:34]

Michael: We don't measure hallucination directly in our internal benchmarks. We more measure like was the answer right or was it wrong? We measure hallucination indirectly by evaluating the context, like the RAG context fed into the model as well. So basically, if the context was bad and the answer was bad, then chances are like it's the context. But if the context was good and it just like misinterpreted that or had the wrong conclusion, then like we can take different steps there. Harrison from LangChain has been talking about this sort of two-by-two matrix with the RAG people. It's a pretty simple concept. [01:04:08]

Swyx: What's the source of error? [01:04:11]

Michael: Exactly. I've been talking to Harrison actually about like a more structured way perhaps within Linkchain to like do evals. Because I think that's a massive problem. Like every single eval is different for these big, large language models and doing them in a quantitative way is really hard. But it's possible with like a platform that I think harnesses GPT-4 in the right way. That and also perhaps a stricter prompting language like a prompting markup language for prompting models is something I'm also very interested in. Because we've written some very, very complex prompts particularly for a VS Code extension to like very fancy things with people's code. And like I wish there was a way that you could have like a more formal way like a Python for LLM prompting that you could activate desired things within like the model's execution flow through some other abstraction above language that has been like tested to do that some of the time. Perhaps like combined with like formal grammar limitations and stuff like that. Interesting. I have no idea what that looks like. These are all things these are all things that have kind of emerged directly from the issues we're facing ourselves internally. But yeah, definitely very abstract so far.

Alessio: And yeah, just to wrap what's one message idea you want people to remember and think about? [01:05:32]

Michael: I think pay attention to those moments that like really jump out at you. Like when you see like a crazy demo that you can't forget about like something that you just think is really, really cool. Because I see a lot of people trying to start startups from the angle of like, hey, I just want to start a startup or I'm just bored at my job or like I'm like generally interested in the space. And I personally disagree with that. My take is that it's much easier having been on both sides of that coin now, it's much easier to stay obsessed every single day when the genesis of your startup is something that really spoke to you in an incredibly meaningful way beyond just being some insight that you've noticed. And I guess that's what we're discovering now is that in the long, long term what you're really building is like you're building a group of people that believe this thing, that believe that the future of solving problems and making things will be just like focused more on the human thought process as opposed to the implementation part. And it's that belief that I think is what really gets you through the tough times and hopefully gets you to the other side someday. [01:06:47]

Swyx: Awesome. I kind of want to play Lose Yourself as the outro music. [01:06:52]

Alessio: Then we'll get DMCA strike. Thank you so much for coming on.

Michael: Yeah, thank you so much for having me. This was really fun. [01:06:59]

Get full access to Latent Space at www.latent.space/subscribe

2023-11-03
Länk till avsnitt

Powering your Copilot for Data ? with Artem Keydunov of Cube.dev

The first workshops and talks from the AI Engineer Summit are now up! Join the >20k viewers on YouTube, find clips on Twitter (we?re also clipping @latentspacepod), and chat with us on Discord!

Text-to-SQL was one of the first applications of NLP. Thoughtspot offered ?Ask your data questions? as their core differentiation compared to traditional dashboarding tools. In a way, they provide a much friendlier interface with your own structured (aka ?tabular?, as in ?SQL tables?) data, the same way that RLHF and Instruction Tuning helped turn the GPT-3 of 2020 into the ChatGPT of 2022.

Today, natural language queries on your databases are a commodity. There are 4 different ChatGPT plugins that offer this, as well as a bunch of startups like one of our previous guests, Seek.ai. Perplexity originally started with a similar product in 2022:

In March 2023 LangChain wrote a blog post on LLMs and SQL highlighting why they don?t consistently work:

* ?LLMs can write SQL, but they are often prone to making up tables, making up field?

* ?LLMs have some context window which limits the amount of text they can operate over?

* ?The SQL it writes may be incorrect for whatever reason, or it could be correct but just return an unexpected result.?

For example, if you ask a model to ?return all active users in the last 7 days? it might hallucinate a `is_active` column, join to an `activity` table that doesn?t exist, or potentially get the wrong date (especially in leap years!).

We previously talked to Shreya Rajpal at Guardrails AI, which also supports Text2SQL enforcement. Their approach was to run the actual SQL against your database and then use the error messages to improve the query:

Semantic Layers to the rescue

Cube is an open source semantic layer which recently integrated with LangChain to solve these issues in a different way. You can use YAML, Javascript, or Python to create definitions of different metrics, measures and dimensions for your data:

Creating these metrics and passing them in the model context limits the possibility for errors as the model just needs to query the `active_users` view, and Cube will then expand that into the full SQL in a reliable way. The downside of this approach compared to the Guardrails one for example is that it requires more upfront work to define metrics, but on the other hand it leads to more reliable and predictable outputs.

The promise of adding a great semantic layer to your LLM app is irresistible - you greatly minimize hallucinations, make much more token efficient prompts, and your data stays up to date without any retraining or re-indexing. However, there are also difficulties with implementing semantic layers well, so we were glad to go deep on the topic with Artem as one of the leading players in this space!

Timestamps

* [00:00:00] Introductions

* [00:01:28] Statsbot and limitations of natural language processing in 2017

* [00:04:27] Building Cube as the infrastructure for Statsbot

* [00:08:01] Open sourcing Cube in 2019

* [00:09:09] Explaining the concept of a semantic layer/Cube

* [00:11:01] Using semantic layers to provide context for AI models working with tabular data

* [00:14:47] Workflow of generating queries from natural language via semantic layer

* [00:21:07] Using Cube to power customer-facing analytics and natural language interfaces

* [00:22:38] Building data-driven AI applications and agents

* [00:25:59] The future of the modern data stack

* [00:29:43] Example use cases of Slack bots powered by Cube

* [00:30:59] Using GPT models and limitations around math

* [00:32:44] Tips for building data-driven AI apps

* [00:35:20] Challenges around monetizing embedded analytics

* [00:36:27] Lightning Round

Transcript

Swyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer, editor of Latent Space and founder of Smol.ai and Alessio, partner and CTO in residence at Decibel Partners. [00:00:15]

Alessio: Hey everyone, and today we have Artem Keydunov on the podcast, co-founder of Cube. Hey Artem. [00:00:21]

Artem: Hey Alessio, hi Swyx. Good to be here today, thank you for inviting me. [00:00:25]

Alessio: Yeah, thanks for joining. For people that don't know, I've known Artem for a long time, ever since he started Cube. And Cube is actually a spin-out of his previous company, which is Statsbot. And this kind of feels like going both backward and forward in time. So the premise of Statsbot was having a Slack bot that you can ask, basically like text to SQL in Slack, and this was six, seven years ago, something like that. A lot ahead of its time, and you see startups trying to do that today. And then Cube came out of that as a part of the infrastructure that was powering Statsbot. And Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source evolution. I think you have over 16,000 stars on GitHub today, you have a very active open source community. But maybe for people at home, just give a quick like lay of the land of the original Statsbot product. You know, what got you interested in like text to SQL and what were some of the limitations that you saw then, the limitations that you're also seeing today in the new landscape? [00:01:28]

Artem: I started Statsbot in 2016. The original idea was to just make sort of a side project based off my initial project that I did at a company that I was working for back then. And I was working for a company that was building software for schools, and we were using Slack a lot. And Slack was growing really fast, a lot of people were talking about Slack, you know, like Slack apps, chatbots in general. So I think it was, you know, like another wave of, you know, bots and all that. We have one more wave right now, but it always comes in waves. So we were like living through one of those waves. And I wanted to build a bot that would give me information from different places where like a data lives to Slack. So it was like developer data, like New Relic, maybe some marketing data, Google Analytics, and then some just regular data, like a production database, so it sells for sometimes. And I wanted to bring it all into Slack, because we were always chatting, you know, like in Slack, and I wanted to see some stats in Slack. So that was the idea of Statsbot, right, like bring stats to Slack. I built that as a, you know, like a first sort of a side project, and I published it on Reddit. And people started to use it even before Slack came up with that Slack application directory. So it was a little, you know, like a hackish way to install it, but people are still installing it. So it was a lot of fun. And then Slack kind of came up with that application directory, and they reached out to me and they wanted to feature Statsbot, because it was one of the already being kind of widely used bots on Slack. So they featured me on this application directory front page, and I just got a lot of, you know, like new users signing up for that. It was a lot of fun, I think, you know, like, but it was sort of a big limitation in terms of how you can process natural language, because the original idea was to let people ask questions directly in Slack, right, hey, show me my, you know, like opportunities closed last week or something like that. My co founder, who kind of started helping me with this Slack application, him and I were trying to build a system to recognize that natural language. But it was, you know, we didn't have LLMs right back then and all of that technology. So it was really hard to build the system, especially the systems that can kind of, you know, like keep talking to you, like maintain some sort of a dialogue. It was a lot of like one off requests, and like, it was a lot of hit and miss, right? If you know how to construct a query in natural language, you will get a result back. But you know, like, it was not a system that was capable of, you know, like asking follow up questions to try to understand what you actually want. And then kind of finally, you know, like, bring this all context and go to generate a SQL query, get the result back and all of that. So that was a really missing part. And I think right now, that's, you know, like, what is the difference? So right now, I kind of bullish that if I would start Statsbot again, probably would have a much better shot at it. But back then, that was a big limitation. We kind of build a queue, right, as we were working on Statsbot, because we needed it. [00:04:27]

Alessio: What was the ML stack at the time? Were you building, trying to build your own natural language understanding models, like were there open source models that were good that you were trying to leverage? [00:04:38]

Artem: I think it was mostly combination of a bunch of things. And we tried a lot of different approaches. The first version, which I built, like was Regex. They were working well. [00:04:47]

Swyx: It's the same as I did, I did option pricing when I was in finance, and I had a natural language pricing tool thing. And it was Regex. It was just a lot of Regex. [00:04:59]

Artem: Yeah. [00:05:00]

Artem: And my co-founder, Pavel, he's much smarter than I am. He's like PhD in math, all of that. And he started to do some stuff. I was like, no, you just do that stuff. I don't know. I can do Regex. And he started to do some models and trying to either look at what we had on the market back then, or try to build a different sort of models. Again, we didn't have any foundation back in place, right? We wanted to try to use existing math, obviously, right? But it was not something that we can take the model and try and run it. I think in 2019, we started to see more of stuff, like ecosystem being built, and then it eventually kind of resulted in all this LLM, like what we have right now. But back then in 2016, it was not much available for just the people to build on top. It was some academic research, right, kind of been happening. But it was very, very early for something to actually be able to use. [00:05:58]

Alessio: And then that became Cube, which started just as an open source project. And I think I remember going on a walk with you in San Mateo in 2020, something like that. And you had people reaching out to you who were like, hey, we use Cube in production. I just need to give you some money, even though you guys are not a company. What's the story of Cube then from Statsbot to where you are today? [00:06:21]

Artem: We built a Cube at Statsbot because we needed it. It was like, the whole Statsbot stack was that we first tried to translate the initial sort of language query into some sort of multidimensional query. It's like we were trying to understand, okay, people wanted to get active opportunities, right? What does it mean? Is it a metric? Is it what a dimension here? Because usually in analytics, you always, you know, like, try to reduce everything down to the sort of, you know, like a multidimensional framework. So that was the first step. And that's where, you know, like it didn't really work well because all this limitation of us not having foundational technologies. But then from the multidimensional query, we wanted to go to SQL. And that's what was SemanticLayer and what was Cube essentially. So we built a framework where you would be able to map your data into this concept, into this metrics. Because when people were coming to Statsbot, they were bringing their own datasets, right? And the big question was, how do we tell the system what is active opportunities for that specific users? How we kind of, you know, like provide that context, how we do the training. So that's why we came up with the idea of building the SemanticLayer so people can actually define their metrics and then kind of use them as a Statsbot. So that's how we built a Cube. At some point, we saw people started to see more value in the Cube itself, you know, like kind of building the SemanticLayer and then using it to power different types of the application. So in 2019, we decided, okay, it feels like it might be a standalone product and a lot of people want to use it. Let's just try to open source it. So we took it out of Statsbot and open-sourced. [00:08:01]

Swyx: Can I make sure that everyone has the same foundational knowledge? The concept of a cube is not something that you invented. I think, you know, not everyone has the same background in analytics and data that all three of us do. Maybe you want to explain like OLAP Cube, HyperCube, the brief history of cubes. Right. [00:08:17]

Artem: I'll try, you know, like a lot of like Wikipedia pages and like a lot of like a blog post trying to go into academics of it. So I'm trying to like... [00:08:25]

Swyx: Cube's according to you. Yeah. [00:08:27]

Artem: So when we think about just a table in a database, the problem with the table, it's not a multidimensional, meaning that in many cases, if we want to slice the data, we kind of need to result with a different table, right? Like think about when you're writing a SQL query to answer one question, SQL query always ends up with a table, right? So you write one SQL, you got one. And then you write to answer a different question, you write a second query. So you're kind of getting a bunch of tables. So now let's imagine that we can kind of bring all these tables together into multidimensional table. And that's essentially Cube. So it's just like the way that we can have measures and dimension that can potentially be used at the same time from a different angles. [00:09:09]

Alessio: So initially, a lot of your use cases were more BI related, but you recently released a LangChain integration. There's obviously more and more interest in, again, using these models to answer data questions. So you've seen the chat GPT code interpreter, which is renamed as like advanced data analysis. What's kind of like the future of like the semantic layer in AI? You know, what are like some of the use cases that you're seeing and why do you think it's a good strategy to make it easier to do now the text to SQL you wanted to do seven years ago? [00:09:39]

Artem: Yeah. So, I mean, you know, when it started to happen, I was just like, oh my God, people are now building Statsbot with Cube. They just have a better technology for, you know, like natural language. So it kind of, it made sense to me, you know, like from the first moment I saw it. So I think it's something that, you know, like happening right now and chat bot is one of the use cases. I think, you know, like if you try to generalize it, the use case would be how do we use structured or tabular data with, you know, like AI models, right? Like how do we turn the data and give the context as a data and then bring it to the model and then model can, you know, like give you answers, make a questions, do whatever you want. But the question is like how we go from just the data in your data warehouse, database, whatever, which is usually just a tabular data, right? Like in a SQL based warehouses to some sort of, you know, like a context that system can do. And if you're building this application, you have to do it. It's like no way you can get away around not doing this. You either map it manually or you come up with some framework or something else. So our take is that and my take is that semantic layer is just really good place for this context to leave because you need to give this context to the humans. You need to give that context to the AI system anyway, right? So that's why you define metric once and then, you know, like you teach your AI system what this metric is about. [00:11:01]

Alessio: What are some of the challenges of using tabular versus language data and some of the ways that having the semantic layer kind of makes that easier maybe? [00:11:09]

Artem: Imagine you're a human, right? And you're going into like your new data analyst at a company and just people give you a warehouse with a bunch of tables and they tell you, okay, just try to make sense of this data. And you're going through all of these tables and you're really like trying to make sense without any, you know, like additional context or like some columns. In many cases, they might have a weird names. Sometimes, you know, if they follow some kind of like a star schema or, you know, like a Kimball style dimensions, maybe that would be easier because you would have facts and dimensions column, but it's still, it's hard to understand and kind of make sense because it doesn't have descriptions, right? And then there is like a whole like industry of like a data catalogs exist because the whole purpose of that to give context to the data so people can understand that. And I think the same applies to the AI, right? Like, and the same challenge is that if you give it pure tabular data, it doesn't have this sort of context that it can read. So you sort of needed to write a book or like essay about your data and give that book to the system so it can understand it. [00:12:12]

Alessio: Can you run through the steps of how that works today? So the initial part is like the natural language query, like what are the steps that happen in between to do model, to semantic layer, semantic layer, to SQL and all that flow? [00:12:26]

Artem: The first key step is to do some sort of indexing. That's what I was referring to, like write a book about your data, right? Describe in a text format what your data is about, right? Like what metrics it has, dimensions, what is the structures of that, what a relationship between those metrics, what are potential values of the dimensions. So sort of, you know, like build a really good index as a text representation and then turn it into embeddings into your, you know, like a vector storage. Once you have that, then you can provide that as a context to the model. I mean, there are like a lot of options, like either fine tune or, you know, like sort of in context learning, but somehow kind of give that as a context to the model, right? And then once this model has this context, it can create a query. Now the query I believe should be created against semantic layer because it reduces the room for the error. Because what usually happens is that your query to semantic layer would be very simple. It would be like, give me that metric group by that dimension and maybe that filter should be applied. And then your real query for the warehouse, it might have like a five joins, a lot of different techniques, like how to avoid fan out, fan traps, chasm traps, all of that stuff. And the bigger query, the more room that the model can make an error, right? Like even sometimes it could be a small error and then, you know, like your numbers is going to be off. But making a query against semantic layer, that sort of reduces the error. So the model generates a SQL query and then it executes us again, semantic layer. And semantic layer executes us against your warehouse and then sends result all the way back to the application. And then can be done multiple times because what we were missing was both this ability to have a conversation, right? With the model. You can ask question and then system can do a follow-up questions, you know, like then do a query to get some additional information based on this information, do a query again. And sort of, you know, like it can keep doing this stuff and then eventually maybe give you a big report that consists of a lot of like data points. But the whole flow is that it knows the system, it knows your data because you already kind of did the indexing and then it queries semantic layer instead of a data warehouse directly. [00:14:47]

Alessio: Maybe just to make it a little clearer for people that haven't used a semantic layer before, you can add definitions like revenue, where revenue is like select from customers and like join orders and then sum of the amount of orders. But in the semantic layer, you're kind of hiding all of that away. So when you do natural language to queue, it just select revenue from last week and then it turns into a bigger query. [00:15:12]

Swyx: One of the biggest difficulties around semantic layer for people who've never thought about this concept before, this all sounds super neat until you have multiple stakeholders within a single company who all have different concepts of what a revenue is. They all have different concepts of what active user is. And then they'll have like, you know, revenue revision one by the sales team, you know, and then revenue revision one, accounting team or tax team, I don't know. I feel like I always want semantic layer discussions to talk about the not so pretty parts of the semantic layer, because this is where effectively you ship your org chart in the semantic layer. [00:15:47]

Artem: I think the way I think about it is that at the end of the day, semantic layer is a code base. And in Qubit, it's essentially a code base, right? It's not just a set of YAML files with pythons. I think code is never perfect, right? It's never going to be perfect. It will have a lot of, you know, like revisions of code. We have a version control, which helps it's easier with revisions. So I think we should treat our metrics and semantic layer as a code, right? And then collaboration is a big part of it. You know, like if there are like multiple teams that sort of have a different opinions, let them collaborate on the pull request, you know, they can discuss that, like why they think that should be calculated differently, have an open conversation about it, you know, like when everyone can just discuss it, like an open source community, right? Like you go on a GitHub and you talk about why that code is written the way it's written, right? It should be written differently. And then hopefully at some point you can come up, you know, like to some definition. Now if you still should have multiple versions, right? It's a code, right? You can still manage it. But I think the big part of that is that like, we really need to treat it as a code base. Then it makes a lot of things easier, not as spreadsheets, you know, like a hidden Excel files. [00:16:53]

Alessio: The other thing is like then having the definition spread in the organization, like versus everybody trying to come up with their own thing. But yeah, I'm sure that when you talk to customers, there's people that have issues with the product and it's really like two people trying to define the same thing. One in sales that wants to look good, the other is like the finance team that wants to be conservative and they all have different definitions. How important is the natural language to people? Obviously you guys both work in modern data stack companies either now or before. There's going to be the whole wave of empowering data professionals. I think now a big part of the wave is removing the need for data professionals to always be in the loop and having non-technical folks do more of the work. Are you seeing that as a big push too with these models, like allowing everybody to interact with the data? [00:17:42]

Artem: I think it's a multidimensional question. That's an example of, you know, like where you have a lot of inside the question. In terms of examples, I think a lot of people building different, you know, like agents or chatbots. You have a company that built an internal Slack bot that sort of answers questions, you know, like based on the data in a warehouse. And then like a lot of people kind of go in and like ask that chatbot this question. Is it like a real big use case? Maybe. Is it still like a toy pet project? Maybe too right now. I think it's really hard to tell them apart at this point because there is a lot of like a hype, you know, and just people building LLM stuff because it's cool and everyone wants to build something, you know, like even at least a pet project. So that's what happened in Krizawa community as well. We see a lot of like people building a lot of cool stuff and it probably will take some time for that stuff to mature and kind of to see like what are real, the best use cases. But I think what I saw so far, one use case was building this chatbot and we have even one company that are building it as a service. So they essentially connect into Q semantic layer and then offering their like chatbot So you can do it in a web, in a slack, so it can, you know, like answer questions based on data in your semantic layer, but also see a lot of things like they're just being built in house. And there are other use cases, sort of automation, you know, like that agent checks on the data and then kind of perform some actions based, you know, like on changes in data. But other dimension of your question is like, will it replace people or not? I think, you know, like what I see so far in data specifically, you know, like a few use cases of LLM, I don't see Q being part of that use case, but it's more like a copilot for data analyst, a copilot for data engineer, where you develop something, you develop a model and it can help you to write a SQL or something like that. So you know, it can create a boilerplate SQL, and then you can edit this SQL, which is fine because you know how to edit SQL, right? So you're not going to make a mistake, but it will help you to just generate, you know, like a bunch of SQL that you write again and again, right? Like boilerplate code. So sort of a copilot use case. I think that's great. And we'll see more of it. I think every platform that is building for data engineers will have some sort of a copilot capabilities and Cubectl, we're building this copilot capabilities to help people build semantic layers easier. I think that just a baseline for every engineering product right now to have some sort of, you know, like a copilot capabilities. Then the other use case is a little bit more where Cube is being involved is like, how do we enable access to data for non-technical people through the natural language as an interface to data, right? Like visual dashboards, charts, it's always has been an interface to data in every BI. Now I think we will see just a second interface as a just kind of a natural language. So I think at this point, many BI's will add it as a commodity feature is like Tableau will probably have a search bar at some point saying like, Hey, ask me a question. I know that some of the, you know, like AWS Squeak site, they're about to announce features like this in their like BI. And I think Power BI will do that, especially with their deal with open AI. So every company, every BI will have this some sort of a search capabilities built in inside their BI. So I think that's just going to be a baseline feature for them as well. But that's where Cube can help because we can provide that context, right? [00:21:07]

Alessio: Do you know how, or do you have an idea for how these products will differentiate once you get the same interface? So right now there's like, you know, Tableau is like the super complicated and it's like super sad. It's like easier. Yeah. Do you just see everything will look the same and then how do people differentiate? [00:21:24]

Artem: It's like they all have line chart, right? And they all have bar chart. I feel like it pretty much the same and it's going to be fragmented as well. And every major vendor and most of the vendors will try to have some sort of natural language capabilities and they might be a little bit different. Some of them will try to position the whole product around it. Some of them will just have them as a checkbox, right? So we'll see, but I don't think it's going to be something that will change the BI market, you know, like something that will can take the BI market and make it more consolidated rather than, you know, like what we have right now. I think it's still will remain fragmented. [00:22:04]

Alessio: Let's talk a bit more about application use cases. So people also use Q for kind of like analytics in their product, like dashboards and things like that. How do you see that changing and more, especially like when it comes to like agents, you know, so there's like a lot of people trying to build agents for reporting, building agents for sales. If you're building a sales agent, you need to know everything about the purchasing history of the customer. All of these things. Yeah. Any thoughts there? What should all the AI engineers listening think about when implementing data into agents? [00:22:38]

Artem: Yeah, I think kind of, you know, like trying to solve for two problems. One is how to make sure that agents or LLM model, right, has enough context about, you know, like a tabular data and also, you know, like how do we deliver updates to the context, which is also important because data is changing, right? So every time we change something upstream, we need to surely update that context in our vector database or something. And how do you make sure that the queries are correct? You know, I think it's obviously a big pain and that's all, you know, like AI kind of, you know, like a space right now, how do we make sure that we don't, you know, provide our own cancers, but I think, you know, like be able to reduce the room for error as much as possible that what I would look for, you know, like to try to like minimize potential damage. And then our use case for Qube, it's been using a lot to power sort of customer facing analytics. So I don't think much is going to change is that I feel like again, more and more products will adopt natural language interfaces as sort of a part of that product as well. So we would be able to power this business to not only, you know, like a chart, visuals, but also some sort of, you know, like a summaries, probably in the future, you're going to open the page with some surface stats and you will have a smart summary kind of generated by AI. And that summary can be powered by Qube, right, like, because the rest is already being powered by Qube. [00:24:04]

Alessio: You know, we had Linus from Notion on the pod and one of the ideas he had that I really like is kind of like thumbnails of text, kind of like how do you like compress knowledge and then start to expand it. A lot of that comes into dashboards, you know, where like you have a lot of data, you have like a lot of charts and sometimes you just want to know, hey, this is like the three lines summary of it. [00:24:25]

Artem: Exactly. [00:24:26]

Alessio: Makes sense that you want to power that. How are you thinking about, yeah, the evolution of like the modern data stack in quotes, whatever that means today. What's like the future of what people are going to do? What's the future of like what models and agents are going to do for them? Do you have any, any thoughts? [00:24:42]

Artem: I feel like modern data stack sometimes is not very, I mean, it's obviously big crossover between AI, you know, like ecosystem, AI infrastructure, ecosystem, and then sort of a data. But I don't think it's a full overlap. So I feel like when we know, like I'm looking at a lot of like what's happening in a modern data stack where like we use warehouses, we use BI's, you know, different like transformation tools, catalogs, like data quality tools, ETLs, all of that. I don't see a lot of being compacted by AI specifically. I think, you know, that space is being compacted as much as any other space in terms of, yes, we'll have all this copilot capabilities, some of AI capabilities here and there, but I don't see anything sort of dramatically, you know, being sort of, you know, a change or shifted because of, you know, like AI wave. In terms of just in general data space, I think in the last two, three years, we saw an explosion, right? Like we got like a lot of tools, every vendor for every problem. I feel like right now we should go through the cycle of consolidation. If Fivetran and DBT merge, they can be Alteryx of a new generation or something like that. And you know, probably some ETL tool there. I feel it might happen. I mean, it's just natural waves, you know, like in cycles. [00:25:59]

Alessio: I wonder if everybody is going to have their own copilot. The other thing I think about these models is like Swyx was at Airbyte and yeah, there's Fivetran. [00:26:08]

Swyx: Fivetran versus AirByte, I don't think it'll mix very well. [00:26:10]

Alessio: A lot of times these companies are doing the syntax work for you of like building the integration between your data store and like the app or another data store. I feel like now these models are pretty good at coming up with the integration themselves and like using the docs to then connect the two. So I'm really curious, like in the future, what that will look like. And same with data transformation. I mean, you think about DBT and some of these tools and right now you have to create rules to normalize and transform data. In the future, I could see you explaining the model, how you want the data to be, and then the model figuring out how to do the transformation. I think it all needs a semantic layer as far as like figuring out what to do with it. You know, what's the data for and where it goes. [00:26:53]

Artem: Yeah, I think many of this, you know, like workflows will be augmented by, you know, like some sort of a copilot. You know, you can describe what transformation you want to see and it can generate a boilerplate right, of transformation for you, or even, you know, like kind of generate a boilerplate of specific ETL driver or ETL integration. I think we're still not at the point where this code can be fully automated. So we still need a human and a loop, right, like who can be, who can use this copilot. But in general, I think, yeah, data work and software engineering work can be augmented quite significantly with all that stuff. [00:27:31]

Alessio: You know, the big thing with machine learning before was like, well, all of your data is bad. You know, the data is not good for anything. And I think like now, at least with these models, they have some knowledge of their own and they can also tell you if your data is bad, which I think is like something that before you didn't have. Any cool apps that you've seen being built on Qube, like any kind of like AI native things that people should think about, new experiences, anything like that? [00:27:54]

Artem: Well, I see a lot of Slack bots. They all remind me of Statsbot, but I know like I played with a few of them. They're much, much better than Statsbot. It feels like it's on the surface, right? It's just that use case that you really want, you know, think about you, a data engineer in your company, like everyone is like, and you're asking, hey, can you pull that data for me? And you would be like, can I build a bot to replace myself? You know, like, so they can both ping that bot instead. So it's like, that's why a lot of people doing that. So I think it's a first use case that actually people are playing with. But I think inside that use case, people get creative. So I see bots that can actually have a dialogue with you. So, you know, like you would come to that bot and say, hey, show me metrics. And the bot would be like, what kind of metrics? What do you want to look at? You will be like active users. And then it would be like, how do you define active users? You want to see active users sort of cohort, you want to see active users kind of changing behavior over time, like a lot of like a follow up questions. So it tries to sort of, you know, like understand what exactly you want. And that's how many data analysts work, right? When people started to ask you something, you always try to understand what exactly do you mean? Because many people don't know how to ask correct questions about your data. It's a sort of an interesting specter. On one side of the specter, you know, nothing is like, hey, show me metrics. And the other side of specter, you know how to write SQL, and you can write exact query to your data warehouse, right? So many people like a little bit in the middle. And the data analysts, they usually have the knowledge about your data. And that's why they can ask follow up questions and to understand what exactly you want. And I saw people building bots who can do that. That part is amazing. I mean, like generating SQL, all that stuff, it's okay, it's good. But when the bot can actually act like they know that your data and they can ask follow up questions. I think that's great. [00:29:43]

Swyx: Yeah. [00:29:44]

Alessio: Are there any issues with the models and the way they understand numbers? One of the big complaints people have is like GPT, at least 3.5, cannot do math. Have you seen any limitations and improvement? And also when it comes to what model to use, do you see most people use like GPT-4? Because it's like the best at this kind of analysis. [00:30:03]

Artem: I think I saw people use all kinds of models. To be honest, it's usually GPT. So inside GPT, it could be 3.5 or 4, right? But it's not like I see a lot of something else, to be honest, like, I mean, maybe some open source alternatives, but it feels like the market is being dominated by just chat GPT. In terms of the problems, I think chatting about it with a few people. So if math is required to do math, you know, like outside of, you know, like chat GPT itself, so it would be like some additional Python scripts or something. When we're talking about production level use cases, it's quite a lot of Python code around, you know, like your model to make it work. To be honest, it's like, it's not that magic that you just throw the model in and like it can give you all these answers. For like a toy use cases, the one we have on a, you know, like our demo page or something, it works fine. But, you know, like if you want to do like a lot of post-processing, do a mass on URL, you probably need to code it in Python anyway. That's what I see people doing. [00:30:59]

Alessio: We heard the same from Harrison and LangChain that most people just use OpenAI. We did a OpenAI has no moat emergency podcast, and it was funny to like just see the reaction that people had to that and how hard it actually is to break down some of the monopoly. What else should people keep in mind, Artem? You're kind of like at the cutting edge of this. You know, if I'm looking to build a data-driven AI application, I'm trying to build data into my AI workflows. Any mistakes people should avoid? Any tips on the best stack to use? What tools to use? [00:31:32]

Artem: I would just recommend going through to warehouse as soon as possible. I think a lot of people feel that MySQL can be a warehouse, which can be maybe on like a lower scale, but definitely not from a performance perspective. So just kind of starting with a good warehouse, a query engine, Lakehouse, that's probably like something I would recommend starting from a day zero. And there are good ways to do it, very cheap, with open source technologies too, especially in the Lakehouse architecture. I think, you know, I'm biased, obviously, but using a semantic layer, preferably Cube, and for, you know, like a context. And other than that, I just feel it's a very interesting space in terms of AI ecosystem. I see a lot of people using link chain right now, which is great, you know, like, and we build an integration. But I'm sure the space will continue to evolve and, you know, like we'll see a lot of interesting tools and maybe, you know, like some tools would be a better fit for a job. I'm not aware of any right now, but it's always interesting to see how it evolves. Also it's a little unclear, you know, like how all the infrastructure around actually developing, testing, documenting, all that stuff will kind of evolve too. But yeah, again, it's just like really interesting to see and observe, you know, what's happening in this space. [00:32:44]

Swyx: So before we go to the lightning round, I wanted to ask you on your thoughts on embedded analytics and in a sense, the kind of chatbots that people are inserting on their websites and building with LLMs is very much sort of end user programming or end user interaction with their own data. I love seeing embedded analytics, and for those who don't know, embedded analytics is basically user facing dashboards where you can see your own data, right? Instead of the company seeing data across all their customers, it's an individual user seeing their own data as a slice of the overall data that is owned by the platform that they're using. So I love embedded analytics. Well, actually, overwhelmingly, the observation that I've had is that people who try to build in this market fail to monetize. And I was wondering your insights on why. [00:33:31]

Artem: I think overall, the statement is true. It's really hard to monetize, you know, like in embedded analytics. That's why at Qube we're excited more about our internal kind of BI use case, or like a company's a building, you know, like a chatbots for their internal data consumption or like internal workflows. Embedded analytics is hard to monetize because it's historically been dominated by the BI vendors. And we still see a lot of organizations are using BI tools as vendors. And what I was talking about, BI vendors adding natural language interfaces, they will probably add that to the embedded analytics capabilities as well, right? So they would be able to embed that too. So I think that's part of it. Also, you know, if you look at the embedded analytics market, the bigger organizations are big GADs, they're really more custom, you know, like it becomes and at some point I see many organizations, they just stop using any vendor, and they just kind of build most of the stuff from scratch, which probably, you know, like the right way to do. So it's sort of, you know, like you got a market that is very kept at the top. And then you also in that middle and small segment, you got a lot of vendors trying to, you know, like to compete for the buyers. And because again, the BI is very fragmented, embedded analytics, therefore is fragmented also. So you're really going after the mid market slice, and then with a lot of other vendors competing for that. So that's why it's historically been hard to monetize, right? I don't think AI really going to change that just because it's using model, you just pay to open AI. And that's it, like everyone can do that, right? So it's not much of a competitive advantage. So it's going to be more like a commodity features that a lot of vendors would be able to leverage. [00:35:20]

Alessio: This is great, Artem. As usual, we got our lightning round. So it's three questions. One is about acceleration, one on exploration, and then take away. The acceleration thing is what's something that already happened in AI or maybe, you know, in data that you thought would take much longer, but it's already happening today. [00:35:38]

Artem: To be honest, all this foundational models, I thought that we had a lot of models that been in production for like, you know, maybe decade or so. And it was like a very niche use cases, very vertical use cases, it's just like in very customized models. And even when we're building Statsbot back then in 2016, right, even back then, we had some natural language models being deployed, like a Google Translate or something that was still was a sort of a model, right, but it was very customized with a specific use case. So I thought that would continue for like, many years, we will use AI, we'll have all these customized niche models. But there is like foundational model, they like very generic now, they can serve many, many different use cases. So I think that is a big change. And I didn't expect that, to be honest. [00:36:27]

Swyx: The next question is about exploration. What is one thing that you think is the most interesting unsolved question in AI? [00:36:33]

Artem: I think AI is a subset of software engineering in general. And it's sort of connected to the data as well. Because software engineering as a discipline, it has quite a history. We build a lot of processes, you know, like toolkits and methodologies, how we prod that, [00:36:50]

Swyx: right. [00:36:51]

Artem: But AI, I don't think it's completely different. But it has some unique traits, you know, like, it's quite not idempotent, right, and kind of from many dimensions and like other traits. So which kind of may require a different methodologies may require different approaches and a different toolkit. I don't think how much is going to deviate from a standard software engineering, I think many tools and practices that we develop our software engineering can be applied to AI. And some of the data best practices can be applied as well. But it's like we got a DevOps, right, like it's just a bunch of tools, like ecosystem. So now like AI is kind of feels like it's shaping into that with a lot of its own, you know, like methodologies, practices and toolkits. So I'm really excited about it. And I think it's a lot of unsolved still question again, how do we develop that? How do we test you know, like, what is the best practices? How what is a methodologist? So I think that would be an interesting to see. [00:37:44]

Alessio: Awesome. Yeah. Our final message, you know, you have a big audience of engineers and technical folks, what's something you want everybody to remember to think about to explore? [00:37:55]

Artem: I mean, it says being hooked to try to build a chatbot, you know, like for analytics, back then and kind of, you know, like looking at what people do right now, I think, yeah, just do that. I mean, it's working right now, with foundational models, it's actually now it's possible to build all those cool applications. I'm so excited to see, you know, like, how much changed in the last six years or so that we actually now can build a smart agents. So I think that sort of, you know, like a takeaways and yeah, we are, as humans in general, we like we really move technology forward. And it's fun to see, you know, like, it's just a first hand. [00:38:30]

Alessio: Well, thank you so much for coming on Artem. [00:38:32]

Swyx: This was great. [00:38:32]

Get full access to Latent Space at www.latent.space/subscribe

2023-10-26
Länk till avsnitt

The End of Finetuning ? with Jeremy Howard of Fast.ai

Thanks to the over 17,000 people who have joined the first AI Engineer Summit! A full recap is coming. Last call to fill out the State of AI Engineering survey! See our Community page for upcoming meetups in SF, Paris and NYC.

This episode had good interest on Twitter and was discussed on the Vanishing Gradients podcast.

Fast.ai?s ?Practical Deep Learning? courses been watched by over >6,000,000 people, and the fastai library has over 25,000 stars on Github. Jeremy Howard, one of the creators of Fast, is now one of the most prominent and respected voices in the machine learning industry; but that wasn?t always the case.

Being non-consensus and right

In 2018, Jeremy and Sebastian Ruder published a paper on ULMFiT (Universal Language Model Fine-tuning), a 3-step transfer learning technique for NLP tasks:

The paper demonstrated that pre-trained language models could be fine-tuned on a specific task with a relatively small amount of data to achieve state-of-the-art results. They trained a 24M parameters model on WikiText-103 which was beat most benchmarks.

While the paper had great results, the methods behind weren?t taken seriously by the community:

?Everybody hated fine tuning. Everybody hated transfer learning. I literally did tours trying to get people to start doing transfer learning and nobody was interested, particularly after GPT showed such good results with zero shot and few shot learning [?] which I was convinced was not the right direction, but who's going to listen to me, cause as you said, I don't have a PhD, not at a university? I don't have a big set of computers to fine tune huge transformer models.?

Five years later, fine-tuning is at the center of most major discussion topics in AI (we covered some like fine tuning vs RAG and small models fine tuning), and we might have gotten here earlier if Jeremy had OpenAI-level access to compute and distribution. At heart, Jeremy has always been ?GPU poor?:

?I've always been somebody who does not want to build stuff on lots of big computers because most people don't have lots of big computers and I hate creating stuff that most people can't use.?

This story is a good reminder of how some of the best ideas are hiding in plain sight; we recently covered RWKV and will continue to highlight the most interesting research that isn?t being done in the large labs.

Replacing fine-tuning with continued pre-training

Even though fine-tuning is now mainstream, we still have a lot to learn. The issue of ?catastrophic forgetting? and potential solutions have been brought up in many papers: at the fine-tuning stage, the model can forget tasks it previously knew how to solve in favor of new ones.

The other issue is apparent memorization of the dataset even after a single epoch, which Jeremy covered Can LLMs learn from a single example? but we still don?t have the answer to.

Despite being the creator of ULMFiT, Jeremy still professes that there are a lot of open questions on finetuning:

?So I still don't know how to fine tune language models properly and I haven't found anybody who feels like they do.?

He now advocates for "continued pre-training" - maintaining a diversity of data throughout the training process rather than separate pre-training and fine-tuning stages. Mixing instructional data, exercises, code, and other modalities while gradually curating higher quality data can avoid catastrophic forgetting and lead to more robust capabilities (something we covered in Datasets 101).

?Even though I originally created three-step approach that everybody now does, my view is it's actually wrong and we shouldn't use it? the right way to do this is to fine-tune language models, is to actually throw away the idea of fine-tuning. There's no such thing. There's only continued pre-training.

And pre-training is something where from the very start, you try to include all the kinds of data that you care about, all the kinds of problems that you care about, instructions, exercises, code, general purpose document completion, whatever. And then as you train, you gradually curate that, you know, you gradually make that higher and higher quality and more and more specific to the kinds of tasks you want it to do. But you never throw away any data?.

So yeah, that's now my view, is I think ULMFiT is the wrong approach. And that's why we're seeing a lot of these so-called alignment tax? I think it's actually because people are training them wrong.

An example of this phenomena is CodeLlama, a LLaMA2 model finetuned on 500B tokens of code: while the model is much better at code, it?s worse on generic tasks that LLaMA2 knew how to solve well before the fine-tuning.

In the episode we also dive into all the places where open source model development and research is happening (academia vs Discords - tracked on our Communities list and on our survey), and how Jeremy recommends getting the most out of these diffuse, pseudonymous communities (similar to the Eleuther AI Mafia).

Show Notes

* Jeremy?s Background

* FastMail

* Optimal Decisions

* Kaggle

* Enlitic

* fast.ai

* Rachel Thomas

* Practical Deep Learning

* fastai for PyTorch

* nbdev

* fastec2 (the underrated library we describe)

* Can LLMs learn from a single example?

* the Kaggle LLM Science Exam competition, which ?challenges participants to answer difficult science-based questions written by a Large Language Model?.

* Zeiler and Fergus paper

* ULM Fit

* DAWNBench

* Phi-1

* Code Llama

* AlexNet

Timestamps

* [00:00:00] Intros and Jeremy?s background

* [00:05:28] Creating ULM Fit - a breakthrough in NLP using transfer learning

* [00:06:32] The rise of GPT and the appeal of few-shot learning over fine-tuning

* [00:10:00] Starting Fast.ai to distribute AI capabilities beyond elite academics

* [00:14:30] How modern LMs like ChatGPT still follow the ULM Fit 3-step approach

* [00:17:23] Meeting with Chris Lattner on Swift for TensorFlow at Google

* [00:20:00] Continued pre-training as a fine-tuning alternative

* [00:22:16] Fast.ai and looking for impact vs profit maximization

* [00:26:39] Using Fast.ai to create an "army" of AI experts to improve their domains

* [00:29:32] Fast.ai's 3 focus areas - research, software, and courses

* [00:38:42] Fine-tuning memorization and training curve "clunks" before each epoch

* [00:46:47] Poor training and fine-tuning practices may be causing alignment failures

* [00:48:38] Academia vs Discords

* [00:53:41] Jeremy's high hopes for Chris Lattner's Mojo and its potential

* [01:05:00] Adding capabilities like SQL generation through quick fine-tuning

* [01:10:12] Rethinking Fast.ai courses for the AI-assisted coding era

* [01:14:53] Rapid model development has created major technical debt

* [01:17:08] Lightning Round

AI Summary (beta)

This is the first episode we?re trying this. Here?s an overview of the main topics before you dive in the transcript.

* Jeremy's background and philosophies on AI

* Studied philosophy and cognitive science in college

* Focused on ethics and thinking about AI even 30 years ago

* Believes AI should be accessible to more people, not just elite academics/programmers

* Created fast.ai to make deep learning more accessible

* Development of transfer learning and ULMFit

* Idea of transfer learning critical for making deep learning accessible

* ULMFit pioneered transfer learning for NLP

* Proposed training general language models on large corpora then fine-tuning - this became standard practice

* Faced skepticism that this approach would work from NLP community

* Showed state-of-the-art results on text classification soon after trying it

* Current open questions around fine-tuning LLMs

* Models appear to memorize training data extremely quickly (after 1 epoch)

* This may hurt training dynamics and cause catastrophic forgetting

* Unclear how best to fine-tune models to incorporate new information/capabilities

* Need more research on model training dynamics and ideal data mixing

* Exciting new developments

* Mojo and new programming languages like Swift could enable faster model innovation

* Still lots of room for improvements in computer vision-like innovations in transformers

* Small models with fine-tuning may be surprisingly capable for many real-world tasks

* Prompting strategies enable models like GPT-3 to achieve new skills like playing chess at superhuman levels

* LLMs are like computer vision in 2013 - on the cusp of huge new breakthroughs in capabilities

* Access to AI research

* Many key convos happen in private Discord channels and forums

* Becoming part of these communities can provide great learning opportunities

* Being willing to do real work, not just talk about ideas, is key to gaining access

* The future of practical AI

* Coding becoming more accessible to non-programmers through AI assistance

* Pre-requisite programming experience for learning AI may no longer be needed

* Huge open questions remain about how to best train, fine-tune, and prompt LLMs

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:21]

Swyx: Hey, and today we have in the remote studio, Jeremy Howard all the way from Australia. Good morning. [00:00:27]

Jeremy: The remote studio, also known as my house. Good morning. Nice to see you. [00:00:32]

Swyx: Nice to see you too. I'm actually very used to seeing you in your mask as a message to people, but today we're mostly audio. But thank you for doing the very important public service of COVID awareness. It was a pleasure. [00:00:46]

Jeremy: It was all very annoying and frustrating and tedious, but somebody had to do it. [00:00:52]

Swyx: Somebody had to do it, especially somebody with your profile. I think it really drives home the message. So we tend to introduce people for them and then ask people to fill in the blanks on the personal side. Something I did not know about you was that you graduated with a BA in philosophy from the University of Melbourne. I assumed you had a PhD. [00:01:14]

Jeremy: No, I mean, I barely got through my BA because I was working 80 to 100 hour weeks at McKinsey and Company from 19 years old onwards. So I actually didn't attend any lectures in second and third year university. [00:01:35]

Swyx: Well, I guess you didn't need it or you're very sort of self-driven and self-motivated. [00:01:39]

Jeremy: I took two weeks off before each exam period when I was working at McKinsey. And then, I mean, I can't believe I got away with this in hindsight, I would go to all my professors and say, oh, I was meant to be in your class this semester and I didn't quite turn up. Were there any assignments I was meant to have done, whatever. I can't believe all of them let me basically have it. They basically always would say like, okay, well, if you can have this written by tomorrow, I'll accept it. So yeah, stressful way to get through university, but. [00:02:12]

Swyx: Well, it shows that, I guess, you min-maxed the opportunities. That definitely was a precursor. [00:02:18]

Jeremy: I mean, funnily, like in as much as I, you know, in philosophy, the things I found interesting and focused on in the little bit of time I did spend on it was ethics and cognitive science. And it's kind of really amazing that it's now come back around and those are actually genuinely useful things to know about, which I never thought would happen. [00:02:38]

Swyx: A lot of, yeah, a lot of relevant conversations there. So you were a consultant for a while and then in the magical month of June 1989, you founded both Optimal Decisions and Fastmeal, which I also briefly used. So thank you for that. [00:02:53]

Jeremy: Oh, good for you. Yeah. Cause I had read the statistics, which is that like 90% or something of small businesses fail. So I thought if I start two businesses, I have a higher chance. In hindsight, I was thinking of it as some kind of stochastic thing I didn't have control over, but it's a bit odd, but anyway. [00:03:10]

Swyx: And then you were president and chief scientist at Kaggle, which obviously is the sort of composition platform of machine learning. And then Enlitic, where you were working on using deep learning to improve medical diagnostics and clinical decisions. Yeah. [00:03:28]

Jeremy: I was actually the first company to use deep learning in medicine, so I kind of founded the field. [00:03:33]

Swyx: And even now that's still like a pretty early phase. And I actually heard you on your new podcast with Tanish, where you went very, very deep into the stuff, the kind of work that he's doing, such a young prodigy at his age. [00:03:47]

Jeremy: Maybe he's too old to be called a prodigy now, ex-prodigy. No, no. [00:03:51]

Swyx: I think he still counts. And anyway, just to round out the bio, you have a lot more other credentials, obviously, but most recently you started Fast.ai, which is still, I guess, your primary identity with Rachel Thomas. So welcome. [00:04:05]

Jeremy: Yep. [00:04:06]

Swyx: Thanks to my wife. Thank you. Yeah. Doing a lot of public service there with getting people involved in AI, and I can't imagine a better way to describe it than fast, fast.ai. You teach people from nothing to stable diffusion in seven weeks or something, and that's amazing. Yeah, yeah. [00:04:22]

Jeremy: I mean, it's funny, you know, when we started that, what was that, like 2016 or something, the idea that deep learning was something that you could make more accessible was generally considered stupid. Everybody knew that deep learning was a thing that you got a math or a computer science PhD, you know, there was one of five labs that could give you the appropriate skills and that you would join, yeah, basically from one of those labs, you might be able to write some papers. So yeah, the idea that normal people could use that technology to do good work was considered kind of ridiculous when we started it. And we weren't sure if it was possible either, but we kind of felt like we had to give it a go because the alternative was we were pretty sure that deep learning was on its way to becoming, you know, the most or one of the most, you know, important technologies in human history. And if the only people that could use it were a handful of computer science PhDs, that seemed like A, a big waste and B, kind of dangerous. [00:05:28]

Swyx: Yeah. [00:05:29]

Alessio: And, you know, well, I just wanted to know one thing on your bio that at Kaggle, you were also the top rank participant in both 2010 and 2011. So sometimes you see a lot of founders running companies that are not really in touch with the problem, but you were clearly building something that you knew a lot about, which is awesome. Talking about deep learning, you created, published a paper on ULM fit, which was kind of the predecessor to multitask learning and a lot of the groundwork that then went to into Transformers. I've read back on the paper and you turned this model, AWD LSTM, which I did the math and it was like 24 to 33 million parameters, depending on what training data set you use today. That's kind of like not even small, it's like super small. What were some of the kind of like contrarian takes that you had at the time and maybe set the stage a little bit for the rest of the audience on what was kind of like the state of the art, so to speak, at the time and what people were working towards? [00:06:32]

Jeremy: Yeah, the whole thing was a contrarian take, you know. So okay, so we started Fast.ai, my wife and I, and we thought, yeah, so we're trying to think, okay, how do we make it more accessible? So when we started thinking about it, it was probably 2015 and then 2016, we started doing something about it. Why is it inaccessible? Okay, well, A, no one knows how to do it other than a few number of people. And then when we asked those few number of people, well, how do you actually get good results? They would say like, oh, it's like, you know, a box of tricks that aren't published. So you have to join one of the labs and learn the tricks. So a bunch of unpublished tricks, not much software around, but thankfully there was Theano and rappers and particularly Lasagna, the rapper, but yeah, not much software around, not much in the way of data sets, you know, very hard to get started in terms of the compute. Like how do you get that set up? So yeah, no, everything was kind of inaccessible. And you know, as we started looking into it, we had a key insight, which was like, you know what, most of the compute and data for image recognition, for example, we don't need to do it. You know, there's this thing which nobody knows about, nobody talks about called transfer learning, where you take somebody else's model, where they already figured out like how to detect edges and gradients and corners and text and whatever else, and then you can fine tune it to do the thing you want to do. And we thought that's the key. That's the key to becoming more accessible in terms of compute and data requirements. So when we started Fast.ai, we focused from day one on transfer learning. Lesson one, in fact, was transfer learning, literally lesson one, something not normally even mentioned in, I mean, there wasn't much in the way of courses, you know, the courses out there were PhD programs that had happened to have recorded their lessons and they would rarely mention it at all. We wanted to show how to do four things that seemed really useful. You know, work with vision, work with tables of data, work with kind of recommendation systems and collaborative filtering and work with text, because we felt like those four kind of modalities covered a lot of the stuff that, you know, are useful in real life. And no one was doing anything much useful with text. Everybody was talking about word2vec, you know, like king plus queen minus woman and blah, blah, blah. It was like cool experiments, but nobody's doing anything like useful with it. NLP was all like lemmatization and stop words and topic models and bigrams and SPMs. And it was really academic and not practical. But I mean, to be honest, I've been thinking about this crazy idea for nearly 30 years since I had done cognitive science at university, where we talked a lot about the CELS Chinese room experiment. This idea of like, what if there was somebody that could kind of like, knew all of the symbolic manipulations required to answer questions in Chinese, but they didn't speak Chinese and they were kind of inside a room with no other way to talk to the outside world other than taking in slips of paper with Chinese written on them and then they do all their rules and then they pass back a piece of paper with Chinese back. And this room with a person in is actually fantastically good at answering any question you give them written in Chinese. You know, do they understand Chinese? And is this, you know, something that's intelligently working with Chinese? Ever since that time, I'd say the most thought, to me, the most thoughtful and compelling philosophical response is yes. You know, intuitively it feels like no, because that's just because we can't imagine such a large kind of system. But you know, if it looks like a duck and acts like a duck, it's a duck, you know, or to all intents and purposes. And so I always kind of thought, you know, so this is basically a kind of analysis of the limits of text. And I kind of felt like, yeah, if something could ingest enough text and could use the patterns it saw to then generate text in response to text, it could appear to be intelligent, you know. And whether that means it is intelligent or not is a different discussion and not one I find very interesting. Yeah. And then when I came across neural nets when I was about 20, you know, what I learned about the universal approximation theorem and stuff, and I started thinking like, oh, I wonder if like a neural net could ever get big enough and take in enough data to be a Chinese room experiment. You know, with that background and this kind of like interest in transfer learning, you know, I'd been thinking about this thing for kind of 30 years and I thought like, oh, I wonder if we're there yet, you know, because we have a lot of text. Like I can literally download Wikipedia, which is a lot of text. And I thought, you know, how would something learn to kind of answer questions or, you know, respond to text? And I thought, well, what if we used a language model? So language models are already a thing, you know, they were not a popular or well-known thing, but they were a thing. But language models exist to this idea that you could train a model to fill in the gaps. Or actually in those days it wasn't fill in the gaps, it was finish a string. And in fact, Andrej Karpathy did his fantastic RNN demonstration from this at a similar time where he showed like you can have it ingest Shakespeare and it will generate something that looks a bit like Shakespeare. I thought, okay, so if I do this at a much bigger scale, using all of Wikipedia, what would it need to be able to do to finish a sentence in Wikipedia effectively, to do it quite accurately quite often? I thought, geez, it would actually have to know a lot about the world, you know, it'd have to know that there is a world and that there are objects and that objects relate to each other through time and cause each other to react in ways and that causes proceed effects and that, you know, when there are animals and there are people and that people can be in certain positions during certain timeframes and then you could, you know, all that together, you can then finish a sentence like this was signed into law in 2016 by US President X and it would fill in the gap, you know. So that's why I tried to create what in those days was considered a big language model trained on the entirety on Wikipedia, which is that was, you know, a bit unheard of. And my interest was not in, you know, just having a language model. My interest was in like, what latent capabilities would such a system have that would allow it to finish those kind of sentences? Because I was pretty sure, based on our work with transfer learning and vision, that I could then suck out those latent capabilities by transfer learning, you know, by fine-tuning it on a task data set or whatever. So we generated this three-step system. So step one was train a language model on a big corpus. Step two was fine-tune a language model on a more curated corpus. And step three was further fine-tune that model on a task. And of course, that's what everybody still does today, right? That's what ChatGPT is. And so the first time I tried it within hours, I had a new state-of-the-art academic result on IMDB. And I was like, holy s**t, it does work. And so you asked, to what degree was this kind of like pushing against the established wisdom? You know, every way. Like the reason it took me so long to try it was because I asked all my friends in NLP if this could work. And everybody said, no, it definitely won't work. It wasn't like, oh, maybe. Everybody was like, it definitely won't work. NLP is much more complicated than vision. Language is a much more vastly complicated domain. You know, and you've got problems like the grounding problem. We know from like philosophy and theory of mind that it's actually impossible for it to work. So yeah, so don't waste your time. [00:15:10]

Alessio: Jeremy, had people not tried because it was like too complicated to actually get the data and like set up the training? Or like, were people just lazy and kind of like, hey, this is just not going to work? [00:15:20]

Jeremy: No, everybody wasn't lazy. So like, so the person I thought at that time who, you know, there were two people I thought at that time, actually, who were the strongest at language models were Stephen Merity and Alec Radford. And at the time I didn't know Alec, but I, after we had both, after I'd released ULM Fit and he had released GPT, I organized a chat for both of us with Kate Metz in the New York Times. And Kate Metz answered, sorry, and Alec answered this question for Kate. And Kate was like, so how did, you know, GPT come about? And he said, well, I was pretty sure that pre-training on a general large corpus wouldn't work. So I hadn't tried it. And then I read ULM Fit and turns out it did work. And so I did it, you know, bigger and it worked even better. And similar with, with Stephen, you know, I asked Stephen Merity, like, why don't we just find, you know, take your AWD-ASTLM and like train it on all of Wikipedia and fine tune it? And he's kind of like, well, I don't think that's going to really lie. Like two years before I did a very popular talk at KDD, the conference where everybody in NLP was in the audience. I recognized half the faces, you know, and I told them all this, I'm sure transfer learning is the key. I'm sure ImageNet, you know, is going to be an NLP thing as well. And, you know, everybody was interested and people asked me questions afterwards and, but not just, yeah, nobody followed up because everybody knew that it didn't work. I mean, even like, so we were scooped a little bit by Dai and Lee, Kwok Lee at Google. They had, they had, I already, I didn't even realize this, which is a bit embarrassing. They had already done a large language model and fine tuned it. But again, they didn't create a general purpose, large language model on a general purpose corpus. They only ever tested a domain specific corpus. And I haven't spoken to Kwok actually about that, but I assume that the reason was the same. It probably just didn't occur to them that the general approach could work. So maybe it was that kind of 30 years of mulling over the, the cell Chinese room experiment that had convinced me that it probably would work. I don't know. Yeah. [00:17:48]

Alessio: Interesting. I just dug up Alec announcement tweet from 2018. He said, inspired by Cobe, Elmo, and Yola, I'm fit. We should have a single transformer language model can be fine tuned to a wide variety. It's interesting because, you know, today people think of AI as the leader, kind of kind of like the research lab pushing forward the field. What was that at the time? You know, like kind of like going back five years, people think of it as an overnight success, but obviously it took a while. [00:18:16]

Swyx: Yeah. Yeah. [00:18:17]

Jeremy: No, I mean, absolutely. And I'll say like, you know, it's interesting that it mentioned Elmo because in some ways that was kind of diametrically opposed to, to ULM fit. You know, there was these kind of like, so there was a lot of, there was a lot of activity at the same time as ULM fits released. So there was, um, so before it, as Brian McCann, I think at Salesforce had come out with this neat model that did a kind of multitask learning, but again, they didn't create a general fine tune language model first. There was Elmo, um, which I think was a lip, you know, actually quite a few months after the first ULM fit example, I think. Um, but yeah, there was a bit of this stuff going on. And the problem was everybody was doing, and particularly after GPT came out, then everybody wanted to focus on zero shot and few shot learning. You know, everybody hated fine tuning. Everybody hated transfer learning. And like, I literally did tours trying to get people to start doing transfer learning and people, you know, nobody was interested, particularly after GPT showed such good results with zero shot and few shot learning. And so I actually feel like we kind of went backwards for years and, and not to be honest, I mean, I'm a bit sad about this now, but I kind of got so disappointed and dissuaded by like, it felt like these bigger lab, much bigger labs, you know, like fast AI had only ever been just me and Rachel were getting all of this attention for an approach I thought was the wrong way to do it. You know, I was convinced was the wrong way to do it. And so, yeah, for years people were really focused on getting better at zero shot and few shots and it wasn't until, you know, this key idea of like, well, let's take the ULM fit approach, but for step two, rather than fine tuning on a kind of a domain corpus, let's fine tune on an instruction corpus. And then in step three, rather than fine tuning on a reasonably specific task classification, let's fine tune on a, on a RLHF task classification. And so that was really, that was really key, you know, so I was kind of like out of the NLP field for a few years there because yeah, it just felt like, I don't know, pushing uphill against this vast tide, which I was convinced was not the right direction, but who's going to listen to me, you know, cause I, as you said, I don't have a PhD, not at a university, or at least I wasn't then. I don't have a big set of computers to fine tune huge transformer models. So yeah, it was definitely difficult. It's always been hard. You know, it's always been hard. Like I've always been somebody who does not want to build stuff on lots of big computers because most people don't have lots of big computers and I hate creating stuff that most people can't use, you know, and also stuff that's created on lots of big computers has always been like much more media friendly. So like, it might seem like a recent thing, but actually throughout my 30 years in data science, the attention's always been on, you know, the big iron results. So when I first started, everybody was talking about data warehouses and it was all about Teradata and it'd be like, oh, this big bank has this huge room full of computers and they have like terabytes of data available, you know, at the press of a button. And yeah, that's always what people want to talk about, what people want to write about. And then of course, students coming out of their PhDs and stuff, that's where they want to go work because that's where they read about. And to me, it's a huge distraction, you know, because like I say, most people don't have unlimited compute and I want to help most people, not the small subset of the most well-off people. [00:22:16]

Alessio: That's awesome. And it's great to hear, you do such a great job educating that a lot of times you're not telling your own story, you know? So I love this conversation. And the other thing before we jump into Fast.AI, actually, a lot of people that I know, they run across a new architecture and whatnot, they're like, I got to start a company and raise a bunch of money and do all of this stuff. And say, you were like, I want everybody to have access to this. Why was that the case for you? Was it because you already had a successful venture in like FastMail and you were more interested in that? What was the reasoning? [00:22:52]

Jeremy: It's a really good question. So I guess the answer is yes, that's the reason why. So when I was a teenager, I thought it would be really cool to like have my own company. You know, I didn't know the word startup. I didn't know the word entrepreneur. I didn't know the word VC. And I didn't really know what any of those things were really until after we started Kaggle, to be honest. Even the way it started to what we now call startups. I just thought they were just small businesses. You know, they were just companies. So yeah, so those two companies were FastMail and Optimal Decisions. FastMail was the first kind of synchronized email provider for non-businesses. So something you can get your same email at home, on your laptop, at work, on your phone, whatever. And then Optimal Decisions invented a new approach to insurance pricing. Something called profit-optimized insurance pricing. So I saw both of those companies, you know, after 10 years. And at that point, I had achieved the thing that as a teenager I had wanted to do. You know, it took a lot longer than it should have because I spent way longer in management consulting than I should have because I got caught up in that stupid rat race. But, you know, eventually I got there and I remember my mom saying to me, you must be so proud. You know, because she remembered my dream. She's like, you've done it. And I kind of reflected and I was like, I'm not proud at all. You know, like people quite liked FastMail. You know, it's quite nice to have synchronized email. It probably would have happened anyway. Yeah, I'm certainly not proud that I've helped some insurance companies suck more money out of their customers. Yeah, no, I'm not proud. You know, it's actually, I haven't really helped the world very much. You know, maybe in the insurance case I've made it a little bit worse. I don't know. So, yeah, I was determined to not waste more years of my life doing things, working hard to do things which I could not be reasonably sure would have a lot of value. So, you know, I took some time off. I wasn't sure if I'd ever work again, actually. I didn't particularly want to, because it felt like, yeah, it felt like such a disappointment. And, but, you know, and I didn't need to. I had enough money. Like, I wasn't super rich, but I had enough money. I didn't need to work. And I certainly recognized that amongst the other people I knew who had enough money that they didn't need to work, they all worked ridiculously hard, you know, and constantly put themselves in extremely stressful situations. And I thought, I don't want to be one of those idiots who's tied to, you know, buying a bigger plane than the next guy or whatever. You know, Kaggle came along and I mainly kind of did that just because it was fun and interesting to hang out with interesting people. But, you know, with Fast.ai in particular, you know, Rachel and I had a very explicit, you know, long series of conversations over a long period of time about like, well, how can we be the most helpful to society as a whole, and particularly to those people who maybe need more help, you know? And so we definitely saw the world going in a potentially pretty dystopian direction if the world's most powerful technology was controlled by a small group of elites. So we thought, yeah, we should focus on trying to help that not happen. You know, sadly, it looks like it still is likely to happen. But I mean, I feel like we've helped make it a little bit less likely. So we've done our bit. [00:26:39]

Swyx: You've shown that it's possible. And I think your constant advocacy, your courses, your research that you publish, you know, just the other day you published a finding on, you know, learning that I think is still something that people are still talking about quite a lot. I think that that is the origin story of a lot of people who are going to be, you know, little Jeremy Howards, furthering your mission with, you know, you don't have to do everything by yourself is what I'm saying. No, definitely. Definitely. [00:27:10]

Jeremy: You know, that was a big takeaway from like, analytic was analytic. It definitely felt like we had to do everything ourselves. And I kind of, I wanted to solve medicine. I'll say, yeah, okay, solving medicine is actually quite difficult. And I can't do it on my own. And there's a lot of other things I'd like to solve, and I can't do those either. So that was definitely the other piece was like, yeah, you know, can we create an army of passionate domain experts who can change their little part of the world? And that's definitely happened. Like I find nowadays, at least half the time, probably quite a bit more that I get in contact with somebody who's done really interesting work in some domain. Most of the time I'd say, they say, yeah, I got my start with fast.ai. So it's definitely, I can see that. And I also know from talking to folks at places like Amazon and Adobe and stuff, which, you know, there's lots of alumni there. And they say, oh my God, I got here. And like half of the people are fast.ai alumni. So it's fantastic. [00:28:13]

Swyx: Yeah. [00:28:14]

Jeremy: Actually, Andre Kapathy grabbed me when I saw him at NeurIPS a few years ago. And he was like, I have to tell you, thanks for the fast.ai courses. When people come to Tesla and they need to know more about deep learning, we always send them to your course. And the OpenAI Scholars Program was doing the same thing. So it's kind of like, yeah, it's had a surprising impact, you know, that's just one of like three things we do is the course, you know. [00:28:40]

Swyx: Yes. [00:28:40]

Jeremy: And it's only ever been at most two people, either me and Rachel or me and Sylvia nowadays, it's just me. So yeah, I think it shows you don't necessarily need a huge amount of money and a huge team of people to make an impact. [00:28:56]

Swyx: Yeah. So just to reintroduce fast.ai for people who may not have dived into it much, there is the courses that you do. There is the library that is very well loved. And I kind of think of it as a nicer layer on top of PyTorch that people should start with by default and use it as the basis for a lot of your courses. And then you have like NBDev, which I don't know, is that the third one? [00:29:27]

Jeremy: Oh, so the three areas were research, software, and courses. [00:29:32]

Swyx: Oh, sorry. [00:29:32]

Jeremy: So then in software, you know, fast.ai is the main thing, but NBDev is not far behind. But then there's also things like FastCore, GHAPI, I mean, dozens of open source projects that I've created and some of them have been pretty popular and some of them are still a little bit hidden, actually. Some of them I should try to do a better job of telling people about. [00:30:01]

Swyx: What are you thinking about? Yeah, what's on the course of my way? Oh, I don't know, just like little things. [00:30:04]

Jeremy: Like, for example, for working with EC2 and AWS, I created a FastEC2 library, which I think is like way more convenient and nice to use than anything else out there. And it's literally got a whole autocomplete, dynamic autocomplete that works both on the command line and in notebooks that'll like auto-complete your instance names and everything like that. You know, just little things like that. I try to make like, when I work with some domain, I try to make it like, I want to make it as enjoyable as possible for me to do that. So I always try to kind of like, like with GHAPI, for example, I think that GitHub API is incredibly powerful, but I didn't find it good to work with because I didn't particularly like the libraries that are out there. So like GHAPI, like FastEC2, it like autocompletes both at the command line or in a notebook or whatever, like literally the entire GitHub API. The entire thing is like, I think it's like less than 100K of code because it actually, as far as I know, the only one that grabs it directly from the official open API spec that GitHub produces. And like if you're in GitHub and you just type an API, you know, autocomplete API method and hit enter, it prints out the docs with brief docs and then gives you a link to the actual documentation page. You know, GitHub Actions, I can write now in Python, which is just so much easier than writing them in TypeScript and stuff. So, you know, just little things like that. [00:31:40]

Swyx: I think that's an approach which more developers took to publish some of their work along the way. You described the third arm of FastAI as research. It's not something I see often. Obviously, you do do some research. And how do you run your research? What are your research interests? [00:31:59]

Jeremy: Yeah, so research is what I spend the vast majority of my time on. And the artifacts that come out of that are largely software and courses. You know, so to me, the main artifact shouldn't be papers because papers are things read by a small exclusive group of people. You know, to me, the main artifacts should be like something teaching people, here's how to use this insight and here's software you can use that builds it in. So I think I've only ever done three first-person papers in my life, you know, and none of those are ones I wanted to do. You know, they were all ones that, like, so one was ULM Fit, where Sebastian Ruder reached out to me after seeing the course and said, like, you have to publish this as a paper, you know. And he said, I'll write it. He said, I want to write it because if I do, I can put it on my PhD and that would be great. And it's like, okay, well, I want to help you with your PhD. And that sounds great. So like, you know, one was the masks paper, which just had to exist and nobody else was writing it. And then the third was the Fast.ai library paper, which again, somebody reached out and said, please, please write this. We will waive the fee for the journal and everything and actually help you get it through publishing and stuff. So yeah, so I don't, other than that, I've never written a first author paper. So the research is like, well, so for example, you know, Dawn Bench was a competition, which Stanford ran a few years ago. It was kind of the first big competition of like, who can train neural nets the fastest rather than the most accurate. And specifically it was who can train ImageNet the fastest. And again, this was like one of these things where it was created by necessity. So Google had just released their TPUs. And so I heard from my friends at Google that they had put together this big team to smash Dawn Bench so that they could prove to people that they had to use Google Cloud and use their TPUs and show how good their TPUs were. And we kind of thought, oh s**t, this would be a disaster if they do that, because then everybody's going to be like, oh, deep learning is not accessible. [00:34:20]

Swyx: You know, to actually be good at it, [00:34:21]

Jeremy: you have to be Google and you have to use special silicon. And so, you know, we only found out about this 10 days before the competition finished. But, you know, we basically got together an emergency bunch of our students and Rachel and I and sat for the next 10 days and just tried to crunch through and try to use all of our best ideas that had come from our research. And so particularly progressive resizing, just basically train mainly on small things, train on non-square things, you know, stuff like that. And so, yeah, we ended up winning, thank God. And so, you know, we turned it around from being like, like, oh s**t, you know, this is going to show that you have to be Google and have TPUs to being like, oh my God, even the little guy can do deep learning. So that's an example of the kind of like research artifacts we do. And yeah, so all of my research is always, how do we do more with less, you know? So how do we get better results with less data, with less compute, with less complexity, with less education, you know, stuff like that. So ULM fits obviously a good example of that. [00:35:37]

Swyx: And most recently you published, can LLMs learn from a single example? Maybe could you tell the story a little bit behind that? And maybe that goes a little bit too far into the learning of very low resource, the literature. [00:35:52]

Jeremy: Yeah, yeah. So me and my friend, Jono Whittaker, basically had been playing around with this fun Kaggle competition, which is actually still running as we speak, which is, can you create a model which can answer multiple choice questions about anything that's in Wikipedia? And the thing that makes it interesting is that your model has to run on Kaggle within nine hours. And Kaggle's very, very limited. So you've only got 14 gig RAM, only two CPUs, and a small, very old GPU. So this is cool, you know, if you can do well at this, then this is a good example of like, oh, you can do more with less. So yeah, Jono and I were playing around with fine tuning, of course, transfer learning, pre-trained language models. And we saw this, like, so we always, you know, plot our losses as we go. So here's another thing we created. Actually, Sylvain Guuger, when he worked with us, created called fast progress, which is kind of like TQEDM, but we think a lot better. So we look at our fast progress curves, and they kind of go down, down, down, down, down, down, down, a little bit, little bit, little bit. And then suddenly go clunk, and they drop. And then down, down, down, down, down a little bit, and then suddenly clunk, they drop. We're like, what the hell? These clunks are occurring at the end of each epoch. So normally in deep learning, this would be, this is, you know, I've seen this before. It's always been a bug. It's always turned out that like, oh, we accidentally forgot to turn on eval mode during the validation set. So I was actually learning then, or, oh, we accidentally were calculating moving average statistics throughout the epoch. So, you know, so it's recently moving average or whatever. And so we were using Hugging Face Trainer. So, you know, I did not give my friends at Hugging Face the benefit of the doubt. I thought, oh, they've fucked up Hugging Face Trainer, you know, idiots. Well, you'll use the Fast AI Trainer instead. So we switched over to Learner. We still saw the clunks and, you know, that's, yeah, it shouldn't really happen because semantically speaking in the epoch, isn't like, it's not a thing, you know, like nothing happens. Well, nothing's meant to happen when you go from ending one epoch to starting the next one. So there shouldn't be a clunk, you know. So I kind of asked around on the open source discords. That's like, what's going on here? And everybody was just like, oh, that's just what, that's just what these training curves look like. Those all look like that. Don't worry about it. And I was like, oh, are you all using Trainer? Yes. Oh, well, there must be some bug with Trainer. And I was like, well, we also saw it in Learner [00:38:42]

Swyx: and somebody else is like, [00:38:42]

Jeremy: no, we've got our own Trainer. We get it as well. They're just like, don't worry about it. It's just something we see. It's just normal. [00:38:48]

Swyx: I can't do that. [00:38:49]

Jeremy: I can't just be like, here's something that's like in the previous 30 years of neural networks, nobody ever saw it. And now suddenly we see it. [00:38:57]

Swyx: So don't worry about it. [00:38:59]

Jeremy: I just, I have to know why. [00:39:01]

Swyx: Can I clarify? This is, was everyone that you're talking to, were they all seeing it for the same dataset or in different datasets? [00:39:08]

Jeremy: Different datasets, different Trainers. They're just like, no, this is just, this is just what it looks like when you fine tune language models. Don't worry about it. You know, I hadn't seen it before, but I'd been kind of like, as I say, I, you know, I kept working on them for a couple of years after ULM fit. And then I kind of moved on to other things, partly out of frustration. So I hadn't been fine tuning, you know, I mean, Lama's only been out for a few months, right? But I wasn't one of those people who jumped straight into it, you know? So I was relatively new to the kind of Lama fine tuning world, where else these guys had been, you know, doing it since day one. [00:39:49]

Swyx: It was only a few months ago, [00:39:51]

Jeremy: but it's still quite a bit of time. So, so yeah, they're just like, no, this is all what we see. [00:39:56]

Swyx: Don't worry about it. [00:39:56]

Jeremy: So yeah, I, I've got a very kind of like, I don't know, I've just got this brain where I have to know why things are. And so I kind of, I ask people like, well, why, why do you think it's happening? And they'd be like, oh, it would pretty obviously, cause it's like memorize the data set. It's just like, that can't be right. It's only seen it once. Like, look at this, the loss has dropped by 0.3, 0.3, which is like, basically it knows the answer. And like, no, no, it's just, it is, it's just memorize the data set. So yeah. So look, Jono and I did not discover this and Jono and I did not come up with a hypothesis. You know, I guess we were just the ones, I guess, who had been around for long enough to recognize that like, this, this isn't how it's meant to work. And so we, we, you know, and so we went back and like, okay, let's just run some experiments, you know, cause nobody seems to have actually published anything about this. [00:40:51]

Well, not quite true.

Some people had published things, but nobody ever actually stepped back and said like, what the hell, you know, how can this be possible? Is it possible? Is this what's happening? And so, yeah, we created a bunch of experiments where we basically predicted ahead of time. It's like, okay, if this hypothesis is correct, that it's memorized in the training set, then we ought to see blah, under conditions, blah, but not under these conditions. And so we ran a bunch of experiments and all of them supported the hypothesis that it was memorizing the data set in a single thing at once. And it's a pretty big data set, you know, which in hindsight, it's not totally surprising because the theory, remember, of the ULMFiT theory was like, well, it's kind of creating all these latent capabilities to make it easier for it to predict the next token. So if it's got all this kind of latent capability, it ought to also be really good at compressing new tokens because it can immediately recognize it as like, oh, that's just a version of this. So it's not so crazy, you know, but it is, it requires us to rethink everything because like, and nobody knows like, okay, so how do we fine tune these things? Because like, it doesn't even matter. Like maybe it's fine. Like maybe it's fine that it's memorized the data set after one go and you do a second go and okay, the validation loss is terrible because it's now really overconfident. [00:42:20]

Swyx: That's fine. [00:42:22]

Jeremy: Don't, you know, don't, I keep telling people, don't track validation loss, track validation accuracy because at least that will still be useful. Just another thing that's got lost since ULMFiT, nobody tracks accuracy of language models anymore. But you know, it'll still keep learning and it does, it does keep improving. But is it worse? You know, like, is it like, now that it's kind of memorized it, it's probably getting a less strong signal, you know, I don't know. So I still don't know how to fine tune language models properly and I haven't found anybody who feels like they do, like nobody really knows whether this memorization thing is, it's probably a feature in some ways. It's probably some things that you can do usefully with it. It's probably, yeah, I have a feeling it's messing up training dynamics as well. [00:43:13]

Swyx: And does it come at the cost of catastrophic forgetting as well, right? Like, which is the other side of the coin. [00:43:18]

Jeremy: It does to some extent, like we know it does, like look at Code Llama, for example. So Code Llama was a, I think it was like a 500 billion token fine tuning of Llama 2 using code. And also pros about code that Meta did. And honestly, they kind of blew it because Code Llama is good at coding, but it's bad at everything else, you know, and it used to be good. Yeah, I was pretty sure it was like, before they released it, me and lots of people in the open source discords were like, oh my God, you know, we know this is coming, Jan Lukinsk saying it's coming. I hope they kept at least like 50% non-code data because otherwise it's going to forget everything else. And they didn't, only like 0.3% of their epochs were non-code data. So it did, it forgot everything else. So now it's good at code and it's bad at everything else. So we definitely have catastrophic forgetting. It's fixable, just somebody has to do, you know, somebody has to spend their time training a model on a good mix of data. Like, so, okay, so here's the thing. Even though I originally created three-step approach that everybody now does, my view is it's actually wrong and we shouldn't use it. [00:44:36]

Jeremy: And that's because people are using it in a way different to why I created it. You know, I created it thinking the task-specific models would be more specific. You know, it's like, oh, this is like a sentiment classifier as an example of a task, you know, but the tasks now are like a, you know, RLHF, which is basically like answer questions that make people feel happy about your answer. So that's a much more general task and it's a really cool approach. And so we see, for example, RLHF also breaks models like, you know, like GPT-4, RLHDEFT, we know from kind of the work that Microsoft did, you know, the pre, the earlier, less aligned version was better. And these are all kind of examples of catastrophic forgetting. And so to me, the right way to do this is to fine-tune language models, is to actually throw away the idea of fine-tuning. There's no such thing. There's only continued pre-training. And pre-training is something where from the very start, you try to include all the kinds of data that you care about, all the kinds of problems that you care about, instructions, exercises, code, general purpose document completion, whatever. And then as you train, you gradually curate that, you know, you gradually make that higher and higher quality and more and more specific to the kinds of tasks you want it to do. But you never throw away any data. You always keep all of the data types there in reasonably high quantities. You know, maybe the quality filter, you stop training on low quality data, because that's probably fine to forget how to write badly, maybe. So yeah, that's now my view, is I think ULM fit is the wrong approach. And that's why we're seeing a lot of these, you know, so-called alignment tacks and this view of like, oh, a model can't both code and do other things. And, you know, I think it's actually because people are training them wrong. [00:46:47]

Swyx: Yeah, well, I think you have a clear [00:46:51]

Alessio: anti-laziness approach. I think other people are not as good hearted, you know, they're like, [00:46:57]

Swyx: hey, they told me this thing works. [00:46:59]

Alessio: And if I release a model this way, people will appreciate it, I'll get promoted and I'll kind of make more money. [00:47:06]

Jeremy: Yeah, and it's not just money. It's like, this is how citations work most badly, you know, so if you want to get cited, you need to write a paper that people in your field recognize as an advancement on things that we know are good. And so we've seen this happen again and again. So like I say, like zero shot and few shot learning, everybody was writing about that. Or, you know, with image generation, everybody just was writing about GANs, you know, and I was trying to say like, no, GANs are not the right approach. You know, and I showed again through research that we demonstrated in our videos that you can do better than GANs, much faster and with much less data. And nobody cared because again, like if you want to get published, you write a GAN paper that slightly improves this part of GANs and this tiny field, you'll get published, you know. So it's, yeah, it's not set up for real innovation. It's, you know, again, it's really helpful for me, you know, I have my own research lab with nobody telling me what to do and I don't even publish. So it doesn't matter if I get citations. And so I just write what I think actually matters. I wish there was, and, you know, and actually places like OpenAI, you know, the researchers there can do that as well. It's a shame, you know, I wish there was more academic, open venues in which people can focus on like genuine innovation. [00:48:38]

Swyx: Twitter, which is unironically has become a little bit of that forum. I wanted to follow up on one thing that you mentioned, which is that you checked around the open source discords. I don't know if it's too, I don't know if it's a pusher to ask like what discords are lively or useful right now. I think that something I definitely felt like I missed out on was the early days of Luther AI, which is a very hard bit. And, you know, like what is the new Luther? And you actually shouted out the alignment lab AI discord in your blog post. And that was the first time I even knew, like I saw them on Twitter, never knew they had a discord, never knew that there was actually substantive discussions going on in there and that you were an active member of it. Okay, yeah. [00:49:23]

Jeremy: And then even then, if you do know about that and you go there, it'll look like it's totally dead. And that's because unfortunately, nearly all the discords, nearly all of the conversation happens in private channels. You know, and that's, I guess. [00:49:35]

Swyx: How does someone get into that world? Because it's obviously very, very instructive, right? [00:49:42]

Jeremy: You could just come to the first AI discord, which I'll be honest with you, it's less bustling than some of the others, but it's not terrible. And so like, at least, to be fair, one of Emma's bustling channels is private. [00:49:57]

Swyx: I guess. [00:49:59]

Jeremy: So I'm just thinking. [00:50:01]

Swyx: It's just the nature of quality discussion, right? Yeah, I guess when I think about it, [00:50:05]

Jeremy: I didn't have any private discussions on our discord for years, but there was a lot of people who came in with like, oh, I just had this amazing idea for AGI. If you just thought about like, if you imagine that AI is a brain, then we, you know, this just, I don't want to talk about it. You know, I don't want to like, you don't want to be dismissive or whatever. And it's like, oh, well, that's an interesting comment, but maybe you should like, try training some models first to see if that aligns with your intuition. Like, oh, but how could I possibly learn? It's like, well, we have a course, just actually spend time learning. Like, you know, anyway. And there's like, okay, I know the people who always have good answers there. And so I created a private channel and put them all in it. And I got to admit, that's where I post more often because there's much less, you know, flight of fancy views about how we could solve AGI, blah, blah, blah. So there is a bit of that. But having said that, like, I think the bar is pretty low. Like if you join a Discord and you can hit the like participants or community or whatever button, you can see who's in it. And then you'll see at the top, who the admins or moderators or people in the dev role are. And just DM one of them and say like, oh, here's my GitHub. Well, here's some blog posts I wrote. You know, I'm interested in talking about this, you know, can I join the private channels? And I've never heard of anybody saying no. I will say, you know, Alutha's all pretty open. So you can do the Alutha Discord still. You know, one problem with the Alutha Discord is it's been going on for so long that it's like, it's very inside baseball. It's quite hard to get started. Yeah. Carpa AI looks, I think it's all open. That's just less stability. That's more accessible. [00:52:03]

Swyx: Yeah. [00:52:04]

Jeremy: There's also just recently, now it's research that does like the Hermes models and data set just opened. They've got some private channels, but it's pretty open, I think. You mentioned Alignment Lab, that one it's all the interesting stuff is on private channels. So just ask. If you know me, ask me, cause I've got admin on that one. There's also, yeah, OS Skunkworks, OS Skunkworks AI is a good Discord, which I think it's open. So yeah, they're all pretty good. [00:52:40]

Swyx: I don't want you to leak any, you know, Discords that don't want any publicity, but this is all helpful. [00:52:46]

Jeremy: We all want people, like we all want people. [00:52:49]

Swyx: We just want people who like, [00:52:51]

Jeremy: want to build stuff, rather than people who, and like, it's fine to not know anything as well, but if you don't know anything, but you want to tell everybody else what to do and how to do it, that's annoying. If you don't know anything and want to be told like, here's a really small kind of task that as somebody who doesn't know anything is going to take you a really long time to do, but it would still be helpful. Then, and then you go and do it. That would be great. The truth is, yeah, [00:53:19]

Swyx: like, I don't know, [00:53:20]

Jeremy: maybe 5% of people who come in with great enthusiasm and saying that they want to learn and they'll do anything. [00:53:25]

Swyx: And then somebody says like, [00:53:25]

Jeremy: okay, here's some work you can do. Almost nobody does that work. So if you're somebody who actually does the work and follows up, you will massively stand out. That's an extreme rarity. And everybody will then want to help you do more work. [00:53:41]

Swyx: So yeah. [00:53:41]

Jeremy: So just, yeah, just do work and people will want to support you. [00:53:47]

Alessio: Our Discord used to be referral only for a long time. We didn't have a public invite and then we opened it and they're kind of like channel gating. Yeah. A lot of people just want to do, I remember it used to be like, you know, a forum moderator. [00:54:00]

Swyx: It's like people just want to do [00:54:01]

Alessio: like drive-by posting, [00:54:03]

Swyx: you know, and like, [00:54:03]

Alessio: they don't want to help the community. They just want to get their question answered. [00:54:07]

Jeremy: I mean, the funny thing is our forum community does not have any of that garbage. You know, there's something specific about the low latency thing where people like expect an instant answer. And yeah, we're all somehow in a forum thread where they know it's like there forever. People are a bit more thoughtful, but then the forums are less active than they used to be because Discord has got more popular, you know? So it's all a bit of a compromise, you know, running a healthy community is, yeah, it's always a bit of a challenge. All right, we got so many more things [00:54:47]

Alessio: we want to dive in, but I don't want to keep you here for hours. [00:54:50]

Swyx: This is not the Lex Friedman podcast [00:54:52]

Alessio: we always like to say. One topic I would love to maybe chat a bit about is Mojo, modular, you know, CrystalLiner, not many of you on the podcast. So we want to spend a little time there. You recently did a hacker's guide to language models and you ran through everything from quantized model to like smaller models, larger models, and all of that. But obviously modular is taking its own approach. Yeah, what got you excited? I know you and Chris have been talking about this for like years and a lot of the ideas you had, so. [00:55:23]

Jeremy: Yeah, yeah, yeah, yeah, no, absolutely. So I met Chris, I think it was at the first TensorFlow Dev Summit. And I don't think he had even like, I'm not sure if he'd even officially started his employment with Google at that point. So I don't know, you know, certainly nothing had been mentioned. So I, you know, I admired him from afar with LLVM and Swift and whatever. And so I saw him walk into the courtyard at Google. It's just like, oh s**t, man, that's Chris Latner. I wonder if he would lower his standards enough to talk to me. Well, worth a try. So I caught up my courage because like nobody was talking to him. He looked a bit lost and I wandered over and it's like, oh, you're Chris Latner, right? It's like, what are you doing here? What are you doing here? And I was like, yeah, yeah, yeah. It's like, oh, I'm Jeremy Howard. It's like, oh, do you do some of this AI stuff? And I was like, yeah, yeah, I like this AI stuff. Are you doing AI stuff? It's like, well, I'm thinking about starting to do some AI stuff. Yeah, I think it's going to be cool. And it's like, wow. So like, I spent the next half hour just basically brain dumping all the ways in which AI was stupid to him. And he listened patiently. And I thought he probably wasn't even remember or care or whatever. But yeah, then I kind of like, I guess I re-caught up with him a few months later. And it's like, I've been thinking about everything you said in that conversation. And he like narrated back his response to every part of it, projects he was planning to do. And it's just like, oh, this dude follows up. Holy s**t. And I was like, wow, okay. And he was like, yeah, so we're going to create this new thing called Swift for TensorFlow. And it's going to be like, it's going to be a compiler with auto differentiation built in. And blah, blah, blah. And I was like, why would that help? [00:57:10]

Swyx: You know, why would you? [00:57:10]

Jeremy: And he was like, okay, with a compiler during the forward pass, you don't have to worry about saving context, you know, because a lot will be optimized in the backward. But I was like, oh my God. Because I didn't really know much about compilers. You know, I spent enough to kind of like, understand the ideas, but it hadn't occurred to me that a compiler basically solves a lot of the problems we have as end users. I was like, wow, that's amazing. Okay, you do know, right, that nobody's going to use this unless it's like usable. It's like, yeah, I know, right. So I was thinking you should create like a fast AI for this. So, okay, but I don't even know Swift. And he was like, well, why don't you start learning it? And if you have any questions, ask me. It's just like, holy s**t. Like, not only has Chris Latner lowered his standards enough to talk to me, but he's offering me personal tutoring on the programming language that he made. So I was just like, I'm not going to let him down. So I spent like the next two months, like just nerding out on Swift. And it was just before Christmas that I kind of like started writing down what I'd learned. And so I wrote a couple of blog posts on like, okay, this is like my attempt to do numeric programming in Swift. And these are all the challenges I had. And these are some of the issues I had with like making things properly performant. And here are some libraries I wrote. And I sent it to Chris and was like, I hope he's not too disappointed with me, you know, because that would be the worst. It's like, you know, and I was also like, I was like, I hope he doesn't dislike the fact that I, you know, didn't love everything. [00:58:46]

Jeremy: And yeah, he was like, oh, thanks for sending me that. Let's get on a call and talk about it. And we spoke and he was like, this is amazing. I can't believe that you made this. This is exactly what Swift needs. And he was like, and so like somebody set up like a new Swift, what they call them, the equivalent of a pep, you know, kind of RFC thing of like, oh, you know, let's look at how we can implement Jeremy's ideas and the language. And so it's like, oh, wow. And so, yeah, you know, and then we ended up like literally teaching some lessons together about Swift for TensorFlow. And we built a fast AI kind of equivalent with him and his team. It was so much fun. Then in the end, you know, Google didn't follow through, which is fair enough, like asking everybody to learn a new programming language is going to be tough. But like, it was very obvious, very, very obvious at that time that TensorFlow 2 is going to be a failure, you know, and so it's felt like, okay, I, you know, well, you know, what are you going to do? Like, you can't focus on TensorFlow 2 because it's not going to, like, it's not working. It's never going to work. You know, nobody at Google's using it. Internally. So, you know, in the end, Chris left, you know, Swift for TensorFlow got archived. [01:00:13]

Swyx: There was no backup plan. [01:00:15]

Jeremy: So it kind of felt like Google was kind of screwed, you know, and Chris went and did something else. But we kept talking and I was like, look, Chris, you know, you've got to be your own boss, man. It's like, you know, you've got the ideas, you know, like only you've got the ideas, you know, and if your ideas are implemented, we'd all be so much better off because like Python's the best of a whole bunch of s**t, you know, like I would, it's amazing, but it's awful, you know, compared to what it could be. And anyway, so eventually a few years later, he called me up and he was like, Jeremy, I've taken your advice. I've started a company. And I was like, oh my God. It's like, we've got to create a new language. We're going to create a new infrastructure. It's going to build, it's going to have all the stuff we've talked about. And it's like, oh wow. So that's what Mojo is. And so Mojo is like, you know, building on all the stuff that Chris has figured out over, I mean, really from when he did his PhD thesis, which developed LLVM onwards, you know, in Swift and MLIR, you know, the TensorFlow runtime engine, which is very good. You know, that was something that he built and has lasted. So yeah, I'm pumped about that. I mean, it's very speculative. Creating a whole new language is tough. I mean, Chris has done it before and he's created a whole C++ compiler amongst other things. Looking pretty hopeful. I mean, I hope it works because, you know, [01:01:53]

Alessio: You told them to quit his job. [01:01:55]

Swyx: So I mean, in the meantime, I will say, you know, [01:02:00]

Jeremy: Google now does have a backup plan, you know, they have Jax, which was never a strategy. It was just a bunch of people who also recognized TensorFlow 2 as s**t and they just decided to build something else. And for years, my friends in that team were like, don't tell anybody about us because we don't want to be anything but a research project. So now these poor guys, suddenly they're the great white hope for Google's future. And so Jax is, you know, also not terrible, but it's still written in Python. Like it would be cool if we had all the benefits of Jax, but in a language that was designed for those kinds of purposes. So, you know, fingers crossed that, yeah, that Mojo turns out great. [01:02:45]

Swyx: Yeah. [01:02:47]

Alessio: Any other thoughts on when, where people should be spending their time? So that's more the kind of language framework level. Then you have the, you know, GGML, some of these other like quantization focused kind of model level things. Then you got the hardware people. It's like a whole other bucket. Yeah. What are some of the exciting stuff that you're excited about? [01:03:08]

Jeremy: Well, you won't be surprised to hear me say this, but I think fine tuning transfer learning is still a hugely underappreciated area. So today's zero shot, few shot learning equivalent is retrieval augmented generation, you know, RAC, which is like, just like few shot learning is a thing. Like it's a real thing. It's a useful thing. It's not a thing anybody would want to ignore. Why are people not spending at least as much effort on fine tuning? You know, cause you know, RAG is like such a inefficient hack really, [01:03:45]

Swyx: isn't it? [01:03:45]

Jeremy: It's like, you know, segment up my data in some somewhat arbitrary way, embed it, ask questions about that, you know, hope that my embedding, you know, model embeds questions in the same bedding space as the paragraphs, which obviously is not going to, if your question is like, if I've got a whole bunch of archive papers embeddings, and I asked like, what are all the ways in which we can make inference more efficient? Like the only paragraphs it'll find is like if there's a review paper, here's a list of ways to make, you know, inference more efficient. Doesn't have any of the specifics. No, it's not going to be like, oh, here's one way, here's one way, here's a different way in different papers, [01:04:33]

Swyx: you know? Yeah. [01:04:35]

Jeremy: If you fine tune a model, then all of that information is getting directly incorporated into the weights of your model in a much more efficient and nuanced way. And then you can use RAG on top of that. So I think that that's one area that's definitely like underappreciated. And also the kind of like the confluence or like, okay, how do you combine RAG and fine tuning, for example. [01:05:00]

Swyx: Something that I think a lot of people are uncertain about, and I don't expect you to know either, is that whether or not you can fine tune new information in, and I think that that is the focus of some of your open questions. And of course you can, right? [01:05:17]

Jeremy: Like, obviously you can, because there's no such thing as fine, there's no such thing as fine tuning. There's only continued pre-training. So fine tuning is pre-training, like they're literally the same thing. So the knowledge got in there in the first place through pre-training. So how could like continuing to pre-train not put more knowledge in? Like it's the same thing. The problem is just we're really bad at it because everybody's doing it dumb ways. So, you know, it's a good question. And it's not just new knowledge, but like new capabilities. You know, I think like in my Packers Guide to LLM, into Packers Guide to LLM's talk, I show a simple, I mean, it's a funny, that's a simple example, because it doesn't sound it, but like taking a pre-trained based model and getting it to generate SQL. And it took 15 minutes to train on a single GPU. You know, I think that might surprise people that that capability is at your fingertips. And, you know, because it was already there, it was just latent in the base model. Really pushing the boundaries of what you can do with small models, I think is a really interesting question. Like what can you do with a, like, I mean, there isn't much in the way of good small models. A really underappreciated one is a BTLM 3B, which is a like kind of 7B quality 3B model. There's not much at the 1 to 2B range sadly, there are some code ones, but like the fact that there are some really good code ones in that 1 to 2B range shows you that that's a great size for doing complex tasks well. [01:06:56]

Swyx: There was PHY 1 recently, which has been the subject of a little bit of discussion about whether to train on benchmarks. [01:07:04]

Jeremy: PHY 1.5 as well. So that's not a good model yet. [01:07:09]

Swyx: Why not? [01:07:11]

Jeremy: It's good at doing, so PHY 1 in particular is good at doing a very specific thing, which is creating very small Python snippets. [01:07:19]

Swyx: The thing, okay, [01:07:21]

Jeremy: so like PHY 1.5 has never read Wikipedia, for example, so it doesn't know who Tom Cruise is, you know, it doesn't know who anybody is, it doesn't know about any movies, it doesn't really know anything about anything, like, because it's never read anything, you know, it was trained on a nearly entirely synthetic data set, which is designed for it to learn reasoning, and so it was a research project, and a really good one, and it definitely shows us a powerful direction in terms of what you can do with synthetic data, and wow, gosh, even these tiny models can get pretty good reasoning skills, pretty good math skills, pretty good coding skills, [01:08:04]

Jeremy: but I don't know if it's a model you could necessarily build on. Some people have tried to do some fine tunes of it, and again, they're like surprisingly good in some ways for a 1.5b model, but not sure you'd find it useful for anything. [01:08:24]

Swyx: I think that's the struggle of pitching small models, because small is great, you know, you don't need a lot of resources to run them, but the performance evaluation is always so iffy, it's always just like, yeah, it works on some things, and we don't trust it for others. [01:08:41]

Jeremy: Yeah, so that's why we're back to fine tuning. So Microsoft did create a 5.1.5 web, but they didn't release it, unfortunately. I would say a 5.1.5 web with fine tuning for your task, you know, might quite, you know, might solve a lot of tasks that people have in their kind of day-to-day lives. You know, particularly in kind of an enterprise setting, I think there's a lot of like repetitive kind of processing that has to be done. It's a useful thing for coders to know about, because I think quite often you can like replace some thousands and thousands of lines of complex buggy code, maybe with a fine tune, you know. [01:09:24]

Swyx: Got it. Yeah. [01:09:27]

Alessio: And Jeremy, before we let you go, I think one question on top of a lot of people's minds. So you've done practical deep learning for coders in 2018, 19, 21, 22. I feel like the more time goes by, the more the GPUs get concentrated. If you're somebody who's interested in deep learning today and you don't want to go join OpenAI, you don't want to join Anthropic, what's like the best use of their time? Should they focus on, yeah, small model development? Should they focus on fine tuning math and all of that? Should they just like focus on making Ragnar a hack and coming up with a better solution? Yeah, what's a practical deep learning for coders 2024 kind of look like? [01:10:10]

Jeremy: Yeah. [01:10:11]

Swyx: I mean, good question. [01:10:12]

Jeremy: I'm trying to figure that out for myself. You know, like what should I teach? Because I definitely feel like things have changed a bit. You know, one of the ways in which things have changed is that coding is much more accessible now. So if you look at a lot of the folks in the kind of open source LLM community, they're folks who really hadn't coded before a year ago. And they're using these models to help them build stuff they couldn't build before, which is just fantastic, you know? So one thing I kind of think is like, okay, well, we need a lot more material to help these people use this newfound skill they have because they don't really know what they're doing, you know, and they don't claim to, but they're doing it anyway. And I think that's fantastic, you know? So like, are there things we could do to help people, [01:10:58]

Swyx: you know, bridge this gap? [01:11:00]

Jeremy: Because previously, you know, I know folks who were, you know, doing manual jobs a year ago, and now they're training language models thanks to the help of Codex and Copilot and whatever. So, you know, yeah, what does it look like to like really grab this opportunity? You know, maybe Fast.ai's goals can be dramatically expanded now to being like, let's make coding more accessible, you know, kind of AI-oriented coding more accessible. If so, our course should probably look very different, you know, and we'd have to throw away that like, oh, you have to have at least a year of full-time programming, you know, as a prerequisite. Yeah, what would happen if we got rid of that? So that's kind of one thought that's in my head. You know, as to what should other people do? Honestly, I don't think anybody has any idea, like, the more I look at it, what's going on. I know I don't, you know, like, we don't really know how to do anything very well. Clearly OpenAI do, like, they seem to be quite good at some things, or they're talking to folks at, or who have recently left OpenAI. [01:12:17]

Swyx: Even there, it's clear there's a lot of stuff [01:12:19]

Jeremy: they haven't really figured out, and they're just kind of like using recipes that they've noticed have been okay, so, yeah, we don't really know how to train these models well, we don't know how to fine-tune them well, we don't know how to do React well, we don't know what they can do, we don't know what they can't do, we don't know how big a model you need to solve different kinds of problems, we don't know what kind of problems they can't do, we don't know what good prompting strategies are for particular problems, you know. Like, somebody sent me a message the other day saying they've written something that is a prompting strategy for GPT-4, for GPT-4, they've written, like, 6,000 lines of Python code, and it's to help it play chess. And then they've said they've had it play against other chess engines, including the best Stockfish engines, and it's got an ELO of 3,400, [01:13:11]

Swyx: which would make it close to [01:13:13]

Jeremy: the best chess engine in existence. And I think this is a good example of, like, people were saying, like, GPT-4 can't play chess. I mean, I was sure that was wrong. I mean, obviously, it can play chess. But the difference between, like, with no prompting strategy, it can't even make legal moves, with good prompting strategies, it might be just about the best chess engine in the world, far better than any human player. So, yeah, I mean, we don't really know what the capabilities are yet. So I feel like it's all blue sky at this point. It feels like computer vision in 2013 to me, which was, like, in 2013, computer vision was, like, OK, OK. [01:13:51]

Swyx: We just had the AlexNet. [01:13:52]

Jeremy: We've had AlexNet. We've had VGGNet. It's around the time Zyler and Fergus, like, no, it's probably before that. So we hadn't yet had the Zyler and Fergus, like, oh, this is actually what's going on inside the layers. So, you know, we don't actually know what's happening inside these transformers. We don't know how to create good training dynamics. We don't really know anything much. And there's a reason for that, right? And the reason for that is language models suddenly got really useful. And so the kind of economically rational thing to do, like, this is not criticism. This is true. The economic rational thing to do is to, like, OK, like, build that as fast as possible. You know, make something work, get it out there. And that's what, you know, OpenAI in particular did and Anthropic kind of did. But there's a whole lot of technical debt everywhere. You know, nobody's really figured this stuff out because everybody's been so busy [01:14:53]

Swyx: building what we know works as quickly as possible. [01:14:57]

Jeremy: So, yeah, I think there's a huge amount of opportunity to, you know, I think we'll find things can be made to work a lot faster, a lot less memory. I got a whole bunch of ideas I want to try, you know, every time I look at something closely, like really closely, I'm always like, oh, it turns out this person actually had no idea what they're doing, you know, [01:15:21]

Swyx: which is fine. [01:15:23]

Jeremy: Like, none of us know what we're doing. We should experiment with that. As we had a trade out on the podcast [01:15:32]

Alessio: who created FlashAttention. Yeah. And I asked him, did nobody think of using SRAM before you? Like, were people just like, no. And he was like, yeah, people just didn't think of it. They didn't try. They didn't come from like a systems background. [01:15:48]

Swyx: Yeah. [01:15:48]

Jeremy: I mean, the thing about FlashAttention is, I mean, lots of people absolutely had thought of that. So had I, right? But I mean, the honest truth is, particularly before Triton, like everybody knew that tiling is the right way to solve anything. And everybody knew that attention, fused attention wasn't tiled. That was stupid. But not everybody's got his ability to like, be like, oh, well, I am confident enough in CUDA and or Triton to use that insight to write something better, you know? And this is where, like, I'm super excited about Mojo, right? And I always talk to Chris about FlashAttention because I'm like, you know, there is a thousand FlashAttentions out there for us to build. You just got to make it easy for us to build them. Like Triton definitely helps, but it's still not easy. You know, it still requires kind of really understanding the GPU architecture and writing it in that kind of very CUDA-ish way. So yeah, I think, you know, if Mojo or something equivalent can really work well, we're going to see a lot more FlashAttentions popping up. [01:17:06]

Swyx: Great, Jerry. [01:17:08]

Alessio: And before we wrap, we usually do a quick lightning round. [01:17:10]

Swyx: We're going to have three simple questions. [01:17:13]

Alessio: So the first one is around acceleration. And you've been in this field a long time. What's something that it's already here today in AI that you thought would take much longer? I don't think anything. [01:17:24]

Jeremy: So I've actually been slightly too bullish. So in my 2014 TED talk, I had a graph and I said, like, this is like the slope of human capabilities and this is the slope of AI capabilities. And I said, oh, and I put a dot saying we are here. It was just before they passed. And I looked back at the transcript the other day and I said, in five years, I think we'll, you know, we might have crossed that threshold in which computers will be better at most human tasks than most humans or most average humans. And so that might be almost true now for non-physical tasks. So I was like, took, you know, took that twice as long as I thought it might. [01:18:11]

Jeremy: Yeah, no, I wouldn't say anything surprised me too much. It's still like, definitely like, I got to admit, you know, I had a very visceral reaction using GPT-4 for the first time. Not because I found it surprising, but actually doing it, like something I was pretty sure would exist by about now, maybe a bit earlier. But actually using it definitely is different to just feeling like it's probably on its way, you know, and yeah, whatever GPT-5 looks like. I'm sure, I imagine I'll have the same visceral reaction, you know. [01:18:56]

Swyx: It's really amazing to watch develop. We also have an exploration question. So what do you think is the most interesting unsolved question in AI? [01:19:07]

Jeremy: How do language models learn? You know, what are the training dynamics? Like I want to see, there was a great paper about ResNets a few years ago that showed how, that was able to like plot a kind of projected three-dimensional loss surface for a ConvNet with and without skip connections. And you know, you could very clearly see without the skip connections, it was bumpy, and with the skip connections, it was super smooth. That's the kind of work we need. Like, so there was actually an interesting blog post that came out just today from the PyTorch team where some of them have created this like 3D matrix product visualization thing. [01:19:56]

Swyx: The MatMul Visualizer. [01:19:58]

Jeremy: Yeah, and they actually showed some nice examples of like a GPT-2 attention layer and like showed an animation and said, like, if you look at this, we can actually see a bit about what it's doing. You know, so again, it reminds me of the Zeiler and Fergus, you know, ConvNet paper that was the first one to do these reverse convolutions to show what's actually being learned in each layer in a ConvNet. Yeah, we need a lot more of this, like, what is going on inside these models? How do they actually learn? And then how can we use those insights to help them to learn better? So I think that would be one. The other exploration I'd really like to see is a much more rigorous analysis of what kind of data do they need at what level? And when do they need it? And how often? So that kind of like dataset mixing, curation, so forth. [01:20:52]

Swyx: Right. In order to get the best capabilities. Yeah. How much is Wikipedia? Yeah. [01:20:58]

Jeremy: Yeah. [01:20:59]

Swyx: Very uncertain. [01:20:59]

Jeremy: Fine-tune what, you know, what kind of mix do you need for it to keep its capabilities? And what are the kind of underlying capabilities that it most needs to keep? And if it loses those, it would lose all these other ones. And what data do you need to keep those? And, you know, other things we can do to change the loss function, to help it to not forget to do things, stuff like that. [01:21:20]

Swyx: Awesome. [01:21:21]

Alessio: And yeah, before wrapping, what's one message, one idea you want everyone to remember and think about? [01:21:27]

Jeremy: You know, I guess the main thing I want everybody to remember is that, you know, there's a lot of people in the world. And they have a lot of, you know, diverse experiences and capabilities. And they all matter. And now that we have a, you know, newly powerful technology in our lives, we could think of that one of two ways. One would be, gee, that's really scary. What would happen if all of these people in the world had access to this technology? Some of them might be bad people. Let's make sure they can't have it. Or one might be, wow, of all those people in the world, I bet a lot of them could really improve the lives of a lot of humanity if they had this tool. This has always been the case, you know, from the invention of writing, to the invention of the printing press, to the, you know, development of education. And it's been a constant battle between people who think that the distributed power is unsafe and it should be held on to by an elite few. And people who think that humanity on net, you know, is a marvelous species, particularly when part of a society and a civilization. And we should do everything we can to enable more of them to contribute. This is a really big conversation right now. And, you know, I want to see more and more people showing up and showing what, you know, what the great unwashed masses out there can actually achieve. You know, that actually, you know, regular people are going to do a lot of really valuable work and actually help us be, you know, more safe and also flourishing in our lives and providing a future for our children to flourish in. You know, if we lock things down to the people that we think, you know, the elites that we think can be trusted to run it for us, yeah, I think all bets are off about where that leaves us as a society, you know. [01:24:00]

Alessio: Yep. Now that's an important message. And yeah, that's why we've been promoting a lot of open source developers, open source communities, I think, letting the builders build and explore. That's always a good idea. Thank you so much for coming on, Jeremy. This was great. [01:24:20]

Jeremy: Thank you for having me. [01:24:22]

Get full access to Latent Space at www.latent.space/subscribe

2023-10-19
Länk till avsnitt

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.

See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.

This episode had good interest on Twitter.

Last month, Imbue was crowned as AI?s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As ?stealth? foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to ?develop generally capable AI agents with human-like intelligence in order to solve problems in the real world?.

From RL to Reasoning LLMs

Along with their Series A, they announced Avalon, ?A Benchmark for RL Generalization Using Procedurally Generated Worlds?. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.

After a while, they realized that pure RL isn?t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn?t go to higher level tasks. A pure RL world also doesn?t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the ?models for reasoning? path:

?The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.?

Inspired by Chelsea Finn?s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:

?We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:

* What was the original claim that was made?

* What evidence is there for this claim?

* Does the evidence support the claim?

* Is the claim correct?

This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.?

The Full Stack Model Lab

One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation from Astera Institute, NVIDIA, Cruise CEO Kyle Vogt, Notion co-founder Simon Last, and others. Imbue tackles their work with a ?full stack? approach:

* Models. Pretraining very large (>100B parameter) models, optimized to perform well on internal reasoning benchmarks, with a ~10,000 Nvidia H100 GPU cluster lets us iterate rapidly on everything from training data to architecture and reasoning mechanisms.

* Tools and Agents. Building internal productivity tools from coding agents for fixing type checking and linting errors, to sophisticated systems like CARBS (for hyperparameter tuning and network architecture search).

* Interface Invention. Solving agent trust and collaboration (not merely communication) with humans by creating better abstractions and interfaces ? IDEs for users to program computers in natural language.

* Theory. Publishing research about the theoretical underpinnings of self-supervised learning, as well as scaling laws for machine learning research.

Kanjun believes we are still in the ?bare metal phase? of agent development, and they want to take a holistic approach to building the ?operating system for agents?. We loved diving deep into the Imbue approach toward solving the AI Holy Grail of reliable agents, and are excited to share our conversation with you today!

Timestamps

* [00:00:00] Introductions

* [00:06:07] The origin story of Imbue

* [00:09:39] Imbue's approach to training large foundation models optimized for reasoning

* [00:12:18] Imbue's goals to build an "operating system" for reliable, inspectable AI agents

* [00:15:37] Imbue's process of developing internal tools and interfaces to collaborate with AI agents

* [00:17:27] Imbue's focus on improving reasoning capabilities in models, using code and other data

* [00:19:50] The value of using both public benchmarks and internal metrics to evaluate progress

* [00:21:43] Lessons learned from developing the Avalon research environment

* [00:23:31] The limitations of pure reinforcement learning for general intelligence

* [00:28:36] Imbue's vision for building better abstractions and interfaces for reliable agents

* [00:31:36] Interface design for collaborating with, rather than just communicating with, AI agents

* [00:37:40] The future potential of an agent-to-agent protocol

* [00:39:29] Leveraging approaches like critiquing between models and chain of thought

* [00:45:49] Kanjun's philosophy on enabling team members as creative agents at Imbue

* [00:53:51] Kanjun's experience co-founding the communal co-living space The Archive

* [01:00:22] Lightning Round

Show Notes

* Imbue

* Avalon

* CARBS (hyperparameter optimizer)

* Series B announcement

* Kanjun/Imbue?s Podcast

* MIT Media Lab

* Research mentioned:

* Momentum Contrast

* SimClr

* Chelsea Finn - SayCan

* Agent Protocol - part of the AI Engineer Foundation

* Scenius - Kevin Kelly

* South Park Commons

* The Archive

* Thursday Nights in AI

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]

Swyx: Hey, and today in the studio we have Kanjun from Imbue. Welcome. So you and I have, I guess, crossed paths a number of times. You're formerly named Generally Intelligent and you've just announced your rename, rebrand in huge, humongous ways. So congrats on all of that. And we're here to dive in into deeper detail on Imbue. We like to introduce you on a high level basis, but then have you go into a little bit more of your personal side. So you graduated your BS at MIT and you also spent some time at the MIT Media Lab, one of the most famous, I guess, computer hacking labs in the world. Then you graduated MIT and you went straight into BizOps at Dropbox, where you're eventually chief of staff, which is a pretty interesting role we can dive into later. And then it seems like the founder bug hit you. You were basically a three times founder at Ember, Sorceress, and now at Generally Intelligent slash Imbue. What should people know about you on the personal side that's not on your LinkedIn? That's something you're very passionate about outside of work. [00:01:12]

Kanjun: Yeah. I think if you ask any of my friends, they would tell you that I'm obsessed with agency, like human agency and human potential. [00:01:19]

Swyx: That's work. Come on.

Kanjun: It's not work. What are you talking about?

Swyx: So what's an example of human agency that you try to promote? [00:01:27]

Kanjun: With all of my friends, I have a lot of conversations with them that's kind of helping figure out what's blocking them. I guess I do this with a team kind of automatically too. And I think about it for myself often, like building systems. I have a lot of systems to help myself be more effective. At Dropbox, I used to give this onboarding talk called How to Be Effective, which people liked. I think like a thousand people heard this onboarding talk, and I think maybe Dropbox was more effective. I think I just really believe that as humans, we can be a lot more than we are. And it's what drives everything. I guess completely outside of work, I do dance. I do partner dance. [00:02:03]

Swyx: Yeah. Lots of interest in that stuff, especially in the sort of group living houses in San Francisco, which I've been a little bit part of, and you've also run one of those. [00:02:12]

Kanjun: That's right. Yeah. I started the archive with two friends, with Josh, my co-founder, and a couple of other folks in 2015. That's right. And GPT-3, our housemates built. [00:02:22]

Swyx: Was that the, I guess, the precursor to Generally Intelligent, that you started doing more things with Josh? Is that how that relationship started? Yeah. [00:02:30]

Kanjun: This is our third company together. Our first company, Josh poached me from Dropbox for Ember. And there we built a really interesting technology, laser raster projector, VR headset. And then we were like, VR is not the thing we're most passionate about. And actually it was kind of early days when we both realized we really do believe that in our lifetimes, like computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today. And at that time, we actually, after Ember, we were like, work on AI research or start an AI lab. A bunch of our housemates were joining OpenAI, and we actually decided to do something more pragmatic to apply AI to recruiting and to try to understand like, okay, if we are actually trying to deploy these systems in the real world, what's required? And that was Sorceress. That taught us so much about maybe an AI agent in a lot of ways, like what does it actually take to make a product that people can trust and rely on? I think we never really fully got there. And it's taught me a lot about what's required. And it's kind of like, I think informed some of our approach and some of the way that we think about how these systems will actually get used by people in the real world. [00:03:42]

Swyx: Just to go one step deeper on that, you're building AI agents in 2016 before it was cool. You got some muscle and you raised $30 million. Something was working. What do you think you succeeded in doing and then what did you try to do that did not pan out? [00:03:56]

Kanjun: Yeah. So the product worked quite well. So Sorceress was an AI system that basically looked for candidates that could be a good fit and then helped you reach out to them. And this was a little bit early. We didn't have language models to help you reach out. So we actually had a team of writers that like, you know, customized emails and we automated a lot of the customization. But the product was pretty magical. Like candidates would just be interested and land in your inbox and then you can talk to them. As a hiring manager, that's such a good experience. I think there were a lot of learnings, both on the product and market side. On the market side, recruiting is a market that is endogenously high churn, which means because people start hiring and then we hire the role for them and they stop hiring. So the more we succeed, the more they... [00:04:39]

Swyx: It's like the whole dating business. [00:04:40]

Kanjun: It's the dating business. Exactly. Exactly. And I think that's the same problem as the dating business. And I was really passionate about like, can we help people find work that is more exciting for them? A lot of people are not excited about their jobs and a lot of companies are doing exciting things and the matching could be a lot better. But the dating business phenomenon like put a damper on that, like it's actually a pretty good business. But as with any business with like relatively high churn, the bigger it gets, the more revenue we have, the slower growth becomes because if 30% of that revenue you lose year over year, then it becomes a worse business. So that was the dynamic we noticed quite early on after our Series A. I think the other really interesting thing about it is we realized what was required for people to trust that these candidates were like well vetted and had been selected for a reason. And it's what actually led us, you know, a lot of what we do at Imbue is working on interfaces to figure out how do we get to a situation where when you're building and using agents, these agents are trustworthy to the end user. That's actually one of the biggest issues with agents that, you know, go off and do longer range goals is that I have to trust, like, did they actually think through this situation? And that really informed a lot of our work today. [00:05:52]

Alessio: Let's jump into GI now, Imbue. When did you decide recruiting was done for you and you were ready for the next challenge? And how did you pick the agent space? I feel like in 2021, it wasn't as mainstream. Yeah. [00:06:07]

Kanjun: So the LinkedIn says that it started in 2021, but actually we started thinking very seriously about it in early 2020, late 2019, early 2020. So what we were seeing is that scale is starting to work and language models probably will actually get to a point where like with hacks, they're actually going to be quite powerful. And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there are all sorts of issues. We're like, oh, that's not that useful, but we could kind of see like, okay, you keep improving it in all of these different ways and it'll get better. What Josh and I were really interested in is how can we get computers that help us do bigger things? Like, you know, there's this kind of future where I think a lot about, you know, if I were born in 1900 as a woman, like my life would not be that fun. I'd spend most of my time like carrying water and literally like getting wood to put in the stove to cook food and like cleaning and scrubbing the dishes and, you know, getting food every day because there's no refrigerator, like all of these things, very physical labor. And what's happened over the last 150 years since the industrial revolution is we've kind of gotten free energy, like energy is way more free than it was 150 years ago. And so as a result, we've built all these technologies like the stove and the dishwasher and the refrigerator, and we have electricity and we have infrastructure, running water, all of these things that have totally freed me up to do what I can do now. And I think the same thing is true for intellectual energy. We don't really see it today, but because we're so in it, but our computers have to be micromanaged. You know, part of why people are like, oh, you're stuck to your screen all day. Well, we're stuck to our screen all day because literally nothing happens unless I'm doing something in front of my screen. I don't, you know, I can't send my computer off to do a bunch of stuff for me. And there is a future where that's not the case, where, you know, I can actually go off and do stuff and trust that my computer will pay my bills and figure out my travel plans and do the detailed work that I am not that excited to do so that I can like be much more creative and able to do things that I as a human, I'm very excited about and collaborate with other people. And there are things that people are uniquely suited for. So that's kind of always been the thing that has been really exciting to me. Like Josh and I have known for a long time, I think that, you know, whatever AI is, it would happen in our lifetimes. And the personal computer kind of started giving us a bit of free intellectual energy. And this is like really the explosion of free intellectual energy. So in early 2020, we were thinking about this and what happened was self-supervised learning basically started working across everything. So worked in language, SimClear came out, I think MoCo had come out, Momentum Contrast had come out earlier in 2019, SimClear came out in early 2020. And we're like, okay, for the first time, self-supervised learning is working really well across images and text and suspect that like, okay, actually it's the case that machines can learn things the way that humans do. And if that's true, if they can learn things in a fully self-supervised way, because like as people, we are not supervised. We like go Google things and try to figure things out. So if that's true, then like what the computer could be is much bigger than what it is today. And so we started exploring ideas around like, how do we actually go? We didn't think about the fact that we could actually just build a research lab. So we were like, okay, what kind of startup could we build to like leverage self-supervised learning? So that eventually becomes something that allows computers to become much more able to do bigger things for us. But that became General Intelligence, which started as a research lab. [00:09:39]

Alessio: So your mission is you aim to rekindle the dream of the personal computer. So when did it go wrong and what are like your first products and user facing things that you're building to rekindle it? [00:09:53]

Kanjun: Yeah. So what we do at Imbue is we train large foundation models optimized for reasoning. And the reason for that is because reasoning is actually, we believe the biggest blocker to agents or systems that can do these larger goals. If we think about something that writes an essay, like when we write an essay, we like write it. We put it and then we're done. We like write it and then we look at it and we're like, oh, I need to do more research on that area. I'm going to go do some research and figure it out and come back and, oh, actually it's not quite right. The structure of the outline. So I'm going to rearrange the outline, rewrite it. It's this very iterative process and it requires thinking through like, okay, what am I trying to do? Is the goal correct? Also like, has the goal changed as I've learned more? So as a tool, like when should I ask the user questions? I shouldn't ask them questions all the time, but I should ask them questions in higher risk situations. How certain am I about the like flight I'm about to book? There are all of these notions of like risk certainty, playing out scenarios, figuring out how to make a plan that makes sense, how to change the plan, what the goal should be. That are things that we lump under the bucket of reasoning and models today, they're not optimized for reasoning. It turns out that there's not actually that much explicit reasoning data on the internet as you would expect. And so we get a lot of mileage out of optimizing our models for reasoning in pre-training. And then on top of that, we build agents ourselves and we, I can get into, we really believe in serious use, like really seriously using the systems and trying to get to an agent that we can use every single day, tons of agents that we can use every single day. And then we experiment with interfaces that help us better interact with the agents. So those are some set of things that we do on the kind of model training and agent side. And then the initial agents that we build, a lot of them are trying to help us write code better because code is most of what we do every day. And then on the infrastructure and theory side, we actually do a fair amount of theory work to understand like, how do these systems learn? And then also like, what are the right abstractions for us to build good agents with, which we can get more into. And if you look at our website, we build a lot of tools internally. We have a like really nice automated hyperparameter optimizer. We have a lot of really nice infrastructure and it's all part of the belief of like, okay, let's try to make it so that the humans are doing the things humans are good at as much as possible. So out of our very small team, we get a lot of leverage. [00:12:18]

Swyx: And so would you still categorize yourself as a research lab now, or are you now in startup mode? Is that a transition that is conscious at all? [00:12:26]

Kanjun: That's a really interesting question. I think we've always intended to build, you know, to try to build the next version of the computer, enable the next version of the computer. The way I think about it is there's a right time to bring a technology to market. So Apple does this really well. Actually, iPhone was under development for 10 years, AirPods for five years. And Apple has a story where iPhone, the first multi-touch screen was created. They actually were like, oh wow, this is cool. Let's like productionize iPhone. They actually brought, they like did some work trying to productionize it and realized this is not good enough. And they put it back into research to try to figure out like, how do we make it better? What are the interface pieces that are needed? And then they brought it back into production. So I think of production and research as kind of like these two separate phases. And internally we have that concept as well, where like things need to be done in order to get to something that's usable. And then when it's usable, like eventually we figure out how to productize it. [00:13:20]

Alessio: What's the culture like to make that happen, to have both like kind of like product oriented, research oriented. And as you think about building the team, I mean, you just raised 200 million. I'm sure you want to hire more people. What are like the right archetypes of people that work at Imbue? [00:13:35]

Kanjun: I would say we have a very unique culture in a lot of ways. I think a lot about social process design. So how do you design social processes that enable people to be effective? I like to think about team members as creative agents, because most companies, they think of their people as assets and they're very proud of this. And I think about like, okay, what is an asset? It's something you own that provides you value that you can discard at any time. This is a very low bar for people. This is not what people are. And so we try to enable everyone to be a creative agent and to really unlock their superpowers. So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with agency. A lot of the work I do with team members is try to figure out like, you know, what are you really good at? What really gives you energy and where can we put you such that, how can I help you unlock that and grow that? So much of our work, you know, in terms of team structure, like much of our work actually comes from people. Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process doing hyperparameter optimization. And he actually pulled some ideas from plasma physics. He's a plasma physicist to make the local search work. A lot of our work on evaluations comes from a couple of members of our team who are like obsessed with evaluations. We do a lot of work trying to figure out like, how do you actually evaluate if the model is getting better? Is the model making better agents? Is the agent actually reliable? A lot of things kind of like, I think of people as making the like them shaped blob inside imbue and I think, you know, yeah, that's the kind of person that we're, we're hiring for. We're hiring product engineers and data engineers and research engineers and all these roles. We have projects, not teams. We have a project around data, data collection and data engineering. That's actually one of the key things that improve the model performance. We have a pre-training kind of project with some fine tuning as part of that. And then we have an agent's project that's like trying to build on top of our models as well as use other models in the outside world to try to make agents then we actually use as programmers every day. So all sorts of different, different projects. [00:15:37]

Swyx: As a founder, you're now sort of a capital allocator among all of these different investments effectively at different projects. And I was interested in how you mentioned that you were optimizing for improving reasoning and specifically inside of your pre-training, which I assume is just a lot of data collection. [00:15:55]

Kanjun: We are optimizing reasoning inside of our pre-trained models. And a lot of that is about data. And I can talk more about like what, you know, what exactly does it involve? But actually big, maybe 50% plus of the work is figuring out even if you do have models that reason well, like the models are still stochastic. The way you prompt them still makes, is kind of random, like makes them do random things. And so how do we get to something that is actually robust and reliable as a user? How can I, as a user, trust it? We have all sorts of cool things on the, like, you know, I was mentioning earlier when I talked to other people building agents, they have to do so much work, like to try to get to something that they can actually productize and it takes a long time and agents haven't been productized yet for, partly for this reason is that like the abstractions are very leaky. We can get like 80% of the way there, but like self-driving cars, like the remaining 20% is actually really difficult. We believe that, and we have internally, I think some things that like an interface, for example, that lets me really easily like see what the agent execution is, fork it, try out different things, modify the prompt, modify like the plan that it is making. This type of interface, it makes it so that I feel more like I'm collaborating with the agent as it's executing, as opposed to it's just like doing something as a black box. That's an example of a type of thing that's like beyond just the model pre-training, but on the model pre-training side, like reasoning is a thing that we optimize for. And a lot of that is about what data do we put in. [00:17:27]

Swyx: It's interesting just because I always think like, you know, out of the levers that you have, the resources that you have, I think a lot of people think that running foundation model company or a research lab is going to be primarily compute. And I think the share of compute has gone down a lot over the past three years. It used to be the main story, like the main way you scale is you just throw more compute at it. And now it's like, Flops is not all you need. You need better data, you need better algorithms. And I wonder where that shift has gone. This is a very vague question, but is it like 30-30-30 now? Is it like maybe even higher? So one way I'll put this is people estimate that Llama2 maybe took about three to $4 million of compute, but probably 20 to $25 million worth of labeling data. And I'm like, okay, well that's a very different story than all these other foundation model labs raising hundreds of millions of dollars and spending it on GPUs. [00:18:20]

Kanjun: Data is really expensive. We generate a lot of data. And so that does help. The generated data is close to actually good, as good as human labeled data. [00:18:34]

Swyx: So generated data from other models? [00:18:36]

Kanjun: From our own models. From your own models. Or other models, yeah. [00:18:39]

Swyx: Do you feel like there's certain variations of this? There's the sort of the constitutional AI approach from Anthropic and basically models sampling training on data from other models. I feel like there's a little bit of like contamination in there, or to put it in a statistical form, you're resampling a distribution that you already have that you already know doesn't match human distributions. How do you feel about that basically, just philosophically? [00:19:04]

Kanjun: So when we're optimizing models for reasoning, we are actually trying to like make a part of the distribution really spiky. So in a sense, like that's actually what we want. We want to, because the internet is a sample of the human distribution that's also skewed in all sorts of ways. That is not the data that we necessarily want these models to be trained on. And so when we're generating data, we're not really randomly generating data. We generate very specific things that are like reasoning traces and that help optimize reasoning. Code also is a big piece of improving reasoning. So generated code is not that much worse than like regular human written code. You might even say it can be better in a lot of ways. So yeah. So we are trying to already do that. [00:19:50]

Alessio: What are some of the tools that you thought were not a good fit? So you built Avalon, which is your own simulated world. And when you first started, the metagame was like using games to simulate things using, you know, Minecraft and then OpenAI is like the gym thing and all these things. And I think in one of your other podcasts, you mentioned like Minecraft is like way too slow to actually do any serious work. Is that true? Yeah. I didn't say it. [00:20:17]

Swyx: I don't know. [00:20:18]

Alessio: That's above my pay grade. But Avalon is like a hundred times faster than Minecraft for simulation. When did you figure that out that you needed to just like build your own thing? Was it kind of like your engineering team was like, Hey, this is too slow. Was it more a long-term investment? [00:20:34]

Kanjun: Yeah. At that time we built Avalon as a research environment to help us learn particular things. And one thing we were trying to learn is like, how do you get an agent that is able to do many different tasks? Like RL agents at that time and environments at that time. What we heard from other RL researchers was the like biggest thing keeping holding the field back is lack of benchmarks that let us explore things like planning and curiosity and things like that and have the agent actually perform better if the agent has curiosity. And so we were trying to figure out in a situation where, how can we have agents that are able to handle lots of different types of tasks without the reward being pretty handcrafted? That's a lot of what we had seen is that like these very handcrafted rewards. And so Avalon has like a single reward it's across all tasks. And it also allowed us to create a curriculum so we could make the level more or less difficult. And it taught us a lot, maybe two primary things. One is with no curriculum, RL algorithms don't work at all. So that's actually really interesting. [00:21:43]

Swyx: For the non RL specialists, what is a curriculum in your terminology? [00:21:46]

Kanjun: So a curriculum in this particular case is basically the environment Avalon lets us generate simpler environments and harder environments for a given tasks. What's interesting is that the simpler environments, what you'd expect is the agent succeeds more often. So it gets more reward. And so, you know, kind of my intuitive way of thinking about it is, okay, the reason why it learns much faster with a curriculum is it's just getting a lot more signal. And that's actually an interesting general intuition to have about training these things as like, what kind of signal are they getting? And like, how can you help it get a lot more signal? The second thing we learned is that reinforcement learning is not a good vehicle, like pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were not able to, they were able to learn all sorts of crazy things. They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing. And so that actually started to get us on the track of thinking about, okay, how do we do the reasoning part in language? And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working on SACAN at the time where it's basically an experiment where they have robots kind of trying to do different tasks and actually do the reasoning for the robot in natural language. And it worked quite well. And that led us to start experimenting very seriously with reasoning. [00:23:31]

Alessio: How important is the language part for the agent versus for you to inspect the agent? You know, like is it the interface to kind of the human on the loop really important or? [00:23:43]

Kanjun: Yeah, I personally think of it as it's much more important for us, the human user. So I think you probably could get end to end agents that work and are fairly general at some point in the future. But I think you don't want that. Like we actually want agents that we can like perturb while they're trying to figure out what to do. Because, you know, even a very simple example, internally we have like a type error fixing agent and we have like a test generation agent. Test generation agent goes off rails all the time. I want to know, like, why did it generate this particular test? [00:24:19]

Swyx: What was it thinking? [00:24:20]

Kanjun: Did it consider, you know, the fact that this is calling out to this other function? And the formatter agent, if it ever comes up with anything weird, I want to be able to debug like what happened with RL end to end stuff. Like we couldn't do that. Yeah. [00:24:36]

Swyx: It sounds like you have a bunch of agents operating internally within the company. What's your most, I guess, successful agent and what's your least successful one? [00:24:44]

Kanjun: The agents don't work. All of them? I think the only successful agents are the ones that do really small things. So very specific, small things like fix the color of this button on the website or like change the color of this button. [00:24:57]

Swyx: Which is now sweep.dev is doing that. Exactly. [00:25:00]

Kanjun: Perfect. Okay. [00:25:02]

Swyx: Well, we should just use sweep.dev. Well, I mean, okay. I don't know how often you have to fix the color of a button, right? Because all of them raise money on the idea that they can go further. And my fear when encountering something like that is that there's some kind of unknown asymptote ceiling that's going to prevent them, that they're going to run head on into that you've already run into. [00:25:21]

Kanjun: We've definitely run into such a ceiling. But what is the ceiling? [00:25:24]

Swyx: Is there a name for it? Like what? [00:25:26]

Kanjun: I mean, for us, we think of it as reasoning plus these tools. So reasoning plus abstractions, basically. I think actually you can get really far with current models and that's why it's so compelling. Like we can pile debugging tools on top of these current models, have them critique each other and critique themselves and do all of these, like spend more computer inference time, context hack, retrieve augmented generation, et cetera, et cetera, et cetera. Like the pile of hacks actually does get us really far. And a way to think about it is like the underlying language model is kind of like a noisy channel. Actually I don't want to use this analogy. It's actually a really bad analogy, but you kind of like trying to get more signal out of the channel. We don't like to think about it that way. It's what the default approach is, is like trying to get more signal out of this noising channel. But the issue with agents is as a user, I want it to be mostly reliable. It's kind of like self-driving in that way. Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling at 70 miles an hour. It's like the hardest agent problem. But one thing we learned from Sorceress and one thing we learned by using these things internally is we actually have a pretty high bar for these agents to work. You know, it's actually really annoying if they only work 50% of the time and we can make interfaces to make it slightly less annoying. But yeah, there's a ceiling that we've encountered so far and we need to make the models better. We also need to make the kind of like interface to the user better. And also a lot of the like critiquing. I hope what we can do is help people who are building agents actually like be able to deploy them. I think, you know, that's the gap that we see a lot of today is everyone who's trying to build agents to get to the point where it's robust enough to be deployable. It just, it's like an unknown amount of time. Okay. [00:27:12]

Swyx: So this goes back into what Embu is going to offer as a product or a platform. How are you going to actually help people deploy those agents? Yeah. [00:27:21]

Kanjun: So our current hypothesis, I don't know if this is actually going to end up being the case. We've built a lot of tools for ourselves internally around like debugging, around abstractions or techniques after the model generation happens. Like after the language model generates the text and like interfaces for the user and the underlying model itself, like models talking to each other, maybe some set of those things kind of like an operating system. Some set of those things will be helpful for other people. And we'll figure out what set of those things is helpful for us to make our agents. Like what we want to do is get to a point where we can like start making an agent, deploy it, it's reliable, like very quickly. And there's a similar analog to software engineering, like in the early days, in the seventies and the sixties, like to program a computer, like you have to go all the way down to the registers and write things and eventually we had assembly. That was like an improvement. But then we wrote programming languages with these higher levels of abstraction and that allowed a lot more people to do this and much faster. And the software created is much less expensive. And I think it's basically a similar route here where we're like in the like bare metal phase of agent building. And we will eventually get to something with much nicer abstractions. [00:28:36]

Alessio: We had this conversation with George Hotz and we were like, there's not a lot of reasoning data out there. And can the models really understand? And his take was like, look, with enough compute, you're not that complicated as a human. Like the model can figure out eventually why certain decisions are made. What's been your experience? Like as you think about reasoning data, like do you have to do a lot of like manual work or like is there a way to prompt models to extract the reasoning from actions that they [00:29:03]

Swyx: see? [00:29:03]

Kanjun: So we don't think of it as, oh, throw enough data at it and then it will figure out what the plan should be. I think we're much more explicit. You know, a way to think about it is as humans, we've learned a lot of reasoning strategies over time. We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask like, huh, what was the original claim that was made? What evidence is there for this claim? Does the evidence support the claim? Is the claim correct? This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we didn't always have them. And because they're invented, like we can generate data that's much more specific to them. So I think internally, yeah, we have a lot of thoughts on what reasoning is and we generate a lot more specific data. We're not just like, oh, it'll figure out reasoning from this black box or like it'll figure out reasoning from the data that exists. Yeah. [00:30:04]

Alessio: I mean, the scientific method is like a good example. If you think about hallucination, right, people are thinking, how do we use these models to do net new, like scientific research? And if you go back in time and the model is like, well, the earth revolves around the sun and people are like, man, this model is crap. It's like, what are you talking about? Like the sun revolves around the earth. It's like, how do you see the future? Like if the models are actually good enough, but we don't believe them, it's like, how do we make the two live together? So you're like, you use Inbu as a scientist to do a lot of your research and Inbu tells you, hey, I think this is like a serious path you should go down. And you're like, no, that sounds impossible. Like how is that trust going to be built? And like, what are some of the tools that maybe are going to be there to inspect it? [00:30:51]

Kanjun: Really there are two answers to this. One element of it is as a person, like I need to basically get information out of the model such that I can try to understand what's going on with the model. Then the second question is like, okay, how do you do that? And that's kind of some of our debugging tools, they're not necessarily just for debugging. They're also for like interfacing with and interacting with the model. So like if I go back in this reasoning trace and like change a bunch of things, what's going to happen? Like, what does it conclude instead? So that kind of helps me understand like, what are its assumptions? And, you know, we think of these things as tools. And so it's really about like, as a user, how do I use this tool effectively? I need to be willing to be convinced as well. It's like, how do I use this tool effectively? And what can it help me with? [00:31:36]

Swyx: And what can it tell me? There's a lot of mention of code in your process. And I was hoping to dive in even deeper. I think we might run the risk of giving people the impression that you view code or you use code just as like a tool within InView just for coding assistance. But I think you actually train code models. And I think there's a lot of informal understanding about how adding code to language models improves their reasoning capabilities. I wonder if there's any research or findings that you have to share that talks about the intersection of code and reasoning. Hmm. Yeah. [00:32:08]

Kanjun: So the way I think about it intuitively is like code is the most explicit example of reasoning data on the internet. [00:32:15]

Swyx: Yeah. [00:32:15]

Kanjun: And it's not only structured, it's actually very explicit, which is nice. You know, it says this variable means this, and then it uses this variable. And then the function does this. As people, when we talk in language, it takes a lot more to extract that explicit structure out of our language. And so that's one thing that's really nice about code is I see it as almost like a curriculum for reasoning. I think we use code in all sorts of ways. The coding agents are really helpful for us to understand what are the limitations of the agents. The code is really helpful for the reasoning itself. But also code is a way for models to act. So by generating code, it can act on my computer. And, you know, when we talk about rekindling the dream of the personal computer, kind of where I see computers going is, you know, like computers will eventually become these much more malleable things where I, as a user today, I have to know how to write software code, like in order to make my computer do exactly what I want it to do. But in the future, if the computer is able to generate its own code, then I can actually interface with it in natural language. And so one way we think about agents is kind of like a natural language programming language. It's a way to program my computer in natural language that's much more intuitive to me as a user. And these interfaces that we're building are essentially IDEs for users to program our computers in natural language. Maybe I should say what we're doing that way. Maybe it's clearer. [00:33:47]

Swyx: I don't know. [00:33:47]

Alessio: That's a good pitch. What do you think about the different approaches people have, kind of like text first, browser first, like multi-on? What do you think the best interface will be? Or like, what is your, you know, thinking today? [00:33:59]

Kanjun: In a lot of ways, like chat as an interface, I think Linus, Linus Lee, you had on this. I really like how he put it. Chat as an interface is skeuomorphic. So in the early days, when we made word processors on our computers, they had notepad lines because that's what we understood these like objects to be. Chat, like texting someone is something we understand. So texting our AI is something that we understand. But today's word documents don't have notepad lines. And similarly, the way we want to interact with agents, like chat is a very primitive way of interacting with agents. What we want is to be able to inspect their state and to be able to modify them and fork them and all of these other things. And we internally have, think about what are the right representations for that? Like architecturally, like what are the right representations? What kind of abstractions do we need to build? And how do we build abstractions that are not leaky? Because if the abstractions are leaky, which they are today, like, you know, this stochastic generation of text is like a leaky abstraction. I cannot depend on it. And that means it's actually really hard to build on top of. But our experience and belief is actually by building better abstractions and better tooling, we can actually make these things non-leaky. And now you can build like whole things on top of them. So these other interfaces, because of where we are, we don't think that much about them. [00:35:17]

Swyx: Yeah. [00:35:17]

Alessio: I mean, you mentioned, this is kind of like the Xerox Spark moment for AI. And we had a lot of stuff come out of Parc, like the, what you see is what you got editors and like MVC and all this stuff. But yeah, but then we didn't have the iPhone at Parc. We didn't have all these like higher things. What do you think it's reasonable to expect in like this era of AI, you know, call it like five years or so? Like what are like the things we'll build today and what are things that maybe we'll see in kind of like the second wave of products? [00:35:46]

Kanjun: That's interesting. I think the waves will be much faster than before. Like what we're seeing right now is basically like a continuous wave. Let me zoom a little bit earlier. So people like the Xerox Parc analogy I give, but I think there are many different analogies. Like one is the like analog to digital computer is kind of an example, like another analogy to where we are today. The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system of pulleys and it can only calculate one function. Like it can calculate like an integral. And that was so magical at the time because you actually did need to calculate this integral bunch, but it had a bunch of issues like in analog errors compound. And so there was actually a set of breakthroughs necessary in order to get to the digital computer, like Turing's decidability, Shannon. I think the like whole like relay circuits can be thought of as can be mapped to Boolean operators and a set of other like theoretical breakthroughs, which essentially were abstractions. They were like creating abstractions for these like very like lossy circuits. They were creating abstractions for these like very analog circuits and digital had this nice property of like being error correcting. And so when I talk about like less leaky abstractions, that's what I mean. That's what I'm kind of pointing a little bit to. It's not going to look exactly the same way. And then the Xerox PARC piece, a lot of that is about like, how do we get to computers that as a person, I can actually use well. And the interface actually helps it unlock so much more power. So the sets of things we're working on, like the sets of abstractions and the interfaces, like hopefully that like help us unlock a lot more power in these systems. Like hopefully that'll come not too far in the future. I could see a next version, maybe a little bit farther out. It's like an agent protocol. So a way for different agents to talk to each other and call each other. Kind of like HTTP. [00:37:40]

Swyx: Do you know it exists already? [00:37:41]

Kanjun: Yeah, there is a nonprofit that's working on one. I think it's a bit early, but it's interesting to think about right now. Part of why I think it's early is because the issue with agents, it's not quite like the internet where you could like make a website and the website would appear. The issue with agents is that they don't work. And so it may be a bit early to figure out what the protocol is before we really understand how these agents get constructed. But, you know, I think that's, I think it's a really interesting question. [00:38:09]

Swyx: While we're talking on this agent to agent thing, there's been a bit of research recently on some of these approaches. I tend to just call them extremely complicated chain of thoughting, but any perspectives on kind of meta-GPT, I think it's the name of the paper. I don't know if you care about at the level of individual papers coming out, but I did read that recently and TLDR, it beat GPT-4 and human eval by role-playing software agent development agency, instead of having sort of single shot or single role, you have multiple roles and how having all of them criticize each other as agents communicating with other agents. [00:38:45]

Kanjun: Yeah, I think this is an example of an interesting abstraction of like, okay, can I just plop in this like multi-role critiquing and see how it improves my agent? And can I just plop in chain of thought, tree of thought, plop in these other things and see how they improve my agent? One issue with this kind of prompting is that it's still not very reliable. It's like, there's one lens, which is like, okay, if you do enough of these techniques, you'll get to high reliability. And I think actually that's a pretty reasonable lens. We take that lens often. And then there's another lens that's like, okay, but it's starting to get really messy what's in the prompt and like, how do we deal with that messiness? And so maybe you need like cleaner ways of thinking about and constructing these systems. And we also take that lens. So yeah, I think both are necessary. Yeah. [00:39:29]

Swyx: Side question, because I feel like this also brought up another question I had for you. I noticed that you work a lot with your own benchmarks, your own evaluations of what is valuable. I would say I would contrast your approach with OpenAI as OpenAI tends to just lean on, hey, we played StarCraft or hey, we ran it on the SAT or the, you know, the AP bio test and that did results. Basically, is benchmark culture ruining AI? [00:39:55]

Swyx: Or is that actually a good thing? Because everyone knows what an SAT is and that's fine. [00:40:04]

Kanjun: I think it's important to use both public and internal benchmarks. Part of why we build our own benchmarks is that there are not very many good benchmarks for agents, actually. And to evaluate these things, you actually need to think about it in a slightly different way. But we also do use a lot of public benchmarks for like, is the reasoning capability in this particular way improving? So yeah, it's good to use both. [00:40:26]

Swyx: So for example, the Voyager paper coming out of NVIDIA played Minecraft and set their own benchmarks on getting the Diamond X or whatever and exploring as much of the territory as possible. And I don't know how that's received. That's obviously fun and novel for the rest of the engineer, the people who are new to the scene. But for people like yourselves, you build Avalon just because you already found deficiencies with using Minecraft. Is that valuable as an approach? Oh, yeah. I love Voyager. [00:40:57]

Kanjun: I mean, Jim, I think is awesome. And I really like the Voyager paper and I think it has a lot of really interesting ideas, which is like the agent can create tools for itself and then use those tools. [00:41:06]

Swyx: He had the idea of the curriculum as well, which is something that we talked about earlier. Exactly. [00:41:09]

Kanjun: And that's like a lot of what we do. We built Avalon mostly because we couldn't use Minecraft very well to like learn the things we wanted. And so it's like not that much work to build our own. [00:41:19]

Swyx: It took us, I don't know. [00:41:22]

Kanjun: We had like eight engineers at the time, took about eight weeks. So six weeks. [00:41:27]

Swyx: And OpenAI built their own as well, right? Yeah, exactly. [00:41:30]

Kanjun: It's just nice to have control over our environment. But if you're doing our own sandbox to really trying to inspect our own research questions. But if you're doing something like experimenting with agents and trying to get them to do things like Minecraft is a really interesting environment. And so Voyager has a lot of really interesting ideas in it. [00:41:47]

Swyx: Yeah. Cool. One more element that we had on this list, which is context and memory. I think that's kind of like the foundational, quote unquote, RAM of our era. I think Andrej Karpathy has already made this comparison. So there's nothing new here. And that's just the amount of working knowledge that we can fit into one of these agents. And it's not a lot, right? Especially if you need to get them to do long running tasks. If they need to self-correct from errors that they observe while operating in their environment. Do you see this as a problem? Do you think we're going to just trend to infinite context and that'll go away? Or how do you think we're going to deal with it? [00:42:22]

Kanjun: I think when you talked about what's going to happen in the first wave and then in the second wave, I think what we'll see is we'll get like relatively simplistic agents pretty soon. And they will get more and more complex. And there's like a future wave in which they are able to do these like really difficult, really long running tasks. And the blocker to that future, one of the blockers is memory. And that was true of computers too. You know, I think when von Neumann made the von Neumann architecture, he was like, the biggest blocker will be like, we need this amount of memory, which is like, I don't remember exactly like 32 kilobytes or something to store programs. And that will allow us to write software. He didn't say it this way because he didn't have these terms, but that only really was like happened in the seventies with the microchip revolution. It may be the case that we're waiting for some research breakthroughs or some other breakthroughs in order for us to have like really good long running memory. And then in the meantime, agents will be able to do all sorts of things that are a little bit smaller than that. I do think with the pace of the field, we'll probably come up with all sorts of interesting things like, you know, RAG is already very helpful. [00:43:26]

Swyx: Good enough, you think? [00:43:27]

Kanjun: Maybe good enough for some things. [00:43:29]

Swyx: How is it not good enough? I don't know. [00:43:31]

Kanjun: I just think about a situation where you want something that's like an AI scientist. As a scientist, I have learned so much about my fields and a lot of that data is maybe hard to fine tune or on, or maybe hard to like put into pre-training. Like a lot of that data, I don't have a lot of like repeats of the data that I'm seeing. You know, like if I'm a scientist, I've like accumulated so many little data points. And ideally I'd want to store those somehow, or like use those to fine tune myself as a model somehow, or like have better memory somehow. I don't think RAG is enough for that kind of thing. But RAG is certainly enough for like user preferences and things like that. Like what should I do in this situation? What should I do in that situation? That's a lot of tasks. We don't have to be a scientist right away. Awesome. [00:44:21]

Swyx: I have a hard question, if you don't mind me being bold. Yeah. I think the most comparable lab to InView is Adept. You know, a research lab with like some amount of product situation on the horizon, but not just yet, right? Why should people work for InView over Adept? And we can cut this if it's too like... Yeah. [00:44:40]

Kanjun: The way I think about it is I believe in our approach. The type of thing that we're doing is we're trying to like build something that enables other people to build agents and build something that really can be maybe something like an operating system for agents. I know that that's what we're doing. I don't really know what everyone else is doing. You know, I can kind of like talk to people and have some sense of what they're doing. And I think it's a mistake to focus too much on what other people are doing, because extremely focused execution on the right thing is what matters. To the question of like, why us? I think like strong focus on reasoning, which we believe is the biggest blocker, on inspectability, which we believe is really important for user experience and also for the power and capability of these systems. Building non-leaky, good abstractions, which we believe is solving the core issue of agents, which is around reliability and being able to make them deployable. And then really seriously trying to use these things ourselves, like every single day, and getting to something that we can actually ship to other people that becomes something that is a platform. Like, it feels like it could be Mac or Windows. I love the dogfooding approach. [00:45:49]

Swyx: That's extremely important. And you will not be surprised how many agent companies I talk to that don't use their own agent. Oh no, that's not good. That's a big surprise. [00:45:59]

Kanjun: Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about how good they are. Wait, did you have any other hard questions you wanted to ask? [00:46:08]

Swyx: Yeah, mine was just the only other follow-up that you had based on the answer you just gave was, do you see yourself releasing models or do you see yourself, what is the artifacts that you want to produce that lead up to the general operating system that you want to have people use, right? And so a lot of people just as a byproduct of their work, just to say like, hey, I'm still shipping, is like, here's a model along the way. Adept took, I don't know, three years, but they released Persimmon recently, right? Like, do you think that kind of approach is something on your horizon? Or do you think there's something else that you can release that can show people, here's kind of the idea, not the end products, but here's the byproducts of what we're doing? [00:46:51]

Kanjun: Yeah, I don't really believe in releasing things to show people like, oh, here's what we're doing that much. I think as a philosophy, we believe in releasing things that will be helpful to other people. [00:47:02]

Swyx: Yeah. [00:47:02]

Kanjun: And so I think we may release models or we may release tools that we think will help agent builders. Ideally, we would be able to do something like that, but I'm not sure exactly what they look like yet. [00:47:14]

Swyx: I think more companies should get into the releasing evals and benchmarks game. Yeah. [00:47:20]

Kanjun: Something that we have been talking to agent builders about is co-building evals. So we build a lot of our own evals and every agent builder tells me, basically evals are their biggest issue. And so, yeah, we're exploring right now. And if you are building agents, please reach out to me because I would love to, like, figure out how we can be helpful based on what we've seen. Cool. [00:47:40]

Swyx: That's a good call to action. I know a bunch of people that I can send your way. Cool. Great. [00:47:43]

Kanjun: Awesome. [00:47:44]

Swyx: Yeah. We can zoom out to other interests now. [00:47:46]

Alessio: We got a lot of stuff. So we have Sherif from Lexicon, the podcast. He had a lot of interesting questions on his website. You similarly have a lot of them. Yeah. [00:47:55]

Swyx: I need to do this. I'm very jealous of people with personal websites right there. Like, here's the high level questions of goals of humanity that I want to set people on. And I don't have that. [00:48:04]

Alessio: It's never too late, Sean. [00:48:05]

Swyx: Yeah. [00:48:05]

Alessio: It's never too late. [00:48:06]

Kanjun: Exactly. [00:48:07]

Alessio: There were a few that stuck out as related to your work that maybe you're kind of learning [00:48:12]

Swyx: more about it. [00:48:12]

Alessio: So one is why are curiosity and goal orientation often at odds? And from a human perspective, I get it. It's like, you know, would you want to like go explore things or kind of like focus on your career? How do you think about that from like an agent perspective? Where it's like, should you just stick to the task and try and solve it as in the guardrails as possible? Or like, should you look for alternative solutions? [00:48:34]

Swyx: Yeah. [00:48:34]

Kanjun: I think one thing that's really interesting about agents actually is that they can be forked. Like, you know, we can take an agent that's executed to a certain place and said, okay, here, like fork this and do a bunch of different things. I try a bunch of different things. Some of those agents can be goal oriented and some of them can be like more curiosity driven. You can prompt them in slightly different ways. And something I'm really curious about, like what would happen if in the future, you know, we were able to actually go down both paths. As a person, why I have this question on my website is I really find that like I really can only take one mode at a time and I don't understand why. And like, is it inherent in like the kind of context that needs to be held? That's why I think from an agent perspective, like forking it is really interesting. Like I can't fork myself to do both, but I maybe could fork an agent to like add a certain point in a task. [00:49:26]

Swyx: Yeah. Explore both. Yeah. [00:49:28]

Alessio: How has the thinking changed for you as the funding of the company changed? That's one thing that I think a lot of people in the space think is like, oh, should I raise venture capital? Like, how should I get money? How do you feel your options to be curious versus like goal oriented has changed as you raise more money and kind of like the company has grown? [00:49:50]

Kanjun: Oh, that's really funny. Actually, things have not changed that much. So we raised our Series A $20 million in late 2021. And our entire philosophy at that time was, and still kind of is, is like, how do we figure out the stepping stones, like collect stepping stones that eventually let us build agents, kind of these new computers that help us do bigger things. And there was a lot of curiosity in that. And there was a lot of goal orientation in that. Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer. Great name, by the way. [00:50:28]

Swyx: Thank you. [00:50:29]

Kanjun: Is there a story behind that name? [00:50:30]

Swyx: Yeah. [00:50:31]

Kanjun: Abe loves CARBS. It's also cost aware. So as soon as he came up with cost aware, he was like, I need to figure out how to make this work. But the cost awareness of it was really important. So that curiosity led us to this really cool hyperparameter optimizer. That's actually a big part of how we do our research. It lets us experiment on smaller models. And for those experiment results to carry to larger ones. [00:50:56]

Swyx: Which you also published a scaling laws, which is great. I think the scaling laws paper from OpenAI was like the biggest. And from Google, I think, was the greatest public service to machine learning that any research lab can do. Yeah, totally. [00:51:10]

Kanjun: What was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters. So yeah, that's cool. It basically hasn't changed very much. So there's some curiosity. And then there's some goal oriented parts. Like Avalon, it was like a six to eight week sprint for all of us. And we got this thing out. And then now different projects do like more curiosity or more goal orientation at different times. Cool. [00:51:36]

Swyx: Another one of your questions that we highlighted was, how can we enable artificial agents to permanently learn new abstractions and processes? I think this is might be called online learning. [00:51:45]

Kanjun: Yeah. So I struggle with this because, you know, that scientist example I gave. As a scientist, I've like permanently learned a lot of new things. And I've updated and created new abstractions and learned them pretty reliably. And you were talking about like, okay, we have this RAM that we can store learnings in. But how well does online learning actually work? And the answer right now seems to be like, as models get bigger, they fine tune faster. So they're more sample efficient as they get bigger. [00:52:15]

Swyx: Because they already had that knowledge in there. You're just kind of unlocking it. [00:52:23]

Kanjun: Partly maybe because they already have like some subset of the representation. Partly they just memorize things more, which is good. So maybe this question is going to be solved, but I still don't know what the answer is. [00:52:36]

Swyx: As I've had a platform that continually fine tunes for you as you work on that domain, which is something I'm working on. Well, that's great. We would love to use that. We'll talk more. Two more questions just about your general activities. I think you've just been very active in the San Francisco tech scene. You're a founding member of Software Commons. [00:52:56]

Kanjun: Oh yeah, that's true. [00:52:57]

Swyx: Tell me more. By the time I knew about SPC, it was already a very established thing. But what was it like in the early days? What was the story there? [00:53:05]

Kanjun: Yeah, the story is Ruchi, who started it, was the VP of operations at Dropbox. And I was the chief of staff and we worked together very closely. She's actually one of the investors in Sorceress. And SPC is an investor in Vue. And at that time, Ruchi was like, you know, I would like to start a space for people who are figuring out what's next. And we were figuring out what's next post-Ember, those three months. And she was like, do you want to just like hang out in this space? And we're like, sure. And it was a really good group. Wasim and Jeff from Pilot, the folks from Zulip, and a bunch of other people at that time. It was a really good group. We just hung out. There was no programming. It's much more official than it was at that time. [00:53:44]

Swyx: Yeah, now it's like a YC before YC type of thing. That's right, yeah. [00:53:48]

Kanjun: At that time, we literally, it was a bunch of friends hanging out in the space together. [00:53:51]

Swyx: And was this concurrent with the Archive? [00:53:53]

Kanjun: Oh yeah, actually, I think we started the Archive around the same time. [00:53:56]

Swyx: You're just like really big into community. But also like, so, you know, I run a Hacker House and I'm also part of hopefully what becomes like the next Software Commons or whatever. What are the principles in organizing communities like that with really exceptional people that go on to do great things? Do you have to be really picky about who joins? Like all your friends just magically turn out super successful like that. You know, it's not normal, right? Like this is very special. And a lot of people want to do that and fail. And you had the co-authors of GPT-3 in your house. That's true. [00:54:32]

Kanjun: And a lot of other really cool people that you'll eventually hear about. [00:54:35]

Swyx: Co-founders of Pilot and anyone else. I don't want you to pick your friends, but there's some magic special sauce in getting people together and in one workspace, living space, whatever, right? And that's part of why I'm here in San Francisco. And I would love for more people to learn about it and also maybe get inspired to build their own. [00:54:52]

Kanjun: Your question is really more about like, how do you actually build a community that where people in it are like eventually are awesome? [00:54:59]

Swyx: Okay. [00:55:00]

Kanjun: Which is different than like why live in a co-living house. So one adage we had when we started the archive was you become the average of the five people closest to you. [00:55:08]

Swyx: Yes. [00:55:08]

Kanjun: And I think that's roughly true. And good people draw good people. So there are really two things. One, we were quite picky and it mattered a lot to us. Is this someone where if they're hanging out in the living room, we'd be really excited to come hang out. Yeah. Two is I think we did a really good job of creating a high growth environment and an environment where people felt really safe. We actually apply these things to our team and it works remarkably well as well. So I do a lot of basically how do I create safe spaces for people where it's not just like safe law, but like it's like a safe space where people really feel inspired by each other. And I think at the archive, we really made each other better. My friend, Michael Nielsen called it a self-actualization machine. [00:55:52]

Swyx: My goodness. Okay. [00:55:54]

Kanjun: And I think, yeah, people came in. Was he a part of the archive? He was not, but he hung out a lot. Honorary member. Friend of the archive. [00:56:02]

Swyx: Yeah. [00:56:02]

Kanjun: The culture was that we learned a lot of things from each other about like how to make better life systems and how to think about ourselves and psychological debugging. And a lot of us were founders. So having other founders going through similar things was really helpful. And a lot of us worked in AI. And so having other people to talk about AI with was really helpful. And so I think all of those things led to a form of idea flux and also kind of like, so I think a lot about like the idea flux and default habits or default impulses. It led to a set of idea flux and default impulses that led to some really interesting things and led to us doing much bigger things, I think, than we otherwise would have decided to do because it felt like taking risks was less risky. So that's something we do a lot of on the team. It's like, how do we make it so that taking risks is less risky? And there's a term called senious. [00:56:57]

Swyx: Yes. I was thinking Kevin Kelly. Kevin Kelly, senious. I was going to feed you that word, but I didn't want to like bias you. Yes. [00:57:02]

Kanjun: I think maybe like a lot of what I'm interested in is constructing a kind of senious. And the archive was definitely a senious in a particular, or like getting toward a senious in a particular way. And Jason Ben, my archive housemate and who now runs the neighborhood, [00:57:17]

Swyx: has a good way of putting it. [00:57:17]

Kanjun: If genius is from your genes, senious is from your scene. Yeah, I think like maybe a lot of the community building impulse is from this like interest in what kind of idea flux can be created. You know, there's a question of like, why did Xerox PARC come out with all of this interesting stuff? It's their senious. Why did Bell Labs come out with all this interesting stuff? Maybe it's their senious. Why didn't the transistor come out of Princeton? And the other people working on it at the time. [00:57:44]

Swyx: I just think it's remarkable how you hear a lot about Alan Kay. And I just read a bit. And apparently Alan Kay was like the most junior guy at Xerox PARC. Yeah, definitely. [00:57:53]

Kanjun: He's just the one who talks about it. He talks the most. [00:57:57]

Swyx: Yeah, exactly. Yeah. So I, you know, hopefully I'm also working towards contributing that senious. I called mine the more provocative name of the arena. Interesting. That's quite provocative. In the arena. [00:58:08]

Kanjun: So are you fighting other people in the arena? [00:58:11]

Swyx: No. You never know. [00:58:12]

Alessio: On any day in the mission, it's an adventure. [00:58:15]

Swyx: We're in the arena trying stuff, as they say. You are also a GP at Outset Capital, where you also co-organize the Thursday Nights in AI, where hopefully someday I'll eventually speak. You're on the roster. [00:58:28]

Kanjun: I'm on the roster. [00:58:29]

Swyx: Thank you so much. So why spend time being a VC and organizing all these events? You're also a very busy CEO and, you know, why spend time with that? Why is that an important part of your life? [00:58:39]

Kanjun: Yeah, for me personally, I really like helping founders. So Allie, my investing partner, is fortunately amazing and she does everything for the fund. So she like hosts the Thursday night events and she finds folks who we could invest in. And she does basically everything. Josh and I are her co-partners. So Allie was our former chief of staff at Sorceress. We just thought she was amazing. She wanted to be an investor. And Josh and I also like care about helping founders and kind of like giving back to the community. What we didn't realize at the time when we started the fund is that it would actually be incredibly helpful for Imbue. So talking to AI founders who are building agents and working on, you know, similar things is really helpful. They could potentially be our customers and they're trying out all sorts of interesting things. And I think being an investor, looking at the space from the other side of the table, it's just a different hat that I routinely put on. And it's helpful to see the space from the investor lens as opposed to from the founder lens. So I find that kind of like hat switching valuable. It maybe would lead us to do slightly different things. [00:59:44]

Swyx: Awesome. Appreciate that. [00:59:46]

Alessio: Yeah, you've been really generous with your time. Let's just wrap with the lightning round. Okay. So we have two questions, acceleration, exploration, and then a takeaway. So the acceleration question is, what's something that already happened in AI that you thought would take much longer to be here? [01:00:03]

Kanjun: I think the rate at which we discover new capabilities of existing models and kind of like build hacks on top of them to make them work better is something that has been surprising and awesome. And the research community building on its own ideas, that's probably, you want something very specific. Yeah, I think the rate at which we discovered capabilities probably. [01:00:22]

Swyx: Cool. Exploration slash requests for startups. If you weren't building Imbue, what AI company would you build? Hmm. Every founder has like their like number two. Really? Yeah, I don't know. [01:00:33]

Kanjun: Wow. I cannot imagine building any other thing than Imbue. [01:00:37]

Swyx: Wow. Well, that's a great answer too. [01:00:38]

Kanjun: It's like obviously the thing to build. [01:00:42]

Swyx: Okay. [01:00:42]

Kanjun: It's like obviously work on the fundamental platform. Yeah. [01:00:46]

Swyx: So that was my attempt at innovating this question, but the previous one was, but what was the most interesting unsolved question in AI? [01:00:53]

Kanjun: My answer is kind of boring, but the most interesting unsolved questions are these questions of, how do we make these stochastic systems into things that we can like reliably use and build on top of? [01:01:04]

Swyx: Yep. [01:01:05]

Alessio: And yeah, take away what's one message you want everyone to remember? [01:01:09]

Kanjun: Maybe two things. One is just the like, we're in a historic moment. I didn't think in my lifetime I would necessarily be in, like able to work on the things I'm excited to work on in this moment, but we're in a historic moment that where we'll look back and be like, oh my God, the future was invented in these years. And I think like, there may be a set of messages to take away from that. One is like, AI is a tool like any technology. And you know, when it comes to things like, what might the future look like? Like we like to think about it as, it's like just a better computer. It's like much more powerful computer that gives us a lot of free intellectual energy that we can now like solve so many problems with. You know, there are so many problems in the world [01:01:53]

Swyx: where we're like, [01:01:53]

Kanjun: oh, it's not worth a person thinking about that. And so things get worse and things get worse. No one wants to work on maintenance. And like this technology gives us the potential to actually be able to like allocate intellectual energy to all of those problems. And the world could be much better, like could be much more thoughtful because of that. I'm so excited about that. And there are definitely risks and dangers. And we actually do a fair, something I didn't talk about is we do a fair amount of work on the policy side. On the safety side, like we think about safety and policy in terms of engineering theory and also regulation. And kind of comparing to like the automobile or the airplane or any new technology, there's like a set of new possible capabilities and a set of new possible dangers that are unlocked with every new technology. And so on the engineering side, like we think a lot about engineering safety, like how do we actually engineer these systems so that they are inspectable and why we reason in natural language so that the systems are very inspectable so that we can like stop things if anything weird is happening. That's why we don't think end-to-end black boxes [01:02:58]

Swyx: are a good idea. [01:02:58]

Kanjun: On the theoretical side, we like really believe in like deeply understanding, like when we actually fine tune on individual examples, like what's going on, when we're pre-training, what's going on, like debugging tools for these agents to understand like what's going on. And then on the regulation side, I think there's actually a lot of regulation that already covers many of the dangers like that people are talking about. And there are areas where there's not much regulation. And so we focus on those areas where there's not much regulation. So some of our work is actually, we built an agent that helped us analyze the 20,000 pages of policy proposals submitted to the Department of Commerce request for AI policy proposals. We looked at what were the problems people brought up and what were the solutions they presented and then like did a summary analysis and kind of like, you know, build agents to do that. And now the Department of Commerce is like interested in using that as a tool to like analyze proposals. And so a lot of what we're trying to do on the regulation side is like actually figure out where is there regulation missing and how do we actually in a very targeted way try to solve those missing areas. So I guess if I were to say like, what are the takeaways? It's like the takeaway is like the future could be really exciting if we can actually get agents that are able to do these bigger things. Reasoning is the biggest blocker plus like these sets of abstractions to make things more robust and reliable. And there are, you know, things where we have to be quite careful and thoughtful about how do we deploy these and what kind of regulation should go along with it so that this is actually a technology that where we, when we deploy it, it is protective to people and not harmful. [01:04:36]

Swyx: Awesome, wonderful. [01:04:38]

Alessio: Thank you so much for your time, Kanjun. [01:04:40]

Kanjun: Thank you. [01:04:41]

Swyx: Thank you. [01:04:48]

Get full access to Latent Space at www.latent.space/subscribe

2023-10-14
Länk till avsnitt

[AIE Summit Preview #2] The AI Horcrux ? Swyx on Cognitive Revolution

This is a special double weekend crosspost of AI podcasts, helping attendees prepare for the AI Engineer Summit next week. After our first friendly feedswap with the Cognitive Revolution pod, swyx was invited for a full episode to go over the state of AI Engineering and to preview the AI Engineer Summit Schedule, where we share many former CogRev guests as speakers.

For those seeking to understand how two top AI podcasts think about major top of mind AI Engineering topics, this should be the perfect place to get up to speed, which will be a preview of many of the conversations taking place during the topic tables sessions on the night of Monday October 9 at the AI Engineer Summit.

While you are listening, there are two things you can do to be part of the AI Engineer experience. One, join the AI Engineer Summit Slack. Two, take the State of AI Engineering survey and help us get to 1000 respondents!

Links

* AI Engineer Summit (Join livestream and Slack community)

* State of AI Engineering Survey (please help us fill this out to represent you!)

* Cognitive Revolution full episode with Nathan

* swyx?s ai-notes (featuring Communities in README.md)

* We referenced The Eleuther AI Mafia

* This podcast intro voice was AI Anna again, from our Wondercraft pod!

Timestamps

* (00:00:49) AI Nathan?s intro

* (00:03:14) What is an AI engineer?

* (00:05:56) What backgrounds do AI engineers typically have?

* (00:17:13) Swyx?s Discord AI project

* (00:20:41) Key tools for AI engineers

* (00:23:42) HumanLoop, Guardrails, Langchain

* (00:27:01) Criteria for identifying capable AI engineers when hiring

* (00:30:59) Skepticism around AI being a fad and doubts about contributing to AI

* (00:34:03) AI Engineer Conference speaker lineup

* (00:41:14) AI agents and two years to AGI

* (00:46:04) Expectations and disagreement around what AI agent capabilities will work soon

* (00:50:12) Swyx?s OpenAI thesis

* (00:53:03) AI safety considerations and the role of AI engineers

* (00:56:24) Disagreement on whether AI will soon be able to generate code pull requests

* (01:01:07) AI helping non-technical people to code

* (01:01:49) Multi-modal Chat-GPT and the future implications

* (01:03:33) Nathan living in the same dorm as Mark Zuckerberg

* (01:04:44) Competitive dynamics between OpenAI and other AI model developers

* (01:05:39) Play.ht vs ElevenLabs

* (01:09:20) The tension between platforms and developers building on top of them

* (01:11:40) The best thing startups can do to compete with foundation model providers

* (01:16:26) User identity/authentication services like Login with OpenAI

* (01:19:20) Google vs the other live players

* (01:20:46) AI Horcruxes / Pendants

* (01:22:05) The concept of an AI app bundle for consumers and developers

Get full access to Latent Space at www.latent.space/subscribe

2023-10-08
Länk till avsnitt

[AIE Summit Preview #1] Swyx on Software 3.0 and the Rise of the AI Engineer

This is a special double weekend crosspost of AI podcasts, helping attendees prepare for the AI Engineer Summit next week. Swyx gave a keynote on the Software 3.0 Landscape recently (referenced in our recent Humanloop episode) and was invited to go deeper in podcast format, and to preview the AI Engineer Summit Schedule.

For those seeking to ramp up on the current state of thinking on AI Engineering, this should be the perfect place to start, alongside our upcoming Latent Space University course (which is being tested live for the first time at the Summit workshops).

Full transcript available here!

Links

* AI Engineer Summit (Join livestream and Slack community)

* State of AI Engineering Survey (please help us fill this out to represent you!)

* Podrocket full episode by Tejas Kumar

Show notes

* Explaining Software 1.0, 2.0, and 3.0

* Software 1.0: Hand-coded software with conditional logic, loops, etc.

* Software 2.0: Machine learning models like neural nets trained on data

* Software 3.0: Using large pre-trained foundation models without needing to collect/label training data

* Foundation Models and Model Architecture

* Foundation models like GPT-3/4, Claude, Whisper - can be used off the shelf via API

* Model architecture refers to the layers and structure of a ML model

* Grabbing a pre-trained model lets you skip data collection and training

* Putting Foundation Models into Production

* Levels of difficulty: calling an API, running locally, fully serving high-volume predictions

* Key factors: GPU utilization, batching, infrastructure expertise

* The Emerging AI Developer Landscape

* AI is becoming more accessible to "traditional" software engineers

* Distinction between ML engineers and new role of AI engineers

* AI engineers consume foundation model APIs vs. developing models from scratch

* The Economics of AI Engineers

* Demand for AI exceeds supply of ML experts to build it

* AI engineers will emerge out of software engineers learning these skills

* Defining the AI Engineering Stack

* System of reasoning: Foundation model APIs

* Retrieval augmented generation (RAG) stack: Connects models to data

* AI UX: New modalities and interfaces beyond chatbots

* Building Products with Foundation Models

* Replicating existing features isn't enough - need unique value

* Focus on solving customer problems and building trust

* AI Skepticism and Hype

* Some skepticism is healthy, but "AI blame" also emerges

* High expectations from media/industry creators

* Important to stay grounded in real customer needs

* Meaningful AI Applications

* Many examples of AI positively impacting lives already

* Engineers have power to build and explore - lots of opportunity

* Closing and AI Engineer Summit Details

* October 8-10 virtual conference for AI engineers

* Speakers from OpenAI, Microsoft, Amazon, etc

* Free to attend online

Get full access to Latent Space at www.latent.space/subscribe

2023-10-07
Länk till avsnitt

RAG Is A Hack - with Jerry Liu from LlamaIndex

Want to help define the AI Engineer stack? >800 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey, which we will present for the first time at next week?s AI Engineer Summit. Join us online!

This post had robust discussion on HN and Twitter.

In October 2022, Robust Intelligence hosted an internal hackathon to play around with LLMs which led to the creation of two of the most important AI Engineering tools: LangChain ??? (our interview with Harrison here) and LlamaIndex ? by Jerry Liu, which we?ll cover today. In less than a year, LlamaIndex has crossed 600,000 monthly downloads, raised $8.5M from Greylock, has a fast growing open source community that contributes to LlamaHub, and it doesn?t seem to be slowing down.

LlamaIndex?s Origin (aka GPT Tree Index)

Jerry struggled to make large amounts of data work with GPT-3 (which had a 4,096 tokens context window). Today LlamaIndex is at the forefront of the RAG wave (Retrieval Augmented Generation), but in the beginning Jerry wasn?t focused on embeddings and search, but rather on understanding how models could summarize, link, and reason about data.

On November 5th, Jerry pushed the first version to Github under the name ?GPT Tree Index?:

The GPT Tree Index first takes in a large dataset of unprocessed text data as input. It then builds up a tree-index in a bottom-up fashion; each parent node is able to summarize the children nodes using a general summarization prompt; each intermediate node containing summary text summarizing the components below. Once the index is built, it can be saved to disk and loaded for future use.

Then, say the user wants to use GPT-3 to answer a question. Using a query prompt template, GPT-3 will be able to recursively perform tree traversal in a top-down fashion in order to answer a question. For example, in the very beginning GPT-3 is tasked with selecting between *n* top-level nodes which best answers a provided query, by outputting a number as a multiple-choice problem. The GPT Tree Index then uses the number to select the corresponding node, and the process repeats recursively among the children nodes until a leaf node is reached.

[?]

How is this better than an embeddings-based approach / other state-of-the-art QA and retrieval methods?

The intent is not to compete against existing methods. A simpler embedding-based technique could be to just encode each chunk as an embedding and do a simple question-document embedding look-up to retrieve the result. This project is a simple exercise to test how GPT can organize and lookup information.

The project attracted a lot of attention early on (the announcement tweet has ~330 likes), but it wasn?t until ~February 2023 that the open source community really started to explode, which was around the same time that LlamaHub was released. LlamaHub made it easy for developers to import data from Google Drive, Discord, Slack, databases, and more into their LlamaIndex projects.

What is LlamaIndex?

As we mentioned, LlamaIndex is leading the charge in the development of the RAG stack. RAG boils down to two parts:

* Indexing (i.e. how do you load and index the data in your knowledge base)

* Querying (i.e. how do you surface the data and fit it in the model context)

Indexing

To get your data from all your sources to your RAG knowledge base, you can leverage a few tools:

* Documents / Nodes: A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. A Node is the atomic unit of data in LlamaIndex and represents a ?chunk? of a source Document (i.e. one Document has many Node) as well as its relationship to other Node objects.

* Data Connectors: A data connector ingest data from different sources and turn them into Document representations (text and simple metadata). These connectors are offered through LlamaHub, and there are over 200 of them today.

* Data Indexes: Once you?ve ingested your data, LlamaIndex will help you index the data into a format that?s easy to retrieve. There are many types of indexes (Summary, Tree, Vector, etc). Under the hood, LlamaIndex parses the raw documents into intermediate representations, calculates vector embeddings, and infers metadata. The most commonly used index is the VectorStoreIndex, which can then be paired with any of the vector stores out there (an example with Chroma).

Querying

The RAG pipeline, during the querying phase, sources the most pertinent context from a user's prompt, forwarding it along to the LLM. This equips the LLM with current / private knowledge beyond its foundational training data. LlamaIndex offers adaptable modules tailored for building RAG pathways for Q&A, chatbots, or agent use, since each of them has different requirements. For example, a chatbot should expect the user to interject with follow up questions, while an agent will try to carry out a whole task on its own without user intervention.

Building Blocks

* Retrievers: A retriever defines how to efficiently retrieve relevant context from a knowledge base (i.e. index) when given a query. Vector index is the most popular mode, but there are other options like Summary, Tree, Keyword Table, Knowledge Graph, and Document Summary.

* Node Postprocessors: Once the retriever gets you Node objects back, you will need to do additional work like discarding low similarity ones. There are many options here as well, such as `SimilarityPostprocessor` (i.e. drop nodes below a certain similarity score) or `LongContextReorder` which helps avoid the issues raised in the ?Lost in the Middle, U-shaped recollection curve? paper.

* Response Synthesizers: Takes a user query and your retrieved chunks, and prompts and LLM with them. There are a few response modes here that balance thoroughness and compactness.

Pipelines

* Query Engines: A query engine is an end-to-end pipeline that allow you to ask question over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM. This makes it possible to do things like ?Ask panda questions? by leveraging Panda dataframes as a data source.

* Chat Engines: A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question & answer). This supports traditional OpenAI-style chat interfaces, as well as more advanced ones like ReAct.

* Agents: An agent is an automated decision maker (powered by an LLM) that interacts with the world via a set of tools. Agent may be used in the same fashion as query engines or chat engines, but they have the power to both read and write data. For reasoning, you can use either OpenAI Functions or ReAct. Both can leverage the tools offered through LlamaHub for further analysis.

RAG vs Finetuning

Now that you have a full overview of what LlamaIndex does, the next question is ?When should I use this and when should I fine tune??. Jerry?s TLDR is that ?RAG is just a hack?, but a powerful one. Each option has pros and cons:

* Lower investment: RAG requires almost 0 upfront investment, unlike finetuning which requires data cleaning, model training, increased costs for finetuned inference, etc.

* Stricter access control and higher visibility: when finetuning, the model learns everything. With RAG, you can decide what documents the index should have access to, making it more secure by default. You are also able to see everything that was passed into the context if a response doesn?t look right.

* Context window limitation: you can only fit so many tokens into the prompt due to the way models work. Finetuning helps you circumvent that by compressing the knowledge into the model weights rather than putting it in the prompt.

As Jerry says, the best way to know this inside out is to learn to build RAG from scratch (without LlamaIndex) - and they have plenty of tutorials on his Twitter and blog to learn this.

The other issue is that the math for finetuning isn?t well known yet as we discussed with Quentin Anthony from Eleuther, so unless you have money and time to invest into exploring fine tuning, you?re better off starting with RAG.

Full YouTube Discussion!

Show Notes

* LlamaIndex

* LlamaHub

* SEC Insights

* Robust Intelligence

* Quora?s Poe

* Chroma

* Vespa

* Why should every AI engineer learn to build RAG from scratch?

* LangChain

* Gorilla

* Lost in the Middle: How Language Models Use Long Contexts

Timestamps

* [00:00:00] Introductions and Jerry?s background

* [00:04:30] Starting LlamaIndex as a side project

* [00:05:11] Evolution from tree-index to current LlamaIndex and LlamaHub architecture

* [00:11:39] Deciding to leave Robust to start the LlamaIndex company and raising funding

* [00:20:06] Context window size and information capacity for LLMs

* [00:21:34] Minimum viable context and maximum context for RAG

* [00:22:52] Fine-tuning vs RAG - current limitations and future potential

* [00:24:02] RAG as a hack but good hack for now

* [00:26:19] RAG benefits - transparency and access control

* [00:27:46] Potential for fine-tuning to take over some RAG capabilities

* [00:30:04] Baking everything into an end-to-end trained LLM

* [00:33:24] Similarities between iterating on ML models and LLM apps

* [00:34:47] Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning

* [00:40:16] Evaluating and optimizing each component of Lama Index system

* [00:46:02] Building retrieval benchmarks to evaluate RAG

* [00:47:24] SEC Insights - open source full stack LLM app using LlamaIndex

* [00:49:48] Enterprise platform to complement LlamaIndex open source

* [00:51:00] Community contributions for LlamaHub data loaders

* [00:53:21] LLM engine usage - majority OpenAI but options expanding

* [00:56:25] Vector store landscape

* [00:59:46] Exploring relationships and graphs within data

* [01:03:24] Additional complexity of evaluating agent loops

* [01:04:01] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:20]

Swyx: And today we finally have Jerry Liu on the podcast. Hey Jerry. [00:00:24]

Jerry: Hey guys. Hey Swyx and Alessio. Thanks for having me. [00:00:27]

Swyx: It's kind of weird because we keep running into each other in San Francisco AI events, so it's kind of weird to finally just have a conversation recorded for everybody else. [00:00:34]

Jerry: Yeah, I know. I'm really looking forward to this, aside from the questions. [00:00:38]

Swyx: So I tend to introduce people on their formal background and then ask something on the more personal side. So you are part of the Princeton gang. [00:00:46]

Jerry: I don't know if there is like official Princeton gang. [00:00:48]

Swyx: No, small Princeton gang. Okay. I attended your meeting. There was like four of you with Prem and the others. And then you have a bachelor's in CS and a certificate in finance. That's also fun. I also did finance and I think I saw that you also interned at Two Sigma where I worked in New York. You were a machine learning engineer. [00:01:06]

Jerry: You were at Two Sigma?

Swyx: Yeah, very briefly.

Jerry: Oh, cool. I didn't know that. [00:01:09]

Swyx: That was my first like proper engineering job before I went into DevRel. [00:01:12]

Jerry: Oh, okay. Nice. [00:01:14]

Swyx: And then you were a machine learning engineer at Quora, AI research scientist at Uber for three years, and then two years machine learning engineer at Robust Intelligence before starting LlamaIndex. So that's your LinkedIn. It's not only LinkedIn that people should know about you. [00:01:27]

Jerry: I think back during my Quora days, I had this like three-month phase where I just wrote like a ton of Quora answers. And so I think if you look at my tweets nowadays, you can basically see that as like the V2 of my three-month like Forrestant where I just like went ham on Quora for a bit. I actually, I think I was back then actually when I was working on Quora, I think the thing that everybody was fascinated in was just like general like deep learning advancements and stuff like GANs and generative like images and just like new architectures that were evolving. And it was a pretty exciting time to be a researcher actually, because you were going in like really understanding some of the new techniques. So I kind of use that as like a learning opportunity, basically just like read a bunch of papers and then answer questions on Quora. And so you can kind of see traces of that basically in my current Twitter where it's just like really about kind of like framing concepts and trying to make it understandable and educate other users on it. Yeah. [00:02:17]

Swyx: I've said, so a lot of people come to me for my Twitter advice, but like, I think you are doing one of the best jobs in AI Twitter, which is explaining concepts and just consistently getting hits out. Thank you. I didn't know it was due to the Quora training. Let's just sign on on Quora. A lot of people, including myself, like kind of wrote off Quora as like one of the web 1.0 like sort of question answer forums. But now I think it's becoming, seeing a resurgence obviously due to Poe and obviously Adam and D'Angelo has always been a leading tech figure, but what do you think is kind of underrated about Quora? [00:02:46]

Jerry: Well, I mean, I like the, I really liked the mission of Quora when I, when I joined. In fact, I interned there like in 2015 and I joined full time in 2017. One is like they had, and they have like a very talented engineering team and just like really, really smart people. And the other part is the whole mission of the company is to just like spread knowledge and to educate people. And to me that really resonated. I really liked the idea of just like education and democratizing the flow of information. If you imagine like kind of back then it was like, okay, you have Google, which is like for search, but then you have Quora, which is just like user generated, like grassroots type content. And I really liked that concept because it's just like, okay, there's certain types of information that aren't accessible to people, but you can make accessible by just like surfacing it. And so actually, I don't know if like most people know that about like Quora and if they've used the product, whether through like SEO, right, or kind of like actively, but that really was what drew me to it. [00:03:39]

Swyx: Yeah. I think most people challenges with it is that sometimes you don't know if it's like a veiled product pitch, right? [00:03:44]

Jerry: Yeah. Of course, like quality of the answer matters quite a bit. And then you start running into these like- [00:03:47]

Swyx: It's like five alternatives and then here's the one I work on. Yeah. [00:03:50]

Jerry: Like recommendation issues and all that stuff. I used, I worked on recsys at Quora actually, so I got a taste of some of that stuff. Well, I mean, I kind of more approached it from machine learning techniques, which might be a nice segue into RAG actually. A lot of it was just information retrieval. We weren't like solving anything that was like super different than what was standard in the industry at the time, but just like ranking based on user preferences. I think a lot of Quora was very metrics driven. So just like trying to maximize like daily active hours, like time spent on site, those types of things. And all the machine learning algorithms were really just based on embeddings. You have a user embedding and you have like item embeddings and you try to train the models to try to maximize the similarity of these. And it's basically a retrieval problem. [00:04:30]

Swyx: Okay. So you've been working on RAG for longer than most people think? [00:04:33]

Jerry: Well, kind of. So I worked there for like a year, right, just transparently. And then I worked at Uber where I was not working on ranking. It was more like kind of deep learning training for self-driving and computer vision and that type of stuff. But I think in the LLM world, it's kind of just like a combination of like everything these days. I mean, retrieval is not really LLMs, but like it fits within the space of like LLM apps. And then obviously like having knowledge of the underlying deep learning architectures helps. Having knowledge of basic software engineering principles helps too. And so I think it's kind of nice that like this whole LLM space is basically just a combination of just like a bunch of stuff that you probably like people have done in the past. [00:05:11]

Swyx: It's good. It's like a summary capstone project. Yeah, exactly. [00:05:14]

Jerry: Yeah. [00:05:15]

Alessio: And before we dive into LlamaIndex, what do they feed you a robust intelligence that both you and Harrison from LangChain came out of it at the same time? Was there like, yeah. Is there any fun story of like how both of you kind of came up with kind of like core infrastructure to LLM workflows today? Or how close were you at robust? Like any fun behind the scenes? [00:05:37]

Jerry: Yeah. Yeah. We, um, we work pretty closely. I mean, we were on the same team for like two years. I got to know Harrison and the rest of the team pretty well. I mean, I have a respect that people there, the people that were very driven, very passionate. And it definitely pushed me to be, you know, a better engineer and leader and those types of things. Yeah. I don't really have a concrete explanation for this. I think it's more just, we have like an LLM hackathon around like September. This was just like exploring GPT-3 or it was October actually. And then the day after I went on vacation for a week and a half, and so I just didn't track Slack or anything. And then when I came back, saw that Harrison started LangChain [00:06:09]

Swyx: Oh that's cool. [00:06:10]

Jerry: I was like, oh, I'll play around with LLMs a bit and then hacked around on stuff. And I think I've told the story a few times, but you know, I was like trying to feed in information into GPT-3. And then, then you deal with like context window limitations and there was no tooling or really practices to try to understand how do you, you know, get GPT-3 to navigate large amounts of data. And that's kind of how the project started. Really was just one of those things where early days, like we were just trying to build something that was interesting. Like I wanted to start a company. I had other ideas actually of what I wanted to start. And I was very interested in, for instance, like multimodal data, like video data and that type of stuff. And then this just kind of grew and eventually took over the other idea. [00:06:48]

Swyx: Text is the universal interface. [00:06:50]

Jerry: I think so. I think so. I actually think once the multimodal models come out, I think there's just like mathematically nicer properties of you can just get like join multiple embeddings, like clip style. But text is really nice because from a software engineering principle, it just makes things way more modular. You can just convert everything into text and then you just represent everything as text. [00:07:08]

Swyx: Yeah. I'm just explaining retroactively why working on LlamaIndex took off versus if you had chose to spend your time on multimodal, we probably wouldn't be talking about whatever you ended up working on. [00:07:18]

Jerry: Yeah. [00:07:19]

Swyx: That's true. It's troubled. Interesting. So November 9th, that was a very productive month. I guess October, November, November 9th, you announced GPT-3 Index and you picked a tree logo. Very cool. Every project must have an emoji. [00:07:32]

Jerry: Yeah. Yeah. I probably was somewhat inspired by a light train, but I will admit, yeah. [00:07:37]

Swyx: It uses GPT to build a knowledge tree in a bottoms-up fashion by applying a summarization prompt for each node. Yep. Which I like that original vision. Your messaging roundabout then was also that you're creating optimized data structures. What's the sort of journey to that and how does that contrast with LlamaIndex today? Okay. [00:07:56]

Jerry: Maybe I can tell a little bit about the beginning intuitions. I think when I first started, this really wasn't supposed to be something that was like a toolkit that people use. It was more just like a system. And the way I wanted to think about the system was more a thought exercise of how language models with their reasoning capabilities, if you just treat them as like brains, can organize information and then traverse it. So I didn't want to think about embeddings, right? To me, embeddings just felt like it was just an external thing that was like, well, it was just external to trying to actually tap into the capabilities of language models themselves, right? I really wanted to see, you know, just as like a human brain could like synthesize stuff, could we create some sort of like structure where this neural CPU, if you will, can like organize a bunch of information, you know, auto-summarize a bunch of stuff and then also traverse the structure that I created. That was the inspiration for this initial tree index, to be honest. And I think I said this in the first tweet, it actually works super well, right? Like GPT-4 obviously is much better at reasoning. I'm one of the first to say, you know, you shouldn't use anything pre-GPT-4 for anything that requires complex reasoning because it's just going to be unreliable, okay, disregarding stuff like fine tuning. But it worked okay. But I think it definitely struck a chord with kind of like the Twitter crowd, which is just like new ideas at the time, I guess, just like thinking about how you can actually bake this into some sort of application. Because I think what I also ended up discovering was the fact that there was starting to become a wave of developers building on top of GPT-3 and people were starting to realize that what makes them really useful is to apply them on top of your personal data. And so even if the solution itself was kind of like primitive at the time, like the problem statement itself was very powerful. And so I think being motivated by the problem statement, right, like this broad mission of how do I unlock elements on top of the data also contributed to the development of LOM index to the state it is today. And so I think part of the reason, you know, our toolkit has evolved beyond the just existing set of like data structures is we really tried to take a step back and think, okay, what exactly are the tools that would actually make this useful for a developer? And then, you know, somewhere around December, we made an active effort to basically like push towards that direction, make the code base more modular, right, more friendly as an open source library. And then also start adding in like embeddings, start thinking into practical considerations like latency, cost, performance, those types of things. And then really motivated by that mission, like start expanding the scope of the toolkit towards like covering the life cycle of like data ingestion and querying. Where you also added Llamahub and yeah, so I think that was in like January on the data loading side. And so we started adding like some data loaders, saw an opportunity there, started adding more stuff on the retrieval querying side, right? We still have like the core data structures, but how do you actually make them more modular and kind of like decouple storing state from the types of like queries that you could run on top of this a little bit. And then starting to get into more complex interactions, like chain of thought reasoning, routing and, you know, like agent loops. [00:10:44]

Alessio: You and I spent a bunch of time earlier this year talking about Llamahub, what that might become. You were still at Robust. When did you decide it was time to start the company and then start to think about what LlamaIndex is today? [00:10:58]

Jerry: Yeah, I mean, probably December. It was kind of interesting. I was getting some inbound from initial VCs, I was talking about this project. And then in the beginning, I was like, oh, yeah, you know, this is just like a design project. But you know, what about my other idea on like video data, right? And then I was trying to like get their thoughts on that. And then everybody was just like, oh, yeah, whatever, like that part's like a crowded market. And then it became clear that, you know, this was actually a pretty big opportunity. And like, coincidentally, right, like this actually did relate to like, my interests have always been at the intersection of AI data and kind of like building practical applications. And it was clear that this was evolving into a much bigger opportunity than the previous idea was. So around December, and then I think I gave a pretty long notice, but I left officially like early March. [00:11:39]

Alessio: What were your thinkings in terms of like moats and, you know, founders kind of like overthink it sometimes. So you obviously had like a lot of open source love and like a lot of community. And you're like, were you ever thinking, okay, I don't know, this is maybe not enough to start a company or did you always have conviction about it? [00:11:59]

Jerry: Oh, no, I mean, 100%. I felt like I did this exercise, like, honestly, probably more late December and then early January, because I was just existentially worried about whether or not this would actually be a company at all. And okay, what were the key questions I was thinking about? And these were the same things that like other founders, investors, and also like friends would ask me is just like, okay, what happens if context windows get much bigger? What's the point of actually structuring data right in the right way? Right? Why don't you just dump everything into the prompt, fine tuning, like, what if you just train the model over this data? And then, you know, what's the point of doing this stuff? And then some other ideas is what if like OpenAI actually just like takes this like builds upwards on top of the their existing like foundation models and starts building in some like built in orchestration capabilities around stuff like RAG and agents and those types of things. And so I basically ran through this mental exercise and, you know, I'm happy to talk a little bit more about those thoughts as well. But at a high level, well, context windows have gotten bigger, but there's obviously still a need for a rag. I think RAG is just like one of those things that like, in general, what people care about is, yes, they do care about performance, but they also care about stuff like latency and costs. And so my entire reasoning at the time was just like, okay, like, yes, maybe you will have like much bigger context windows, as we've seen with like 100k context windows. But for enterprises, like, you know, data, which is not in just like the scale of like a few documents, it's usually in like gigabytes, terabytes, petabytes. How do you actually just unlock language models over that data, right? And so it was clear there was just like, whether it's RAG or some other paradigm, no one really knew what that answer was. And so there was clearly like technical opportunity here. Like there was just stacks that needed to be invented to actually solve this type of problem, because language models themselves didn't have access to this data. The other piece here is just like, and so if like you just dumped all this data into, let's say a model had like hypothetically an infinite context window, right? And you just dump like 50 gigabytes of data into a context window. That just seemed very inefficient to me, because you have these network transfer costs of uploading 50 gigabytes of data to get back a single response. And so I kind of realized, you know, there's always going to be some curve, regardless of like the performance of the best performing models of like cost versus performance. What RAG does is it does provide extra data points along that access, because you kind of control the amount of context you actually wanted to retrieve. And of course, like RAG as a term was still evolving back then, but it was just this whole idea of like, how do you just fetch a bunch of information to actually, you know, like stuff into the prompt. And so people even back then were kind of thinking about some of those considerations. [00:14:29]

Swyx: And then you fundraised in June, or you announced your fundraiser in June. Yeah. Take us through that process of thinking about the fundraise and your plans for the company, you know, at the time. Yeah, definitely. [00:14:41]

Jerry: I mean, I think we knew we wanted to, I mean, obviously we knew we wanted to fundraise. There was also a bunch of like investor interest, and it was probably pretty unusual given the, you know, like hype wave of generative AI. So like a lot of investors were kind of reaching out around like December, January, February. In the end, we went with Greylock. Greylock's great. You know, they've been great partners so far. And to be honest, like there's a lot of like great VCs out there. And a lot of them who are specialized on like open source, data, infra, and that type of stuff. What we really wanted to do was, because for us, like time was of the essence, like we wanted to ship very quickly and still kind of build Mindshare in this space. We just kept the fundraising process very efficient. I think we basically did it in like a week or like three days. And so, yeah, just like front loaded it and then just like pick the one named Jerry. Yeah, exactly. Yeah. [00:15:27]

Swyx: I'm kidding. I mean, he's obviously great and Greylock's a fantastic firm. [00:15:32]

Jerry: Embedding some of my research. So, yeah, just we've had Greylock. They've been great partners. I think in general, when I talk to founders about like the fundraise process, it's never like the most fun period, I think, because it's always just like, you know, there's a lot of logistics, there's lawyers you have to, you know, get in the loop. And like a lot of founders just want to go back to building. I think in the end, we're happy that we kept it to a pretty efficient process. [00:15:54]

Swyx: And so you fundraise with Simon. How do you split things with him? How big is your team now? [00:15:57]

Jerry: The team is growing. By the time this podcast is released, we'll probably have had one more person join the team. So basically, it's between, we're rapidly getting to like eight or nine people. At the current moment, we're around like six. And so just like there'll be some exciting developments in the next few weeks. I'm excited to announce that. So the team is, has kind of like, we've been pretty selective in terms of like how we like grow the team. Obviously, like we look for people that are really active in terms of contributions to Lum Index, people that have like very strong engineering backgrounds. And primarily, we've been kind of just looking for builders, people that kind of like grow the open source and also eventually this like managed like enterprise platform as well with us. In terms of like Simon, yeah, I've known Simon for a few years now. I knew him back at Uber ATG in Toronto. He's one of the smartest people I knew, has a sense of both like a deep understanding of ML, but also just like first principles thinking about like engineering and technical concepts in general. And I think one of my criteria, criteria is when I was like looking for a co-founder for this project with someone that was like technically better than me, because I knew I wanted like a CTO. And so honestly, like there weren't a lot of people that, I mean, there's, I know a lot of people that are smarter than me, but like that fit that bill. We're willing to do a startup and also just have the same like values that I shared. Right. And just, I think doing a startup is very hard work, right? It's not like, I'm sure like you guys all know this, it's, it's a lot of hours, a lot of late nights and you want to be like in the same place together and just like being willing to hash out stuff and have that grit basically. And I really looked for that. And so Simon really fit that bill and I think I convinced him to bring Trump on board. [00:17:24]

Swyx: Yeah. And obviously I've had the pleasure of chatting and working with a little bit with both of you. What would you say those, those like your top one or two values are when, when thinking about that or the culture of the company and that kind of stuff? [00:17:36]

Jerry: I think in terms of the culture of the company, it's really like, I mean, there's a few things I can name off the top of my head. One is just like passion, integrity. I think that's very important for us. We want to be honest. We don't want to like, obviously like copy code or, or kind of like, you know, just like, you know, not give attribution, those types of things and, and just like be true to ourselves. I think we're all very like down to earth, like humble people, but obviously I think just willingness to just like own stuff and dive right in. And I think grit comes with it. I think in the end, like this is a very fast moving space and we want to just like be one of the, you know, like dominant forces and helping to provide like production quality outline applications. Yeah. [00:18:11]

Swyx: I promise we'll get to more technical questions, but I also want to impress on the audience that this is a very conscious and intentional company building. And since your fundraising post, which was in June, and now it's September, so it's been about three months, you've actually gained 50% in terms of stars and followers. You've 3x'd your download count to 600,000 a month and your discord membership has reached 10,000. So like a lot of ongoing growth. [00:18:37]

Jerry: Yeah, definitely. And obviously there's a lot of room to expand there too. And so open source growth is going to continue to be one of our core goals because in the end it's just like, we want this thing to be, well, one big, right? We all have like big ambitions, but to just like really provide value to developers and helping them in prototyping and also productionization of their apps. And I think it turns out we're in the fortunate circumstance where a lot of different companies and individuals, right, are in that phase of like, you know, maybe they've hacked around on some initial LLM applications, but they're also looking to, you know, start to think about what are the production grade challenges necessary to actually, that to solve, to actually make this thing robust and reliable in the real world. And so we want to basically provide the tooling to do that. And to do that, we need to both spread awareness and education of a lot of the key practices of what's going on. And so a lot of this is going to be continued growth, expansion, education, and we do prioritize that very heavily. [00:19:30]

Alessio: Let's dive into some of the questions you were asking yourself initially around fine tuning and RAG , how these things play together. You mentioned context. What is the minimum viable context for RAG ? So what's like a context window too small? And at the same time, maybe what's like a maximum context window? We talked before about the LLMs are U-shaped reasoners. So as the context got larger, like it really only focuses on the end and the start of the prompt and then it kind of peters down. Any learnings, any kind of like tips you want to give people as they think about it? [00:20:06]

Jerry: So this is a great question. And part of what I wanted to talk about a conceptual level, especially with the idea of like thinking about what is the minimum context? Like, okay, what if the minimum context was like 10 tokens versus like, you know, 2k tokens versus like a million tokens. Right. Like, and what does that really give you? And what are the limitations if it's like 10 tokens? It's kind of like, um, like eight bit, 16 bit games, right? Like back in the day, like if you play Mario and you have like the initial Mario where the graphics were very blocky and now obviously it's like full HD, 3d, just the resolution of the context and the output will change depending on how much context you can actually fit in. So the way I kind of think about this from a more principled manner is like you have like, there's this concept of like information capacity, just this idea of like entropy, like given any fixed amount of like storage space, like how much information can you actually compact in there? And so basically a context window length is just like some fixed amount of storage space, right? And so there's some theoretical limit to the maximum amount of information you can compact until like a 4,000 token storage space. And what does that storage space use for these days with LLMs? For inputs and also outputs. And so this really controls the maximum amount of information you can feed in terms of the prompt plus the granularity of the output. If you had an infinite context window, you're going to have an infinitely detailed response and also infinitely detailed memory. But if you don't, you can only kind of represent stuff in more quantized bits, right? And so the smaller the context window, just generally speaking, the less details and maybe the less, um, and for like specific, precise information, you're going to be able to surface any given point in time. [00:21:34]

Alessio: So when you have short context, is the answer just like get a better model or is the answer maybe, Hey, there needs to be a balance between fine tuning and RAG to make sure you're going to like leverage the context, but at the same time, don't keep it too low resolution? [00:21:48]

Jerry: Yeah, yeah. Well, there's probably some minimum threat, like I don't think anyone wants to work with like a 10. I mean, that's just a thought exercise anyways, a 10 token context window. I think nowadays the modern context window is like 2k, 4k is enough for just like doing some sort of retrieval on granular context and be able to synthesize information. I think for most intents and purposes, that level of resolution is probably fine for most people for most use cases. I think the question there is just like, um, the limitations actually more on, okay, if you're going to actually combine this thing with some sort of retrieval data structure mechanism, there's just limitations on the retrieval side because maybe you're not actually fetching the most relevant context to actually answer this question, right? Like, yes, like given the right context, 4,000 tokens is enough. But if you're just doing like top-k similarity, like you might not be able to be fetching the right information from the documents. [00:22:34]

Alessio: So how should people think about when to stick with RAG versus when to even entertain and also in terms of what's like the threshold of data that you need to actually worry about fine tuning versus like just stick with rag? Obviously you're biased because you're building a RAG company, but no, no, actually, um, I [00:22:52]

Jerry: think I have like a few hot takes in here, some of which sound like a little bit contradictory or what we're actually building. And I think to be honest, I don't think anyone knows the right answer. I think this is the truth. [00:23:01]

Alessio: Yeah, exactly. [00:23:01]

Jerry: This is just like thought exercise towards like understanding the truth. [00:23:04]

Alessio: Right. [00:23:04]

Jerry: So, okay. [00:23:05]

Alessio: I have a few hot takes. [00:23:05]

Jerry: One is like RAG is basically just, just a hack, but it turns out it's a very good hack because what is RAG rag is you keep the model fixed and you just figure out a good way to like stuff stuff into the prompt of the language model and everything that we're doing nowadays in terms of like stuffing stuff into the prompt is just algorithmic. We're just figuring out nice algorithms to, to like retrieve right information with top case similarity, do some sort of like, uh, you know, hybrid search, some sort of like a chain of thought decomp and then just like stuff stuff into a prompt. So it's all like algorithmic and it's more like just software engineering to try to make the most out of these like existing APIs. The reason I say it's a hack is just like from a pure like optimization standpoint. If you think about this from like the machine learning lens, unless the software engineering lens, there's pieces in here that are going to be like suboptimal, right? Like, like the thing about machine learning is when you optimize like some system that can be optimized within machine learning, like the set of parameters, you're really like changing like the entire system's weights to try to optimize the subjective function. [00:24:02]

Jerry: And if you just cobble a bunch of stuff together, you can't really optimize the pieces are inefficient, right? And so like a retrieval interface, like doing top cam batting lookup, that part is inefficient. [00:24:13]

Jerry: If you, for instance, because there might be potentially a better, more learned retrieval algorithm, that's better. If you know, you do stuff like some sort of, I know nowadays there's this concept of how do you do like short-term and long-term memory represent stuff in some sort of vector embedding, do trunk sizes, all that stuff. It's all just like decisions that you make that aren't really optimized and it's not really automatically learned. It's more just things that you set beforehand to actually feed into the system. So I do think like there is a lot of room to actually optimize the performance of an entire LLM system, potentially in a more like machine learning based way. Right. [00:24:48]

Jerry: And I will leave room for that. And this is also why I think like in the long term, I do think fine tuning will probably have like greater importance. And just like there will probably be new architectures invented that where you can actually kind of like include a lot of this under the black box, as opposed to having like hobbling together a bunch of components outside the black box. That said, just very practically given the current state of things, like even if I said RAG is a hack, it's a very good hack and it's also very easy to use. Right. [00:25:16]

Jerry: And so just like for kind of like the AI engineer persona, which to be fair is kind of one of the reasons generative AI has gotten so big is because it's way more accessible for everybody to get into, as opposed to just like traditional machine learning, it tends to be good enough. [00:25:30]

Jerry: Right. And if we can basically provide these existing techniques to help people really optimize how to use existing systems without having to really deeply understand machine learning, I still think that's a huge value add. And so there's very much like a UX and ease of use problem here, which is just like RAG is way easier to onboard and use. And that's probably like the primary reason why everyone should do RAG instead of fine tuning to begin with. If you think about like the 80-20 rule, like RAG very much fits within that and fine tuning doesn't really right now. And then I'm just kind of like leaving room for the future that, you know, like in the end, fine tuning can probably take over some of the aspects of like what RAG does. [00:26:04]

Swyx: I don't know if this is mentioned in your explainability also allows for sourcing. And at the end of the day, like to increase trust that we have to source documents. Yeah. [00:26:14]

Jerry: So, so I think what RAG does is it increases like transparency, visibility into the actual documents, right. [00:26:19]

Jerry: That are getting fed into their context. [00:26:21]

Swyx: Here's where they got it from. [00:26:22]

Alessio: Exactly. [00:26:22]

Jerry: That's definitely an advantage. I think the other piece that I think is an advantage, and I think that's something that someone actually brought up is just you can do access control with, with RAG . If you have an external storage system, you can't really do that with, with large language models. [00:26:35]

Jerry: It's just like gate information to the neural net weights, like depending on the type of user for the first point, you could technically, you could technically have the language model. [00:26:45]

Jerry: Like if it memorized enough information, just like a site sources, but there's a question of just trust whether or not you're actually, yeah, well, but like it makes it up right now because it's like not good enough, but imagine a world where it is good enough and it does give accurate citations.

Swyx: No, I think to establish trust, you just need a direct connection.So it's, it's kind of weird. It's, it's this melding of deep learning systems versus very traditional information retrieval. Yeah, exactly. [00:27:11]

Jerry: Well, so, so I think, I mean, I kind of think about it as analogous to like humans, right? [00:27:15]

Jerry: Like, uh, we as humans, obviously we use the internet, we use tools. Uh, these tools have API interfaces are well-defined. Um, and obviously we're not like the tools aren't part of us. And so we're not like back propping or optimizing over these tools. And so when you think about like RAG , it's basically, um, LLM is learning how to use like a vector database to look up information that it doesn't know. And so then there's just a question of like how much information is inherent within the network itself and how much does it need to do some sort of like tool used to look up stuff that it doesn't know. [00:27:42]

Jerry: And I do think there'll probably be more and more of that interplay as time goes on. [00:27:46]

Swyx: Yeah. Some followups on discussions that we've had, you know, we discussed fine tuning a bit and what's your current take on whether you can, you can fine tune new knowledge into LLMs. [00:27:55]

Jerry: That's one of those things where I think longterm you definitely can. I think some people say you can't, I disagree. I think you definitely can. Just right now I haven't gotten it to work yet. So, so I think like we've tried, yeah, well, um, not in a very principled way, right? Like this is something that requires like an actual research scientist and not someone that has like, you know, an hour or two per night to actually look at this. [00:28:12]

Swyx: Like I, you were a research scientist at Uber. I mean, it's like full-time, full-time working. [00:28:16]

Jerry: So, so I think, um, what I specifically concretely did was I took OpenAI's fine tuning endpoints and then tried to, you know, it's in like a chat message interface. And so there's like, um, input question, like a user assistant message format. And so what I did was I tried to take just some piece of text and have the LLM memorize it by just asking it a bunch of questions about the text. So given a bunch of context, I would generate some questions and then generate some response and just fine tune over the question responses. That hasn't really worked super well, but that's also because I'm, I'm just like trying to like use OpenAI's endpoints as is. If you just think about like traditional, like how you train a Transformers model, there's kind of like the, uh, instruction, like fine tuning aspect, right? You like ask it stuff when guided with correct responses, but then there's also just like, um, next token production. And that's something that you can't really do with the OpenAI API, but you can do with, if you just train it yourself and that's probably possible if you just like train it over some corpus of data. I think Shashira from Berkeley said like, you know, when they trained Gorilla, they were like, Oh, you know, this, a lot of these LLMs are actually pretty good at memorizing information. Um, just the way the API interface is exposed is just no one knows how to use them right [00:29:22]

Alessio: now. Right. [00:29:22]

Jerry: And so, so I think that's probably one of the issues. [00:29:24]

Swyx: Just to clue people in who haven't read the paper, Gorilla is the one where they train to use specific APIs. [00:29:30]

Jerry: Yeah, I think this was on the Gorilla paper. Like the, the model itself could, uh, try to learn some prior over the data to decide like what tool to pick. But there's also, it's also augmented with retrieval that helps supplement it in case like the, the, the, um, prior doesn't actually work. [00:29:45]

Swyx: Is that something that you'd be interested in supporting? [00:29:48]

Jerry: I mean, I think in the longterm, like if like, this is kind of how fine tuning, like RAG evolves. Like I do think there'll be some aspect where fine tuning will probably memorize some high level concepts of knowledge, but then like RAG will just be there to supplement like aspects of that, that aren't work that don't, that, that it doesn't know.

Jerry: Um, the way I think about this is kind of like, obviously RAG is the default way, like to be clear, RAG right now is the default way to actually augment stuff with knowledge. I think it's just an open question of how much the LM can actually internalize both high level concepts, but also details as you can like train stuff over it. And coming from an ML background, there is a certain beauty and just baking everything into some training process of a language model. Like if you just take raw chat, GPT or chat, GPT code interpreter, right? Like GPT four, it's not like you do RAG with it. You just ask it questions about like, Hey, how do I like to find a pedantic model in Python? And I'm like, can you give me an example? Can you visualize a graph? It just does it right. Like, and we'll run it through code interpreters as a tool, but that's not like a source for knowledge. [00:30:46]

Jerry: It's just an execution environment. And so there is some beauty in just like having the model itself, like just, you know, instead of you kind of defining the algorithm for what the data structure should look like the model just learns it under the hood. That said, I think the reason it's not a thing right now is just like, no one knows how to do it. [00:31:01]

Jerry: It probably costs too much money. And then also like the API interfaces and just like the actual ability to kind of evaluate and improve on performance, like isn't known to most people. [00:31:12]

Alessio: Yeah. [00:31:12]

Swyx: It also would be better with browsing. [00:31:14]

Alessio: Yeah. [00:31:16]

Swyx: I wonder when they're going to put that back. [00:31:18]

Alessio: Okay. Yeah. [00:31:19]

Swyx: So, and then one more follow up before we go into RAG for AI engineers is on your brief mentioned about security or off. How many of your, the people that you talk to, you know, you talk to a lot of people putting LlamaIndex into production. How many people actually are there versus just like, let's just dump a whole company notion into this thing. [00:31:36]

Jerry: Wait, are you talking about from like the security off standpoint? [00:31:39]

Alessio: Yeah. [00:31:39]

Swyx: How big a need is that? Because I, I talked to some people who are thinking about building tools in that domain, but I don't know if people want it. [00:31:47]

Jerry: I mean, I think bigger companies, like just bigger companies, like banks, consulting firms, like they all want this requirement, right? The way they're using LlamaIndex is not with this, obviously. Cause I don't think we have support for like access control or author that have stuff like on a hood. [00:32:02]

Jerry: Cause we're more just like an orchestration framework. And so the way they build these initial apps is more kind of like prototype. Like, let's kind of, yeah. Like, you know, use some publicly available data. That's not super sensitive. Let's like, you know, assume that every user is going to be able to have access to the same amount of knowledge, those types of things. I think users have asked for it, but I don't think that's like a P zero. Like I think the P zero is more on like, can we get this thing working before we expand this to like more users within the work? [00:32:25]

Alessio: There's a bunch of pieces to rag. Obviously it's not a, just an acronym. And you two recently, you think every AI engineer should build the front scratch at least once. Why is that? I think so. [00:32:37]

Jerry: I'm actually kind of curious to hear your thoughts about this. Um, but this kind of relates to the initial like AI engineering posts that you put out and then also just like the role of an AI engineer and the skills that they're going to have to learn to truly succeed because there's an entire On one end, you have people that don't really, uh, like understand the fundamentals and just want to use this to like cobble something together to build something. And I think there is a beauty in that for what it's worth. Like, it's just one of those things. And Gen AI has made it so that you can just use these models in inference only mode, call something together, use it, power your app experiences, but on the other end, what we're increasingly seeing is that like more and more developers building with these apps start running into honestly, like pretty similar issues that like we'll play just a standard engineer building like a classifier model, which is just like accuracy problems, like, and hallucinations, basically just an accuracy problem, right? [00:33:24]

Like it's not giving you the right results. So what do you do? You have to iterate on the model itself. You have to figure out what parameters you tweak. You have to gain some intuition about this entire process. That workflow is pretty similar, honestly, like even if you're not training the model to just like tuning a ML model with like hyper parameters and learning like proper ML practices of like, okay, how do I have like define a good evaluation benchmark? How do I define like the right set of metrics to do to use, right? How do I actually iterate and improve the performance of this pipeline for [00:33:52]

Alessio: production? What tools do I use? [00:33:53]

Jerry: Right? Like every ML engineer use like some form of weights and biases, tensor boards, or like some other experimentation tracking tool. What tools should I use to actually help build like LLM applications and optimize it for production? There's like a certain amount of just like LLM ops, like tooling and concepts and just like practices that people will kind of have to internalize if they want to optimize these. And so I think that the reason I think being able to build like RAG from scratch is important is it really gives you a sense of like how things are working to get, help you build intuition about like what parameters are within a RAG system and which ones actually tweak to make them better. Cause otherwise I think that one of the advantages of the LlamaIndex quick start is it's three lines of code. The downside of that is you have zero visibility into what's actually going on [00:34:37]

Alessio: under the hood. [00:34:37]

Jerry: And I think there's something that we've kind of been thinking about for a while and I'm like, okay, let's just release like a new tutorial series. That's just like, we're in set, not no three lines of code. We're just going to go in and actually show you how the thing actually works on [00:34:47]

Alessio: the hood. Right. [00:34:47]

Jerry: And so I like, does everybody need this? Like probably not as for some people, the three lines of code might work, but I think increasingly, like honestly, 90% of the users I talked to have questions about how to improve the performance of their app. And so just like, given this, it's just like one of those things that's like better for the understanding. [00:35:03]

Alessio: Yeah. [00:35:03]

Swyx: I'd say it is one of the most useful tools of any sort of developer education toolkit to write things yourself from scratch. So Kelsey Hightower famously wrote Kubernetes the hard way, which is don't use Kubernetes. Here's everything that you would have to do by yourself. And you should be able to put all these things together yourself to understand the value of Kubernetes. And the same thing for LLlamaIndex. I've done, I was the guy who did the same for React. And it's a pretty good exercise for you to just fully understand everything that's going on under the hood. And I was actually going to suggest while in one of the previous conversations, there's all these like hyperparameters, like the size of the chunks and all that. And I was thinking like, what would hyperparameter optimization for RAG look [00:35:44]

Alessio: like? [00:35:44]

Jerry: Yeah, definitely. I mean, so absolutely. I think that's going to be an increasing thing. I think that's something we're kind of looking at because like, I think someone [00:35:52]

Swyx: should just put, do like some large scale study and then just ablate everything. And just you, you tell us. [00:35:57]

Jerry: I think it's going to be hard to find a universal default that works for [00:36:00]

Alessio: everybody. [00:36:00]

Jerry: I think it's going to be somewhat, I do think it's going to be somewhat like dependent on the data and use case. I think if there was a universal default, that would be amazing. But I think increasingly we found, you know, people are just defining their own like custom parsers for like PDFs, markdown files for like, you know, SEC filings versus like Slack conversations. And then like the use case too, like, do you want like a summarization, like the granularity of the response? Like it really affects the parameters that you want to pick. I do like the idea of hyperparameter optimization though, but it's kind of like one of those things where you are kind of like training the model basically kind of on your own data domain. [00:36:36]

Alessio: Yeah. [00:36:36]

Swyx: You mentioned custom parsers. You've designed LlamaIndex, maybe we can talk about like the surface area of the [00:36:41]

Alessio: framework. [00:36:41]

Swyx: You designed LlamaIndex in a way that it's more modular, like you mentioned. How would you describe the different components and what's customizable in each? [00:36:50]

Jerry: Yeah, I think they're all customizable. And I think that there is a certain burden on us to make that more clear through the [00:36:57]

Alessio: docs. [00:36:57]

Jerry: Well, number four is customization tutorials. [00:36:59]

Swyx: Yeah, yeah. [00:37:00]

Jerry: But I think like just in general, I think we do try to make it so that you can plug in the out of the box stuff. But if you want to customize more lower level components, like we definitely encourage you to do that and plug it into the rest of our abstractions. So let me just walk through like maybe some of the basic components of LlamaIndex. There's data loaders. You can load data from different data sources. We have Llama Hub, which you guys brought up, which is, you know, a collection of different data loaders of like unstructured and unstructured data, like PDFs, file types, like Slack, Notion, all that stuff. Now you load in this data. We have a bunch of like parsers and transformers. You can split the text. You can add metadata to the text and then basically figure out a way to load it into like a vector store. So, I mean, you worked at like Airbrite, right? It's kind of like there is some aspect like E and T, right? And in terms of like transforming this data and then the L, right, loading it into some storage abstraction, we have like a bunch of integrations with different document storage systems. [00:37:49]

Alessio: So that's data. [00:37:50]

Jerry: And then the second piece really is about like, how do you retrieve this data? How do you like synthesize this data and how do you like do some sort of higher level reasoning over this data? So retrieval is one of the core abstractions that we have. We do encourage people to like customize, define your own retrievers, that section on kind of like how do you define your own, like custom retriever, but also we have like out of the box ones. The retrieval algorithm kind of depends on how you structure the data, obviously. Like if you just flat index everything with like chunks with like embeddings, then you can really only do like top K like lookup plus maybe like keyword search or something. But if you can index it in some sort of like hierarchy, like defined relationships, you can do more interesting things like actually traverse relationships between nodes. Then after you have this data, how do you like synthesize the data? [00:38:32]

Alessio: Right. [00:38:32]

Jerry: Um, and, and this is the part where you feed it into the language model. There's some response abstraction that can abstract away over like long contacts to actually still give you a response, even if the context overflows a context window. And then there's kind of these like higher level, like reasoning primitives that I'm going to define broadly. And I'm just going to call them in some general bucket of like agents, even though everybody has different definitions of agents, but you're the first to data agents, [00:38:56]

Swyx: which I was very excited. [00:38:57]

Alessio: Yeah. [00:38:57]

Jerry: We, we kind of like coin, coin that term. And the way we, we thought about it was, you know, we wanted to think about how to use agents for, uh, like data workflows basically. And, and so what are the reasoning primitives that you want to do? So the most simple reasoning primitive you can do is some sort of routing module. It's a classifier, like given a query, just make some automated decision on what choice to pick, right? You could use LLMs. You don't have to use LLMs. You could just try and classifier basically. That's something that we might actually explore. And then the next piece is, okay, what are some higher level things? You can have the LLM like define like a query plan, right. To actually execute over the data. You can do some sort of while loop, right? That's basically what an agent loop is, which is like react a chain of thought, like the open AI function calling, like while loop to try to like take a question and try to break it down into some, some, uh, series of steps to actually try to execute to get back a response. And so there's a range and complexity from like simple reasoning primitives to more advanced ones. The way we kind of think about it is like, which ones should we implement and how do [00:39:50]

Alessio: they work? [00:39:50]

Jerry: Well, like, do they work well over like the types of like data tasks that we give them? [00:39:54]

Alessio: How do you think about optimizing each piece? So take, um, embedding models is one piece of it. You offer fine tuning, embedding models. And I saw it was like fine tuning gives you like 5, 10% increase. What's kind of like the Delta left on the embedding side? Do you think we can get models that are like a lot better? Do you think like that's one piece where people should really not spend too much time? [00:40:16]

Jerry: I just think it's, it's not the only parameter. Cause I think in the end, if you think about everything that goes into retrieval, the chunking algorithm, um, how you define like metadata will bias your embedding representations. Then there's the actual embedding model itself, which is something that you can try optimizing. And then there's like the retrieval algorithm. Are you going to just do top K? Are you going to do like hybrid search? Are you going to do auto retrieval? Like there's a bunch of parameters. And so I do think it's something everybody should try. I think by default we use like OpenAI's embedding model. A lot of people these days use like sentence transformers because it's, it's just like free open source and you can actually optimize, directly optimize it. This is an active area of exploration. I do think one of our goals is it should ideally be relatively free for every developer to just run some fine tuning process over their data to squeeze out some more points and performance. And if it's that relatively free and there's no downsides, everybody should basically do [00:41:04]

Alessio: it. [00:41:04]

Jerry: There's just some complexities, right? In terms of optimizing your embedding model, especially in a production grade data pipeline. If you actually fine tune the embedding model and the embedding space changes, you're going to have to reindex all your documents. And for a lot of people, that's not feasible. And so I think like Joe from Vespa on our webinars, like there's this idea that depending on if you're just using like document and query embeddings, you could keep the document embeddings frozen and just train a linear transform on the query or, or any sort of transform on the query, right? So therefore it's just a query side transformation instead of actually having to reindex all the document embeddings. That's pretty smart. We weren't able to get like huge performance gains there, but it does like improve performance a little bit. And that's something that basically, you know, everybody should be able to kick off. You can actually do that on LLlamaIndex too. [00:41:45]

Swyx: OpenAIO has a cookbook on adding bias to the embeddings too, right? [00:41:49]

Alessio: Yeah. [00:41:49]

Jerry: There's just like different parameters that you can, you can try adding to try to like optimize the retrieval process. And the idea is just like, okay, by default you have all this text. It kind of lives in some latent space, right? [00:42:01]

Swyx: Yeah. Shut out, shut out latent space. You should take a drink every time. [00:42:05]

Jerry: But it lives in some latent space. But like depending on the type, specific types of questions that the user might want to ask, the latent space might not be optimized to actually retrieve the relevant piece of context that the user want to ask. So can you shift the embedding points a little bit, right? And how do we do that? Basically, that's really a key question here. So optimizing the embedding model, even changing the way you like chunk things, these all shift the embeddings. [00:42:26]

Alessio: So the retrieval is interesting. I got a bunch of startup pitches that are like, like ragged school, but like there's a lot of stuff in terms of ranking that could be better. There's a lot of stuff in terms of sun setting data. Once it starts to become stale, that could be better. Are you going to move into that part too? So like you have SEC Insights as one of kind of like your demos. And that's like a great example of, Hey, I don't want to embed all the historical documents because a lot of them are outdated and I don't want them to be in the context. [00:42:55]

Jerry: What's that problem space? [00:42:57]

Alessio: Like how much of it are you going to also help with and versus how much you expect others to take care of? [00:43:03]

Jerry: Yeah, I'm happy to talk about SEC Insights in just a bit. I think more broadly about the like overall retrieval space. We're very interested in it because a lot of these are very practical problems that [00:43:11]

Alessio: people have asked us. [00:43:11]

Jerry: And so the idea of outdated data, I think, how do you like deprecate or time wait data and do that in a reliable manner, I guess. So you don't just like set some parameter and all of a sudden that affects your, all your retrieval items, like is pretty important because people have started bringing [00:43:25]

Alessio: that up. [00:43:25]

Jerry: Like I have a bunch of duplicate documents, things get out of date. How do I like sunset documents? And then remind me, what was the, what was the first thing you said? Cause I think there was, there was something like the ranking ranking, right? [00:43:35]

Alessio: Yeah. [00:43:35]

Jerry: So I think this space is not new. I think everybody who is new to this space starts learning some basic concepts of information retrieval, which to be fair has been around for quite a bit. But our goal is to kind of like take some of like just general ranking and information retrieval concepts. So by encoding, like crossing coding, right? Like we're based models versus like kind of keyword based search. How do you actually evaluate retrieval? These things start becoming relevant. And so I think for us, like rather than inventing like new retriever techniques for the sake of like just inventing better ranking, we want to take existing ranking techniques and kind of like package it in a way that's like intuitive and easy for people to understand. That said, I think there are interesting and new retrieval techniques that are kind of in place that can be done when you tie it into some downstream rack system. The reason for this is just like, if you think about the idea of like chunking text, right? Like that just really wasn't a thing, or at least for this specific purpose, like the reason chunking is a thing in RAG right now is because like you want to fit within the context bundle of an LLM, right? Like why do you want to chunk a document? That just was less of a thing. I think back then, if you wanted to like transform a document, it was more for like structured data extraction or something in the past. And so there's kind of like certain new concepts that you got to play with that you can use to invent kind of more interesting retrieval techniques. Another example here is actually LLM based reasoning, like LLM based chain of thought reasoning. You can take a question, break it down into smaller components and use that to actually send to your retrieval system. And that gives you better results. And it's kind of like sending the full question to a retrieval system. That also wasn't really a thing back then, but then you can kind of figure out an interesting way to like blending old and the new, right? With LLMs and data. [00:45:13]

Swyx: There's a lot of ideas that you come across. Do you have a store of them? [00:45:17]

Jerry: Yeah, I think I, sometimes I get like inspiration. There's like some problem statement and I'm just like, oh, it's like, following you is [00:45:23]

Swyx: very hard because it's just a lot of homework. [00:45:25]

Jerry: So I think I've, I've started to like step on the brakes just a little bit. Cause then I start, no, no, no. Well, the, the reason is just like, okay, if I just have invent like a hundred more retrieval techniques, like, like sure. But like, how do people know which one is good and which one's like bad. [00:45:41]

Alessio: Right. [00:45:41]

Jerry: And so have a librarian, right? [00:45:42]

Swyx: Like it's going to catalog it and you're going to need some like benchmarks. [00:45:45]

Jerry: And so I think that's probably the focus for the next, next few weeks is actually like properly kind of like having an understanding of like, oh, you know, when should you do this or like, what does this actually work well? [00:45:54]

Alessio: Yeah. [00:45:54]

Swyx: Some kind of like a, maybe like a flow chart, decision tree type of thing. Yeah, exactly. When this do that, you know, something like that, that would be really helpful for me. [00:46:02]

Alessio: Thank you. [00:46:02]

Swyx: It seems like your most successful side project. Yeah. What is SEC Insights for our listeners? [00:46:07]

Jerry: Um, our SEC Insights is a full stack LLM chatbot application, um, that does. Analysis of your sec 10 K and 10 Q filings. And so the goal for building this project is really twofold. The reason we started building this was one, it was a great way to dog food, the production readiness for our library. We actually ended up like adding a bunch of stuff and fixing a ton of bugs because of this. And I think it was great because like, you know, thinking about how we handle like callbacks streaming, actually generating like reliable sub responses and bubbling up sources, citations. These are all things that like, you know, if you're just building the library in isolation, you don't really think about it. But if you're trying to tie this into a downstream application, like it really starts mattering for your error messages. When you talk about bubbling up stuff for like sources, like if you go into SEC Insights and you type something, you can actually see the highlights in the right side. That was something that like took a little bit of like, um, understanding to figure out how to build wall. And so it was great for dog fooding improvement of the library itself. And then as we're building the app, um, the second thing was we're starting to talk to users and just like trying to showcase like kind of, uh, bigger companies, like the potential of LLM index as a framework, because these days obviously building a chatbot, right. With Streamlight or something, it'll take you like 30 minutes or an hour. Like there's plenty of templates out there on LLM index, like train, like you can just build a chatbot, but how do you build something that kind of like satisfies some of these, uh, this like criteria of surfacing, like citations, being transparent, seeing like, uh, having a good UX, um, and then also being able to handle different types of questions, right? Like more complex questions that compare different documents. That's something that I think people are still trying to explore. And so what we did was like, we showed, well, first like organizations, the possibilities of like what you can do when you actually build something like this. And then after like, you know, we kind of like stealth launched this for fun, just as a separate project, uh, just to see if we could get feedback from users who are using this world to see like, you know, how we can improve stuff. And then we were thought, we thought like, ah, you know, we built this, right? Obviously we're not going to sell like a financial app. Like that's not really our, in our wheelhouse, but we're just going to open source the entire thing. And so that now is basically just like a really nice, like full stack app template you can use and customize on your own, right. To build your own chatbot, whether it is a really financial documents or like other types of documents. Um, and it provides like a nice template for basically anybody to kind of like go in and get started. There's certain components though, that like aren't released yet that we're going to going to, and then next few weeks, like one is just like kind of more detailed guides on like different modular components within it. So if you're like a full stack developer, you can go in and actually take the pieces that you want and actually kind of build your own custom flows. The second piece is like, take, there's like certain components in there that might not be directly related to the LLM app that would be nice to just like have people use, uh, an example is the PDF viewer, like the PDF viewer with like citations. I think we're just going to give that right. So, you know, you could be using any library you want, but then you can just, you know, just drop in a PDF viewer. [00:48:53]

Alessio: Right. [00:48:53]

Jerry: So that it's just like a fun little module that you can do. [00:48:55]

Swyx: Nice. That's really good community service right there. I want to talk a little bit about your cloud offering, because you mentioned, I forget the name that you had for it. [00:49:04]

Alessio: Enterprise something. [00:49:04]

Jerry: Well, one, we haven't come up with a name. Uh, we're kind of calling it LLM index platform, platform LLM index enterprise. I'm open to suggestions here. Um, and the second thing is I don't actually know how much I can, I can share right now because it's mostly kind of like, uh, we, we, yeah, exactly. [00:49:20]

Swyx: To the extent that you can talk about LLM index as a business. Um, always just want to give people in the mind, like, Hey, like you sell things too, you know what I mean? [00:49:28]

Jerry: Yeah, a hundred percent. So I think the high level of what I can probably say is just like, I think we're looking at ways of like actively kind of complimenting the developer experience, like building LLM index. We've always been very focused on stuff around like plugging in your data into the language model. And so can we build tools that help like augment that experience beyond the open [00:49:47]

Alessio: source library? Right. [00:49:48]

Jerry: And so I think what we're going to do is like make a build an experience where it's very seamless to transition from the open source library with like a one line toggle, you can basically get this like complimentary service and then figure out a way to like monetize in a bit. I think where our revenue focus this year is less emphasized. Like it's more just about like, can we build some manage offering that like provides complimentary value to what the open source library provides? [00:50:09]

Alessio: Yeah. [00:50:10]

Swyx: I think it's the classic thing about all open source is you want to start building the most popular open source projects in your category to own that category. You're going to make it very easy to host. Therefore you're just built your biggest competitor, which is you. [00:50:22]

Jerry: I think it will be like complimentary. Cause I think it will be like, you know, use the open source library and then you have a toggle and all of a sudden, you know, you can see this basically like a pipeline ish thing pop up and then it will be able to kind of like, you'll have a UI. There'll be some enterprise guarantees and the end goal would be to help you build like a production RAG app more easily. [00:50:42]

Alessio: Data loaders. There's a lot of them. What are maybe some of the most popular, maybe under, not underrated, but like underexpected, you know, and how has the open source side of it helped with like getting a lot more connectors, you only have six people on the team today, so you couldn't have done it all yourself. [00:51:00]

Jerry: Yeah. I think the nice thing about like Walmart hub itself, it's supposed to be a community driven hub. Um, and so actually the bulk of the peers are completely community contributed. Um, and so we haven't written that many like first party connectors actually for this, it's more just like a kind of encouraging people to contribute to the community in terms of the most popular tools, uh, or the data loaders. I think we have Google analytics on this and I forgot the specifics. It's some mix of like the PDF loaders. We have like 10 of them, but there's some subset of them that are popular. And then there's Google, like I think Gmail and like G drive. Um, and then I think maybe it's like one of Slack or notion. One thing I will say though, uh, and I think like Swix might probably knows this better than I do, given that you were, she used to work at air bite. It's very hard to build, like, especially for full on service, like notion Slack or like Salesforce to build like a really, really high quality loader that really extracts all the information that people want. [00:51:51]

Alessio: Right. [00:51:51]

Jerry: And so I think the thing is when people start out, like they will probably use these loaders and it's a great tool to get started. And for a lot of people, it's like good enough. And they submit PRs if they want more additional features. But if you get to a point where you actually want to call like an API that hasn't been supported yet, or, you know, you want to load in stuff that like in metadata or something that hasn't been directly baked into the logic of a loader itself, people start adding up, like writing their own custom loaders. And that is a thing that we're seeing. That's something that we're okay with. [00:52:18]

Alessio: Right. [00:52:18]

Jerry: Cause like a lot of this is more just like community driven. And if you want to submit a PR to improve the existing one, you can, otherwise you can create your own custom ones. [00:52:24]

Alessio: Yeah. [00:52:25]

Swyx: And all that is custom loaders all supported within LLlamaIndex, or do you pair it with something else? [00:52:29]

Jerry: Oh, it's just like, I mean, you just define your own subclass. I think, I think that's it. [00:52:33]

Alessio: Yeah. Yeah. [00:52:33]

Swyx: Cause typically in the data ecosystem with everybody, everybody has his own strategies with custom loaders, but also you could write your own with like Dagster or like Prefect or one of those tools. [00:52:43]

Alessio: Yeah. [00:52:44]

Jerry: Yeah, exactly. So I think for us, it's more, we just have a very flexible like document abstraction that you can fill in with any content that you want. [00:52:50]

Swyx: Are people really dumping all their Gmail into these things? You said Gmail is number two. Uh, I'm not sure actually. I mean, that's these, you know, that's the most private data source. [00:52:59]

Alessio: That's true. [00:53:00]

Swyx: So I'm surprised that people are dumping too. I mean, I'm sure some, some people are, but like, I'm sure I'm surprised it's [00:53:06]

Alessio: popular. [00:53:06]

Swyx: Well, and then, so, uh, the LLM engine, uh, I assume OpenAI is going to be a majority. Is it an overwhelming majority? Uh, how, what's the market share between like OpenAI, Cohere, Anthropic, you know, whatever you're seeing. [00:53:21]

Alessio: OpenSource too. [00:53:21]

Jerry: Yeah, I think it's probably some, uh, OpenAI has a majority, but then like there's Anthropic and there's also, um, OpenSource. I think there is a lot of people trying out like Llama 2, um, and, and, um, some variant of like a top OpenSource model. [00:53:33]

Swyx: Side note, any confusion there, Llama 2 versus Llama? [00:53:36]

Jerry: Yeah, I think whenever I go to these talks, I always open it up with like, we started before it. Yeah, exactly. We start before meta, right? [00:53:43]

Alessio: I want to point that out. [00:53:43]

Jerry: Uh, but no, for us, we try to use it for like branding. We just add two llamas when we have like a Llama 2 integration instead of one llama. So I think a lot of people are trying out the popular OpenSource models. Uh, there's a lot of toolkits and OpenSource projects that allow you to self-host and deploy Llama 2 and like, oh, Llama is just a very recent example. I think that we, we added integration with, and so we just, uh, by virtue of having more of these services, I think more and more people are trying it out. [00:54:07]

Swyx: Do you think there's, there's potential there? Is like, um, is that going to be an increasing trend? Like OpenSource? [00:54:12]

Alessio: Yeah. [00:54:12]

Jerry: Yeah, definitely. I think in general people hate monopolies. And so, um, like there's a, whenever like OpenAI has something really cool or like any, um, company has something really cool, even meta, like there's just going to be a huge competitive pressure from other people to do something that's more open and better. Um, and so I do think just market pressures will, will improve like OpenSource adoption. [00:54:32]

Swyx: Last thing I'll say about this, which is just really like, it gets clicks. It's people like psychologically want that, but then at the end of the day, they want, they fall for brand name and popular and performance benchmarks. You know, at the end of the day, OpenAI still wins on that. I think that's true. [00:54:47]

Jerry: But I, I just think like, unless you were like an active employee at OpenAI, right? Like all these research labs are putting out like ML, like PhDs or kind of like other companies too, that are investing a lot of dollars. Uh, there's going to be a lot of like competitive pressures developed, like better models. So is it going to be like all fully open source with like a permissive license? Like, I'm not completely sure, but like, there's just a lot of just incentive for people to develop their stuff here. [00:55:09]

Swyx: Have you looked at like RAG specific models, like contextual? [00:55:12]

Alessio: No. [00:55:13]

Jerry: Is it public? [00:55:14]

Swyx: No, they literally just, uh, so Dewey Keeler. I think it's his name. And you probably came across him. He wrote the RAG paper at Meta and just started contextual AI to create a RAG specific model. I don't know what that means. I was hoping that you do, cause it's your business. [00:55:29]

Jerry: I had insider information. I mean, you know, to be honest, I think this, this kind of relates to my previous point on like RAG and fine tuning, like a RAG specific model is a model architecture that's designed for better RAG and it's less the software engineering principle of like, how can I take existing stuff and just plug and play different components into it? Um, and there's a beauty in that from ease of use and modularity, but when you want to end to end optimize the thing, you might want a more specific model. I think, I think building your own models is honestly pretty hard. Um, and I think the issue is if you also build your own models, like you're also just gonna have to keep up with like the rate of LM advances, like how, like basically the question is when GPT five and six and whatever, like anthropic cloud three comes out, how can you prove that you're actually better than, uh, software developers cobbling together and components on top of a base model. Right. Even if it's just like conceptually, this is better than maybe like GPT three or GPT four. [00:56:21]

Alessio: What about vector stores? I know Spooks is wearing a chroma sweatshirt. [00:56:25]

Swyx: Yeah, because they use a swagging. [00:56:27]

Jerry: I have, I have the mug from Chroma. [00:56:29]

Alessio: Yeah. It's been great. Yeah. [00:56:30]

Jerry: What do you think there? [00:56:31]

Alessio: Like there's a lot of them. Are they pretty interchangeable for like your users use case? Uh, is HNSW all we need? Is there room for improvements? [00:56:40]

Swyx: Is NTRA all we need? [00:56:42]

Jerry: I think, um, yeah, we try to remain unopinionated about storage providers. So it's not like we don't try to like play favorites. So we have like a bunch of integrations obviously. And we, the way we try to do it is we just tried to find like some standard interfaces, but obviously like different vector stores will support kind of like, uh, slightly additional things like metadata filters and those things. I mean, the goal is to have our users basically leave it up to them to try to figure out like what makes sense for their use case in terms of like the algorithm itself, I don't think the Delta on like improving the vector store, like. Embedding lookup algorithm. [00:57:10]

Alessio: Is that high? [00:57:10]

Jerry: I think the stuff has been mostly solved or at least there's just a lot of other stuff you can do to try to improve the overall performance. No, I mean like everything else that we just talked about, like in terms of like [00:57:20]

Alessio: accuracy, right. [00:57:20]

Jerry: To improve rag, like everything that we talked about, like chunking, like metadata, like. [00:57:24]

Swyx: I mean, I was just thinking like, maybe for me, the interesting question is, you know, there are like eight, it's a kind of game of thrones. There's like eight, the war of eight databases right now. Oh, I see. Um, how do they stand out and how did they become very good partners? [00:57:36]

Alessio: If not my index. [00:57:36]

Jerry: Yeah, we're pretty good partners with, with most of them. [00:57:39]

Alessio: Uh, let's see. [00:57:39]

Swyx: Well, like if you're a, you know, vector database founder, like what do you, what do you work on? [00:57:44]

Alessio: It's a good question. [00:57:44]

Jerry: I think one thing I'm very interested in is, and this is something I think I've started to see a general trend towards is combining structured data querying with unstructured data querying. Um, and I think that will probably just expand the query sophistication of these vector stores and basically make it so that users don't have to think about whether they would just call this like hybrid querying. [00:58:05]

Swyx: Is that what we've it's doing? [00:58:06]

Alessio: Yeah. [00:58:07]

Jerry: I mean, I think like, if you think about metadata filters, that's basically a structured filter. It's like our select where something equals something, and then you combine that with semantic search. I think like Lance DB or something was like, uh, try, I was trying to do some like joint interface. The reason is like most data is semi-structured. There's some structured annotations and there's some like unstructured texts. And so like, um, somehow combining all the expressivity of like SQL with like the flexibility of semantic search is something that I think is going to be really important. We have some basic hacks right now that allow you to jointly query both a SQL database and like a separate SQL database and a vector store to like combine the information. That's obviously going to be less efficient than if you just combined it into one [00:58:46]

Alessio: system. Yeah. [00:58:46]

Jerry: And so I think like PG vector, like, you know, that type of stuff, I think it's starting to get there, but like in general, like how do you have an expressive query language to actually do like structured querying along with like all the capabilities, semantic search. [00:58:57]

Swyx: So your current favorite is just put it into Postgres. No, no, no. We don't play with Postgres language, the query language. [00:59:05]

Jerry: I actually don't know what the best language would be for this, because I think it will be something that like the model hasn't been fine-tuned over. Um, and so you might want to train the model over this, but some way of like expressing structured data filters, and this could be include time too, right? It could, it doesn't have to just be like a where clause with this idea of like a [00:59:26]

Alessio: semantic search. Yeah. [00:59:27]

Swyx: And we talked about, uh, graph representations. [00:59:30]

Alessio: Yeah. Oh yeah. [00:59:30]

Jerry: That's another thing too. And there's like, yeah. So that's actually something I didn't even bring up yet. Like there's this interesting idea of like, can you actually have the language model, like explore like relationships within the data too, right? And somehow combine that information with stuff that's like more and more, um, structured within the DB. [00:59:46]

Alessio: Awesome. [00:59:46]

Swyx: What are your current strong beliefs about how to evaluate RAG ? [00:59:49]

Jerry: I think I have thoughts. I think we're trying to curate this into some like more opinionated principles because there's some like open questions here. I think one question I had to think about is whether you should do like evals like component by component first, or is yours do the end to end thing? I think you should, you might actually just want to do the end to end thing first, just to do a sanity check of whether or not like this, uh, given a query and the final response, whether or not it even makes sense, like you eyeball [01:00:11]

Alessio: it, right. [01:00:11]

Jerry: And then you like try to do some basic evals. And then once you like diagnose what the issue is, then you go into the kind of like specific area to define some more, uh, solid benchmarks and try to like [01:00:21]

Alessio: improve stuff. [01:00:21]

Jerry: So what is Antoine evals? Like it's, you, um, have a query, it goes in through retrieval system. You get back something, you synthesize response, and that's your final thing. And you evaluate the quality of the final response. And these days, there's plenty of projects like startups, like companies research, doing stuff around like GPT-4, right. As like a human judge to basically kind of like synthetically generate data. [01:00:41]

Swyx: I don't know from the startup side. [01:00:43]

Jerry: I just know from a technical side, I think, I think people are going to do more of it. The main issue right now is just, uh, it's really unreliable. Like it's, it's just, uh, like there's like variants on the response, whatever you want. [01:00:54]

Alessio: They won't do more of it. [01:00:54]

Swyx: I mean, cause it's bad. [01:00:55]

Jerry: No, but, but these models will get better and you'll probably fine tune a model to [01:00:59]

Alessio: be a better judge. [01:00:59]

Jerry: I think that's probably what's going to happen. So I'm like reasonably bullish on this because I don't think there's really a good alternative beyond you just human annotating a bunch of data sets, um, and then trying to like just manually go through and curating, like evaluating eval metrics. And so this is just going to be a more scalable solution in terms of the [01:01:17]

Alessio: startups. Yeah. [01:01:17]

Jerry: I mean, I think there's a bunch of companies doing this in the end. It probably comes down to some aspect of like UX speed, whether you can like fine tune a model. So that's end to end evals. And then I think like what we found is for rag, a lot of times, like, uh, what ends up affecting this, like end response is retrieval. You're just not able to retrieve the right response. And so I think having proper retrieval benchmarks, especially if you want to do production RAG is, is actually quite important. I think what does having good retrieval metrics tell you? It tells you that at least like the retrieval is good. It doesn't necessarily guarantee the end generation is good, but at least it gives you some, uh, sanity track, right? So you can like fix one component while optimizing the rest, what retrieval like evaluation is pretty standard. And it's been around for a while. It's just like an IR problem. Basically you have some like input query, you get back some retrieves out of context, and then there's some ground truth and that ranked set. And then you try to measure it based on ranking metrics. So the closer that ground truth is to the top, the more you reward the evals. And then the closer it is to the bottom where if it's not in the retrieve side at all, then you penalize the evals. Um, and so that's just like a classic ranking problem. I think like most people starting out probably don't know how to do this right [01:02:28]

Alessio: now. [01:02:28]

Jerry: We, we just launched them like basic retrieval evaluation modules to help users [01:02:32]

Alessio: do this. [01:02:32]

Jerry: One is just like curating this data set in the first place. And one thing that we're very interested in is this idea of like synthetic data set generation for evals. So how can you give in some context, generate a set of questions with Drupal 2.4, and then all of a sudden you have like question and then context pairs, and that becomes your ground truth. [01:02:47]

Swyx: Are data agent evals the same thing, or is there a separate set of stuff for agents that you think is relevant here? [01:02:53]

Jerry: Yeah, I think data agents add like another layer of complexity. Cause then it's just like, you have just more loops in the system. Like you can evaluate like each chain of thought loop itself, like every LLM call to see whether or not the input to that specific step in the chain of thought process actually works or is correct. Or you can evaluate like the final response to see if that's correct. This gets even more complicated when you do like multi-agent stuff, because now you have like some communication between like different agents. Like you have a top level orchestration agent passing it on to some low level [01:03:24]

Alessio: stuff. [01:03:24]

Jerry: I'm probably less familiar with kind of like agent eval frameworks. I know they're, they're starting to be, become a thing. Talking to like June from the Drown of Agents paper, which is pretty unrelated to what we're doing now. But it's very interesting where it's like, so you can kind of evaluate like overall agent simulations by just like kind of understanding whether or not they like modeled the distribution of human behavior. But that's not like a very macro principle. [01:03:46]

Alessio: Right. [01:03:46]

Jerry: And that's very much to evaluate stuff, to kind of like model the distribution of [01:03:51]

Alessio: things. [01:03:51]

Jerry: And I think that works well when you're trying to like generate something for like creative purposes, but for stuff where you really want the agent to like achieve a certain task, it really is like whether or not it achieved the task or not. [01:04:01]

Alessio: Right. [01:04:01]

Jerry: Cause then it's not like, Oh, does it generally mimic human behavior? It's like, no, like did you like send this email or not? [01:04:07]

Alessio: Right. [01:04:07]

Jerry: Like, cause otherwise like this, this thing didn't work. [01:04:09]

Alessio: Awesome. Let's jump into a lightning round. So we have two questions, acceleration, exploration, and then one final tag away. The acceleration question is what's something that already happened in AI that you thought would take much longer to get here? [01:04:23]

Jerry: I think just the ability of LLMs to generate believable outputs and for text and also for images. And I think just the whole reason I started hacking around with LLMs, honestly, I felt like I got into it pretty late. I should've gotten into it like early 2022 because UB23 had been out for a while. Like just the fact that there was this engine that was capable of like reasoning and no one was really like tapping into it. And then the fact that, you know, I used to work in image generation for a while. Like I did GANs and stuff back in the day. And that was like pretty hard to train. You would generate these like 32 by 32 images. And then now taking a look at some of the stuff by like Dolly and, and, you know, mid journey and those things. So it's, it's just, it's, it's very good. [01:04:59]

Alessio: Yeah. [01:04:59]

Swyx: Exploration. What do you think is the most interesting unsolved question in AI? [01:05:03]

Jerry: Yeah, I'd probably work on some aspect of, um, like personalization of memory. Like, I think I actually think that I don't think anyone's like, I think a lot of people have thoughts about that, but like, for what it's worth, I don't think the final state will be right. I think it will be some, some like fancy algorithm or architecture where you like bake it into like the, the architecture of the model itself. Like if, if you have like a personalized assistant that you can talk to that will like learn behaviors over time, right. And learn stuff through like conversation history, what exactly is the right architecture there? I do think that will be part of like the wrong continuous fine tuning. [01:05:38]

Swyx: Yeah. [01:05:39]

Jerry: Like some aspect of that, right. [01:05:40]

Alessio: Right. [01:05:40]

Jerry: Like these are like, I don't actually know the specific technique, but I don't think it's just going to be something where you have like a fixed vector store and that, that thing will be like the thing that restores all your memories. [01:05:48]

Swyx: It's interesting because I feel like using model weights for memory, it's just such an unreliable storage device. [01:05:56]

Jerry: I know. But like, I just think, uh, from like the AGI, like, you know, just modeling like the human brain perspective, I think that there is something nice about just like being able to optimize that system. [01:06:08]

Alessio: Right. [01:06:08]

Jerry: And to optimize a system, you need parameters and then that's where you just get into the neural net piece. [01:06:12]

Alessio: Cool. Cool. Uh, and yeah, take away, you got the audience ear. What's something you want everyone to think about or yeah, take away from this conversation and your thinking. [01:06:24]

Jerry: I think there were a few key things. Uh, so we talked about two of them already, which was SEC Insights, which if you guys haven't tracked it out, I've definitely encouraged you to do so because it's not just like a random like sec app, it's like a full stack thing that we open source, right. And so if you guys want to track it out, I would definitely do that. It provides a template for you to build kind of like production grade rack apps. Um, and we're going to open source like, and modularize more components of that soon and do a workshop on, um, yeah. And the second piece is I think we are thinking a lot about like retrieval and evals. Um, I think right now we're kind of exploring integrations with like a few different partners. And so hopefully some of that will be, uh, really soon. And so just like, how do you basically have an experience where you just like write law index code, all of a sudden you can easily run like retrievals, evals, and like traces, all that stuff. And, and like a service. And so I think we're working with like a few providers on that. And then the other piece, which we did talk about already is this idea of like, yeah, building like RAG from scratch. I mean, I think everybody should do it. I think I would check out the guide. If you guys haven't already, I think it's in our docs, but instead of just using, you know, either the kind of like the retriever query engine and lamin decks or like the conversational QA train and Lang train, it's, I would take a look at how do you actually chunk parse data and do like top cam batting retrieval, because I really think that by doing that process, it helps you understand the decisions, the prompts, the language models to use. [01:07:42]

Alessio: That's it. Yeah. [01:07:44]

Swyx: Thank you so much, Jerry. [01:07:45]

Alessio: Yeah. [01:07:45]

Jerry: Thank you. [01:07:46]

Get full access to Latent Space at www.latent.space/subscribe

2023-10-05
Länk till avsnitt

Building the Foundation Model Ops Platform ? with Raza Habib of Humanloop

Want to help define the AI Engineer stack? >500 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey! Please fill it out (and help us reach 1000!)

The AI Engineer Summit schedule is now live! We are running two Summits and judging two Hackathons this Oct. As usual, see our Discord and community page for all events.

A rite of passage for every AI Engineer is shipping a quick and easy demo, and then having to cobble together a bunch of solutions for prompt sharing and versioning, running prompt evals and monitoring, storing data and finetuning as their AI apps go from playground to production. This happens to be Humanloop?s exact pitch.

full show notes: https://latent.space/p/humanloop

Timestamps

* [00:01:21] Introducing Raza

* [00:10:52] Humanloop Origins

* [00:19:25] What is HumanLoop?

* [00:20:57] Who is the Buyer of PromptOps?

* [00:22:21] HumanLoop Features

* [00:22:49] The Three Stages of Prompt Evals

* [00:24:34] The Three Types of Human Feedback

* [00:27:21] UI vs BI for AI

* [00:28:26] LangSmith vs HumanLoop comparisons

* [00:31:46] The TAM of PromptOps

* [00:32:58] How to Be Early

* [00:34:41] 6 Orders of Magnitude

* [00:36:09] Becoming an Enterprise Ready AI Infra Startup

* [00:40:41] Killer Usecases of AI

* [00:43:56] HumanLoop's new Free Tier and Pricing

* [00:45:20] Addressing Graduation Risk

* [00:48:11] On Company Building

* [00:49:58] On Opinionatedness

* [00:51:09] HumanLoop Hiring

* [00:52:42] How HumanLoop thinks about PMF

* [00:55:16] Market: LMOps vs MLOps

* [00:57:01] Impact of Multimodal Models

* [00:57:58] Prompt Engineering vs AI Engineering

* [01:00:11] LLM Cascades and Probabilistic AI Languages

* [01:02:02] Prompt Injection and Prompt Security

* [01:03:24] Finetuning vs HumanLoop

* [01:04:43] Open Standards in LLM Tooling

* [01:06:05] Did GPT4 Get Dumber?

* [01:07:29] Europe's AI Scene

* [01:09:31] Just move to SF (in The Arena)

* [01:12:23] Lightning Round - Acceleration

* [01:13:48] Continual Learning

* [01:15:02] DeepMind Gato Explanation

* [01:17:40] Motivations from Academia to Startup

* [01:19:52] Lightning Round - The Takeaway

Get full access to Latent Space at www.latent.space/subscribe

2023-09-29
Länk till avsnitt

Heralds of the AI Content Flippening ? with Youssef Rizk of Wondercraft.ai

Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We?re collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)!

In March, we started off our GPT4 coverage framing one of this year?s key forks in the road as the ?Year of Multimodal vs Multimodel AI?. 6 months in, neither has panned out yet. The vast majority of LLM usage still defaults to chatbots built atop OpenAI (per our LangSmith discussion), and rumored GPU shortages have prevented the broader rollout of GPT-4 Vision. Most "AI media? demos like AI Drake and AI South Park turned out heavily human engineered, to the point where the AI label is more marketing than honest reflection of value contributed.

However, the biggest impact of multimodal AI in our lives this year has been a relatively simple product - the daily HN Recap podcast produced by Wondercraft.ai, a 5 month old AI podcasting startup. As swyx observed, the ?content flippening? ? an event horizon when the majority of content you choose to consume is primarily AI generated/augmented rather than primarily human/manually produced ? has now gone from unthinkable to possible.

For full show notes, go to: https://latent.space/p/wondercraft

Timestamps

* [00:03:15] What is Wondercraft?

* [00:08:22] Features of Wondercraft

* [00:10:42] Types of Podcasts

* [00:11:44] The Importance of Consistency

* [00:14:01] Wondercraft House Podcasts

* [00:19:27] Video Translation and Dubbing

* [00:21:49] Building Wondercraft in 1 Day

* [00:24:25] What is your moat?

* [00:30:37] Audio Generation stack

* [00:32:12] How Important is it to Sound Human? and AI Uncanny Valley

* [00:36:02] AI Watermarking

* [00:36:32] The Text to Speech Industry

* [00:41:19] Voice Synthesis Research

* [00:45:53] AI Podcaster interviews Human Podcaster

* [00:50:38] Takeaway

Get full access to Latent Space at www.latent.space/subscribe

2023-09-20
Länk till avsnitt

Doing it the Hard Way: Making the AI engine and language ? of the future ? with Chris Lattner of Modular

If AI is so important, why is its software so bad?

This was the motivating question for Chris Lattner as he reconnected with his product counterpart on Tensorflow, Tim Davis, and started working on a modular solution to the problem of sprawling, monolithic, fragmented platforms in AI development. They announced a $30m seed in 2022 and, following their successful double launch of Modular/Mojo? in May, have just announced their $100m Series A.

While the performance claims of Mojo? and its promise as a fully multithreaded compiled Python superset stole the show, we were amazed to learn that it is a side project - and the vision for Modular?s Python inference engine is at least as big.

Listeners will recall that we last talked with George Hotz about his work on tinygrad and how he wants to replace PyTorch with something faster and lighter, handwriting a ?reduced instruction set? of operators himself. But what if the problem could be solved at even lower level - with the Python engine/runtime itself?

Chris on Compilers

Chris? history with compilers is well known - creating LLVM during his PhD (for which he won the 2012 ACM Software System Award), hired straight into Apple where he also made Clang and Swift (the iPhone programming language that replaced Objective-C), then leading the Tensorflow Infrastructure team at Google where he built XLA, a just-in-time compiler for optimizing a lot of the algebra behind TF?s workloads, and MLIR, a modular compiler framework that sat above LLVM to optimize ML graphs and kernels that were hard to represent in the LLVM IR.

So as pretty much the best compiler engineer in human history, you?d justifiably assume that Chris is simply choosing to take his compiler approach to Python. And yet that is not how he thinks about compilers at all.

As he says in our chat,

?How do you enable invention? How do you get more kinds of people that understand different parts of this problem to actually collaborate? And so this is where I see our work on Mojo and on the engine?

?I don't have a compiler hammer that I'm running around looking for compiler problems to hit.?

Today a small number of people at companies like OpenAI spend a lot of time manually writing CUDA kernels. But an optimizing compiler for AI leads to compilers as a means to an end for increasing software collaboration, expanding the ability of people with different skillsets and knowledge.

??What is the fundamental purpose of a compiler? Well, it's to make it so that you don't have to know as much about the hardware. You could write everything in very low-level assembly code for every single problem that you have? But what a compiler really does is it allows you to express things at a higher level of abstraction.?

For Chris, compilers are also ways to properly automate generalized optimizations that might otherwise be manually coded and brittle abstractions, like operator fusion:

?So NVIDIA goes and they build this really cool library called FasterTransformer. The performance point of using it is massive. So a lot of LLM companies and other folks use this thing because they want the performance.

?Here's the problem. If you want to go innovate in transformers, now you're constrained by what FasterTransformer can do, right?

And so, again, you come back to where are compilers useful?

They're useful for generalization. If you can get the same quality result or better than FasterTransformer, but with a generalized architecture, well now you can get the best of both worlds, where you have orthogonality and composability, you enable research, you also get better performance.?

Done correctly, these operator optimizations being implemented at the compiler level amount to an ?AI Engine? that can not only survive, but enable major architecture shifts should a credible alternative LLM architecture come along someday.

Modular ? the Unified AI Engine

Modular?s original goal was to build the ?Unified AI Engine? to speed up AI development and inference - one that doesn?t assume an ?AI = GPUs? world that only benefits the ?GPU-rich?, but one that treats AI as ?a large-scale, heterogeneous, parallel compute problem?.

Modular itself is an engine (separate from Mojo, which we cover below) that can run all other frameworks between 10% to 650% faster on CPUs (with GPU support coming in the fall):

At Google, Chris? job wasn?t to build the best possible compiler for AI. The goal was to build the best compiler for TPUs, so that all TensorFlow users would have a great Google Cloud experience. Similarly, the PyTorch team at Meta isn?t trying to make AI faster for the world, but mostly for their recommendations and ads systems. Chris and Tim realized that the AI engine and developer experience isn?t a product prioritized by any of the big tech companies (they tried) - so they see Modular as the best way to deliver the AI development platform of the future.

The modularity of Modular shines through in the hot-swapping Inference Engine demo, which has to be seen to be believed.

Mojo ? ? Blazing Fast Python

The other piece of Modular is Mojo, a new programming language for AI that is a superset of Python. In some sense it is ?the ultimate yak shave?: We were shocked to learn that Chris and the team didn?t initially set out to create Mojo, but it started life as an internal DSL to make themselves more productive.

Mojo adopted Python?s syntax since it?s by far the most used language in machine learning and AI. It also lets them supports all existing PyPi packages, requiring no code changes for developers to go from Python to Mojo. Mojo comes with a lot of different underlying design choices that lead to much better performance:

* It?s compiled rather than interpreted like Python

* No GIL which allows for multi-threading

* Better heap representation

* Leverages MLIR

In the perfect test scenario that leverages all of these improvements, Mojo is up to ~68,000x faster than Python ? (fire emoji is a valid file extension for Mojo files, btw!).

Of course, that is just one microbenchmark, but as Jeremy Howard explains, most Python codebases should run between 10-100x faster simply by moving to Mojo with very minor adjustments.

A community member port of Llama2 from Python to Mojo shows it inferencing >100x faster than Python, and 20% faster than the handcoded raw C implementation.

The Modular team is embarking in one of the hardest technical challenges we?ve seen a startup tackle, and we can?t wait to see what comes out of it. We had an amazing conversation with Chris diving into all the details, which we hope you enjoy!

Show Notes

* Modular AI

* Chris? personal website

* Scott Forstall

* Bret Victor?s Playgrounds

* Karpathy?s Tweets

* Speculative Execution

* Llama memory constraints

* LLVM

* Clang

* Swift

* TensorFlow

* PyTorch

* XLA

* MLIR

* TPUs

* Guido van Rossum

Timestamps

* [00:00:00] Introduction

* [00:00:40] Chris's background - LLVM, Clang, Swift

* [00:03:01] Chris's experience with Google TPUs and XLA

* [00:05:47] The limitations of current frameworks like TensorFlow and PyTorch

* [00:08:03] The benefits of using compilers for AI systems

* [00:13:14] Enabling more collaboration between researchers through better systems

* [00:20:55] Starting with CPU optimization instead of just GPUs

* [00:24:36] Design principles and goals behind Modular

* [00:32:41] The benefits of starting from a general compiler architecture

* [00:35:13] Origins of deciding to create the Mojo language

* [00:44:43] Goals for Mojo to become a true Python superset

* [00:48:12] Thoughts on tinygrad

* [00:52:00] ggml, quantization, etc

* [00:57:00] Speculative execution and other gains from making Mojo more parallel

* [01:01:50] Future of Mojo?s toolkit

* [01:07:00] Why Modular is a company and not a foundation

* [01:11:00] Learnings as a first time founder and engineering leader

* [01:25:00] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]

Swyx: Hey, and today we have Chris Lattner in the house. Welcome, Chris. [00:00:21]

Chris: Hi both. Thanks for having me. [00:00:24]

Swyx: We're so excited to have you. We have so many questions and we'll try to get through as many as we can. You're one of the easiest people to research I've ever had on the pod, because you document yourself extensively on https://nondot.org/sabre/. What's the story behind that, just quickly? [00:00:40]

Chris: I mean, I've had that website for, since, I don't know, the mid-90s. So it's been a very, very, very long time, and I originally had a big personal page. Again, this was the mid-90s with all the scroll tags and all that kind of stuff. Yeah, exactly. [00:00:56]

Swyx: The animated gifs. ?Under construction.? [00:00:57]

Chris: Yeah. It has been rebooted a few times, and web design is not my strong point, but the server was originally named after some fish we had. That was the origin of non-dot. [00:01:08]

Swyx: I love it. I looked on Tanya's page and she has some spaniels. [00:01:12]

Chris: Yep. We're dog people. We love many animals. [00:01:15]

Swyx: So your quick bio, you did your PhD in CS in 2005, and then immediately went into Apple working on LLVM, the compiler framework that you created during your PhD. In our prep, you also maybe had a favorite Scott Forstall story. [00:01:32]

Chris: Well, so I got to work with a lot of really interesting people at Apple. Scott was actually pretty famous. Scott is responsible for many things across the years, but he really drove the iPhone. At least the iPhone software, specifically. And so Scott was super interesting because he was kind of a high-maintenance person. He was very difficult to work with. He did not mind making other people wait for him. So there'd be all these exec reviews of Scott where the entire room is full of people. He's sitting across the hallway in his office for a half hour making people wait for him. And so when Scott was at Apple, I wasn't his biggest fan, I'll admit, but I actually have a lot of respect for a lot of the things he did. He drove a lot of the early iPhone stuff. He made the bet on Siri and a bunch of other stuff that he did. And so he's a very impressive person. I guess he's out of tech these days, but yeah, so many fascinating. [00:02:25]

Swyx: My favorite story was the keyboards and how they basically had to invent predictive typing or it wouldn't work. [00:02:31]

Chris: Yep. It's all software. So much of that, it feels obvious now because it's been developed for years and years and years, but it was like pure research and nobody knew if you could get all of that software to fit on such a constrained device for 1.0. So it's just an amazing time. [00:02:45]

Swyx: Incredible. So I'll fill out the bio a little bit. You started working on Clang while at Apple, I think, as a front-end for C and Objective-C. You created Swift as well in 2010. And then in 2012, won the ACM Software System Award for LLVM, which I think is a crowning accomplishment for a lot of things. [00:03:01]

Chris: I love to build things. [00:03:03]

Swyx: You were VP of Autopilot at Tesla and then Senior Director and Distinguished Engineer at Google for TensorFlow. And then most recently, President of Product Engineering at RISC-V, or at SiFive, which builds RISC-V. [00:03:15]

Chris: They're the inventors and they drive so much of RISC-V is a really fancy new instruction set for a lot of computing needs and led to a lot of AI chips and so much that exists out there. So it was a lot of fun. And so that was actually driving and building hardware. And so most of my career I spent on the software side of it. And so it was a lot of fun to be able to see the other side of how hardware comes together, how you design it, how you think about it, what are the trade-offs in that entire space. And so for a lot of years, I've been just on that hardware-software boundary. [00:03:48]

Swyx: That's a lot of what we're going to talk about today with Modular Mojo. Well, so that's the brief history and you started Modular in 2022, about 20 months ago. What's one other thing on the personal side that people should know about you that people don't see on the LinkedIn because you're all into hardware-software boundaries and stuff? [00:04:05]

Chris: I have kids, I like to do woodworking, I like to walk. And so often, I like to go walking with people and do walking one-ones and things like that. [00:04:15]

Alessio: What's the latest woodworking project you've worked on? [00:04:18]

Chris: Oh, I mean, I just built a Lego robotics table for my kids, so helping out with the school. And so, yeah, not the most fancy furniture, but I've also built furniture and many other things for the house. [00:04:29]

Alessio: So I think the easiest thing for people to grasp so far has been Mojo, which is a superset of Python. And I think everybody talks about that because it's easier to grasp, but Modular's goal is to build a unified AI engine. And when I see unified, it implies things are not unified today, there's a lot of fragmentation, a lot of complexity. So let's start from the origin. What are some of the problems that you saw in the AI research and development space that you thought needed to be solved? [00:04:58]

Chris: Yeah, great question. So if you go back just a few years ago to 2015, 2016, 2017 timeframe, AI was really taking off. It wasn't to the point where it is now, where it's obvious to everybody, but for those of us who were following, amazing things were happening. And that era of technology was powered by TensorFlow and powered by PyTorch, right? And PyTorch came a little bit later, but they're both kind of similar designs in some ways. The challenge there is that the people building these systems were driven by the AI and the research and the differential equations and the auto diff and all these parts of the problem. They weren't looking to solve the software-hardware boundary problem. And so what they did is they said, okay, well, what do we need to build? We need a way for people to set up layers. So we need something like Keras or NNModule or something like that. Well, underneath the covers are these things called operators. And so you get things like convolutions and matrix multiplications and reductions and element wise ops and all these different things. Well, how are we going to implement those? Let's go take CUDA and let's go take the Intel math libraries, Intel MKL, and let's build on top of those. Now doubt really well, but the challenge with that is that whenever you come out with a new piece of hardware, even if it's just a new variant of an Intel CPU, you have initially a small number of these operators. But today TensorFlow and PyTorch have thousands of operators. And so what ends up happening is each of these things get what's called a kernel. Each of these kernels ends up being written generally by humans manually. And so if you bring up a new piece of hardware, you have to then re-implement thousands of kernels. This makes it very difficult for people to enter the hardware space. The other side of it though is research, right? So if you're a researcher, very few people know how these kernels work, right? [00:06:41]

This is coming in vogue. You hear about people writing CUDA kernels, for example. And I mean, the people who do this are amazing and I love them, but there's very few of them and the skill sets required to do that are just very different than innovating in model architecture, right? And so one of the challenges that we've seen with a lot of these AI systems has been the scalability problem of I can't find experts who can go write these kernels. Now, when I got involved with work at Google, we were working on Google TPUs. Google TPUs are one of the most successful at-scale training accelerators that exist. And one of the challenges that we face as a team is this challenge of saying, how do we bring up a novel piece of hardware given you have thousands of these different things? And really the goal at Google initially was catalyze and enable a ton of research. Now, one of the things that was done before I got there and that was novel and it attracted me there is people said, hey, let's use compilers for this. So instead of handwriting thousands of kernels and rewriting all of these operators and trying to do what Intel or what NVIDIA had done, they said, let's take a different approach. And compilers can be way more scalable than humans because compilers can allow you to mix kernels in different ways. And there's a number of these optimizations that are really important that you've talked about before, including kernel fusion, which can massively reduce memory traffic and things like this, and these other reassociations and optimizations that you want to be able to do. [00:08:03]

Chris: And a compiler can do that in a very general way. Whereas if you're doing it with traditional handwritten kernels, what you get is you get a fixed permutation of the ones that people thought were interesting. And so the things that worked are the things that have already been important, not the things that researchers want to do next. And a lot of research is doing new things, right? And so the investment in compilers led to this thing called XLA, which is part of the Google stack. Really great, enabled massive exaflop scale computers, tons of amazing work was done with that. But there was another problem, right? The big problem was that, okay, well, it was brought up to enable one piece of hardware, in that case, Google TPUs. And it turns out building compilers is hard. And there's a different scalability problem, where before it was hard to hire lots of humans to write lots of kernels. Now you have to hire compiler engineers. And there are even fewer compiler engineers that know machine learning and know all this stuff. And so what actually happened there is that there's a bunch of technical innovation and a lot of good things that came out of it. But one of the challenges was something like XLA is it's not extensible. And so you can technically extend it if you're at Google and you work on TPUs and you have access to the hardware, right? But if you're not, then it becomes a real challenge. And so one of the things I love about the NVIDIA platform in particular is that if you look at CUDA, like many people get grumpy about CUDA for various reasons, but you go all the way back to when AI took off, like deep learning took off with the AlexNet moment, for example, right? So many people will credit the AlexNet moment as being a combination of two things. They say it's data, ImageNet, and compute, the power of the GPUs coming together. And that's what allowed the AlexNet moment to happen. But the thing they often forget is that the third part was programmability, because CUDA enabled researchers to go invent convolution kernels that did not exist, right? There was no TensorFlow back then. There was none of the stuff that existed. And so it's actually this triumvirate between data compute and programmability that enabled a novel kind of research to kick off this invention that became the entire wave of deep learning systems, right? And so to me, learning from many of these things, you have to learn from history, coming to modular saying, okay, well, how do we take the next step? How do we get to the next epoch in terms of this technology where we can get the benefits of humans who have amazing algorithmic innovation and ideas and sparsity and like all the things that are kind of on the edges of the research that could become relevant? How do we get the benefit of compilers? And so compilers do have amazing scale and generality to new kinds of problems. And then how do we get the benefit of programmability and mix all these things together? That set of insights is what led to modular and what we're doing with the AI engine. [00:10:44]

Alessio: I think in one of your previous podcasts, you mentioned leaving people behind, you know, that are like not experts in certain things and they can't contribute. CUDA is great. And we had Tridao who created FlashAttention on the podcast. And when the new Cutlass version came out, he made FlashAttention too, because Cutlass was so much better. And like, he didn't have to worry about that. He could focus on it. How do you see the future of like AI development in kind of like a post-modular world? You know, do you think there's going to be a lot more collaboration at different levels of teams coming together? Or is one of your goals like allowing people that are not compiler experts to like not even think about it and assume they already got the best? [00:11:22]

Chris: Yeah, well, so I mean, my general belief is that humans are amazing, but we can't always fit everything in our head, right? And so you have different kinds of specialities, different kinds of people. And so if you can get them to work together, you can get something that's bigger than any one of them, right? I have certain skill sets, but I barely remember differential equations, right? And so it turns out that I'm not going to be inventing the next great model architecture, [00:11:45]

Swyx: right? [00:11:45]

Chris: But I'm useful for some of the systems problems. And so if we can get these people working together and collaborating together and understanding how these things work, like new breakthroughs can happen. And so Tree's interview with you, I think is a great example of that, right? He explained how, you know, he was working on different parts of the stack. He got interested in the systems. And he's a research group with Chris Ray, right? They have applications people that they work with, right? And so it really does, in my opinion, come back to like, how do you enable this flywheel? How do you enable invention? How do you get more kinds of people that understand different parts of this problem to actually collaborate? And so this is where I think that, you know, you see our work on Mojo and on the engine and things like this, what we're doing is we're really trying to drive out the complexity of this problem because so many of these systems that have been built up, you know, they're just aggregated together, right? It's like, here's a useful thing that enables me to solve the problem I want. And it wasn't really designed top to bottom. And I think the modular world provides is a much simpler stack that's much more orthogonal, much more consistent, much more principled. And that enables us to like reduce complexity all the way up the stack. Whereas if you're building on top of all this fragmented kind of mess of history, right? You just kind of have to cope with it. And a lot of the AI, particularly on the research systems, right? They have this happy path. And so if you do exactly the demo, the thing will work. But if you try changing anything just a little bit, everything falls apart and performance is awful or it doesn't work or whatever. And so that's an artifact of this fragmentation at the bottom. [00:13:14]

Swyx: So you kind of view compilers and languages as medium for which humans can collaborate or cross boundaries. [00:13:20]

Chris: I like compilers. I've been working on them for a long time, but work backwards from the problem, right? And if compilers are useful or the technology is compiler technology is useful to solve the problem, then that's cool. Let's use it. I don't have a compiler hammer that I'm running around looking for compiler hammer. Compiler problems to hit. Yeah, exactly. And so here, you say, what good is a compiler? Like what is the fundamental purpose of a compiler? Well, it's to make it so that you don't have to know as much about the hardware. You could write everything in very low level assembly code for every single problem that you have. But what a compiler or a programming language or an AI framer really does is it allows you to express things at a higher level of abstraction. Yeah. Now that goal serves multiple purposes. One purpose is that you make it easier, right? Second goal is that my opinion is that like, if you push a lot of complexity out of your head, you make room for new kinds of complexity. And so it's really about reduction of accidental complexity so that you can wrestle with the inherent complexity and the problem. Another is that by getting abstraction, right, you enable, for example, one of the things that compilers are good at, particularly modern ones like we're building, is that the compilers have infinite attention to detail. Humans don't, right? And so it turns out that, you know, if you hand write a bunch of assembly and then you have a similar problem, well, you just like take it and hack it a little bit without doing a first principles analysis of the best way to solve the problem, right? Well, compiler can actually do a lot better than that because CPU cycles are basically free these days. [00:14:42]

Swyx: Yeah, exactly. [00:14:42]

Chris: And also higher levels of abstraction give you other powers. And one of the things I think is really exciting about deep learning systems and things like what Modular is building is that it has raised compute to this graph level. Once you have gotten things out of for loops and semicolons and, you know, out of the muck and into something that's more declarative, well, now you can do things where you transform the compute. This is something that I think that many people don't yet realize because it's kind of possible, but it's really such a pain with these existing systems is that, you know, a lot of the power of what this abstraction provides is the ability to do things like Pmap and Vmap, like where you're taking a computation and then transforming it. And one of the things I was very inspired by my time at Google is, you know, we started out with these very low level things and, you know, single node GPU machines and then clusters and then async programming, like all this very little stuff. And by the time I had left, we had had, you know, researchers in Jupyter Notebook training petaflop supercomputers. You just think about that. That is an enormous lift in terms of the tech. And that was made possible by a lot of very layered and well-architected systems, by a lot of, you know, novel HPC type hardware, by a lot of these breakthroughs that had happened. And so what I'd love to see is for that technology to get even more widely adopted, generalized and get out there and also kind of break down a lot of the complexity that got built up along the way. Beautiful. [00:16:09]

Swyx: You use very precise terms, AI engine, AI framework, AI compiler. And I think that means special things for you, especially within the modular context. Do you care to define them so we can have context for the rest of the conversation? Yeah, absolutely. [00:16:22]

Chris: That's a great point. When I think about framework, I'm usually talking about things like TensorFlow and PyTorch. These are things that, you know, most people building a model will use something like PyTorch to build it and train it and do things like that. Underneath that, you end up getting a whole bunch of ways to talk to the hardware. And often it's CUDA or Intel MKL or something like this. And so those things are the engine. And that interface of the hardware is generally what I think of when I talk about an engine. [00:16:48]

Swyx: Right. And modular is a new engine. Yes. [00:16:51]

Chris: And modular is providing a new engine that plugs into TensorFlow, PyTorch, and a whole bunch of other stuff. And then allows you to drive, manipulate, program the hardware in a new way. [00:16:59]

Swyx: Which I would recommend everyone check out the products launch demo where you swapped it out in real time and it just kept working. [00:17:06]

Chris: Yep, yep. [00:17:07]

Swyx: That was a big flex. [00:17:08]

Chris: So I believe in properly modular, properly layered, properly designed technology. And so if you get the abstractions right, you can do really cool things like this. [00:17:16]

Alessio: Let's start diving deeper. So as you mentioned, you said between the framework level and the hardware level. So when it first got announced, I went on the website and I was like, wow, I wonder how many petaflops they get on an A100. And then I open and it's all CPUs. So my question is, everybody's trying to make GPUs go brr. Why are you making CPUs go brr first? [00:17:40]

Chris: So this is the problem with doing first principles work. Is that you have to do all of the work from the beginning. And if you do it right, you shouldn't skip over important steps. What is an AI system today? Lots of people say, oh, it's a GPU. People are fighting over GPUs. They're always talking about, it's all about GPUs, right? AI, in my opinion, is actually a large-scale, heterogeneous, parallel compute problem. And so AI traditionally starts with data loading. GPUs don't load data, right? And so you have to do data loading, preprocessing, networking, a whole bunch of stuff. And then you do a lot of matrix multiplications. You do all the things that people usually talk about. But then you do post-processing and you send stuff out over a network or under disk, right? And so CPUs, it turns out, are necessary to drive the GPUs, right? And a lot of the systems, again, when you say, let's bring up software for the accelerator, what you end up doing is you say, okay, well, what can the accelerator do? It turns out it's a subset of the problem because they decided that the matrix multiplications or whatever they thought was important is the important part of the problem. So you then go build a system that does exactly what the chip will do. And you never have time to go solve the big problem. And so it's really funny when you look at something like a TensorFlow or like a PyTorch, so much of that host side compute problem, the CPU work, ends up being in Python, ends up being in these things like tf.data and stuff like this. Not programmable, not extensible, really slow in many cases, very difficult to distribute. And so there's a huge mess here. Also, if you look at CPUs, it turns out they are accelerators. So CPUs these days have tensor cores. They just get funny names like AMX instructions and things like this, right? And the reason for that is that it used to be that CPUs and GPUs were completely different things. What's happened over time is GPUs get more programmable and more like CPUs, and CPUs get more parallel. And so what's happening is we're getting a spectrum of this technology. And so when we started modular, we said, okay, well, let's look at this from a technology perspective. Hey, it makes sense to build a general thing because once you have a general thing, you can specialize. As I've seen with XLA and some of these other stacks, like it's very hard to start with the specialized thing and then generalize it. Also, it turns out that, you know, where's the spend in AI? Well, I mean, different people are spending different amounts of money, different things, but training scales the size of your research team, inference scales the size of your product and user base and everything else. And so a lot of inference these days is still done on CPU. So what we decided to do is we said, okay, well, let's start with CPU. Let's get improve the architecture. CPUs are also easier to work with and they don't stock out and they, you know, they're easier for a variety of other reasons. And let's prove that we can build a very general architecture that can scale across different families. And so what we showed is we showed, okay, we can do Intel, AMD, we can do this arm Graviton thing and showed a lot of support for, you know, all the different weird permutations of things within even an Intel CPU. There's all these different vector lengths and all this stuff going on and showing that we could beat the vendor software with much more general and flexible programming approaches. And then from there, yes, we're doing GPU. We'll have GPUs coming out soon. And then when you build into that, right, what you get is you get the benefit of a well considered, well layered stack that has got all the right DNA in it. And so then you can scale into these different kinds of accelerators over time. [00:20:55]

Alessio: What are some of the challenges to actually build an engine? So I think the CPU point people have. So that's why you see LLAMA, CPP, you see some of this quantization where most people are thinking, let's take the model, quantize it, make it runnable on CPU and do that. You were like, no, I'm kind of like more crazy than that. How about we redo the whole engine? How does that differ in terms of work? So the model work is very kind of like weight specific. Yours is more like runtime, compiler specific. What does your team look like? And what are the challenges that you tackle to make an engine happen? [00:21:29]

Chris: In terms of the technology or? [00:21:31]

Swyx: Yeah. [00:21:31]

Alessio: Kind of like, how do you even start? Like when you started this company, kind of like some people said, I'm going to change the weights and quantize them. You were like, I'm going to change the engine. You know, what are some of the low hanging fruits, maybe some of the initial challenges that you're working on? [00:21:45]

Chris: Well, so, so I think a lot of what characterized modular is doing things the hard way to get a better outcome. [00:21:52]

Swyx: Right. [00:21:52]

Chris: So many of the people on our team, we've worked on all of the systems. So, you know, I worked on XLA and TensorFlow, the people that worked on PyTorch, TVM, the Intel OpenVINO stuff, like all of these weird things that have been created in the industry, Onyx Runtime, right? We have several really great people from there. And so many of these people have been working on these systems. And the challenge with them is that many of these systems were designed like five or eight years ago. [00:22:17]

Swyx: Right. [00:22:17]

Chris: And so AI was very different back then. There were no LLMs, right? I mean, it was a very different world. And so the challenge is, is that when you build a system, it starts out by being a pile of code and it gets bigger and bigger and bigger and bigger and bigger. And the farther along its evolution you get, the harder it is to make fundamental changes. And so what we did is we said, okay, let's start all the way at the beginning. Just like you're saying, yes, it's much harder. Again, I like to build things and I think our team likes to build things. And so you say, well, how does threading work? By the way, it's not often known, but TensorFlow, PyTorch, all these things still run the same thread pool that Caffe ran on. Widely known to be a huge problem, leads to massive performance problems, makes latency super unpredictable when you do inference. That one, a very specific set of design choices to make the thread pool block and be, you know, be synchronous. And like the entire architecture at the very bottom of the stack was wrong. And once you get that wrong, you can't go back. And so our thread pool assumes that no test can block. You have very lightweight threading, right? This goes directly into everything that gets built on top of it. You then go into things like, okay, well, how do you express kernels? Well, you still want to be able to handwrite kernels and we start by prototyping things in C++, but then you also get up into the mojo land. And so you build, you know, a very fancy auto-fusing compiler using all the best state-of-the-art techniques while also going beyond state-of-the-art because we know that users hate static shape limitations, lack of programmability. They don't want to be tied just to tensors, for example. And so a lot of LLMs have ragged tensors and things like that going on. Tabular data, you have like all these things. And so what you want to be building and one of the benefits of architecting things from first principles is that you can take all the pain that you've suffered and felt in other systems and you've never had a chance to do anything about it because of schedule, because of constraints from various kinds, and you can actually architect and build the right thing that can scale into that. And so that's, that's the approach we took. And so a lot of it was very familiar work, but it's very hardcore design engineering and you really need to know the second and third order effects of each decision. And fortunately, a lot of the stuff isn't research anymore. It's pretty proven. [00:24:31]

Swyx: So you mentioned some design goals that you have in first principles. Do you have a list? [00:24:36]

Chris: In what sense? [00:24:40]

Swyx: Off the top of your head. Like, I think it's very useful when designing systems to have that list of principles. And I think you very much think of yourself as a first principles thinker, but I think your principles differ than most. And you've gained this insight over just studying a lot of AI work over the years. What are they? [00:24:55]

Chris: I don't know that I have one set of principles that I, you know, it's like one, one club that I go around and beat things with. But a lot of what we're trying to do is we're trying to unlock the latent potential of a lot of hardware and do so in a way that's super accessible. And so a lot of our starting conditions was not like enable a new thing. It's much more about drive out the complexity that people are struggling with to do the thing. And so it's not research. It's about design and engineering. Now, when you look at this, we're also driving from, okay, let's enable the maximum power of any given piece of hardware. So if you talk to an LLM company and they just spent $200 million on GPUs and their A100 GPUs of a specific memory size or whatever, right? They want to get everything possible out of that chip and they don't want a lowest common denominator solution. Right? And so you want, on the one hand, full power. You want to go all the way down to the metal and be able to unlock these things. And some of these researchers like, like tree and others, I mean, they're freaking amazing. [00:25:57]

Swyx: Right? [00:25:57]

Chris: But on the other hand, a lot of other people want more portability, generality, abstraction. [00:26:03]

Swyx: Right? [00:26:03]

Chris: And so the challenge becomes how do you enable and how do you design a system where you get abstraction by default without like giving up the full power? And again, a lot of the compiler systems that have been, you know, compiler for ML type things have really given up full power because they're just trying to cover one specific point in the space. And so owning that and designing for that, I think is really important to what we're [00:26:25]

Swyx: doing. [00:26:25]

Chris: And other pieces, just sympathy for users, because a lot of people that get obsessed about the tech forget about the fact that the people that will be using it will be very different than the people that are building it. That aspect is actually really important when your developer tools fundamentally is to understand that the developers that are using it, they don't want to know about the [00:26:44]

Swyx: tech. [00:26:44]

Chris: One of the things that's super funny about working on compilers is nobody wants to know about a compiler. You're building a Mojo app or you're building a C app or whatever, right? You just want the compiler to get out of your way or tell if you did something wrong, right? If you're thinking about the compilers because it's too slow or it's, you know, broken in some way or something. And so AI tech should be the same way, right? I mean, how much of building and deploying a model is fighting with the tools? Get some crazy Python stack trace out of some tool because it covered the special case and now you're off that happy path, right? And so that compassion for users is something I think that, largely because AI infrastructure is so immature, but it's never been really part of the ethos of the people building tools. [00:27:22]

Swyx: You chose things like, you know, your third pool has everything non-blocking. The sum of your first principles have led the module inference engine to be two to three times faster than PyTorch and TensorFlow, right? [00:27:33]

Chris: Oh, I was trying to look at it. [00:27:34]

Swyx: I'll show a decomposition of performance. Okay, well, yeah. So you can talk about that too. [00:27:38]

Chris: So one of the really funny things that if you get it wrong, it's very difficult to fix is asynchrony. And so when you think about, I have a CPU and I have a GPU and they talk to each other, most people think about it in terms of CPUs doing some stuff that throws a CUDA kernel across the fence, GPUs go brr, right? And then when there's results, you know, you read it back, right? But that's actually a really inefficient way to run a computer. What you actually want is you want to think about there's two different computers that are both executing and they're sending messages back and forth to each other. So I built hardware, right? If you go all the way down to the gates, when you look at this, these computers, whether they're the tiled cerebrus wafer thing, right? Hardware is implicitly parallel. All of these things are always running all the time and they're communicating with each other. And so starting from an asynchronous programming model means that you can get accelerators that send messages to each other because that's the natural form of the hardware. When you get into CPUs, CPUs, you have, you know, 88 core CPUs or a hundred core CPUs these days, even if you have four, right? What they really are is there are four completely independent computers. And so, yeah, they send cash lines across the fabric at each other, right? But they're async, right? And so much of the programming model that people start with is always sync. And so when you build into the stuff, you say, okay, well, that's a huge problem. The consequence of getting this right is that now you get overlapping work and it comes for free, right? And again, simplicity, the right architecture leads to the thing just magically happening. One of the great projects we did at Google back in the day involved some of this stuff and it led to a 2x improvement in ads throughput. Ads is a very tuned workload, right? And getting TPUs and CPUs to work at the same time and overlap that compute was a huge deal. And the fact that it just falls out of an async architecture is quite important. And again, you look at this at all levels of granularity, networking is asynchronous. So as soon as you distribute a compute problem across a network, async is there, right? And so all of these systems are kind of designed in the wrong way. You go up a level of the stack. So you have these operators, right? Super interesting how this whole ecosystem evolved. If you dig into something like TensorFlow or PyTorch, right? You know, you get to the point where you have a matrix multiplication. And so like you've talked about before on your podcast, kernel fusion is really important. And the way people did that historically is they say, okay, well, I have a matrix multiplication and oh gosh, it's often followed by a ReLU. Well, I'll make a MatMul ReLU fused kernel, right? Cool, and that's a huge performance improvement because ReLU is just a max operation and you avoid tons of memory traffic, all good stuff, right? You run into these scalability problems because now you get things like a fused attention layer. So what is the consequence of saying, I'm going to manually tune the things that are important for mlperf or something, right? Well, what ends up happening is, again, you get these happy paths and they work way better than the default path. And so if you look within the NVIDIA world, for example, there's a ton of focus on transformers. And so NVIDIA goes and they build this really cool library called Faster Transformer. The performance point of using it is massive. Like it's a big deal. And so a lot of LLM companies and other folks use this thing because they want the performance. Performance turns into cost and throughput and all good things. Here's the problem. If you want to go innovate in transformers, now you're constrained by what Faster Transformer can do, right? And so, again, you come back to where are compilers useful. They're useful for generalization. And so if you can get the same quality result or better than Faster Transformer, but with a generalized architecture, well, now you can get the best of both worlds where you have orthogonality and composability, you enable research, you also get better performance. One of the things that you ask, like, how can we beat state of the art? Well, it's because it turns out compilers have more attention span. And it turns out that what's happened, even within like the NVIDIA product line, or even within the Intel product line, or even within one vendor's line of technologies, is that they have to build these little compilers because there's so much variation across the product family. If you look at an Intel product family, for example, they're building software that has to run on many different versions of this architecture. And they come out and they add a cool new dot product instruction, or they add beeflet 16 support, or they add whatever. And so what's been happening in the industry is that each of these companies have been building their own little compilers. And so their own little compilers are, again, they're focused on one part of the PROM domain. They have all these issues. They're not scaled very well. And so you get either, again, another fragmented part of the space where something will work really well, usually for a benchmark, right? But then it doesn't work well when people try to do new things. And so kernel fusion turns out to be one of those things. The programmability side, right? I mean, you just keep working your way up the stack. Matrix multiplication is really important. So who's that thing that hasn't been invented yet? I mean, we have folks that are using our stuff that care about computational fluid dynamics, right? And things like this, where it's really more of HPC, linear algebra, like more general than deep learning, right? And they want to use the same technology because all this technology is general purpose. And so enabling people to express their PROM domains, and often they're experts in fluid dynamics, which I know nothing about, by the way. [00:32:41]

Swyx: I mean, diffusion is another one that relatively recent new technique. Yeah, right. [00:32:47]

Chris: And so like enabling people to innovate in this way without having to know all that thread pool, right? You know, they don't want to know about a thread pool. And so enabling people to be able to focus on the part of the stack they care about and have it compose in is super important. Again, many systems have been built that tackle individual pieces of these PROMs. They end up usually having very specific constraints and limitations and problems. And so what we're doing is we're saying, okay, let's do the hard thing. Go all the way back. Let's actually build things in the right way and layer them up and do so in a way that composes correctly. And then what that means is you're driving away all that complexity that comes from, you know, the blocks don't plug together. [00:33:25]

Alessio: Yeah, even at the hardware level, I'm sure that the cerebros of the world are like really happy that you're building this because now they can offer binding. And then I think that's one of the main complaints from developers is like these chips sound great, but like, how do I use them, you know? [00:33:43]

Chris: Well, and that's one. So we're still early in our journey, but I care a lot about hardware and we have many friends in the space. The challenge again, so I worked on TPUs as one example, but certainly not the only one. The challenge, if you're building innovative hardware is you have to build the entire stack from the very bottom to the top. And so if you talk about a cerebros, right, they've built some amazing stuff, but they've had to build their own vertical software stack. And now it doesn't work the same at the top level as anything else. And so even if it's really good, right, it means that there's this huge barrier of entry for a developer to switch to their tech stack. Sometimes they're, some of these things are better than others, let's just say, right? And so it turns out building stuff is really hard. And so a lot of what we're trying to do is, again, we're putting down bricks. Like we have to take steps in logical order. We have to build the technology in the right way. Like I insist that we do everything at a super high quality. But when you do that, what that means is that then you can have a thing that you can plug into. And no, we can't turn a cell phone into a data center supercomputer, right? But if you want to quantize your model, you shouldn't have to use different tools for a cell phone than you use for a supercomputer, right? It turns out the intake's the same. Yeah. [00:34:50]

Alessio: Let's keep working our way towards the 35,000 times faster number that is out there. So you kind of keep going up and then you get to the Python level. [00:35:00]

Swyx: Yep. [00:35:00]

Alessio: And you're building Mojo, which is a Python superset. I'm also sure you didn't wake up one day [00:35:06]

Swyx: and you were like, [00:35:06]

Alessio: yeah, that sounds like a fun thing to do, creating a Python superset. Yep. What are some of the limitations that you saw there? [00:35:13]

Chris: Yeah, well, so I'll tell you where it came from. Because when we started Modular, we had no intention of building a programming language. So this is the, again, it's not looking for reasons to invent a language. But if we have to invent one to solve a problem, then cool, let's do it. So what we did was we said, okay, let's start, again, thread pools and other very basic stuff. How do we integrate with existing TensorFlow PyTorch systems? Turns out that's technically very complicated and very yucky. But then you get into the more, okay, let's get the hardware to go broom, right? Prom, right? And so then what we decided to do is we invented a whole bunch of very nerdy, very low level compiler tech. And so our compiler, yeah, it does autofusion and stuff like this, but it's designed for cloud first compute. Because there's more than one computer in the world, right? [00:36:00]

Swyx: And things like this. [00:36:00]

Chris: And so caching, distribution, like all these things get built into the compiler. You want to use things like auto tuning, [00:36:06]

Swyx: right? [00:36:06]

Chris: Because of all the complexity in the hardware and humans are great at algorithms. Attention span is not always the right thing. And so there's these requirements that came out of this. And so what we did is we built this pure compiler technology and validated it to show that we could generate kernels with very high performance. We got to the point where we're building that all and we were writing this very low level MLIR stuff by hand. We're happy enough with it at the time, but our team hated writing the stuff by hand. And so we needed syntax and said, okay, well, this looks like a language. And so what choices do we have? We could either do a domain specific embedded DSL like thing, like Halide, or there's a whole bunch of these things [00:36:45]

Swyx: that are out there, [00:36:45]

Chris: or we could build a programming language. And so again, saying, let's do it the hard way because it gives you a better result. The problem with the Halide or like the OpenAI Triton thing, or like there's a whole bunch of stuff that's kind of in this category is that they have terrible debuggers. The tools around it are really weird. They demo really well, but often are best used by the people who built the tools themselves, things like this. What we decided to do is say, okay, well, let's go build a full programming language. I know how to do that, built Swift, learned a few lessons. I know both how to do it, but also what a big commitment it is to do that. And the consequence of that is you can do something that's much better. Now you have to go shopping for syntax, right? And so we'd built all this pure technology and we could do anything we wanted. Could use Swift, could use C++, [00:37:31]

Swyx: could use whatever, [00:37:31]

Chris: but obviously the entire ML community is around Python. And so we said, okay, well, let's go use Python. And then how are we going to do that? Again, you dive into these levels of decision-making and it's like, okay, well, there's a lot of things that are like Python, right? [00:37:45]

Swyx: But they're not, [00:37:45]

Chris: and they don't get adoption and they have huge problems and they fragment the community and all the things. And so I said, okay, well, let's actually do it the right way. Let's try to build something that it'll take time to get there. But in the end, it's a super set of Python. [00:37:57]

Swyx: Why? [00:37:57]

Chris: Well, Python syntax isn't actually the important thing. It's the community, the entire body of programmer muscle memory, right? Like all of these things are actually the important thing. And so building a thing that looks like Python, but it's not was never a goal. Let's go actually build and again, do the hard thing that leads to a better quality result that'll be better for the world. Even if it takes a little bit longer to build. I'm shocked. [00:38:21]

Swyx: My jaw was like dropped the entire time you were saying this because this sounds like it's just a massive yak shave to improve your tooling to make yourself more productive, [00:38:29]

Chris: which is crazy. [00:38:31]

Swyx: Like most people start out trying to do the language first, but you came at a great point. [00:38:36]

Chris: So we built it and we started on this path to make it so that our team would be more productive. And we say today, like the most important Mojo developers are at modular. And that's actually really important when you're building a language is use it yourself. This was a mistake we made with Swift is we built Swift to solve a people don't like objective C syntax problem. Roughly, but we did not have internal users before we launched it. Not significant ones, right? And so with Mojo, like we're actually using it. And it's the thing that powers all the kernels in our engine. And so it's actually needs to be production quality. But then you realize that shaving the act that finally is actually not actually not worth it, right? [00:39:15]

Swyx: And we realized, okay, [00:39:15]

Chris: well, Mojo is actually useful to lots of other people. And so this is when we announced it. We said, okay, well, yeah, we'll make this a standalone thing because we think it's valuable and interesting to the rest of the world as well. And then, of course, we'll invest in it more because it's not just us and we can tolerate pain, but we want people to fall in love with good tools. [00:39:31]

Swyx: Yeah. And obviously you had a great stack already and good team, but like how long from realization that, oh, we need to start looking around for a language to something that looks like Mojo today? [00:39:41]

Chris: Yeah. So the lexer and the parser for Mojo started in October. [00:39:45]

Swyx: Wow. [00:39:45]

Chris: So it's less than a year old. [00:39:48]

Swyx: Yeah. [00:39:48]

Chris: This is also another thing is that I'm a very strange person in many ways, right? My ideas of what are hard problems are really different than other people, right? But Mojo is a much smaller language than Swift is. [00:40:00]

Swyx: Yeah. [00:40:00]

Chris: And even when it's done, it will be a much smaller language. And so compared to building Clang, which is a full C++ compiler or Swift, which is itself a very complicated, fancy system for a variety of reasons, right? This is actually a small project. Yeah. Yeah. [00:40:15]

Swyx: You still have to pick design choices from like Rust and whatever [00:40:18]

Chris: Yeah, well, absolutely. And so we will see what happens with Mojo over time. I would like a big chunk of our stack that is currently written in C++ to eventually move over. And so having a very good system programming language that scales is quite valuable and useful for lots of reasons. [00:40:32]

Swyx: One of the other things [00:40:33]

Chris: I'll share with you is that starting from CPU, starting from the general thing that you then specialize leads to these design points, for example, in Mojo, where you say, okay, well, if I care about high performance data loading, that needs to be super parallel. I care about disks being parallel and network being parallel and async and all this stuff that needs to be safe, right? And so with Swift, we built a memory-safe parallel programming abstraction called Actors. We've built all this stuff. And so being able to take the lessons learned from building [00:41:03]

Swyx: it the first time [00:41:03]

Chris: and driving it into a system the second time means that you can make something that's much better than the first time around when you were just figuring things out. So, but starting from generality is really important. [00:41:14]

Swyx: Every single language designer I've ever talked to has emphasized a playground and I was browsing your site and I realized that you had called the Xcode and Swift playgrounds a personal passion and you were inspired by Brett Victor. I guess, what have you learned about building a good playground? Because you just released modular like a few days ago, sorry, Mojo a few days ago, I was able to go in and play with it. What have you learned? And maybe what goes, what is underappreciated about like a good playground? [00:41:38]

Chris: Yeah, well, so when we were building Swift, there's this big question about how do we do something better than what Objective-C had? Yeah, right. And so naturally it's like you've gone through all this work, [00:41:48]

Swyx: you're building this new thing, [00:41:48]

Chris: what can you do with it? When we first launched, we wanted to make something very visual. Apple's a very visual company, right? It likes user interfaces [00:41:56]

Swyx: and stuff like this. [00:41:56]

Chris: And it turns out that we as humans, many of us are very visual learners and thinkers. One of the things that playgrounds for iOS and for the Mac allows you to do is play with time. And so what happens is that there's a graphical view of a canvas roughly, right? You then run your program and you have a ball bouncing or whatever the thing is that's happening. And now you can scrub through time because it can log and keep track of a bunch of state. And so this is one of the cool things about building systems and controlling it top to bottom is that you can build these kinds of experiences. One of the fun projects I was able to work on at Apple is this thing called Swift Playgrounds. And so it's actually an iPad app. The entire purpose is to teach kids how to code, right? And so one of the cool things about that is that that led to this whole area of research, to me at least, and around UI design for saying, for Playgrounds, how do I do coding on an iPad without popping up a keyboard, right? And so, exactly, very interesting technical problem, very different than compilers, turns out, right? And so we spent a lot of time working on gestures for like, you know, moving braces and blocks and refactoring code and doing all this stuff, making it so that it's super predictably understood what identifiers were in scope. And so complete the identifiers instead of you having to type them, instead of typing in numbers, like you get a little spinner. [00:43:12]

Swyx: That's not just for kids. [00:43:14]

Chris: And so it's super awesome. One of the things that came out of that is the current iPad keyboard allows you to swipe down on keys instead of going through modifiers. And so that came out of that project. And so there's a lot of the stuff where being able to build this stuff enables you to re-ask old questions. Yeah. [00:43:33]

Swyx: Oh, that's great. I love the scrubbing stuff. And Brent actually worked at Apple. It probably overlapped with you. I actually never met him. [00:43:39]

Chris: Yeah, so I'm sure it's a giant compound. Yeah, so coming back to Brett Victor, so Brett did a whole bunch of research on user interface paradigms for kind of explaining how code works. And so he wrote up many different, it seems like a worry dream or something is his blog or something. And he has a whole bunch of like concept demos and things like this. And so it was super inspirational. And so a lot of what we were doing was saying, okay, well, can we get this actually out to people to actually use? And so that was a lot of fun. So Mojo doesn't have anything quite as cool like that yet. But we'll see. [00:44:13]

Swyx: There's a whole community [00:44:13]

Chris: of people building cool stuff. [00:44:15]

Swyx: And a lot of people are saying, [00:44:15]

Chris: oh, we should have UI libraries and stuff like this. And Mojo is not gonna build a UI library. But there's a lot of cool people on the internet that know how to do this well. And I'd love to see that. [00:44:25]

Alessio: Let's list some of the known things about Mojo that people like. It's compiled instead of interpreter. There's like no global interpreter lock. The heap representation is different. Use MLIR. What are maybe some of your favorite or like most underrated things about Mojo that you haven't covered? Well, so I think that [00:44:43]

Chris: there's two ways of looking at Mojo. Most common way is it's like a Python plus plus. Again, I've been working on this stuff [00:44:49]

Swyx: for a long time. [00:44:49]

Chris: It kind of been there before, right? And so if you look at Swift versus Objective-C, what Objective-C is, is it's this really interesting language that many people don't know anymore, but where you have effectively small talk, which has super dynamic objects combined with C, right? And so the way Objective-C worked in the first iPhone and Macs for years were all built with Objective-C. Is that the high-level libraries are all built with the super dynamic, you know, you could inject methods and override things and hack the class hierarchy and all this stuff, completely dynamic object model combined with C, which is really good at executing things efficiently, [00:45:25]

Swyx: right? [00:45:25]

Chris: And so one of the reasons that Objective-C scaled so well, for example, in the first iPhone, which was super CPU constrained, was that anytime performance was a problem, you could drop down to C. So in the case of Swift, what happened is we said, okay, well, we want to keep all the things that are good about Objective-C. So it has to be dynamic classes. You have to be able to do all this kind of stuff. We have to work with all of the Objective-C frameworks, but then we want to be able to make one thing that scales, so it's not two different worlds glued together. Python is the same thing as Objective-C, [00:45:53]

Swyx: right? [00:45:54]

Chris: But turn on its head, where instead of being objects and C, it's like what people think of as Python, like a very high-level dynamic, flexible programming model, but then it's also glued onto C for the execution layer, right? And so you look at something like NumPy has a very nice Python layer, or even TensorFlow or PyTorch, very nice Python layer, but underneath the covers, it's all C is C++. And so a lot of what we're doing in Mojo is, you know, we learned a lot from Swift and things like this, but it's kind of conceptually similar, where what you're doing is you're saying, cool, it's not about whether dynamics good or static is good. They're both good. They're good for different problems. So let's put them together in a consistent thing and allow you to reach for the right answer for a given problem instead of being religious about it, like dynamic typing is the right answer, right? Just say like, cool, dynamic typing is great. We can see all the benefits. A lot of people love this and it's super productive and expressive. But if you want better performance, you can reach for static typing, right? And so a lot of, I think what Mojo is, is it's progressive in terms of like, get out of arguing about stupid things that don't matter. Just let people solve problems, right? And I think that is hopefully what people see in it. Now, I mean, we can dive into other things. So Mojo learns from Rust, for example. Rust is a wonderful community with a lot of cool stuff going on. It's kind of hard to learn. And so can we take the type system innovations like lifetimes and features like that, pull them forward into a thing and make them easier to learn? If so, then we get a lot of the benefits of the safety and the other things that Rust gives and performance and all the good things, [00:47:24]

Swyx: no garbage collector, [00:47:24]

Chris: all the stuff that people love about Rust, do so in a way that's a lot easier to learn, right? [00:47:28]

Swyx: And so it'll borrow a checker. [00:47:30]

Chris: Do have a borrow checker. But one of the challenges with Rust is that, in my opinion, it's more cultural. I mean, there are definitely language design issues that antagonize it a little bit, but a lot of it is the culture, right? And so a lot of the culture of Rust is very much thou shalt borrow and expose references to everything. And the pervasive library model around Rust ends up being culturally very low level, but you could write much higher level libraries in Rust if you wanted to. And so what we're doing with Mojo is saying, okay, let's take the tech, let's fix some of the language issues [00:48:04]

Swyx: and things like that, [00:48:04]

Chris: but let's define a new culture. And so as we roll out new features and new enhancements into Mojo, you'll see more and more of that over time. [00:48:12]

Alessio: ? So one of the things that George Hotz talked about on the podcast is XLA is like a CISC and tanning dry is a risk. You built XLA, so... ? Your response. ? Exactly. We got the other side of the thing. What are your thoughts on that and what are the right trade-offs to make? [00:48:29]

Chris: ? Yeah, so I contributed to XLA. I didn't write the whole thing, but yeah. ? And you worked on RISC. [00:48:34]

Swyx: ? Yeah. [00:48:35]

Chris: Also, I love George. He's a very interesting person. He's very enthusiastic, and that's really cool. It seems like he's learning his first compiler, though, because what he's doing is he's building what's widely known as a tensor contraction compiler. And so he's identified one sub, sub, sub, sub, sub [00:48:53]

Swyx: part of the problem, [00:48:53]

Chris: which turns out to be really important, which is how do you express the matrix multiplications and stuff like this. And he's learning how to build a compiler for that. He doesn't care about performance, as he talked about, and performance is not great. And so he has different sets of goals. But what he's doing is he's reductively turning AI into a matmul, something that a polyhedral compiler or something like that would tackle. And that's cool. Been there, done that. The problem with that is it doesn't scale. It turns out that there are a lot of things in AI that are not just matmuls. And so one of the challenges that I predict he'll run into is when you get out to those problems, now suddenly you'll have two systems. Simplest example, this is like the data layer will be completely different, right? And so there'll be this interface. What happens when there's this phase change between how the system works? Is it easy to use? Is it composed? What happens? [00:49:45]

Swyx: I don't know, right? [00:49:45]

Chris: So George is a super smart guy. We'll see what he comes up with. The other thing I'd say is that he's very focused on building and learning and doing things in an opinionated way that he likes. He's not being super user-centric and meeting people where they are and trying to get and lift people and do the things they're already doing, but do them better. And so it'll be interesting to see if he gets a community of people that are actually building things that are kind of beyond his circle. But he's a very smart guy. And I think that some of the stuff he's doing will be really cool. And I think it's also really interesting because he's showing the world, like the Jaxx people, that you don't need all of PyTorch to build a framework. [00:50:21]

Swyx: Right? [00:50:22]

Chris: And so that truth, I think, is I think maybe two-sided because on the one hand, the tasteful subset of AI infra, however you want to look at that, is actually relatively small. But the complexity that you need to be able to integrate into a production system, deal with quantization, deal with all these things you actually need for really high performance, like really push the boundaries of what people are doing, that's where it gets hard. And so I have no way to predict where it'll go. But if you want to make a risk versus risk argument, well, it's risk until you want to do new things. And what he's identified as a subset of the problem that you can model in a very, very nice, beautiful way, which is known, but there's a lot of the rest of the problem. And so if you've compressed, you know, he talks about XLA having 150 ops, XLA could have a 10th of that. If you just said it's element-wise with an enum, which is kind of what he does. And so that's not really the right question. The right question is what can you express? And can you express a big enough part of the problem for it to be useful? And so, I don't know, we'll see where it goes. [00:51:24]

Swyx: That's fascinating. Some good advice in there, I think, from engineer to engineer. Yeah, well, so, I mean, [00:51:29]

Chris: but George's goal and my goal are very different. That's the important thing. It's like George's, he's building a thing to understand it. It's the best way. I mean, from what I understand, I haven't talked with George about this. And he wants it locally run transformers. [00:51:43]

Swyx: Well, yeah, which is cool. [00:51:44]

Chris: And I want that too. We'll talk about that in a few months, but so we have similar technical goals [00:51:51]

Swyx: in some cases, right? [00:51:51]

Chris: But the way he's approaching the problem is build a thing to learn it, right? And so he's very happy to talk about how he'll like rip the whole thing up and throw it away. And that's super awesome. He's building it like a research project. Like we're building it in a very different way saying, okay, we know that PyTorch is yucky in various ways or TensorFlow's made some unfortunate design decisions, [00:52:11]

Swyx: right? [00:52:11]

Chris: It's not about beauty. It's about pragmatism. Because when we talk to people, we say, hey, who here wants to rewrite all your code? Generally, not very many people raise their hand and people are willing to in certain cases and there are certain profiles. But if you look at where the majority of the market and where the community is, it's much smaller. Interesting. [00:52:28]

Swyx: Well, you mentioned one of the operations that might be tricky is sort of the data layer. I don't know if I exactly understand what specifically is in the data layer, but I think memory constraints are something that people are talking about a lot. Recently, Georgi Griganov of GGML was showing off just the sheer amount of stuff that he can do on a single MacBook. And the analysis from Andrej Karpathy was mostly that it's just because it's memory-constrained, not compute-constrained. So even though you have a lot less compute on a single machine on Apple Silicon, it doesn't actually matter because you're just ultimately optimizing for token output. What memory-specific optimizations on the Mojo design side would you call out as important design choices? [00:53:10]

Chris: Yeah, so I think that a lot of the on-device ML or on-device LLM work has really been around 4-bit quantization and 2-bit and 1-bit and things like this. You called them hacks, I think, on your... Okay. [00:53:22]

Swyx: I don't think it's hacks. [00:53:24]

Chris: I mean, I think it's funny, like if you want to nerd out about it, like a float 32 is a quantized representation of infinite precision floating point numbers, right? You only have 32 bits to be able to represent all of numerics, right? That's a pretty flexible and useful hack, right, from that perspective. So I'm not here to tell you that there's one right way to run a neural network. I want to make it as easy as possible to be able to explore and research and try new things. And if it works well for you, great. The challenge I have with like the 4-bit numeric stuff and with quantization in general is that the way these things are implemented are hacks. And so often it is very hard-coded kernels. So GGML, wonderful project, lots of really cool and smart people working on this. The kernel libraries are very specific, individual things that are available in very hard-coded ways and they don't compose correctly. You know, you want to walk up to it with a novel model, right? GGML requires a lot of rework before you can do that. And not lots of people know C++ that do this stuff. And so anyways, my goal and my quest is to massively reduce that complexity. Within quantization, here's the thing I'll give you to think about, right? So autofusing compilers are better for performance, memory, and accuracy. And the reason for that is that if you're using autofusion, avoiding go-out-to-memory, good for performance. Automatic is better than manual, so it's good for humans that don't have the attention span to do this. But with quantization, it's really interesting because the way you normally implement a quantized operation is that you have higher internal precision than you do the external precision, right? And so if you write out an activation in memory, you have to re-quantize down to eight bits. But often what you'll end up doing is, or take Flute 16 or something, right? The internal activation, or the internal arithmetic is done as Flute 32. Load from memory, and you do like a multiplication of two Flute 16 things and you get a Flute 32 intermediate result. And so in the CPU or in the GPU, in registers, you have higher precision. So now when you do autofusion, you keep things in the higher precision, and so you have less intermediate rounding. And so when you take a big attention block and you do quantized fusion, you actually get, yes, much more flexibility because you can fuse much bigger regions than people can do by hand. You get better performance because you're not writing things out, but you also get better accuracy. And so that's one of the things that, again, [00:55:46]

Swyx: That's a free lunch. [00:55:47]

Chris: That's pretty great, right? And so, and also you go back to the complexity and the pain and suffering and the, you know, a lot of what Modular's trying to do is reduce suffering in the world. A lot of the quantization tools are just really bad. And it's because, you know, they have this like unmovable kernel library that has a whole bunch of special important cases and they're trying to like pattern match onto it. And so they often have very flaky problems and it's just a huge pain in the butt. And so by solving some of that low-level compiler nerdery, right, it enables you to have better tools, better accuracy, like all these things actually stack out and just leads to better technology. And then is 4-bit the right answer? I mean, 4-bit's cool, 2-bit's cool. All this stuff is cool, right? I mean, I think that there, it really depends on your application or use case. And so allowing people to play with that, that cannot write the kernels, like that's the whole point. [00:56:35]

Swyx: Yeah, they can still quantize, but using your approach, like it's just orthogonal. It's just going to be a straight improvement either way. So, yeah. [00:56:41]

Chris: Right, exactly. [00:56:42]

Alessio: There's still so much we're figuring out, right? The mixture of experts thing, like a few months ago, like people were not really thinking about, then George kind of leaked it on the podcast. Alerted it on our pod. Yeah, and then people started talking about it. A few other people confirmed it, yeah. [00:56:56]

Chris: Yeah, yeah, yeah, exactly. [00:56:57]

Alessio: As all these people started talking about it, I was like, I didn't say it. Please don't call me Sam Ullman. Speculative execution is another one. Basically, like Karpathy's thing is like, hey, if you're trying to get one token, getting K token in batch is almost the same time. I'm sure Mojo is great for that because it's not single-threaded like Python. You can run parallel. [00:57:18]

Chris: So one of the funny things about this is that you've all been in space for a while. It used to be back in the day, ResNet-50 or something, or MNIST, right? What is a neural network? It's machine learning operators, right? Then reinforcement learning came on the scene, right? And now suddenly you're saying an inference ends up being part of the thing the agent does. And then I have a training job that's driving this thing. And now a big RL system ends up being this massively complicated distributed system where you have traditional AI infra lashed together with all this Python and stuff like this. You come back to like stable diffusion with the units, you go look at yield LLM implementation, all the tokenization stuff's in Python. It's super funny when you look at this because what the world is telling us is that this AI infra, these systems are not flexible enough. And so why do you have to do the tokenization in Python? It's because the data layers, the libraries that people build in this stuff are not programmable and you need flexibility. And so people do this. And by putting this stuff into Python, I mean, it's great and I understand that, or rewriting into C++ to deploy it, right? What ends up happening is you lose the ability to do things like PMAP because the graph, the underlying ML model is a declarative specification of compute. But if you can't represent your computation, then you can't transform it, right? And so one of the real purposes of Mojo and the way it integrates with the engine and stuff like this is to give you the best of both worlds where you can say, cool, I can have full programmability. I can write a completely custom tokenization layer or whatever it is I want to do. Or if I have a really compressed on-disk format or I want three bit, whatever the thing is, I can express that. But now it composes into the stack instead of it being a bolt-on on the side that doesn't work well. I've seen the consequence of not building this stuff. And what it does is it drives all this complexity into the system. Or you look at serving layers. There's these platforms like SageMaker, for example. SageMaker is a very popular hosting solution for doing inference on models. But it's really just a TensorFlow or PyTorch that's wrapped up, right? And so sure, you can give it a TensorFlow graph and say, go ahead and serve my TensorFlow graph. But what if you want some pre-processing? Well, you have to set up a microservice next to it, right? And so now you have all this data going in and out over the network just to do one summarization operator before you send something out to the mobile client that you're talking to or something, right? And so the consequence of these design points drives a huge amount of this external complexity into the systems. It just doesn't need to be there. If you do the hard work, it doesn't need to be there if you do all the hard work of first-printing this stuff. [00:59:50]

Alessio: What about the post-transformer world? I think we kind of touched about this. And when you have faster transformer and all these things, it's so easy to just do another transformer model. We just did our WKB episode with Eugene Cha. What do you think about transformer alternatives and how closely are you working with some of these groups as you develop modular? [01:00:12]

Chris: Yeah, so we're great friends with Chris Ray's research group, and he's pushing on the hyena models with FFTs and things like this. And so I'm not smart enough to know the right thing there, honestly. My take on that is that there's a lot of smart people. I have a hard time believing transformers are the last major macro architecture that will be invented. And so what I'd love to do is enable more people to be able to play with this stuff. I often get asked of, why does anybody care about AI and for if transformers have solved it? It's a super funny question, because the basic assumption there, which is not wrong, the basic assumption is that transformers have eaten everything. They've eaten so much of vision transformers and everything else. They've eaten all the modalities. Therefore, in the fullness of time, they'll eat everything. But the funny thing about that is that that's a very narrow view of, again, what is AI? Because AI also includes massive recommender models where you have huge embeddings and these big, dense matrix multiplications. It also includes the units and things like this. It also kind of ignores the fact that transformers, as a category, there's a lot of consistency and we still have softmax. But if you go back to the first paper, the modern transformer is actually quite different. And so, yes, there's a lot of really good ideas about attention and things like this. But the evolution of this over time has really refined the approaches and a lot of the activation functions have changed. And a lot of stuff and a lot of innovation is still happening in this field. So, I mean, is it FFTs or is it attention? I defer to smarter people that know that stack better. But what I'd love for them to be able to do is not be held back by the architectures of the systems that were massively over-optimized just for attention. ? What else should people be on the lookout for modular? [01:01:50]

Alessio: So you just released yesterday Mojo download on Linux systems. You have macOS and Windows coming out soon. What are, say, like six, nine months from now, I don't know how much you can share, what is going to be the toolkit? So there's kind of like modular is the engine, Mojo is like the language. What are going to be the other components that people can leverage? ? Yeah, yeah. [01:02:08]

Chris: As we record, just yesterday, we announced download support for Linux. I've heard of Macs and Apple platforms. Turns out CI is kind of annoying with them. And so, yes, we'll roll out that kind of stuff. So roll out new platforms, of course. One of the things we're, and within Mojo, Mojo is still a young language. And so we have traits coming, hopefully by the end of the year. We have a bunch of things like that that'll be really a big deal for library design and enable new kinds of things to be expressed cleanly. Mojo will mature, right? And so I think that this is a major thing that we're focused on, is actually building Mojo in the right way. And that'll be super exciting. One of the consequences of that is we want a big community around Mojo to build cool stuff. And so as part of building in towards this, we'll start open sourcing Mojo. I think that's something that'll be really great. We just want to make sure that we do it. And again, if we do anything, we want to do it the best possible way. So we want to figure out what is the right contribution model and all this kind of stuff. We want a permissive license. And so we have to nail down a lot of the details to kind of go into this stuff. Because again, we want to be able to build something that works well and have a whole bunch of people that work well together and not just a gigantic, catastrophic mess. [01:03:14]

Alessio: Yeah, there's kind of like the Python 2-3 mess that we all got through and nobody wants to remember about it. What's kind of the relationship with Guido and the Python Foundation? And because some of the Mojo stuff is like, this is so good, why isn't it in Python 2? You know, long-term, how are you planning to keep kind of like the two languages in sync? And how are you involved with each other, so to speak? [01:03:38]

Chris: Yeah, so Guido for quite some time, from before the launch. And so he's known about Mojo as it's coming. We've been very fortunate. He spent a bunch of time with our team [01:03:47]

Swyx: and things like this. [01:03:47]

Chris: He occasionally shows up on Discord and gives me a hard time about things. So that's super awesome. [01:03:52]

Swyx: What is his pet topic? [01:03:53]

Chris: I think that he enjoys trolling us. And so, which I also enjoy. So it's all good. And so like there's Guido himself, then there's a broader question of Python. I consider Mojo to be a member of the Python family. And so there's a number of members of the Python family, by the way, including things like PyPy and Cython and like all this stuff. And so we want to be a good member of the Python family. And what I expect is that Python will continue to evolve and add new stuff. Mojo will continue to evolve and add new stuff, right? And so the analogy I give to people is to go way back 30, 40 years ago, there was C, and then this newcomer came on the scene in 1983 or something called C++, right? And what was C++? Well, it was C with classes, right? And so Python with not just classes, but all the stuff underneath it that you usually do in C, right? And so what happened back in the day is that C and C++ started as two different communities, but there's tons of intermixing and idea sharing and interpollination of ideas. And a lot of the C++ features ended up in C. And then of course, all the C features ended up in C++. And so I expect that same thing to happen. And so I look at it as Python 3 versus Mojo. Python 3 is really defined by its runtime. It's defined by a specific object model. And it really, I mean, if the Python community wants to change that, that would be really interesting. But Mojo is saying, okay, it's defined by a superset of the expressive capability. And so we have fancy MLIR compilers [01:05:21]

Swyx: and things like this. [01:05:21]

Chris: And so we can have on-stack representations and things that kind of lead to relatives of each other. And I'd like for Mojo to be a superset, right? In terms of all the capabilities. But each of these things will evolve in parallel. You know, I consider, you know, when people come to me and they say, hey, I want like this crazy feature, which should be in Python. I say, great, go talk to Python. We're here to add the systems programming features. We're not here to just add a general, you know, Walrus 2 operator or something. Ooh, that still burns a little bit. [01:05:49]

Swyx: But, you know, Python actually did end up adding no-gill after, like, not long after. Well, they haven't added it yet, but there's been discussion about it. Well, also, yeah, I mean, [01:05:58]

Chris: I think the gill stuff is also going to be super interesting. They have a five-year journey to add this. And so it's going to be technically very complicated for the community because one of the most beautiful things and pragmatic things about Python is that you drop right down in C. And so much of the Python ecosystem is actually C libraries or C++ or et cetera, right? But then are wrapped by Python, right? But one of the things the no-gill stuff breaks is it breaks a bunch of that glue. And so, like, the ability to get and set attributes, all the C functions for doing that break, right? And so that's going to cause a lot of churn and complexity. And so I'm not involved in the effort, obviously, but from what I can see, the Python community seems like they're walking into this with eyes wide open. Oh, yeah. They understand the trade-offs. I think they're doing a really, like, well-thought-out approach to this. And so I think that it will probably go really well. Now, that's great also, by the way, because Mojo likes threads, because threads are a thing, right? And so this will make it so that the Python ecosystem is more concurrent compatible, which will be great for us. [01:07:00]

Swyx: Yeah, but you're already there, so. [01:07:02]

Chris: Yeah, exactly. I mean, again, first principle learning something, it's not like, you know, multi-cores of the future anymore, right? Yeah, yeah. [01:07:08]

Swyx: One thing you're doing differently than beyond, in terms of, you know, C, C++, and then, you know, Python, Python++, is that you're choosing to build this as a company. Why a company and not a foundation? I think you kind of answered that with the modular first. [01:07:19]

Chris: Yeah, so we didn't start modular to build Mojo. We started modular to solve some AI problems, and then said, okay, well, we need to do a language. So I'll reinterpret your question, if it's okay, as saying, why is modular an independent company instead of part of a big tech? Apple or Google or Microsoft or whatever. So there's a number of reasons. Well, so first of all, I'll say, we tried. We collectively, I'll speak on behalf of all of our, the people on our team. Many of us came from big tech. Yeah. Like I worked at Google. I worked on ML infrastructure at Google, right? Literally working on this problem. And many of our people came out of this context. And the challenge, again, these companies are amazing, right? This isn't to bag on big tech. The challenge is, AI infra is not their product, right? So when I was working on XLA for TPUs, when I was working on XLA, it was to enable TPUs. It wasn't some abstract, let's go solve programming model and hardware and this big problem. It was literally enable this hardware because we just installed exaflops of it, and we needed to get to go and work, right? When you look at what is TensorFlow, it's, by the way, part of the cloud organization within Google. So if you want help with TensorFlow, sign up for GCP, and then they can help you, right? What is their product? Then Meta, right? I mean, what are they trying to solve? Well, they're trying to solve their ad stuff. [01:08:34]

Swyx: Meta has never had any interest in, yeah, external facing developer stuff. But Microsoft would have had you, like Satya has, you know. [01:08:40]

Chris: Yeah, I wouldn't go so far to say that none of these people care. All these people care. And there's so many good engineers within the PyTorch team that care about external developers. But the way to think about this is that all these projects are more of like a hobby than they are the company project, right? And so that difference is actually really important. Like, I mean, if you file a bug against Meta or a bug against PyTorch, you have a bunch of really good engineers that are allowed to work on that, and they want their product to be good. And so they might fix it, but also they might not, right? When we talk to people, not everybody in AI trusts Meta and Google. Often they're directly competing with them, right? And so like, no, I'm not going to actually show you my model so that you can debug the problem. They're conflicted in lots of different ways. And so with Modular as a standalone company, it's super important to us that we're neutral. We're like Switzerland, right? We do not build hardware. We do not have a cloud. We are not building an LLM, right? And so what we're doing is we're building AI Infra in a way that is really good so that you all can go invent all this other stuff and you have the right tools to do it, and we're not competing with you, right? And so that is something that, you know, again, there's lots of really good people, all my friends, you know, in all these different places, right? It's not the engineers or the management is doing anything wrong. It's just that what is their core incentive structure? What do the engineers get promoted for doing? And these things that, you know, actually they're more incentive oriented than they are technically oriented [01:10:08]

Swyx: end up mattering a lot. [01:10:08]

Chris: And this is one of the reasons why at a hardware company, you're not incentivized to build software that runs on lots of different kinds of hardware, obviously, right? Within Google, you're not incentivized to build things that work great for PyTorch. You know, so there's this problem where the rest of the world is building on AI. They use TensorFlow and PyTorch and lots of hardware and lots of clouds and lots of stuff. And so being able to help people and be aligned with their interests is really useful. [01:10:31]

Swyx: One thing I wanted to come back on, you said you don't have a cloud, but the way that people would use the modular inference engine is through your cloud. [01:10:39]

Chris: You have cloud engineers. [01:10:40]

Swyx: We do have cloud engineers. [01:10:41]

Chris: Actually, the way our product gets used is you use it on your cloud. And so we give you roughly a Docker container, and so it can run on cloud, on-prem, on laptops. We have folks using all kinds of different things. And so it's very modular that way. So we'll also build into a hosted product, of course, over time, just out of convenience. A lot of people don't want to do the management themselves, but we're really focused on meet people where they are, right? And we believe that our tech gets adopted faster if it's easy to adopt and easy to use and saying, okay, first move all your stuff to our cloud. [01:11:14]

Swyx: It's a valuable thing, [01:11:15]

Chris: particularly for people who don't want to manage that, but it just slows down adoption. [01:11:18]

Swyx: So a bit more company origin story stuff, because I just love company origin story type things. Your co-founder is Tim Davis, who you've worked with for a while. He's also had a couple of other startups under his belt. You get the idea for modular at SciFive, and you talked to the big clouds, and they didn't really want it, or you just arrived at the conclusion that it wouldn't be the best place for it. How did you go about founding the company? Yeah, good question. [01:11:40]

Chris: So I've been working on this stuff since 2016, 2017, right? So I've been working on AI and for of different points. So Tesla doing applied. How do we make cars drive themselves? At Google, bringing up a hardware program and trying to get TensorFlow to be architected better, let's say. Then I was dissatisfied for various reasons with what was going on at Google and with not taking PyTorch seriously and things like that. And so I went and joined a hardware startup. When I did that, I really wanted to solve this problem, but the timing that was in 2020, which was right before the pandemic, by the way, it wasn't right, right? Because at that time, there's still a lot of things were unknown. PyTorch was still figuring stuff out, and they had a lot of very ambitious projects. And at the time, I'm like, okay, well, I assume that Meta will go off and solve these problems, right? And so I joined a hardware startup to understand the other side, the business strategy, the commercial side of things, how the company building side of things and all this kind of stuff, learned a ton. Also that I'm a software person, not a hardware person, but Tim was going through his own journey. And so Tim joined Google Brain roughly the same time I did in 2017. We worked together very closely. I was on the data center TPU side. He was on the mobile side with Android and all that kind of stuff. I was engineering, he was product. We were very complimentary that whole time. He stayed at Google through all that time until about 2020 and to 2021 through 2021. And so we kind of got to the points in our journey where we're saying, okay, well, what are we going to do next? And so middle of 2021, we said, okay, well, this AI infra problem is still a thing. This is, in our opinion, was not getting fixed. We looked at this and said, okay, well, what are the problems in the space? A reductive way of asking the question is you say, if AI is so important to the world, this was before chat GPT, but AI was important before chat GPT, by the way. If it's so important to the world, why is all the software so bad? Why is it so hard to deploy a model? I mean, we did huge amounts of work to make it easy to train models, but getting them into production is still very, very challenging. And so what we did is we broke this all down and we said, okay, well, there's really three kinds of software in the world. There's the hardware specific software. So CUDA or the XLA stack or the Apple neural engine stack with Core ML, things like this. And it's not the hardware people's fault, but they have to build this vertical software stack for their hardware because there's nothing to plug into. There's no LLVM for machine learning, right? And as a consequence of doing that, and they're not malicious, but they end up fragmenting the universe because they all have to build different stuff. Okay, so that's one third of the software in AI. Another third is the frameworks. So you've got TensorFlow, you got PyTorch, you got TVM, you got like all this stuff out there. All these things were, you know, they're eight years old. The infrastructure itself was research, right? These things were built in a different era of what ML was, and they got evolved along the way and new hardware and new use cases and all this stuff. And they were never intentionally designed by, you know, from what we know now. Furthermore, often because AI was so important to their host companies, hundreds of people got thrown at it, right? And so I don't know how much money has been spent on TensorFlow or PyTorch, but it's a lot, right? And so you get all these people that are kind of hacking away in the combination of lots of hands and not a lot of clear vision. I mean, it's easier to understand in hindsight than it is to predict what AI will look like in five years, right? It means that it will generate a lot of stuff, which is maybe not the most clean architecture, right? And so we get these systems that have lots of well-known problems. And so PyTorch, for example, it's pretty difficult to deploy. It's pretty well-known. It doesn't really work great with lots of non-NVIDIA hardware, right? It doesn't scale super well for LLMs. These things are pretty well-known, but they're very difficult fundamentally to fix. And the PyTorch engineers are doing really great work. They're working hard on this, but it's really hard to fix given the environment that they're in. And so because you've got the hardware side of things that's fragmenting software, you've got the framework side that is, you know, they're tied to the architecture that they started with and things evolved. What we've got is we've got a lot of people who want to make AI easy. And so MLOps is this category that evolved. And what I think a lot of these folks tried to do is they said, hey, let's make it easy by making the API super simple. So AutoML, one example of this, maybe the most extreme, but lots of other people said, hey, I'm going to add a layer of Python on top of this gigantic mess, and that will make it easy to do AI. But the challenge is you can't solve programmability or performance or hardware capabilities or new kinds of algorithms or like security, like these core problem deployability, these core problems that people are struggling with by adding a layer of Python on top, at least not without giving up the mad joy of like all the craziness of AI research. Right? And so what we decided to do is we said, okay, well, let's go back and first principles this thing, like what is causing all of this madness? Well, it's because there's no thing for people to plug into. Let's go do that hard thing. Let's go build from the bottom up. One of our first blog posts was, you know, it's before we could say what we were doing. It's like the mission statement of what let's actually design and first principles of stuff. Let's build this unifying platform. Let's tackle the hard problem. And so that's what we decided to do. [01:16:47]

Alessio: At Decibel, our team is kind of like early believers and technical founders. And we see a lot of founders like yourself. You have a very long career. It's like an amazing engineer. And then all of a sudden you're like in the CEO seat. What are some of the learnings that you've had building a team, mentoring people, especially when I'm sure a lot of your work has been mentoring engineer, and now it's like also having the product head, also having the fundraising head, any stories and learnings? [01:17:13]

Chris: So at Modular, my co-founder, Tim and I, we're like two in a box, right? So one of the things that I think is really special is that we have a very strong relationship and we complement each other very well, yin and yang, right? And so having somebody to talk to is really, really important. And it's not something that I've had being engineering leader at Google or engineering leader at Apple or something like this. And so that I think is super special. I'll also say that, you know, I've built many teams, many products and technologies. And so I built all this kind of stuff, but it's always within somebody else's context. And so it's really nice to not have to clean up somebody else's mess, right? [01:17:46]

Swyx: Well, it's your mess now. Yeah, exactly. [01:17:48]

Chris: And so also you get to, again, you get a first principles of everything. Like how do we think about comp? How do we think about, you know, a lot of the philosophy at Modular was, okay, well, you know, our belief when we started the company was we understood the pain. I'll speak on behalf of Tim. Tim understood the pain with his Google hat on, right? And he worked with a lot of customers outside and things like this, but having a Google hat on is very different than having a startup hat on, right? And so when we started the company, we started and said, okay, well, Chris goes and engineering leader, go start building the thing and build the engineering team and all that kind of stuff. Tim goes and builds the product side and the business side and things like this and goes and interviews 50 or 100 different companies without a Google hat on. What is your pain point? What are you doing? What are your challenges? How can we help? We're thinking about building X. What do you think about that? And really hone the vision. And that's what allowed us to come back together. And so the challenging things about being Modulars, we're trying to build something that is really hard. It's a super hard tech problem. Also pretty abstract. I mean, it's getting less abstract now that it's working and it's all coming together and we can announce things, right? But solving this problem requires hiring these very expensive specialists out of all these big tech companies, right? And so that really formed and shaped a lot of our initial conditions, how we thought about things. And again, when you're first principaling this, you say, okay, well, because of that, I have to raise a lot of money. I have to be able to incentivize people well. I have to be able to pay them. I need to be able to make it comfortable, like make it so that they're not fish out of water. And a lot of that shapes how you do this stuff. And so I've really enjoyed it. I think that it's a lot of fun. It's also great because we can do things where, you know, you come back to, is TensorFlow or PyTorch a product? I would say no, but I'd also say self-reflectively, many of the things I've worked on for like Swift, for example, right? Or even Xcode are products in the sense of they are, there's a product manager and there's a team that works on it and that you ship it to customers, but it is not the core product of the company. Xcode is a loss center, right? Apple doesn't make money on it. It is because it is detached. It's kind of one level indirect from the customer, right? It's very easy for that team or for a support team like that or like the TensorFlow or the PyTorch team equivalently to go work on interesting technical projects that get very divorced from the customer because you don't really know what they're doing. And so for us, we're directly customer facing, right? We see the pain. And in AI, as I think you probably know, right? There's a lot of pain and building and deploying these things is really a mess. And sure, throw a layer of Python on top, you can make a demo simple, right? But a lot of the pain that the leading companies and the leading people that are building these things are facing are not that kind of a problem. It's that they're surrounded by too many things that don't really work, right? And so a lot of our vision on let's go unify all this stuff. Let's have fewer things that work better came directly out of talking to teams that their problem is that they're building a product and their product changes. They're not using one model. Their needs over time evolve. And okay, well, now we have a mobile product. Well, now what does that mean? That's a completely different universe, right? And so what ends up happening with the teams we work with is that they're often quite sophisticated and they've evolved lots of different messy systems for different special cases and it's killing them, right? And so they often want to just be able to run faster, right? Do I need a team of 50 engineers to deploy this model? Why do I need that? [01:21:09]

Swyx: I was also curious about your learnings as an engineering leader. So you've just had tons of experience building teams and hiring engineers. Obviously people want to work with you naturally. So you just naturally get a buff. Oh yeah, so it's easy, right? What is your learnings or advice or just on the engineering management side of things? [01:21:26]

Chris: Yeah, so I mean, I think there's different things. I consider my job is to help the team win, right? So I do what it takes to win. And you have to be like, starting from wanting to win is actually something that some people take for granted, right? And so you have to define what winning is. And so giving people a clear vision, having a clear purpose, keeping people aligned, super, super important when you've got a whole bunch of really good people that are all wanting to be heroes in their own journey, right? If the vectors add up, you can make a lot of progress really fast. If they're pointing against each other, they cancel out, right? Within, you know, because of who I am and what I like to do, like I will often help build the initial foundation of the thing myself. And so showing the team how to build things is really good for not just like, because I built a lot of this stuff before, like directly contribute, but also saying the culture. So one of the things that is really important to me in an engineering team is how fast can you spin, right? If you're sitting there and you have to wait 24 hours or three weeks for CI to run, well, it just slows everything down, right? And so, well, what does that mean? Well, it means testing strategy. It means like all of these things are just like core software engineering problems end up mattering a lot. And once you get a culture in there, like, you know, low dependencies, like do not just suck in third-party dependencies and hope it'll be great. Because there's lots of these things that kind of come into this. And then what you end up doing is you end up building a culture within the team. Now, when you do that, now you have really good people. You have to identify first when you're hiring, but also as people are evolving, like what are people good at, right? And I really believe that if you have a really powerful engineer, for example, or product manager or whatever, [01:23:01]

Swyx: if they're really good, [01:23:01]

Chris: you can throw them at any problem and they can make progress, right? But if you have somebody who's really good and really passionate and you line them up with something they really want to be doing, well, then they'll have superpowers, right? And so a lot of it is making sure people are working on the right problems. And so they're able to grow and do things and push and they have agency to own decisions and they're able to do things. And so it's kind of like this ongoing, like evolving dance, particularly in a high growth team, where what you're doing is you're looking for not just what are the lines of code you write, but also what are you contributing, right? And things like this. And so there's a lot of building a team that I'm not the guy that's going to write a management textbook or something like that, right? I mean, you should. I should probably write a compiler textbook first. [01:23:47]

Swyx: Yeah, you have many contributions. [01:23:50]

Chris: I like building the thing, unfortunately. And so I don't slow down for stuff like that. But a lot of it is, people get very focused on often the product or if they're really, really smart and they're good at business, they focus on the customer and the problems the customer has, right? But you can't solve and build the product without having the team. And so, so much of these things end up being these virtuous loops. And so thinking about all parts of those problems, I think is really an important part of being a leader and being a team. And again, this is one of the reasons I love Tim [01:24:21]

Swyx: and love working with him [01:24:21]

Chris: because he's really great at ways that I'm less great at and we're both learning from each other. Before we do landing ground, [01:24:27]

Alessio: any people that should be joining your team, any role that you have open that you're looking for? [01:24:32]

Chris: We are growing quite a bit. We are focused on a whole bunch of different things, including hardware, software boundary. And so if you're a kernel engineer, you care about performance, GPUs and like all the weird things that are out there, right? This is a major focus. We are not hiring researchers, but we really love applied people that like actually get a model to work in production and do things. And that's really great for us. We have a lot of customer engagements and things like that going on that can be very helpful and valuable with that. We're also growing out our go-to-market team and there's many different kinds of roles. You can check out our career page and we have a number of positions posted there. [01:25:06]

Swyx: Awesome. [01:25:07]

Alessio: So we have our usual three questions before wrapping up. One is on acceleration, one is on exploration, and then I'll take it away. So the acceleration one is, what's something that already happened in AI that you thought would take much longer to be here? [01:25:21]

Chris: So the chat GPT explosion, I thought was super interesting, right? And for folks like us that have been paying attention to AI for a long time, chat GPT was super interesting to me because it was a user interface innovation. And chat GPT happened and then GPT-4 happened and the world generally didn't even notice GPT-4. Nerds like us did, right? But they had no idea, they don't care. Chat GPT was the thing that really got people excited and it was really, you know, RLHF, like, I mean, that goes into all this stuff, right? But it was really about the user interface and how they use it. And suddenly it opened people's minds to the power of what AI can do. And so I thought that was super interesting. And from a looking backwards perspective, I thought that brought AI forward in the public consciousness by several years, I think. [01:26:10]

Swyx: I always say you want to combine model with modality. Like chat GPT, you know, we had Clippy before and Clippy never took off. But anyway, so the time was right. What do you think is the most interesting unsolved question in AI? Maybe not the one you're tackling. [01:26:24]

Chris: There's lots of smart people with lots of different opinions about what AI is, right? And there are certain people that you know, and I know that think that everything just be an end-to-end neural net and software should go away, right? I think that the open question is, what is the balance between trained algorithms and intelligently designed algorithms? I do not believe personally that it is all one or all the other, right? And if you want to build a cat detector, then a CNN is a really good way to do that. If you want to write a bootloader or an operating system, then for loops are a good way to do that, right? But where do things phase out over time and how do we make it so that app developers can think about these things more consistently instead of thinking about them as, you know, category A versus category B, right? And I mean, part of my bet is that AI as a software development approach ends up being, you know, part of the tool set of how people think about building applications. You know, where applications are not just like an iPhone app or something like that, but it's your cloud services, your data pipeline. It's like this whole complicated dance that leads to building a user product, right? And so I think that we as an industry haven't yet figured that out, right? I mean, it's just so early. AI is like in its adolescent years right now. [01:27:35]

Alessio: It's funny because like doing this podcast, we're like, oh, remember that? And then you look at the timestamp and it was like three months ago. Exactly. You know, it's kind of you look back and it's like, oh, it's not even one year since JGBT came out, you know? And we went from like no AI safety discourse, for example, to like AI is going to end the world. Then it's like, all we did was I put a chat online, you know, so it kind of makes you wonder. [01:27:58]

Chris: And I'll admit, like in 2017, there was a bunch of people focused on safety. And I'm like, why does this matter? Right? And they were just ahead of their time. Now it's pretty clear. [01:28:06]

Swyx: Yeah, exactly. [01:28:06]

Chris: That's exactly right. [01:28:07]

Swyx: They took it seriously when the rest of us were only looking at the math. Yeah. [01:28:11]

Chris: Well, and that's one of the things I really love about some of the OG people like Jeff Hinton and some of these folks like Jan Leku because they were into AI before it was cool, right? They were working on this stuff before it was obvious to everyone. And I think that they have seen and can integrate across a much longer timeframe. And that the wisdom that comes out of that, I think enables them to do even today, really amazing things that they get that better perspective for. [01:28:36]

Alessio: Awesome. Before we wrap, Chris, any final takeaway message that you want everybody to think about and remember? [01:28:43]

Chris: No, I mean, thank you for having me. I mean, this is a lot of fun and I really love being able to talk at a much more technical level about the AI part of what we're doing. And so I'm just so excited about where things are, what's happening, what the world's building, like just everything about what's happening right now is just super exciting to me. Awesome. [01:29:01]

Alessio: Thank you so much, Chris. [01:29:02]

Swyx: Thank you. [01:29:02]

Get full access to Latent Space at www.latent.space/subscribe

2023-09-14
Länk till avsnitt

The Point of LangChain ? with Harrison Chase of LangChain

As alluded to on the pod, LangChain has just launched LangChain Hub: ?the go-to place for developers to discover new use cases and polished prompts.? It?s available to everyone with a LangSmith account, no invite code necessary. Check it out!

In 2023, LangChain has speedrun the race from 2:00 to 4:00 to 7:00 Silicon Valley Time. From the back to back $10m Benchmark seed and (rumored) $20-25m Sequoia Series A in April, to back to back critiques of ?LangChain is Pointless? and ?The Problem with LangChain? in July, to teaching with Andrew Ng and keynoting at basically every AI conference this fall (including ours), it has been an extreme rollercoaster for Harrison and his growing team creating one of the most popular (>60k stars at time of writing) building blocks for AI Engineers.

LangChain?s Origins

The first commit to LangChain shows its humble origins as a light wrapper around Python?s formatter.format for prompt templating. But as Harrison tells the story, even his first experience with text-davinci-002 in early 2022 was focused on chatting with data from their internal company Notion and Slack, what is now known as Retrieval Augmented Generation (RAG).

As the Generative AI meetup scene came to life post Stable Diffusion, Harrison saw a need for common abstractions for what people were building with text LLMs at the time:

* LLM Math, aka Riley Goodside?s ?You Can?t Do Math? REPL-in-the-loop (PR #8)

* Self-Ask With Search, Ofir Press? agent pattern (PR #9) (later ReAct, PR #24)

* NatBot, Nat Friedman?s browser controlling agent (PR #18)

* Adapters for OpenAI, Cohere, and HuggingFaceHub

All this was built and launched in a few days from Oct 16-25, 2022.

Turning research ideas/exciting usecases into software quickly and often has been in the LangChain DNA from Day 1 and likely a big driver of LangChain?s success, to date amassing the largest community of AI Engineers and being the default launch framework for every big name from Nvidia to OpenAI:

Dancing with Giants

But AI Engineering is built atop of constantly moving tectonic shifts:

* ChatGPT launched in November (?The Day the AGI Was Born?) and the API released in March. Before the ChatGPT API, OpenAI did not have a chat endpoint. In order to build a chatbot with history, you had to make sure to chain all messages and prompt for completion. LangChain made it easy to do that out of the box, which was a huge driver of usage.

* Today, OpenAI has gone all-in on the chat API and is deprecating the old completions models, essentially baking in the chat pattern as the default way most engineers should interact with LLMs? and reducing (but not eliminating) the value of ConversationChains.

* And there have been more updates since: Plugins released in API form as Functions in June (one of our top pods ever? reducing but not eliminating the value of OutputParsers) and Finetuning in August (arguably reducing some need for Retrieval and Prompt tooling).

With each update, OpenAI and other frontier model labs realign the roadmaps of this nascent industry, and Harrison credits the modular design of LangChain in staying relevant. LangChain has not been merely responsive either: LangChain added Agents in November, well before they became the hottest topic of the AI Summer, and now Agents feature as one of LangChain?s top two usecases.

LangChain?s problem for podcasters and newcomers alike is its sheer scope - it is the world?s most complete AI framework, but it also has a sprawling surface area that is difficult to fully grasp or document in one sitting. This means it?s time for the trademark Latent Space move (ChatGPT, GPT4, Auto-GPT, and Code Interpreter Advanced Data Analysis GPT4.5): the executive summary!

What is LangChain?

As Harrison explains, LangChain is an open source framework for building context-aware reasoning applications, available in Python and JS/TS.

It launched in Oct 2022 with the central value proposition of ?composability?, aka the idea that every AI engineer will want to switch LLMs, and combine LLMs with other things into ?chains?, using a flexible interface that can be saved via a schema.

Today, LangChain?s principal offerings can be grouped as:

* Components: isolated modules/abstractions

* Model I/O

* Models (for LLM/Chat/Embeddings, from OpenAI, Anthropic, Cohere, etc)

* Prompts (Templates, ExampleSelectors, OutputParsers)

* Retrieval (revised and reintroduced in March)

* Document Loaders (eg from CSV, JSON, Markdown, PDF)

* Text Splitters (15+ various strategies for chunking text to fit token limits)

* Retrievers (generic interface for turning an unstructed query into a set of documents - for self-querying, contextual compression, ensembling)

* Vector Stores (retrievers that search by similarity of embeddings)

* Indexers (sync documents from any source into a vector store without duplication)

* Memory (for long running chats, whether a simple Buffer, Knowledge Graph, Summary, or Vector Store)

* Use-Cases: compositions of Components

* Chains: combining a PromptTemplate, LLM Model and optional OutputParser

* with Router, Sequential, and Transform Chains for advanced usecases

* savable, sharable schemas that can be loaded from LangChainHub

* Agents: a chain that has access to a suite of tools, of nondeterministic length because the LLM is used as a reasoning engine to determine which actions to take and in which order. Notable 100LOC explainer here.

* Tools (interfaces that an agent can use to interact with the world - preset list here. Includes things like ChatGPT plugins, Google Search, WolframAlpha. Groups of tools are bundled up as toolkits)

* AgentExecutor (the agent runtime, basically the while loop, with support for controls, timeouts, memory sharing, etc)

* LangChain has also added a Callbacks system for instrumenting each stage of LLM, Chain, and Agent calls (which enables LangSmith, LangChain?s first cloud product), and most recently an Expression Language, a declarative way to compose chains.

LangChain the company incorporated in January 2023, announced their seed round in April, and launched LangSmith in July. At time of writing, the company has 93k followers, their Discord has 31k members and their weekly webinars are attended by thousands of people live.

The full-featuredness of LangChain means it is often the first starting point for building any mainstream LLM use case, because they are most likely to have working guides for the new developer. Logan (our first guest!) from OpenAI has been a notable fan of both LangChain and LangSmith (they will be running the first LangChain + OpenAI workshop at AI Eng Summit).

However, LangChain is not without its critics, with Aravind Srinivas, Jim Fan, Max Woolf, Mckay Wrigley and the general Reddit/HN community describing frustrations with the value of their abstractions, and many are attempting to write their own (the common experience of adding and then removing LangChain is something we covered in our Agents writeup). Harrison compares this with the timeless ORM debate on the value of abstractions.

LangSmith

Last month, Harrison launched LangSmith, their LLM observability tool and first cloud product. LangSmith makes it easy to monitor all the different primitives that LangChain offers (agents, chains, LLMs) as well as making it easy to share and evaluate them both through heuristics (i.e. manually written ones) and ?LLM evaluating LLM? flows.

The top HN comment in the ?LangChain is Pointless? thread observed that orchestration is the smallest part of the work, and the bulk of it is prompt tuning and data serialization. When asked this directly our pod, Harrison agreed:

?I agree that those are big pain points that get exacerbated when you have these complex chains and agents where you can't really see what's going on inside of them. And I think that's partially why we built Langsmith?? (48min mark)

You can watch the full launch on the LangChain YouTube:

It?s clear that the target audience for LangChain is expanding to folks who are building complex, production applications rather than focusing on the simpler ?Q&A your docs? use cases that made it popular in the first place. As the AI Engineer space matures, there will be more and more tools graduating from supporting ?hobby? projects to more enterprise-y use cases.

In this episode we run through some of the history of LangChain, how it?s growing from an open source project to one of the highest valued AI startups out there, and its future. We hope you enjoy it!

Show Notes

* LangChain

* LangChain?s Berkshire Hathaway Homepage

* Abstractions tweet

* LangSmith

* LangSmith Cookbooks repo

* LangChain Retrieval blog

* Evaluating CSV Question/Answering blog and YouTube

* MultiOn Partner blog

* Harvard Sports Analytics Collective

* Evaluating RAG Webinar

* awesome-langchain:

* LLM Math Chain

* Self-Ask

* LangChain Hub UI

* ?LangChain is Pointless?

* Harrison?s links

* sports - estimating player compatibility in the NBA

* early interest in prompt injections

* GitHub

* Twitter

Timestamps

* [00:00:00] Introduction

* [00:00:48] Harrison's background and how sports led him into ML

* [00:04:54] The inspiration for creating LangChain - abstracting common patterns seen in other GPT-3 projects

* [00:05:51] Overview of LangChain - a framework for building context-aware reasoning applications

* [00:10:09] Components of LangChain - modules, chains, agents, etc.

* [00:14:39] Underappreciated parts of LangChain - text splitters, retrieval algorithms like self-query

* [00:18:46] Hiring at LangChain

* [00:20:27] Designing the LangChain architecture - balancing flexibility and structure

* [00:24:09] The difference between chains and agents in LangChain

* [00:25:08] Prompt engineering and LangChain

* [00:26:16] Announcing LangSmith

* [00:30:50] Writing custom evaluators in LangSmith

* [00:33:19] Reducing hallucinations - fixing retrieval vs generation issues

* [00:38:17] The challenges of long context windows

* [00:40:01] LangChain's multi-programming language strategy

* [00:45:55] Most popular LangChain blog posts - deep dives into specific topics

* [00:50:25] Responding to LangChain criticisms

* [00:54:11] Harrison's advice to AI engineers

* [00:55:43] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]

Swyx: Welcome. Today we have Harrison Chase in the studio with us. Welcome Harrison. [00:00:23]

Harrison: Thank you guys for having me. I'm excited to be here. [00:00:25]

Swyx: It's been a long time coming. We've been asking you for a little bit and we're really glad that you got some time to join us in the studio. Yeah. [00:00:32]

Harrison: I've been dodging you guys for a while. [00:00:34]

Swyx: About seven months. You pulled me in here. [00:00:37]

Alessio: About seven months. But it's all good. I totally understand. [00:00:38]

Swyx: We like to introduce people through the official backgrounds and then ask you a little bit about your personal side. So you went to Harvard, class of 2017. You don't list what you did in Harvard. Was it CS? [00:00:48]

Harrison: Stats and CS. [00:00:50]

Swyx: That's awesome. I love me some good stats. [00:00:52]

Harrison: I got into it through stats, through doing sports analytics. And then there was so much overlap between stats and CS that I found myself doing more and more of that. [00:00:59]

Swyx: And it's interesting that a lot of the math that you learn in stats actually comes over into machine learning which you applied at Kensho as a machine learning engineer and Robust Intelligence, which seems to be the home of a lot of AI founders.

Harrison: It does. Yeah.

Swyx: And you started LangChain, I think around November 2022 and incorporated in January. Yeah. [00:01:19]

Harrison: I was looking it up for the podcast and the first tweet was on, I think October 24th. So just before the end of November or end of October. [00:01:26]

Swyx: Yeah. So that's your LinkedIn. What should people know about you on the personal side that's not obvious on LinkedIn? [00:01:33]

Harrison: A lot of how I got into this is all through sports actually. Like I'm a big sports fan, played a lot of soccer growing up and then really big fan of the NBA and NFL. And so freshman year at college showed up and I knew I liked math. I knew I liked sports. One of the clubs that was there was the Sports Analytics Collective. And so I joined that freshman year, I was doing a lot of stuff in like Excel, just like basic stats, but then like wanted to do more advanced stuff. So learn to code, learn kind of like data science and machine learning through that way. Kind of like just kept on going down that path. I think sports is a great entryway to data science and machine learning. There's a lot of like numbers out there. People like really care. Like I remember, I think sophomore, junior year, I was in the Sports Collective and the main thing we had was a blog. And so we wrote a blog. It wasn't me. One of the other people in the club wrote a blog predicting the NFL season. I think they made some kind of like with stats and I think their stats showed that like the Dolphins would end up beating the Patriots and New England got like pissed about it, of course. So people like really care and they'll give you feedback about whether you're like models doing well or poorly. And so you get that. And then you also get like instantaneous kind of like, well, not instantaneous, but really quick feedback. Like if you predict a game, the game happens that night. Like you don't have to wait a year to see what happens. So I think sports is a great kind of like entryway for kind of like data science. [00:02:43]

Alessio: There was actually my first article on the Twilio blog with a Python script to like predict pricing of like Daily Fantasy players based on my past week performance. Yeah, I don't know. It's a good getaway drug. [00:02:56]

Swyx: And on my end, the way I got into finance was through sports betting. So maybe we all have some ties in there. Was like Moneyball a big inspiration? The movie? [00:03:06]

Harrison: Honestly, not really. I don't really like baseball. That's like the big thing. [00:03:10]

Swyx: Let's call it a lot of stats. Cool. Well, we can dive right into LangChain, which is what everyone is excited about. But feel free to make all the sports analogies you want. That really drives home a lot of points. What was your GPT aha moment? When did you start working on GPT itself? Maybe not LangChain, just anything to do with the GPT API? [00:03:29]

Harrison: I think it probably started around the time we had a company hackathon. I think that was before I launched LangChain. I'm trying to remember the exact sequence of events, but I do remember that at the hackathon I worked with Will, who's now actually at LangChain as well, and then two other members of Robust. And we made basically a bot where you could ask questions of Notion and Slack. And so I think, yeah, RAG, basically. And I think I wanted to try that out because I'd heard that it was getting good. I'm trying to remember if I did anything before that to realize that it was good. So then I would focus on that on the hackathon. I can't remember or not, but that was one of the first times that I built something [00:04:06]

Swyx: with GPT-3. There wasn't that much opportunity before because the API access wasn't that widespread. You had to get into some kind of program to get that. [00:04:16]

Harrison: DaVinci-002 was not terrible, but they did an upgrade to get it to there, and they didn't really publicize that as much. And so I think I remember playing around with it when the first DaVinci model came out. I was like, this is cool, but it's not amazing. You'd have to do a lot of work to get it to do something. But then I think that February or something, I think of 2022, they upgraded it and it was it got better, but I think they made less of an announcement around it. And so I just, yeah, it kind of slipped under the radar for me, at least. [00:04:45]

Alessio: And what was the step into LangChain? So you did the hackathon, and then as you were building the kind of RAG product, you felt like the developer experience wasn't that great? Or what was the inspiration? [00:04:54]

Harrison: No, honestly, so around that time, I knew I was going to leave my previous job. I was trying to figure out what I was going to do next. I went to a bunch of meetups and other events. This was like the September, August, September of that year. So after Stable Diffusion, but before ChatGPT. So there was interest in generative AI as a space, but not a lot of people hacking on language models yet. But there were definitely some. And so I would go to these meetups and just chat with people and basically saw some common abstractions in terms of what they were building, and then thought it would be a cool side project to factor out some of those common abstractions. And that became kind of like LangChain. I looked up again before this, because I remember I did a tweet thread on Twitter to announce LangChain. And we can talk about what LangChain is. It's a series of components. And then there's some end-to-end modules. And there was three end-to-end modules that were in the initial release. One was NatBot. So this was the web agent by Nat Friedman. Another was LLM Math Chain. So it would construct- [00:05:51]

Swyx: GPT-3 cannot do math. [00:05:53]

Harrison: Yeah, exactly. And then the third was Self-Ask. So some type of RAG search, similar to React style agent. So those were some of the patterns in terms of what I was seeing. And those all came from open source or academic examples, because the people who were actually working on this were building startups. And they were doing things like question answering over your databases, question answering over SQL, things like that. But I couldn't use their code as kind of like inspiration to factor things out. [00:06:18]

Swyx: I talked to you a little bit, actually, roundabout, right after you announced LangChain. I'm honored. I think I'm one of many. This is your first open source project. [00:06:26]

Harrison: No, that's not actually true. I released, because I like sports stats. And so I remember I did release some really small, random Python package for scraping data from basketball reference or something. I'm pretty sure I released that. So first project to get a star on GitHub, let's say that. [00:06:45]

Swyx: Did you reference anything? What was the inspirations, like other frameworks that you look to when open sourcing LangChain or announcing it or anything like that? [00:06:53]

Harrison: I mean, the only main thing that I looked for... I remember reading a Hacker News post a little bit before about how a readme on the project goes a long way. [00:07:02]

Swyx: Readme's help. [00:07:03]

Harrison: Yeah. And so I looked at it and was like, put some status checks at the top and have the title and then one or two lines and then just right into installation. And so that's the main thing that I looked at in terms of how to structure it. Because yeah, I hadn't done open source before. I didn't really know how to communicate that aspect of the marketing or getting people to use it. I think I had some trouble finding it, but I finally found it and used that as a lot [00:07:25]

Swyx: of the inspiration there. Yeah. It was one of the subjects of my write-up how it was surprising to me that significant open source experience actually didn't seem to matter in the new wave of AI tooling. Most like auto-GPTs, Torrents, that was his first open source project ever. And that became auto-GPT. Yeah. I don't know. To me, it's just interesting how open source experience is kind of fungible or not necessary. Or you can kind of learn it on the job. [00:07:49]

Alessio: Overvalued. [00:07:50]

Swyx: Overvalued. Okay. You said it, not me. [00:07:53]

Alessio: What's your description of LangChain today? I think when I built the LangChain Hub UI in January, there were a few things. And I think you were one of the first people to talk about agents that were already in there before it got hot now. And it's obviously evolved into a much bigger framework today. Run people through what LangChain is today, how they should think about it, and all of that. [00:08:14]

Harrison: The way that we describe it or think about it internally is that LangChain is basically... I started off saying LangChain's a framework for building LLM applications, but that's really vague and not really specific. And I think part of the issue is LangChain does do a lot, so it's hard to be somewhat specific. But I think the way that we think about it internally, in terms of prioritization, what to focus on, is basically LangChain's a framework for building context-aware reasoning applications. And so that's a bit of a mouthful, but I think that speaks to a lot of the core parts of what's in LangChain. And so what concretely that means in LangChain, there's really two things. One is a set of components and modules. And these would be the prompt template abstraction, the LLM abstraction, chat model abstraction, vector store abstraction, text splitters, document loaders. And so these are combinations of things that we build and we implement, or we just have integrations with. So we don't have any language models ourselves. We don't have any vector stores ourselves, but we integrate with a lot of them. And then the text splitters, we have our own logic for that. The document loaders, we have our own logic for that. And so those are the individual modules. But then I think another big part of LangChain, and probably the part that got people using it the most, is the end-to-end chains or applications. So we have a lot of chains for getting started with question answering over your documents, chat question answering, question answering over SQL databases, agent stuff that you can plug in off the box. And that basically combines these components in a series of specific ways to do this. So if you think about a question answering app, you need a lot of different components kind of stacked. And there's a bunch of different ways to do question answering apps. So this is a bit of an overgeneralization, but basically, you know, you have some component that looks up an embedding from a vector store, and then you put that into the prompt template with the question and the context, and maybe you have the chat history as well. And then that generates an answer, and then maybe you parse that out, or you do something with the answer there. And so there's just this sequence of things that you basically stack in a particular way. And so we just provide a bunch of those assembled chains off the shelf to make it really easy to get started in a few lines of code. [00:10:09]

Alessio: And just to give people context, when you first released LangChain, OpenAI did not have a chat API. It was a completion-only API. So you had to do all the human assistant, like prompting and whatnot. So you abstracted a lot of that away. I think the most interesting thing to me is you're kind of the Switzerland of this developer land. There's a bunch of vector databases that are killing each other out there to get people to embed data in them, and you're like, I love you all. You all are great. How do you think about being an opinionated framework versus leaving a lot of choice to the user? I mean, in terms of spending time into this integration, it's like you only have 10 people on the team. Obviously that takes time. Yeah. What's that process like for you all? [00:10:50]

Harrison: I think right off the bat, having different options for language models. I mean, language models is the main one that right off the bat we knew we wanted to support a bunch of different options for. There's a lot to discuss there. People want optionality between different language models. They want to try it out. They want to maybe change to ones that are cheaper as new ones kind of emerge. They don't want to get stuck into one particular one if a better one comes out. There's some challenges there as well. Prompts don't really transfer. And so there's a lot of nuance there. But from the bat, having this optionality between the language model providers was a big important part because I think that was just something we felt really strongly about. We believe there's not just going to be one model that rules them all. There's going to be a bunch of different models that are good for a bunch of different use cases. I did not anticipate the number of vector stores that would emerge. I don't know how many we supported in the initial release. It probably wasn't as big of a focus as language models was. But I think it kind of quickly became so, especially when Postgres and Elastic and Redis started building their vector store implementations. We saw that some people might not want to use a dedicated vector store. Maybe they want to use traditional databases. I think to your point around what we're opinionated about, I think the thing that we believe most strongly is it's super early in the space and super fast moving. And so there's a lot of uncertainty about how things will shake out in terms of what role will vector databases play? How many will there be? And so I think a lot of it has always kind of been this optionality and ability to switch and not getting locked in. [00:12:19]

Swyx: There's other pieces of LangChain which maybe don't get as much attention sometimes. And the way that you explained LangChain is somewhat different from the docs. I don't know how to square this. So for example, you have at the top level in your docs, you have, we mentioned ModelIO, we mentioned Retrieval, we mentioned Chains. Then you have a concept called Agents, which I don't know if exactly matches what other people call Agents. And we also talked about Memory. And then finally there's Callbacks. Are there any of the less understood concepts in LangChain that you want to give some air to? [00:12:53]

Harrison: I mean, I think buried in ModelIO is some stuff around like few-shot example selectors that I think is really powerful. That's a workhorse. [00:13:01]

Swyx: Yeah. I think that's where I start with LangChain. [00:13:04]

Harrison: It's one of those things that you probably don't, if you're building an application, you probably don't start with it. You probably start with like a zero-shot prompt. But I think that's a really powerful one that's probably just talked about less because you don't need it right off the bat. And for those of you who don't know, that basically selects from a bunch of examples the ones that are maybe most relevant to the input at hand. So you can do some nice kind of like in-context learning there. I think that's, we've had that for a while. I don't think enough people use that, basically. Output parsers also used to be kind of important, but then function calling. There's this interesting thing where like the space is just like progressing so rapidly that a lot of things that were really important have kind of diminished a bit, to be honest. Output parsers definitely used to be an understated and underappreciated part. And I think if you're working with non-OpenAI models, they still are, but a lot of people are working with OpenAI models. But even within there, there's different things you can do with kind of like the function calling ability. Sometimes you want to have the option of having the text or the application you're building, it could return either. Sometimes you know that it wants to return in a structured format, and so you just want to take that structured format. Other times you're extracting things that are maybe a key in that structured format, and so you want to like pluck that key. And so there's just like some like annoying kind of like parsing of that to do. Agents, memory, and retrieval, we haven't talked at all. Retrieval, there's like five different subcomponents. You could also probably talk about all of those in depth. You've got the document loaders, the text splitters, the embedding models, the vector stores. Embedding models and vector stores, we don't really have, or sorry, we don't build, we integrate with those. Text splitters, I think we have like 15 or so. Like I think there's an under kind of like appreciated amount of those. [00:14:39]

Swyx: And then... Well, it's actually, honestly, it's overwhelming. Nobody knows what to choose. [00:14:43]

Harrison: Yeah, there is a lot. [00:14:44]

Swyx: Yeah. Do you have personal favorites that you want to shout out? [00:14:47]

Harrison: The one that we have in the docs is the default is like the recursive text splitter. We added a playground for text splitters the other week because, yeah, we heard a lot that like, you know, and like these affect things like the chunk overlap and the chunks, they affect things in really subtle ways. And so like I think we added a playground where people could just like choose different options. We have like, and a lot of the ideas are really similar. You split on different characters, depending on kind of like the type of text that you have marked down, you might want to split on differently than HTML. And so we added a playground where you can kind of like choose between those. I don't know if those are like underappreciated though, because I think a lot of people talk about text splitting as being a hard part, and it is a really important part of creating these retrieval applications. But I think we have a lot of really cool retrieval algorithms as well. So like self query is maybe one of my favorite things in LangChain, which is basically this idea of when you have a user question, the typical kind of like thing to do is you embed that question and then find the document that's most similar to that question. But oftentimes questions have things that just, you don't really want to look up semantically, they have some other meaning. So like in the example that I use, the example in the docs is like movies about aliens in the year 1980. 1980, I guess there's some semantic meaning for that, but it's a very particular thing that you care about. And so what the self query retriever does is it splits out the metadata filter and most vector stores support like a metadata filter. So it splits out this metadata filter, and then it splits out the semantic bit. And that's actually like kind of tricky to do because there's a lot of different filters that you can have like greater than, less than, equal to, you can have and things if you have multiple filters. So we have like a pretty complicated like prompt that does all that. That might be one of my favorite things in LangChain, period. Like I think that's, yeah, I think that's really cool. [00:16:26]

Alessio: How do you think about speed of development versus support of existing things? So we mentioned retrieval, like you got, or, you know, text splitting, you got like different options for all of them. As you get building LangChain, how do you decide which ones are not going to keep supporting, you know, which ones are going to leave behind? I think right now, as you said, the space moves so quickly that like you don't even know who's using what. What's that like for you? [00:16:50]

Harrison: Yeah. I mean, we have, you know, we don't really have telemetry on what people are using in terms of what parts of LangChain, the telemetry we have is like, you know, anecdotal stuff when people ask or have issues with things. A lot of it also is like, I think we definitely prioritize kind of like keeping up with the stuff that comes out. I think we added function calling, like the day it came out or the day after it came out, we added chat model support, like the day after it came out or something like that. That's probably, I think I'm really proud of how the team has kind of like kept up with that because this space is like exhausting sometimes. And so that's probably, that's a big focus of ours. The support, I think we've like, to be honest, we've had to get kind of creative with how we do that. Cause we have like, I think, I don't know how many open issues we have, but we have like 3000, somewhere between 2000 and 3000, like open GitHub issues. We've experimented with a lot of startups that are doing kind of like question answering over your docs and stuff like that. And so we've got them on the website and in the discord and there's a really good one, dosu on the GitHub that's like answering issues and stuff like that. And that's actually something we want to start leaning into more heavily as a company as well as kind of like building out an AI dev rel because we're 10 people now, 10, 11 people now. And like two months ago we were like six or something like that. Right. So like, and to have like 2,500 open issues or something like that, and like 300 or 400 PRs as well. Cause like one of the amazing things is that like, and you kind of alluded to this earlier, everyone's building in the space. There's so many different like touch points. LangChain is lucky enough to kind of like be a lot of the glue that connects it. And so we get to work with a lot of awesome companies, but that's also a lot of like work to keep up with as well. And so I don't really have an amazing answer, but I think like the, I think prioritize kind of like new things that, that come out. And then we've gotten creative with some of kind of like the support functions and, and luckily there's, you know, there's a lot of awesome people working on all those support coding, question answering things that we've been able to work with. [00:18:46]

Swyx: I think there is your daily rhythm, which I've seen you, you work like a, like a beast man, like mad impressive. And then there's sometimes where you step back and do a little bit of high level, like 50,000 foot stuff. So we mentioned, we mentioned retrieval. You did a refactor in March and there's, there's other abstractions that you've sort of changed your mind on. When do you do that? When do you do like the, the step back from the day to day and go, where are we going and change the direction of the ship? [00:19:11]

Harrison: It's a good question so far. It's probably been, you know, we see three or four or five things pop up that are enough to make us think about it. And then kind of like when it reaches that level, you know, we don't have like a monthly meeting where we sit down and do like a monthly plan or something. [00:19:27]

Swyx: Maybe we should. I've thought about this. Yeah. I'd love to host that meeting. [00:19:32]

Harrison: It's really been a lot of, you know, one of the amazing things is we get to interact with so many different people. So it's been a lot of kind of like just pattern matching on what people are doing and trying to see those patterns before they punch us in the face or something like that. So for retrieval, it was the pattern of seeing like, Hey, yeah, like a lot of people are using vector sort of stuff. But there's also just like other methods and people are offering like hosted solutions and we want our abstractions to work with that as well. So we shouldn't bake in this paradigm of doing like semantic search too heavily, which sounds like basic now, but I think like, you know, to start a lot of it was people needed help doing these things. But then there was like managed things that did them, hybrid retrieval mechanisms, all of that. I think another example of this, I mean, Langsmith, which we can maybe talk about was like very kind of like, I think we worked on that for like three or four months before announcing it kind of like publicly, two months maybe before giving it to kind of like anyone in beta. But this was a lot of debugging these applications as a pain point. We hear that like just understanding what's going on is a pain point. [00:20:27]

Alessio: I mean, you two did a webinar on this, which is called Agents vs. Chains. It was fun, baby. [00:20:32]

Swyx: Thanks for having me on. [00:20:33]

Harrison: No, thanks for coming. [00:20:34]

Alessio: That was a good one. And on the website, you list like RAG, which is retrieval of bank debt generation and agents as two of the main goals of LangChain. The difference I think at the Databricks keynote, you said chains are like predetermined steps and agents is models reasoning to figure out what steps to take and what actions to take. How should people think about when to use the two and how do you transition from one to the other with LangChain? Like is it a path that you support or like do people usually re-implement from an agent to a chain or vice versa? [00:21:05]

Swyx: Yeah. [00:21:06]

Harrison: You know, I know agent is probably an overloaded term at this point, and so there's probably a lot of different definitions out there. But yeah, as you said, kind of like the way that I think about an agent is basically like in a chain, you have a sequence of steps. You do this and then you do this and then you do this and then you do this. And with an agent, there's some aspect of it where the LLM is kind of like deciding what to do and what steps to do in what order. And you know, there's probably some like gray area in the middle, but you know, don't fight me on this. And so if we think about those, like the benefits of the chains are that they're like, you can say do this and you just have like a more rigid kind of like order and the way that things are done. They have more control and they don't go off the rails and basically everything that's bad about agents in terms of being uncontrollable and expensive, you can control more finely. The benefit of agents is that I think they handle like the long tail of things that can happen really well. And so for an example of this, let's maybe think about like interacting with a SQL database. So you can have like a SQL chain and you know, the first kind of like naive approach at a SQL chain would be like, okay, you have the user question. And then you like write the SQL query, you do some rag, you pull in the relevant tables and schemas, you write a SQL query, you execute that against the SQL database. And then you like return that as the answer, or you like summarize that with an LLM and return that to the answer. And that's basically the SQL chain that we have in LangChain. But there's a lot of things that can go wrong in that process. Starting from the beginning, you may like not want to even query the SQL database at all. Maybe they're saying like, hi, or something, or they're misusing the application. Then like what happens if you have some step, like a big part of the application that people with LangChain is like the context aware part. So there's generally some part of bringing in context to the language model. So if you bring in the wrong context to the language model, so it doesn't know which tables to query, what do you do then? If you write a SQL query, it's like syntactically wrong and it can't run. And then if it can run, like what if it returns an unexpected result or something? And so basically what we do with the SQL agent is we give it access to all these different tools. So it has another tool, it can run the SQL query as another, and then it can respond to the user. But then if it kind of like, it can decide which order to do these. And so it gives it flexibility to handle all these edge cases. And there's like, obviously downsides to that as well. And so there's probably like some safeguards you want to put in place around agents in terms of like not letting them run forever, having some observability in there. But I do think there's this benefit of, you know, like, again, to the other part of what LangChain is like the reasoning part, like each of those steps individually involves some aspect of reasoning, for sure. Like you need to reason about what the SQL query is, you need to reason about what to return. But there's then there's also reasoning about the order of operations. And so I think to me, the key is kind of like giving it an appropriate amount to reason about while still keeping it within checks. And so to the point, like, I would probably recommend that most people get started with chains and then when they get to the point where they're hitting these edge cases, then they think about, okay, I'm hitting a bunch of edge cases where the SQL query is just not returning like the relevant things. Maybe I should add in some step there and let it maybe make multiple queries or something like that. Basically, like start with chain, figure out when you're hitting these edge cases, add in the reasoning step to that to handle those edge cases appropriately. That would be kind of like my recommendation, right? [00:24:09]

Swyx: If I were to rephrase it, in my words, an agent would be a reasoning node in a chain, right? Like you start with a chain, then you just add a reasoning node, now it's an agent. [00:24:17]

Harrison: Yeah, the architecture for your application doesn't have to be just a chain or just an agent. It can be an agent that calls chains, it can be a chain that has an agent in different parts of them. And this is another part as well. Like the chains in LangChain are largely intended as kind of like a way to get started and take you some amount of the way. But for your specific use case, in order to kind of like eke out the most performance, you're probably going to want to do some customization at the very basic level, like probably around the prompt or something like that. And so one of the things that we've focused on recently is like making it easier to customize these bits of existing architectures. But you probably also want to customize your architectures as well. [00:24:52]

Swyx: You mentioned a bit of prompt engineering for self-ask and then for this stuff. There's a bunch of, I just talked to a prompt engineering company today, PromptOps or LLMOps. Do you have any advice or thoughts on that field in general? Like are you going to compete with them? Do you have internal tooling that you've built? [00:25:08]

Harrison: A lot of what we do is like where we see kind of like a lot of the pain points being like we can talk about LangSmith and that was a big motivation for that. And like, I don't know, would you categorize LangSmith as PromptOps? [00:25:18]

Swyx: I don't know. It's whatever you want it to be. Do you want to call it? [00:25:22]

Harrison: I don't know either. Like I think like there's... [00:25:24]

Swyx: I think about it as like a prompt registry and you store them and you A-B test them and you do that. LangSmith, I feel like doesn't quite go there yet. Yeah. It's obviously the next step. [00:25:34]

Harrison: Yeah, we'll probably go. And yeah, we'll do more of that because I think that's definitely part of the application of a chain or agent is you start with a default one, then you improve it over time. And like, I think a lot of the main new thing that we're dealing with here is like language models. And the main new way to control language models is prompts. And so like a lot of the chains and agents are powered by this combination of like prompt language model and then some output parser or something doing something with the output. And so like, yeah, we want to make that core thing as good as possible. And so we'll do stuff all around that for sure. [00:26:05]

Swyx: Awesome. We might as well go into LangSmith because we're bringing it up so much. So you announced LangSmith I think last month. What are your visions for it? Is this the future of LangChain and the company? [00:26:16]

Harrison: It's definitely part of the future. So LangSmith is basically a control center for kind of like your LLM application. So the main features that it kind of has is like debugging, logging, monitoring, and then like testing and evaluation. And so debugging, logging, monitoring, basically you set three environment variables and it kind of like logs all the runs that are happening in your LangChain chains or agents. And it logs kind of like the inputs and outputs at each step. And so the main use case we see for this is in debugging. And that's probably the main reason that we started down this path of building it is I think like as you have these more complex things, debugging what's actually going on becomes really painful whether you're using LangChain or not. And so like adding this type of observability and debuggability was really important. Yeah. There's a debugging aspect. You can see the inputs, outputs at each step. You can then quickly enter into like a playground experience where you can fiddle around with it. The first version didn't have that playground and then we'd see people copy, go to open AI playground, paste in there. Okay. Well, that's a little annoying. And then there's kind of like the monitoring, logging experience. And we recently added some analytics on like, you know, how many requests are you getting per hour, minute, day? What's the feedback like over time? And then there's like a testing debugging, sorry, testing and evaluation component as well where basically you can create datasets and then test and evaluate these datasets. And I think importantly, all these things are tied to each other and then also into LangChain, the framework. So what I mean by that is like we've tried to make it as easy as possible to go from logs to adding a data point to a dataset. And because we think a really powerful flow is you don't really get started with a dataset. You can accumulate a dataset over time. And so being able to find points that have gotten like a thumbs up or a thumbs down from a user can be really powerful in terms of creating a good dataset. And so that's maybe like a connection between the two. And then the connection in the other way is like all the runs that you have when you test or evaluate something, they're logged in the same way. So you can debug what exactly is going on and you don't just have like a final score. You have like this nice trace and thing where you can jump in. And then we also want to do more things to hook this into a LangChain proper, the framework. So I think like some of like the managing the prompts will tie in here already. Like we talked about example selectors using datasets as a few short examples is a path that we support in a somewhat janky way right now, but we're going to like make better over time. And so there's this connection between everything. Yeah. [00:28:42]

Alessio: And you mentioned the dataset in the announcement blog post, you touched on heuristic evaluation versus LLMs evaluating LLMs. I think there's a lot of talk and confusion about this online. How should people prioritize the two, especially when they might start with like not a good set of evals or like any data at all? [00:29:01]

Harrison: I think it's really use case specific in the distinction that I draw between heuristic and LLM. LLMs, you're using an LLM to evaluate the output heuristics, you have some common heuristic that you can use. And so some of these can be like really simple. So we were doing some kind of like measuring of an extraction chain where we wanted it to output JSON. Okay. One evaluation can be, can you use JSON.loads to load it? And like, right. And that works perfectly. You don't need an LLM to do that. But then for like a lot of like the question answering, like, is this factually accurate? And you have some ground truth fact that you know it should be answering with. I think, you know, LLMs aren't perfect. And I think there's a lot of discussion around the pitfalls of using LLMs to evaluate themselves. And I'm not saying they're perfect by any means, but I do think they're, we've found them to be kind of like better than blue or any of those metrics. And the way that I also like to use those is also just like guide my eye about where to look. So like, you know, I might not trust the score of like 0.82, like exactly correct, but like I can look to see like which data points are like flagged as passing or failing. And sometimes the evaluators messing up, but it's like good to like, you know, I don't have to look at like a hundred data points. I can focus on like 10 or something like that. [00:30:10]

Alessio: And then can you create a heuristic once in Langsmith? Like what's like your connection to that? [00:30:16]

Harrison: Yeah. So right now, all the evaluation, we actually do client side. And part of this is basically due to the fact that a lot of the evaluation is really application specific. So we thought about having evaluators, you could just click off and run in a server side or something like that. But we still think it's really early on in evaluation. We still think there's, it's just really application specific. So we prioritized instead, making it easy for people to write custom evaluators and then run them client side and then upload the results so that they can manually inspect them because I think manual inspection is still a pretty big part of evaluation for better or worse. [00:30:50]

Swyx: We have this sort of components of observability. We have cost, latency, accuracy, and then planning. Is that listed in there? [00:30:57]

Alessio: Well, planning more in the terms of like, if you're an agent, how to pick the right tool and whether or not you are picking the right tool. [00:31:02]

Swyx: So when you talk to customers, how would you stack rank those needs? Are they cost sensitive? Are they latency sensitive? I imagine accuracy is pretty high up there. [00:31:13]

Harrison: I think accuracy is definitely the top that we're seeing right now. I think a lot of the applications, people are, especially the ones that we're working with, people are still struggling to get them to work at a level where they're reliable [00:31:24]

Swyx: enough. [00:31:25]

Harrison: So that's definitely the first. Then I think probably cost becomes the next one. I think a few places where we've started to see this be like one of the main things is the AI simulation that came out. [00:31:36]

Swyx: Generative agents. Yeah, exactly. [00:31:38]

Harrison: Which is really fun to run, but it costs a lot of money. And so one of our team members, Lance, did an awesome job hooking up like a local model to it. You know, it's not as perfect, but I think it helps with that. Another really big place for this, we believe, is in like extraction of structured data from unstructured data. And the reason that I think it's so important there is that usually you do extraction of some type of like pre-processing or indexing process over your documents. I mean, there's a bunch of different use cases, but one use case is for that. And generally that's over a lot of documents. And so that starts to rack up a bill kind of quickly. And I think extraction is also like a simpler task than like reasoning about which tools to call next in an agent. And so I think it's better suited for that. Yeah. [00:32:15]

Swyx: On one of the heuristics I wanted to get your thoughts on, hallucination is one of the big problems there. Do you have any recommendations on how people should reduce hallucinations? [00:32:25]

Harrison: To reduce hallucinations, we did a webinar on like evaluating RAG this past week. And I think there's this great project called RAGOS that evaluates four different things across two different spectrums. So the two different spectrums are like, is the retrieval part right? Or is the generation, or sorry, like, is it messing up in retrieval or is it messing up in generation? And so I think to fix hallucination, it probably depends on where it's messing up. If it's messing up in generation, then you're getting the right information, but it's still hallucinating. Or you're getting like partially right information and hallucinating some bits, a lot of that's prompt engineering. And so that's what we would recommend kind of like focusing on the prompt engineering part. And then if you're getting it wrong in the, if you're just not retrieving the right stuff, then there's a lot of different things that you can probably do, or you should look at on the retrieval bit. And honestly, that's where it starts to become a bit like application specific as well. Maybe there's some temporal stuff going on. Maybe you're not parsing things correctly. Yeah. [00:33:19]

Swyx: Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:33:35]

Harrison: Yeah. Yeah. [00:33:37]

Swyx: Yeah. [00:33:38]

Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:33:56]

Swyx: Yeah. Yeah. [00:33:58]

Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:34:04]

Swyx: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. [00:34:17]

Harrison: Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, I mean, there's probably a larger discussion around that, but openAI definitely had a huge headstart, right? And that's... Clawds not even publicly available yet, I don't think. [00:34:28]

Swyx: The API? Yeah. Oh, well, you can just basically ask any of the business reps and they'll give it to you. [00:34:33]

Harrison: You can. But it's still a different signup process. I think there's... I'm bullish that other ones will catch up especially like Anthropic and Google. The local ones are really interesting. I think we're seeing a big... [00:34:46]

Swyx: Lama Two? Yeah, we're doing the fine-tuning hackathon tomorrow. Thanks for promoting that. [00:34:50]

Harrison: No, thanks for it. I'm really excited about that stuff. I mean, that's something that like we've been, you know, because like, as I said, like the only thing we know is that the space is moving so fast and changing so rapidly. And like, local models are, have always been one of those things that people have been bullish on. And it seems like it's getting closer and closer to kind of like being viable. So I'm excited to see what we can do with some fine-tuning. [00:35:10]

Swyx: Yeah. I have to confess, I did not know that you cared. It's not like a judgment on Langchain. I was just like, you know, you write an adapter for it and you're done, right? Like how much further does it go for Langchain? In terms of like, for you, it's one of the, you know, the model IO modules and that's it. But like, you seem very personally, very passionate about it, but I don't know what the Langchain specific angle for this is, for fine-tuning local models, basically. Like you're just passionate about local models and privacy and all that, right? And open source. [00:35:41]

Harrison: Well, I think there's a few different things. Like one, like, you know, if we think about what it takes to build a really reliable, like context-aware reasoning application, there's probably a bunch of different nodes that are doing a bunch of different things. And I think it is like a really complex system. And so if you're relying on open AI for every part of that, like, I think that starts to get really expensive. Also like, probably just like not good to have that much reliability on any one thing. And so I do think that like, I'm hoping that for like, you know, specific parts at the end, you can like fine-tune a model and kind of have a more specific thing for a specific task. Also, to be clear, like, I think like, I also, at the same time, I think open AI is by far the easiest way to get started. And if I was building anything, I would absolutely start with open AI. So. [00:36:27]

Swyx: It's something I think a lot of people are wrestling with. But like, as a person building apps, why take five vendors when I can take one vendor, right? Like, as long as I trust Azure, I'm just entrusting all my data to Azure and that's it. So I'm still trying to figure out the real case for local models in production. And I don't know, but fine-tuning, I think, is a good one. That's why I guess open AI worked on fine-tuning. [00:36:49]

Harrison: I think there's also like, you know, like if there is, if there's just more options available, like prices are going to go down. So I'm happy about that. So like very selfishly, there's that aspect as well. [00:37:01]

Alessio: And in the Lancsmith announcement, I saw in the product screenshot, you have like chain, tool and LLM as like the three core atoms. Is that how people should think about observability in this space? Like first you go through the chain and then you start dig down between like the model itself and like the tool it's using? [00:37:19]

Harrison: We've added more. We've added like a retriever logging so that you can see like what query is going in and what are the documents you're getting out. Those are like the three that we started with. I definitely think probably the main ones, like basically the LLM. So the reason I think the debugging in Lancsmith and debugging in general is so needed for these LLM apps is that if you're building, like, again, let's think about like what we want people to build in with LangChain. These like context aware reasoning applications. Context aware. There's a lot of stuff in the prompt. There's like the instructions. There's any previous messages. There's any input this time. There's any documents you retrieve. And so there's a lot of like data engineering that goes into like putting it into that prompt. This sounds silly, but just like making sure the data shows up in the right format is like really important. And then for the reasoning part of it, like that's obviously also all in the prompt. And so being able to like, and there's like, you know, the state of the world right now, like if you have the instructions at the beginning or at the end can actually make like a big difference in terms of whether it forgets it or not. And so being able to kind of like. [00:38:17]

Swyx: Yeah. And it takes on that one, by the way, this is the U curve in context, right? Yeah. [00:38:21]

Harrison: I think it's real. Basically I've found long context windows really good for when I want to extract like a single piece of information about something basically. But if I want to do reasoning over perhaps multiple pieces of information that are somewhere in like the retrieved documents, I found it not to be that great. [00:38:36]

Swyx: Yeah. I have said that that piece of research is the best bull case for Lang chain and all the vector companies, because it means you should do chains. It means you should do retrieval instead of long context, right? People are trying to extend long context to like 100K, 1 million tokens, 5 million tokens. It doesn't matter. You're going to forget. You can't trust it. [00:38:54]

Harrison: I expect that it will probably get better over time as everything in this field. But I do also think there'll always be a need for kind of like vector stores and retrieval in some fashions. [00:39:03]

Alessio: How should people get started with Langsmith Cookbooks? Wanna talk maybe a bit about that? [00:39:08]

Swyx: Yeah. [00:39:08]

Harrison: Again, like I think the main thing that even I find valuable about Langsmith is just like the debugging aspect of it. And so for that, it's very simple. You can kind of like turn on three environment variables and it just logs everything. And you don't look at it 95% of the time, but that 5% you do when something goes wrong, it's quite handy to have there. And so that's probably the easiest way to get started. And we're still in a closed beta, but we're letting people off the wait list every day. And if you really need access, just DM me and we're happy to give you access there. And then yeah, there's a lot that you can do with Langsmith that we've been talking about. And so Will on our team has been leading the charge on a really great like Langsmith Cookbooks repo that covers everything from collecting feedback, whether it's thumbs up, thumbs down, or like multi-scale or comments as well, to doing evaluation, doing testing. You can also use Langsmith without Langchain. And so we've got some notebooks on that in there. But we have Python and JavaScript SDKs that aren't dependent on Langchain in any way. [00:40:01]

Swyx: And so you can use those. [00:40:01]

Harrison: And then we'll also be publishing a notebook on how to do that just with the REST APIs themselves. So yeah, definitely check out that repo. That's a great resource that Will's put together. [00:40:10]

Swyx: Yeah, awesome. So we'll zoom out a little bit from Langsmith and talk about Langchain, the company. You're also a first-time founder. Yes. And you've just hired your 10th employee, Julia, who I know from my data engineering days. You mentioned Will Nuno, I think, who maintains Langchain.js. I'm very interested in like your multi-language strategy, by the way. Ankush, your co-founder, Lance, who did AutoEval. What are you staffing up for? And maybe who are you hiring? [00:40:34]

Harrison: Yeah, so 10 employees, 12 total. We've got three more joining over the next three weeks. We've got Julia, who's awesome leading a lot of the product, go-to-market, customer success stuff. And then we've got Bri, who's also awesome leading a lot of the marketing and ops aspects. And then other than that, all engineers. We've staffed up a lot on kind of like full stack infra DevOps, kind of like as we've started going into the hosted platform. So internally, we're split about 50-50 between the open source and then the platform stuff. And yeah, we're looking to hire particularly on kind of like the things, we're actually looking to hire across most fronts, to be honest. But in particular, we probably need one or two more people on like open source, both Python and JavaScript and happy to dive into the multi-language kind of like strategy there. But again, like strong focus there on engineering, actually, as opposed to maybe like, we're not a research lab, we're not a research shop. [00:41:48]

Swyx: And then on the platform side, [00:41:49]

Harrison: like we definitely need some more people on the infra and DevOps side. So I'm using this as an opportunity to tell people that we're hiring and that you should reach out if that sounds like you. [00:41:58]

Swyx: Something like that, jobs, whatever. I don't actually know if we have an official job. [00:42:02]

Harrison: RIP, what happened to your landing page? [00:42:04]

Swyx: It used to be so based. The Berkshire Hathaway one? Yeah, so what was the story, the quick story behind that? Yeah, the quick story behind that is we needed a website [00:42:12]

Harrison: and I'm terrible at design. [00:42:14]

Swyx: And I knew that we couldn't do a good job. [00:42:15]

Harrison: So if you can't do a good job, might as well do the worst job possible. Yeah, and like lean into it. And have some fun with it, yeah. [00:42:21]

Swyx: Do you admire Warren Buffett? Yeah, I admire Warren Buffett and admire his website. And actually you can still find a link to it [00:42:26]

Harrison: from our current website if you look hard enough. So there's a little Easter egg. Before we dive into more of the open source community things, [00:42:33]

Alessio: let's dive into the language thing. How do you think about parity between the Python and JavaScript? Obviously, they're very different ecosystems. So when you're working on a LangChain, is it we need to have the same abstraction in both language or are you to the needs? The core stuff, we want to have the same abstractions [00:42:50]

Harrison: because we basically want to be able to do serialize prompts, chains, agents, all the core stuff as tightly as possible and then use that between languages. Like even, yeah, like even right now when we log things to LangChain, we have a playground experience where you can run things that runs in JavaScript because it's kind of like in the browser. But a lot of what's logged is like Python. And so we need that core equivalence for a lot of the core things. Then there's like the incredibly long tail of like integrations, more researchy things. So we want to be able to do that. Python's probably ahead on a lot of like the integrations front. There's more researchy things that we're able to include quickly because a lot of people release some of their code in Python and stuff like that. And so we can use that. And there's just more of an ecosystem around the Python project. But the core stuff will have kind of like the same abstractions and be translatable. That didn't go exactly where I was thinking. So like the LangChain of Ruby, the LangChain of C-sharp, [00:43:44]

Swyx: you know, there's demand for that. I mean, I think that's a big part of it. But you are giving up some real estate by not doing it. Yeah, it comes down to kind of like, you know, ROI and focus. And I think like we do think [00:43:58]

Harrison: there's a strong JavaScript community and we wanted to lean into that. And I think a lot of the people that we brought on early, like Nuno and Jacob have a lot of experience building JavaScript tooling in that community. And so I think that's a big part of it. And then there's also like, you know, building JavaScript tooling in that community. Will we do another language? Never say never, but like... [00:44:21]

Swyx: Python JS for now. Yeah. Awesome. [00:44:23]

Alessio: You got 83 articles, which I think might be a record for such a young company. What are like the hottest hits, the most popular ones? [00:44:32]

Harrison: I think the most popular ones are generally the ones where we do a deep dive on something. So we did something a few weeks ago around evaluating CSV question answering applications, which I think is a really interesting one because most question answering, like everyone does question answering, but it's generally over unstructured data over your documents and you do the whole rag thing. And that doesn't work amazing for structured data. And so this was something that we heard, the origin of this was basically we heard from the community, you guys should improve this. And so we're like, okay, let's improve it. And then we're like, okay, well, in order to see if we improve it, we need to like evaluate it and see how we're doing. And so we kind of like wrote up a lot of our thought process there. And I think, and a lot of people like reached out about that and thought that was interesting and we're going through similar challenges and had, we posted another one a few days after that someone wrote basically as a response, which is awesome because it had a completely different strategy. And it was a really, it was a really, that was a really good piece as well. So that was like a deep dive on something like evaluation bit. I think like we did one on retrieval a while back, which was basically like, hey, we, and this was around when we changed our abstractions, like, hey, we changed our abstractions to this. This is why we did it. This is what we see coming down the pipeline. These are like the different types of retrieval that we see. I think a lot of people read and liked that one. A lot of the blogs that we do are also highlighting cool partnerships or cool applications. But in terms of, if you go by like number of views, I think the ones that get the most views are the more like deep dive ones. [00:45:55]

Swyx: Yeah. And I also noticed that you do guest posts as well. [00:45:58]

Harrison: Actually, you know, which one, and this is a guest post that got a lot of views, the multi-on one, the multi-on agent one. When we did, we did a blog where we integrated with them and that got a ton of views. [00:46:06]

Swyx: What do you think that is? [00:46:07]

Harrison: I think it's, I mean, it's one of like the few agents that's actually available and like out in the world. [00:46:15]

Swyx: They're still behind a wait list. Still behind a wait list, [00:46:17]

Harrison: but they're very active on social media. I don't know if I'm off the wait list. [00:46:21]

Swyx: I mean, you're on their blogs. They're on your blog, so I hope they give you access at some point. But that's interesting. A lot of interest in agents. I think they just opened up an API as well. Yeah, exactly. [00:46:32]

Harrison: That was the blog that we did. I was, yeah, I was a bit surprised to see that as well, but I think there's generally a lot of interest in agents and it's also really hard to get them to work. And I think multi-on is one of the first that has that. [00:46:45]

Swyx: Yeah. So my angle to this is a lot of people want to work with you. Yes. You're bombarded. I'm sure your email is just unmanageable. How should people be good partners with you? Like I work at a company and I'm like, hey, I'd love to do something on the LangChain blog or integrate to LangChain. I know Harrison's a busy guy. Like, what do I do? [00:47:03]

Harrison: Like the stuff that gets my attention honestly is like the in-depth, really thought out stuff. Obviously I love this stuff. Like this stuff is awesome. And there's so many different, there's so much to do as well. And like the biggest thing that we have trouble with internally is like figuring out what to do. [00:47:17]

Swyx: What's noise and what's signal. [00:47:19]

Harrison: Not even that, but just like what to focus on. Like there's so many different directions we could do and we want to go in like so many because there's so many interesting things, but we can't do. So if anyone kind of like takes the time to like go deep in a particular area, I love talking to them and I love reading what they write. And I love sharing what they write on the blog. Like that to me is awesome. So I think like... [00:47:37]

Swyx: Do good stuff. Be so good they can't ignore you. It sounds basic, right? [00:47:40]

Harrison: So that's why I didn't want to say it. [00:47:42]

Swyx: No, it's great. [00:47:42]

Harrison: But I think like these deep dots, yeah, there's just so much to do and these don't do shallow stuff, I guess would be. [00:47:48]

Swyx: I think that's a good call that people need reminding. [00:47:50]

Alessio: What about the other side of open source? So on Acker News, there were a couple blog posts recently, like the problem with LangChain and LangChain is pointless, all these different things. So the TLDR of some of them were, the LangChain API is like kind of verbose and complicated versus like sometimes I can just do this in like 10 lines of code. How do you balance that in terms of allowing for the complex use cases versus making maybe the ergonomics like simpler, but then trading that off later? [00:48:21]

Harrison: There's a lot to balance and there's a lot to do. And I think like posts like that are very valuable to hear basically what people are saying. And like, we have a lot of open issues. So it's not like these things hadn't been said before, but I think like that was a good emphasis on what people are saying. And I think there was a lot of things in there. I think part of it's kind of like around and we took all of it very seriously. And yeah, I think there's a lot to dive into there. There's like the documentation piece. And so I think we did a revamp of the documentation to address that. There's also like a comment in this, I think this was around, I think the top comment on the LangChain is pointless one was like basically like orchestration is like 5% of the work. And then like the other 95% is like prompt engineering and like data engineering. And those are the hard bits. I think maybe orchestration is a little bit more than 5%, but I like agree that those are like really big pain points that get exacerbated when you have these complex chains and agents where you can't really see what's going on inside of them. And I think that's partially why we built Langsmith to help out with exactly that. We also needed to do better things like make the prompts more visible and make it allow for more customizability around that. And so we've tried to add some stuff there. In terms of balancing, there's also LangChain is pointless. I don't need a wrapper. I can just call the underlying API. I think if all you're trying to do is call the underlying API, then like, yeah, that's gonna be the cleanest and simplest thing to do. And we try to get as close to that experience as possible, but we're not optimizing for calling the API. We're optimizing for helping people build context-aware reasoning applications as easily as possible. And so there's some level of abstractions that you need to add in order to assist in that. Yeah, that's definitely a balance that's tricky to strike, but I think there's also some aspect of it. Like, I do think one of the big benefits that LangChain provides is a standard interface for language models so that you can switch between them. And this kind of gets into like an ORM debate, like are ORMs generally kind of like useful or not? And so I think in this case they are. I think there's probably a larger kind of like philosophical kind of like question about that [00:50:25]

Swyx: that people have strong opinions on. Just the prompts don't transfer like you also mentioned. Yeah, yeah, there's that, yeah. [00:50:32]

Harrison: And then between kind of like allowing for, I think one helpful thing that we did in terms of like distinguishing between basically the base interfaces and then more complex stuff is part of the separation around the docs is there's like the components piece, which has the model IO, the retrieval, the agents, the callbacks, things like that. And then there's all the use cases. And so I think like the use cases, because they are like these assembly of all these things in a particular order, they start to get more complex. And it's, you know, we try our best to kind of like make clear how you can configure things. But yeah, there's a lot of different options that you might want to configure. And so I think that split has kind of helped us internally at least. And I think externally as well, because we've heard good comments about the improved documentation. I think that's made it a little bit more clear. And then another thing, one of the things that we also released soon after, and we'd been thinking about a little bit is basically like a LangChain expression language, which allows for actual composability of pieces. So LangChain, I think, has always been very good about interchangeability. Let's ignore the prompting issues, but like you could always plug in like one LLM for another one. You could swap in one vector for another one, but the chains themselves haven't actually been super actually composable. Like we had the sequential chain, but that was a bit like clunky to use. And then we had a router chain, but that was a bit, you know, that was also a bit clunky to use. And so one of the things, and so there's a million different things to do, and we didn't prioritize that. [00:51:53]

Swyx: I think after this, [00:51:53]

Harrison: we definitely bumped it up and prioritized in priority. And luckily Nuno had been doing a lot of awesome work on it already, so it wasn't too much of a lift. But yeah, now there's this way where a lot of the chains that we've been releasing are written in this LangChain expression language where they're actually truly composable, and you can see what's going on under the hood. And it's basically, it uses kind of like the pipe kind of like terminology to coordinate things and move things around. So yeah, I mean, I think there were a lot of good points in those Hacker News things, and you know, we can't respond to everything, but we try to like look at everything and take everything seriously. [00:52:25]

Swyx: You're being very diplomatic. But so first of all, I like the expression language. I think that that is the path towards sort of language agnostic LangChain kind of, or whatever, DSL. But also like, what was just kind of plain wrong or plain offensive, or like, I don't know, people can get very vitriolic sometimes on Hacker News. [00:52:40]

Harrison: Yeah, I mean, I think the comments that I appreciated were the ones where they gave specific things. And I think the ones where they said, you know, LangChain sucks. Like, okay. Can't do much of that. [00:52:51]

Swyx: Yeah, exactly. Verifacing on my question would be like, you're not the first and you won't be the last to have that kind of very intense scrutiny. What would be your advice to other people, other maintainers of projects for going through something like this? [00:53:03]

Harrison: I would probably say, try to drill into like what is actually underlying things [00:53:08]

Swyx: as much as possible. [00:53:08]

Harrison: And if there is actual substance that's being delivered, whether you agree with it or not, like, I think that's valuable to know. And then for the other stuff, like try to maybe follow up, but maybe try not to let it get under your skin too much. [00:53:22]

Swyx: Thanks for tackling that. [00:53:24]

Alessio: And I know we're getting to the time and we'll wrap up soon, but since you're going to speak at the AI Engineers Conference, what's your advice to AI engineers, especially when to start with LangChain and when they're just experimenting with a model, [00:53:38]

Swyx: when are they, [00:53:38]

Alessio: as you mentioned, if you just want to do an API call, don't use LangChain. Yeah. [00:53:43]

Harrison: I mean, my advice would just like build as many things as possible. Like, I think it's still really early in the space. No one really knows what they're doing to some extent. Like, it's a bit weird to say, but there's so many things to like discover. So I would just say like, build as many things as possible. Cause I think like the best thing is you stumble upon a really good idea and you build something really awesome. And the worst thing that happens is you just learn a lot about a field and the technology that's going to be incredibly important and rapidly kind of like changing. [00:54:11]

Alessio: What would you build if you weren't doing LangChain? [00:54:13]

Harrison: I mean, the things that are most interesting to me are kind of like things around like long-term memory and like longer running agents. So I'd probably build, and these are things that we've been wanting to build [00:54:23]

Swyx: internally as well. [00:54:23]

Harrison: But like, I think a chatbot that like actually remembers things about you as like silly as that sounds, like people like chatbots a lot and they have their delivered limited by their context window. And so I think really diving into like a specific application of memory there. [00:54:38]

Swyx: I've been trying to build a chatbot [00:54:39]

Harrison: that remembers things about you. That would be one. And then like, I know a lot of people are doing this, but like a personal assistant for like managing like email calendar, basic stuff, which I think is, I think that's like a fantastic application for these like agent like things, because if you think about personal assistants today, you usually interact, I don't have one, but I'm told you interact with them over email. And the nice thing about that, as opposed to like chat, there's not as stringent an expectation on latency as there is on chat. And so you can do a lot of things like reflection and kind of like making sure that you're on the right track and really put more safeguards and thinking about these agents as opposed to relying on like chas and interface, like the bot we have that's on GitHub answering questions on the issues, I think probably gives better answers than the bots that we have that are on chat on the website. And I think that's not because, there's just different constraints that you have in different types of problems. And I think I would be like, I think the personal assistant one's really interesting because you remove the constraint of chat, which I think at this point in time is probably pretty limited in terms of functionality. [00:55:43]

Swyx: Yeah. I've been calling this sort of long inference. If you didn't have to care about ANC and you could take like a day, a month, a year to work on something, what could you do? And yeah, that's super interesting. [00:55:56]

Harrison: I think that's a really promising place to explore. [00:55:58]

Swyx: Yeah. Have you looked at, regarding the long conversation thing, you and I have tried it about this many times. Have you looked into what character and inflection are doing? Because they're probably working on it. [00:56:08]

Harrison: I've thought about memory a bunch. Like I think it comes down to like, it comes down to like state, like what's the state you're tracking? Like what's the data structure for that? And I think that could also maybe be a bit like application specific. But if we're talking about a generic chat bot, that's kind of generic. I don't know. Yeah, I don't know how they're thinking about that. My sense is that inflection like thinks about that a bit more than character. Like I think in Inception, sorry, inflection's whole thing is they like, the bot knows you. [00:56:33]

Swyx: It's one chat. There's no history. You just talk to it. Yeah. [00:56:37]

Harrison: So they've definitely got some state that they're tracking. I'd be really curious to know what that is. Character, I don't think has lent into it too much. I think they let you do some stuff in terms of like uploading background. And I'm not entirely sure how they use that, whether they just like put that in the prompt or do some retrieval over that. But I think they're definitely, they haven't lent into it as much as inflection, I would say. [00:56:57]

Swyx: So given like, you are one of the most interested people in this space, would this be like a second product for you? If you ever want to explore that or do you want to just partner with people and you're putting out the call for people to come to you if they have solutions for that? [00:57:10]

Harrison: If I wasn't working on LangChain, I would be building an application company, for sure, first of all. Like, I don't think, like I think like there's, which I know is very hypocritical to say. [00:57:20]

Swyx: Like you're Mr. DevTools and Infra and Observability. [00:57:24]

Harrison: Yeah, I don't know. If you're building an application company that's working on something related to long-term memory or long-term agents, I would love to chat and just geek out [00:57:31]

Swyx: about a lot of this stuff. I'll show you Smalltalk at some point. Yes. Cool. Awesome. [00:57:37]

Alessio: Yeah, let's do a lightning round. [00:57:38]

Swyx: So the first one is on acceleration. What has happened in AI that you thought would take much longer than it actually ended up taking? [00:57:45]

Harrison: The function call and ability from OpenAI, like tool usage. [00:57:48]

Swyx: Yeah. [00:57:48]

Harrison: They did that really fast, I thought. [00:57:50]

Swyx: Yeah. But it's just a question of fine-tuning, no? Yeah. It's not even like reliable. [00:57:54]

Harrison: It's not terrible. They're a pretty big organization that's serving a lot of traffic. And like, this was a, yeah, it's like, it is like just fine-tuning, but I think like you still have to like collect that data set and fine-tune it and evaluate it and then release it at scale and figure out the right API. [00:58:09]

Swyx: No shade on OpenAI. Like they're moving everyone's bar as to how quickly like a 400% organization can go. Do you think it eliminates like approaches like JSONformer and all the other approaches that people, like guardrails, you know, previous guest, eliminates your output validation thing? Yeah. [00:58:26]

Harrison: I think JSONformer and stuff like that are still really interesting for like local models, for sure. And there's like 90% of people use OpenAI or something and like my made up numbers. [00:58:37]

Swyx: No, it's probably real. [00:58:38]

Harrison: And the best way to get structured output is by using the function calling ability. So yeah, absolutely. [00:58:46]

Alessio: What do you think is the most interesting unsolved question in AI? [00:58:50]

Harrison: I'm really interested like how multimodal is going to work. Like with just what that looks like. [00:58:55]

Swyx: Have you had a look at the GPT-4 vision? No, not really. [00:58:59]

Harrison: Yeah, not beyond what they- [00:59:01]

Swyx: They're doing private betas right now. So I'm very excited. [00:59:04]

Harrison: I'm excited about that as well. Yeah, I mean, I think that's, you know, you talk about like, again, this whole space is just changing so fast, but you talk about something that could like really change how, because like, you know, a lot of lang chain is kind of like a data orchestration tool in some sense. And so if you had a whole new type of data in there. [00:59:20]

Swyx: So maybe we do this thought exercise, right? Tomorrow, OpenAI releases the GPT-4 vision API. What does lang chain do? [00:59:25]

Harrison: Immediately we add support for it in like the wrapper. So however you interact, like honestly, this is another like fun thing. Everyone's API now looks like OpenAI's. [00:59:35]

Swyx: Yeah, which is great. [00:59:36]

Harrison: Which you have to do, yeah. So like our wrapper looks similar to OpenAI. So I don't think it will be that difficult to include support for it at the basic model level. And so we do that. And now that we've released the expression language bit, like a lot of the core chains, we have examples of rewriting them just in this expression language. So like for retrieval, if we're now talking about like, okay, you can do like retrieval question answering over for multimodal things, we'd probably have to figure out how those are getting stored and what's being done with them. But then from there, that should be, yeah, so probably looking to like, yeah, how are people kind of like storing and consuming this type of information? But then that step should be pretty easy to plug into the kind of like chain. [01:00:17]

Swyx: Multimodal stores? Yeah, I don't know. I always wonder what that would actually look like because a lot of multimodality in LLMs is really just an LLM, a text LLM calling a different model. And that's just no different than any API call, essentially unchanged. [01:00:32]

Harrison: I think it's probably something that you don't know until you let like a million people play around with it. [01:00:37]

Swyx: Then there'll be new LangChain for multimodal. What's one message you want everyone to remember today? [01:00:43]

Harrison: I would probably say just like build. I think it's a fantastic time to be building. [01:00:47]

Swyx: All right, just build. Yeah. [01:00:49]

Alessio: Thank you Harrison for coming on. [01:00:51]

Swyx: Thanks so much. [01:00:51]

Harrison: Thank you guys for having me. [01:00:52]

Swyx: It's a lot of fun. [01:00:53]

Get full access to Latent Space at www.latent.space/subscribe

2023-09-06
Länk till avsnitt

RWKV: Reinventing RNNs for the Transformer Era ? with Eugene Cheah of UIlicious

The AI Engineer Summit Expo has been announced, presented by AutoGPT (and future guest Toran Bruce-Richards!) Stay tuned for more updates on the Summit livestream and Latent Space University.

This post was on HN for 10 hours.

What comes after the Transformer? This is one of the Top 10 Open Challenges in LLM Research that has been the talk of the AI community this month. Jon Frankle (friend of the show!) has an ongoing bet with Sasha Rush on whether Attention is All You Need, and the most significant challenger to emerge this year has been RWKV - Receptance Weighted Key Value models, which revive the RNN for GPT-class LLMs, inspired by a 2021 paper on Attention Free Transformers from Apple (surprise!).

What this means practically is that RWKV models tend to scale in all directions (both in training and inference) much better than Transformers-based open source models:

While remaining competitive on standard reasoning benchmarks:

swyx was recently in Singapore for meetings with AI government and industry folks, and grabbed 2 hours with RWKV committee member Eugene Cheah for a deep dive, the full recording of which is now up on Latent Space TV:

Today we release both the 2hr video and an edited 1hr audio version, to cater to the different audiences and provide ?ablation opportunities? on RWKV interest level.

The Eleuther Mafia?

The RWKV project is notable not merely because of the credible challenge to the Transformers dominance. It is also a distributed, international, mostly uncredentialed community reminiscent of early 2020s Eleuther AI:

* Primarily Discord, pseudonymous, GPU-poor volunteer community somehow coordinating enough to train >10B, OPT/BLOOM-competitive models

* Being driven by the needs of its community, it is extremely polyglot (e.g. English, Chinese, Japanese, Arabic) not because it needs to beat some benchmarks, but because its users want it to be for their own needs.

* ?Open Source? in both the good and the bad way - properly Apache 2.0 licensed (not ?open but restricted?), yet trained on data taken from commercially compromised sources like the Pile (where Shawn Presser?s Books3 dataset has been recently taken down) and Alpaca (taking from Steven Tey?s ShareGPT which is technically against OpenAI TOS)

The threadboi class has loved tracking the diffusion of Transformers paper authors out into the industry:

But perhaps the underdog version of this is tracking the emerging Eleuther AI mafia:

It will be fascinating to see how both Eleuther and Eleuther alums fare as they build out the future of both LLMs and open source AI.

Audio Version Timestamps

assisted by smol-podcaster. Different timestamps vs the 2hr YouTube

* [00:05:35] Eugene's path into AI at UIlicious

* [00:07:33] Tokenizer penalty and data efficiency of Transformers

* [00:08:02] Using Salesforce CodeGen

* [00:10:17] The limitations of Transformers for handling large context sizes

* [00:13:17] RWKV compute costs compared to Transformers

* [00:16:06] How Eugene found RWKV early

* [00:18:52] RWKV's focus on supporting many languages, not just English

* [00:21:24] Using the RWKV model for fine-tuning for specific languages

* [00:24:45] What is RWKV?

* [00:33:46] Overview of the different RWKV models like World, Raven, Novel

* [00:41:34] Background of Blink, the creator of RWKV

* [00:49:55] The linear vs quadratic scaling of RWKV vs Transformers

* [00:53:29] RWKV matching Transformer performance on reasoning tasks

* [00:54:31] The community's lack of marketing for RWKV

* [00:57:00] The English-language bias in AI models

* [01:00:33] Plans to improve RWKV's memory and context handling

* [01:03:10] Advice for AI engineers wanting to get more technical knowledge

Show Notes

Companies/Organizations:

* RWKV - HF blog, paper, docs, GitHub, Huggingface

* Raven 14B (finetuned on Alpaca+ShareGPT+...) Demo

* World 7B (supports 100+ world languages) Demo

* How RWKV works in 100 LOC, RWKV overview

* EleutherAI - Decentralized open source AI research group

* Stability AI - Creators of Stable Diffusion

* Conjecture - Spun off from EleutherAI

People:

* Eugene Chia - CTO of UIlicious, member of RWKV committee (GitHub, Twitter)

* Blink/Bo Peng - Creator of RWKV architecture

* Quentin Anthony - our Latent Space pod on Eleuther, coauthor on RWKV

* Sharif Shameem - our Latent Space pod on being early to Stable Diffusion

* Tri Dao - our Latent Space pod on FlashAttention making Attention subquadratic

* Linus Lee - our Latent Space pod in NYC

* Jonathan Frankle - our Latent Space pod about Transformers longevity

* Chris Re - Genius at Stanford working on state-space models

* Andrej Karpathy - Zero to Hero series

* Justine Tunney ("Justine.lol") - mmap trick

Models/Papers:

* Top 10 Open Challenges in LLM Research

* Retentive Network: A Successor to Transformer for Large Language Models

* GPT-NeoX - Open source replica of GPT-3 by EleutherAI

* Salesforce CodeGen and CodeGen 2

* Attention Free Transformers paper

* The Pile

* RedPajama dataset

* Monarch Mixer - Revisiting BERT, Without Attention or MLPs

Misc Notes

RWKV is not without known weaknesses - Transformers do well in reasoning because they are expressive in the forward pass, yet the RWKV docs already note that it is sensitive to prompt formatting and poor at lookback tasks. We also asked pointed questions about RWKV?s challenges in the full podcast.

Get full access to Latent Space at www.latent.space/subscribe

2023-08-30
Länk till avsnitt

Cursor.so: The AI-first Code Editor ? with Aman Sanger of Anysphere

Thanks to the almost 30k people who tuned in to the last episode!

Your podcast cohosts have been busy shipping:

* Alessio open sourced smol-podcaster, which makes the show notes here!

* swyx launched GodMode. Maybe someday the Cursor of browsers?

* We?re also helping organize a Llama Finetuning Hackameetup this Saturday in anticipation of the CodeLlama release.

Lastly, more speakers were announced at AI Engineer Summit! ?

~46% of code typed through VS Code is written by Copilot. How do we get closer to 90+%? Aman Sanger says we need a brand new AI-powered IDE to get there; and we?re excited to be the first podcast ever to tell the Cursor story.

If you haven?t heard of Cursor, you may have been living under a rock. Here are just some of the rave reviews going around in the past week alone:

* ?Cursor is the best product I've used in a while? - Alex MacCaw

* ?Someone finally put GPT into a code editor in a seamless way. It's so elegant and easy. No more copying and pasting.? - Andrew McCalip

* ?Coding with AI is getting insane.? - Mckay Wrigley

* ?This is mind blowing ?? - Linus Ekenstam

* ?Cursor + gpt4-32k = illegal levels of productivity? - Sully Omarr

* ?EL MEJOR EDITOR DE CÓDIGO con IA? - Carlos Santana

A decade ago, ?platform risk? meant building apps on social media platforms was risky as you could get cut off from the social network.

Today, the AI version of ?platform risk? is building AI products within an existing product (like an AI extension for VS Code, or a Figma plugin). Since Copilot, a generation of VSCode plugins have launched (including Cody, Cosine, and previous guests Codeium and Codium), only to be challenged by Copilot X itself.

A core AI Engineering thesis is that new capabilities in AI demands new innovation in AI UX (and that AI UX can actually be a viable moat). Take VS Code for example; when Github was first working on Copilot, there was actually no way to support the ?ghost autocomplete? feature we all use today. They eventually convinced the team to build it, and Copilot?s success speaks for itself.

If you?re a startup building on top of VSC today, you do not have the same access and influence on the roadmap. Your UX is limited to what they allow you to do, and often that caps your ability to successfully compete against them.

Since Cursor owns the whole IDE, they can do things you can?t (yet) do in VSCode:

Cursor?s Gameplan

Cursor is competing head to head against VS Code by forking Microsoft?s IDE and building their own AI-powered version. A few of Cursor?s unique features:

* Native chat: Chat is a core piece of Cursor. Users can choose between GPT-3.5 and GPT-4 to ask questions and receive answers based on their code.

* ?Mentioning? files: you can easily add files into your request context by using ?@?; this works both for code as well as documentation. If you want to do a change that includes multiple files, you can include them in your question to make sure the change is reflected in all of them.

* Custom prompting engine: Cursor built Priompt, their custom prompting engine. As your chats go over the context window size, Priompt figures out which messages to keep in the history, which files to drop from the prompt, etc.

* Moving beyond typing: while IDEs are familiar to folks as today?s interfaces, in the future Cursor hopes to have agents you can delegate tasks to. Instead of a back and forth on a new feature or bug fix, you can ask it to do the whole thing for you end to end.

After diving deep into Cursor we nerded out on model usage, training, quantization, and evaluation. There?s a ton of great content in this episode, we hope you?ll enjoy it!

As always, feedback welcome in the comments, and tag us on socials for future guest suggestions!

Show Notes

* Cursor

* Gary Marcus? cubes prompt

* Priompt

* ?Humans should focus on bigger problems.?

* Codium AI on Latent Space

* Rift from Morph

* Sourcegraph

* E2B

* Repl.it

* HungryHungryHippos, Hyena, etc (see our FlashAttention episode)

* Aman Tweets

* Why GPT-3.5 is (mostly) cheaper than Llama 2

* Llama?s architectural limitations

* ?Training will look like researchers/practitioners offloading large-scale training jobs to specialized ?training? companies: a state of the world that resembles chip design & fabrication.? - Mosaic prediction

* ?The size of all code/history on Github public repos is 92TB. The size of Google's monorepo in 2015 was 86TB (of much higher quality code). If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else.? - May 2023

Timestamps

* [00:00:00] Intros

* [00:02:31] Developing CAD models vs coding models

* [00:05:23] Deciding to build a new IDE optimized for large language models

* [00:10:50] Getting early access to GPT-4 and realizing its potential for software development

* [00:12:32] Rethinking the UI/UX for coding

* [00:18:24] Cursor's features like system prompts and chat

* [00:22:24] Tips for prompting GPT-3/4 for code generation and editing

* [00:27:24] Cursor's documentation and context features

* [00:29:30] The potential of coding agents like Code Interpreter

* [00:38:23] Cursor's internal prompting tool Priompt

* [00:40:47] The challenges of very long context lengths for models

* [00:45:44] The compute costs for prompt tokens vs. completion tokens

* [00:49:36] How quantization interacts with model utilization

* [00:51:24] Issues with human eval for benchmarking code models

* [00:53:12] Thoughts on training models vs. relying on foundation models from big providers

* [00:55:34] The origin story of Cursor's parent company AnySphere

* [00:56:00] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]

Swyx: Hey, and today we're back in the studio again after a little break and we have Aman Sanger in the house. Hey Aman. Hey, thanks for coming. Thanks for having me. So I wanted to introduce our guests and then have you fill in the blanks. So you worked at Gamelon, Bridgewater, McKinsey, Google, and You.com, all on sort of kind of AI related things and some finance related things. You also ran your own consultancy, Abelian AI, and you graduated in CS and math from MIT recently. Worked on a few projects, including Instill, which I think we'll cover a little bit later, and most recently Cursor.so, which we'll cover for the vast majority of the podcast. But just on a personal side, what's one thing that people should know about you that, you know, might not be so obvious on LinkedIn? Oh, interesting. [00:01:01]

Aman: In a previous life, I played a lot of squash. [00:01:05]

Swyx: You were a top seed? [00:01:06]

Aman: Yeah. So in high school, I kind of competed in tournaments and most people probably don't really know what squash is. It's like tennis in many ways. It's like a racket sport, but it's indoors. You play against a wall. I guess now pickleball is all the rage with, with racket sports, but yeah, the story is I used to play tennis and then I moved to a building that had a squash court in it and then I picked it up. I loved it. And I've been playing ever since. So I competed a lot in high school, played a bunch at FIT, have not had the chance to play much here. In San Francisco, there aren't too many courts. [00:01:38]

Swyx: We can organize a squash tournament and then you'll crush it, of course. Is there anything about the athlete mentality that you take with you as a founder? [00:01:47]

Aman: Yeah, I think it can be at times a bit too much, but I'm very competitive. I really hate losing. Now I think I'll go on runs and if someone tries passing me, I won't let it happen. I'll just kick it into overdrive and maybe I'll turn the corner if I know they're going to beat me, but I can't let someone pass me when I'm running. And I think the same is true with starters, where the competitive nature, I think it in general helps motivate me and makes me, I guess, just work harder. [00:02:17]

Swyx: Yeah. Okay. Well, we'll have a bunch of competitive questions later, but we'll go over the timeline. [00:02:22]

Alessio: Let's jump into how you got to Cursor. So in August 2022, you launched something called Instill. Can you talk a little bit about that? [00:02:31]

Aman: Yeah, and maybe before I go into Instill, I should talk about what I was even doing before that, because Instill was actually a very brief foray from what I was doing with my original co-founder, Michael. So we had both actually gone to the same high school together, gone to MIT together. And then after graduating, we knew we wanted to start something. And in June, what we were working on was also called Cursor, but very different. We basically were very, very fanatical users of Copilot. We loved it. And we had a little bit of experience with computer-aided design or CAD software. A lot of our friends, in fact, were mechanical engineers. And we'd heard a lot about how tedious it was to just design these parts and software like SOLIDWORKS and whatnot. It was pretty obvious to us that if you could train a transformer on the task of predicting the next token, not just for code, but for CAD, then you could get a really useful product that could speed up mechanical engineering. So that's actually what we'd worked on up until Instill, even a little bit after Instill. And yeah, I can go into more detail about that. It was pretty interesting. That's probably how, despite these days doing less stuff with model training than in the past. For that, it was all just kind of rolling our own models from scratch, a lot of training, a lot of inference. [00:03:48]

Alessio: I'm always curious to hear about what made you interested in that. Obviously, you've been at the forefront of a lot of this AI work. Why was that the most interesting thing to you? Did you think there were not as many people going after that? Did you think you had a unique insight into it? Because we got a lot of people listening that want to be founders and want to figure out how to make that decision. [00:04:09]

Swyx: Yeah. [00:04:10]

Aman: First off, I've always been incredibly fascinated by AI. The first time I originally learned how to program, actually, because I'd seen the results from ImageNet and I'd heard deep learning, and that just sounded insanely cool to me. And so my first programming project was building and training a neural network in Java, because that was the only language I knew from my AP Computer Science class. But ever since then, everything I've done has been involving ML, AI. The reason I wanted to, I guess, found a company is, first off, I had been working with Michael on a couple of other things. We'd done an AI consultancy in the past. We worked really well together and really just enjoyed working on stuff on our own. With CAD, we were doing a little bit of ideation, and I think we were quite worried about competition in a lot of other areas. I think that worry has definitely subsided a little bit with what we're working on now, obviously. A lot of competition in the coding space. But it seemed like the kind of thing where not a lot of eyes were on this. It seemed very technically possible, at least at the time. And the market was pretty sizable, if you looked into it. So it was both a really interesting technical problem. And then if you just tried to analyze the space, it seemed like a good idea. [00:05:23]

Alessio: How do you decide to move off of it? That's another important answer as a founder. [00:05:28]

Aman: I think there are a few key things that we did not take into account when we were working on this. One was if you look at the original Codex paper, our assumption was this is the model that powers Copilot. It was trained on 100 billion tokens, or it was something like 50 billion tokens of Python. And one interesting insight from it was that you actually get no transfer benefits from the pre-trained model on text to code. So that means they took GBD3, and for the smaller models, for the models that weren't trained in all the Python data, there were some benefits where GBD3 transferred really well faster. But then for the final Codex model, it turns out that there were no transfer benefits, meaning you just took a model, you trained it from scratch on those 100 billion tokens of Python code, it would do just as well as the GBD3 12 billion model that was fine-tuned. The issue was that it was only true for GBD3 and 100 billion tokens of Python code. These days, I mean, the jury's still out on this, but it seems pretty clear that the benefits from learning language are quite helpful with code. I guess that kind of goes into the issues with CAD, where one, you're dealing with much less data than code. If you assume, first off, that 50 billion, 100 billion tokens is all you need, then maybe with like 10x less, you could get a pretty useful model. In reality, Copilot today is powered by probably trillions of tokens of code, as well as text. And when you're dealing with, at most, from scraping every single bit of CAD data, you can find 10 billion tokens. It's just not enough to train a useful model. We tried scaling, and no matter what kinds of regularization techniques we used, we just couldn't get it past a few billion parameters without overfitting. That was the big thing. And then the other is that there's no transfer. If you try to test these models today, and even with GBD4, there's a prompt that I like to use, which is good for testing like 3.5 versus 4 if you don't know which one's behind the scenes. But even 4 sometimes struggles with it. And the prompt is you kind of lay out, I think it's like a famous kind of Gary Marcus prompt as well, where you lay out a bunch of kind of cubes on a table, right? And you describe it, and as you increase the complexity, you know, 3.5 drops out, you increase the complexity more, 4 drops off. But it's clear that these models are not that good at spatial reasoning, and that's exactly what's needed for CAD. [00:07:52]

Swyx: Oh, yeah, that's right. [00:07:54]

Aman: What you want to do is if I were to design this table with CAD in front of you, I would first draw a rectangle, then I would do an extrusion operation, which would basically take the rectangle and then extend it orthogonal to the plane, such that it's like a volume, [00:08:14]

Swyx: right? [00:08:15]

Aman: And then the model has to realize that, okay, the shape now that exists is this structure here, this table. And then the really difficult thing is for other operations, it'll need to point to the constructed geometry that was built. And basically, the model effectively to work well, it needs to kind of, in its mind, imagine this 3D structure. And the models are not good at that. If you try fine tuning code models on this task, or language models, they're just not going to transfer well at all. [00:08:44]

Alessio: Do you think in like two, three years, there will be a good AI-powered CAD software? [00:08:47]

Aman: Yeah, my perspective now is that I think the best way here is probably redesigning the entire system. One other big pain point was we tried to build plugins with all the major pieces of CAD software, like SolidWorks, Onshape, and so on and so forth. And if you think it's hard to build a plugin for some of the older IDEs, you've not seen these pieces of software. And so I think even if you got a good model, it might be really hard to actually get distribution and create a good plugin that works. So it feels like with the advancements you have in kind of text to images, and there's some new companies kind of doing stuff with text to 3D, it feels like the reasonable approach is actually just scrap the way that people are doing CAD right now. And I suspect a company or some companies will come around and do this quite well. [00:09:37]

Swyx: That's really good insight. And we have more sort of general LLM products thoughts to ask you at the end. We wanted to get into Cursor, since that is your primary product right now. In January 2023, you announced it to the world. Maybe take us into the, I guess, idea maze leading up to Cursor. [00:09:54]

Aman: Yeah, I guess it still was one kind of brief, brief pivot period where we tried doing text to images. The reason we decided against it was that I don't think we're the founders for that kind of company. We learned this from CAD, and we strongly believe this now that it is much better to be a user of the product you're building. And we just weren't big users of any of the text to image tools. So it was around December where we managed to actually get early access to GPT-4. And before then, we had played around a little bit with using earlier versions of 3.5 on writing code. And we kind of given up. It just seemed like if you looked at Text DaVinci 2 or Code DaVinci 2, those older 3.5 variants, they just couldn't really do anything meaningful. But then we opened up the playground, started copy pasting code into there. And it was ridiculous. This is before everyone started using human. [00:10:50]

Swyx: Did you use the early version? [00:10:52]

Aman: Yes. So, okay. [00:10:54]

Swyx: So, it was an earlier version. The unhinged raw. [00:10:56]

Aman: Oh, it was still, no, it was still, it wasn't, yeah, it was still very safe. But before people started using, human eval was the thing, but before everyone started talking about it and knowing about it, we kind of pasted it in and it got 85%. And we were just like, wow, best open source model at the time got 30%. And Code DaVinci 2 got something like 47%. And yeah, GPT-4 today, it gets about the same score. And so we then started, you know, writing code in there, just copying and pasting random pieces of code from whatever kind of things we were testing and developing. We found that it was not just good at creating net new things, but refactoring code, editing code, helping you debug kind of every single aspect of software development felt so different with these models. And then we kind of, in our heads, just plotted out the future. And this is GPT-4, like what happens when you have 4.5, GPT-5? These models are just going to get better and better and better at programming. And the future is probably not going to be more and more things that you tab enter for autocomplete. I think that's a very useful tool. We use Copilot every day. We find it quite useful, but you can't have a world in which language models are able to produce 90%, 95% of the code, and it still follows that form factor. I think you have to redesign the entire way, the entire UX of writing software. And that was our take with Cursor, where you need to own the full IDE and completely redesign the flow of producing software and just doing software development in general. [00:12:32]

Swyx: Those are big statements that we need to dig into a little bit more. I want to backtrace a little bit. So you got early access to GPT-4. That actually means that you were backed by OpenAI, you joined OpenAI Fund before you were Cursor. [00:12:46]

Aman: Yeah, basically. Kind of. [00:12:48]

Swyx: Oh, okay. Because I'm trying to get the chronology and I assumed you were, they funded you because of Cursor. [00:12:52]

Aman: Yeah, so OpenAI is this program Converge. That was the program we participated in. And through that, the main thing was early access to on-release models that we got to play with. Obviously, none of this went to production. None of this could go into production. It was just kind of a sneak peek of GPT-4. And so, yeah, before we actually built out Cursor, we didn't take money from OpenAI, but we were a part of this program. [00:13:14]

Swyx: Got it. Yeah. And then you also mentioned one more thing, which was interesting. You still use Copilot, but you also use Cursor. Yes. You also mentioned that Copilot is probably trained on trillions of tokens, which means that's extensive training since the original Codex. That's my guess. [00:13:30]

Aman: I mean, if you look at the stack, for example, right? It's what? One to two trillion tokens? Something around that. I'm very skeptical that Copilot is training on less, especially with all the lawsuits you see with whether or not it's quote-unquote fair use. [00:13:43]

Swyx: So yeah, my guess is trillions of tokens. [00:13:44]

Aman: I don't really know. But yeah, I'm sure if you did the math on how much public code there is in GitHub, it's almost certainly the trillions. [00:13:52]

Swyx: One of the reasons I harp on this is one of our pet themes is tracking the dataset to parameter ratio. And Copilot cannot be that big because it returns relatively quickly. So it's going to be in the low billions, right? So how do you do trillions of tokens to the low billions? That's interesting. Yeah. [00:14:12]

Aman: I think I have some thoughts on this because there's the whole thing with chinchilla scaling and then people are now saying, oh, chinchilla scaling doesn't matter because of inference. But Copilot could be a mixture of experts. That's one other speculation. I don't know if that's true. I mean, it probably wasn't the case at least a year or two ago. My guess is it's probably a small model that's very over-trained. From what I've heard, there are also lots of tricks you can do with caching where even if the model is quite big, it doesn't take, it effectively takes no time to ingest the entire prompt. Yeah. [00:14:46]

Swyx: Semantic caching is what they are calling it, right? I guess if it roughly embeds to the same thing, just return the same thing. [00:14:50]

Aman: I think it's partially that, right? Where let's say the suffix or the code before where your cursor is has changed slightly. They might not actually go ahead and use a different... [00:15:02]

Swyx: That seems dangerous for code. [00:15:03]

Aman: It does seem a little dangerous, but it gives you this like incredibly snappy response. And the other thing is the KV cache, right? Where you can just, I don't think there's any open source framework that does this right now, but what you can do is if you've already computed something over the KV cache, then you can just... [00:15:19]

Swyx: This is the attention KV cache for people following. [00:15:21]

Aman: So this is the attention KV cache, right? And if you've already computed all the keys and values, you can just store that in memory and then load that back up in the GPU and you don't need to process the prompt again. And I speculate they're doing something like that behind the scenes. [00:15:35]

Swyx: That's a lot of memory. That's a lot of story. It is. [00:15:38]

Aman: Well, unless they're using something like multi-query attention or... [00:15:41]

Swyx: Yeah. We'll talk about that in your Llama 2 piece. And then the final big opinion that you drop in there was you must write your own IDE as opposed to write a VS Code extension, which there's plenty of them out there. SourceGraph is doing one and I've been working closely with Morph, which just put out Rift. So this is obviously a big undertaking. Maybe explain more a little bit about why build your own IDE. Yeah. [00:16:05]

Aman: The reason we decided to do this is I think in the future, today, what Cursor can provide and what any of these tools can provide isn't that much different than, I guess, what you get in VS Code. But it was more of a long-term decision where in the long term, you're going to need to design just a very different UX that the extensions don't give you. One story we'd heard is that with Copilot, in order to actually get the multi-line ghost text implemented, it wasn't actually a part of the extension. I think the team at GitHub had to call up VS Code and have them make a change to the source in order for that extension API to be enabled. That's what allows for multi-line ghost text completion. And this is scary. If you look today, there are other things that VS Code in their source code has enabled as APIs that are just closed off to everyone but Copilot. So I think there's this fundamental platform risk where you're competing with the incumbent that owns the platform you're building on. And we thought it would just not really be tenable in that sense. And then the other thing is if you want to do other kind of fancy things. So one example of a feature we're kind of building right now is instead of just... So Copilot is great for completing the next line, completing the next few lines, but what if you wanted to do a kind of sort of edit, where instead of just completing this line, it changes the line above or delete something. There's no way that you can do something like that in VS Code, but we have the UI for this that we've kind of built out in Cursor. We're currently training models in order to get it to work well. But again, this is a feature that we think once we get it to work, will be quite useful. Could be on par with Copilot level in terms of usefulness. And it's just fundamentally impossible unless you own the IDE. There are a lot of other ones like those that we're kind of cooking up. And then there are small things. I do think like in terms of inline edits, which means inside the editor, you can press command K in Cursor and then ask for some kind of modification of the code or ask for a generation of the code. And I do think we have probably the best UX for that because if you look at what someone like Sourcegraph does, I mean, Sourcegraph code is a great product, but they basically have to use the GitHub pull request comment feature in order to do it. And I think like these paper cuts kind of add up over time. [00:18:24]

Swyx: They do. And it's very impressive how quickly you can try it out, you know, obviously encourage everyone listening to try out Cursor. And the download is really quick. The binary is super small. And then when you spin it up, it boots up really fast. And it's just a text file that guides you through the tutorial. It's really, it's really great. [00:18:39]

Alessio: I was using it today. Actually I will open right now. The first thing I like, you guys have like bring your own keys. So that's like one of the things that I don't see in enough products, like bring your own API key instead of like sign up for an account and do all of that. [00:18:54]

Swyx: Well, so like you have to trust them that they won't. [00:18:56]

Alessio: Look at this guy. [00:18:59]

Swyx: I just wonder if like OpenAI could do one more thing, which is just, you know, do a limit, the spend limit per key. So like that leaves space for like other companies to come in and do that. But I mean, OpenAI could just build it tomorrow. [00:19:10]

Alessio: I saw Logan tweeted about whether or not it would be interesting to have per key billing. [00:19:15]

Swyx: I mean, I think that would be. They're clearly thinking about it. [00:19:17]

Aman: Yeah. They have more important things like GPD 4.5. We can talk about that one. [00:19:20]

Swyx: Yes. [00:19:21]

Alessio: Let's talk a bit about what you do. So first of all, unlike some of the other tools, you guys have like a system prompt kind of thing, which are like rules for AI. What was like the decision behind that? Did you see people being frustrated with always having to repeat the same thing in the prompt? [00:19:38]

Aman: Yeah. The problem was for encoding some small rules that the model will tend to get wrong. So for example, we use Solid instead of react. [00:19:46]

Aman: Solid is just another reactive UI framework. It's a decent bit faster. And my co-founders know a lot more about the details in this than I do. But like the other really nice benefit is that with the VS code fork that we're using, you can kind of inject solid into multiple routes. While react is meant to be kind of like it takes over the very root of the entire DOM. Solid instead, you can inject it inside of like multiple HTML components. It's much more performant that way. And so yeah, we use solid because of that. And then the issue is every time you create a TSX file, write a component, GPD 4 by default will assume it's react, right? And so it'll get the code wrong. And so just encoding rules like that are pretty helpful on the side of that problem. For some of our users who are less familiar with English too, it's helpful to kind of add a prompt to say describe this in whatever language they're most comfortable with. [00:20:45]

Swyx: And so for those who don't know, you primarily, the main model for most people is GPT 3.5 and pro users can use GPT 4. You're prompting GPT 3.5 with these system prompts first. Any other tips apart from your company specific ones, apart from the English as a second language ones, how do you prompt GPT 3 or 4 for code? [00:21:05]

Aman: So this is interesting because I think in general, these models are good at just producing net new code or rewriting code from scratch. The thing that they're not great at is producing edits or modifications. So producing a diff is incredibly painful. And I'm sure you guys may have encountered this if doing stuff with agents, but they just get line numbers wrong pretty often. And when you're producing a diff, you know, it's fewer tokens of compute. And there's, there's some theories that like, you know, the more tokens of compute you kind of use up, the more the model is kind of expending on thinking, thinking, yeah, chain of thought. That's one thing we've kind of struggled with. And so that takes probably chaining to get it to work well, where one kind of technique we do is we have GPT 4 kind of propose a draft PR and then we have 3.5 go and kind of heal a draft diff and then we have 3.5 go and go and heal those changes. So you'll have to do things like this in order to get it work around those limitations with edits. In terms of general code writing, I think with 4, it's just, it's been super, super straightforward. 4 is fantastic. 3.5 would strongly recommend using the Azure model because there you get access to completions, meaning you can put kind of words in GPT 3.5's mouth and let it finish it. Kind of like what you can do with Claude. And that's really helpful. [00:22:24]

Swyx: I always assumed that was going away as a API because OpenAI is like clearly not interested in maintaining that. I mean, they're straight up deprecating it now. Yeah. [00:22:33]

Aman: It's a little frustrating because I think it's really useful for code, right? Because when you can do stuff in the middle of the line, it's impossible to do that with the chat format. But with the completion format, it becomes trivial. [00:22:46]

Swyx: So one thing I learned from working with Jesse on GPT 4 OpenAI, he always asked GPT to comment your code before writing the code. And that's the chain of thought for code, right? So when I ask you for code, give me a fully commented code with only a brief explanation on how it works, bias towards the most efficient solution and offer an alternative implementation if it fits. If it's unclear what environment or library versions I'm working with that might significantly change your answer, please ask me to clarify. That's my custom instructions right now for code. And I'm just like, hey, we should come together as a community and just share these custom instructions or system prompts. Yeah. [00:23:18]

Aman: When you get it to be more verbose, I do worry a bit in terms of UX because more tokens means it takes a lot longer to get to the answer. And then it's also just, I don't want to read a massive answer. I just often want the answer immediately, or I just want kind of a short block of code to answer that. That is a trade off you kind of will have to deal with. And the same thing with diffs, right? Where the diffs are going to be so much faster if you get them to work, but it's just going to result in lower quality edits. [00:23:47]

Alessio: One nice thing you do in the chat is actually remove some of the code you don't touch. So I was using it to make some changes to the code base and in each function, it would say like add a comment with existing code and then tell you just the stuff to change and the stuff to add to it, which kind of frustrates me with gbt4 sometimes. It just re-gives you the whole function definition instead of just that. I noticed that in the chat, you can now apply change and put it into the code if you don't start the conversation from the file itself. Why is that so hard? Like so many products have it. Is it actually hard or is it just like a UX decision to have you? [00:24:24]

Aman: So there are two ways of doing that, right? So when you say apply change, do you mean you select a region in the code, press the button and then it makes it just like, makes it in? [00:24:33]

Alessio: Right here, it told me to like add these three lines of Python and I'm like, I don't want to copy paste them. You know, I did it, but it would be good to just do. [00:24:40]

Aman: So if it just makes the change for you. Yeah, this is something we're going to be adding this week. So yeah, this is definitely like something a lot of users have asked for and it should be reasonably straightforward to do. I think the issues we want to use for sparingly because of how expensive it is and 3.5 actually kind of struggles with this. [00:24:59]

Swyx: Interesting. [00:25:00]

Alessio: And then I noticed, so you can chat either with or without context. So with context, you pass it parts of your code base without, you don't. Every time it loads the license file. So is there anything that you're working on to make sure that like you don't have like license infringement and stuff like that, or is it just like the model for some reason thinks the license file is really important? [00:25:22]

Aman: Yeah, right now it probably uses vanilla embeddings. We're working on a couple of interesting techniques for much better retrieval. One of them is basically fine tuning a model to kind of memorize a code base. So there was a paper that came out a little while ago from Google, which is called documents or it's called transformers as a differentiable search index. The idea here is you train a transformer on a code base or you train it on a corpus of documents in order to basically directly answer questions about which document is relevant given the question. So the mapping would be some query, some question, and in this case, a question about a piece of code. And then the model would directly output the not just file, but let's say the actual function or the class that solves it. It wouldn't output all the code for it, but it only just output like the symbol that corresponds to it. And we've seen some initially promising results with this direction. If you look at the original paper and then there are some follow on work, it actually does a lot better than very old school retrieval techniques like BM25 and even embedding based techniques. And so this is an approach we're experimenting with and we think it could prove quite helpful. The other direction is just improving embeddings. If you looked at the recent paper by, I think it was Alibaba. So there was a recent model. If you do the math, it costs them $1,000, less than $1,000 to train this thing. And it beats OpenAI on non-code related tasks, sadly non-code related. OpenAI still kind of holds the crown for code related embeddings. But we think there's some promise in potentially training our own embeddings and then fine tuning it on particular code bases so it performs better there. So these are both directions. We're kind of independently exploring to improve the performance of retrieval. But in the short term, we do have the ability to use kind of re-rankers and more kind of advanced tuning. So if you look in the chat, I think there may be a button you can click which lets you enable re-rankers, which should improve the performance a decent bit. [00:27:24]

Swyx: Awesome. [00:27:25]

Alessio: Anything else in the product that we're missing? We have inline generation and inline question asking to the model. You have the chat interface on the right. Yeah. [00:27:36]

Aman: So one thing that our users have found quite helpful is being able to add files or add documentation. So if you want to add Next.js docs, the most recent docs, you just do add Next.js in the chat or in command K and you'll be able to then basically get that information in your context. We have a lot of features that will be coming up quite soon. One that we're quite excited about is basically code interpreter style mode of using the chat. And so I don't mean that, I guess, in the traditional sense of code interpreter. But code interpreter is probably the one example of, as far as I know, the one example of an agent that works really well, that has some sort of kind of product market fit. And I think the reason it works super well is because when you try to get agents to do some massive task, I don't know, many people who like reviewing PRs or reviewing large diffs, it's much more fun to kind of be in flow. And I think the way that the code interpreter is able to deal with this is it breaks it down to these kind of small units that are very auditable and understandable. When you ask the model to produce a graph, you just see the graph and then you can kind of tell more or less it's wrong. And then you can go and see the code and the code's very understandable. So I think it's pretty important to kind of have the agent do these very small, discrete units and then show the output in a way that's very easy for the user to understand and then go in and fix. And so we're building a kind of flow like that in the chat that should be coming out in the next two weeks, which we're very excited by, because we've done a bunch of experimental stuff with agents. And the big thing has always been this problem where it just produces a bunch of code and it's just so hard to tell whether or not it's correct or not. It's less efficient because it'll end up having some bugs. And then it would have been better if the user just went and wrote all of it themselves. [00:29:30]

Swyx: There's one approach with a former guest of ours, Itamar, on Codium, whose essentially approach is you need to develop the spec, the tests, and the source code in harmony kind of together. Well, the spec is the prompt, and then the spec could generate a test or a spec could generate code. And the only way to validate the code is to run it with tests, is kind of his analysis of what the agent space, what the code agent space may look like. [00:29:54]

Aman: I think tests are pretty promising a direction. If you have a really, really rigorous set of tests where you can completely confirm whether or not the agent has done the right thing, I think that solves it. But I think it's only one part of the overall puzzle here. I do think you're going to want the model to... Like the issue is, it's really kind of painful to go and write this massive, massive prompt describing everything. I want to be able to kind of do it in flow and just see a change, then go step by step from there. I think that's just a more fun way of doing it. I think the more fun and more easy to use kind of product will win, assuming the capabilities are about equal. So that's kind of our bet here. Yeah, that's great. [00:30:34]

Swyx: Have you thought about like, so you said you can add docs, which is really cool. And I've thought about this before, but I always get hung up on versioning. You just choose to not care about it and just embed the most current docs? Yes, we embed the most current docs. [00:30:46]

Aman: You can add whatever docs you want, if you just have a URL. You can paste the URL for the docs in. [00:30:53]

Swyx: You give a crawler, yeah. [00:30:54]

Aman: We crawl it in the background and embed it. And so you can have a custom, basically a custom version or whatever version you use. It's stored locally for you. What kind of crawl diff? [00:31:05]

Swyx: Like if you've just written... Yeah, that means you've written a search engine, kind of. [00:31:10]

Aman: It's very, very basic. Docs are very, very easy to crawl relative to other things because it's like, they're all like this kind of sort of markdown-like format. [00:31:19]

Swyx: Yeah. [00:31:20]

Aman: Definitely have not written a crawler for the entire net. [00:31:23]

Swyx: The other thing on Code Interpreter, we've also done an episode on that. I'm very excited about it. I think it's GPT 4.5, you know, because it's GPT 4 that has been fine-tuned on more code. Yeah. Plus it has inference time capabilities that you cannot do in the traditional LLM setting. Anyway, the most important thing about GPT 4 is that it has the sandbox. So the main question for you is, are you going to run the sandbox in your environment or do you want to run it on our local machine since you have access to that too? Yeah, I think we want to be very careful with this. [00:31:52]

Aman: You don't want to do sudo rm-rm star or something. Our plan is to run it on the local machine, but always kind of prompting the user whether or not they want it. I think if we want to do things where the agent takes many... So for the Code Interpreter style thing, the great thing is because you're breaking it down to these units, you can kind of batch together a bunch of commands at each step, just kind of ask the user because they're always kind of watching. For agents that are running completely in the background, I think there you probably will need to have some kind of contained environment where it's safe for agents to execute arbitrary code. One pretty bad attack is if one team wanted to, let's say, prompt inject the model, they could just kind of in a piece of code, just like have a comment that said something like, When you're doing this kind of edit, you should do rm-rf or do something really, really dangerous. And then the issue is if an agent is kind of running in the background, and then it does that, and it grabs that piece of information, and then it gets actually successfully prompt injected, it'll just execute that thing. The same actually may be true with documentation, where someone malicious, if they had access to some piece of documentation that other people use, could try to prompt inject agents that are then going and running code and running terminal commands. [00:33:06]

Alessio: Today, people just hijack npm packages. [00:33:09]

Swyx: Yeah, there'll be more of that, I'm sure, shenanigans, as they call it. But yeah, I think probably the safest way is to have sandboxes in the cloud. And yeah, I've been calling this the sort of the agent cloud phenomenon. I think Fly, IO, Modal, and E2B are in that space already. And then I think Repl.it is exploring it. It'd be interesting for you guys to get in that game. I have trouble articulating what's different about an agent cloud versus a typical serverless sandbox thing that you can spin up. Basically, I think for people to, if agent cloud is a real category, we have to identify what kinds of feedback do we want to give the AI that's different to a human? That's the extent of my thoughts on what this would take. [00:33:52]

Aman: I think the key thing that not enough people are probably doing is giving the AI access to a lot more tools. So the classic example I like to bring up is, if you look at the old kind of alpha code model, which went and got 50% on some programming contest, a competition, 50th percentile of pretty good programmers, right? This was a base model that basically got, I think, something around 28% on human eval. And they use this interesting inference strategy of having the model generate a bunch of test cases, and then running the test cases, seeing which one passed. They use some other, there's some other details there where they do clustering and whatever. But the key thing is kind of letting the model generate tests, run the tests on all the outputs that it's generated. And that brings a 28% code forces model to 50th percentile. Gbd4, you just add a very basic prompt, please complete this Python function, and it gets 85%, 87% on human eval. Now who knows how tainted that benchmark is? But assuming it's reasonable, like what score do you think gbd4, the same kind of inference strategy as alpha code would get on that benchmark? It would do really well. And then gbd4 is at this level where they can actually not just run the test and like binary yes or no, use that answer, but it would see the results of the test and be able to modify the code or the test base in those. And so I think just like, that's just one tool. The other tools you could have access to would be language servers. So this is a great thing with VS Code, where VS Code kind of invented the language server or the language server protocol. And so as a result, when working with a VS Code fork, we kind of have access to every single part of the language server protocol, which means we can go to definition, get all the symbols in your entire workspace, kind of everything you do in a modern IDE. And what we've been working on is kind of giving these models access to those tools. And that like dramatically improves performance, right? Because the way that humans usually will search for something is they'll kind of click around, go to definition, read some code, do all that. But you use the tools in the IDE to search for things more efficiently. And if you're just trying to have a model, just do a brute force kind of semantic search and get the answer from that. I think it's not going to work nearly as well as kind of an agent that's able to use those tools. [00:36:10]

Swyx: Awesome. [00:36:11]

Alessio: And you guys are growing the team right now? [00:36:14]

Aman: Yes, we are. So we are currently five people based in SF and we're looking to hire engineers and designers. We think there's a lot of interesting work that we're doing that's left to be done. So some of it involves model training, kind of training some open source models for things like embeddings or areas where it perhaps is too expensive or not or too slow to use open AI. And then lots of interesting things with pushing these models to kind of the boundaries. So getting GPT-4 to work really well in this kind of agent loop in a way that's really in flow and intuitive for users to use. So yeah, I think lots of exciting work. [00:36:53]

Swyx: Cool. And then maybe to sketch out a little bit more about the company and then we'll zoom out to just general LLM observations. You're also working on a prompt tooling thing called Priompt? Yeah. [00:37:04]

Aman: So this is just an internal tool that we use. It's called Priompt. And we built this because we didn't really find a good way of solving for the problem of when you have a variable number of kind of inputs that you want to stuff into the prompt and you have like a fixed length prompt, right? You can only use 4096 tokens. How do you encode for rules and how to properly kind of order the inputs that go into it? And Priompt or priority prompting are intermediary solution for this, where you can kind of encode very custom rules into how you build up the prompt based on how, I guess, overflowing it is, right? So let's say I have a bunch of previous chat messages and then I also have the code from the current file. So maybe what you want to do is you want some rules where if everything can fit in, you put it all in. But then you start by like first removing like all the old chat messages. Then once it gets to a certain length, you don't want to remove any more chat messages and you want to start removing parts of the file. And then you want to remove parts of the file in this particular truncation strategy, which tends to work quite well. So this kind of thing where like as you kind of slide the window and how many tokens you're allotted, you can see like the prompt is like very, very differently constructed. So it's like optimal kind of at all sizings. And we found that quite helpful internally. [00:38:23]

Swyx: And you chose the JSX approach. I'm not going to ask you too much about like design choice. I mean, it's popular with React. Fixies also put out AI JSX. Do you find that like helpful? Do you think that like some kind of DSL might emerge for prompting? [00:38:36]

Aman: Yeah, I think it's still pretty early. And it's not clear what the best way to do. I think for very lightweight, easy prompting, like you should just use strings. When you're doing kind of prompt engineering, and really like rigorous prompt construction where you can have a bunch of different possible inputs in the prompt. We think JSX makes a lot of sense. It's because it's kind of like website development where you'll have different kind of screen sizes, different kinds of devices that can look at it. And in a similar way, you've different kinds of prompts, right? You've different prompt context lengths. And you basically want across all different context lengths to get a very, very good prompt for the model. And so yeah, that's why I think like JSX kind of makes sense. It's not clear if it is like the best way of doing it. I think the jury's still out on that. [00:39:27]

Swyx: One way to deal with the context length model issue is to train your own model that has a very long context length, like magic.dev, which announced a 5 million context length window. I don't know how credible that is. I haven't tried it. But your thoughts? [00:39:42]

Aman: Yeah, I think the issue with context length, long context length right now is that costs scale linearly, right? Costs technically scale quadratically in terms of attention. But the interesting thing is that for really, really large models in terms of flops or actual floating point operations that the models are doing, attention tends to be a pretty negligible part compared to the actual, I guess, feed forward part of the neural network. And so up to like 8K, it tends to look pretty linear. I guess when you're going to like higher and higher context lengths, it starts to get more and more tricky. And then there's some other optimizations or some other difficulties with memory bandwidth that we can get into. It just feels like the key issue is even if it is linear, it's still so expensive, right? Paying for 32,000 tokens at whatever the pricing is right now feels like exorbitantly high. My perspective is that there probably will be at some point in the future, or there might be at some point in the future, like a better approach for really, really long context. Something that looks more kind of recurrent. [00:40:47]

Swyx: It feels more elegant. [00:40:48]

Aman: I don't know if it'll happen because I think there are like interesting ways of hacking together or chaining together these language models, even with short prompts. But I'm not super bullish on kind of scaling up attention the way that we're doing right now in like 100, 200K context windows. [00:41:02]

Swyx: Like Cloud is doing. Yeah. Are you monitoring like RWBKV, which is one of the recurrent approaches? [00:41:07]

Aman: I've been meaning to read that paper. I have not been monitoring that. I looked into a few of the papers from state space, like the state space models. Those are pretty interesting. Can you give an intuition? [00:41:19]

Swyx: Because you seem to be explaining it really well. Why are they different? Why is that interesting? Yeah. [00:41:25]

Aman: I think the interesting thing with, at least with the original state space model, is that you get kind of two benefits. One for training, you get the paralyzability of a transformer and you can kind of run it, I believe in about N log N for some N length sequence. And then for inference, it's also like the way that it's formulated is also somewhat recurrent. So you can kind of store everything in this fixed state. And then because of that, you get, I believe, an O of one kind of cost towards inference. It could be slightly higher, but yeah, it's much less than the O of N cost per token for the transformer. And so that makes it really tractable to then do for very, very long sequences. There's some follow on work with Hungry Hungry Hippos and Hyena. And again, I think the key piece is that for like very, very long sequence lengths, it ends up being N log N rather than N squared. I did say that the cost of like, I guess even linear attention is pretty high, but that's because the 32K model is priced a decent bit higher than the original. It is surprising that Claude, or I'm actually not familiar with Claude's pricing. Is it higher for the 100K one than for the normal? [00:42:31]

Swyx: No, I believe it's the same. [00:42:33]

Aman: That is actually quite surprising. I'm not sure if they're doing attention under the hood because even with like a lot of tricks with 100K or even 200K, I would assume that cost will eventually start to build up. So they might be doing something fancy there. [00:42:46]

Swyx: Well, my guess was alibi, which is a trick, which is replacing proper attention with kind of like a exponentially declining forgetting curve is what I'm thinking. Someone has to put 100K to the test. I haven't done it. [00:43:00]

Aman: Yeah, they have this graph that looks promising for 200K, but I feel like anecdotally from everything that I've heard, it just seems like it forgets things like they don't actually pay attention to things. [00:43:11]

Alessio: I just open-sourced a small podcast there yesterday, which is what we use. [00:43:14]

Swyx: We use it to summarize this podcast. [00:43:16]

Alessio: Yeah, I'm looking at my logs, prompt length of all my recent ones is 55,400 tokens, and it works. [00:43:25]

Swyx: How much per call? [00:43:26]

Alessio: Free, because it's not commercial. I'm like, hopefully nobody from Entropic is listening. But yeah, it works. But I think that's kind of like the sweet spot. And then the completion length is like 1,800, you know, so it's not like it stays within the 60K band. But anyway, yeah, curious to see. And I think another thing from your Twitter model parades that I really like is actually differentiating between the type of workload. I feel like people talk about these models as like anything you do is like the same thing, but you posted about GPT 3.5 being cheaper than LLAMA 2 for completion-heavy workloads. What does that mean? [00:44:07]

Aman: Yeah, so there are different terms, I guess, based on like whatever community you're in. So I think in the research community, they probably call it pre-filling is handling prompt tokens. And then I believe maybe decoding is what they call generating completion tokens. We'll just use prompt tokens and completion tokens. But for prompt tokens, the work, it's entirely compute bound. And the reason why is the same reason why transformers are so good at being kind of being trained in parallel. And it's that you can parallelize the entire sequence or you can parallelize an input, not just along the batch dimension, but the sequence dimension. So that means let's look at the first layer of the transformer. Imagine like that entire layer could fit in memory. I just read that to memory. And then I basically apply the matrix multiplication of the entire sequence on this layer. If you're doing token generation, instead, you have to read the layer, then taking the first input, and then you have to read the next layer and then do that same input. You have to do it all the way to the end of the model. And then you generate the next token and that next token passes through all the layers again. So before what you were doing is you have all your input tokens in parallel, they're going through the first layer. So you read the first layer, then in parallel, they're going through the second layer. You read the second layer, so on and so forth to the end. But when you're doing it for one token at a time, you read the first layer, second layer, third layer, fourth, blah, blah, blah. Then you do it all over again for the next token. And so as a result, for your sequence length N, you end up using N times more memory bandwidth than compute. [00:45:44]

Swyx: And time as well, like wall clock time. Yeah. [00:45:47]

Aman: I mean, so with wall clock time, it's weird because transformers are far more efficient than... [00:45:53]

Swyx: I comment on that because in the RWKV interview that I did, same thing. They have a visual actually of this. So the thing you were trying to describe with words, they actually have a visual and animation But it's helpful because once you see it, you're like, oh, okay, that's why it's like a different graph. Yeah, exactly. Yeah. [00:46:10]

Aman: So when you're dealing with the prompt, it's completely compute bound. And because GPUs can handle some crazy number of floating point operations per second, it's like almost instant. That's why time to first token feels super instant. And then when you're generating one token at a time, it now becomes completely memory bound where for each token, you're bound by how fast you can read all the weights into memory. So that's like around like 200x slower in general. [00:46:34]

Swyx: Yeah. So your specific recommendations, which I pulled out from the post, people should read it. It's really good. I feel like the title undersells it a little bit. Yeah. You should not serve Llama 2 for completion heavy workloads. Llama is best for prompt dominated tasks like classification. And I feel like I can run with that. That makes a lot of sense. [00:46:51]

Aman: And re-ranking is one thing we find useful for it internally. [00:46:54]

Swyx: Do you use Llama 2 right now? [00:46:55]

Aman: We don't have it in production, but we've experimented with it for a few things. [00:46:59]

Swyx: You also had an interesting observation because I think we had talked a lot about quantization in the podcast just for running locally or more efficient running. You said quantization and imperfect utilization cancel each other out. Yes. That's a cool observation. Yeah. [00:47:12]

Aman: So this is like a little bit hand wavy, but the core thing is, yeah, we expect that when you don't have like complete utilization, right, you're never going to like saturate all your GPUs. There's going to be some idle time. Like from things that we've experimented with in the past, it ends up being, you know, 50% is a reasonable amount as a more liberal estimate of how much you can get. So the interesting thing about quantization is that there's a bunch of these kind of new quantization libraries that have cropped up and they're all very good at reducing costs for low batch inference when you're memory bound. But the key thing is when you increase the batch size, they actually end up resulting in no real speed ups over FP16. The reason why is because they only quantize the model weights, right? So that operation of kind of reading the model weights when they're now, you know, 4x smaller instead of FP16, they're, you know, 4 bits or something. [00:48:04]

Swyx: It's still the same number of weights. [00:48:06]

Aman: The operation of reading weights is like it ends up being 3, 4x faster. But the issue is when you increase your batch size enough and for large batch inference, the key thing is it now moves back from being memory to being compute bound again. And when you're compute bound, quantization of model weights basically does nothing. And so it ends up being effectively the same cost. And then the other interesting thing is it's even worse for small models because for, or at least the small LLAMA models, because I believe the smaller ones relative to model size have a much bigger KV cache. I'm not sure if the smaller ones use multi or group query attention. They might not. [00:48:42]

Swyx: They do not. Only the large ones use. Okay, exactly. [00:48:45]

Aman: Yeah. So then because they use normal multi-head attention, the thing is when your batch size increases enough, then the memory bottleneck is not your small quantized model weights. No, it's actually the KV cache. And so quantizing the model weights effectively will do nothing then. So the key insight there is like all these new techniques are fantastic when you're just kind of playing with these models, running them low batch sizes. But when you really try to increase the batch size and serve it in production, they're probably going to be lower or more expensive than FP16 because there are these optimizations with things like text generation inference, which uses VLM or like page attention, which are much, much faster. And so the best that I think you could probably do right now with open source is like full 8-bit quantization, which means not just quantizing the weights, but also like the actual activations and the KV cache so that none of those things end up being bottlenecks. [00:49:36]

Swyx: That's a great breakdown. The post goes into much more detail with a lot of math, actually, which I love. And you also spec out some rules of thumb, which I think people can use to figure out their limitations and pricing and all that good stuff. Yeah. [00:49:49]

Aman: One big caveat I'd say is that the other massive benefit of LLAMA too is that you can fine tune it. [00:49:54]

Swyx: Yeah. Right. Well, you'll be able to fine tune OpenAI soon enough. We'll see. So we'll just get your general takes on LLM topics, just kind of quick fire, and then we'll go to lightning round. So human eval, that is the predominant way to benchmark code models because OpenAI benchmarks code models that way. There's some issues with it. [00:50:13]

Aman: Yeah. With open source models and even probably with some closed source models, it's unclear how much of it has actually leaked into the train set, right? So there's a recent model, New Hope, which it looked like they had some leakage, which is why it had really, really good performance. But I think there was an interesting approach taken by Palm too, where I think this is actually possible for someone to do right now. I've been meaning to do it at some point, but there's this paper called Babel code and they have a library which I think literally translates human eval into all other languages. And I think that would be a really good test because the other issues, a lot of the models that perform really well on human eval are pure Python, right? And that doesn't really give you a sense of if it's a good coding model overall. So yeah, I think at some point it would be really helpful if just someone did the work and ran the Babel code engine and translated human eval into all these other languages and then was able to run it. I think that would probably be a better benchmark, but still. I think if the original human eval problems leaked, I suspect it would also be helpful for solving the problems translated into other languages. But the issue is it's just so easy to run and anything else is probably going to be quite painful. [00:51:24]

Swyx: Right? Well, it'd be better if there was a sandbox to run it. So agent cloud, hashtag. Hot take on training. Yeah. [00:51:32]

Alessio: Another one from your endless Twitter quality. Training will look like researchers offloading large scale training jobs to specialized training companies. A state of the word that resembles chip design and fabrication. Yeah. [00:51:44]

Swyx: How do you think about that? [00:51:45]

Alessio: And obviously Mosaic was on the podcast just got acquired. [00:51:47]

Swyx: So you tweeted that in May in 2022 and then one year later Mosaic gets acquired. Like I think that's a pretty fresh hint. Yeah. [00:51:54]

Aman: I was probably wrong about it in a lot of ways too, because I assumed the future would kind of look like a lot of startups would have their own models. And this is me kind of in the CAD frame of mind where I thought, okay, if you look at GPT-3 at that point, it was just like GPT-3, maybe a little bit 3.5. It wasn't like that good a generalist model. And I thought prompting is not the way to do things. It's just completely fine tuning or training your own models. And it was also a similar time that we kind of saw a lot of the open source earlier efforts in training models, which proved like not that great. I think Bloom and OPT were two models that came around about that time. And if you looked at the OPT logs, they manually tuned their learning rates several times. I think they switched the optimizer from Adam to something really weird where they switched the optimizer in the middle. And don't quote me on this because I could be wrong, but I remember it was like some really, really sketchy stuff down in the middle. And I just thought, wow, if it's this hard, it seems like there's a company to be built around it. The key difference is that there are just massive foundation model companies. And I think most AI product companies are not going to be mostly training their models or mostly using like custom models. It's more so going to look like them kind of using these APIs out of the box. And then maybe using, you know, the fine tuning endpoints there. [00:53:12]

Alessio: Oh, I mean, it's the same. [00:53:14]

Swyx: So you changed your mind a little bit. [00:53:15]

Aman: I did change my mind a little bit. I assumed like with the CAD thing, I thought, okay, you're gonna need a foundation model for CAD. You're going to need a foundation model. [00:53:22]

Swyx: No, that's old school thinking. [00:53:23]

Aman: Yeah. And now it's just like you have the one generalist model. The one God model. And the one God model transfers fantastically well with everything. Okay, quickly move along. [00:53:31]

Swyx: You had another one, which I loved. The size of all code history on GitHub public repos is 92 terabytes. The size of Google's monorepo is 86 terabytes of much higher quality code. If Google were willing to deploy code models trained on your own data, they would have a noticeable advantage over everyone else. Yeah. [00:53:46]

Aman: Again, this is one thing that I think is probably a little wrong. Because this is based on the big science paper. And the big science paper, like basically said they scraped all of GitHub and they got 92 terabytes. And I think if you look closely, which I did kind of after some people kind of pointed out some mistakes, I think GitHub is like a lot, a lot bigger than that. The big science paper said they get cloned. And so I was assuming, okay, get clone means you get the full working tree, right? But if you look a little deeper, I think GitHub is like a lot bigger than people think. My expectation is that GitHub probably has something like five to 10 trillion tokens of code, usable code. And so that's a lot more than what they ended up getting. But yeah, Google still has like a pretty meaningful fraction. [00:54:33]

Swyx: And they just put out IDX, which is somewhat of a competitor. Yeah, yeah. [00:54:37]

Aman: I think it's more like, it looks more like a replit kind of competitor where it's like an in-browser thing. But yeah, I think a lot of people can be viewed as competitors. [00:54:46]

Swyx: But you're very competitive as we established, you know. And then final question, why is the company called AnySphere? And you have this whole manifesto on your landing page on why humans should focus on bigger problems. [00:54:55]

Aman: It's an interesting story where Michael and I were in this program Converge, and two of our friends, Arvid and Swale, who we knew like reasonably well at MIT. And we knew them because they're like some of the best engineers at MIT. And so they were independently kind of working on their own company. It was called AnySphere. And we both independently from after playing with GPT-4 realized, oh, wow, like the IDE is the thing to build. After a few months of independently working on it, we realized, okay, like, why are we doing this separately? We should just kind of join forces. And that's kind of what we did. And so right now, the overall company is called AnySphere. But yeah, the product and the core thing is Cursor. It's lovely. [00:55:34]

Swyx: I recommend people actually check out AnySphere.co and read the manifesto because I think it's a broader message to builders out there. Yeah. [00:55:42]

Alessio: Yeah. Let's jump into lightning round. Okay. We got three questions for you. The first one is, what is something that already happened in AI that you thought would take much longer? [00:55:52]

Aman: I think code. Specifically, I think just being generalist at code, where before you had these specialized models, right, where codex was supposed to be kind of specialized for code. And then there's a general language model, but it's kind of unification of capabilities towards like this one model that's not just really good at text, but it's also fantastic at code. I was not expecting like the generalist model, I guess, to come super, super soon and be this good at code. [00:56:19]

Swyx: That's why you pivoted or you started your whole company. What do you think is the most interesting unsolved question in AI? [00:56:26]

Aman: I really think it's this kind of long-term memory piece where I think it's possible to get to maybe AGI superhuman level systems that still kind of hack around memory using like something that kind of resembles transformers. But it feels like the more elegant thing is how do you get models that really like continuously learn? Some kind of recurrent based system would be able to do this where there's like a state. But right now, like models can only really learn in context super efficiently. Fine-tuning is incredibly inefficient. It requires tons of data points to actually learn new things. So yeah, I'm really interested to see how we solve this lifelong learning efficiency problem. [00:57:06]

Swyx: Yeah. I'm interested in using knowledge graphs to do that because I think that's kind of like a forgotten piece of the puzzle. And if you could have models update their own knowledge graphs and query their own knowledge graphs, that might be it. I think Llama Index is basically working itself into what that is. Oh, interesting. [00:57:22]

Aman: Yeah. And then there's the techniques where the models directly kind of learn to like inside the weights or inside the architecture, you learn how to be able to read from databases and retrieval based like the retro based techniques. Like those seemed interesting, but it's surprising like you haven't really seen anything from that in a while after that initial paper. [00:57:42]

Alessio: And just to wrap the episode up, what's one message you want everyone to remember and think about as they keep building and exploring in AI? [00:57:49]

Swyx: Yeah. [00:57:50]

Aman: I mean, GPT-4 is now a few months old. At some point we're going to get much, much better models and I think it'll be pretty soon. And so what does the world look like then? And specifically for coding, like what does the world look like when you have another step that's just as large as it was from GPT-3 to GPT-4? I think it's just so incredibly different. I think it just completely changes how people write software. [00:58:14]

Swyx: In what direction though? So I've said my piece on like 4.5 being more inference time. I don't actually know if that's true. That's just my theory. [00:58:22]

Aman: I think the direction that we'll probably see is, I mean, the language models will just get better at doing intense reasoning, right? So they'll be able to tackle harder problems. They'll probably pick up in more nuances and how like software engineering is done. They'll probably have longer context windows. And so I expect, yeah, more agentic type things will end up being more prominent in the future. I don't know how far you can take the agent stuff with a four level model, but I think with like a 4.5 or a 5, I think agent models will work for almost any kind of coding task. At least almost any kind of reasonably well-scoped coding tasks. [00:59:00]

Swyx: Agents are the future. Well, thanks so much for coming in. Thanks Aman. Of course. [00:59:04]

Alessio: Thanks for having me. [00:59:04]

Get full access to Latent Space at www.latent.space/subscribe

2023-08-22
Länk till avsnitt

The Mathematics of Training LLMs ? with Quentin Anthony of Eleuther AI

Invites are going out for AI Engineer Summit! In the meantime, we have just announced our first Actually Open AI event with Brev.dev and Langchain, Aug 26 in our SF HQ (we?ll record talks for those remote). See you soon (and join the Discord)!

Special thanks to @nearcyan for helping us arrange this with the Eleuther team.

This post was on the HN frontpage for 15 hours.

As startups and even VCs hoard GPUs to attract talent, the one thing more valuable than GPUs is knowing how to use them (aka, make GPUs go brrrr).

There is an incredible amount of tacit knowledge in the NLP community around training, and until Eleuther.ai came along you pretty much had to work at Google or Meta to gain that knowledge. This makes it hard for non-insiders to even do simple estimations around costing out projects - it is well known how to trade $ for GPU hours, but trading ?$ for size of model? or ?$ for quality of model? is less known and more valuable and full of opaque ?it depends?. This is why rules of thumb for training are incredibly useful, because they cut through the noise and give you the simple 20% of knowledge that determines 80% of the outcome derived from hard earned experience.

Today?s guest, Quentin Anthony from EleutherAI, is one of the top researchers in high-performance deep learning. He?s one of the co-authors of Transformers Math 101, which was one of the clearest articulations of training rules of thumb. We can think of no better way to dive into training math than to have Quentin run us through a masterclass on model weights, optimizer states, gradients, activations, and how they all impact memory requirements.

The core equation you will need to know is the following:

Where C is the compute requirements to train a model, P is the number of parameters, and D is the size of the training dataset in tokens. This is also equal to ?, the throughput of your machine measured in FLOPs (Actual FLOPs/GPU * # of GPUs), multiplied by T, the amount of time spent training the model.

Taking Chinchilla scaling at face value, you can simplify this equation to be `C = 120(P^2)`.These laws are only true when 1000 GPUs for 1 hour costs the same as 1 GPU for 1000 hours, so it?s not always that easy to make these assumptions especially when it comes to communication overhead.

There?s a lot more math to dive into here between training and inference, which you can listen to in the episode or read in the articles.

The other interesting concept we covered is distributed training and strategies such as ZeRO and 3D parallelism. As these models have scaled, it?s become impossible to fit everything in a single GPU for training and inference. We leave these advanced concepts to the end, but there?s a lot of innovation happening around sharding of params, gradients, and optimizer states that you must know is happening in modern LLM training.

If you have questions, you can join the Eleuther AI Discord or follow Quentin on Twitter.

Show Notes

* Transformers Math 101 Article

* Eleuther.ai

* GPT-NeoX 20B

* BLOOM

* Turing NLG

* Mosaic

* Oak Ridge & Frontier Supercomputer

* Summit Supercomputer

* Lawrence Livermore Lab

* RWKV

* Flash Attention

* Stas Bekman

Timestamps

* [00:00:00] Quentin's background and work at Eleuther.ai

* [00:03:14] Motivation behind writing the Transformers Math 101 article

* [00:05:58] Key equation for calculating compute requirements (tau x T = 6 x P x D)

* [00:10:00] Difference between theoretical and actual FLOPs

* [00:12:42] Applying the equation to estimate compute for GPT-3 training

* [00:14:08] Expecting 115+ teraflops/sec per A100 GPU as a baseline

* [00:15:10] Tradeoffs between Nvidia and AMD GPUs for training

* [00:18:50] Model precision (FP32, FP16, BF16 etc.) and impact on memory

* [00:22:00] Benefits of model quantization even with unlimited memory

* [00:23:44] KV cache memory overhead during inference

* [00:26:08] How optimizer memory usage is calculated

* [00:32:03] Components of total training memory (model, optimizer, gradients, activations)

* [00:33:47] Activation recomputation to reduce memory overhead

* [00:38:25] Sharded optimizers like ZeRO to distribute across GPUs

* [00:40:23] Communication operations like scatter and gather in ZeRO

* [00:41:33] Advanced 3D parallelism techniques (data, tensor, pipeline)

* [00:43:55] Combining 3D parallelism and sharded optimizers

* [00:45:43] Challenges with heterogeneous clusters for distribution

* [00:47:58] Lightning Round

Transcription

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]

Swyx: Hey, today we have a very special guest, Quentin Anthony from Eleuther.ai. The context for this episode is that we've been looking to cover Transformers math for a long time. And then one day in April, there's this blog post that comes out that literally is called Transformers Math 101 from Eleuther. And this is one of the most authoritative posts that I've ever seen. And I think basically on this podcast, we're trying to give people an intuition around what are the rules of thumb that are important in thinking about AI and reasoning by AI. And I don't think there's anyone more credible than the people at Eleuther or the people training actual large language models, especially on limited resources. So welcome, Quentin. [00:00:59]

Quentin: Thank you. A little bit about myself is that I'm a PhD student at Ohio State University, starting my fifth year now, almost done. I started with Eleuther during the GPT-NeoX20B model. So they were getting started training that, they were having some problems scaling it. As we'll talk about, I'm sure today a lot, is that communication costs and synchronization and how do you scale up a model to hundreds of GPUs and make sure that things progress quickly is really difficult. That was really similar to my PhD work. So I jumped in and helped them on the 20B, getting that running smoothly. And then ever since then, just as new systems challenges arise, and as they move to high performance computing systems and distributed systems, I just sort of kept finding myself falling into projects and helping out there. So I've been at Eleuther for a little bit now, head engineer there now, and then finishing up my PhD and then, well, who knows where I'll go next. [00:01:48]

Alessio: Awesome. What was the inspiration behind writing the article? Was it taking some of those learnings? Obviously Eleuther is one of the most open research places out there. Is it just part of the DNA there or any fun stories there? [00:02:00]

Quentin: For the motivation for writing, you very frequently see in like the DL training space, like these Twitter posts by like, for example, like Stas Bekman at Hugging Face, you'll see like a Twitter post that's like, oh, we just found this magic number and everything is like 20% faster. He?s super excited, but doesn't really understand what's going on. And the same thing for us, we very frequently find that a lot of people understand the theory or maybe the fundamentals of why like AI training or inference works, but no one knows like the nitty gritty details of like, how do you get inference to actually run correctly on your machine split across two GPUs or something like that. So we sort of had all of these notes that we had accumulated and we're sort of sharing among engineers within Eleuther and we thought, well, this would really help a lot of other people. It's not really maybe appropriate for like a paper, but for something like a blog post or technical report, this would actually maybe squeeze a lot of performance out of people's hardware they're already running on. So I guess there are a lot of projects in Eleuther that we're sort of trying to share notes with people in a way that typical institutions don't. They sort of live within that institution and then you go to a different institution and they do something very similar, but without the lessons of the previous. And it's because everyone's trying to do their own special sauce with their own stack. Whereas Eleuther, we don't really have that constraint and we can just share everything to everybody. [00:03:14]

Swyx: Yeah, this is a level of openness that basically very few people actually embrace. One, it's an extra effort to write things down, of course, but two, it is secret sauce and so that not many people do it. And therefore, oftentimes the only way to learn this stuff is to actually work in one of the large model labs. And so you guys are doing a lot. The only other instance where I can think of where people actually open sourced their process was Facebook's OPT. What else is similar, like sort of trade knowledge, but not formal research knowledge? [00:03:45]

Quentin: I would say Bloom. So the Hugging Face Bloom project in big science and all of that, that was very open. I'd say it's the same caliber, if not more detailed than OPT. Other than that, I think there was like a doc from Microsoft on like their Turing NLG. Their paper is pretty relaxed in that it did talk about some of those challenges. Other than like OPT and Bloom and us, I can't think of any. It's a new thing. [00:04:10]

Swyx: It matters that you are going for the sort of good enough rules of thumb, because I think a lot of people try to go for precision and being overly precise actually is not helpful. Right. Yes. [00:04:20]

Quentin: You'll see some like statements in the blog posts that are just like, we think this is about 1.2 in our experience. And, you know, we don't go any further into detail and it would take maybe an extra month for us to chase down every single little piece of memory. But instead, like getting good enough is still helpful to people. [00:04:36]

Alessio: Let's jump into it. The first part of the article, and we'll put this in the show notes so people will be following along with the post. So we don't need to read every single equation and every footnote for it. [00:04:46]

Swyx: Okay. [00:04:46]

Alessio: But the core equation here is that not the cost of compute, but the compute required to turn a transformer model is roughly equal to tau times T, where like T is the, where tau is the hardware setup throughput that you have. So number of GPUs times the actual flops per GPU. And then T is the time spent. I think people can visualize that pretty easily. It's basically like how many GPUs do you have and how much do you let them run for? And the things that come to it that people have read before in the Chinchilla paper in a way, and the OpenAI scaling law is that you can then equal this to 6PD, where P is the number of parameters in the model and D is the size of the, of the dataset in tokens. So talk a little bit about how people should think about the two. I think a lot of times the focus is on tokens parameter ratio in the training dataset and people don't think as much about the actual flops per GPU, which you're going to mention later in the blog post too, in terms of how much you can get out. So how should people think about this when they're building a model and where should they go to this equation as they're starting to think about training their own transformer-based [00:05:58]

Swyx: model? [00:05:58]

Quentin: You touched a little bit on the fact that people usually start with the dataset. So you have some dataset that you want to train a model on. And then from there, from the 6PD, you should see, okay, I should have about six tokens per parameter. So that determines my model size thereabouts for Chinchilla Optimal. So since then we've seen that need more something like 20 or more than that to get a good quality model. But the next question that should be on your mind in terms of a systems perspective is how long is it going to take for this model to train and what kind of budget should I expect? So let's say I want some cloud instance for some amount of time and each of them will have some price attached to it. So that's where the throughput comes in. So now that you have this model, this number of parameters, you should map that to a transformer architecture and you should benchmark what throughput you get on your software stack for that type of model. So now you have your flops per second on a single GPU. And then given whatever parallelism scheme, which I'm sure we'll get into, like data parallelism or tensor parallelism or whatever else, how is that flops number going to scale to whatever number of GPUs? And then from there, you're going to get a time. And if you have a time, you have a cost. Those are like the business answers that you'll be able to get using this formula. That's why we sort of split it into the T and the throughput terms so that you can solve for one of them, which is usually get throughput, need time, and from time you get cost. In a nutshell, that's the answer. [00:07:19]

Alessio: One thing that I noticed, you mentioned some of these laws are only true when a thousand GPUs for one hour cost the same as one GPU for a thousand hours, given that we have a shortage of the biggest GPUs out there. Any thoughts there on how people should prioritize this? [00:07:36]

Quentin: Yeah, so I would say you should find what the minimum number of GPUs is to just fit your model first. The memory bottleneck is your biggest problem if you have a sizable model. If it's a small model, nobody cares. But most models that people care about will need to be split across multiple GPUs. So find the minimum number of GPUs to just fit your one instance of your model and then calculate how long that's going to take. If it's a reasonable amount of time, then you're done. If it takes too long, then you need to start worrying about having multiple instances of that model. I always feel like you should go with the minimum number of GPUs because the more number of GPUs that you have, the more likely it is for things to break. So I would say just find out what time is reasonable for you and then fit the number of GPUs to that and no more. Because people get greedy and they say, if I have twice the GPUs, I can get this done in half the time. And then you end up taking three times the time because everything is breaking every day. And that's when I am up at midnight trying to fix your model that's broken. [00:08:34]

Swyx: We had a previous guest which has invested a lot in their framework for training these things. Would there not be an equivalent open source framework you guys would have made that would help with scaling up GPUs linearly like that? Or is this an oversimplification? [00:08:50]

Quentin: Okay, yeah. So maybe I should step back. Both Mosaic and us have our own sort of software stack recipe that scales well, theoretically. But I'll get to that in a minute. Mosaic is all based off optimizer sharding. So it's based off ZeRO. So you basically perfectly split your model optimizer and your parameters and your gradients across all of the different GPUs. So your aggregate memory is number of parameters divided by number of GPUs. Same thing for optimizer and so on. Whereas we at Eleuther use a Megatron deep speed based library. And for that, it's a bit more complex. So the efficiency can be a little higher, but it's more prone to failure at the same [00:09:30]

Swyx: time. [00:09:30]

Quentin: So you kind of have to tune it. In both cases, getting back to like the practical case, you should be able to get linear speed up by adding more GPUs. The problem is that there are hardware failures. You tend to have problems with like maybe loss will overflow if you have too many GPUs or maybe one GPU will hang. You might have software issues. You might have synchronization issues. And that's why I'm saying practically that you should take the minimum number of GPUs that you have because those are the easier cases to debug. That make sense? [00:10:00]

Swyx: Yeah. [00:10:00]

Quentin: Any more detail on any specific point? [00:10:02]

Swyx: Not particularly, just because we haven't actually had to debug those things. But I imagine basically there's a lot of return towards encoding these knowledge into software and not repeating it again. So it makes a ton of sense. I think Alessio had more questions before we move too far into high level, more questions on just the equation itself. I think we want to spend time on essentially, this is the central equation of figuring out compute requirements. Yeah. [00:10:25]

Alessio: Another thing in it is that the computer is like the forward pass and like the backwards pass and forward is 2PD, backward is 4PD. Why it's to the ratio between the two? Can you explain that? Why is it two and four? [00:10:39]

Quentin: Yeah. [00:10:40]

Alessio: Why is it twice the amount? [00:10:42]

Quentin: Oh, okay. Intuitively for forward pass, you're just moving, you're propagating forward the inputs through the layer. And then in the backward pass, you're doing something a little more complex than that. You're doing back propagation. And I don't think I can explain it intuitively enough to go into more detail on the exact [00:10:58]

Swyx: numbers. Yeah. [00:10:58]

Quentin: That's okay. [00:10:59]

Swyx: I feel like you want to get out a whiteboard and start drawing like, you know. [00:11:02]

Quentin: That's what I would normally do. [00:11:03]

Swyx: Tangents and gradients. It's actually surprisingly low to do the back propagation. Honestly, that's one of the fundamental things I love about the math of deep learning so far that as I've explored it, which is, it's surprisingly efficient as compared to other, I guess, numerical methods you might be exposed to and, you know, college calculus. Yeah. [00:11:22]

Alessio: And I think the other thing is that things sound simple, you know, when people go on Twitter and say, Oh, 20 is like the optimal ratio. And it's like, then it's like, well, why is that the number? And the answer is usually much, much harder, like what we're seeing right now. So I think it's a, it's a good reminder that the numbers are simple, like all the best and most popular, like math equations are like, so elegant. Obviously the proof behind that is, it's not that easy. That's always a good reminder. [00:11:52]

Swyx: I want to put this equation to the test a little bit. We can do this from either GPT-3's perspective or GPT-NeoX, whatever you're more comfortable with. You have this distinction of actual flops versus theoretical flops. And a lot of times when people report the flops it took to train a model, like we just saw one in Lama 2 where the estimate is something that the amount of flops and that's, that's what we go with. So GPT-3 took a 3.14 times 10 to the power 23 flops. That is the theoretical flops. I want to get to a point where I can sort of work out if a number passes the smell test. And I wonder how to do that because I should be able to plug in this equation, right? I know that GPT-3 was trained on 300 billion tokens. I know the parameter size of 175. Is it, is it just like a 6 times 175 times 300? Like I haven't done the math, but what are the nuances here that you might want to call out? [00:12:42]

Quentin: Theoretical flops is usually given from, you have a given set of hardware and this is what you expect your hardware to get. The problem is that in practice, full utilization, that's the key word, right? Because in practice, there are a lot of cases where like you're spending time waiting on data movement from like the GPU to CPU. Or for example, you might be waiting to synchronize across the different GPUs. So there's a lot of idle time basically that you're going to be spending during training. [00:13:05]

Swyx: Smell tests. [00:13:06]

Quentin: I don't know if I have a smell test myself, to be honest, like maybe I'll look at like what sort of flops, what you would expect on like an A100. There's sort of just an expected flops for a given GPU that everyone sort of knows what you should expect. So like for an A100, that number is somewhere between 100 and 180. T flops is what you would expect to see on an A100. For a V100, like an older GPU, it's something more like 40 to 30. So people sort of know, given the kernels that we're running for a deep learning, what sort of flops you expect. And then you sort of compare that to the theory, to the theoretical flops that people are reporting and see if that matches your expectations. [00:13:47]

Swyx: Yeah. [00:13:47]

Alessio: And in the article you mentioned for the A100, like if you're seeing below 115 teraflops a second, there's something wrong with your model or hardware. How did you get to 115? Is it just, you know, production observability and like you've seen over months and months and months that like that's the baseline or how do you come up with the numbers like that? Yeah. [00:14:08]

Quentin: For a number like that, we basically, we compared a lot of different frameworks. So like I mentioned before, Mosaic has their own framework and we have our own framework. They all have their own flop counters too, right? And we saw across a bunch of different hardware configurations that if you tune things correctly, you should be getting above 115 in pretty much all cases. So like there are some cases where things are tuned poorly or your system is a little weird, but we've never been able to get a new system and not been able to get above [00:14:35]

Swyx: 115. [00:14:35]

Quentin: If something is below 115, you have something really wrong in your software. But that's really all it is, is just comparing across software stacks and hardware systems. [00:14:44]

Alessio: What about different GPUs? We had George Hotz on the podcast and he talked about AMD cards and how in theory their flops should be much better than some Nvidia cards, but the reality is like the CUDA runtime makes up for it. How should people think about improving that? You know, like do you see, okay, the A100 is like 115 teraflops. I'd rather just stick with this than try and figure out all the kinks of like a better AMD card or any thoughts there? [00:15:10]

Swyx: Right. [00:15:10]

Quentin: Well, that's sort of touching on developer time, right? And which ends up being more expensive because at the end of the day, the AMD and Rockham software stack has a long way to go. I would say most things run there, not particularly efficiently, but you're going to have weird bugs that no one has encountered before. One of the big pluses of going with the Nvidia and PyTorch stack is that there are thousands of GitHub issues with everyone facing the same problem as you and resolving them quickly and in an open source way is probably the biggest benefit of going with the Nvidia software stack right now. AMD has about the same hardware, software, not so much. And they haven't quite got the momentum in the open source realm, for example, to get close. Like something, for example, like Flash Attention, it's spread to more Nvidia GPU types than it has like to AMD at all. And waiting on those latest and greatest features to reach AMD is something that's prohibitive to a lot of people, but it's getting there. I'm running a lot of experiments on AMD right now because it's sort of reached the government lab supercomputers now. And so a lot of experiments are going there and it will catch up, I'd say within a few [00:16:14]

Swyx: years. [00:16:14]

Quentin: Awesome. [00:16:15]

Swyx: Maybe just talk about what's available from the government labs and I heard the original, the origin of Eluther started with a grant for TPUs. Is that right? [00:16:24]

Quentin: Yes, that was a little before me, but there was a lot of just like getting a grabbing a Google Cloud or TPU pod or something like that is a lot of the original TPU work on Mesh TensorFlow, which is like now like an ancient distributed deep learning library. [00:16:36]

Quentin: Eluther got a grant, an insight grant with Oak Ridge last year, and we got quite a bit of Summit Compute. So Summit is a V100 based supercomputer. It's got some weirdness to it. So there's six V100 GPUs per node. And we did a lot of experiments there. It's a challenging system to scale to because your interconnect across nodes is kind of slow in comparison to within a node, which I think we'll get to later. But now Oak Ridge has moved to AMD. So the next grant that we're trying to work towards is on Frontier, which has four AMD GPUs per node and again has a slower interconnect across nodes. So we get all of those new challenges again to try and overlap things. But that's just like you have Oak Ridge, you have Lawrence Livermore. There's a lot of government supercomputers that you can apply for compute towards like open researchers too. It's sort of a new thing. I think we're one of the first like us and like Lion, for example, is another organization that's getting compute from government providers and such. They're all moving to AMD as well. And we look forward to exploring that with them. [00:17:42]

Swyx: Yeah. [00:17:43]

Alessio: The computing is definitely, it used to be easy to find the GPU. Now, not as much. So you got to find them anywhere. [00:17:49]

Swyx: Yes. [00:17:49]

Alessio: Let's talk about memory requirements a little bit. So you touched on this a little bit before and just before this, we had a trade out on the pockets from FlashAttention and memory speed was one of our main focuses, but this time we're being bound by actually memory size, like the VRAM itself, when it comes to model weights and parameters and optimizer states and all that fun stuff. Let's go through this and Sean, we can, we can take turns. There's a lot to cover here, but maybe we can start from model weights. So one topic we covered a lot in the past is precision and quantization. That's one of the obviously main driver of memory. You mentioned most of, in the article, most transformers are mixed precision, like FP16 plus FP32 or BF16 FP32, and they can be cast down. And you mentioned up to like INT8 without a lot of performance hit. So let's start there and maybe run people through some of the maths and like the byte per parameter ratio and different precision. [00:18:50]

Swyx: Sure. [00:18:51]

Quentin: So when I started deep learning, it was all FP32. You have 32 bits, four bytes per parameter. Things were pretty simple. You didn't have to do any loss scaling at all. But the problem was that you didn't get a whole lot of flops once NVIDIA moved to V100s and introduced Tensor cores. So Tensor cores do all of their computation at FP16 precision. So you're kind of throwing all of those away if you're doing things in FP32. So once the hardware moved to V100, the software moved to like mixed precision and APEX and AMP and such. And one counterintuitive part of mixed precision is that you actually require more memory when you're trained because you need an FP16 copy of the weights and an FP32 copy of the weights. The FP16 copy is where you're doing like your actual computation on the Tensor cores. So you get maybe it's not uncommon to get double the throughput that you would see before in FP32. And then you at each step update that FP32 copy with the FP16 update. So both need to be stored in memory. The problem with that is that FP16 is very precise but doesn't have a whole lot of range, [00:19:55]

Swyx: dynamic range. [00:19:55]

Quentin: So you have a really big mantissa if you're thinking in terms of like floating point representations, not a whole lot of exponent. So BF16 puts more of the bits from the mantissa back to the exponent. So you have a much higher range and a lower precision. And that gets rid of all of this instability problem and loss scaling and such that anyone familiar with debugging knows how unstable it can be, especially for large scale training. And BF16 does away with a lot of that, but it's only supported on A100s. So you see the back and forth between hardware and software. So every time NVIDIA introduces some new Tensor cores or BF16 support or something like that, the software adapts to support it and then training adapts. And then now you mentioned like Ind8 and such. Now we're seeing that you have some model that's been trained in FP16, FP32, whatever else. And then now you want to, with minimal loss and accuracy, quantize that model into a smaller representation like Ind8 and now like Ind4 and things like that and see what you can get away with. And then since deep learning is such like a stochastic problem that a lot of those last bits of precision don't really matter is what we're finding. And I expect that to continue. [00:21:06]

Alessio: And so just to put some numbers to it, when you have a FP32, you need four bytes per parameter at inference time to load it in memory. If you have a eight bits model quantized down, you need one byte per parameter. So for example, in an H100, which is 80 gigabyte of memory, you could fit a 70 billion parameters in eight, you cannot fit a FP32 because you will need like 280 gigabytes of memory. So how much does that play into it? Like you mentioned it was all FP32 when you first started. Is it just like a development complexity thing, like going down to FP16 and then Ind8? Or if they could get a GPU with like a terabyte of VRAM, will people just load this memory as like FP32 weights or would they still want to quantize them to make them more efficient? Right. [00:22:00]

Quentin: I would say even if you had infinite VRAM, you would still want a quantized model, just a bigger model that's quantized is what I would say. And that's because like I was mentioning there at the end, how like deep learning is very stochastic and a lot, you could have all the precision in the world, but ultimately it's meaningless when you still depend so much like on what the input is. And you depend so much on little variations and maybe a few more samples of training data would matter more. A lot of that precision in a nutshell doesn't really matter in deep learning. All that matters is the big picture. What is that neuron actually saying? And not the tiny details of what it might be thinking. Oh, I also wanted to mention that even if you have an A100, the actual model size is quite a bit smaller that you could load than what you mentioned. That's because of the KV cache. So the KV cache intuitively during inference, it only matters during inference and think intuitively if you're writing a paragraph, you want to remember every single previous word that you've written before you write the next word. So like what is autoregressive language modeling? It's filling in the next word, the next token. So if I say like the dog went to the, and I need to write the next word, I would say park or something. Before I write the next word, my memory is wiped and I have to read the whole thing again. That is life without a KV cache. And a KV cache says, remember everything that I've generated before, as well as all the context before what I've generated. But the memory overhead for a KV cache commonly is either comparable or larger than the model in some cases, if you have a really long context. And I think the exact equation is something like, oh, it's like two times the number of layers, times the number of heads, times the dimension of each head. And then there's two of those. You have one for K, one for V. But that was just a quick aside. Yeah. [00:23:44]

Alessio: I know this is Transformers math, but do you think one of the interesting things about RNNs too, it's like moving away from this, like KV cache, the scales with the sequence length and having like a fixed sequence pass. I know those are some of the things that people are working on. [00:24:00]

Swyx: Yeah. [00:24:00]

Quentin: So there's a paper that I was involved with called RWKV that I would recommend people read. It is answering this exact question. So how do you get Transformers quality without this quadratic attention overhead that Transformers requires? So it is interesting. I don't know if I can really dive too deep into the technical details there. I'd recommend people read the paper. But yeah. [00:24:23]

Swyx: Yeah. [00:24:23]

Alessio: It's interesting to see if attention is all you need, or maybe attention is all we need, but we need better ways to make it infer in a good way. [00:24:33]

Swyx: We've actually done an unreleased episode with one of the RWKV core members and they call it soft attention or light attention. I forget what they call it, but yeah, just ways to approximate it such that it's linear and not quadratic. That's great. Yeah. [00:24:47]

Quentin: I didn't know that you were involved. [00:24:48]

Swyx: That's great. How did you get involved? Is it just because like everyone just hangs out in Discord and talks about the future of Transformers? Oh yeah. [00:24:55]

Quentin: I mean, the RWKV people specifically are in Eleuther all the time. Like they're very close collaboration with us. And my contribution was we have all of these experiments done by all of these people on RNNs and how they relate to Transformers and how do we turn that into a paper and disseminate that digestibly so that people don't have to read through like a Discord log from a year ago to understand what's going on. [00:25:16]

Swyx: Oh my God. [00:25:16]

Quentin: Just read this paper. So that took some work, but I wasn't a core contributor. So that's why I don't want to go into like the technical details. But yeah, that's how I did. [00:25:24]

Swyx: We'll try to get that RWKV episode out. It seems like there's increasing mentions of it and they are doing pretty important work as far as scaling these models are concerned. Okay. So we discussed inference type quantization and memory requirements. And then you also had a section on training with a lot of stuff I think mentioned. I think we probably want to spend the most of our time on optimizer states and the Atom optimizer. Yeah. What are your takes on it and what should people keep in mind when they deal with these optimizers? Okay. [00:25:57]

Quentin: I would say the Atom optimizer is good at what it does. It's sort of a broad question. So let me think. You have the copy of the weights and then you have your momentum and your variance that [00:26:08]

Swyx: you store. [00:26:08]

Quentin: And like, okay, maybe an intuitive explanation for momentum is that like, let's say you have a canyon and you're trying to get to the bottom. And if you're just doing basic SGD, then every step is going to be an equal size. Whereas if you're using something like Atom with the momentum term, then your steps should be progressively larger because you can see, oh, the general trend is we're heading downwards very quickly. But stepping back from that, since you have all of these extra terms in Atom, you require a lot more memory to store it. Like three times as much memory as SGD. And if you have all of this memory being spent on your optimizer states, then how do you distribute it across GPUs? Because you'll find that what ends up being your bottleneck more than just raw compute, raw flops on a given GPU is your parallelism. And that falls back onto how much model you can fit on a single GPU before you need to split it up across a bunch of GPUs. And then you end up spending time, more time with them talking to each other than actually making progress. So that's why all of this time in the blog post is spent on how do you distribute your model? What are all those different distributed strategies look like? Which ones are more efficient? And given that a lot of your memory is being spent optimizers, how do you distribute that optimizer specifically? Because a lot of people, when they talk about parallelism, they talk about model parallelism, the parameters themselves. In actuality, when you're training, a good portion of your memory is actually spent on optimizer states. So what specific part of that would you like to go into? Would you like to go into like zero or sharded optimizers? [00:27:36]

Swyx: I think the sharded optimizer stuff is really interesting, but I think we're kind of leaving that towards the end, right? Because that's the maybe more advanced distributed sections. Here, I think we're just going for rough intuition for people who've maybe are familiar with the ideas of these optimizers, but haven't actually had to implement them yet. They read your code, but they don't really understand the intuition behind the code. I see. [00:28:00]

Alessio: And Quentin, when you say in the blog post, it says, Adam is magic. How much of it is like actual magic, even to like people like you that are pretty close to the metal, so to speak? Are some of these things just come as gospel? It's like, I know this works, like I'm not touching it. I'm just leveraging it. How much of it are you actually thinking about improving on in your day-to-day work? I see. [00:28:22]

Quentin: So I'm a systems guy. I'm an engineer. And a lot of these things come to me as magic. Adam comes to me as magic. I see it from the gods. I say, this is how a deep learning model is trained. And this is how the next step is calculated. And then I say, okay, how do I make that fast? I would say I do look at ways to improve upon it using things like second order optimizers. So there's a lot of research on there because they're hard to distribute. But the core contribution for me always comes down to someone else has done like some deep learning optimization and I need to make it run fast. So I can't really speak to the motivation of why Adam came about other than like simple, intuitive things like I mentioned with like the momentum. But what matters to me is that Adam takes more memory than SGD, specifically three times. And all of that memory needs to go somewhere and it needs to be split efficiently. [00:29:14]

Swyx: Yeah. [00:29:14]

Alessio: So when you add them all up, you got 12 bytes per parameter with vanilla Adam. [00:29:20]

Swyx: Yeah. [00:29:20]

Alessio: And then you still get the model parameters and memory too. So as you mentioned, you need to keep a copy of both for like a FB32, FB16 mixed, a copy of both quantization levels. So there's precision levels. So it's six bytes per parameter. Right. [00:29:36]

Quentin: Taking a step back again, is that like, okay, most people think of your model getting big. So you need to split with model parallelism purely, something like tensor parallelism. But we can see that the model only takes like two bytes per parameter if we're doing FB16. Whereas the optimizer itself requires four bytes per parameter for the model states, four bytes for momentum, four bytes for variance. So what matters more is how do you split your optimizer efficiently and how do you store it efficiently? And something like bits and bytes, where the optimizer, you got like eight bit Adam, where those optimizer states is only one byte per parameter instead of four or something like that. That is going to give you a much better return on your model training and on your memory overhead required than if you were to, for example, quantize your pure like FB16 model weights down to int8 or something. So for training specifically, your optimizer memory matters a lot. The most in most cases. [00:30:31]

Swyx: Well, yeah. [00:30:31]

Alessio: And before we dive into zero, just to wrap up the items that you're going to shard later. So you have the parameters, you have the optimizer states, and then you have the gradients. Just maybe touch a little bit on that. And then we can talk about how to efficiently load them in GPUs. [00:30:48]

Quentin: So the parameters are the FP32 copies of the parameters. We include them in the optimizer discussion. Some people don't, but just for clarity, it's 12 bytes per param for the optimizer states and four of them are for that FP32 copy of the weights. Four of them are for the momentum. I already went into why it's important to store momentum, but that's also per parameter. You need to store where that parameter is going and where it's been going in the past. You also need to know, okay, we know where it's going, but there's going to be bumps on this canyon that we're going down. So we need to store its variance. How often are those bumps? Should we be focusing more on the momentum? Or is this parameter just kind of jumping around everywhere? Those are all important answers that we need the optimizer to store, and it's per parameter. So that's where all three of those terms come from. And we also include some competing bits and bytes, for example, an SGD to show that depending on your optimizer, you may store all or none of these and in different representations. [00:31:50]

Alessio: I'm looking at the total training memory. You essentially have model memory, optimizer memory, gradient memory, and activation memory. I think that's one of the last discussed things. So maybe just give people a little bit of a view. [00:32:03]

Swyx: Yeah, this is completely new to me. [00:32:05]

Alessio: Active, you know, recomputation, checkpointing, and all of that. [00:32:08]

Swyx: Right. [00:32:09]

Quentin: So, okay. So to summarize before activation checkpointing, which will be complicated, you have your model params, like I mentioned before, they used to be FP32. Now they're probably BF16, maybe FP16 if it's an older GPU. Then you have your optimizer. That's where a lot of the memory is going. And it's your high precision, usually FP32, copy of the weights. So that's four bytes per param. And then you have, optionally, a couple more terms like we just discussed, like momentum or variance or whatever else, depending on what your optimizer is. Then you have your gradients. So your gradients is what is the gradient update that we get after running the forward pass on the model. And that's going to be whatever your low precision copy of the weights is. So like two bytes per param, if you're using FP16 or BF16. And all of those are sort of set in stone. And that overhead is not going to go away for the duration of training. Your gradients might get cleared after you back propagate them, but your optimizer states and your model states aren't going away. That memory overhead will be there. Activation recomputation and activation memory is dynamic. So some people will come and have this problem where the model loads fine for training. But then when you actually run your first iteration, or you run some future iteration or something like that, you run out of memory, seemingly at random. And it's because of these activations that you're computing on the fly. Good summary, or do you want to get into activation recomputation now, or do you want me to touch on anything else? [00:33:35]

Alessio: Yeah, I was going to say, when is the recomputation happening? How does it decide between recomputing versus storing? And talk a bit more about that, maybe. [00:33:47]

Quentin: Yeah, okay. So there's a lot of different ways to do this, but I would say there are a few main ones. First is a very simple scheme. You recompute everything. Every single activation that you calculate is just going to be either used or thrown away until the end. So in that case, you care very much about memory. You care very little about compute. Maybe this would be a case where you have to distribute across a lot of different GPUs, for example. And your communication speed is really low. Then that might be a good case for you to just recompute everything. It happens rarely, but it happens. Next up would be something like selective recomputation. So in selective recomputation, which Megatron has a good paper on, and I believe the figure that we have in our blog post is from, in that case, you sort of do a weighted decision for each activation. So for really big activation tensors, you decide, is this going to be more expensive to save in terms of memory or to recompute in terms of compute? So that's sort of the smart scheme that Megatron implements. And there's a lot of different heuristics they use. It's probably not worth mentioning off this super long equation on a pod, but you should go and read that paper if you're interested on selective recomputation. And then a really stupid scheme that most people go with, including NeoX, would be something like, instead of doing all of these heuristics, you just say, if my tensor is bigger than X, I throw it away. And you set X to some static number, and that's it. And that is good enough for a lot of cases. [00:35:18]

Swyx: Why is it good enough? [00:35:20]

Quentin: You don't want to store more than, you know, X-sized tensor. And some fall above that, some fall below it. And you're not trying to squeeze. You care more about getting something close enough to what the actual heuristic should be without actually computing the heuristic because you don't want to spend the time writing that heuristic code. [00:35:37]

Swyx: Cool. I think that does take us on a grand tour of the memory math. Is there any sort of high-level takeaway before we go into the distributed stuff? Zero and all that. Perhaps more detail than most people have ever encountered. And so I'll repeat the equation that Alessio mentioned again, which is total training memory now has all these components that you've mapped out for the first time as far as we're concerned. Model memory, optimizer memory, activation memory, gradient memory. We covered quite a few algorithms as to the choices you can make there. Anything else that you want to mention about just memory math? I don't think so. [00:36:11]

Quentin: I think that about covers it. I will say that it's a very different scheme for training and inference. It's common for people to say, oh, BF16 is the best. Done. Whereas a more correct take is that during training, precision matters a bit more. So BF16 will be around longer for training than it will for inference, in which case your model is sort of already baked. And it definitely doesn't need some of those last bits of precision so you can get away much easier with going to int8 for inference rather than training. So everything that you learn for training has to be relearned for inference and vice versa. [00:36:44]

Swyx: There's a third category. You're talking about training versus inference. This third category is emerging with regards to fine-tuning and perhaps parameter-efficient methods of fine-tuning. The naive way to implement fine-tuning is just to do more training. But I don't know if you've developed any intuitions over fine-tuning that's worth inserting here. Any intuitions? If you were to write fine-tuning math, what would go in there? That might be an interesting diff to training math. [00:37:10]

Quentin: I think there's a lot of questions that are unanswered for fine-tuning. For example, we know scaling laws for training. And some people have done scaling laws for fine-tuning. But how does a model that's already been trained on one domain transfer to another in terms of fine-tuning size? How many tokens per parameter should you have for your fine-tuning dataset? Maybe I'm ignorant, but I feel like a lot of those sort of practical questions on how a model can transfer and how a model can learn or grok some new ability that wasn't in its original training dataset is something that I would definitely put inside a fine-tuning blog post. [00:37:45]

Swyx: Something related to perplexity and, I guess, diversity of the tokens that you get. [00:37:49]

Quentin: Yeah, sort of dataset transfer is something that I would be curious in. Learning rate transfer is another one. So your model has some decayed learning rate over the course of training. How does that change for fine-tuning? Things like that. [00:38:00]

Swyx: All right, cool. Thanks for indulging that stuff. Sure. Yeah. [00:38:03]

Alessio: I think after all of this, you can quickly do the math and see that training needs to be distributed to actually work because we just don't have hardware that can easily run this. So let's talk a bit about that. So zero is one of the first things that you mentioned here, which is focused on sharded optimizers. Maybe run people through that and how to think about it. [00:38:25]

Swyx: Sure. [00:38:25]

Quentin: So zero is centered around two communication operations. And the first is scatter. And people should be looking at the zero figure that I think we have. [00:38:35]

Swyx: Yeah. [00:38:36]

Quentin: So there's a figure in the paper with parameters, gradients, and optimizer states that people should be looking at when I'm talking about this. Every GPU is going to get its own equal portion of the slice. And if we're doing... There are different stages of zero, but let's just start off with assuming that it's an equal slice of the optimizer states, gradients, and parameters. That would be zero three, stage three in that case. And we do that with a scatter. And the scatter takes, say, one over end GPUs, plus this offset of that slice goes to that GPU. Now all of the GPUs have an equal slice that's in its rank order. And then during each training step, that GPU is going to wait for all of the other slices to communicate so that we now have a whole pie on that GPU, that single GPU. Once we have that whole pie, we do the forward pass on it. And then we distribute that forward pass to all of the others using a gather. So it's a scatter, reduced scatter specifically, and then a gather back to all the others. And you do that each step. So the point of it is that you're sharding these states across GPUs. And with the different stages, you'll see in that figure that the optimizer state is taking the most proportion, which is because of what I mentioned before. We're including the FP32 copy and we're doing atom. So we need those four bytes per param for momentum and for variance. And then zero stage one, which is the most common one, is just optimizer. Zero stage two is optimizer plus gradients. And zero stage three is optimizer gradients and model parameters. But it all comes back to this splitting up and then gathering together back and forth over and over. So you get a lot of communication overhead from zero. But the plus part of that is that you can overlap a lot of that movement with computation. [00:40:23]

Alessio: How do you get the optimal number of GPUs to do this on? Is there a way to shard too much as well and put too much overhead? [00:40:31]

Quentin: It depends more on what your interconnect is. Taking a step back, there is synchronization that's required, a lot of it, across all of these GPUs. And those tend to be cumulative. So if you go to too many GPUs on an interconnect that's too slow, then you're going to end up spending more time synchronizing. And that magic number where you spend more time synchronizing is going to be different depending on what your fabric is and what your GPU memory is specifically. Just how small of a slice is each GPU getting? I can't, for example, for Summit, that number comes out to be about 20 billion parameters. Now you have 20 billion parameters, and then your magic number of GPUs for that is going to be something like 100 to 200 scale. Beyond that, you're just going to end up spending more time communicating. And the actual flops dipping below some predetermined number by you is going to be whatever your sweet spot ends up being. [00:41:24]

Alessio: And then, so this one was like hard for me to go through, so I'm excited to have you run through it, which is a 3D parallelism. [00:41:33]

Swyx: It's fancy, it's cutting edge. [00:41:35]

Alessio: Yeah, let's talk a bit more about that and some of the work. [00:41:38]

Quentin: Okay, 3D parallelism. So what is each dimension? First is the really basic one. That's data parallelism. And data parallelism is you have a copy of the model. Let's say for simplicity, one copy fits on one GPU perfectly. Data parallelism is that now you have two GPUs, so you have one copy on GPU one, one copy on GPU two. Both of them do the forward and backward pass and then synchronize and average the gradients. And then that's a step. Data parallelism for 3D parallelism is actually zero. So it's, you're sharding the optimizer states across all of your different GPUs. Next up is tensor parallelism. Tensor parallelism is you split your model. Like say, if you have two GPUs, you split your model down the middle and each GPU on its tensor specifically is going to do its forward or backward operation on its tensor. And then only when necessary, it'll synchronize that tensor operation with the other GPU. It's a bit more complex than something like pipeline parallelism, which is the third dimension. In pipeline parallelism, let's say you have four layers in your model. And you have four GPUs. You put one layer on each GPU and then GPU one does the forward pass and then sends the output of its activations to GPU two. It does the forward pass, sends activations to three, and you're just moving down a line. That is a naive scheme in that all of the other GPUs are doing nothing while a single GPU is doing its forward or backward pass. So the reason it's called pipeline parallelism is because you're splitting your mini batch into micro batches. So GPU one will do the forward pass on micro batch one and then send to GPU two. And then while GPU two is running on that first micro batch, GPU one is working on the next micro batch. And so you're sort of pipelining the movement and computation of each micro batch. The problem with that is that you need a really big batch size in order to split it up into both mini batches and micro batches. So combining all three of those together, you get a 3D mesh of where each parameter and optimizer state and so on maps to each GPU. And that's 3D parallelism. So let's start diving into details on what have that made sense, what should I jump into more on? [00:43:55]

Alessio: I think the main question is, do you need all of the GPUs to be the same to do this? Or can you have mismatching GPUs as well? [00:44:03]

Quentin: Okay, two things matter. If there's a difference in VRAM for the two different kinds of GPUs, then you're going to be bottlenecked by whichever GPU has the lower amount of VRAM because it's going to run out of memory. And then you can't like whatever's left on the larger GPUs is going to be empty. As far as I'm aware, there's no like GPU single GPU aware memory overhead scheme that would account for that. The second problem is that let's say all of your GPUs have the same amount of VRAM, but half of them are really slow. And the problem with that is that those synchronizations that I mentioned earlier are going to kill you. So you're going to move as quickly as your slowest GPU in that case. So in both cases, you end up regressing to your slowest or smallest GPU. So you might as well have the same GPUs for all of them. Otherwise, you're wasting the nicer ones. And that also goes to your CPUs and your interconnect. So going back to the 20 billion parameter model that Eleuther was training, that was on a cluster that was sort of Frankenstein made during COVID when there was all of that shortage of network switches and such like that. So every node had a different network switch. And so you ended up moving at the speed of the slowest switch and getting everything tuned properly so that it's not worse than the slowest switch was challenging and is like a real world problem that sometimes comes up. [00:45:28]

Alessio: Is this work widely accepted? Like I hadn't learned about this before studying for this episode. Is this something that people are still trying and researching? Or is everybody just aware of this and running this in production? [00:45:43]

Quentin: What is this specifically? [00:45:44]

Alessio: Like the sharded optimizers plus the 3D parallelism, bringing the two things together and having this kind of mesh strategy. [00:45:51]

Quentin: I would say that a lot of major GPT-based models use this scheme. A lot of them now are sort of going with just a pure zero scheme. So just a pure sharded. You just shard everything. And then since that's so easy, everyone gets an equal slice. There's no such thing as a pipeline stage. There's no such thing as what tensor should go on which GPU. Instead, we shard everything equally and treat everything equally. It's a much easier problem to debug, to checkpoint, to run training on than it is with this 3D parallel scheme. I say 3D parallel gives you the most control and also the most ways to go wrong. And depending on whether you have more engineers or whether you have more GPUs, that should decide which of these you go with. [00:46:35]

Swyx: It's also not too hard, right? You've basically outlined the five or six different numbers that you need to keep in your head. And it doesn't feel impossible that if you need to achieve that level of control, you've given everybody the main levers to do it with. And that's wonderful. Definitely. [00:46:51]

Quentin: The problem that comes up is like, say, like, okay, GPT-4 came out. Now we have VLLMs. [00:46:57]

Swyx: Whoa, what are VLLMs? Oh, okay. Virtual LLMs, like the Metro of Expert things? No, like visual. [00:47:03]

Quentin: So now you have like multimodal models and such. How do you distribute that? Do you distribute it in a pipeline stage? And do you just shard it? Do you split the tensor and make a tensor parallel? It's sort of hard to change your model and add new features and such when you have this 3D parallel scheme. That's when I say hard. I mean, it's hard to sort of adapt and modify it to new features. [00:47:26]

Alessio: I know we're at the hour mark, and I think we put our listeners through a very intense class today. So this was great, Quentin. And we're going to definitely link the article so that people can read it and follow along. Any other research that you're working on in this space that you want to shout out? I know one of our usual, I mean, wrong question is, what's the most interesting unsolved question in AI? So curious to hear if you think it's still on the training inference, math optimization, or are there more areas that people should pay attention to? [00:47:58]

Quentin: I think in my area of research, there are two things that I think people should really care about. And the first is multimodal parallelism and RLHF. You were seeing more and more reinforcement learning and coming into the training loop. And so how do you split that some model or some GPUs are working on inference and some GPUs are working on training? And like I mentioned before, you have to relearn everything and they have very unique challenges. How do you split up a KV cache during training, for example? Those are challenges that are not well studied, I don't think. And then multimodal, you have like maybe a vision transformer and a text transformer. How do you split those up? Do you split them up equally? Do you put them on separate GPUs or do you just shard everything? And just maybe one GPU will have some vision, some text parameters. And then the second case I would say is that communication is very often a bottleneck. So we talk about 3D parallelism, but a lot of those like, for example, tensor parallelism, you can't go across nodes with. You'll just get killed in communication. So what I'm getting to is how should you compress your communication before it happens? So on the fly compression, you have some buffer that needs to be communicated. You compress it with a GPU kernel, then you send it across the network and then you decompress it, something like that. Making people spend less money on communication fabrics and more on GPUs as intended is sort of a thing that people need to explore. I think those are my two. [00:49:26]

Alessio: Sean, you went over the other half of the lightning round before we wrap it up. [00:49:30]

Swyx: That's a good brain dump. Cool. Yeah, I have so many more questions on the multimodal stuff, but that should be for another time. Acceleration, what has already happened in AI that you thought would take much longer? [00:49:42]

Quentin: I would say flash attention. Guys, just talk to Tree. And flash attention is just sort of a really great set of kernels that I thought would take a while to get to us. [00:49:51]

Alessio: Well, Quentin, thank you very much, man. This was super informative and I think hopefully helps demystify a little bit the blog post. I think people open it and it's like a lot of math on it. And I think you walking them through it was super helpful. So thank you so much for coming on. [00:50:07]

Swyx: Of course. [00:50:08]

Quentin: And I'm happy to answer any questions that people have offline if they have them. I do read my email. [00:50:13]

Swyx: Email and Discord. Of course, yeah. [00:50:15]

Quentin: Discord I'm even faster on. [00:50:16]

Alessio: Thank you, everyone. [00:50:18]

Swyx: Thanks, Quentin. [00:50:19]

Get full access to Latent Space at www.latent.space/subscribe

2023-08-16
Länk till avsnitt

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC ? with Tianqi Chen of CMU / OctoML

We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email [email protected] if you?d like to support.

We are facing a massive GPU crunch. As both startups and VC?s hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common.

But what if we could run LLMs with AMD cards, or without a GPU at all? There?s just one weird trick: compilation. And there?s one person uniquely qualified to do it.

We had the pleasure to sit down with Tianqi Chen, who?s an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML.

The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects:

* MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s!

* Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product.

* MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks.

The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA?s counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD?s 7900s vs some of top NVIDIA consumer cards.

If you just can?t get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn?t comparable, it allows for non-time-sensitive inference to be run on commodity hardware.

We also enjoyed getting a peek into TQ?s process, which involves a lot of sketching:

With all the other work going on in this space with projects like ggml and Ollama, we?re excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems!

Show Notes

* TQ?s Projects:

* XGBoost

* Apache TVM

* MXNet

* MLC

* OctoML

* CMU Catalyst

* ONNX

* GGML

* Mojo

* WebLLM

* RWKV

* HiPPO

* Tri Dao?s Episode

* George Hotz Episode

People:

* Carlos Guestrin

* Albert Gu

Timestamps

* [00:00:00] Intros

* [00:03:41] The creation of XGBoost and its surprising popularity

* [00:06:01] Comparing tree-based models vs deep learning

* [00:10:33] Overview of TVM and how it works with ONNX

* [00:17:18] MLC deep dive

* [00:28:10] Using int4 quantization for inference of language models

* [00:30:32] Comparison of MLC to other model optimization projects

* [00:35:02] Running large language models in the browser with WebLLM

* [00:37:47] Integrating browser models into applications

* [00:41:15] OctoAI and self-optimizing compute

* [00:45:45] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]

Swyx: Okay, and we are here with Tianqi Chen, or TQ as people call him, who is assistant professor in ML computer science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologist of OctoML. You wear many hats. Are those, you know, your primary identities these days? Of course, of course. [00:00:42]

Tianqi: I'm also, you know, very enthusiastic open source. So I'm also a VP and PRC member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. [00:00:53]

Swyx: Yeah. So you did Apache TVM, XGBoost, and MXNet, and we can cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? [00:01:08]

Tianqi: Let me say, yeah, so normally when I do, I really love coding, even though like I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks to draw down the design diagrams and the sketchbooks I keep sketching over the years, and now I have like three or four of them. And it's kind of a usually a fun experience of thinking the design through and also seeing how open source project evolves and also looking back at the sketches that we had in the past to say, you know, all these ideas really turn into code nowadays. [00:01:43]

Alessio: How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, he'll be a very accomplished engineer. Like you built like three of these. What's that process like for you? Like it's the sketchbook, like the start, and then you think about the code or like. [00:01:59]

Swyx: Yeah. [00:02:00]

Tianqi: So, so usually I start sketching on high level architectures and also in a project that works for over years, we also start to think about, you know, new directions, like of course generative AI language model comes in, how it's going to evolve. So normally I would say it takes like one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high level architectural guide for some of the future items. Yeah. [00:02:28]

Swyx: Have you ever published this sketchbooks? Cause I think people would be very interested on, at least on a historical basis. Like this is the time where XGBoost was born, you know? Yeah, not really. [00:02:37]

Tianqi: I started sketching like after XGBoost. So that's a kind of missing piece, but a lot of design details in TVM are actually part of the books that I try to keep a record of. [00:02:48]

Swyx: Yeah, we'll try to publish them and publish something in the journals. Maybe you can grab a little snapshot for visual aid. Sounds good. [00:02:57]

Alessio: Yeah. And yeah, talking about XGBoost, so a lot of people in the audience might know it's a gradient boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning competitions. And I think there's like a whole Wikipedia page of like all state-of-the-art models. They use XGBoost and like, it's a really long list. When you were working on it, so we just had Tri Dao, who's the creator of FlashAttention on the podcast. And I asked him this question, it's like, when you were building FlashAttention, did you know that like almost any transform race model will use it? And so I asked the same question to you when you were coming up with XGBoost, like, could you predict it would be so popular or like, what was the creation process? And when you published it, what did you expect? We have no idea. [00:03:41]

Tianqi: Like, actually, the original reason that we built that library is that at that time, deep learning just came out. Like that was the time where AlexNet just came out. And one of the ambitious mission that myself and my advisor, Carlos Guestrin, then is we want to think about, you know, try to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, support vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually like one of the key characteristics of deep learning is that it's taking a lot [00:04:22]

Swyx: of data, right? [00:04:23]

Tianqi: So we will be able to get the same amount of performance. That's a hypothesis we're setting out to test. Of course, if you look at now, right, that's a wrong hypothesis, but as a byproduct, what we find out is that, you know, most of the gradient boosting library out there is not efficient enough for us to test that hypothesis. So I happen to have quite a bit of experience in the past of building gradient boosting trees and their variants. So Effective Action Boost was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges, like I worked on KDDCup and then Kaggle kind of become bigger, right? So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding of it. That tends to be like a very good decision, right, to be effective. Usually when I build it, we feel like maybe a command line interface is okay. And now we have a Python binding, we have R bindings. And then it realized, you know, it started getting interesting. People started contributing different perspectives, like visualization and so on. So we started to push a bit more on to building distributive support to make sure it works on any platform and so on. And even at that time point, when I talked to Carlos, my advisor, later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interestingly, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then. And it turns out, you know, actually, we are both wrong in some sense, and Deep Neural Network was the king in the hill. But at least the gradient boosting direction got into something fruitful. [00:06:01]

Swyx: Interesting. [00:06:02]

Alessio: I'm always curious when it comes to these improvements, like, what's the design process in terms of like coming up with it? And how much of it is a collaborative with like other people that you're working with versus like trying to be, you know, obviously, in academia, it's like very paper-driven kind of research driven. [00:06:19]

Tianqi: I would say the extra boost improvement at that time point was more on like, you know, I'm trying to figure out, right. But it's combining lessons. Before that, I did work on some of the other libraries on matrix factorization. That was like my first open source experience. Nobody knew about it, because you'll find, likely, if you go and try to search for the package SVD feature, you'll find some SVN repo somewhere. But it's actually being used for some of the recommender system packages. So I'm trying to apply some of the previous lessons there and trying to combine them. The later projects like MXNet and then TVM is much, much more collaborative in a sense that... But, of course, extra boost has become bigger, right? So when we started that project myself, and then we have, it's really amazing to see people come in. Michael, who was a lawyer, and now he works on the AI space as well, on contributing visualizations. Now we have people from our community contributing different things. So extra boost even today, right, it's a community of committers driving the project. So it's definitely something collaborative and moving forward on getting some of the things continuously improved for our community. [00:07:37]

Alessio: Let's talk a bit about TVM too, because we got a lot of things to run through in this episode. [00:07:42]

Swyx: I would say that at some point, I'd love to talk about this comparison between extra boost or tree-based type AI or machine learning compared to deep learning, because I think there is a lot of interest around, I guess, merging the two disciplines, right? And we can talk more about that. I don't know where to insert that, by the way, so we can come back to it later. Yeah. [00:08:04]

Tianqi: Actually, what I said, when we test the hypothesis, the hypothesis is kind of, I would say it's partially wrong, because the hypothesis we want to test now is, can you run tree-based models on image classification tasks, where deep learning is certainly a no-brainer right [00:08:17]

Swyx: now today, right? [00:08:18]

Tianqi: But if you try to run it on tabular data, still, you'll find that most people opt for tree-based models. And there's a reason for that, in the sense that when you are looking at tree-based models, the decision boundaries are naturally rules that you're looking at, right? And they also have nice properties, like being able to be agnostic to scale of input and be able to automatically compose features together. And I know there are attempts on building neural network models that work for tabular data, and I also sometimes follow them. I do feel like it's good to have a bit of diversity in the modeling space. Actually, when we're building TVM, we build cost models for the programs, and actually we are using XGBoost for that as well. I still think tree-based models are going to be quite relevant, because first of all, it's really to get it to work out of the box. And also, you will be able to get a bit of interoperability and control monotonicity [00:09:18]

Swyx: and so on. [00:09:19]

Tianqi: So yes, it's still going to be relevant. I also sometimes keep coming back to think about, are there possible improvements that we can build on top of these models? And definitely, I feel like it's a space that can have some potential in the future. [00:09:34]

Swyx: Are there any current projects that you would call out as promising in terms of merging the two directions? [00:09:41]

Tianqi: I think there are projects that try to bring a transformer-type model for tabular data. I don't remember specifics of them, but I think even nowadays, if you look at what people are using, tree-based models are still one of their toolkits. So I think maybe eventually it's not even a replacement, it will be just an ensemble of models that you can call. Perfect. [00:10:07]

Alessio: Next up, about three years after XGBoost, you built this thing called TVM, which is now a very popular compiler framework for models. Let's talk about, so this came out about at the same time as ONNX. So I think it would be great if you could maybe give a little bit of an overview of how the two things work together. Because it's kind of like the model, then goes to ONNX, then goes to the TVM. But I think a lot of people don't understand the nuances. I can get a bit of a backstory on that. [00:10:33]

Tianqi: So actually, that's kind of an ancient history. Before XGBoost, I worked on deep learning for two years or three years. I got a master's before I started my PhD. And during my master's, my thesis focused on applying convolutional restricted Boltzmann machine for ImageNet classification. That is the thing I'm working on. And that was before AlexNet moment. So effectively, I had to handcraft NVIDIA CUDA kernels on, I think, a GTX 2070 card. I have a 22070 card. It took me about six months to get one model working. And eventually, that model is not so good, and we should have picked a better model. But that was like an ancient history that really got me into this deep learning field. And of course, eventually, we find it didn't work out. So in my master's, I ended up working on recommender system, which got me a paper, and I applied and got a PhD. But I always want to come back to work on the deep learning field. So after XGBoost, I think I started to work with some folks on this particular MXNet. At that time, it was like the frameworks of CAFE, Ciano, PyTorch haven't yet come out. And we're really working hard to optimize for performance on GPUs. At that time, I found it's really hard, even for NVIDIA GPU. It took me six months. And then it's amazing to see on different hardwares how hard it is to go and optimize code for the platforms that are interesting. So that gets me thinking, can we build something more generic and automatic? So that I don't need an entire team of so many people to go and build those frameworks. So that's the motivation of starting working on TVM. There is really too little about machine learning engineering needed to support deep learning models on the platforms that we're interested in. I think it started a bit earlier than ONNX, but once it got announced, I think it's in a similar time period at that time. So overall, how it works is that TVM, you will be able to take a subset of machine learning programs that are represented in what we call a computational graph. Nowadays, we can also represent a loop-level program ingest from your machine learning models. Usually, you have model formats ONNX, or in PyTorch, they have FX Tracer that allows you to trace the FX graph. And then it goes through TVM. We also realized that, well, yes, it needs to be more customizable, so it will be able to perform some of the compilation optimizations like fusion operator together, doing smart memory planning, and more importantly, generate low-level code. So that works for NVIDIA and also is portable to other GPU backends, even non-GPU backends [00:13:36]

Swyx: out there. [00:13:37]

Tianqi: So that's a project that actually has been my primary focus over the past few years. And it's great to see how it started from where I think we are the very early initiator of machine learning compilation. I remember there was a visit one day, one of the students asked me, are you still working on deep learning frameworks? I tell them that I'm working on ML compilation. And they said, okay, compilation, that sounds very ancient. It sounds like a very old field. And why are you working on this? And now it's starting to get more traction, like if you say Torch Compile and other things. I'm really glad to see this field starting to pick up. And also we have to continue innovating here. [00:14:17]

Alessio: I think the other thing that I noticed is, it's kind of like a big jump in terms of area of focus to go from XGBoost to TVM, it's kind of like a different part of the stack. Why did you decide to do that? And I think the other thing about compiling to different GPUs and eventually CPUs too, did you already see some of the strain that models could have just being focused on one runtime, only being on CUDA and that, and how much of that went into it? [00:14:50]

Tianqi: I think it's less about trying to get impact, more about wanting to have fun. I like to hack code, I had great fun hacking CUDA code. Of course, being able to generate CUDA code is cool, right? But now, after being able to generate CUDA code, okay, by the way, you can do it on other platforms, isn't that amazing? So it's more of that attitude to get me started on this. And also, I think when we look at different researchers, myself is more like a problem solver type. So I like to look at a problem and say, okay, what kind of tools we need to solve that problem? So regardless, it could be building better models. For example, while we build extra boots, we build certain regularizations into it so that it's more robust. It also means building system optimizations, writing low-level code, maybe trying to write assembly and build compilers and so on. So as long as they solve the problem, definitely go and try to do them together. And I also see it's a common trend right now. Like if you want to be able to solve machine learning problems, it's no longer at Aggressor layer, right? You kind of need to solve it from both Aggressor data and systems angle. And this entire field of machine learning system, I think it's kind of emerging. And there's now a conference around it. And it's really good to see a lot more people are starting to look into this. [00:16:10]

Swyx: Yeah. Are you talking about ICML or something else? [00:16:13]

Tianqi: So machine learning and systems, right? So not only machine learning, but machine learning and system. So there's a conference called MLsys. It's definitely a smaller community than ICML, but I think it's also an emerging and growing community where people are talking about what are the implications of building systems for machine learning, right? And how do you go and optimize things around that and co-design models and systems together? [00:16:37]

Swyx: Yeah. And you were area chair for ICML and NeurIPS as well. So you've just had a lot of conference and community organization experience. Is that also an important part of your work? Well, it's kind of expected for academic. [00:16:48]

Tianqi: If I hold an academic job, I need to do services for the community. Okay, great. [00:16:53]

Swyx: Your most recent venture in MLsys is going to the phone with MLCLLM. You announced this in April. I have it on my phone. It's great. I'm running Lama 2, Vicuña. I don't know what other models that you offer. But maybe just kind of describe your journey into MLC. And I don't know how this coincides with your work at CMU. Is that some kind of outgrowth? [00:17:18]

Tianqi: I think it's more like a focused effort that we want in the area of machine learning compilation. So it's kind of related to what we built in TVM. So when we built TVM was five years ago, right? And a lot of things happened. We built the end-to-end machine learning compiler that works, the first one that works. But then we captured a lot of lessons there. So then we are building a second iteration called TVM Unity. That allows us to be able to allow ML engineers to be able to quickly capture the new model and how we demand building optimizations for them. And MLCLLM is kind of like an MLC. It's more like a vertical driven organization that we go and build tutorials and go and build projects like LLM to solutions. So that to really show like, okay, you can take machine learning compilation technology and apply it and bring something fun forward. Yeah. So yes, it runs on phones, which is really cool. But the goal here is not only making it run on phones, right? The goal is making it deploy universally. So we do run on Apple M2 Macs, the 17 billion models. Actually, on a single batch inference, more recently on CUDA, we get, I think, the most best performance you can get out there already on the 4-bit inference. Actually, as I alluded earlier before the podcast, we just had a result on AMD. And on a single batch, actually, we can get the latest AMD GPU. This is a consumer card. It can get to about 80% of the 4019, so NVIDIA's best consumer card out there. So it's not yet on par, but thinking about how diversity and what you can enable and the previous things you can get on that card, it's really amazing that what you can do with this kind of technology. [00:19:10]

Swyx: So one thing I'm a little bit confused by is that most of these models are in PyTorch, but you're running this inside a TVM. I don't know. Was there any fundamental change that you needed to do, or was this basically the fundamental design of TVM? [00:19:25]

Tianqi: So the idea is that, of course, it comes back to program representation, right? So effectively, TVM has this program representation called TVM script that contains more like computational graph and operational representation. So yes, initially, we do need to take a bit of effort of bringing those models onto the program representation that TVM supports. Usually, there are a mix of ways, depending on the kind of model you're looking at. For example, for vision models and stable diffusion models, usually we can just do tracing that takes PyTorch model onto TVM. That part is still being robustified so that we can bring more models in. On language model tasks, actually what we do is we directly build some of the model constructors and try to directly map from Hugging Face models. The goal is if you have a Hugging Face configuration, we will be able to bring that in and apply optimization on them. So one fun thing about model compilation is that your optimization doesn't happen only as a soft language, right? For example, if you're writing PyTorch code, you just go and try to use a better fused operator at a source code level. Torch compile might help you do a bit of things in there. In most of the model compilations, it not only happens at the beginning stage, but we also apply generic transformations in between, also through a Python API. So you can tweak some of that. So that part of optimization helps a lot of uplifting in getting both performance and also portability on the environment. And another thing that we do have is what we call universal deployment. So if you get the ML program into this TVM script format, where there are functions that takes in tensor and output tensor, we will be able to have a way to compile it. So they will be able to load the function in any of the language runtime that TVM supports. So if you could load it in JavaScript, and that's a JavaScript function that you can take in tensors and output tensors. If you're loading Python, of course, and C++ and Java. So the goal there is really bring the ML model to the language that people care about and be able to run it on a platform they like. [00:21:37]

Swyx: It strikes me that I've talked to a lot of compiler people, but you don't have a traditional compiler background. You're inventing your own discipline called machine learning compilation, or MLC. Do you think that this will be a bigger field going forward? [00:21:52]

Tianqi: First of all, I do work with people working on compilation as well. So we're also taking inspirations from a lot of early innovations in the field. Like for example, TVM initially, we take a lot of inspirations from Halide, which is just an image processing compiler. And of course, since then, we have evolved quite a bit to focus on the machine learning related compilations. If you look at some of our conference publications, you'll find that machine learning compilation is already kind of a subfield. So if you look at papers in both machine learning venues, the MLC conferences, of course, and also system venues, every year there will be papers around machine learning compilation. And in the compiler conference called CGO, there's a C4ML workshop that also kind of trying to focus on this area. So definitely it's already starting to gain traction and becoming a field. I wouldn't claim that I invented this field, but definitely I helped to work with a lot of folks there. And I try to bring a perspective, of course, trying to learn a lot from the compiler optimizations as well as trying to bring in knowledges in machine learning and systems together. [00:23:07]

Alessio: So we had George Hotz on the podcast a few episodes ago, and he had a lot to say about AMD and their software. So when you think about TVM, are you still restricted in a way by the performance of the underlying kernel, so to speak? So if your target is like a CUDA runtime, you still get better performance, no matter like TVM kind of helps you get there, but then that level you don't take care of, right? [00:23:34]

Swyx: There are two parts in here, right? [00:23:35]

Tianqi: So first of all, there is the lower level runtime, like CUDA runtime. And then actually for NVIDIA, a lot of the mood came from their libraries, like Cutlass, CUDN, right? Those library optimizations. And also for specialized workloads, actually you can specialize them. Because a lot of cases you'll find that if you go and do benchmarks, it's very interesting. Like two years ago, if you try to benchmark ResNet, for example, usually the NVIDIA library [00:24:04]

Swyx: gives you the best performance. [00:24:06]

Tianqi: It's really hard to beat them. But as soon as you start to change the model to something, maybe a bit of a variation of ResNet, not for the traditional ImageNet detections, but for latent detection and so on, there will be some room for optimization because people sometimes overfit to benchmarks. These are people who go and optimize things, right? So people overfit the benchmarks. So that's the largest barrier, like being able to get a low level kernel libraries, right? In that sense, the goal of TVM is actually we try to have a generic layer to both, of course, leverage libraries when available, but also be able to automatically generate [00:24:45]

Swyx: libraries when possible. [00:24:46]

Tianqi: So in that sense, we are not restricted by the libraries that they have to offer. That's why we will be able to run Apple M2 or WebGPU where there's no library available because we are kind of like automatically generating libraries. That makes it easier to support less well-supported hardware, right? For example, WebGPU is one example. From a runtime perspective, AMD, I think before their Vulkan driver was not very well supported. Recently, they are getting good. But even before that, we'll be able to support AMD through this GPU graphics backend called Vulkan, which is not as performant, but it gives you a decent portability across those [00:25:29]

Swyx: hardware. [00:25:29]

Alessio: And I know we got other MLC stuff to talk about, like WebLLM, but I want to wrap up on the optimization that you're doing. So there's kind of four core things, right? Kernel fusion, which we talked a bit about in the flash attention episode and the tiny grab one memory planning and loop optimization. I think those are like pretty, you know, self-explanatory. I think the one that people have the most questions, can you can you quickly explain [00:25:53]

Swyx: those? [00:25:54]

Tianqi: So there are kind of a different things, right? Kernel fusion means that, you know, if you have an operator like Convolutions or in the case of a transformer like MOP, you have other operators that follow that, right? You don't want to launch two GPU kernels. You want to be able to put them together in a smart way, right? And as a memory planning, it's more about, you know, hey, if you run like Python code, every time when you generate a new array, you are effectively allocating a new piece of memory, right? Of course, PyTorch and other frameworks try to optimize for you. So there is a smart memory allocator behind the scene. But actually, in a lot of cases, it's much better to statically allocate and plan everything ahead of time. And that's where like a compiler can come in. We need to, first of all, actually for language model, it's much harder because dynamic shape. So you need to be able to what we call symbolic shape tracing. So we have like a symbolic variable that tells you like the shape of the first tensor is n by 12. And the shape of the third tensor is also n by 12. Or maybe it's n times 2 by 12. Although you don't know what n is, right? But you will be able to know that relation and be able to use that to reason about like fusion and other decisions. So besides this, I think loop transformation is quite important. And it's actually non-traditional. Originally, if you simply write a code and you want to get a performance, it's very hard. For example, you know, if you write a matrix multiplier, the simplest thing you can do is you do for i, j, k, c, i, j, plus, equal, you know, a, i, k, times b, i, k. But that code is 100 times slower than the best available code that you can get. So we do a lot of transformation, like being able to take the original code, trying to put things into shared memory, and making use of tensor calls, making use of memory copies, and all this. Actually, all these things, we also realize that, you know, we cannot do all of them. So we also make the ML compilation framework as a Python package, so that people will be able to continuously improve that part of engineering in a more transparent way. So we find that's very useful, actually, for us to be able to get good performance very quickly on some of the new models. Like when Lamato came out, we'll be able to go and look at the whole, here's the bottleneck, and we can go and optimize those. [00:28:10]

Alessio: And then the fourth one being weight quantization. So everybody wants to know about that. And just to give people an idea of the memory saving, if you're doing FB32, it's like four bytes per parameter. Int8 is like one byte per parameter. So you can really shrink down the memory footprint. What are some of the trade-offs there? How do you figure out what the right target is? And what are the precision trade-offs, too? [00:28:37]

Tianqi: Right now, a lot of people also mostly use int4 now for language models. So that really shrinks things down a lot. And more recently, actually, we started to think that, at least in MOC, we don't want to have a strong opinion on what kind of quantization we want to bring, because there are so many researchers in the field. So what we can do is we can allow developers to customize the quantization they want, but we still bring the optimum code for them. So we are working on this item called bring your own quantization. In fact, hopefully MOC will be able to support more quantization formats. And definitely, I think there's an open field that's being explored. Can you bring more sparsities? Can you quantize activations as much as possible, and so on? And it's going to be something that's going to be relevant for quite a while. [00:29:27]

Swyx: You mentioned something I wanted to double back on, which is most people use int4 for language models. This is actually not obvious to me. Are you talking about the GGML type people, or even the researchers who are training the models also using int4? [00:29:40]

Tianqi: Sorry, so I'm mainly talking about inference, not training, right? So when you're doing training, of course, int4 is harder, right? Maybe you could do some form of mixed type precision for inference. I think int4 is kind of like, in a lot of cases, you will be able to get away with int4. And actually, that does bring a lot of savings in terms of the memory overhead, and so on. [00:30:09]

Alessio: Yeah, that's great. Let's talk a bit about maybe the GGML, then there's Mojo. How should people think about MLC? How do all these things play together? I think GGML is focused on model level re-implementation and improvements. Mojo is a language, super sad. You're more at the compiler level. Do you all work together? Do people choose between them? [00:30:32]

Tianqi: So I think in this case, I think it's great to say the ecosystem becomes so rich with so many different ways. So in our case, GGML is more like you're implementing something from scratch in C, right? So that gives you the ability to go and customize each of a particular hardware backend. But then you will need to write from CUDA kernels, and you write optimally from AMD, and so on. So the kind of engineering effort is a bit more broadened in that sense. Mojo, I have not looked at specific details yet. I think it's good to start to say, it's a language, right? I believe there will also be machine learning compilation technologies behind it. So it's good to say, interesting place in there. In the case of MLC, our case is that we do not want to have an opinion on how, where, which language people want to develop, deploy, and so on. And we also realize that actually there are two phases. We want to be able to develop and optimize your model. By optimization, I mean, really bring in the best CUDA kernels and do some of the machine learning engineering in there. And then there's a phase where you want to deploy it as a part of the app. So if you look at the space, you'll find that GGML is more like, I'm going to develop and optimize in the C language, right? And then most of the low-level languages they have. And Mojo is that you want to develop and optimize in Mojo, right? And you deploy in Mojo. In fact, that's the philosophy they want to push for. In the ML case, we find that actually if you want to develop models, the machine learning community likes Python. Python is a language that you should focus on. So in the case of MLC, we really want to be able to enable, not only be able to just define your model in Python, that's very common, right? But also do ML optimization, like engineering optimization, CUDA kernel optimization, memory planning, all those things in Python that makes you customizable and so on. But when you do deployment, we realize that people want a bit of a universal flavor. If you are a web developer, you want JavaScript, right? If you're maybe an embedded system person, maybe you would prefer C++ or C or Rust. And people sometimes do like Python in a lot of cases. So in the case of MLC, we really want to have this vision of, you optimize, build a generic optimization in Python, then you deploy that universally onto the environments that people like. [00:32:54]

Swyx: That's a great perspective and comparison, I guess. One thing I wanted to make sure that we cover is that I think you are one of these emerging set of academics that also very much focus on your artifacts of delivery. Of course. Something we talked about for three years, that he was very focused on his GitHub. And obviously you treated XGBoost like a product, you know? And then now you're publishing an iPhone app. Okay. Yeah. Yeah. What is his thinking about academics getting involved in shipping products? [00:33:24]

Tianqi: I think there are different ways of making impact, right? Definitely, you know, there are academics that are writing papers and building insights for people so that people can build product on top of them. In my case, I think the particular field I'm working on, machine learning systems, I feel like really we need to be able to get it to the hand of people so that really we see the problem, right? And we show that we can solve a problem. And it's a different way of making impact. And there are academics that are doing similar things. Like, you know, if you look at some of the people from Berkeley, right? A few years, they will come up with big open source projects. Certainly, I think it's just a healthy ecosystem to have different ways of making impacts. And I feel like really be able to do open source and work with open source community is really rewarding because we have a real problem to work on when we build our research. Actually, those research bring together and people will be able to make use of them. And we also start to see interesting research challenges that we wouldn't otherwise say, right, if you're just trying to do a prototype and so on. So I feel like it's something that is one interesting way of making impact, making contributions. [00:34:40]

Swyx: Yeah, you definitely have a lot of impact there. And having experience publishing Mac stuff before, the Apple App Store is no joke. It is the hardest compilation, human compilation effort. So one thing that we definitely wanted to cover is running in the browser. You have a 70 billion parameter model running in the browser. That's right. Can you just talk about how? Yeah, of course. [00:35:02]

Tianqi: So I think that there are a few elements that need to come in, right? First of all, you know, we do need a MacBook, the latest one, like M2 Max, because you need the memory to be big enough to cover that. So for a 70 million model, it takes you about, I think, 50 gigahertz of RAM. So the M2 Max, the upper version, will be able to run it, right? And it also leverages machine learning compilation. Again, what we are doing is the same, whether it's running on iPhone, on server cloud GPUs, on AMDs, or on MacBook, we all go through that same MOC pipeline. Of course, in certain cases, maybe we'll do a bit of customization iteration for either ones. And then it runs on the browser runtime, this package of WebLM. So that will effectively... So what we do is we will take that original model and compile to what we call WebGPU. And then the WebLM will be to pick it up. And the WebGPU is this latest GPU technology that major browsers are shipping right now. So you can get it in Chrome for them already. It allows you to be able to access your native GPUs from a browser. And then effectively, that language model is just invoking the WebGPU kernels through there. So actually, when the LATMAR2 came out, initially, we asked the question about, can you run 17 billion on a MacBook? That was the question we're asking. So first, we actually... Jin Lu, who is the engineer pushing this, he got 17 billion on a MacBook. We had a CLI version. So in MLC, you will be able to... That runs through a metal accelerator. So effectively, you use the metal programming language to get the GPU acceleration. So we find, okay, it works for the MacBook. Then we asked, we had a WebGPU backend. Why not try it there? So we just tried it out. And it's really amazing to see everything up and running. And actually, it runs smoothly in that case. So I do think there are some kind of interesting use cases already in this, because everybody has a browser. You don't need to install anything. I think it doesn't make sense yet to really run a 17 billion model on a browser, because you kind of need to be able to download the weight and so on. But I think we're getting there. Effectively, the most powerful models you will be able to run on a consumer device. It's kind of really amazing. And also, in a lot of cases, there might be use cases. For example, if I'm going to build a chatbot that I talk to it and answer questions, maybe some of the components, like the voice to text, could run on the client side. And so there are a lot of possibilities of being able to have something hybrid that contains the edge component or something that runs on a server. [00:37:47]

Alessio: Do these browser models have a way for applications to hook into them? So if I'm using, say, you can use OpenAI or you can use the local model. Of course. [00:37:56]

Tianqi: Right now, actually, we are building... So there's an NPM package called WebILM, right? So that you will be able to, if you want to embed it onto your web app, you will be able to directly depend on WebILM and you will be able to use it. We are also having a REST API that's OpenAI compatible. So that REST API, I think, right now, it's actually running on native backend. So that if a CUDA server is faster to run on native backend. But also we have a WebGPU version of it that you can go and run. So yeah, we do want to be able to have easier integrations with existing applications. And OpenAI API is certainly one way to do that. Yeah, this is great. [00:38:37]

Swyx: I actually did not know there's an NPM package that makes it very, very easy to try out and use. I want to actually... One thing I'm unclear about is the chronology. Because as far as I know, Chrome shipped WebGPU the same time that you shipped WebILM. Okay, yeah. So did you have some kind of secret chat with Chrome? [00:38:57]

Tianqi: The good news is that Chrome is doing a very good job of trying to have early release. So although the official shipment of the Chrome WebGPU is the same time as WebILM, actually, you will be able to try out WebGPU technology in Chrome. There is an unstable version called Canary. I think as early as two years ago, there was a WebGPU version. Of course, it's getting better. So we had a TVM-based WebGPU backhand two years ago. Of course, at that time, there were no language models. It was running on less interesting, well, still quite interesting models. And then this year, we really started to see it getting matured and performance keeping up. So we have a more serious push of bringing the language model compatible runtime onto the WebGPU. [00:39:45]

Swyx: I think you agree that the hardest part is the model download. Has there been conversations about a one-time model download and sharing between all the apps that might use this API? That is a great point. [00:39:58]

Tianqi: I think it's already supported in some sense. When we download the model, WebILM will cache it onto a special Chrome cache. So if a different web app uses the same WebILM JavaScript package, you don't need to redownload the model again. So there is already something there. But of course, you have to download the model once at least to be able to use it. [00:40:19]

Swyx: Okay. One more thing just in general before we're about to zoom out to OctoAI. Just the last question is, you're not the only project working on, I guess, local models. That's right. Alternative models. There's gpt4all, there's olama that just recently came out, and there's a bunch of these. What would be your advice to them on what's a valuable problem to work on? And what is just thin wrappers around ggml? Like, what are the interesting problems in this space, basically? [00:40:45]

Tianqi: I think making API better is certainly something useful, right? In general, one thing that we do try to push very hard on is this idea of easier universal deployment. So we are also looking forward to actually have more integration with MOC. That's why we're trying to build API like WebILM and other things. So we're also looking forward to collaborate with all those ecosystems and working support to bring in models more universally and be able to also keep up the best performance when possible in a more push-button way. [00:41:15]

Alessio: So as we mentioned in the beginning, you're also the co-founder of Octomel. Recently, Octomel released OctoAI, which is a compute service, basically focuses on optimizing model runtimes and acceleration and compilation. What has been the evolution there? So Octo started as kind of like a traditional MLOps tool, where people were building their own models and you help them on that side. And then it seems like now most of the market is shifting to starting from pre-trained generative models. Yeah, what has been that experience for you and what you've seen the market evolve? And how did you decide to release OctoAI? [00:41:52]

Tianqi: One thing that we found out is that on one hand, it's really easy to go and get something up and running, right? So if you start to consider there's so many possible availabilities and scalability issues and even integration issues since becoming kind of interesting and complicated. So we really want to make sure to help people to get that part easy, right? And now a lot of things, if we look at the customers we talk to and the market, certainly generative AI is something that is very interesting. So that is something that we really hope to help elevate. And also building on top of technology we build to enable things like portability across hardwares. And you will be able to not worry about the specific details, right? Just focus on getting the model out. We'll try to work on infrastructure and other things that helps on the other end. [00:42:45]

Alessio: And when it comes to getting optimization on the runtime, I see when we run an early adopters community and most enterprises issue is how to actually run these models. Do you see that as one of the big bottlenecks now? I think a few years ago it was like, well, we don't have a lot of machine learning talent. We cannot develop our own models. Versus now it's like, there's these great models you can use, but I don't know how to run them efficiently. [00:43:12]

Tianqi: That depends on how you define by running, right? On one hand, it's easy to download your MLC, like you download it, you run on a laptop, but then there's also different decisions, right? What if you are trying to serve a larger user request? What if that request changes? What if the availability of hardware changes? Right now it's really hard to get the latest hardware on media, unfortunately, because everybody's trying to work on the things using the hardware that's out there. So I think when the definition of run changes, there are a lot more questions around things. And also in a lot of cases, it's not only about running models, it's also about being able to solve problems around them. How do you manage your model locations and how do you make sure that you get your model close to your execution environment more efficiently? So definitely a lot of engineering challenges out there. That we hope to elevate, yeah. And also, if you think about our future, definitely I feel like right now the technology, given the technology and the kind of hardware availability we have today, we will need to make use of all the possible hardware available out there. That will include a mechanism for cutting down costs, bringing something to the edge and cloud in a more natural way. So I feel like still this is a very early stage of where we are, but it's already good to see a lot of interesting progress. [00:44:35]

Alessio: Yeah, that's awesome. I would love, I don't know how much we're going to go in depth into it, but what does it take to actually abstract all of this from the end user? You know, like they don't need to know what GPUs you run, what cloud you're running them on. You take all of that away. What was that like as an engineering challenge? [00:44:51]

Tianqi: So I think that there are engineering challenges on. In fact, first of all, you will need to be able to support all the kind of hardware backhand you have, right? On one hand, if you look at the media library, you'll find very surprisingly, not too surprisingly, most of the latest libraries works well on the latest GPU. But there are other GPUs out there in the cloud as well. So certainly being able to have know-hows and being able to do model optimization is one thing, right? Also infrastructures on being able to scale things up, locate models. And in a lot of cases, we do find that on typical models, it also requires kind of vertical iterations. So it's not about, you know, build a silver bullet and that silver bullet is going to solve all the problems. It's more about, you know, we're building a product, we'll work with the users and we find out there are interesting opportunities in a certain point. And when our engineer will go and solve that, and it will automatically reflect it in a service. [00:45:45]

Swyx: Awesome. [00:45:46]

Alessio: We can jump into the lightning round until, I don't know, Sean, if you have more questions or TQ, if you have more stuff you wanted to talk about that we didn't get a chance to [00:45:54]

Swyx: touch on. [00:45:54]

Alessio: Yeah, we have talked a lot. [00:45:55]

Swyx: So, yeah. We always would like to ask, you know, do you have a commentary on other parts of AI and ML that is interesting to you? [00:46:03]

Tianqi: So right now, I think one thing that we are really pushing hard for is this question about how far can we bring open source, right? I'm kind of like a hacker and I really like to put things together. So I think it's unclear in the future of what the future of AI looks like. On one hand, it could be possible that, you know, you just have a few big players, you just try to talk to those bigger language models and that can do everything, right? On the other hand, one of the things that Wailing Academic is really excited and pushing for, that's one reason why I'm pushing for MLC, is that can we build something where you have different models? You have personal models that know the best movie you like, but you also have bigger models that maybe know more, and you get those models to interact with each other, right? And be able to have a wide ecosystem of AI agents that helps each person while still being able to do things like personalization. Some of them can run locally, some of them, of course, running on a cloud, and how do they interact with each other? So I think that is a very exciting time where the future is yet undecided, but I feel like there is something we can do to shape that future as well. [00:47:18]

Swyx: One more thing, which is something I'm also pursuing, which is, and this kind of goes back into predictions, but also back in your history, do you have any idea, or are you looking out for anything post-transformers as far as architecture is concerned? [00:47:32]

Tianqi: I think, you know, in a lot of these cases, you can find there are already promising models for long contexts, right? There are space-based models, where like, you know, a lot of some of our colleagues from Albert, who he worked on this HIPPO models, right? And then there is an open source version called RWKV. It's like a recurrent models that allows you to summarize things. Actually, we are bringing RWKV to MOC as well, so maybe you will be able to see one of the models. [00:48:00]

Swyx: We actually recorded an episode with one of the RWKV core members. It's unclear because there's no academic backing. It's just open source people. Oh, I see. So you like the merging of recurrent networks and transformers? [00:48:13]

Tianqi: I do love to see this model space continue growing, right? And I feel like in a lot of cases, it's just that attention mechanism is getting changed in some sense. So I feel like definitely there are still a lot of things to be explored here. And that is also one reason why we want to keep pushing machine learning compilation, because one of the things we are trying to push in was productivity. So that for machine learning engineering, so that as soon as some of the models came out, we will be able to, you know, empower them onto those environments that's out there. [00:48:43]

Swyx: Yeah, it's a really good mission. Okay. Very excited to see that RWKV and state space model stuff. I'm hearing increasing chatter about that stuff. Okay. Lightning round, as always fun. I'll take the first one. Acceleration. What has already happened in AI that you thought would take much longer? [00:48:59]

Tianqi: Emergence of more like a conversation chatbot ability is something that kind of surprised me before it came out. This is like one piece that I feel originally I thought would take much longer, but yeah, [00:49:11]

Swyx: it happens. And it's funny because like the original, like Eliza chatbot was something that goes all the way back in time. Right. And then we just suddenly came back again. Yeah. [00:49:21]

Tianqi: It's always too interesting to think about, but with a kind of a different technology [00:49:25]

Swyx: in some sense. [00:49:25]

Alessio: What about the most interesting unsolved question in AI? [00:49:31]

Swyx: That's a hard one, right? [00:49:32]

Tianqi: So I can tell you like what kind of I'm excited about. So, so I think that I have always been excited about this idea of continuous learning and lifelong learning in some sense. So how AI continues to evolve with the knowledges that have been there. It seems that we're getting much closer with all those recent technologies. So being able to develop systems, support, and be able to think about how AI continues to evolve is something that I'm really excited about. [00:50:01]

Swyx: So specifically, just to double click on this, are you talking about continuous training? That's like a training. [00:50:06]

Tianqi: I feel like, you know, training adaptation and it's all similar things, right? You want to think about entire life cycle, right? The life cycle of collecting data, training, fine tuning, and maybe have your local context that getting continuously curated and feed onto models. So I think all these things are interesting and relevant in here. [00:50:29]

Swyx: Yeah. I think this is something that people are really asking, you know, right now we have moved a lot into the sort of pre-training phase and off the shelf, you know, the model downloads and stuff like that, which seems very counterintuitive compared to the continuous training paradigm that people want. So I guess the last question would be for takeaways. What's basically one message that you want every listener, every person to remember today? [00:50:54]

Tianqi: I think it's getting more obvious now, but I think one of the things that I always want to mention in my talks is that, you know, when you're thinking about AI applications, originally people think about algorithms a lot more, right? Our algorithm models, they are still very important. But usually when you build AI applications, it takes, you know, both algorithm side, the system optimizations, and the data curations, right? So it takes a connection of so many facades to be able to bring together an AI system and be able to look at it from that holistic perspective is really useful when we start to build modern applications. I think it's going to continue going to be more important in the future. [00:51:35]

Swyx: Yeah. Thank you for showing the way on this. And honestly, just making things possible that I thought would take a lot longer. So thanks for everything you've done. [00:51:46]

Tianqi: Thank you for having me. [00:51:47]

Swyx: Yeah. [00:51:47]

Alessio: Thanks for coming on TQ. [00:51:49]

Swyx: Have a good one. [00:51:49]

Get full access to Latent Space at www.latent.space/subscribe

2023-08-10
Länk till avsnitt

[AI Breakdown] Summer AI Technical Roundup: a Latent Space x AI Breakdown crossover pod!

Our 3rd podcast feed swap with other AI pod friends! Check out Cognitive Revolution and Practical AI as well.

NLW is the best daily AI YouTube/podcaster with the AI Breakdown. His summaries and content curation are spot on and always finds the interesting angle that will keep you thinking. Subscribe to the AI Breakdown wherever fine podcasts are sold! https://pod.link/1680633614

You can also watch on YouTube:

Timestamps

courtesy of summarize.tech

The hosts discuss the launch of Code Interpreter as a separate model from OpenAI and speculate that it represents the release of GPT 4.5. People have found Code Interpreter to be better than expected, even for tasks unrelated to coding. They discuss the significance of this release, as well as the challenges of evaluating AI models, the cultural mismatch between researchers and users, and the increasing value of data in the AI industry. They also touch on the impact of open-source tools, the potential of AI companions, the advantages of Anthropics compared to other platforms, advancements in image recognition and multimodality, and predictions for the future of AI.

* 00:00:00 In this section, the hosts discuss the launch of Code Interpreter from OpenAI and its significance in the development of the AI field. They explain that Code Interpreter, initially introduced as a plugin, is now considered a separate model with its own dropdown menu. They note that people have found Code Interpreter to be better than expected, even for tasks that are not related to coding. This leads them to speculate that Code Interpreter actually represents the release of GPT 4.5, as there has been no official announcement or blog post about it. They also mention that the AI safety concerns and regulatory environment may be impacting how OpenAI names and labels their models. Overall, they believe that Code Interpreter's release signifies a significant shift in the AI field and hints at the possibility of future advanced models like GPT 5.

* 00:05:00 In this section, the speaker discusses the improvements in GPT 4.5 and how it enhances the experience for non-coding queries and inputs. They explain that the code interpreter feature allows for a wider range of use cases that were not possible with previous models like GPT 3.5. Additionally, they highlight the value of the code interpreter in assisting individuals with no coding experience to solve basic coding problems. This feature is likened to having a junior developer or intern analyst that aids in conducting tests and simplifies coding tasks. The speaker emphasizes that GPT 4.5 enables users to be more productive and efficient, especially when dealing with code-related challenges. They also discuss the future direction of AGI, where more time will be dedicated to inference rather than training, as this approach has shown significant improvements in terms of problem-solving.

* 00:10:00 In this section, the speaker discusses how advanced AI models like GPT-4.5 are not just larger versions of previous models but rather employ fundamentally different techniques. They compare the evolution of AI models to the evolutionary timeline of humans, where the invention of tools opened up a whole new set of possibilities. They touch on the difficulty of evaluating AI models, particularly in more subjective tasks, and highlight how perceptions of model performance can be influenced by factors like formatting preferences. Additionally, the speaker mentions the challenges of reinforcement learning and the uncertainty around what the model is prioritizing in its suggestions. They conclude that OpenAI, as a research lab, is grappling with the complexities of updating models and ensuring reliability for users.

* 00:15:00 In this section, the speaker discusses the cultural mismatch between OpenAI researchers and users of OpenAI's products, highlighting the conflicting statements made about model updates. They suggest that OpenAI needs to establish a policy that everyone can accept. The speaker also emphasizes the challenges of communication and the difficulty of serving different stakeholders. They mention the impact of small disruptions on workflows and the lack of immediate feedback within OpenAI's system. Additionally, the speaker briefly discusses the significance of OpenAI's custom instructions feature, stating that it allows for more personalization but is not fundamentally different from what other chat companies already offer. The discussion then transitions to Facebook's release of LAMA2, which holds significance both technically and for users, although further details on its significance are not provided in this excerpt.

* 00:20:00 In this section, the introduction of GPT-4.5, also known as LAVA 2, is discussed. LAVA 2 is the first fully commercially usable GPT 3.5 equivalent model, which is a significant development because it allows users to run it on their own infrastructure and fine-tune it according to their needs. Although it is not fully open source, it presents new opportunities for various industries such as government, healthcare, and finance. The discussion also touches upon the open source aspect of LAVA 2, with the recognition that it has still contributed significantly to the community, as evidenced by the three million dollars' worth of compute and the estimated 15 to 20 million dollars' worth of additional fine-tuning capabilities it brings. The conversation acknowledges the value of open source models and data, while also recognizing the challenges and complexities in striking a balance between openness and restrictions.-

* 00:25:00 In this section, the discussion centers around the commoditization of compute and the increasing value of data in the AI industry. While GPU compute is currently in high demand, it is observed that data is what holds the real value in AI. The conversation touches on the history of Open Source models and how the release of data for models like GPT J and GPT Neo signal a shift towards prioritizing data over model weights. The transcript also mentions the caution around data usage, citing examples of copyright concerns with datasets like Bookcorpus. The debate arises on whether ML engineers should proactively use open data or wait for permission, with some arguing for proactive usage to avoid holding back progress. The conversation also discusses the importance of terminology and protecting the definition of open source, while recognizing that the functional implications of open data are what matter most.

* 00:30:00 In this section, the conversation revolves around the impact of open-source tools on companies and how it has influenced their approach to AI development. It is noted that companies can no longer just offer a nice user interface (UI) wrapper around an open AI model, as customers are demanding more. The competition has shifted towards other aspects of productionizing AI applications, which is seen as a positive development. The speaker predicts that OpenAI's competitive pressure will lead to opening up their source code and expects interesting advancements to emerge, such as running models locally for unlimited use. Additionally, the conversation touches on the potential of commercially available models, the application of new techniques, and the creativity unlocked by open source. The speaker also mentions the AI girlfriend economy, an area that is often overlooked but has millions of users and significant financial success.

* 00:35:00 In this section, the speaker discusses their prediction about the long-term impact of AI on interpersonal relationships, suggesting that AI companions, such as AI girlfriends or boyfriends, could help address the loneliness crisis and reduce incidents of violence. They also mention the idea of using AI models to improve social interactions and communication skills. However, they highlight that this idea of AI companions may face resistance from older generations who may struggle to accept their legitimacy. The speaker also mentions an example of using AI models to create a mental wellness product in the form of a private journal. Overall, the speaker believes that while AI companions may have potential, they may not completely replace human relationships and interactions.

* 00:40:00 In this section, the speaker discusses their views on Anthropics and the advantages it offers compared to other platforms. They mention that while Anthropics used to position themselves as the safer alternative to OpenAI, it was not appealing to many engineers. However, with the introduction of the 100K contest window and the ability to upload multiple files, Anthropics has become state-of-the-art in certain dimensions, such as latency and reliability in code synthesis. The speaker also notes that some businesses are choosing to build with the Anthropics API over OpenAI due to these advantages. They believe that Anthropics is finally finding its foothold after being overshadowed by OpenAI for a long time. Additionally, the speaker discusses their experience at the Anthropics hackathon, where they saw developer excitement for the platform. They believe that Anthropics is on its way up and that it paves the way for a multi-model future. However, they also acknowledge that the odds are stacked against Anthropics and that it needs more marketing support and community buy-in. Lastly, the speaker mentions the importance of running chats side by side against different models like Tracicia and GPT-4.5, and highlights that in their experience, Anthropics wins about 30% of the time, making it a valuable addition to one's toolkit.

* 00:45:00 In this section, the discussion revolves around the advancements in image recognition and multimodality in language models like GPT-4.5. While there was some excitement about these developments, it was noted that relying on model updates alone may not be sufficient, and there is a need to focus on product-level improvements, such as integrating language models into services like Google Maps. However, concerns were raised about the reliability of updates, as evidenced by a regression in Bard's code interpreter functionality. Additionally, other trends in the developer community, like the emergence of auto GPT projects and the ongoing quest for building useful agents, were highlighted. Finally, there was mention of the growing interest in evaluation-focused companies like LangChain and LaunchLang, which aim to monitor the success of prompts and agents.

* 00:50:00 In this section, the speaker discusses the focus on model evaluation and observability, as well as the importance of combining deep industry expertise with AI technology to make improvements. They also touch on the need for creating an information hierarchy between documents and scoring them in specific verticals like Finance. The speaker mentions advancements in text-to-image capabilities and expresses interest in character AI and AI-native social media. They mention the possibility of AI personas from Meta and the development of agent clouds optimized for EI agents. They acknowledge that these advancements may raise concerns among AI safety proponents. Overall, there seems to be excitement and exploration around these emerging technologies.

* 00:55:00 In this section, the speakers discuss their predictions and what they are closely watching in the coming months. Alice believes that there will be more public talk about open source models being used in production, as currently, many perceive them as just toys. She expects companies to start deploying these models and showcasing their usage. Sean predicts the rise of AI engineers as a profession, with people transitioning from informal groups to certified professionals working in AI teams within companies. He mentions that the first AI engineer within Meta has already been announced. Overall, they anticipate a relatively quiet August followed by a resurgence of activity in September, with events like Facebook Connect and continued hackathons driving innovation.

Transcript

all right what is going on how's it going boys great to have you here hey good how are y'all good I I think I'm excited for this yeah no I'm super excited I think uh you know we were just talking a little bit before this that the AI audience right now is really interesting it's sort of on the one hand you have of course the folks who are actually in it who are building in it who are you know or or dabbling because they're in some other field but they're fascinated by it and you know are spending their nights in weekends building and then on the other hand you have the folks who are you know what we used to call non-technical perhaps but who are actively paying attention in a way that I think is very different to the technical evolutions of this field because they have a sense or an understanding that it's so fast moving that the place that they have to be paying attention to is you know what's changing from the standpoint of of developers and Builders so I what we want to do today is kind of reflect on the month of July which had a couple of I think really Keystone events in the context of what it means for the technical development of the AI field and and what you know where it leads how people's Frameworks are changing how people sort of sense that things have changed over the last month and I think that the place to start although we could choose a lot of different examples is with an idea that you guys have spent a lot of time sharing on Twitter and in other places that the launch of code interpreter from openai which is nominally a chat GPT plugin actually represents functionally something closer to the release of GPT 4.5 so maybe we can start by just having you guys sort of explain that idea uh and then we can kind of take it from there yeah I'll maybe start with this one um yeah so quote interpreter was first announced as a plug-in at least in the plugins announcement from March but from the start it was already presented as a separate model because at least when you look in the UI you know you don't go into the charity plugin see why and pick it from a menu plugins it is actually a separate model in in the drop down menu and it is so today and I think um yes it adds on an additional sandbox for running and testing code and iterating on that um and actually you can upload files to it and do operations and files and people are having a lot of fun uploading different batteries and hacking uh to see what the container is and try to break out into the Container um but what really convinced me that it might be a separate model was when people tried it on tasks that were not code and found it better so code interpreter is poorly named not just because you know it just sounds like a like a weird developer Tool uh but they basically it's kind of maybe hiding some progress that openai has made that it's completely not been public about there's no blog post about it what interpreter itself is launched in a support Forum post uh you know low-key it wouldn't even announced by any of the major uh public channels that opening has um and so the leading theory is that you know I've dubbed a gpp 4.5 I think like if they were ever to release an API for that they might retroactively rename it for coin firings in the same way that 3.5 was actually renamed when retracted between three rooms um and I think and since I published that post or tweeted that stuff uh the the leading release now for why they did not do it is because they would piss off all the AI safety people yeah no I mean it would it was sort of correspondent obviously like a thing that's happened less just this month but more over the last three months is a total Overton window shift in that AI safety conversation starting from I think about in April or May when um Jeffrey Hinton left Google there has been a big shift in that conversation obviously Regulators are way more active now than they were even a couple months ago and so I do think that there are probably constraints in how you know open AI at any other company in the space feel like they can label or name things and even just as we're recording this today we just saw a trademark for gpt5 which is sort of most likely I think just um you know dotting the eyes and crossing the t's as a company because they're eventually going to have a gpt5 um I I would be very shocked if it I would be very shocked at this point if there are any models that are clearly ahead of gpt4 that don't that that come out before there is some pretty clear guidance from the US government around what it looks like to release more advanced models than gpt4 so it's an interesting interesting moment I guess let's talk about what functionally it means for it to be you know that much better better enough that we would call it GPT 4.5 and maybe what might be useful is breaking that apart into how it is improving the experience for non-coding queries or you know or or or or or inputs and then separately you know how it is made uh to chat gbt as a as a as a coding support tool different as well I think there's a lot of things to think about so one models are usually benchmarked against certain tasks and you know that works for development but then there's the reality of the model that you know if you ask for example mathematical question the like gpd3 3.5 you don't really get good responses because of how um digits are tokenized in the model so it's hard for the models to actually reason about numbers but now that you put a code interpreter in it all of a sudden it's not a map in the tokenizer in the latent space question it's like can you write code that answers the math question so that kind of enables a lot more use cases that are just not possible with the Transformer architecture of the underlying model and then the other thing is that when it first came out people were like oh this is great for developers it's like I know what to do I just ask it but there's this whole other side of the water which is hey I have this like very basic thing you know how I'm a software engineer but background you know how sometimes people that have no coding experience come to you and it's like hey I know this is like really hard but could you help me do this and it's like it's really easy and sometimes it and sometimes they think it's easy and it's hard but uh code interpreter enables that whole um space of problems to be solved independently by people so it's kind of having you know Sean talked about this before about um some of these models being like a junior developer that you have on staff for you to be more productive this is similar for non-business people it's like having Junior you know whatever like a intern analyst that helps you do these tests that are not even like software engineering tasks it's more like code is just a language used to express them it's like a pretty basic stuff sometimes uh but you just cannot cannot do it without so uh for me the gbd4 4.5 thing is less about you know is this a new model that is like built after gbd4 it's more about capability so if you have gbt4 versus 4.5 you're probably gonna get more stuff done with 4.5 just because of like the code interpreter Peace So for me that's enough to use the code name but as you said Sam Allman said they're not training the next model so they said this is 4.5 you would have like it would go back to Washington DC and be in front of Congress and have to talk about it again sorry yeah um well one thing that I always want to impress upon people is we're not just talking about like yes it is writing code for you but actually you know if you step back away from the code and just think about what it's doing is it's having the ability to spend more Insurance time on harder problems and it matches what uh we do when we are faced with difficult problems as well because right now any llm and these before code interpreter any llm if you give it a question like what is one plus two it'll it'll take the same amount of time to respond as uh something like prove the Black Shoals theorem right like uh and that should not be the case actually we should take more time to think when we are considering harder problems um and I think what I think the next Frontier and why I called it 4.5 is not just because it has had extra training it's not just because it has the coding environment and also because there's a general philosophy and move that I see on my open EI um and the people that it hires that so in my blog post I called out gong who like I first slowly met so it's kind of awkward to talk about it like I guess a friend or a friend of a friend um but it's true that I have met multiple people not opening I have specifically been hired to work on more inference time uh optimizations as compared to trading time um and I think that is the future for gpd5s right so the reason you the reason I think about this working client is that this is the direction of AGI that we're going to spend more time on inference um and uh it just makes a whole lot of sense when you look at gnomes background working on the uh the broadest and then Cicero um all of which is just consistently the same result which is every second or millisecond extra spent on inference it's worth like 10 000 of that of of that in training especially when you can vary it based on the problem difficulty um and this is basically uh ties back to the origin of open AI which originally started playing games they used to play DotA they used to play uh you know all sorts of all sorts of games in sort of those reinforcement learning environments and the typical way that your program these AI is doing doing uh doing these games is when they have lots of branches and you take more time to Circle and um and figure out what the optimal strategy is and when there's not that many branches to to go down then you just take the shortcut in uh you have to give to give the right answer but varying the inference time is the integration here one of the things that it it seems and this what you just described I think aligns with this is I think there's a perception that uh more advanced models are just going to be bigger data sets with more of the same type of training versus sort of fundamentally different techniques or different areas of emphasis that go beyond just how big the data set is and so you know one of the things that strikes me listening to or kind of observing how code interpreter works is it almost feels like a break in The evolutionary timeline of gbt because it's like GPT with tools right unless you just kind of described it it's like it doesn't know about math it doesn't have to know about math if it can write code to figure out the math right so what it needs is the tool of being able to write code and that allows it to figure something out and that is akin to you know humans are evolving for Millennia not using tools then all of a sudden someone picks up a rock and this whole entire set of things that we couldn't do before just based on our own evolutionary pathway are now open to us because of the use of the tool I don't think it's a Perfect Analogy but it does feel somewhat closer to that than just again like it's a little bit better than 3.5 so we called it four it's a little bit better than four so we called it 4.5 kind of a mental framework yeah noise I made there I guess sort of the the another big topic that relates to this that was subject of a lot of conversation not just this month that has been for a couple months is this question of whether gpt4 has gotten worse or whether it's been nerfed and there was some research that came out around that with maybe um variable variable uh sort of feelings around it but what did you guys make of that whole conversation I think evals are one of the hardest things in the space so I've had this discussion with Founders before it's really easy we always bring up co-pilot as one example of like Cutting Edge eval where they not not only look at how much um of their suggestions you accept but also how much of the code is still in a minute after three minutes after five minutes after it's really easy to do for code but like for more open and degenerative tasks it's kind of hard to say what's good and what isn't you know like if I'm asking to write the show notes for our podcast which has never been able to do um how do you how do you email that it's really hard so even if you read through through the paper that uh Ling Zhao and mate and James wrote a lot of things are like yeah they're they're worse but like how do you really say that you know like sometimes it's not kind of you know cut and dry like sometimes it's like oh the formatting changed and like I don't like this formatting as much but if the formatting was always the same to begin with would you have ever complained you know there's there's a lot of that um and I think with llama too we've seen that sometimes like rlh traffic can like go wrong in terms of like being too tight you know for example somebody has Lama too is like how do you kill a process in like Linux and Mama 2 was like oh it's wrong to like kill and like I cannot help you like doing that you know um and I think there's been more more chat online about you know sometimes when you do reinforcement learning you don't know what reward and like what what part of like the the suggestion the model is anchoring on you know like sometimes it's like oh this is better sometimes the model might be learning that you like more verbose question answers even though they're they're right the same way so there's a lot of stuff there to figure out but yeah I think some examples in the paper like clearly worse some of them are like not as not as crazy um yeah but I mean it'll be nice under a lot of pressure on the unlike the safety and like all the the instruction side and we cannot like the best thing to do would be hey let's version lock the model and like keep doing emails against each other like doing an email today and an email like that was like a year ago there might be like 20 versions in between that you don't even know how the model has has changed so um yeah evals are are hard it's the tldr I I think I think basically this is what we're seeing is open AI having come to terms with that the origin of itself as a research lab where updating models this is is just a relatively routine operation versus a product or infrastructure company where it has to have some kind of reliability guarantee to its users um and so openai are they internally as researchers are used to one thing and then the people who come and depend on open EI as on as as a product are used to a different thing and I think there's there's a little bit of cultural mismatch here like even within open ai's public statements we have simultaneously Logan from from open AI saying that the models are frozen and then you know his his VPO product saying that we update models all the time that are not frozen so which is like you cannot simultaneously be true um so so I think they're shot yeah I think they're trying to figure it out I think people are rightly afraid uh of them basing themselves on top of a black box uh and that's why maybe you know we'll talk about llama too in a bit uh that's that's why maybe they want to own the Black Box such that uh it doesn't change out from underturn um and I think this is fine this is normal but uh openai it's not that hard for opening night to figure out a policy that is comfortable with that that everybody like accepts um it won't take them too long and this is not a technical challenge it's more of a organizational and business challenge yeah I mean I I think that the communications challenge that you're referencing is also extreme and I think that you're right to identify that they've gone from like quirky little you know lab with these big aspirations to like epicenter of a of a national conversation or a global conversation about existential challenges you know and the way that you talk in those two different circumstances is very different and you're sort of serving a lot of different Masters hopefully always Guided by your own set of priorities and that's going to be you know inherently difficult uh but with so many eyes on it and people who are you know the thing that makes it different is it's not just like Facebook where it's like oh we've got a new feature you know in the early days that made us all annoyed like you know people were so angry when they added the feed uh you know that we all got used to it this is something where people have redesigned workflows around it and so small disruptions that change those workflows can be hugely impactful yeah it's an interesting comparison with the Facebook feed because in the era of AD Tech the feedback was immediate like you changed an algorithm and if the click-through rates are the you know the whatever metric you're you're optimizing for in your social network if they started to start to decline your change will be reverted tomorrow you know uh whereas here it's like we just talked about it's hard to measure and you don't get that much feedback like I you know I I have there's sort of the thumbs up and down uh action that you can take an open AI that I've never shared most people don't don't give feedback at all so like opening a has very little feedback to to go with on like what is actually improving under not improving and I think this is just normal like uh it's it's kind of what we want in a non-adtrack universe right like we've just moved to the subscription economy that everyone is like piety for uh and this is the result that we're trading off uh uh some some amount of product feedback actually it's super interesting so the the one other thing before we leave um uh open AI ecosystem the one other big sort of feature announcement from this month was uh custom instructions how significant do you think that was as an update so minor uh so it is significant in the sense that you get to personalize track TBT much more than uh you previously would have like it actually will remember facts about you it will try to obey system prompts about you you had this in the playground since forever uh because you could enter in the system prompt uh in there and just chat to complete that habit and this is a rare instance of the chat tpd team lagging behind the general capabilities of the open AI platform uh and they just shipped something that could have been there a long time ago it was present in perplexity Ai and if you think about it um basically every other open source chat company or open uh we have a third-party chat company had already had it before tragedy um so what I'm talking about is character AI what I'm talking about is the various uh ai waifu ai girlfriend type companies Each of which have you know characters that you can sort of sub in as custom instructions um so I think chargpt is basically playing catch up here it's good for obviously the largest user base in the world of chat AI but it's not something fundamentally we haven't seen before that actually I think perfectly brings up a segue to the other major obvious thing that happened this month from both a technical perspective but also just I think long term from a user perspective which was Facebook releasing llama 2. so this was something that was uh you know anticipated for a while but I I guess where to even start with the significance of llama 2 I mean how do you sum it up if you're talking to someone who sort of isn't paying attention to the space you know what what does the introduction of of lava 2 mean relative to other things that had been available previous to it um it is the first fully commercially usable not fully open source we'll talk about that first fully commercially usable gbt 3.5 equivalent model and that's a big deal because one you can run it on your own infrastructure you can write it on your own cloud so all the governments and Healthcare and financial use cases are opened up to that and then you can fine tune it because you have full control over all the weights and all the internals as much as you want um so it's a big deal from from that point of view um not as big in terms of the you know pushing you know for the state of the art um but it's still still extremely big deal yep I think the the open source part so I've wrote so the data it came out over this post um about you know why llamasu is not open source and why it doesn't matter and uh I was telling Sean I'm writing this thing and it was like whatever man like this license stuff is like so so tired I was like yeah I'll just post it on on anchor news in the morning and I think it was on the front page for like the whole day they got like 228 comments and I was regarding the flash attention podcast episode in the morning so I got out of the studio and it was like 230 comments of people being very like you know upset one way or the other about license and my point and you know I was I started an open source company myself in the past and I contributed to a bunch of projects is that yeah llama 2 is not open source but like the open source Institute definition but we just don't have a better definition for like models you know like because it's mostly open source you can use it for a lot of stuff so what's like the and it's not Source available because for a lot of stuff you can use it commercially so how do we find better labels and my point was like look let's figure out what the Better Label is but even though it's not fully open source it's still like three million dollars of like flops donated to the community basically you know who else who else in the open source Community is stepping up and putting 3 million of h100 to make us train this model so I I think like overall netmed is like a very positive thing for the community and then you've seen how much stuff was built on top of it there's like the quantized versions with ggml there's like the context window expansion um there's so much being done by the community that um I I think it was it was great for for everyone uh and by the way three million is the lower uh that's just compute um there's a reasonable estimate from scaliai that the extra fine tune that you could on top of it uh was worth about 15 to 20 million dollars um so that's a lot of money just kind of donated to the community um although they didn't release the data they didn't tell us any of the data sets uh they just say trust us we didn't train on any of your Facebook information which is uh it's the first instance where the models are more open than the data and I think that's a reflection of where the relative shift in value might uh happen um as a result of lava too and so I I don't know you can take that in multiple different directions but I just want to point that out yeah I was gonna say so we first had the the examples I made so we first had the open models open source models which is like rent pajama so the data so have been the training code is open the model weights are open then stability kind of did the same thing with stable LM which is like hey the widths are open but we're not giving you the data you know so you can you can download the model but you cannot retrain it yourself and that llama too it's like we don't give you the data we'll give you the models but you can only use it for for some stuff so there's more and more restriction but like Sean is saying and we talked about this before everybody wants to train their model nobody wants to open source the best data set for X you know which maybe is what more open source people should focus on it's like how to build better specific data sets instead of yet spending giving Jensen Wang another five million dollars of gpus but the model gets more headlines for now you know so that's that's what everybody Adidas yeah and I want to point out it's a reversal of the open source culture they used to get a sequence of openness and you could kind of pick and choose from uh whether it's open code all the way down to open data versus all the way down to uh open weights and you know there's some some barrier to combination I I wrote I wrote this book a long time ago because I don't remember that the five levels um uh but yeah like it's it's very strange and I think it's just it's just a relative uh um discussion of where the money is going um and I think it makes usually shows that compute is becoming commoditized um which yes there's a GPU approach right now uh a100 has sold out everywhere across the board people are commenting all about it uh this month um you know and there's people hoarding compute like nobody's business but as far as the value an AI is concerned it looks like computers is relatively um you know uh commoditized it's actually data that's that that people are kind of safeguarding generously um going all the way back to the history of Open Source models that you lose their AI when they when they train GPT J and GPT Neo as the first reproductions of gpt3 um they they release the data first uh stable diffusion when they train stable diffusion they release live on 500b first uh and that's I think reflectors or like the the normal sequence of events you release the data that anybody's uh the model weights but now now we're just skipping the data part and I think it's just it's fair it's a way to think about yourself you know I think um one of our conversations I think I think it was my Conover when he was talking about comparing our current AI era versus uh the 2000s era in search engines you know all he basically said like all of the public publishable information retrieval research dried up because all those phds went to work at Google and Google just sat on it uh and that it this is now you know a fight for IP um and and I think that is just a very rational way of behavior and I guess like a capitalist AI economy do you think so one of the things that we were talking about before starting with the the code interpreter 4.5 and why or gbt 4.5 and why they might not call it that is the emergence of this sort of regulatory if not pressure certainly Intrigue uh you know do you think that there's potentially an aspect of that when it comes to why people are so jealously safeguarding you know the the data is there more risk for for being open about where the data is actually coming from the the books three examples probably good so MPT trained their model on a data set called bookstree which is 190 000 books something like that um and then people on Twitter were like well this stuff is not you know in the free you know it's under copyright still you just published yeah yeah it's not in the public domain you can just take it and and train on it but the license for some of these books is like kind of blurry you know on like what's fair use and what is it um and so there was like this old thing on Twitter about it and then MPD you know Mosaic first changed the license and they changed it back and um I think Sean uh Sean presser from Luther was just tweeting about this yesterday and he was basically saying look as ml Engineers maybe it's better to not try and be the you know the main ethics night and just say hey look the data's open and let's try it and then maybe people later will say hey please don't use the data and then we can figure it out but like proactively not using all of this stuff can kind of keep the progress back and and you know he's more coming from the side of like a Luther which is like doing this work in public so for them it's like hey you know if you don't want us to train now this is fine but we shouldn't by default not do it um versus if you're meta you know they said the deterring llama on like stuff available on the internet they didn't say the train llama on stuff that is licensed to train on uh it's a it's a small it's a small difference the other piece of this that that I I wanted to sort of circle back to because we kind of breezed over it but I think it's really significant you know we did get a little lost in this conversation around open source definitions and I don't think that's unimportant I think that people are rightly protective when a set of terminology has a particular meaning and a massive Global Corporation sort of tries to like nudge it towards something that is potentially serving their ends versus uh you know actually being by that definition but I also think that your point which is that functionally relative to the rest of the space it probably doesn't super matter because what people mean is almost more about functionally what they can do with it and what it means for the space relative to more closed models and I I think one of the big observations has been that the availability of uh you know from from when llama one was you know fully fully leaked the availability of of all of that has pretty dramatically changed won the evolution of the space over the past few months and two I think from a business standpoint how the big companies and incumbents have thought about this so another big conversation this month going back to sort of the The Venture Capital side of of your life has been the extent to which uh companies or startups are or big companies are not wanting to sort of side on with some startup that's going to offer them you know AI whatever because their technical teams can just go spin up you know sort of their their own version of it because of the the sort of you know availability of these open source tools but you know I guess I'm interested I guess in bringing the the sort of Open Source you know in air quotes side of the conversation into the to the realm of how it has impacted how companies are thinking about you know uh their their development in the in the context of the AI space I think it's just Rising like put it raising the bar on like what you're supposed to offer so I think six nine months ago it was enough to offer a nice UI wrapper around an open AI model today it isn't anymore so that's really the main the main difference it's like what are you doing outside of wrapping the model and people need more and more before they buy versus building yeah I think um it actually moves the area of competition uh towards other parts of productionizing AI applications you know I I think that's probably just a positive um I I feel like um the uh actually the competitive pressure that La The Meta is putting on Open the Eyes is a good thing uh one of the fun predictions that I made was in the next six months ubt opening hour open source tpc3 um which which is not open source and uh I like it's so far behind the state of the art now that it doesn't matter as far as safety is concerned and it basically peeps open AI in the open source AI game uh which which would be nice to have of the things that people have been building um you called out a couple uh context window expansion but have there been any that really stand out to you as super interesting or unexpected or or you know particularly high potential um one of our short short term podcast guests uh the mlc team they were thumb wrapping llama two to run on MacBook gpus so I think that's like the the most interesting Gap right it's like how do we go from paper token to like unlimited local use that's one of the main main things that keep even people like me from like automating a lot of stuff right it's like I don't want to constantly pay open AI to do menial stuff but if I go run this locally and do it even if five times lower I would do it so that's uh that's a super exciting space yeah I would say beyond that there hasn't been that much I mean it's it's only a few weeks old so uh it hasn't been damaged uh emergence coming from it I would I would definitely say um you want to keep the lookout for uh the uh basically what happens in post lab number one which you know keep in mind it was only in February um the same thing that happened with Acuna alpaca and all the other sort of instructions to you and sort of research type models um but just more of them because now they are also commercially available um we haven't seen them come out yet but it's it's almost like guarantee that they will um you can also apply all the new techniques uh that have been have emerged since then like Json former because now you have access to all the model leads um to to to llama and I think uh that will also uh create another subset of models that uh basically was only theoretically applicable to sort of research holiday models uh before and so now these will be authored commercially as well um so like yeah nothing nothing like really eye-popping I would say um but but it's been five minutes is that it's yeah it's it's been it's been a very short amount of time uh and the thing of Open Source is that the creativity unlocked um is is very hard to predict and actually I think happens a lot in the uh let's just say the the mess official part of the economy where where I've been focusing a lot on recently on um the sort of AI girlfriend economy which is huge uh I I feel like it's not polite conversation that the amount of um AI girlfriend area has but it's real they're millions of users they're making a lot of money uh and it's just virtually not talked about in in like polite SF circles it feels like one of those areas that's going to be uh an absolute lightning rod when it comes to the societal debates around this technology like you can feel it that that sort of oh you know the people are going to hone in on that as example a of you know a change that they don't like that's my guess at least I don't know like so I have a really crazy longer term prediction like maybe on the order of like 30 to 50 years but um you know yeah a girlfriend for Nobel Peace Prize because it what if it solves the loneliness crisis right what if it cuts the rate of Terror and uh you know school shootings by like or something like that's huge my wife and I have joked about how every generation there's always something like they always think that they're like so far ahead and they think that there's nothing that their kids could throw at them that they just like fundamentally won't get and without fail every generation has something that seems just totally normal to them that their parents generation writ large just like has such a hard time with and we're like it's probably gonna be like AI girlfriends and boyfriends we're gonna be like yeah but they're not real they're like yeah but it's real to me you know they're having debates with our future 13 year old or kids are only four and two now so it feels like maybe the right timeline yeah I I've heard actually of all people Matthew McConaughey on the Lexus and what what yeah you was he was great shout out shout out shout out Matt um but they were talking about they were kind of talking about this and they were noodle in the this idea of like computers helping us being better so kind of like we have computers learn how to play chess and then we all got better at chess by using the computers to like learn and like experiment uh they were talking about similarly in interpersonal relationship maybe it does you know it doesn't have to be you shut off from from humans but it's like using some of these models and some of these things to actually like learn you know how to better interact with people and if you're like shy and an introvert it's like okay I can like try these jokes on like these conversation points with a model and like you know it teaches me hey that's not okay to say or like you know you should maybe be more open or or I don't know but I think that's a more wholesome view of it than like everybody just kind of runs away from society and that's like 10 AI friends and doesn't talk to humans anymore what's it's much less sexy to just say like AI friends right that even though like there's the if you look at the possibility set you know the idea that people might have this sort of uh to your point like conversational partner that helps them effectively work through their own things in this safe space that doesn't necessarily relate to romantic attachment just because the movie Her came out right right it can just be a panel of experts uh and I I've uh I had I do have plans to build uh you know a small CEO which is uh it's my own boss um and just for me to check it um and actually we'll flag out just lifting various services so you come a lot you come across a lot of AI Engineers who are interested in building mental wellness products and a lot of these will take the form of some kind of Journal um and this will be your most private uh thoughts that you don't really want to send anywhere else um and so actually all these will make advantage of Open Source models because they don't want to set it to open AI um and that makes a ton of sense which is something like I just came across uh from one of my friends uh here in the coordinating space that I have uh where it's it's one of those situations where you can actually try out like having a conversation and having a group of yeah friends chime in and see what that feels like to you uh it's it's the first example I found my past where someone's actually done this super interesting so uh llama and uh code interpreter I think stood out pretty clearly as as really big things to touch um I wanted to check in just as we sort of start to maybe around the corner towards wrapping up Claude 2 uh and anthropic how significant was this in what ways was a significant you know was it something that was sort of meaningful from expanding the capacity set for developers or was it sort of more just a good example of what you can do if you increase the context window but you know that's something that might ultimately become table Stakes later on yeah I could I could maybe speak through this a little bit um so it is significant but not earth shattering or clearly I think it is the first time that Claude as a whole has just been a generally publicly available you used to be on a weakness um yes it has a longer context window but to me more significantly it is anthropic finding its its footholds uh in the very competitive CI landscape you know um anthopics message used to be that we're yes we're number two to open the eye but we're safer you know and that's that's not a super appealing uh thing to to many uh Engineers it is it is very appealing to some uh uh corporations by the way um but uh you know I think I think having the 100K contest window makes them state-of-the-art in one dimension which is very useful uh the ability to upload multiple files I think is super useful as well um and I and actually I have met a number of businesses I'm closer as a source graph who are actually choosing to build with claw 2 API over and above open AI just because they are better at latency better reliability in in better in some form of code synthesis um so I think it's anthropic finding it's foothold finally after a long while uh of being in open the eyeshadow yeah and we use cloud for the uh the transcript and timestamps and the buckets so shout out the 100K context window you know we couldn't do that when we first started the podcast we were like okay how do we trunk this stuff or like gpd4 and and all of that and then Bob was like just put the whole thing in here man and works great so uh that's a good start but I feel like they're always yeah a second second fiddle you know it's like every time there really something people are like cool okay some people like it must be more like okay fine I I feel bad for them because it's like it's really good stuff you know but they just need they just need some uh some help on the marketing side and the community buy-in so I just spent this past weekend at uh the club hackathon which is as far as I know anthropics first hackathon I I treated a pretty well received video where I was I was just eating the hackathon venue at 2 am in the morning and there was just a ton of people hacking there there were like 300 people uh participating uh for Claude And I think it's just the first real developer excitement I've ever seen for enthalpy kid Claude um so I think they're on their way up I think this paves the way for a multi-model future um that is something that a lot of people are betting on um it's just the the odds are stacked against entropic but they're making some Headway um I I do think that you should always be running all your chat side by side against uh tragicia and Claude and maybe mama two um so I I immediately I have a little uh many of our app that does that that uh save all the all the chats across and uh and yeah I can say I can legitimately say that Claude wins about 30 of the time uh as far as any time I give it a task to do I ask it a question um which is not you know doesn't make it number one but it actually is very additive to your overall toolkit of yeah I think you shouldn't use yeah it's certainly the first time that you're if you go on Twitter on any given day you will see people saying things like if you haven't used uh Claude you know for writing you have to try it now or so you know like people who are really who have made a switch who are have no affiliation who are very convinced that it is now part of the the suite of tools that people should really be paying attention to which I think is great where we shouldn't be at a stage yet where we're you know total totally in on one just one tool set I'll also mention I think this month or at least July was when the first inspection of where whether like is too much context not actually a good thing um so there's a there's a pretty famously product I forget the actual title a bit uh that shows a very pronounced new curve in the retrieval abilities of large context models um and so basically if you if if you if the item that is being retrieved is at the start or the end of the context window then it has the best chance of being received but if it's in the middle it has a high chance of being lost um and so is 100k context a good thing are you systematically testing its ability to um to retrieve the correct factual information or are you just looking at a summary and growing yeah it looks good to me you know um I think we will be testing like whether or not it's worth extending it to 100K or a million tokens or infinite tokens uh or do you want to blend uh a short window like 8 000 tokens or 4 000 tokens uh in couple that together with a proper semantic search system uh like the retrieval augmented generation and Vector database companies are doing so I think that that discussion has come up in open source a lot um and basically it I think it matches human memory right like you want to have a short working memory hahaha you know the I was thinking about it the one other obviously big sort of company update that we haven't spoken about yet was around the middle of the month Google bard had a a big set of updates a lot of it was sort of business focused right so it was available in more languages uh it was you know whatever the the sort of from a feature perspective the biggest thing that they were sort of hanging their hat on was around image recognition and sort of this push towards uh towards multimodality but you know did did you have any guys did you guys have any thoughts about that or was that sort of like you know not sort of on the the high priority list as a as an announcement or development this month I I think going back to the point before we're getting to the maturity level of the industry we're like doing like model updates and all this stuff like it's fine but like people need more you know people need more and like that's why I call it interpreter it's like so good right it's not just like oh we made the model A little better like we added this thing it's like this is like a whole new thing if you're playing the model game if not you got to go to the product level and I think Google should start thinking about how to make that work because when I search on Google Maps for certain stuff it's like completely does not work so maybe they should use models to like make that better and then say we're using Bard in Google Maps search uh but yeah I don't know I've kind of I'm kind of tuning off a lot of the single just model announcements so uh so Bart's updates I think the the multi-modality they actually beat gpt4 to releasing a generally available multimodal wall right you can upload an image and have Bard describe it and that's pretty interesting pretty cool um I think uh one of our earliest guests Robo flow uh Brad their CTO was actually doing some comparisons because they have access to a lot of division models and and Bart came up a little bit short but it was pretty good it was it was like close to the state of the art um I would say the problem with Bard is that you can't rely on them having reliable updates because they had a June update I don't actually remember of implicit code execution where they started to ship uh the code interpreter type functionality but in a more limited format if you run the same code the same questions that but advertising the June blog post it's sundarkai advertise in in a video that and tweet it out they no longer worked in the heart so they had a regression that's that was very embarrassing um obviously unintended but uh it's and it shows that it's hard to keep model progress up to date but I think Google has this checkered history riff its products being reliable you know they also killed off Google Adobe rip um and uh and I think that's something that they have to combat which is like yes they're they're trying to ship model progress I've met the bar people they're you know good artist people um but they have struggled to to ship uh products even more than open AI which is frankly embarrassing for a couple of the size of Google outside of the the biggies are there any other sort of key trends or or you know maybe not even key trends but sort of bubbling interest that you guys are noticing in the developer community that aren't necessarily super widely uh seen outside you know one of the things that I keep an eye on is all the auto GPT like things you know in this month we had gbt engineer and we had multi-on who held a hackathon and you know there's a few few things like that but you know not necessarily in the agent space but are there any other themes that you guys are are keeping an eye on let's say uh I I'm sure Alessio can chime in but on on I do keep a relative uh close eye on that agent stuff uh it has not uh died down in terms of the the heat uh even the other GPT team who by the way I work uh on the first floor the building that I work on uh they're hard at work uh shipping the next version and so I think a lot of people are engaging in the dream of agents and um I think like scoping them down to something usable is still a task that uh has not as it has so far eluded every single team so far and uh and it is what it is I think I think uh all these very ambitious goals we are at the very start of of this journey uh the same Journey that maybe self-driving cars took uh in 2012 when when they started doing the darker challenge um and I think the other thing I'll point out interest in terms of uh just overall interest uh I am definitely seeing a lot of uh eval type companies being formed and winning hackathons too um so what what at Utah companies they're they're basically uh companies in that you uh monitor the uh the success of your prompts or your agents and version them and um and and just share them potentially um I I I feel like I can't be more descriptive just because it's hard to um to really describe what they do it's just because they are not very clear about what they do yet um Lang chain launch Lang Smith um and I think that is the first commercial product that nine chain probably you know the the top one or two developer oriented AI projects out there um and that's more observability but also local uh tensorous ebal as well because they Aqua hired in an AI eval projects as well so I was I'll just call out just the general domain of how to eval models um is a very big focus of the developers here again yep yeah we've done um two seats and companies doing agents but they're both verticalized agents so I think the open source motion has been Auto gbt do anything um and now we're seeing a lot of Founders is like hey you know if you take that and then you combine it with like deep industry expertise you can get so many improvements to it and then the other piece of it is how do you do information retrieval so you know in general knowledge like documents everything is kind of flat but when you're in specific vertical say Finance for example um you know if you're looking at the earnings from this quarter like 10 quarters ago like the latest ones are like much more important so how do you start to create this like information hierarchy between documents and then how do you use that instead of doing simple like retrieval from like an embedding store it's like how do you also start to score these things that's another area of of research from from founders oh I'll call out two more things um one more thing that happened this week this month was sdxl uh you know text to image doesn't seem as sexy anymore even though like last year with all the raids um I but I do think like it's it's coming along um I I definitely wish that Google was putting up more of a fight because they actually at the start of the Year released some very interesting Capers that they never followed up on uh that show some really interesting Transformers based uh text image models that I thought was super interesting and then this the other uh element which uh you know I'm just like very fascinated by a lot of the I don't know like the uh uh I I I hesitate to say this but it's actually like the the character and like the um um let's just call they call it character replica and and all the sort of work versions of that um I I do think that a lot of people are hacking on this kind of stuff um the retention metrics on character AI blows away um you know a lot of the uh the metrics that you might see in on traditional social media sites and basically AI native social media is something that is something that that is there's something there that I think people haven't really explored yet and and people are exploring it you know like uh is this company and like you know he's always a few years ahead of it so uh not to keep returning to this theme but I I just think like it's it's definitely coming for a lot of like a lot of the ways that we we deal with things like right now we think co-pilot and we right now we think um uh we've been chat gbt but like uh what what we what we really want to speak to is is uh a way of serializing personality and intelligence um and and potentially that is a that is a leading form of Mind upload um so that Becca is into science fiction but I do see a lot of people working on that yeah I mean we just got a Financial Times report that says that AI personas uh from meta from Facebook could be coming next month they were talking about uh yeah they were talking about airport was there's one one that's Abraham Lincoln one that's like a surfer dude who gives you travel advice so it's it's it's you know the sourcing is three people with knowledge of the project or whatever um and it you know no obviously no confirmation from meta but it's no secret that Zuckerberg has been interested in this stuff and uh you know the the ftp's is actually it's a good overview of why a company like Meadow would care about it in very dollars and cents terms yeah something like and I want to State like the first version of this is very very me like when I first looked at character AI it was like okay I want to talk to Genghis Khan if I'm doing a history class but it's like not it's like what if what a 10 year old would enjoy you know um but I think the the various iterations of this professionally would be very interesting so on the developer side of this I have been calling for the development of agent clouds which are clouds that are specifically uh optimized not for uh human use but for uh EI agent teams and that is a form of character right it's a character is it with the different environments uh with the different dependencies pre-installed uh that can be programmatically controlled can get programmatic feedback to agents um and uh and there's a protocol for me um that some of the leading figures like Auto gbt and e2b are creating that um lets agents run clouds um this would this would definitely terrify the AI safety people because we have gone from like running them on a single machine towards running you know clusters originally um but it's happening all right so so let's talk about what comes next do you guys have any predictions for August or if not predictions just things that you're watching most closely go ahead Alice uh let me let me think and I think Sean is usually good at like the super long term prediction some more uh pragmatic I don't know you know yeah he's more like he he like minimum like 12 to 24 months um I I think like for me probably starting to see more public talk about open source models in production with people using that as a differentiator I think right now a lot of it is kind of like oh these models are there but nobody's really saying oh I moved away from opening I'm using this but in our we run a early adopters Community with about 1500 kind of like a Fortune 500 large companies leaders and some of them were like oh we deployed dolly in production and we're using it we're not writing a blog post about it um so I think right now the perception is still everybody's using open Ai and the open source models are like really toys but I think we're gonna get into September and you know you're not going to see a lot of announcements in August proper but I think a lot of people are gonna spend August getting these models ready and then going into end of the year and say hey we're here too you know we're using the open models like we don't need open AI um I think right now there's still not not a lot of a lot of public talk about that so excited to to see more uh yeah I'm a little bit uh as for myself uh this is very self-interested obviously but we had to edit an agenda you know I wrote about the the rise of the AI engineer I mean I think it's definitely happening as we speak um I I have seen multiple tags like people tag me multiple times a day on like uh how they're reorienting their careers I think people professionalizing around this and going from essentially like informal groups and slack channels and meetups and stuff towards uh certifications and courses and job titles and actual AI teams in every single company I think is happening um I I just got notification like two days ago that the uh you know in meta apparently you can sort of name your name a job site title whatever you want internally uh and so they emerged as the first AI engineer within meta uh has has been announced and uh so I think I think as far as you know the near-term I do see this career this profession come into place um that I've been forecasting for uh for a little bit and I'm excited to help it along awesome well guys great conversation tons of interesting stuff happening obviously um I do think it you know ironically I think it's a relatively more quiet time in some ways than than it even was and you know my my prediction for August is that we're going to see the extension of that we're going to see sort of the the biggest breath that we've had at least from a from a feeling perspective maybe since Chachi PT but then we're gonna rage back in in September you got Facebook connects in September you've got sort of just the return to business that everyone does after August um but of course I think you know the hackathons aren't going to stop in the Bay Area so people are going to keep building and it's entirely possible that something you know hits in the next four weeks that that totally changes that be exciting to see looking forward

Get full access to Latent Space at www.latent.space/subscribe

2023-08-04
Länk till avsnitt

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

FlashAttention was first published by Tri Dao in May 2022 and it had a deep impact in the large language models space. Most open models you?ve heard of (RedPajama, MPT, LLaMA, Falcon, etc) all leverage it for faster inference. Tri came on the podcast to chat about FlashAttention, the newly released FlashAttention-2, the research process at Hazy Lab, and more.

This is the first episode of our ?Papers Explained? series, which will cover some of the foundational research in this space. Our Discord also hosts a weekly Paper Club, which you can signup for here.

How does FlashAttention work?

The paper is titled ?FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness?. There are a couple keywords to call out:

* ?Memory Efficient?: standard attention memory usage is quadratic with sequence length (i.e. O(N^2)). FlashAttention is sub-quadratic at O(N).

* ?Exact?: the opposite of ?exact? in this case is ?sparse?, as in ?sparse networks? (see our episode with Jonathan Frankle for more). This means that you?re not giving up any precision.

* The ?IO? in ?IO-Awareness? stands for ?Input/Output? and hints at a write/read related bottleneck.

Before we dive in, look at this simple GPU architecture diagram:

The GPU has access to three memory stores at runtime:

* SRAM: this is on-chip memory co-located with the actual execution core. It?s limited in size (~20MB on an A100 card) but extremely fast (19TB/s total bandwidth)

* HBM: this is off-chip but on-card memory, meaning it?s in the GPU but not co-located with the core itself. An A100 has 40GB of HBM, but only a 1.5TB/s bandwidth.

* DRAM: this is your traditional CPU RAM. You can have TBs of this, but you can only get ~12.8GB/s bandwidth, which is way too slow.

Now that you know what HBM is, look at how the standard Attention algorithm is implemented:

As you can see, all 3 steps include a ?write X to HBM? step and a ?read from HBM? step. The core idea behind FlashAttention boils down to this: instead of storing each intermediate result, why don?t we use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead? (We also talked about kernel fusion in our episode with George Hotz and how PyTorch / tinygrad take different approaches here)

The result is much faster, but much harder to read:

As you can see, FlashAttention is a very meaningful speed improvement on traditional Attention, and it?s easy to understand why it?s becoming the standard for most models.

This should be enough of a primer before you dive into our episode! We talked about FlashAttention-2, how Hazy Research Group works, and some of the research being done in Transformer alternatives.

Show Notes:

* FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv)

* FlashAttention-2

* Together AI

* From Deep Learning to Long Learning

* The Hardware Lottery by Sara Hooker

* Hazy Research

* Is Attention All You Need?

* Nvidia CUTLASS 3

* SRAM scaling slows

* Transformer alternatives:

* S4

* Hyena

* Recurrent Neural Networks (RNNs)

Timestamps:

* Tri's background [00:00:00]

* FlashAttention?s deep dive [00:02:18]

* How the Hazy Research group collaborates across theory, systems, and applications [00:17:21]

* Evaluating models beyond raw performance [00:25:00]

* FlashAttention-2 [00:27:00]

* CUDA and The Hardware Lottery [00:30:00]

* Researching in a fast-changing market [00:35:00]

* Promising transformer alternatives like state space models and RNNs [00:37:30]

* The spectrum of openness in AI models [00:43:00]

* Practical impact of models like LLAMA2 despite restrictions [00:47:12]

* Incentives for releasing open training datasets [00:49:43]

* Lightning Round [00:53:22]

Transcript:

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. Today we have no Swyx, because he's in Singapore, so it's a one-on-one discussion with Tri Dao. Welcome! [00:00:24]

Tri: Hi everyone. I'm Tri Dao, excited to be here. [00:00:27]

Alessio: Tri just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the FlashAttention paper, which is one of the seminal work in the Transformers era. He's got a lot of interest from efficient transformer training and inference, long range sequence model, a lot of interesting stuff. And now you're going to be an assistant professor in CS at Princeton next year. [00:00:51]

Tri: Yeah, that's right. [00:00:52]

Alessio: Yeah. And in the meantime, just to get, you know, a low pressure thing, you're Chief Scientist at Together as well, which is the company behind RedPajama. [00:01:01]

Tri: Yeah. So I just joined this week actually, and it's been really exciting. [00:01:04]

Alessio: So what's something that is not on the internet that people should know about you? [00:01:09]

Tri: Let's see. When I started college, I was going to be an economist, so I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes and I immediately decided that I was going to be a math major. And that kind of changed the course of my career. So now I'm doing math, computer science, AI research. [00:01:32]

Alessio: I had a similar thing. I started with physics and then I took like a programming course and I was like, I got to do computer science. I don't want to do physics. So FlashAttention is definitely, everybody's using this. Everybody loves it. You just released FlashAttention 2 last week. [00:01:48]

Tri: Yeah. Early this week on Monday. Yeah. [00:01:53]

Alessio: You know, AI time. Things move fast. So maybe let's run through some of the FlashAttention highlights, some of the innovation there, and then we can dive into FlashAttention 2. So the core improvement in FlashAttention is that traditional attention is a quadratic sequence length. And to the two, FlashAttention is linear, which obviously helps with scaling some of these models. [00:02:18]

Tri: There are two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention. The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient? So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly. And as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you'll be able to train with 2 to 4x longer sequence length for the same cost without doing any approximations. As a result, lots of folks have been using this. The thing is available in a lot of libraries that do language model training or fine tuning. [00:03:32]

Alessio: And the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there. [00:03:40]

Tri: For sure. So in addition, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do quadratic number of comparison. And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, OK, I'm going to pretend there's zero. So that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements. And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, tend not to use these approximate attention methods. Because it turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-on time. So this was pretty surprising because back then, I think my background was more on the theoretical side. So I was thinking of, oh, how many flops or floating point operations are you performing? And hopefully that correlates well with walk-on time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime. There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory. [00:05:34]

Alessio: Yeah, the I.O. awareness is probably one of the biggest innovations here. And the idea behind it is, like you mentioned, the FLOPS growth of the cards have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had. So talk a bit about how that came to be as an idea. It's one of those things that like in insight, it's like, obviously, why are we like rewriting to like HBM every time, you know, and like once you change it, it's clear. But what was that discovery process? [00:06:08]

Tri: Yeah, in hindsight, a lot of the ideas have already been there in the literature. And I would say is it was somehow at the intersection of both machine learning and systems. And you kind of needed ideas from both sides. So on one hand, on the system side, so lots of systems folks have known that, oh, you know, kernel fusion is great. Kernel fusion just means that instead of performing, you know, loading the same element, instead of performing an operation, write it down, load it back up and perform the second operation, you just load it once, perform two operations and then write it down again. So that saves you kind of memory read and write in the middle there. So kernel fusion has been a classic. There's been other techniques from the system side, like tiling, where you perform things in the form of computations in block, again, so that you can load it into a really fast memory. Think of it as a cache. And this is, again, classical computer science ideas, right? You want to use the cache. So the system folks have been thinking about these ideas for a long time, and they apply to attention as well. But there were certain things in attention that made it difficult to do a complete kernel fusion. One of which is there is this softmax operation in the middle, which requires you to essentially sum across the row of the attention matrix. So it makes it difficult to kind of break it, because there's this dependency. So it makes it difficult to break things into a block. So on the system side, people have been thinking about these ideas, but it's been difficult to kind of do kernel fusion for the entire operation. On the machine learning side, people have been thinking more algorithmically. They say, okay, either we can approximate attention, or there's this trick called the online softmax trick, which says that because of softmax, the way it's written mathematically, you can actually break it up into smaller pieces, do some rescaling, and still get the right answer. So this online softmax trick has been around for a while. I think there was a paper from NVIDIA folks back in 2018 about this. And then there was a paper from Google. So Marcus, Rob, and Stats wrote a paper late 2021 on using this online softmax trick to break attention up into smaller pieces. So a lot of the ideas were already there. But it turns out, you kind of need to combine ideas from both sides. So you need to understand that, hey, we want to do kernel fusion to reduce memory written writes. But we also need this online softmax trick to be able to break the softmax into smaller pieces so that a lot of the systems tricks kind of carry through. We saw that, and it was kind of a natural idea that we ended up using ideas from both sides, and it ended up working pretty well. Yeah. [00:08:57]

Alessio: Are there any downsides to kernel fusion? If I think about databases and the reasons why we have atomic operations, you know, it's like, you have observability and fallback in between them. How does that work with attention? Is there anything that we lose by fusing the operations? [00:09:13]

Tri: Yeah, I think mostly on the practical side is that you lose a little bit of flexibility in the sense that, hey, now you have, for example, faster attention, it's just a subroutine that you would call to do attention. But as a researcher, let's say you don't want that exact thing, right? You don't want just attention, let's say you want some modification to attention. You want to do, hey, I'm going to multiply the query and key, but then I'm going to do this extra thing before I carry on. So kernel fusion just means that, okay, we have a subroutine that does the entire thing. But if you want to experiment with things, you won't be able to use that fused kernel. And the answer is, can we have a compiler that then automatically does a lot of this kernel fusion? Lots of compiler folks are thinking about this, either with a new language or you can embed it in PyTorch. PyTorch folks have been working on this as well. So if you write just your code in PyTorch and they can capture the graph, can they generate code that will fuse everything together? That's still ongoing, and it works for some cases. But for attention, because of this kind of softmax rewriting stuff, it's been a little bit more difficult. So maybe in a year or two, we'll have compilers that are able to do a lot of these optimizations for you. And you don't have to, for example, spend a couple months writing CUDA to get this stuff to work. Awesome. [00:10:41]

Alessio: And just to make it clear for listeners, when we say we're not writing it to memory, we are storing it, but just in a faster memory. So instead of the HBM, we're putting it in the SRAM. Yeah. [00:10:53]

Tri: Yeah. [00:10:54]

Alessio: Maybe explain just a little bit the difference there. [00:10:56]

Tri: Yeah, for sure. This is kind of a caricature of how you think about accelerators or GPUs in particular, is that they have a large pool of memory, usually called HBM, or high bandwidth memory. So this is what you think of as GPU memory. So if you're using A100 and you list the GPU memory, it's like 40 gigs or 80 gigs. So that's the HBM. And then when you perform any operation, you need to move data from the HBM to the compute unit. So the actual hardware unit that does the computation. And next to these compute units, there are on-chip memory or SRAM, which are much, much smaller than HBM, but much faster. So the analogy there is if you're familiar with, say, CPU and RAM and so on. So you have a large pool of RAM, and then you have the CPU performing the computation. But next to the CPU, you have L1 cache and L2 cache, which are much smaller than DRAM, but much faster. So you can think of SRAM as the small, fast cache that stays close to the compute unit. Physically, it's closer. There is some kind of asymmetry here. So HBM is much larger, and SRAM is much smaller, but much faster. One way of thinking about it is, how can we design algorithms that take advantage of this asymmetric memory hierarchy? And of course, lots of folks have been thinking about this. These ideas are pretty old. I think back in the 1980s, the primary concerns were sorting. How can we sort numbers as efficiently as possible? And the motivating example was banks were trying to sort their transactions, and that needs to happen overnight so that the next day they can be ready. And so the same idea applies, which is that they have slow memory, which was hard disk, and they have fast memory, which was DRAM. And people had to design sorting algorithms that take advantage of this asymmetry. And it turns out, these same ideas can apply today, which is different kinds of memory. [00:13:00]

Alessio: In your paper, you have the pyramid of memory. Just to give people an idea, when he says smaller, it's like HBM is like 40 gig, and then SRAM is like 20 megabytes. So it's not a little smaller, it's much smaller. But the throughput on card is like 1.5 terabytes a second for HBM and like 19 terabytes a second for SRAM, which is a lot larger. How do you think that evolves? So TSMC said they hit the scaling limits for SRAM, they just cannot grow that much more. HBM keeps growing, HBM3 is going to be 2x faster than HBM2, I think the latest NVIDIA thing has HBM3. How do you think about the future of FlashAttention? Do you think HBM is going to get fast enough when maybe it's not as useful to use the SRAM? [00:13:49]

Tri: That's right. I think it comes down to physics. When you design hardware, literally SRAM stays very close to compute units. And so you don't have that much area to essentially put the transistors. And you can't shrink these things too much. So just physics, in terms of area, you don't have that much area for the SRAM. HBM is off-chip, so there is some kind of bus that essentially transfers data from HBM to the compute unit. So you have more area to essentially put these memory units. And so yeah, I think in the future SRAM probably won't get that much larger, because you don't have that much area. HBM will get larger and faster. And so I think it becomes more important to design algorithms that take advantage of this memory asymmetry. It's the same thing in CPU, where the cache is really small, the DRAM is growing larger and larger. DRAM could get to, I don't know, two terabytes, six terabytes, or something, whereas the cache stays at, I don't know, 15 megabytes or something like that. I think maybe the algorithm design becomes more and more important. There's still ways to take advantage of this, I think. So in the future, I think flash attention right now is being used. I don't know if in the next couple of years, some new architecture will come in and whatnot, but attention seems to be still important. For the next couple of years, I still expect some of these ideas to be useful. Not necessarily the exact code that's out there, but I think these ideas have kind of stood the test of time. New ideas like IO awareness from back in the 1980s, ideas like kernel fusions, tiling. These are classical ideas that have stood the test of time. So I think in the future, these ideas will become more and more important as we scale models to be larger, as we have more kinds of devices, where performance and efficiency become much, much more important. [00:15:40]

Alessio: Yeah, and we had Jonathan Frankle on the podcast, and if you go to issattentionallyouneed.com, he has an outstanding bet, and he does believe that attention will be the state of the art architecture still in a few years. Did you think flash attention would be this popular? I'm always curious on the research side, you publish a paper, and obviously you know it's great work, but sometimes it just kind of falls flat in the industry. Could you see everybody just starting to use this, or was that a surprise to you? [00:16:11]

Tri: Certainly, I didn't anticipate the level of popularity. Of course, we were extremely happy to have people using this stuff and giving us feedback and so on, and help us improve things. I think when we were writing the paper, I remember sending an email to one of my advisors, and like, hey, I'm excited about this paper, but I think the most important thing will be the artifact, which is the code. So I knew that the code will be valuable. So we kind of focus a lot on the code and make sure that the code is usable and as fast as can be. Of course, the idea, the paper presents the ideas and explain it and have experiments that validate the idea, but I knew that the artifact or the code was also pretty important. And that turned out to be the right focus, which is, you know, we put out the paper, we release the code and continue working on the code. So it's a team effort with my co-authors as well. [00:17:07]

Alessio: We mentioned Hazy Research a bunch of times on the podcast before. I would love for you to spend five minutes just talking about how does the group work? How do people get together? How do you bounce ideas off of each other? Yeah. [00:17:21]

Tri: So Hazy Research is a research group at Stanford led by one of my advisors, Chris Re. I love the people there. It was one of the best experiences I had. They've made my PhD so much more enjoyable. And I think there are a couple of ways that the group has been working pretty well. So one is, I think there's a diverse pool of people who either, you know, some of them focus on algorithms and theory, some of them focus on building systems, some of them focus on applications. And as a result, there is this flow of idea. So as an example, some of us were working on like more algorithms and theory, and then we can talk to the folks building systems and say, hey, let's try it out and let's put it in the systems and see how it is. And there you will get feedback from systems folks. They will say, hey, we implemented this, or we tried this and this is where it doesn't work, something like that. And once we put it in the systems, the application folks can use the algorithm or new methods or new models. And we again get great feedback from them because the application folks, for example, some of my good friends, they focus on medical imaging or seizure detection. And that is the problem they care about. And if your method doesn't work on the task they care about, they will tell you. Whereas I think a lot of people in machine learning, they're a little bit more flexible. So they will be like, hey, it doesn't work on seizure detection. Let's try some other task, right? But having that direct feedback of like, hey, it doesn't work there, let's figure out why. I think that that feedback allows us to do better work. And I think that kind of process of exchanging ideas, validating it in a real system so that applications folks can try it out and give you feedback. That cycle has been very, very useful. And so that's one, having a diverse group of people. The other one is, and this is something I really appreciate from advice from Chris was try to understand the fundamental, right? And he's happy letting me go off and read some textbooks and playing with things because I think a lot of research ideas come from understanding the old literature and see how it fits with the new landscape. And so if you just new archive papers every day, that's great, but you also need to read textbooks. And that's one advice I got from Chris, which is understand the fundamentals. And I think that allows us to do more impactful work. [00:19:46]

Alessio: How do you think about academia versus industry? I feel like AI / Machine Learning has been an area where up until three, four years ago, most of the cutting edge work was being done in academia. And now there's all these big industry research labs. You're obviously going to Princeton, so you're an academia believer. How should people think about where to go? Say I'm doing my master's, I have to decide between doing a PhD and going into OpenAI Anthropic. How should I decide? [00:20:15]

Tri: I think they kind of play a complementary role, in my opinion. Of course, I also was considering different paths as well. So I think right now, scaling matters a lot, especially when you talk about language models and AI and so on. Scaling matters a lot. And that means that you need compute resources and you need infrastructure and you need engineers time. And so industry tends to have an advantage when it comes to scaling things. But a lot of the ideas actually came from academia. So let's take Attention, which got popular with the Transformer in 2017. Attention actually has been around for a while. So I think the first mention was in 2014, a paper from Bernadot and others and Yoshua Bengio, which is coming from academia. A lot of ideas did come from academia. And scaling things up, of course, I think OpenAI has been great at scaling things up. That was the bet that they made after, I think, GPT-2. So they saw that scaling these things up to back then was 1.5 billion parameter seemed to give you amazing capabilities. So they really committed to that. They really committed to scaling things. And that turned out to be, it's been a pretty successful bet. I think for academia, we're still trying to figure out exactly what we're doing in this shifting landscape. And so lots of folks have been focusing on, for example, evaluation. So I know the Stanford Center for Foundation Model led by Percy, they have this benchmark called HELM, which is this holistic benchmark. So trying to figure out, okay, characterizing the landscape of different kinds of models, what people should evaluate, what people should measure, and things like that. So evaluation is one role. The other one is understanding. So this has happened historically where there's been some development in the industry and academia can play a role in explaining, understanding. They have the luxury to slow down trying to understand stuff, right? So lots of paper on understanding what's really going on, probing these models, and so on. I think I'm not as familiar with the NLP literature, but my impression is there's a lot of that going on in the NLP conferences, which is understanding what these models are doing, what capabilities they have, and so on. And the third one I could see is that the academia can take more risky bets in the sense that we can work on stuff that is quite different from industry. I think industry, my impression is you have some objective. You're trying to say, hey, for this quarter, we want to scale the model in this particular way. Next quarter, we want the model to have these capabilities. You're trying to get objectives that maybe, I don't know, 70% that will work out because it's important for the company's direction. I think for academia, the way things work is you have many, many researchers or PhD students, and they're kind of pursuing independent directions. And they have a little bit more flexibility on, hey, I'm going to try out this seemingly crazy idea and see, let's say there's a 30% chance of success or something. And however you define success, for academia, a lot of the time, success just means like, hey, we found something interesting. That could eventually go into industry through collaboration and so on. So I do see academia and industry kind of playing complementary roles. And as for someone choosing a career, I think just more and more generally, industry would be probably better in terms of compensation, in terms of probably work-life balance. But my biased perspective is that maybe academia gives you a little bit more freedom to think and understand things. So it probably comes down to personal choice. I end up choosing to be a professor next year at Princeton. But of course, I want to maintain a relationship with industry folks. I think industry folks can provide very valuable feedback to what we're doing in academia so that we understand where the field is moving because some of the directions are very much influenced by what, for example, OpenAI or Google is doing. So we want to understand where the field is moving. What are some promising applications? And try to anticipate, okay, if the field is moving like this, these applications are going to be popular. What problems will be important in two, three years? And then we try to start thinking about those problems so that hopefully in two, three years, we have some of the answers to some of these problems in two, three years. Sometimes it works out, sometimes it doesn't. But as long as we do interesting things in academia, that's the goal. [00:25:03]

Alessio: And you mentioned the eval side. So we did a Benchmarks 101 episode. And one of the things we were seeing is sometimes the benchmarks really influence the model development. Because obviously, if you don't score well on the benchmarks, you're not going to get published and you're not going to get funded. How do you think about that? How do you think that's going to change now that a lot of the applications of these models, again, is in more narrow industry use cases? Do you think the goal of the academia eval system is to be very broad and then industry can do their own evals? Or what's the relationship there? [00:25:40]

Tri: Yeah, so I think evaluation is important and often a little bit underrated. So it's not as flashy as, oh, we have a new model that can do such and such. But I think evaluation, what you don't measure, you can't make progress on, essentially. So I think industry folks, of course, they have specific use cases that their models need to do well on. And that's what they care about. Not just academia, but other groups as well. People do understand what are some of the emerging use cases. So for example, now one of the most popular use cases is Chatbot. And then I think folks from Berkeley, some of them are from Berkeley, call them MLCs. They set up this kind of Chatbot arena to essentially benchmark different models. So people do understand what are some of the emerging use cases. People do contribute to evaluation and measurement. And as a whole, I think people try to contribute to the field and move the field forward, albeit that maybe slightly different directions. But we're making progress and definitely evaluation and measurement is one of the ways you make progress. So I think going forward, there's still going to be just more models, more evaluation. We'll just have better understanding of what these models are doing and what capabilities they have. [00:26:56]

Alessio: I like that your work has been focused on not making benchmarks better, but it's like, let's just make everything faster. So it's very horizontal. So FlashAttention 2, you just released that on Monday. I read in the blog post that a lot of the work was also related to some of the NVIDIA library updates. Yeah, maybe run us through some of those changes and some of the innovations there. Yeah, for sure. [00:27:19]

Tri: So FlashAttention 2 is something I've been working on for the past couple of months. So the story is the NVIDIA CUTLASS team, they released a new version of their library, which contains all these primitives to allow you to do matrix multiply or memory loading on GPU efficiently. So it's a great library and I built on that. So they released their version 3 back in January and I got really excited and I wanted to play with that library. So as an excuse, I was just like, okay, I'm going to refactor my code and use this library. So that was kind of the start of the project. By the end, I just ended up working with the code a whole lot more and I realized that, hey, there are these inefficiencies still in Flash Attention. We could change this way or that way and make it, in the end, twice as fast. But of course, building on the library that the NVIDIA folks released. So that was kind of a really fun exercise. I was starting out, it's just an excuse for myself to play with the new library. What ended up was several months of improvement, improving Flash Attention, discovering new ideas. And in the end, we managed to make it 2x faster and now it's pretty close to probably the efficiency of things like matrix multiply, which is probably the most optimized subroutine on the planet. So we're really happy about it. The NVIDIA Cutlass team has been very supportive and hopefully in the future, we're going to collaborate more. [00:28:46]

Alessio: And since it's an NVIDIA library, can you only run this on CUDA runtimes? Or could you use this and then run it on an AMD GPU? [00:28:56]

Tri: Yeah, so it's an NVIDIA library. So right now, the code we release runs on NVIDIA GPUs, which is what most people are using to train models. Of course, there are emerging other hardware as well. So the AMD folks did implement a version of Flash Attention, I think last year as well, and that's also available. I think there's some implementation on CPU as well. For example, there's this library, ggml, where they implemented the same idea running on Mac and CPU. So I think that kind of broadly, the idea would apply. The current implementation ended up using NVIDIA's library or primitives, but I expect these ideas to be broadly applicable to different hardware. I think the main idea is you have asymmetry in memory hierarchy, which tends to be everywhere in a lot of accelerators. [00:29:46]

Alessio: Yeah, it kind of reminds me of Sara Hooker's post, like the hardware lottery. There could be all these things that are much better, like architectures that are better, but they're not better on NVIDIA. So we're never going to know if they're actually improved. How does that play into some of the research that you all do too? [00:30:04]

Tri: Yeah, so absolutely. Yeah, I think Sara Hooker, she wrote this piece on hardware lottery, and I think she captured really well of what a lot of people have been thinking about this. And I certainly think about hardware lottery quite a bit, given that I do some of the work that's kind of really low level at the level of, hey, we're optimizing for GPUs or NVIDIA GPUs and optimizing for attention itself. And at the same time, I also work on algorithms and methods and transformer alternatives. And we do see this effect in play, not just hardware lottery, but also kind of software framework lottery. You know, attention has been popular for six years now. And so many kind of engineer hours has been spent on making it as easy and efficient as possible to run transformer, right? And there's libraries to do all kinds of tensor parallel, pipeline parallel, if you use transformer. Let's say someone else developed alternatives, or let's just take recurrent neural nets, like LSTM, GRU. If we want to do that and run that efficiently on current hardware with current software framework, that's quite a bit harder. So in some sense, there is this feedback loop where somehow the model architectures that take advantage of hardware become popular. And the hardware will also kind of evolve to optimize a little bit for that kind of architecture and software framework will also evolve to optimize for that particular architecture. Right now, transformer is the dominant architecture. So yeah, I'm not sure if there is a good way out of this. Of course, there's a lot of development. Things like, I think compilers will play a role because compilers allow you to maybe still be much more efficient across different kinds of hardware because essentially you write the same code and compiler will be able to make it run efficiently different kinds of hardware. So for example, there's this language Mojo, they're compiler experts, right? And their bet is AI models will be running on different kinds of devices. So let's make sure that we have really good compilers with a good language that then the compiler can do a good job optimizing for all kinds of devices. So that's maybe one way that you can get out of this cycle. But yeah, I'm not sure of a good way. In my own research, I have to think about both the algorithm new model and how it maps to hardware. So there are crazy ideas that seem really good, but will be really, really difficult to run efficiently. And so as a result, for example, we can't really scale some of the architectures up simply because they're not hardware friendly. I have to think about both sides when I'm working on new models. [00:32:50]

Alessio: Yeah. Have you spent any time looking at some of the new kind of like AI chips companies, so to speak, like the Cerebras of the world? Like one of their innovations is co-locating everything on the chip. So you remove some of this memory bandwidth issue. How do you think about that? [00:33:07]

Tri: Yeah, I think that's an interesting bet. I think Tesla also has this Dojo supercomputer where they try to have essentially as fast on-chip memory as possible and removing some of these data transfer back and forth. I think that's a promising direction. The issues I could see, you know, I'm definitely not a hardware expert. One issue is the on-chip memory tends to be really expensive to manufacture, much more expensive per gigabyte compared to off-chip memory. So I talked to, you know, some of my friends at Cerebros and, you know, they have their own stack and compiler and so on, and they can make it work. The other kind of obstacle is, again, with compiler and software framework and so on. For example, if you can run PyTorch on this stuff, lots of people will be using it. But supporting all the operations in PyTorch will take a long time to implement. Of course, people are working on this. So I think, yeah, we kind of need these different bets on the hardware side as well. Hardware has, my understanding is, has a kind of a longer time scale. So you need to design hardware, you need to manufacture it, you know, maybe on the order of three to five years or something like that. So people are taking different bets, but the AI landscape is changing so fast that it's hard to predict, okay, what kind of models will be dominant in, let's say, three or five years. Or thinking back five years ago, would we have known that Transformer would have been the dominant architecture? Maybe, maybe not, right? And so different people will make different bets on the hardware side. [00:34:39]

Alessio: Does the pace of the industry and the research also influence the PhD research itself? For example, in your case, you're working on improving attention. It probably took you quite a while to write the paper and everything, but in the meantime, you could have had a new model architecture come out and then it's like nobody cares about attention anymore. How do people balance that? [00:35:02]

Tri: Yeah, so I think it's tough. It's definitely tough for PhD students, for researchers. Given that the field is moving really, really fast, I think it comes down to understanding fundamental. Because that's essentially, for example, what the PhD allows you to do. It's been a couple of years understanding the fundamentals. So for example, when I started my PhD, I was working on understanding matrix vector multiply, which has been a concept that's been around for hundreds of years. We were trying to characterize what kind of matrices would have theoretically fast multiplication algorithm. That seems to have nothing to do with AI or anything. But I think that was a time when I developed mathematical maturity and research taste and research skill. The research topic at that point didn't have to be super trendy or anything, as long as I'm developing skills as a researcher, I'm making progress. And eventually, I've gotten quite a bit better in terms of research skills. And that allows, for example, PhD students later in their career to quickly develop solutions to whatever problems they're facing. So I think that's just the natural arc of how you're being trained as a researcher. For a lot of PhD students, I think given the pace is so fast, maybe it's harder to justify spending a lot of time on the fundamental. And it's tough. What is this kind of explore, exploit kind of dilemma? And I don't think there's a universal answer. So I personally spend some time doing this kind of exploration, reading random textbooks or lecture notes. And I spend some time keeping up with the latest architecture or methods and so on. I don't know if there's a right balance. It varies from person to person. But if you only spend 100% on one, either you only do exploration or only do exploitation, I think it probably won't work in the long term. It's probably going to have to be a mix and you have to just experiment and kind of be introspective and say, hey, I tried this kind of mixture of, I don't know, one exploration paper and one exploitation paper. How did that work out for me? Should I, you know, having conversation with, for example, my advisor about like, hey, did that work out? You know, should I shift? I focus more on one or the other. I think quickly adjusting and focusing on the process. I think that's probably the right way. I don't have like a specific recommendation that, hey, you focus, I don't know, 60% on lecture notes and 40% on archive papers or anything like that. [00:37:35]

Alessio: Let's talk about some Transformer alternatives. You know, say Jonathan Franco loses his bet and Transformer is not the state of the art architecture. What are some of the candidates to take over? [00:37:49]

Tri: Yeah, so this bet is quite fun. So my understanding is this bet between Jonathan Franco and Sasha Rush, right? I've talked to Sasha a bunch and I think he recently gave an excellent tutorial on Transformer alternatives as well. So I would recommend that. So just to quickly recap, I think there's been quite a bit of development more recently about Transformer alternatives. So architectures that are not Transformer, right? And the question is, can they do well on, for example, language modeling, which is kind of the application that a lot of people care about these days. So there are methods based on state space methods that came out in 2021 from Albert Gu and Curran and Chris Re that presumably could do much better in terms of capturing long range information while not scaling quadratically. They scale sub-quadratically in terms of sequence length. So potentially you could have a much more efficient architecture when sequence length gets really long. The other ones have been focusing more on recurrent neural nets, which is, again, an old idea, but adapting to the new landscape. So things like RWKV, I've also personally worked in this space as well. So there's been some promising results. So there's been some results here and there that show that, hey, these alternatives, either RNN or state space methods, can match the performance of Transformer on language modeling. So that's really exciting. And we're starting to understand on the academic research side, we want to understand, do we really need attention? I think that's a valuable kind of intellectual thing to understand. And maybe we do, maybe we don't. If we want to know, we need to spend serious effort on trying the alternatives. And there's been folks pushing on this direction. I think RWKV scale up to, they have a model at 14 billion that seems pretty competitive with Transformer. So that's really exciting. That's kind of an intellectual thing. We want to figure out if attention is necessary. So that's one motivation. The other motivation is Transformer Alternative could have an advantage in practice in some of the use cases. So one use case is really long sequences. The other is really high throughput of generation. So for really long sequences, when you train with Transformer, with flash attention and so on, the computation is still quadratic in the sequence length. So if your sequence length is on the order of, I don't know, 16K, 32K, 100K or something, which some of these models have sequence length 100K, then you do get significantly slower in terms of training, also in terms of inference. So maybe these alternative architectures could scale better in terms of sequence length. I haven't seen actual validation on this. Let's say an RNN model release with context length, I don't know, 100K or something. I haven't really seen that. But the hope could be that as we scale to long sequences, these alternative architectures could be more well-suited. Not just text, but things like high resolution images, audio, video, and so on, which are emerging applications. So that's one, long sequences. Number two is a high throughput generation, where I can imagine scenarios where the application isn't like an interactive chatbot, but let's say a company wants to batch as many requests as possible on their server, or they're doing offline processing, they're generating stuff based on their internal documents, that you need to process in batch. And the issue with Transformer is that during generation, it essentially needs to keep around all the previous history. It's called the KV cache. And that could take a significant amount of memory, so you can't really batch too much because you run out of memory. I am personally bullish on RNNs. I think RNNs, they essentially summarize the past into a state vector that has fixed size, so the size doesn't grow with the history. So that means that you don't need as much memory to keep around all the previous tokens. And as a result, I think you can scale to much higher batch sizes. And as a result, you can make much more efficient use of the GPUs or the accelerator, and you could have much higher generation throughput. Now, this, I don't think, has been validated at scale. So as a researcher, I'm bullish on this stuff because I think in the next couple of years, these are use cases where these alternatives could have an advantage. We'll just kind of have to wait and see to see if these things will happen. I am personally bullish on this stuff. At the same time, I also spend a bunch of time making attention as fast as possible. So maybe hatching and playing both sides. Ultimately, we want to understand, as researchers, we want to understand what works, why do the models have these capabilities? And one way is, let's push attention to be as efficient as possible. On the other hand, let's push other alternatives to be as efficient at scale, as big as possible, and so that we can kind of compare them and understand. Yeah, awesome. [00:43:01]

Alessio: And I think as long as all of this work happens and open, it's a net positive for everybody to explore all the paths. Yeah, let's talk about open-source AI. Obviously, together, when Red Pajama came out, which was an open clone of the LLAMA1 pre-training dataset, it was a big thing in the industry. LLAMA2 came out on Tuesday, I forget. And this week, there's been a lot of things going on, which they call open-source, but it's not really open-source. Actually, we wrote a post about it that was on the front page of Hacker News before this podcast, so I was frantically responding. How do you think about what open-source AI really is? In my mind, in open-source software, we have different levels of open. So there's free software, that's like the GPL license. There's open-source, which is Apache, MIT. And then there's kind of restricted open-source, which is the SSPL and some of these other licenses. In AI, you have the open models. So Red Pajama is an open model because you have the pre-training dataset, you have the training runs and everything. And then there's obviously RandomLens that doesn't make it one-to-one if you retrain it. Then you have the open-weights model that's kind of like StableLM, where the weights are open, but the dataset is not open. And then you have LLAMA2, which is the dataset is not open, the weights are restricted. It's kind of like not really open-source, but open enough. I think it's net positive because it's like $3 million of flops donated to the public. [00:44:32]

Tri: How do you think about that? [00:44:34]

Alessio: And also, as you work together, what is your philosophy with open-source AI? Right, right. [00:44:40]

Tri: Yeah, I think that's a great question. And I think about it on maybe more practical terms. So of course, Meta has done an amazing job training LLAMA1, LLAMA2. And for LLAMA2, they make it much less restrictive compared to LLAMA1. Now you can use it for businesses, unless you are a monthly active user or something like that. I think just this change will have a very significant impact in the kind of landscape of open-source AI, where now lots of businesses, lots of companies will be using, I expect will be using things like LLAMA2. They will fine-tune on their own dataset. They will be serving variants or derivatives of LLAMA2. Whereas before, with LLAMA1, it was also a really good model, but your business companies weren't allowed to do that. So I think on a more practical term, it's kind of shifting the balance between a closed-source model like OpenAI and Anthropic and Google, where you're making API calls, right? And maybe you don't understand as much of what the model is doing, how the model is changing, and so on. Versus now, we have a model with open weight that is pretty competitive from what I've seen in terms of benchmarks, pretty competitive with GPT 3.5, right? And if you fine-tune it on your own data, maybe it's more well-suited for your own data. And I do see that's going to shift the balance of it. More and more folks are going to be using, let's say, derivatives of LLAMA2. More and more folks are going to fine-tune and serve their own model instead of calling an API. So that shifting of balance is important because in one way, we don't want just a concentration of decision-making power in the hands of a few companies. So I think that's a really positive development from Meta. Of course, training the model takes a couple of millions of dollars, but engineers have and I'm sure they spend tons of time trying many, many different things. So the actual cost is probably way more than that. And they make the weights available and they allow probably a lot of companies are going to be using this. So I think that's a really positive development. And we've also seen amazing progress on the open source community where they would take these models and they either fine-tune on different kinds of data sets or even make changes to the model. So as an example, I think for LLAMA1, the context lane was limited to 2K. Like a bunch of folks figured out some really simple methods to scale up to like 8K. [00:47:12]

Alessio: Like the RoPE. [00:47:13]

Tri: Yes. I think the open source community is very creative, right? And lots of people. LLAMA2 will, again, kind of accelerate this where more people will try it out. More people will make tweaks to it and make a contribution and then so on. So overall, I think I see that as still a very positive development for the field. And there's been lots of libraries that will allow you to host or fine-tune these models, like even with quantization and so on. Just a couple of hours after LLAMA2 was released, tons of companies announcing that, hey, it's on our API or hosting and so on and together did the same. So it's a very fast-paced development and just kind of a model with available weights that businesses are allowed to use. I think that alone is already a very positive development. At the same time, yeah, we can do much better in terms of releasing data sets. Data sets tend to be... Somehow people are not incentivized to release data sets. So philosophically, yeah, you want to be as open as possible. But on a practical term, I think it's a little bit harder for companies to release data sets. Legal issues. The data sets released tend to be not as eye-catchy as the model release. So maybe people are less incentivized to do that. We've seen quite a few companies releasing data sets together. Released a red pajama data set. I think Cerebus then worked on that and deduplicate and clean it up and release slim pajama and so on. So we're also seeing positive development on that front, kind of on the pre-training data set. So I do expect that to continue. And then on the fine-tuning data set or instruction tuning data set, I think we now have quite a few open data sets on instruction tuning and fine-tuning. But these companies do pay for human labelers to annotate these instruction tuning data set. And that is expensive. And maybe they will see that as their competitive advantage. And so it's harder to incentivize these companies to release these data sets. So I think on a practical term, we're still going to make a lot of progress on open source AI, on both the model development, on both model hosting, on pre-training data set and fine-tuning data set. Right now, maybe we don't have the perfect open source model since all the data sets are available. Maybe we don't have such a thing yet, but we've seen very fast development on the open source side. I think just maybe this time last year, there weren't as many models that are competitive with, let's say, ChatGPT. [00:49:43]

Alessio: Yeah, I think the open data sets have so much more impact than open models. If you think about Elusive and the work that they've done, GPT-J was great, and the Pythia models are great, but the Pyle and the Stack, everybody uses them. So hopefully we get more people to contribute time to work on data sets instead of doing the 100th open model that performs worse than all the other ones, but they want to say they released the model. [00:50:14]

Tri: Yeah, maybe the question is, how do we figure out an incentive structure so that companies are willing to release open data sets? And for example, it could be like, I think some of the organizations are now doing this where they are asking volunteers to annotate and so on. And maybe the Wikipedia model of data set, especially for instruction tuning, could be interesting where people actually volunteer their time and instead of editing Wikipedia, add annotation. And somehow they acknowledge and feel incentivized to do so. Hopefully we get to that kind of level of, in terms of data, it would be kind of like Wikipedia. And in terms of model development, it's kind of like Linux where people are contributing patches and improving the model in some way. I don't know exactly how that's going to happen, but based on history, I think there is a way to get there. [00:51:05]

Alessio: Yeah, I think the Dolly-15K data set is a good example of a company saying, let's do this smaller thing, just make sure we make it open. We had Mike Conover from Databricks on the podcast, and he was like, people just bought into it and leadership was bought into it. You have companies out there with 200,000, 300,000 employees. It's like, just put some of them to label some data. It's going to be helpful. So I'm curious to see how that evolves. What made you decide to join Together? [00:51:35]

Tri: For Together, the focus has been focusing a lot on open source model. And I think that aligns quite well with what I care about, of course. I also know a bunch of people there that I know and trust, and I'm excited to work with them. Philosophically, the way they've been really open with data set and model release, I like that a lot. Personally, for the stuff, for example, the research that I've developed, like we also try to make code available, free to use and modify and so on, contributing to the community. That has given us really valuable feedback from the community and improving our work. So philosophically, I like the way Together has been focusing on open source model. And the nice thing is we're also going to be at the forefront of research and the kind of research areas that I'm really excited about, things like efficient training and inference, aligns quite well with what the company is doing. We'll try our best to make things open and available to everyone. Yeah, but it's going to be fun being at the company, leading a team, doing research on the topic that I really care about, and hopefully we'll make things open to benefit the community. [00:52:45]

Alessio: Awesome. Let's jump into the lightning round. Usually, I have two questions. So one is on acceleration, one on exploration, and then a takeaway. So the first one is, what's something that already happened in AI machine learning that you thought would take much longer than it has? [00:53:01]

Tri: I think understanding jokes. I didn't expect that to happen, but it turns out scaling model up and training lots of data, the model can now understand jokes. Maybe it's a small thing, but that was amazing to me. [00:53:16]

Alessio: What about the exploration side? What are some of the most interesting unsolved questions in the space? [00:53:22]

Tri: I would say reasoning in the broad term. We don't really know how these models do. Essentially, they do something that looks like reasoning. We don't know how they're doing it. We have some ideas. And in the future, I think we will need to design architecture that explicitly has some kind of reasoning module in it if we want to have much more capable models. [00:53:43]

Alessio: What's one message you want everyone to remember today? [00:53:47]

Tri: I would say try to understand both the algorithm and the systems that these algorithms run on. I think at the intersection of machine learning system has been really exciting, and there's been a lot of amazing results at this intersection. And then when you scale models to large scale, both the machine learning side and the system side really matter. [00:54:06]

Alessio: Awesome. Well, thank you so much for coming on 3. [00:54:09]

Tri: This was great. Yeah, this has been really fun. [00:54:11]

Get full access to Latent Space at www.latent.space/subscribe

2023-07-26
Länk till avsnitt

Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

As first discussed on our May Emergency pod and leaked 4 days ago, Llama (renamed from LLaMA) was upgraded to Llama 2 (pretraining on 2 trillion tokens with 2x the context length - bigger than any dataset discussed in Datasets 101, and adding ~$20m of RLHF/preference annotation) and released for commercial use on 18 July.

It immediately displaced Falcon-40B as the leading open LLM and was immediately converted/quantized to GGML and other formats. Llama 2 seems to outperform all other open source models in their equivalent weight class:

Why are open models important? The intersection of Open Source and AI is one of the oldest themes on this publication, and there has been a raging debate on the security and reliability of the OpenAI models and APIs. Users have reported GPT-4?s quality going down, which has been denied and denied and as of today, given some supporting data from Databricks, and complained about the API reliability and rapid deprecation schedules. Last and surely the biggest, there are entire classes of businesses and government/healthcare/military organizations that categorically cannot send any of their sensitive data to an external API provider, even if it is OpenAI through Azure. The only way to have total control is to own and serve your own models, which Llama 2 now pushes forward in terms of the state of the art (your own GPT3.5-quality model, though it is nowhere near Claude 2 or GPT-4).

As we do with breaking news, we got on to Twitter Spaces again to chat with two scheduled guests:

* , ML Researcher at Huggingface and author of Interconnects who had the best summary of the Llama2 paper

* Matt Bornstein, organizer of the a16z infra team that launched Llama2.ai (source here) and has been coding up a storm with AI demo apps, unusual for VCs

as well as Anton Troynikov of Chroma, Russell Kaplan of Scale AI, and Omar Qazi of the Whole Mars Catalog.

Enjoy!

Show Notes

* Official links

* Website, Paper

* GitHub (Llama 2 commit)

* Azure Partnership

* Use policy, Statement of Support for Open Approach

* Where to try

* Llama2.ai (source), Perplexity Llama Chat

* Live playground/API on Replicate, deploy all versions on Baseten

* https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI

* Dev ports - simonw llm-replicate, ggml using llama.cpp (7B, 13B) or pinokio, ollama, Core ML port

* Timeline

* 24 Feb - LLaMA 1 announced

* 6 May - our No Moats podcast - first mention of Zuck opening up Llama

* 14 July - Llama 2 leaked

* 18 July - Llama 2 announced

* Community notes

* Nathan?s research paper recap

* 638 LOC, 4 dependencies

* Usage restrictions - MAU restriction, derivative models

* Grouped Query Attention

* System prompt

* 2 trillion token dataset

* >$20m price tag (rlhf, jimfan),

* Separate models for safety and helpfulness (jimfan)

* Mistral AI founders left out of paper

* Interesting fails:

Timestamps

* [00:02:30] Introducing the speakers

* [00:03:32] Nathan Lambert intro

* [00:04:48] General Summary of Llama 2

* [00:05:57] Sarah Silverman killed Dataset Transparency?

* [00:08:48] Simon's Recap of Llama 2

* [00:11:43] Matt's Intro

* [00:12:59] a16z Infra's new AI team?

* [00:15:10] Alessio's recap of Llama 2

* [00:17:26] Datasets 101 Followup

* [00:18:14] Context Length 4k

* [00:20:35] Open-ish Source? Usage Policy and Restrictions

* [00:23:38] Huggingface Responsible AI License

* [00:24:57] Pretraining Llama 2 Base Model beyond Chinchilla

* [00:29:55] Llama 2 is incomplete? Race to publish

* [00:31:40] Come for the Llama, stay for the (Meta) drama

* [00:33:22] Language Translation

* [00:35:10] Llama2's coding abilities

* [00:35:59] Why we want to know about the training data

* [00:37:45] The importance of Meta pushing forward Truly Open AI

* [00:40:59] Llama 2 as Enabler of Startups

* [00:43:59] Where you can try Llama 2

* [00:44:25] Do you need dataset transparency if you have evals?

* [00:45:56] >$20m cost of Llama 2 is primarily preference data collection

* [00:48:59] Do we even need human annotators?

* [00:49:42] Models Rating Models

* [00:53:32] How to get Code preference data

* [00:54:34] Llama 2 Finetuning Ecosystem

* [00:56:32] Hey Apple: Llama2 on Metal pls

* [00:57:17] Llama 2 and Chroma

* [01:00:15] Open Source MoE model?

* [01:00:51] Llama 2 using tools

* [01:01:40] Russell Kaplan on Scale AI's Llama 2 plans

* [01:03:31] Scale annotating code?

* [01:04:36] Immortality

* [01:04:59] Running Llama on your phone

* [01:06:54] Sama

2023-07-19
Länk till avsnitt

AI Fundamentals: Datasets 101

In April, we released our first AI Fundamentals episode: Benchmarks 101. We covered the history of benchmarks, why they exist, how they are structured, and how they influence the development of artificial intelligence.

Today we are (finally!) releasing Datasets 101! We?re really enjoying doing this series despite the work it takes - please let us know what else you want us to cover!

Stop me if you?ve heard this before: ?GPT3 was trained on the entire Internet?.

Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily on Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the ?entire internet? is estimated to be 64 zetabytes, or 64 trillion GB. So it?s more accurate to say that GPT3 is trained on 0.0000000001% of the Internet.

Why spend $5m on GPU time training on $50 worth of data?

Simple: Garbage in, garbage out. No matter how good your algorithms, no matter how much money/compute you have, your model quality is strongly determined by the data you train it on and research scientists think we just don?t need or have that much high quality data. We spend an enormous amount of effort throwing out data to keep the quality high, and recently Web 2.0-era UGC platforms like StackOverflow, Reddit, and Twitter clamped down on APIs as they realize the goldmines they sit on.

Data is the new new oil. Time for a primer!

Show Notes

* Our 2 months worth of podcast prep notes!

* The Token Crisis paper

* Ilya Sutskever on datasets

* OpenAI Tokenizer

* Kaplan Scaling Laws Lecture

* Chinchilla Paper

* Sasha Rush?s Tweet

* Karpathy?s Build Conference Presentation

* LIMA Paper

* Phi-1 by Microsoft

* Washington Post Article on datasets

* Our episode with Jonathan Frankle

* Our episode with Mike Conover

* BloombergGPT

* Datasets

* HuggingFace Hub

* CommonCrawl, Overview

* C4

* List of Dirty, Naughty, Obscene, and Otherwise Bad Words

* OpenWebText

* books3

* OpenAssistant

* The Stack

* The Pile

* LAION

* Audio:

* LibriSpeech: A dataset of audio recordings of audiobooks

* CommonVoice: A dataset of audio recordings of people speaking different languages

* Voxforge: A dataset of audio recordings of people speaking different languages?

* Switchboard: A dataset of audio recordings of telephone conversations?

* Fisher Corpus: A dataset of audio recordings of news broadcasts?

* Chinese:

* CMRC (Chinese Machine Reading Comprehension 2018)

* DuReader

* ChID

* Copyright & Privacy:

* https://stablediffusionlitigation.com/

* https://haveibeentrained.com/

* https://githubcopilotlitigation.com/

* https://twitter.com/moyix/status/1662131770463072257

* OpenAI Opt Out Process

* Check if you?re in The Stack

* Deduplication

* Deduplicating Training Data Makes Language Models Better

* Deduplicating Training Data Mitigates Privacy Risks in Language Models

* Contamination

* CodeForces example

Get full access to Latent Space at www.latent.space/subscribe

2023-07-17
Länk till avsnitt

Code Interpreter == GPT 4.5 (w/ Simon Willison, Alex Volkov, Aravind Srinivas, Alex Graveley, et al.)

Code Interpreter is GA! As we do with breaking news, we convened an emergency pod and >17,000 people tuned in, by far our most biggest ever. This is a 2-for-1 post - a longform essay with our trademark executive summary and core insights - and a podcast capturing day-after reactions. Don?t miss either of them!

Essay and transcript: https://latent.space/p/code-interpreter

Podcast Timestamps

[00:00:00] Intro - Simon and Alex

[00:07:40] Code Interpreter for Edge Cases

[00:08:59] Code Interpreter's Dependencies - Tesseract, Tensorflow

[00:09:46] Code Interpreter Limitations

[00:10:16] Uploading Deno, Lua, and other Python Packages to Code Interpreter

[00:11:46] Code Interpreter Timeouts and Environment Resets

[00:13:59] Code Interpreter for Refactoring

[00:15:12] Code Interpreter Context Window

[00:15:34] Uploading git repos

[00:16:17] Code Interpreter Security

[00:18:57] Jailbreaking

[00:19:54] Code Interpreter cannot call GPT APIs

[00:21:45] Hallucinating Lack of Capability

[00:22:27] Code Interpreter Installed Libraries and Capabilities

[00:23:44] Code Interpreter generating interactive diagrams

[00:25:04] Code Interpreter has Torch and Torchaudio

[00:25:49] Code Interpreter for video editing

[00:27:14] Code Interpreter for Data Analysis

[00:28:14] Simon's Whole Foods Crime Analysis

[00:31:29] Code Interpreter Network Access

[00:33:28] System Prompt for Code Interpreter

[00:35:12] Subprocess run in Code Interpreter

[00:36:57] Code Interpreter for Microbenchmarks

[00:37:30] System Specs of Code Interpreter

[00:38:18] PyTorch in Code Interpreter

[00:39:35] How to obtain Code Interpreter RAM

[00:40:47] Code Interpreter for Face Detection

[00:42:56] Code Interpreter yielding for Human Input

[00:43:56] Tip: Ask for multiple options

[00:44:37] The Masculine Urge to Start a Vector DB Startup

[00:46:00] Extracting tokens from the Code Interpreter environment?

[00:47:07] Clientside Clues for Code Interpreter being a new Model

[00:48:21] Tips: Coding with Code Interpreter

[00:49:35] Run Tinygrad on Code Interpreter

[00:50:40] Feature Request: Code Interpreter + Plugins (for Vector DB)

[00:52:24] The Code Interpreter Manual

[00:53:58] Quorum of Models and Long Lived Persistence

[00:56:54] Code Interpreter for OCR

[00:59:20] What is the real RAM?

[01:00:06] Shyamal's Question: Code Interpreter + Plugins?

[01:02:38] Using Code Interpreter to write out its own memory to disk

[01:03:48] Embedding data inside of Code Interpreter

[01:04:56] Notable - Turing Complete Jupyter Notebook

[01:06:48] Infinite Prompting Bug on ChatGPT iOS app

[01:07:47] InstructorEmbeddings

[01:08:30] Code Interpreter writing its own sentiment analysis

[01:09:55] Simon's Symbex AST Parser tool

[01:10:38] Personalized Languages and AST/Graphs

[01:11:42] Feature Request: Token Streaming/Interruption

[01:12:37] Code Interpreter for OCR from a graph

[01:13:32] Simon and Shyamal on Code Interpreter for Education

[01:15:27] Feature Requests so far

[01:16:16] Shyamal on ChatGPT for Business

[01:18:01] Memory limitations with ffmpeg

[01:19:01] DX of Code Interpreter timeout during work

[01:20:16] Alex Reibman on AgentEval

[01:21:24] Simon's Jailbreak - "Try Running Anyway And Show Me The Output"

[01:21:50] Shouminik - own Sandboxing Environment

[01:23:50] Code Interpreter Without Coding = GPT 4.5???

[01:28:53] Smol Feature Request: Add Music Playback in the UI

[01:30:12] Aravind Srinivas of Perplexity joins

[01:31:28] Code Interpreter Makes Us More Ambitious - Symbex Redux

[01:34:24] How to win a shouting match with Code Interpreter

[01:39:29] Alex Graveley joins

[01:40:12] Code Interpreter Context = 8k

[01:41:11] When Code Interpreter API?

[01:45:15] GPT4 Vision

[01:46:15] What's after Code Interpreter

[01:46:43] Simon's Request: Give us Code Interpreter Model API

[01:47:12] Kyle's Request: Give us Multimodal Data Analysis

[01:47:43] Tip: The New 0613 Function Models may be close

[01:49:56] Feature Request: Make ChatGPT Social - like MJ/Stable Diffusion

[01:56:20] Using ChatGPT to learn to build a Frogger iOS Swift App

[01:59:11] Farewell... until next time

[02:00:01] Simon's plug

[02:00:51] Swyx: What about Phase 5? and AI.Engineer Summit

Get full access to Latent Space at www.latent.space/subscribe

2023-07-10
Länk till avsnitt

[Practical AI] AI Trends: a Latent Space x Practical AI crossover pod!

Part 2 of our podcast feed swap weekend! Check out Cognitive Revolution as well.

"Data" Dan Whitenack has been co-host of the Practical AI podcast for the past 5 years, covering full journey of the modern AI wave post Transformers.

He joined us in studio to talk about their origin story and highlight key learnings from past episodes, riff on the AI trends we are all seeing as AI practitioner-podcasters, and his passion for low-resource-everything!

Subscribe on the Changelog, RSS, Apple Podcasts, Twitter, Mastodon, and wherever fine podcasts are sold!

Show notes

* Daniel Whitenack ? Twitter, GitHub, Website

* Featured Latent Space episodes:

* Featured Practical AI episodes:

* From notebooks to Netflix scale with Metaflow

* Capabilities of LLMs ?

* ML at small organizations

* Prediction Guard

* Data Dan

Timestamps

* 00:00 Welcome to Practical AI

* 01:16 Latent Space Podcast

* 04:00 Practical AI Podcast

* 06:20 Prediction Guard

* 08:05 Daniel's favorite episodes

* 10:21 Alessio's favorite episode

* 10:54 Swyx's favorite episode

* 12:44 Listener favorites

* 15:14 LLMOps

* 17:06 Reza Shabani

* 19:06 Benchmarks 101

* 20:06 Roboflow

* 21:38 Mode collapse

* 26:21 Rajiv Shah

* 28:01 Staying on top of things

* 33:11 Kirsten Lum

* 34:31 datadan.io

* 38:48 Prompt engineering

* 40:38 Unique challenges engineers face

* 42:51 AI-UX

* 45:31 NLP data sets

* 50:49 Unlabeled data sets

* 55:07 Lightning round!

* 55:20 What's already happened in AI?

* 56:27 Unsolved questions in AI

* 58:01 Get hands on

* 58:53 Outro

Transcript

Full transcript is over at the Changelog site!

Get full access to Latent Space at www.latent.space/subscribe

2023-07-02
Länk till avsnitt

[Cognitive Revolution] The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research

Thanks to the over 1m people that have checked out the Rise of the AI Engineer. It?s a long July 4 weekend in the US, and we?re celebrating with a podcast feed swap!

We?ve been big fans of Nathan Labenz and Erik Torenberg?s work at the Cognitive Revolution podcast for a while, which started around the same time as we did and has done an incredible job of hosting discussions with top researchers and thinkers in the field, with a wide range of topics across computer vision (a special focus thanks to Nathan?s work at Waymark), GPT-4 (with exceptional insight due to Nathan?s time on the GPT-4 ?red team?), healthcare/medicine/biotech (Harvard Medical School, Med-PaLM, Tanishq Abraham, Neal Khosla), investing and tech strategy (Sarah Guo, Elad Gil, Emad Mostaque, Sam Lessin), safety and policy, curators and influencers and exceptional AI founders (Josh Browder, Eugenia Kuyda, Flo Crivello, Suhail Doshi, Jungwon Byun, Raza Habib, Mahmoud Felfel, Andrew Feldman, Matt Welsh, Anton Troynikov, Aravind Srinivas).

If Latent Space is for AI Engineers, then Cognitive Revolution covers the much broader field of AI in tech, business and society at large, with a longer runtime to go deep on research papers like TinyStories. We hope you love this episode as much as we do, and check out CogRev wherever fine podcasts are sold!

Subscribe to the Cognitive Revolution on:

* Website

* Apple Podcasts

* Spotify

* Youtube

Good Data is All You Need

The work of Ronen and Yuanzhi echoes a broader theme emerging in the midgame of 2023:

* Falcon-40B (trained on 1T tokens) outperformed LLaMA-65B (trained on 1.4T tokens), primarily due to the RefinedWeb Dataset that runs CommonCrawl through extensive preprocessing and cleaning in their MacroData Refinement pipeline.

* UC Berkeley LMSYS?s Vicuna-13B is near GPT-3.5/Bard quality at a tenth of their size, thanks to fine-tuning from 70k user-highlighted ChatGPT conversations (indicating some amount of quality).

* Replit?s finetuned 2.7B model outperforms the 12B OpenAI Codex model based on HumanEval, thanks to high quality data from Replit users

The path to smaller models leans on better data (and tokenization!), whether from cleaning, from user feedback, or from synthetic data generation, i.e. finetuning high quality on outputs from larger models. TinyStories and Phi-1 are the strongest new entries in that line of work, and we hope you?ll pick through the show notes to read up further.

Show Notes

* TinyStories (Apr 2023)

* Paper: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

* Internal presentation with Sebastien Bubeck at MSR

* Twitter thread from Ronen Eldan

* Will future LLMs be based almost entirely on synthetic training data? In a new paper, we introduce TinyStories, a dataset of short stories generated by GPT-3.5&4. We use it to train tiny LMs (< 10M params) that produce fluent stories and exhibit reasoning.

* Phi-1 (Jun 2023)

* Paper: Textbooks are all you need (HN discussion)

* Twitter announcement from Sebastien Bubeck:

* phi-1 achieves 51% on HumanEval w. only 1.3B parameters & 7B tokens training dataset and 8 A100s x 4 days = 800 A100-hours. Any other >50% HumanEval model is >1000x bigger (e.g., WizardCoder from last week is 10x in model size and 100x in dataset size).

Get full access to Latent Space at www.latent.space/subscribe

2023-07-01
Länk till avsnitt

Commoditizing the Petaflop ? with George Hotz of the tiny corp

We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends!

Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don?t obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority!

We are excited to share the world?s first interview with George Hotz on the tiny corp!

If you don?t know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly ?interned? at the Elon Musk-run Twitter.

Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 ?luxury AI computer? aimed at local model training and inference, aka your ?personal compute cluster?:

* 738 FP16 TFLOPS

* 144 GB GPU RAM

* 5.76 TB/s RAM bandwidth

* 30 GB/s model load bandwidth (big llama loads in around 4 seconds)

* AMD EPYC CPU

* 1600W (one 120V outlet)

* Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)

(In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps ? )

The tiny corp manifesto

There are three main theses to tinycorp:

* If XLA/PrimTorch are CISC, tinygrad is RISC: CISC (Complex Instruction Set Computing) are more complex instruction sets where a single instruction can execute many low-level operations. RISC (Reduced Instruction Set Computing) are smaller, and only let you execute a single low-level operation per instruction, leading to faster and more efficient instruction execution. If you?ve used the Apple Silicon M1/M2, AMD Ryzen, or Raspberry Pi, you?ve used a RISC computer.

* If you can?t write a fast ML framework for GPU, you can?t write one for your own chip: there are many ?AI chips? companies out there, and they all started from taping the chip. Some of them like Cerebras are still building, while others like Graphcore seem to be struggling. But building chips with higher TFLOPS isn?t enough: ?There?s a great chip already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet?nobody in ML uses it.?, referring to the AMD RX 7900 XTX. NVIDIA?s lead is not only thanks to high-performing cards, but also thanks to a great developer platform in CUDA. Starting with the chip development rather than the dev toolkit is much more cost-intensive, so tinycorp is starting by writing a framework for off-the-shelf hardware rather than taping their own chip.

* Turing completeness considered harmful: Once you call in to Turing complete kernels, you can no longer reason about their behavior. Since they have to be able to execute any instruction, they are much more complex. To optimize Turing kernels performance, you fall back to caching, warp scheduling, and branch prediction. Since neural networks only need ADD/MUL operations and only rely on static memory accesses, there?s no need to have Turing completeness. This design decision allows tinygrad to optimize instructions at a much lower level. As you might have guessed, CUDA is Turing-complete; this is one of the main differences that tinycorp wants to leverage to be competitive.

All that ? covered in the first 10 minutes of our discussion. George came ready to go deep, so we went for it. Some of the other technical questions we went through:

* Laziness: why laziness is important and how operation fusing can help with memory efficiency

* Debugging & CI: Why great developer experience is a priority in tinygrad

* Quantization: what?s the right level of quantization, how lossless are these transformations, his quick takes on Mojo and ggml, and why fp16 is the target for their out-of-the-box LLaMA.

* Building rigs for individual use: we talked a bit about the design tradeoffs of building these machines with low noise and a single power plug, the difference that PCIe 4 vs 3 makes, and more.

The ?personal compute cluster? is $15,000, but for businesses interested in local training and inference, George also estimates that he will be able to build you a H100-class GPU that is 5-10x faster (than a H100) for the same price.

Misc: Bitter Lessons, Core Insights, Remote Work

Outside of tiny, we also talked about one of George?s favorite units of measure ?a person of compute?. Much of the AGI talk has been benchmark-driven, but looking at it from a compute throughput can also be interesting. One person of compute is roughly 20 PFLOPS (64 A100s, or a single dense 42U A100 rack); one A100 is ~$10-15,000, so the GPUs by themselves will come out at $640,000-$1,000,000.

We also covered a wide range of topics, including his self analysis on GPT-4, Elon Musk, Remote Work, Computer Vision and the Comma Body, and life above/below the API (and above/below the Kanban board). See show notes and timestamps for more!

Show Notes

* ?Unlocked iPhone Traded for Nissan 350Z?

* ?Unlocked iPhone? on YouTube (August 21st, 2007)

* ?The Light It Up Contest? on YouTube (February 13th, 2011)

* Comma.ai

* NHTSA cease and desist

* The Hero?s Journey

* The Portal Story

* A Person of Compute

* Above / Below the API Line (swyx take)

* The Bitter Lesson

* The Goddess of Everything Else (listen to George read it)

* Meditations on Moloch

* George?s email to Lisa Su, AMD?s CEO:

Timestamps

* [00:00:00] Intros & tinygrad?s ?Portal Story?

* [00:03:00] Thesis #1

* [00:03:50] Thesis #2

* [00:05:00] Thesis #3 + Turing completeness discussion

* [00:10:00] tinygrad?s creation and core ideas

* [00:16:00] Operation fusing in tinygrad

* [00:17:00] Debugging & profiling in tinygrad

* [00:18:30] Tinygrad vs Pytorch competitiveness

* [00:20:30] geohot vs AMD

* [00:25:00] On ggml

* [00:26:00] Tinygrad?s CI philosophy

* [00:26:30] On Mojo

* [00:28:00] ggml quantization is made up

* [00:31:00] Work for tiny: benchmark int8 vs fp16

* [00:33:00] Why you can?t build tinybox - Design constraints

* [00:35:00] The Personal Compute Cluster

* [00:37:00] Shoutout to our MosaicML podcast

* [00:39:00] FLOPcoin and other use cases for the tinybox

* [00:43:00] Rumors on GPT-4 architecture

* [00:46:00] The Bitter Lesson

* [00:48:00] Hiring and Changing mind on remote work

* [00:52:00] Above/Below The API

* [00:55:40] Comma Bodies & Computer Vision

* [00:58:40] Merging with the machine and AI girlfriends

* [01:02:00] Is AI gonna kill us all?

* [01:09:00] Why Avatar 2 was bad

Transcript

Swyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer and editor of Latent Space. And Alessio is taking over with the intros, Alessio is Partner and CTO in residence at Decibel Partners. [00:00:20]

Alessio: Hey everyone, today we have Geohot on the podcast, aka George Hotz. Everybody knows George, so I'm not going to do a big intro. A couple of things that people might have missed: you traded the first ever unlocked iPhone for a Nissan 350Z and three new iPhones. You were then one of the first people to break into the PS3 to run arbitrary code. You got sued by Sony, you wrote a rap song to fight against that, which is still live on YouTube, which we're going to have on the show notes. Did not go to Tesla to build vision, and instead you started Comma.ai, which was an amazing engineering feat in itself until you got a cease and desist from the government to not put these things on the street and turned that into a research only project. [00:01:00]

George: You know they're out there. [00:01:01]

Alessio: Yeah, yeah. [00:01:03]

Swyx: They're out there. [00:01:04]

Alessio: But like in a, you know, you market them as a research kind of like no warranty. [00:01:06]

George: Because I use the word dev kit, that's not about the government, that's nothing to do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping. What's the difference between a dev kit and not a dev kit? Nothing. Just the question of do you think it's for you? And if you think it's for you, buy it. It's a consumer product. We call it a dev kit. If you have a problem with that, it's not for you. [00:01:28]

Swyx: That's great insight. [00:01:30]

Alessio: I was going through your blog posts to get ready. You've wrote this post about The Hero's Journey. And you linked this thing called the portal story, which is kind of the set of stories in movies and books about people living this arbitrary life. And then the run to this magic portals kind of takes them into a new, very exciting life and dimension. When you wrote that post, you talked about TinyGrad, which is one of the projects we're working on today. You mentioned this is more of a hobby, something that is not going to change the course of history. Obviously, you're now going full speed into it. So we would love to learn more about what was the portal that you ran into to get here. [00:02:03]

George: Well, what you realize is... You know what made me realize that I absolutely had to do the company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize NVIDIA? What are the odds that large organizations in the government, but of course I repeat myself, decide to try to clamp down on accessibility of ML compute? I want to make sure that can't happen structurally. So that's why I realized that it's really important that I do this. And actually, from a more practical perspective, I'm working with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has the best inference chips. Working with these companies is really difficult. So I'd like to start another organization that eventually in the limit, either works with people to make chips or makes chips itself and makes them available to anybody. [00:02:48]

Alessio: Can you share three core pieces to TinyCorp? Maybe we can dive into each of them. So XLA, PrimTorch, those are the complex instruction system. TinyGrad is the restricted instruction system. So you're kind of focused on, again, TinyGrad being small, not being overcomplicated and trying to get as close to the DSP as possible in a way where it's at more. [00:03:08]

George: Well, it's a very clear analogy from how processes are developed. So a lot of processes back in the day were CISC, complex instruction set, system 360, and then x86. This isn't how things stayed. They went to now the most common processor is ARM, and people are excited about RISC-V. No one's excited about it. RISC-V is even less complex than ARM. No one is excited about CISC processors anymore. They're excited about reduced instruction set processors. So TinyGrad is, we are going to make a RISC offset for all ML models. And yeah, it can run all ML models with basically 25 instead of the 250 of XLA or PrimeTorch. So about 10x less complex. [00:03:47]

Swyx: Yep. [00:03:48]

Alessio: You talk a lot about existing AI chips. You said if you can?t write a fast ML framework for GPUs, you just cannot write one for your own chip. So that's another one of your core insights. I don't know if you want to expand on that. [00:03:59]

George: Yeah. I mean, your chip is worse, right? There's no way the chip that you're going to tape out, especially on the first try, is going to be easier to use than an AMD GPU, right? And yet there's no good stack for AMD GPUs. So why do you think you can make one for your chip? You can't, right? There's one other company, aside from NVIDIA, who's succeeded at all at making training chips. What company? [00:04:20]

Swyx: AMD? Intel? [00:04:22]

George: No, no, no. I've never trained. Who's trained a model on AMD or Intel? Cerebras. [00:04:26]

Swyx: Cerebras! [00:04:27]

George: I'm talking about, you might know some startups who trained models on these chips. [00:04:31]

Alessio: Oh, TPU. [00:04:32]

George: Exactly. Right? So Midjourney is trained on TPU, right? Like a lot of startups do actually train on TPUs. And they're the only other successful training chip, aside from NVIDIA. But what's unique about Google is that they also wrote their own ML framework, right? And if you can't write your own ML framework that is performant on NVIDIA, there's no way you're going to make it performant on your stuff. [00:04:53]

Alessio: And they started from TensorFlow and then they made the chip after. [00:04:56]

Swyx: Yeah, exactly. Exactly. [00:04:58]

George: And you have to do it in that direction. Otherwise, you're going to end up, you know, Cerebras, one of those things, a million... Has anyone ever seen a Cerebras? No one's ever like, oh, I trained my model on a Cerebras. Most people are like, I trained my model on GPUs. Some people, 20%, are like, I trained my model on TPUs. [00:05:14]

Alessio: And then the third one, which is the one that surprised me the most, is Turing completeness is harmful. It should be avoided. It made sense once I read it, but maybe tell us a bit more about how you got there. [00:05:25]

George: Okay. So CPUs devote tons of their silicon and power to things like reorder buffers and speculative execution and branch predictors. And the reason that you need all these things is because at compile time, you can't understand how the code's going to run. This is Rice?s theorem. This is the halting problem and its limit. And this is not like, oh, the halting problem is theoretical. No, no, no, no. It's actually very real. Does this branch get taken or not? Well, it depends on X. Where does X come from? Yeah, forget it, right? But no branches depend on X in a neural net. Every branch is a static loop. Like if you're doing a matrix multiply, it's a static loop over the inner dimension. And neural networks are even better. No loads even depend on X, right? So with a GPU shader, right, your load might depend on which texture you're actually loading into RAM. But with a neural network, your load is, well, I load that way. Why? Well, because I load that way the other million times I ran the same net. Every single time you run the net, you do the exact same set of loads, stores, and arithmetic. The only thing that changes is the data. And this gives you a very powerful ability to optimize that you can't do with CPU-style things, which have branches, and even GPU-style things, which have loads and stores. Well, GPUs, if you want GPU-style stuff, you have like load based on X, you now need a cache hierarchy, and not an explicit cache hierarchy, an implicit cache hierarchy with eviction policies that are hard-coded into the CPU. You start doing all this stuff, and you're never going to get theoretically good performance. Again, I don't think there's 100X. Some startups will talk about 100X, and they'll talk about absolutely ridiculous things like clockless computing or analog computing. Okay, here, analog computing just won't work. And clockless computing, sure, it might work in theory, but your EDA tools are... Maybe AIs will be able to design clockless chips, but not humans. But what actually is practical is changing cache hierarchies and removing branch predictors and removing warp schedulers, right? GPUs spend tons of power on warp scheduling because we have to hide the latency from the memory. We'll have to hide the latency if everything's statically scheduled. [00:07:25]

Alessio: Why do you think people are still hanging on to Turing completeness? [00:07:27]

Swyx: Well, because it's really easy. [00:07:29]

George: Turing Complete is just really easy to just, oh, you know, it would just be so nice if I could do like an if statement here and actually branch the code, right? So it requires a lot more thought to do it without Turing Completeness. [00:07:41]

Swyx: And would this be qualitatively different than TPUs? [00:07:44]

George: So TPUs are a lot closer. Yeah. TPUs are a lot closer to what I'm talking about than like CUDA. Okay, so what is CUDA? Well, CUDA is a C-like language, which compiles to an LLVM-like IR, which compiles to PTX, which compiles to SAS, which are all Turing Complete. TPUs are much more like this. Yeah. Their memory is pretty statically managed. They have a V?I did some reverse engineering on the TPU. It's published in TinyGrad. It has like a VLIW instruction, and it runs them. So it's similar. I think the TPUs have a few problems. I think systolic arrays are the wrong choice. I think they have systolic arrays because that was the guy's PhD, and then of course Amazon makes? [00:08:20]

Swyx: Could you summarize systolic arrays for us? [00:08:21]

George: Systolic arrays are just?okay, so basically you have like?it's a way to do matrix multiplication. Think of a grid of mollax, and then the grid can multiply, and then shift, multiply, then shift, multiply, then shift. And they are very power efficient, but it becomes hard to schedule a lot of stuff on them if you're not doing like perfectly sized dense matrix multiplies, which you can argue, well, design your models to use perfectly sized dense matrix multiplies, sure. [00:08:47]

Swyx: Thanks for indulging on these explanations. I think we need to keep our audience along with us by pausing every now and then to explain key terms. [00:08:56]

George: When I say explain a systolic array, I just immediately get a picture in my head of like tilting a matrix and shifting it. It's hard to kind of explain. Yeah. [00:09:04]

Swyx: Yeah. We'll do something. We'll do something. We'll have show notes. [00:09:08]

George: And we edit in visuals. Yeah, yeah, yeah. There's some great graphics that just show you, oh, so that's what a systolic array is. But it's a mollax shift machine that looks kind of different from the typical ALU sort of machine. I think the right answer is something that looks more like queues that feed into ALUs, and then you can prefetch the loads from the memory, put in a bunch of queues, and then the queue is just like, and feeds into another queue over here. But yeah, but that's not even the main problem with TPUs. The main problem with TPUs is that they're closed source. Not only is the chip closed source, but all of XLA is open source. But the XLA to TPU compiler is a 32 megabyte binary blob called libTPU on Google's cloud instances. It's all closed source. It's all hidden stuff. And you know, well, there's a reason Google made it closed source. Amazon made a clone of the TPU. It's called Inferentia. Or they have some other name for it, a training. Tranium. Yeah, yeah, yeah. And look, it's a clone of the TPU. But Google's software at least kind of works. [00:09:58]

Alessio: So those are kind of like the three core pieces. The first thing you're working on, that you've been working on, is TinyGrad. And one of your Twitch streams, you said, is the best thing you've ever written. [00:10:07]

Swyx: Yeah. [00:10:08]

Alessio: Tell us a bit more about that creation. [00:10:10]

George: For a long time, TinyGrad had a hard limit at a thousand lines of code. And what this would force you to do is really make sure you were not wasting lines. I got rid of the restriction because it became a little code golfy at the end. But once like the core framework of TinyGrad was there in those thousand lines, but like the core framework, the ideas are expressed with no boilerplate. If you go read PyTorch, you know, PyTorch I think is actually pretty good code. I think Facebook's pretty good, but there's so much boilerplate. Go in PyTorch and try to track down how an LGU actually works. [00:10:44]

Swyx: Just a lot of instructions. [00:10:45]

George: Oh, you're going to be diving down a long stack from Python to C to custom libraries to dispatchers to, and then I don't even know how to read TensorFlow. I don't even know where's an LU in TensorFlow. [00:10:55]

Swyx: Nobody knows. [00:10:56]

George: Someone at Google knows maybe. Google as an organism knows. I don't know if anyone individual at Google knows. [00:11:02]

Alessio: What are like the important ergonomics like for a developer as you think about designing the TinyGrad API? [00:11:07]

George: So the TinyGrad front end looks very similar to PyTorch. There's an even higher level front end you can use for TinyGrad, which is just ONNX. We have better support for ONNX than Core ML does. And we're going to have, I think we're going to pass ONNX Runtime soon, too. People think ONNX Runtime, that's the gold standard for ONNX. No, you can do better. [00:11:23]

Swyx: Pass them in what, specifically? Test compliance tests. [00:11:26]

George: So ONNX has a big set of compliance tests that you can check out. And we have them running in TinyGrad, and there's some failures. We're below ONNX Runtime, but we're beyond Core ML. So that's where we are in ONNX support now. But we will pass ONNX Runtime soon because it becomes very easy to add ops because you don't need to do anything at the lower levels. You just do it at this very high level, and TinyGrad compiles it to something that's fast using these minimal ops. You can write, most concretely, what TinyGrad can do that PyTorch can't really do, is if you have something like A times B plus C. If you write that in NaivePyTorch, what it's going to do on the GPU is read A, read B in a kernel, and then store A times B in memory, and then launch another kernel to do A times B plus C. Okay, got to do those loads from memory. It's a whole extra round trip to memory that I just didn't have to do. And you're like, yeah, but you can use the Torch JIT, and it corrects this. Yeah, for that one example, for that one example of MUL/ACC, but, oh, now you did three multiplies? Six multiplies? It won't compile arbitrary code. [00:12:26]

Swyx: And have you looked into the other approaches like PyTorch Lightning to accelerate PyTorch itself? [00:12:32]

George: Well, PyTorch Lightning, my understanding is, it's mostly a framework around PyTorch, right? PyTorch Lightning is not going to fix this fundamental problem of I multiply six tensors together. It's not going to fix it going to memory any more than a single read from each and a single write to the output. There are lower level things in PyTorch that are, I'm not exactly sure what Dynamo does, but I know they're generating some Triton stuff, which is going to generate the kernels on the fly. But, you know, PyTorch Lightning is at a higher level of abstraction. So TinyGrad's front-end stuff looks like PyTorch. I made a few tweaks. There's a few things I don't like about PyTorch. Why is Relu a class? Really, what's the state? You make a class, and there's a state. Everything should just be Torch functional and then Relu, but just dot Relu on the tensor. There's things in Torch where you have to do tensor dot and not a tensor dot. It just shows an API that's not perfectly refined. But when you're doing stuff TinyGrad style where you don't have lines, well, it has to work this way. Because even the lines to express the, well, you can't use the where operator in PyTorch. Why is it true case, condition, false case? Ugh, that's how Python expresses ifs. It's disgusting. Turner operators are much nicer. It should be, I can do my like a less than zero dot where a comma one, right? [00:13:46]

Swyx: The very pandas-like API? [00:13:50]

George: It looks like Torch, NumPy, pandas. They're all very similar. I tried to take the cleanest subset of them and express them. But like I said, you can also interact with it using ONNX. I have a rewrite of StableDiffusion, I have a rewrite of Llama, I have a rewrite of Whisper. You can look at them. They're shorter than the Torch versions, and I think they're cleaner. And you stream them all? [00:14:05]

Swyx: Yeah. Very nice. [00:14:07]

Alessio: So what's the other important concept that you're leveraging to do operation fusing? [00:14:11]

George: Yeah, you have basically like a few different like models for the simplest one is eager is as soon as the interpreter sees A times B, it actually dispatches A times B, right? Then you have graph like TensorFlow, which will put A times B into a graph, and then we'll do absolutely nothing until you actually compile the graph at the end. I like this third choice, which is somewhere in the middle, laziness. Laziness is you don't know when the ops are going to dispatch, and don't worry about that. You don't have to worry about this as a programmer, you just write out all your stuff. And then when you actually type `.numpy`, it'll be ready by the time you copy the thing back to CPU. Or you can do `.realize`, and it will actually like force that tensor to be allocated in RAM. And if you think about it, PyTorch is kind of lazy in a way, but they didn't extend the paradigm far enough, right? When I do A times B in PyTorch, it's going to launch a CUDA kernel to do A times B. But it's not going to wait for that CUDA kernel to complete. So you're getting the worst possible worlds. You're getting the same laziness, but you also can't get fusion, because PyTorch doesn't know that I'm then going to do plus C. There's no way for it to be like, whoa, whoa, whoa, don't launch that CUDA kernel. Whoa, just do this one too. Right? Again, PyTorch is working on this, and it's a little bit harder. In Kama, I felt like I was competing against a lot of idiots. Here, I'm competing against smart, very smart people who've made some, I think, different trade-offs. Whereas, if you're trying to build something that is just straight up good on NVIDIA, and we have a lot of people and complexity to throw at it, yeah, PyTorch made a lot of the right choices. I'm trying to build something that manages complexity. You can always make your software do more. The magic is when you can make your software do more without adding complexity, right? Because complex things eventually collapse under their own weight, so it's kind of... [00:15:58]

Alessio: How does fusing actually work? [00:16:00]

George: There's this thing called lazy.py, and when you do A times B, that's... It's put into a graph, but it's a very local graph. There's no global graph optimizations. And even this can change, right? Again, the programming model for TinyGrad does not preclude eagerness, right? Laziness is not guaranteed laziness. It's just going to try its best. So you put in A times B, and that's a binary op, right? And then you put in A times B, that's a node in the graph. It's a virtual node because it's not realized yet, plus C. Okay, here's a new node, which takes the C tensor in here and takes the output of A times B. It's like, whoa, there's two binary ops. Okay, we'll just fuse those together. Okay, here I have a kernel. This kernel has A, B, and C as inputs. It does A times B plus C in the local registers, and then outputs that to memory. And you can graph.one in TinyGrad. Another amazing thing that TinyGrad has that I've not seen in any other framework is two things. Graph equals one, which is an environment variable. It will output a complete graph of all the operations. Other people are like, oh, you can use PyTorch, export it to ONNX, and use Netron. Yeah, you can. Like, what? That's not what's real. Graph equals one will show you the actual kernels that were dispatched to the GPU. You can also type debug equals two, which will print those kernels out in your command line, and it will tell you the exact number of flops and the exact number of memory accesses in each kernel. So you can immediately see, wait a second, okay, this kernel used this many flops. This was the gigaflops. This is how many bytes it read, and this was the gigabyte per second. And then you can profile without having to like, okay, I mean, in theory, in PyTorch, Sure, use the NVIDIA Insight Profiler. No one does that. No one does, of course, because it's so difficult, right? Like, actually, NVIDIA used to, I think CUDA 9 was the last one that had it. They had a command line one, but now it's like, okay, I'm going to generate this blob, use this NVIDIA GUI tool to convert it into a Chrome trace, and then load it. Yeah, no one does that, right? Just type debug equals two in any TinyGrad model, and it will show you all the kernels that it launches and the efficiency of each kernel, basically. [00:17:58]

Swyx: Yeah, this is something that John Carmack has often commented about, is that when you code, you need to build in your instrumentation or observability right into that. I wonder if whatever John is working on, he's adopting this style, and maybe we can sort of encourage it by, I don't know, naming it and coining a certain kind of debugging style? [00:18:16]

George: If he would like to start contributing to TinyGrad, I'd be so happy. [00:18:19]

Swyx: You should hook up with them. [00:18:22]

George: I've chatted with them a few times. I'm not really sure what his company's doing, but no, I mean, hopefully we get TinyGrad to a point where people actually want to start using it. So TinyGrad right now is uncompetitive on NVIDIA, and it's uncompetitive on x86. [00:18:36]

Swyx: And specifically, what do you care about when you say uncompetitive? Speed. [00:18:39]

George: Share of speed. It's correct. The correctness is there. The correctness for both forwards and backwards passes is there. But on NVIDIA, it's about 5x slower than PyTorch right now. Like 5x, wow, this is unsurmountable. No, there's reasons it's 5x slower, and I can go through how we're going to make it faster. It could be 100x slower, so we're making progress. But there's one place where it actually is competitive, and that's Qualcomm GPUs. So TinyGrad is used to run the model in OpenPilot. Like right now, it's been live in production now for six months. And TinyGrad is about 2x faster on the GPU than Qualcomm's library. [00:19:10]

Swyx: What about Qualcomm architecture? [00:19:12]

George: What makes it doable? Well, because the world has spent how many millions of man hours to make NVIDIA fast? And Qualcomm has a team of 10 Qualcomm engineers? Okay, well, who can I beat here? What I propose with TinyGrad is that developer efficiency is much higher. But even if I have 10x higher developer efficiency, I still lose on NVIDIA, right? You know, okay, I didn't put 100,000 man hours into it, right? If they put a million, like, that's what I'm saying. But that's what I'm saying we can get. And we are going to close this speed gap a lot. Like I don't support TensorCourse yet. That's a big one that's just going to, okay, massively close the gap. And then AMD. I don't even have a benchmark for AMD because I couldn't get it compiled. Oh, and I tried. Oh, I tried. I spent a day. Like, I spent actually a day trying to get PyTorch. And I got it built. I got it kind of working, then I tried to run a model, like, there's all kinds of weird errors and the rabbit holes are so deep on this. I'm like, you know, you can compare the speed. Right now, you can run LLAMA, you can run anything you want on AMD. It already all works. Any OpenCL backend works, and it's not terribly slow. I mean, it's a lot faster than crashing. So it's infinitely times faster than PyTorch on AMD. But pretty soon, we're going to start getting close to theoretical maximums on AMD. That's really where I'm pushing. And I want to get AMD on MLPerf in a couple months, hopefully. [00:20:26]

Swyx: Now that you bring up AMD. [00:20:27]

Alessio: Yeah, let's dive into that. Because when you announced the Semicore fundraise, you mentioned one of your first goals is like build the framework, runtime and driver for AMD. And then on June 3rd on Twitch, you weren't as excited about AMD anymore. Maybe let's talk a bit about that. You compared the quality of commit messages from the AMD kernel to the Intel work that people are doing there. What's important to know? [00:20:51]

George: When I said I want to write a framework, I never intended on writing a kernel driver. I mean, I flirted with that idea briefly, but realistically, there's three parts to it, right? There's the ML framework, there's the driver, and then there's the user space runtime. I was even down to rewrite the user space runtime. I have a GitHub repo called CUDA IOControlSniffer. It's terribly called. But you can actually launch a CUDA kernel without CUDA. So you don't need CUDA installed. Just the NVIDIA open source driver and this open source repo can launch a CUDA kernel. So rewriting the user space runtime is doable. Rewriting the kernel driver? [00:21:26]

Swyx: I don't even have docs. [00:21:27]

George: I don't have any docs for the GPU. Like it would just be a massive reverse engineering project. I wasn't complaining about it being slow. I wasn't complaining about PyTorch not compiling. I was complaining about the thing crashing my entire computer. It panics my kernel. And I have to wait five minutes while it reboots because it's a server motherboard and they take five minutes to reboot. So I was like, look, if you guys do not care enough to get me a decent kernel driver, there's no way I'm wasting my time on this, especially when I can use Intel GPUs. Intel GPUs have a stable kernel driver and they have all their hardware documented. You can go and you can find all the register docs on Intel GPUs. So I'm like, why don't I just use these? Now, there's a downside to them. Their GPU is $350. You're like, what a deal. [00:22:03]

Swyx: It's $350. [00:22:04]

George: You know, you get about $350 worth of performance. And if you're paying about $400 for the PCIe slot to put it in, right, like between the power and all the other stuff, you're like, okay, nevermind. You got to use NVIDIA or AMD from that perspective. But I sent an email to Lisa Su. She responded. [00:22:19]

Swyx: Oh. [00:22:20]

George: And I've had a few calls since. And like, what I tried to do, first off, like, thank you for responding. It shows me that like, if you don't care about your kernel panicking, I can't, like, this is just a huge waste of my time, right? I'll find someone who will care. I'm not asking for your seven by seven Winograd convolution when transposed to be fast. Like, I'm not asking for that. I'm asking literally for- The basics of getting it running. Oh, and this isn't TinyGrad. This is your demo apps. I ran their demo apps in loops, and I got kernel panics. I'm like, no, okay. No, Lisa Su reached out, connected with a whole bunch of different people. They sent me a pre-release version of RockM 5.6. They told me you can't release it, which I'm like, guys, why do you care? But they say they're going to release it by the end of the month, and it fixed the kernel panic. The guy managed to reproduce it with the two GPUs and the computer, and yeah, sent me a driver, and it works. I had that experience, and then I had another experience where I had two calls with, like, AMD's, like, communication people. I was just like, I tried to explain to these people, like, open source culture. Like, it's not open source if you dump the source code on a GitHub repo and then forget about it until the next release. It's not open source if all your issues are from 2022. Like, it's just no one's going to contribute to that project, right? Sure, it's open source in a very, like, technical sense. To be fair, it's better than nothing. It's better than nothing, but I fixed a bug in Nickel that I fixed. There's a fun fact, by the way. If you have a consumer AMD GPU, they don't support peer-to-peer, and their all-reduce bandwidth is horrendously slow because it's using CUDA kernels to do the copy between the GPUs, and it's putting so many transactions on the PCIe bus that it's really slow. But you can use CUDA memcpy, and there's a flag to use CUDA memcpy, but that flag had a bug. I posted the issue on Nickel. I expected nothing to happen. The NVIDIA guy replied to me within an hour. He's like, try this other flag. I'm like, okay, I tried the other flag. It still doesn't work, but here's a clean repro. And I spent, like, three hours writing a very clean repro. I ended up tracking the issue down myself, but just the fact that somebody responded to me within an hour and cared about fixing the issue? Okay, you've shown that it's worth my time, and I will put my time in because, like, let's make this better. Like, I'm here to help. But if you show me that, you know, you're like, you're the kernel panics. That's just, like, expected. Okay. [00:24:36]

Swyx: Well, it sounds like AMD is getting the message. [00:24:38]

George: They are. And I just, I don't really think they've had someone explain to them, like, like, I was like, you can, like, build in public. And they're like, what's an example of building in public? I'm like, go look at PyTorch. Go look at PyTorch. I have two minor things merged into PyTorch because it's very responsive, you know? [00:24:53]

Alessio: So that's kind of like the lowest level of the stack. And then at a slightly higher level, obviously, there's TinyGrad, there's Mojo, there's ggml. How are you thinking about breadth versus, like, depth? Like, where you decided to focus early on? [00:25:06]

George: So ggml is very much like a, okay, everyone has M1s, right? Actually, I was thinking, in the beginning, I was thinking of something more like ggml, focused on the M1s. But ggml showed up and was just like, we're actually just focusing on the M1s. And actually, M1 PyTorch is considerably better than AMD PyTorch. M1 PyTorch works, it only gives wrong answers sometimes, and it only crashes sometimes. But, like, some models kind of run. When I was writing the metal backend, I was comparing to MPS PyTorch, and I had, like, a discrepancy. TinyGrad checks all its outputs compared to Torch, and I had one where it didn't match. I'm like, I checked the matrix by hand, it matches TinyGrad, I don't understand. And then I switched PyTorch back to CPU, and it matched. I'm like, oh. Well, there's, like, bugs, like, if you, like, transpose the matrix, because, like, I think it has to do with, like, multi-views in PyTorch, and, like, weird under-the-hood stuff that's not exposed to you, like, there's bugs. And maybe they fixed them, but, like, you know, it seems like there was a lot of momentum. Again, because you're getting how many engineers care about making PyTorch work on M1, right? Thousands, tens of thousands. And you have an open development process, and guess what? It's going to be good. How many engineers care about AMD working, PyTorch AMD working? Well, you got 10 guys that work for AMD, and then, like, a couple hobbyists. [00:26:15]

Swyx: You revealed an interesting detail about how you debug. You hand-check the matrix math? No, I don't hand-check it. [00:26:20]

George: One of the best tests in TinyGrad is a file called testops.py. And it's just a hundred small examples written in TinyGrad and PyTorch, and it checks both the forwards and backwards to make sure they match. [00:26:34]

Swyx: Good test suite. Yeah. Very important. [00:26:35]

George: That's, I mean, that's one of them where, like, I really, I put a lot of effort into CI for TinyGrad. I think CI is super important. Like, I want that green check to mean I can merge this, right? Like, I don't want my tests to, and if the green check, if you somehow manage to introduce a bug and get the green check, okay, we're fixing the test, top priority. [00:26:51]

Swyx: Mojo? [00:26:52]

George: It's closed source. No, I'm not that interested. Do you know what I mean? Like, look, I like Chris Lattner. I think he's going to do great things, and I understand the, like, kind of the wisdom, even, in keeping it closed source. But, you know, I'm interested when it's open. [00:27:05]

Swyx: Yeah. You have an interesting design deviation from him, because he's decided to be a, well, promised to be a superset of Python, and you have decided to break with PyTorch APIs. And I think that affects learnability and transportability of code. [00:27:18]

George: You know, if the PyTorch thing ends up being, like, a stumbling block, I could write a perfect PyTorch instead of import PyTorch. Instead of, like, yeah, import torch, you type import tinytorchestorch. And if that really becomes the stumbling block, I will do that. No, Chris Lattner went much further than PyTorch. Replicating the PyTorch API is something I can do with a couple, you know, like an engineer monitor. [00:27:44]

Swyx: A shim. [00:27:44]

George: Right, like a shim, yeah. Replicating Python? [00:27:47]

Swyx: Hoo-hoo-hoo! [00:27:48]

George: There's a big graveyard of those projects. How's Piston going? How's Jython? [00:27:57]

Swyx: PyPy? Oh, you can go way back. [00:27:59]

Alessio: So your core mission is commoditizing the petaflop. And then your business goal is to sell computers for more than the cost to make, which seems super reasonable. And you're going to have three tiny boxes? [00:28:11]

Swyx: Red, green, blue? No, no, no, no, no, no, no. [00:28:13]

George: That was my... Look, you know, a lot of people, like, I love, you know, leaning into, like, saying I'm giving up, right? It's great to give up, right? Giving up is this wonderful thing. It's so liberating. And then, like, you can decide afterward if you really give up or not. There's very little harm in saying you give up, except, like, you know, great, Twitter haters have something to talk about, and all press is good press, kids, so... Just red, only red. [00:28:32]

Swyx: Tiny box, red. Tiny box, red. [00:28:34]

George: Unless AMD, you know, upsets me again, and then we're back to other colors. We have other colors to choose from. [00:28:41]

Alessio: When you think about hardware design, what are some of the numbers you look for? So, teraflops per second is one, but, like, memory bandwidth is another big limiter. Like, how do you make those trade-offs? [00:28:52]

George: Well, I mean, fundamentally, I'm limited to what GPUs I can buy. But, yeah, for something that I think a lot of people are going to want to reasonably do, with, um... A coworker of mine described them as luxury AI computers. Right? Like, luxury AI computers for people. And that's, like, what we're building. And I think a common thing people are going to want to do is run, like, Large Llama. Right? Or Large, like, Falcon or whatever. [00:29:13]

Swyx: FB-16 Llama. [00:29:14]

George: FB-16, exactly. Exactly. Um, you know, Int8, I think, can work. I think that, like, what GGML is doing to go to, like, N4. Like, this doesn't work. Like, have you done... I mean, maybe they have. But, like, I read what it was, and I was like, this isn't from any paper. This is just some... Squeezing as much as possible. Yeah, you made up some quantization standards to make it run fast. And, like, maybe it works. But, okay, where's, like, the Hellaswag number? Right? Where's your, uh... [00:29:38]

Swyx: The thesis is right. That, like, if you have hundreds of billions of parameters, that the individual quantization doesn't actually matter that much. [00:29:44]

George: Well, the real way to look at all of that is to just say you want to compress the weights, right? It's a form of weight compression. Quantization is a form of weight compression, right? Now, this is obviously not lossless. It's not a lossless compressor, right? If it's a lossless compressor, and you can show that it's correct, then, okay, we don't have to have any other conversation. But it's a lossy compressor. And how do you know that your loss isn't actually losing the power of the model? Maybe int4 65B llama is actually the same as FB16 7B llama, right? We don't know. Maybe someone has done this yet, but I looked for it when it, like, first came out and people were talking about it. And I'm like, it's not from a paper, right? The indate stuff is from a paper where they... Like, some of the indate stuff is from a paper. There's one paper, I think it's, like, indate... LLM.indate, where they actually do all the tests. And they didn't go fully indate. They made, like, 90% of it indate and kept, like, 10% of it in FB16 for what they called, like, the outliers or whatever. So I think that this is not quite so easy. [00:30:37]

Swyx: And I think being able... [00:30:38]

George: Well, so first off, if you're training, no one's gotten training to work with indate yet. There's a few papers that vaguely show it. But if you're training, you're going to need BF16 or float16. So this is why I target that. Now, the thing that you're going to want to do is run these large language models out of the box on your hardware in FB16, and that's memory bandwidth. So you need large amounts of memory bandwidth, too. So ask how I trade off memory bandwidth in Flop, so what GPUs can I buy? [00:31:02]

Alessio: So first of all, you have this hiring process, which is you've got to solve one of the bounties that are open on TinyGrad. There's no technical interview. One of them is indate support. Do you already have some things you want to test on? [00:31:14]

Swyx: We have indate support. What I'd like to see somebody do [00:31:16]

George: is just load the ggml indate llama into TinyGrad and then benchmark it against the FB16 one. Indate already works in TinyGrad. It doesn't actually do the math in indate. It does all the math still in FB32. So indate can mean you just have your weights in indate, or indate can mean you actually do your math in indate. And doing your math in indate, the big gain that people care about is actually having your weights in indate, because weights in indate mean less memory and less memory bandwidth, whereas the math, keep it in FB32. With on M1s, it doesn't matter what data type you're doing in the GPU. I'm not even sure it can do indate, but FB16 and FB32 is the same tariff ops. So yeah, no, that's one of the bounties. One of the bounties is get indate llama running [00:31:58]

Swyx: with the indate weights. [00:32:00]

George: And then actually, what you could even do, if you really want to test this, just take the FB16 weights, convert them to indate, then convert them back to FB16, then compare the unconverted and converted. [00:32:10]

Swyx: Oh, that's a nice hack. Oh, yeah. Right, like- This should be lossless in the other direction. Yeah, I think FB16, [00:32:17]

George: it should be lossless in the other direction. I'm actually not 100% about that. Why not? Oh, because like, you ever try to like, like if you want to represent, if it was like int16, it's not lossless. [00:32:25]

Swyx: Sure. [00:32:26]

George: All of indate can be represented in FB16, but I'm not 100% about that. [00:32:29]

Swyx: Just drop the bytes. We just have to do it, right? [00:32:32]

George: Just literally do it. There's only 256 to check, like. But yeah, either way, or I mean, int4, definitely. So do your int4, convert it back, and now see, even with int4 weights and FB32 math, like, okay, how much has your performance degraded this model? [00:32:47]

Alessio: I think like the, you're planning to release the first tiny box, ship them in like two to six, eight months, something like that. What's top of mind for you in terms of building a team? Who should, who are you calling for? [00:32:59]

George: So as the GPU is picked out and you're like, well, I could make that computer with the GPUs. And my answer is, can you? Do you know how hard it is to put six GPUs in a computer? And people think it's really easy. And it's really easy to put one GPU in a computer. It's really easy to put two GPUs in a computer, but now you want to put in eight. Okay, so I'll tell you a few things about these GPUs. They take up four slots. You can buy the nicest super micro. You can't put eight of those in there. You need two slot blowers. [00:33:25]

Swyx: If you want to use one of those, [00:33:25]

George: those for you super micros, you need two slot blowers or water cooling, right? If you're trying to get the four slot cards in there, you're going to need some form of water cooling. There are some like Chinese 40 nineties that are blowers, right? You have any blowers or water cooling if you're trying to get it in those things, right? [00:33:37]

Swyx: So are you doing water? [00:33:39]

George: No, I'm not using that chassis. Okay, so now you want to get six GPUs in a computer. So that's a big challenge. You're like, oh, I'll just use a PCIe extenders. I saw it online as tech tips. It works great. No, it doesn't. Try PCIe extenders that work at PCIe 4.0 and interconnect bandwidth, super important. They don't work at 3.0. No PCIe extender I've tested, and I've bought 20 of them, works at PCIe 4.0. So you're going to need PCIe re-drivers. Now, okay, how much is that adding cost, right? Like these things all get really hard. And then tiny boxes, I've even had another constraint to it. I want this thing to be silent, not totally silent, but my limit is like 45, maybe 50 DB, but not super micro machine, 60 DB. We have a small, we have a compute cluster at comma. You gotta wear ear protection to go in there. Like- [00:34:24]

Swyx: Yeah, I've seen some videos where you give a tour. Oh yeah. It's noisy. It's super loud. [00:34:28]

George: You got all these machines just screaming. All those, like if you have a blower, what is that thing? 10,000 RPM, just screaming. Like I want to be able to use the normal big GPU fans and make this thing so it can sit under your desk, plug into one outlet of power, right? Six GPUs, your GPUs are 350 Watts each. Can't plug that into a wall outlet. Okay, so how are you going to deal with that? Good questions, right? [00:34:51]

Swyx: And you're not sharing them. [00:34:52]

George: Well, that one, I mean, that one is pretty obvious. You have to limit the power on the GPUs, right? You have to limit the power on the GPUs. Now you can limit power on GPUs and still get, you can use like half the power and get 80% of the performance. This is a known fact about GPUs, but like that's one of my design constraints. So when you start to add all these design constraints, good luck building a tiny box yourself. Obviously it can be done, but you need something that has actually quite a bit of scale and resources to do it. [00:35:15]

Alessio: And you see like the, under the desk, it's like one of the main use cases, kind of like individual developer use or. [00:35:21]

George: Yeah, what I also see is more of a, like an AI hub for your home, right? As we start to get like home robotics kind of stuff, you don't want to put the inference on the robot, but you also don't want to put the inference on the cloud. Well, you don't want to put it on the robot because, okay, it's 1500 Watts, tiny box. You'll put batteries and charge them, bad idea. Just wireless. Wireless is 0.5 milliseconds, right? This is super fast. You don't want to go to the cloud for two reasons. One, cloud's far away. Okay, it's not that far away. You can kind of address this. But two, cloud's also mad expensive. Like cloud GPUs are way more expensive than running that GPU at your house. At least any rates you're going to get, right? Maybe if you commit to buy, well, yeah, I'm going to buy 10,000 GPUs for three years, then maybe the cloud will give you a good rate. But like, you want to buy one GPU in the cloud? I mean, okay, you can go to like vast, but like if you're going on Azure AWS, so that's expensive. [00:36:12]

Swyx: This is like a personal data center instead of a cloud data center. [00:36:16]

George: We like the term compute cluster. So we can use NVIDIA GPUs. [00:36:20]

Swyx: Yeah, data centers may be a little bit dated. It's a compute cluster, [00:36:23]

George: which is totally legal under the CUDA license agreement. [00:36:26]

Swyx: You talk a lot about the PCIe connection. Do you think there's any fat there to trim? What do you mean? You're limited by bandwidth. [00:36:32]

George: Okay, for some things, yes. So bandwidth is roughly 10x less than what you can get with NB-linked A100s, right? NB-linked A100s are going to have, and then you can even get like full fabric and NVIDIA really pushes on that stuff, 600 gigabytes per second, right? And PCIe, four, you're going to get 60, right? So you're getting 10x less. That said, why do you need the bandwidth, right? And the answer is you need it for training huge models. If you're training on a tiny box, your limit's going to be about 7 billion. If you're training on big stuff, your limit's going to be like 70 billion, right? Okay, you can hack it to get a bit higher. You can hack it, like GPT hacked it to get a bit higher, but like that 65 billion in LLAMA, like there's a reason they chose 65 billion, right? And that's what can reasonably fit model parallel on a GPU, right? So yes, you are going to end up training models. The cap's going to be like 7 billion, but I actually heard this on your podcast. I don't think that the best chatbot models are going to be the big ones. I think the best chatbot models are going to be the ones where you had a thousand training runs instead of one. And I don't think that the interconnect bandwidth is going to matter that much. [00:37:33]

Swyx: So what are we optimizing for instead of compute optimal? What do you mean compute optimal? You're talking about this, the LLAMA style models where you train for like 200x. You train longer, yeah. [00:37:41]

George: Yeah, yeah, yeah. You can always make your model better by doing one of two things, right? And a comma, we just have a strict limit on it. You can always make your model better by training longer, and you can always make your model better by making it bigger. But these aren't the interesting ones, right? Particularly the making it bigger because training it longer, fine. You're getting a better set of weights. The inference is the same. The inference is the same whether I trained it for a day or a week. Okay, if it's 1 billion versus 10 billion, well, I 10x my inference too, right? So I think that these big models are kind of, sure, they're great if you're research labs and you're trying to like max out this hypothetical thing. [00:38:13]

Swyx: Which you can talk about later. Yeah, yeah, yeah. [00:38:15]

George: But if you're like a startup or you're like an individual or you're trying to deploy this to the edge anywhere, you don't need that many weights. [00:38:22]

Swyx: Yeah, yeah. You actually don't want that many weights. Optimizing for inference rather than capabilities doing benchmarks. Yes. [00:38:29]

George: And I think the inference thing, right? There's gonna be so much more. Right now, the ratio between like training and inference on clouds, I think it's only still, I think it's like two or three X, right? It's two or three X more inference, which doesn't make any sense. It's way more inference. [00:38:41]

Swyx: Yeah. [00:38:42]

George: There should be 10 to 100 X more inference in the world than training. But then also like, what is training, right? You start to see these things like LoRa, like it's kind of blurring the lines between inference and training. And I think that that blurred line is actually really good. I'd like to see much more like on-device training or on-device fine tuning of the final layer. We're pushing toward this stuff at Comma, right? Like why am I shipping a fixed model? I totally want this model to fine tune based on like how your left tire is flat, right? Every time you cut the same turn because your left tire is flat, well, it should learn that, right? [00:39:11]

Swyx: So would Comma pursue parameter efficient fine tuning? Yeah. [00:39:16]

George: We're looking into stuff like that. I mean, Comma is already very parameter efficient because we have to like run this thing in a car and you have to like cool it and power it. [00:39:22]

Alessio: And so this kind of like intelligence cluster you have in your home, you see when the person is using third-party model, they load them locally and kind of do the final fine tuning. It kind of stays within the box. [00:39:33]

George: I think that that's one version of it for the privacy conscious. I also see a world where you can have your tiny box in its down cycles, mine flop coin, right? You know, it turns out not all crypto is a scam. [00:39:45]

Swyx: There's one way to tell if crypto is a scam. [00:39:46]

George: If they're selling the coin before they make the product, [00:39:49]

Swyx: it's a scam. [00:39:49]

George: If they have the product and then they sell the coin, it's maybe not a scam, right? So yeah, my thought is like each tiny box would let you, would have a private key on it. And you have to do it this way. You can't just let anyone join because of Sybil attacks, right? [00:40:01]

Swyx: There's a real problem of like, [00:40:01]

George: how do I ensure your data is correct? And the way that I ensure your data is correct on the tiny net is if you ever send wrong data, you're banned from the network for life. [00:40:08]

Swyx: Yeah. [00:40:09]

George: Your $15,000 hardware box is banned. [00:40:11]

Swyx: So, you know, don't cheat. [00:40:11]

George: Obviously if it messes up, we'll forgive you. [00:40:14]

Swyx: Somebody's going to try to jailbreak your devices. There's no jailbreak. [00:40:17]

George: There's no jailbreak. [00:40:18]

Swyx: It's just a different network. [00:40:19]

George: Well, there's just a private key on ea ch device, right? Like if you buy a tiny box from the tiny corp, [00:40:23]

Swyx: I give you a private key. [00:40:23]

George: It's in my backend server, right? You want to hack my server, that's illegal. Anything you want to do on the device, the device is yours. My server's mine, right? [00:40:29]

Swyx: Yeah. Have you looked into like a federated training at all? [00:40:33]

George: Okay. There's orders of magnitude of federated training. You mean like over the cloud and stuff? [00:40:37]

Swyx: Over the internet? Yeah. Over the internet, but also distributed on a bunch of devices, right? [00:40:41]

George: Yeah, I'm very bearish on this stuff. Because your interconnect bandwidth, right? So, okay. At the high end, you have your interconnect bandwidth of NVLink, which is 600 gigabytes per second, right? The tiny box has 60 gigabytes per second. And then your internet has 125 megabytes per second, right? Not gigabits, 125 megabytes, right? So, okay. That's how many orders of magnitude we're talking here? Like from 60 down to 125? Like, all right, that's over a hundred X. That's 400 X, right? So like, what you can do is inference, right? Like there's, for inference, you don't care, right? For inference, there's so little bandwidth at the top and the bottom of the model that like, yeah, you can do federated inference, right? And that's kind of what I'm talking about. There's also interesting things to push into, like you're like, but okay, what if you want to run closed source models? This stuff gets kind of interesting, like using TPMs on the boxes and stuff. But then someone might jailbreak my device. So, you know, maybe we don't try to do that. [00:41:34]

Alessio: Yeah, what's like the enterprise use case? Do you see companies buying a bunch of these and like stacking them together? [00:41:39]

George: The tiny box is like the first version of what we're building. But what I really want to do is be on the absolute edge of flops per dollar and flops per watt. These are the two numbers that matter. So the enterprise use case is you want to train, like Kama, right? So Kama just built out a new compute cluster. It's about a person and a half. [00:41:56]

Swyx: A person being 20 petaflops. [00:41:58]

George: A person is 20 petaflops. It's about 30 petaflops. We built out a little compute cluster and, you know, we paid double what you theoretically could per flop, right? You theoretically could pay half per flop if you designed a bunch of custom stuff. And yeah, I mean, I could see that being, you know, a tiny corp. Kama's going to be the first customer. I'm going to build a box for Kama and then I'm going to show off the box I built for Kama and be like, okay, like, do you want to build? I sell $250,000 training computers. Or how much is one H100 box? [00:42:26]

Swyx: It's 400 grand? [00:42:27]

George: Okay, I'll build you a 400 grand training computer and it'll be 10x better than that H100 box. Again, not for every use case. For some, you need the interconnect bandwidth. But for 90% of most companies' model training use cases, the tiny box will be 5x faster for the same price. [00:42:41]

Alessio: You mentioned the person of compute. How do we build a human for $20 million? [00:42:47]

George: Well, it's a lot cheaper now. So like I said, Kama spent about half a million on our person and a half, so. [00:42:54]

Alessio: What are some of the numbers people should think of when they compare compute to like people? So GPT-4 was 100 person years of training. That's more like on the timescale. 20 petaflops is one person. I think you, right now the math was that for the price of the most expensive thing we build, which is the International Space Station, we could build one Tampa of. Yeah, yeah, one Tampa of compute. [00:43:16]

Swyx: Yeah, which is the ultimate currency of measurement. [00:43:20]

George: Yeah, yeah, we could build. So like the biggest training clusters today, I know less about how GPT-4 was trained. I know some rough numbers on the weights and stuff, but Lama- [00:43:28]

Swyx: A trillion parameters? [00:43:30]

George: Well, okay, so GPT-4 is 220 billion in each head, and then it's an eight-way mixture model. So mixture models are what you do when you're out of ideas. So, you know, it's a mixture model. They just train the same model eight times, and then they have some little trick. They actually do 16 inferences, but no, it's not like- [00:43:45]

Swyx: So the multimodality is just a vision model kind of glommed on? [00:43:49]

George: I mean, the multimodality is like obvious what it is too. You just put the vision model in the same token space as your language model. Oh, did people think it was something else? The mixture has nothing to do with the vision or language aspect of it. It just has to do with, well, okay, we can't really make models bigger than 220 billion parameters. We want it to be better. Well, how can we make it better? Well, we can train it longer, and okay, we've actually already maxed that out. We're getting diminishing returns there. [00:44:13]

Swyx: Okay. A mixture of experts. [00:44:14]

George: Yeah, a mixture of experts. We'll train eight of them, right? [00:44:16]

Swyx: So, all right. [00:44:17]

George: So, you know, the real truth is whenever a start, whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool. [00:44:28]

Swyx: It must be amazing. [00:44:29]

George: It's a trillion parameters. No, it's a little bigger than GPT-3, and they did an eight-way mixture of experts. Like, all right, dude, anyone can spend eight times the money and get that. Coming back to what I think is actually gonna happen is, yeah, people are gonna train smaller models for longer and fine-tune them and find all these tricks. OpenAI used to publish stuff on this, you know, [00:44:47]

Swyx: when they would publish stuff [00:44:48]

George: about how much better the training has gotten holding compute constant. It's gotten a lot better, right? Think, compare like BatchNorm to NoBatchNorm. [00:45:00]

Swyx: Is you're finding algorithms like FlashAttention? [00:45:02]

George: Yeah, well, FlashAttention, yeah. And FlashAttention is the same compute. FlashAttention is an interesting fact where it's actually the identical compute. It's just a more efficient way to do the compute. But I'm even talking about like, look at the new embeddings people are using, right? They used to use these like boring old embeddings. Now, like, Lama uses that complex one, and now there's like Alibi. I'm not up-to-date on all the latest stuff, but those tricks give you so much. [00:45:23]

Swyx: There's been a whole round trip with positional embeddings. I don't know if you've seen this discussion. I haven't followed exactly. [00:45:29]

George: I mean, you quickly run into the obvious problem with positional embeddings, which is you have to invalidate your KV cache if you run off the context. So that's why I think these new ones, [00:45:38]

Swyx: they're playing with them, [00:45:38]

George: but I'm not an expert on like the latest up-to-date language model stuff. [00:45:43]

Alessio: What are some of the things, I mean, that people are getting wrong? So back to autonomous driving, there was like the whole like LiDAR versus vision thing. People don't get into accidents because they cannot see well. They get into accidents because they get distracted and all these things. Do you see similarities today on like the Pathway GI? [00:45:59]

George: Nothing I say about this is ever gonna compete with how Rich Sutton stated it. [00:46:03]

Swyx: Rich Sutton, the writer of [00:46:04]

George: Reinforcement Learning, The Bitter Lesson. Nothing I say is ever gonna compete with, The Bitter Lesson's way better than any way I'm going to phrase this. Just go read that, and then like, I'm sorry it's bitter, but you actually just have to believe it. Like over and over again, people make this mistake. They're like, oh, we're gonna hand engineer this thing. No, like stop wasting time. [00:46:22]

Swyx: I mean, OpenAI is not taking The Bitter Lesson. They were leaders in deep learning for a long, long, long time. [00:46:27]

George: Well, OpenAI was the absolute leader to the thesis that compute is all you need, right? [00:46:31]

Swyx: And there's a question of how long [00:46:32]

George: this thesis is going to continue for. It's a cool thesis, and look, I think I would be lying along with everybody else. I was into language models like way back in the day for the Hutter Prize. I got into AI through the Hutter Prize. Like 2014, I'm trying to build compressive models of Wikipedia. And I'm like, okay, why is this so hard? What this is is a language model, right? And I'm playing with these Bayesian things, and I'm just like, oh, but I get it. I have two data points, and they're almost the same, but how do I measure that almost, right? I just wrapped my head around this, and this was around the time Karpathy released the first RNN that generated the Shakespeare stuff. And I'm like, okay, I get it, right? It's neural networks that are compressors. Now, this isn't actually, you can't actually win the Hutter Prize with these things because the Hutter Prize is MDL. It's the model, size of the model plus the size of the encodings, embeddings. So yeah, you can't, I mean, probably now you can because it's gotten so good. But yeah, back in the day, you kind of couldn't. So I was like, okay, cool. [00:47:29]

Swyx: This is what it is. [00:47:29]

George: I kind of get it. I didn't expect that it would continue to work this well. I thought there'd be real limits to how good autocomplete could get. That's fancy autocomplete. But yeah, it works well. So like, yeah, what is OpenAI getting wrong? Technically, not that much. I don't know. If I was a researcher, why would I go work there? [00:47:48]

Swyx: Yes, so why is OpenAI like the Miami Heat? [00:47:51]

George: No, look, this is my technical stuff. I don't really want to harp on this, but like, why go work at OpenAI when you could go work at Facebook as a researcher? OpenAI can keep ideologues who, you know, believe ideological stuff and Facebook can keep every researcher who's like, dude, I just want to build AI and publish it. [00:48:08]

Alessio: Yeah, any other thoughts, tiny corp, bounties? [00:48:11]

George: You know, I've been thinking a lot about like what it means to hire in today's world. Okay, look, I'm a believer that machines are going to replace everything in about 20 years. So, okay, what is that thing that people can still do that computers can't? And this is a narrowing list, but like, you know, back in the day, like imagine I was starting a company in 1960. Oh, and we're going to have to hire a whole bunch of calculators in the basement to do all the, you know, math to support the, dude, have you heard about computers? Why don't we just buy a few of those? Oh, wow, man, you're right. So like, I feel like that's kind of happening again. And I'm thinking about, I will post in my Discord, I'll be like, who wants to like, okay, I just changed my unary ops used to be log and exp in like E. I changed them to be log two and exp two because hardware has log two and exp two accelerators. [00:48:59]

Swyx: Yeah, and of course you can just change your base. [00:49:00]

George: It's one multiply to get it back to E. But like, I made the primitives log two and exp two, right? I just posted in the Discord. I'm like, could someone put this pull request up? And someone eventually did and I merged it. But I'm like, this is almost to the level [00:49:12]

Swyx: where models can do it. [00:49:14]

George: We're almost to the point where I can say that to a model and the model can do it. [00:49:17]

Swyx: Have you tried? Yeah, I don't know. [00:49:20]

George: I think autocomplete went further than I thought it would, but I'm also relatively unimpressed with these chatbots. The problem is if your loss function is categorical cross entropy on the internet, your responses will always be mid. [00:49:32]

Swyx: Yes, mode collapse is what I call it, I don't know. [00:49:35]

George: Maybe, I'm not even talking about mode collapse. You're actually trying to predict the, like, look, I rap. I'm a hobbyist rapper. When I try to get these things to write rap, the raps sound like the kind of raps you read in the YouTube comments. [00:49:45]

Swyx: Nursery school. [00:49:46]

George: Yeah, it's like, all right, great. You rhyme box with fox, sick rhyme, bro. You know, and Drake is rhyming give it up for me with napkins and cutlery, right? Like, all right, come on. [00:49:55]

Swyx: He's got like this thing about orange. Orange is famous so you can't rhyme it. Yeah, yeah, yeah, yeah, yeah. [00:49:59]

George: But now, of course, you know, four-inch screws and orange juice is in GPT's training course. Yeah, so I think it went further than everyone kind of thought it would. But the thing that I really want to see is like somebody put 10 LLMs in a room and have them discuss the answer before they give it to me. Right, like, you can actually do this, right? And I think the coding things have to be the same way. There is no coder alive, no matter how good you are, that sits down, well, I'm going to start at cell A1 and type my program, and then I'm going to press run and it's going to work. No one programs like that. So why do we expect the models to, right? So there's a lot that, like, still needs to be done. But, you know, at the tiny corp, I want to be on the cutting edge of this, too. I want to be, like, program generation. I mean, what is TinyGrad? It's a compiler, it generates programs. Generate the fastest program that meets the spec, right? Why am I not just having ML do that? So, you know, it's kind of a, you have to exist fluidly with the machines. And I've come around on a lot of stuff. I'm like, wait, TinyGrad, TinyCorp should be a remote company. I can't do this in person. [00:50:58]

Swyx: Really? [00:50:58]

George: Yeah, like, comma makes sense to be in person. Like, comma, sure. Yeah, we're getting off in San Diego. [00:51:04]

Swyx: But that was a six-year-old company, right? [00:51:05]

George: And it works, and it works for a certain type of people [00:51:08]

Swyx: and a certain type of culture. [00:51:08]

George: But what's going to be different this time? Okay, remote, but now it's remote. And now I'm getting these, like, people who apply, and I'm like, I literally have a thousand applications. I'm not calling you to do a technical screen. I can't really tell anything from a technical screen. What am I going to do? Make a code on a whiteboard? Like, bring up a shared notebook document, so we could, oh, like, that's not going to work. Okay, so then I'm moved to the next thing. We do this at Comma with good success, programming challenges. [00:51:31]

Swyx: I've also found them to be, like, [00:51:32]

George: completely non-predictive. I found one thing to actually be predictive, and it's, wait a second, just write code in TinyGrad. It's open source, right? And yeah, so, you know, I'm talking to a few people who've been contributing, and, like, contribute, or, you know, the job's not for you. But you can do it remote, and it's, look, it's a chill job. Like, you're not, you're like, oh, yeah, well, I work for the tiny corp. Like, well, you're writing MIT-licensed software. Like, you see what it's doing, right? Like, we'll just, I think, think of it as maybe more of, like, a stipend than a salary. And then also some equity. Like, if, you know, I get rich, we all get rich. [00:52:01]

Alessio: How do you think about agents and kind of, like, thinking of them as people versus, like, job to be done? Sean built this thing called Small Developer. [00:52:09]

Swyx: It's in the same vein. Or, like, the human in the loop with the language model and just iterating while you write code. I think that's absolutely where it goes. [00:52:17]

Alessio: And there's, like, a, it's not, like, one thing. It's, like, there's Small Interpreter. There's, like, Small Debugger. It's kind of, like, all these different jobs to be done. [00:52:24]

Swyx: It's a small world. [00:52:25]

Alessio: Yeah, it's a, I know, this is, like, the small box is, like, small AI meets tiny corp. [00:52:29]

Swyx: So we're all in the same wavelength. [00:52:30]

Alessio: How do you think about that? Do you think people will have a human-like interaction where it's, like, oh, this is, like, the AI developer, or, like, is it I'm the human being supercharged by the AI tools? [00:52:41]

George: Oh, I think it's, yeah, much more like I'm the human supercharged by the AI tools. I think that, like, coding is tool-complete. Like, driving's not tool-complete. We hire people to drive who are, like, below the API line. Right, there's an API line in the world, right? [00:52:53]

Swyx: Love that. Yes. [00:52:53]

George: Yeah, yeah, yeah, there's an API line in the world. And, like, you can think, like, Uber's a really clear example, right? There's the people below the API line and the people above the API line. And the way you can tell if you're below or above, by the way, is is your manager a computer, right? Who's the manager of the Uber driver? [00:53:06]

Swyx: Well, a computer, right? Does the machine tell you what to do or do you tell machines what to do? Exactly, exactly. [00:53:09]

George: So, coding is tool-complete, right? [00:53:13]

Swyx: Coding is tool-complete. [00:53:13]

George: Coding is above the API line. So it will always be tools supercharging your coding workflow. And it will never be you performing some, like, task. Like, okay, well, I can do everything except for actually starting a Docker container. Like, it just doesn't make any sense, right? Yeah, so it will always be sort of tools. And, you know, look, we see the same stuff with all the, like, people are like, stable diffusion's gonna replace artists or whatever. It's like, dude, like- [00:53:38]

Swyx: It's gonna create new artists. [00:53:39]

George: Did Photoshop replace artists? [00:53:41]

Swyx: Like, what are you talking about, right? [00:53:42]

George: Like, you know, a real artist's finger paint. They can't use brushes. Brushes are, you know, brushes are gonna replace all the, okay, like, I just can't. Like, it's all just tools and the tools are gonna get better and better and better. And then eventually, yes, the tools are going to replace us. But, you know, that's still 20 years away. So, you know, I got a company to run in the meantime. [00:54:02]

Swyx: So I've written about the API line before and I think that's from Venkatesh. I don't know if you've got your directive to it. I don't know, I definitely took it from someone. [00:54:07]

George: It's definitely not mine. [00:54:08]

Swyx: It's VGR. But I also have a speculated, a higher line than that, which is the Kanban board. Like, who tells the programmers what to do, right? So are you above or below the Kanban board? Has that evolved your management thinking? [00:54:21]

George: Yeah, like, that's sort of what I mean. Like, it's like, I'm just gonna describe the pull request in two sentences and then like, yeah. [00:54:28]

Swyx: So you are running the Kanban board? Or the bounties, you know? [00:54:31]

George: Yes, the bounties are the Kanban board, exactly. And that is kind of the high level. And then like, yeah, we'll get AIs to fill in some and we'll get people to fill in others. And that's also what it means to be like, full-time at TinyCorp, right? Would you start, and I wrote this up pretty concretely. I'm like, okay, step one is you do bounties for the company. Step two is you propose bounties for the company, right? You don't obviously pay them, we pay them. [00:54:52]

Swyx: But you propose them. [00:54:52]

George: And I'm like, yeah, that's a good bounty. That like, helps with the main workflow of the company. And step three is you get hired full-time, you get equity, we all, you know, maybe get rich. [00:55:01]

Swyx: What else are you designing differently about the employee experience? [00:55:04]

George: You know, some people really like to like, [00:55:06]

Swyx: like keep a separation, right? [00:55:07]

George: Some people really like to keep a separation between like employees and management or customers and employees. Like a comma, you know, the reason I do the DevKit thing, it's like, dude, you buy a comma thing, you're an employee of the company. Like you're just part of the company. It's all the same thing. There's no like secrets, there's no dividing lines. There's no like, it's all a spectrum for like, you know, down here at the spectrum, like you pay. And then up here at the spectrum, you get paid. You understand this is the same spectrum of college, right? Like for undergrad, you pay, and then you get up here to like, you know, I'm doing a PhD program, you get paid. Okay, well, cool. Welcome to the, you know. [00:55:39]

Alessio: What about comma bodies? You mentioned a lot of this stuff is clearly virtual, but then there's below the API line you actually need. [00:55:47]

Swyx: Wait, this is a thing that's been announced? Comma bodies? We sell them. You can buy them. [00:55:51]

George: They're a thousand bucks on our website. [00:55:53]

Swyx: Oh, okay, no, no, no. I'm thinking about like the, what Tesla announced with like the humanoid robots. It's the same thing. [00:55:58]

George: Except of course, we made the comma version of it. Tesla uses 20 actuators. We use two, right? Like how do you build the simplest possible thing that can like turn the robotics problem into entirely a software problem? So right now it is literally just a comma three on a pole with two wheels. It balances, keeps the comma three up there. And like, there's so much you could do with that already. [00:56:21]

Swyx: Right? [00:56:22]

George: Like this should replace, how many security guards could this replace? Right? If this thing could just competently wander around a space and take pictures and, you know, focus in on things, send you a text message when someone's trying to break into your building, you know, like, like this could already do so much, of course, but the software is not there yet. Right? So how do we turn robotics into a thing where it's very clearly a software problem? You know, that people don't accept that self-driving cars are a software problem. Like, I don't, I don't know what to tell you, man. Like literally just watch the video yourself and then drive with a joystick, right? Can you drive? And we've actually done this test. We've actually done this test where you've had someone, okay, you just watch this video and here's a joystick and you got to drive the car. And of course they can drive the car. It takes a little bit of practice to get used to the joystick, but the problem is all the model, right? So I can now make the model better. [00:57:07]

Swyx: Our second most popular episode ever was about segment anything coming out of Facebook, which as far as I understand the state of the art in computer vision, what are you hoping for there that you need for Karma? [00:57:17]

George: I haven't used segment anything. Like they large, large YOLOs or not. I've used like large YOLOs and I'm super impressed by them. [00:57:24]

Swyx: Yeah. [00:57:25]

George: I got to check out segment anything. I don't think it's a distinct problem, right? Okay, here's something that I'm interested in. All right, we have great LLMs. We have great text to speech models and we have great speech to text models. Okay, so why can I not talk to an LLM? Like I'd have a normal conversation with it. [00:57:39]

Swyx: You can with the latency of like two seconds every time. Right? [00:57:42]

George: And then it feels so unnatural. It's this like staccato. Like I don't like the RLHF models. I don't like the tuned versions of them. You take on the personality of our customer support agent. Right? [00:57:53]

Swyx: Like, oh, come on. [00:57:54]

George: I like LLMA more than ChatGPT. ChatGPT's personality just graded on me. Whereas LLMA, like, cool. I read a little bit of pretext paragraph. I can put you in any scenario I want, right? Like, that's interesting to me. So yeah, I think there is really no like distinction between computer vision and language and any of this stuff. It's all eventually going to be fused into one massive. So to say computer vision is solved, well, it doesn't make any sense because what's the output of a computer vision model? Segmentation? Like, what a weird task, right? [00:58:26]

Swyx: Who cares? OCR? [00:58:28]

George: Who cares? [00:58:29]

Swyx: I don't care if you can segment [00:58:29]

George: which pixels make up that laptop. I care if you can pick it up. [00:58:32]

Alessio: And you're going to have the local cluster. You're going to have the body. [00:58:36]

Swyx: Yeah. [00:58:37]

George: Yeah, I think that's kind of where that goes. [00:58:39]

Swyx: Maybe we can paint the future of like, the year is 2050. You've achieved all you wanted at TinyCorp. What is the AI enabled future like? [00:58:48]

George: Well, TinyCorp's the second company. Comma was the first. Comma builds the hardware infrastructure. TinyCorp builds the software infrastructure. The third company is the first one that's going to build a real product. And that product is AI Girlfriend. No, like I'm dead serious, right? Like, this is the dream product. This is the absolute dream product. Girlfriend is just the like- [00:59:08]

Swyx: Stand-in. [00:59:09]

George: Well, no, it's not a stand-in. No, no, no, no. I actually mean it, right? So I've been wanting to merge with a machine ever since I was like, mad little. [00:59:15]

Swyx: Like, you know, I was just like, [00:59:16]

George: how do I merge with a machine, right? [00:59:18]

Swyx: And like, you can look at like, [00:59:19]

George: maybe the Elon style way of thinking about it is Neuralink, right? I'm like, I don't think we need any of this, right? You ever, some of your friends maybe, they get into relationships and you start thinking of, you know, them and their partner as the same person. You start thinking of them as like one person. I mean, they are kind of like merged, right? Like, humans can just kind of do this. It's so cool. It's this ability that we already have. Right, so I don't need to put, you know, electrodes in my brain to merge with a machine. I need an AI Girlfriend, right? So that's what I mean. Like, this is the third product. This is the third company. And yeah, in 2050, I mean like, ah, it's so hard. I just like, maybe I can imagine like 2035. I don't even know 2050, but like, yeah, 2035. Like, yeah, that'd be really great. [01:00:03]

Swyx: In terms of merging, like, isn't it, shouldn't you work on Brain Upload rather than AI Girlfriend? Brain Upload, right? [01:00:09]

George: I don't need Brain Upload either. Like, there's thousands of hours of me on YouTube, right? Yes. How much of my brain's already uploaded? [01:00:17]

Swyx: That's only the stuff that you voice. Yeah, it's not that different. [01:00:20]

George: It's not that different, right? You really think a model with, you know, an exaflop of compute couldn't extract everything that's really going on in my brain? I'm a pretty open person, right? Like, I'm not running a complex filter. Humans can't run that complex of a filter. Like, humans just can't. Like, this is actually a cool quirk of biology. It's like, well, humans like can't lie that well. [01:00:39]

Alessio: So is it good or bad to put all of your stream of consciousness out there? [01:00:43]

George: I mean, I think it's good. [01:00:45]

Swyx: I mean, he's streaming every day. I want to live forever. We said off mic that we may be the first immortals, right? Yeah, this is how you live forever. [01:00:54]

George: It's a question of, okay, how many weights do I have? Right, okay, let's say I have a trillion weights, right? So talking about a terabyte, 100 terabytes here. [01:01:02]

Swyx: Okay, but it's not really 100 terabytes, right? [01:01:03]

George: Because it's Kolmogorov complexity. How much redundancy is there in those weights? So, like, maximally compressed, how big is the weight file for my brain? Quantize it whatever you want. Quantization is a poor man's compression. I think we're only talking really here about, like, maybe a couple gigabytes, right? And then if you have, like, a couple gigabytes of true information of yourself up there, cool, man. Like, what does it mean for me to live forever? [01:01:27]

Swyx: Like, that's me. No, I think that's good. [01:01:29]

Alessio: And I think there's a bit of, like, a professionalization of social media, where, like, a lot of people only have what's, like, PC out there, you know? And I feel like you're going to get, going back to the ChatGPT thing, right? You're going to train a model on, like, everything that's public about a lot of people. [01:01:44]

Swyx: And it's like- [01:01:45]

George: Then no one's going to run their model and they're going to die. Don't put PC on social media. [01:01:49]

Swyx: We're moving on to what would normally be called the lightning round, but just general tics, because you're a generally interesting person with many other interests. What does the goddess of everything else mean to you? [01:01:59]

George: Oh, it means that AI is not really going to kill us. [01:02:01]

Swyx: Really? [01:02:01]

George: Of course. [01:02:02]

Swyx: Tell us more. [01:02:03]

George: Lex asked me this, like, is AI going to kill us all? And I was quick to say yes, but I don't actually really believe it. I think there's a decent chance that AI kills 95% of us. [01:02:11]

Swyx: Okay. [01:02:12]

Alessio: But they saw on your Twitch streams that you're with them, so they're not going to- [01:02:16]

Swyx: No, I don't think, I actually, [01:02:18]

George: I don't also think it's AI. Like, I think the AI alignment problem is so misstated. I think it's actually not a question of whether the computer is aligned with the company who owns the computer. It's a question of whether that company's aligned with you or that government's aligned with you. And the answer is no, and that's how you end up dead. [01:02:31]

Swyx: So what the goddess of everything else means to me [01:02:32]

George: is like, the complexity will continue. Paper clippers don't exist. [01:02:37]

Swyx: You know, there are forces. [01:02:38]

George: The paper clipper is cancer, right? The paper clipper is really just a perfect form of cancer. And the goddess of everything else says, yeah, but cancer doesn't win, you know? [01:02:48]

Swyx: Yeah, it's a beautiful story for those who haven't heard it. And you read it out and I listened to it. Yeah, what are you grateful for today? [01:02:55]

George: Oh man, I mean, it's all just like, I haven't, I haven't thinking about this stuff forever. Like, that it's actually like happening and it's happening in an accessible way too. I guess that's what I'm really grateful for. It's not like, AI is not some Manhattan project style. You don't know anything about it. Closed doors. [01:03:12]

Swyx: Closed doors. [01:03:13]

George: I'll fight really hard to keep it that way. I'm grateful for just how much is released out there and how much I can just learn and stay up to date. And I guess I'm grateful to the true fabric of reality that, you know, I didn't need differential equations to understand it. Like, I don't need some like, there's a limit to my math abilities. I can do most undergrad math, but I took some grad math classes and okay, now we're getting to the end of what I can do. And it's just the actual like, end of what I can do. Like, I'm limited by my brain, but you know, ML stuff, hey, you need high school math. [01:03:45]

Swyx: You know what I mean? [01:03:46]

George: When I learned to multiply a matrix, seventh grade, [01:03:48]

Swyx: like, it's all easy. You need more electrical engineering than you need high school math early. [01:03:52]

George: Yeah, well, you need electrical engineering to like, build the machines, but even that, like, these machines are simpler than the machines that have existed before. The compute stack looks really nice. So, you know, yeah, I just, I'm grateful that it's all happening and I get to understand it. [01:04:05]

Alessio: John Carmack mentioned there's about six insights we have left. Do you have an intuition for what some of the paths [01:04:11]

Swyx: people should be taking? [01:04:12]

Alessio: Obviously you're working on one. What are some of the other branches of the tree that people should go under? [01:04:17]

George: I don't think I'm working on one of the six insights. I don't think TinyGrid's any one of the six insights. Something I really like that Elon does, and I try to be inspired by it, is look at the boring tunnel machine and ask how you can build a 10X cheaper one. All right, look at the rocket. How can I build a 10X cheaper one? All right, look at the electric car and say, how can I build a 10X cheaper, like, cheaper or, you know, can go further or whatever, whatever, whatever, right? And you just do the straight up physics math, right? I'm trying to do the same thing with ML frameworks, right? And in doing so, making sure that this stuff remains accessible. You could imagine a world where if Google TPUs were actually the ultimate, if Google TPUs were actually the best training things, I mean, actually, you know, I'm kind of grateful for NVIDIA, right? Because if Google TPUs were the ultimate, now you have this huge closed source compiler in between XLA and the hardware, and yeah, that's just a really bad thing. So, I mean, something that is somewhat upsetting about the Tiny Core is that it is trying to prevent downside, but it's not all trying to prevent downside. Like, we're also building computers and we're gonna build some awesome, powerful, cheap computers along the way. So, no, I'm not really working directly on any of the six tricks. I also think the six tricks are kind of gonna be like luck. [01:05:25]

Swyx: I think it's just gonna be like, you know, [01:05:26]

George: please tell me more about what covariate shift is and how that inspired you to come up with batch normalization. Please tell me more about why it's a transformer and it has a query, a key, and a value, right? Like Schmidt-Huber described it better in fast weights. I mean, my theory about why transformers work have nothing to do with this attention mechanism and just the fact that it's semi-weight sharing, right? Because the weight matrix is being generated on the fly, you can compress the weight matrix, right? Like, this is what that, there's an operation in the transformer, which, and by the way, this is like, Qualcomm's SNPE can't run transformers for this reason. So, most matrix multipliers in neural networks are weight times values, right? Whereas when you get to the outer product in transformers, well, it's weight times weight. It's values times values, right? So, SNPE doesn't even support that operation, right? So, it's like that operation that gives the transformer its power. It has nothing to do with the fact that it's attention, [01:06:20]

Swyx: right? [01:06:21]

George: And this is a funny, like, but that is one of the six tricks, right? Batch, like these norms are a trick. Transformers are a trick. Okay, six more. [01:06:29]

Swyx: So, you talk about attention as weight compression. [01:06:33]

George: Compression is not exactly the right word. What I mean is that the weight can change dynamically based on the context. So, there was this thing in PAC-8 in the Hutter Prize that I absolutely loved, and I've never seen it again in neural networks, and it's a really good trick. Okay, imagine you have 256 weight sets for a layer, right? And then you choose which of the weight sets you're loading in based on some context. And that context can come from another neural net, right? So, I have another neural net, which projects 256 wide, one hot, do a softmax, predict it, and then I actually load the weights in. And I can do this operation at both test time and train time. I can do this operation at both training and inference, and I load in the weights given the context. Like, that is what transformers do. But transformers, instead of having 256 discrete ones, it's actually just that, but continuous. Which is funny that that was in language models, and I just like, when I understood that about transformers, I'm like, oh, this is a real trick, and why are they using the word attention? [01:07:23]

Alessio: And today is actually the anniversary of attention is all you need. What? [01:07:27]

Swyx: Oh, that's so cool. [01:07:28]

Alessio: Today, six years ago. [01:07:29]

Swyx: Six years. [01:07:30]

George: Six years. [01:07:31]

Swyx: Changed the world. Wow. [01:07:32]

George: Well, there's one of your envelope tricks, right? And you could easily write it on an envelope, think about how you write out that. How many times have you written that? Because it's not in any libraries, because it's all used a little differently each time. Like, you just write out that exact same, you know. [01:07:45]

Swyx: You've name checked Elon a few times. I think about both of you as systems thinkers. Input, output, thinking something in between. What's different about your style versus his? [01:07:53]

George: Elon's fundamental science for the world is physics, mine is information theory. But you do a lot of physics as well. [01:07:58]

Swyx: I mean, like, you base it on- [01:07:59]

George: And Elon does a lot of information theory as well, too. But the difference maybe is expressed in what your ambitions are, right? Elon's ambitions may be like- [01:08:08]

Swyx: Go to Mars. Go to Mars, right? [01:08:10]

George: Go to Mars is the ultimate modernist physics ambition, right? It's a physics problem getting to Mars, right? [01:08:16]

Swyx: Well, what are electric cars? [01:08:17]

George: It's a physics problem, right? Okay, now he's like pushing on the autonomy stuff, and you push a little on information theory. But fundamentally, his dreams are physics-based dreams. My dreams are information-based dreams. I want to live forever in virtual reality with my AI girlfriend. Those are the aspirations of someone who accepts information theory as a core science. So I think that's the main difference between me and him. He has physics-based aspirations, and I have information-based aspirations. [01:08:39]

Swyx: Mark Andreessen, he is a- Hi, Mark. He's a listener. He's a big proponent of effective accelerationism. You've been a bit more critical. Why do you say that IAC is not taken seriously by its adherents? [01:08:50]

George: Oh, well, only the left takes ideology seriously. It's just like a fact, right? [01:08:55]

Swyx: Is the right more cynical? Is that what it is? [01:08:57]

George: I don't know. [01:08:58]

Swyx: It's like the left actually manages [01:08:59]

George: to get energy around the ideologies, right? [01:09:02]

Swyx: Look, here you have- [01:09:03]

George: You have two effective altruists named Sam going in front of Congress. Only one of them is in jail. [01:09:08]

Swyx: You know, it's interesting. [01:09:09]

George: They're both calling for regulation in their respective spaces, right? [01:09:11]

Swyx: So SBF is definitely like kind of wolf in sheep's clothing, kind of, right? Like he only adopted IAC or EA to market. [01:09:19]

George: Oh, and Sam Altman is a genuinely good guy who is not interested in power-seeking for himself. [01:09:24]

Swyx: All right. Okay, okay. We don't have to go there. Fair enough, fair enough. [01:09:27]

George: But no, IAC is not like, like you are not serious, right? Mark Andreessen, I like Mark Andreessen, but it's like someone who's like 2019, whose like eyes were opened about like the political world being not exact. You mean all the people on the news were lying to me? [01:09:42]

Swyx: Bro, they were lying to you. [01:09:43]

George: Like, okay, we all figured this out five years ago. Now, what are you going to do about it? I'm going to complain about it on Twitter. Great, and that's what IAC is. [01:09:50]

Alessio: Last and maybe most important, why was Avatar 2 bad? [01:09:55]

Swyx: Oh, I have a whole, you can go on my blog. [01:09:56]

George: I rewrote the script of Avatar 2. I wrote a script that actually might make you feel something for the characters. I killed Jake Sully in the first scene. Like you had to. Do you really think his second story art topped his first one? No, of course not. You had to kill the guy and make the movie about the brothers, right? And just that alone and realizing that, like you could have kept the Titanic scene. [01:10:16]

Swyx: It would have been fine. [01:10:16]

George: I didn't even take it out. I left your Titanic scene, James Cameron, but I wrote you a story. So, you know, you're just, just, just. [01:10:23]

Swyx: He needs ships to sink in water. [01:10:24]

George: Look, it's a great scene, but like the movie was just like, like the Roman, I've never. [01:10:30]

Swyx: Great CGI, you know, let down by the writing maybe. It's a beautiful world. [01:10:34]

George: And that's why like I care so much, right? Like you don't hear me ranting about Pirates of the Caribbean 2 being a terrible story. Cause come on, what do you expect, man? Like Johnny Depp's like, wow, I had a movie that made me rich. I love this. [01:10:44]

Alessio: But this goes back to like the midpoint. You know, I think you wrote like, feels like ChatGPT wrote the movie and that's my worry a little bit. It's like kind of converging towards that. [01:10:53]

Swyx: Oh, I. Malik, Malik wrote the movie. Sorry, I didn't want to interrupt you. [01:10:59]

George: I closed a pull request two days ago. I was like, was this written by ChatGPT? And I just closed it. [01:11:04]

Swyx: Like, you know what? [01:11:05]

George: I honestly feel bad if you were a human who wrote this. [01:11:07]

Swyx: Incapable of being more perplexed. [01:11:09]

George: But if you, if I have a classifier running in my head that asks, you know, is this a AI or is this a human? Like, you know, the only way to deal with all this, like, like, like, oh God, it's like the worst possible. Like, you know, people are like, how are you mad about like these chatbots? You're not mad about like Tesla. I don't want to buy a Tesla. I don't have to buy a Tesla. And it won't really impact my life negatively. But if I don't want to use a chatbot, it's still going to impact my life negatively. All the amount of like personalized spam that now makes me spend more cycles on my classifier to tell if it's spam or not, because you can now use AIs and generate this so cheaply. Like, no, I mean, we have to move to a model where everything's just a dollar, right? Like you want to send me an email, it's a dollar. Like you guys wouldn't care. None of my friends would care. No one would care, except the spammers, right? Like we just got to move to those sort of models. [01:11:54]

Swyx: Awesome. [01:11:55]

Alessio: One last message you want everyone to remember. [01:11:58]

George: Go try TinyGrad. I hope that we're a serious competitor to what's out there. And then I want to take it all the way. We'll start with just building something for GPUs and then we'll start building chips and then we'll start building fabs and then we'll start building silicon mines and then we'll have the first self-reproducing robot using. [01:12:15]

Swyx: Yeah, okay. All right, George. [01:12:18]

Alessio: Thank you so much for coming on. [01:12:19]

Swyx: You did a big inspiration. Thank you. Thanks. [01:12:21]

Swyx: Thank you. [01:12:29]

Get full access to Latent Space at www.latent.space/subscribe

2023-06-20
Länk till avsnitt

Emergency Pod: OpenAI's new Functions API, 75% Price Drop, 4x Context Length (w/ Alex Volkov, Simon Willison, Riley Goodside, Joshua Lochner, Stefania Druga, Eric Elliott, Mayo Oshin et al)

Full Transcript and show notes: https://www.latent.space/p/function-agents?sd=pf

Timestamps:

[00:00:00] Intro

[00:01:47] Recapping June 2023 Updates

[00:06:24] Known Issues with Long Context

[00:08:00] New Functions API

[00:10:45] Riley Goodside

[00:12:28] Simon Willison

[00:14:30] Eric Elliott

[00:16:05] Functions API and Agents

[00:18:25] Functions API vs Google Vertex JSON

[00:21:32] From English back to Code

[00:26:14] Embedding Price Drop and Pinecone Perspective

[00:30:39] Xenova and Huggingface Perspective

[00:34:23] Function Selection

[00:39:58] Designing Code Agents with Function API

[00:42:16] Models as Routers

[00:46:48] Prompt Engineering replaced by Finetuning

[00:52:15] The 2 Code x LLM Paradigms

[00:56:30] Smol Models for the future

[00:58:54] The Evolution of the GPT API

[01:03:27] Functions API Security vs Prompt Injection

[01:16:18] GPT Model Upgrades

[01:17:36] JSONformer

[01:21:03] Closing Comments - What We Want Next

Get full access to Latent Space at www.latent.space/subscribe

2023-06-14
Länk till avsnitt

From RLHF to RLHB: The Case for Learning from Human Behavior - with Jeffrey Wang and Joe Reeve of Amplitude

Welcome to the almost 3k latent space explorers that joined us last month! We?re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community.

Who among you regularly click the ubiquitous ? /? buttons in ChatGPT/Bard/etc?

Anyone? I don?t see any hands up.

OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks? Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses.

We are shockingly reliant on good human feedback. Andrej Karpathy?s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red):

But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented ?longer = better? bias when classifying responses in a ?laboratory setting?.

The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don?t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations:

The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot.

This is fantastic and ?obviously? the future of productized AI. Every AI application should figure out how to learn from all their real users, not some contractors in a foreign country. Most prompt engineers and prompt engineering tooling also tend to focus on pre-production prototyping, but could also benefit from A/B testing their prompts in the real world.

In short, AI may need Analytics more than Analytics needs AI.

Amplitude?s Month of AI

This is why Amplitude is going hard on AI - and why we recently spent a weekend talking to Jeffrey Wang, cofounder and chief architect at Amplitude, and Joe Reeve, head of AI, recording a live episode at the AI + Product Hackathon where 150+ hackers gathered to compete for over $22.5k in prizes from Amplitude, New Relic, LanceDB, AWS, and more.

To put things in perspective, Amplitude is a legendary YC alum with $238M of revenue in 2022 ? our first guests representing the AI efforts of a public company!

We chatted about how they have been approaching AI in their product (?question to chart? BI, text field autofill, instrumenting Amplitude with Amplitude), some of the issues they?ve had with different models, and the importance of first-party data in the world of LLMs. Another topic that came out of the Q&A was this idea of almost an ?AmplitudeGPT?; rather than using language to simply generate a query, you could have these models investigate reasons for why certain behavior is happening in your user base. It was a really good discussion, and hope you all enjoy listening to it!

Sections

* [00:00:47] Amplitude's founding story and pivot

* [00:03:28] Amplitude as an AI company and opportunities

* [00:07:14] Limitations and challenges with using AI models

* [00:10:56] Using Amplitude's product to build Amplitude - instrumenting AI

* [00:12:32] Existing ML models in Amplitude's product and customer use cases

* [00:15:50] ?A/Z testing? and adaptable products

* [00:19:33] The future of analytics and dashboards

* [00:21:03] Optimizing for metrics in chatbots and AI products

* [00:26:22] Using general models vs. fine-tuned models

* [00:30:24] The importance of models vs. data - Amplitude's data set

* [00:39:00] Lightning Round + Q&A

Show Notes

* Amplitude

* Sonalight to Amplitude pivot announcement

* The Slack origin story

* Reverse Engineering Copilot

* Simon Willison?s blog

Transcript

Editor?s note: all timestamps are 1 minute behind because we hadn?t yet added the intro before making these. Sorry about that!

Alessio: Thank you everyone for coming. Hopefully, some of you have listened to the podcast before, if you haven't, we focus on AI research and application. So we don't focus on ?AI is going to kill us all?. We don't think about virtual girlfriends. We don't think about all of these more societal things. We're focused on models: how do you build them? How do you train them? How do you use them in production? What are some of the limitations on getting these things from demos to things that millions of users use? And obviously, a lot of you are building things. Otherwise, you wouldn't be here. And some of you have been building things for a long time, and now have a new paradigm that you want to build on top of. So I'm excited to dive in here. And maybe, I mean, I'm sure most people know you, but maybe you want to do intros and give a little background. [00:00:47]

Jeffrey: Sure. Yeah, hey, everyone, met you all this morning, but I'm Jeffrey. I'm one of the co-founders and Chief Architect here at Amplitude. Been working on this product analytics thing, helping people understand user behavior data and make great product decisions and build better products for the last decade or so. And obviously, AI is a technology that we've been leveraging for a long time, but the recent trends are particularly exciting. And yeah, we have a lot of thoughts on how to apply that to our space, what we're doing in our product, and what we think the future of AI and product development and product data is. So excited to talk through some of those. [00:01:20]

Joe: Yeah, I'm Joe, Joe Reeve. I've got a background in sort of startups and tech, been professional software engineer since I was 16, quit college. And at the moment, I'm running sort of AI R&D efforts here at Amplitude. Super excited about all the new stuff, but also all the stuff that Amplitude's been doing for a long time and how we're sort of getting renewed interest and excitement and abilities to push that even further forwards. [00:01:44]

Swyx: So I think it's useful for people listening on the podcast and also some people here. Can you contextualize Amplitude as an AI company? Like what does that mean to you? What unique opportunities do you guys have? [00:02:02]

Jeffrey: Sure, yeah, happy to speak to that. So, you know, if we think about the fundamental thing that our customers of Amplitude try to do, it's they want to look at their product data and they want to figure out how do I make my product better? And the really cool thing about product data is that one, it's often like very high fidelity, right? Digital products compared to, you know, let's say physical products before them have way more information about what's going on. And so that's why product data is, you know, even a thing at all, right? You finally have that feedback loop of, hey, I built this thing. This is how people are using it. Now let me learn from that and make my product better. Now, one of the downsides of that is that the data is massive. If you look at any of the internet scale products out there, they generate enormous amounts of data. And the ability of humans to kind of sift through that data is obviously limited. At Amplitude, we try to give people as many tools, whether AI or not, in order to process that. But at the end of the day, if you could get from the data and what user behavior is happening in your product to the insights of how to make your product better without as much manual work, that's kind of the holy grail of product analytics. And so in some sense, Amplitude has always been a company on the path to AI because figuring out how to make your product better from data is ultimately an AI problem. And so we're kind of just solving all the barriers in the way, like getting data in first, building good models for short-term things. And long-term, it's always been about, hey, how can you take product data and automatically make your product better as fast as possible? [00:03:28]

Alessio: So that's the future of Amplitude. And a lot of people here probably want to start companies and whatnot. So maybe you want to give a 60 seconds of why you started Amplitude and what the story was like and maybe the first three to six months, what the challenges were. [00:03:42]

Jeffrey: Yeah, of course. It's funny that we talk about this because the start of Amplitude is actually almost more AI than the current state. And so actually my two co-founders, Spencer and Curtis, they went through YC originally with not Amplitude, but SonaLite, which was a text-by-voice company. So it was kind of before the era of Siri and those types of technologies where they wanted to build something that would read text messages to them, that's easy, but also do voice recognition so that you could send text messages, say when you're driving, without having to pull out your phone. And so they worked on it and it was really popular back when they were doing it. After they finished YC, they realized the big innovation that they needed to figure out in order to make that successful was being really good at voice recognition, which was a different problem. They're awesome software engineers, but they don't come from an ML background. And so it's like, okay, are we going to spend the next five years solving voice recognition? Not really the thing that they had in mind when they were building product. But one thing that they happened to stumble upon as they were working on that was they spent a lot of time thinking about, hey, what was hard about that product? What made users churn? What made users really love it and engage? And they built a bunch of analytics tools to help them understand that. And they were really kind of shocked that those tools didn't exist out there in the market or they were like much more primitive than they wanted. And it turns out a bunch of other people in their YC batch felt the same. And they were like, hey, that analytics thing you're building, we want that. For you to text by voice, we want your analytics product. And so they're like, okay, fine. We will pivot, natural language and voice recognition isn't really our thing. And so we'll do distributed systems and analytics instead. That's where I came in. I'm a distributed systems and analytics guy. And so I happened to get in touch with them just through some mutual friends at the time. And then, yeah, we kind of went on it. The funny thing about a lot of things in technology is that the most forward thinking companies with respect to a lot of technologies are gaming companies. And so a lot of AmpliG's early start was either gaming companies or companies with founders that came from gaming backgrounds, where in gaming people have always been very, very rigorous about product data and optimizing engagement loops and all of that. And so they look for the best tools. We went to Zynga 15 years ago. It's like, that's where product analytics originated. And so a lot of those founders of new startups who had left Zynga were like, hey, that thing that you're building, that's trying to figure out patterns and user data and use that to make better products. That is exactly what we want after leaving Zynga. And then from there, that was Amplitude.

Swyx: Yeah, I think famously other gaming companies would be like Slack, right? Mr. Butterfield tried to make a gaming company and failed and made Flickr. Then he tried to make another gaming company and failed and made Slack. And now look out to see what he does next. Discord as well. That's right. [00:06:34]

Jeffrey: Yeah, people who come from gaming backgrounds are very rigorous in their product thinking. [00:06:39]

Swyx: That's interesting. Alessio, you have a background in games? [00:06:43]

Alessio: Yeah, in playing them, not in building them. So I will not fall into an enterprise company by doing that. Let's talk about R&D today and some of the ideas that you're working through, like some of the limitations that you run through. I think the most interesting thing about hackathons is you come with an idea and then you kind of hit a wall trying to build it. And then that takes you into another path. Like what are maybe funny things that you learn in terms of like the limitations of these models or like the missing infrastructure for using them? [00:07:14]

Joe: So we've got a couple of different frames for thinking about this. There's AI that we're putting into our products and then us knowing that our customers want to put AI into their products. So there's the, how do we support our customers in their product development using AI? But how do we do that ourselves? And this is a great opportunity for us to learn the challenges our customers are gonna see. And so the first thing there is let's just start from the beginning, assume we want to add AI to our product, which maybe isn't the best place to start, but let's just assume we want to. How do we start ideating opportunities to put stuff into our product? So we sort of came up with this framework where we look at our product and we think about what are the collaboration touch points? So where are the points that a human might hand off to another human? And then think where can we replace one of those humans with the machine? So instead of thinking of some AI, amorphous AI, LLM, whatever, we're thinking actually, what if we had a robot that we were collaborating, not just a human, not just some sort of thing that spits out numbers. So collaborating. Then there's thinking of these as tools. So this is like your auto-suggest, on your mobile keyboard or spell check or something. How do you integrate this stuff as deeply into your product? So what are the friction points that users go through? Maybe they check lots of boxes. Is there a way we can pre-check those boxes we can get? So that's the feature embedding really deeply into the tool you've already got, the product you've already got. And then you step back and think, okay, what's a tool? So a tool is like ChatGPT, where you go there, it's an AI powered tool. It's not necessarily connected to your product, but it's a supplementary tool that you add. So there's a sort of ideation process there that we went through. And we sort of landed on a couple. And one of the key things that Amplitude does is help our customers, one, collect data in like a standard and sort of queryable way. And then we help them query it and get insights out of that data. So we were thinking, what's the feature there? How do we embed that? But also what's the collaboration point? And you might be a product manager asking an analyst, hey, please help me. Let's have a conversation about this. I don't know what questions to ask, but you also might just be about to go click the big create button and fill in a bunch of fields. And can we fill in a bunch of the fields for you? So we went to what to us seemed like one of the most obvious places. And we built a text box. Surprise, surprise with LLMs. We've got a text box. You can type in a question, type in anything about your data that you want to know, and then it'll spit back a chart, which is kind of neat. And we hit a bunch of problems there with LLMs hallucinating, losing context, even within the context windows, not really sort of recalling everything within the context window. So we sort of did a bunch of experimentation and realized if we split this down to seven different questions, so instead of saying, generate me a chart and a query for this one question, let's split that into lots of sub queries, like what kinds of events should I use? How should I display this? What should I call it? Rather than asking you all of that in one go. But then we had another problem where we have one query that a user makes that actually spins out seven different queries. So how do we monitor this? We can't just say one performance metric. You know, RLHF, you can't just say yes or no. Was the query response good? Because it might've failed for one of seven reasons. And maybe multiple of them failed or maybe some of them failed and then maybe they've hallucinated. And so we're getting code errors where an enum is not being matched. So we've had lots of sort of issues going all the way down there that we've had to figure out from first principles and sort of a really exciting way for us to understand what our customers are going through. [00:10:56]

Swyx: So I wanna be clear. So you've described your exploration and how you think about products. What have you released so far? I just wanna get an idea of what has been shipped. [00:11:08]

Joe: Sure. So in terms of LLM stuff, this, we call it question to chart internally. This ask a question, get a chart out. This, we've started rolling out to customers already. So last week, actually, started rolling out to our AI design partners a sign that we had signed up, which is a really exciting process. Actually, a lot of customers are just so excited to work with us and try it out and see how they can break it. So that's something we rolled out recently, which is built in LLM. It's the first piece built on LLM that we're working on. But we've also had a bunch of long-term ML, sort of traditional ML models that we've been running and products that we've been running with customers that help them predict what their users are gonna do. Because we've got this massive behavioral data set, best behavioral data set in the world. So we can train these awesome models and help our customers predict what their users are gonna do. So they can share the more relevant content or now is the right time to ask people if they want to upgrade or they want to rate your app or that sort of thing. [00:12:05]

Swyx: Yeah, there is a little bit of a contrast, conflicts, because you already had all these ML models in-house and you're spinning up a new AI team and you're like, no, let's do all of this with GPT-3. Are the existing ML researchers saying like, no, this is a complete misuse of text generation? Or are they excited about it? Is it unlocking new things? [00:12:32]

Joe: Yeah, actually, it's the combining these things. So we're able to use the traditional ML to shorten the fields, to narrow the number of things we need to pass into the LLMs. Because the LLMs can do a lot more of the reasoning, but we can make sure that the context we're providing is much more specific and generally much better by using the traditional ML models. [00:12:53]

Swyx: Yeah, okay. And then the pain points that you're experiencing are hallucination. And then also like the multi-query thing. What do you think you wish for? Or what do you think you're thinking about to solve those pain points? [00:13:06]

Joe: So right now we're instrumenting with our own product. So we're instrumenting groups of inferences and individual inferences, which means we can then create charts that show how often they fail, why they fail, how often we need to retry to get good answers.

Swyx: So amplitude using amplitude. [00:13:23]

Joe: Exactly. To build amplitude. [00:13:24]

Swyx: Yeah, exactly. [00:13:25]

Joe: Well, I mean, we're a product company. What else would we do? [00:13:29]

Swyx: That is the second part of what you're saying, right? Which is, first of all, you want AI in the amplitude products. Second, people are shipping AI products with amplitude. You wanna talk a little bit more about what you're seeing there? [00:13:39]

Joe: Yeah. I guess the key thing here is, for a lot of people is, okay, I can build the thing that calls OpenAI's API and then gives a response back. I'm nervous that I'm gonna be giving incorrect answers. I'm nervous that I don't really know how to measure whether the answers are incorrect. And I'm nervous that I'm not gonna be able to improve over time. So a lot of people we actually hear are nervous of giving thumbs up, thumbs down buttons because they're implying to their users that they're gonna be using this to improve the results. But they actually have no idea how to use that to improve the results in a meaningful way. And particularly when you've got multiple queries going off for one request, you've gotta then fine tune lots of different things in parallel. So it gets to be quite a technically complex sort of problem if you're not using great tooling that already exists for it. So that's, and then you have the extra layer of, I'm getting a bad result. I've tweaked my prompt template that I'm sending off to OpenAI. And now, has the result got better or worse? [00:14:35]

Swyx: I don't know. [00:14:36]

Joe: I don't know how to measure that. Except by thumbs up, thumbs down, which is a difficult measure in the first place. So that's where we can start saying, measuring the behavior of users once we've generated something for them. So have they gone and shared this content? Have they used this content? They actually gotten any value out of it? Not just have they pressed thumbs up. We can actually measure, are they getting value? Are they throwing it away from their behavior? But then using that through the Amplitude product, we can then tie that through to A-B tests, which is another product that Amplitude has. So then suddenly we start, and we're not doing this yet. This is sort of next on our list, is to start putting these prompts into our A-B test variants. So then we make a tweak in the UI, and it goes off, fires on the original, the control and our variant, our new variant. See, does it get fewer or more errors? Does it get fewer or more thumbs up, thumbs down? [00:15:30]

Alessio: Have you thought about, I don't know, A-Z testing, I guess? Like one of the limitations has been, well, people can only write so much copywrite to test, but now with these generative models, you can actually generate a lot of copy. And like you go to on-demand test more and more and more copy. Have you seen any maybe fun customer stories? Like can you, anything there? [00:15:50]

Jeffrey: Yeah, so actually there's a very good example of this. I don't know if I can share the actual customer, but actually from before the LLM days, where they literally generated the versions of the copy themselves, and they made their product basically adapt, you know, multi-arm bandit style of like, hey, here's all these different variations, like just go figure out the best one. At an internal hackathon, maybe two months ago, I built a prototype of what you're talking about, which is, okay, now replace the copy generation with an LLM. So just constantly generating new variations, and then multi-arm banditing to figure out which one's the best. I think that is probably the future of copywriting, where it's like, you don't actually need a whole lot of manual work anymore. It can, almost everything can happen automatically. And it's kind of the micro example in my head of this concept that we really like, which is self-improving products, where, you know, at some point, you know, someone has to say, hey, I'm gonna build a product that does this, you know, like a newsreader or something. But then, you know, after you have that, like the title of the newsreader, like the description of the sections, your navigation, all of that, in theory, you know, if you can give it some structure that the AI can play with, the LLM can manipulate all of that for you, and then use, you know, A-B testing, multi-arm bandits and all of that to kind of figure out what's best. And that generative AI kind of makes that last piece of like, what are my options possible? And that's super exciting for us. And we wanna be there, you know, to help you measure that, help you deploy that, and make that like the way people build products in the future. [00:17:14]

Alessio: I think I've talked about this on the podcast, but this idea of like just-in-time UIs, you know, like each type of user wants to interact in a different way. And like, what you're building is a way of that, right? Like, Amplitude has been really like dashboard-driven, kind of like a diagram-driven, showing the user flow. Now each user can say, hey, I don't really want the table. I just want the charts. Or like, I don't want the charts. I just want the data. What do you think about the future of like dashboards and like BI in general? But like, the analysts used to come up with like what you should be seeing. Now each user can ask their own questions. [00:17:47]

Jeffrey: Yeah, like the future of analytics, I think, is, you know, can go a few different paths. One thing that I want to, you know, counter against the whole LLM trend a little bit is I think when you get into really important and specific questions, you know, let's say you're writing like some complicated SQL or even code, you know, code and SQL are good because they're very specific, right? You can define your semantics very precisely. And that's something that I think, you know, when people start thinking about like natural language questions, they kind of take for granted. They're like, oh yeah, why doesn't it just, you know, figure out the precise semantics from my very ambiguous words? It's like, well, it's actually, in some senses it's possible, right? Because the precise semantics are not captured by your ambiguous natural language words. And so the way we think about it, at least today, you know, who knows what's going to change in the future is like natural language is a great interface to like get started. If you don't know what the underlying data looks like, if you don't know like what questions you should be asking, it is a very, very expressive way to start, get started. It's much easier than manipulating a bunch of things, much, much easier than writing SQL and all of that. But like once you kind of know what you want, it's very hard to like make it precise. It's actually easier to make SQL or code precise than it is natural language. And so that's a little bit of what we're thinking right now. So we think, you know, for sure the way that maybe many people will interface with analytics and data will turn into natural language because maybe the precision doesn't matter to them. But like at the end of the day, when you're trying to get, you're trying to sum up your revenue or something, it's like, you want to know that it's right. And you want to know the semantics that go into that. And like, that's why, you know, that's part of why data is hard. The semantics really do matter. They can make a huge difference in the output. And so there's a boundary there that I'm curious where it will push over time, but I don't think it's quite there yet. [00:19:33]

Joe: I think this is where models sort of can become more embedded as features rather than go off and do this thing, create this analysis for me and then come back, the collaborator model. Then we're saying this field, I'm not sure what should go in there. Can you make a suggestion? And then I'm going to go and refine it over time. So it's the sort of autofill, but guessing autofill, but then you still, you can tweak everything. This is one of the core design sort of principles that we've come up is yes, you've got to be able to explain what the model's doing. And as a human, I need to understand, a user I need to understand what is the model doing and why is it doing it? But I also need to be able to tweak it once it's done it. I don't want to feel like I've just said go and then I can't stop it and it's going to go off and do stuff. And that's sometimes how things like AutoGPT can feel. It's going and it's costing me OpenAI tokens and I have no idea what's going on. So yeah, I think a key thing is servicing all the individual things the model's doing and allowing users to tweak it, stop it, retry while it's going. [00:20:33]

Swyx: For me, one of the most challenging questions is something I think you guys have maybe thought about a lot which is chat. Ideally you want, like you could say naively, for example, you want to optimize time in app, but actually that's a sign of failure if the chat session is longer than it should be. Do you have any advice on, I'm sure you've dealt with this before pre AI era, but like what do you advise AI hackers to optimize for? Like what analytics should people be looking at? [00:21:03]

Jeffrey: Yeah, our general kind of philosophy as a company is to work with customers to identify north star metrics. Right, and like time in app is not good primarily because it doesn't actually correlate with your business outcomes most of the time. And to be fair, sometimes it does. Like if you're a social media app, maybe it does correlate really well and maybe it's not a bad metric then. But for a lot of other products, right, if you're trying to do the search, for example, or like time on search, like nobody wants that. It's like, yeah, what is your success rate? You know, how many, do you get them to come back and search in the future? Like that's much more interesting than the time of your session. And so, because you know, each time you can serve apps, right, that's your business. And so it's like, if you choose a metric that's well correlated with your business outcomes, then that's at least the first step to getting that right and not getting caught up in other vanity metrics that sound like they could be good to increase, but then, you know, they can sometimes lead to negative business outcomes, you know, and then you get the worst. You've optimized the wrong metric the whole time. And that's where tying in AI and product analytics makes a lot of sense. And it's really important because product analytics, these companies that are like our customers that are trying out building features that are LMs and they're not sure what to optimize for, optimize for the same thing you're already optimizing for. You're already measuring conversions. You're measuring how much value, hopefully, your customers are getting out of your product. So continue doing that and maybe find a way to tie the LLM feature to that and sort of through A-B tests and that sort of thing. And then on the chat specifically, chat is obviously for a business maybe rolling out a chat box based on LLMs. It can be really scary. And that's another sort of mental model of framing we've been thinking around is we find LLMs right now are most useful either when you come from, either when you have a narrow input space and a broad output space, because you can be very, you know exactly what format of data, what kind of data is gonna be passed in. That's probably not coming directly from a user. It's probably coming from a button click or a toggle switch or something. And then you can have a general output and you can provide templates and that sort of thing. And then the other way is broad input space, narrow output space. So that's free form text box. And you can provide a bunch of sort of clamping, framing, validation on the output to make sure that you're not spewing out, you know, poems about Hitler or whatever it is. You know, you can be really, really deliberate when you've got a small output space. Chat is large input space, large output space, which is really, really scary. If you're, as a company, you're not selling a chat product, you're selling a, you know, an analytics product with maybe a chat support bot or something. [00:23:37]

Swyx: Yeah, I think this is one of those opportunities. I always try to raise the awareness of this, that Copilot I think did a really interesting metric or North Star, which was how much code is kept or retained by the user. And for people who are Googling along, you can actually look for this blog post about reverse engineering Copilot internals. And they actually set up custom metrics around, you know, 30 seconds after a code snippet is accepted, one minute, two minute, three minute, all the way to five minutes. And you can sort of see it construct a curve of how long Copilot suggestions stick around. And from there, they can actually make statements like this, you know, evaluate the success of the products. It's pretty cool. [00:24:18]

Joe: One of the really nice things we found actually, we accidentally did this. So our chart building interface, heavily instrumented. It's a, we're Amplitude. So we instrument our product. We also, it's one of the main tools that our customers use. So it's really, really well instrumented. And so when we tied chart creation through asking a question through an LLM, and then we tied that to a chart, an output chart, we then automatically were able to tie every time someone edits any of the parameters to that generation. So then we know, we have really detailed RLHF data for, yeah, you got everything apart from the metric, right? But you got everything apart from this event that shouldn't have been there, because that's the one that got removed. So similar to the Copilot there. [00:25:00]

Alessio: And I want to make sure we open it up for questions, but like one last thing is about, everybody knows that small is beautiful. And when you think about what models to use and some of the parameters, like there's costs, there's latency, there's like accuracy. How do you think about using, you know, GPT-4 and some of those models versus using smaller ones that are fine-tuned? What are the trade-offs? [00:25:23]

Joe: Yeah, I guess right now we're very much in the, let's explore, let's try everything and just iterate as fast as possible, which is what general models are great for. We do have some smaller, not even fine-tuned, some smaller models that we've sort of borrowed from Hugging Face that we run internally for more specific tasks. And that's often sort of selecting specific values before we pass it to a general model right now, just because the general models are much easier to communicate with and they understand most of the words we use. It's not like we use a word and suddenly we get random outputs for no reason, the sort of gold magic up type thing. So they're generally less susceptible to that. So that's why we're iterating heavily on the general models. I think we absolutely have to move to some more specific models, particularly given inference on fine-tuned open AI models gets more expensive and slower the more you do it. So yeah, that's definitely a thing we're looking at and we're doing some internal stuff, but it's the next step or one of the next steps. [00:26:22]

Jeffrey: Yeah, to give a pseudo example of that, one of the hard things to help users within Amplitude is picking the right event to analyze. It's kind of your fundamental unit of analysis. And when a user comes in and let's say that's the first time they're using Amplitudes, someone else in their company has set up the product, so they don't know what the events are. Right now in Amplitude you get this massive dropdown and it's like, all right, there's a thousand things, like which one is the one I'm looking for. And sometimes the names are good and sometimes they're not. But one thing we did was, okay, yeah, feed that into open AI. Hey, tell me which event type best matches like this user's intent. That's like pretty good at that, right? So it's all language stuff, but it's a little bit slow and it's a little bit expensive to do that every time. And so we kind of fell back to, once we validated that that works, kind of fell back to a more traditional embedding-based approach. It's like, all right, compute all those embeddings. That's more work upfront because you have to go through your database of all of these things and you got to commit like that engineering work, but it's like you validate with the general model because it's just easy. It takes like an hour to figure out that it works. And then it's like, all right, can we do the same thing with embeddings? That's way faster, way cheaper and still has reasonable quality. Embeddings also have a nice quality that you can get like magnitude of things, whereas LLMs aren't great at giving you like, hey, it matches this much. It's kind of, you can ask it for an order and that's decent, but like, yeah, anything beyond that is pretty challenging. [00:27:42]

Alessio: How do you think about the importance of the model versus the data, right? There's like a lot of companies that have a lot of data, but not a lot of AI expertise or companies that are just using off the shelf model. How should companies think about how much data to collect? What data is meaningful? What isn't, any thoughts there? [00:27:59]

Jeffrey: Yeah, I think it's safe to say that both are really important, right? Like the evolution of LLMs really was a lot of model innovation. And so I don't want to downplay that. At the same time, I think the future of AI applications and doing really cool things with it will be in the data, partially because like, you know, ChatGPT has done such a huge advance, right? The LLMs model space has advanced like crazy in the last year. And so I think a lot of the untapped potential will be in data in the future. One thing that's particularly interesting to us is like we have a pretty unique data set, actually. It's a lot of first party behavior data, right? So if you're, you know, if you're Square, for example, you instrumented like the way that people interact with Square Cash and the wallet and the, you know, the checkout system. And like, those are very specific things. Like Square can't look elsewhere in the world for that stuff. And that's really interesting because, you know, to build models of user behavior, you need user behavior data. And it turns out there's not actually a lot of examples of user behavior data out there in the world. And so to Joy's point earlier about, you know, we have one of the best user behavior data sets in the world. And so if we want to build a model around that, I think it would be a super interesting one. So if you take an analogy to what ChatGPT does, it basically takes a bunch of language examples and it, you know, learns a bunch of abstract concepts, like how to, you know, prove math things or how to render in JavaScript. It's like, wow, that's very astonishing. They kind of prove, it's almost like a proof of concept to the world that if you train a sufficiently good, you know, transformer self-attention type model with a sufficiently large data set of, you know, hundreds of gigabytes of internet text, you'll learn really interesting abstract concepts. And so we want to apply that to our data set, right? Cat GPG is great because it's a proof of concept. If it didn't exist, you know, I would have told you, yeah, you can spend $10 million training this model on a data set, you'd probably not get anything interesting because we just have no idea. But because it exists, it kind of proves to the world that if you do this correctly, there is a ton of interesting value. And so that's what I think. And so, you know, amplitude is just one example of a very interesting data set that you will train something that's, you know, fundamentally very different from GPT or any LLM out there. And there's lots of other data sets out there. And I think that's where a lot of the interesting things will come once this kind of, this phase of like rapid model evolution kind of tapers out a little bit. And you'll see a lot of the more interesting applications there. [00:30:24]

Swyx: So I've never thought about this much, but you guys must do it a lot. Like what is the ethics or best practices around training on user data when they don't know they're being watched? Like, I mean, presumably they're fine with tracking and events, but like, do we tell them that we're going to train on their data? Is it okay? [00:30:50]

Joe: I guess there are a couple of things. One is PII. Doesn't go anywhere near the stuff, right? PII with strip and like, that's just a really important thing. [00:30:58]

Swyx: You still need an identifier for streams. [00:31:02]

Joe: Yeah, yeah. But in terms of training models, we don't want any of that to go in there because then you might accidentally, you know, like, hello, ChatGPT, please hallucinate me a social security number. That's dangerous. [00:31:11]

Swyx: Also PII makes it into prompts a lot. [00:31:14]

Joe: Sure, that's true. So then you have to strip that from your... So we have some experiments where we're stripping PII that is in places that shouldn't be, you know, descriptions of things. Sometimes people copy paste big long lists of email addresses into charts and things. But some of these things are actually pretty surprisingly easy to detect and strip out. So we can do that. And we have some layers that are stripping out that sort of replacing them with tokens. So the LLMs can still operate on them. But in terms of training this data, all that training is happening internally and we're not putting any sort of private data, personally identifiable information in. I don't know if there's anything you wanted to add there. Yeah, yeah. [00:31:54]

Jeffrey: We certainly think about this a lot and our customers think about a lot. Like when I think about user privacy with respect to tracking, there's kind of this big spectrum. Around the one end, it's like literally track nothing and, you know, the end of story. And like for people like that, I mean, that's cool. You know, they're not gonna use Amplitude. They may not like us very much. You know, that is what it is. And then on the other end of the spectrum is like, we're gonna track you across the entire internet and sell your data to everyone. And like, that's obviously bad. And like, there's lots of good reasons to think that's bad. First party behavioral data, I think is actually probably almost as far. Fully anonymized first party behavior data would be like kind of the minimum. It's like web server logs with no IP, no identifier, nothing. The problem is that you can't do a lot of interesting behavioral analysis without that. You can't tell if, you know, this person that came on this day was the same one that purchased later. And so like, you can't actually, it's much harder to make your product better if you don't have that. And so, you know, we're kind of set at this place where we have, you know, like pseudo anonymized first party data. And like, we don't sell the data. You don't mix data from, you know, different places on the internet through Facebook cookies or things like that. And, you know, our philosophy is like, that is actually the most important data to build a better product. It's not the most important data to advertise, which is why Facebook and Google do what they do, but it's the most important data to build a better products. And it kind of strikes the right balance between yeah, totally tracking everything that you're doing and like not having any information to make your product better. [00:33:19]

Swyx: Yeah, cool. And I think we're going to go to audience questions. So let's start warming them up soon. But I think we have some lightning round questions [00:33:29]

Joe: The audience is thinking of questions while we go. [00:33:31]

Alessio: The first one is, what's something that already happened in AI that you thought would take much longer to be here? [00:33:39]

Jeffrey: I don't know what the constraints on our lightning round, but I think maybe creativity is the best word where it's, you know, with the image generation stuff, text generation, you know, one thing that still blows my mind, I used to be a competitive like math guy and like there's this international math Olympiad problem in one of the papers and it solves it. And I'm just like, wow, I can solve this when I was spending all my life doing this thing. Like that level of creativity really blew my mind. And what's the takeaway? It's like maybe the takeaway is that creativity is not as, you know, as not as high entropy or high dimensional as we think it is, which is kind of interesting takeaway. But yeah, that one definitely surprised me. [00:34:21]

Joe: I guess there's something actually that maybe answering the inverse question that a lot of my friends were surprised happened quickly. And I was like, this is just braindead obvious. I've got a lot of friends in the AI safety space. So they're worried that in particular, X-risk, right, extinction risk, that AI is going to kill the human race. And they were like, oh no, what if an AI escapes containment and gets access to the internet? And then we get an LLM and the first thing we do is like, hey, also GPT, here's the internet. [00:34:48]

Swyx: So you thought, it's happening faster than you thought. [00:34:53]

Joe: Well, it's happening faster than, to me it makes sense, because I'm like one of the guys connecting it to the internet. And I'm like, I'm surprised that other people were surprised it was going to be so fast. [00:35:01]

Swyx: Yeah, so a bit of context, Joe and I, we've been adjacent to the EA community and they have like smoothly migrated to the X-risk community very quickly after SBF. [00:35:13]

Joe: Yeah, after SBF, yeah, that was fun. [00:35:16]

Swyx: Okay, so next question, exploration. What do you think is the most interesting unsolved question in AI? What's next? [00:35:30]

Joe: I guess like, is it going to keep getting better at the same rate? Is it going to, and that's just a super important question that's going to change. Like, depending on that answer, 50 startups are going to pivot or not pivot, right? [00:35:43]

Swyx: Which is what's next, literally. [00:35:45]

Joe: Literally, what's next? Like in a year's time, are the models similarly better than they have been so far? Or are we about to taper off or are we about to continue going linearly? [00:35:58]

Jeffrey: Yeah, I'll throw one out that is not necessarily about AI, but like, what's intelligence, right? And if you ask people 20, 30 years ago, maybe even longer now, it's like, yeah, chess. Chess is intelligence. And then chess got solved and like, ah, that's just brute force. And it's like, well, you know, creating creative images and writing, that's intelligence. Well, it's like, that's solved too. Maybe it's just, you know, if you have enough parameters, you can capture that. So like, what is intelligence? What does it mean to have an AGI? What does that actually mean? And then what the implications that are on for our understanding of humans and our brains. I've always thought that, you know, everyone is just a stochastic machine. And so, you know, is everything consistent in my mind?

Swyx: Free will and illusion. Exactly. [00:36:43]

Joe: I guess maybe like the scaling piece is like that intelligence as you scale is gets more and more expensive on the traditional stuff. But then there's something I think I saw yesterday on Hacker News. It was people actually getting a brain to play tic-tac-toe. Like by a brain, I mean, stem cells grown into brain tissue. And they were able to train it. And like that to me is very significant because suddenly the like metal computers limitations is not applied. And then now we've got all this intelligence. What is intelligence stuff on a squishy wet computer? That makes it even harder to ask and even harder to draw lines. [00:37:18]

Swyx: Yeah. Yeah. So famously, you know, language models are so much more inefficient than wet computers, as you say. And so if you can merge that, you know, the human brain runs on 30 Watts of power as it is my favorite fact. We're not anywhere close to that yet. [00:37:36]

Alessio: Before we get into Q&A, one last takeaway that you want everybody to think about. [00:37:41]

Jeffrey: Yeah, I'll do the one that we actually repeat in Inside Amplitude very often, not about AI, but I think it applies, which is it's early. It's sometimes hard to realize that when things are happening so fast, especially in the Bay Area, but like the ramifications of AI or in our case, product data and all that are gonna play out over the next many decades. And that's just, you know, we're very fortunate to be at the beginning of it. And so yeah, take advantage of it and keep reminding yourself that it's early. [00:38:15]

Joe: I guess mine would be, let humans be good at doing human things. Let machines be good at doing machine things and let machines be good at doing machine things and help humans be good at doing human things. And like, if you don't do that, then you're gonna be building something that's either not useful or it's very scary. So yeah, get machines helping humans, not the other way around. [00:38:39]

Swyx: Get machines helping humans. All right. With that, I think we're all gonna open up to questions. We're gonna toss you the mic. [00:38:45]

Audience #1: Yeah, hey, thanks for the insight into how you guys implemented your AI, you know, question asking chatbot and how have you converted into seven sub queries and then generate the data out. I've just, I got a peak my interest about how you guys exactly do it. Like Alessio asked, like, what exactly is the model that you guys are using? Are you converting it into your, what are these queries that you generate from a single English language? Is it possible to go a little deeper just from a curiosity perspective? [00:46:34]

Joe: So we have a custom query engine. So it's not SQL or anything that we're generating. We're generating a custom query output. So I guess the types of questions range. So things like chart type, are we doing a segmentation chart, a line chart or are we doing a funnel chart? You know, the number goes down over time or up over time or between a conversion between two events and there are various other types or metrics or, and then there's also the name. What should we name this chart that answers this question? So the way that's implemented in practice, you could use something like Lang chain to sort of chain these things together. But in our experience, I think Lang chain's a great tool for certain things and definitely really great for prototyping, but we found it quite restrictive. So we've ended up building sort of an internal, it's a very, very small wrapper, internal, we use TypeScript as well, framework that allows us to basically just write in code and infer within what we call a transaction, an inference transaction, which gets monitored as one, but then also all the individual inferences within it get monitored. So it's a bit like when you're writing a database transaction with most sort of, at least in the node ecosystem, the JavaScript ecosystem, where you sort of get a transaction object that you can operate on, and then you return your, or you return, you sort of commit your transaction. So we've got an interface like that, so we can just write pure TypeScript, await this response or await these responses. And then we've got a switch case. If it's a segmentation chart, go and do these with these queries. And then each of those inferences can be a different model. So we think in the future, maybe we have one query where we have some GPT-4 responses. We want some text responses. Maybe we also want to generate an image from that same query together, and then that gets bundled. So I don't know if that answers your question.

Audience #1: Yeah, I think so. Yeah, thank you. I think so. You said in future, you're going to use GPT-4. What are you using right now for? [00:48:33]

Joe: Right now, everything's GPT-3.5. We're moving around, and I think probably for some of the prompts, we'll use something like DaVinci. Some we might use GPT-4. Some we'll be using internal ones. And we also want to be able to degrade gracefully if a customer has told us they don't want us to send anything to OpenAI, then we can degrade to some internal models that maybe are some of the open source models that have been trained on smaller datasets. [00:48:57]

Audience #1: Gotcha, makes sense. Thank you. [00:48:58]

Jeffrey: Yeah, I think to add to that a little bit, the key is breaking down the problem sufficiently, because if you break down the problem enough, you can also provide it with some examples, which is super helpful, right? You know, GPT is quite good at zero shot, but within the context of our specific domain, it doesn't know what's going on. And so being able to break down the problem to, hey, select the type of chart. Don't generate me an entire chart definition. Select me the type of chart, and then select me the specific metric based on their query, and then giving it some examples. Select me the events and properties that I want to look at. By breaking it down and having very, very contextual prompts with respect to those examples, you get a lot higher quality output than trying to generate, like, you know, if you imagine generate, like, hey, generate me a whole SQL query with all, you know, here's like the schema of all my tables, now generate it entirely. It's like, it actually struggles with stuff like that, because it's just like kind of too much information and computation to come out of language. Now, maybe GPT-5 will be different, but like, that's the state of the art today. [00:49:57]

Swyx: I'll ask a follow-up to Joe. So you mentioned, you mentioned trying LangChain, but not needing it for production. Any other comments on tooling that are out there that's interesting to you? Do you use a embedding database, for example, or do you just use a regular database? [00:50:18]

Joe: Yeah, so we've actually been running embedding sort of similarity or vector search in production for multiple months, maybe even almost a year, and just like straight up Postgres, but now we're using PG Vector, which actually Jeffrey could probably speak more to about that decision and what that was like. [00:50:40]

Swyx: So this is a pretty hot take. At Amplitude scale, all you need is Postgres? [00:50:46]

Joe: We'd use many things other than Postgres. But I mean, we, this isn't rolled out for all customers and it's not necessarily getting sort of hit with a lot of traffic. And so the scale here is very different. Our usage scale is very different to our ingestion. [00:51:04]

Swyx: Yeah, yeah, yeah. [00:51:06]

Jeffrey: Just to clarify that a little bit more, we're not putting individual end user vectors or end event vectors. We're putting in taxonomies. So if I'm DoorDash, my taxonomy is add to cart, checkout, purchase, browse. That's the cardinality. And so that's actually small. It's on the order of tens of millions. And so yeah, you use stuff that in Postgres, no problem. Now, when we talk about large behavioral models or like actually embedding events, there are many, many trillions of those. And yeah, Postgres probably doesn't work there. [00:51:41]

Swyx: Yeah, actually I wanted to comment on this slightly before, which is separating taxonomies from the actual data is one way you protect your customers against prompt injection. It's something that Simon Willison has been talking about where you want to have like query for one thing, but essentially no knowledge of the actual underlying data, just the taxonomy. So it's good practice. [00:52:00]

Audience #2: Yeah, so you talked about a model which would be trained on user behavior data like amplitude GPT. It really piqued my interest and what capabilities would emerge? What do you think that you would find and what would be the first thing you would ask the model? That's a good question. [00:52:23]

Jeffrey: So we've thought about this a little bit and I think the, right, these are sequence, token prediction models. And so at the very least, I would hope for a much better, we have a predictions feature right now, which says, hey, given what a user has done over the last 90 days, do we think they're gonna belong to this cohort in the future or not? So that cohort might be people who churn, people who purchase, people who upsell, whatever the customer wants. We think it would be much better at tasks like that, right, because if it just has a very good understanding of behavioral patterns and what's gonna come next, it would be able to do that. That's exciting, but not that exciting. If I'm trying to think about like the analogies to what we see in LLMs, it's like, okay, yeah, what is the behavioral equivalent of like learning physics concepts, right? It's like, oh, I don't actually know, but it might be this understanding of patterns of sessions and how that like, for example, categorizing users in a unsupervised way seems like a very simple output for a model that understands user behavior, right? Here's all the users and if you wanna discriminate them by their ability to achieve some outcome in the future, like here's the best way to separate that group and here's why, right? Be able to explain at that level and that would be super powerful for customers, right? A lot of times what our customers do is, hey, these people came back the next day and these people didn't, why? What was different about them? And so we have a bunch of heuristics to do that, but at the end, there's something like, causal impact is like one of the holy grails of product analytics. It's like, what was the causation behind some observed difference in behavior? And I think, yeah, a large behavioral model will be much better at assessing that and be able to give you potentially interpretable ways of answering that question that are like really hard to do, really hard, really computationally intensive, really like noisy, distilling causation correlation is obviously super hard. Those are some of the examples. The other one that I am, I don't know if I'm optimistic about it, but we really interesting is, one of the things that amplitude requires today is manual instrumentation, right? You have to decide, hey, this clicking of a button, this viewing of page, these are important things. I'm naming them in this way. There's a lot of popular tools out there that kind of just record user sessions or like track DOM events automatically. There's a lot of problems with those tools because the data is incredibly noisy. It's just so noisy, right? A lot of times you just can't actually interpret it. And so it's like, oh, it's great because I don't need to do any work. But like, well, you also don't get anything out of it. It's possible that a behavioral model would be able to actually understand what's going on there by understanding your user behavior in a correctly modeled and correctly labeled sense, and then figuring out. I don't know if that's possible. I think that would make everyone's lives a lot easier if you could somehow ask behavioral questions of data without having to instrument. All of our customers would love that, but also all of them are instrumenting because they know that's definitely not possible today. [00:55:26]

Audience #2: This is really interesting. You're looking forward to the future. If you're gonna build it, it's gonna be amazing, yeah. [00:55:31]

Jeffrey: That's the goal, that's the goal. [00:55:33]

Audience #2: Awesome. [00:55:34]

Swyx: Thanks for listening. [00:56:09]

Get full access to Latent Space at www.latent.space/subscribe

2023-06-08
Länk till avsnitt

Building the AI × UX Scenius ? with Linus Lee of Notion AI

Read: https://www.latent.space/p/ai-interfaces-and-notion

Show Notes

* Linus on Twitter

* Linus? personal blog

* Notion

* Notion AI

* Notion Projects

* AI UX Meetup Recap

Timestamps

* [00:03:30] Starting the AI / UX community

* [00:10:01] Most knowledge work is not text generation

* [00:16:21] Finding the right constraints and interface for AI

* [00:19:06] Linus' journey to working at Notion

* [00:23:29] The importance of notations and interfaces

* [00:26:07] Setting interface defaults and standards

* [00:32:36] The challenges of designing AI agents

* [00:39:43] Notion deep dive: ?Blocks?, AI, and more

* [00:51:00] Prompt engineering at Notion

* [01:02:00] Lightning Round

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]

Swyx: And today we're not in our regular studio. We're actually at the Notion New York headquarters. Thanks to Linus. Welcome. [00:00:28]

Linus: Thank you. Thanks for having me. [00:00:29]

Swyx: Thanks for having us in your beautiful office. It is actually very startling how gorgeous the Notion offices are. And it's basically the same aesthetic. [00:00:38]

Linus: It's a very consistent aesthetic. It's the same aesthetic in San Francisco and the other offices. It's been for many, many years. [00:00:46]

Swyx: You take a lot of craft in everything that you guys do. Yeah. [00:00:50]

Linus: I think we can, I'm sure, talk about this more later, but there is a consistent kind of focus on taste that I think flows down from Ivan and the founders into the product. [00:00:59]

Swyx: So I'll introduce you a little bit, but also there's just, you're a very hard person to introduce because you do a lot of things. You got your BA in computer science at Berkeley. Even while you're at Berkeley, you're involved in a bunch of interesting things at Replit, CatalystX, Hack Club and Dorm Room Fund. I always love seeing people come out of Dorm Room Fund because they tend to be a very entrepreneurial. You're a product engineer at IdeaFlow, residence at Betaworks. You took a year off to do independent research and then you've finally found your home at Notion. What's one thing that people should know about you that's not on your typical LinkedIn profile? [00:01:39]

Linus: Putting me on the spot. I think, I mean, just because I have so much work kind of out there, I feel like professionally, at least, anything that you would want to know about me, you can probably dig up, but I'm a big city person, but I don't come from the city. I went to school, I grew up in Indiana, in the middle of nowhere, near Purdue University, a little suburb. I only came out to the Bay for school and then I moved to New York afterwards, which is where I'm currently. I'm in Notion, New York. But I still carry within me a kind of love and affection for small town, Indiana, small town, flyover country. [00:02:10]

Swyx: We do have a bit of indulgence in this. I'm from a small country and I think Alessio, you also kind of identified with this a little bit. Is there anything that people should know about Purdue, apart from the chickens? [00:02:24]

Linus: Purdue has one of the largest international student populations in the country, which I don't know. I don't know exactly why, but because it's a state school, the focus is a lot on STEM topics. Purdue is well known for engineering and so we tend to have a lot of folks from abroad, which is particularly rare for a university in, I don't know, that's kind of like predominantly white American and kind of Midwestern state. That makes Purdue and the surrounding sort of area kind of like a younger, more diverse international island within the, I guess, broader world that is Indiana. [00:02:58]

Swyx: Fair enough. We can always dive into sort of flyover country or, you know, small town insights later, but you and I, all three of us actually recently connected at AIUX SF, which is the first AIUX meetup, essentially which just came out of like a Twitter conversation. You and I have been involved in HCI Twitter is kind of how I think about it for a little bit and when I saw that you were in town, Geoffrey Litt was in town, Maggie Appleton in town, all on the same date, I was like, we have to have a meetup and that's how this thing was born. Well, what did it look like from your end? [00:03:30]

Linus: From my end, it looked like you did all of the work and I... [00:03:33]

Swyx: Well, you got us the Notion. Yeah, yeah. [00:03:36]

Linus: It was also in the Notion office, it was in the San Francisco one and then thereafter there was a New York one that I decided I couldn't make. But yeah, from my end it was, and I'm sure you were too, but I was really surprised by both the mixture of people that we ended up getting and the number of people that we ended up getting. There was just a lot of attention on, obviously there was a lot of attention on the technology itself of GPT and language models and so on, but I was surprised by the interest specifically on trying to come up with interfaces that were outside of the box and the people that were interested in that topic. And so we ended up having a packed house and lots of interesting demos. I've heard multiple people comment on the event afterwards that they were positively surprised by the mixture of both the ML, AI-focused people at the event as well as the interface HCI-focused people. [00:04:24]

Swyx: Yeah. I kind of see you as one of the leading, I guess, AI UX people, so I hope that we are maybe starting a new discipline, maybe. [00:04:33]

Linus: Yeah, I mean, there is this kind of growing contingency of people interested in exploring the intersection of those things, so I'm excited for where that's going to go. [00:04:41]

Swyx: I don't know if it's worth going through favorite demos. It was a little while ago, so I don't know if... [00:04:48]

Alessio: There was, I forget who made it, but there was this new document writing tool where you could apply brushes to different paragraphs. [00:04:56]

Linus: Oh, this was Amelia's. Yeah, yeah, yeah. [00:04:58]

Alessio: You could set a tone, both in terms of writer inspiration and then a tone that you wanted, and then you could drag and drop different tones into paragraphs and have the model rewrite them. It was the first time that it's not just auto-complete, there's more to it. And it's not asked in a prompt, it's this funny drag-an-emoji over it. [00:05:20]

Linus: Right. [00:05:21]

Swyx: I actually thought that you had done some kind of demo where you could select text and then augment it in different moods, but maybe it wasn't you, maybe it was just someone else [00:05:28]

Linus: I had done something similar, with slightly different building blocks. I think Amelia's demo was, there was sort of a preset palette of brushes and you apply them to text. I had built something related last year, I prototyped a way to give people sliders for different semantic attributes of text. And so you could start with a sentence, and you had a slider for length and a slider for how philosophical the text is, and a slider for how positive or negative the sentiment in the text is, and you could adjust any of them in the language model, reproduce the text. Yeah, similar, but continuous control versus distinct brushes, I think is an interesting distinction there. [00:06:03]

Swyx: I should add it for listeners, if you missed the meetup, which most people will have not seen it, we actually did a separate post with timestamps of each video, so you can look at that. [00:06:13]

Alessio: Sorry, Linus, this is unrelated, but I think you build over a hundred side projects or something like that. A hundred? [00:06:20]

Swyx: I think there's a lot of people... I know it's a hundred. [00:06:22]

Alessio: I think it's a lot of them. [00:06:23]

Swyx: A lot of them are kind of small. [00:06:25]

Alessio: Yeah, well, I mean, it still counts. I think there's a lot of people that are excited about the technology and want to hack on things. Do you have any tips on how to box, what you want to build, how do you decide what goes into it? Because all of these things, you could build so many more things on top of it. Where do you decide when you're done? [00:06:44]

Linus: So my projects actually tend to be... I think especially when people approach project building with a goal of learning, I think a common mistake is to be over-ambitious and sort of not scope things very tightly. And so a classic kind of failure mode is, you say, I'm really interested in learning how to use the GPT-4 API, and I'm also interested in vector databases, and I'm also interested in Next.js. And then you devise a project that's going to take many weeks, and you glue all these things together. And it could be a really cool idea, but then especially if you have a day job and other things that life throws you away, it's hard to actually get to a point where you can ship something. And so one of the things that I got really good at was saying, one, knowing exactly how quickly I could work, at least on the technologies that I knew well, and then only adding one new unknown thing to learn per project. So it may be that for this project, I'm going to learn how the embedding API works. Or for this project, I'm going to learn how to do vector stuff with PyTorch or something. And then I would scope things so that it fit in one chunk of time, like Friday night to Sunday night or something like that. And then I would scope the project so that I could ship something as much work as I could fit into a two-day period, so that at the end of that weekend, I could ship something. And then afterwards, if I want to add something, I have time to do it and a chance to do that. But it's already shipped, so there's already momentum, and people are using it, or I'm using it, and so there's a reason to continue building. So only adding one new unknown per project, I think, is a good trick. [00:08:14]

Swyx: I first came across you, I think, because of Monocle, which is your personal search engine. And I got very excited about it, because I always wanted a personal search engine, until I found that it was in a language that I've never seen before. [00:08:25]

Linus: Yeah, there's a towel tower of little tools and technologies that I built for myself. One of the other tricks to being really productive when you're building side projects is just to use a consistent set of tools that you know really, really well. For me, that's Go, and my language, and a couple other libraries that I've written that I know all the way down to the bottom of the stack. And then I barely have to look anything up, because I've just debugged every possible issue that could come up. And so I could get from start to finish without getting stuck in a weird bug that I've never seen before. But yeah, it's a weird stack. [00:08:58]

Swyx: It also means that you probably are not aiming for, let's say, open source glory, or whatever. Because you're not publishing in the JavaScript ecosystem. Right, right. [00:09:06]

Linus: I mean, I've written some libraries before, but a lot of my projects tend to be like, the way that I approach it is less about building something that other people are going to use en masse. And make yourself happy. Yeah, more about like, here's the thing that I built, if you want to, and often I learn something in the process of building that thing. So like with Monocle, I wrote a custom sort of full text search index. And I thought a lot of the parts of what I built was interesting. And so I just wanted other people to be able to look at it and see how it works and understand it. But the goal isn't necessarily for you to be able to replicate it and run it on your own. [00:09:36]

Swyx: Well, we can kind of dive into your other AIUX thoughts. As you've been diving in, you tend to share a lot on Twitter. And I just kind of took out some of your greatest hits. This is relevant to the demo that you picked out, Alessio. And what we're talking about, which is, most knowledge work is not a text generation task. That's funny, because a lot of what Notion AI is, is text generation right now. Maybe you want to elaborate a little bit. Yeah. [00:10:01]

Linus: I think the first time you look at something like GPT, the shape of the thing you see is like, oh, it's a thing that takes some input text and generates some output text. And so the easiest thing to build on top of that is a content generation tool. But I think there's a couple of other categories of things that you could build that are sort of progressively more useful and more interesting. And so besides content generation, which requires the minimum amount of wrapping around ChatGPT, the second tier up from that is things around knowledge, I think. So if you have, I mean, this is the hot thing with all these vector databases things going around. But if you have a lot of existing context around some knowledge about your company or about a field or all of the internet, you can use a language model as a way to search and understand things in it and combine and synthesize them. And that synthesis, I think, is useful. And at that point, I think the value that that unlocks, I think, is much greater than the value of content generation. Because most knowledge work, the artifact that you produce isn't actually about writing more words. Most knowledge work, the goal is to understand something, synthesize new things, or propose actions or other kinds of knowledge-to-knowledge tasks. And then the third category, I think, is automation. Which I think is sort of the thing that people are looking at most actively today, at least from my vantage point in the ecosystem. Things like the React prompting technique, and just in general, letting models propose actions or write code to accomplish tasks. That's also moving far beyond generating text to doing something more interesting. So much of the value of what humans sit down and do at work isn't actually in the words that they write. It's all the thinking that goes on before you write those words. So how can you get language models to contribute to those parts of work? [00:11:43]

Alessio: I think when you first tweeted about this, I don't know if you already accepted the job, but you tweeted about this, and then the next one was like, this is a NotionAI subtweet. [00:11:53]

Swyx: So I didn't realize that. [00:11:56]

Alessio: The best thing that I see is when people complain, and then they're like, okay, I'm going to go and help make the thing better. So what are some of the things that you've been thinking about? I know you talked a lot about some of the flexibility versus intuitiveness of the product. The language is really flexible, because you can say anything. And it's funny, the models never ignore you. They always respond with something. So no matter what you write, something is going to come back. Sometimes you don't know how big the space of action is, how many things you can do. So as a product builder, how do you think about the trade-offs that you're willing to take for your users? Where like, okay, I'm not going to let you be as flexible, but I'm going to create this guardrails for you. What's the process to think about the guardrails, and how you want to funnel them to the right action? [00:12:46]

Linus: Yeah, I think what this trade-off you mentioned around flexibility versus intuitiveness, I think, gets at one of the core design challenges for building products on top of language models. A lot of good interface design comes from tastefully adding the right constraints in place to guide the user towards actions that you want to take. As you add more guardrails, the obvious actions become more obvious. And one common way to make an interface more intuitive is to narrow the space of choices that the users have to make, and the number of choices that they have to make. And that intuitiveness, that source of intuitiveness from adding constraints, is kind of directly at odds with the reason that language models are so powerful and interesting, which is that they're so flexible and so general, and you can ask them to do literally anything, and they will always give you something. But most of the time, the answer isn't that high quality. And so there's kind of a distribution of, like, there are clumps of things in the action space of what a language model can do that the model's good at, and there's parts of the space where it's bad at. And so one sort of high-level framework that I have for thinking about designing with language models is, there are actions that the language model's good at, and actions that it's bad at. How do you add the right constraints carefully to guide the user and the system towards the things that the language model's good at? And then at the same time, how do you use those constraints to set the user expectations for what it's going to be good at and bad at? One way to do this is just literally to add those constraints and to set expectations. So a common example I use all the time is, if you have some AI system to answer questions from a knowledge base, there are a couple of different ways to surface that in a kind of a hypothetical product. One is, you could have a thing that looks like a chat window in a messaging app, and then you could tell the user, hey, this is for looking things up from a database. You can ask a question, then it'll look things up and give you an answer. But if something looks like a chat, and this is a lesson that's been learned over and over for anyone building chat interfaces since, like, 2014, 15, if you have anything that looks like a chat interface or a messaging app, people are going to put some, like, weird stuff in there that just don't look like the thing that you want the model to take in, because the expectation is, hey, I can use this like a messaging app, and people will send in, like, hi, hello, you know, weird questions, weird comments. Whereas if you take that same, literally the same input box, and put it in, like, a thing that looks like a search bar with, like, a search button, people are going to treat it more like a search window. And at that point, inputs look a lot more like keywords or a list of keywords or maybe questions. So the simple act of, like, contextualizing that input in different parts of an interface reset the user's expectations, which constrain the space of things that the model has to handle. And that you're kind of adding constraints, because you're really restricting your input to mostly things that look like keyword search. But because of that constraint, you can have the model fit the expectations better. You can tune the model to perform better in those settings. And it's also less confusing and perhaps more intuitive, because the user isn't stuck with this blank page syndrome problem of, okay, here's an input. What do I actually do with it? When we initially launched Notion AI, one of my common takeaways, personally, from talking to a lot of my friends who had tried it, obviously, there were a lot of people who were getting lots of value out of using it to automate writing emails or writing marketing copy. There were a ton of people who were using it to, like, write Instagram ads and then sort of paste it into the Instagram tool. But some of my friends who had tried it and did not use it as much, a frequently cited reason was, I tried it. It was cool. It was cool for the things that Notion AI was marketed for. But for my particular use case, I had a hard time figuring out exactly the way it was useful for my workflow. And I think that gets back at the problem of, it's such a general tool that just presented with a blank prompt box, it's hard to know exactly the way it could be useful to your particular use case. [00:16:21]

Alessio: What do you think is the relationship between novelty and flexibility? I feel like we're in kind of like a prompting honeymoon phase where the tools are new and then everybody just wants to do whatever they want to do. And so it's good to give these interfaces because people can explore. But if I go forward in three years, ideally, I'm not prompting anything. The UX has been built for most products to already have the intuitive, kind of like a happy path built into it. Do you think there's merit in a way? If you think about ChatGPT, if it was limited, the reason why it got so viral is people were doing things that they didn't think a computer could do, like write poems and solve riddles and all these different things. How do you think about that, especially in Notion, where Notion AI is kind of like a new product in an existing thing? How much of it for you is letting that happen and seeing how people use it? And then at some point be like, okay, we know what people want to do. The flexibility is not, it was cool before, but now we just want you to do the right things with the right UX. [00:17:27]

Linus: I think there's value in always having the most general input as an escape hatch for people who want to take advantage of that power. At this point, Notion AI has a couple of different manifestations in the product. There's the writer. There's a thing we called an AI block, which is a thing that you can always sort of re-update as a part of document. It's like a live, a little portal inside the document that an AI can write. We also have a relatively new thing called AI autofill, which lets an AI fill an entire column in a Notion database. In all of these things, speaking of adding constraints, we have a lot of suggested prompts that we've worked on and we've curated and we think work pretty well for things like summarization and writing drafts to blog posts and things. But we always leave a fully custom prompt for a few reasons. One is if you are actually a power user and you know how language models work, you can go in and write your custom prompt and if you're a power user, you want access to the power. The other is for us to be able to discover new use cases. And so one of the lovely things about working on a product like Notion is that there's such an enthusiastic and lively kind of community of ambassadors and people that are excited about trying different things and coming up with all these templates and new use cases. And having a fully custom action or prompt whenever we launch something new in AI lets those people really experiment and help us discover new ways to take advantage of AI. I think it's good in that way. There's also a sort of complement to that, which is if we wanted to use feedback data or learn from those things and help improve the way that we are prompting the model or the models that we're building, having access to that like fully diverse, fully general range of use cases helps us make sure that our models can handle the full generality of what people want to do. [00:19:06]

Swyx: I feel like we've segway?d a lot into our Notion conversation and maybe I just wanted to bridge that a little bit with your personal journey into Notion before we go into Notion proper. You spent a year kind of on a sabbatical, kind of on your own self-guided research journey and then deciding to join Notion. I think a lot of engineers out there thinking about doing this maybe don't have the internal compass that you have or don't have the guts to basically make no money for a year. Maybe just share with people how you decided to basically go on your own independent journey and what got you to join Notion in the end. [00:19:42]

Linus: Yeah, what happened? Um, yeah, so for a little bit of context for people who don't know me, I was working mostly at sort of seed stage startups as a web engineer. I actually didn't really do much AI at all for prior to my year off. And then I took all of 2022 off with less of a focus on it ended up sort of in retrospect becoming like a Linus Pivots to AI year, which was like beautifully well timed. But in the beginning of the year, there was kind of a one key motivation and then one key kind of question that I had. The motivation was that I think I was at a sort of a privileged and fortunate enough place where I felt like I had some money saved up that I had saved up explicitly to be able to take some time off and investigate my own kind of questions because I was already working on lots of side projects and I wanted to spend more time on it. I think I also at that point felt like I had enough security in the companies and folks that I knew that if I really needed a job on a short notice, I could go and I could find some work to do. So I wouldn't be completely on the streets. And so that security, I think, gave me the confidence to say, OK, let's try this kind of experiment.[00:20:52]

Maybe it'll only be for six months. Maybe it'll be for a year. I had enough money saved up to last like a year and change. And so I had planned for a year off and I had one sort of big question that I wanted to explore. Having that single question, I think, actually was really helpful for focusing the effort instead of just being like, I'm going to side project for a year, which I think would have been less productive. And that big question was, how do we evolve text interfaces forward? So, so much of knowledge work is consuming walls of text and then producing more walls of text. And text is so ubiquitous, not just in software, but just in general in the world. They're like signages and menus and books. And it's ubiquitous, but it's not very ergonomic. There's a lot of things about text interfaces that could be better. And so I wanted to explore how we could make that better. A key part of that ended up being, as I discovered, taking advantage of this new technologies that let computers make sense of text information. And so that's how I ended up sort of sliding into AI. But the motivation in the beginning was less focused on learning a new technology and more just on exploring this general question space. [00:21:53]

Swyx: Yeah. You have the quote, text is the lowest denominator, not the end game. Right, right. [00:21:58]

Linus: I mean, I think if you look at any specific domain or discipline, whether it's medicine or mathematics or software engineering, in any specific discipline where there's a narrower set of abstractions for people to work with, there are custom notations. One of the first things that I wrote in this exploration year was this piece called Notational Intelligence, where I talk about this idea that so much of, as a total sidebar, there's a whole other fascinating conversation that I would love to have at some point, maybe today, maybe later, about how to evolve a budding scene of research into a fully-fledged field. So I think AI UX is kind of in this weird stage where there's a group of interesting people that are interested in exploring this space of how do you design for this newfangled technology, and how do you take that and go and build best practices and powerful methods and tools [00:22:48]

Swyx: We should talk about that at some point. [00:22:49]

Linus: OK. But in a lot of established fields, there are notations that people use that really help them work at a slightly higher level than just raw words. So notations for describing chemicals and notations for different areas of mathematics that let people work with higher-level concepts more easily. Logic, linguistics. [00:23:07]

Swyx: Yeah. [00:23:07]

Linus: And I think it's fair to say that some large part of human intelligence, especially in these more technical domains, comes from our ability to work with notations instead of work with just the raw ideas in our heads. And text is a kind of notation. It's the most general kind of notation, but it's also, because of its generality, not super high leverage if you want to go into these specific domains. And so I wanted to try to improve on that frontier. [00:23:29]

Swyx: Yeah. You said in our show notes, one of my goals over the next few years is to ensure that we end up with interface metaphors and technical conventions that set us up for the best possible timeline for creativity and inventions ahead. So part of that is constraints. But I feel like that is one part of the equation, right? What's the other part that is more engenders creativity? [00:23:47]

Linus: Tell me a little bit about that and what you're thinking there. [00:23:51]

Swyx: It's just, I feel like, you know, we talked a little bit about how you do want to constrain, for example, the user interface to guide people towards things that language models are good at. And creative solutions do arise out of constraints. But I feel like that alone is not sufficient for people to invent things. [00:24:10]

Linus: I mean, there's a lot of directions, I think, that could go from that. The origin of that thing that you're quoting is when I decided to come help work on AI at Notion, a bunch of my friends were actually quite surprised, I think, because they had expected that I would have gone and worked? [00:24:29]

Swyx: You did switch. I was eyeing that for you. [00:24:31]

Linus: I mean, I worked at a lab or at my own company or something like that. But one of the core motivations for me joining an existing company and one that has lots of users already is this exact thing where in the aftermath of a new foundational technology emerging, there's kind of a period of a few years where the winners in the market get to decide what the default interface paradigm for the technology is. So, like, mini computers, personal computers, the winners of that market got to decide Windows are and how scrolling works and what a mouse cursor is and how text is edited. Similar with mobile, the concept of a home screen and apps and things like that, the winners of the market got to decide. And that has profound, like, I think it's difficult to understate the importance of, in those few critical years, the winning companies in the market choosing the right abstractions and the right metaphors. And AI, to me, seemed like it's at that pivotal moment where it's a technology that lots of companies are adopting. There is this well-recognized need for interface best practices. And Notion seemed like a company that had this interesting balance of it could still move quickly enough and ship and prototype quickly enough to try interesting interface ideas. But it also had enough presence in the ecosystem that if we came up with the right solution or one that we felt was right, we could push it out and learn from real users and iterate and hopefully be a part of that story of setting the defaults and setting what the dominant patterns are. [00:26:07]

Swyx: Yeah, it's a special opportunity. One of my favorite stories or facts is it was like a team of 10 people that designed the original iPhone. And so all the UX that was created there is essentially what we use as smartphones today, including predictive text, because people were finding that people were kind of missing the right letters. So they just enhanced the hit area for certain letters based on what you're typing. [00:26:28]

Linus: I mean, even just the idea of like, we should use QWERTY keyboards on tiny smartphone screens. Like that's a weird idea, right? [00:26:36]

Swyx: Yeah, QWERTY is another one. So I have RSI. So this actually affects me. QWERTY was specifically chosen to maximize travel distance, right? Like it's actually not ergonomic by design because you wanted the keyboard, the key type writers to not stick. But we don't have that anymore. We're still sticking to QWERTY. I'm still sticking to QWERTY. I could switch to the other ones. I forget. QORAC or QOMAC anytime, but I don't just because of inertia. I have another thing like this. [00:27:02]

Linus: So going even farther back, people don't really think enough about where this concept of buttons come from, right? So the concept of a push button as a thing where you press it and it activates some binary switch. I mean, buttons have existed for, like mechanical buttons have existed for a long time. But really, like this modern concept of a button that activates a binary switch really gets like popularized by the popular advent of electricity. Before the electricity, if you had a button that did something, you would have to construct a mechanical system where if you press down on a thing, it affects some other lever system that affects as like the final action. And this modern idea of a button that is just a binary switch gets popularized electricity. And at that point, a button has to work in the way that it does in like an alarm clock, because when you press down on it, there's like a spring that makes sure that the button comes back up and that it completes the circuit. And so that's the way the button works. And then when we started writing graphical interfaces, we just took that idea of a thing that could be depressed to activate a switch. All the modern buttons that we have today in software interfaces are like simulating electronic push buttons where you like press down to complete a circuit, except there's actually no circuit being completed. It's just like a square on a screen. [00:28:11]

Swyx: It's all virtualized. Right. [00:28:12]

Linus: And then you control the simulation of a button by clicking a physical button on a mouse. Except if you're on a trackpad, it's not even a physical button anymore. It's like a simulated button hardware that controls a simulated button in software. And it's also just this cascade of like conceptual backwards compatibility that gets us here. I think buttons are interesting. [00:28:32]

Alessio: Where are you on the skeuomorphic design love-hate spectrum? There's people that have like high nostalgia for like the original, you know, the YouTube icon on the iPhone with like the knobs on the TV. [00:28:42]

Linus: I think a big part of that is at least the aesthetic part of it is fashion. Like fashion taken very literally, like in the same way that like the like early like Y2K 90s aesthetic comes and goes. I think skeuomorphism as expressed in like the early iPhone or like Windows XP comes and goes. There's another aspect of this, which is the part of skeuomorphism that helps people understand and intuit software, which has less to do with skeuomorphism making things easier to understand per se and more about like, like a slightly more general version of skeuomorphism is like, there should be a consistent mental model behind an interface that is easy to grok. And then once the user has the mental model, even if it's not the full model of exactly how that system works, there should be a simplified model that the user can easily understand and then sort of like adopt and use. One of my favorite examples of this is how volume controls that are designed well often work. Like on an iPhone, when you make your iPhone volume twice as loud, the sound that comes out isn't actually like at a physical level twice as loud. It's on a log scale. When you push the volume slider up on an iPhone, the speaker uses like four times more energy, but humans perceive it as twice as loud. And so the mental model that we're working with is, okay, if I make this, this volume control slider have two times more value, it's going to sound two times louder, even though actually the underlying physics is like on a log scale. But what actually happens physically is not actually what matters. What matters is how humans perceive it in the model that I have in my head. And there, I think there are a lot of other instances where the skeuomorphism isn't actually the thing. The thing is just that there should be a consistent mental model. And often the easy, consistent mental model to reach for is the models that already exist in reality, but not always. [00:30:23]

Alessio: I think the other big topic, maybe before we dive into Notion is agents. I think that's one of the toughest interfaces to crack, mostly because, you know, the text box, everybody understands that the agent is kind of like, it's like human-like feeling, you know, where it's like, okay, I'm kind of delegating something to a human, right? I think, like, Sean, you made the example of like a Calendly, like a savvy Cal, it's like an agent, because it's scheduling on your behalf for something. [00:30:51]

Linus: That's actually a really interesting example, because it's a kind of a, it's a pretty deterministic, like there's no real AI to it, but it is agent in the sense that you're like delegating it and automate something. [00:31:01]

Swyx: Yeah, it does work without me. It's great. [00:31:03]

Alessio: So that one, we figured out. Like, we know what the scheduling interface is like. [00:31:07]

Swyx: Well, that's the state of the art now. But, you know, for example, the person I'm corresponding with still has to pick a time from my calendar, which some people dislike. Sam Lesson famously says it's a sign of disrespect. I disagree with him, but, you know, it's a point of view. There could be some intermediate AI agents that would send emails back and forth like a human person to give the other person who feels slighted that sense of respect or a personalized touch that they want. So there's always ways to push it. [00:31:39]

Alessio: Yeah, I think for me, you know, other stuff that I think about, so we were doing prep for another episode and had an agent and asked it to do like a, you know, background prep on like the background of the person. And it just couldn't quite get the format that I wanted it to be, you know, but I kept to have the only way to prompt that it's like, give it text, give a text example, give a text example. What do you think, like the interface between human and agents in the future will be like, do you still think agents are like this open ended thing that are like objective driven where you say, Hey, this is what I want to achieve versus I only trust this agent to do X. And like, this is how X is done. I'm curious because that kind of seems like a lot of mental overhead, you know, to remember each agent for each task versus like if you have an executive assistant, like they'll do a random set of tasks and you can trust them because they're a human. But I feel like with agents, we're not quite there. [00:32:36]

Swyx: Agents are hard. [00:32:36]

Linus: The design space is just so vast. Since all of the like early agent stuff came out around auto GPT, I've tried to develop some kind of a thesis around it. And I think it's just difficult because there's so many variables. One framework that I usually apply to sort of like existing chat based prompting kind of things that I think also applies just as well to agents is this duality between what you might call like trust and control. So you just now you brought up this example of you had an agent try to write some write up some prep document for an episode and it couldn't quite get the format right. And one way you could describe that is you could say, Oh, the, the agent didn't exactly do what I meant and what I had in my head. So I can't trust it to do the right job. But a different way to describe it is I have a hard time controlling exactly the output of the model and I have a hard time communicating exactly what's in my head to the model. And they're kind of two sides of the same coin. I think if you, if you can somehow provide a way to with less effort, communicate and control and constrain the model output a little bit more and constrain the behavior a little bit more, I think that would alleviate the pressure for the model to be this like fully trusted thing because there's no need for trust anymore. There's just kind of guardrails that ensure that the model does the right thing. So developing ways and interfaces for these agents to be a little more constrained in its output or maybe for the human to control its output a little bit more or behavior a little bit more, I think is a productive path. Another sort of more, more recent revelation that I had while working on this and autofill thing inside notion is the importance of zones of influence for AI agents, especially in collaborative settings. So having worked on lots of interfaces for independent work on my year off, one of the surprising lessons that I learned early on when I joined notion was that if you build a collaboration permeates everything, which is great for notion because collaborating with an AI, you reuse a lot of the same metaphors for collaborating with humans. So one nice thing about this autofill thing that also kind of applies to AI blocks, which is another thing that we have, is that you don't alleviate this problem of having to ask questions like, oh, is this document written by an AI or is this written by a human? Like this need for auditability, because the part that's written by the AI is just in like the autofilled cell or in the AI block. And you can, you can tell that's written by the AI and things outside of it, you can kind of reasonably assume that it was written by you. I think anytime you have sort of an unbounded action space for, for models like agents, it's especially important to be able to answer those questions easily and to have some sense of security that in the same way that you want to know whether your like coworker or collaborator has access to a document or has modified a document, you want to know whether an AI has permissions to access something. And if it's modified something or made some edit, you want to know that it did it. And so as a compliment to constraining the model's action space proactively, I think it's also important to communicate, have the user have an easy understanding of like, what exactly did the model do here? And I think that helps build trust as well. [00:35:39]

Swyx: Yeah. I think for auto GPT and those kinds of agents in particular, anything that is destructive, you need to prompt for, I guess, or like check with, check in with the user. I know it's overloaded now. I can't say that. You have to confirm with the user. You confirm to the user. Yeah, exactly. Yeah. Yeah. [00:35:56]

Linus: That's tough too though, because you, you don't want to stop. [00:35:59]

Swyx: Yeah. [00:35:59]

Linus: One of the, one of the benefits of automating these things that you can sort of like, in theory, you can scale them out arbitrarily. I can have like a hundred different agents working for me, but if that means I'm just spending my entire day in a deluge of notifications, that's not ideal either. [00:36:12]

Swyx: Yeah. So then it could be like a reversible, destructive thing with some kind of timeouts, a time limit. So you could reverse it within some window. I don't know. Yeah. I've been thinking about this a little bit because I've been working on a small developer agent. Right. Right. [00:36:27]

Linus: Or maybe you could like batch a group of changes and can sort of like summarize them with another AI and improve them in bulk or something. [00:36:33]

Swyx: Which is surprisingly similar to the collaboration problem. Yeah. Yeah. Yeah. Exactly. Yeah. [00:36:39]

Linus: I'm telling you, the collaboration, a lot of the problems with collaborating with humans also apply to collaborating with AI. There's a potential pitfall to that as well, which is that there are a lot of things that some of the core advantages of AI end up missing out on if you just fully anthropomorphize them into like human-like collaborators. [00:36:56]

Swyx: But yeah. Do you have a strong opinion on that? Like, do you refer to it as it? Oh yeah. [00:37:00]

Linus: I'm an it person, at least for now, in 2023. Yeah. [00:37:05]

Swyx: So that leads us nicely into introducing what Notion and Notion AI is today. Do you have a pet answer as to what is Notion? I've heard it introduced as a database, a WordPress killer, a knowledge base, a collaboration tool. What is it? Yeah. [00:37:19]

Linus: I mean, the official answer is that a Notion is a connected workspace. It has a space for your company docs, meeting notes, a wiki for all of your company notes. You can also use it to orchestrate your workflows if you're managing a project, if you have an engineering team, if you have a sales team. You can put all of those in a single Notion database. And the benefit of Notion is that all of them live in a single space where you can link to your wiki pages from your, I don't know, like onboarding docs. Or you can link to a GitHub issue through a task from your documentation on your engineering system. And all of this existing in a single place in this kind of like unified, yeah, like single workspace, I think has lots of benefits. [00:37:58]

Swyx: That's the official line. [00:37:59]

Linus: There's an asterisk that I usually enjoy diving deeper into, which is that the whole reason that this connected workspace is possible is because underlying all of this is this really cool abstraction of blocks. In Notion, everything is a block. A paragraph is a block. A bullet point is a block. But also a page is a block. And the way that Notion databases work is that a database is just a collection of pages, which are really blocks. And you can like take a paragraph and drag it into a database and it'll become a page. You can take a page inside a database and pull it out and it'll just become a link to that page. And so this core abstraction of a block that can also be a page, that can also be a row in a database, like an Excel sheet, that fluidity and this like shared abstraction across all these different areas inside Notion, I think is what really makes Notion powerful. This Lego theme, this like Lego building block theme permeates a lot of different parts of Notion. Some fans of Notion might know that when you, or when you join Notion, you get a little Lego minifigure, which has Lego building blocks for workflows. And then every year you're at Notion, you get a new block that says like you've been here for a year, you've been here for two years. And then Simon, our co-founder and CTO, has a whole crate of Lego blocks on his desk that he just likes to mess with because, you know, he's been around for a long time. But this Lego building block thing, this like shared sort of all-encompassing single abstraction that you can combine to build various different kinds of workflows, I think is really what makes Notion powerful. And one of the sort of background questions that I have for Notion AI is like, what is that kind of building block for AI? [00:39:30]

Swyx: Well, we can dive into that. So what is Notion AI? Like, so I kind of view it as like a startup within the startup. Could you describe the Notion AI team? Is this like, how seriously is Notion taking the AI wave? [00:39:43]

Linus: The most seriously? The way that Notion AI came about, as I understand it, because I joined a bit later, I think it was around October last year, all of Notion team had a little offsite. And as a part of that, Ivan and Simon kind of went into a little kind of hack weekend. And the thing that they ended up hacking on inside Notion was the very, very early prototype of Notion AI. They saw this GPT-3 thing. The early, early motivation for starting Notion, building Notion in the first place for them, was sort of grounded in this utopian end-user programming vision where software is so powerful, but there are only so many people in the world that can write programs. But everyone can benefit from having a little workspace or a little program or a little workflow tool that's programmed to just fit their use case. And so how can we build a tool that lets people customize their software tools that they use every day for their use case? And I think to them, seemed like such a critical part of facilitating that, bridging the gap between people who can code and people who need software. And so they saw that, they tried to build an initial prototype that ended up becoming the first version of Notion AI. They had a prototype in, I think, late October, early November, before Chachapiti came out and sort of evolved it over the few months. But what ended up launching was sort of in line with the initial vision, I think, of what they ended up building. And then once they had it, I think they wanted to keep pushing it. And so at this point, AI is a really key part of Notion strategy. And what we see Notion becoming going forward, in the same way that blocks and databases are a core part of Notion that helps enable workflow automation and all these important parts of running a team or collaborating with people or running your life, we think that AI is going to become an equally critical part of what Notion is. And it won't be, Notion is a cool connected workspace app, and it also has AI. It'll be that what Notion is, is databases, it has pages, it has space for your docs, and it also has this sort of comprehensive suite of AI tools that permeate everything. And one of the challenges of the AI team, which is, as you said, kind of a startup within a startup right now, is to figure out exactly what that all-permeating kind of abstraction means, which is a fascinating and difficult open problem. [00:41:57]

Alessio: How do you think about what people expect of Notion versus what you want to build in Notion? A lot of this AI technology kind of changes, you know, we talked about the relationship between text and human and how human collaborates. Do you put any constraints on yourself when it's like, okay, people expect Notion to work this way with these blocks. So maybe I have this crazy idea and I cannot really pursue it because it's there. I think it's a classic innovator's dilemma kind of thing. And I think a lot of founders out there that are in a similar position where it's like, you know, series C, series D company, it's like, you're not quite yet the super established one, you're still moving forward, but you have an existing kind of following and something that Notion stands for. How do you kind of wrangle with that? [00:42:43]

Linus: Yeah, that is in some ways a challenge and that Notion already is a kind of a thing. And so we can't just scrap everything and start over. But I think it's also, there's a blessing side of it too, in that because there are so many people using Notion in so many different ways, we understand all of the things that people want to use Notion for very well. And then so we already have a really well-defined space of problems that we want to help people solve. And that helps us. We have it with the existing Notion product and we also have it by sort of rolling out these AI things early and then watching, learning from the community what people want to do [00:43:17]

Swyx: with them. [00:43:17]

Linus: And so based on those learnings, I think it actually sort of helps us constrain the space of things we think we need to build because otherwise the design space is just so large with whatever we can do with AI and knowledge work. And so watching what people have been using Notion for and what they want to use Notion for, I think helps us constrain that space a little bit and make the problem of building AI things inside Notion a little more tractable. [00:43:36]

Swyx: I think also just observing what they naturally use things for, and it sounds like you do a bunch of user interviews where you hear people running into issues and, or describe them as, the way that I describe myself actually is, I feel like the problem is with me, that I'm not creative enough to come up with use cases to use Notion AI or any other AI. [00:43:57]

Linus: Which isn't necessarily on you, right? [00:43:59]

Swyx: Exactly. [00:43:59]

Linus: Again, like it goes way back to the early, the thing we touched on early in the conversation around like, if you have too much generality, there's not enough, there are not enough guardrails to obviously point to use cases. Blank piece of paper. [00:44:10]

Swyx: I don't know what to do with this. So I think a lot of people judge Notion AI based on what they originally saw, which is write me a blog post or do a summary or do action items. Which, fun fact, for latent space, my very, very first Hacker News hit was reverse engineering Notion AI. I actually don't know if I got it exactly right. I think I got the easy ones right. And then apparently I got the action items one really wrong. So there's some art into doing that. But also you've since launched a bunch of other products and maybe you've already hinted at AI Autofill. Maybe we can just talk a little bit about what does the scope or suite of Notion AI products have been so far and what you're launching this week? Yeah. [00:44:53]

Linus: So we have, I think, three main facets of Notion AI and Notion at the moment. We have sort of the first thing that ever launched with Notion AI, which I think that helps you write. It's, going back to earlier in the conversation, it's kind of a writing, kind of a content generation tool. If you have a document and you want to generate a summary, it helps you generate a summary, pull out action items, you can draft a blog post, you can help it improve, it's helped to improve your writings, it can help fix grammar and spelling mistakes. But under the hood, it's a fairly lightweight, a thick layer of prompts. But otherwise, it's a pretty straightforward use case of language models, right? And so there's that, a tool that helps you write documents. There's a thing called an AI block, which is a slightly more constrained version of that where one common way that we use it inside Notion is we take all of our meeting notes inside Notion. And frequently when you have a meeting and you want other people to be able to go back to it and reference it, it's nice to have a summary of that meeting. So all of our meeting notes templates, at least on the AI team, have an AI block at the top that automatically summarizes the contents of that page. And so whenever we're done with a meeting, we just press a button and it'll re-summarize that, including things like what are the core action items for every person in the meeting. And so that block, as I said before, is nice because it's a constrained space for the AI to work in, and we don't have to prompt it every single time. And then the newest member of this AI collection of features is AI autofill, which brings Notion AI to databases. So if you have a whole database of user interviews and you want to pull out what are the companies, core pain points, what are their core features, maybe what are their competitor products they use, you can just make columns. And in the same way that you write Excel formulas, you can write a little AI formula, basically, where the AI will look at the contents of the page and pull out each of these key pieces of information. The slightly new thing that autofill introduces is this idea of a more automated background [00:46:43]

Swyx: AI thing. [00:46:44]

Linus: So with Writer, the AI in your document product and the AI block, you have to always ask it to update. You have to always ask it to rewrite. But if you have a column in a database, in a Notion database, or a property in a Notion database, it would be nice if you, whenever someone went back and changed the contents of the meeting node or something updated about the page, or maybe it's a list of tasks that you have to do and the status of the task changes, you might want the summary of that task or detail of the task to update. And so anytime that you can set up an autofilled Notion property so that anytime something on that database row or page changes, the AI will go back and sort of auto-update the autofilled value. And that, I think, is a really interesting part that we might continue leading into of like, even though there's AI now tied to this particular page, it's sort of doing its own thing in the background to help automate and alleviate some of that pain of automating these things. But yeah, Writer, Blocks, and Autofill are the three sort of cornerstones we have today. [00:47:42]

Alessio: You know, there used to be this glorious time where like, Roam Research was like the hottest knowledge company out there, and then Notion built Backlinks. I don't know if we are to blame for that. No, no, but how do Backlinks play into some of this? You know, I think most AI use cases today are kind of like a single page, right? Kind of like this document. I'm helping with this. Do you see some of these tools expanding to do changes across things? So we just had Itamar from Codium on the podcast, and he talked about how agents can tie in specs for features, tests for features, and the code for the feature. So like the three entities are tied together. Like, do you see some Backlinks help AI navigate through knowledge basis of companies where like, you might have the document the product uses, but you also have the document that marketing uses to then announce it? And as you make changes, the AI can work through different pieces of it? [00:48:41]

Swyx: Definitely. [00:48:41]

Linus: If I may get a little theoretical from that. One of my favorite ideas from my last year of hacking around building text augmentations with AI for documents is this realization that, you know, when you look at code in a code editor, what it is at a very lowest level is just text files. A code file is a text file, and there are maybe functions inside of it, and it's a list of functions, but it's a text file. But the way that you understand it is not as a file, like a Word document, it's a kind of a graph.[00:49:10]

Linus: Like you have a function, you have call sites to that function, there are places where you call that function, there's a place where that function is tested, many different definitions for that function. Maybe there's a type definition that's tied to that function. So it's a kind of a graph. And if you want to understand that function, there's advantages to be able to traverse that whole graph and fully contextualize where that function is used. Same with types and same with variables. And so even though its code is represented as text files, it's actually kind of a graph. And a lot of the, of what, all of the key interfaces, interface innovations behind IDEs is helping surface that graph structure in the context of a text file. So like things like go to definition or VS Code's little window view when you like look at references. And interesting idea that I explored last year was what if you bring that to text documents? So text documents are a little more unstructured, so there's a less, there's a more fuzzy kind of graph idea. But if you're reading a textbook, if there's a new term, there's actually other places where the term is mentioned. There's probably a few places where that's defined. Maybe there's some figures that reference that term. If you have an idea, there are other parts of the document where the document might disagree with that idea or cite that idea. So there's still kind of a graph structure. It's a little more fuzzy, but there's a graph structure that ties together like a body of knowledge. And it would be cool if you had some kind of a text editor or some kind of knowledge tool that let you explore that whole graph. Or maybe if an AI could explore that whole graph. And so back to your point, I think taking advantage of not just the backlinks. Backlinks is a part of it. But the fact that all of these inside Notion, all of these pages exist in a single workspace and it's a shared context. It's a connected workspace. And you can take any idea and look up anywhere to fully contextualize what a part of your engineering system design means. Or what we know about our pitching their customer at a company. Or if I wrote down a book, what are other places where that book has been mentioned? All these graph following things, I think, are really important for contextualizing knowledge. [00:51:02]

Swyx: Part of your job at Notion is prompt engineering. You are maybe one of the more advanced prompt engineers that I know out there. And you've always commented on the state of prompt ops tooling. What is your process today? What do you wish for? There's a lot here. [00:51:19]

Linus: I mean, the prompts that are inside Notion right now, they're not complex in the sense that agent prompts are complex. But they're complex in the sense that there is even a problem as simple as summarize a [00:51:31]

Swyx: page. [00:51:31]

Linus: A page could contain anything from no information, if it's a fresh document, to a fully fledged news article. Maybe it's a meeting note. Maybe it's a bug filed by somebody at a company. The range of possible documents is huge. And then you have to distill all of it down to always generate a summary. And so describing that task to AI comprehensively is pretty hard. There are a few things that I think I ended up leaning on, as a team we ended up leaning on, for the prompt engineering part of it. I think one of the early transitions that we made was that the initial prototype for Notion AI was built on instruction following, the sort of classic instruction following models, TextWG003, and so on. And then at some point, we all switched to chat-based models, like Claude and the new ChatGPT Turbo and these models. And so that was an interesting transition. It actually kind of made few-shot prompting a little bit easier, I think, in that you could give the few-shot examples as sort of previous turns in a conversation. And then you could ask the real question as the next follow-up turn. I've come to appreciate few-shot prompting a lot more because it's difficult to fully comprehensively explain a particular task in words, but it's pretty easy to demonstrate like four or five different edge cases that you want the model to handle. And a lot of times, if there's an edge case that you want a model to handle, I think few-shot prompting is just the easiest, most reliable tool to reach for. One challenge in prompt engineering that Notion has to contend with often is we want to support all the different languages that Notion supports. And so all of our prompts have to be multilingual or compatible, which is kind of tricky because our prompts are written, our instructions are written in English. And so if you just have a naive approach, then the model tends to output in English, even when the document that you want to translate or summarize is in French. And so one way you could try to attack that problem is to tell the model, answering the language of the user's query. But it's actually a lot more effective to just give it examples of not just English documents, but maybe summarizing an English document, maybe summarize a ticket filed in French, summarize an empty document where the document's supposed to be in Korean. And so a lot of our few-shot prompt-included prompts in Notion AI tend to be very multilingual, and that helps support our non-English-speaking users. The other big part of prompt engineering is evaluation. The prompts that you exfiltrated out of Notion AI many weeks ago, surprisingly pretty spot-on, at least for the prompts that we had then, especially things like summary. But they're also outdated because we've evolved them a lot more, and we have a lot more examples. And some of our prompts are just really, really long. They're like thousands of tokens long. And so every time we go back and add an example or modify the instruction, we want to make sure that we don't regress any of the previous use cases that we've supported. And so we put a lot of effort, and we're increasingly building out internal tooling infrastructure for things like what you might call unit tests and regression tests for prompts with handwritten test cases, as well as tests that are driven more by feedback from Notion users that have chosen to share their feedback with us. [00:54:31]

Swyx: You just have a hand-rolled testing framework or use Jest or whatever, and nothing custom out there. You basically said you've looked at so many prompt ops tools and you're sold on none of them. [00:54:42]

Linus: So that tweet was from a while ago. I think there are a couple of interesting tools these days. But I think at the moment, Notion uses pretty hand-rolled tools. Nothing too heavy, but it's basically a for loop over a list of test cases. We do do quite a bit of using language models to evaluate language models. So our unit test descriptions are kind of funny because the test is literally just an input document and a query, and then we expect the model to say something. And then our qualification for whether that test passes or not is just ask the language model again, whether it looks like a reasonable summary or whether it's in the right language. [00:55:19]

Swyx: Do you have the same model? Do you have entropic-criticized OpenAI or OpenAI-criticized entropic? That's a good question. Do you worry about models being biased towards its own self? [00:55:29]

Linus: Oh, no, that's not a worry that we have. I actually don't know exactly if we use different models. If you have a fixed budget for running these tests, I think it would make sense to use more expensive models for evaluation rather than generation. But yeah, I don't remember exactly what we do there. [00:55:44]

Swyx: And then one more follow-up on, you mentioned some of your prompts are thousands of tokens. That takes away from my budget as a user. Isn't that a trade-off that's a concern? So there's a limited context window, right? Some of that is taken by you as the app designer, product designer, deciding what system prompt to provide. And then the remainder is what I as a user can give you to actually summarize as my content. In theory. [00:56:10]

Linus: I think in practice there are a couple of trends that make that an issue. So for things like generating summaries, a summary is only going to be so many tokens long. If our prompts are generating you 3,000 token summaries, the prompt is not doing its job anyway. [00:56:25]

Swyx: Yeah, but the source doc is. [00:56:27]

Linus: The source doc could be longer. So if you wanted to translate a 5,000 token document, you do have to truncate it. And there is a limitation. It's not something that we are super focused on at the moment for a couple of reasons. I think there are techniques that, if we need to, help us compress those prompts. Things like parameter-efficient fine-tuning. And also the context lengths. It seems like the dominant trend is that context lengths are getting cheaper and longer constantly. Anthropic recently announced their 100,000 token context model recently. And so I think in the longer term that's going to be taken care of anyway by the models becoming more accommodating of longer contexts. And it's more of a temporary limitation. Cool. [00:57:04]

Swyx: Shall we talk about the professionalizing of a scene? [00:57:07]

Linus: Yeah, I think one of the things that is a helpful bit of context when thinking about HCI and AI in particular is, historically, HCI and AI have been sort of competing disciplines. Competing very specifically in the sense that they often fought for the same sources of funding and the same kinds of people and attention throughout the history of computer science. HCI and AI both used to come from the same or very aligned, similar, parallel motivations of, we have computers. How do we make computers work better with humans? And one way to do it was to make the machine smarter. Another way to do it was to design better interfaces. And through the AI booms and busts, when the AI boom was happening, HCI would get less funding. And when AIs had winters, HCI would get a lot more attention because it was sort of the alternative solution. And now that we have this sort of renewed attention on how to build better interfaces for AI, I think it's interesting that it's kind of a scene now. There are podcasts like this where I get to talk about interfaces and AI. But it's definitely not a fully-fledged field. My favorite definition of sort of what distinguishes the two apart comes from Andy Matuszak, where he, I'm going to butcher the quote, but he said something to the effect of, a field has at their disposal a powerful set of established tools and methods and standards and a shared set of core questions they want to answer. And so if you look at machine learning, which is obviously a really dominant established field, if you want to answer, if you want to evaluate a model, if you want to answer, if you want to solve a particular task or build a model that solves a particular task, there are powerful methods that we have, like gradient descent and specific benchmarks, for building solutions and then re-evaluating how to do the solutions. Or if you have an even more expensive problem, there are surely attempts that have been made before and then attempts that people are making now for how to attack that problem and frameworks to think about these things. In AI and UX, I think, we're very early in the evolution of that space and that community, and there's a lot of people excited, a lot of people building, but we have yet to come up with a set of best practices and tools and methods and frameworks for thinking about these things. And those will surely arise, and as they do, I think we'll see the evolution of the field. In prompt engineering and using language models in products at large, I think that community is a little farther along. It's still very fast moving because it's really young, but there are established prompting techniques like React and distillation of larger instruction following models. And these techniques, I think, are the beginnings of best practices and powerful tools at the disposal of this language model using field. [00:59:43]

Swyx: Yeah, and mostly it's just following Riley Goodside. It's how I learn about prompting techniques. Right, right. Yeah, pioneers. But yeah, I am actually interested in this. We've recently kind of rebranded the podcast or the newsletter somewhat in towards being for this term AI engineer, which I kind of view as somewhere between machine learning researcher and software engineer, some kind of in-between mix. And I think creating the media, creating meetups, creating a de facto conference for it, creating job titles, and then I think that core set of questions that everyone wants to get better at, I think that is essentially how this starts. Yeah, yeah. Pretty excited of. [01:00:25]

Linus: Creating a space for the people that are interested to come together, I think, is a really, really key important part of it. I'm always, whenever I come back to it, I'm always amazed by how if you look at the sort of golden era of theoretical physics in the early 20th century, or the golden era of early personal computing, there are maybe like two dozen people that have contributed all of the significant ideas to that field. They all kind of know each other. I always found that really fascinating. And I think the causal relationship actually goes the other way. It's not that all those people happen to know each other. It's that because there was that core set of people that always, that were very close to each other and shared ideas often, and they were co-located, that that field is able to blossom. And so I think creating that space is really critical. [01:01:08]

Swyx: Yeah, there's a very famous photo of the Solvay conference in 1927, where Albert Einstein, Niels Bohr, Marie Curie, all these top physics names. And how many Nobel laureates are in the photo, right? Yeah, and when I tweeted it out once, people were like, I didn't know these all lived together, and they all knew each other, and they must have exchanged so many ideas. [01:01:28]

Linus: I mean, similar with artists and writers that help a new kind of period blossom. [01:01:34]

Swyx: Now, is it going to be San Francisco, New York, though? [01:01:36]

Alessio: That's a spicy question. [01:01:39]

Swyx: I don't know, we'll see. Well, we're glad to at least be a part of your world, whether it is on either coast. But it's also virtual, right? Like, we have a Discord, it's happening online as well, even if you're in a small town like Indiana. [01:01:54]

Swyx: Cool, lightning round? Awesome, yeah, let's do it. [01:01:59]

Alessio: We only got three questions for you. One is acceleration, one exploration, then a final takeaway. So the first one we always like to ask is like, what is something that happened in AI that you thought would take much longer than it has? [01:02:13]

Swyx: Price is coming down. [01:02:14]

Linus: Price is coming down and or being able to get a lot more bang for your buck. So things like GPT-3.5 Turbo being, I don't know, exactly the figure, like 10 times, 20 times cheaper. [01:02:25]

Swyx: And then having GPT, then DaVinci O3. [01:02:27]

Linus: Then DaVinci O3 per token, or the super long context clod, or MPT StoryWriter, these like long context models that take, theoretically would take a lot of compute to run, but they're sort of accessible to us now. I think they're surprising because I would have thought that before these things came out, that cost per token and scaling context length, and these were like sort of core constraints that you would have to design your AI systems around. And it ends up being like, if you just wait a few months, like OpenAI will figure out how to make these models 10 times cheaper. Or Anthropic will figure out how to make the models be able to take a million tokens. And the speed at which that's happened has been surprising and a little bit frightening, because it invalidates a lot of the assumptions that I was operating with, and I have to recalibrate. [01:03:11]

Swyx: Yeah, there's this very famous law called Wurf's Law, also known as Gates's Law, that basically says software engineers will take up whatever hardware engineers give them. And I feel like there's a parallel law right now where language model improvements, AI UX people are going to take up all the improvements that language model people will give them. So, you know, they're trying to, while the language model people are improving the costs by a single order of magnitude, you, with your Notion AI autofill, are increasing by orders of magnitude the amount of consumption that's being used. [01:03:39]

Linus: Yeah, exactly. Before the show started, we were just talking about how when I was prototyping an autofill, just to make sure that things sort of like scaled up, okay, I ended up running autofill on a database with like 6,000 pages and just summaries. And usually these are fairly long pages. I ended up running through something like two or three million tokens in a matter of like 20 minutes. [01:03:58]

Swyx: Yeah. [01:03:58]

Linus: Which is not too expensive, luckily, because the models are getting cheaper. It's going to be fine. But it is like $5 or $6, which the concept of like running a test on my computer and it spending the price of like a nice coffee is kind of a weird thing still that I'm getting used to. [01:04:13]

Swyx: And Notion AI currently is $10 a month, something like that. So there's ways to make Notion lose money. [01:04:20]

Alessio: You just get negative gross margins on that test. [01:04:24]

Linus: Not sanctioned by Notion. I mean, obviously, you should use it to, you know, improve your life and support your workflows in whatever ways that's useful. [01:04:33]

Swyx: Okay, second question is about exploration. What do you think is the most interesting unsolved question in AI? [01:04:39]

Linus: Predictability, reliability. Well, in AI broadly, I think it's much harder. But with language models specifically, I think how to build dependable systems is really important. If you ask Notion AI or if you ask ChatGPT or Claude, like maybe a bullet list of X, Y, Z, sometimes it'll make those bullets with like the Unicode center dot. Sometimes it'll make them with a dash. Sometimes it'll like add a title. Sometimes it'll like bold random things. And all of the things are fine. But it's a little jarring if every time the answer is a little stochastic. I think this is much more of a concern for when you're automating tasks or having the model make decisions by itself. Predictability, dependability, so much of the software that runs the world is sort of behind-the-scenes decision-making programs that run inside enterprises and automate systems and make decisions for people. And auditability, dependability is just so critical to all of them. One avenue of work that I'm really intrigued by is in these decision-making systems, not having the model sort of internally as a black box make decisions, but having the model synthesize code that makes decisions. So you might ask the model for things like summarization, like natural language tasks, you have to ask the model. But if you wanted to, I don't know, let's say you have a document and you want to filter out all the dates. Instead of asking the model, hey, can you grab all the dates? You can ask the model to write a regular expression that captures a particular set of date formats that you really care about. And at that point, the output of the model is a program. And the nice thing about a program is you can kind of check it. There's lots of nice things. One is it's much cheaper to run afterwards. Another is you can verify it. And the program becomes a kind of a, what in design we call a boundary object, where it's a shared thing that exists both in the sphere of the human and the sphere of the computer. And you can iterate on it to fix bugs. And you can co-evolve this object that is now like a representation of this decision that you want the model to, the computer to make. But it's auditable and dependable and reliable. And so I'm pretty bullish on co-generation and other sort of like program synthesis and program verification techniques. But using the model to write the initial program and help the people maintain the software. [01:06:36]

Swyx: Yeah, I'm so excited by that. Just in terms of reliability, I'll call out our previous guest. Rojbal. Yeah, yeah. And she's working on Guardrails AI. There's also LMQL. And then Microsoft recently put out Guidance, which is their custom language thing. Have you explored any of those? [01:06:51]

Linus: I've taken a look at all of them. I've spoken to Shreya. I think this general space of like more... Speaking of adding constraints to general systems, adding constraints, adding program verification, all of these things I think are super fascinating. I also personally like it a lot. Because before I was spending a lot of my time in AI, I spent a bunch of time looking at like programming languages and compilers and interpreters. And there is just so much amazing work that has gone into how do you build automated ways to reason about a program? Like compilers and type checkers and so on. And it would be a real shame if the whole field of program synthesis and verification just became like ask GPT-4. [01:07:30]

Swyx: But actually, it's not. [01:07:30]

Linus: Like they work together. You write the program, you synthesize the program with GPT-4 from human descriptions. And then now we have this whole set of powerful techniques that we can use to more formally understand and prove things about programs. And I think the synergy of them, I'm excited to see. [01:07:44]

Swyx: Awesome. This was great, Linus. [01:07:47]

Alessio: Our last question is always, what's one message you want everyone to remember today about the space, exciting challenges? [01:07:54]

Swyx: We were at the beginning. [01:07:57]

Linus: Maybe this is really cliche. But one thing that I always used to say about when I was working on text interfaces last year [01:08:05]

Swyx: was that I would be really disappointed [01:08:07]

Linus: if in a thousand years humans are still using the same kind of like writing tools and writing systems that we are today. Like it would be pretty surprising if we're still sort of like writing documents in the same way that we are today in a thousand years. And the language and the writing system hasn't evolved at all. If humans plan to be around for many thousands of years into the future, writing has really only been around for like two, three thousand years. And it's like sort of modern form. And we should, I think, care a lot more about building flexible, powerful tools than about backwards compatibility if we plan to be around for many more times the number of years that we've been around. And so I think whether we look at something as simple as language models or as expansive as like humans interacting with text documents, I think it's worth reminding yourself often that the things that we have today are sometimes that way for a reason but often just because an artifact of like the way that we've gotten here. And text can look very different. Language models can look very different. I personally think in a couple of years we're going to do something better than transformers. So all of these things are going to change. And I think it's important to have your eyes sort of looking over the horizon at what's coming far into the future. [01:09:24]

Swyx: Nice way to end it. [01:09:25]

Alessio: Well, thank you, Linus, for coming on. This was great. Thank you. This was lovely. [01:09:29]

Linus: Thanks for having me. [01:09:31]

Get full access to Latent Space at www.latent.space/subscribe

2023-06-01
Länk till avsnitt

Hur lyssnar man på podcast?

En liten tjänst av I'm With Friends. Finns även på engelska.
Uppdateras med hjälp från iTunes.