The big news this week from Nvidia, splashed in headlines across all forms of media, was the company’s announcement about its Vera Rubin GPU.
This week, Nvidia CEO Jensen Huang used his CES keynote to highlight performance metrics for the new chip. According to Huang, the Rubin GPU is capable of 50 PFLOPs of NVFP4 inference and 35 PFLOPs of NVFP4 training performance, representing 5x and 3.5x the performance of Blackwell.
But it won’t be available until the second half of 2026. So what should enterprises be doing now?
The current, shipping Nvidia GPU architecture is Blackwell, which was announced in 2024 as the successor to Hopper. Alongside that release, Nvidia emphasized that that its product engineering path also included squeezing as much performance as possible out of the prior Grace Hopper architecture.
It’s a direction that will hold true for Blackwell as well, with Vera Rubin coming later this year.
“We continue to optimize our inference and training stacks for the Blackwell architecture,” Dave Salvator, director of accelerated computing products at Nvidia, told VentureBeat.
In the same week that Vera Rubin was being touted by Nvidia’s CEO as its most powerful GPU ever, the company published new research showing improved Blackwell performance.
Nvidia has been able to increase Blackwell GPU performance by up to 2.8x per GPU in a period of just three short months.
The performance gains come from a series of innovations that have been added to the Nvidia TensorRT-LLM inference engine. These optimizations apply to existing hardware, allowing current Blackwell deployments to achieve higher throughput without hardware changes.
The performance gains are measured on DeepSeek-R1, a 671-billion parameter mixture-of-experts (MoE) model that activates 37 billion parameters per token.
Among the technical innovations that provide the performance boost:
Programmatic dependent launch (PDL): Expanded implementation reduces kernel launch latencies, increasing throughput.
All-to-all communication: New implementation of communication primitives eliminates an intermediate buffer, reducing memory overhead.
Multi-token prediction (MTP): Generates multiple tokens per forward pass rather than one at a time, increasing throughput across various sequence lengths.
NVFP4 format: A 4-bit floating point format with hardware acceleration in Blackwell that reduces memory bandwidth requirements while preserving model accuracy.
The optimizations reduce cost per million tokens and allow existing infrastructure to serve higher request volumes at lower latency. Cloud providers and enterprises can scale their AI services without immediate hardware upgrades.
Blackwell is also widely used as a foundational hardware component for training the largest of large language models.
In that respect, Nvidia has also reported significant gains for Blackwell when used for AI training.
Since its initial launch, the GB200 NVL72 system delivered up to 1.4x higher training performance on the same hardware — a 40% boost achieved in just five months without any hardware upgrades.
The training boost came from a series of updates including:
Optimized training recipes. Nvidia engineers developed sophisticated training recipes that effectively leverage NVFP4 precision. Initial Blackwell submissions used FP8 precision, but the transition to NVFP4-optimized recipes unlocked substantial additional performance from the existing silicon.
Algorithmic refinements. Continuous software stack enhancements and algorithmic improvements enabled the platform to extract more performance from the same hardware, demonstrating ongoing innovation beyond initial deployment.
Salvator noted that the high-end Blackwell Ultra is a market-leading platform purpose-built to run state-of-the-art AI models and applications.
He added that the Nvidia Rubin platform will extend the company’s market leadership and enable the next generation of MoEs to power a new class of applications to take AI innovation even further.
Salvator explained that the Vera Rubin is built to address the growing demand in compute created by the continuing growth in model size and reasoning token generation from leading models such as MoE.
“Blackwell and Rubin can serve the same models, but the difference is the performance, efficiency and token cost,” he said.
According to Nvidia’s early testing results, compared to Blackwell, Rubin can train large MoE models in a quarter the number of GPUs, inference token generation with 10X more throughput per watt, and inference at 1/10th the cost per token.
“Better token throughput performance and efficiency, means newer models can be built with more reasoning capability and faster agent-to-agent interaction, creating better intelligence at lower cost,” Salvator said.
For enterprises deploying AI infrastructure today, current investments in Blackwell remain sound despite Vera Rubin’s arrival later this year.
Organizations with existing Blackwell deployments can immediately capture the 2.8x inference improvement and 1.4x training boost by updating to the latest TensorRT-LLM versions — delivering real cost savings without capital expenditure. For those planning new deployments in the first half of 2026, proceeding with Blackwell makes sense. Waiting six months means delaying AI initiatives and potentially falling behind competitors already deploying today.
However, enterprises planning large-scale infrastructure buildouts for late 2026 and beyond should factor Vera Rubin into their roadmaps. The 10x improvement in throughput per watt and 1/10th cost per token represent transformational economics for AI operations at scale.
The smart approach is phased deployment: Leverage Blackwell for immediate needs while architecting systems that can incorporate Vera Rubin when available. Nvidia’s continuous optimization model means this isn’t a binary choice; enterprises can maximize value from current deployments without sacrificing long-term competitiveness.
Right now in the AI world, there are a lot of percolating ideas and experimentation. But as far as Replit CEO Amjad Masad is concerned, they’re just “toys”: unreliable, marginally effective, and generic.
“There’s a lot of sameness out there,” Masad explains in a new VB Beyond the Pilot podcast. “Everything kind of looks the same, all the images, all the code, everything.”
This “slop,” as it’s come to be known, is not only the result of lazy one-shot prompting, but a lack of individual flavor.
“The way to overcome slop is for the platform to expend more effort and for the developers of the platform to imbue the agent with taste,” Masad says.
Replit tackles the slop problem through a mix of specialized prompting, classification features built into its design systems, and proprietary RAG techniques. The team also isn’t hesitant to use more tokens; this results in higher-quality inputs, Masad notes.
Ongoing testing is also critical. After the first generation of an app, Masad’s team kicks the result off to a testing agent, which analyzes all its features, then reports back to a coding agent about what worked (and didn’t). “If you introduce testing in the loop, you can give the model feedback and have the model reflect on its work,” Masad says.
Pitting models against one another is another of Replit’s strategies: Testing agents may be built on one LLM, coding agents on another. This capitalizes on their different knowledge distributions. “That way the product you’re giving to the customer is high effort and less sloppy,” Masad says. “You generate more variety.”
Ultimately, he describes a “push and pull” between what the model can actually do and what teams need to build on top of it to add value. Also, “if you wanna move fast and you wanna ship things, you need to throw away a lot of code,” he says.
There’s still a lot of frustration around AI because, Masad acknowledges, it isn’t living up to the intense hype. Chatbots are well-established but they offer a “marginal improvement” in workflows.
Vibe coding is beginning to take off partly because it’s the best way for companies to adopt AI in an impactful way, he notes. It can “make everyone in the enterprise the software engineer,” he says, allowing employees to solve problems and improve efficiency through automation, thus requiring less reliance on traditional SaaS tools.
“I would say that the population of professional developers who studied computer science and trained as developers will shrink over time,” Masad says. On the flip side, the population of vibe coders who can solve problems with software and agents will grow “tremendously” over time.
In the end, enterprises must fundamentally change how they think about software; traditional roadmaps are no longer relevant, Masad says. Because AI capabilities are evolving so dramatically, builders can only “roughly” estimate what things might look like months or even weeks into the future.
Reflecting this reality, Replit’s team remains agile and isn’t hesitant to “drop everything” when a new model comes out to perform evals. “It’ll ebb and flow,” Masad contends. “You need to be very zen about it and not have an ego about it.”
Listen to the full podcast to hear about:
The “squishy” divide in AI intelligence that impedes specialization;
The cathedral versus bazaar debate in open source — and why a “cathedral made of bazaars” may be the best path to collective innovation;
How Replit “forks” the development environment to create isolated sandboxes for experimentation;
The importance of context compression;
What really defines AI agents: They don’t just retrieve information; they work autonomously, repeatedly, without human intervention.
Subscribe to Beyond the Pilot on Apple Podcasts, Spotify and YouTube.
A new study from researchers at Stanford University and Nvidia proposes a way for AI models to keep learning after deployment — without increasing inference costs. For enterprise agents that have to digest long docs, tickets, and logs, this is a bid to get “long memory” without paying attention costs that grow with context length.
The approach, called “End-to-End Test-Time Training” (TTT-E2E), reframes language modeling as a continual learning problem: Instead of memorizing facts during pre-training, models learn how to adapt in real time as they process new information.
The result is a Transformer that can match long-context accuracy of full attention models while running at near-RNN efficiency — a potential breakthrough for enterprise workloads where context length is colliding with cost.
For developers building AI systems for long-document tasks, the choice of model architecture often involves a painful trade-off between accuracy and efficiency.
On one side are Transformers with full self-attention, currently the gold standard for accuracy. They are designed to scan through the keys and values of all previous tokens for every new token generated, providing them with lossless recall. However, this precision comes at a steep cost: The computational cost per token grows significantly with context length.
On the other side are linear-time sequence models, which keep inference costs constant but struggle to retain information over very long contexts.
Other approaches try to split the difference — sliding-window attention, hybrids that mix attention with recurrence, and other efficiency tricks — but they still tend to fall short of full attention on hard language modeling.
The researchers’ bet is that the missing ingredient is compression: Instead of trying to recall every token exactly, models should distill what matters into a compact state.
The core innovation of the paper is the application of Test-Time Training (TTT) to language modeling. This transforms the model from a static database into a flexible learner.
In standard AI deployment, models are trained to minimize loss and then deployed as frozen artifacts. If you try to make a static model learn during deployment, it typically performs poorly because it was never trained to update itself efficiently.
The researchers solve this by shifting from standard pre-training (teaching the model facts) to meta-learning (teaching the model how to learn). The goal is to optimize the model’s “initialization” so that it can absorb new information rapidly when it goes live.
The process involves simulating inference-time learning during the training phase:
Inner loop (learn): During training, the model treats text as a stream and performs small, temporary updates as it predicts the next token — simulating how it would adapt at inference.
Outer loop (teach it to learn): The system then updates the model’s initialization so the next round of streaming adaptation becomes faster and more accurate.
While the idea of a model changing its weights during deployment might sound risky to reliability focused enterprise leaders, co-author Yu Sun argues it is mathematically safer than it appears.
“You should think of the model as an RNN with a huge hidden state,” Sun says. He notes that if an enterprise feels safe deploying standard Transformers or RNNs, the stability profile of TTT is comparable.
To implement TTT-E2E, the researchers modified the standard Transformer architecture to support this new learning paradigm, creating a hierarchy that separates cheap short-term context handling from selective long-term memory updates.
The model uses Sliding Window Attention rather than full attention. This acts as the model’s “working memory,” looking back only at a fixed window of recent tokens to handle immediate syntax and local references. This ensures the cost of processing a new token remains constant rather than growing as the context expands.
The model employs “targeted weight updates.” While standard models have completely frozen weights during use, TTT-E2E designates specific sections (Multi-Layer Perceptron layers in the final 25% of the model’s blocks) to be mutable.
The architecture uses a “dual-track storage” to prevent the model from forgetting its general training while learning a new document. Each updateable block contains two MLP components: one static layer that holds general pre-trained knowledge, and one dynamic layer that updates in real-time to store the current document’s context.
The innovation lies in how the model handles information that falls out of the sliding window. In a standard sliding window model, once a token slides out of view, it is forgotten. TTT-E2E prevents this via compression. As the window moves, the model uses next-token prediction to “compress” the passing information directly into the weights of the dynamic MLP layers. This consolidates the gist and facts of the earlier parts of the document into the model’s structure, serving as a long-term memory.
The headline result: TTT-E2E continues improving as context length grows — matching or outperforming full attention — while efficient baselines plateau after ~32,000 tokens.
To validate their approach, the researchers trained models ranging from 125 million to 3 billion parameters. They employed a two-stage training process: pre-training on 8,000-token contexts and fine-tuning on 128,000-token contexts. These models were tested against robust baselines, including Transformers with full attention, Transformers with Sliding Window Attention (SWA), hybrid models (Mamba 2 and Gated DeltaNet), and TTT-KVB (an earlier form of test-time training).
The results highlight a significant breakthrough in scaling. The most critical experiment tested performance as the input document grew from 8,000 to 128,000 tokens. The Full Attention Transformer, the gold standard, continued to improve its performance (lower loss) as the context grew. In contrast, efficient baselines like Mamba 2, Gated DeltaNet, and SWA hit a ceiling, with their performance degrading or flattening out after 32,000 tokens.
The new TTT-E2E method successfully scaled with context length, mimicking the behavior of Full Attention. In the experiments using 3B parameter models, TTT-E2E actually maintained a lower perplexity (better performance) than Full Attention throughout the context window.
Critically, this performance did not come at the cost of speed. On inference latency, TTT-E2E matched the efficiency of RNNs. At a context length of 128k tokens, TTT-E2E was 2.7x faster than the Full-Attention Transformer on Nvidia H100 hardware.
Crucially for adoption, Sun notes that TTT models can be deployed for inference today on standard Transformer infrastructure to achieve these speedups. However, he cautions that the training side of the equation (specifically the outer loop) is currently more complex and slower than standard methods, representing a hurdle that still needs engineering optimization.
The benefits become even more drastic as data scales. Sun argues the advantage should widen further at million-token contexts, though those figures are projections rather than today’s benchmarked deployments.
However, the approach does have specific limitations rooted in its design philosophy. The researchers performed a “Needle in a Haystack” test, which requires the model to retrieve a specific, isolated piece of information (like a passcode) hidden in a large block of text. In this evaluation, Full Attention dramatically outperformed all other methods, including TTT-E2E.
This is because Full Attention relies on a cache that allows for nearly lossless recall of specific details, whereas TTT-E2E relies on compression. Compression captures the intuition and core information perfectly but may lose specific, random details that do not fit the learned patterns.
This distinction has major implications for enterprise data pipelines, specifically RAG. Sun suggests that TTT won’t make RAG obsolete but will redefine it. He likens TTT to “updating the human brain” with general knowledge, while RAG will remain a necessary tool for precision, “similar to how humans still need to write things down in a notepad.” For enterprise teams, the takeaway is that TTT reduces how often you need retrieval — but doesn’t eliminate the need for exact external memory.
While the technique was demonstrated on the Transformer architecture, the researchers note that “in principle, TTT can be applied to any baseline architecture” that allows for a separation of long-term and short-term memory components.
“We believe that these two classes of memory will continue to complement each other,” the researchers concluded.
Looking ahead, Sun predicts a paradigm shift where the primary form of AI memory will be highly compressed rather than exact. While models will retain a “reasonable” perfect-recall window of around 128,000 tokens, he believes TTT architectures will eventually unlock a “compressed memory of billions of tokens,” fundamentally changing how enterprise agents balance recall, cost, and context length.
Nvidia’s $20 billion strategic licensing deal with Groq represents one of the first clear moves in a four-front fight over the future AI stack. 2026 is when that fight becomes obvious to enterprise builders.
For the technical decision-makers we talk to every day — the people building the AI applications and the data pipelines that drive them — this deal is a signal that the era of the one-size-fits-all GPU as the default AI inference answer is ending.
We are entering the age of the disaggregated inference architecture, where the silicon itself is being split into two different types to accommodate a world that demands both massive context and instantaneous reasoning.
To understand why Nvidia CEO Jensen Huang dropped one-third of his reported $60 billion cash pile on a licensing deal, you have to look at the existential threats converging on his company’s reported 92% market share.
The industry reached a tipping point in late 2025: For the first time, inference — the phase where trained models actually run — surpassed training in terms of total data center revenue, according to Deloitte. In this new “Inference Flip,” the metrics have changed. While accuracy remains the baseline, the battle is now being fought over latency and the ability to maintain “state” in autonomous agents.
There are four fronts of that battle, and each front points to the same conclusion: Inference workloads are fragmenting faster than GPUs can generalize.
Gavin Baker, an investor in Groq (and therefore biased, but also unusually fluent on the architecture), summarized the core driver of the Groq deal cleanly: “Inference is disaggregating into prefill and decode.”
Prefill and decode are two distinct phases:
The prefill phase: Think of this as the user’s “prompt” stage. The model must ingest massive amounts of data — whether it’s a 100,000-line codebase or an hour of video — and compute a contextual understanding. This is “compute-bound,” requiring massive matrix multiplication that Nvidia’s GPUs are historically excellent at.
The generation (decode) phase: This is the actual token-by-token “generation.” Once the prompt is ingested, the model generates one word (or token) at a time, feeding each one back into the system to predict the next. This is “memory-bandwidth bound.” If the data can’t move from the memory to the processor fast enough, the model stutters, no matter how powerful the GPU is. (This is where Nvidia was weak, and where Groq’s special language processing unit (LPU) and its related SRAM memory, shines. More on that in a bit.)
Nvidia has announced an upcoming Vera Rubin family of chips that it’s architecting specifically to handle this split. The Rubin CPX component of this family is the designated “prefill” workhorse, optimized for massive context windows of 1 million tokens or more. To handle this scale affordably, it moves away from the eye-watering expense of high bandwidth memory (HBM) — Nvidia’s current gold-standard memory that sits right next to the GPU die — and instead utilizes 128GB of a new kind of memory, GDDR7. While HBM provides extreme speed (though not as quick as Groq’s static random-access memory (SRAM)), its supply on GPUs is limited and its cost is a barrier to scale; GDDR7 provides a more cost-effective way to ingest massive datasets.
Meanwhile, the “Groq-flavored” silicon, which Nvidia is integrating into its inference roadmap, will serve as the high-speed “decode” engine. This is about neutralizing a threat from alternative architectures like Google’s TPUs and maintaining the dominance of CUDA, Nvidia’s software ecosystem that has served as its primary moat for over a decade.
All of this was enough for Baker, the Groq investor, to predict that Nvidia’s move to license Groq will cause all other specialized AI chips to be canceled — that is, outside of Google’s TPU, Tesla’s AI5, and AWS’s Trainium.
At the heart of Groq’s technology is SRAM. Unlike the DRAM found in your PC or the HBM on an Nvidia H100 GPU, SRAM is etched directly into the logic of the processor.
Michael Stewart, managing partner of Microsoft’s venture fund, M12, describes SRAM as the best for moving data over short distances with minimal energy. “The energy to move a bit in SRAM is like 0.1 picojoules or less,” Stewart said. “To move it between DRAM and the processor is more like 20 to 100 times worse.”
In the world of 2026, where agents must reason in real-time, SRAM acts as the ultimate “scratchpad”: a high-speed workspace where the model can manipulate symbolic operations and complex reasoning processes without the “wasted cycles” of external memory shuttling.
However, SRAM has a major drawback: it is physically bulky and expensive to manufacture, meaning its capacity is limited compared to DRAM. This is where Val Bercovici, chief AI officer at Weka, another company offering memory for GPUs, sees the market segmenting.
Groq-friendly AI workloads — where SRAM has the advantage — are those that use small models of 8 billion parameters and below, Bercovici said. This isn’t a small market, though. “It’s just a giant market segment that was not served by Nvidia, which was edge inference, low latency, robotics, voice, IoT devices — things we want running on our phones without the cloud for convenience, performance, or privacy,” he said.
This 8B “sweet spot” is significant because 2025 saw an explosion in model distillation, where many enterprise companies are shrinking massive models into highly efficient smaller versions. While SRAM isn’t practical for the trillion-parameter “frontier” models, it is perfect for these smaller, high-velocity models.
Perhaps the most under-appreciated driver of this deal is Anthropic’s success in making its stack portable across accelerators.
The company has pioneered a portable engineering approach for training and inference — basically a software layer that allows its Claude models to run across multiple AI accelerator families — including Nvidia’s GPUs and Google’s Ironwood TPUs. Until recently, Nvidia’s dominance was protected because running high-performance models outside of the Nvidia stack was a technical nightmare. “It’s Anthropic,” Weka’s Bercovici told me. “The fact that Anthropic was able to … build up a software stack that could work on TPUs as well as on GPUs, I don’t think that’s being appreciated enough in the marketplace.”
(Disclosure: Weka has been a sponsor of VentureBeat events.)
Anthropic recently committed to accessing up to 1 million TPUs from Google, representing over a gigawatt of compute capacity. This multi-platform approach ensures the company isn’t held hostage by Nvidia’s pricing or supply constraints. So for Nvidia, the Groq deal is equally a defensive move. By integrating Groq’s ultra-fast inference IP, Nvidia is making sure that the most performance-sensitive workloads — like those running small models or as part of real-time agents — can be accommodated within Nvidia’s CUDA ecosystem, even as competitors try to jump ship to Google’s Ironwood TPUs. CUDA is the special software Nvidia provides to developers to integrate GPUs.
The timing of this Groq deal coincides with Meta’s acquisition of the agent pioneer Manus just two days ago. The significance of Manus was partly its obsession with statefulness.
If an agent can’t remember what it did 10 steps ago, it is useless for real-world tasks like market research or software development. KV Cache (Key-Value Cache) is the “short-term memory” that an LLM builds during the prefill phase.
Manus reported that for production-grade agents, the ratio of input tokens to output tokens can reach 100:1. This means for every word an agent says, it is “thinking” and “remembering” 100 others. In this environment, the KV Cache hit rate is the single most important metric for a production agent, Manus said. If that cache is “evicted” from memory, the agent loses its train of thought, and the model must burn massive energy to recompute the prompt.
Groq’s SRAM can be a “scratchpad” for these agents — although, again, mostly for smaller models — because it allows for the near-instant retrieval of that state. Combined with Nvidia’s Dynamo framework and the KVBM, Nvidia is building an “inference operating system” that enables inference servers to tier this state across SRAM, DRAM, HBM, and other flash-based offerings like that from Bercovici’s Weka.
Thomas Jorgensen, senior director of Technology Enablement at Supermicro, which specializes in building clusters of GPUs for large enterprise companies, told me in September that compute is no longer the primary bottleneck for advanced clusters. Feeding data to GPUs was the bottleneck, and breaking that bottleneck requires memory.
“The whole cluster is now the computer,” Jorgensen said. “Networking becomes an internal part of the beast … feeding the beast with data is becoming harder because the bandwidth between GPUs is growing faster than anything else.”
This is why Nvidia is pushing into disaggregated inference. By separating the workloads, enterprise applications can use specialized storage tiers to feed data at memory-class performance, while the specialized “Groq-inside” silicon handles the high-speed token generation.
We are entering an era of extreme specialization. For decades, incumbents could win by shipping one dominant general-purpose architecture — and their blind spot was often what they ignored on the edges. Intel’s long neglect of low-power is the classic example, Michael Stewart, managing partner of Microsoft’s venture fund M12, told me. Nvidia is signaling it won’t repeat that mistake. “If even the leader, even the lion of the jungle will acquire talent, will acquire technology — it’s a sign that the whole market is just wanting more options,” Stewart said.
For technical leaders, the message is to stop architecting your stack like it’s one rack, one accelerator, one answer. In 2026, advantage will go to the teams that label workloads explicitly — and route them to the right tier:
prefill-heavy vs. decode-heavy
long-context vs. short-context
interactive vs. batch
small-model vs. large-model
edge constraints vs. data-center assumptions
Your architecture will follow those labels. In 2026, “GPU strategy” stops being a purchasing decision and becomes a routing decision. The winners won’t ask which chip they bought — they’ll ask where every token ran, and why.
When initially experimenting with LLMs and agentic AI, software engineers at Notion AI applied advanced code generation, complex schemas, and heavy instructioning.
Quickly, though, trial and error taught the team that it could get rid of all of that complicated data modeling. Notion’s AI engineering lead Ryan Nystrom and his team pivoted to simple prompts, human-readable representations, minimal abstraction, and familiar markdown formats. The result was dramatically improved model performance.
Applying this re-wired approach, the AI-native company released V3 of its productivity software in September. Its notable feature: Cutomizable AI agents — which have quickly become Notion’s most successful AI tool to date. Based on usage patterns compared to previous versions, Nystrom calls it a “step function improvement.”
“It’s that feeling of when the product is being pulled out of you rather than you trying to push,” Nystrom explains in a VB Beyond the Pilot podcast. “We knew from that moment, really early on, that we had something. Now it’s, ‘How could I ever use Notion without this feature?’”
As a traditional software engineer, Nystrom was used to “extremely deterministic” experiences. But a light bulb moment came when a colleague advised him to simply describe his AI prompt as he would to a human, rather than codify rules of how agents should behave in various scenarios. The rationale: LLMs are designed to understand, “see” and reason about content the same way humans can.
“Now, whenever I’m working with AI, I will reread the prompts and tool descriptions and [ask myself] is this something I could give to a person with no context and they could understand what’s going on?” Nystrom said on the podcast. “If not, it’s going to do a bad job.”
Stepping back from “pretty complicated rendering” of data within Notion (such as JSON or XML) Nystrom and his team represented Notion pages as markdown, the popular device-agnostic markup language that defines structure and meaning using plain text without the need for HTML tags or formal editors. This allows the model to interact with, read, search and make changes to text files.
Ultimately, this required Notion to rewire its systems, with Nystrom’s team focusing largely on the middleware transition layer.
They also identified early on the importance of exercising restraint when it comes to context. It’s tempting to load as much information into a model as possible, but that can slow things down and confuse the model. For Notion, Nystrom described a 100,000 to 150,000 token limit as the “sweet spot.”
“There are cases where you can load tons and tons of content into your context window and the model will struggle,” he said. “The more you put into the context window, you do see a degradation in performance, latency, and also accuracy.”
A spartan approach is also important in the case of tooling; this can help teams avoid the “slippery slope” of endless features, Nystrom advised. Notion focuses on a “curated menu” of tools rather than a voluminous Cheesecake Factory-like menu that creates a paradox of choice for users.
“When people ask for new features, we could just add a tool to the model or the agent,” he said. But, “the more tools we add, the more decisions the model has to make.”
The bottom line: Channel the model. Use APIs the way they were meant to be used. Don’t try to be fancy, don’t try to overcomplicate it. Use plain English.
Listen to the full podcast to hear about:
Why AI is still in the pre-Blackberry, pre-iPhone era;
The importance of “dogfooding” in product development;
Why you shouldn’t worry about how cost effective your AI feature is in the early stages — that can be optimized later;
How engineering teams can keep tools minimal in the age of MCP;
Notion’s evolution from wikis to full-blown AI assistants.
Subscribe to Beyond the Pilot on Apple Podcasts, Spotify, and YouTube.