Adobe today launched its most ambitious AI offensive to date, unveiling the Firefly AI Assistant — a new agentic creative tool that can orchestrate complex, multi-step workflows across the company’s entire Creative Cloud suite from a single conversational interface — alongside a raft of new video, image, and collaboration features designed to position the company at the center of the rapidly evolving AI-powered content creation landscape.
The announcements, which also include a new Color Mode for Premiere Pro, the addition of Kling 3.0 video models to Firefly’s growing roster of third-party AI engines, and Frame.io Drive — a virtual filesystem that lets distributed teams work with cloud-stored media as though it lived on their local machines — represent Adobe’s clearest signal yet that it views agentic AI not as a feature upgrade but as a fundamental reshaping of how creative work gets done.
“We want creators to tell us the destination and let the Firefly assistant — with its deep understanding of all the Adobe professional tools and generative tools — bring the tools to you right in the conversation,” Alexandru Costin, Vice President of AI & Innovation at Adobe, told VentureBeat in an exclusive interview ahead of the launch.
The stakes could hardly be higher. Adobe is fighting to convince Wall Street, creative professionals, and a wave of well-funded AI-native competitors that its decades-old software empire can not only survive the generative AI revolution but lead it.
The centerpiece of today’s announcement is the Firefly AI Assistant, which Adobe describes as a fundamentally new way to interact with its creative tools. Rather than requiring users to manually navigate between Photoshop, Premiere, Illustrator, Lightroom, Express, and other apps — selecting the right tool for each step of a complex project — the assistant lets creators describe an outcome in natural language. The agent then figures out which tools to invoke, in what order, and executes the workflow.
The assistant is the productized version of Project Moonlight, a research prototype Adobe first previewed at its annual MAX conference in the fall of 2025 and subsequently refined through a private beta. “This is basically [Project] Moonlight,” Costin confirmed to VentureBeat. “We started with all the learnings from Moonlight, and we engaged with customers. We looked internally. We evolved that architecture to make it more ambitious.”
Under the hood, Adobe says it has assembled roughly 100 tools and skills that the assistant can call upon, spanning generative image and video creation, precision photo editing, layout adaptation, and even stakeholder review through Frame.io. The system is built around a single conversational interface inside the Firefly web app where users describe what they want and the assistant maintains context across sessions. Pre-built Creative Skills — purpose-built, multi-step workflow templates such as portrait retouching or social media asset generation — can be run from a single prompt and customized to match a creator’s own style. The assistant also learns a creator’s preferred tools, workflows, and aesthetic choices over time, and understands the content type being worked on — image, video, vector, brand assets — to make context-aware decisions.
Crucially, outputs use native Adobe file formats — PSD, AI, PRPROJ — meaning users can take any result into the corresponding flagship app for manual, pixel-level refinement at any point. “We always imagine this continuum where you can have complete conversational edits and pixel-perfect edits, and you can decide, as a creative, where you want to land,” Costin said. The Firefly AI Assistant will enter public beta in the coming weeks, though Adobe did not specify an exact date.
For a company whose AI monetization story has faced persistent skepticism from investors, the pricing structure of the Firefly AI Assistant will be closely watched. Costin told VentureBeat that, at launch, using the assistant will require an active Adobe subscription that includes the relevant apps — meaning users who want the agent to invoke Photoshop cloud capabilities, for instance, will need an entitlement that includes the Photoshop SKU. Generative actions will consume the user’s existing pool of generative credits, consistent with how Firefly credits work across the rest of Adobe’s platform.
“To use some of these cloud capabilities from Photoshop and other apps, you need to have a subscription that includes access to the Photoshop SKU,” Costin explained. “You’ll be consuming your credits when you use generative features.” He acknowledged, however, that the model could evolve: “As we better understand the value of this — and the costs of operating the brain, the conversation engine — things might change.”
The question of whether Adobe can convert AI enthusiasm into meaningful revenue growth is anything but theoretical. When Adobe reported its most recent quarterly results in March, it touted 10% year-over-year revenue growth to $6.4 billion and disclosed that annual recurring revenue from AI standalone and add-on products had reached $125 million — a figure CEO Shantanu Narayen projected would double within nine months.
Alongside the assistant, Adobe is expanding Firefly’s roster of third-party AI models to include Kling 3.0 and Kling 3.0 Omni, two video generation models developed by Kuaishou, the Chinese technology company. Kling 3.0 focuses on fast, high-quality production with smart storyboarding and audio-visual sync, while the Omni variant adds professional controls for shot duration, camera angle, and character movement across multi-shot sequences. The additions bring Firefly’s model count to more than 30, joining Google’s Nano Banana 2 and Veo 3.1, Runway’s Gen-4.5, Luma AI’s Ray3.14, Black Forest Labs’ FLUX.2[pro], ElevenLabs’ Multilingual v2, and others.
When asked whether Adobe had concerns about integrating a model from a Chinese tech company given the current geopolitical climate, Costin was direct: “We think choice is what we want to offer our customers.” He explained that Adobe’s strategy distinguishes between its own commercially safe, first-party Firefly models — trained on licensed Adobe Stock imagery and public domain content — and third-party partner models, which carry different commercial safety profiles. “For some use cases, like ideation, non-production use cases, we got requests from customers to support some external models,” Costin said. “If I’m in ideation, I might be more flexible with commercial safety. When I go into production, I’d want to have a model that gives you more confidence.”
This raises an important nuance for the agentic era. When the Firefly AI Assistant autonomously selects which model to use for a given task, the commercial safety guarantees may vary depending on which engine it invokes. Costin pointed to Adobe’s Content Credentials system — the metadata-and-fingerprinting framework developed through the Content Authenticity Initiative — as the mechanism for maintaining transparency. “The agentic power — and the fact that the assistant has access to all of those models — means it could decide to use a model that carries different content credentials,” he acknowledged. “But with the transparency of content credentials, the user will know how a particular piece of content was created and can decide whether that’s commercially safe or not.” Adobe offers commercial indemnity for its first-party Firefly models but applies different indemnity levels for third-party models — a distinction that enterprise buyers, in particular, will need to carefully evaluate.
Adobe’s agentic ambitions also intersect with its strategic partnership with Nvidia, announced earlier this year at Nvidia’s GTC conference. When asked whether the Firefly AI Assistant’s agentic capabilities are built on NVIDIA’s agent toolkit and NeMo infrastructure, Costin revealed that the collaboration is active but has not yet made it into a shipping product.
“We’re in active discussions — investigating not only Nemotron,” Costin said. “They have this technology called Open Shell and Nemo Claw, which give us the ability to efficiently run long-running agentic workflows in a sandboxed environment.” He said the technology would become increasingly important as Adobe pushes the assistant to handle longer, more autonomous creative tasks — but cautioned that “it’s not shipping yet. It’s being actively explored.”
For Nvidia, which is building an ecosystem of enterprise AI agent platforms with partners like Adobe, Salesforce, and SAP, the partnership could eventually serve as a high-profile proof point for its agent infrastructure stack in the creative vertical. For Adobe, the ability to run complex, long-duration agentic workflows efficiently and securely in sandboxed environments could be the technical foundation that separates the Firefly AI Assistant from lighter-weight chatbot integrations offered by competitors. The partnership also signals Adobe’s recognition that the computational demands of agentic AI — where a single user request may trigger dozens of model calls and tool invocations — require infrastructure partnerships that go well beyond what a software company can build alone.
Beyond the headline AI assistant announcement, Adobe’s broader set of updates reflects a company trying to strengthen its position across every phase of the content creation pipeline. Color Mode in Premiere Pro may be the most significant near-term upgrade for working editors. Entering public beta today, Color Mode is described as a first-of-its-kind color grading experience built specifically for the way editors — rather than dedicated colorists — think and work. Adobe notes that it was developed through an extensive private beta with hundreds of working editors, and that participants reported they “actually enjoy color grading” — a sentiment suggesting Adobe may have found a way to democratize one of post-production’s most intimidating disciplines. General availability is expected later in 2026.
The Firefly Video Editor gains audio upgrades including the Enhance Speech feature migrated from Premiere and Adobe Podcast, direct Adobe Stock integration with access to more than 800 million licensed assets, and simple color adjustment controls with intuitive sliders and one-click looks. On the image editing front, Adobe introduced Precision Flow, which generates a range of semantic variations from a single prompt and lets users browse them via an interactive slider — a novel approach that Costin described as “the best slider-based control mixed with the best semantic understanding of not only the existing scene, but what the scene could be.” AI Markup complements this by letting users draw directly on images to specify where and how edits should be applied. After Effects 26.2 adds an AI-powered Object Matte tool that dramatically accelerates rotoscoping and masking — create accurate mattes of moving subjects with a hover and click, refine with a Quick Selection brush, and perfect edges with a Refine Edge tool.
Rounding out the announcements, Frame.io Drive addresses one of the most persistent pain points in distributed video production: getting media from point A to point B without losing hours — or days — to downloads, syncing, and shipped hard drives. Frame.io Drive is a desktop application that mounts Frame.io projects to a user’s computer so media appears in Finder or Explorer and behaves like local files. The underlying technology, called Frame.io Mounted Storage, streams media on demand as applications request it, while local caching ensures smooth playback. The product builds on streaming technology provided by Suite Studios, and the real-time file access capability is included with every Frame.io account. Adobe emphasized that all content lives solely within Frame.io and is never shared with third parties.
The move positions Frame.io not just as a review-and-approval tool at the end of the production pipeline but as the central media layer from the very beginning of a project — from first capture through final delivery. If successful, the strategy could significantly deepen Adobe’s lock-in with professional video teams by making Frame.io the single source of truth for distributed productions. Frame.io Drive and Mounted Storage will roll out in phases, with Enterprise customers gaining access starting today and accounts on other plans following shortly. Others can join a waitlist.
Taken together, today’s announcements paint a picture of a company executing aggressively across multiple fronts — but also one that is navigating a complex moment. Adobe first introduced Firefly in March 2023 as a family of generative AI models focused on image and text effects, with a strong emphasis on commercial safety through training on licensed Adobe Stock content. In the two years since, the company has rapidly expanded into video generation, multi-model access, and now agentic workflows — a trajectory that mirrors the broader industry’s shift from standalone AI features to AI-native systems.
But the competitive field has grown dramatically. Runway, Pika, and a host of AI-native video generation startups have captured mindshare among creators. Canva has aggressively integrated AI into its design platform. And the emergence of powerful foundation models from OpenAI, Google, and Anthropic — the latter of which Adobe says it will integrate with Firefly AI Assistant capabilities — means the barrier to building creative AI tools has never been lower. Adobe is also navigating these product ambitions against a complex corporate backdrop: the impending departure of CEO Shantanu Narayen, an actively exploited zero-day vulnerability in Acrobat Reader (CVE-2026-34621) that had been used by hackers for months before being patched this week, a U.K. antitrust investigation over cancellation fees, and a recent $75 million lawsuit settlement.
Adobe’s response, articulated clearly through today’s launches, is to lean into what it believes is its deepest moat: the integration of AI into a set of professional-grade, category-leading applications that no startup can replicate overnight. Costin framed the agentic transition as empowering rather than threatening to creative professionals, comparing Creative Skills to a next-generation version of Photoshop Actions — the macro-recording feature that has long allowed power users to automate repetitive tasks. “We want to help our customers become — from the ones doing all the work — to be creative directors, doing some of the work, but most importantly, guiding the assistant in executing some of those creative visions,” he said.
It is a compelling pitch — and, in its own way, a revealing one. For three decades, Adobe made its fortune by selling the tools that turned creative vision into finished pixels. Now it is asking its customers to let an AI agent handle more of that translation, trusting that the human role will shift from operating the tools to directing the outcome. Whether creators embrace that bargain — and whether Wall Street rewards it — will determine not just Adobe’s trajectory but the shape of an entire industry learning to create alongside machines.
Microsoft today launched MAI-Image-2-Efficient, a lower-cost, higher-speed variant of its flagship text-to-image model that the company says delivers production-ready quality at nearly half the price. The release, available immediately in Microsoft Foundry and MAI Playground with no waitlist, marks the fastest turnaround yet from Microsoft’s in-house AI superintelligence team — and the clearest signal that Redmond is serious about building a self-sufficient AI stack that doesn’t depend on OpenAI.
The new model is priced at $5 per million text input tokens and $19.50 per million image output tokens, a roughly 41% reduction from MAI-Image-2’s pricing of $5 and $33, respectively, for those same tiers. Microsoft says the model runs 22% faster than its flagship sibling and achieves 4x greater throughput efficiency per GPU, as measured on NVIDIA H100 hardware at 1024×1024 resolution. The company also claims it outpaces competing hyperscaler models — specifically naming Google’s Gemini 3.1 Flash, Gemini 3.1 Flash Image, and Gemini 3 Pro Image — by an average of 40% on p50 latency benchmarks.
The model is also rolling out across Copilot and Bing, Microsoft said, with additional product surfaces to follow.
Microsoft is positioning MAI-Image-2-Efficient and its flagship MAI-Image-2 as complementary tools rather than replacements for each other — a tiered pairing designed to cover the full spectrum of enterprise image generation needs.
MAI-Image-2-Efficient targets high-volume, cost-sensitive production workloads: product photography, marketing creative, UI mockups, branded asset pipelines, and real-time interactive applications. It handles short-form in-image text like headlines and labels cleanly, according to Microsoft, and is built to operate within the tight latency and budget constraints of batch processing environments. MAI-Image-2, meanwhile, remains the company’s precision instrument — the model you reach for when the brief demands the highest photorealistic fidelity, complex stylization like anime or illustration, or longer, more intricate in-image typography. Microsoft is effectively telling enterprise customers: use the efficient model for your assembly line, and the flagship for your showcase.
This approach mirrors pricing strategies that have worked across the AI industry — OpenAI’s GPT model tiers, Anthropic’s Haiku-Sonnet-Opus lineup, Google’s Flash-Pro distinction — but applies it specifically to image generation, a domain where cost-per-image economics can make or break production deployment at scale.
The speed of this release deserves attention. MAI-Image-2 itself only debuted on MAI Playground on March 19, as VentureBeat previously reported, with broader availability through Microsoft Foundry arriving on April 2 alongside two other new foundation models: MAI-Transcribe-1 (a speech-to-text model supporting 25 languages) and MAI-Voice-1 (an audio generation model). Less than a month later, Microsoft has shipped an optimized production variant.
That cadence suggests the MAI Superintelligence team — the research group led by Mustafa Suleyman, CEO of Microsoft AI, that was formed in November 2025 — is operating more like a startup shipping iterative products than a traditional corporate research lab publishing papers. When Suleyman wrote in his April 2 blog post that the team was “building Humanist AI” with a focus on “optimizing for how people actually communicate, training for practical use,” he appears to have meant it literally: the models aren’t just shipping, they’re shipping fast enough to have product roadmaps.
The early reception for MAI-Image-2 has been notably positive. Decrypt reported in its hands-on review that the model had already reached the No. 3 position on the Arena.ai leaderboard for image generation, trailing only Google and OpenAI. Decrypt’s reviewer noted that the model’s photorealism was “a real strength” and that its text rendering was “a legitimate highlight” that “handled complex typography with far more consistency than we expected.” The review also found that in some direct comparisons, MAI-Image-2 outperformed OpenAI’s GPT-Image on image quality and text rendering despite sitting below it on the leaderboard — an observation that underscores how benchmark rankings don’t always capture real-world utility.
That said, the original model shipped with significant constraints that Decrypt flagged: a 30-second cooldown between generations, a 15-image daily cap in the native UI, only 1:1 aspect ratio output, no image-to-image capabilities, and aggressive content filtering that blocked even innocuous creative prompts. Whether MAI-Image-2-Efficient inherits or relaxes any of these limitations isn’t addressed in today’s announcement, and enterprise customers accessing the model through the Foundry API will likely face different constraints than playground users.
Today’s launch cannot be understood in isolation. It arrives at a moment when the relationship between Microsoft and OpenAI — once the defining partnership of the generative AI era — is visibly fraying at the seams.
Just yesterday, CNBC reported that OpenAI’s newly appointed chief revenue officer, Denise Dresser, sent an internal memo to staff explicitly stating that the Microsoft partnership “has also limited our ability to meet enterprises where they are.” The memo reportedly touted OpenAI’s new alliance with Amazon Web Services and the Bedrock platform as a key growth driver, describing inbound customer demand as “frankly staggering” since the partnership was announced in late February. Microsoft added OpenAI to its list of competitors in its annual report in mid-2024. OpenAI, meanwhile, has diversified its cloud infrastructure across CoreWeave, Google, and Oracle, reducing its dependence on Microsoft Azure.
The MAI model family is the most tangible expression of Microsoft’s side of that strategic uncoupling. When Microsoft can generate production-quality images with its own model at $19.50 per million output tokens, the calculus for continuing to license OpenAI’s image models — and paying OpenAI a share of the resulting revenue — shifts dramatically. Every MAI model that reaches production quality is a line item that Microsoft can potentially move off OpenAI’s balance sheet and onto its own.
The organizational infrastructure to support this shift is already in place. On March 17, as disclosed in communications posted on Microsoft’s official blog, CEO Satya Nadella announced a sweeping reorganization that unified the company’s consumer and commercial Copilot efforts under a single leadership team, with Jacob Andreou elevated to EVP of Copilot reporting directly to Nadella. Critically, the reorganization also refocused Suleyman’s role. As Nadella wrote in his message to employees, the company is “doubling down on our superintelligence mission with the talent and compute to build models that have real product impact, in terms of evals, COGS reduction, as well as advancing the frontier.” That phrase — “COGS reduction” — is corporate-speak for reducing the cost of goods sold, and it points directly to the economic motivation behind models like MAI-Image-2-Efficient. Every dollar Microsoft saves by using its own models instead of licensing from partners flows straight to gross margin.
There’s one more dimension that makes today’s release strategically significant, and it may be the most important one: the rise of AI agents.
TechCrunch reported yesterday that Microsoft is testing ways to integrate OpenClaw-like features into Microsoft 365 Copilot, building toward an always-on agent that can execute multi-step tasks over extended periods. The company has also launched Copilot Cowork (an agent that takes actions within Microsoft 365 apps), Copilot Tasks (an agent for completing multi-step personal productivity tasks), and Agent 365 (referenced in Nadella’s March reorganization memo). Microsoft is expected to showcase these agentic capabilities at its Build conference in June.
In an agentic world — where AI systems don’t just answer questions but execute complex workflows autonomously — image generation becomes a primitive that agents call programmatically, not a standalone product that users interact with manually. An enterprise agent building a marketing campaign might need to generate dozens of product images, create social media assets, produce presentation graphics, and iterate on design concepts, all without human intervention at each step. The economics of that workflow are governed entirely by per-token pricing and latency, which is precisely what MAI-Image-2-Efficient optimizes for. If Microsoft’s vision for Copilot involves agents that generate images as a routine subtask within larger workflows, those agents need image generation that’s fast enough to not create bottlenecks and cheap enough to not blow up cost projections when called thousands of times per day. The 4x efficiency improvement and 41% price cut aren’t just nice marketing numbers — they’re architectural requirements for the agentic future Microsoft is betting the company on.
Several important questions remain unaddressed by today’s announcement. Microsoft didn’t disclose whether MAI-Image-2-Efficient resolves the aspect ratio limitations and aggressive content filtering that reviewers flagged in the original model. The company also didn’t specify whether the quality-to-speed tradeoffs involve visible degradation on complex prompts — the announcement describes “production-ready quality” and “flagship quality” interchangeably, but distillation models of any kind typically involve some quality concession.
The footnotes in the press release also reveal the narrow conditions under which the benchmark claims were tested: efficiency figures were measured on NVIDIA H100 at 1024×1024 with “optimized batch sizes and matched latency targets,” and the latency comparisons against Google models were conducted at p50 (median) rather than p95 or p99, which would capture worst-case performance. Enterprise customers running diverse workloads at varying concurrency levels may see different results. MAI Playground is currently available only in select markets, including the U.S., with EU availability listed as “coming soon.” Copilot integration is underway but not complete. And the enterprise API through Foundry, while live, is still in early deployment.
But the trajectory is unmistakable. In less than five months since the MAI Superintelligence team was announced, Microsoft has shipped a flagship image model, three additional foundation models, and now a cost-optimized production variant — all while reorganizing its entire Copilot organization, navigating a fracturing relationship with its most important AI partner, and laying the groundwork for agentic AI features that could redefine enterprise productivity. Whether all of that is fast enough to catch Anthropic’s momentum, contain OpenAI’s drift toward Amazon, and justify a $600 price target is the multi-hundred-billion-dollar question. But for a company that spent the first two years of the generative AI era mostly reselling someone else’s technology, Microsoft is now doing something it hasn’t done in a long time in AI: shipping its own work, on its own schedule, at its own price — and daring the market to keep up.
Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems.
New research from Databricks puts a number on that failure gap. The company’s AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks, reporting gains of 20% or more on Stanford’s STaRK benchmark suite and consistent improvement across Databricks’ own KARLBench evaluation framework. The results make the case that the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem.
The work builds on Databricks’ earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures.
“RAG works, but it doesn’t scale,” Michael Bendersky, research director at Databricks, told VentureBeat. “If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task.”
The core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search.
Consider a question like “Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?” The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results.
To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain.
STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base.
Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps:
Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first.
Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer.
Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required.
“The agent can do things like decomposing the question into a SQL query and a search query out of the box,” Bendersky said. “It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found.”
Being able to source information from both structured and unstructured data isn’t an entirely new concept.
LlamaIndex, LangChain and Microsoft Fabric agents all offer some form of hybrid retrieval. Bendersky draws a distinction in how the Databricks approach frames the problem architecturally.
“We almost don’t see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables,” he said. “We see this more as an agent that has access to multiple tools.”
The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code.
Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks’ research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format.
“Just bring the agent to the data,” Bendersky said. “You basically give the agent more sources, and it will learn to use them pretty well.”
For data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest.
The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront.
Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start.
The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one.
“This is kind of like a ladder,” Bendersky said. “The agent will slowly get more and more information and then slowly improve overall.”
When the One Big Beautiful Bill arrived as a 900-page unstructured document — with no standardized schema, no published IRS forms, and a hard shipping deadline — Intuit’s TurboTax team had a question: could AI compress a months-long implementation into days without sacrificing accuracy?
What they built to do it is less a tax story than a template, a workflow combining commercial AI tools, a proprietary domain-specific language and a custom unit test framework that any domain-constrained development team can learn from.
Joy Shaw, director of tax at Intuit, has spent more than 30 years at the company and lived through both the Tax Cuts and Jobs Act and the OBBB. “There was a lot of noise in the law itself and we were able to pull out the tax implications, narrow it down to the individual tax provisions, narrow it down to our customers,” Shaw told VentureBeat. “That kind of distillation was really fast using the tools, and then enabled us to start coding even before we got forms and instructions in.”
When the Tax Cuts and Jobs Act passed in 2017, the TurboTax team worked through the legislation without AI assistance. It took months, and the accuracy requirements left no room for shortcuts.
“We used to have to go through the law and we’d code sections that reference other law code sections and try and figure it out on our own,” Shaw said.
The OBBB arrived with the same accuracy requirements but a different profile. At 900-plus pages, it was structurally more complex than the TCJA. It came as an unstructured document with no standardized schema. The House and Senate versions used different language to describe the same provisions. And the team had to begin implementation before the IRS had published official forms or instructions.
The question was whether AI tools could compress the timeline without compromising the output. The answer required a specific sequence and tooling that did not exist yet.
The OBBB was still moving through Congress when the TurboTax team began working on it. Using large language models, the team summarized the House version, then the Senate version and then reconciled the differences. Both chambers referenced the same underlying tax code sections, a consistent anchor point that let the models draw comparisons across structurally inconsistent documents.
By signing day, the team had already filtered provisions to those affecting TurboTax customers, narrowed to specific tax situations and customer profiles. Parsing, reconciliation and provision filtering moved from weeks to hours.
Those tasks were handled by ChatGPT and general-purpose LLMs. But those tools hit a hard limit when the work shifted from analysis to implementation. TurboTax does not run on a standard programming language. Its tax calculation engine is built on a proprietary domain-specific language maintained internally at Intuit. Any model generating code for that codebase has to translate legal text into syntax it was never trained on, and identify how new provisions interact with decades of existing code without breaking what already works.
Claude became the primary tool for that translation and dependency-mapping work. Shaw said it could identify what changed and what did not, letting developers focus only on the new provisions.
“It’s able to integrate with the things that don’t change and identify the dependencies on what did change,” she said. “That sped up the process of development and enabled us to focus only on those things that did change.”
General-purpose LLMs got the team to working code. Getting that code to shippable required two proprietary tools built during the OBBB cycle.
The first auto-generated TurboTax product screens directly from the law changes. Previously, developers curated those screens individually for each provision. The new tool handled the majority automatically, with manual customization only where needed.
The second was a purpose-built unit test framework. Intuit had always run automated tests, but the previous system produced only pass/fail results. When a test failed, developers had to manually open the underlying tax return data file to trace the cause.
“The automation would tell you pass, fail, you would have to dig into the actual tax data file to see what might have been wrong,” Shaw said. The new framework identifies the specific code segment responsible, generates an explanation and allows the correction to be made inside the framework itself.
Shaw said accuracy for a consumer tax product has to be close to 100 percent. Sarah Aerni, Intuit’s VP of technology for the Consumer Group, said the architecture has to produce deterministic results.
“Having the types of capabilities around determinism and verifiably correct through tests — that’s what leads to that sort of confidence,” Aerni said.
The tooling handles the speed. But Intuit also uses LLM-based evaluation tools to validate AI-generated output, and even those require a human tax expert to assess whether the result is correct. “It comes down to having human expertise to be able to validate and verify just about anything,” Aerni said.
The OBBB was a tax problem, but the underlying conditions are not unique to tax. Healthcare, financial services, legal tech and government contracting teams regularly face the same combination: complex regulatory documents, hard deadlines, proprietary codebases, and near-zero error tolerance.
Based on Intuit’s implementation, four elements of the workflow are transferable to other domain-constrained development environments:
Use commercial LLMs for document analysis. General-purpose models handle parsing, reconciliation and provision filtering well. That is where they add speed without creating accuracy risk.
Shift to domain-aware tooling when analysis becomes implementation. General-purpose models generating code into a proprietary environment without understanding it will produce output that cannot be trusted at scale.
Build evaluation infrastructure before the deadline, not during the sprint. Generic automated testing produces pass/fail outputs. Domain-specific test tooling that identifies failures and enables in-context fixes is what makes AI-generated code shippable.
Deploy AI tools across the whole organization, not just engineering. Shaw said Intuit trained and monitored usage across all functions. AI fluency was distributed across the organization rather than concentrated in early adopters.
“We continue to lean into the AI and human intelligence opportunity here, so that our customers get what they need out of the experiences that we build,” Aerni said.
AI agents run on file systems using standard tools to navigate directories and read file paths.
The challenge, however, is that there is a lot of enterprise data in object storage systems, notably Amazon S3. Object stores serve data through API calls, not file paths. Bridging that gap has required a separate file system layer alongside S3, duplicated data and sync pipelines to keep both aligned.
The rise of agentic AI makes that challenge even harder, and it was affecting Amazon’s own ability to get things done. Engineering teams at AWS using tools like Kiro and Claude Code kept running into the same problem: Agents defaulted to local file tools, but the data was in S3. Downloading it locally worked until the agent’s context window compacted and the session state was lost.
Amazon’s answer is S3 Files, which mounts any S3 bucket directly into an agent’s local environment with a single command. The data stays in S3, with no migration required. Under the hood, AWS connects its Elastic File System (EFS) technology to S3 to deliver full file system semantics, not a workaround. S3 Files is available now in most AWS Regions.
“By making data in S3 immediately available, as if it’s part of the local file system, we found that we had a really big acceleration with the ability of things like Kiro and Claude Code to be able to work with that data,” Andy Warfield, VP and distinguished engineer at AWS, told VentureBeat.
S3 was built for durability, scale and API-based access at the object level. Those properties made it the default storage layer for enterprise data. But they also created a fundamental incompatibility with the file-based tools that developers and agents depend on.
“S3 is not a file system, and it doesn’t have file semantics on a whole bunch of fronts,” Warfield said. “You can’t do a move, an atomic move of an object, and there aren’t actually directories in S3.”
Previous attempts to bridge that gap relied on FUSE (Filesystems in USErspace), a software layer that lets developers mount a custom file system in user space without changing the underlying storage. Tools like AWS’s own Mount Point, Google’s gcsfuse and Microsoft’s blobfuse2 all used FUSE-based drivers to make their respective object stores look like a file system.
Warfield noted that the problem is that those object stores still weren’t file systems. Those drivers either faked file behavior by stuffing extra metadata into buckets, which broke the object API view, or they refused file operations that the object store couldn’t support.
S3 Files takes a different architecture entirely. AWS is connecting its EFS (Elastic File System) technology directly to S3, presenting a full native file system layer while keeping S3 as the system of record. Both the file system API and the S3 object API remain accessible simultaneously against the same data.
Before S3 Files, an agent working with object data had to be explicitly instructed to download files before using tools. That created a session state problem. As agents compacted their context windows, the record of what had been downloaded locally was often lost.
“I would find myself having to remind the agent that the data was available locally,” Warfield said.
Warfield walked through the before-and-after for a common agent task involving log analysis. He explained that a developer was using Kiro or Claude Code to work with log data, in the object only case they would need to tell the agent where the log files are located and to go and download them. Whereas if the logs are immediately mountable on the local file system, the developer can simply identify that the logs are at a specific path, and the agent immediately has access to go through them.
For multi-agent pipelines, multiple agents can access the same mounted bucket simultaneously. AWS says thousands of compute resources can connect to a single S3 file system at the same time, with aggregate read throughput reaching multiple terabytes per second — figures VentureBeat was not able to independently verify.
Shared state across agents works through standard file system conventions: subdirectories, notes files and shared project directories that any agent in the pipeline can read and write. Warfield described AWS engineering teams using this pattern internally, with agents logging investigation notes and task summaries into shared project directories.
For teams building RAG pipelines on top of shared agent content, S3 Vectors — launched at AWS re:Invent in December 2024 — layers on top for similarity search and retrieval-augmented generation against that same data.
AWS is positioning S3 Files against FUSE-based file access from Azure Blob NFS and Google Cloud Storage FUSE. For AI workloads, the meaningful distinction is not primarily performance.
“S3 Files eliminates the data shuffle between object and file storage, turning S3 into a shared, low-latency working space without copying data,” Jeff Vogel, analyst at Gartner, told VentureBeat. “The file system becomes a view, not another dataset.”
With FUSE-based approaches, each agent maintains its own local view of the data. When multiple agents work simultaneously, those views can potentially fall out of sync.
“It eliminates an entire class of failure modes including unexplained training/inference failures caused by stale metadata, which are notoriously difficult to debug,” Vogel said. “FUSE-based solutions externalize complexity and issues to the user.”
The agent-level implications go further still. The architectural argument matters less than what it unlocks in practice.
“For agentic AI, which thinks in terms of files, paths, and local scripts, this is the missing link,” Dave McCarthy, analyst at IDC, told VentureBeat. “It allows an AI agent to treat an exabyte-scale bucket as its own local hard drive, enabling a level of autonomous operational speed that was previously bottled up by API overhead associated with approaches like FUSE.”
Beyond the agent workflow, McCarthy sees S3 Files as a broader inflection point for how enterprises use their data.
“The launch of S3 Files isn’t just S3 with a new interface; it’s the removal of the final friction point between massive data lakes and autonomous AI,” he said. “By converging file and object access with S3, they are opening the door to more use cases with less reworking.”
For enterprise teams that have been maintaining a separate file system alongside S3 to support file-based applications or agent workloads, that architecture is now unnecessary.
For enterprise teams consolidating AI infrastructure on S3, the practical shift is concrete: S3 stops being the destination for agent output and becomes the environment where agent work happens.
“All of these API changes that you’re seeing out of the storage teams come from firsthand work and customer experience using agents to work with data,” Warfield said. “We’re really singularly focused on removing any friction and making those interactions go as well as they can.”
Anthropic on Tuesday announced Project Glasswing, a sweeping cybersecurity initiative that pairs an unreleased frontier AI model — Claude Mythos Preview — with a coalition of twelve major technology and finance companies in an effort to find and patch software vulnerabilities across the world’s most critical infrastructure before adversaries can exploit them.
The launch partners include Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks. Anthropic says it has also extended access to more than 40 additional organizations that build or maintain critical software, and is committing up to $100 million in usage credits for Claude Mythos Preview across the effort, along with $4 million in direct donations to open-source security organizations.
The announcement arrives at a moment of extraordinary momentum — and extraordinary scrutiny — for the San Francisco-based AI startup. Anthropic disclosed on Sunday that its annualized revenue run rate has surpassed $30 billion, up from approximately $9 billion at the end of 2025, and the number of business customers each spending over $1 million annually now exceeds 1,000, doubling in less than two months. The company simultaneously announced a multi-gigawatt compute deal with Google and Broadcom. On the same day, Bloomberg reported that Anthropic had poached a senior Microsoft executive, Eric Boyd, to lead its infrastructure expansion.
But Glasswing is something categorically different from a revenue milestone or a compute deal. It’s Anthropic’s most ambitious attempt to translate frontier AI capabilities — capabilities the company itself describes as dangerous — into a defensive advantage before those same capabilities proliferate to hostile actors.
At the center of Project Glasswing sits Claude Mythos Preview, a general-purpose frontier model that Anthropic says has already identified thousands of high-severity zero-day vulnerabilities — meaning flaws previously unknown to software developers — in every major operating system and every major web browser, along with a range of other critical software.
The company is not making the model generally available.
“We do not plan to make Claude Mythos Preview generally available due to its cybersecurity capabilities,” Newton Cheng, Frontier Red Team Cyber Lead at Anthropic, told VentureBeat in an exclusive interview. “However, given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. The fallout — for economies, public safety, and national security — could be severe.”
That language — “the fallout could be severe” — is striking coming from the company that built the model. Anthropic is effectively arguing that the tool it created is powerful enough to reshape the cybersecurity landscape, and that the only responsible thing to do is to keep it restricted while giving defenders a head start.
The technical results reinforce that claim. According to Anthropic’s press release, Mythos Preview was able to find nearly all of the vulnerabilities it surfaced, and develop many related exploits, entirely autonomously, without any human steering. Three examples stand out: The model found a 27-year-old vulnerability in OpenBSD — widely regarded as one of the most security-hardened operating systems in the world and commonly used to run firewalls and critical infrastructure. The flaw allowed an attacker to remotely crash any machine running the OS simply by connecting to it. It also discovered a 16-year-old vulnerability in FFmpeg — the near-ubiquitous video encoding and decoding library — in a line of code that automated testing tools had exercised five million times without ever catching the problem. And perhaps most alarmingly, Mythos Preview autonomously found and chained together several vulnerabilities in the Linux kernel to escalate from ordinary user access to complete control of the machine.
All three vulnerabilities have been reported to the relevant maintainers and have since been patched. For many other vulnerabilities still in the remediation pipeline, Anthropic says it is publishing cryptographic hashes of the details today, with plans to reveal specifics after fixes are in place.
On the CyberGym evaluation benchmark, Mythos Preview scored 83.1%, compared to 66.6% for Claude Opus 4.6, Anthropic’s next-best model. The gap is even wider on coding benchmarks: Mythos Preview achieves 93.9% on SWE-bench Verified versus 80.8% for Opus 4.6, and 77.8% on SWE-bench Pro versus 53.4%.
Finding thousands of zero-days at once sounds impressive. Actually handling the output responsibly is a logistical nightmare — and one of the sharpest criticisms that security researchers have raised about AI-driven vulnerability discovery. Flooding open-source maintainers, many of whom are unpaid volunteers, with an avalanche of critical bug reports could easily do more harm than good.
Cheng told VentureBeat that Anthropic has built a triage pipeline specifically to manage this problem. “We triage every bug that we find and then send the highest severity bugs to professional human triagers we have contracted to assist in our disclosure process by manually validating every bug report before we send it out to ensure that we send only high-quality reports to maintainers,” he said.
That pipeline is designed to prevent exactly the scenario that maintainers fear most: an automated firehose of unverified reports. “We do not submit large volumes of findings to a single project without first reaching out in an effort to agree on a pace the maintainer can sustain,” Cheng added.
When Anthropic has access to the source code, the company aims to include a candidate patch with every report, labeled by provenance — meaning the maintainer knows the patch was written or reviewed by a model — and offers to collaborate on a production-quality fix. “Models can write patches,” Cheng noted, “but there are many factors that impact patch quality, and we strongly recommend that autonomously-written patches are put under the same scrutiny and testing that human-written patches are.”
On disclosure timelines, Anthropic says it follows a coordinated vulnerability disclosure framework. Once a patch is available, the company will generally wait 45 days before publishing full technical details, giving downstream users time to deploy the fix before exploitation information becomes public. Cheng said the company may shorten that buffer “if the details are already publicly known through other channels, or if earlier publication would materially help defenders identify and mitigate ongoing attacks,” or extend it “when patch deployment is unusually complex or the affected footprint is unusually broad.”
Those are reasonable principles, but they will be tested at a scale that no vulnerability disclosure program has ever attempted. The sheer volume of findings — thousands of zero-days across every major platform — means that even a well-designed triage process will face bottlenecks. And the 45-day disclosure window assumes that maintainers can actually produce, test, and ship a patch in that time, which is far from guaranteed for complex kernel-level bugs or deeply embedded cryptographic flaws.
The irony of a company claiming to build the most capable cyber model ever constructed while simultaneously suffering a string of embarrassing security lapses has not been lost on observers.
In late March, a draft blog post about Mythos was left in an unsecured and publicly searchable data store — a CMS misconfiguration that exposed roughly 3,000 internal assets, including what appeared to be strategic plans for the model’s rollout. Days later, on March 31, anyone who ran npm install on Claude Code pulled down Anthropic’s complete original source code — 512,000 lines — for approximately three hours due to a packaging error, an incident that drew widespread attention in the developer community and was first reported by VentureBeat.
When asked why partners and governments should trust Anthropic as the custodian of a model it describes as having unprecedented cyber capabilities, Cheng was direct. “Security is central to how we build and ship,” he told VentureBeat. “These two incidents, a blog CMS misconfiguration and an npm packaging error, were human errors in publishing tooling, not breaches of our security architecture. We’ve made changes to prevent these from happening again, and we’ll continue to improve our processes.”
It is a technically accurate distinction — neither incident involved a breach of Anthropic’s core model weights, training infrastructure, or API systems — but it is also a distinction that may prove difficult to sustain as a public argument. For an organization asking governments and Fortune 500 companies to trust it with a tool that can autonomously find and exploit vulnerabilities in the Linux kernel, even minor operational lapses carry outsized reputational risk. The fact that the Mythos leak itself was what first alerted the security community to the model’s existence, weeks before the planned announcement, underscores the point.
The coalition’s breadth is notable. It includes direct competitors — Google and Microsoft — alongside cybersecurity incumbents, financial institutions, and the steward of the world’s largest open-source ecosystem. And several partners have already been running Mythos Preview against their own infrastructure for weeks.
CrowdStrike’s CTO Elia Zaitsev framed the initiative in terms of collapsing timelines: “The window between a vulnerability being discovered and being exploited by an adversary has collapsed — what once took months now happens in minutes with AI.” AWS Vice President and CISO Amy Herzog said her teams have already been testing Mythos Preview against critical codebases, where the model is “already helping us strengthen our code.” And Microsoft’s Global CISO Igor Tsyganskiy noted that when tested against CTI-REALM, Microsoft’s open-source security benchmark, “Claude Mythos Preview showed substantial improvements compared to previous models.”
Perhaps the most revealing comment came from Jim Zemlin, CEO of the Linux Foundation, who pointed to the fundamental asymmetry that has plagued open-source security for decades: “In the past, security expertise has been a luxury reserved for organizations with large security teams. Open-source maintainers — whose software underpins much of the world’s critical infrastructure — have historically been left to figure out security on their own.” Project Glasswing, he said, “offers a credible path to changing that equation.”
To back that claim with dollars, Anthropic says it has donated $2.5 million to Alpha-Omega and OpenSSF through the Linux Foundation, and $1.5 million to the Apache Software Foundation. Maintainers interested in access can apply through Anthropic’s Claude for Open Source program.
After the research preview period — during which Anthropic’s $100 million credit commitment will cover most usage — Claude Mythos Preview will be available to participants at $25 per million input tokens and $125 per million output tokens. Participants can access the model through the Claude API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry.
Those prices reflect the model’s computational intensity. The draft blog post that leaked in March described Mythos as a large, compute-intensive model that would be expensive for both Anthropic and its customers to serve. Anthropic’s solution is to develop and launch new safeguards with an upcoming Claude Opus model, allowing the company to “improve and refine them with a model that does not pose the same level of risk as Mythos Preview,” as Cheng told VentureBeat. Security professionals whose legitimate work is affected by those safeguards will be able to apply to an upcoming Cyber Verification Program.
The financial context matters. The same day Project Glasswing launched, Anthropic disclosed its revenue milestone and the Google-Broadcom compute deal. Broadcom signed an expanded deal with Anthropic that will give the AI startup access to about 3.5 gigawatts worth of computing capacity drawing on Google’s AI processors, according to CNBC. The scale of compute being marshaled is staggering — and it helps explain why Anthropic needs both the revenue from enterprise cybersecurity partnerships and the infrastructure to serve a model of Mythos Preview’s size.
The timing also intersects with growing speculation about Anthropic’s path to a public offering. The company is reportedly evaluating an IPO as early as October 2026. A high-profile, government-adjacent cybersecurity initiative with blue-chip partners is exactly the kind of program that burnishes an IPO narrative — particularly when the company can simultaneously point to $30 billion in annualized revenue and a compute footprint measured in gigawatts.
The most consequential question raised by Project Glasswing is not whether Mythos Preview’s capabilities are real — the partner endorsements and patched vulnerabilities suggest they are — but how much time defenders actually have before similar capabilities are available to adversaries.
Cheng was candid about the timeline. “Frontier AI capabilities are likely to advance substantially over just the next few months,” he told VentureBeat. “Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.” He described Project Glasswing as “an important step toward giving defenders a durable advantage in the coming AI-driven era of cybersecurity” but added a crucial caveat: “It’s important to note, this is a starting point. No one organization can solve these cybersecurity problems alone.”
That framing — months, not years — is worth taking seriously. DARPA launched its original Cyber Grand Challenge in 2016, a competition to create automatic defensive systems capable of reasoning about flaws, formulating patches, and deploying them on a network in real time. At the time, the winning AI-powered bot, Mayhem, finished last when placed against human teams at DEF CON. A decade later, Anthropic is claiming that a frontier AI model can find vulnerabilities that survived 27 years of expert human review and millions of automated security tests — and can chain exploits together autonomously to achieve full system compromise.
The delta between those two data points illustrates why the industry is treating this as a genuine inflection point, not a marketing exercise. Anthropic itself has firsthand experience with the offensive side of this equation: the company disclosed in November 2025 that a Chinese state-sponsored group achieved 80 to 90 percent autonomous tactical execution using Claude across approximately 30 targets, according to Anthropic’s misuse report.
Project Glasswing arrives during one of the most turbulent weeks in Anthropic’s history. In the span of days, the company has announced a model it considers too dangerous for public release, disclosed that its revenue has tripled, sealed a multi-gigawatt compute deal, hired a senior Microsoft executive, made it more expensive for Claude Code subscribers to use third-party tools like OpenClaw, and weathered a major outage of its Claude chatbot on Tuesday morning. Anthropic says it will report publicly on what it has learned within 90 days. In the medium term, the company has proposed that an independent, third-party body might be the ideal home for continued work on large-scale cybersecurity projects.
Whether any of that is fast enough depends on a race that is already underway. Anthropic built a model that can autonomously crack open the most hardened operating systems on the planet — and is now betting that sharing it with defenders, under careful restrictions, will do more good than the inevitable moment when similar capabilities land in less careful hands. It is, in essence, a wager that transparency can outrun proliferation. The next few months will determine whether that bet pays off, or whether the glasswing’s wings were never quite opaque enough to hide what was coming.
Presented by Box
As frontier models converge, the advantage in enterprise AI is moving away from the model and toward the data it can safely access. For most enterprises, that advantage lives in unstructured data: the contracts, case files, product specifications, and internal knowledge.
For enterprise leaders, the question is no longer which model to use, but which platform governs the content those models are allowed to reason over.
“It’s not what the model does anymore, it’s the enterprise’s own unstructured data – their content, how it’s organized, how it’s governed, and how it’s made accessible to the AI.” says Yash Bhavnani, head of AI at Box.
“The organizations that will lead in AI are the ones that built the governance infrastructure to make any model trustworthy, with the right permissions in place, the right content accessible, and a clear audit trail for every action taken,” says Ben Kus, CTO of Box.
As the advantage in AI shifts from models to governed content, systems of record are becoming the foundation that makes enterprise AI trustworthy.
Employees use frontier models to summarize documents, draft reports, answer questions, but when those tools are disconnected from authoritative internal repositories, the results are difficult to trust, impossible to audit, and potentially dangerous. AI that cannot trace its outputs back to a governed source of record becomes a liability.
“It’s not a theoretical concern,” Bhavnani says. “For an insurance enterprise using AI to analyze client claims, low accuracy is simply not acceptable, and untraceable output can’t be acted upon.”
Systems of record provide authoritative, version-controlled content with embedded permissions and compliance controls already built in, and RAG pipelines retrieve data from live repositories at inference time, connecting responses directly to current, traceable sources.
Without integration into systems of record, employees build their own workarounds, content gets duplicated across tools that don’t talk to each other, and shadow knowledge stores accumulate outside the visibility of IT and compliance teams.
“Customers tell us employees are uploading sensitive documents to personal accounts and running their own AI workflows, with no visibility from the enterprise into what is being shared or what is being generated,” he says. “It’s not just a security risk, it’s an organizational one.”
As AI moves into agentic territory, executing multi-step tasks autonomously across documents, workflows, and enterprise systems, the risk profile changes entirely. Agents act faster than humans, often without the contextual judgment needed to decide what data they should access, making permissions-aware access essential.
“An AI platform without permissions-aware access is too dangerous to use,” Kus says. “It’s a precondition for safe enterprise AI deployment, and the more it appears to have been added after the fact rather than built into the foundation, the more it should concern the enterprise considering it.”
In regulated industries, frameworks like HIPAA, FedRAMP High, and SOC 2 demand audit trails, policy enforcement, and demonstrable controls over who and what has accessed sensitive data.
“The audit trail should cover not only the source files but the AI session that used them, and accessed only with the same controls and the same encryption mechanism,” Kus says. “We don’t want customers to end up with a compliance breach because the agent was looking at sensitive data and the agent records got stored somewhere unexpected.”
Enterprise content platforms are evolving from repositories into orchestration layers — an AI control plane that sits between models, agents, and enterprise data. Rather than just storing documents, the platform governs how content is accessed, routes it to the right reasoning engine, enforces permissions, and maintains a complete audit trail of every action.
“An AI-ready content platform needs to support human navigation and use in the way platforms always have, and it needs its own AI agents that understand the platform’s data structures deeply enough to get the best out of them,” Kus says. “It also needs to be open enough that any external agent can reach into it. An open agent ecosystem is the future of how these platforms will work.”
When content, permissions, audit trails, and application access are all handled by the same platform, governance stays attached to the content itself. More than any capability of the models on top of it, a unified governance layer is what allows enterprise AI to scale safely.
Unstructured data has long been a sticking point for organizations, which had to build specialized models to handle every subtype of unstructured data.
“What’s changed is that general-purpose large language models now bring enough intelligence to extract structured data from unstructured content without that level of bespoke investment,” Kus says. “Box Extract applies this capability at scale, automatically pulling key information from contracts, forms, claims, and reports and applying it as structured metadata within Box. The content that previously had to be read by a person to yield its value can now be processed, structured, and made queryable across an entire repository.”
And once that data is extracted and operational logic lives in the system, users can visualize, search, and act on that extracted information through custom dashboards and no-code tools.
Box Agents take this further by enabling multi-step reasoning and task execution grounded directly in enterprise content, with persistent sessions that support iterative knowledge work with simple, natural language direction. And because agent sessions in Box are persistent, the work is not lost between interactions.
The practical result is that end-to-end workflows that previously required human coordination across multiple systems can be orchestrated directly on systems of record.
“When those workflows are built on Box agents and automation operating directly on governed content, the handoffs become automated, the audit trail is built in, and the system of record remains the authoritative source throughout,” Bhavani says. “Nothing falls through the cracks between systems, because there is only one system.”
The enterprises seeing real returns are not the ones that simply plugged in a frontier model and waited for results. They are the ones that connected AI to their systems of record, governed what it can access, and built the operational layer that makes its outputs trustworthy enough to use at scale.
Platforms that bring together content management, security, automation, and AI integration in a single layer are emerging as the foundation for enterprise AI, because model capability alone is not enough. Without governance built into the platform, the gaps between systems become the point of failure.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Block today unveiled Managerbot, a new AI agent embedded in the Square platform that proactively monitors a seller’s business, identifies emerging problems, and proposes actionable solutions — without the seller ever having to ask a question. The product marks the most tangible manifestation of CEO Jack Dorsey’s controversial bet that artificial intelligence can fundamentally reshape how his company operates, builds products, and serves the millions of small businesses that depend on Square to run day-to-day commerce.
In an exclusive interview with VentureBeat, Willem Avé, Block’s head of product at Square, described Managerbot as a decisive break from the company’s earlier Square AI assistant, which functioned as a reactive chatbot that answered seller questions about sales, employees, and business performance.
“The big shift from Square AI to Managerbot is really from reactive to proactive,” Avé said. “What that means is the primary interface is not a question box. You assign tasks to Managerbot, and that could be based on data, an insight, or a signal from your business.”
The product is beginning to roll out now, with full availability to Square sellers expected over the coming months. Block declined to say whether Managerbot would carry an additional fee or be bundled into existing Square subscriptions.
Avé outlined three core domains where Managerbot operates today: inventory forecasting, employee shift scheduling, and automated marketing campaign creation. In every case, the agent acts before the seller does — watching over the business, detecting patterns, and surfacing recommendations with proposed actions attached.
In the inventory domain, Managerbot continuously monitors a seller’s stock levels, sales velocity, and external signals such as weather patterns and local events, then alerts the seller when an item is about to run out — or when it should stock up ahead of anticipated demand. “In warmer weather, we can see that you sell more of a certain good,” Avé explained. “That’s the forecasting capability, combined with local data — weather, events — so we can help sellers manage both their inventory and cash flows.”
For shift scheduling — a task that Avé described as “one of those interesting, very hard computer science problems” that consumes hours of a small business owner’s week — Managerbot analyzes forecasted sales data and then generates optimized employee schedules that balance worker preferences with coverage needs. “It turns out that frontier models are actually pretty good at it,” Avé said.
The third capability tackles what Avé called “the whole bucket of things that sellers could do if they had more time” — principally marketing. Managerbot identifies sales trends across a seller’s catalog and automatically drafts win-back campaigns and promotional outreach targeted at a store’s best customer segments. Avé said Block is seeing “very meaningful lift” from Managerbot-generated campaigns compared to what some sellers create manually, though he declined to share specific performance figures publicly.
Managerbot runs on third-party frontier models — Avé specifically referenced Anthropic’s Sonnet and OpenAI’s GPT family — but Block’s competitive advantage, he argued, lies in the “agent harness” the company has built around those models. That harness draws heavily on Goose, Block’s open-source agent framework, and incorporates learnings from its consumer-facing Money Bot on Cash App.
The challenge specific to Square is scale and complexity. A seller running a small business might interact with hundreds of different tools across invoicing, inventory, customer management, marketing, payroll, and scheduling. Managerbot must navigate all of them coherently within a single agentic loop. “This isn’t like, you know, you load a skill and call it a day — think about hundreds of skills,” Avé said. “Actually, managing the context and managing the way that we progressively disclose tools, and some of the other innovation that we have at the harness layer, is I think some of the secret sauce.”
A critical design decision shapes every interaction: Managerbot does not autonomously execute changes to a seller’s business. Every write action — whether adjusting a shift schedule, publishing a marketing campaign, or modifying inventory — requires explicit seller approval. To facilitate that approval, Managerbot generates visual UI previews showing exactly what will change before the seller clicks “yes.” “We want to earn trust with sellers, so any write action is prompted to the user to approve,” Avé said. “The seller needs a visual representation of what the change is. You can’t just describe in words all the time what you’re going to go do.”
That human-in-the-loop caution reflects a sensitivity that gains additional weight given Block’s recent history. In January 2025, 48 state financial regulators imposed an $80 million fine on Block for violations of Bank Secrecy Act and anti-money laundering laws related to Cash App. The Connecticut Department of Banking stated in announcing the settlement that regulators “found Block was not in compliance with certain requirements, creating the potential that its services could be used to support money laundering, terrorism financing, or other illegal activities.” The Illinois Department of Financial and Professional Regulation simultaneously joined the coordinated enforcement action.
Separately, reporting from The Guardian has documented instances of Block’s customer-facing chatbots making serious errors, including telling customers to cancel or close their accounts. When VentureBeat raised this concern during the interview, Avé acknowledged the stakes but redirected to Managerbot’s specific safeguards.
“Financial accuracy and financial data — the value of these products really come from recommendations,” Avé said. “We need to be better than whatever you can feed to ChatGPT. If you take a CSV of your sales and put it in ChatGPT or Claude, we need our product to be better and answer that question either more accurately or better than what’s available in the market.” He pointed to the harness layer’s role in reducing hallucinations through tuning, prompt engineering, and optimized tool-call loops, while acknowledging the inherent limitations of probabilistic systems: “It’s never going to be zero. Obviously, these are probabilistic systems, and we have guidance and call-outs in the tool to provide that.” On regulated domains like lending and payments, Avé was more definitive: “In any sort of regulated domains — banking, lending, payments — there are strict guardrails on what we can and can’t say to sellers. Those are just part of the product and business.”
It is impossible to evaluate Managerbot outside the context of the radical organizational surgery Block performed just weeks ago. In late February, Dorsey announced that Block would cut more than 4,000 of its roughly 10,000 employees — nearly half the workforce — explicitly citing AI as the driving rationale. As the BBC reported, Dorsey wrote that “AI fundamentally changes what it means to build and run a company.” Block’s stock surged more than 20 percent on the news, according to ABC7.
The company’s Q4 2025 earnings report, released alongside the layoff announcement, showed gross profit of $2.87 billion — up 24 percent year over year — and raised 2026 guidance to $12.2 billion in gross profit, according to AlphaSense’s earnings analysis. Block also reported a greater than 40 percent increase in production code shipped per engineer since September 2025 through the use of agentic coding tools. As CNBC commentator Steve Sedgwick wrote in an opinion piece following the announcement, “I keep getting told on CNBC that AI will create new jobs to replace those being lost. I’ve been asking the same question for years now.” The Observer’s Mark Minevich was more pointed, calling Block’s layoffs “probably the first legitimate mass layoff driven by A.I. as the actual operating thesis.”
Managerbot, then, is the product answer to the obvious follow-up question: if Block shed 4,000 workers in the name of intelligence tools, what exactly are those intelligence tools building? Avé framed the product as proof of concept for Block’s entire strategic thesis. “Block has been in the press recently about rebuilding as an intelligence company, and it’s like, a lot of people are asking, ‘What does that mean for us?'” Avé said. “What I like to do is show, not tell. We’re building Managerbot, which I think is one of the more advanced, maybe the most advanced, small business agent out there today.”
Perhaps the most consequential signal Avé shared was an early behavioral pattern: sellers who begin using Managerbot are voluntarily migrating more of their business operations onto the Square platform, consolidating payroll, time cards, and shift scheduling into Block’s ecosystem to feed the agent more data. “When they start interacting with Managerbot, they want to move more of their business onto Square because they see the value,” Avé said. “They’re like, ‘I should put my payroll here. I should get time cards here. I should get my shift schedules here,’ because once all that data is in one place, they can make better decisions and manage their business better.”
This dynamic could prove to be Managerbot’s most significant long-term effect — not as a standalone feature, but as a gravitational force pulling sellers deeper into Block’s integrated commerce stack. Block’s Q4 earnings already showed Square’s new volume added grew 29 percent year over year, with sales-led NVA surging 62 percent. Avé also argued that Square’s first-party architecture — built organically rather than through acquisitions — gives it a structural advantage over competitors in the AI era. “We’ve kind of harmonized and canonicalized this data at a sensible layer,” he said. “It’s not super hard to create more skills for these data domains.”
When VentureBeat pressed Avé on the tension between helping sellers and upselling them on Block’s own financial products — lending, payments processing, and other services that generate revenue for the company — he acknowledged the concern but framed Managerbot’s mission in terms of decision-making quality. “The goal for Managerbot is to help sellers increase their decision-making correctness,” Avé said. “If we can make sellers better at running their business by making better decisions and giving time back, I think that’s a good thing.”
Avé was insistent that Managerbot represents something categorically different from the chatbot-as-advisor model that has proliferated across enterprise software. “A lot of people are building chatbots as advisors — it can answer a question for you,” he said. “What we really want Managerbot to be is a protector of your business. This is identifying trends. This is spotting things that you might have missed. This is helping you run your business and take actions.”
He also argued that the agent model compounds Block’s development velocity in ways that traditional software cannot match. “It’s much more straightforward to add a capability to Managerbot than it is to build a big Web 2.0 UI,” Avé said. “If we can deliver more capabilities, more features, more value to our sellers, the whole system compounds.”
Whether that compounding materializes — and whether sellers ultimately experience Managerbot as a trusted protector or a sophisticated upsell engine — will determine much about Block’s future. The company has staked its corporate identity, its headcount, and its Wall Street narrative on the conviction that AI agents can deliver more value with fewer humans in the loop. Managerbot is the first product to carry the full weight of that promise. And the small business owners who keep their shops open with Square terminals, who juggle shift schedules on napkins and skip marketing because there aren’t enough hours in the day — they didn’t ask to be the test case for Silicon Valley’s boldest AI thesis. But as of today, they are.
AI vibe coders have yet another reason to thank Andrej Karpathy, the coiner of the term.
The former Director of AI at Tesla and co-founder of OpenAI, now running his own independent AI project, recently posted on X describing a “LLM Knowledge Bases” approach he’s using to manage various topics of research interest.
By building a persistent, LLM-maintained record of his projects, Karpathy is solving the core frustration of “stateless” AI development: the dreaded context-limit reset.
As anyone who has vibe coded can attest, hitting a usage limit or ending a session often feels like a lobotomy for your project. You’re forced to spend valuable tokens (and time) reconstructing context for the AI, hoping it “remembers” the architectural nuances you just established.
Karpathy proposes something simpler and more loosely, messily elegant than the typical enterprise solution of a vector database and RAG pipeline.
Instead, he outlines a system where the LLM itself acts as a full-time “research librarian”—actively compiling, linting, and interlinking Markdown (.md) files, the most LLM-friendly and compact data format.
By diverting a significant portion of his “token throughput” into the manipulation of structured knowledge rather than boilerplate code, Karpathy has surfaced a blueprint for the next phase of the “Second Brain”—one that is self-healing, auditable, and entirely human-readable.
For the past three years, the dominant paradigm for giving LLMs access to proprietary data has been Retrieval-Augmented Generation (RAG).
In a standard RAG setup, documents are chopped into arbitrary “chunks,” converted into mathematical vectors (embeddings), and stored in a specialized database.
When a user asks a question, the system performs a “similarity search” to find the most relevant chunks and feeds them into the LLM.Karpathy’s approach, which he calls LLM Knowledge Bases, rejects the complexity of vector databases for mid-sized datasets.
Instead, it relies on the LLM’s increasing ability to reason over structured text.
The system architecture, as visualized by X user @himanshu in part of the wider reactions to Karpathy’s post, functions in three distinct stages:
Data Ingest: Raw materials—research papers, GitHub repositories, datasets, and web articles—are dumped into a raw/ directory. Karpathy utilizes the Obsidian Web Clipper to convert web content into Markdown (.md) files, ensuring even images are stored locally so the LLM can reference them via vision capabilities.
The Compilation Step: This is the core innovation. Instead of just indexing the files, the LLM “compiles” them. It reads the raw data and writes a structured wiki. This includes generating summaries, identifying key concepts, authoring encyclopedia-style articles, and—crucially—creating backlinks between related ideas.
Active Maintenance (Linting): The system isn’t static. Karpathy describes running “health checks” or “linting” passes where the LLM scans the wiki for inconsistencies, missing data, or new connections. As community member Charly Wargnier observed, “It acts as a living AI knowledge base that actually heals itself.”
By treating Markdown files as the “source of truth,” Karpathy avoids the “black box” problem of vector embeddings. Every claim made by the AI can be traced back to a specific .md file that a human can read, edit, or delete.
While Karpathy’s setup is currently described as a “hacky collection of scripts,” the implications for the enterprise are immediate.
As entrepreneur Vamshi Reddy (@tammireddy) noted in response to the announcement: “Every business has a raw/ directory. Nobody’s ever compiled it. That’s the product.”
Karpathy agreed, suggesting that this methodology represents an “incredible new product” category.
Most companies currently “drown” in unstructured data—Slack logs, internal wikis, and PDF reports that no one has the time to synthesize.
A “Karpathy-style” enterprise layer wouldn’t just search these documents; it would actively author a “Company Bible” that updates in real-time.
As AI educator and newsletter author Ole Lehmann put it on X: “i think whoever packages this for normal people is sitting on something massive. one app that syncs with the tools you already use, your bookmarks, your read-later app, your podcast app, your saved threads.”
Eugen Alpeza, co-founder and CEO of AI enterprise agent builder and orchestration startup Edra, noted in an X post that: “The jump from personal research wiki to enterprise operations is where it gets brutal. Thousands of employees, millions of records, tribal knowledge that contradicts itself across teams. Indeed, there is room for a new product and we’re building it in the enterprise.”
As the community explores the “Karpathy Pattern,” the focus is already shifting from personal research to multi-agent orchestration.
A recent architectural breakdown by @jumperz, founder of AI agent creation platform Secondmate, illustrates this evolution through a “Swarm Knowledge Base” that scales the wiki workflow to a 10-agent system managed via OpenClaw.
The core challenge of a multi-agent swarm—where one hallucination can compound and “infect” the collective memory—is addressed here by a dedicated “Quality Gate.”
Using the Hermes model (trained by Nous Research for structured evaluation) as an independent supervisor, every draft article is scored and validated before being promoted to the “live” wiki.
This system creates a “Compound Loop”: agents dump raw outputs, the compiler organizes them, Hermes validates the truth, and verified briefings are fed back to agents at the start of each session. This ensures that the swarm never “wakes up blank,” but instead begins every task with a filtered, high-integrity briefing of everything the collective has learned
A common critique of non-vector approaches is scalability. However, Karpathy notes that at a scale of ~100 articles and ~400,000 words, the LLM’s ability to navigate via summaries and index files is more than sufficient.
For a departmental wiki or a personal research project, the “fancy RAG” infrastructure often introduces more latency and “retrieval noise” than it solves.
Tech podcaster Lex Fridman (@lexfridman) confirmed he uses a similar setup, adding a layer of dynamic visualization:
“I often have it generate dynamic html (with js) that allows me to sort/filter data and to tinker with visualizations interactively. Another useful thing is I have the system generate a temporary focused mini-knowledge-base… that I then load into an LLM for voice-mode interaction on a long 7-10 mile run.”
This “ephemeral wiki” concept suggests a future where users don’t just “chat” with an AI; they spawn a team of agents to build a custom research environment for a specific task, which then dissolves once the report is written.
Technically, Karpathy’s methodology is built on an open standard (Markdown) but viewed through a proprietary-but-extensible lens (note taking and file organization app Obsidian).
Markdown (.md): By choosing Markdown, Karpathy ensures his knowledge base is not locked into a specific vendor. It is future-proof; if Obsidian disappears, the files remain readable by any text editor.
Obsidian: While Obsidian is a proprietary application, its “local-first” philosophy and EULA (which allows for free personal use and requires a license for commercial use) align with the developer’s desire for data sovereignty.
The “Vibe-Coded” Tools: The search engines and CLI tools Karpathy mentions are custom scripts—likely Python-based—that bridge the gap between the LLM and the local file system.
This “file-over-app” philosophy is a direct challenge to SaaS-heavy models like Notion or Google Docs. In the Karpathy model, the user owns the data, and the AI is merely a highly sophisticated editor that “visits” the files to perform work.
The AI community has reacted with a mix of technical validation and “vibe-coding” enthusiasm. The debate centers on whether the industry has over-indexed on Vector DBs for problems that are fundamentally about structure, not just similarity.
Jason Paul Michaels (@SpaceWelder314), a welder using Claude, echoed the sentiment that simpler tools are often more robust:
“No vector database. No embeddings… Just markdown, FTS5, and grep… Every bug fix… gets indexed. The knowledge compounds.”
However, the most significant praise came from Steph Ango (@Kepano), co-creator of Obsidian, who highlighted a concept called “Contamination Mitigation.”
He suggested that users should keep their personal “vault” clean and let the agents play in a “messy vault,” only bringing over the useful artifacts once the agent-facing workflow has distilled them.
|
Feature |
Vector DB / RAG |
Karpathy’s Markdown Wiki |
|
Data Format |
Opaque Vectors (Math) |
Human-Readable Markdown |
|
Logic |
Semantic Similarity (Nearest Neighbor) |
Explicit Connections (Backlinks/Indices) |
|
Auditability |
Low (Black Box) |
High (Direct Traceability) |
|
Compounding |
Static (Requires re-indexing) |
Active (Self-healing through linting) |
|
Ideal Scale |
Millions of Documents |
100 – 10,000 High-Signal Documents |
The “Vector DB” approach is like a massive, unorganized warehouse with a very fast forklift driver. You can find anything, but you don’t know why it’s there or how it relates to the pallet next to it. Karpathy’s “Markdown Wiki” is like a curated library with a head librarian who is constantly writing new books to explain the old ones.
Karpathy’s final exploration points toward the ultimate destination of this data: Synthetic Data Generation and Fine-Tuning.
As the wiki grows and the data becomes more “pure” through continuous LLM linting, it becomes the perfect training set.
Instead of the LLM just reading the wiki in its “context window,” the user can eventually fine-tune a smaller, more efficient model on the wiki itself. This would allow the LLM to “know” the researcher’s personal knowledge base in its own weights, essentially turning a personal research project into a custom, private intelligence.
Bottom-line: Karpathy hasn’t just shared a script; he’s shared a philosophy. By treating the LLM as an active agent that maintains its own memory, he has bypassed the limitations of “one-shot” AI interactions.
For the individual researcher, it means the end of the “forgotten bookmark.”
For the enterprise, it means the transition from a “raw/ data lake” to a “compiled knowledge asset.” As Karpathy himself summarized: “You rarely ever write or edit the wiki manually; it’s the domain of the LLM.” We are entering the era of the autonomous archive.
Microsoft on Thursday launched three new foundational AI models it built entirely in-house — a state-of-the-art speech transcription system, a voice generation engine, and an upgraded image creator — marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI, Google, and other frontier labs on model development, not just distribution.
The trio of models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are available immediately through Microsoft Foundry and a new MAI Playground. They span three of the most commercially valuable modalities in enterprise AI: converting speech to text, generating realistic human voice, and creating images. Together, they represent the opening salvo from Microsoft’s superintelligence team, which Suleyman formed just six months ago to pursue what he calls “AI self-sufficiency.”
“I’m very excited that we’ve now got the first models out, which are the very best in the world for transcription,” Suleyman told VentureBeat in an interview ahead of the public announcement. “Not only that, we’re able to deliver the model with half the GPUs of the state-of-the-art competition.”
The announcement lands at a precarious moment for Microsoft. The company’s stock just closed its worst quarter since the 2008 financial crisis, as investors increasingly demand proof that hundreds of billions of dollars in AI infrastructure spending will translate into revenue. These models — priced aggressively and positioned to reduce Microsoft’s own cost of goods sold — are Suleyman’s first answer to that pressure.
MAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark — the industry-standard multilingual test — across the top 25 languages by Microsoft product usage, averaging 3.8% WER. According to Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 each.
The model uses a transformer-based text decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC files up to 200MB, and Microsoft says its batch transcription speed is 2.5 times faster than the existing Microsoft Azure Fast offering. Diarization, contextual biasing, and streaming are listed as “coming soon.” Microsoft is already testing MAI-Transcribe-1 inside Copilot’s Voice mode and Microsoft Teams for conversation transcription — a detail that underscores how quickly the company intends to replace third-party or older internal models with its own.
Alongside it, MAI-Voice-1 is Microsoft’s text-to-speech model, capable of generating 60 seconds of natural-sounding audio in a single second. The model preserves speaker identity across long-form content and now supports custom voice creation from just a few seconds of audio through Microsoft Foundry. Microsoft is pricing it at $22 per 1 million characters. MAI-Image-2, meanwhile, debuted as a top-three model family on the Arena.ai leaderboard and now delivers at least 2x faster generation times on Foundry and Copilot compared to its predecessor. Microsoft is rolling it out across Bing and PowerPoint, pricing it at $5 per 1 million tokens for text input and $33 per 1 million tokens for image output. WPP, one of the world’s largest advertising holding companies, is among the first enterprise partners building with MAI-Image-2 at scale.
To understand why these models matter, you have to understand the contractual tectonic shift that made them possible. Until October 2025, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. The original deal with OpenAI, signed in 2019, gave Microsoft a license to OpenAI’s models in exchange for building the cloud infrastructure OpenAI needed. But when OpenAI sought to expand its compute footprint beyond Microsoft — striking deals with SoftBank and others — Microsoft renegotiated. As Suleyman explained in a December 2025 interview with Bloomberg, the revised agreement meant that “up until a few weeks ago, Microsoft was not allowed — by contract — to pursue artificial general intelligence or superintelligence independently.” The new terms freed Microsoft to build its own frontier models while retaining license rights to everything OpenAI builds through 2032.
Suleyman described the dynamic to VentureBeat in characteristically blunt terms. “Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence,” he said. “Since then, we’ve been convening the compute and the team and buying up the data that we need.”
He was quick to emphasize that the OpenAI partnership remains intact. “Nothing’s changing with the OpenAI partnership. We will be in partnership with them at least until 2032 and hopefully a lot longer,” Suleyman said. “They have been a phenomenal partner to us.” He also highlighted that Microsoft provides access to Anthropic’s Claude through its Foundry API, framing the company as “a platform of platforms.” But the subtext is unmistakable: Microsoft is building the capability to stand on its own. In March, as Business Insider first reported, Suleyman wrote in an internal memo that his goal is to “focus all my energy on our Superintelligence efforts and be able to deliver world class models for Microsoft over the next 5 years.” CNBC reported that the structural shift freed Suleyman from day-to-day Copilot product responsibilities, with former Snap executive Jacob Andreou taking over as EVP of the combined consumer and commercial Copilot experience.
Perhaps the most striking detail Suleyman shared with VentureBeat is how small the teams behind these models actually are. “The audio model was built by 10 people, and the vast majority of the speed, efficiency and accuracy gains come from the model architecture and the data that we have used,” Suleyman said. “My philosophy has always been that we need fewer people who are more empowered. So we operate an extremely flat structure.” He added: “Our image team, equally, is less than 10 people. So this is all about model and data innovation, which has delivered state of the art performance.”
This matters for two reasons. First, it challenges the prevailing industry narrative that frontier AI development requires thousands of researchers and billions in headcount costs. Meta, by contrast, has pursued what Suleyman described in his Bloomberg interview as a strategy of “hiring a lot of individuals, rather than maybe creating a team” — including reported compensation packages of $100 million to $200 million for top researchers. Second, small teams producing state-of-the-art results dramatically improve the economics. If Microsoft can build best-in-class transcription with 10 engineers and half the GPUs of competitors, the margin structure of its AI business looks fundamentally different from companies burning through cash to achieve similar benchmarks.
The lean-team philosophy also echoes Suleyman’s broader views on how AI is already reshaping the work of building AI itself. When asked by VentureBeat how his own team works, Suleyman described an environment that resembles a startup trading floor more than a traditional Microsoft engineering org. “There are groups of people around round tables, circular tables, not traditional desks, on laptops instead of big screens,” he said. “They’re basically vibe coding, side by side all day, morning till night, in rooms of 50 or 60 people.”
Suleyman has been steadily building a philosophical brand around Microsoft’s AI efforts that he calls “humanist AI” — a term that appeared prominently in the blog post he authored for the launch and that he elaborated on in our interview. “I think that the motivation of a humanist super intelligence is to create something that is truly in service of humanity,” he told VentureBeat. “Humans will remain in control at the top of the food chain, and they will be always aligned to human interests.”
The framing serves multiple purposes. It differentiates Microsoft from the more acceleration-oriented rhetoric coming from OpenAI and Meta. It resonates with enterprise buyers who need governance, compliance, and safety assurances before deploying AI in regulated industries. And it provides a narrative hedge: if something goes wrong in the broader AI ecosystem, Microsoft can point to its stated commitment to human control. In his December Bloomberg interview, Suleyman went further, describing containment and alignment as “red lines” and arguing that no one should release a superintelligence tool until they are “confident it can be controlled.”
Suleyman also stressed data provenance as a competitive advantage, describing a conversation with CEO Satya Nadella about developing “a clean lineage of models where the data is extremely clean.” He drew an implicit contrast with open-source alternatives, noting that “many of the open-source models have been trained on data in, let’s say, inappropriate ways. And there are potentially security issues with that.” For enterprise customers evaluating AI vendors amid a thicket of copyright lawsuits across the industry, that is a meaningful commercial argument — if Microsoft can credibly claim that its training data was acquired through properly licensed channels, it reduces the legal and reputational risk of deploying these models in production.
Today’s launch positions Microsoft on three competitive fronts simultaneously. MAI-Transcribe-1 directly targets the transcription workloads that OpenAI’s Whisper models have dominated in the open-source community, with Microsoft claiming superior accuracy on all 25 benchmarked languages. The FLEURS results also show it winning against Google’s Gemini 3.1 Flash Lite on 22 of 25 languages — a direct challenge as Google aggressively pushes Gemini across its own product suite. And MAI-Voice-1‘s ability to clone voices from seconds of audio and generate speech at 60x real-time puts it in competition with ElevenLabs, Resemble AI, and the growing ecosystem of voice AI startups, with Microsoft’s distribution advantage — any Foundry developer can now access these capabilities through the same API they use for GPT-4 and Claude — acting as a powerful moat.
Suleyman framed the competitive position confidently: “We’re now a top three lab just under OpenAI and Gemini,” he told VentureBeat. The pricing strategy — MAI-Voice-1 at $22 per million characters, MAI-Image-2 at $5 per million input tokens — reflects a deliberate decision to compete on cost. “We’re pricing them to be the very best of any hyperscaler. So there will be the cheapest of any of the hyperscalers out there, Amazon. And obviously Google,” Suleyman said. “And that’s a very conscious decision.”
This makes strategic sense for Microsoft, which can amortize model development costs across its enormous installed base of enterprise customers. But it also speaks to the question investors have been asking with increasing urgency: when does AI spending start generating returns? Microsoft’s stock has fallen roughly 17% year-to-date, according to CNBC, part of a broader selloff in software stocks. By building models that run on half the GPUs of competitors, Microsoft reduces its own infrastructure costs for internal products — Teams, Copilot, Bing, PowerPoint — while offering developers pricing designed to undercut the rest of the market. In his March memo, Suleyman wrote that his models would “enable us to deliver the COGS efficiencies necessary to be able to serve AI workloads at the immense scale required in the coming years.” These three models are the first tangible delivery on that promise.
Suleyman made clear that transcription, voice, and image generation are just the beginning. When asked whether Microsoft would build a large language model to compete directly with GPT at the frontier level, he was unequivocal. “We absolutely are going to be delivering state of the art models across all modalities,” he said. “Our mission is to make sure that if Microsoft ever needs it, we will be able to provide state of the art at the best efficiency, the cheapest price, and be completely independent.”
He described a multi-year roadmap to “set up the GPU clusters at the appropriate scale,” noting that the superintelligence team was formally stood up only in October 2025. Suleyman spoke to VentureBeat from Miami, where the full team was convening for one of its regular week-long in-person sessions. He described Nadella flying in for the gathering to lay out “the roadmap of everything that we need to achieve for our AI self-sufficiency mission over the next 2, 3, 4 years, and all the compute roadmap that that would involve.”
Building a competitive frontier LLM, of course, is a different order of magnitude in complexity, data requirements, and compute cost from what Microsoft demonstrated Thursday. The models launched today are specialized — they handle audio and images, not the general reasoning and text generation that underpin products like ChatGPT or Copilot’s core intelligence. Suleyman has the organizational mandate, Nadella’s public backing, and the contractual freedom. What he doesn’t yet have is a track record at Microsoft of delivering on the hardest problem in AI.
But consider what he does have: three models that are best-in-class or near it in their respective domains, built by teams smaller than most seed-stage startups, running on half the industry-standard GPU footprint, and priced below every major cloud competitor. Two years ago, Suleyman proposed in MIT Technology Review what he called the “Modern Turing Test” — not whether AI could fool a human in conversation, but whether it could go out into the world and accomplish real economic tasks with minimal oversight. On Thursday, his own models took a step toward that vision. The question now is whether Microsoft’s superintelligence team can repeat the trick at the scale that actually matters — and whether they can do it before the market’s patience runs out.