Google’s newest AI model is here: Gemini 3.1 Flash-Lite, and the biggest improvements this time around come in cost and speed, especially for enterprises and developers seeking to leverage powerful reasoning and multimodal capabilities from the U.S. search and cloud giant.
Positioning it as the most cost-efficient and responsive model in the Gemini 3 series, Google is offering a solution built specifically for intelligence at scale.
This launch arrives just weeks after the February debut of its heavy-lifting sibling, Gemini 3.1 Pro, completing a tiered strategy that allows enterprises to scale intelligence across every layer of their infrastructure.
In the world of high-throughput AI, the metric that often dictates user experience isn’t just accuracy—it’s latency. For real-time customer support, live content moderation, or instant user interface generation, the “time to first answer token” is the primary indicator of whether an application feels like a tool or a teammate. If a model takes even two seconds to begin its response, the illusion of fluid interaction is broken.
Gemini 3.1 Flash-Lite is engineered specifically for this instant feel. According to internal benchmarks and third-party evaluations, Flash-Lite outperforms its predecessor, Gemini 2.5 Flash, with a 2.5X faster time to first token. Furthermore, it boasts a 45 percent increase in overall output speed — 363 tokens per second compared to 249.
This speed is achieved through what Koray Kavukcuoglu, VP of Research at Google DeepMind, describes in an X post as an unbelievable amount of complex engineering to make AI feel instantaneous.
Perhaps the most innovative technical addition is the introduction of thinking levels.
Standardized across both the Flash-Lite and Pro variants, this feature allows developers to modulate the model’s reasoning intensity dynamically. For a simple classification task or a high-volume sentiment analysis, the model can be dialed down for maximum speed and minimum cost.
Conversely, for complex code exploration, generating dashboards, or creating simulations, the thinking can be dialed up, allowing the model to perform deeper reasoning and logic before emitting its first response.
While the “Lite” suffix often implies a significant sacrifice in capability, the performance data suggests a model that punches well into the territory of much larger systems. Gemini 3.1 Flash-Lite achieved an Elo score of 1432 on the Arena.ai Leaderboard, placing it in a competitive tier with models much larger in parameter count.
Key benchmark results highlight its specialized strengths across diverse cognitive domains:
Scientific knowledge: 86.9 percent on GPQA Diamond.
Multimodal understanding: 76.8 percent on MMMU-Pro.
Multilingual Q&A: 88.9 percent on MMMLU.
Parametric knowledge: 43.3 percent on SimpleQA Verified.
Abstract reasoning: 16.0 percent on Humanity’s Last Exam (full set)
The model is particularly adept at structured output compliance—a critical requirement for enterprise developers who need AI to generate valid JSON, SQL, or UI code that won’t break downstream systems.
In benchmarks like LiveCodeBench, Flash-Lite scored a 72.0 percent, outperforming several rivals in its weight class, including GPT-5 mini, which scored 80.4 percent on a different subset but lagged significantly in speed and cost efficiency.
Furthermore, its performance on CharXiv Reasoning (73.2 percent) and Video-MMMU (84.8 percent) demonstrates that its multimodal capabilities are robust enough for complex chart synthesis and knowledge acquisition from video.
To understand Flash-Lite’s place in the market, one must look at it alongside Gemini 3.1 Pro, which Google released in mid-February 2026 to retake the AI crown. While Flash-Lite is the reflexes of the Gemini system, 3.1 Pro is undoubtedly the brain.
The primary differentiator is the depth of cognitive processing. Gemini 3.1 Pro was engineered to double the reasoning performance of the previous generation, achieving a verified score of 77.1 percent on ARC-AGI-2—a benchmark designed to test a model’s ability to solve entirely new logic patterns it has not encountered during training.
While Flash-Lite holds its own in scientific knowledge at 86.9 percent, the Pro model pushes that boundary to a staggering 94.3 percent, making it the superior choice for deep research and high-stakes synthesis. The application focus also differs significantly based on these reasoning gaps.
Gemini 3.1 Pro is capable of vibe-coding—generating animated SVGs and complex 3D simulations directly from text prompts. For example, in one demonstration, Pro coded a complex 3D starling murmuration that users could manipulate via hand-tracking. It can even reason through abstract literary themes, such as translating the atmospheric tone of Emily Brontë’s Wuthering Heights into a functional web design.
Gemini 3.1 Flash-Lite, conversely, is the workhorse for high-volume execution. It handles the millions of daily tasks—translation, tagging, and moderation—that require consistent, repeatable results without the massive compute overhead of a reasoning-heavy model.
It fills a wireframe with hundreds of products instantly or orchestrates intent routing with 94 percent accuracy, as reported by early testers.
For enterprise technical decision-makers, the most compelling part of the Gemini 3.1 series is the reasoning-to-dollar ratio.
Google has priced Gemini 3.1 Flash-Lite at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.
This pricing makes it significantly more affordable than competitors like Claude 4.5 Haiku, which is priced at $1.00 per 1 million input and $5.00 per 1 million output tokens.
Even compared to Gemini 2.5 Flash, which cost $0.30 per 1 million input, Flash-Lite offers a cost reduction alongside its performance gains.
When contrasted with Gemini 3.1 Pro—which maintains a price of $2.00 per million input tokens for prompts up to 200k—the strategic advantage of the dual-model approach becomes clear. In high-context usage (above 200,000 tokens per interaction), Flash-Lite is actually between 12x and 16x cheaper.
|
Model |
Input |
Output |
Total Cost |
Source |
|
Qwen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
Qwen3.5-Flash |
$0.10 |
$0.40 |
$0.50 |
|
|
deepseek-chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deepseek-reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Grok 4.1 Fast (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Fast (non-reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
MiniMax M2.5 |
$0.15 |
$1.20 |
$1.35 |
|
|
Gemini 3.1 Flash-Lite |
$0.25 |
$1.50 |
$1.75 |
|
|
MiniMax M2.5-Lightning |
$0.30 |
$2.40 |
$2.70 |
|
|
Gemini 3 Flash Preview |
$0.50 |
$3.00 |
$3.50 |
|
|
Kimi-k2.5 |
$0.60 |
$3.00 |
$3.60 |
|
|
GLM-5 |
$1.00 |
$3.20 |
$4.20 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Claude Haiku 4.5 |
$1.00 |
$5.00 |
$6.00 |
|
|
Qwen3-Max (2026-01-23) |
$1.20 |
$6.00 |
$7.20 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
GPT-5.2 |
$1.75 |
$14.00 |
$15.75 |
|
|
Claude Sonnet 4.5 |
$3.00 |
$15.00 |
$18.00 |
|
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.6 |
$5.00 |
$25.00 |
$30.00 |
|
|
GPT-5.2 Pro |
$21.00 |
$168.00 |
$189.00 |
By using a cascading architecture, an enterprise can use 3.1 Pro for the initial complex planning, architectural design, and deep logic, then hand off high-frequency, repetitive execution to Flash-Lite at one-eighth of the cost.
This shift effectively moves AI from an expensive experimental cost center to a utility-grade resource that can be run over every log file, email, and customer chat without exhausting the cloud budget.
Early feedback from Google’s partner network suggests that the 3.1 series is successfully filling a critical gap in the market for reliable autonomy.
Andrew Carr, Chief Scientist at Cartwheel, has tested both models and noted their unique strengths. Regarding 3.1 Pro, he highlighted its substantially improved understanding of 3D transformations, which resolved long-standing rotation order bugs in animation pipelines.
However, he found Flash-Lite to be a different kind of unlock for the business: “3.1 Flash-Lite is a remarkably competent model. It is lightning fast, but still somehow finds a way to follow all instructions… The intelligence to speed ratio is unparalleled in any other model”.
For consumer-facing applications, the low latency of Flash-Lite has been the key to market expansion.
Kolby Nottingham, Head of AI at Latitude, shared that the model achieved a 20 percent higher success rate and 60 percent faster inference times compared to their previous model, enabling sophisticated storytelling to a much wider audience than would have otherwise been possible.
Reliability in data tagging has also emerged as a standout feature. Bianca Rangecroft, CEO of Whering, reported that by integrating 3.1 Flash-Lite into their classification pipeline, they achieved 100 percent consistency in item tagging, providing a highly reliable foundation for their label assignment and increasing confidence in structured outputs.
Kaan Ortabas, Co-Founder of HubX, noted that as a root orchestration engine, Flash-Lite delivered sub-10 second completions with near-instant streaming and 97 percent structured output compliance.
On the flagship side, Vladislav Tankov, Director of AI at JetBrains, noted a 15 percent quality improvement in the Pro model, emphasizing that it is stronger, faster, and more efficient, requiring fewer output tokens to achieve its goals.
Both Gemini 3.1 Flash-Lite and Pro are offered through Google AI Studio and Vertex AI. As proprietary models, they follow a standard commercial software-as-a-service model rather than an open-source license.
Operating through Vertex AI provides grounded reasoning within a secure perimeter, ensuring that high-volume workloads—like those being run by Databricks to achieve best-in-class results on the OfficeQA benchmark—remain protected by enterprise-grade security and data residency guarantees.
However, they also are limited in terms of customizability and require persistent internet connectivity, as opposed to purely open source rivals like the powerful new Qwen3.5 series released by Alibaba over the last few weeks.
The current preview status for Flash-Lite allows Google to refine safety and performance based on real-world developer feedback before general availability.
For developers already building via the Gemini API, the transition to 3.1 Pro and Flash-Lite represents a direct performance upgrade at the same or lower price points, effectively lowering the barrier to entry for complex agentic workflows.
The release of Gemini 3.1 Flash-Lite represents the final piece of a strategic pivot for Google. While the industry has been obsessed with state-of-the-art reasoning for the most complex problems, the vast majority of enterprise work consists of high-volume, repetitive, but high-precision tasks.
By providing both the brain in Gemini 3.1 Pro and the reflexes in Gemini 3.1 Flash-Lite, Google is signaling that the next phase of the AI race will be won by models that can think through a problem, but also execute that solution at scale.
For the CTO or technical lead deciding which model to bake into their 2026 product roadmap, the Gemini 3.1 series offers a compelling argument: you no longer have to pay a reasoning tax to get reliable, instantaneous results. As Flash-Lite rolls out in preview today, the message to the developer community is clear: the barrier to intelligence at scale hasn’t just been lowered—it’s been dismantled.
Sonos Play leak reveals a new $399 CAD portable speaker. Sitting between the Move 2 and Roam 2, it features 24-hour battery life, IP67 rating, and Bluetooth 5.3.
Anthropic saw a surveillance problem and walked. OpenAI saw an opportunity and signed. Now, Sam Altman is under fire for struggling to explain how OpenAI’s contract is any safer.
When an OpenAI finance analyst needed to compare revenue across geographies and customer cohorts last year, it took hours of work — hunting through 70,000 datasets, writing SQL queries, verifying table schemas. Today, the same analyst types a plain-English question into Slack and gets a finished chart in minutes.
The tool behind that transformation was built by two engineers in three months. Seventy percent of its code was written by AI. And it is now used by more than 4,000 of OpenAI’s roughly 5,000 employees every day — making it one of the most aggressive deployments of an AI data agent inside any company, anywhere.
In an exclusive interview with VentureBeat, Emma Tang, the head of data infrastructure at OpenAI whose team built the agent, offered a rare look inside the system — how it works, how it fails, and what it signals about the future of enterprise data. The conversation, paired with the company’s blog post announcing the tool, paints a picture of a company that turned its own AI on itself and discovered something that every enterprise will soon confront: the bottleneck to smarter organizations isn’t better models. It’s better data.
“The agent is used for any kind of analysis,” Tang said. “Almost every team in the company uses it.”
To understand why OpenAI built this system, consider the scale of the problem. The company’s data platform spans more than 600 petabytes across 70,000 datasets. Even locating the correct table can consume hours of a data scientist’s time. Tang’s Data Platform team — which sits under infrastructure and oversees big data systems, streaming, and the data tooling layer — serves a staggering internal user base. “There are 5,000 employees at OpenAI right now,” Tang said. “Over 4,000 use data tools that our team provides.”
The agent, built on GPT-5.2 and accessible wherever employees already work — Slack, a web interface, IDEs, the Codex CLI, and OpenAI’s internal ChatGPT app — accepts plain-English questions and returns charts, dashboards, and long-form analytical reports. In follow-up responses shared with VentureBeat on background, the team estimated it saves two to four hours of work per query. But Tang emphasized that the larger win is harder to measure: the agent gives people access to analysis they simply couldn’t have done before, regardless of how much time they had.
“Engineers, growth, product, as well as non-technical teams, who may not know all the ins and outs of the company data systems and table schemas” can now pull sophisticated insights on their own, her team noted.
Tang walked through several concrete use cases that illustrate the agent’s range. OpenAI’s finance team queries it for revenue comparisons across geographies and customer cohorts. “It can, just literally in plain text, send the agent a query, and it will be able to respond and give you charts and give you dashboards, all of these things,” she said.
But the real power lies in strategic, multi-step analysis. Tang described a recent case where a user spotted discrepancies between two dashboards tracking Plus subscriber growth. “The data agent can give you a chart and show you, stack rank by stack rank, exactly what the differences are,” she said. “There turned out to be five different factors. For a human, that would take hours, if not days, but the agent can do it in a few minutes.”
Product managers use it to understand feature adoption. Engineers use it to diagnose performance regressions — asking, for instance, whether a specific ChatGPT component really is slower than yesterday, and if so, which latency components explain the change. The agent can break it all down and compare prior periods from a single prompt.
What makes this especially unusual is that the agent operates across organizational boundaries. Most enterprise AI agents today are siloed within departments — a finance bot here, an HR bot there. OpenAI’s cuts horizontally across the company. Tang said they launched department by department, curating specific memory and context for each group, but “at some point it’s all in the same database.” A senior leader can combine sales data with engineering metrics and product analytics in a single query. “That’s a really unique feature of ours,” Tang said.
Finding the right table among 70,000 datasets is, by Tang’s own admission, the single hardest technical challenge her team faces. “That’s the biggest problem with this agent,” she said. And it’s where Codex — OpenAI’s AI coding agent — plays its most inventive role.
Codex serves triple duty in the system. Users access the data agent through Codex via MCP. The team used Codex to generate more than 70% of the agent’s own code, enabling two engineers to ship in three months. But the third role is the most technically fascinating: a daily asynchronous process where Codex examines important data tables, analyzes the underlying pipeline code, and determines each table’s upstream and downstream dependencies, ownership, granularity, join keys, and similar tables.
“We give it a prompt, have Codex look at the code and respond with what we need, and then persist that to the database,” Tang explained. When a user later asks about revenue, the agent searches a vector database to find which tables Codex has already mapped to that concept.
This “Codex Enrichment” is one of six context layers the agent uses. The layers range from basic schema metadata and curated expert descriptions to institutional knowledge pulled from Slack, Google Docs, and Notion, plus a learning memory that stores corrections from previous conversations. When no prior information exists, the agent falls back to live queries against the data warehouse.
The team also tiers historical query patterns. “All query history is everybody’s ‘select star, limit 10.’ It’s not really helpful,” Tang said. Canonical dashboards and executive reports — where analysts invested significant effort determining the correct representation — get flagged as “source of truth.” Everything else gets deprioritized.
Even with six context layers, Tang was remarkably candid about the agent’s biggest behavioral flaw: overconfidence. It’s a problem anyone who has worked with large language models will recognize.
“It’s a really big problem, because what the model often does is feel overconfident,” Tang said. “It’ll say, ‘This is the right table,’ and just go forth and start doing analysis. That’s actually the wrong approach.”
The fix came through prompt engineering that forces the agent to linger in a discovery phase. “We found that the more time it spends gathering possible scenarios and comparing which table to use — just spending more time in the discovery phase — the better the results,” she said. The prompt reads almost like coaching a junior analyst: “Before you run ahead with this, I really want you to do more validation on whether this is the right table. So please check more sources before you go and create actual data.”
The team also learned, through rigorous evaluation, that less context can produce better results. “It’s very easy to dump everything in and just expect it to do better,” Tang said. “From our evals, we actually found the opposite. The fewer things you give it, and the more curated and accurate the context is, the better the results.”
To build trust, the agent streams its intermediate reasoning to users in real time, exposes which tables it selected and why, and links directly to underlying query results. Users can interrupt the agent mid-analysis to redirect it. The system also checkpoints its progress, enabling it to resume after failures. And at the end of every task, the model evaluates its own performance. “We ask the model, ‘how did you think that went? Was that good or bad?'” Tang said. “And it’s actually fairly good at evaluating how well it’s doing.”
When it comes to safety, Tang took a pragmatic approach that may surprise enterprises expecting sophisticated AI alignment techniques.
“I think you just have to have even more dumb guardrails,” she said. “We have really strong access control. It’s always using your personal token, so whatever you have access to is only what you have access to.”
The agent operates purely as an interface layer, inheriting the same permissions that govern OpenAI’s data. It never appears in public channels — only in private channels or a user’s own interface. Write access is restricted to a temporary test schema that gets wiped periodically and can’t be shared. “We don’t let it randomly write to systems either,” Tang said.
User feedback closes the loop. Employees flag incorrect results directly, and the team investigates. The model’s self-evaluation adds another check. Longer term, Tang said, the plan is to move toward a multi-agent architecture where specialized agents monitor and assist each other. “We’re moving towards that eventually,” she said, “but right now, even as it is, we’ve gotten pretty far.”
Despite the obvious commercial potential, OpenAI told VentureBeat that the company has no plans to productize its internal data agent. The strategy is to provide building blocks and let enterprises construct their own. And Tang made clear that everything her team used to build the system is already available externally.
“We use all the same APIs that are available externally,” she said. “The Responses API, the Evals API. We don’t have a fine-tuned model. We just use 5.2. So you can definitely build this.”
That message aligns with OpenAI’s broader enterprise push. The company launched OpenAI Frontier in early February, an end-to-end platform for enterprises to build and manage AI agents. It has since enlisted McKinsey, Boston Consulting Group, Accenture, and Capgemini to help sell and implement the platform. AWS and OpenAI are jointly developing a Stateful Runtime Environment for Amazon Bedrock that mirrors some of the persistent context capabilities OpenAI built into its data agent. And Apple recently integrated Codex directly into Xcode.
According to information shared with VentureBeat by OpenAI, Codex is now used by 95% of engineers at OpenAI and reviews all pull requests before they’re merged. Its global weekly active user base has tripled since the start of the year, surpassing one million. Overall usage has grown more than fivefold.
Tang described a shift in how employees use Codex that transcends coding entirely. “Codex isn’t even a coding tool anymore. It’s much more than that,” she said. “I see non-technical teams use it to organize thoughts and create slides and to create daily summaries.” One of her engineering managers has Codex review her notes each morning, identify the most important tasks, pull in Slack messages and DMs, and draft responses. “It’s really operating on her behalf in a lot of ways,” Tang said.
When asked what other enterprises should take away from OpenAI’s experience, Tang didn’t point to model capabilities or clever prompt engineering. She pointed to something far more mundane.
“This is not sexy, but data governance is really important for data agents to work well,” she said. “Your data needs to be clean enough and annotated enough, and there needs to be a source of truth somewhere for the agent to crawl.”
The underlying infrastructure — storage, compute, orchestration, and business intelligence layers — hasn’t been replaced by the agent. It still needs all of those tools to do its job. But it serves as a fundamentally new entry point for data intelligence, one that is more autonomous and accessible than anything that came before it.
Tang closed the interview with a warning for companies that hesitate. “Companies that adopt this are going to see the benefits very rapidly,” she said. “And companies that don’t are going to fall behind. It’s going to pull apart. The companies who use it are going to advance very, very quickly.”
Asked whether that acceleration worried her own colleagues — especially after a wave of recent layoffs at companies like Block — Tang paused. “How much we’re able to do as a company has accelerated,” she said, “but it still doesn’t match our ambitions, not even one bit.”
This tiny gadget from Shure can turn any XLR microphone into a USB model so it can record to a smartphone, laptop or tablet with your favorite pro-quality microphone.
Endor Labs, the application security startup backed by more than $208 million in venture funding, today launched AURI, a platform that embeds real-time security intelligence directly into the AI coding tools that are reshaping how software gets built. The product is available free to individual developers and integrates natively with popular AI coding assistants including Cursor, Claude, and Augment through the Model Context Protocol (MCP).
The announcement arrives against a sobering backdrop. While 90% of development teams now use AI coding assistants, research published in December by Carnegie Mellon University, Columbia University, and Johns Hopkins University found that leading models produce functionally correct code only about 61% of the time — and just 10% of that output is both functional and secure.
“Even though AI can now produce functionally correct code 61% of the time, only 10% of that output is both functional and secure,” Endor Labs CEO Varun Badhwar told VentureBeat in an exclusive interview. “These coding agents were trained on open source code from across the internet, so they’ve learned best practices — but they’ve also learned to replicate a lot of the same security problems of the past.”
That gap between code that works and code that is safe defines the market AURI is designed to capture — and the urgency behind its launch.
To understand why Endor Labs built AURI, it helps to understand the structural problem at the heart of AI-assisted software development. AI coding models are trained on vast repositories of open-source code scraped from across the internet — code that includes not only best practices but also well-documented vulnerabilities, insecure patterns, and flaws that may not be discovered for years after the code was originally written.
Badhwar, a repeat cybersecurity entrepreneur who previously built RedLock (acquired by Palo Alto Networks), founded Endor Labs four years ago with Dimitri Stiliadis. The original thesis was straightforward: developers were becoming “software assemblers,” writing less original code and importing most components from open source repositories. Then came the explosion of AI-powered coding tools, which Badhwar described as “the once in a generation opportunity of how to rewrite software development life cycle powered by AI.”
The productivity gains are real — more efficiency, faster time to market, and the democratization of software creation beyond trained engineers. But the security consequences are potentially devastating. New vulnerabilities are discovered every day in code that may have been written a decade ago, and that constantly evolving threat intelligence is not easily available to the AI models generating new code.
“Every day, every hour, new vulnerabilities are found in software that might have been written 5, 10, 12 years ago — and that information isn’t easily available to the models,” Badhwar explained. “If you started filtering out anything that ever had a vulnerability, you’d have no code left to train on.”
The result is a feedback loop: AI tools generate code at unprecedented speed, much of it modeled on insecure patterns, and security teams scramble to keep up. Traditional scanning tools, designed for a world where humans wrote and reviewed code at human speed, are increasingly overmatched.
AURI’s core technical differentiator is what Endor Labs calls its “code context graph” — a deep, function-level map of how an application’s first-party code, open source dependencies, container layers, and AI models interconnect. Where competitors like Snyk and GitHub’s Dependabot examine what libraries an application imports and cross-reference them against known vulnerability databases, Endor Labs traces exactly how and where those components are actually used, down to the individual line of code.
“We have this code intelligence graph that understands not just what libraries and dependencies you use, but pinpoints exactly how, where, and in what context they’re used — down to the specific line of code where you’re calling a piece of functionality that has a vulnerability,” Badhwar said.
He illustrated the difference with a concrete example. A developer might import a large library like an AWS SDK but only call two services comprising 10 lines of code. The remaining 99,000 lines in that open source library are unreachable by the application. Traditional tools flag every known vulnerability across the entire library. AURI’s full-stack reachability analysis trims those irrelevant findings away.
Building that capability required significant investment. Endor Labs hired 13 PhDs specializing in program analysis, many of whom previously built similar technology internally at companies like Meta, GitHub, and Microsoft. The company has indexed billions of functions across millions of open source packages and created over half a billion embeddings to identify the provenance of copied code, even when function names or structures have been changed.
The platform combines this deterministic analysis with agentic AI reasoning. Specialized agents work together to detect, triage, and remediate vulnerabilities automatically, while multi-file call graphs and dataflow analysis detect complex business logic flaws that span multiple components. The result, according to Endor Labs, is an average 80% to 95% reduction in security findings for enterprise customers — trimming away what Badhwar called “tens of millions of dollars a year in developer productivity” lost to investigating false positives.
In a strategic move aimed at rapid adoption, Endor Labs is offering AURI’s core functionality free to individual developers through an MCP server that integrates directly with popular IDEs including VS Code, Cursor, and Windsurf. The free tier requires no credit card, no sign-up process, and no complex registration.
“The idea is that there’s no policy, no administration, no customization. It just helps your code generation tools stop creating more vulnerabilities,” Badhwar said.
Privacy-conscious developers will note a key architectural choice: the free product runs entirely on the developer’s machine. Only non-proprietary vulnerability intelligence is pulled from Endor Labs’ servers. “All of your code stays local and is scanned locally. It never gets copied into AURI or Endor Labs or anything else,” Badhwar explained.
The enterprise version adds the features large organizations need: full customization, policy configuration, role-based access control for teams of thousands of developers, and integration across CI/CD pipelines. Enterprise pricing is based on the number of developers and the volume of scans. Deployment options include local scanning, ephemeral cloud containers, and on-premises Kubernetes clusters with full tenant isolation — flexibility Badhwar said is “the most any vendor offers in this space.”
The freemium approach mirrors the playbook that worked for developer tools companies like GitHub and Atlassian: win individual developers first, then expand into their organizations. But it also reflects a practical reality. In a world where AI coding agents are proliferating across every team, Endor Labs needs to be wherever code is being written — not waiting behind a procurement process.
“Over 97% of vulnerabilities flagged by our previous tool weren’t reachable in our application,” said Travis McPeak, Security at Cursor, in a statement sent to VentureBeat. “AURI by Endor Labs shows the few vulnerabilities that are impactful, so we patch quickly, focusing on what matters.”
The application security market is increasingly crowded. Snyk, GitHub Advanced Security, and a growing number of startups all compete for developer attention. Even the AI model providers themselves are entering the fray: Anthropic recently announced a code security product built into Claude, a move that sent ripples through the market.
Badhwar, however, framed Anthropic’s announcement as validation rather than threat. “That’s one of the biggest validations of what we do, because it says code security is one of the hottest problems in the market,” he told VentureBeat. The deeper question, he argued, is whether enterprises want to trust the same tool generating code to also review it.
“Claude is not going to be the only tool you use for agentic coding. Are you going to use a separate security product for Cursor, a separate one for Claude, a separate one for Augment, and another for Gemini Code Assist?” Badhwar said. “Do you want to trust the same tool that’s creating the software to also review it? There’s a reason we’ve always had reviewers who are different from the developers.”
He outlined three principles he believes will define effective security in the agentic era: independence (security review must be separate from the tool that generated the code), reproducibility (findings must be consistent, not probabilistic), and verifiability (every finding must be backed by evidence). It is a direct challenge to purely LLM-based approaches, which Badhwar characterized as “completely non-deterministic tools that you have no control over in terms of having verifiability of findings, consistency.”
AURI’s approach combines LLMs for what they do best — reasoning, explanation, and contextualization — with deterministic tools that provide the consistency enterprises require. Beyond detection, the platform simulates upgrade paths and tells developers which remediation route will work without introducing breaking changes, a step beyond what most competitors offer. Developers can then execute those fixes themselves or route them to AI coding agents with confidence that the changes have been deterministically validated.
Endor Labs has already demonstrated AURI’s capabilities in high-profile scenarios. In February 2026, the company announced that AURI had identified and validated seven security vulnerabilities in OpenClaw, the popular agentic AI assistant, which were later acknowledged by the OpenClaw development team. As reported by Infosecurity Magazine, OpenClaw subsequently patched six of the vulnerabilities, which ranged from high-severity server-side request forgery bugs to path traversal and authentication bypass flaws.
“These are zero days. They’ve never been found, but AURI did an incredible job of finding those,” Badhwar said. The company has also been detecting active malware campaigns in ecosystems like NPM, including tracking campaigns like Shai-Hulud for several months.
The company is well-capitalized to sustain its push. Endor Labs closed an oversubscribed $93 million Series B round in April 2025 led by DFJ Growth, with participation from Salesforce Ventures, Lightspeed Venture Partners, Coatue, Dell Technologies Capital, Section 32, and Citi Ventures. The company reported 30x annual recurring revenue growth and 166% net revenue retention since its Series A just 18 months earlier. Its platform now protects more than 5 million applications and runs over 1 million scans each week for customers including OpenAI, Cursor, Dropbox, Atlassian, Snowflake, and Robinhood.
Several dozen enterprise customers already use Endor Labs to accelerate compliance with frameworks including FedRAMP, NIST standards, and the European Cyber Resilience Act — a growing priority as regulators increasingly treat software supply chain security as a matter of national security.
The broader question hanging over AURI’s launch — and over the application security industry as a whole — is whether security tooling can evolve fast enough to match the pace of AI-driven development. Critics of agentic security warn that the industry is moving too quickly, granting AI agents permissions across critical systems without fully understanding the risks. Badhwar acknowledged the concern but argued that resistance is futile.
“I’ve seen this play out when I was building cloud security products, and people were fearful of moving to AWS,” he said. “There was a perception of control when it was in your data center. Yet, guess what? That was the biggest movement of its time, and we as an industry built the right technology and security tooling and visibility around it to make ourselves comfortable.”
For Badhwar, the most exciting implication of agentic development is not the new risks it creates but the old problems it can finally solve. Security teams have spent decades struggling to get developers to prioritize fixing vulnerabilities over building features. AI agents, he argued, do not have that problem — if you give them the right instructions and the right intelligence, they simply execute.
“Security has always struggled for lack of a developer’s attention,” Badhwar said. “But we think you can get an AI agent that’s writing software’s attention by giving them the right context, integrating into the right workflows, and just having them do the right thing for you, so you don’t take an automation opportunity and make it a human’s problem.”
It is a characteristically optimistic framing from a founder who has built his career at the intersection of tectonic technology shifts and the security gaps they leave behind. Whether AURI can deliver on that vision at the scale the AI coding revolution demands remains to be seen. But in a world where machines are writing code faster than humans can review it, the alternative — hoping the models get security right on their own — is a bet few enterprises can afford to make.
The BenQ MA270S is a 27-inch 5K computer display made for Macs. It has tuned Mac color, 5K clarity and is designed for a multi-device workflow that many Mac users need.
From silicon-carbon batteries to OS Turbo 3.0, the Honor MagicBook Pro 14 laptop optimizes its performance, power and endurance using AI.
As the attacks continue, Iranians have turned to Space X’s satellite internet technology — including one of the country’s most notorious hacker crews, which is claiming retaliatory cyberattacks against the U.S.
Despite political turmoil in the U.S. AI sector, in China, the AI advances are continuing apace without a hitch.
Earlier today, e-commerce giant Alibaba’s Qwen Team of AI researchers, focused primarily on developing and releasing to the world a growing family of powerful and capable Qwen open source language and multimodal AI models, unveiled its newest batch, the Qwen3.5 Small Model Series, which consists of:
Qwen3.5-0.8B & 2B: Two models, both ptimized for “tiny” and “fast” performance, intended for prototyping and deployment on edge devices where battery life is paramount.
Qwen3.5-4B: A strong multimodal base for lightweight agents, natively supporting a 262,144 token context window.
Qwen3.5-9B a compact reasoning model that outperforms the 13.5x larger U.S. rival OpenAI’s open soruce gpt-oss-120B on key third-party benchmarks including multilingual knowledge and graduate-level reasoning
To put this into perspective, these models are on the order of the smallest general purpose models lately shipped by any lab around the world, comparable more to MIT offshoot LiquidAI’s LFM2 series, which also have several hundred million or billion parameters, than the estimated trillion parameters (model settings) reportedly used for the flagship models from OpenAI, Anthropic, and Google’s Gemini series.
The weights for the models are available right now globally under Apache 2.0 licenses — perfect for enterprise and commercial use, including customization as needed — on Hugging Face and ModelScope.
The technical foundation of the Qwen3.5 small series is a departure from standard Transformer architectures. Alibaba has moved toward an Efficient Hybrid Architecture that combines Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts (MoE).
This hybrid approach addresses the “memory wall” that typically limits small models; by using Gated Delta Networks, the models achieve higher throughput and significantly lower latency during inference.
Furthermore, these models are natively multimodal. Unlike previous generations that “bolted on” a vision encoder to a text model, Qwen3.5 was trained using early fusion on multimodal tokens. This allows the 4B and 9B models to exhibit a level of visual understanding—such as reading UI elements or counting objects in a video—that previously required models ten times their size.
Newly released benchmark data illustrates just how aggressively these compact models are competing with—and often exceeding—much larger industry standards. The Qwen3.5-9B and Qwen3.5-4B variants demonstrate a cross-generational leap in efficiency, particularly in multimodal and reasoning tasks.
Multimodal dominance: In the MMMU-Pro visual reasoning benchmark, Qwen3.5-9B achieved a score of 70.1, outperforming Gemini 2.5 Flash-Lite (59.7) and even the specialized Qwen3-VL-30B-A3B (63.0).
Graduate-level reasoning: On the GPQA Diamond benchmark, the 9B model reached a score of 81.7, surpassing gpt-oss-120b (80.1), a model with over ten times its parameter count.
Video understanding: The series shows elite performance in video reasoning. On the Video-MME (with subtitles) benchmark, Qwen3.5-9B scored 84.5 and the 4B scored 83.5, significantly leading over Gemini 2.5 Flash-Lite (74.6).
Mathematical prowess: In the HMMT Feb 2025 (Harvard-MIT mathematics tournament) evaluation, the 9B model scored 83.2, while the 4B variant scored 74.0, proving that high-level STEM reasoning no longer requires massive compute clusters.
Document and multilingual knowledge: The 9B variant leads the pack in document recognition on OmniDocBench v1.5 with a score of 87.7. Meanwhile, it maintains a top-tier multilingual presence on MMMLU with a score of 81.2, outperforming gpt-oss-120b (78.2).
Coming on the heels of last week’s release of an already pretty small, powerful open source Qwen3.5-Medium capable of running on a single GPU, the announcement of the Qwen3.5-Small Models Series and their even smaller footprint and processing requirements sparked immediate interest among developers focused on “local-first” AI.
“More intelligence, less compute” resonated with users seeking alternatives to cloud-based models.
AI and tech educator Paul Couvert of Blueshell AI captured the industry’s shock regarding this efficiency leap.
“How is this even possible?!” Couvert wrote on X. “Qwen has released 4 new models and the 4B version is almost as capable as the previous 80B A3B one. And the 9B is as good as GPT OSS 120b while being 13x smaller!”
Couvert’s analysis highlights the practical implications of these architectural gains:
“They can run on any laptop”
“0.8B and 2B for your phone”
“Offline and open source”
As developer Karan Kendre of Kargul Studio put it: “these models [can run] locally on my M1 MacBook Air for free.”
This sentiment of “amazing” accessibility is echoed across the developer ecosystem. One user noted that a 4B model serving as a “strong multimodal base” is a “game changer for mobile devs” who need screen-reading capabilities without high CPU overhead.
Indeed, Hugging Face developer Xenova noted that the new Qwen3.5 Small Model series can even run directly in a user’s web browser and perform such sophisticated and previously higher-compute demanding operations like video analysis.
Researchers also praised the release of Base models alongside the Instruct versions, noting that it provides essential support for “real-world industrial innovation.”
The release of Base models is particularly valued by enterprise and research teams because it provides a “blank slate” that hasn’t been biased by a specific set of RLHF (Reinforcement Learning from Human Feedback) or SFT (Supervised Fine-Tuning) data, which can often lead to “refusals” or specific conversational styles that are difficult to undo.
Now, with the Base models, those interested in customizing the model to fit specific tasks and purposes an easier starting point, as they can now apply their own instruction tuning and post-training without having to strip away Alibaba’s.
Alibaba has released the weights and configuration files for the Qwen3.5 series under the Apache 2.0 license. This permissive license allows for commercial use, modification, and distribution without royalty payments, removing the “vendor lock-in” associated with proprietary APIs.
Commercial use: Developers can integrate models into commercial products royalty-free.
Modification: Teams can fine-tune (SFT) or apply RLHF to create specialized versions.
Distribution: Models can be redistributed in local-first AI applications like Ollama.
The release of the Qwen3.5 Small Series arrives at a moment of “Agentic Realignment.” We have moved past simple chatbots; the goal now is autonomy. An autonomous agent must “think” (reason), “see” (multimodality), and “act” (tool use). While doing this with trillion-parameter models is prohibitively expensive, a local Qwen3.5-9B can perform these loops for a fraction of the cost.
By scaling Reinforcement Learning (RL) across million-agent environments, Alibaba has endowed these small models with “human-aligned judgment,” allowing them to handle multi-step objectives like organizing a desktop or reverse-engineering gameplay footage into code. Whether it is a 0.8B model running on a smartphone or a 9B model powering a coding terminal, the Qwen3.5 series is effectively democratizing the “agentic era.”
The Qwen3.5 series shift from “chatbits” to “native multimodal agents” transforms how enterprises can distribute intelligence. By moving sophisticated reasoning to the “edge”—individual devices and local servers—organizations can automate tasks that previously required expensive cloud APIs or high-latency processing.
The 0.8B to 9B models are re-engineered for efficiency, utilizing a hybrid architecture that activations only the necessary parts of the network for each task.
Visual Workflow Automation: Using “pixel-level grounding,” these models can navigate desktop or mobile UIs, fill out forms, and organize files based on natural language instructions.
Complex Document Parsing: With scores exceeding 90% on document understanding benchmarks, they can replace separate OCR and layout parsing pipelines to extract structured data from diverse forms and charts.
Autonomous Coding & Refactoring: Enterprises can feed entire repositories (up to 400,000 lines of code) into the 1M context window for production-ready refactors or automated debugging.
Real-Time Edge Analysis: The 0.8B and 2B models are designed for mobile devices, enabling offline video summarization (up to 60 seconds at 8 FPS) and spatial reasoning without taxing battery life.
The table below outlines which enterprise functions stand to gain the most from local, small-model deployment.
|
Function |
Primary Benefit |
Key Use Case |
|
Software Engineering |
Local Code Intelligence |
Repository-wide refactoring and terminal-based agentic coding. |
|
Operations & IT |
Secure Automation |
Automating multi-step system settings and file management tasks locally. |
|
Product & UX |
Edge Interaction |
Integrating native multimodal reasoning directly into mobile/desktop apps. |
|
Data & Analytics |
Efficient Extraction |
High-fidelity OCR and structured data extraction from complex visual reports. |
While these models are highly capable, their small scale and “agentic” nature introduce specific operational “flags” that teams must monitor.
The Hallucination Cascade: In multi-step “agentic” workflows, a small error in an early step can lead to a “cascade” of failures where the agent pursues an incorrect or nonsensical plan.
Debugging vs. Greenfield Coding: While these models excel at writing new “greenfield” code, they can struggle with debugging or modifying existing, complex legacy systems.
Memory and VRAM Demands: Even “small” models (like the 9B) require significant VRAM for high-throughput inference; the “memory footprint” remains high because the total parameter count still occupies GPU space.
Regulatory & Data Residency: Using models from a China-based provider may raise data residency questions in certain jurisdictions, though the Apache 2.0 open-weight version allows for hosting on “sovereign” local clouds.
Enterprises should prioritize “verifiable” tasks—such as coding, math, or instruction following—where the output can be automatically checked against predefined rules to prevent “reward hacking” or silent failures.