Apple Just Released Urgent Updates For 2 Billion iPhones

Out of nowhere, Apple just released a series of iPhone updates, including one to address an issue for iPhones first released more than a decade ago.

Theorem wants to stop AI-written bugs before they ship — and just raised $6M to do it

As artificial intelligence reshapes software development, a small startup is betting that the industry’s next big bottleneck won’t be writing code — it will be trusting it.

Theorem, a San Francisco-based company that emerged from Y Combinator’s Spring 2025 batch, announced Tuesday it has raised $6 million in seed funding to build automated tools that verify the correctness of AI-generated software. Khosla Ventures led the round, with participation from Y Combinator, e14, SAIF, Halcyon, and angel investors including Blake Borgesson, co-founder of Recursion Pharmaceuticals, and Arthur Breitman, co-founder of blockchain platform Tezos.

The investment arrives at a pivotal moment. AI coding assistants from companies like GitHub, Amazon, and Google now generate billions of lines of code annually. Enterprise adoption is accelerating. But the ability to verify that AI-written software actually works as intended has not kept pace — creating what Theorem’s founders describe as a widening “oversight gap” that threatens critical infrastructure from financial systems to power grids.

“We’re already there,” said Jason Gross, Theorem’s co-founder, when we asked whether AI-generated code is outpacing human review capacity. “If you asked me to review 60,000 lines of code, I wouldn’t know how to do it.”

Why AI is writing code faster than humans can verify it

Theorem’s core technology combines formal verification — a mathematical technique that proves software behaves exactly as specified — with AI models trained to generate and check proofs automatically. The approach transforms a process that historically required years of PhD-level engineering into something the company claims can be completed in weeks or even days.

Formal verification has existed for decades but remained confined to the most mission-critical applications: avionics systems, nuclear reactor controls, and cryptographic protocols. The technique’s prohibitive cost — often requiring eight lines of mathematical proof for every single line of code — made it impractical for mainstream software development.

Gross knows this firsthand. Before founding Theorem, he earned his PhD at MIT working on verified cryptography code that now powers the HTTPS security protocol protecting trillions of internet connections daily. That project, by his estimate, consumed fifteen person-years of labor.

“Nobody prefers to have incorrect code,” Gross said. “Software verification has just not been economical before. Proofs used to be written by PhD-level engineers. Now, AI writes all of it.”

How formal verification catches the bugs that traditional testing misses

Theorem’s system operates on a principle Gross calls “fractional proof decomposition.” Rather than exhaustively testing every possible behavior — computationally infeasible for complex software — the technology allocates verification resources proportionally to the importance of each code component.

The approach recently identified a bug that slipped past testing at Anthropic, the AI safety company behind the Claude chatbot. Gross said the technique helps developers “catch their bugs now without expending a lot of compute.”

In a recent technical demonstration called SFBench, Theorem used AI to translate 1,276 problems from Rocq (a formal proof assistant) to Lean (another verification language), then automatically proved each translation equivalent to the original. The company estimates a human team would have required approximately 2.7 person-years to complete the same work.

“Everyone can run agents in parallel, but we are also able to run them sequentially,” Gross explained, noting that Theorem’s architecture handles interdependent code — where solutions build on each other across dozens of files — that trips up conventional AI coding agents limited by context windows.

How one company turned a 1,500-page specification into 16,000 lines of trusted code

The startup is already working with customers in AI research labs, electronic design automation, and GPU-accelerated computing. One case study illustrates the technology’s practical value.

A customer came to Theorem with a 1,500-page PDF specification and a legacy software implementation plagued by memory leaks, crashes, and other elusive bugs. Their most urgent problem: improving performance from 10 megabits per second to 1 gigabit per second — a 100-fold increase — without introducing additional errors.

Theorem’s system generated 16,000 lines of production code, which the customer deployed without ever manually reviewing it. The confidence came from a compact executable specification — a few hundred lines that generalized the massive PDF document — paired with an equivalence-checking harness that verified the new implementation matched the intended behavior.

“Now they have a production-grade parser operating at 1 Gbps that they can deploy with the confidence that no information is lost during parsing,” Gross said.

The security risks lurking in AI-generated software for critical infrastructure

The funding announcement arrives as policymakers and technologists increasingly scrutinize the reliability of AI systems embedded in critical infrastructure. Software already controls financial markets, medical devices, transportation networks, and electrical grids. AI is accelerating how quickly that software evolves — and how easily subtle bugs can propagate.

Gross frames the challenge in security terms. As AI makes it cheaper to find and exploit vulnerabilities, defenders need what he calls “asymmetric defense” — protection that scales without proportional increases in resources.

“Software security is a delicate offense-defense balance,” he said. “With AI hacking, the cost of hacking a system is falling sharply. The only viable solution is asymmetric defense. If we want a software security solution that can last for more than a few generations of model improvements, it will be via verification.”

Asked whether regulators should mandate formal verification for AI-generated code in critical systems, Gross offered a pointed response: “Now that formal verification is cheap enough, it might be considered gross negligence to not use it for guarantees about critical systems.”

What separates Theorem from other AI code verification startups

Theorem enters a market where numerous startups and research labs are exploring the intersection of AI and formal verification. The company’s differentiation, Gross argues, lies in its singular focus on scaling software oversight rather than applying verification to mathematics or other domains.

“Our tools are useful for systems engineering teams, working close to the metal, who need correctness guarantees before merging changes,” he said.

The founding team reflects that technical orientation. Gross brings deep expertise in programming language theory and a track record of deploying verified code into production at scale. Co-founder Rajashree Agrawal, a machine learning research engineer, focuses on training the AI models that power the verification pipeline.

“We’re working on formal program reasoning so that everyone can oversee not just the work of an average software-engineer-level AI, but really harness the capabilities of a Linus Torvalds-level AI,” Agrawal said, referencing the legendary creator of Linux.

The race to verify AI code before it controls everything

Theorem plans to use the funding to expand its team, increase compute resources for training verification models, and push into new industries including robotics, renewable energy, cryptocurrency, and drug synthesis. The company currently employs four people.

The startup’s emergence signals a shift in how enterprise technology leaders may need to evaluate AI coding tools. The first wave of AI-assisted development promised productivity gains — more code, faster. Theorem is wagering that the next wave will demand something different: mathematical proof that speed doesn’t come at the cost of safety.

Gross frames the stakes in stark terms. AI systems are improving exponentially. If that trajectory holds, he believes superhuman software engineering is inevitable — capable of designing systems more complex than anything humans have ever built.

“And without a radically different economics of oversight,” he said, “we will end up deploying systems we don’t control.”

The machines are writing the code. Now someone has to check their work.

Pro‑Ject Pre Box S3 Is An All-In-One Control Centre For Your Audio System

This new and compact stereo preamplifier from Pro-Ject Audio Systems brings together analog, digital and wireless audio sources in a single aluminum case.

Virtuix Takes VR Locomotion Public With Nasdaq Debut

Virtuix begins trading on Nasdaq under VTIX as the VR treadmill maker scales Omni One, reports 138 percent revenue growth, and expands consumer, enterprise, and defense markets.

Tribit Launches A Pair Of Foldable Wireless ANC Headphones For Less Than £50

The new Tribit QuietPlus 81 wireless ANC headphones are designed to be comfortable and foldable. They weighs just 283g, which makes them suitable for traveling.

The Icemag Cometh In The Shape Of A New Wireless Power Bank Equipped With Active Cooling

This new power bank from Sharge harnesses Active Cooling 3.0 technology when wirelessly charging a phone to quickly recharge the battery without danger of overheating.

The AI visualization tech stack: From 2D to holograms

Presented by Avalon HolographicsThe pace of AI continues to be staggering. From simple pattern recognition systems to large language models (LLMs), and now as we move into the physical AI reality, the power of these systems continues to improve our liv…

Qwen3-Max Thinking beats Gemini 3 Pro and GPT-5.2 on Humanity’s Last Exam (with search)

Chinese AI and tech firms continue to impress with their development of cutting-edge, state-of-the-art AI language models.

Today, the one drawing eyeballs is Alibaba Cloud’s Qwen Team of AI researchers and its unveiling of a new proprietary language reasoning model, Qwen3-Max-Thinking.

You may recall, as VentureBeat covered last year, that Qwen has made a name for itself in the fast-moving global AI marketplace by shipping a variety of powerful, open source models in various modalities, from text to image to spoken audio. The company even earned an endorsement from U.S. tech lodgings giant Airbnb, whose CEO and co-founder Brian Chesky said the company was relying on Qwen’s free, open source models as a more affordable alternative to U.S. offerings like those of OpenAI.

Now, with the proprietary Qwen3-Max-Thinking, the Qwen Team is aiming to match and, in some cases, outpace the reasoning capabilities of GPT-5.2 and Gemini 3 Pro through architectural efficiency and agentic autonomy.

The release comes at a critical juncture. Western labs have largely defined the “reasoning” category (often dubbed “System 2” logic), but Qwen’s latest benchmarks suggest the gap has closed.

In addition, the company’s relatively affordable API pricing strategy aggressively targets enterprise adoption. However, as it is a Chinese model, some U.S. firms with strict national security requirements and considerations may be wary of adopting it.

The Architecture: “Test-Time Scaling” Redefined

The core innovation driving Qwen3-Max-Thinking is a departure from standard inference methods. While most models generate tokens linearly, Qwen3 utilizes a “heavy mode” driven by a technique known as “Test-time scaling.”

In simple terms, this technique allows the model to trade compute for intelligence. But unlike naive “best-of-N” sampling—where a model might generate 100 answers and pick the best one — Qwen3-Max-Thinking employs an experience-cumulative, multi-round strategy.

This approach mimics human problem-solving. When the model encounters a complex query, it doesn’t just guess; it engages in iterative self-reflection. It uses a proprietary “take-experience” mechanism to distill insights from previous reasoning steps. This allows the model to:

  1. Identify Dead Ends: Recognize when a line of reasoning is failing without needing to fully traverse it.

  2. Focus Compute: Redirect processing power toward “unresolved uncertainties” rather than re-deriving known conclusions.

The efficiency gains are tangible. By avoiding redundant reasoning, the model integrates richer historical context into the same window. The Qwen team reports that this method drove massive performance jumps without exploding token costs:

  • GPQA (PhD-level science): Scores improved from 90.3 to 92.8.

  • LiveCodeBench v6: Performance jumped from 88.0 to 91.4.

Beyond Pure Thought: Adaptive Tooling

While “thinking” models are powerful, they have historically been siloed — great at math, but poor at browsing the web or running code. Qwen3-Max-Thinking bridges this gap by effectively integrating “thinking and non-thinking modes”.

The model features adaptive tool-use capabilities, meaning it autonomously selects the right tool for the job without manual user prompting. It can seamlessly toggle between:

  • Web Search & Extraction: For real-time factual queries.

  • Memory: To store and recall user-specific context.

  • Code Interpreter: To write and execute Python snippets for computational tasks.

In “Thinking Mode,” the model supports these tools simultaneously. This capability is critical for enterprise applications where a model might need to verify a fact (Search), calculate a projection (Code Interpreter), and then reason about the strategic implication (Thinking) all in one turn.

Empirically, the team notes that this combination “effectively mitigates hallucinations,” as the model can ground its reasoning in verifiable external data rather than relying solely on its training weights.

Benchmark Analysis: The Data Story

Qwen is not shy about direct comparisons.

On HMMT Feb 25, a rigorous reasoning benchmark, Qwen3-Max-Thinking scored 98.0, edging out Gemini 3 Pro (97.5) and significantly leading DeepSeek V3.2 (92.5).

However, the most significant signal for developers is arguably Agentic Search. On “Humanity’s Last Exam” (HLE) — the benchmark that measures performance on 3,000 “Google-proof” graduate-level questions across math, science, computer science, humanities and engineering — Qwen3-Max-Thinking, equipped with web search tools, scored 49.8, beating both Gemini 3 Pro (45.8) and GPT-5.2-Thinking (45.5) .

This suggests that Qwen3-Max-Thinking’s architecture is uniquely suited for complex, multi-step agentic workflows where external data retrieval is necessary.

In coding tasks, the model also shines. On Arena-Hard v2, it posted a score of 90.2, leaving competitors like Claude-Opus-4.5 (76.7) far behind.

The Economics of Reasoning: Pricing Breakdown

For the first time, we have a clear look at the economics of Qwen’s top-tier reasoning model. Alibaba Cloud has positioned qwen3-max-2026-01-23 as a premium but accessible offering on its API.

  • Input: $1.20 per 1 million tokens (for standard contexts <= 32k).

  • Output: $6.00 per 1 million tokens.

On a base level, here’s how Qwen3-Max-Thinking stacks up:

Model

Input (/1M)

Output (/1M)

Total Cost

Source

Qwen 3 Turbo

$0.05

$0.20

$0.25

Alibaba Cloud

Grok 4.1 Fast (reasoning)

$0.20

$0.50

$0.70

xAI

Grok 4.1 Fast (non-reasoning)

$0.20

$0.50

$0.70

xAI

deepseek-chat (V3.2-Exp)

$0.28

$0.42

$0.70

DeepSeek

deepseek-reasoner (V3.2-Exp)

$0.28

$0.42

$0.70

DeepSeek

Qwen 3 Plus

$0.40

$1.20

$1.60

Alibaba Cloud

ERNIE 5.0

$0.85

$3.40

$4.25

Qianfan

Gemini 3 Flash Preview

$0.50

$3.00

$3.50

Google

Claude Haiku 4.5

$1.00

$5.00

$6.00

Anthropic

Qwen3-Max Thinking (2026-01-23)

$1.20

$6.00

$7.20

Alibaba Cloud

Gemini 3 Pro (≤200K)

$2.00

$12.00

$14.00

Google

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

Claude Sonnet 4.5

$3.00

$15.00

$18.00

Anthropic

Gemini 3 Pro (>200K)

$4.00

$18.00

$22.00

Google

Claude Opus 4.5

$5.00

$25.00

$30.00

Anthropic

GPT-5.2 Pro

$21.00

$168.00

$189.00

OpenAI

This pricing structure is aggressive, undercutting many legacy flagship models while offering state-of-the-art performance.

However, developers should note the granular pricing for the new agentic capabilities, as Qwen separates the cost of “thinking” (tokens) from the cost of “doing” (tool use).

  • Agent Search Strategy: Both standard search_strategy:agent and the more advanced search_strategy:agent_max are priced at $10 per 1,000 calls.

    • Note: The agent_max strategy is currently marked as a “Limited Time Offer,” suggesting its price may rise later.

  • Web Search: Priced at $10 per 1,000 calls via the Responses API.

Promotional Free Tier:To encourage adoption of its most advanced features, Alibaba Cloud is currently offering two key tools for free for a limited time:

  • Web Extractor: Free (Limited Time).

  • Code Interpreter: Free (Limited Time).

This pricing model (low token cost + à la carte tool pricing) allows developers to build complex agents that are cost-effective for text processing, while paying a premium only when external actions—like a live web search—are explicitly triggered.

Developer Ecosystem

Recognizing that performance is useless without integration, Alibaba Cloud has ensured Qwen3-Max-Thinking is drop-in ready.

  • OpenAI Compatibility: The API supports the standard OpenAI format, allowing teams to switch models by simply changing the base_url and model name.

  • Anthropic Compatibility: In a savvy move to capture the coding market, the API also supports the Anthropic protocol. This makes Qwen3-Max-Thinking compatible with Claude Code, a popular agentic coding environment.

The Verdict

Qwen3-Max-Thinking represents a maturation of the AI market in 2026. It moves the conversation beyond “who has the smartest chatbot” to “who has the most capable agent.”

By combining high-efficiency reasoning with adaptive, autonomous tool use—and pricing it to move—Qwen has firmly established itself as a top-tier contender for the enterprise AI throne.

For developers and enterprises, the “Limited Time Free” windows on Code Interpreter and Web Extractor suggest now is the time to experiment. The reasoning wars are far from over, but Qwen has just deployed a very heavy hitter.

Prepare For A Samsung Galaxy S26 Ultra Price Rise

With Galaxy S26 price rises looming, the data is clear: you should wait before buying the Samsung Galaxy S26.

Google Photos Wants To Become Your Next Doomscrolling Obsession

A leaked update reveals Google Photos is testing a vertical, AI-powered video feed similar to YouTube Shorts, Instagram Reels and TikTok