The AI updates aren’t slowing down. Literally two days after OpenAI launched a new underlying AI model for ChatGPT called GPT-5.3 Instant, the company has unveiled another, even more massive upgrade: GPT-5.4.
Actually, GPT-5.4 comes in two varieties: GPT-5.4 Thinking and GPT-5.4 Pro, the latter designed for the most complex tasks.
Both will be available in OpenAI’s paid application programming interface (API) and Codex software development application, while GPT-5.4 Thinking will be available to all paid subscribers of ChatGPT (Plus, the $20-per-month plan, and up) and Pro will be reserved for ChatGPT Pro ($200 monthly) and Enterprise plan users.
ChatGPT Free users will also get a taste of GPT-5.4, but only when their queries are auto-routed to the model, according to an OpenAI spokesperson.
The big headlines on this release are efficiency, with OpenAI reporting that GPT-5.4 uses far fewer tokens (47% fewer on some tasks) than its predecessors, and, arguably even more impressively, a new “native” Computer Use mode available through the API and its Codex that lets GPT-5.4 navigate a users’ computer like a human and work across applications.
The company is also releasing a new suite of ChatGPT integrations allowing GPT-5.4 to be plugged directly into users’ Microsoft Excel and Google Sheets spreadsheets and cells, enabling granular analysis and automated task completion that should speed up work across the enterprise, but may make fears of white collar layoffs even more pronounced on the heels of similar offerings from Anthropic’s Claude and its new Cowork application.
OpenAI says GPT-5.4 supports up to 1 million tokens of context in the API and Codex, enabling agents to plan, execute, and verify tasks across long horizons— however, it charges double the cost per 1 million tokens once the input exceeds 272,000 tokens.
The most consequential capability OpenAI highlights is that GPT-5.4 is its first general-purpose model released with native, state-of-the-art computer-use capabilities in Codex and the API, enabling agents to operate computers and carry out multi-step workflows across applications.
OpenAI says the model can both write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. OpenAI also claims a jump in agentic web browsing.
Benchmark results are presented as evidence that this is not merely a UI wrapper.
On BrowseComp, which measures how well AI agents can persistently browse the web to find hard-to-locate information, OpenAI reports GPT-5.4 improving by 17% absolute over GPT-5.2, and GPT-5.4 Pro reaching 89.3%, described as a new state of the art.
On OSWorld-Verified, which measures desktop navigation using screenshots plus keyboard and mouse actions, OpenAI reports GPT-5.4 at 75.0% success, compared to 47.3% for GPT-5.2, and notes reported human performance at 72.4%.
On WebArena-Verified, GPT-5.4 reaches 67.3% success using both DOM- and screenshot-driven interaction, compared to 65.4% for GPT-5.2. On Online-Mind2Web, OpenAI reports 92.8% success using screenshot-based observations alone.
OpenAI also links computer use to improvements in vision and document handling. On MMMU-Pro, GPT-5.4 reaches 81.2% success without tool use, compared with 79.5% for GPT-5.2, and OpenAI says it achieves that result using a fraction of the “thinking tokens.”
On OmniDocBench, GPT-5.4’s average error is reported at 0.109, improved from 0.140 for GPT-5.2. The post also describes expanded support for high-fidelity image inputs, including an “original” detail level up to 10.24M pixels.
OpenAI positions GPT-5.4 as built for longer, multi-step workflows—work that increasingly looks like an agent keeping state across many actions rather than a chatbot responding once.
As tool ecosystems get larger, OpenAI argues that the naive approach—dumping every tool definition into the prompt—creates a tax paid on every request: cost, latency, and context pollution.
GPT-5.4 introduces tool search in the API as a structural fix. Instead of receiving all tool definitions upfront, the model receives a lightweight list of tools plus a search capability, and it retrieves full tool definitions only when they’re actually needed.
OpenAI describes the efficiency win with a concrete comparison: on 250 tasks from Scale’s MCP Atlas benchmark, running with 36 MCP servers enabled, the tool-search configuration reduced total token usage by 47% while achieving the same accuracy as a configuration that exposed all MCP functions directly in context.
That 47% figure is specifically about the tool-search setup in that evaluation—not a blanket claim that GPT-5.4 uses 47% fewer tokens for every kind of task.
OpenAI’s coding pitch is that GPT-5.4 combines the coding strengths of GPT-5.3-Codex with stronger tool and computer-use capabilities that matter when tasks aren’t single-shot.
GPT-5.4 matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while being lower latency across reasoning efforts.
Codex also gets workflow-level knobs. OpenAI says /fast mode delivers up to 1.5× faster performance across supported models, including GPT-5.4, describing it as the same model and intelligence “just faster.”
And it describes releasing an experimental Codex skill, “Playwright (Interactive)”, meant to demonstrate how coding and computer use can work in tandem—visually debugging web and Electron apps and testing an app as it’s being built.
Alongside GPT-5.4, OpenAI is announcing a suite of secure AI products in ChatGPT built for enterprises and financial institutions, powered by GPT-5.4 for advanced financial reasoning and Excel-based modeling.
The centerpiece is ChatGPT for Excel and Google Sheets (beta), which OpenAI describes as ChatGPT embedded directly in spreadsheets to build, analyze, and update complex financial models using the formulas and structures teams already rely on.
The suite also includes new ChatGPT app integrations intended to unify market, company, and internal data into a single workflow, naming FactSet, MSCI, Third Bridge, and Moody’s.
And it introduces reusable “Skills” for recurring finance work such as earnings previews, comparables analysis, DCF analysis, and investment memo drafting.
OpenAI anchors the finance push with an internal benchmark claim: model performance increased from 43.7% with GPT-5 to 88.0% with GPT-5.4 Thinking on an OpenAI internal investment banking benchmark.
OpenAI leans on benchmarks intended to resemble real office deliverables, not just puzzle-solving. On GDPval, an evaluation spanning “well-specified knowledge work” across 44 occupations, OpenAI reports that GPT-5.4 matches or exceeds industry professionals in 83.0% of comparisons, compared to 71.0% for GPT-5.2.
The company also highlights specific improvements in the kinds of artifacts that tend to expose model weaknesses: structured tables, formulas, narrative coherence, and design quality.
In an internal benchmark of spreadsheet modeling tasks modeled after what a junior investment banking analyst might do, GPT-5.4 reaches a mean score of 87.5%, compared to 68.4% for GPT-5.2.
And on a set of presentation evaluation prompts, OpenAI says human raters preferred GPT-5.4’s presentations 68.0% of the time over GPT-5.2’s, citing stronger aesthetics, greater visual variety, and more effective use of image generation.
OpenAI describes GPT-5.4 as its most factual model yet and connects that claim to a practical dataset: de-identified prompts where users previously flagged factual errors. On that set, OpenAI reports GPT-5.4’s individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors compared to GPT-5.2.
In statements provided to VentureBeat from OpenAI and attributed early GPT-5.4 testers, Daniel Swiecki of Walleye Capital says that on internal finance and Excel evaluations, GPT-5.4 improved accuracy by 30 percentage points, which he links to expanded automation for model updates and scenario analysis.
Brendan Foody, CEO of Mercor, calls GPT-5.4 the best model the company has tried and says it’s now top of Mercor’s APEX-Agents benchmark for professional services work, emphasizing long-horizon deliverables like slide decks, financial models, and legal analysis.
In the API, OpenAI says GPT-5.4 Thinking is available as gpt-5.4 and GPT-5.4 Pro as gpt-5.4-pro. Pricing is as follows:
GPT-5.4: $2.50 / 1M input tokens; $15 / 1M output tokens
GPT-5.4 Pro: $30 / 1M input tokens; $180 / 1M output tokens
Batch + Flex: half-rate; Priority processing: 2× rate
This makes GPT-5.4 among the more expensive models to run over API compared to the entire field, as seen in the table below.
|
Model |
Input |
Output |
Total Cost |
Source |
|
Qwen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
Qwen3.5-Flash |
$0.10 |
$0.40 |
$0.50 |
|
|
deepseek-chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deepseek-reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Grok 4.1 Fast (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Fast (non-reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
MiniMax M2.5 |
$0.15 |
$1.20 |
$1.35 |
|
|
Gemini 3.1 Flash-Lite |
$0.25 |
$1.50 |
$1.75 |
|
|
MiniMax M2.5-Lightning |
$0.30 |
$2.40 |
$2.70 |
|
|
Gemini 3 Flash Preview |
$0.50 |
$3.00 |
$3.50 |
|
|
Kimi-k2.5 |
$0.60 |
$3.00 |
$3.60 |
|
|
GLM-5 |
$1.00 |
$3.20 |
$4.20 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Claude Haiku 4.5 |
$1.00 |
$5.00 |
$6.00 |
|
|
Qwen3-Max (2026-01-23) |
$1.20 |
$6.00 |
$7.20 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
GPT-5.2 |
$1.75 |
$14.00 |
$15.75 |
|
|
Claude Sonnet 4.6 |
$3.00 |
$15.00 |
$18.00 |
|
|
GPT-5.4 |
$2.50 |
$15.00 |
$17.50 |
|
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.6 |
$5.00 |
$25.00 |
$30.00 |
|
|
GPT-5.2 Pro |
$21.00 |
$168.00 |
$189.00 |
|
|
GPT-5.4 Pro |
$30.00 |
$180.00 |
$210.00 |
Another important note: with GPT-5.4, requests that exceed 272,000 input tokens are billed at 2X the normal rate, reflecting the ability to send prompts larger than earlier models supported.
In Codex, compaction defaults to 272k tokens, and the higher long-context pricing applies only when the input exceeds 272k—meaning developers can keep sending prompts at or under that size without triggering the higher rate, but can opt into larger prompts by raising the compaction limit, with only those larger requests billed differently.
An OpenAI spokesperson said that in the API the maximum output is 128,000 tokens, the same as previous models.
Finally, on why GPT-5.4 is priced higher at baseline, the spokesperson attributed it to three factors: higher capability on complex tasks (including coding, computer use, deep research, advanced document generation, and tool use), major research improvements from OpenAI’s roadmap, and more efficient reasoning that uses fewer reasoning tokens for comparable tasks—adding that OpenAI believes GPT-5.4 remains below comparable frontier models on pricing even with the increase.
Across the release and the follow-up clarifications, GPT-5.4 is positioned as a model meant to move beyond “answer generation” and into sustained professional workflows—ones that require tool orchestration, computer interaction, long context, and outputs that look like the artifacts people actually use at work.
OpenAI’s emphasis on token efficiency, tool search, native computer use, and reduced user-flagged factual errors all point in the same direction: making agentic systems more viable in production by lowering the cost of retries—whether that retry is a human re-prompting, an agent calling another tool, or a workflow re-running because the first pass didn’t stick.
Companies rushing to adopt AI face hidden costs in “work waste.” The article explores how workflow intelligence tools like Scribe reveal and fix this inefficiency.
Genies and the NBPA partner to create AI avatars of NBA players for games, apps, and fan chat, extending licensed athlete identities into interactive digital platforms.
Ecovacs has released two new models, the T90 and Winbot W3 with flagship features at a more reasonable price.
A massive patch has arrived for Escape From Tarkov that is sure to massively change the meta and the way most players play.
The Fi Series 3+ dog tracker offers real-time GPS location tracking, activity monitoring and health insights to support a safe and healthy life for your best friend.
The latest iPhone update has landed. Unusually for a minor point update, it includes features but no security fixes.
To create coherent images or videos, generative AI diffusion models like Stable Diffusion or FLUX have typically relied on external “teachers”—frozen encoders like CLIP or DINOv2—to provide the semantic understanding they couldn’t learn on their own.
But this reliance has come at a cost: a “bottleneck” where scaling up the model no longer yields better results because the external teacher has hit its limit.
Today, German AI startup Black Forest Labs (maker of the FLUX series of AI image models) has announced a potential end to this era of academic borrowing with the release of Self-Flow, a self-supervised flow matching framework that allows models to learn representation and generation simultaneously.
By integrating a novel Dual-Timestep Scheduling mechanism, Black Forest Labs has demonstrated that a single model can achieve state-of-the-art results across images, video, and audio without any external supervision.
The fundamental problem with traditional generative training is that it’s a “denoising” task. The model is shown noise and asked to find an image; it has very little incentive to understand what the image is, only what it looks like.
To fix this, researchers have previously “aligned” generative features with external discriminative models. However, Black Forest Labs argues this is fundamentally flawed: these external models often operate on misaligned objectives and fail to generalize across different modalities like audio or robotics.
The Labs’ new technique, Self-Flow, introduces an “information asymmetry” to solve this. Using a technique called Dual-Timestep Scheduling, the system applies different levels of noise to different parts of the input. The student receives a heavily corrupted version of the data, while the teacher—an Exponential Moving Average (EMA) version of the model itself—sees a “cleaner” version of the same data.
The student is then tasked not just with generating the final output, but with predicting what its “cleaner” self is seeing—a process of self-distillation where the teacher is at layer 20 and the student is at layer 8. This “Dual-Pass” approach forces the model to develop a deep, internal semantic understanding, effectively teaching itself how to see while it learns how to create.
The practical results of this shift are stark. According to the research paper, Self-Flow converges approximately 2.8x faster than the REpresentation Alignment (REPA) method, the current industry standard for feature alignment. Perhaps more importantly, it doesn’t plateau; as compute and parameters increase, Self-Flow continues to improve while older methods show diminishing returns.
The leap in training efficiency is best understood through the lens of raw computational steps: while standard “vanilla” training traditionally requires 7 million steps to reach a baseline performance level, REPA shortened that journey to just 400,000 steps, representing a 17.5x speedup.
Black Forest Labs’ Self-Flow framework pushes this frontier even further, operating 2.8x faster than REPA to hit the same performance milestone in roughly 143,000 steps.
Taken together, this evolution represents a nearly 50x reduction in the total number of training steps required to achieve high-quality results, effectively collapsing what was once a massive resource requirement into a significantly more accessible and streamlined process.
Black Forest Labs showcased these gains through a 4B parameter multi-modal model. Trained on a massive dataset of 200M images, 6M videos, and 2M audio-video pairs, the model demonstrated significant leaps in three key areas:
Typography and text rendering: One of the most persistent “tells” of AI images has been garbled text. Self-Flow significantly outperforms vanilla flow matching in rendering complex, legible signs and labels, such as a neon sign correctly spelling “FLUX is multimodal”.
Temporal consistency: In video generation, Self-Flow eliminates many of the “hallucinated” artifacts common in current models, such as limbs that spontaneously disappear during motion.
Joint video-audio synthesis: Because the model learns representations natively, it can generate synchronized video and audio from a single prompt, a task where external “borrowed” representations often fail because an image-encoder doesn’t understand sound.
In terms of quantitative metrics, Self-Flow achieved superior results over competitive baselines. On Image FID, the model scored 3.61 compared to REPA’s 3.92. For video (FVD), it reached 47.81 compared to REPA’s 49.59, and in audio (FAD), it scored 145.65 against the vanilla baseline’s 148.87.
The announcement concludes with a look toward world models—AI that doesn’t just generate pretty pictures but understands the underlying physics and logic of a scene for planning and robotics.
By fine-tuning a 675M parameter version of Self-Flow on the RT-1 robotics dataset, researchers achieved significantly higher success rates in complex, multi-step tasks in the SIMPLER simulator. While standard flow matching struggled with complex “Open and Place” tasks, often failing entirely, the Self-Flow model maintained a steady success rate, suggesting that its internal representations are robust enough for real-world visual reasoning.
For researchers looking to verify these claims, Black Forest Labs has released an inference suite on GitHub specifically for ImageNet 256×256 generation. The project, primarily written in Python, provides the SelfFlowPerTokenDiT model architecture based on SiT-XL/2.
Engineers can utilize the provided sample.py script to generate 50,000 images for standard FID evaluation. The repository highlights that a key architectural modification in this implementation is per-token timestep conditioning, which allows each token in a sequence to be conditioned on its specific noising timestep. During training, the model utilized BFloat16 mixed precision and the AdamW optimizer with gradient clipping to maintain stability.
Black Forest Labs has made the research paper and official inference code available via GitHub and their research portal. While this is currently a research preview, the company’s track record with the FLUX model family suggests these innovations will likely find their way into their commercial API and open-weights offerings in the near future.
For developers, the move away from external encoders is a massive win for efficiency. It eliminates the need to manage separate, heavy models like DINOv2 during training, simplifying the stack and allowing for more specialized, domain-specific training that isn’t beholden to someone else’s “frozen” understanding of the world.
For enterprises, the arrival of Self-Flow represents a significant shift in the cost-benefit analysis of developing proprietary AI.
While the most immediate beneficiaries are organizations training large-scale models from scratch, the research demonstrates that the technology is equally potent for high-resolution fine-tuning. Because the method converges nearly three times faster than current standards, companies can achieve state-of-the-art results with a fraction of the traditional compute budget.
This efficiency makes it viable for enterprises to move beyond generic off-the-shelf solutions and develop specialized models that are deeply aligned with their specific data domains, whether that involves niche medical imaging or proprietary industrial sensor data.
The practical applications for this technology extend into high-stakes industrial sectors, most notably robotics and autonomous systems. By leveraging the framework’s ability to learn “world models,” enterprises in manufacturing and logistics can develop vision-language-action (VLA) models that possess a superior understanding of physical space and sequential reasoning.
In simulation tests, Self-Flow allowed robotic controllers to successfully execute complex, multi-object tasks—such as opening a drawer to place an item inside—where traditional generative models failed. This suggests that the technology is a foundational tool for any enterprise seeking to bridge the gap between digital content generation and real-world physical automation.
Beyond performance gains, Self-Flow offers enterprises a strategic advantage by simplifying the underlying AI infrastructure. Most current generative systems are “Frankenstein” models that require complex, external semantic encoders often owned and licensed by third parties.
By unifying representation and generation into a single architecture, Self-Flow allows enterprises to eliminate these external dependencies, reducing technical debt and removing the “bottlenecks” associated with scaling third-party teachers. This self-contained nature ensures that as an enterprise scales its compute and data, the model’s performance scales predictably in lockstep, providing a clearer ROI for long-term AI investments.
Microsoft on Tuesday released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that the company says matches or exceeds the performance of systems many times its size — while consuming a fraction of the compute and training data. The release marks the latest and most technically ambitious chapter in the software giant’s year-long campaign to prove that carefully engineered small models can compete with, and in key areas outperform, the industry’s largest AI systems.
The 15-billion-parameter model, available immediately through Microsoft Foundry, HuggingFace, and GitHub under a permissive license, processes both images and text and can reason through complex math and science problems, interpret charts and documents, navigate graphical user interfaces, and handle everyday visual tasks like captioning photos and reading receipts. It arrives at a moment when the AI industry is grappling with a fundamental tension: the biggest models deliver the best raw performance, but their enormous cost, latency, and energy consumption make them impractical for many real-world deployments.
“Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models,” the Microsoft Research team wrote in the model’s official announcement, “and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.”
Perhaps the most striking claim in the release is how little training data the model required relative to its competitors. Phi-4-reasoning-vision-15B was trained on approximately 200 billion tokens of multimodal data, built atop the Phi-4-Reasoning language backbone (itself trained on 16 billion tokens) and the foundational Phi-4 model (400 billion unique tokens). By contrast, rival multimodal models from Alibaba’s Qwen family (2.5 VL and 3 VL), Moonshot AI’s Kimi-VL, SenseTime’s InternVL series, and Google’s Gemma3 each consumed more than one trillion tokens during training — roughly five times the total data pipeline Microsoft used.
That disparity matters enormously for economics. Training large AI models costs millions of dollars in cloud compute, and the environmental footprint of trillion-token training runs has drawn increasing scrutiny from regulators and investors alike. If Microsoft’s claims hold up under independent evaluation, the model represents a significant advance in training efficiency — one that could reshape how organizations think about the build-versus-buy calculus for AI deployment.
The secret, according to the research team, lies not in scale but in meticulous data curation. The team’s final dataset drew primarily from three sources: open-source datasets that were “meticulously filtered and improved”; high-quality domain-specific internal data; and targeted data acquisitions. The researchers described a hands-on quality assurance process in which team members manually reviewed samples from each dataset, typically spending five to ten minutes classifying data quality before deciding how to treat each source. For data with incorrect answers, they re-generated responses using GPT-4o and o4-mini. When questions were unsalvageable but images were high quality, they repurposed the images as seeds for new caption or visual question-answering data. They also reported fixing “a surprisingly large number of formatting and logical errors across widely used open-source datasets” — a finding that raises uncomfortable questions about the quality of training data underpinning many of the industry’s most prominent models.
The model’s most technically novel contribution may be its approach to reasoning. In the world of language-only AI, “reasoning models” — systems that spend extra compute time working through problems step by step — have become the hottest category in the field, with OpenAI’s o-series and DeepSeek’s R1 leading the charge. But extending reasoning to multimodal tasks involving images introduces a wrinkle: for many visual tasks like image captioning or optical character recognition, chain-of-thought reasoning is not only unnecessary but can actually degrade performance by introducing unnecessary verbosity and latency.
Microsoft’s solution was to build what it calls a “mixed reasoning and non-reasoning model.” The team started with Phi-4-Reasoning, already a capable reasoning language model, and then trained it on a hybrid data mixture where approximately 20 percent of samples included explicit chain-of-thought reasoning traces (wrapped in <think>…</think> tags) and 80 percent were tagged for direct response (with a <nothink> token). The model learned to invoke structured reasoning for domains like math and science where it helps, while defaulting to fast, direct responses for perception-focused tasks where it does not.
This design choice reflects a pragmatic view of reasoning that contrasts with the industry’s current enthusiasm for always-on thinking. As the research team explained: “For tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful, while mathematical and scientific problem-solving benefit from multi-step reasoning.” Users who want to override the model’s default behavior can do so by explicitly prompting with <think> or <nothink> tokens.
The team explored four possible training pipelines for multimodal reasoning and chose the one they judged to best balance capability, efficiency, and data requirements. The alternative approaches — training reasoning and multimodal capabilities simultaneously from a non-reasoning base, learning multimodal skills first and then adding reasoning, or requiring reasoning traces for all training data — each carried significant drawbacks. Training reasoning from scratch demands enormous multimodal reasoning data. Adding reasoning after multimodal training risks catastrophic forgetting. And forcing reasoning on every query wastes compute on tasks that don’t benefit from it.
Under the hood, Phi-4-reasoning-vision-15B uses a mid-fusion architecture that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The choice of mid-fusion — where a pretrained vision encoder converts images into tokens that are then projected into the language model’s embedding space — over early-fusion, where images and text are processed together in a single transformer, reflects the team’s resource constraints. Early-fusion yields richer joint representations but demands significantly more compute, memory, and data.
The team conducted careful ablation studies on how to handle image resolution, an issue that matters critically for tasks like reading dense screenshots or small UI elements. They tested four approaches — Dynamic S, Multi-crop, Multi-crop with S, and dynamic resolution using SigLIP-2’s Naflex variant — and found that dynamic resolution encoders performed best, especially on high-resolution data. They selected the SigLIP-2 Naflex variant with up to 3,600 maximum tokens, which corresponds roughly to native 720p resolution and delivered particularly strong results on benchmarks requiring fine-grained visual understanding like ScreenSpot-Pro.
This matters for one of the model’s headline use cases: powering computer-using agents that navigate desktop, web, and mobile interfaces. With strong high-resolution perception and fine-grained grounding capabilities, the model can identify and localize interactive elements like buttons, menus, and text fields — a prerequisite for the autonomous software agents that many in the industry view as the next major frontier for AI deployment. The team noted that the model’s low inference-time requirements make it particularly well suited “for interactive environments where low latency and compact model size are essential.”
The model’s benchmark results paint a picture of a system that punches well above its weight class on efficiency while remaining competitive — though not dominant — on raw accuracy. On the team’s own evaluations across ten benchmarks, Phi-4-reasoning-vision-15B scored 84.8 on AI2D (science diagrams), 83.3 on ChartQA, 75.2 on MathVista, 88.2 on ScreenSpot v2 (UI element grounding), and 54.3 on MMMU (a broad multimodal understanding test).
Those numbers generally trail the much larger Qwen3-VL-32B models (which scored 85.0, 84.0, 81.8, 93.9, and 70.6 on the same benchmarks, respectively) but remain competitive with or ahead of similarly-sized systems like Qwen3-VL-8B and Kimi-VL-A3B. The real value proposition, as Figure 1 in the announcement illustrates, emerges when accuracy is plotted against compute time and output token count: Phi-4-reasoning-vision-15B sits at the Pareto frontier of models that are both fast and accurate, delivering competitive results in a fraction of the time required by larger systems.
The Microsoft team acknowledged that their benchmark numbers “may be lower than other previously shared numbers” because they ran all evaluations themselves rather than quoting leaderboard claims. They used temperature=0.0, greedy decoding, and a 4,096 maximum output token limit, with no custom prompting or parameter tuning. The team committed to releasing all evaluation logs publicly — a transparency practice that remains uncommon in the field and should allow independent researchers to verify the results. Still, independent reproduction will be critical: the AI research community has grown increasingly skeptical of self-reported numbers, particularly when evaluation methodologies differ across organizations.
Phi-4-reasoning-vision-15B does not exist in isolation. It is the latest entry in a Phi model family that has expanded rapidly over the past year, evolving from a niche research project into a central pillar of Microsoft’s AI strategy — one that now spans language, vision, on-device inference, education, and robotics.
The lineage traces back through several milestones. In late 2024, Microsoft released the original Phi-4, a 14-billion-parameter language model that demonstrated the power of synthetic data and careful curation. In April 2025, the company launched Phi-4 mini reasoning (3.8 billion parameters), Phi-4 reasoning (14 billion parameters), and Phi-4 reasoning plus — with the latter reportedly approaching the performance of DeepSeek’s R1, a model with 671 billion parameters, according to TechCrunch’s reporting at the time.
The family has also extended into specialized domains. Phi Silica, an on-device small language model for Copilot+ PCs, has been used with LoRA fine-tuning to customize generation for specific tasks. In one case study detailed on the Windows Developer Blog, Microsoft’s education team used LoRA adapters with Phi Silica to generate Kahoot! quizzes, achieving a 75 percent reduction in rejection rates and a 4.6-times uplift in subjective quality scores. On the hardware side, the Phi-4-mini model has been optimized for MediaTek’s NPU platforms, running at over 800 tokens per second for prefill on the Dimensity 9400 — fast enough for real-time AI on smartphones and tablets.
And in what may be the most ambitious extension yet, Microsoft announced Rho-alpha (ρα), described as the company’s “first robotics model derived from Microsoft’s Phi series.” According to Microsoft Research, Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks, adding tactile sensing to the perception stack and targeting dual-arm setups and humanoid robots.
The release crystallizes a broader shift in the AI industry’s center of gravity. For the past two years, the dominant narrative has held that bigger is better — that raw scale in parameters, data, and compute is the primary driver of capability. Microsoft’s Phi family represents the most visible corporate champion of the counterargument: that careful engineering of data quality, training methodology, and architecture design can substitute for brute-force scale. This thesis has significant implications for enterprise adoption. Organizations deploying AI in latency-sensitive or resource-constrained settings — edge devices, interactive applications, on-premise servers — cannot practically run trillion-parameter models. A 15-billion-parameter model that delivers 80 to 90 percent of a frontier model’s accuracy at a tenth of the inference cost could unlock deployment scenarios that were previously uneconomical.
The model’s open-weight release, accompanied by fine-tuning code and benchmark logs, also represents a competitive strategy. By making the model freely available and deeply documented, Microsoft positions Phi as a foundation layer for an ecosystem of downstream applications — many of which will run on Azure, use Microsoft’s development tools, or integrate with its enterprise software stack.
Yet the model still trails the largest open-weight competitors on the hardest benchmarks, particularly in mathematical reasoning (where Qwen3-VL-32B-Thinking-40K scores 78.2 on MathVerse compared to 53.1 for Phi-4-reasoning-vision with forced thinking) and general multimodal understanding (MMMU scores of 72.2 versus 55.0). The 20/80 reasoning-to-non-reasoning data split is, by the team’s own admission, a heuristic that “may not be optimal for all domains or deployment contexts.” And the model’s ability to correctly decide when to reason and when to respond directly remains what the researchers called “an open problem.”
Microsoft is wagering that in the real world, where latency budgets are tight, hardware is finite, and deployment costs compound with every API call, the smartest model is not the biggest one — it’s the one that knows when to think and when to just answer. Whether that bet pays off will depend less on benchmark tables and more on what happens when millions of developers start putting Phi-4-reasoning-vision to work. The model is available now on Microsoft Foundry, HuggingFace, and GitHub. The leaderboard, as always, is open.
Apple has revealed the MacBook Neo, its most affordable laptop. Here’s what it looks and feels like in person.