To create coherent images or videos, generative AI diffusion models like Stable Diffusion or FLUX have typically relied on external “teachers”—frozen encoders like CLIP or DINOv2—to provide the semantic understanding they couldn’t learn on their own.
But this reliance has come at a cost: a “bottleneck” where scaling up the model no longer yields better results because the external teacher has hit its limit.
Today, German AI startup Black Forest Labs (maker of the FLUX series of AI image models) has announced a potential end to this era of academic borrowing with the release of Self-Flow, a self-supervised flow matching framework that allows models to learn representation and generation simultaneously.
By integrating a novel Dual-Timestep Scheduling mechanism, Black Forest Labs has demonstrated that a single model can achieve state-of-the-art results across images, video, and audio without any external supervision.
The fundamental problem with traditional generative training is that it’s a “denoising” task. The model is shown noise and asked to find an image; it has very little incentive to understand what the image is, only what it looks like.
To fix this, researchers have previously “aligned” generative features with external discriminative models. However, Black Forest Labs argues this is fundamentally flawed: these external models often operate on misaligned objectives and fail to generalize across different modalities like audio or robotics.
The Labs’ new technique, Self-Flow, introduces an “information asymmetry” to solve this. Using a technique called Dual-Timestep Scheduling, the system applies different levels of noise to different parts of the input. The student receives a heavily corrupted version of the data, while the teacher—an Exponential Moving Average (EMA) version of the model itself—sees a “cleaner” version of the same data.
The student is then tasked not just with generating the final output, but with predicting what its “cleaner” self is seeing—a process of self-distillation where the teacher is at layer 20 and the student is at layer 8. This “Dual-Pass” approach forces the model to develop a deep, internal semantic understanding, effectively teaching itself how to see while it learns how to create.
The practical results of this shift are stark. According to the research paper, Self-Flow converges approximately 2.8x faster than the REpresentation Alignment (REPA) method, the current industry standard for feature alignment. Perhaps more importantly, it doesn’t plateau; as compute and parameters increase, Self-Flow continues to improve while older methods show diminishing returns.
The leap in training efficiency is best understood through the lens of raw computational steps: while standard “vanilla” training traditionally requires 7 million steps to reach a baseline performance level, REPA shortened that journey to just 400,000 steps, representing a 17.5x speedup.
Black Forest Labs’ Self-Flow framework pushes this frontier even further, operating 2.8x faster than REPA to hit the same performance milestone in roughly 143,000 steps.
Taken together, this evolution represents a nearly 50x reduction in the total number of training steps required to achieve high-quality results, effectively collapsing what was once a massive resource requirement into a significantly more accessible and streamlined process.
Black Forest Labs showcased these gains through a 4B parameter multi-modal model. Trained on a massive dataset of 200M images, 6M videos, and 2M audio-video pairs, the model demonstrated significant leaps in three key areas:
Typography and text rendering: One of the most persistent “tells” of AI images has been garbled text. Self-Flow significantly outperforms vanilla flow matching in rendering complex, legible signs and labels, such as a neon sign correctly spelling “FLUX is multimodal”.
Temporal consistency: In video generation, Self-Flow eliminates many of the “hallucinated” artifacts common in current models, such as limbs that spontaneously disappear during motion.
Joint video-audio synthesis: Because the model learns representations natively, it can generate synchronized video and audio from a single prompt, a task where external “borrowed” representations often fail because an image-encoder doesn’t understand sound.
In terms of quantitative metrics, Self-Flow achieved superior results over competitive baselines. On Image FID, the model scored 3.61 compared to REPA’s 3.92. For video (FVD), it reached 47.81 compared to REPA’s 49.59, and in audio (FAD), it scored 145.65 against the vanilla baseline’s 148.87.
The announcement concludes with a look toward world models—AI that doesn’t just generate pretty pictures but understands the underlying physics and logic of a scene for planning and robotics.
By fine-tuning a 675M parameter version of Self-Flow on the RT-1 robotics dataset, researchers achieved significantly higher success rates in complex, multi-step tasks in the SIMPLER simulator. While standard flow matching struggled with complex “Open and Place” tasks, often failing entirely, the Self-Flow model maintained a steady success rate, suggesting that its internal representations are robust enough for real-world visual reasoning.
For researchers looking to verify these claims, Black Forest Labs has released an inference suite on GitHub specifically for ImageNet 256×256 generation. The project, primarily written in Python, provides the SelfFlowPerTokenDiT model architecture based on SiT-XL/2.
Engineers can utilize the provided sample.py script to generate 50,000 images for standard FID evaluation. The repository highlights that a key architectural modification in this implementation is per-token timestep conditioning, which allows each token in a sequence to be conditioned on its specific noising timestep. During training, the model utilized BFloat16 mixed precision and the AdamW optimizer with gradient clipping to maintain stability.
Black Forest Labs has made the research paper and official inference code available via GitHub and their research portal. While this is currently a research preview, the company’s track record with the FLUX model family suggests these innovations will likely find their way into their commercial API and open-weights offerings in the near future.
For developers, the move away from external encoders is a massive win for efficiency. It eliminates the need to manage separate, heavy models like DINOv2 during training, simplifying the stack and allowing for more specialized, domain-specific training that isn’t beholden to someone else’s “frozen” understanding of the world.
For enterprises, the arrival of Self-Flow represents a significant shift in the cost-benefit analysis of developing proprietary AI.
While the most immediate beneficiaries are organizations training large-scale models from scratch, the research demonstrates that the technology is equally potent for high-resolution fine-tuning. Because the method converges nearly three times faster than current standards, companies can achieve state-of-the-art results with a fraction of the traditional compute budget.
This efficiency makes it viable for enterprises to move beyond generic off-the-shelf solutions and develop specialized models that are deeply aligned with their specific data domains, whether that involves niche medical imaging or proprietary industrial sensor data.
The practical applications for this technology extend into high-stakes industrial sectors, most notably robotics and autonomous systems. By leveraging the framework’s ability to learn “world models,” enterprises in manufacturing and logistics can develop vision-language-action (VLA) models that possess a superior understanding of physical space and sequential reasoning.
In simulation tests, Self-Flow allowed robotic controllers to successfully execute complex, multi-object tasks—such as opening a drawer to place an item inside—where traditional generative models failed. This suggests that the technology is a foundational tool for any enterprise seeking to bridge the gap between digital content generation and real-world physical automation.
Beyond performance gains, Self-Flow offers enterprises a strategic advantage by simplifying the underlying AI infrastructure. Most current generative systems are “Frankenstein” models that require complex, external semantic encoders often owned and licensed by third parties.
By unifying representation and generation into a single architecture, Self-Flow allows enterprises to eliminate these external dependencies, reducing technical debt and removing the “bottlenecks” associated with scaling third-party teachers. This self-contained nature ensures that as an enterprise scales its compute and data, the model’s performance scales predictably in lockstep, providing a clearer ROI for long-term AI investments.
Microsoft on Tuesday released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that the company says matches or exceeds the performance of systems many times its size — while consuming a fraction of the compute and training data. The release marks the latest and most technically ambitious chapter in the software giant’s year-long campaign to prove that carefully engineered small models can compete with, and in key areas outperform, the industry’s largest AI systems.
The 15-billion-parameter model, available immediately through Microsoft Foundry, HuggingFace, and GitHub under a permissive license, processes both images and text and can reason through complex math and science problems, interpret charts and documents, navigate graphical user interfaces, and handle everyday visual tasks like captioning photos and reading receipts. It arrives at a moment when the AI industry is grappling with a fundamental tension: the biggest models deliver the best raw performance, but their enormous cost, latency, and energy consumption make them impractical for many real-world deployments.
“Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models,” the Microsoft Research team wrote in the model’s official announcement, “and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.”
Perhaps the most striking claim in the release is how little training data the model required relative to its competitors. Phi-4-reasoning-vision-15B was trained on approximately 200 billion tokens of multimodal data, built atop the Phi-4-Reasoning language backbone (itself trained on 16 billion tokens) and the foundational Phi-4 model (400 billion unique tokens). By contrast, rival multimodal models from Alibaba’s Qwen family (2.5 VL and 3 VL), Moonshot AI’s Kimi-VL, SenseTime’s InternVL series, and Google’s Gemma3 each consumed more than one trillion tokens during training — roughly five times the total data pipeline Microsoft used.
That disparity matters enormously for economics. Training large AI models costs millions of dollars in cloud compute, and the environmental footprint of trillion-token training runs has drawn increasing scrutiny from regulators and investors alike. If Microsoft’s claims hold up under independent evaluation, the model represents a significant advance in training efficiency — one that could reshape how organizations think about the build-versus-buy calculus for AI deployment.
The secret, according to the research team, lies not in scale but in meticulous data curation. The team’s final dataset drew primarily from three sources: open-source datasets that were “meticulously filtered and improved”; high-quality domain-specific internal data; and targeted data acquisitions. The researchers described a hands-on quality assurance process in which team members manually reviewed samples from each dataset, typically spending five to ten minutes classifying data quality before deciding how to treat each source. For data with incorrect answers, they re-generated responses using GPT-4o and o4-mini. When questions were unsalvageable but images were high quality, they repurposed the images as seeds for new caption or visual question-answering data. They also reported fixing “a surprisingly large number of formatting and logical errors across widely used open-source datasets” — a finding that raises uncomfortable questions about the quality of training data underpinning many of the industry’s most prominent models.
The model’s most technically novel contribution may be its approach to reasoning. In the world of language-only AI, “reasoning models” — systems that spend extra compute time working through problems step by step — have become the hottest category in the field, with OpenAI’s o-series and DeepSeek’s R1 leading the charge. But extending reasoning to multimodal tasks involving images introduces a wrinkle: for many visual tasks like image captioning or optical character recognition, chain-of-thought reasoning is not only unnecessary but can actually degrade performance by introducing unnecessary verbosity and latency.
Microsoft’s solution was to build what it calls a “mixed reasoning and non-reasoning model.” The team started with Phi-4-Reasoning, already a capable reasoning language model, and then trained it on a hybrid data mixture where approximately 20 percent of samples included explicit chain-of-thought reasoning traces (wrapped in <think>…</think> tags) and 80 percent were tagged for direct response (with a <nothink> token). The model learned to invoke structured reasoning for domains like math and science where it helps, while defaulting to fast, direct responses for perception-focused tasks where it does not.
This design choice reflects a pragmatic view of reasoning that contrasts with the industry’s current enthusiasm for always-on thinking. As the research team explained: “For tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful, while mathematical and scientific problem-solving benefit from multi-step reasoning.” Users who want to override the model’s default behavior can do so by explicitly prompting with <think> or <nothink> tokens.
The team explored four possible training pipelines for multimodal reasoning and chose the one they judged to best balance capability, efficiency, and data requirements. The alternative approaches — training reasoning and multimodal capabilities simultaneously from a non-reasoning base, learning multimodal skills first and then adding reasoning, or requiring reasoning traces for all training data — each carried significant drawbacks. Training reasoning from scratch demands enormous multimodal reasoning data. Adding reasoning after multimodal training risks catastrophic forgetting. And forcing reasoning on every query wastes compute on tasks that don’t benefit from it.
Under the hood, Phi-4-reasoning-vision-15B uses a mid-fusion architecture that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The choice of mid-fusion — where a pretrained vision encoder converts images into tokens that are then projected into the language model’s embedding space — over early-fusion, where images and text are processed together in a single transformer, reflects the team’s resource constraints. Early-fusion yields richer joint representations but demands significantly more compute, memory, and data.
The team conducted careful ablation studies on how to handle image resolution, an issue that matters critically for tasks like reading dense screenshots or small UI elements. They tested four approaches — Dynamic S, Multi-crop, Multi-crop with S, and dynamic resolution using SigLIP-2’s Naflex variant — and found that dynamic resolution encoders performed best, especially on high-resolution data. They selected the SigLIP-2 Naflex variant with up to 3,600 maximum tokens, which corresponds roughly to native 720p resolution and delivered particularly strong results on benchmarks requiring fine-grained visual understanding like ScreenSpot-Pro.
This matters for one of the model’s headline use cases: powering computer-using agents that navigate desktop, web, and mobile interfaces. With strong high-resolution perception and fine-grained grounding capabilities, the model can identify and localize interactive elements like buttons, menus, and text fields — a prerequisite for the autonomous software agents that many in the industry view as the next major frontier for AI deployment. The team noted that the model’s low inference-time requirements make it particularly well suited “for interactive environments where low latency and compact model size are essential.”
The model’s benchmark results paint a picture of a system that punches well above its weight class on efficiency while remaining competitive — though not dominant — on raw accuracy. On the team’s own evaluations across ten benchmarks, Phi-4-reasoning-vision-15B scored 84.8 on AI2D (science diagrams), 83.3 on ChartQA, 75.2 on MathVista, 88.2 on ScreenSpot v2 (UI element grounding), and 54.3 on MMMU (a broad multimodal understanding test).
Those numbers generally trail the much larger Qwen3-VL-32B models (which scored 85.0, 84.0, 81.8, 93.9, and 70.6 on the same benchmarks, respectively) but remain competitive with or ahead of similarly-sized systems like Qwen3-VL-8B and Kimi-VL-A3B. The real value proposition, as Figure 1 in the announcement illustrates, emerges when accuracy is plotted against compute time and output token count: Phi-4-reasoning-vision-15B sits at the Pareto frontier of models that are both fast and accurate, delivering competitive results in a fraction of the time required by larger systems.
The Microsoft team acknowledged that their benchmark numbers “may be lower than other previously shared numbers” because they ran all evaluations themselves rather than quoting leaderboard claims. They used temperature=0.0, greedy decoding, and a 4,096 maximum output token limit, with no custom prompting or parameter tuning. The team committed to releasing all evaluation logs publicly — a transparency practice that remains uncommon in the field and should allow independent researchers to verify the results. Still, independent reproduction will be critical: the AI research community has grown increasingly skeptical of self-reported numbers, particularly when evaluation methodologies differ across organizations.
Phi-4-reasoning-vision-15B does not exist in isolation. It is the latest entry in a Phi model family that has expanded rapidly over the past year, evolving from a niche research project into a central pillar of Microsoft’s AI strategy — one that now spans language, vision, on-device inference, education, and robotics.
The lineage traces back through several milestones. In late 2024, Microsoft released the original Phi-4, a 14-billion-parameter language model that demonstrated the power of synthetic data and careful curation. In April 2025, the company launched Phi-4 mini reasoning (3.8 billion parameters), Phi-4 reasoning (14 billion parameters), and Phi-4 reasoning plus — with the latter reportedly approaching the performance of DeepSeek’s R1, a model with 671 billion parameters, according to TechCrunch’s reporting at the time.
The family has also extended into specialized domains. Phi Silica, an on-device small language model for Copilot+ PCs, has been used with LoRA fine-tuning to customize generation for specific tasks. In one case study detailed on the Windows Developer Blog, Microsoft’s education team used LoRA adapters with Phi Silica to generate Kahoot! quizzes, achieving a 75 percent reduction in rejection rates and a 4.6-times uplift in subjective quality scores. On the hardware side, the Phi-4-mini model has been optimized for MediaTek’s NPU platforms, running at over 800 tokens per second for prefill on the Dimensity 9400 — fast enough for real-time AI on smartphones and tablets.
And in what may be the most ambitious extension yet, Microsoft announced Rho-alpha (ρα), described as the company’s “first robotics model derived from Microsoft’s Phi series.” According to Microsoft Research, Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks, adding tactile sensing to the perception stack and targeting dual-arm setups and humanoid robots.
The release crystallizes a broader shift in the AI industry’s center of gravity. For the past two years, the dominant narrative has held that bigger is better — that raw scale in parameters, data, and compute is the primary driver of capability. Microsoft’s Phi family represents the most visible corporate champion of the counterargument: that careful engineering of data quality, training methodology, and architecture design can substitute for brute-force scale. This thesis has significant implications for enterprise adoption. Organizations deploying AI in latency-sensitive or resource-constrained settings — edge devices, interactive applications, on-premise servers — cannot practically run trillion-parameter models. A 15-billion-parameter model that delivers 80 to 90 percent of a frontier model’s accuracy at a tenth of the inference cost could unlock deployment scenarios that were previously uneconomical.
The model’s open-weight release, accompanied by fine-tuning code and benchmark logs, also represents a competitive strategy. By making the model freely available and deeply documented, Microsoft positions Phi as a foundation layer for an ecosystem of downstream applications — many of which will run on Azure, use Microsoft’s development tools, or integrate with its enterprise software stack.
Yet the model still trails the largest open-weight competitors on the hardest benchmarks, particularly in mathematical reasoning (where Qwen3-VL-32B-Thinking-40K scores 78.2 on MathVerse compared to 53.1 for Phi-4-reasoning-vision with forced thinking) and general multimodal understanding (MMMU scores of 72.2 versus 55.0). The 20/80 reasoning-to-non-reasoning data split is, by the team’s own admission, a heuristic that “may not be optimal for all domains or deployment contexts.” And the model’s ability to correctly decide when to reason and when to respond directly remains what the researchers called “an open problem.”
Microsoft is wagering that in the real world, where latency budgets are tight, hardware is finite, and deployment costs compound with every API call, the smartest model is not the biggest one — it’s the one that knows when to think and when to just answer. Whether that bet pays off will depend less on benchmark tables and more on what happens when millions of developers start putting Phi-4-reasoning-vision to work. The model is available now on Microsoft Foundry, HuggingFace, and GitHub. The leaderboard, as always, is open.
Apple has revealed the MacBook Neo, its most affordable laptop. Here’s what it looks and feels like in person.
The new Pixel 10a has faster charging, a brighter screen, no telephoto lens and last year’s processor. Can this mid-range phone really be Google’s flagship killer?
Your Galaxy phone is now a house key. Samsung launches Digital Home Key support for Aqara, Schlage, and more, powered by the new Aliro connectivity standard.
Brooklyn-based headphone brand Grado Labs has announced the fourth model in its Signature range. The new Grado S550 brings together drivers and Brazilian walnut housings.
The new Grell OAE2 use a front-oriented acoustic concept to produce a sound with a more natural feel more like listening to music through a pair of loudspeakers.
This article explains the real business impact, from faster experimentation and better decision-making and responsibilities, guardrails, and roles for engineering teams.
Alibaba’s Qwen team of AI researchers have been among the most prolific and well-regarded by international machine learning community — shipping dozens of powerful generalized and specialized generative models starting last summer, most of them entirely open source and free.
But now, just 24 hours after shipping the open source Qwen3.5 small model series—a release that drew public praise from Elon Musk for its “impressive intelligence density”—the project’s technical architect and several other Qwen team members have exited the company under unclear circumstances, raising questions and concerns from around the world about the future direction of the Qwen team and its focus on open source.
The departure of Junyang “Justin” Lin, the technical lead who steered Qwen from a nascent lab project to a global powerhouse with over 600 million downloads, alongside two fellow colleagues — staff research scientist Binyuan Hui and intern Kaixin Li — marks a volatile inflection point for Alibaba Cloud and its role as an international open source AI leader.
These three Qwen Team members announced their departures on X today, though they did not share the reasons or whether or not it they were voluntary. VentureBeat reached out to sources at Alibaba for more information and will update when we obtain it. Lin himself signed off with a simple post: “me stepping down. bye my beloved qwen.”
While the company celebrates a technical triumph, the sudden exit of its core leadership suggests a deepening rift between the researchers who built the models and a corporate hierarchy now pivoting toward aggressive monetization.
The Qwen3.5 small model series (ranging from 0.8B to 9B parameters) represents a final masterstroke in “intelligence density” from the founding team.
The models employ a Gated DeltaNet hybrid architecture that allows a 9B-parameter model to rival the reasoning capabilities of much larger systems.
By utilizing a 3:1 ratio of linear attention to full attention, the models maintain a massive 262,000-token context window while remaining efficient enough to run natively on standard laptops and smartphones — even in web browsers.
Lin, a PKU humanities graduate and polyglot, has long advocated for this “algorithm-hardware co-design” to bypass compute constraints—a philosophy he detailed at the January 2026 Tsinghua AI Summit.
For the developer community, Qwen3.5 wasn’t just another update; it was a blueprint for the “Agentic Inflection,” where models shift from being chatbots to autonomous “all-in-one AI workers” capable of navigating UIs and executing complex code.
For the 90,000+ enterprises currently deploying Qwen via DingTalk or Alibaba Cloud, the leadership vacuum creates a crisis of confidence.
Many companies migrated to Qwen because it offered a “third way”: the performance of a proprietary US model with the transparency of open weights.
Alibaba has recently consolidated its AI efforts into the “Qwen C-end Business Group,” merging its model labs with consumer hardware teams. The goal is clear: transition Qwen from a research project into the operating system for a new era of AI-integrated glasses and rings.
However, the reported appointment of Hao Zhou, a veteran of Google DeepMind’s Gemini team, to lead the Qwen team indicates a shift from “research-first” to “metric-driven” leadership.
Industry analysts, including those cited by InfoWorld, warn that as Alibaba pushes to meet investor demands for revenue growth, the “open” in Qwen’s open-weight models may become a secondary priority — similar to what we saw with Meta after the disappointing release of its Llama 4 AI model last spring, and subsequent reorganization of its AI division, seeing the hiring of Scale AI co-founder and CEO Alexandr Wang and following departure of preeminent researcher Yann LeCun.
Enterprises relying on the Apache 2.0-licensed Qwen models now face the possibility that future flagships —such as the rumored Qwen3.5-Max—will be locked behind paid, proprietary APIs to drive Cloud DAU (Daily Active User) metrics.
The takeaway? If you value Qwen’s open source efforts, download and preserve the models now, while you still can.
The internal friction at Alibaba mirrors the tensions seen at OpenAI and Google: the “soul” of the machine is often at odds with the “scale” of the business. Xinyu Yang, a researcher at rival Chinese AI lab DeepSeek, captured this sentiment in a stark post on X: “Replace the excellent leader with a non-core people from Google Gemini, driven by DAU metrics. If you judge foundation model teams like consumer apps, don’t be surprised when the innovation curve flattens.”
This “Gemini-fication”—the shift toward a highly regulated, product-centric culture—threatens the very agility that allowed Qwen to surpass Meta’s Llama in derivative model creation. For the global AI community, the loss of Junyang Lin is symbolic.
He was the primary bridge between China’s deep engineering talent and the Western open-source ecosystem. Without his advocacy, there are fears that the project will retreat into a “walled garden” strategy similar to its Western rivals.
The technical brilliance of the Qwen3.5 release has been overshadowed by the heartbreak of its creators. On social media, the sentiment among the team members who built the model is one of mourning rather than celebration:
Chen Cheng, a Qwen contributor, explicitly alluded to a forced departure, writing in a post on X: “I’m truly heartbroken. I know leaving wasn’t your choice… I honestly can’t imagine Qwen without you.”
Li suggested the exit signaled the end of broader ambitions, such as a planned Singapore-based research hub: “Qwen could have had a Singapore base, all thanks to Junyang. But now that he’s gone, there’s no reason left to stay here.”
While the public face of the Qwen3.5 launch was one of technical triumph, internal reports from a “Tongyi Conference” held by Alibaba on March 4 suggest an atmosphere of significant organizational tension.
According to unverified but widely discussed accounts from the meeting posted on X, executives defended the departures as the culmination of a fundamental disagreement over how AI should be built.
The primary catalyst appears to be a dismantling of the “vertically integrated” R&D model that Junyang Lin had championed. Under Lin, the Qwen team operated as an end-to-end, autonomous unit covering everything from pre-training and infrastructure to multimodal research.The new corporate directive splits this “closed loop” into horizontal modules managed directly by Alibaba Cloud’s Tongyi Lab.
Leadership including Alibaba CEO Wu Yongming (referred to as “Wu Ma”), Cloud CTO Zhou Jingren, and the Chief HR Officer, argued that while Lin’s centralized “efficiency” was undeniable, the project’s scale—now involving hundreds of people—could no longer be governed by “one person’s brain.”
The most striking details from the conference involve the company’s response to the team’s loyalty to Lin. When asked if there was a path for Lin’s return, the Chief HR Officer reportedly struck a definitive tone, stating:
“We cannot put him on a pedestal… the company cannot accept irrational demands that spare no cost to retain him.”
The executive then turned the question back on the staff, asking the audience to consider: “What do you think your own cost is?” This rhetoric signals a pivot from a talent-first, researcher-led culture to a more traditional, replaceable corporate structure.
CEO Wu Yongming addressed complaints regarding “choked” resources, claiming he was unaware of any intentional bottlenecks and asserting that Qwen remains his “highest priority.” However, in a surprising moment of candor, CTO Zhou Jingren reportedly admitted that even he had been “sidelined” at times, illustrating a fractured chain of command where technical needs frequently collided with “national situation” constraints and group-level political factors.
The known facts are simple: Qwen has never been technically stronger, yet its founding core has been dismantled. As Alibaba prepares to face investors for its fiscal Q3 earnings report on March 5, the narrative will likely focus on “efficiency” and “commercial scale.”
For the enterprises currently excited about the 60% cost reductions promised by Qwen3.5, the immediate future is bright.
But for the larger AI community, the cost of that efficiency may be the loss of the most vibrant open-source lab in the East.
As Hao Zhou takes the reins, the world is watching to see if Qwen remains a “model for the world” or becomes merely a component in Alibaba’s corporate bottom line.
While many were waiting for iOS 26.4, Apple has just effectively confirmed another iPhone update, iOS 26.3.1 will land soon, and now it’s clear when.