A surprising iPhone update has landed. Should you choose it or is it time to move to iOS 26?
The DJI Avata 360 is DJI’s first 360-degree drone that can be flown like a camera or FPV (first-person view) drone using a motion controller and DJI FPV goggles.
…
While the number of QD OLED series in Samsung’s 2026 TV range may have shrunk, the quality of the ones that remain has gone to a whole other level.
The baton of open source AI models has been passed on between several companies over the years since ChatGPT debuted in late 2022, from Meta with its Llama family to Chinese labs like Qwen and z.ai. But lately, Chinese companies have started pivoting back towards proprietary models even as some U.S. labs like Cursor and Nvidia release their own variants of the Chinese models, leaving a question mark about who will originate this branch of technology going forward.
One answer: Arcee, a San Francisco based lab, which this week released AI Trinity-Large-Thinking—a 399-billion parameter text-only reasoning model released under the uncompromisingly open Apache 2.0 license, allowing for full customizability and commercial usage by anyone from indie developers to large enterprises.
The release represents more than just a new set of weights on AI code sharing community Hugging Face; it is a strategic bet that “American Open Weights” can provide a sovereign alternative to the increasingly closed or restricted frontier models of 2025.
This move arrives precisely as enterprises express growing discomfort with relying on Chinese-based architectures for critical infrastructure, creating a demand for a domestic champion that Arcee intends to fill.
As Clément Delangue, co-founder and CEO of Hugging Face, told VentureBeat in a direct message on X: “The strength of the US has always been its startups so maybe they’re the ones we should count on to lead in open-source AI. Arcee shows that it’s possible!”
To understand the weight of the Trinity release, one must understand the lab that built it. Based in San Francisco, Arcee AI is a lean team of only 30 people.
While competitors like OpenAI and Google operate with thousands of engineers and multibillion-dollar compute budgets, Arcee has defined itself through what CTO Lucas Atkins calls “engineering through constraint”.
The company first made waves in 2024 after securing a $24 million Series A led by Emergence Capital, bringing its total capital to just under $50 million. In early 2026, the team took a massive risk: they committed $20 million—nearly half their total funding—to a single 33-day training run for Trinity Large.
Utilizing a cluster of 2048 NVIDIA B300 Blackwell GPUs, which provided twice the speed of the previous Hopper generation, Arcee bet the company’s future on the belief that developers needed a frontier model they could truly own.
This “back the company” bet was a masterclass in capital efficiency, proving that a small, focused team could stand up a full pipeline and stabilize training without endless reserves.
Trinity-Large-Thinking is noteworthy for the extreme sparsity of its attention mechanism. While the model houses 400 billion total parameters, its Mixture-of-Experts architecture means that only 1.56%, or 13 billion parameters, are active for any given token.
This allows the model to possess the deep knowledge of a massive system while maintaining the inference speed and operational efficiency of a much smaller one—performing roughly 2 to 3 times faster than its peers on the same hardware. Training such a sparse model presented significant stability challenges.
To prevent a few experts from becoming “winners” while others remained untrained “dead weight,” Arcee developed SMEBU, or Soft-clamped Momentum Expert Bias Updates.
This mechanism ensures that experts are specialized and routed evenly across a general web corpus. The architecture also incorporates a hybrid approach, alternating local and global sliding window attention layers in a 3:1 ratio to maintain performance in long-context scenarios.
Arcee’s partnership with fellow startup DatologyAI provided a curriculum of over 10 trillion curated tokens. However, the training corpus for the full-scale model was expanded to 20 trillion tokens, split evenly between curated web data and high-quality synthetic data.
Unlike typical imitation-based synthetic data where a smaller model simply learns to mimic a larger one, DatologyAI utilized techniques to synthetically rewrite raw web text—such as Wikipedia articles or blogs—to condense the information.
This process helped the model learn to reason over concepts and information rather than merely memorizing exact token strings.
To ensure regulatory compliance, tremendous effort was invested in excluding copyrighted books and materials with unclear licensing, attracting enterprise customers who are wary of intellectual property risks associated with mainstream LLMs.
This data-first approach allowed the model to scale cleanly while significantly improving performance on complex tasks like mathematics and multi-step agent tool use.
The defining feature of this official release is the transition from a standard “instruct” model to a “reasoning” model.
By implementing a “thinking” phase prior to generating a response—similar to the internal loops found in the earlier Trinity-Mini—Arcee has addressed the primary criticism of its January “Preview” release.
Early users of the Preview model had noted that it sometimes struggled with multi-step instructions in complex environments and could be “underwhelming” for agentic tasks.
The “Thinking” update effectively bridges this gap, enabling what Arcee calls “long-horizon agents” that can maintain coherence across multi-turn tool calls without getting “sloppy”.
This reasoning process enables better context coherence and cleaner instruction following under constraint. This has direct implications for Maestro Reasoning, a 32B-parameter derivative of Trinity already being used in audit-focused industries to provide transparent “thought-to-answer” traces.
The goal was to move beyond “yappy” or inefficient chatbots toward reliable, cheap, high-quality agents that stay stable across long-running loops.
The significance of Arcee’s Apache 2.0 commitment is amplified by the retreat of its primary competitors from the open-weight frontier.
Throughout 2025, Chinese research labs like Alibaba’s Qwen and z.ai (aka Zhupai) set the pace for high-efficiency MoE architectures.
However, as we enter 2026, those labs have begun to shift toward proprietary enterprise platforms and specialized subscriptions, signaling a move away from pure community growth.
The fragmentation of these once-prolific teams, such as the departure of key technical leads from Alibaba’s Qwen lab, has left a void at the high end of the open-weight market. In the United States, the movement has faced its own crisis.
Meta’s Llama division notably retreated from the frontier landscape following the mixed reception of Llama 4 in April 2025, which faced reports of quality issues and benchmark manipulation.
For developers who relied on the Llama 3 era of dominance, the lack of a current 400B+ open model created an urgent need for an alternative that Arcee has risen to fill.
Trinity-Large-Thinking’s performance on agent-specific evaluations establishes it as a legitimate frontier contender. On PinchBench, a critical metric for evaluating model capability on autonomous agentic tasks, Trinity achieved a score of 91.9, placing it just behind the proprietary market leader, Claude Opus 4.6 (93.3).
This competitiveness is mirrored in IFBench, where Trinity’s score of 52.3 sits in a near-dead heat with Opus 4.6’s 53.1, indicating that the reasoning-first “Thinking” update has successfully addressed the instruction-following hurdles that challenged the model’s earlier preview phase.
The model’s broader technical reasoning capabilities also place it at the high end of the current open-source market. It recorded a 96.3 on AIME25, matching the high-tier Kimi-K2.5 and outstripping other major competitors like GLM-5 (93.3) and MiniMax-M2.7 (80.0).
While high-end coding benchmarks like SWE-bench Verified still show a lead for top-tier closed-source models—with Trinity scoring 63.2 against Opus 4.6’s 75.6—the massive delta in cost-per-token positions Trinity as the more viable sovereign infrastructure layer for enterprises looking to deploy these capabilities at production scale.
When it comes to other U.S. open source frontier model offerings, OpenAI’s gpt-oss tops out at 120 billion parameters, but there’s also Google with Gemma (Gemma 4 was just released this week) and IBM’s Granite family is also worth a mention, despite having lower benchmarks. Nvidia’s Nemotron family is also notable, but is fine-tuned and post-trained Qwen variants.
|
Benchmark |
Arcee Trinity-Large |
gpt-oss-120B (High) |
IBM Granite 4.0 |
Google Gemma 4 |
|
GPQA-D |
76.3% |
80.1% |
74.8% |
84.3% |
|
Tau2-Airline |
88.0% |
65.8%* |
68.3% |
76.9% |
|
PinchBench |
91.9% |
69.0% (IFBench) |
89.1% |
93.3% |
|
AIME25 |
96.3% |
97.9% |
88.5% |
89.2% |
|
MMLU-Pro |
83.4% |
90.0% (MMLU) |
81.2% |
85.2% |
So how is an enterprise supposed to choose between all these?
Arcee Trinity-Large-Thinking is the premier choice for organizations building autonomous agents; its sparse 400B architecture excels at “thinking” through multi-step logic, complex math, and long-horizon tool use. By activating only a fraction of its parameters, it provides a high-speed reasoning engine for developers who need GPT-4o-level planning capabilities within a cost-effective, open-source framework.
Conversely, gpt-oss-120B serves as the optimal middle ground for enterprises that require high-reasoning performance but prioritize lower operational costs and deployment flexibility.
Because it activates only 5.1B parameters per forward pass, it is uniquely suited for technical workloads like competitive code generation and advanced mathematical modeling that must run on limited hardware, such as a single H100 GPU.
Its configurable reasoning effort—offering “Low,” “Medium,” and “High” modes—makes it the best fit for production environments where latency and accuracy must be balanced dynamically across different tasks.
For broader, high-throughput applications, Google Gemma 4 and IBM Granite 4.0 serve as the primary backbones. Gemma 4 offers the highest “intelligence density” for general knowledge and scientific accuracy, making it the most versatile option for R&D and high-speed chat interfaces.
Meanwhile, IBM Granite 4.0 is engineered for the “all-day” enterprise workload, utilizing a hybrid architecture that eliminates context bottlenecks for massive document processing. For businesses concerned with legal compliance and hardware efficiency, Granite remains the most reliable foundation for large-scale RAG and document analysis.
In this climate, Arcee’s choice of the Apache 2.0 license is a deliberate act of differentiation. Unlike the restrictive community licenses used by some competitors, Apache 2.0 allows enterprises to truly own their intelligence stack without the “black box” biases of a general-purpose chat model.
“Developers and Enterprises need models they can inspect, post-train, host, distill, and own,” Lucas Atkins noted in the launch announcement.
This ownership is critical for the “bitter lesson” of training small models: you usually need to train a massive frontier model first to generate the high-quality synthetic data and logits required to build efficient student models.
Furthermore, Arcee has released Trinity-Large-TrueBase, a raw 10-trillion-token checkpoint. TrueBase offers a rare, “unspoiled” look at foundational intelligence before instruction tuning and reinforcement learning are applied. For researchers in highly regulated industries like finance and defense, TrueBase allows for authentic audits and custom alignments starting from a clean slate.
The response from the developer community has been largely positive, reflecting the desire for more open weights, U.S.-made mdoels.
On X, researchers highlighted the disruption, noting that the “insanely cheap” prices for a model of this size would be a boon for the agentic community.
On open AI model inference website OpenRouter, Trinity-Large-Preview established itself as the #1 most used open model in the U.S., serving over 80.6 billion tokens on peak days like March 1, 2026.
The proximity of Trinity-Large-Thinking to Claude Opus 4.6 on PinchBench—at 91.9 versus 93.3—is particularly striking when compared to the cost. At $0.90 per million output tokens, Trinity is approximately 96% cheaper than Opus 4.6, which costs $25 per million output tokens.
Arcee’s strategy is now focused on bringing these pretraining and post-training lessons back down the stack. Much of the work that went into Trinity Large will now flow into the Mini and Nano models, refreshing the company’s compact line with the distillation of frontier-level reasoning.
As global labs pivot toward proprietary lock-in, Arcee has positioned Trinity as a sovereign infrastructure layer that developers can finally control and adapt for long-horizon agentic workflows.
For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google’s Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba’s Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, “open” with asterisks isn’t the same as open.
Gemma 4 eliminates that friction entirely. Google DeepMind’s newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem.
No custom clauses, no “Harmful Use” carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over.
The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research.
Gemma 4 arrives as four distinct models organized into two deployment tiers. The “workstation” tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model — both supporting text and image input with 256K-token context windows. The “edge” tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows.
The naming convention takes some unpacking. The “E” prefix denotes “effective parameters” — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more.
The “A” in 26B A4B stands for “active parameters” — only 3.8 billion of the MoE model’s 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model.
For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference — but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle.
The architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B–31B range while running at roughly the speed of a 4B model during inference.
This is not just a benchmark curiosity — it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family.
Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi-turn agent conversations.
Previous generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level.
All four models handle variable aspect-ratio image input with configurable visual token budgets — a meaningful improvement over Gemma 3n’s older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task.
Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots.
The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local — think healthcare, field service, or multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification.
Function calling is also native across all four models, drawing on research from Google’s FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4’s function calling was trained into the model from the ground up — optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents.
The benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 — numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%.
For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode.
The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond — a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture.
The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability.
These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in a single model family with deployment options from edge devices to cloud serverless.
Google is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially.
The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference — rather than maintaining always-on GPU instances — could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications.
Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today — workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research — represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google’s open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.
PlantWave converts electrical activity in plants into music, offering a new way to experience biological systems through technology.
Microsoft launches three in-house MAI models for transcription, voice and image generation through Foundry, hedging its reliance on OpenAI.
Artificial intelligence, generative AI, and agentic systems are rapidly reshaping enterprise technology, and nowhere is this more evident than in the software development lifecycle (SDLC).
Latest version of the Firewalla app supports AmneziaWG VPN Server, a breakthrough VPN protocol that bypasses blocked networks to keep devices safe and connected.
The latest smart glasses from Meta and Ray-Ban offer options for people who need prescription lenses, more options and different prices.