Ever have a scary HR meeting on your calendar? That’s how the Artemis 3 crew found out their assignments

NASA took an unconventional approach to informing the astronauts of Artemis 3 about their crew assignments.

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don’t check out

Moonshot AI released Kimi K2.7-Code this week, an open-source update to its K2 coding model family, claiming leaner reasoning and double-digit performance gains.

K2.7-Code is built on the same trillion-parameter mixture-of-experts architecture as its predecessor K2.6, and drops in via an OpenAI-compatible API — which matters for teams already running K2.6 in production gateways.

When K2.6 launched in April, it topped OpenRouter’s weekly LLM leaderboard — a ranking based on actual API routing decisions by developers, not self-reported benchmark scores.

Moonshot AI says K2.7-Code addresses what it calls “overthinking,” reducing thinking-token usage by 30% compared to K2.6 — a number that would directly affect inference costs for teams running agentic workflows. Whether that efficiency gain holds on independent benchmarks is a question practitioners have already started raising publicly.

What Kimi K2.7-Code is

K2.7-Code is released under a Modified MIT license, with weights available on HuggingFace. The model is deployable via vLLM or SGLang. It runs exclusively in thinking mode and does not support temperature adjustment — Moonshot AI has fixed it at 1.0, meaning teams cannot tune output determinism the way they might with other models.

The core change from K2.6 is how the model generates low-level code. Where K2.6 produced implementations by wrapping existing libraries and routing through established frameworks, K2.7-Code authors implementations directly. Moonshot AI says this produces more reliable generalization across Rust, Go and Python, and across task types including frontend development, DevOps and performance optimization.

On benchmark performance, Moonshot AI claims gains of 21.8% on Kimi Code Bench v2, 11% on Program Bench and 31.5% on MLS Bench Lite. All three are proprietary benchmarks run by Moonshot AI. The model has not been submitted to DeepSWE, an independent coding benchmark that produces a 70-point spread across models — compared to SWE-Bench Pro’s 30-point spread — making it a more discriminating signal for teams configuring model routing systems.

More honest, weaker for it

The picture from outside Moonshot’s own benchmarks is more complicated.

Researcher Elliot Arledge ran K2.7-Code against K2.6 and Claude Fable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full run logs at kernelbench.com. 

“K2.7 is more honest but not more capable,” Arledge wrote on X

On five of six problems, K2.7-Code produced real authored Triton kernels where K2.6 had used library wrappers. Two of those kernels failed on the model’s own bugs. The MoE kernel result regressed from K2.6’s score of 0.222 to 0.157. 

“Fable, for reference, tops every cell it doesn’t honestly fail,” Arledge wrote.

Sugumaran Balasubramaniyan, a developer who built a model-task-router for the Hermes Agent platform using DeepSWE as his reference signal, responded publicly to the K2.7-Code release and challenged Moonshot AI directly on the benchmark choices.

 “Respectfully, every model ‘improves’ double digits on its own test suite,” Balasubramaniyan wrote on X

He noted that K2.6 scored 24% on DeepSWE, tied with GPT-5.4-mini, and asked whether Moonshot AI would submit K2.7-Code to the same benchmark.

Balasubramaniyan said it took 13 review rounds to get the benchmark data right for his router and that he would route coding tasks to K2.7-Code if the independent numbers hold up.

What this means for enterprises

The token efficiency gain is immediately usable. Teams running K2.6 in production can swap in K2.7-Code via the OpenAI-compatible API and expect lower inference costs on agentic workflows without an architecture change. The 30% thinking-token reduction is Moonshot’s own number, but the integration path is low-risk enough to test against your own workloads before committing.

The practical question is whether those efficiency gains hold on a team’s own task distribution. Running K2.7-Code against your own workloads before adjusting gateway weights is the low-risk path to finding out.

Google researchers introduce ‘faithful uncertainty,’ allowing LLMs to offer best guesses instead of hallucinations

Large language models continue to struggle with hallucinations, presenting a major roadblock for real-world enterprise applications. Reducing these errors is a messy business, forcing model developers to navigate a strict tradeoff where eliminating factual errors often suppresses valid answers.

In a new paper, Google researchers introduce the concept of “faithful uncertainty,” a metacognitive technique that aligns a model’s response with its internal confidence. This alignment allows the model to offer appropriately hedged hypotheses, such as “My best guess is,” instead of defaulting to an unhelpful “answer-or-abstain” binary.

In real-world agentic AI applications, this metacognitive awareness acts as an essential control layer. It empowers autonomous systems to accurately determine when their internal knowledge is sufficient and when they must dynamically trigger external tools or search APIs to resolve deficits.

The utility tax of current mitigation strategies

Understanding why LLMs hallucinate hinges on separating two capabilities: a model knowing facts versus knowing what is known. Historically, most factuality gains in AI have come from expanding the knowledge boundary, meaning developers simply pack more facts into the model’s parameters through larger scale and more training data.

However, expanding a model’s knowledge does not automatically improve its boundary awareness, which is its ability to distinguish the known from the unknown and recognize its own limitations.

“There are broadly two ways to improve LLM factuality,” Gal Yona, Research Scientist at Google and co-author of the paper, told VentureBeat. The first is continuing to teach the model more facts. But, Yona notes, “model capacity is finite, and the long tail of knowledge is effectively infinite.”

Once models hit this limit, the hope is they know what they don’t know and simply abstain from answering. However, this is inherently difficult for LLMs.

“This is why most practical attempts to reduce hallucinations through various interventions don’t actually make it to deployment,” Yona explains. “They do reduce hallucinations, but they also hurt utility, because the model ends up refusing to answer questions it actually does know.”

This inability to distinguish between knowns and unknowns creates what the paper’s authors call the “utility tax.” Enforcing a zero-hallucination standard requires the model to abstain whenever it is even slightly uncertain, discarding massive volumes of completely valid information. For example, the authors demonstrate that reducing an underlying 25% error rate down to a strict 5% target forces developers to discard 52% of the model’s correct answers.

Treating all errors as hallucinations forces enterprise systems to choose between trustworthiness and helpfulness. Application developers are generally unwilling to pay this massive utility tax and render their models unhelpful.

Consequently, they optimize systems to prioritize coverage, forcing models to operate in a state where they continue to generate confident hallucinations.

Reframing hallucinations as confident errors

To move past the utility tax, the researchers propose to stop treating any factual error as a hallucination. Instead, they reframe hallucinations as “confident errors”: incorrect information delivered authoritatively without appropriate qualification.

This subtle reframing dissolves the strict “answer-or-abstain” dichotomy and allows the model to express its uncertainty.

In this new framework, if a model makes a factual mistake but appropriately hedges its response (e.g., by stating, “I am not completely sure, but I think…”), it isn’t a hallucination. It is simply a hypothesis offered to the user for consideration. By expressing uncertainty, the AI preserves its utility—sharing whatever partial or likely knowledge it has—without violating the user’s trust.

However, if an AI assistant hedges all its responses with a disclaimer, the user is forced to double-check everything, defeating the purpose of the tool entirely.

The solution the researchers propose is “faithful uncertainty.” This approach requires aligning a model’s linguistic uncertainty, or the words it uses to express doubt, with its intrinsic uncertainty, which is its actual, internal statistical confidence in that specific answer. This ensures the model only hedges when its internal state genuinely reflects conflicting or low-probability information.

Faithful uncertainty forms a core component of “metacognition,” the AI’s ability to be aware of its own uncertainty and act on it. To understand this practically, consider the intuitive example of consulting a doctor. We do not trust doctors because they are all-knowing. We trust them because they reliably distinguish between a confident diagnosis (“You have a fracture”) and an educated hypothesis (“It might be a sprain, but let’s run some tests”).

Practical implications for enterprise AI

Under the new framing, errors where a model is genuinely confident but factually incorrect are categorized as “honest mistakes.” This casts knowledge expansion (training the model on more data) and faithful uncertainty as completely complementary efforts. Knowledge expansion pushes the absolute knowledge boundary outward to minimize honest mistakes, while faithful uncertainty honestly communicates wherever that boundary currently lies.

This new framing has important implications for agentic applications. The shift to agentic AI might make it seem like knowing what the model doesn’t know is redundant, since models can just search external databases. However, access to external tools actually amplifies the need for faithful uncertainty. In agentic systems, metacognition becomes the central control layer that governs the entire system.

External tools solve the storage problem because the model no longer needs to encode every fact into its parameters. However, this introduces a new control problem: managing when to retrieve information, verify facts, and orchestrate these external tools. Without faithful uncertainty, an agent is essentially flying blind and must rely on external, static heuristics or over-engineered scaffolds.

“The model might search for something it already knows confidently—wasting latency and cost for no gain. Or the opposite: it confidently answers from memory when it should have searched, producing a plausible but wrong output,” Yona said. Today’s agent harnesses try to solve this externally with query classifiers or always-search rules, but Yona notes that these are “static and brittle.” By using its intrinsic uncertainty to regulate its own behavior, the agent dynamically optimizes its tool use, choosing to invoke a search tool only when its internal confidence is genuinely low.

Beyond deciding when to search, faithful uncertainty is critical for evaluating the results of a search. If a tool returns low-quality or unexpected information, a metacognitive agent does not blindly accept whatever appears in its context window. Instead, it uses its uncertainty awareness to weigh the retrieved external signals against its own internal priors. This prevents sycophantic behavior where the system might otherwise trust external sources that conflict with its actual known knowledge.

The bootstrapping paradox: The catch to teaching uncertainty

For enterprise builders, achieving this faithful uncertainty is trickier than it sounds. It requires teaching models the syntax of uncertainty through supervised fine-tuning (SFT). Because pre-trained models are mostly fed authoritative text, they must be explicitly taught to say things like, “I’m not entirely sure, but I think VentureBeat was founded in…”

But SFT introduces a “bootstrapping paradox.” Unlike standard training datasets where the “right answer” is the same regardless of the model, the ground truth for uncertainty is the model’s own dynamic knowledge base.

“Here’s the catch: the ‘correct’ expression of uncertainty is inherently dynamic, because it depends on what this particular model knows or doesn’t know at this particular point in training,” Yona said. “If you train on a label that says ‘I don’t know X’ but the model actually does know X, you’ve taught it to hallucinate uncertainty… The training data is static, but the target is a moving one, and that’s the fundamental tension teams need to grapple with.”

The road to self-aware AI

For enterprises looking to implement these capabilities without expensive retraining, prompting serves as the most accessible entry point. “Prompt engineering is already something most engineers do today, this provides the lowest-friction path to improving metacognitive behavior today,” Yona said. Enterprise developers can explore frameworks like MetaFaith, an open-source project previously co-authored by Yona, to begin applying metacognitive prompting to off-the-shelf models.

However, Yona cautions that “there is still substantial headroom that prompting alone doesn’t solve,” meaning the industry will eventually need to rely on advanced reinforcement learning (RL) to bake metacognition deeply into model training.

Ultimately, as enterprises transition from isolated chat applications to complex, multi-agent workflows, self-awareness will become a defining prerequisite for reliable autonomy. But evaluating whether a model truly possesses this awareness remains a profound technical challenge.

“How do you actually evaluate whether a model can sense its internal states?” Yona asks. “Even in humans, it’s hard to define or separate ‘true’ self-monitoring abilities from a capable reliance on proxies. We face exactly the same challenges with LLMs: a model might learn to mimic the style of uncertainty without truly sensing its internal state. Developing evaluation frameworks that can tell the difference is one of the most important open problems in this space.”

‘Tell Him He’s a Piece of Shit’: Meta’s New AI Unit Is a Total Mess

Executives and employees alike are struggling with Meta’s chaotic AI strategy, according to sources and internal discussions reviewed by WIRED.

I tried Acer’s new 5K MiniLED Gaming monitor, and OLED kept popping into my head

Hands-on with the Acer Nitro XV345CKR P reveals a bold take on MiniLED gaming monitors, balancing 5K clarity, HDR brilliance, and versatility against OLED’s undeniable appeal.

How to Protect Your Email ROI in a Tight Economy (and the One Mistake You Need to Stop Making)

If you want email to keep driving revenue, you need to protect it like any other business asset. Here’s how.

‘I think that combination of interior and exterior will create these moments that really live long after the player puts down the controller’ — Alien: Isolation 2 is designed to ‘really stay’ with players and be ‘unforgettable’, says Creative Assembly

Creative Assembly has said it was “absolutely the intention” for Alien: Isolation to feel “too scary” for players, and the goal is to make the sequel just as memorable.

How to watch USA vs Paraguay: Free Streams & TV Channels online from anywhere as the co-hosts begin their World Cup adventure

Here’s how to watch USA vs Paraguay for free online and from anywhere as the FIFA World Cup 2026 co-hosts begin their Group D campaign.