Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution

Z.AI, the AI platform developed by the team behind the GLM model family, has released GLM-5.1 — its next-generation flagship model developed specifically for agentic engineering. Unlike models optimized for clean, single-turn benchmarks, GLM-5.1 is built for agentic tasks, with significantly stronger coding capabilities than its predecessor, and achieves state-of-the-art performance on SWE-Bench Pro while leading GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

Architecture: DSA, MoE, and Asynchronous RL

Before diving into what GLM-5.1 can do, it’s worth understanding what it’s built on — because the architecture is meaningfully different from a standard dense transformer.

GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. The model uses a glm_moe_dsa architecture (Mixture of Experts (MoE) model combined with DSA). For AI devs evaluating whether to self-host, this matters: MoE models activate only a subset of their parameters per forward pass, which can make inference significantly more efficient than a comparably-sized dense model, though they require specific serving infrastructure.

On the training side, GLM-5 implements a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Novel asynchronous agent RL algorithms further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. This is what allows the model to handle agentic tasks with the kind of sustained judgment that single-turn RL training struggles to produce.

The Plateau Problem GLM-5.1 is Solving

To understand what makes GLM-5.1 different at inference time, it helps to understand a specific failure mode in LLMs used as agents. Previous models — including GLM-5 — tend to exhaust their repertoire early: they apply familiar techniques for quick initial gains, then plateau. Giving them more time doesn’t help.

This is a structural limitation for any developer trying to use an LLM as a coding agent. The model applies the same playbook it knows, hits a wall, and stops making progress regardless of how long it runs. GLM-5.1, by contrast, is built to stay effective on agentic tasks over much longer horizons. The model handles ambiguous problems with better judgment and stays productive over longer sessions. It breaks complex problems down, runs experiments, reads results, and identifies blockers with real precision. By revisiting its reasoning and revising its strategy through repeated iteration, GLM-5.1 sustains optimization over hundreds of rounds and thousands of tool calls.

The sustained performance requires more than a larger context window. This capability requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error, enabling truly autonomous execution for complex engineering tasks.

Benchmarks: Where GLM-5.1 Stands

On SWE-Bench Pro, GLM-5.1 achieves a score of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, setting a new state-of-the-art result.

The broader benchmark profile shows a well-rounded model. GLM-5.1 scores 95.3 on AIME 2026, 94.0 on HMMT Nov. 2025, 82.6 on HMMT Feb. 2026, and 86.2 on GPQA-Diamond — a graduate-level science reasoning benchmark. On agentic and tool-use benchmarks, GLM-5.1 scores 68.7 on CyberGym (a substantial jump from GLM-5’s 48.3), 68.0 on BrowseComp, 70.6 on τ³-Bench, and 71.8 on MCP-Atlas (Public Set) — the last one particularly relevant given MCP’s growing role in production agent systems. On Terminal-Bench 2.0, the model scores 63.5, rising to 66.5 when evaluated with Claude Code as the scaffolding.

Across 12 representative benchmarks covering reasoning, coding, agents, tool use, and browsing, GLM-5.1 demonstrates a broad and well-balanced capability profile. This shows that GLM-5.1 is not a single-metric improvement — it advances simultaneously across general intelligence, real-world coding, and complex task execution.

In terms of overall positioning, GLM-5.1’s general capability and coding performance are overall aligned with Claude Opus 4.6.

8-Hour Sustained Execution: What That Actually Means

The most important difference in GLM-5.1 is its capacity for long-horizon task execution. GLM-5.1 can work autonomously on a single task for up to 8 hours, completing the full process from planning and execution to testing, fixing, and delivery.

For developers building autonomous agents, this changes the scope of what’s possible. Rather than orchestrating a model over dozens of short-lived tool calls, you can hand GLM-5.1 a complex objective and let it run a complete ‘experiment–analyze–optimize’ loop autonomously.

The concrete engineering demonstrations make this tangible: GLM-5.1 can build a complete Linux desktop environment from scratch in 8 hours; perform 178 rounds of autonomous iteration on a vector database task and improve performance to 1.5× the initial version; and optimize a CUDA kernel, increasing speedup from 2.6× to 35.7× through sustained tuning.

That CUDA kernel result is notable for ML engineers: improving a kernel from 2.6× to 35.7× speedup through autonomous iterative optimization is a level of depth that would take a skilled human engineer significant time to replicate manually.

Model Specifications and Deployment

GLM-5.1 is a 754-billion-parameter MoE model released under the MIT license on HuggingFace. It operates with a 200K context window and supports up to 128K maximum output tokens — both important for long-horizon tasks that need to hold large codebases or extended reasoning chains in memory.

GLM-5.1 supports thinking mode (offering multiple thinking modes for different scenarios), streaming output, function calling, context caching, structured output, and MCP for integrating external tools and data sources.

For local deployment, the following open-source frameworks support GLM-5.1: SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+).

For API access, the model is available through Z.AI’s API platform. Getting started requires installing zai-sdk via pip and initializing a ZaiClient with your API key. .

Key Takeaways

GLM-5.1 sets a new state-of-the-art on SWE-Bench Pro with a score of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro — making it one of the the strongest publicly benchmarked model for real-world software engineering tasks at the time of release.
The model is built for long-horizon autonomous execution, capable of working on a single complex task for up to 8 hours — running experiments, revising strategies, and iterating across hundreds of rounds and thousands of tool calls without human intervention.
GLM-5.1 uses a MoE + DSA architecture trained with asynchronous reinforcement learning, which reduces training and inference costs compared to dense transformers while maintaining long-context fidelity — a meaningful consideration for teams evaluating self-hosting.
It is open-weight under the MIT license (754B parameters, 200K context window, 128K max output tokens) and supports local deployment via SGLang, vLLM, xLLM, Transformers, and KTransformers, as well as API access through the Z.AI platform with OpenAI SDK compatibility.
GLM-5.1 goes beyond coding — it also shows strong improvements in front-end prototyping, artifacts generation, and office productivity tasks (Word, Excel, PowerPoint, PDF), positioning it as a general-purpose foundation for both agentic systems and high-quality content workflows.

Check out the Weights, API and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution

Toward a future that preserves benefits of neurotechnology for all | MIT News

How America's 250th birthday became a test of AI-powered collective intelligence

Takeda signs US$600M AI drug discovery deal with Insilico

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

MIT in the media: Innovating and educating for the next 250 years of America | MIT News

HP accelerates enterprise workflows with OpenAI Frontier

Moonbeam Pivots From Polkadot to Base to Build AI Agents

Vitalik Buterin Unveils New ‘Lean Ethereum” Strawmap

Bitcoin Bounces Above $63K Following Strategy-fueled Selloff

Trader Turns $2 Million of ETH Into $14,208 as Lighter Token Rallies 53%

What Does the Average Canadian’s TFSA Look Like at 55?

Top Insights

Bitcoin Shrugs Off Strategy FUD, Hits New 2-Week Peak in Early Signs of Structural Stabilization

Stock Indexes Settle Higher as Big Tech and Chip Stocks Rally

Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution

Architecture: DSA, MoE, and Asynchronous RL

The Plateau Problem GLM-5.1 is Solving

Benchmarks: Where GLM-5.1 Stands

8-Hour Sustained Execution: What That Actually Means

Model Specifications and Deployment

Key Takeaways

Related Posts