Overview
GLM-5.1 is Z.AI’s latest flagship model, designed for long-horizon tasks. It can work continuously and autonomously on a single task for up to 8 hours, completing the full loop from planning and execution to iterative optimization and delivering production-grade results.In both general capability and coding performance, GLM-5.1 is overall aligned with Claude Opus 4.6. It demonstrates stronger sustained execution in long-horizon autonomous tasks, complex engineering optimization, and real-world development workflows, making it an ideal foundation for building autonomous agents and long-horizon coding agents.
Positioning
Flagship Foundation Model
Input Modalities
Text
Output Modalitie
Text
Context Length
200K
Maximum Output Tokens
128K
Capability
Thinking Mode
Offering multiple thinking modes for different scenarios
Streaming Output
Support real-time streaming responses to enhance user interaction experience
Function Call
Powerful tool invocation capabilities, enabling integration with various external toolsets
Context Caching
Intelligent caching mechanism to optimize performance in long conversations
Structured Output
Support for structured output formats like JSON, facilitating system integration
MCP
Flexibly integrate external MCP tools and data sources to expand application scenarios
Usage
Agentic Coding
Agentic Coding
Further optimized for agentic coding workflows such as Claude Code and OpenClaw, GLM-5.1 offers stronger long-horizon planning, stepwise execution, process adjustment, and result delivery. It performs significantly better on long-running development tasks and complex coding problems, making it well suited for real-world engineering work with multiple stages and strong interdependencies.
General Conversation
General Conversation
More robust in open-ended Q&A, complex instruction following, and multi-turn interactions, with richer responses, more complete content, stronger instruction adherence, and better long-context understanding. It is well suited for high-quality everyday assistance and complex information workflows.
Creative Writing
Creative Writing
Further improved in literary expression, plot development, character portrayal, and style control, making it suitable for fiction excerpts, story concepts, and copywriting tasks that require strong expressiveness and consistency.
Artifacts / Front-End Development
Artifacts / Front-End Development
Well suited for website generation, interactive pages, and front-end prototyping. Outputs show less templated structure, more diverse visual expression, and higher overall task completion quality, enabling a faster path from requirements to usable deliverables.
Office Productivity
Office Productivity
Broadly improved across PowerPoint, Word, PDF, and Excel tasks, with stronger capabilities in complex content organization, layout design, and structured output. Default visual quality and overall polish are significantly improved, making it suitable for high-intensity production scenarios such as long-form documents, reports, teaching materials, and research papers.
Introducing GLM-5.1
General and Coding Capability: Aligned with the Global Frontier
GLM-5.1 ranks among the world’s top-tier models in both overall capability and coding performance, with overall performance aligned with Claude Opus 4.6 and leading results across multiple key benchmarks.
On SWE-Bench Pro, GLM-5.1 achieves a score of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, setting a new state-of-the-art result. At the same time, across 12 representative benchmarks covering reasoning, coding, agents, tool use, and browsing, GLM-5.1 demonstrates a broad and well-balanced capability profile.
This shows that GLM-5.1 is not a single-metric improvement. Instead, it advances simultaneously across general intelligence, real-world coding, and complex task execution, making it a stronger foundation for general-purpose agent systems and engineering production scenarios.
On SWE-Bench Pro, GLM-5.1 achieves a score of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, setting a new state-of-the-art result. At the same time, across 12 representative benchmarks covering reasoning, coding, agents, tool use, and browsing, GLM-5.1 demonstrates a broad and well-balanced capability profile.
This shows that GLM-5.1 is not a single-metric improvement. Instead, it advances simultaneously across general intelligence, real-world coding, and complex task execution, making it a stronger foundation for general-purpose agent systems and engineering production scenarios.Long-Horizon Task Capability: Toward 8-Hour Sustained Execution
GLM-5.1 shows especially strong gains on long-horizon tasks, with major improvements in sustained execution, closed-loop optimization, and engineering delivery under complex objectives. Compared with models primarily designed for minute-level interactions, GLM-5.1 can work autonomously on a single task for up to 8 hours, completing the full process from planning and execution to testing, fixing, and delivery.Under the same evaluation standard, GLM-5.1 is one of the few models capable of 8-hour sustained execution, and the first Chinese model to reach this level. The way we evaluate model capability is shifting from “how smart it is in a single turn” to “how long it can work reliably on a long-horizon task, and what it can actually deliver.”This capability is not simply about having a longer context window. It requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error, and enabling truly autonomous execution for complex engineering tasks.
Engineering Delivery: From Code Generation Toward Autonomous Agent
One of GLM-5.1’s key breakthroughs is its ability to form an autonomous “experiment–analyze–optimize” loop in long-horizon tasks, rather than stopping at one-shot code generation. The model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement.In representative cases, GLM-5.1 can build a complete Linux desktop system from scratch within 8 hours. It can autonomously carry out 655 iterations, completing the entire optimization pipeline and boosting vector database query throughput to 6.9× that of the initial production version. On the KernelBench Level 3 optimization benchmark, it performs thousands of tool-invocation-driven optimizations on real machine learning workloads, achieving a 3.6× geometric mean speedup—significantly surpassing the 1.49× achieved by torch.compile in max-autotune mode.These results show that GLM-5.1 is already capable of autonomous exploration, continuous improvement, and stable delivery in complex engineering environments, enabling it to take on higher-value tasks such as system building, performance optimization, and long-horizon coding agents.
Resources
- API Documentation: Learn how to call the API.
Quick Start
The following is a full sample code to help you onboard GLM-5.1 with ease.- cURL
- Official Python SDK
- Official Java SDK
- OpenAI Python SDK
Basic CallStreaming Call