Overview
GLM-4.6V series are Z.ai’s iterations in a multimodal large language model. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM-4.6V integrate native Function Calling capabilities for the first time. This effectively bridges the gap between “visual perception” and “executable action,” providing a unified technical foundation for multimodal agents in real-world business scenarios.- GLM-4.6V
- GLM-4.6V-Flash
Positioning
For cloud and high-performance cluster scenarios
Input Modality
Video / Image / Text / File
Output Modality
Text
Context Length
128K
Resources
- API Documentation: Learn how to call the API.
Introducing GLM-4.6V
1
Native Multimodal Tool Use
Traditional tool usage in LLMs often relies on pure text, requiring multiple intermediate conversions when dealing with images, videos, or complex documents—a process that introduces information loss and engineering complexity.GLM-4.6V is equipped with native multimodal tool calling capability:
- Multimodal Input: Images, screenshots, and document pages can be passed directly as tool parameters without being converted to text descriptions first, minimizing signal loss.
- Multimodal Output: The model can visually comprehend results returned by tools—such as searching results, statistical charts, rendered web screenshots, or retrieved product images—and incorporate them into subsequent reasoning chains.
2
Capabilities & Scenarios
- Intelligent Image-Text Content Creation & Layout
- Visual Web Search & Rich Media Report Generation
- Frontend Replication & Visual Interaction
- Long-Context Understanding
GLM-4.6V can accept multimodal inputs—mixed text/image papers, reports, or slides—and automatically generate high-quality, structured image-text interleaved content.
- Complex Document Understanding: Accurately understands structured information from documents containing text, charts, figures, and formulas.
- Visual Tool Retrieval: During generation, the model can autonomously call search tools to find candidate images or crop key visuals from the source multimodal context.
- Visual Audit & Layout: The model performs a “visual audit” on candidate images to assess relevance and quality, filtering out noise to produce structured articles ready for social media or knowledge bases.
3
Overall Performance
We evaluated GLM-4.6V on over 20 mainstream multimodal benchmarks, including MMBench, MathVista, and OCRBench. The model achieves SOTA performance among open-source models of comparable scale in key capabilities such as multimodal interaction, logical reasoning, and long-context understanding.

Examples
- Intelligent Image-Text Content Creation & Layout
- Frontend Replication & Visual Interaction
- Multi-Image Processing Agent
- Full Object Detection and Recognition
Highlights Analysis
- Supports native multimodality, enabling direct processing of documents containing visual elements (e.g., images, tables, curves, etc.). This eliminates the need for cumbersome and error-prone preprocessing steps such as OCR and document parsing.
- In addition to text output, the model is capable of independent decision-making to locate the pages and regions where relevant content resides. It can also invoke tools via MCP for screenshot capture and embedding, generating well-illustrated reports.
- On the basis of in-depth paper reading and information analysis & consolidation, the model is endowed with reasoning capabilities, allowing it to express its own “insights” on specific topics.
Prompt
Based on the key visualizations from the two papers, this research report delivers a comprehensive, illustration-integrated analysis of the bench described in the literature.Display
ThinkAnswerRendered Result:
Quick Start
- cURL
- Python
- Java
Basic CallStreaming Call