GLM-4.6V

Overview

GLM-4.6V series are Z.ai’s iterations in a multimodal large language model. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM-4.6V integrate native Function Calling capabilities for the first time. This effectively bridges the gap between “visual perception” and “executable action,” providing a unified technical foundation for multimodal agents in real-world business scenarios.

GLM-4.6V
GLM-4.6V-Flash

Positioning

For cloud and high-performance cluster scenarios

Input Modality

Video / Image / Text / File

Output Modality

Text

Context Length

128K

Resources

API Documentation: Learn how to call the API.

Introducing GLM-4.6V

Native Multimodal Tool Use

Traditional tool usage in LLMs often relies on pure text, requiring multiple intermediate conversions when dealing with images, videos, or complex documents—a process that introduces information loss and engineering complexity.GLM-4.6V is equipped with native multimodal tool calling capability:

Multimodal Input: Images, screenshots, and document pages can be passed directly as tool parameters without being converted to text descriptions first, minimizing signal loss.
Multimodal Output: The model can visually comprehend results returned by tools—such as searching results, statistical charts, rendered web screenshots, or retrieved product images—and incorporate them into subsequent reasoning chains.

This native support allows GLM-4.6V to close the loop from perception to understanding to execution, enabling complex tasks like mixed text-image content creation and visual web searching.

Capabilities & Scenarios

Intelligent Image-Text Content Creation & Layout
Visual Web Search & Rich Media Report Generation
Frontend Replication & Visual Interaction
Long-Context Understanding

GLM-4.6V can accept multimodal inputs—mixed text/image papers, reports, or slides—and automatically generate high-quality, structured image-text interleaved content.

Complex Document Understanding: Accurately understands structured information from documents containing text, charts, figures, and formulas.
Visual Tool Retrieval: During generation, the model can autonomously call search tools to find candidate images or crop key visuals from the source multimodal context.
Visual Audit & Layout: The model performs a “visual audit” on candidate images to assess relevance and quality, filtering out noise to produce structured articles ready for social media or knowledge bases.

Overall Performance

We evaluated GLM-4.6V on over 20 mainstream multimodal benchmarks, including MMBench, MathVista, and OCRBench. The model achieves SOTA performance among open-source models of comparable scale in key capabilities such as multimodal interaction, logical reasoning, and long-context understanding. Description

Examples

Intelligent Image-Text Content Creation & Layout
Frontend Replication & Visual Interaction
Multi-Image Processing Agent
Full Object Detection and Recognition

Highlights Analysis

Supports native multimodality, enabling direct processing of documents containing visual elements (e.g., images, tables, curves, etc.). This eliminates the need for cumbersome and error-prone preprocessing steps such as OCR and document parsing.
In addition to text output, the model is capable of independent decision-making to locate the pages and regions where relevant content resides. It can also invoke tools via MCP for screenshot capture and embedding, generating well-illustrated reports.
On the basis of in-depth paper reading and information analysis & consolidation, the model is endowed with reasoning capabilities, allowing it to express its own “insights” on specific topics.

Prompt

Based on the key visualizations from the two papers, this research report delivers a comprehensive, illustration-integrated analysis of the bench described in the literature.

Display

Think

Alright, here's the deal. The user wants me to act as a research report generator, analyzing two papers on AI-generated video benchmarks. I need to synthesize the information, focusing on the key visualizations and presenting it in a clear, integrated way. This is a common task, and I'm ready to dive in.
First, I need to understand the core of each paper. Paper 1, "AIGCBench," seems to be about evaluating Image-to-Video (I2V) generation. It's got a framework, datasets, and metrics. I'll need to dig into its structure, especially Figure 1, which shows the overall setup. Then, I'll look at the evaluation metrics, probably in Table 2. The visual comparisons of the generated videos in Figure 3 will be crucial for understanding the results. And, of course, the radar chart in Figure 4 will summarize the performance.
Next, Paper 2, "FETV," focuses on Text-to-Video (T2V) generation. It emphasizes fine-grained evaluation, categorizing prompts by content, attributes, and complexity. I'll need to understand the categorization framework, probably in Figure 1. The data distribution across these categories, shown in Figure 3 and 4, will be important. The radar charts in Figure 5 and 6 will show the model performance across different categories. And the leaderboards in Figure 7 will provide a quick overview of the rankings.
Now, the real work begins: synthesizing these two papers. I'll need to structure the report logically. I'll start with an introduction to set the stage. Then, I'll dive into the benchmarks themselves, highlighting their different focuses. I'll need to explain how AIGCBench is structured (Figure 1) and how FETV categorizes its prompts (Figure 1). Then, I'll compare the evaluation metrics, probably using Table 1 and Table 2. I'll need to discuss the visual results, using Figure 3 and Figures 5-7. Finally, I'll wrap it up with a conclusion, emphasizing the strengths and weaknesses of each benchmark.
I need to make sure I'm using the images effectively, referencing them correctly, and providing a clear, concise analysis. The user wants a "comprehensive, illustration-integrated analysis," so I need to deliver on that promise. Let's get started!

Answer

Of course. Here is a research report that provides a comprehensive, illustration-integrated analysis of the two benchmarks described in the literature.
Research Report: A Comparative Analysis of AIGCBench and FETV for Evaluating AI-Generated Video
This report analyzes two key research papers that introduce novel benchmarks for evaluating AI-generated video content. The first, AIGCBench, focuses on a comprehensive evaluation of Image-to-Video (I2V) generation, while the second, FETV, proposes a benchmark for the fine-grained evaluation of open-domain Text-to-Video (T2V) generation. By examining their methodologies, datasets, evaluation metrics, and results, we can understand the current state and challenges of automated video generation assessment.
1. AIGCBench: A Comprehensive Benchmark for Image-to-Video Generation
AIGCBench is designed to address the limitations of existing benchmarks by providing a unified and scalable framework for evaluating I2V models. Its primary goal is to offer a fair comparison of different algorithms under equivalent conditions.
Framework and Datasets:
The AIGCBench framework is divided into three core modules: datasets, video generation models, and evaluation metrics. As illustrated in Figure 1, the benchmark utilizes two types of datasets: real-world video-text pairs and a novel set of image-text pairs generated through a proprietary pipeline.
!<image_ref>[1, [[102, 75, 887, 296]], "Figure 1: Illustration of the AIGCBench framework."]</image_ref>
This pipeline, detailed in Figure 2, involves combining subjects, behaviors, backgrounds, and styles to create diverse prompts, which are then processed by GPT-4 for enhancement before being fed to a text-to-image model (Stable Diffusion) to generate the final image.
!<image_ref>[1, [[102, 71, 908, 466]], "Figure 2: Image-text dataset generation pipeline and results."]</image_ref>
Evaluation Metrics and Results:
AIGCBench evaluates I2V models across four critical dimensions:
Control-Video Alignment: How well the generated video matches the input image.
Motion Effects: The quality and plausibility of the motion.
Temporal Consistency: The coherence of the video across time.
Video Quality: The overall aesthetic and technical quality.
The benchmark evaluates five state-of-the-art models: VideoCrafter, I2VGen-XL, SVD, Pika, and Gen-2. The quantitative results, summarized in Table 2, show that Gen-2 performs best in video quality, while SVD and Pika show strong performance in control-video alignment and temporal consistency.
!<table_ref>[1, [[90, 75, 907, 235]], "Table 2: Quantitative analysis for different Image-to-Video algorithms."]</table_ref>
Qualitative comparisons in Figure 3 visually demonstrate the performance differences. For example, in the "knight running" prompt, Gen-2 produces a highly detailed and aesthetically pleasing video, whereas VideoCrafter and I2VGen-XL struggle to preserve the spatial structure of the original image.
!<image_ref>[1, [[92, 71, 910, 722]], "Figure 3: Qualitative comparison of five I2V algorithms on three different prompts."]</image_ref>
A user study with 42 participants was conducted to validate the metrics. The results, visualized in a radar chart in Figure 4, show that Pika excels in motion effects and temporal consistency, while Gen-2 is favored for video quality.
!<image_ref>[1, [[385, 75, 614, 266]], "Figure 4: Radar chart showing user preferences for different I2V algorithms."]</image_ref>
2. FETV: A Benchmark for Fine-Grained Text-to-Video Evaluation
FETV addresses the need for more nuanced evaluation of T2V models by introducing a multi-aspect categorization of prompts. Unlike benchmarks that provide only a single overall score, FETV breaks down evaluation into specific categories to reveal a model's strengths and weaknesses.
Multi-Aspect Categorization:
FETV's core innovation is its categorization of text prompts based on three orthogonal aspects, as shown in Figure 1:
Major Content: The primary subject (e.g., people, animals, scenery).
Attribute Control: Specific attributes the prompt aims to control (e.g., color, speed, motion direction).
Prompt Complexity: The complexity of the text prompt (simple, medium, complex).
This framework allows for a "fine-grained evaluation" that goes beyond a single score.
!<image_ref>[2, [[181, 71, 815, 314]], "Figure 1: Illustration of FETV's multi-aspect categorization."]</image_ref>
Evaluation and Results:
FETV evaluates four T2V models: CogVideo, Text2Video-zero, ModelScopeT2V, and ZeroScope. The evaluation is performed manually across four perspectives: static quality, temporal quality, overall alignment, and fine-grained alignment.
The results, presented in radar charts in Figure 5 and Figure 6, reveal significant variations in model performance across different categories. For instance, Figure 5 shows that ModelScopeT2V performs well on "static quality" for "people" and "scenery" but poorly on "artifacts" and "vehicles."
!<image_ref>[2, [[267, 71, 731, 472]], "Figure 5: Manual evaluation of static and temporal video quality across different major contents."]</image_ref>
Furthermore, Figure 6 demonstrates that models struggle with controlling specific attributes. For example, in the "fine-grained alignment" chart, all models show weakness in controlling "event order."
!<image_ref>[2, [[181, 525, 807, 722]], "Figure 6: Manual evaluation of video-text alignment across different attributes and complexities."]</image_ref>
The benchmark also includes leaderboards (Figure 7) that rank the models based on their overall performance in static quality, temporal quality, and alignment.
!<image_ref>[2, [[175, 71, 817, 196]], "Figure 7: Leaderboard on the FETV benchmark based on manual evaluation."]</image_ref>
3. Comparative Analysis and Conclusion
Both AIGCBench and FETV represent significant steps forward in the standardized evaluation of AI-generated video, but they target different tasks and evaluation philosophies.
Scope: AIGCBench focuses specifically on the I2V task, evaluating how well a model can animate a static image. FETV, in contrast, is designed for the more open-ended T2V task, where the model must generate a video from a text description.
Evaluation Granularity: FETV's primary strength is its "fine-grained" approach, using a detailed categorization system to pinpoint model weaknesses. AIGCBench provides a more holistic, four-dimensional score, offering a broader performance overview.
Methodology: AIGCBench combines both reference-dependent (video-dependent) and reference-free (video-free) metrics, along with human validation. FETV relies heavily on comprehensive manual evaluation to establish a ground truth, which it then uses to diagnose the reliability of automatic metrics.
In conclusion, these two benchmarks provide a more robust and multi-faceted way to assess AI video generation models. AIGCBench offers a comprehensive framework for fair I2V model comparison, while FETV pushes for a deeper, more nuanced understanding of T2V model capabilities. Together, they highlight the importance of diverse and detailed evaluation in driving progress in the field of AIGC.
Part 2. Image reference records
!<image_ref>[1, [[102, 75, 887, 296]], "Figure 1: Illustration of the AIGCBench framework."]</image_ref>
!<image_ref>[1, [[102, 71, 908, 466]], "Figure 2: Image-text dataset generation pipeline and results."]</image_ref>
!<image_ref>[1, [[92, 71, 910, 722]], "Figure 3: Qualitative comparison of five I2V algorithms on three different prompts."]</image_ref>
!<image_ref>[1, [[385, 75, 614, 266]], "Figure 4: Radar chart showing user preferences for different I2V algorithms."]</image_ref>
!<image_ref>[2, [[181, 71, 815, 314]], "Figure 1: Illustration of FETV's multi-aspect categorization."]</image_ref>
!<image_ref>[2, [[267, 71, 731, 472]], "Figure 5: Manual evaluation of static and temporal video quality across different major contents."]</image_ref>
!<image_ref>[2, [[181, 525, 807, 722]], "Figure 6: Manual evaluation of video-text alignment across different attributes and complexities."]</image_ref>
!<image_ref>[2, [[175, 71, 817, 196]], "Figure 7: Leaderboard on the FETV benchmark based on manual evaluation."]</image_ref>
Part 3. Table reference records
!<table_ref>[1, [[90, 75, 907, 235]], "Table 2: Quantitative analysis for different Image-to-Video algorithms."]</table_ref> Next, I need to call the image_reference tool to get the URL of the relevant image.

Rendered Result: Description

Quick Start

cURL
Python
Java

Basic Call

curl -X POST \
    https://api.z.ai/api/paas/v4/chat/completions \
    -H "Authorization: Bearer your-api-key" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "glm-4.6v",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
                        }
                    },
                    {
                        "type": "text",
                        "text": "Where is the second bottle of beer from the right on the table?  Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
                    }
                ]
            }
        ],
        "thinking": {
            "type":"enabled"
        }
    }'

Streaming Call

curl -X POST \
    https://api.z.ai/api/paas/v4/chat/completions \
    -H "Authorization: Bearer your-api-key" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "glm-4.6v",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
                        }
                    },
                    {
                        "type": "text",
                        "text": "Where is the second bottle of beer from the right on the table?  Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
                    }
                ]
            }
        ],
        "thinking": {
            "type":"enabled"
        },
        "stream": true
    }'

Get Started

Language Models

Vision Language Models

Image Generation Models

Video Generation Models

Image Generation Models

Capabilities

Tools

Agents

Overview

Positioning

Input Modality

Output Modality

Context Length

Resources

Introducing GLM-4.6V

Native Multimodal Tool Use

Capabilities & Scenarios

Overall Performance

Examples

Prompt

Display

Quick Start

Get Started

Language Models

Vision Language Models

Image Generation Models

Video Generation Models

Image Generation Models

Capabilities

Tools

Agents

​ Overview

Positioning

Input Modality

Output Modality

Context Length

​ Resources

​ Introducing GLM-4.6V

Native Multimodal Tool Use

Capabilities & Scenarios

Overall Performance

​ Examples

Prompt

Display

​ Quick Start

Overview

Resources

Introducing GLM-4.6V

Examples

Quick Start