Overview

GLM-4.5V is Z.AI’s new generation of visual reasoning models based on the MOE architecture. With a total of 106B parameters and 12B activation parameters, it achieves SOTA performance among open-source VLMs of the same level in various benchmark tests, covering common tasks such as image, video, document understanding, and GUI tasks.
PriceInput ModalityOutput ModalityMaximum Output Tokens
Input: $0.6 per million tokens
Output: $1.8 per million tokens
Video, Image, Text, FileText64K
  1. Web Page Coding: Analyze webpage screenshots or screen recording videos, understand layout and interaction logic, and generate complete and usable webpage code with one click.
  2. Grounding: Precisely identify and locate target objects, suitable for practical scenarios such as security checks, quality inspections, content reviews, and remote sensing monitoring.
  3. GUI Agent: Recognize and process screen images, support execution of commands like clicking and sliding, providing reliable support for intelligent agents to complete operational tasks.
  4. Complex Long Document Interpretation: Deeply analyze complex documents spanning dozens of pages, support summarization, translation, chart extraction, and can propose insights based on content.
  5. Image Recognition and Reasoning: Strong reasoning ability and rich world knowledge, capable of deducing background information of images without using search.
  6. Video Understanding: Able to parse long video content and accurately infer the time, characters, events, and logical relationships within the video.
  7. Subject Problem Solving: Can solve complex text-image combined problems, suitable for K12 educational scenarios for problem-solving and explanation.

Resources

Detailed Description

1. Open-Source Multimodal SOTA

GLM-4.5V, based on Z.AI’s flagship GLM-4.5-Air, continues the iterative upgrade of the GLM-4.1V-Thinking technology route, achieving comprehensive performance at the same level as open-source SOTA models in 42 public visual multimodal benchmarks, covering common tasks such as image, video, document understanding, and GUI tasks. Description

2. Support Thinking and Non-Thinking

GLM-4.5V introduces a new “Thinking Mode” switch, allowing users to freely switch between quick response and deep reasoning, flexibly balancing processing speed and output quality according to task requirements.

Examples

Prompt

DescriptionPlease generate a high - quality UI interface using CSS and HTML based on the webpage I provided.

Display

Screenshot of the rendered web page:Description

Samples Code

curl --location 'https://api.z.ai/api/paas/v4/chat/completions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Accept-Language: en-US,en' \
--header 'Content-Type: application/json' \
--data '{
"model": "glm-4.5v",
"messages": [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
                }
            },
            {
                "type": "text",
                "text": "Where is the second bottle of beer from the right on the table?  Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
            }
        ]
    }
],
"thinking": {
    "type":"enabled"
}
}'