Skip to main content
This guide explains how to migrate your calls from GLM-4.5 or other earlier models to Z.AI GLM-4.6, our most powerful coding model to date, covering sampling parameter differences, streaming tool calls, and other key points.

GLM-4.6 Features

  • Support for larger context and output: Maximum context 200K, maximum output 128K.
  • New support for streaming output during tool calling process (tool_stream=true), real-time retrieval of tool call parameters.
  • Same as GLM-4.5 series, supports deep thinking (thinking={ type: "enabled" }).
  • Superior code performance and advanced reasoning capabilities.

Migration Checklist

  • Update model identifier to glm-4.6
  • Sampling parameters: temperature default value 1.0, top_p default value 0.95, recommend choosing only one for tuning
  • Deep thinking: Enabled or disable thinking={ type: "enabled" } as needed for complex reasoning/coding
  • Streaming response: Enable stream=true and properly handle delta.reasoning_content and delta.content
  • Streaming tool calls: Enable stream=true and tool_stream=true and stream-concatenate delta.tool_calls[*].function.arguments
  • Maximum output and context: Set max_tokens appropriately (GLM-4.6 maximum output 128K, context 200K)
  • Prompt optimization: Work with deep thinking, use clearer instructions and constraints
  • Development environment verification: Conduct use case testing and regression, focus on randomness, latency, parameter completeness in tool streams

Start Migration

1. Update Model Identifier

  • Update model to glm-4.6.
resp = client.chat.completions.create(
    model="glm-4.6",
    messages=[{"role": "user", "content": "Briefly describe the advantages of GLM-4.6"}]
)

2. Update Sampling Parameters

  • temperature: Controls randomness; higher values are more divergent, lower values are more stable.
  • top_p: Controls nucleus sampling; higher values expand candidate set, lower values converge candidate set.
  • temperature defaults to 1.0, top_p defaults to 0.95, not recommended to adjust both simultaneously.
# Plan A: Use temperature (recommended)
resp = client.chat.completions.create(
    model="glm-4.6",
    messages=[{"role": "user", "content": "Write a more creative brand introduction"}],
    temperature=1.0
)

# Plan B: Use top_p
resp = client.chat.completions.create(
    model="glm-4.6",
    messages=[{"role": "user", "content": "Generate more stable technical documentation"}],
    top_p=0.8
)

3. Deep Thinking (Optional)

  • GLM-4.6 continues to support deep thinking capability, enabled by default.
  • Recommended to enable for complex reasoning and coding tasks:
resp = client.chat.completions.create(
    model="glm-4.6",
    messages=[{"role": "user", "content": "Design a three-tier microservice architecture for me"}],
    thinking={"type": "enabled"}
)

4. Streaming Output and Tool Calls (Optional)

  • GLM-4.6 exclusively supports real-time streaming construction and output during tool calling process, disabled by default (False), requires enabling both:
    • stream=True: Enable streaming output for responses
    • tool_stream=True: Enable streaming output for tool call parameters
response = client.chat.completions.create(
    model="glm-4.6",
    messages=[{"role": "user", "content": "How's the weather in Beijing"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather conditions for a specified location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string", "description": "City, eg: Beijing, Shanghai"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    stream=True,
    tool_stream=True,
)

# Initialize streaming collection variables
reasoning_content = ""
content = ""
final_tool_calls = {}
reasoning_started = False
content_started = False

# Process streaming response
for chunk in response:
    if not chunk.choices:
        continue

    delta = chunk.choices[0].delta

    # Streaming reasoning process output
    if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
        if not reasoning_started and delta.reasoning_content.strip():
            print("\n🧠 Thinking Process:")
            reasoning_started = True
        reasoning_content += delta.reasoning_content
        print(delta.reasoning_content, end="", flush=True)

    # Streaming answer content output
    if hasattr(delta, 'content') and delta.content:
        if not content_started and delta.content.strip():
            print("\n\nπŸ’¬ Answer Content:")
            content_started = True
        content += delta.content
        print(delta.content, end="", flush=True)

    # Streaming tool call information (parameter concatenation)
    if delta.tool_calls:
        for tool_call in delta.tool_calls:
            idx = tool_call.index
            if idx not in final_tool_calls:
                final_tool_calls[idx] = tool_call
                final_tool_calls[idx].function.arguments = tool_call.function.arguments
            else:
                final_tool_calls[idx].function.arguments += tool_call.function.arguments

# Output final tool call information
if final_tool_calls:
    print("\nπŸ“‹ Function Calls Triggered:")
    for idx, tool_call in final_tool_calls.items():
        print(f"  {idx}: Function Name: {tool_call.function.name}, Parameters: {tool_call.function.arguments}")
See: Tool Streaming Output Documentation

5. Testing and Regression

First verify in development environment that post-migration calls are stable, focus on:
  • Whether responses meet expectations, whether there’s excessive randomness or excessive conservatism in output
  • Whether tool streaming construction and output work normally
  • Latency and cost in long context and deep thinking scenarios

More Resources

⌘I