Context Caching

Context caching functionality significantly reduces token consumption and response latency by caching repeated context content. When you repeatedly use the same system prompts or conversation history in dialogues, the caching mechanism automatically identifies and reuses this content, thereby improving performance and reducing costs.

Features

Automatic Cache Recognition: Implicit caching that intelligently identifies repeated context content without manual configuration
Significant Cost Reduction: Cached tokens are billed at lower prices, dramatically saving costs
Improved Response Speed: Reduces processing time for repeated content, accelerating model responses
Transparent Billing: Detailed display of cached token counts in response field usage.prompt_tokens_details.cached_tokens
Wide Compatibility: Supports all mainstream models, including GLM-4.6, GLM-4.5 series, etc.

Context caching works by computing input message content and identifying content that is identical or highly similar to previous requests. When repeated content is detected, the system reuses previous computation results, avoiding redundant token processing.

This mechanism is particularly suitable for the following scenarios:

System prompt reuse: In multi-turn conversations, system prompts usually remain unchanged, and caching can significantly reduce token consumption for this part.
Repetitive tasks: For tasks that process similar content with consistent instructions multiple times, caching can improve efficiency.
Multi-turn conversation history: In complex conversations, historical messages often contain a lot of repeated information, and caching can effectively reduce token usage for this part.

Code Examples

cURL
Python SDK

Basic Caching Example

# First request - establish cache
curl --location 'https://api.z.ai/api/paas/v4/chat/completions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
    "model": "glm-4.6",
    "messages": [
        {
            "role": "system",
            "content": "You are a professional data analyst, skilled at explaining data trends and providing business insights."
        },
        {
            "role": "user",
            "content": "How to analyze user retention rate?"
        }
    ]
}'

Cache Reuse Example

# Second request - reuse system prompt cache
curl --location 'https://api.z.ai/api/paas/v4/chat/completions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
    "model": "glm-4.6",
    "messages": [
        {
            "role": "system",
            "content": "You are a professional data analyst, skilled at explaining data trends and providing business insights."
        },
        {
            "role": "user",
            "content": "What is funnel analysis?"
        }
    ]
}'

Install SDK

# Install latest version
pip install zai-sdk

# Or specify version
pip install zai-sdk==0.1.0

Verify Installation

import zai
print(zai.__version__)

Basic Conversation Example

from zai import ZaiClient

# Initialize client
client = ZaiClient(api_key='Your API key')

# First request - establish cache
response1 = client.chat.completions.create(
    model="glm-4.6",
    messages=[
        {
            "role": "system",
            "content": "You are a professional technical documentation assistant, skilled at explaining complex technical concepts. Please answer user questions with clear and concise language, and provide practical code examples."
        },
        {
            "role": "user",
            "content": "What is RESTful API?"
        }
    ]
)

print("First request result:")
print(f"Reply: {response1.choices[0].message.content}")
print(f"Total tokens: {response1.usage.total_tokens}")
print(f"Cached tokens: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")

# Second request - reuse system prompt cache
response2 = client.chat.completions.create(
    model="glm-4.6",
    messages=[
        {
            "role": "system",
            "content": "You are a professional technical documentation assistant, skilled at explaining complex technical concepts. Please answer user questions with clear and concise language, and provide practical code examples."  # Same system prompt
        },
        {
            "role": "user",
            "content": "What are the differences between GraphQL and RESTful API?"
        }
    ]
)

print("\nSecond request result:")
print(f"Reply: {response2.choices[0].message.content}")
print(f"Total tokens: {response2.usage.total_tokens}")
print(f"Cached tokens: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")

Long Document Analysis Example

from zai import ZaiClient

# Initialize client
client = ZaiClient(api_key='Your API key')

# Long document content (simulated)
long_document = """
This is a detailed technical specification document that includes system architecture, API design, database structure, and many other aspects.
The document is very long and contains a lot of technical details and implementation instructions...
[Large amount of document content omitted here]
"""

# First analysis - establish document cache
response1 = client.chat.completions.create(
    model="glm-4.6",
    messages=[
        {
            "role": "system",
            "content": f"Please answer user questions based on the following technical document:\n\n{long_document}"
        },
        {
            "role": "user",
            "content": "What is the main architecture of this system?"
        }
    ]
)

print("First analysis:")
print(f"Total tokens: {response1.usage.total_tokens}")
print(f"Cached tokens: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")

# Second analysis - reuse document cache
response2 = client.chat.completions.create(
    model="glm-4.6",
    messages=[
        {
            "role": "system",
            "content": f"Please answer user questions based on the following technical document:\n\n{long_document}"  # Same document content
        },
        {
            "role": "user",
            "content": "What are the characteristics of the API design?"
        }
    ]
)

print("\nSecond analysis:")
print(f"Total tokens: {response2.usage.total_tokens}")
print(f"Cached tokens: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")
print(f"Cache savings: {response2.usage.prompt_tokens_details.cached_tokens / response2.usage.total_tokens * 100:.1f}%")

Multi-turn Conversation Caching Example

from zai import ZaiClient

# Initialize client
client = ZaiClient(api_key='Your API key')

# Build conversation history
conversation_history = [
    {"role": "system", "content": "You are a Python programming assistant, helping users solve programming problems."},
    {"role": "user", "content": "How to create a simple Flask application?"},
    {"role": "assistant", "content": "Creating a Flask application is simple, first install Flask..."},
    {"role": "user", "content": "How to add routes?"},
    {"role": "assistant", "content": "In Flask, add routes using the @app.route decorator..."},
]

# Continue conversation - reuse conversation history cache
response = client.chat.completions.create(
    model="glm-4.6",
    messages=conversation_history + [
        {"role": "user", "content": "How to handle POST requests?"}
    ]
)

print("Conversation reply:")
print(f"Content: {response.choices[0].message.content}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens if hasattr(response.usage, 'prompt_tokens_details') else 0}")

# Calculate cache efficiency
if hasattr(response.usage, 'prompt_tokens_details') and response.usage.prompt_tokens_details.cached_tokens:
    cache_ratio = response.usage.prompt_tokens_details.cached_tokens / response.usage.prompt_tokens * 100
    print(f"Cache hit rate: {cache_ratio:.1f}%")

Batch Processing Optimization Example

from zai import ZaiClient
import time

# Initialize client
client = ZaiClient(api_key='Your API key')

# Common system prompt
system_prompt = """
You are a professional code review assistant. Please analyze the provided code from the following aspects:
1. Code quality and readability
2. Performance optimization suggestions
3. Security considerations
4. Best practice recommendations
Please provide specific improvement suggestions.
"""

# List of code snippets to review
code_snippets = [
    "def calculate_sum(numbers): return sum(numbers)",
    "class User: def __init__(self, name): self.name = name",
    "for i in range(len(items)): print(items[i])",
    "if user_input == 'yes' or user_input == 'y': return True"
]

results = []
total_cached_tokens = 0

for i, code in enumerate(code_snippets):
    start_time = time.time()
    
    response = client.chat.completions.create(
        model="glm-4.6",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Please review the following code:\n```python\n{code}\n```"}
        ]
    )
    
    end_time = time.time()
    
    # Count cache effects
    cached_tokens = 0
    if hasattr(response.usage, 'prompt_tokens_details') and response.usage.prompt_tokens_details.cached_tokens:
        cached_tokens = response.usage.prompt_tokens_details.cached_tokens
        total_cached_tokens += cached_tokens
    
    results.append({
        'code': code,
        'review': response.choices[0].message.content,
        'total_tokens': response.usage.total_tokens,
        'cached_tokens': cached_tokens,
        'response_time': end_time - start_time
    })
    
    print(f"Code snippet {i+1} review completed:")
    print(f"  Response time: {end_time - start_time:.2f}s")
    print(f"  Cached tokens: {cached_tokens}")
    print(f"  Total tokens: {response.usage.total_tokens}")
    print()

print(f"Batch processing completed, total cached tokens: {total_cached_tokens}")

Response contains context cache token usage information:

{
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 300,
    "total_tokens": 1500,
    "prompt_tokens_details": {
      "cached_tokens": 800
    }
  }
}

Best Practices

System Prompt Optimization
Document Content Reuse
Conversation History Management

Use stable system prompts

# Recommended: Use stable system prompts
system_prompt = """
You are a professional technical consultant with the following characteristics:
- Deep technical background and rich project experience
- Able to provide accurate and practical technical advice
- Good at explaining complex concepts in clear and concise language
Please provide professional technical guidance based on user questions.
"""

Use long documents as system messages

# Recommended: Use long documents as system messages
def create_document_based_chat(document_content, user_question):
    return client.chat.completions.create(
        model="glm-4.6",
        messages=[
            {
                "role": "system",
                "content": f"Please answer user questions based on the following document content:\n\n{document_content}"
            },
            {
                "role": "user",
                "content": user_question
            }
        ]
    )

# Multiple calls with the same document, system prompts will be cached
questions = ["What is the main content of the document?", "What are the key points?", "How to implement these suggestions?"]
for question in questions:
    response = create_document_based_chat(document_content, question)
    # Second and subsequent calls will hit the cache

Manage conversation history to improve cache efficiency

class ConversationManager:
    def __init__(self, client, system_prompt):
        self.client = client
        self.system_prompt = system_prompt
        self.history = [{"role": "system", "content": system_prompt}]
    
    def add_message(self, role, content):
        self.history.append({"role": role, "content": content})
    
    def get_response(self, user_message):
        # Add user message
        self.add_message("user", user_message)
        
        # Get reply (conversation history will be cached)
        response = self.client.chat.completions.create(
            model="glm-4.6",
            messages=self.history
        )
        
        # Add assistant reply to history
        assistant_message = response.choices[0].message.content
        self.add_message("assistant", assistant_message)
        
        return response
    
    def get_cache_stats(self, response):
        """Get cache statistics"""
        if hasattr(response.usage, 'prompt_tokens_details'):
            cached = response.usage.prompt_tokens_details.cached_tokens or 0
            total = response.usage.prompt_tokens
            return f"Cache hit: {cached}/{total} ({cached/total*100:.1f}%)"
        return "No cache information"

# Usage example
manager = ConversationManager(client, "You are a programming assistant...")
response1 = manager.get_response("How to learn Python?")
response2 = manager.get_response("Recommend some learning resources")  # Will reuse previous conversation cache

Use Cases

Multi-turn Conversations

Intelligent customer service systems
Personal assistant services

Batch Processing

Code review batch processing
Content batch analysis

Template Applications

Report generation templates
Standardized process handling

Education and Training

Homework grading assistance
Learning material analysis

Important Notes

Understanding Cache Mechanism
Cost Optimization Suggestions
Performance Considerations
Best Practices

Caching is automatically triggered based on content similarity, no manual configuration required
Identical content has the highest cache hit rate
Minor formatting differences may affect cache effectiveness
Cache has reasonable time limits, will recalculate after expiration

Billing Information

Context caching uses a differentiated billing strategy:

New content tokens: Billed at standard prices
Cache hit tokens: Billed at discounted prices (usually 50% of standard price)
Output tokens: Billed at standard prices

Billing example:

Assuming standard price is 0.01 /1K tokens:

Request details:
- Total input tokens: 2000
- Cache hit tokens: 1200
- New content tokens: 800
- Output tokens: 500

Billing calculation:
- New content cost: 800 × 0.01/1000 = 0.008
- Cache cost: 1200 × 0.005/1000 = 0.006
- Output cost: 500 × 0.01/1000 = 0.005
- Total cost: 0.019

Compared to no cache (2500 × 0.01/1000 = 0.025), saves 24%

Get Started

Language Models

Vision Language Models

Image Generation Models

Video Generation Models

Image Generation Models

Audio Models

Capabilities

Tools

Agents

Features

Code Examples

Best Practices

Use Cases

Multi-turn Conversations

Batch Processing

Template Applications

Education and Training

Important Notes

Billing Information

Get Started

Language Models

Vision Language Models

Image Generation Models

Video Generation Models

Image Generation Models

Audio Models

Capabilities

Tools

Agents

​Features

​Code Examples

​Best Practices

​Use Cases

Multi-turn Conversations

Batch Processing

Template Applications

Education and Training

​Important Notes

​Billing Information

Features

Code Examples

Best Practices

Use Cases

Important Notes

Billing Information