Context caching functionality significantly reduces token consumption and response latency by caching repeated context content. When you repeatedly use the same system prompts or conversation history in dialogues, the caching mechanism automatically identifies and reuses this content, thereby improving performance and reducing costs.
Automatic Cache Recognition: Implicit caching that intelligently identifies repeated context content without manual configuration
Significant Cost Reduction: Cached tokens are billed at lower prices, dramatically saving costs
Improved Response Speed: Reduces processing time for repeated content, accelerating model responses
Transparent Billing: Detailed display of cached token counts in response field usage.prompt_tokens_details.cached_tokens
Wide Compatibility: Supports all mainstream models, including GLM-4.6, GLM-4.5 series, etc.
Context caching works by computing input message content and identifying content that is identical or highly similar to previous requests. When repeated content is detected, the system reuses previous computation results, avoiding redundant token processing.
This mechanism is particularly suitable for the following scenarios:
System prompt reuse: In multi-turn conversations, system prompts usually remain unchanged, and caching can significantly reduce token consumption for this part.
Repetitive tasks: For tasks that process similar content with consistent instructions multiple times, caching can improve efficiency.
Multi-turn conversation history: In complex conversations, historical messages often contain a lot of repeated information, and caching can effectively reduce token usage for this part.
# First request - establish cachecurl --location 'https://api.z.ai/api/paas/v4/chat/completions' \--header 'Authorization: Bearer YOUR_API_KEY' \--header 'Content-Type: application/json' \--data '{ "model": "glm-4.6", "messages": [ { "role": "system", "content": "You are a professional data analyst, skilled at explaining data trends and providing business insights." }, { "role": "user", "content": "How to analyze user retention rate?" } ]}'
Cache Reuse Example
Copy
Ask AI
# Second request - reuse system prompt cachecurl --location 'https://api.z.ai/api/paas/v4/chat/completions' \--header 'Authorization: Bearer YOUR_API_KEY' \--header 'Content-Type: application/json' \--data '{ "model": "glm-4.6", "messages": [ { "role": "system", "content": "You are a professional data analyst, skilled at explaining data trends and providing business insights." }, { "role": "user", "content": "What is funnel analysis?" } ]}'
from zai import ZaiClient# Initialize clientclient = ZaiClient(api_key='Your API key')# First request - establish cacheresponse1 = client.chat.completions.create( model="glm-4.6", messages=[ { "role": "system", "content": "You are a professional technical documentation assistant, skilled at explaining complex technical concepts. Please answer user questions with clear and concise language, and provide practical code examples." }, { "role": "user", "content": "What is RESTful API?" } ])print("First request result:")print(f"Reply: {response1.choices[0].message.content}")print(f"Total tokens: {response1.usage.total_tokens}")print(f"Cached tokens: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")# Second request - reuse system prompt cacheresponse2 = client.chat.completions.create( model="glm-4.6", messages=[ { "role": "system", "content": "You are a professional technical documentation assistant, skilled at explaining complex technical concepts. Please answer user questions with clear and concise language, and provide practical code examples." # Same system prompt }, { "role": "user", "content": "What are the differences between GraphQL and RESTful API?" } ])print("\nSecond request result:")print(f"Reply: {response2.choices[0].message.content}")print(f"Total tokens: {response2.usage.total_tokens}")print(f"Cached tokens: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")
Long Document Analysis Example
Copy
Ask AI
from zai import ZaiClient# Initialize clientclient = ZaiClient(api_key='Your API key')# Long document content (simulated)long_document = """This is a detailed technical specification document that includes system architecture, API design, database structure, and many other aspects.The document is very long and contains a lot of technical details and implementation instructions...[Large amount of document content omitted here]"""# First analysis - establish document cacheresponse1 = client.chat.completions.create( model="glm-4.6", messages=[ { "role": "system", "content": f"Please answer user questions based on the following technical document:\n\n{long_document}" }, { "role": "user", "content": "What is the main architecture of this system?" } ])print("First analysis:")print(f"Total tokens: {response1.usage.total_tokens}")print(f"Cached tokens: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")# Second analysis - reuse document cacheresponse2 = client.chat.completions.create( model="glm-4.6", messages=[ { "role": "system", "content": f"Please answer user questions based on the following technical document:\n\n{long_document}" # Same document content }, { "role": "user", "content": "What are the characteristics of the API design?" } ])print("\nSecond analysis:")print(f"Total tokens: {response2.usage.total_tokens}")print(f"Cached tokens: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")print(f"Cache savings: {response2.usage.prompt_tokens_details.cached_tokens / response2.usage.total_tokens * 100:.1f}%")
Multi-turn Conversation Caching Example
Copy
Ask AI
from zai import ZaiClient# Initialize clientclient = ZaiClient(api_key='Your API key')# Build conversation historyconversation_history = [ {"role": "system", "content": "You are a Python programming assistant, helping users solve programming problems."}, {"role": "user", "content": "How to create a simple Flask application?"}, {"role": "assistant", "content": "Creating a Flask application is simple, first install Flask..."}, {"role": "user", "content": "How to add routes?"}, {"role": "assistant", "content": "In Flask, add routes using the @app.route decorator..."},]# Continue conversation - reuse conversation history cacheresponse = client.chat.completions.create( model="glm-4.6", messages=conversation_history + [ {"role": "user", "content": "How to handle POST requests?"} ])print("Conversation reply:")print(f"Content: {response.choices[0].message.content}")print(f"Total tokens: {response.usage.total_tokens}")print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens if hasattr(response.usage, 'prompt_tokens_details') else 0}")# Calculate cache efficiencyif hasattr(response.usage, 'prompt_tokens_details') and response.usage.prompt_tokens_details.cached_tokens: cache_ratio = response.usage.prompt_tokens_details.cached_tokens / response.usage.prompt_tokens * 100 print(f"Cache hit rate: {cache_ratio:.1f}%")
Batch Processing Optimization Example
Copy
Ask AI
from zai import ZaiClientimport time# Initialize clientclient = ZaiClient(api_key='Your API key')# Common system promptsystem_prompt = """You are a professional code review assistant. Please analyze the provided code from the following aspects:1. Code quality and readability2. Performance optimization suggestions3. Security considerations4. Best practice recommendationsPlease provide specific improvement suggestions."""# List of code snippets to reviewcode_snippets = [ "def calculate_sum(numbers): return sum(numbers)", "class User: def __init__(self, name): self.name = name", "for i in range(len(items)): print(items[i])", "if user_input == 'yes' or user_input == 'y': return True"]results = []total_cached_tokens = 0for i, code in enumerate(code_snippets): start_time = time.time() response = client.chat.completions.create( model="glm-4.6", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Please review the following code:\n```python\n{code}\n```"} ] ) end_time = time.time() # Count cache effects cached_tokens = 0 if hasattr(response.usage, 'prompt_tokens_details') and response.usage.prompt_tokens_details.cached_tokens: cached_tokens = response.usage.prompt_tokens_details.cached_tokens total_cached_tokens += cached_tokens results.append({ 'code': code, 'review': response.choices[0].message.content, 'total_tokens': response.usage.total_tokens, 'cached_tokens': cached_tokens, 'response_time': end_time - start_time }) print(f"Code snippet {i+1} review completed:") print(f" Response time: {end_time - start_time:.2f}s") print(f" Cached tokens: {cached_tokens}") print(f" Total tokens: {response.usage.total_tokens}") print()print(f"Batch processing completed, total cached tokens: {total_cached_tokens}")
# Recommended: Use stable system promptssystem_prompt = """You are a professional technical consultant with the following characteristics:- Deep technical background and rich project experience- Able to provide accurate and practical technical advice- Good at explaining complex concepts in clear and concise languagePlease provide professional technical guidance based on user questions."""
Use long documents as system messages
Copy
Ask AI
# Recommended: Use long documents as system messagesdef create_document_based_chat(document_content, user_question): return client.chat.completions.create( model="glm-4.6", messages=[ { "role": "system", "content": f"Please answer user questions based on the following document content:\n\n{document_content}" }, { "role": "user", "content": user_question } ] )# Multiple calls with the same document, system prompts will be cachedquestions = ["What is the main content of the document?", "What are the key points?", "How to implement these suggestions?"]for question in questions: response = create_document_based_chat(document_content, question) # Second and subsequent calls will hit the cache
Manage conversation history to improve cache efficiency
Copy
Ask AI
class ConversationManager: def __init__(self, client, system_prompt): self.client = client self.system_prompt = system_prompt self.history = [{"role": "system", "content": system_prompt}] def add_message(self, role, content): self.history.append({"role": role, "content": content}) def get_response(self, user_message): # Add user message self.add_message("user", user_message) # Get reply (conversation history will be cached) response = self.client.chat.completions.create( model="glm-4.6", messages=self.history ) # Add assistant reply to history assistant_message = response.choices[0].message.content self.add_message("assistant", assistant_message) return response def get_cache_stats(self, response): """Get cache statistics""" if hasattr(response.usage, 'prompt_tokens_details'): cached = response.usage.prompt_tokens_details.cached_tokens or 0 total = response.usage.prompt_tokens return f"Cache hit: {cached}/{total} ({cached/total*100:.1f}%)" return "No cache information"# Usage examplemanager = ConversationManager(client, "You are a programming assistant...")response1 = manager.get_response("How to learn Python?")response2 = manager.get_response("Recommend some learning resources") # Will reuse previous conversation cache