Context caching functionality significantly reduces token consumption and response latency by caching repeated context content. When you repeatedly use the same system prompts or conversation history in dialogues, the caching mechanism automatically identifies and reuses this content, thereby improving performance and reducing costs.
Features
- Automatic Cache Recognition: Implicit caching that intelligently identifies repeated context content without manual configuration
- Significant Cost Reduction: Cached tokens are billed at lower prices, dramatically saving costs
- Improved Response Speed: Reduces processing time for repeated content, accelerating model responses
- Transparent Billing: Detailed display of cached token counts in response field
usage.prompt_tokens_details.cached_tokens - Wide Compatibility: Supports all mainstream models, including GLM-4.6, GLM-4.5 series, etc.
Context caching works by computing input message content and identifying content that is identical or highly similar to previous requests. When repeated content is detected, the system reuses previous computation results, avoiding redundant token processing.This mechanism is particularly suitable for the following scenarios:
- System prompt reuse: In multi-turn conversations, system prompts usually remain unchanged, and caching can significantly reduce token consumption for this part.
- Repetitive tasks: For tasks that process similar content with consistent instructions multiple times, caching can improve efficiency.
- Multi-turn conversation history: In complex conversations, historical messages often contain a lot of repeated information, and caching can effectively reduce token usage for this part.
Code Examples
- cURL
- Python SDK
Basic Caching ExampleCache Reuse Example
Best Practices
- System Prompt Optimization
- Document Content Reuse
- Conversation History Management
Use stable system prompts
Use Cases
Multi-turn Conversations
- Intelligent customer service systems
- Personal assistant services
Batch Processing
- Code review batch processing
- Content batch analysis
Template Applications
- Report generation templates
- Standardized process handling
Education and Training
- Homework grading assistance
- Learning material analysis
Important Notes
- Understanding Cache Mechanism
- Cost Optimization Suggestions
- Performance Considerations
- Best Practices
- Caching is automatically triggered based on content similarity, no manual configuration required
- Identical content has the highest cache hit rate
- Minor formatting differences may affect cache effectiveness
- Cache has reasonable time limits, will recalculate after expiration
Billing Information
Context caching uses a differentiated billing strategy:- New content tokens: Billed at standard prices
- Cache hit tokens: Billed at discounted prices (usually 50% of standard price)
- Output tokens: Billed at standard prices