When interacting with models, you can control the model’s output by adjusting different parameters to meet the needs of various scenarios. Understanding these core parameters will help you better utilize the model’s capabilities.
Quick Reference
Parameter | Type | Default Value | Description |
---|---|---|---|
do_sample | Boolean | true | Whether to sample the output to increase diversity. |
temperature | Float | (Model dependent) | Controls the randomness of output, higher values are more random. |
top_p | Float | (Model dependent) | Controls diversity through nucleus sampling, recommended to use either this or temperature . |
max_tokens | Integer | (Model dependent) | Limits the maximum number of tokens generated in a single call. |
stream | Boolean | false | Whether to return responses in streaming mode. |
thinking | Object | {"type": "enabled"} | Whether to enable chain-of-thought deep thinking, only supported by GLM-4.5 and above. |
Parameter Details
do_sample
do_sample
is a boolean value (true
or false
) that determines whether to sample the model’s output.
true
(default): Performs random sampling based on the probability distribution of each token, increasing text diversity and creativity. Suitable for content creation, dialogue, and other scenarios.false
: Uses a greedy strategy, always selecting the token with the highest probability. Provides high deterministic output, suitable for scenarios requiring precise, factual answers.
- Set to
false
when you need reproducible, deterministic output. - Set to
true
when you want the model to generate more diverse and interesting content, and use it in combination withtemperature
ortop_p
.
temperature
Thetemperature
parameter controls the randomness of the model’s output.
- Lower values (e.g., 0.2): Make the probability distribution more “sharp”, resulting in more deterministic and conservative output.
- Higher values (e.g., 0.8): Make the probability distribution more “flat”, resulting in more random and diverse output.
- For scenarios requiring rigor and factual accuracy (such as knowledge Q&A), it’s recommended to use lower
temperature
. - For scenarios requiring creativity (such as content creation), you can try higher
temperature
. - It’s recommended to use only one of
temperature
andtop_p
.
top_p
top_p
(nucleus sampling) controls diversity by sampling from the smallest set of tokens whose cumulative probability exceeds the threshold.
- Lower values (e.g., 0.2): Limit the sampling range, resulting in more deterministic output.
- Higher values (e.g., 0.9): Expand the sampling range, resulting in more diverse output.
- If you want to achieve some diversity while ensuring content quality,
top_p
is a good choice (recommended values 0.8-0.95). - It’s generally not recommended to modify both
temperature
andtop_p
simultaneously.
max_tokens
max_tokens
is used to limit the maximum number of tokens the model can generate in a single call. GLM-4.6 supports a maximum output length of 128K, GLM-4.5 supports a maximum output length of 96K, and it’s recommended to set it to no less than 1024. Tokens are the basic units of text, typically 1 token equals approximately 0.75 English words or 1.5 Chinese characters. Setting an appropriate max_tokens can control response length and cost, avoiding overly long outputs. If the model completes its answer before reaching the max_tokens limit, it will naturally end; if it reaches the limit, the output may be truncated.
- Purpose: Prevents generating overly long text and controls API call costs.
- Note:
max_tokens
limits the length of generated content, not including input.
- Set
max_tokens
reasonably according to your application scenario. If you need short answers, you can set it to a smaller value (e.g., 50).
max_tokens
and maximum supported max_tokens
for each model:
Model Code | Default max_tokens | Maximum max_tokens |
---|---|---|
glm-4.6 | 65536 | 131072 |
glm-4.5 | 65536 | 98304 |
glm-4.5-air | 65536 | 98304 |
glm-4.5-x | 65536 | 98304 |
glm-4.5-airx | 65536 | 98304 |
glm-4.5-flash | 65536 | 98304 |
glm-4.5v | 16384 | 16384 |
glm-4-32b-0414-128k | 16384 | 16384 |
stream
stream
is a boolean value used to control the API’s response method.
false
(default): Returns the complete response at once, simple to implement but with long waiting times.true
: Returns content in streaming (SSE) mode, significantly improving the experience of real-time interactive applications.
- For chatbots, real-time code generation, and other applications, it’s strongly recommended to set this to
true
.
thinking
Thethinking
parameter controls whether the model enables “Chain of Thought” for deeper thinking and reasoning.
- Type: Object
- Supported Models:
GLM-4.5
and above
type
(string):enabled
(default): Enable chain of thought.GLM-4.6
andGLM-4.5
will automatically determine if needed, whileGLM-4.5V
will force thinking.disabled
: Disable chain of thought.
- It’s recommended to enable this when you need the model to perform complex reasoning and planning.
- For simple tasks, you can disable it to get faster responses.
Related Concepts
Token Usage Calculation
Token Usage Calculation
Tokens are the basic units for model text processing. Usage calculation includes both input and output parts.
- Input Token Count: The number of tokens contained in the text you send to the model.
- Output Token Count: The number of tokens contained in the text generated by the model.
- Total Token Count: The sum of input and output, usually used as the billing basis.
tokenizer
API to estimate the token count of text.Maximum Output Tokens
Maximum Output Tokens
Maximum Output Tokens refers to the maximum number of tokens a model can generate in a single request. It’s different from the
max_tokens
parameter - max_tokens
is the upper limit you set in your request, while Maximum Output Tokens is the architectural limitation of the model itself.For example, a model’s context window might be 8k tokens, but its maximum output capability might be limited to 4k tokens.Context Window
Context Window
The Context Window refers to the total number of tokens a model can process in a single interaction, including all tokens from both input text and generated text.
- Importance: The context window determines how much historical information the model can “remember”. If the total length of input and expected output exceeds the model’s context window, the model will be unable to process it.
- Note: Different models have different context window sizes. When conducting long conversations or processing long documents, special attention should be paid to context window limitations.
Concurrency Limits
Concurrency Limits
Concurrency refers to the number of API requests you can initiate simultaneously. This is set by the platform to ensure service stability and fair resource allocation.
- Limits: Different users or subscription plans may have different concurrency quotas.
- Overages: If you exceed the concurrency limit, new requests may fail or need to wait in queue.