Skip to main content
When interacting with models, you can control the model’s output by adjusting different parameters to meet the needs of various scenarios. Understanding these core parameters will help you better utilize the model’s capabilities.

Quick Reference

ParameterTypeDefault ValueDescription
do_sampleBooleantrueWhether to sample the output to increase diversity.
temperatureFloat(Model dependent)Controls the randomness of output, higher values are more random.
top_pFloat(Model dependent)Controls diversity through nucleus sampling, recommended to use either this or temperature.
max_tokensInteger(Model dependent)Limits the maximum number of tokens generated in a single call.
streamBooleanfalseWhether to return responses in streaming mode.
thinkingObject{"type": "enabled"}Whether to enable chain-of-thought deep thinking, only supported by GLM-4.5 and above.

Parameter Details

do_sample

do_sample is a boolean value (true or false) that determines whether to sample the model’s output.
  • true (default): Performs random sampling based on the probability distribution of each token, increasing text diversity and creativity. Suitable for content creation, dialogue, and other scenarios.
  • false: Uses a greedy strategy, always selecting the token with the highest probability. Provides high deterministic output, suitable for scenarios requiring precise, factual answers.
Best Practices:
  • Set to false when you need reproducible, deterministic output.
  • Set to true when you want the model to generate more diverse and interesting content, and use it in combination with temperature or top_p.

temperature

The temperature parameter controls the randomness of the model’s output.
  • Lower values (e.g., 0.2): Make the probability distribution more “sharp”, resulting in more deterministic and conservative output.
  • Higher values (e.g., 0.8): Make the probability distribution more “flat”, resulting in more random and diverse output.
Best Practices:
  • For scenarios requiring rigor and factual accuracy (such as knowledge Q&A), it’s recommended to use lower temperature.
  • For scenarios requiring creativity (such as content creation), you can try higher temperature.
  • It’s recommended to use only one of temperature and top_p.

top_p

top_p (nucleus sampling) controls diversity by sampling from the smallest set of tokens whose cumulative probability exceeds the threshold.
  • Lower values (e.g., 0.2): Limit the sampling range, resulting in more deterministic output.
  • Higher values (e.g., 0.9): Expand the sampling range, resulting in more diverse output.
Best Practices:
  • If you want to achieve some diversity while ensuring content quality, top_p is a good choice (recommended values 0.8-0.95).
  • It’s generally not recommended to modify both temperature and top_p simultaneously.

max_tokens

max_tokens is used to limit the maximum number of tokens the model can generate in a single call. GLM-4.6 supports a maximum output length of 128K, GLM-4.5 supports a maximum output length of 96K, and it’s recommended to set it to no less than 1024. Tokens are the basic units of text, typically 1 token equals approximately 0.75 English words or 1.5 Chinese characters. Setting an appropriate max_tokens can control response length and cost, avoiding overly long outputs. If the model completes its answer before reaching the max_tokens limit, it will naturally end; if it reaches the limit, the output may be truncated.
  • Purpose: Prevents generating overly long text and controls API call costs.
  • Note: max_tokens limits the length of generated content, not including input.
Best Practices:
  • Set max_tokens reasonably according to your application scenario. If you need short answers, you can set it to a smaller value (e.g., 50).
Default max_tokens and maximum supported max_tokens for each model:
Model CodeDefault max_tokensMaximum max_tokens
glm-4.665536131072
glm-4.56553698304
glm-4.5-air6553698304
glm-4.5-x6553698304
glm-4.5-airx6553698304
glm-4.5-flash6553698304
glm-4.5v1638416384
glm-4-32b-0414-128k1638416384

stream

stream is a boolean value used to control the API’s response method.
  • false (default): Returns the complete response at once, simple to implement but with long waiting times.
  • true: Returns content in streaming (SSE) mode, significantly improving the experience of real-time interactive applications.
Best Practices:
  • For chatbots, real-time code generation, and other applications, it’s strongly recommended to set this to true.

thinking

The thinking parameter controls whether the model enables “Chain of Thought” for deeper thinking and reasoning.
  • Type: Object
  • Supported Models: GLM-4.5 and above
Properties:
  • type (string):
    • enabled (default): Enable chain of thought. GLM-4.6 and GLM-4.5 will automatically determine if needed, while GLM-4.5V will force thinking.
    • disabled: Disable chain of thought.
Best Practices:
  • It’s recommended to enable this when you need the model to perform complex reasoning and planning.
  • For simple tasks, you can disable it to get faster responses.

Tokens are the basic units for model text processing. Usage calculation includes both input and output parts.
  • Input Token Count: The number of tokens contained in the text you send to the model.
  • Output Token Count: The number of tokens contained in the text generated by the model.
  • Total Token Count: The sum of input and output, usually used as the billing basis.
You can call the tokenizer API to estimate the token count of text.
Maximum Output Tokens refers to the maximum number of tokens a model can generate in a single request. It’s different from the max_tokens parameter - max_tokens is the upper limit you set in your request, while Maximum Output Tokens is the architectural limitation of the model itself.For example, a model’s context window might be 8k tokens, but its maximum output capability might be limited to 4k tokens.
The Context Window refers to the total number of tokens a model can process in a single interaction, including all tokens from both input text and generated text.
  • Importance: The context window determines how much historical information the model can “remember”. If the total length of input and expected output exceeds the model’s context window, the model will be unable to process it.
  • Note: Different models have different context window sizes. When conducting long conversations or processing long documents, special attention should be paid to context window limitations.
Concurrency refers to the number of API requests you can initiate simultaneously. This is set by the platform to ensure service stability and fair resource allocation.
  • Limits: Different users or subscription plans may have different concurrency quotas.
  • Overages: If you exceed the concurrency limit, new requests may fail or need to wait in queue.
If your application requires high concurrency processing, please check your account limits or contact platform support.
I