Available Parameters

We currently support the availablee parameters that vLLM supports.

Main Sampling Parameters:

  1. temperature (default: 1.0)

    • Controls randomness in token selection.

    • Higher values (>1.0) increase randomness.

    • Lower values (<1.0) make outputs more deterministic.

    • A value of 0 forces greedy decoding.

  2. top_p (Nucleus Sampling) (default: 1.0)

    • Controls the probability mass of token selection.

    • Only considers tokens that sum up to top_p probability.

    • Lower values (e.g., 0.9) limit token choices to more likely options.

    • Higher values (close to 1.0) allow a broader range of tokens.

  3. top_k (default: -1, which means disabled)

    • Limits token selection to the top k most probable tokens.

    • Lower values (e.g., top_k=50) make output more deterministic.

    • If -1, this setting is ignored.

  4. max_tokens (default: None)

    • Sets the maximum number of tokens to generate.

    • Helps prevent excessively long responses.

  5. repetition_penalty (default: 1.0)

    • Penalizes repeated tokens to avoid looping responses.

    • Values >1.0 discourage repetition.

    • Common values: 1.1 - 1.2.

  6. presence_penalty (default: 0.0)

    • Increases the likelihood of introducing new tokens.

    • Useful for making outputs more diverse.

  7. frequency_penalty (default: 0.0)

    • Penalizes tokens that have appeared frequently.

    • Helps prevent excessive repetition of common words.

Advanced Sampling Techniques:

  1. do_sample (default: True)

    • Enables stochastic sampling (when set to True).

    • If False, uses greedy decoding (choosing the highest probability token at each step).

  2. random_seed (default: None)

    • Sets a fixed seed for reproducible results.

    • Useful for debugging or deterministic sampling.

  3. guided_choice

    • Allows constraining outputs to predefined options.

    • Ensures responses align with a given list of acceptable completions.

How These Work Together:

  • Setting temperature=0 forces deterministic output (greedy decoding).

  • Using top_p and top_k together balances diversity and coherence.

  • repetition_penalty and presence_penalty help avoid repetitive loops.

Last updated