Parasail
  • Welcome
  • Serverless
    • Serverless
    • Available Parameters
  • Dedicated
    • Dedicated Endpoints
    • Speeding up Dedicated Models with Speculative Decoding
    • Deploying private models through HuggingFace Repos
    • Dedicated Endpoint Management API
    • Rate Limits and Limitations
    • FP8 Quantization
  • Batch
    • Quick start
    • Batch Processing with Private Models
    • Batch file format
    • API Reference
  • Cookbooks
    • Run and Evaluate Any Model
    • Chat Completions
    • RAG
    • Multi-Modal
    • Text-to-Speech with Orpheus TTS models
    • Multimodal GUI Task Agent
    • Tool/Function Calling
    • Structured Output
  • Billing
    • Pricing
    • Billing And Payments
    • Promotions
    • Batch SLA
  • Security and Account Management
    • Data Privacy and Retention
    • Account Management
    • Compliance
  • Resources
    • Silly Tavern Guide
    • Community Engagement
Powered by GitBook
On this page
  1. Serverless

Available Parameters

We currently support the availablee parameters that vLLM supports.

Main Sampling Parameters:

  1. temperature (default: 1.0)

    • Controls randomness in token selection.

    • Higher values (>1.0) increase randomness.

    • Lower values (<1.0) make outputs more deterministic.

    • A value of 0 forces greedy decoding.

  2. top_p (Nucleus Sampling) (default: 1.0)

    • Controls the probability mass of token selection.

    • Only considers tokens that sum up to top_p probability.

    • Lower values (e.g., 0.9) limit token choices to more likely options.

    • Higher values (close to 1.0) allow a broader range of tokens.

  3. top_k (default: -1, which means disabled)

    • Limits token selection to the top k most probable tokens.

    • Lower values (e.g., top_k=50) make output more deterministic.

    • If -1, this setting is ignored.

    • Since top_k is not defined in OpenAI spec, it should be passed in extra_body field.

  4. max_tokens (default: None)

    • Sets the maximum number of tokens to generate.

    • Helps prevent excessively long responses.

  5. repetition_penalty (default: 1.0)

    • Penalizes repeated tokens to avoid looping responses.

    • Values >1.0 discourage repetition.

    • Common values: 1.1 - 1.2.

  6. presence_penalty (default: 0.0)

    • Increases the likelihood of introducing new tokens.

    • Useful for making outputs more diverse.

  7. frequency_penalty (default: 0.0)

    • Penalizes tokens that have appeared frequently.

    • Helps prevent excessive repetition of common words.

  8. seed (default: None)

    • Sets a fixed seed for reproducible results.

    • Useful for debugging or deterministic sampling.

How These Work Together:

  • Setting temperature=0 forces deterministic output (greedy decoding).

  • Using top_p and top_k together balances diversity and coherence.

  • repetition_penalty and presence_penalty help avoid repetitive loops.

PreviousServerlessNextDedicated Endpoints

Last updated 2 months ago