# Welcome

Parasail provides affordable, high-performance cloud GPUs for running demanding AI workloads—serverless inference, dedicated instances, and batch processing.

Parasail is a cloud GPU platform for AI inference. We aggregate top AI hardware providers to give you scalable, on-demand access to powerful computing resources—without costly hardware investments or lengthy contracts.

We offer three ways to run inference, each designed for a different workload pattern:

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>Serverless</strong></td><td>On-demand, pay-per-token inference for popular open-source models. No setup required—just call the API.</td><td></td><td></td><td><a href="/pages/AbuAPWxU9zDfEIgukNsK">/pages/AbuAPWxU9zDfEIgukNsK</a></td></tr><tr><td><strong>Dedicated Instances</strong></td><td>Private GPU endpoints with full control over model, hardware, and scaling. Ideal for production workloads.</td><td></td><td></td><td><a href="/pages/dvVsRgLfPYaM2uxDi8Gw">/pages/dvVsRgLfPYaM2uxDi8Gw</a></td></tr><tr><td><strong>Batch Processing</strong></td><td>Process millions of inferences at 50% off serverless pricing. OpenAI-compatible batch API.</td><td></td><td></td><td><a href="/pages/51IF2LvdW8BHeIjbfIMj">/pages/51IF2LvdW8BHeIjbfIMj</a></td></tr></tbody></table>

## Not sure where to start?

Pick a product based on your workload:

| If you...                                                                 | Use                                                           | Why                                                                      |
| ------------------------------------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------ |
| Want to call a popular open-source model with no setup                    | [**Serverless**](/parasail-docs/products/overview)            | Pay-per-token, instant, OpenAI-compatible                                |
| Need a specific or private model, predictable latency, or production SLAs | [**Dedicated Instances**](/parasail-docs/products/overview-1) | Private GPU endpoint with full control over model, hardware, and scaling |
| Process large volumes offline where real-time isn't required              | [**Batch**](/parasail-docs/products/quickstart)               | 50% off serverless pricing, up to millions of requests                   |

## Find your path

Choose the option that best matches what you're trying to do:

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>Chat and text generation</strong></td><td>Build chatbots, assistants, and text generation pipelines</td><td></td><td></td><td><a href="/pages/9NzGG4P8mD9sCd9NCJMK">/pages/9NzGG4P8mD9sCd9NCJMK</a></td></tr><tr><td><strong>RAG and embeddings</strong></td><td>Build retrieval-augmented generation and vector search systems</td><td></td><td></td><td><a href="/pages/VgTsKmx1uUyZtUKwy1WK">/pages/VgTsKmx1uUyZtUKwy1WK</a></td></tr><tr><td><strong>Batch at scale</strong></td><td>Process large volumes of prompts, images, or text offline</td><td></td><td></td><td><a href="/pages/AXpJHRJH9bgHtkexfQ2q">/pages/AXpJHRJH9bgHtkexfQ2q</a></td></tr><tr><td><strong>Agents and tool calling</strong></td><td>Build agentic workflows with function calling and multi-step reasoning</td><td></td><td></td><td><a href="/pages/AwBBAik6tiXQRBQzxem9">/pages/AwBBAik6tiXQRBQzxem9</a></td></tr><tr><td><strong>Browser and UI automation</strong></td><td>Automate browser tasks and GUI interactions with vision models</td><td></td><td></td><td><a href="/pages/jA08SyPNrWTQdemUTvP1">/pages/jA08SyPNrWTQdemUTvP1</a></td></tr></tbody></table>

## Quickstart

Make your first API call in under two minutes:

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>Serverless quickstart</strong></td><td>Call a serverless model with the OpenAI SDK</td><td></td><td></td><td><a href="/pages/cqVFBa1P7r5p7nfFbCnL">/pages/cqVFBa1P7r5p7nfFbCnL</a></td></tr><tr><td><strong>Dedicated quickstart</strong></td><td>Deploy your own model on a dedicated GPU</td><td></td><td></td><td><a href="/pages/UcIhEOaxPYRZMC3qQ7xw">/pages/UcIhEOaxPYRZMC3qQ7xw</a></td></tr><tr><td><strong>Batch quickstart</strong></td><td>Submit a batch job in five lines of Python</td><td></td><td></td><td><a href="/pages/OrPZ7ZRehYXxzse9LB3u">/pages/OrPZ7ZRehYXxzse9LB3u</a></td></tr></tbody></table>

## API reference

Parasail's API is fully OpenAI-compatible. Point your existing OpenAI SDK at our base URL and start making requests.

* **Base URL**: `https://api.parasail.io/v1`
* **Auth**: `Authorization: Bearer <PARASAIL_API_KEY>`
* **Get an API key**: <https://www.saas.parasail.io/keys>

See the [API Reference](/parasail-docs/api-reference/authentication) for full details on authentication, endpoints, and parameters.

## For AI coding agents

Building on Parasail on behalf of a user? Read [`llms.txt`](https://github.com/parasail-ai/gitbook-doc/blob/main/llms.txt) for a flat index of every page, then see [For AI Coding Agents](/parasail-docs/resources/for-ai-coding-agents) for per-product setup notes. In short: use the OpenAI SDK against `https://api.parasail.io/v1` with the `PARASAIL_API_KEY` environment variable—every endpoint is OpenAI-compatible.

## Explore

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>Guides</strong></td><td>Step-by-step tutorials for common tasks</td><td></td><td></td><td><a href="/pages/fCGvKxrehXKm6u7ta71R">/pages/fCGvKxrehXKm6u7ta71R</a></td></tr><tr><td><strong>Billing</strong></td><td>Pricing, rate limits, and quota management</td><td></td><td></td><td><a href="/pages/8hnW6Po2Bj1dutmH0v2j">/pages/8hnW6Po2Bj1dutmH0v2j</a></td></tr><tr><td><strong>Security</strong></td><td>Data privacy, compliance, and API key management</td><td></td><td></td><td><a href="/pages/reCnV1oJm5Rzj9QnXa4c">/pages/reCnV1oJm5Rzj9QnXa4c</a></td></tr><tr><td><strong>Resources</strong></td><td>Community guides and third-party integrations</td><td></td><td></td><td><a href="/pages/eWwm8YtUZYYgUDvXBdyp">/pages/eWwm8YtUZYYgUDvXBdyp</a></td></tr></tbody></table>


# Serverless

Make your first serverless inference call in under two minutes using the OpenAI SDK.

Make your first serverless API call in under two minutes. Serverless gives you on-demand access to popular open-source models with no setup—just call the API and pay per token.

## 1. Get an API key

1. Go to <https://www.saas.parasail.io/keys>.
2. Click **Create API Key**.
3. Copy the key—it's shown only once.

Store it in your environment:

```bash
export PARASAIL_API_KEY="your-api-key-here"
```

## 2. Install the OpenAI SDK

```bash
pip install openai
```

## 3. Make your first request

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

chat_completion = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "What is the capital of New York?"}]
)

print(chat_completion.choices[0].message.content)
```

## 4. List available models

```bash
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```

## Next steps

* [Serverless overview](/parasail-docs/products/overview)—full serverless API guide
* [Available parameters](/parasail-docs/products/overview/parameters)—temperature, top\_p, top\_k, and more
* [Responses API](/parasail-docs/products/overview/responses-api)—for multi-turn agentic workflows
* [Chat completions guide](/parasail-docs/guides/chat-completions)—understanding messages, roles, and chat templates
* [Use third-party tools](/parasail-docs/products/overview#using-third-party-tools-like-cline-anythingllm)—Cline, AnythingLLM, and other OpenAI-compatible tools


# Dedicated Instances

Deploy your own model on a dedicated GPU instance in a few minutes.

Deploy a private GPU endpoint with full control over the model, hardware, and scaling. Dedicated instances are ideal for production workloads, custom models, and models not available on serverless.

## 1. Get an API key

1. Go to <https://www.saas.parasail.io/keys>.
2. Click **Create API Key**.
3. Copy the key—it's shown only once.

## 2. Navigate to the Dedicated tab

In the Parasail dashboard, click the **Dedicated** tab to open the deployment configuration page.

## 3. Configure your deployment

| Field                     | Description                                                                                                                          |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **Deployment Name**       | A name for your endpoint.                                                                                                            |
| **HuggingFace ID / URL**  | The model to load. For example, `neuralmagic/Qwen2-72B-Instruct-FP8` or the full HuggingFace URL.                                    |
| **HuggingFace Token**     | Required only for private HuggingFace repos. See [Private HuggingFace Models](/parasail-docs/products/overview-1/private-hf-models). |
| **Hardware Service Tier** | Choose from low-cost, low-latency, or ultra-low-latency GPU configurations.                                                          |

If the model is supported, you'll see a green confirmation message.

## 4. (Optional) Set advanced options

* **Speculative Decoding Draft Model**—speed up output with a draft model. See [Speculative Decoding](/parasail-docs/products/overview-1/speculative-decoding).
* **Replicas**—spin up multiple load-balanced replicas for high throughput.
* **Scale Down Policy**—automatically shut down after idle time to avoid unexpected costs.
* **Max Concurrent Connections**—set a per-replica concurrency limit to maintain QoS.

## 5. Deploy

Click **Deploy**. If the model weights have been cached previously, deployment takes 3–5 minutes. If weights need to be downloaded from HuggingFace, it can take 10 minutes to several hours depending on model size.

## 6. Call your endpoint

Once the endpoint is live, call it using the OpenAI SDK:

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

response = client.chat.completions.create(
    model="your-deployment-name",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
```

## Next steps

* [Dedicated instances overview](/parasail-docs/products/overview-1)—full guide to dedicated endpoints
* [Auto-scaling](/parasail-docs/products/overview-1/auto-scaling)—scale replicas with traffic
* [FP8 quantization](/parasail-docs/products/overview-1/fp8-quantization)—reduce cost with quantization
* [Management API](/parasail-docs/products/overview-1/management-api)—manage endpoints programmatically


# Batch Processing

Submit your first batch job in five lines of Python. Batch processing is 50% off serverless pricing.

Process thousands or millions of LLM inferences at 50% off serverless pricing. Batch is ideal for evaluations, data processing, and any workload that doesn't require real-time responses.

## 1. Get an API key

1. Go to <https://www.saas.parasail.io/keys>.
2. Click **Create API Key**.
3. Copy the key—it's shown only once.

Store it in your environment:

```bash
export PARASAIL_API_KEY="your-api-key-here"
```

## 2. Install the batch helper library

```bash
pip install openai-batch
```

The library is 100% compatible with both Parasail and OpenAI. Source code: [github.com/parasail-ai/openai-batch](https://github.com/parasail-ai/openai-batch).

## 3. Submit your first batch

```python
import random
from openai_batch import Batch

with Batch() as batch:
    objects = ["cat", "robot", "coffee mug", "spaceship", "banana"]
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            messages=[{"role": "user", "content": f"Tell me a joke about a {random.choice(objects)}"}]
        )
    result, output_path, error_path = batch.submit_wait_download()
    print(f"Batch completed with status {result.status} and stored in {output_path}")
```

Run it:

```bash
PARASAIL_API_KEY=<your-api-key> python3 test_batch.py
```

## 4. Monitor via the Batch UI

The Parasail Batch UI lets you upload batches, track status, and download results. Open the **Batch** tab in the dashboard.

## Key features

* **50% off serverless pricing**—cached tokens get an additional 50% off.
* **OpenAI-compatible**—works with the OpenAI batch file format and API.
* **Any HuggingFace model**—no need for a dedicated deployment; just specify the HuggingFace ID.
* **Embeddings**—supports embedding models, not just chat.
* **Resumable**—submit a batch, then monitor and download in a separate process using the batch ID.
* **Up to 50,000 requests** and **1000 MB** per input file.

## Next steps

* [Batch full guide](/parasail-docs/products/quickstart)—comprehensive batch processing guide
* [Batch helper GitHub repo](https://github.com/parasail-ai/openai-batch)—library source and README
* [Batch file format](/parasail-docs/products/quickstart/file-format)—JSONL input format (OpenAI-compatible)
* [Batch API reference](/parasail-docs/products/quickstart/api-reference)—OpenAI-compatible batch API spec
* [Batch with private models](/parasail-docs/products/quickstart/private-models)—batch processing with private HuggingFace repos


# Chat and Text Generation

Build chatbots, assistants, and text generation pipelines with Parasail's OpenAI-compatible API.

Use Parasail to build chatbots, virtual assistants, content generators, and any application that involves conversational AI. Our API is fully OpenAI-compatible—point your existing OpenAI SDK at our base URL and start building.

## Choose your deployment model

| Need                                           | Recommended product                                                                  |
| ---------------------------------------------- | ------------------------------------------------------------------------------------ |
| Quick experimentation or low-volume production | [Serverless](/parasail-docs/products/overview)—pay per token, no setup               |
| Production with latency/cost control           | [Dedicated instances](/parasail-docs/products/overview-1)—private GPUs, auto-scaling |
| Large-scale offline processing (evals, data)   | [Batch](/parasail-docs/products/quickstart)—50% off, up to millions of prompts       |

## Quickstart

Make your first chat completion call in under two minutes:

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph."}]
)

print(response.choices[0].message.content)
```

See the [serverless quickstart](/parasail-docs/quickstart/serverless) for the full guide.

## Relevant guides

* [Chat completions guide](/parasail-docs/guides/chat-completions)—messages, roles, system prompts, and chat templates
* [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model)—find the best model for your use case
* [Structured output](/parasail-docs/guides/structured-output)—force JSON-formatted responses
* [Tool/function calling](/parasail-docs/guides/tool-function-calling)—give the model tools to call

## API reference

* [Chat completions API](/parasail-docs/api-reference/chat-completions)
* [Parameters](/parasail-docs/api-reference/parameters)—temperature, top\_p, top\_k, and more
* [Responses API](/parasail-docs/products/overview/responses-api)—multi-turn agentic workflows with tool calling

## Available models

Browse all available serverless models:

```bash
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```

See also: [Model recommendations](/parasail-docs/guides/model-recommendations), [Qwen3.5 notes](/parasail-docs/products/overview/qwen3.5-notes), [Deepseek V3.X](/parasail-docs/products/overview/deepseek-v3x), [GPT-OSS](/parasail-docs/products/overview/gpt-oss).


# RAG and Embeddings

Build retrieval-augmented generation (RAG) pipelines and vector search systems with Parasail embeddings.

Build retrieval-augmented generation (RAG) pipelines, semantic search, and document similarity systems. Parasail supports leading open-source embedding models that rival proprietary alternatives.

## Embedding models

| Model                 | HuggingFace ID                      | Notes                                              |
| --------------------- | ----------------------------------- | -------------------------------------------------- |
| GritLM                | `parasail-ai/GritLM-7B-vllm`        | Developed by Contextual, rivals proprietary models |
| GTE-Qwen2-7B-Instruct | `Alibaba-NLP/gte-Qwen2-7B-instruct` | From Alibaba, strong general-purpose embeddings    |

Any embedding model on HuggingFace can be used with Parasail's batch and dedicated APIs.

## Generate embeddings

### Serverless (low volume)

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.embeddings.create(
    model="parasail-ai/GritLM-7B-vllm",
    input="Parasail provides affordable cloud GPUs for AI inference.",
    encoding_format="base64"
)

print(response.data[0].embedding[:5])
```

### Batch (large scale—50% off)

```python
from openai_batch import Batch

with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="parasail-ai/GritLM-7B-vllm",
            encoding_format="base64",
            input=f"This is input #{i}"
        )
    result, output_path, error_path = batch.submit_wait_download()
```

Parasail strongly recommends using `encoding_format="base64"` to reduce output file sizes.

## Relevant guides

* [RAG guide](/parasail-docs/guides/rag)—step-by-step RAG implementation
* [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model)—includes embedding model evaluation

## API reference

* [Embeddings API](/parasail-docs/api-reference/embeddings)
* [Batch API](/parasail-docs/api-reference/batch-api)—for large-scale embedding generation
* [Batch embedding models](/parasail-docs/products/quickstart#batch-embedding-models)

## Next steps

* [Batch quickstart](/parasail-docs/quickstart/batch)—submit your first batch embedding job
* [Batch full guide](/parasail-docs/products/quickstart)—comprehensive batch processing documentation


# Batch Processing at Scale

Process large volumes of LLM inferences, embeddings, and multimodal inputs at scale with Parasail Batch.

Process thousands or millions of inferences offline at 50% off serverless pricing. Batch is ideal for evaluations, data processing, content generation, and any workload that doesn't need real-time responses.

## Why batch?

* **50% off** serverless token pricing
* **Additional 50% off** for cached tokens
* **Up to 50,000 requests** and **1000 MB** per input file
* **OpenAI-compatible**—use the same batch file format and API
* **Any HuggingFace model**—no dedicated deployment needed
* **Resumable**—submit in one process, monitor in another

## Quickstart

```python
import random
from openai_batch import Batch

with Batch() as batch:
    objects = ["cat", "robot", "coffee mug", "spaceship", "banana"]
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            messages=[{"role": "user", "content": f"Tell me a joke about a {random.choice(objects)}"}]
        )
    result, output_path, error_path = batch.submit_wait_download()
    print(f"Batch completed with status {result.status}")
```

See the [batch quickstart](/parasail-docs/quickstart/batch) for the full guide.

## What you can batch

| Task                | Example models               | Guide                                                                               |
| ------------------- | ---------------------------- | ----------------------------------------------------------------------------------- |
| Chat completions    | Any HuggingFace transformer  | [Batch full guide](/parasail-docs/products/quickstart)                              |
| Embeddings          | GritLM, GTE-Qwen2            | [Batch embedding models](/parasail-docs/products/quickstart#batch-embedding-models) |
| Image understanding | Qwen3-VL-8B-Instruct         | [Batch with images](/parasail-docs/products/quickstart/images)                      |
| Private models      | Any private HuggingFace repo | [Batch with private models](/parasail-docs/products/quickstart/private-models)      |

## Pricing

Batch is billed at **50% of serverless pricing**, based on parameter count. FP8 models incur no additional cost; FP16 models are 30% more. See [full pricing details](/parasail-docs/billing/pricing#batch-pricing).

## Relevant pages

* [Batch quickstart](/parasail-docs/products/quickstart)—comprehensive guide
* [Batch helper GitHub repo](https://github.com/parasail-ai/openai-batch)—library source and README
* [File format](/parasail-docs/products/quickstart/file-format)—JSONL input format
* [API reference](/parasail-docs/products/quickstart/api-reference)—OpenAI-compatible batch API
* [Priority](/parasail-docs/products/quickstart/priority)—batch priority levels
* [Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)


# Agents and Tool Calling

Build agentic workflows with function calling, multi-step reasoning, and tool use on Parasail.

Build AI agents that can call functions, reason across multiple steps, and interact with external tools. Parasail supports both the Chat Completions API with tool calling and the Responses API for multi-turn agentic workflows.

## Two approaches

| Approach                 | Best for                                      | API                    |
| ------------------------ | --------------------------------------------- | ---------------------- |
| Chat Completions + tools | Single-model function calling                 | `/v1/chat/completions` |
| Responses API            | Multi-turn agentic loops, parallel tool calls | `/v1/responses`        |

## Function calling with Chat Completions

```python
from openai import OpenAI
import json

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

print(response.choices[0].message.tool_calls)
```

## Responses API (multi-turn agentic)

The Responses API supports multi-turn conversations with parallel tool calls. It uses the standard Parasail gateway:

* **Base URL**: `https://api.parasail.io/v1`
* **Required**: Set `store: false` on every request (server-side state is not supported).

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    tools=tools,
    input=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
```

## Relevant guides

* [Tool/Function Calling guide](/parasail-docs/guides/tool-function-calling)—detailed function calling tutorial
* [Multimodal GUI Task Agent](/parasail-docs/guides/gui-agent)—GUI automation with vision + tool calling
* [Browser and UI Automation](/parasail-docs/guides/browser-and-ui-automation)—browser task automation

## API reference

* [Chat Completions API](/parasail-docs/api-reference/chat-completions)
* [Responses API](/parasail-docs/products/overview/responses-api)—full spec, compatibility table, and agentic loop example


# Browser and UI Automation

Automate browser tasks and GUI interactions with vision-capable models on Parasail.

Use vision-capable models to automate browser interactions, GUI tasks, and screen-based workflows. Parasail supports models that can see screenshots, reason about UI elements, and take actions.

## What you can build

* Web scraping with visual understanding
* Automated form filling and UI testing
* GUI task agents that navigate desktop applications
* Multi-step browser workflows with vision + tool calling

## Relevant guides

* [Browser and UI Automation guide](/parasail-docs/guides/browser-and-ui-automation)—step-by-step browser automation tutorial
* [Multimodal GUI Task Agent](/parasail-docs/guides/gui-agent)—build a GUI agent that sees and acts
* [Multi-Modal guide](/parasail-docs/guides/multi-modal)—handling images, audio, and video as model inputs

## Deployment options

| Need                               | Recommended product                                                                 |
| ---------------------------------- | ----------------------------------------------------------------------------------- |
| Real-time automation (interactive) | [Dedicated instances](/parasail-docs/products/overview-1)—low latency, auto-scaling |
| Batch processing of screenshots    | [Batch](/parasail-docs/products/quickstart)—process many images offline             |

## API reference

* [Chat Completions API](/parasail-docs/api-reference/chat-completions)—send images as input
* [Responses API](/parasail-docs/products/overview/responses-api)—agentic loops with tool calling
* [Multi-modal guide](/parasail-docs/guides/multi-modal)—image, audio, and video input formats


# Image Generation and Editing

Run diffusion models for batch image generation and editing with Parasail.

Parasail supports batch image generation and editing for diffusion models such as [OmniGen](https://huggingface.co/Shitao/OmniGen-v1) and [Qwen-Image-Edit](https://huggingface.co/Qwen/Qwen-Image-Edit). Submit prompts via JSONL files and receive Base64-encoded output images.

## How it works

1. Create a JSONL file where each line is a prompt (with optional input images encoded as Base64).
2. Submit via the [Batch helper library](/parasail-docs/products/quickstart) or the [Batch API](/parasail-docs/products/quickstart/api-reference).
3. Download the results—output images are Base64-encoded in the response JSONL.

## Supported parameters

| Parameter         | Description                                                    |
| ----------------- | -------------------------------------------------------------- |
| `model`           | For example `Shitao/OmniGen-v1` or `Qwen/Qwen-Image-Edit`      |
| `prompt`          | The prompt to guide generation. Reference input images as: \`< |
| `size`            | Formatted as `WxH` in pixels, for example, `1024x1024`         |
| `image`           | A list of Base64-encoded JPGs or PNGs                          |
| `response_format` | Currently only `b64_json` is supported                         |

## Quickstart example

```python
from openai_batch import Batch, providers
import base64

with Batch() as batch:
    for image_path in image_files:
        with open(image_path, "rb") as img_file:
            base64_image = base64.b64encode(img_file.read()).decode("utf-8")

        batch.add_to_batch(
            model="Shitao/OmniGen-v1",
            prompt="A man in a black shirt is reading a book. The man is the right man in <img><|image_1|></img>.",
            size="1024x1024",
            image=[base64_image],
            response_format="b64_json",
        )

    result, output_path, error_path = batch.submit_wait_download()
```

## Full guide

See the [Image Generation and Editing product page](/parasail-docs/products/overview-2) for the complete guide, including a full Python script that processes a directory of images and saves the results.

## Relevant pages

* [Batch quickstart](/parasail-docs/products/quickstart)—batch helper library fundamentals
* [Batch with images](/parasail-docs/products/quickstart/images)—batch image processing reference
* [Batch API](/parasail-docs/products/quickstart/api-reference)—OpenAI-compatible API spec


# Serverless

Our Serverless Service offers API based on Token Usage with Popular Models.

## API Usage:

#### UI API:

All of our serverless models are accessible using our OpenAPI spec API. On the Serverless page you can find example code to get the endpoints. If you click on a serverless model:

Our API Gateway is: <https://api.parasail.io/v1>

You will get example code like this:

```python
# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

chat_completion = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "What is the capital of New York?"}]
)

print(chat_completion.choices[0].message.content)
```

You will need to get an API Key from your profile:

#### Getting an API Key:

Which will allow you to generate an API Key:

**The API Key is displayed one time at creation and is not able to be seen again.**

#### Programmatically Find Models:

You can find the model list by using our /models endpoint. You can find it by:

```
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"
```

The API key can be found through the UI like above.

## Using Third-Party Tools Like Cline, AnythingLLM:

Our serverless OpenAI spec allows you to use third-party tools seamlessly just like other popular providers.

Taking the example of Cline, a popular coding tool within Visual Studio:

Find the model you want to use with cline like QwenCoder 32b or DeepSeekR1:

Find the model name, and have your [API Key Ready](#getting-an-api-key),

Then head over to your favorite tool like Cline:

Then you could get:


# Available Parameters

We currently support the available parameters that vLLM supports.

#### **Main Sampling Parameters:**

1. **`temperature`** (default: 1.0)
   * Controls randomness in token selection.
   * Higher values (>1.0) increase randomness.
   * Lower values (<1.0) make outputs more deterministic.
   * A value of 0 forces greedy decoding.
2. **`top_p` (Nucleus Sampling)** (default: 1.0)
   * Controls the probability mass of token selection.
   * Only considers tokens that sum up to `top_p` probability.
   * Lower values (for example, 0.9) limit token choices to more likely options.
   * Higher values (close to 1.0) allow a broader range of tokens.
3. **`top_k`** (default: `-1`, which means disabled)
   * Limits token selection to the top `k` most probable tokens.
   * Lower values (for example, `top_k=50`) make output more deterministic.
   * If `-1`, this setting is ignored.
   * Since `top_k` is not defined in OpenAI spec, it should be passed in `extra_body` field.
4. **`max_tokens`** (default: `None`)
   * Sets the maximum number of tokens to generate.
   * Helps prevent excessively long responses.
5. **`repetition_penalty`** (default: 1.0)
   * Penalizes repeated tokens to avoid looping responses.
   * Values >1.0 discourage repetition.
   * Common values: 1.1 - 1.2.
6. **`presence_penalty`** (default: 0.0)
   * Increases the likelihood of introducing new tokens.
   * Useful for making outputs more diverse.
7. **`frequency_penalty`** (default: 0.0)
   * Penalizes tokens that have appeared frequently.
   * Helps prevent excessive repetition of common words.
8. **`seed`** (default: `None`)
   * Sets a fixed seed for reproducible results.
   * Useful for debugging or deterministic sampling.

#### **How These Work Together:**

* Setting **`temperature=0`** forces deterministic output (greedy decoding).
* Using **`top_p`** and **`top_k`** together balances diversity and coherence.
* **`repetition_penalty`** and **`presence_penalty`** help avoid repetitive loops.


# Qwen3.5 Notes

Turn off Qwen3.5 thinking mode and set model-specific parameters on Parasail Serverless.

#### How to Turn Off Thinking

```python
from openai import OpenAI
import os
# Configured by environment variables
client = OpenAI(base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"])

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is the capital of New York?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)
```

## Next steps

* [Serverless overview](/parasail-docs/products/overview)
* [Serverless parameters](/parasail-docs/products/overview/parameters)
* [DeepSeek V3.x parameters](/parasail-docs/products/overview/deepseek-v3x)
* [GPT-OSS notes](/parasail-docs/products/overview/gpt-oss)
* [Chat Completions guide](/parasail-docs/guides/chat-completions)


# Deepseek V3.X

Toggle DeepSeek V3.1 thinking mode via the chat\_template\_kwargs parameter on Parasail Serverless.

### DeepSeek V3.1

Thinking can be turned on and off by specifying the `thinking` parameter as a dictionary item in the `chat_template_kwargs` parameter of the body. It must be set to `true` or `false` not `"True"` or `"true"` .

Python + OpenAI Library:

```python
# pip install openai
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"]
)

chat_completion = client.chat.completions.create(
    model="parasail-deepseek-31",
    messages=[{"role": "user", "content": "What is the capital of New York?"}],
    extra_body={
        "chat_template_kwargs": {"thinking": True},  # or False, for no-think mode
    }
)

print(chat_completion.choices[0].message.content)
```

CURL:

{% code overflow="wrap" %}

```sh
curl -X POST https://api.parasail.io/v1/chat/completions \
    -H "Authorization: Bearer $PARASAIL_API_KEY" \
    -H "content-type: application/json" \
    -d '{"model": "parasail-deepseek-31", "chat_template_kwargs": {"thinking": false}, "messages": [{"role": "user", "content": "What is the capital of New York?"}]}'
```

{% endcode %}

## Next steps

* [Serverless overview](/parasail-docs/products/overview)
* [Serverless parameters](/parasail-docs/products/overview/parameters)
* [GPT-OSS notes](/parasail-docs/products/overview/gpt-oss)
* [Qwen3.5 notes](/parasail-docs/products/overview/qwen3.5-notes)
* [Chat Completions guide](/parasail-docs/guides/chat-completions)


# GPT-OSS 20b and 120b

Control GPT-OSS 20b and 120b reasoning with the thinking\_budget and reasoning\_effort parameters on Parasail Serverless.

## GPT-OSS Reasoning Control

GPT-OSS models support **explicit reasoning control** via two parameters:

* **`thinking_budget`**—limits how many internal *thinking tokens* the model can use
* **`reasoning_effort`**—qualitative control over how hard the model reasons

These are especially useful for trading off **depth vs latency/cost**.

***

#### Parameters

**`thinking_budget`**

* Integer
* Upper bound on internal reasoning tokens
* Higher = deeper reasoning, slower & more expensive

**`reasoning_effort`**

* `"low" | "medium" | "high"`
* Controls how deliberate the reasoning style is
* Typically paired with a thinking budget

***

### Python (OpenAI client)

```python
extra_params = {
    "extra_body": {
        "custom_params": {"thinking_budget": 40},
        "chat_template_kwargs": {"reasoning_effort": "high"},
    }
}

resp = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "user", "content": "Explain why the sky is blue."}
    ],
    **extra_params,
)
```

***

### Direct HTTP (curl)

```bash
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      { "role": "user", "content": "Explain why the sky is blue." }
    ],
    "custom_params": {
      "thinking_budget": 40
    },
    "chat_template_kwargs": {
      "reasoning_effort": "high"
    }
  }'
```

***

### Practical guidance

* **Fast / cheap**: `thinking_budget: 10`, `reasoning_effort: "low"`
* **Balanced default**: `thinking_budget: 25`, `reasoning_effort: "medium"`
* **Deep reasoning**: `thinking_budget: 40+`, `reasoning_effort: "high"`

> These controls affect *internal reasoning only*—the final response stays concise unless you ask otherwise.

## Structured outputs (JSON Schema) with GPT-OSS

**Goal:** force the model to return *only valid JSON* that matches your schema. On Parasail, you do this by sending `response_format.type = "json_schema"` (same shape as OpenAI), and—optionally—adding GPT-OSS reasoning knobs via `extra_body`.

***

**Minimal example (Parasail + GPT-OSS)**

```python
import os, json
from openai import OpenAI

MODEL = "parasail-gpt-oss-20b-fast"
client = OpenAI(api_key=os.environ["PARASAIL_API_KEY"], base_url="https://api.parasail.io/v1")

PRODUCT_REVIEW_SCHEMA = {
  "type": "object",
  "properties": {
    "rating": {"type": "integer", "minimum": 1, "maximum": 5},
    "pros":   {"type": "array", "items": {"type": "string"}},
    "cons":   {"type": "array", "items": {"type": "string"}},
    "summary":{"type": "string"},
  },
  "required": ["rating", "pros", "cons", "summary"],
  "additionalProperties": False,
}

resp = client.chat.completions.create(
  model=MODEL,
  messages=[{"role":"user","content":"Write a short review of a flagship smartphone."}],
  response_format={
    "type": "json_schema",
    "json_schema": {"name": "product_review", "strict": True, "schema": PRODUCT_REVIEW_SCHEMA},
  },
  # GPT-OSS / Parasail knobs (optional)
  extra_body={
    "custom_params": {"thinking_budget": 50},
    "chat_template_kwargs": {"reasoning_effort": "medium"},
  },
)

data = json.loads(resp.choices[0].message.content)
print(data)
```

**Notes**

* `strict: True` + `additionalProperties: False` is the combo that keeps output tight.
* Always `json.loads(...)` the returned `message.content` and treat that as the source of truth.
* If you stream, you’ll still receive JSON text in `delta.content`—just concatenate and parse at the end (like your `stream_completion()` does).

## Next steps

* [Serverless overview](/parasail-docs/products/overview)
* [Serverless parameters](/parasail-docs/products/overview/parameters)
* [DeepSeek V3.x parameters](/parasail-docs/products/overview/deepseek-v3x)
* [Qwen3.5 notes](/parasail-docs/products/overview/qwen3.5-notes)
* [Chat Completions guide](/parasail-docs/guides/chat-completions)


# Responses API

Use the OpenAI Responses API for multi-turn, agentic workflows with built-in tool calling.

{% hint style="info" %}
The Responses API uses the standard Parasail gateway. Use the base URL below with your existing Parasail API key.
{% endhint %}

## Base URL

```
https://api.parasail.io/v1
```

Use your existing Parasail API key for authentication. The API key is the same as for all other Parasail endpoints.

## Requirements

| Requirement           | Details                                                                                                                              |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **Base URL**          | `https://api.parasail.io/v1`                                                                                                         |
| **`store` parameter** | Must be set to `false` on every request. The gateway does not support server-side state storage. Omitting this will return an error. |
| **Authentication**    | `Authorization: Bearer <PARASAIL_API_KEY>` (same key as the standard API)                                                            |

## Compatibility

The table below shows which Responses API features are supported.

| Feature                                    | Supported | Notes                                                         |
| ------------------------------------------ | --------- | ------------------------------------------------------------- |
| Single-turn completions                    | ✅         |                                                               |
| Multi-turn conversations                   | ✅         | Pass full conversation history in `input`                     |
| Reasoning / thinking                       | ✅         | Reasoning content returned automatically for reasoning models |
| Function / tool calling                    | ✅         | Define tools with `tools` parameter                           |
| Parallel tool calls                        | ✅         | Model may issue multiple tool calls in one turn               |
| `tool_choice` (`auto`, `required`, `none`) | ✅         |                                                               |
| `store: true` (server-side state)          | ❌         | Must set `store: false`                                       |
| `previous_response_id` (stateful chaining) | ❌         | Not supported—pass full history in `input` instead            |
| Streaming                                  | ✅         | Use `stream: true`                                            |
| Web search tool                            | ❌         | Not currently available                                       |
| File search tool                           | ❌         | Not currently available                                       |
| Computer use tool                          | ❌         | Not currently available                                       |

## Quick Start

### Python (OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    input="Explain the difference between TCP and UDP in two sentences."
)

print(response.output_text)
```

### cURL

```bash
curl -s https://api.parasail.io/v1/responses \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "parasail-kimi-k25-elicit",
    "store": false,
    "input": "Explain the difference between TCP and UDP in two sentences."
  }'
```

## Multi-Turn Conversations

Since `previous_response_id` is not supported, you maintain conversation history by passing the full message array in `input`. Each turn appends the assistant's reply and the next user message.

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

# Turn 1
response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    input=[
        {"role": "user", "content": "I'm planning a trip to Tokyo. What should I see on day 1?"}
    ]
)
turn1_text = response.output_text
print("Turn 1:", turn1_text)

# Turn 2 — pass the full history
response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    input=[
        {"role": "user", "content": "I'm planning a trip to Tokyo. What should I see on day 1?"},
        {"role": "assistant", "content": turn1_text},
        {"role": "user", "content": "What about day 2? Suggest different areas."}
    ]
)
print("Turn 2:", response.output_text)
```

## Function Calling (Agentic)

The Responses API supports tool definitions and multi-step agentic loops where the model calls functions and you return results.

### Define Tools

```python
tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "default": "celsius"
                }
            },
            "required": ["city"]
        }
    }
]
```

### Agentic Loop

```python
from openai import OpenAI
import json

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

tools = [...]  # as defined above

def handle_tool_call(name, arguments):
    """Your application logic — call real APIs, databases, etc."""
    args = json.loads(arguments)
    if name == "get_weather":
        return json.dumps({"city": args["city"], "temperature": 22, "condition": "sunny"})
    return json.dumps({"error": "unknown function"})

# Initial request
response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    tools=tools,
    input=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# Agentic loop — keep going until the model stops calling tools
while response.output:
    tool_calls = [item for item in response.output if item.type == "function_call"]
    if not tool_calls:
        break  # Model produced a final text response

    # Build input for next turn: previous output + tool results
    next_input = list(response.output)
    for tool_call in tool_calls:
        result = handle_tool_call(tool_call.name, tool_call.arguments)
        next_input.append({
            "type": "function_call_output",
            "call_id": tool_call.call_id,
            "output": result
        })

    response = client.responses.create(
        model="parasail-kimi-k25-elicit",
        store=False,
        tools=tools,
        input=next_input
    )

# Print final response
print(response.output_text)
```

The model may call multiple tools in parallel within a single turn. Always provide results for every `function_call` before sending the next request.

## Important Notes

* **Always set `store: false`.** Every request must include `"store": false`. Server-side response storage is not supported, and omitting this parameter will result in an error.
* **No stateful chaining.** The `previous_response_id` field is not supported. Manage conversation history client-side by passing the full `input` array each turn.
* **Same API key.** Your existing Parasail API key works across all endpoints. No separate key is needed.
* **Model availability.** Not all models are available for the Responses API. Use the `/v1/models` endpoint to check:

```bash
curl -s https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```


# Dedicated Instances

Deploy any Hugging Face model on a private GPU endpoint with full control over hardware, scaling, and latency.

Dedicated Instances are private endpoints that let you run most Hugging Face transformers on our full range of GPUs in a few clicks, at competitive prices. They are ideal for production endpoints that need full control over performance and QoS, and for experimenting with models that aren't available on public serverless endpoints.

## When to use Dedicated Instances

Use Dedicated Instances when you need to:

* Run a specific or private model that isn't offered on [Serverless](/parasail-docs/products/overview).
* Guarantee predictable latency and throughput for a production workload.
* Control the hardware, quantization, scaling, and concurrency of your deployment.

If you just want to call a popular open-source model with no setup, use [Serverless](/parasail-docs/products/overview) instead. For large offline workloads, use [Batch](/parasail-docs/products/quickstart).

## Advantages

* **Experiment and deploy from one place**—run most Hugging Face transformers by entering the URL.
* **Private models**—run custom models from your private Hugging Face repo with an access token.
* **Hardware choice**—select from a variety of GPU types, from RTX 4090 up to H200, to match your cost and performance needs.
* **Scale with traffic**—add load-balanced replicas as demand grows.
* **Aggressive latency targets**—through GPU selection, model quantization, and speculative decoding.
* **Pay only when active**—no contracts or commitments; pause and resume at any time, and billing only runs while the model is active.

## Prerequisites

* A Parasail account and an API key ([create one here](https://www.saas.parasail.io/keys)).
* A model hosted publicly or privately on Hugging Face. Local models can be loaded by pushing them to a private Hugging Face repo.

## Model compatibility

Behind the scenes, we work with a variety of inference stacks to maintain the best model compatibility and performance. For models launched in coordination with the community (for example, Llama models), we typically have 0-day support. Models dropped unannounced can take a few days to a week. If a model you need isn't supported, let us know on [Discord](https://discord.gg/parasail-ai) or through your support channel and we'll investigate.

We currently support most transformers. Support for diffusion models and more conventional AI models like CNNs or LSTMs is further down our roadmap. Some models aren't supported because they contain local code execution in the model repo; we can whitelist trusted model providers, so contact us if you'd like us to investigate a specific model.

## Create a Dedicated Instance

Open the [Create New Dedicated Instance](https://www.saas.parasail.io/dedicated/new) page, then configure the deployment:

* **Deployment Name**—a name you choose; you'll use it as the model name when calling the endpoint.
* **Hugging Face ID / URL**—the model to load. You can paste either the full URL or the model ID. For example, for the FP8 version of Qwen2 72B Instruct, use `https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-FP8` or `neuralmagic/Qwen2-72B-Instruct-FP8`. (Hugging Face has a copy button next to each model ID.)
* **Hugging Face Token**—load a model from a private Hugging Face repo, ideal for models fine-tuned on private data. See [Private Hugging Face Models](/parasail-docs/products/overview-1/private-hf-models) for how to generate a token.

If the model is supported you'll get a green confirmation message. If it isn't, contact us on Discord or through your support channel.

After selecting your model, choose a hardware service tier, then hit **Deploy**. To set advanced options, expand the **Advanced** dropdown.

Once deployed, you're taken to the [status page](#dedicated-status-page). If the model weights were cached previously, startup takes 3–5 minutes. If the weights must be downloaded from Hugging Face, it can take from 10 minutes to several hours depending on model size.

### Advanced options

* **Speculative Decoding Draft Model Name**—set a draft model for faster output with speculative decoding. This is an advanced feature; read [Speculative Decoding](/parasail-docs/products/overview-1/speculative-decoding) first.
* **Replicas**—spin up multiple load-balanced replicas for high-throughput production. The hourly cost multiplies by the number of replicas.
* **Model Mode**—most models autodetect, but some support multiple modes from the same weights (for example GritLM, which is both generative and embedding). Use this to force a specific mode if autodetect picks the wrong type.
* **Scale Down Policy**—for experimentation or R\&D, set a scale-down policy to shut the model down after a period of inactivity, so a forgotten endpoint doesn't run up a bill. See [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling).
* **Max Concurrent Connections per Replica**—when concurrent requests reach this limit, the endpoint returns **429 Too Many Requests** until space frees up. This prevents overload and protects quality of service.

## Dedicated status page <a href="#dedicated-status-page" id="dedicated-status-page"></a>

Once a model is deployed, you can control it and view its metrics on the status page.

* **Pause**—temporarily deactivate the endpoint and pause billing.
* **Edit**—change model settings and redeploy.
* **Destroy**—permanently delete the endpoint.

## Next steps

* [Dedicated Instances Quickstart](/parasail-docs/quickstart/dedicated)—deploy and call your first model
* [Private Hugging Face Models](/parasail-docs/products/overview-1/private-hf-models)—deploy models from a private repo
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)—scale replicas for production traffic
* [Management API](/parasail-docs/products/overview-1/management-api)—create and manage deployments programmatically
* [Speculative Decoding](/parasail-docs/products/overview-1/speculative-decoding)—reduce latency with a draft model


# Speculative Decoding

Speed up dedicated models 1.5-2x by adding a draft model for speculative decoding on the same GPU.

Speculative decoding is a powerful inference approach that can yield a 1.5-2X improvement in output tokens per second with little overhead. With this approach, a lightweight model called the ***draft model*** runs ahead of the main ***target model*** and provides intelligent guesses for what the next token might be. In the Parasail platform, setting up speculative decoding is very easy with the dedicated service. This runs the draft model on the same GPU as the target model, meaning no additional hardware resources are needed.

In this example, we will create speculative decoding for the FP8 variant of Qwen-2.5-VL-72 B ([nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic](https://huggingface.co/parasail-ai/Qwen2.5-VL-72B-Instruct-FP8-Dynamic)). The draft model must be significantly smaller and lighter than the target model AND it must have the same vocabulary structure, meaning it must be in the same family of models. The draft model we will use in this example is [Qwen/Qwen2.5-VL-7 B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ/tree/main).

The out-of-the-box tokens per second for Qwen-2-VL-72 B on a single NVIDIA H200 is around 30 TPS. Lets see how much we can speed that up. First, create a dedicated model:

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXfB69AmqqWB76qF40rqqs97YBDq0kDBvbNPODsfKbxCCufLvA9w8lNQcaPHrNk5Xxqs019Hcflqyj7zLDsRajKtL2oeyn7XhcXiAULiWqm8KbL0Z2qIqaggaLmdjxC4r9flsSuqaA?key=zoghKDf1UQcjciX7_oPD7bFJ)

Chose a unique name for this deployment:

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXcCqZSo7WujDVke2qDz91ECJR-Hl2fEjvN063A1N-__c9VeWCbWI90E-Hk-q0NX3s8cIaWiFnqLA6x0k9OJnX_F6JudfC9jSp4a37hb2OcdxneNc9zhZkIkj5GJRAgbZJiHd0tyfQ?key=zoghKDf1UQcjciX7_oPD7bFJ)

Enter the hugging-face ID or URL for this model, [nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic](https://huggingface.co/nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic) and choose the low-cost tier:

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeH5IHsVvGW01w2CBAWHyD7trKSk5TWGDBuVqffDcMuy3ncmmzfsMux_vH0bVqEn_iCB9dp_pAyDQNzsSLyO2eV-agrFYP6Advr3hqG9Te9VpbzhzQfgXt71rwzVHf1oMYv8gGfSw?key=zoghKDf1UQcjciX7_oPD7bFJ)

Under **advanced options**, you can enter the HuggingFace ID or URL of the draft model, in this case [Qwen/Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4/tree/main). Also select a 1 hr scale-down pol

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXd6NQJuiia4ssPmXa5C_LVltsw3rc6o4Lb_tYRQkK7ieA75WHkIiGCYuxYJLjJDtuQwdp5g4i5R95EeSTZWAzEt_lr00KM7cjZtOdWQcRBDPL_XZhB9Qebwo6Us0fcW1beX3d8qhg?key=zoghKDf1UQcjciX7_oPD7bFJ)

Hit **deploy,** and you will be taken to the model dashboard. The model status will be shown as starting and 🟡 while the model is being set up.

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdfhFC5ZTvvxRQSfWUWucpuR7uCby2j5wUpQyab7WoXcxrXWe1ugEDsectuHvopxmOkOVmT_A9IL-l1YA2W_uh-qUglnHU9VFaNYMCW38EFvJfV2tlKLj5hrf8mFY-dIpT25Jq3Fw?key=zoghKDf1UQcjciX7_oPD7bFJ)\\

When the model status turns green 🟢, we are ready to test the performance.

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdPFCrzC0C-RLkc5TcbiMFVTXTH-lchP3kYbGCNqbhKPWKUhDZa4hk-sNPKZVCR2eHSzHrqFBKLX89Z9jn6zD6AX2TblbokEehsnOwRfXelvlTuNcuJMdHfid5_3nPPeEK1BCCgdw?key=zoghKDf1UQcjciX7_oPD7bFJ)

The output tokens-per-second jumped from 30 to 76 using the same NVIDIA H200, which is a very impressive improvement.

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXffXikAfxTN8cXYPAnQ6Mj4oUUjjx5dSEHYSS5Of-eeFr9qTCMK1RxE4jLHn3Kl3HCPPs5mCBxl15uV-whonZSnYM1mZwksbHXl3Xqi_EuZKECbM6VatYPUv6TQAXbrTDLvE6oxrg?key=zoghKDf1UQcjciX7_oPD7bFJ)

| Target Model                                                                                       | Draft Model                                                                                          |
| -------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| Qwen2.5-72 B-Instruct                                                                              | Qwen/Qwen2.5-7 B-Instruct-AWQ                                                                        |
| [Qwen2.5-VL-72 B-Instruct](https://huggingface.co/parasail-ai/Qwen2.5-VL-72B-Instruct-FP8-Dynamic) | [Qwen/Qwen2.5-VL-7 B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ/tree/main) |
| [meta-llama/Llama-3.1-8 B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)       | [shuyuej/Llama-3.2-1 B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.2-1B-Instruct-GPTQ)     |

**Future content: show additional combinations of target models and draft models that work well.**

## Next steps

* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [FP8 Quantization](/parasail-docs/products/overview-1/fp8-quantization)
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)
* [Management API](/parasail-docs/products/overview-1/management-api)


# Private HuggingFace Models

How to create a dedicated deployment for a private HuggingFace model

In order to deploy a Parasail dedicated endpoint for a private model from HuggingFace, you need to follow these steps:

* Make your model private in HuggingFace.
  * Go to the settings page of your model.
  * At the top of the page, click on the button `Make private`. <img src="/files/M8iMOtSWvcFqnTnASqbw" alt="alt text" data-size="original">
  * After the model becomes private, you should see the following: ![alt text](/files/z39YGf7S7fRbG9YcDARF)
* Create a HuggingFace access token with finegrained permissions that only allows access to your private model.

  * On the `Access Tokens` page, click on the button `Create new token`.
  * Leave token type as `Fine-grained` and write a name for your access token. ![alt text](/files/e4j2VlFe78uT6IXVEL48)
  * Scroll down to the section `Repositories permissions`. ![alt text](/files/EHL1CojuqMB6yHgkz57S)
  * In the textbox `Search for repos`, type the name of your model (for example `meta-llama/Llama-3.2-1 B`).
  * Leave the permission to be `Read access to contents of selected repos`. ![alt text](/files/3teiqdPiUQVQ3fkqag1G)
  * Click on the button `Create token` at the bottom of the page.
  * Copy the access token that is generated.

  ![alt text](/files/qED1i9giQR4kXx8gvAww)
* Create a dedicated deployment in Parasail.
  * Click on the button `Create Dedicated Model` on the Parasail dashboard.
  * Copy your model name (for example `meta-llama/Llama-3.2-1 B`) into the textbox `HuggingFace ID / URL`
  * Copy your access token to the textbox `HuggingFace Token`. ![alt text](/files/dQwvmnVea0RAtwPxUn4Q)
  * Click on the button `Deploy`.


# Management API

Deploy, pause, resume, scale, and monitor dedicated endpoints programmatically with the Parasail REST API.

## Parasail Dedicated Service API Documentation

### Overview

The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via a REST-compatible API. You can perform all functions available in the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.

Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatibility, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.

If you did not create the dedicated endpoint with the API, you can find the **deployment\_id** from the Parasail SaaS platform URL. For example, if your deployment status page is `https://www.saas.parasail.io/dedicated/33665`, then the `deployment_id` is **33665**. You can then use this for all API commands below.

### Base URLs

* **Control API** (for managing deployments):\
  `https://api.parasail.io/api/v1`
* **Chat API** (for inference requests via OpenAI-compatible endpoints):\
  `https://api.parasail.io/v1`

### Authentication

All requests to the Control API must include a Bearer token in the header. For example:

```python
import httpx

api_key = "<YOUR_PARASAIL_API_KEY>"
control_base_url = "https://api.parasail.io/api/v1"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

client = httpx.Client(base_url=control_base_url, headers=headers)
```

***

## Recommended Dedicated Model API Flow

Given the large set of combinations of model types and hardware instances, the recommended process for using the Dedicated API follows the UI flow and leverages our performance/memory models and hardware compatibility checkers to ensure smooth operation. The recommended flow is:

1. Check model compatibility with our `/dedicated/support` endpoint.
2. Retrieve a list of supported hardware for that model with our `/dedicated/devices` endpoint. This will include multi-GPU configurations that leverage parallelism.
3. Within that list, set `selected` to `True` for the desired configurations.
4. Submit (`POST`) a deployment payload (including your selected `deviceConfigs`) to `/dedicated/deployments`.
5. Use `/dedicated/deployments` endpoints to monitor, update, pause, resume, and delete deployments.

### 1. Checking Model Compatibility

The `/dedicated/support` endpoint can be used to query whether a model is supported. The `modelName` parameter can either be a HuggingFace ID like `Qwen/Qwen2.5-72B-Instruct` or a HuggingFace URL like <https://huggingface.co/Qwen/Qwen2.5-72B-Instruct>. Use the `engine` query parameter (for example `VLLM`) when needed.

```python
params = {
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "engine": "VLLM",
    "modelAccessKey": ""  # HF token for privately hosted models
}

# Make the request
response = client.get("/dedicated/support", params=params)

if response.status_code == 200:
    data = response.json()
    if data.get("supported") is True:
        print("✅ Model is supported.")
    else:
        reason = data.get("errorMessage", "No reason provided.")
        print(f"❌ Model not supported: {reason}")
else:
    print(f"❌ Request failed with status code {response.status_code}: {response.text}")
```

This step is optional, but is a useful sanity check for API-based flows to ensure the IDs are entered correctly.

### 2. Retrieving Supported Hardware

The first step to creating a new dedicated deployment is to select a supported hardware configuration. The `/dedicated/devices` endpoint returns valid hardware configurations and their availability. These configurations can be either single GPUs or multiple GPUs with parallelism enabled (tensor, expert, or sequence). The `selected` flag is set to `false` when returned from `/dedicated/devices`. The recommended flow is to flip `selected` to `true` for your desired configurations, then pass that array to the create deployment endpoint (see next section).

The entire structure does not need to be passed to the create dedicated instance endpoint. The required flag shows the parameters that are needed, though it is recommended to ensure compatibility by first calling `/dedicated/devices` and flipping `selected` to true.

`/dedicated/devices` returns an array of structures with the following parameters:

**DeviceConfigs structure**

<table><thead><tr><th width="230.8671875">Parameter</th><th width="105.34375">Type</th><th width="99.8984375">Required?</th><th>Descripti</th></tr></thead><tbody><tr><td>device</td><td><strong>string</strong></td><td>Yes</td><td>Device ID name.</td></tr><tr><td>count</td><td><strong>integer</strong></td><td>Yes</td><td>Number of. The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.</td></tr><tr><td>displayName</td><td><strong>string</strong></td><td>No</td><td>How the device is displayed in the UI.</td></tr><tr><td>cost</td><td><strong>float</strong></td><td>No</td><td>Cost per hour of the hardware configuration. This accounts for multiple GPUs.</td></tr><tr><td>estimatedSingleUserTps</td><td><strong>float</strong></td><td>No</td><td>This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.</td></tr><tr><td>estimatedSystemTps</td><td><strong>float</strong></td><td>No</td><td>This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.</td></tr><tr><td>recommended</td><td><strong>boolean</strong></td><td>No</td><td>If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.</td></tr><tr><td>limitedContext</td><td><strong>boolean</strong></td><td>No</td><td>If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the <em>contextLength</em> parameter.</td></tr><tr><td>available</td><td><strong>boolean</strong></td><td>No</td><td>If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.</td></tr><tr><td>selected</td><td><strong>boolean</strong></td><td>Yes</td><td>This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the <em>create deployment</em> endpoint will make it a selectable option by the scheduler.</td></tr></tbody></table>

Here is example code to retrieve supported hardware, look for a desired configuration, and flip the selected flag to true when it is found:

```python
def get_and_select_device_config(client: httpx.Client, model_name: str, desired_device: str, desired_count: int) -> list[dict]:
    params = {
        "engine": "VLLM",
        "modelName": model_name,
        "modelAccessKey": ""
    }

    response = client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = d.get("device") == desired_device and d.get("count") == desired_count
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(f"No matching device config for {desired_device} x{desired_count}")

    return devices

devices = get_and_select_device_config(
    client,
    model_name="Qwen/Qwen2.5-72B-Instruct",
    desired_device="H100SXM",
    desired_count=2
)

```

### 3. Creating a Deployment

<table><thead><tr><th>Parameter</th><th width="100">Type</th><th width="100">Required</th><th>Description</th></tr></thead><tbody><tr><td><strong>deploymentName</strong></td><td>string</td><td>Yes</td><td>Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.</td></tr><tr><td><strong>modelName</strong></td><td>string</td><td>Yes</td><td>The HuggingFace identifier (for example, <code>org/model-name</code>) of the model you want to deploy. For example: <code>"shuyuej/Llama-3.2-1 B-Instruct-GPTQ"</code></td></tr><tr><td><strong>deviceConfigs</strong></td><td>list of structs</td><td>Yes</td><td>A list of <strong>deviceConfig</strong> structs (see previous section). Any struct with <strong>selected</strong> = true will be considered for scheduling.</td></tr><tr><td><strong>replicas</strong></td><td>integer</td><td>Yes</td><td>Number of replicas to run in parallel</td></tr><tr><td><strong>scaleDownPolicy</strong></td><td>string</td><td>No</td><td>Auto-scaling policy: <code>NONE or TIMER</code></td></tr><tr><td><strong>scaleDownThreshold</strong></td><td>integer</td><td>No</td><td>Time threshold in ms before scaling down</td></tr><tr><td><strong>draftModelName</strong></td><td>string</td><td>No</td><td>(Advanced) Points to a draft model version.</td></tr><tr><td><strong>engine</strong></td><td>string</td><td>No</td><td>Inference engine. Defaults to <code>VLLM</code>.</td></tr><tr><td><code>engineTask</code></td><td>string</td><td>No</td><td>Engine task mode. Defaults to <code>AUTO</code>.</td></tr><tr><td><strong>contextLengthOverride</strong></td><td>integer</td><td>No</td><td>(Advanced) User override for context window length.</td></tr><tr><td><strong>modelAccessKey</strong></td><td>string</td><td>No</td><td>(For private models) HuggingFace token.</td></tr></tbody></table>

Use the `POST /dedicated/deployments` endpoint to create a new model deployment. The response will include an `id` you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from `STARTING` to `ONLINE` before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.

This uses the `devices` struct from the previous example:

```python
# Example: Creating a deployment
deployment_config = {
    "deploymentName": "myqwen",
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "engine": "VLLM",  # Optional (defaults to VLLM)
    "deviceConfigs": devices,
    "replicas": 1,
    # Optional parameters:
    # "scaleDownPolicy": "TIMER",
    # "scaleDownThreshold": 3600000,
}

response = client.post("/dedicated/deployments", json=deployment_config)

if response.status_code == 200:
    deployment_data = response.json()
    deployment_id = deployment_data["id"]
    print(f"Created deployment with ID: {deployment_id}")
else:
    print(f"Failed to create deployment: {response.status_code}, {response.text}")
```

**Response Body Example:**

```json
{
  "id": 12345,
  "deploymentName": "myqwen",
  "modelName": "Qwen/Qwen2.5-72B-Instruct",
  "status": {
    "status": "STARTING",
    "statusMessage": "Deployment is being created"
  },
  "externalAlias": "myorganization-myqwen",
  ...
}
```

* **id**: Unique numeric identifier for the deployment.
* **status.status**: Current state of the deployment (for example, `STARTING`, `ONLINE`).
* **externalAlias**: The string used to reference your model via OpenAI-compatible calls.

***

### 4. Getting Deployment Status

Use the `GET /dedicated/deployments/{deployment_id}` endpoint to retrieve the details of your deployment, including its status.

```python
deployment_id = 12345  # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")

if response.status_code == 200:
    status_data = response.json()
    model_status = status_data["status"]["status"]       # for example "ONLINE", "OFFLINE"
    model_identifier = status_data["externalAlias"]      # alias for OpenAI API calls
    print(f"Deployment status: {model_status}")
    print(f"OpenAI model identifier: {model_identifier}")
else:
    print(f"Failed to get deployment: {response.status_code}, {response.text}")
```

**Key fields in the response**:

* `status["status"]`: The overall state of the deployment.
* `externalAlias`: The string you use in OpenAI Chat/Completion requests (`model=`).

Below are the possible values for `status["status"]`:

| State        | Description                                                       |
| ------------ | ----------------------------------------------------------------- |
| **ONLINE**   | Deployment is active and ready to serve requests.                 |
| **STARTING** | Deployment is in the process of starting up.                      |
| **STOPPING** | Deployment is in the process of shutting down.                    |
| **OFFLINE**  | Deployment is not running and must be resumed to accept requests. |
| **ERROR**    | Deployment entered an error state and needs investigation.        |
| **UNKNOWN**  | Deployment state is unknown (for example, stale or missing data). |

***

### 5. Updating a Deployment

To change the configuration of an existing deployment (for example, number of replicas):

1. **GET** the current deployment (as above).
2. Modify the returned JSON structure to include your updated parameters.
3. **PUT** the entire updated JSON back to the same endpoint (`/dedicated/deployments/{deployment_id}`).

#### Example: Scaling up the replicas

```python
# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()

# 2. Modify the "replicas" field (for example, from 1 to 2)
current_deployment["replicas"] = 2

# 3. Send the updated deployment object
update_response = client.put(
    f"/dedicated/deployments/{deployment_id}",
    json=current_deployment
)

if update_response.status_code == 200:
    updated_deployment_data = update_response.json()
    print("Deployment updated successfully.")
    print(f"New replica count: {updated_deployment_data['replicas']}")
else:
    print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")
```

***

### 6. Pausing and Resuming a Deployment

You may want to pause a deployment to reduce costs. Pausing puts the deployment into an `OFFLINE` state and stops billing. When ready, you can resume it to bring it back `ONLINE`. When the resume command is sent, you will need to poll the status to monitor for the transition from `STARTING` to `ONLINE` before the endpoint will accept requests.

```python
# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
    print("Deployment paused successfully.")
else:
    print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")

# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
    print("Deployment resumed successfully.")
else:
    print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")
```

***

### 7. Using the Deployed Model via OpenAI-Compatible API

Once your deployment is `ONLINE`, you can send inference requests using the **Chat API** at `https://api.parasail.io/v1`. Use the `externalAlias` as the `model` parameter in your OpenAI calls.

Example using the OpenAI Python client:

```python
from openai import OpenAI

openai_client = OpenAI(api_key=api_key, base_url="https://api.parasail.io/v1")

# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"

response = openai_client.chat.completions.create(
    model=model_identifier,
    messages=[{"role": "user", "content": "What's the capital of France?"}],
)

print(response.choices[0].message.content)
```

**Note**: The `model_identifier` is the `externalAlias` from the deployment status.

***

### 8. Deleting a Deployment

To permanently delete an endpoint, send the `DELETE` request.

```python
deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")

if delete_response.status_code == 200:
    print("Deployment deleted successfully.")
else:
    print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")
```

### Example Python Code

```python
import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv

# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
    load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"


class deviceConfig(BaseModel):
    device: str
    count: int
    displayName: Optional[str] = None
    cost: Optional[float] = None
    estimatedSingleUserTps: Optional[float] = None
    estimatedSystemTps: Optional[float] = None
    recommended: Optional[bool] = None
    limitedContext: Optional[bool] = None
    available: Optional[bool] = None
    selected: bool


class CreateModelStruct(BaseModel):
    deploymentName: str
    modelName: str
    replicas: int
    deviceConfigs: list[deviceConfig]
    engine: Optional[str] = "VLLM"
    engineTask: Optional[str] = "AUTO"
    scaleDownPolicy: Optional[str] = None  # NONE, TIMER, INACTIVE
    scaleDownThreshold: Optional[int] = None  # time in ms
    draftModelName: Optional[str] = None
    contextLengthOverride: Optional[int] = None
    modelAccessKey: Optional[str] = None


def check_compatibility(parasail_client: httpx.Client, model_name: str):
    params = {
        "modelName": model_name,
        "engine": "VLLM",
        "modelAccessKey": "",  # HF token for privately hosted models
    }
    response = parasail_client.get("/dedicated/support", params=params)

    if response.status_code == 200:
        data = response.json()
        if data.get("supported") is True:
            print("✅ Model is supported.")
        else:
            reason = data.get("errorMessage", "No reason provided.")
            raise ValueError(f"❌ Model not supported: {reason}")
    else:
        raise RuntimeError(
            f"❌ Request failed with status code {response.status_code}: {response.text}"
        )


def get_and_select_device_config(
    parasail_client: httpx.Client,
    model_name: str,
    desired_device: str,
    desired_count: int,
) -> list[dict]:
    params = {"engine": "VLLM", "modelName": model_name, "modelAccessKey": ""}

    response = parasail_client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = (
            d.get("device") == desired_device and d.get("count") == desired_count
        )
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(
            f"No matching device config for {desired_device} x{desired_count}"
        )

    return devices


def create_deployment(
    parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
    response = parasail_client.post(
        "/dedicated/deployments", content=deployment.model_dump_json()
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to create deployment: {response.status_code} {response.text}"
    return response.json()


def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
    assert deployment["id"] is not None, "Deployment ID must be set for update"
    response = parasail_client.put(
        f"/dedicated/deployments/{deployment['id']}", json=deployment
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to update deployment: {response.status_code} {response.text}"
    return response.json()


def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to pause deployment: {response.status_code} {response.text}"


def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to resume deployment: {response.status_code} {response.text}"


def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.get(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to get deployment: {response.status_code} {response.text}"
    return response.json()


def is_deployment_offline(parasail_client: httpx.Client, id: int) -> bool:
    deployment = get_deployment(parasail_client, id)
    return deployment["status"]["status"] == "OFFLINE"


def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
    response = parasail_client.delete(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to delete deployment: {response.status_code} {response.text}"


def wait_for_status(
    parasail_client: httpx.Client,
    id: int,
    status: str = "ONLINE",
    timeout_minutes: int = 20,
) -> None:
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60
    while True:
        deployment = get_deployment(parasail_client, id)
        current_status = (
            deployment["status"]["status"] if "status" in deployment else "UNKNOWN"
        )
        if current_status == status:
            print(f"Deployment reached target status: {status}")
            return
        elif current_status == "STARTING" or current_status == "STOPPING":
            elapsed_minutes = (time.time() - start_time) / 60
            print(
                f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
            )
        else:
            raise ValueError(f"Unexpected deployment status: {current_status}")
        if time.time() - start_time > timeout_seconds:
            raise TimeoutError(
                f"Deployment status did not change to {status} within {timeout_minutes} minutes"
            )
        time.sleep(30)


def test_deployment(
    parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
    deployment = get_deployment(parasail_client, id)
    assert "status" in deployment, "Deployment status is None"
    assert (
        deployment["status"]["status"] == "ONLINE"
    ), f"Deployment is not online: {deployment['status']['status']}"

    chat_completion = openai_client.chat.completions.create(
        model=deployment["externalAlias"],
        messages=[{"role": "user", "content": "What's the capital of New York?"}],
    )
    assert len(chat_completion.choices) == 1, "No chat completion choices returned"
    assert chat_completion.choices[0].message.content, "Empty chat completion response"


def main():
    # Setup clients
    model_name = "RedHatAI/Qwen2-72B-Instruct-FP8"

    parasail_client = httpx.Client(
        base_url=control_base_url,
        headers={
            "Authorization": "Bearer " + api_key,
            "Content-Type": "application/json",
        },
    )
    openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)
    check_compatibility(parasail_client, model_name)

    devices = get_and_select_device_config(
        parasail_client,
        model_name=model_name,
        desired_device="H100SXM",
        desired_count=2,
    )

    # Create deployment configuration
    deployment_config = CreateModelStruct(
        deploymentName="scripting-example-deployment-3",
        modelName=model_name,
        deviceConfigs=devices,
        replicas=1,
    )

    deployment = None
    try:
        print("Creating deployment...")
        deployment = create_deployment(parasail_client, deployment_config)
        print(f"Created deployment with ID: {deployment['id']}")

        print("\nWaiting for deployment to come online.")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        pause_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
        resume_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        deployment = get_deployment(parasail_client, deployment["id"])
        print("\nUpdating deployment to 2 replicas...")
        deployment["replicas"] = 2
        deployment = update_deployment(parasail_client, deployment)
        print("Waiting for deployment to scale up...")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 2 replicas...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Scaled deployment tests passed successfully")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        if deployment:
            print("\nCleaning up - deleting deployment...")
            delete_deployment(parasail_client, deployment["id"])
            print("Deployment deleted")


if __name__ == "__main__":
    main()

```

## Next steps

* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)
* [Private Hugging Face Models](/parasail-docs/products/overview-1/private-hf-models)
* [Dedicated Quickstart](/parasail-docs/quickstart/dedicated)
* [Authentication API reference](/parasail-docs/api-reference/authentication)


# FP8 Quantization

Quantize dedicated models to FP8 with llm-compressor to halve memory use and speed up inference with minimal accuracy loss.

We recommend using FP8 quantization for dedicated models which allows for a 2x reduction in model memory requirements and up to 1.6x improvement in inference speed with minimal impact on accuracy.

The easiest and recommended way to quantize the model is using [llm-compressor](https://github.com/vllm-project/llm-compressor) with `FP8_DYNAMIC` scheme. In this setting, there is no need for any calibration data since the activations are quantized dynamically.

Here is an example of how to use `llm-compressor` to quantize a model:

```python
# pip install llmcompressor

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "YOUR_HUGGINGFACE_MODEL_HERE"

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Apply quantization.
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
save_path = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```

## Next steps

* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Speculative Decoding](/parasail-docs/products/overview-1/speculative-decoding)
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)
* [Private Hugging Face Models](/parasail-docs/products/overview-1/private-hf-models)
* [Management API](/parasail-docs/products/overview-1/management-api)


# Auto-Scaling

Configure auto-scaling for dedicated endpoints using max concurrent requests, target concurrency, and smoothing factor.

To enable auto-scaling you need to go to the checkbox autoscaling:

The way that auto-scaling works is in conjunction with Max Concurrent requests that is set:

**Max Concurrent Requests**: This is the number of requests per replica that after the number set will begin rejecting requests. The trade off with setting a higher number can be higher latency but a lower number rejecting the requests.

**Target Concurrent Requests:** This is the number of concurrent requests that trigger autoscaling of another replica. This should be set below the Max Concurrent requests.

**Smoothing Factor:** This is the moving average of concurrent requests and how long you the amount of target requests coming in before a new replica is brought up or scaled down.

* Conservative: Responds to changes slowly
* Moderate: Balanced responsiveness
* Aggressive: Quickly react to change

**Auto Scaling Range for Replicas:** This is the minimum to maximum amount of replicas the system will put as the base line and the max it will go up to.

**Note:** Whatever options you have set for hardware selection will be used for autoscaling.

## Next steps

* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Management API](/parasail-docs/products/overview-1/management-api)
* [Speculative Decoding](/parasail-docs/products/overview-1/speculative-decoding)
* [FP8 Quantization](/parasail-docs/products/overview-1/fp8-quantization)
* [Pricing](/parasail-docs/billing/pricing)


# Dedicated Serverless

Understand the concurrency limits and max requests per minute that govern autoscaling capacity on Dedicated Serverless endpoints.

Autoscaling with Dedicated Serverless is a nuanced topic because:

1. You are not sharing an endpoint with other customers, so there is not a common pool serving a model to spread load across.
2. Yet you are still paying per token and do not have to worry about any of the underlying GPU infrastructure or orchestration software.

Despite being a serverless product, the underlying infrastructure that runs AI models requires load limits to maintain quality of service and uptime. The two load limits we employ are **concurrency limits** (how many requests are actively being processed at a given time), which most often relates to bursty traffic and changes depending on autoscaling, and **maximum requests per minute**, which is a static limit and most often relates to more steady traffic.

The following three factors are shown on the Dedicated Serverless status page.

| **Current Concurrency Limit**          | The **current** capacity of the endpoint, measured in **maximum concurrent requests being processed**.                                                                                           |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Fully Autoscaled Concurrency Limit** | Once the endpoint is autoscaled to its maximum, this represents the maximum concurrent requests that an endpoint can process.                                                                    |
| **Max Requests per Minute**            | A static rate limit for the endpoint that controls the maximum amount of steady-state traffic. This is determined by the product tier or negotiated agreement and is not related to autoscaling. |

If your traffic is **bursty**, you may hit **concurrency limits** before you hit the maximum requests per minute.

If your traffic grows **smoothly**, you may hit **maximum requests per minute** before you hit any concurrency limits.

To illustrate this point, if the maximum requests per minute of an endpoint is 10,000 but all 10,000 show up at the exact same time, the endpoint will issue HTTP 429 errors due to concurrency limits. If the requests are more spread out, however, then no HTTP 429 errors will be issued since the traffic is did not exceed the rate limit.

### Concurrency Limits in Detail

Current capacity is displayed on the dedicated serverless status page and is a dynamic number as it is determined behind the scenes by the number of active autoscaled replicas multiplied by the capacity of each replica. The capacity of each replica is determined by the model, the input characteristics, and the type of hardware used to deploy it. Because of some of these factors may change, the capacity of each replica may be subject to change.

Behind the scenes, dedicated serverless endpoints autoscale GPUs to maximize efficiency, and the autoscaling has a 5 min dampening and delay in it to prevent thrashing of the infrastructure. This means the endpoint can only handle a certain amount of burst traffic. Anything in excess will issue HTTP 429 errors before the autoscaling kicks in.

For example, if each replica can handle **64 concurrent requests** and there are 4 active replicas, the Dedicated Serverless status page will display a **current capacity** of **256 concurrent requests**. Anything in excess of that will result in a 429 error. If after 5 minutes of sustained traffic (the auto scaling trigger is always significantly less than the maximum replica capacity), and addition 4 replicas are activated then approximately 5 minutes later the **current capacity** will be **512 concurrent requests.**

A graph of the number of concurrent requests can be viewed in the Dedicated Serverless status page under the "Number Running Requests" tab.

### Max Requests per Minute in Detail

Max requests per minute is a constant factor not tied to autoscaling that is either set through the Dedicated Serverless creation process, or through a special arrangement with sales/customer success. It represents the maximum steady-state capacity of the endpoint and is shown as **Maximum Requests per Minute** in the Dedicated Serverless status pag&#x65;**.** This will not change without intervention from Parasail or by selecting a higher tier service plan.

For example, a high capacity endpoint of Qwen 32b may be shown as **Maximum Requests Per Minute: 15,000**

## Next steps

* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)
* [Rate Limits](/parasail-docs/billing/rate-limits)
* [Pricing](/parasail-docs/billing/pricing)


# Batch

Get started with Parasail's Batch Processing through the UI or the OpenAI-compatible Python batch helper library.

***This page gives an introduction to Parasail's Batch Processing Library. For more detailed information, jump to:***

* [Using the Batch UI](#parasail-batch-ui)
* [openai-batch GitHub repository](https://github.com/parasail-ai/openai-batch)
* [Batch file format (100% compatible with OpenAI)](/parasail-docs/products/quickstart/file-format)
* [OpenAI Batch API reference (Parasail is 100% compatible)](/parasail-docs/products/quickstart/api-reference)

Parasail's batch processing engine is a straightforward and inexpensive way to process thousands or millions of LLM inferences. Batch inferencing is easy: create an input file, start a batch job, wait for it to finish, then download the output.

OpenAI compatibility, Parasail batch helper library

### Getting Started with the Parasail Batch Helper Library

{% hint style="info" %}
You can find the Parasail Batch Helper Library on Github. This library is 100% compatible with both Parasail and OpenAI\
<https://github.com/parasail-ai/openai-batch>
{% endhint %}

The first step is to create a Parasail API key: <https://www.saas.parasail.io/keys>. This key should be stored in your environment through something like a .bashrc file or a .env file, or passed through the command line invocation. *Pasting directly in code isn't recommended as keys can get leaked.*

Next, install ***openai-batch***, the batch helper library that's compatible both with Parasail and OpenAI:

```bash
pip install openai-batch
```

Now you're ready to run a batch job in as little as five lines of code. The batch endpoint supports most transformers on Hugging Face—all you have to do is put the Hugging Face ID in the request. There's no need for this model to be a dedicated or endpoint. In this example, Parasail uses **NousResearch/DeepHermes-3-Mistral-24 B-Preview** (<https://huggingface.co/NousResearch/DeepHermes-3-Mistral-24B-Preview>).

{% code overflow="wrap" %}

```python
#test_batch.py
import random
from openai_batch import Batch

# Create a batch with random prompts
with Batch() as batch:
    objects = ["cat", "robot", "coffee mug", "spaceship", "banana"]
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            messages=[{"role": "user", "content": f"Tell me a joke about a {random.choice(objects)}"}]
        )
    # Submit, wait for completion, and download results
    result, output_path, error_path = batch.submit_wait_download()
    print(f"Batch completed with status {result.status} and stored in {output_path}")
```

{% endcode %}

This code looks for PARASAIL\_API\_KEY in the environment, and passing it through the command line is straightforward. Running this code produces the following output:

```bash
PARASAIL_API_KEY=<INSERT API KEY> python3 test_batch.py
validating
in_progress
...
in_progress
in_progress
completed
Batch completed with status completed and stored in batch-itvt3wmjs7-output.jsonl

```

The first two lines of `batch-itvt3wmjs7-output.jsonl` are the prompt responses:

```
{"id":"vllm-51653b28a97e4e67a0d3587f959ffe3a","custom_id":"line-1","response":{"status_code":200,"request_id":"vllm-batch-1c7186c472cd49b1aa12a81760094540","body":{"id":"chatcmpl-01be171caf3b4dc69cfa9ef360d26a4e","object":"chat.completion","created":1743050291,"model":"NousResearch/DeepHermes-3-Mistral-24B-Preview","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Why did the coffee mug get arrested? It had too much to behave!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":41,"total_tokens":60,"completion_tokens":19,"prompt_tokens_details":null},"prompt_logprobs":null}},"error":null}
{"id":"vllm-58a8ac3f589d4380ad94e00b191da24e","custom_id":"line-3","response":{"status_code":200,"request_id":"vllm-batch-2dd59bd942fa4a25a8792f9b8ee28da1","body":{"id":"chatcmpl-a59bfccd7d134981bf5e1004e77c6854","object":"chat.completion","created":1743050291,"model":"NousResearch/DeepHermes-3-Mistral-24B-Preview","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Why can't you trust an astronaut? Because they're always out of this world!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":41,"total_tokens":59,"completion_tokens":18,"prompt_tokens_details":null},"prompt_logprobs":null}},"error":null}
```

### Batch Submission Limitations

Both Parasail and OpenAI limit the maximum size of the batch input files:

* Up to 50,000 requests (lines)
* Up to **1000 MB** total input file size for Parasail
* Up to **250 MB** total input file size for OpenAI
* Max completion tokens for Parasail defaults to 8,192, but can be overridden to 16,384 through the `max_completion_tokens` parameter.

Workloads that exceed these limits must be split into multiple batches. The `add_to_batch` function in the [Parasail Batch Helper Library](https://github.com/parasail-ai/openai-batch) will raise a `ValueError` exception when the file size or request count is exceeded for the provider.

### Resuming Batch Jobs

A major convenience of batch processing is the ability to submit a job and resume the monitoring in a different process or flow. A developer can upload hundreds of batch jobs and millions of prompts without worrying about program crashes, errors, or resets—those prompts are processed on Parasail's servers until successful completion.

Changing the last two lines of the previous example submits the job and prints out the batch ID, then exits.

```python
# Submit, wait for completion, and download results
batch_id = batch.submit()
print(f"Batch ID: {batch_id}")
```

With a batch ID of `batch-tclfzwczcd`, you can now wait for the batch to finish and download it in a separate script:

{% code overflow="wrap" %}

```python
from openai_batch import Batch
import time

with Batch(batch_id="batch-tclfzwczcd",
           output_file="mybatch.jsonl") as batch:
    # Check status periodically
    while True:
        status = batch.status()
        print(f"Batch status: {status.status}")

        if status.status in ["completed", "failed", "expired", "cancelled"]:
            break

        time.sleep(60)  # Check every minute

    # Download results once completed
    output_path, error_path = batch.download()
    print(f"Output saved to: {output_path}")
```

{% endcode %}

Which outputs:

```
Batch status: in_progress
...
Batch status: in_progress
Output saved to: mybatch.jsonl
```

### Broad Model and Parameter Support

{% hint style="info" %}
Chat completions, embeddings, multimodal inputs, the full range of prompt parameters, and **even OpenAI models** are supported
{% endhint %}

**OpenAI Models**

{% code overflow="wrap" %}

```python
with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Give me 10 dad jokes"}]
        )
```

{% endcode %}

#### &#x20;<a href="#batch-embedding-models" id="batch-embedding-models"></a>

#### **Embedding Models** <a href="#batch-embedding-models" id="batch-embedding-models"></a>

GritLM—developed by Contextual and hosted on Hugging Face by Parasail—and GTE-Qwen2-7 B-Instruct from Alibaba are two excellent open source embedding models that rival proprietary models. Embeddings like these can be easily run by changing `messages`to `input` .

[parasail-ai/GritLM-7 B-vllm](https://huggingface.co/parasail-ai/GritLM-7B-vllm)

[Alibaba-NLP/gte-Qwen2-7 B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)

As seen below, Parasail strongly recommends using [base64 encoding](https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-encoding_format) `encoding_format="base64"` to reduce the size of the output files for both Parasail and OpenAI.

```python
with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="parasail-ai/GritLM-7B-vllm",
            encoding_format="base64",
            input=f"This is input #{i}",
        )
```

**Parameters**

`add_to_batch` supports all of the parameters that `client.chat.completions.create` or `client.embedding.create` supports, though note that open source models on parasail may not always support every parameter.

```python
with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            max_completion_tokens=1000,
            temperature=0.7,
            top_p=0.1,
            messages=[{"role": "user", "content": f"Give me 10 dad jokes"}]
        )
```

**Metadata**

You can add metadata to the batch. This is useful for passing information between separate submit and download processes, as well as tracking the results on the status UI page. Any metadata you add to the submission is visible in the Batch UI progress section. For detailed information about the metadata field, see the [OpenAI Batch API Submit Specification](https://platform.openai.com/docs/api-reference/batch/create#batch-create-metadata).

```python
with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            messages=[{"role": "user", "content": f"Give me 10 dad jokes"}]
        )
    batch.submit( metadata= {"Job Name": "Dad jokes"})
```

### **Parasail Batch UI**

The Parasail Batch UI can be used to upload new batches, track the status of batches, and download results when finished.

While a batch is queued or running, you can view its status, download the input, or cancel it:

When a batch job is finished, you can download the input and the output file as well as view the total token usage:

**Batch Submission**

Batches can be submitted directly from the Batch section of the platform by clicking the **Create Batch**  button. This brings up a dialog to upload a JSONL file.

Below is an example JSONL file, which follows the OpenAI format for batch submissions. Unlike the Parasail Batch Helper Library, this doesn't support OpenAI models and only supports open source Hugging Face transformers and embeddings.

**Further Reading**

* [openai-batch GitHub repository](https://github.com/parasail-ai/openai-batch)
* [Batch file format (100% compatible with OpenAI)](/parasail-docs/products/quickstart/file-format)
* [OpenAI Batch API reference (Parasail is 100% compatible)](/parasail-docs/products/quickstart/api-reference)

## Next steps

* [Batch file format](/parasail-docs/products/quickstart/file-format)
* [Batch API Reference](/parasail-docs/products/quickstart/api-reference)
* [Batch Processing with Images](/parasail-docs/products/quickstart/images)
* [Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)


# Batch Processing with Private Models

Run private Hugging Face models through the Parasail Batch API by pairing a batch job with a dedicated deployment.

In addition to supporting batch API for models that are publicly available on Hugging Face, Parasail also supports the batch API on models that are stored in private Hugging Face repos. This page explains how to run such private models via the batch API.

Please follow these steps:

* Create a dedicated deployment with the private model. See [Deploying private models through Hugging Face repositories](#getting-started-with-our-parasail-batch-helper-library) to learn how to achieve this. The dedicated deployment doesn't need to be running in order for the rest of the steps to work, so you can pause it immediately. This deployment is essentially used by the backend to determine what models, HF token, and certain configurations to use for processing a batch.
* Find the endpoint name of the dedicated deployment. You can find this at the top of the UI page of the dedicated deployment in the description section or from the examples provided for the deployment at the bottom of the page. Endpoint names are typically prefixed with your account name followed by a name that you choose when creating the dedicated deployment.
* Now follow the standard workflow as explained in the [Batch API documentation](/parasail-docs/products/quickstart), except that you should use the endpoint name of the dedicated deployment as the model name. Everything else remains the same.

## Next steps

* [Quick start](/parasail-docs/products/quickstart)
* [Private Hugging Face Models (Dedicated)](/parasail-docs/products/overview-1/private-hf-models)
* [Batch API Reference](/parasail-docs/products/quickstart/api-reference)
* [Batch file format](/parasail-docs/products/quickstart/file-format)
* [Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)


# Batch Processing with Images

Submit image understanding jobs to Parasail batch endpoints with the openai-batch Python helper library.

The python batch helper library can be used to submit images to Parasail batch processing endpoints. For detailed information about this library, please read the [Batch](/parasail-docs/products/quickstart) page.

The library can be installed with the following command:

```
pip install openai-batch
```

The following example file in the github repo shows how this library can be used to read a directory full of images and submit to the `Qwen/Qwen3-VL-8B-Instruct` . The library will automatically read the image files, convert to Base64 format, insert into the batch submission, and store the results in the output directory.

{% @github-files/github-code-block url="<https://github.com/parasail-ai/openai-batch/blob/main/examples/image_understanding.py>" %}

## Next steps

* [Quick start](/parasail-docs/products/quickstart)
* [Batch file format](/parasail-docs/products/quickstart/file-format)
* [Multi-Modal guide](/parasail-docs/guides/multi-modal)
* [Batch API Reference](/parasail-docs/products/quickstart/api-reference)


# Batch File Format

Format batch input and output .jsonl files for Parasail's OpenAI-compatible Batch API.

Parasail supports the OpenAI batch API, including the format of the input and output files.

A batch input file is a `.jsonl` file where each line is a batch request. Parasail processes these requests and returns the results as an output `.jsonl` file.

## Batch input file

The batch input wraps the same data structures used by interactive requests into an offline format. Each line in the input file is one request.

The request is a JSON dictionary with the following keys:

* `custom_id`: A unique value that the user creates so they can later match outputs to inputs.
* `method`: HTTP method, currently `POST`.
* `url`: one of `/v1/chat/completions`, `/v1/embeddings`
* `body`: The same as the body of an interactive request.
  * [Chat Completion](https://platform.openai.com/docs/api-reference/chat/create). Note: `stream` must be omitted or `false`.
  * [Embeddings](https://platform.openai.com/docs/api-reference/embeddings/create). Parasail strongly recommends using `"encoding_format": "base64"`. See below.

### Example batch input files

{% tabs %}
{% tab title="Chat Completion" %}

```json
{"custom_id": "b5b938a55cc349d13f08a2586f96807d", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 10, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Say pong."}]}}
{"custom_id": "ad95b85d915346e29b7afa52314e94b8", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 1024, "messages": [{"role": "user", "content": "What is the capital of New York?"}]}}
{"custom_id": "c587d5391f4524068bb1d6a4a00b5177", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 1024, "messages": [{"role": "user", "content": "Write two very funny dad jokes."}], "temperature": 0.5}}
```

Downloadable example file:
{% endtab %}

{% tab title="Embedding" %}

```json
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65 000 components on a single silicon chip."}}
{"custom_id": "request-3", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence."}}
```

Downloadable example file:
{% endtab %}
{% endtabs %}

See [OpenAI's request input object documentation](https://platform.openai.com/docs/api-reference/batch/request-input).

### Best Practices

For embeddings, Parasail strongly recommends using `"encoding_format": "base64"` ([link](https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-encoding_format)). This setting asks the server to return the result as a Base64-encoded binary array instead of a plain-text float array. This typically reduces file size by nearly 4x with no loss of precision, and the interactive OpenAI client uses this by default.

### Limits

Both Parasail and OpenAI limit the maximum size of the batch input files:

* Up to 50,000 requests (lines)
* Up to 1000 MB total input file size

Workloads that exceed these limits must be split into multiple batches.

## Batch output file

The batch output file wraps the interactive responses into an offline format. Each line in the output file is the response to one request.

**Important**: the order of responses may differ from the order of requests in the input file

The response is a JSON dictionary. The most important keys:

* `custom_id`: The same as the value in the request. Use this to match responses to requests.
* `response`: The HTTP response, as a dictionary:
  * `status_code`: HTTP status code. `200` if the request succeeded, else the same HTTP error code as if this request was interactive.
  * `body`: The same as the body of an interactive response.
    * [Chat Completion](https://platform.openai.com/docs/api-reference/chat/object).
    * [Embeddings](https://platform.openai.com/docs/api-reference/embeddings/object).

See [OpenAI's request output object documentation](https://platform.openai.com/docs/api-reference/batch/request-output).

## Create your own

Creating a batch input file is straightforward. The exact approach depends on your workflow. The following code snippets get you started.

{% tabs %}
{% tab title="Python" %}
{% code lineNumbers="true" %}

```python
import json

prompts = ["What is the capital of New York?", "Write two very funny dad jokes.", "Say pong."]

with open("example_batch_input.jsonl", "w") as file:
    for i, prompt in enumerate(prompts):
        file.write(
            json.dumps(
                {
                    "custom_id": f"request-{i}",
                    "method": "POST",
                    "url": "/v1/chat/completions",
                    "body": {
                        # same request body as for an interactive /v1/chat/completions request
                        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
                        "max_completion_tokens": 1024,
                        "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}],
                    },
                }
            )
            + "\n"
        )
```

{% endcode %}
{% endtab %}
{% endtabs %}

## Next steps

* [Quick start](/parasail-docs/products/quickstart)
* [Batch API Reference](/parasail-docs/products/quickstart/api-reference)
* [Batch Processing with Images](/parasail-docs/products/quickstart/images)
* [Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)


# Batch API Reference

Use Parasail's drop-in replacement for the OpenAI Batch API with your existing client libraries and API key.

Parasail offers a direct drop-in replacement for OpenAI's batch API. The same documentation, guides, and cookbooks apply.

You can use OpenAI's or community-built client libraries for your preferred language.

The only workflow changes are:

* Use the Parasail URL: `https://api.parasail.io/v1`
* Use your Parasail API key. Create one [here](https://www.saas.parasail.io/keys).
* Use an open model that Parasail supports instead of ChatGPT.

## OpenAI API reference

[OpenAI Batch API reference](https://platform.openai.com/docs/api-reference/batch)

[OpenAI Files API reference](https://platform.openai.com/docs/api-reference/files)

## Client Setup

{% tabs %}
{% tab title="Python" %}

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)
```

{% endtab %}
{% endtabs %}

### Chat Completions vs Text Completions

As long as vLLM keeps support for both, Parasail has support for both endpoints.

`/completions` endpoint provides the completion for a **single** **prompt** and takes a single string as an input, whereas the `/chat/completions` provides the responses for a given **dialog** and requires the input in a specific format corresponding to the message history.

```
/v1/completions is for single prompt. you can use this endpoints if your app provides those services

-translation
-text or code generation
-revise a message
-write an email

On the other hand you use /v1/chat/completions if your app provides those services

-chatgpt style chatbots
-virtual assistant for customer service
-interactive surveys and forms
```

## Next steps

* [Quick start](/parasail-docs/products/quickstart)
* [Batch file format](/parasail-docs/products/quickstart/file-format)
* [Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)


# Batch Priority

Prioritize certain batch jobs over others using the metadata field.

You can mark some batch jobs as higher priority than others. Priority is specified as an integer passed via the `metadata` field when creating a batch.

## How It Works

* Batches have a **default priority of 0**.
* Set a higher number for higher priority. For example, a batch with priority `1` will be processed before a batch with priority `0`.
* Priority is passed as a metadata key-value pair: `"PARASAIL_PRIORITY": "1"`.

## Usage

Add the `PARASAIL_PRIORITY` key to the `metadata` field when creating a batch:

{% tabs %}
{% tab title="openai-batch" %}

```python
from openai_batch import Batch

with Batch() as batch:
    for i in range(100):
        batch.add_to_batch(
            model="NousResearch/DeepHermes-3-Mistral-24B-Preview",
            messages=[{"role": "user", "content": f"Prompt #{i}"}]
        )
    batch.submit(metadata={"PARASAIL_PRIORITY": "1"})
```

{% endtab %}

{% tab title="Python (OpenAI SDK)" %}

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "PARASAIL_PRIORITY": "1"
    }
)
```

{% endtab %}

{% tab title="curl" %}

```bash
curl https://api.parasail.io/v1/batches \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h",
    "metadata": {
      "PARASAIL_PRIORITY": "1"
    }
  }'
```

{% endtab %}
{% endtabs %}

## Notes

* The `metadata` field follows the same format as the [OpenAI Batch API](https://platform.openai.com/docs/api-reference/batch/create#batch_create-metadata). Up to 16 key-value pairs can be attached to a batch.
* The `completion_window` field is present for OpenAI API compatibility. It isn't a Parasail completion-time guarantee.
* Priority values are integers. Higher numbers mean higher priority.
* Batches without `PARASAIL_PRIORITY` set default to `0`.


# Troubleshooting

Diagnose and fix common batch job failures, from JSONL validation errors to quota and authentication issues.

This page covers the most common problems you'll hit when submitting and processing batch jobs, and how to resolve them. Parasail's batch service is 100% compatible with the OpenAI batch API, so the file formats, status values, and error codes described here match the OpenAI batch model.

## JSONL validation failures

A batch input file is a `.jsonl` file where **each line is a single, complete JSON object** representing one request. A batch will fail validation if any line is malformed or missing required fields.

Each request line must include:

* `custom_id`: a unique value you create so you can later match outputs to inputs.
* `method`: the HTTP method, currently `POST`.
* `url`: one of `/v1/chat/completions` or `/v1/embeddings`.
* `body`: the same JSON body you would send to an interactive request.

Common causes of validation failures:

* **Malformed lines.** A line isn't valid JSON (trailing comma, unescaped quote, or a request split across multiple lines). Each request must be exactly one line.
* **Missing required keys.** A line is missing `custom_id`, `method`, `url`, or `body`.
* **Duplicate or missing `custom_id`.** Each `custom_id` should be unique so responses can be matched back unambiguously.
* **Wrong `url`.** Only `/v1/chat/completions` and `/v1/embeddings` are supported.
* **`stream` set in the body.** For chat completions, `stream` must be omitted or set to `false`.

To debug, validate the file locally before submitting:

```python
import json

with open("example_batch_input.jsonl") as f:
    for i, line in enumerate(f, start=1):
        try:
            req = json.loads(line)
        except json.JSONDecodeError as e:
            print(f"Line {i} is not valid JSON: {e}")
            continue
        for key in ("custom_id", "method", "url", "body"):
            if key not in req:
                print(f"Line {i} is missing required key: {key}")
```

See [Batch file format](/parasail-docs/products/quickstart/file-format) for the full request and response schema.

## File-size and request-count limits exceeded

Parasail limits the size of each batch input file:

* Up to **50,000 requests** (lines) per input file.
* Up to **1000 MB** total input file size.

Workloads that exceed either limit must be split into multiple batches. If you use the [Parasail Batch Helper Library](https://github.com/parasail-ai/openai-batch), `add_to_batch` raises a `ValueError` when the file-size or request-count limit for the provider is exceeded. Catch it and roll over into a new batch:

```python
from openai_batch import Batch

def submit_in_chunks(requests):
    batch_ids = []
    batch = Batch()
    for req in requests:
        try:
            batch.add_to_batch(**req)
        except ValueError:
            # Current batch is full; submit it and start a new one.
            batch_ids.append(batch.submit())
            batch = Batch()
            batch.add_to_batch(**req)
    batch_ids.append(batch.submit())
    return batch_ids
```

If you submit through the API or Batch UI directly, split the `.jsonl` file yourself so that each file stays under both limits.

## Jobs stuck in `validating` or `in_progress`

A batch moves through `validating`, then `in_progress`, then a terminal state (`completed`, `failed`, `expired`, or `cancelled`). Jobs run on Parasail's servers, so a long-running batch is processed to completion even if your local script crashes, resets, or exits.

You do **not** need to keep a process running to wait for results. Record the batch ID at submission time and resume monitoring later from a separate process:

```python
from openai_batch import Batch
import time

with Batch(batch_id="batch-tclfzwczcd",
           output_file="mybatch.jsonl") as batch:
    while True:
        status = batch.status()
        print(f"Batch status: {status.status}")
        if status.status in ["completed", "failed", "expired", "cancelled"]:
            break
        time.sleep(60)  # Check every minute

    output_path, error_path = batch.download()
    print(f"Output saved to: {output_path}")
```

You can also track progress, view metadata, or cancel a queued or running batch from the [Batch UI](/parasail-docs/products/quickstart#parasail-batch-ui).

## Partial failures

A batch can complete with some requests succeeding and others failing. Failures are reported **per request** rather than failing the whole batch, so always inspect each output line.

Two things to keep in mind when reading output:

* **The order of responses may differ from the order of requests** in the input file.
* **Match responses to requests using `custom_id`**, not by line position.

Each output line contains a `response` object with a `status_code`. A `200` means the request succeeded; any other code is the same HTTP error you would have received for that request interactively.

```python
import json

results = {}
with open("mybatch.jsonl") as f:
    for line in f:
        record = json.loads(line)
        custom_id = record["custom_id"]
        response = record.get("response") or {}
        status = response.get("status_code")
        if status == 200:
            results[custom_id] = response["body"]
        else:
            print(f"{custom_id} failed with status {status}: {record.get('error')}")
```

Failed requests can be collected and resubmitted in a new batch.

## Quota errors

If a batch can't be scheduled because your organization is out of GPU capacity, you'll see an **"Insufficient quota"** message. By default, each organization has a quota of **4 GPUs**, shared across both Batch and Dedicated services regardless of GPU type. A single batch that requires more GPUs than your remaining quota will be rejected.

To resolve this:

* Wait for other jobs that are consuming GPUs to finish, or
* Request more capacity through the Quota Increase form. See [Quota Management](/parasail-docs/billing/quota-management) for details. Increases up to 8 GPUs are available without additional justification.

## Authentication errors (401)

The batch tooling reads your key from the `PARASAIL_API_KEY` environment variable. A `401` response, or an error about a missing API key, usually means the key isn't set or isn't valid.

* Confirm the variable is set in the environment that runs your script:

  ```bash
  PARASAIL_API_KEY=<your-key> python3 test_batch.py
  ```
* Store the key in a `.env` or shell profile rather than pasting it into code, where it can be leaked.
* Create or rotate keys at <https://www.saas.parasail.io/keys>.
* When using the OpenAI SDK directly, point the client at Parasail and pass the key explicitly:

  ```python
  import os
  from openai import OpenAI

  client = OpenAI(
      base_url="https://api.parasail.io/v1",
      api_key=os.environ["PARASAIL_API_KEY"],
  )
  ```

## Output parsing

Results are returned as an output `.jsonl` file, with one response per line. Parse it line by line as JSON and pull `response.body` out of each record (see the partial-failures snippet above for matching on `custom_id`).

For embeddings, Parasail strongly recommends submitting requests with `"encoding_format": "base64"`, which returns each embedding as a Base64-encoded binary array instead of a plain-text float array. This typically reduces output file size by nearly 4x with no loss of precision, but it means you must **decode** the value before use:

```python
import base64
import json
import numpy as np

with open("embeddings-output.jsonl") as f:
    for line in f:
        record = json.loads(line)
        for item in record["response"]["body"]["data"]:
            vector = np.frombuffer(base64.b64decode(item["embedding"]), dtype=np.float32)
            # use vector ...
```

## Next steps

* [Quick start](/parasail-docs/products/quickstart)
* [Batch file format](/parasail-docs/products/quickstart/file-format)
* [Batch API Reference](/parasail-docs/products/quickstart/api-reference)
* [Quota Management](/parasail-docs/billing/quota-management)


# Image Generation

Run diffusion models for batch image generation and editing with Parasail.

Parasail supports batch image generation and editing for diffusion models such as [OmniGen](https://huggingface.co/Shitao/OmniGen-v1) and [Qwen-Image-Edit](https://huggingface.co/Qwen/Qwen-Image-Edit). Prompts are submitted as JSONL files—one prompt per line—through either the OpenAI-compatible [Batch API](/parasail-docs/products/quickstart/api-reference) or the [Parasail Batch helper library](/parasail-docs/products/quickstart), which wraps the OpenAI client.

Input and output images are encoded as raw Base64 (not Data URLs like `data:image/png;base64,...`, just the raw Base64 image data). Batch results are returned as JSONL with Base64-encoded output images. Batch submission files have a **1000 MB** limit; see [Batch file format](/parasail-docs/products/quickstart/file-format) for the full limits.

## Minimal example

The quickest way to generate an image is the batch helper library, which builds the JSONL request file and submits it for you:

```python
from openai_batch import Batch

with Batch() as batch:
    batch.add_to_batch(
        model="Shitao/OmniGen-v1",
        prompt="A serene mountain lake at sunrise, photorealistic",
        size="1024x1024",
        response_format="b64_json",
    )
    result, output_path, error_path = batch.submit_wait_download()
```

For editing, pass one or more Base64-encoded input images and reference them in the prompt:

```python
import base64

with open("person.jpg", "rb") as f:
    img = base64.b64encode(f.read()).decode("utf-8")

batch.add_to_batch(
    model="Shitao/OmniGen-v1",
    prompt="A man in a black shirt is reading a book. The man is the right man in <img><|image_1|></img>.",
    size="1024x1024",
    image=[img],
    response_format="b64_json",
)
```

## Supported parameters

| Parameter         | Description                                                                                 |
| ----------------- | ------------------------------------------------------------------------------------------- |
| `model`           | For example `Shitao/OmniGen-v1` or `Qwen/Qwen-Image-Edit`                                   |
| `prompt`          | The prompt to guide generation. Reference input images inline as `<img><\|image_1\|></img>` |
| `size`            | Formatted as `WxH` in pixels, for example `1024x1024`                                       |
| `image`           | A **list** of Base64-encoded JPGs or PNGs (used for editing)                                |
| `response_format` | Currently only `b64_json` is supported                                                      |

## Advanced: process a directory of images

The script below uses the batch helper library to process every image in an input directory and save the generated results to an output directory. The batch library has many additional features for building processing pipelines—see the [Batch quickstart](/parasail-docs/products/quickstart), which also shows how to monitor batch jobs in Parasail's UI.

To run it:

```sh
pip3 install openai-batch
PARASAIL_API_KEY=<YOUR-API-KEY> python3 test_omnigen_batch.py
```

<pre class="language-python"><code class="lang-python"><strong>#test_omnigen_batch.py
</strong>
<strong>#pip install openai-batch
</strong><strong>from openai_batch import Batch, providers
</strong>from PIL import Image
import os
import base64
import json
import io
from pathlib import Path


def extract_and_save_images(input_file: str, output_dir: str):
    """Extract base64 encoded images from batch output and save as JPG files."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    counter = 1
    with open(input_file, "r", encoding="utf-8") as f:
        for line in f:
            try:
                data = json.loads(line.strip())
                b64_data = data["response"]["body"]["data"][0]["b64_json"]
                image_bytes = base64.b64decode(b64_data)
                image = Image.open(io.BytesIO(image_bytes))
                output_path = os.path.join(output_dir, f"image-{counter}.jpg")
                image.convert("RGB").save(output_path, format="JPEG")
                print(f"Saved {output_path}")
                counter += 1
            except Exception as e:
                print(f"Error processing line {counter}: {e}")
                counter += 1
                continue


# Input and output directories
input_images_dir = Path("omnigen_input_images")
output_images_dir = Path("omnigen_output_images")

# Batch files
output_file = Path("output.jsonl")
error_file = Path("error.jsonl")
submission_file = Path("batch_submission.jsonl")

# Get all image files from input directory
image_extensions = {".jpg", ".jpeg", ".png"}
image_files = [
    f
    for f in input_images_dir.iterdir()
    if f.is_file() and f.suffix.lower() in image_extensions
]

if not image_files:
    print(f"No image files found in {input_images_dir}")
    print(f"Supported extensions: {', '.join(image_extensions)}")
    exit(1)

print(f"Found {len(image_files)} images to process")

# Create a batch with transfusion request
with Batch(
    submission_input_file=submission_file,
    output_file=output_file,
    error_file=error_file,
) as batch_obj:
    # Process each image and add to batch
    for image_path in image_files:
        print(f"Processing: {image_path.name}")

        # Read and encode the image as base64
        with open(image_path, "rb") as img_file:
            image_data = img_file.read()
            base64_image = base64.b64encode(image_data).decode("utf-8")

        # Add transfusion request to the batch
        batch_obj.add_to_batch(
            model="Shitao/OmniGen-v1",
            prompt="A man in a black shirt is reading a book. The man is the right man in &#x3C;img>&#x3C;|image_1|>&#x3C;/img>.",
            size="1024x1024",
            image=[base64_image],
            response_format="b64_json",
        )

    print(f"\nSubmitting batch with {len(image_files)} requests...")

    # Submit, wait for completion, and download results
    result, output_path, error_path = batch_obj.submit_wait_download()

    # Verify the batch completed successfully
    assert result.status == "completed", f"Batch failed with status: {result.status}"

# Verify output file exists
assert output_file.exists(), "Output file not created for transfusion"

# Extract and save generated images
extract_and_save_images(str(output_file), str(output_images_dir))
</code></pre>

## Next steps

* [Batch quickstart](/parasail-docs/products/quickstart)—batch helper library fundamentals
* [Batch with images](/parasail-docs/products/quickstart/images)—batch image processing reference
* [Batch API reference](/parasail-docs/products/quickstart/api-reference)—OpenAI-compatible API spec
* [Batch file format](/parasail-docs/products/quickstart/file-format)—JSONL input format and limits


# Authentication

API key creation, base URLs, and authentication for the Parasail API.

All Parasail API requests use bearer token authentication. The API is fully OpenAI-compatible.

## Base URLs

| Product/API      | Base URL                     | Notes                          |
| ---------------- | ---------------------------- | ------------------------------ |
| Chat Completions | `https://api.parasail.io/v1` | OpenAI-compatible              |
| Responses API    | `https://api.parasail.io/v1` | OpenAI-compatible              |
| Embeddings       | `https://api.parasail.io/v1` | OpenAI-compatible              |
| Batch API        | `https://api.parasail.io/v1` | `/v1/batches`                  |
| Files API        | `https://api.parasail.io/v1` | `/v1/files`, for batch uploads |

`https://api.parasail.io/v1` is the single gateway for all Parasail inference and batch APIs. Don't use other hosts, including `https://api.saas.parasail.io/v1` and `https://api-webflux.saas.parasail.io/v1`, in new integrations.

## Getting an API key

1. Go to <https://www.saas.parasail.io/keys>.
2. Click **Create API Key**.
3. Copy the key immediately—it's displayed only once.

## Using your API key

Pass your API key in the `Authorization` header:

```
Authorization: Bearer <PARASAIL_API_KEY>
```

### With the OpenAI SDK (Python)

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)
```

### With cURL

```bash
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```

### Environment variable

Store your key in an environment variable rather than hardcoding it:

```bash
export PARASAIL_API_KEY="your-api-key-here"
```

## API key types

* **Standard API keys**—full read/write access to all endpoints.
* **Read-only API keys**—restricted access for monitoring and querying. See [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys).

## Organization-based billing

Your API key belongs to an Organization. All usage is billed to that Organization. See [Account Management](/parasail-docs/security-and-account-management/account-management) for details on organizations and switching between them.


# Chat Completions

OpenAI-compatible chat completions API for serverless and dedicated model inference.

Parasail provides a fully OpenAI-compatible chat completions endpoint. Point your existing OpenAI SDK at Parasail's base URL—no code changes needed beyond the URL and API key.

## Endpoint compatibility matrix

The table below shows which Parasail products serve each OpenAI-compatible endpoint. All endpoints share the same base URL (`https://api.parasail.io/v1`) and the same bearer-token authentication.

| Endpoint                    | Serverless | Dedicated                       | Batch                  |
| --------------------------- | ---------- | ------------------------------- | ---------------------- |
| `POST /v1/chat/completions` | Yes        | Yes                             | Yes                    |
| `POST /v1/completions`      | Yes        | Yes                             | Yes                    |
| `POST /v1/embeddings`       | Yes        | Yes                             | Yes                    |
| `POST /v1/responses`        | Yes        | TODO: confirm Dedicated support | No                     |
| `GET /v1/models`            | Yes        | Yes                             | N/A (listing endpoint) |

Batch serves the `/v1/chat/completions` and `/v1/embeddings` endpoints through JSONL input files. See the [Batch API reference](/parasail-docs/api-reference/batch-api).

## Endpoint

```
POST https://api.parasail.io/v1/chat/completions
```

## Authentication

All requests use bearer-token authentication. Pass your API key in the `Authorization` header:

```
Authorization: Bearer <PARASAIL_API_KEY>
```

The OpenAI SDK reads this from the `api_key` argument. See [Authentication](/parasail-docs/api-reference/authentication) for how to create and store a key.

## Request

### Python (OpenAI SDK)

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

response = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of New York?"}
    ],
    max_completion_tokens=1024,
    temperature=0.7
)

print(response.choices[0].message.content)
```

### cURL

```bash
curl https://api.parasail.io/v1/chat/completions \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "parasail-deepseek-r1",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of New York?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'
```

## Request parameters

| Parameter            | Type    | Default  | Description                                                      |
| -------------------- | ------- | -------- | ---------------------------------------------------------------- |
| `model`              | string  | required | Model ID (for example, `parasail-deepseek-r1`) or HuggingFace ID |
| `messages`           | array   | required | Array of message objects with `role` and `content`               |
| `temperature`        | float   | 1.0      | Controls randomness (0 = greedy decoding)                        |
| `top_p`              | float   | 1.0      | Nucleus sampling probability mass                                |
| `top_k`              | int     | -1       | Limits token selection to top-k tokens (pass in `extra_body`)    |
| `max_tokens`         | int     | None     | Maximum tokens to generate                                       |
| `repetition_penalty` | float   | 1.0      | Penalizes repeated tokens                                        |
| `presence_penalty`   | float   | 0.0      | Increases likelihood of new tokens                               |
| `frequency_penalty`  | float   | 0.0      | Penalizes frequently occurring tokens                            |
| `seed`               | int     | None     | Fixed seed for reproducible results                              |
| `stream`             | boolean | false    | Stream tokens as they're generated                               |
| `tools`              | array   | None     | Tool/function definitions for function calling                   |
| `tool_choice`        | string  | "auto"   | `"auto"`, `"required"`, `"none"`, or specific tool               |

These fields follow the OpenAI schema. For exhaustive field semantics, see the [OpenAI chat completions reference](https://platform.openai.com/docs/api-reference/chat/create). For sampling-parameter guidance, see [full parameter details](/parasail-docs/api-reference/parameters).

## Message roles

| Role        | Description                                                  |
| ----------- | ------------------------------------------------------------ |
| `system`    | Sets the model's behavior and persona                        |
| `user`      | The user's input                                             |
| `assistant` | The model's previous response (for multi-turn conversations) |

## Response

A successful request returns a `chat.completion` object. The response shape matches OpenAI:

```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1727900000,
  "model": "parasail-deepseek-r1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Albany is the capital of New York."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 9,
    "total_tokens": 33
  }
}
```

| Field                       | Description                                             |
| --------------------------- | ------------------------------------------------------- |
| `id`                        | Unique identifier for the completion                    |
| `choices[].message.content` | The generated text                                      |
| `choices[].finish_reason`   | Why generation stopped (`stop`, `length`, `tool_calls`) |
| `usage`                     | Prompt, completion, and total token counts              |

See the [OpenAI response object reference](https://platform.openai.com/docs/api-reference/chat/object) for every field.

## Streaming

Set `stream: true` to receive tokens as they are generated. The response is a stream of server-sent events. Each event's data is a `chat.completion.chunk` object whose `choices[].delta` carries the incremental content; the stream ends with a `data: [DONE]` line.

```python
stream = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "Write a haiku about the sea."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
```

## Error responses

Errors use OpenAI-compatible status codes and a JSON body with an `error` object (`message`, `type`, `code`).

| Status Code | Description                                                                            |
| ----------- | -------------------------------------------------------------------------------------- |
| `400`       | Malformed request—invalid JSON, missing required field, or unsupported parameter value |
| `401`       | Missing or invalid API key                                                             |
| `403`       | API key lacks access to the requested model or resource                                |
| `404`       | Model or endpoint not found                                                            |
| `429`       | Rate limit or quota exceeded—back off and retry                                        |
| `500`       | Internal server error—retry with exponential backoff                                   |

## Chat completions vs text completions

Parasail supports both endpoints:

* `/v1/chat/completions`—for dialog-based interactions (chatbots, assistants)
* `/v1/completions`—for single-prompt completion (translation, code generation, email drafting)

## Model availability

List available serverless models:

```bash
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```

## Next steps

* [Chat completions guide](/parasail-docs/guides/chat-completions)—messages, roles, base vs instruct models, chat templates
* [Tool/function calling guide](/parasail-docs/guides/tool-function-calling)—function calling tutorial
* [Structured output guide](/parasail-docs/guides/structured-output)—JSON-formatted responses
* [Parameters reference](/parasail-docs/api-reference/parameters)—sampling parameters in depth
* [Serverless overview](/parasail-docs/products/overview)—using third-party tools like Cline and AnythingLLM


# Responses API

Responses API reference for multi-turn agentic workflows with tool calling on the new gateway.

The Responses API supports multi-turn agentic workflows with built-in tool calling and parallel tool execution. It uses the standard Parasail gateway.

## Endpoint

```
POST https://api.parasail.io/v1/responses
```

## Authentication

Use your existing Parasail API key—the same key as for all other Parasail endpoints. Pass it in the `Authorization` header:

```
Authorization: Bearer <PARASAIL_API_KEY>
```

See [Authentication](/parasail-docs/api-reference/authentication) for how to create and store a key.

## Key requirements

* **`store` must be `false`** on every request—server-side state storage is not supported.
* **No `previous_response_id`**—pass the full conversation history in `input` each turn.
* **Same API key** as the standard gateway.

## Request

### Python (OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    input="Explain the difference between TCP and UDP in two sentences."
)

print(response.output_text)
```

### cURL

```bash
curl https://api.parasail.io/v1/responses \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "parasail-kimi-k25-elicit",
    "store": false,
    "input": "Explain the difference between TCP and UDP in two sentences."
  }'
```

## Request parameters

| Parameter     | Type            | Default  | Description                                                           |
| ------------- | --------------- | -------- | --------------------------------------------------------------------- |
| `model`       | string          | required | Model ID (for example, `parasail-kimi-k25-elicit`)                    |
| `input`       | string or array | required | Prompt string, or the full message/history array for multi-turn calls |
| `store`       | boolean         | required | Must be `false`—server-side state is not supported                    |
| `tools`       | array           | None     | Tool/function definitions for function calling                        |
| `tool_choice` | string          | "auto"   | `"auto"`, `"required"`, or `"none"`                                   |
| `stream`      | boolean         | false    | Stream events as they're generated                                    |

These fields follow the OpenAI schema. For exhaustive field semantics, see the [OpenAI Responses API reference](https://platform.openai.com/docs/api-reference/responses).

## Response

A successful request returns a `response` object. The convenience field `output_text` aggregates the generated text; the full `output` array contains the structured items (messages, tool calls):

```json
{
  "id": "resp_abc123",
  "object": "response",
  "model": "parasail-kimi-k25-elicit",
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "TCP is connection-oriented and reliable; UDP is connectionless and faster but lossy." }
      ]
    }
  ],
  "usage": {
    "input_tokens": 18,
    "output_tokens": 22,
    "total_tokens": 40
  }
}
```

See the [OpenAI Responses object reference](https://platform.openai.com/docs/api-reference/responses/object) for every field.

## Streaming

Set `stream: true` to receive the response as a stream of server-sent events (for example `response.output_text.delta` events carrying incremental text, ending with a completion event). See the [Responses API product page](/parasail-docs/products/overview/responses-api) for streaming and agentic-loop examples.

## Error responses

Errors use OpenAI-compatible status codes and a JSON body with an `error` object (`message`, `type`, `code`).

| Status Code | Description                                                                      |
| ----------- | -------------------------------------------------------------------------------- |
| `400`       | Malformed request—invalid JSON, missing required field, or `store` set to `true` |
| `401`       | Missing or invalid API key                                                       |
| `403`       | API key lacks access to the requested model or resource                          |
| `404`       | Model or endpoint not found                                                      |
| `429`       | Rate limit or quota exceeded—back off and retry                                  |
| `500`       | Internal server error—retry with exponential backoff                             |

## Compatibility summary

| Feature                                    | Supported                          |
| ------------------------------------------ | ---------------------------------- |
| Single-turn completions                    | Yes                                |
| Multi-turn conversations                   | Yes (pass full history in `input`) |
| Reasoning / thinking                       | Yes                                |
| Function / tool calling                    | Yes                                |
| Parallel tool calls                        | Yes                                |
| `store: true` (server-side state)          | No                                 |
| `previous_response_id` (stateful chaining) | No                                 |
| Streaming                                  | Yes                                |
| Web search tool                            | No                                 |
| File search tool                           | No                                 |
| Computer use tool                          | No                                 |

## Next steps

* [Responses API product page](/parasail-docs/products/overview/responses-api)—multi-turn, function calling, agentic loops, and the full compatibility table
* [Agents and tool calling use case](/parasail-docs/use-cases/agents-tool-calling)—building agentic workflows
* [Chat completions API](/parasail-docs/api-reference/chat-completions)—endpoint compatibility matrix and base URL
* [Authentication](/parasail-docs/api-reference/authentication)—creating and using API keys


# Embeddings

Embeddings API for generating vector representations of text using open-source models.

Generate vector embeddings for text using open-source embedding models. The API is OpenAI-compatible—use the same `client.embeddings.create` interface.

## Endpoint

```
POST https://api.parasail.io/v1/embeddings
```

## Authentication

Pass your API key in the `Authorization` header:

```
Authorization: Bearer <PARASAIL_API_KEY>
```

The OpenAI SDK reads this from the `api_key` argument. See [Authentication](/parasail-docs/api-reference/authentication) for how to create and store a key.

## Request

### Python (OpenAI SDK)

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

response = client.embeddings.create(
    model="parasail-ai/GritLM-7B-vllm",
    input="Parasail provides affordable cloud GPUs for AI inference.",
    encoding_format="base64"
)

print(response.data[0].embedding[:5])
```

### cURL

```bash
curl https://api.parasail.io/v1/embeddings \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "parasail-ai/GritLM-7B-vllm",
    "input": "Parasail provides affordable cloud GPUs for AI inference.",
    "encoding_format": "float"
  }'
```

## Parameters

| Parameter         | Type            | Default  | Description                                                          |
| ----------------- | --------------- | -------- | -------------------------------------------------------------------- |
| `model`           | string          | required | Embedding model ID (for example, `parasail-ai/GritLM-7B-vllm`)       |
| `input`           | string or array | required | Text to embed. Pass an array to embed multiple inputs in one request |
| `encoding_format` | string          | "float"  | `"float"` or `"base64"` (recommended for smaller payloads)           |

These fields follow the OpenAI schema. For exhaustive field semantics, see the [OpenAI embeddings reference](https://platform.openai.com/docs/api-reference/embeddings/create).

## Response

A successful request returns a list of embedding objects. With one input string, `data` contains a single embedding:

```json
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0091, 0.0145, "..."]
    }
  ],
  "model": "parasail-ai/GritLM-7B-vllm",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}
```

| Field              | Description                                                                      |
| ------------------ | -------------------------------------------------------------------------------- |
| `data[].embedding` | The vector (array of floats, or a base64 string when `encoding_format="base64"`) |
| `data[].index`     | Position of this embedding in the input array                                    |
| `usage`            | Prompt and total token counts                                                    |

When `input` is an array, `data` contains one object per input, ordered by `index`.

## Available embedding models

| Model                 | HuggingFace ID                      |
| --------------------- | ----------------------------------- |
| GritLM                | `parasail-ai/GritLM-7B-vllm`        |
| GTE-Qwen2-7B-Instruct | `Alibaba-NLP/gte-Qwen2-7B-instruct` |

Any embedding model on HuggingFace can be used with dedicated or batch endpoints.

## Error responses

Errors use OpenAI-compatible status codes and a JSON body with an `error` object (`message`, `type`, `code`).

| Status Code | Description                                                                     |
| ----------- | ------------------------------------------------------------------------------- |
| `400`       | Malformed request—invalid JSON, missing `input`, or unsupported parameter value |
| `401`       | Missing or invalid API key                                                      |
| `403`       | API key lacks access to the requested model                                     |
| `404`       | Model or endpoint not found                                                     |
| `429`       | Rate limit or quota exceeded—back off and retry                                 |
| `500`       | Internal server error—retry with exponential backoff                            |

## Tips

* Use `encoding_format="base64"` to reduce output file sizes, especially for batch processing.
* For large-scale embedding generation, use [batch processing](/parasail-docs/products/quickstart) at 50% off serverless pricing.

## Next steps

* [RAG and embeddings use case](/parasail-docs/use-cases/rag-embeddings)—RAG pipeline guide
* [Batch embedding models](/parasail-docs/products/quickstart#batch-embedding-models)—batch embedding generation
* [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model)—embedding model evaluation
* [Chat completions API](/parasail-docs/api-reference/chat-completions)—endpoint compatibility matrix and base URL


# Batch API

Batch API and billing API reference for Parasail. OpenAI-compatible batch processing and programmatic billing access.

## Batch API

Parasail's batch API is a direct drop-in replacement for OpenAI's batch API. The same documentation, guides, and client libraries apply.

### Base URL

```
https://api.parasail.io/v1
```

### Client setup

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)
```

### Key differences from OpenAI

* Use the Parasail URL instead of OpenAI's.
* Use your Parasail API key ([create one here](https://www.saas.parasail.io/keys)).
* Use an open-source model from HuggingFace instead of ChatGPT.
* Up to **50,000 requests** and **1000 MB** per input file (vs OpenAI's 250 MB).

### OpenAI API reference

* [OpenAI Batch API reference](https://platform.openai.com/docs/api-reference/batch)
* [OpenAI Files API reference](https://platform.openai.com/docs/api-reference/files)

### Batch helper library

Parasail provides a Python batch helper library that simplifies batch submission and monitoring:

```bash
pip install openai-batch
```

See the [openai-batch GitHub repository](https://github.com/parasail-ai/openai-batch) and [batch full guide](/parasail-docs/products/quickstart) for details.

### Batch file format

The batch input file is a JSONL file where each line is a request. The format is 100% compatible with OpenAI. See [Batch file format](/parasail-docs/products/quickstart/file-format).

***

## Billing API

The Parasail Billing API allows you to programmatically access your invoices, including real-time month-to-date spend and hourly or daily usage breakdowns.

### Authentication

```
Authorization: Bearer psk-<accessKey>-<secretKey>
```

Generate an API key from **Settings > API Keys** in the Parasail dashboard.

### Endpoints

#### Get Current Invoice

```
GET /api/v1/billing/invoices/current
```

Returns the active draft invoice for the current billing period. Updated in real time as usage is reported.

#### List Invoice Breakdowns

```
GET /api/v1/billing/invoices/breakdowns
```

Returns granular breakdowns for a time range.

| Parameter       | Required | Description                      |
| --------------- | -------- | -------------------------------- |
| `starting_on`   | Yes      | Start timestamp (RFC 3339)       |
| `ending_before` | Yes      | End timestamp (RFC 3339)         |
| `window_size`   | No       | `hour` or `day` (default: `day`) |
| `status`        | No       | `DRAFT`, `FINALIZED`, or `VOID`  |
| `limit`         | No       | Max breakdown windows returned   |
| `next_page`     | No       | Pagination cursor                |

#### List Invoices

```
GET /api/v1/billing/invoices
```

Returns all invoices matching the specified filters.

#### Get Invoice by ID

```
GET /api/v1/billing/invoices/{invoice_id}
```

### Full billing API reference

For complete request/response examples, the invoice object schema, line item fields, and error responses, see the [Billing API page](/parasail-docs/billing/billing-api).

### Error responses

| Status Code | Description                                 |
| ----------- | ------------------------------------------- |
| `401`       | Missing or invalid API key                  |
| `403`       | Account not found or billing not configured |
| `404`       | Invoice not found                           |
| `502`       | Upstream billing service error              |


# Models Endpoint

List available models on the Parasail platform using the /v1/models endpoint.

List all available models on the Parasail platform. This endpoint works on the standard gateway and serves both the chat completions and Responses APIs.

## Endpoint

```
GET https://api.parasail.io/v1/models
```

## Authentication

Pass your API key in the `Authorization` header:

```
Authorization: Bearer <PARASAIL_API_KEY>
```

See [Authentication](/parasail-docs/api-reference/authentication) for how to create and store a key.

## Request

### cURL

```bash
curl https://api.parasail.io/v1/models \
  -H "Authorization: Bearer $PARASAIL_API_KEY"
```

### Python (OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

models = client.models.list()
for model in models.data:
    print(model.id)
```

## Response

A successful request returns a list of model objects:

```json
{
  "object": "list",
  "data": [
    {
      "id": "parasail-deepseek-r1",
      "object": "model",
      "created": 1727900000,
      "owned_by": "parasail"
    }
  ]
}
```

| Field             | Description                                   |
| ----------------- | --------------------------------------------- |
| `data[].id`       | Model ID to pass as `model` in other requests |
| `data[].object`   | Always `model`                                |
| `data[].owned_by` | Owner of the model                            |

The list is returned in full in a single response—this endpoint is not paginated. See the [OpenAI models reference](https://platform.openai.com/docs/api-reference/models) for field semantics.

## Model types

Parasail supports several types of models:

* **Serverless models**—prefixed with `parasail-` (for example, `parasail-deepseek-r1`). Available on the serverless tier with token-based pricing.
* **Dedicated deployments**—use the deployment name as the model ID. Available on dedicated instances.
* **HuggingFace models**—any HuggingFace transformer can be used with batch or dedicated endpoints by specifying the full HuggingFace ID (for example, `NousResearch/DeepHermes-3-Mistral-24B-Preview`).

## Error responses

Errors use OpenAI-compatible status codes and a JSON body with an `error` object (`message`, `type`, `code`).

| Status Code | Description                                          |
| ----------- | ---------------------------------------------------- |
| `401`       | Missing or invalid API key                           |
| `403`       | API key lacks access to list models                  |
| `429`       | Rate limit or quota exceeded—back off and retry      |
| `500`       | Internal server error—retry with exponential backoff |

## Next steps

* [Chat completions API](/parasail-docs/api-reference/chat-completions)—use a model ID to run inference
* [Serverless overview](/parasail-docs/products/overview)—list of available serverless models
* [Model recommendations](/parasail-docs/guides/model-recommendations)—recommended models by use case
* [Pricing](/parasail-docs/billing/pricing)—pricing by model parameter count


# Parameters

Sampling parameters for the Parasail chat completions and text completions APIs.

Parasail supports all parameters that vLLM supports. These parameters control the randomness, diversity, and length of model outputs.

## Sampling parameters

### `temperature` (default: 1.0)

Controls randomness in token selection.

* Higher values (>1.0) increase randomness.
* Lower values (<1.0) make outputs more deterministic.
* A value of 0 forces greedy decoding.

### `top_p` (Nucleus Sampling) (default: 1.0)

Controls the probability mass of token selection.

* Only considers tokens that sum up to `top_p` probability.
* Lower values (for example, 0.9) limit token choices to more likely options.
* Higher values (close to 1.0) allow a broader range of tokens.

### `top_k` (default: -1, disabled)

Limits token selection to the top `k` most probable tokens.

* Lower values (for example, `top_k=50`) make output more deterministic.
* If `-1`, this setting is ignored.
* Since `top_k` is not in the OpenAI spec, pass it in the `extra_body` field.

### `max_tokens` (default: None)

Sets the maximum number of tokens to generate. Helps prevent excessively long responses.

### `repetition_penalty` (default: 1.0)

Penalizes repeated tokens to avoid looping responses. Common values: 1.1 to 1.2.

### `presence_penalty` (default: 0.0)

Increases the likelihood of introducing new tokens. Useful for making outputs more diverse.

### `frequency_penalty` (default: 0.0)

Penalizes tokens that have appeared frequently. Helps prevent excessive repetition of common words.

### `seed` (default: None)

Sets a fixed seed for reproducible results. Useful for debugging or deterministic sampling.

## How parameters work together

* Setting `temperature=0` forces deterministic output (greedy decoding).
* Using `top_p` and `top_k` together balances diversity and coherence.
* `repetition_penalty` and `presence_penalty` help avoid repetitive loops.

## Example

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>"
)

response = client.chat.completions.create(
    model="parasail-deepseek-r1",
    messages=[{"role": "user", "content": "Write a creative story about a robot."}],
    max_completion_tokens=1000,
    temperature=0.7,
    top_p=0.1,
    repetition_penalty=1.1,
    extra_body={"top_k": 50}
)

print(response.choices[0].message.content)
```

## Next steps

* [Chat completions API](/parasail-docs/api-reference/chat-completions)—full chat completions API spec and endpoint compatibility matrix
* [Serverless parameters page](/parasail-docs/products/overview/parameters)—original parameter reference
* [Structured output guide](/parasail-docs/guides/structured-output)—constrain outputs to JSON
* [Responses API](/parasail-docs/api-reference/responses-api)—parameters for agentic, multi-turn calls


# Run and Evaluate Any Model

Choose between the Serverless, Dedicated, and Batch tiers to try out and evaluate any model on Parasail.

At Parasail, we aim to make it as easy as possible to find the right model for your job. Our [Serverless](https://www.saas.parasail.io/serverless) tier is the quickest way to try out a wide range of popular models covering chat, instruct, and multimodal with model sizes ranging from 7 B to the incredibly capable DeepSeek V3 at 671 B parameters. There are many times though when Serverless is not sufficient, and we built the Dedicated and Batch tiers to help in those cases:

**Trying out new models:** If a model doesn't exist in serverless—maybe its an uncommon model on HuggingFace, or maybe you trained and fine tuned it yourself—you can easily spin it up in a [Dedicated Endpoint](/parasail-docs/products/overview-1). We support most transformers on HuggingFace, both public [and private](/parasail-docs/products/overview-1/private-hf-models), and we have the lowest cost on-demand GPUs on the market from 4090 s up to H200s.

**Large-scale Evaluations:** Effective and automatic LLM evaluation is critical to building a quality product. Our [batch processing service](/parasail-docs/products/quickstart#batch-embedding-models-1) is ideal for evals that require a large amount of prompts, images, or text. The price is 50% off our serverless endpoints and prompt-caching provides another 50% discount. We also support most transformers on HuggingFace with a single-line change in the code, making the evaluation of many models easy.

**Embeddings:** We support a range of embedding models that rival proprietary models such as OpenAI and Voyage, including:

* [parasail-ai/GritLM-7 B-vllm](https://huggingface.co/parasail-ai/GritLM-7B-vllm)
* [Alibaba-NLP/gte-Qwen2-7 B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)

These embeddings - and any others on HuggingFace - can be easily run using our [batch processing service](/parasail-docs/products/quickstart#batch-embedding-models-1).

### **Checking Model Compatibility**

The Dedicated UI can be used to verify whether a model is supported by our inference engines. Simply paste the URL into the model entry page and a message will appear indicating that it is supported.

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><img src="/files/uZkGtiBJjEUsGAcmIPoZ" alt="" data-size="original"></td><td>Model is supported</td><td></td></tr><tr><td><img src="/files/mzHpxqOjrxD8JFIP5GHB" alt=""></td><td>Model cannot be found indicates the model is not supported</td><td></td></tr></tbody></table>

### **Private or Custom Models and LORAs**

Fine-tuned models and LORA adapters can be easily hosted on Parasail's dedicated by hosting them on in a public or private repo on HuggingFace. For public models, simply paste the URL into the dedicated page. For guidelines on hosting private models, including generating the access token, please see this section:

{% content-ref url="/pages/0tnt4xWOMBgre4UPkN0j" %}
[Private HuggingFace Models](/parasail-docs/products/overview-1/private-hf-models)
{% endcontent-ref %}

### **Batch Processing of Private Models**

Private models can also be processed in batch mode at a 50% discount to the equivalent serverless pricing of the model. For information on how to set this up, please see this section:

{% content-ref url="/pages/g08bVP5XvUsIWEMEu62Z" %}
[Batch Processing with Private Models](/parasail-docs/products/quickstart/private-models)
{% endcontent-ref %}

## Next steps

* [Serverless overview](/parasail-docs/products/overview)
* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Batch Quick start](/parasail-docs/products/quickstart)
* [Private Hugging Face Models](/parasail-docs/products/overview-1/private-hf-models)
* [Model recommendations](/parasail-docs/guides/model-recommendations)


# Chat Completions

Send chat messages to instruct-tuned models with Parasail's OpenAI-compatible Chat Completions API.

A Chat Completion is the typical method to interact with an *instruct-tuned* large language model. The user sends chat messages to the model, and the model responds with a message of its own. The chat may continue: user may append another message to the list and the model responds with the full context of the entire chat.

Let's start with an example.

## Example Chat Completion

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key=os.environ["PARASAIL_API_KEY"],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of New York?"},
]

chat_completion = client.chat.completions.create(
    model="parasail-llama-33-70b-fp8",
    messages=messages,
    max_completion_tokens=1024,
)

response_message = chat_completion.choices[0].message
print(
    {
        "role": response_message.role,
        "content": response_message.content,
    }
)
```

Response message:

```python
{'role': 'assistant', 'content': 'The capital of New York is Albany.'}
```

## Messages

At its core, chat completions are based on messages. Each message has a role and content. The role defines who wrote the message: the *user*, the *assistant* (that is, the model), or the *system* prompt. The *content* is the user text or model output.

Roles are defined by the model. The ones show here are typical, such as used by the LLama 3 family. Some models have the same concepts under a different name. Check the model card or the model's chat template of you're having issues.

## Models: Base vs Instruct

Not all models are intended to work with chat completions. The notable distinction is *Base* vs *Instruct-tuned* models. Base models contain the raw knowledge of the model, but are not tuned to respond in a chat fashion. Instead the base model is additionally tuned to respond to chats. We call that the *Instruct* variant. Usually this is clear in the model name, such as `meta-llama/Llama-3.3-70B`vs `meta-llama/Llama-3.3-70B-Instruct`. Note the `Instruct` in the name, though some models use `chat`.

On the technical level, an instruct model will include a *chat template*, while a base model does not.

## What is a Chat Template?

LLMs fundamentally work on a token stream. The user provides input tokens (here in the form of text), and the model generates output tokens (here also presented as text). This is a basic *completion*. Notice this does not mention messages.

The way a model converts the input messages into the raw input tokens needed by the model is by applying a *chat template*. Models typically use the popular Jinja2 template system, and the chat template is simply a Jinja2 template that converts the messages to a string of text.

The chat template is model-specific. It must match the format of the training data used to train the model. The user is able to make some tweaks, but one cannot assume that the template used for one model will work with another.

The simplest example is one that simply concatenates the messages together:

```django
{% for message in messages %}{{ message.content }}{% endfor %}
```

However real chat templates tend to include special handling for `"role": "system"` messages, and they include special tags or tokens to identify message starts and stops. These tags or tokens will match the format of the training data used by the model author.

To explore further, the chat template is usually part of the model's tokenizer definition, often in the file `tokenizer_config.json`.

## Next steps

* [Structured output guide](/parasail-docs/guides/structured-output)
* [Tool and function calling guide](/parasail-docs/guides/tool-function-calling)
* [Chat Completions API reference](/parasail-docs/api-reference/chat-completions)
* [Chat and text generation use case](/parasail-docs/use-cases/chat-text-generation)
* [Model recommendations](/parasail-docs/guides/model-recommendations)


# RAG

Build Retrieval-Augmented Generation systems that combine embeddings, vector retrieval, and LLM generation on Parasail.

## The Evolution of AI Text Generation

Retrieval-Augmented Generation (RAG) has emerged as a transformative approach in AI, addressing key limitations of traditional large language models (LLMs). By combining the generative capabilities of LLMs with dynamic access to external knowledge sources, RAG systems deliver more accurate, contextual, and up-to-date responses. This advancement has particular significance in domains requiring specialized knowledge or real-time information, from technical documentation to financial analysis.

## Understanding Basic RAG Architecture

At its core, RAG operates through a three-stage process:

1. Document vectorization: Converting chunked documents into embeddings stored in a vector database
2. Similarity-based retrieval: Matching query vectors with the most relevant document vectors using metrics like cosine similarity
3. Context-enhanced generation: Combining the original query with retrieved content to generate more informed responses

The effectiveness of a RAG system hinges on both retrieval accuracy and generation quality. While traditional metrics like precision, recall, and F1 scores evaluate retrieval performance, techniques such as LLM-based evaluation and similarity scoring assess the quality of generated responses. In this document, we focus on how to create and evaluate a vector DB on given datasets.

## Document vectorization

You use an embedding model to vectorize a document. If the document is large, you may want to chunk the document using various chunking techniques, especially if the document size is larger than the maximum context length supported by the embedding model.

If you have a lot of documents or document chunks, it's more efficient to run them through the embedding model in a batch. You can use the Parasail Batch APIs to achieve this. The following notebook code shows the process.

We use [Anthropic documentation dataset](https://github.com/anthropics/anthropic-cookbook/blob/main/skills/retrieval_augmented_generation/guide.ipynb) for this experiment:

```python
with open("datasets/anthropic_docs.json", "r") as f:
    dataset = json.load(f)

print(f"Number of documents: {len(dataset)}")
print(f"Number of chunks: {len(dataset[0]['chunks'])}")
print(f"First chunk:\n{dataset[0]['chunks'][15]['chunk_content'][:100]}")
```

Output:

```
Number of documents: 1
Number of chunks: 232
First chunk:
Model options


Enterprise use cases often mean complex needs and edge cases. Anthropic offers a ran
```

We define a function that converts a dataset into a vector DB:

```python
def create_vector_db(dataset: List[Dict[str, Any]], embedding_model_name: str) -> None:
    # A list of all the chunk_content items
    texts_to_embed = []
    # A list of chunk metadata
    metadata_list = []
    for doc in dataset:
        for chunk in doc["chunks"]:
            texts_to_embed.append(f"{chunk['chunk_content']}")
            metadata_list.append({
                "chunk_id": chunk["chunk_id"],
                "chunk_content": chunk["chunk_content"],
            })

    # Run embeddings on all the chunks
    embeddings, total_num_tokens = run_parasail_batch(
        texts_to_embed, embedding_model_name
    )

    vector_db = {
        "embeddings": embeddings,
        "metadata": metadata_list,
        "embedding_model_name": embedding_model_name,
    }
    print("Vector DB is created")
    print("Number of embedding vectors in DB: ", len(vector_db["embeddings"]))
    print(f"Number of embeddings tokens: {total_num_tokens}")
    return vector_db
```

We now define a function that computes embeddings of a list of texts using Parasail Batch API.

```python
def run_parasail_batch(texts_to_embed: List[str], embedding_model_name: str):
    client = OpenAI(
        base_url="https://api.parasail.io/v1",
        api_key=os.getenv("PARASAIL_BATCH_API_KEY"),
    )

    # Make an embedding request from each string of texts_to_embed, and store all
    # the requests in a file. The format of this file follows OpenAI batch inputs format.
    batch_input_file_name = "batch_input_file.jsonl"
    with open(batch_input_file_name, "w") as batch_input_file:
        for i, text in enumerate(texts_to_embed):
            batch_input_file.write(
                json.dumps(
                    {
                        "custom_id": f"request-{i}",
                        "method": "POST",
                        "url": "/v1/embeddings",
                        "body": {
                            "model": embedding_model_name,
                            "input": text,
                        },
                    }
                )
                + "\n"
            )

    # Upload the batch input file to Parasail.
    batch_input_file = client.files.create(
        file=open(batch_input_file_name, "rb"), purpose="batch"
    )

    # Create a batch job for the uploaded file.
    batch_obj = client.batches.create(
        input_file_id=batch_input_file.id,
        # Required for OpenAI API compatibility; not a Parasail completion-time guarantee.
        completion_window="24h",
        endpoint="/v1/embeddings",
    )

    # Wait for the batch job to complete.
    while batch_obj.status != "completed":
        if batch_obj.status in ["failed", "expired", "cancelling", "cancelled"]:
            raise RuntimeError(
                f"Parasail batch job failed with status: {batch_obj.status}.\nErrors: {batch_obj.errors}"
            )

        time.sleep(30)
        batch_obj = client.batches.retrieve(batch_obj.id)
    print(f"Parasail batch job {batch_obj.id} completed successfully")

    # Get the output of the batch job.
    batch_output = client.files.content(batch_obj.output_file_id).content

    # The order of results in batch_output is not guaranteed to match the order of requests
    # in batch_input_file, thus we store outputs in a dictionary using custom_id as key.
    batch_output_dict = {}
    for line in batch_output.splitlines():
        line = json.loads(line)
        batch_output_dict[line["custom_id"]] = line

    # Parse the batch output to get the embeddings and the total number of tokens.
    embeddings = []
    total_num_tokens = 0
    for i in range(len(texts_to_embed)):
        response_body = batch_output_dict[f"request-{i}"]["response"]["body"]
        embed = response_body["data"][0]["embedding"]
        embeddings.append(embed)

        num_tokens = response_body["usage"]["prompt_tokens"]
        total_num_tokens += num_tokens

    return embeddings, total_num_tokens
```

We can now create a vector DB from the our dataset using an open source embedding model.

```python
vector_db = create_vector_db(dataset, "Alibaba-NLP/gte-Qwen2-7B-instruct")
```

Output:

```
Batch job batch-aqjf7etmdt completed successfully

Vector DB is created
Number of embedding vectors in DB:  232
Number of embeddings tokens: 217826
```

## Evaluating the retrieval performance of vector DBs

Now we want to measure the quality of our vector DB using some retrieval metrics.

We use an evaluation dataset that consists of 100 query/answer pairs. The chunks that are relevant to each query are listed along with the answer.

Let's look at one of the queries and its relevant chunks.

```python
with open("datasets/anthropic_docs_eval.json", "r") as f:
    dataset = json.load(f)

print(f"Number of queries: {len(dataset)}")
print("First eval item:")
print(f"    query: {dataset[0]['query']}")
print(f"    answer: {dataset[0]['answer']}")
print(f"    chunk_ids: {dataset[0]['chunk_ids']}")
```

```
Number of queries: 100
First eval item:
    query: How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?
    answer: To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios.
    chunk_ids: ['https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases', 'https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases']
```

We now define a function that takes the evaluation dataset and the vector DB as arguments, and returns a number of retrieval metrics. We first retrieve the most similar chunks for each query using the vector DB, and then compare the retrieved chunks with the golden chunks listed in the evaluation dataset.

```python
def evaluate(dataset: List[Dict[str, Any]], vector_db: Dict[str, Any], top_k: int):
    # Collect the queries and prepend an instruction to each query.
    # The format of the instruction is from https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct
    queries = [
        f"Instruct: Represent this query for retrieving supporting documents\nQuery: {q['query']}"
        for q in dataset
    ]

    # Run embeddings on the queries.
    query_embeddings, total_num_tokens = run_parasail_batch(
        queries, vector_db["embedding_model_name"]
    )
    print(f"Number of tokens used for embedding queries: {total_num_tokens}")

    # Compute similarities between queries and vector DB embeddings.
    query_embeddings = np.array(query_embeddings)
    db_embeddings = np.array(vector_db["embeddings"])
    similarities = np.dot(query_embeddings, db_embeddings.T)
    # Sort vector DB embeddings according to similarity scores.
    top_indices = np.argsort(similarities, axis=1)[:, ::-1]

    # Find the retrieved chunks for all the queries.
    all_retrieved_chunks = [
        [
            {
                "metadata": vector_db["metadata"][idx],
                "similarity": float(similarities[i][idx]),
            }
            for idx in top_indices[i][:top_k]
        ]
        for i in range(len(queries))
    ]

    # Collect golden chunk IDs for every query.
    all_golden_chunks = [query["chunk_ids"] for query in dataset]

    # Compute evaluation metrics.
    precisions = []
    recalls = []
    mrrs = []
    for i in range(len(queries)):
        retrieved_chunks = [
            retrieved_chunk["metadata"].get("chunk_id", None)
            for retrieved_chunk in all_retrieved_chunks[i][:top_k]
        ]
        golden_chunks = set(all_golden_chunks[i])

        true_positives = len(set(retrieved_chunks) & golden_chunks)

        precision = true_positives / len(retrieved_chunks) if retrieved_chunks else 0
        precisions.append(precision)

        recall = true_positives / len(golden_chunks) if golden_chunks else 0
        recalls.append(recall)

        mrr = 0
        for i, link in enumerate(retrieved_chunks, 1):
            if link in golden_chunks:
                mrr = 1 / i
                break
        mrrs.append(mrr)

    avg_precision = sum(precisions) / len(precisions)
    avg_recall = sum(recalls) / len(recalls)
    avg_mrr = sum(mrrs) / len(mrrs)
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall)
    results = {
        "Average Precision (MAP)": avg_precision,
        "Average Recall": avg_recall,
        "Average MRR": avg_mrr,
        "F1 Score": f1,
    }
    return results
```

Let's now evaluate the vector DB.

```python
with open("datasets/anthropic_docs_eval.json", "r") as f:
    dataset = json.load(f)

results = evaluate(dataset, vector_db, top_k=5)
for k, v in results.items():
    print(f"{k}: {v:.4f}")
```

```
Parasail batch job batch-ol5qoa5vm2 completed successfully
Number of tokens used for embedding queries: 3735
Average Precision (MAP): 0.2840
Average Recall: 0.7625
Average MRR: 0.8553
F1 Score: 0.4139
```

## Next steps

* [RAG and embeddings use case](/parasail-docs/use-cases/rag-embeddings)
* [Embeddings API reference](/parasail-docs/api-reference/embeddings)
* [Batch embedding models](/parasail-docs/products/quickstart#batch-embedding-models)
* [Chat Completions guide](/parasail-docs/guides/chat-completions)
* [Model recommendations](/parasail-docs/guides/model-recommendations)


# Multi-Modal

Send images to vision-language models like Qwen2.5-VL using base64 data URLs through Parasail's OpenAI-compatible API.

**Multimodal** (or Multi-Modal) refers to systems or methods that can process or integrate **multiple types of data**, typically from different modalities such as **text, image, audio, video, and other sensory inputs**. Multi-modal systems aim to understand and generate outputs based on this diverse, merged information, mimicking human perception, where we simultaneously use different senses (for example, seeing, hearing, and reading) to interpret the world.

In machine learning and AI, multi-modal models combine data from different modalities to improve the performance of tasks like classification, generation, captioning, and more. For example, an AI model that processes both images and text could interpret and respond to visual and written information in a cohesive and context-aware manner.

## Quickstart

The simplest way to use Vision Language Model is using OpenAI-compatible API and embed image input as base64 encoded data URLs.

Below is an example using Qwen2.5-VL 72 B model with Parasail's serverless API. GLM 4.5V also works with this example.

```python
!pip install openai > /dev/null

import base64
import openai
from IPython.display import Image, display

try:
  from google.colab import userdata
  API_KEY = userdata.get('PARASAIL_API_KEY')
except:
  import os
  API_KEY = os.getenv('PARASAIL_API_KEY')

!wget https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Flower_poster_2.jpg/1540px-Flower_poster_2.jpg -O flower.jpg &> /dev/null
display(Image("flower.jpg", width=500))

def data_url(image):
    with open(image, "rb") as f:
        return f"data:image/jpeg;base64," + base64.b64encode(f.read()).decode("ascii")

oai_client = openai.OpenAI(base_url="https://api.parasail.io/v1", api_key=API_KEY)

messages = [{
    "role": "user",
    "content": [
        { "type": "image_url", "image_url": { "url": data_url("flower.jpg")}},
        {"type": "text", "text": "What's in the image?"},
    ],
}]

chat_completion = oai_client.chat.completions.create(
    model="parasail-qwen25-vl-72b-instruct", messages=messages)

print(chat_completion.choices[0].message.content)
```

```
[33mWARNING: Ignoring invalid distribution ~vidia-cufft-cu12 (/usr/local/lib/python3.11/dist-packages)[0m[33m
[0m[33mWARNING: Ignoring invalid distribution ~vidia-cufft-cu12 (/usr/local/lib/python3.11/dist-packages)[0m[33m
[0m[33mWARNING: Ignoring invalid distribution ~vidia-cufft-cu12 (/usr/local/lib/python3.11/dist-packages)[0m[33m
[0m
```

![jpeg](/files/RmkSYAYlf7BwB8SVSBJB)

```
The image is a collage of twelve different types of flowers, each labeled with its common name and scientific name. Here is a list of the flowers:

1. St Bernard's Lily (Anthericum liliago)
2. Bermuda Buttercup (Oxalis pes-caprae)
3. Oleander (Nerium oleander)
4. Lantana (Lantana camara)
5. Scarlet Pimpernel (Anagallis arvensis)
6. Verbascum (Verbascum sinuatum)
7. Common Mallow (Malva sylvestris)
8. Spanish Oyster (Scolymus hispanicum)
9. Stork's bill (Erodium malacoides)
10. Bindweed (Convolvulus arvensis)
11. Blue Gem (Hebes × franciscana)
12. Calla Lily (Zantedeschia aethiopica)

Each flower is depicted with vibrant colors and detailed features, showcasing the diversity and beauty of the flora.
```

As in the output, the model shows good visual understanding and OCR ability.

## Vision Capabilities

### Limitations

Our public serverless model can support up to 8 input images for multimodal LLMs. For private dedicated deployments, the model supports 2 input images for RTX4090s and 8 input images for A100s, H100s, and H200s. If you need a different configuration than this than [please contact us](https://parasail.clickup.com/forms/9011827181/p/f/8cjb4fd-3371/JDGNCNLKS6ULEJYCHA/help-form).

### Multiple Images: Document Visual Question Answering Example

To process multiple images, you can simply add multiple `image_url`items into the content list. The images are referenced in the order they are present in the content list and in the prompt.

```python
!apt-get install -qq -y poppler-utils &> /dev/null
!pip install pdf2image &> /dev/null
import pdf2image

!wget https://arxiv.org/pdf/1706.03762 -O transformer.pdf &> /dev/null

# Use multiple images as input
messages = []
for i, page in enumerate(pdf2image.convert_from_path("transformer.pdf", last_page=4)):
    page.save(f"page_{i}.jpg", 'JPEG')
    messages.append({"type": "image_url", "image_url": { "url": data_url(f"page_{i}.jpg")}})

messages = [{
    "role": "user",
    "content": messages + [{
        "type": "text",
        "text": "How many of encoder layers were used and where in the paper it's mentioned?"
    }]
}]

chat_completion = oai_client.chat.completions.create(
    model="parasail-qwen25-vl-72b-instruct", messages=messages)

display(Image("page_0.jpg", width=500))
print(chat_completion.choices[0].message.content)
```

![jpeg](/files/Ndw6Y3cCO07iluqVR3nb)

```
The paper mentions that the encoder is composed of a stack of \( N = 6 \) identical layers. This is specified in the section "3.1 Encoder and Decoder Stacks" where it states:

"Encoder: The encoder is composed of a stack of \( N = 6 \) identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network."
```

### Video Understanding

Qwen-2-VL and Qwen-2.5-VL support the `video_url` content type, where the data URL is a series of encoded base64 PNG frames separated by a comma. The Parasail gateway supports up to 20 MB in a single API request, so care must be taken to ensure the frame count and resolution fit in this limit.

The following notebook shows how to load an MP4 video, reorder the channels correctly in a Torch format, resize the video, and compile the data URL.

```python
!pip install torch torchvision opencv-python &> /dev/null
from IPython.display import Video

import os
import cv2
import gc
import math
import openai
import textwrap
import torch
import torchvision
```

```python
FPS = 2  # Video frame sample rate
MIN_FRAMES = 4
MAX_FRAMES = 110  #
FRAME_FACTOR = 2  #

IMAGE_FACTOR = 28  # Image patch size
VIDEO_MIN_PIXELS = 128 * 28 * 28  # Min number of pixels of a single video frame
VIDEO_MAX_PIXELS = 768 * 28 * 28  # Max number of pixels of a single video frame
VIDEO_TOTAL_PIXELS = 12000 * 28 * 28  # Max number of video pixels in a request
MAX_RATIO = 200  # Aspect ratio limit
MAX_GATEWAY_STRING_LEN = 20_000_000

def round_by_factor(number: int, factor: int) -> int:
    """Returns the closest integer to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor


def ceil_by_factor(number: int, factor: int) -> int:
    """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
    return math.ceil(number / factor) * factor


def floor_by_factor(number: int, factor: int) -> int:
    """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
    return math.floor(number / factor) * factor


def smart_resize(
    height: int, width: int, factor, min_pixels: int, max_pixels: int
) -> tuple[int, int]:
    """
    Rescales the image so that the following conditions are met:

    1. Both dimensions (height and width) are divisible by 'factor'.

    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].

    3. The aspect ratio of the image is maintained as closely as possible.
    """
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )
    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


def extract_frames_smart_sampling(
    video_path,
    min_frames=MIN_FRAMES,
    max_frames=MAX_FRAMES,
    fps_target=FPS,
    frame_factor=FRAME_FACTOR,
):
    """Extract frames from a video with smart sampling based on target FPS and frame limits."""
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise ValueError(f"Could not open video file: {video_path}")

    # Get video properties
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    print(f"Height: {height}, Width: {width}, Frames: {total_frames}, FPS: {video_fps}")

    # Calculate number of frames to extract
    num_frames = min(max(total_frames / video_fps * fps_target, min_frames), max_frames)
    num_frames = round_by_factor(num_frames, frame_factor)

    # Sample frames evenly
    indices = torch.linspace(0, total_frames - 1, num_frames).round().long().tolist()

    # Extract frames
    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

    cap.release()
    return frames, height, width, num_frames


def process_video(
    video_path,
    image_factor=IMAGE_FACTOR,
    video_min_pixels=VIDEO_MIN_PIXELS,
    video_max_pixels=VIDEO_MAX_PIXELS,
    video_total_pixels=VIDEO_TOTAL_PIXELS,
):

    # Extract frames
    print("Extracting frames")
    frames, height, width, num_frames = extract_frames_smart_sampling(video_path)

    # Calculate max pixels per frame based on total pixel budget
    max_pixels = min(
        video_max_pixels,
        max(int(video_min_pixels * 1.05), video_total_pixels / num_frames),
    )

    # Calculate target dimensions
    resized_height, resized_width = smart_resize(
        height,
        width,
        factor=image_factor,
        min_pixels=video_min_pixels,
        max_pixels=max_pixels,
    )

    print(f"Resizing {len(frames)} frames to {resized_height}x{resized_width}")

    # Convert frames to torch tensors (batch processing)
    torch_frames = torch.stack(
        [torch.from_numpy(frame).permute(2, 0, 1).float() / 255.0 for frame in frames]
    )

    # Resize all frames at once
    resized_frames = torchvision.transforms.functional.resize(
        torch_frames,
        [resized_height, resized_width],
        interpolation=torchvision.transforms.InterpolationMode.BICUBIC,
        antialias=True,
    )

    # Convert to uint8 and encode as PNG
    print("Encoding as PNG")
    resized_frames = (resized_frames * 255).to(torch.uint8)
    encoded_frames = [
        bytes(torchvision.io.encode_jpeg(frame)) for frame in resized_frames
    ]

    # Build data URL, stopping if it gets too large
    print("Building data URL")
    video_url = "data:video/jpeg;base64"
    for frame in encoded_frames:
        encoded = "," + base64.b64encode(frame).decode("utf-8")
        if len(video_url) + len(encoded) > MAX_GATEWAY_STRING_LEN:
            print(f"Data URL reached {len(video_url)} chars, truncating video")
            break
        video_url += encoded

    print(f"Done processing video. Data URL size: {len(video_url)}")
    return video_url, encoded_frames
```

```python
!wget https://upload.wikimedia.org/wikipedia/commons/2/2f/Making_snowman_in_K%C3%B5rvemaa%2C_Estonia_%28January_2022%29.webm -O snowman.webm &> /dev/null

print(f"Processing snowman.webm...")
Video("snowman.webm", embed=True)
video_url, frames = process_video("snowman.webm")

# Generate chat completions for each model/prompt pair.
chat_response = oai_client.chat.completions.create(
    model="parasail-qwen25-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": "Describe the video in detail"},
            ],
        }
    ],
    temperature=1,
    top_p=0.001,
    max_tokens=4096,
    extra_body={"repetition_penalty": 1.05},
)
print("---------------------------")
print(textwrap.fill(f"output={chat_response.choices[0].message.content}"))
print("=============================")


```

```
Processing snowman.webm...
Extracting frames
Height: 1080, Width: 1920, Frames: 3962, FPS: 25.0
Resizing 110 frames to 224x420
Encoding as PNG
Building data URL
Done processing video. Data URL size: 3467916
---------------------------
output=The video begins with a serene winter scene, showcasing a snow-
covered landscape with a small house nestled among tall, snow-laden
trees. The sky is overcast, adding to the tranquil and chilly
atmosphere of the setting. A person dressed warmly in winter clothing
walks into the frame from the left side, carrying a shovel. They
approach a patch of undisturbed snow near the house and begin to scoop
up large handfuls of snow.  As they gather more snow, they start
shaping it into a ball, rolling it on the ground to make it larger and
more compact. Once the base of the snowman is sufficiently large, they
repeat the process to create a second, smaller ball for the middle
section. With both sections ready, they carefully stack the smaller
ball on top of the larger one, ensuring it is stable.  Next, the
person shapes another, even smaller ball of snow for the head. They
lift it and place it atop the middle section, completing the basic
structure of the snowman. To add details, they find some small twigs
and sticks nearby and use them as arms, sticking them out from the
sides of the middle section. For the face, they search for suitable
objects and eventually find two small stones or pebbles, which they
place as eyes on the head. A small piece of carrot is used for the
nose, and a stick is broken into a short segment to serve as the
mouth.  Finally, the person steps back to admire their creation. The
snowman stands proudly in front of the house, surrounded by the
peaceful winter scenery. The person then walks away, leaving the
snowman as a cheerful addition to the snowy landscape. The video ends
with a wide shot of the completed snowman, emphasizing its charm and
the beauty of the winter environment.
=============================
```

### Object Detection and Localization

Models like Qwen2.5-VL series and [PaliGemma2](https://developers.googleblog.com/en/introducing-paligemma-2-mix/) are trained with object detection and localization capabilities. Simply prompt the model with the description of the objects of interest, the models can output object bounding boxes in the specified format.

Here is an example using the same Transformer paper image:

````python
import json
import PIL
import PIL.ImageDraw

# Prompt the model to output bouding box in JSON format.
messages = [{
    "role": "user",
    "content": [
        { "type": "image_url", "image_url": { "url": data_url("page_0.jpg")}},
        {"type": "text", "text": "Locate the abstract section, output its bbox coordinates using JSON format."},
    ],
}]

chat_completion = oai_client.chat.completions.create(
    model="parasail-qwen25-vl-72b-instruct", messages=messages)
print(chat_completion.choices[0].message.content)

json_string = chat_completion.choices[0].message.content.removeprefix("```json").removesuffix("```")
data = json.loads(json_string)

img = PIL.Image.open("page_0.jpg")

for bbox in data:
  PIL.ImageDraw.Draw(img).rectangle(bbox["bbox_2d"], outline='red')

img.save("page_0_bbox.jpg")
display(Image("page_0_bbox.jpg", width=500))
````

````
```json
[
	{"bbox_2d": [396, 1152, 1314, 1607], "text_content": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."}
]
```
````

![jpeg](/files/TsYkiKfy8wdtjFYJl60t)

### More resources

* <https://github.com/QwenLM/Qwen2.5-VL/tree/main/cookbooks>
* <https://ai.google.dev/gemma/docs/capabilities/vision/prompt-with-visual-data>
* <https://github.com/bytedance/UI-TARS>
* <https://docs.mistral.ai/capabilities/vision/>

## Next steps

* [Browser and UI Automation guide](/parasail-docs/guides/browser-and-ui-automation)
* [Chat Completions API reference](/parasail-docs/api-reference/chat-completions)
* [Serverless overview](/parasail-docs/products/overview)


# Structured Output

Get reliable, schema-conformant JSON from Parasail models using guided\_json and response\_format.

## Overview

Parasail's OpenAI-compatible endpoint can return structured, machine-parseable output so you don't have to scrape JSON out of free-form text. There are two distinct ways to do this:

* **Prompt-only JSON**—you *ask* the model for JSON in the prompt. Simple, but the model can still drift (extra prose, missing keys, invalid JSON).
* **Server-enforced schemas**—you pass a schema with the request and the server constrains decoding so the output is guaranteed to match. On Parasail this is done with `guided_json` or `response_format` (JSON Schema).

Use server-enforced schemas whenever you need to parse the result programmatically.

## Prompt-only JSON vs. server-enforced schemas

| Approach                        | How                                                 | Guarantee                           |
| ------------------------------- | --------------------------------------------------- | ----------------------------------- |
| Prompt-only JSON                | Describe the desired JSON in the prompt             | None—best-effort                    |
| `guided_json`                   | Pass a JSON Schema in `extra_body`                  | Output is constrained to the schema |
| `response_format` (JSON Schema) | Pass `response_format={"type": "json_schema", ...}` | Output is constrained to the schema |

Both server-enforced approaches use the same Parasail base URL:

```
https://api.parasail.io/v1
```

## Supported models

The following models support guided/structured decoding as of June 2026. Capabilities change over time—always confirm against the [models endpoint](/parasail-docs/api-reference/models-endpoint).

| Model                                  | Guided JSON / JSON Schema | Regex | Choice |
| -------------------------------------- | ------------------------- | ----- | ------ |
| parasail-llama-33-70b-fp8              | Y                         | Y     | Y      |
| parasail-llama-4-scout-instruct        | Y                         | Y     | Y      |
| parasail-llama-4-maverick-instruct-fp8 | Y                         | Y     | Y      |
| parasail-qwen3-30b-a3b                 | Y                         |       | Y      |
| parasail-qwen3-235b-a22b               | Y                         |       | Y      |
| parasail-qwen3-32b                     | Y                         |       | Y      |
| parasail-gemma3-27b-it                 | Y                         | Y     | Y      |
| parasail-mistral-devstral-small        | Y                         | Y     | Y      |

## Server-enforced JSON with `guided_json`

Pass a JSON Schema via the OpenAI SDK's `extra_body` parameter. The server uses guided decoding to ensure the response matches your schema.

```python
import json
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)

schema = {
    "type": "object",
    "properties": {
        "location": {"type": "string"},
        "date": {"type": "string", "format": "date"},
    },
    "required": ["location", "date"],
    "additionalProperties": False,
}

completion = client.chat.completions.create(
    model="parasail-qwen3-32b",
    messages=[
        {"role": "user", "content": "What's the weather like in Manhattan Beach on 2025-06-03?"}
    ],
    extra_body={"guided_json": schema},
)

# The content is already constrained to match the schema
data = json.loads(completion.choices[0].message.content)
print(json.dumps(data, indent=2))
```

Expected output:

```json
{
  "location": "Manhattan Beach",
  "date": "2025-06-03"
}
```

## Server-enforced JSON with `response_format`

Parasail also accepts the OpenAI `response_format` field with `type: "json_schema"`—the same shape OpenAI uses. Combine `strict: true` with `additionalProperties: false` for the tightest output.

```python
import json
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)

schema = {
    "type": "object",
    "properties": {
        "capital": {"type": "string"},
        "country": {"type": "string"},
    },
    "required": ["capital", "country"],
    "additionalProperties": False,
}

resp = client.chat.completions.create(
    model="parasail-qwen3-32b",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "capital_lookup", "strict": True, "schema": schema},
    },
)

data = json.loads(resp.choices[0].message.content)
print(json.dumps(data, indent=2))
```

Expected output:

```json
{
  "capital": "Paris",
  "country": "France"
}
```

## Prompt-only JSON

If you don't need a hard guarantee, you can simply ask the model for JSON. This works with any model but is best-effort—always validate and `json.loads()` the result before relying on it.

```python
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)

resp = client.chat.completions.create(
    model="parasail-qwen3-32b",
    messages=[
        {
            "role": "user",
            "content": (
                "Return ONLY a JSON object with keys 'location' and 'date' "
                "for the weather in Manhattan Beach on 2025-06-03. No prose."
            ),
        }
    ],
)

print(resp.choices[0].message.content)
```

## Other guided decoding constraints (regex and choice)

Beyond JSON, Parasail's guided decoding can constrain output to a regular expression or a fixed set of choices via `extra_body`:

```python
# Constrain to an ISO date
extra_body = {"guided_regex": r"\d{4}-\d{2}-\d{2}"}

# Constrain to one of a fixed set of values
extra_body = {"guided_choice": ["red", "green", "blue"]}
```

## cURL example

```bash
curl https://api.parasail.io/v1/chat/completions \
  -H "Authorization: Bearer $PARASAIL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "parasail-qwen3-32b",
        "messages": [{"role": "user", "content": "What is the capital of France?"}],
        "guided_json": {
          "type": "object",
          "properties": {
            "capital": {"type": "string"},
            "country": {"type": "string"}
          },
          "required": ["capital", "country"],
          "additionalProperties": false
        }
      }'
```

## Troubleshooting

| Symptom                              | Likely cause                   | Fix                                                                                           |
| ------------------------------------ | ------------------------------ | --------------------------------------------------------------------------------------------- |
| Model ignores the constraint         | Using prompt-only JSON         | Switch to `guided_json` or `response_format` for server-enforced schemas.                     |
| Valid JSON but unexpected extra keys | `additionalProperties` not set | Add `"additionalProperties": false` to your schema.                                           |
| Need to stream large JSON            | Output too large for one chunk | Set `stream=True`; constraints still apply. Concatenate `delta.content` and parse at the end. |

## Next steps

* [Chat completions guide](/parasail-docs/guides/chat-completions)
* [Tool/function calling guide](/parasail-docs/guides/tool-function-calling)
* [Chat Completions API reference](/parasail-docs/api-reference/chat-completions)
* [Models endpoint](/parasail-docs/api-reference/models-endpoint)


# Tool/Function Calling

Let Parasail models call your functions and tools via the Chat Completions and Responses APIs.

## Overview

Tool (function) calling lets a model decide when to invoke an external function—a weather API, a database query, a calculator—and emit a structured request for it. Your application runs the function and feeds the result back to the model.

Parasail exposes tool calling through two OpenAI-compatible surfaces:

* **Chat Completions API** (`/v1/chat/completions`)—the classic single-call interface. You pass `tools`, inspect `message.tool_calls`, run the tool, and append a `tool` message for the next call.
* **Responses API** (`/v1/responses`)—the newer agentic interface built for multi-step loops. You pass `tools`, inspect `response.output` for `function_call` items, and append `function_call_output` items.

Both use the same base URL and API key:

```
https://api.parasail.io/v1
```

The **tool schema** (name, description, JSON Schema parameters) is the same in both APIs—only the wrapper shape and the response handling differ. Pick Chat Completions for simple one-shot tool calls, and the Responses API for multi-turn agentic workflows.

## Supported models

The following models support tool calling and `tool_choice` as of June 2026. Confirm current availability via the [models endpoint](/parasail-docs/api-reference/models-endpoint).

| Model                                  | Tools | Tool Choice |
| -------------------------------------- | ----- | ----------- |
| parasail-llama-33-70b-fp8              | ✅     | ✅           |
| parasail-llama-4-scout-instruct        | ✅     | ✅           |
| parasail-llama-4-maverick-instruct-fp8 | ✅     | ✅           |
| parasail-qwen3-30b-a3b                 | ✅     | ✅           |
| parasail-qwen3-235b-a22b               | ✅     | ✅           |
| parasail-qwen3-32b                     | ✅     | ✅           |
| parasail-mistral-devstral-small        | ✅     | ✅           |

{% tabs %}
{% tab title="Chat Completions API" %}
In the Chat Completions API, each tool must be wrapped as `{"type": "function", "function": {...}}`. The `function` object holds the `name`, `description`, and a JSON Schema `parameters`.

```python
import os
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Retrieve weather information for a given location and date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "date": {"type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}$"},
                },
                "required": ["location", "date"],
            },
        },
    }
]

messages = [
    {"role": "user", "content": "What's the weather like in Manhattan Beach on 2025-06-03?"}
]

response = client.chat.completions.create(
    model="parasail-llama-4-scout-instruct",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(json.dumps(args, indent=2))
```

Expected arguments:

```json
{
  "location": "Manhattan Beach",
  "date": "2025-06-03"
}
```

### Returning the tool result

Run your function, then append the assistant's tool call and a `tool` message with the result before calling the model again:

```python
def get_weather(location, date):
    # Replace with a real API call
    return {"location": location, "date": date, "weather": "Sunny", "temperature": "75F"}

result = get_weather(**args)

messages.append(response.choices[0].message)  # the assistant's tool call
messages.append(
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(result),
    }
)

final = client.chat.completions.create(
    model="parasail-llama-4-scout-instruct",
    messages=messages,
    tools=tools,
)
print(final.choices[0].message.content)
```

{% endtab %}

{% tab title="Responses API" %}
In the Responses API, tools are defined with a flat shape—`type`, `name`, `description`, and `parameters` at the top level (no nested `function` object). Tool calls come back as `function_call` items in `response.output`, and you return results as `function_call_output` items.

```python
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
)

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "default": "celsius",
                },
            },
            "required": ["city"],
        },
    }
]

def handle_tool_call(name, arguments):
    args = json.loads(arguments)
    if name == "get_weather":
        return json.dumps({"city": args["city"], "temperature": 22, "condition": "sunny"})
    return json.dumps({"error": "unknown function"})

# Initial request (store must be false on the Parasail gateway)
response = client.responses.create(
    model="parasail-kimi-k25-elicit",
    store=False,
    tools=tools,
    input=[{"role": "user", "content": "What's the weather in Tokyo?"}],
)

# Agentic loop—continue until the model stops calling tools
while response.output:
    tool_calls = [item for item in response.output if item.type == "function_call"]
    if not tool_calls:
        break

    next_input = list(response.output)
    for tool_call in tool_calls:
        result = handle_tool_call(tool_call.name, tool_call.arguments)
        next_input.append(
            {
                "type": "function_call_output",
                "call_id": tool_call.call_id,
                "output": result,
            }
        )

    response = client.responses.create(
        model="parasail-kimi-k25-elicit",
        store=False,
        tools=tools,
        input=next_input,
    )

print(response.output_text)
```

The Responses API requires `store=False` on every request. See the [Responses API reference](/parasail-docs/products/overview/responses-api) for the full compatibility table.
{% endtab %}
{% endtabs %}

## Chat Completions tools vs. Responses tools

|                    | Chat Completions                                                    | Responses API                                                     |
| ------------------ | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
| Tool definition    | `{"type": "function", "function": {name, description, parameters}}` | `{"type": "function", name, description, parameters}` (flat)      |
| Where calls appear | `message.tool_calls[]`                                              | `response.output[]` items with `type == "function_call"`          |
| Returning results  | `{"role": "tool", "tool_call_id": ..., "content": ...}`             | `{"type": "function_call_output", "call_id": ..., "output": ...}` |
| Best for           | One-shot tool calls                                                 | Multi-step agentic loops                                          |

The underlying schema—the function `name`, `description`, and JSON Schema `parameters`—is identical between the two; only the wrapper and the result-handling differ.

## Next steps

* [Structured output guide](/parasail-docs/guides/structured-output)
* [Chat Completions API reference](/parasail-docs/api-reference/chat-completions)
* [Responses API reference](/parasail-docs/api-reference/responses-api)
* [Models endpoint](/parasail-docs/api-reference/models-endpoint)


# Multimodal GUI Task Agent

Multimodal models including [UI-TARS](https://github.com/bytedance/UI-TARS) and [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.

This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.

## Computer-Using Agent

[UI-TARS-Desktop](https://github.com/bytedance/UI-TARS-desktop) is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.

To use `UI-TARS-Desktop` with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:

```yaml
name: UI TARS Desktop Parasail Preset
language: en
vlmProvider: Parasail for UI-TARS-1.5
vlmBaseUrl: https://api.parasail.io/v1
vlmApiKey: YOUR_API_KEY_HERE # <-- IMPORTANT: Replace with your actual Parasail API key
vlmModelName: parasail-ui-tars-1p5-7b
```

For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: <https://github.com/bytedance/UI-TARS-desktop/>.

## Browser Operator

[Midscene.js](https://midscenejs.com/) is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.

The easiest way to get started with Midscene.js is using its [Chrome extension](https://midscenejs.com/quick-experience.html).

Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:

**Option 1: Using a Qwen-VL based model (for example, `parasail-qwen25-vl-72b-instruct`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-qwen25-vl-72b-instruct"
export MIDSCENE_USE_QWEN_VL=1 # Flag to indicate a Qwen-VL family model
```

**Option 2: Using a UI-TARS based model (for example, `parasail-ui-tars-1p5-7b`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-ui-tars-1p5-7b"
export MIDSCENE_USE_VLM_UI_TARS=1.5 # Flag indicating UI-TARS 1.5 model compatibility
```

For detailed setup, API documentation, and examples on how to integrate and use Midscene.js for browser automation, please consult its official repository: <https://github.com/web-infra-dev/midscene>


# Browser and UI Automation

Run the Holo1 action vision-language model on a dedicated RTX 4090 endpoint to drive web UI automation with guided JSON output.

### Hcompany/Holo1-7 B

<https://huggingface.co/Hcompany/Holo1-7B>

Holo1 is an Action Vision-Language Model (VLM) developed by [HCompany](https://www.hcompany.ai/) for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

This example shows how to run a single step where an input image and a prompt is entered and the action step is reported. This model can work off of previous memory, meaning the next step would be to include previous actions, the current image view of the UI, whereby the model will follow up with the next step.

#### **Starting the Model**

This model is not available in serverless, however it is small enough to run in a low-cost RTX 4090 in our dedicated service. To start the model, log in to the platform and navigate to the [Create New Dedicated Instance](https://www.saas.parasail.io/dedicated/new) page. Enter **Hcompany/Holo1-7 B** as the model ID and select the GPU type (we recommend the RTX 4090 for basic testing). Then hit deploy to start the model. Once it is running, change the model name in the code below to match the deployment name in your dedicated deployment.

#### Running the Model

This example also shows guided JSON decoding using pydantic schemas.

Given the following prompt and input image:

```
Book a hotel in Paris on August 3rd for 3 nights
```

<figure><img src="/files/iHSRxtixC8u1Is2nhygX" alt=""><figcaption></figcaption></figure>

The output will be a Click object with the "thinking" and coordinates:

```
Output navigation step: 
note='' 
thought='I need to select the check-out date as August 6th, 2025, and 
then proceed to search.' 
action=ClickElementAction(action='click_element', 
         element='August 6th, 2025 on the calendar', x=657, y=318)
```

````python
from typing import Any, List, Literal

import requests
from PIL import Image, ImageDraw
from pydantic import BaseModel, Field, TypeAdapter
from typing import Literal, Union
from copy import deepcopy
from dotenv import load_dotenv
import os
import base64
import openai
import json


assert "PARASAIL_API_KEY" in os.environ
chat_base_url = "https://api.parasail.io/v1"

client = openai.Client(
    api_key=api_key,
    base_url=chat_base_url,
)
model = "miketest1-holo1"

SYSTEM_PROMPT: str = """Imagine you are a robot browsing the web, just like humans. Now you need to complete a task.
In each iteration, you will receive an Observation that includes the last  screenshots of a web browser and the current memory of the agent.
You have also information about the step that the agent is trying to achieve to solve the task.
Carefully analyze the visual information to identify what to do, then follow the guidelines to choose the following action.
You should detail your thought (i.e. reasoning steps) before taking the action.
Also detail in the notes field of the action the extracted information relevant to solve the task.
Once you have enough information in the notes to answer the task, return an answer action with the detailed answer in the notes field.
This will be evaluated by an evaluator and should match all the criteria or requirements of the task.
Guidelines:
- store in the notes all the relevant information to solve the task that fulfill the task criteria. Be precise
- Use both the task and the step information to decide what to do
- if you want to write in a text field and the text field already has text, designate the text field by the text it contains and its type
- If there is a cookies notice, always accept all the cookies first
- The observation is the screenshot of the current page and the memory of the agent.
- If you see relevant information on the screenshot to answer the task, add it to the notes field of the action.
- If there is no relevant information on the screenshot to answer the task, add an empty string to the notes field of the action.
- If you see buttons that allow to navigate directly to relevant information, like jump to ... or go to ... , use them to navigate faster.
- In the answer action, give as many details a possible relevant to answering the task.
- if you want to write, don't click before. Directly use the write action
- to write, identify the web element which is type and the text it already contains
- If you want to use a search bar, directly write text in the search bar
- Don't scroll too much. Don't scroll if the number of scrolls is greater than 3
- Don't scroll if you are at the end of the webpage
- Only refresh if you identify a rate limit problem
- If you are looking for a single flights, click on round-trip to select 'one way'
- Never try to login, enter email or password. If there is a need to login, then go back.
- If you are facing a captcha on a website, try to solve it.
- if you have enough information in the screenshot and in the notes to answer the task, return an answer action with the detailed answer in the notes field
- The current date is {timestamp}.
# <output_json_format>
# ```json
# {output_format}
# ```
# </output_json_format>
"""


class ClickElementAction(BaseModel):
    action: Literal["click_element"]
    element: str
    x: int
    y: int


class WriteElementAction(BaseModel):
    action: Literal["write_element_abs"]
    content: str
    element: str
    x: int
    y: int


class ScrollAction(BaseModel):
    action: Literal["scroll"]
    direction: Literal["down", "up", "left", "right"]


class GoBackAction(BaseModel):
    action: Literal["go_back"]


class RefreshAction(BaseModel):
    action: Literal["refresh"]


class GotoAction(BaseModel):
    action: Literal["goto"]
    url: str


class WaitAction(BaseModel):
    action: Literal["wait"]
    seconds: int


class RestartAction(BaseModel):
    action: Literal["restart"]


class AnswerAction(BaseModel):
    action: Literal["answer"]
    content: str


ActionSpace = Union[
    ClickElementAction,
    WriteElementAction,
    ScrollAction,
    GoBackAction,
    RefreshAction,
    WaitAction,
    RestartAction,
    AnswerAction,
    GotoAction,
]


class NavigationStep(BaseModel):
    note: str = Field(
        default="",
        description="Task-relevant information extracted from the previous observation. Keep empty if no new info.",
    )
    thought: str = Field(description="Reasoning about next steps (<4 lines)")
    action: ActionSpace = Field(description="Next action to take")


#fix references that are generated by json_schema()
def _walk(node, defs):
    if isinstance(node, dict):
        ref = node.get("$ref")
        if ref and ref.startswith("#/$defs/"):
            return _walk(deepcopy(defs[ref.split("/")[-1]]), defs)

        if "anyOf" in node:
            node["oneOf"] = node.pop("anyOf")
        return {k: _walk(v, defs) for k, v in node.items()}
    if isinstance(node, list):
        return [_walk(v, defs) for v in node]
    return node

def openai_schema(schema_class):
    schema = TypeAdapter(schema_class).json_schema()
    result = _walk(schema, schema.get("$defs", {}))
    result.pop("$defs", None)
    return result


def get_navigation_prompt(task, image, step=1):
    """
    Get the prompt for the navigation task.
    - task: The task to complete
    - image: The current screenshot of the web page
    - step: The current step of the task
    """
    system_prompt = SYSTEM_PROMPT.format(
        output_format=NavigationStep.model_json_schema(),
        timestamp="2025-06-04 14:16:03",
    )
    return [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": system_prompt},
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"<task>\n{task}\n</task>\n"},
                {"type": "text", "text": f"<observation step={step}>\n"},
                {"type": "text", "text": "<screenshot>\n"},
                {"type": "image_url", "image_url": {"url": image}},
                {"type": "text", "text": "\n</screenshot>\n"},
                {"type": "text", "text": "\n</observation>\n"},
            ],
        },
    ]


def navigate(image_path: str, task: str) -> str:

    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
        base64_image = base64.b64encode(image_data).decode("utf-8")
        image_url = "data:image/jpeg;base64," + base64_image
    # 2. Create the prompt using the resized image (for correct image tagging context) and task
    prompt = get_navigation_prompt(task, image_url, step=1)

    # 3. Run inference
    #    Pass `messages` (which includes the image object for template processing)
    #    and `resized_image` (for actual tensor conversion).
    # print(get_openai_compatible_schema())
    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            # response_format=get_openai_compatible_schema(),
            max_tokens=512,  # Adjust as needed
            extra_body={"guided_json": openai_schema(NavigationStep)},
        )
        navigation_str = response.choices[0].message.content
    except Exception as e:
        print(f"Error during model inference: {e}")
        raise

    # return navigation_str
    navigation_json = json.loads(navigation_str)
    return NavigationStep(**navigation_json)


# --- Load Example Data ---
example_task = "Book a hotel in Paris on August 3rd for 3 nights"
example_image_url = (
    "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg"
)
example_image_path = "image.jpg"
response = requests.get(example_image_url)

if response.status_code == 200:
    with open(example_image_path, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to download image: {response.status_code}")


output_coords_component = navigate(example_image_path, example_task)
print("Input task:", example_task)
print("Output navigation step:", output_coords_component)

````

## Next steps

* [GUI Agent guide](/parasail-docs/guides/gui-agent)
* [Multi-Modal guide](/parasail-docs/guides/multi-modal)
* [Browser and UI Automation use case](/parasail-docs/use-cases/browser-ui-automation)
* [Dedicated Instances overview](/parasail-docs/products/overview-1)


# Model Recommendations

Recommended models on Parasail by task category, with context length, tool and structured-output support, and serverless availability.

Choosing the right model is the difference between a fast, low-cost deployment and an over-provisioned one. This page collects Parasail's recommended models grouped by task category, so you can start from a strong default and tune from there. For a hands-on workflow to compare candidates yourself, see [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model).

Recommendations are current as of `TODO: <date>`. Model availability, context lengths, and capabilities change frequently—always confirm against the live [`/v1/models`](/parasail-docs/products/overview#programmatically-find-models) endpoint before committing.

> **How to read the tables**
>
> * **Recommended model**—the model name as passed to the `model` parameter in the API.
> * **Context length**—maximum combined input + output tokens.
> * **Tool / structured-output support**—whether the model supports function/tool calling and/or JSON-schema structured outputs.
> * **Serverless availability**—whether the model is available on the [Serverless](/parasail-docs/products/overview) tier (vs. Dedicated or Batch only).
> * **Notes**—caveats, strengths, or links to model-specific notes.

## Chat & general reasoning

General-purpose conversation, instruction following, and multi-step reasoning.

| Recommended model           | Context length | Tool / structured-output support            | Serverless availability | Notes                                                                                     |
| --------------------------- | -------------- | ------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------- |
| `parasail-deepseek-r1`      | TODO:          | TODO:                                       | Yes                     | Reasoning model; see [Deepseek V3.X notes](/parasail-docs/products/overview/deepseek-v3x) |
| `parasail-qwen3-235b-a22b`  | TODO:          | Structured output: Yes; tool calling: TODO: | Yes                     | See [Qwen3.5 notes](/parasail-docs/products/overview/qwen3.5-notes)                       |
| `parasail-qwen3-32b`        | TODO:          | Structured output: Yes; tool calling: TODO: | Yes                     | TODO:                                                                                     |
| `parasail-llama-33-70b-fp8` | TODO:          | Structured output: Yes; tool calling: TODO: | TODO:                   | FP8 quantized                                                                             |
| TODO: GPT-OSS model         | TODO:          | TODO:                                       | TODO:                   | See [GPT-OSS](/parasail-docs/products/overview/gpt-oss) (20b / 120b)                      |

## RAG & embeddings

Embedding models for retrieval-augmented generation and vector search. See [RAG & embeddings](/parasail-docs/use-cases/rag-embeddings).

| Recommended model                   | Context length | Tool / structured-output support | Serverless availability | Notes                                              |
| ----------------------------------- | -------------- | -------------------------------- | ----------------------- | -------------------------------------------------- |
| `parasail-ai/GritLM-7B-vllm`        | TODO:          | N/A (embeddings)                 | TODO:                   | Developed by Contextual; rivals proprietary models |
| `Alibaba-NLP/gte-Qwen2-7B-instruct` | TODO:          | N/A (embeddings)                 | TODO:                   | From Alibaba; strong general-purpose embeddings    |
| TODO: additional embedding model    | TODO:          | N/A (embeddings)                 | TODO:                   | TODO:                                              |

Embeddings can be run via the [Batch processing service](/parasail-docs/products/quickstart). Parasail recommends `encoding_format="base64"` to reduce output file sizes.

## Vision / multimodal

Image understanding, document parsing, and GUI/browser automation. See [Browser & UI automation](/parasail-docs/use-cases/browser-ui-automation).

| Recommended model                                            | Context length | Tool / structured-output support | Serverless availability | Notes             |
| ------------------------------------------------------------ | -------------- | -------------------------------- | ----------------------- | ----------------- |
| `parasail-qwen25-vl-72b-instruct`                            | TODO:          | TODO:                            | TODO:                   | Qwen2.5-VL series |
| `Qwen/Qwen3-VL-8B-Instruct` (TODO: confirm serverless alias) | TODO:          | TODO:                            | TODO:                   | Smaller VL option |
| TODO: additional vision model                                | TODO:          | TODO:                            | TODO:                   | TODO:             |

## Agents & tool calling

Function calling and multi-turn agentic loops. See [Agents & tool calling](/parasail-docs/use-cases/agents-tool-calling) and the [Responses API](/parasail-docs/products/overview/responses-api).

| Recommended model                        | Context length | Tool / structured-output support            | Serverless availability | Notes                                               |
| ---------------------------------------- | -------------- | ------------------------------------------- | ----------------------- | --------------------------------------------------- |
| `parasail-kimi-k25-elicit`               | TODO:          | Tool calling: TODO:                         | TODO:                   | Used in Responses API examples; parallel tool calls |
| `parasail-deepseek-r1`                   | TODO:          | TODO:                                       | Yes                     | Strong reasoning for agent planning                 |
| `parasail-llama-4-scout-instruct`        | TODO:          | Structured output: Yes; tool calling: TODO: | TODO:                   | TODO:                                               |
| `parasail-llama-4-maverick-instruct-fp8` | TODO:          | Structured output: Yes; tool calling: TODO: | TODO:                   | FP8 quantized                                       |

## Coding

Code generation, completion, and editing (for example, via Cline). See [Serverless overview](/parasail-docs/products/overview#using-third-party-tools-like-cline-anythingllm).

| Recommended model                 | Context length | Tool / structured-output support            | Serverless availability | Notes                                 |
| --------------------------------- | -------------- | ------------------------------------------- | ----------------------- | ------------------------------------- |
| `parasail-mistral-devstral-small` | TODO:          | Structured output: Yes; tool calling: TODO: | TODO:                   | Mistral Devstral; coding-focused      |
| TODO: QwenCoder 32b model name    | TODO:          | TODO:                                       | TODO:                   | Referenced for Cline coding workflows |
| TODO: additional coding model     | TODO:          | TODO:                                       | TODO:                   | TODO:                                 |

## Known issues

* BitsAndBytes quantization is not optimal for vLLM, and current support is minimal.
* TODO: other known model-specific issues.

## Next steps

* [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model)—compare candidate models for your use case
* [Structured output](/parasail-docs/guides/structured-output)—force JSON-formatted responses
* [Serverless overview](/parasail-docs/products/overview)—find available models via the `/v1/models` endpoint
* [Responses API](/parasail-docs/products/overview/responses-api)—multi-turn agentic workflows with tool calling


# OpenAI Batch Library

Initial Release:

{% embed url="<https://pypi.org/project/openai-batch/>" %}

[![Tests](https://github.com/parasail-ai/openai-batch/actions/workflows/tests.yml/badge.svg)](https://github.com/parasail-ai/openai-batch/actions/workflows/tests.yml) [![PyPI - Version](https://img.shields.io/pypi/v/openai-batch)](https://pypi.org/project/openai-batch/)

## openai-batch

Batch inferencing is an easy and inexpensive way to process thousands or millions of LLM inferences.

The process is:

1. Write inferencing requests to an input file
2. start a batch job
3. wait for it to finish
4. download the output

This library aims to make these steps easier. The OpenAI protocol is relatively easy to use, but it has a lot of boilerplate steps. This library automates those.

**Supported Providers**

* [OpenAI](https://openai.com/) - ChatGPT, GPT4o, etc.
* [Parasail](https://parasail.io/) - Most transformer models on HuggingFace, such as LLama, Qwen, LLava, etc.

### Command-Line Utilities

Use `openai_batch.run` to run a batch from an input file on disk:

```bash
python -m openai_batch.run input.jsonl
```

This will start the batch, wait for it to complete, then download the results.

Useful switches:

* `-c` Only create the batch, do not wait for it.
* `--resume` Attach to an existing batch job. Wait for it to finish then download results.
* `--dry-run` Confirm your configuration without making an actual request.
* Full list: `python -m openai_batch.run --help`

#### OpenAI Example

```bash
export OPENAI_API_KEY="<Your OpenAI API Key>"

# Create an example batch input file
python -m openai_batch.example_prompts | \
  python -m openai_batch.create_input --model 'gpt-4o-mini' > input.jsonl

# Run this batch (resumable with `--resume <BATCH_ID>`)
python -m openai_batch.run input.jsonl
```

#### Parasail Example

```bash
export PARASAIL_API_KEY="<Your Parasail API Key>"

# Create an example batch input file
python -m openai_batch.example_prompts | \
  python -m openai_batch.create_input --model 'meta-llama/Meta-Llama-3-8B-Instruct' > input.jsonl

# Run this batch (resumable with `--resume <BATCH_ID>`)
python -m openai_batch.run -p parasail input.jsonl
```

### Resources

* [OpenAI Batch Cookbook](https://cookbook.openai.com/examples/batch_processing)
* [OpenAI Batch API reference](https://platform.openai.com/docs/api-reference/batch)
* [OpenAI Files API reference](https://platform.openai.com/docs/api-reference/files)
* [Anthropic's Message Batches](https://www.anthropic.com/news/message-batches-api) - Uses a different API


# Pricing

Understand Parasail pricing across the Serverless (per-token), Dedicated (per-GPU-hour), and Batch (discounted per-token) tiers.

## Serverless Pricing: <a href="#serverless-pricing" id="serverless-pricing"></a>

Pricing for the “serverless” is token-based based on tokens split between input and output, and the amount you owe changes depending on which models you choose to use. You can find the pricing listed directly on the "Serverless" page for the Input/Output pricing. The pricing works per million tokens so if the model costs $1 for input and output pricing and you spend 250,000 tokens on input and 250,000 tokens on output, you get charged $.50 $0.25 for the input and $0.25 for the output.

<figure><img src="/files/pFgxtR7TTBGrfGjYQFfg" alt=""><figcaption></figcaption></figure>

## Dedicated Pricing: <a href="#dedicated-pricing" id="dedicated-pricing"></a>

Each dedicated instance costs according to graphics processing unit per hour. Parasail offers various configurations of the hardware fleet to hit your indicated cost, performance, and latency targets. You have the ability to have your dedicated instances automatically scale the number of graphics processing units as your workload fluctuates, but Parasail offers scale-down policy configuration to meet your needs. A scale down policy is when you want the server to automatically turn off. During run time you get the possible option and amount of replicas you want for the model you chose with the pricing displayed on the option:

<figure><img src="/files/1NmQKA8XqcocyS7RCdWP" alt=""><figcaption></figcaption></figure>

## Batch Pricing: <a href="#batch-pricing" id="batch-pricing"></a>

Pricing for the “batch” Use Case is token-based based on total amount of tokens, discounted to reflect the fact that your queries don't get processed in real time, and the amount you owe changes depending on which models you choose to use.

The default pricing bases itself on parameter size unless the model is a named model.

Batch gets billed on a 50% discount of the Serverless Price. Cached tokens are also an additional 50% off. If the model is an FP16 quant model it's 30% more, FP8 models incur no additional costs.

The current Price:

<table><thead><tr><th>Parameter Count</th><th>Size</th><th width="100">Serverless Price</th><th>Batch Price FP8</th><th>Batch Price FP16</th><th>Cache Price FP8</th><th>Cache Price FP16</th></tr></thead><tbody><tr><td>0-4 B</td><td>0-4 B</td><td>$0.05</td><td>$0.025</td><td>$0.033</td><td>$0.013</td><td>$0.016</td></tr><tr><td>4.1-8 B</td><td>4.1-8 B</td><td>$0.08</td><td>$0.040</td><td>$0.052</td><td>$0.020</td><td>$0.026</td></tr><tr><td>LLM_Model_8.1-16 B</td><td>8.1-16 B</td><td>$0.11</td><td>$0.055</td><td>$0.072</td><td>$0.028</td><td>$0.036</td></tr><tr><td>LLM_Model_16.1 B-21 B</td><td>16.1 B-21 B</td><td>$0.45</td><td>$0.225</td><td>$0.293</td><td>$0.113</td><td>$0.146</td></tr><tr><td>LLM_Model_21.1 B-41 B</td><td>21.1 B-41 B</td><td>$0.50</td><td>$0.250</td><td>$0.325</td><td>$0.125</td><td>$0.163</td></tr><tr><td>LLM_Model_41.1 B-80 B</td><td>41.1 B-80 B</td><td>$0.70</td><td>$0.350</td><td>$0.455</td><td>$0.175</td><td>$0.228</td></tr><tr><td>LLM_Model_80.1 B-404 B</td><td>80.1 B-404 B</td><td>$0.80</td><td>$0.400</td><td>$0.520</td><td>$0.200</td><td>$0.260</td></tr><tr><td>LLM_Model_405 B</td><td>405 B</td><td>$1.75</td><td>$0.875</td><td>$1.138</td><td>$0.438</td><td>$0.569</td></tr></tbody></table>

## Next steps

* [Billing and Payments](/parasail-docs/billing/billing-and-payments)
* [Billing API](/parasail-docs/billing/billing-api)
* [Quota Management](/parasail-docs/billing/quota-management)
* [Rate Limits](/parasail-docs/billing/rate-limits)


# Rate Limits and Limitations

Our current Rate Limits and Serverless usage limitations

## Rate Limits:

Every customer is now rate limited based on the following existing and future product tiers:

The current rate limits are:

| Product Class            | RPM       | Token Limit |
| ------------------------ | --------- | ----------- |
| Serverless - Free        | 5         | -           |
| Serverless - User        | 500       | -           |
| Dedicated Serverless     | 1000      | -           |
| Dedicated Serverless Pro | 4000      | -           |
| Enterprise               | Unlimited | -           |

If you are a new user without a credit card you will **soon** be able to use our system without a credit card but be heavily rate limited.

Once you sign up your account will have rate limits of 500RPM, following the table above. The Token Limit is not yet implemented or decided on yet.

You can contact us to increase your rate limits or move into one of the newer services.

## GPU Quota:

As a new customer of our platform, we have a quota of 4 GPUs per user organization. This is set for multiple reasons the biggest reason is that, initial Users sometimes don't realize that their model could be on 1 GPU but choose a larger option before testing and accumulate a large bill.

If you see the message "Insufficient quota" in your deployment page, that indicates you have reached your user quota.

In order to increase your user quota please fill out this form: [Quota Increase](https://parasail.clickup.com/forms/9011827181/f/8cjb4fd-5671/T54OWYUTZAAJIVOGIC)


# Billing and Payments

How Parasail Handles Payments

## Org Based Billing:

Your user account belongs to an Organization. That Organization gets billed for the usage of all users under that Organization. Please see the page [Account Management](/parasail-docs/security-and-account-management/account-management) to understand more about how organizations work.

<figure><img src="/files/Ho34ws7eOb7qtuQVYsDl" alt="" width="375"><figcaption></figcaption></figure>

## Arrears Based Billing:

Unless you have pre-negotiated invoicing, the system automatically charges your card after you spend $25. The system automatically charges your card. At the end of the month the system automatically adjusts your remaining balance for that billing cycle.

**FAQ**: how do you increase the arrears billing?

You need to [contact Parasail](https://help.parasail.io/) directly to increase the threshold based billing or move to an enterprise based contract.

## Credit Cards:

Every customer must provide a credit card for their account to access the platform. If you choose not to provide a credit card you can still view the platform but only "Look Around" with no access to use the models or API keys.

Stripe handles all of your credit card information. Parasail doesn't store any of your Payment Card Industry data.

## Billing Page:

The billing page breaks down all of the services that you have consumed. It shows all of your past invoices and the current invoice for the month.

## Enterprise Invoicing/Billing Cycle:

For enterprise-based billing, Parasail bills customers on a monthly basis or according to what the parties have signed in the billing contract.

1. Invoice Period
   * Monthly cycle: first through last day of each month
2. Invoice Reconciliation
   * Date: first of the following month
   * Process: Manual review and adjustments such as removing line items and making corrections
3. Automatic Invoice Submission
   * Date: second of the following month
   * Action: Invoices automatically submitted to Stripe
4. Invoice Approval
   * Date: third of the following month
   * Action: Approve invoices for distribution from Stripe
5. Payment Terms
   * Net 30 billing terms apply

This structured format presents the information clearly and chronologically, making it easier for users to understand the billing process and important dates.

##


# Billing API

Programmatically retrieve invoices, real-time month-to-date spend, and hourly or daily usage breakdowns with the Parasail Billing API.

The Parasail Billing API allows you to programmatically access your invoices, including real-time month-to-date spend and hourly or daily usage breakdowns for your current billing period.

## Authentication

All billing endpoints require authentication using your Parasail API key in the `Authorization` header:

```
Authorization: Bearer psk-<accessKey>-<secretKey>
```

You can generate an API key from the Parasail dashboard under **Settings > API Keys**.

## Endpoints

### Get Current Invoice

Returns the active draft invoice for your current billing period. This invoice is continuously updated in real time as usage is reported, giving you up-to-date visibility into your spend.

```
GET /api/v1/billing/invoices/current
```

**Example**

```bash
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  https://api.parasail.io/api/v1/billing/invoices/current
```

**Response**

```json
{
    "id": "3fa1c306-5258-5ac0-8905-8b90c6a6396f",
    "status": "DRAFT",
    "type": "USAGE",
    "total": 52.89,
    "start_timestamp": "2026-04-01T00:00:00+00:00",
    "end_timestamp": "2026-05-01T00:00:00+00:00",
    "line_items": [
        {
            "name": "Serverless Input (Per Million Tokens)",
            "quantity": 1.49,
            "total": 0.75,
            "unit_price": 50.0,
            "type": "usage",
            "starting_at": "2026-04-01T00:00:00+00:00",
            "ending_before": "2026-05-01T00:00:00+00:00",
            "pricing_group_values": {
                "DeploymentName": "my-deployment"
            },
            "presentation_group_values": {
                "ModelName": "meta-llama/Llama-3.1-8B-Instruct"
            }
        },
        {
            "name": "Dedicated GPU Hours",
            "quantity": 0.33,
            "total": 52.80,
            "unit_price": 160.0,
            "type": "usage",
            "starting_at": "2026-04-01T00:00:00+00:00",
            "ending_before": "2026-05-01T00:00:00+00:00",
            "pricing_group_values": {
                "gpuType": "A10080GB"
            }
        }
    ],
    "issued_at": "2026-05-02T00:00:00+00:00",
    "billable_status": "billable"
}
```

Returns `404` if no active draft invoice exists for the account.

***

### List Invoice Breakdowns

Returns granular invoice breakdowns for a time range. Use this endpoint to inspect hourly or daily usage and spend within the current or historical billing period.

```
GET /api/v1/billing/invoices/breakdowns
```

**Query Parameters**

| Parameter                  | Required | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| -------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `starting_on`              | Yes      | Only return breakdown windows starting on or after this timestamp (RFC 3339 format)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| `ending_before`            | Yes      | Only return breakdown windows ending on or before this timestamp (RFC 3339 format)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| `window_size`              | No       | Breakdown granularity: `hour` or `day`. Defaults to `day`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `status`                   | No       | Filter by invoice status: `DRAFT`, `FINALIZED`, or `VOID`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `skip_zero_qty_line_items` | No       | Set to `true` to omit zero-quantity line items                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| `limit`                    | No       | We strongly recommend utilizing the limit parameter in order to avoid excessively large response payloads. Max number of breakdown windows returned. Daily breakdowns can return up to 35 days per response. Hourly breakdowns can return up to 24 hours per response. If omitted, defaults to the maximum for the selected `window_size`. Also not, the `limit` parameter specifies the number of units of the window\_size type (ex: Day) for which results will be returned. It is possible the max number of items returned in a single pagination request could differ from the limit value |
| `next_page`                | No       | Cursor that indicates where the next page of results should start. Use the `next_page` value returned by the previous response                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

**Examples**

```bash
# Get daily breakdowns for May 1, 2026 at 12:00 AM UTC to May 2, 2026 at 12:00 AM UTC
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  "https://api.parasail.io/api/v1/billing/invoices/breakdowns?limit=35&starting_on=2026-05-01T00:00:00Z&ending_before=2026-05-02T00:00:00Z&window_size=day"
```

```bash
# Get hourly breakdowns for May 1, 2026 at 12:00 AM UTC to May 2, 2026 at 12:00 AM UTC
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  "https://api.parasail.io/api/v1/billing/invoices/breakdowns?limit=24&starting_on=2026-05-01T00:00:00Z&ending_before=2026-05-02T00:00:00Z&window_size=hour"
```

**Response**

```json
{
    "breakdowns": [
        {
            "id": "3fa1c306-5258-5ac0-8905-8b90c6a6396f",
            "status": "DRAFT",
            "type": "USAGE",
            "total": 18.42,
            "subtotal": 18.42,
            "start_timestamp": "2026-05-01T00:00:00+00:00",
            "end_timestamp": "2026-06-01T00:00:00+00:00",
            "breakdown_start_timestamp": "2026-05-01T00:00:00+00:00",
            "breakdown_end_timestamp": "2026-05-02T00:00:00+00:00",
            "line_items": [
                {
                    "name": "Serverless Input (Per Million Tokens)",
                    "quantity": 1.49,
                    "total": 0.75,
                    "unit_price": 50.0,
                    "type": "usage",
                    "product_id": "5c1f40cd-9ff8-4e90-ae53-5f81b0e9d1e8",
                    "pricing_group_values": {
                        "DeploymentName": "my-deployment"
                    },
                    "presentation_group_values": {
                        "ModelName": "meta-llama/Llama-3.1-8B-Instruct"
                    }
                },
                {
                    "name": "Dedicated GPU Hours",
                    "quantity": 0.11,
                    "total": 17.60,
                    "unit_price": 160.0,
                    "type": "usage",
                    "pricing_group_values": {
                        "gpuType": "A10080GB"
                    }
                }
            ]
        }
    ],
    "next_page": "7b226e6578745f72616e67655f637572736f72223a7b2273746172745f74696d657374616d70223a22323032362d30352d32315430303a30303a30302e3030305a227d7d"
}
```

For large ranges, the API returns one page at a time. If `next_page` is not `null`, call the endpoint again with the same filters and append `next_page=<cursor>` to fetch the next page.

```bash
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  "https://api.parasail.io/api/v1/billing/invoices/breakdowns?starting_on=2026-05-01T00:00:00Z&ending_before=2026-06-01T00:00:00Z&window_size=day&limit=35&next_page=7b226e6578745f72616e67655f637572736f72223a7b2273746172745f74696d657374616d70223a22323032362d30352d32315430303a30303a30302e3030305a227d7d"
```

Metronome limits daily breakdown pages to up to 35 days and hourly breakdown pages to up to 24 hours. Parasail clamps larger `limit` values to those maximums before forwarding requests to Metronome.

***

### List Invoices

Returns all invoices matching the specified filters. Use `status=DRAFT` to get current billing period invoices, or `status=FINALIZED` to retrieve historical invoices.

```
GET /api/v1/billing/invoices
```

**Query Parameters**

| Parameter       | Required | Description                                                                                 |
| --------------- | -------- | ------------------------------------------------------------------------------------------- |
| `status`        | No       | Filter by invoice status: `DRAFT`, `FINALIZED`, or `VOID`                                   |
| `starting_on`   | No       | Only return invoices with a billing period starting on or after this date (RFC 3339 format) |
| `ending_before` | No       | Only return invoices with a billing period ending before this date (RFC 3339 format)        |

**Example**

```bash
# Get all finalized invoices for April 2026
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  "https://api.parasail.io/api/v1/billing/invoices?status=FINALIZED&starting_on=2026-04-01T00:00:00Z&ending_before=2026-05-01T00:00:00Z"
```

**Response**

```json
{
    "invoices": [
        {
            "id": "3fa1c306-5258-5ac0-8905-8b90c6a6396f",
            "status": "FINALIZED",
            "type": "USAGE",
            "total": 1240.50,
            "start_timestamp": "2026-04-01T00:00:00+00:00",
            "end_timestamp": "2026-05-01T00:00:00+00:00",
            "line_items": [ ... ],
            "issued_at": "2026-05-02T00:00:00+00:00",
            "billable_status": "billable"
        }
    ]
}
```

***

### Get Invoice by ID

Returns a specific invoice by its unique identifier.

```
GET /api/v1/billing/invoices/{invoice_id}
```

**Path Parameters**

| Parameter    | Required | Description                          |
| ------------ | -------- | ------------------------------------ |
| `invoice_id` | Yes      | The unique identifier of the invoice |

**Example**

```bash
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  https://api.parasail.io/api/v1/billing/invoices/3fa1c306-5258-5ac0-8905-8b90c6a6396f
```

***

## Invoice Object

| Field                       | Type   | Description                                                                                  |
| --------------------------- | ------ | -------------------------------------------------------------------------------------------- |
| `id`                        | string | Unique invoice identifier                                                                    |
| `status`                    | string | `DRAFT` (current period, updating in real time), `FINALIZED` (closed), or `VOID` (cancelled) |
| `type`                      | string | Invoice type: `USAGE`, `SCHEDULED`, or `USAGE_CONSOLIDATED`                                  |
| `total`                     | number | Total invoice amount in USD                                                                  |
| `subtotal`                  | number | Subtotal invoice amount in USD                                                               |
| `start_timestamp`           | string | Start of the billing period (RFC 3339)                                                       |
| `end_timestamp`             | string | End of the billing period (RFC 3339)                                                         |
| `breakdown_start_timestamp` | string | Start of the breakdown window (only returned by `/invoices/breakdowns`)                      |
| `breakdown_end_timestamp`   | string | End of the breakdown window (only returned by `/invoices/breakdowns`)                        |
| `line_items`                | array  | Itemized charges (see below)                                                                 |
| `issued_at`                 | string | When the invoice was or will be issued (RFC 3339)                                            |
| `billable_status`           | string | `billable` or `unbillable`                                                                   |

## Line Item Object

| Field                       | Type   | Description                                                                            |
| --------------------------- | ------ | -------------------------------------------------------------------------------------- |
| `name`                      | string | Description of the charge (for example, "Serverless Input (Per Million Tokens)")       |
| `quantity`                  | number | Amount consumed                                                                        |
| `total`                     | number | Line item cost in USD                                                                  |
| `unit_price`                | number | Price per unit in USD                                                                  |
| `type`                      | string | `usage`, `subscription`, `scheduled`, `commit_purchase`, or `applied_commit_or_credit` |
| `starting_at`               | string | Start of the line item period (RFC 3339)                                               |
| `ending_before`             | string | End of the line item period (RFC 3339)                                                 |
| `pricing_group_values`      | object | Pricing dimensions (for example, deployment name, GPU type)                            |
| `presentation_group_values` | object | Display dimensions (for example, model name)                                           |

## Error Responses

| Status Code | Description                                                      |
| ----------- | ---------------------------------------------------------------- |
| `401`       | Missing or invalid API key                                       |
| `403`       | Account not found or billing not configured                      |
| `404`       | Invoice not found (for `/invoices/current` and `/invoices/{id}`) |
| `502`       | Upstream billing service error                                   |

## Next steps

* [Pricing](/parasail-docs/billing/pricing)
* [Quota Management](/parasail-docs/billing/quota-management)
* [Rate Limits](/parasail-docs/billing/rate-limits)
* [Billing and Payments](/parasail-docs/billing/billing-and-payments)
* [Authentication API reference](/parasail-docs/api-reference/authentication)


# Quota Management

Manage your GPU quota limits across Batch and Dedicated services and request quota increases.

#### GPU Quota Limits

By default, all users have a graphics processing unit quota limit of 4 graphics processing units across both Batch and Dedicated services. This quota applies collectively, regardless of graphics processing unit type. The Quota Increase Form is [**Here**](https://parasail.clickup.com/forms/9011827181/f/8cjb4fd-5671/T54OWYUTZAAJIVOGIC).

**Examples:**

* If your current graphics processing unit usage is 2 H100 graphics processing units and you attempt to submit a Batch job requiring 8 H100 graphics processing units such as DeepSeek V3, the submission gets rejected due to exceeding your 4 graphics processing unit quota.
* If you configure automatic scaling for a range of 1–6 graphics processing units, and your model currently utilizes 4 graphics processing units, the quota limit prevents the model from scaling up to a fifth graphics processing unit.

#### Requesting Increased Quotas

If you need additional graphics processing unit capacity, you can request an increased quota using the Quota Request Form. Quota increases up to 8 graphics processing units are available without additional justification. Requests exceeding 8 graphics processing units require a detailed explanation of your usage needs.

## Next steps

* [Pricing](/parasail-docs/billing/pricing)
* [Rate Limits](/parasail-docs/billing/rate-limits)
* [Billing and Payments](/parasail-docs/billing/billing-and-payments)
* [Dedicated Instances overview](/parasail-docs/products/overview-1)
* [Batch Quick start](/parasail-docs/products/quickstart)


# Overview

Run Parasail in production—rate limits, quotas, scaling, deployment lifecycle, batch operations, and safe retries—all from one place.

This hub ties together the operational topics you need when running Parasail in production. Rather than repeat the details, each section points to the page that owns it, so you always read the current source.

## Operational topics

| Topic                         | What it covers                                                                      | Page                                                                                    |
| ----------------------------- | ----------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| Rate limits and 429 responses | RPM limits per product tier, when requests are throttled, and how to raise limits   | [Rate Limits and Limitations](/parasail-docs/billing/rate-limits)                       |
| Quota management              | GPU quota across Batch and Dedicated, and how to request increases                  | [Quota Management](/parasail-docs/billing/quota-management)                             |
| Autoscaling & replicas        | Max/target concurrent requests, smoothing, and replica scaling ranges               | [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)                         |
| Deployment lifecycle          | Create, pause, edit, resume, and destroy Dedicated Instances; pause-to-stop billing | [Dedicated Instances](/parasail-docs/products/overview-1)                               |
| Batch job lifecycle           | Submission, priority, status polling, cancellation, and terminal states             | [Batch Quickstart](/parasail-docs/products/quickstart)                                  |
| Batch troubleshooting         | Validation, file-size limits, stuck jobs, partial failures, and quota errors        | [Batch Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)             |
| Retries & idempotency         | Handling 429 responses, exponential backoff, timeouts, and safe retries             | [Retries and Idempotency](/parasail-docs/operate-in-production/retries-and-idempotency) |

## Monitoring & cost controls

A few controls are available today to keep workloads observable and costs predictable:

* **Dedicated status page.** Each Dedicated Instance has a status page where you can view metrics and control the deployment (pause, edit, destroy). See [Dedicated Instances](/parasail-docs/products/overview-1#dedicated-status-page).
* **Pause to stop billing.** Dedicated Instances bill only while active. Pause an endpoint to stop billing when it isn't needed, and set a **Scale Down Policy** so an idle endpoint shuts down automatically. See [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling).
* **Quota as a guardrail.** The default 4-GPU quota caps how much capacity a single workload can consume, which also caps spend. See [Quota Management](/parasail-docs/billing/quota-management).
* **Batch status.** Track queued and running batches and download outputs from the Batch UI, or poll a batch's status from any process. See [Batch Troubleshooting](/parasail-docs/products/quickstart/troubleshooting).

TODO: confirm whether Parasail exposes account-wide usage or metrics dashboards (beyond the per-deployment status page) and document where to find them.

TODO: confirm whether a public status page for service-wide incidents exists and link it here.

TODO: confirm whether usage-based alerts or billing thresholds can be configured and document how.

## Next steps

* [Rate Limits and Limitations](/parasail-docs/billing/rate-limits)—understand throttling before you scale
* [Retries and Idempotency](/parasail-docs/operate-in-production/retries-and-idempotency)—handle 429 responses and transient errors safely
* [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling)—size replicas for production traffic
* [Quota Management](/parasail-docs/billing/quota-management)—request more GPU capacity


# Retries and Idempotency

Handle 429 responses and transient errors with exponential backoff, sensible timeouts, and safe retries against the Parasail API.

Production clients should expect occasional transient errors and retry them safely. This page covers when Parasail returns `429 Too Many Requests`, how to back off, how to set timeouts, and what you can and can't assume about retry idempotency.

## When you'll see `429 Too Many Requests`

A `429` means a request was throttled and should be retried later. There are two common causes on Parasail:

* **Account rate limits.** Every account has a requests-per-minute (RPM) limit based on its product tier. Exceeding it returns `429`. See [Rate Limits and Limitations](/parasail-docs/billing/rate-limits) for the per-tier limits and how to raise them.
* **Dedicated max concurrent connections.** A Dedicated Instance returns **429 Too Many Requests** when concurrent requests reach the **Max Concurrent Connections per Replica** limit, until space frees up. This protects quality of service and works in tandem with autoscaling. See [Dedicated Instances](/parasail-docs/products/overview-1) and [Auto-Scaling](/parasail-docs/products/overview-1/auto-scaling).

In both cases the right response is the same: wait, then retry with backoff.

## Exponential backoff

Retry transient errors (`429`, and `5xx` server errors) with exponential backoff and jitter, rather than retrying immediately in a tight loop. Cap the number of attempts so a request can't retry forever.

The OpenAI Python SDK already retries certain errors automatically; you can configure `max_retries` on the client and still wrap calls for full control:

```python
import random
import time

from openai import OpenAI, APIStatusError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://api.parasail.io/v1",
    api_key="<PARASAIL_API_KEY>",
    timeout=30,        # seconds; fail fast instead of hanging
    max_retries=0,     # disable built-in retries so we control backoff here
)

def chat_with_backoff(messages, model, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except (APITimeoutError, APIConnectionError):
            retryable = True
        except APIStatusError as e:
            # Retry on throttling and server errors; fail fast on 4xx like 400/401/404.
            retryable = e.status_code == 429 or e.status_code >= 500
            if not retryable:
                raise
        if attempt == max_attempts - 1:
            raise
        # Exponential backoff with full jitter: 1s, 2s, 4s, ... plus randomness.
        sleep = min(2 ** attempt, 30) * (0.5 + random.random() / 2)
        time.sleep(sleep)

response = chat_with_backoff(
    messages=[{"role": "user", "content": "Hello"}],
    model="<your-model>",
)
print(response.choices[0].message.content)
```

## Setting timeouts

Always set a client-side timeout so a stalled connection fails fast and can be retried, instead of hanging indefinitely. The example above sets `timeout=30` on the client; you can also override it per request:

```python
client.with_options(timeout=60).chat.completions.create(...)
```

Choose a timeout that comfortably exceeds your model's expected latency. Long generations and large batch-style requests need more headroom than short completions.

## Safe retries and idempotency

Retrying is only safe if a duplicated request can't cause unintended side effects. For Parasail's inference endpoints (chat completions, embeddings), a retried request simply produces another completion—there is no state to corrupt—so retrying a failed or timed-out call is generally safe. The main caveat is that you may be **billed for both** a request that actually succeeded server-side and its retry, if the original response was lost in transit.

Parasail exposes an **OpenAI-compatible API**, and the repository does not document an idempotency-key mechanism. Treat retries as **not guaranteed idempotent**: a retry can result in a second execution rather than returning the original response. To limit duplicate work:

* Keep `max_attempts` small and only retry genuinely retryable errors (`429`, `5xx`, timeouts).
* For **batch** workloads, prefer the per-request `custom_id` to match outputs to inputs and resubmit only the requests that failed, rather than resubmitting whole batches. See [Batch Troubleshooting](/parasail-docs/products/quickstart/troubleshooting).

TODO: confirm idempotency-key support (for example, an `Idempotency-Key` header) on the Parasail API and document it here if available.

## Next steps

* [Rate Limits and Limitations](/parasail-docs/billing/rate-limits)—per-tier RPM limits and increases
* [Operate in Production](/parasail-docs/operate-in-production/overview)—the full operations hub
* [Dedicated Instances](/parasail-docs/products/overview-1)—max concurrent connections and 429 behavior
* [Batch Troubleshooting](/parasail-docs/products/quickstart/troubleshooting)—handling per-request failures


# Security Overview

How Parasail handles your data across products—retention, training use, encryption, and access controls.

This page summarizes how Parasail handles your data across its products so you can evaluate Parasail for production and enterprise use. The matrix below is a quick reference; follow the links in [Detail pages](#detail-pages) for the authoritative policies.

For the full legal text, see the [Privacy Policy](https://www.parasail.io/legal/privacy-policy) and the [Trust Center](https://trust.parasail.io/).

## Per-product data matrix

| Product             | Input/output retention                                                                 | Metadata retention                      | Training use                                     | Encryption                                      | Access controls               |
| ------------------- | -------------------------------------------------------------------------------------- | --------------------------------------- | ------------------------------------------------ | ----------------------------------------------- | ----------------------------- |
| Serverless          | Not stored or logged; retained only for the time needed to generate and deliver Output | TODO: confirm metadata retention period | Not used to train models or improve the Platform | TODO: confirm encryption (in transit / at rest) | TODO: confirm access controls |
| Dedicated Instances | Not stored or logged; retained only for the time needed to generate and deliver Output | TODO: confirm metadata retention period | Not used to train models or improve the Platform | TODO: confirm encryption (in transit / at rest) | TODO: confirm access controls |
| Batch               | Generally stored and logged for 30 days, though the specific time frame may vary       | TODO: confirm metadata retention period | Not used to train models or improve the Platform | TODO: confirm encryption (in transit / at rest) | TODO: confirm access controls |
| Image Generation    | TODO: confirm input/output retention                                                   | TODO: confirm metadata retention period | Not used to train models or improve the Platform | TODO: confirm encryption (in transit / at rest) | TODO: confirm access controls |

Notes:

* Parasail may collect certain metadata (for example, usage statistics, API token usage, and LLM token/time usage) to improve the Platform. See [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention) for details.
* The "Training use" column reflects Parasail's platform-wide commitment not to use your Input or Output—or any personal data contained in them—to train its models or improve the Platform or Service.

## Detail pages

* [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention)—what data is retained, for how long, and how it is used
* [Compliance](/parasail-docs/security-and-account-management/compliance)—certifications, trust center, and terms of service
* [Account Management](/parasail-docs/security-and-account-management/account-management)—organizations, members, and account structure
* [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys)—scoped, GET-only keys for monitoring and integrations

## Next steps

* [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention)—the authoritative retention policy
* [Compliance](/parasail-docs/security-and-account-management/compliance)—check certification status in the trust center
* [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys)—grant least-privilege access
* [Account Management](/parasail-docs/security-and-account-management/account-management)—manage your organization and members


# Data Privacy and Retention

Check out our Privacy Policy Here:

You can check out our privacy policy here: <https://www.parasail.io/legal/privacy-policy>

## The Part you most likely care about:

We do not store or log any personal data you send as Input to the “serverless” and “dedicated” versions of the Service, and we will not inspect such data without your permission, and will only retain such data for as long as is necessary to provide the Service to you (that is, for the time it takes to generate Output and deliver that Output to you).

We generally store and log any personal data you send as Input to the “batch” version of the Service for 30 days, though the specific time frame may vary.

• Metadata. We may collect certain metadata—essentially, data about your data—that could be used to identify you. This may include things like statistics related to your usage of the Platform/Service (for example, length of engagement, location, and browser type), your usage of the API token that you use to access the Platform/Service, and your large language model (LLM) token usage and time usage (for example, what models you choose to leverage, and how may tokens you consume). We use this information to improve the Platform and the Service.

5\. We Do Not Use Personal Data Contained in Your Input or Output to Train Our Models, or to Improve the Platform or Service.

We treat personal data that is contained within Input and Output differently than in the rest of this Policy.

For the purposes of this Policy, “Input” means any data, images, code, or other content (including, without limitation, text, graphics, audio files, video files, or computer software) that you either: (i) publish, upload to, or use in conjunction with the Platform; (ii) make available in conjunction with the Platform, or (iii) allow the Platform to access; and “Output” means content and/or data generated by the Platform, and/or one or more models, in response to a query.

We will not use your Input, or your Output, or any personal data contained in that Input or Output, to train our own models or to improve the Platform or the Service.

## Next steps

* [Security Overview](/parasail-docs/security-and-account-management/overview)—per-product data handling at a glance
* [Compliance](/parasail-docs/security-and-account-management/compliance)—certifications and trust center
* [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys)—scoped, GET-only access keys
* [Account Management](/parasail-docs/security-and-account-management/account-management)—organizations and members


# Compliance

Check out our compliance checks: https\://trust.parasail.io/

We are currently undergoing our SOC2 Compliance. You can find out the latest from the trust center at: <https://trust.parasail.io/>

You can also see our terms of Service: <https://www.parasail.io/legal/terms>

## Next steps

* [Security Overview](/parasail-docs/security-and-account-management/overview)—per-product data handling at a glance
* [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention)—retention policy and training use
* [Account Management](/parasail-docs/security-and-account-management/account-management)—organizations and members
* [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys)—scoped, GET-only access keys


# Account Management

How accounts and organizations work in Parasail—signing up, inviting members, and switching organizations.

When you sign up you will be asked for an account using either your Email or SSO (Google Auth), you will be prompted for an organization name.

Your account is a part of an Organization. That Organization is what is used for issuing:

* API Keys
* Billing
* Inviting Members to Your Org

<figure><img src="/files/WFN29bvY3B1mskQieZe9" alt="" width="278"><figcaption></figcaption></figure>

This is how Organizations and Accounts are related:

<figure><img src="/files/912zP6igmbYvL5T5JrHm" alt="" width="375"><figcaption></figcaption></figure>

When you first join you will be apart of your own organization unless you invite someone to your organization or join another organization.

### Invite Someone to Your Organization

You want to generate a link to your organization by go to the "Invite Member" in the drop down. When you give someone that link, if they have an account they will be added to that organization. If they do not have an account they will be asked to create one.

<figure><img src="/files/1vAv7PnPQjfPJ98VF9zy" alt="" width="375"><figcaption></figcaption></figure>

### Switching Organizations:

In order to switch organizations and what who will be billed [Billing and Payments](/parasail-docs/billing/billing-and-payments) You will need to switch your organization:

<figure><img src="/files/M5I3w8hyaCzzRt1nZYqc" alt="" width="313"><figcaption></figcaption></figure>

## Next steps

* [Security Overview](/parasail-docs/security-and-account-management/overview)—per-product data handling at a glance
* [Read-Only API Keys](/parasail-docs/security-and-account-management/read-only-api-keys)—scoped, GET-only access keys
* [Billing and Payments](/parasail-docs/billing/billing-and-payments)—how billing maps to your organization
* [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention)—retention policy and training use


# Read-Only API Keys

Scoped, GET-only API keys for monitoring, reporting, and integrations—what they can and cannot do, and how to create one.

Read-only keys provide a safe way to grant access to your Parasail account for monitoring, reporting, and integration use cases.

## What read-only keys can do

A read-only key can call the following Parasail management API endpoints. Full-access keys can call all of these too—read-only keys are restricted to *just* this list, with `GET` only.

### Model catalog

View available models, supported hardware, and capability/performance information.

| Method | Endpoint                        | Description                                                |
| ------ | ------------------------------- | ---------------------------------------------------------- |
| `GET`  | `/api/v1/openrouter/models`     | List models available via OpenRouter                       |
| `GET`  | `/api/v1/dedicated/devices`     | List supported hardware (GPU types, etc.)                  |
| `GET`  | `/api/v1/dedicated/support`     | Check whether a model is supported on dedicated hardware   |
| `GET`  | `/api/v1/dedicated/performance` | Get performance estimates for a model/hardware combination |

### Deployment status

Read information about your dedicated deployments.

| Method | Endpoint                                    | Description                          |
| ------ | ------------------------------------------- | ------------------------------------ |
| `GET`  | `/api/v1/dedicated/deployments`             | List your deployments                |
| `GET`  | `/api/v1/dedicated/deployments/{id}`        | Fetch a specific deployment          |
| `GET`  | `/api/v1/dedicated/deployments/{id}/models` | List models on a specific deployment |

### Billing

Read invoices for your account.

| Method | Endpoint                               | Description                             |
| ------ | -------------------------------------- | --------------------------------------- |
| `GET`  | `/api/v1/billing/invoices`             | List invoices                           |
| `GET`  | `/api/v1/billing/invoices/current`     | Fetch the current (in-progress) invoice |
| `GET`  | `/api/v1/billing/invoices/{invoiceId}` | Fetch a specific invoice                |

## What read-only keys *cannot* do

Read-only keys are blocked from:

* **Inference**—any call to `/openai/*` or other inference endpoints. Use a full-access key for inference traffic.
* **Mutating any resource**—creating, updating, or deleting deployments; rotating keys; changing account settings; etc.
* **Any management endpoint not in the list above.**

## Creating a read-only key

1. Open the **API Keys** page in the Parasail web portal.
2. Click **Create API Key**.
3. Under **Access Level**, select **Read-only**.
4. Give the key a name and click **Create**.
5. Copy the key value—it's shown only once.

Read-only keys appear in the API Keys list with a **Read-only** badge, so you can tell them apart from full-access keys at a glance.

## Example

A read-only key **can** list your dedicated deployments:

```bash
curl -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  https://api.parasail.io/api/v1/dedicated/deployments
```

The same key **cannot** create one. A `POST` to the same endpoint returns `403 Forbidden`:

```bash
curl -X POST \
  -H "Authorization: Bearer psk-<accessKey>-<secretKey>" \
  -H "Content-Type: application/json" \
  https://api.parasail.io/api/v1/dedicated/deployments
```

Response:

```json
{
  "type": "about:blank",
  "title": "Forbidden",
  "status": 403,
  "detail": "This API key is read-only",
  "instance": "/api/v1/dedicated/deployments"
}
```

## Next steps

* [Security Overview](/parasail-docs/security-and-account-management/overview)—per-product data handling at a glance
* [Account Management](/parasail-docs/security-and-account-management/account-management)—organizations and members
* [Data Privacy and Retention](/parasail-docs/security-and-account-management/data-privacy-retention)—retention policy and training use
* [Compliance](/parasail-docs/security-and-account-management/compliance)—certifications and trust center


# Silly Tavern Guide

This is a quick guide for SillyTavern.

## Install SillyTavern

[How to install SillyTavern](https://sillytavernai.com/how-to-install-sillytavern/).

Once installed, assuming you are on Mac/Linux terminal, start up SillyTavern:

![](/files/lOKRKmpT6xsF5kbkl21a)

Choose Option 1, this will launch SillyTavern on localhost: (<http://127.0.0.1:8000/>)

You should see a page something like this:

![](/files/1V3nFE9vp3bZtVAH4RpK)

Once there you will need to make the connection between Parasail and SillyTavern.

## Connecting to the Parasail Chat Completions API

You need to connect on the Connection Profile at the top:

<figure><img src="/files/kA8FM1NK6LDXo6JAezlV" alt=""><figcaption></figcaption></figure>

Take a look at the available Serverless models at <https://saas.parasail.io>, or use the API endpoint `https://api.parasail.io/v1/models`.

<figure><img src="/files/zeEY19aYIqKTUogoalfY" alt=""><figcaption></figcaption></figure>

When you click on a model, on the right-hand side is the ability to see the API Key/Endpoints/ModelNames:

<figure><img src="/files/Z0zNtUdZU0H2YbOWLYGJ" alt=""><figcaption></figcaption></figure>

Back to SillyTavern. The parameters you need to select are:

* API: Chat Completion
* Chat Completion Source: Custom (OpenAI-compatible)
* Custom Endpoint (Base URL): <https://api.parasail.io/v1>

Then, choose your preferred model.

Lastly, click the connect button.

<figure><img src="/files/v0BM9SRgrQjn64kNGe20" alt=""><figcaption></figcaption></figure>

Once you set the connection you will be able to choose the available models:

<figure><img src="/files/vreoAiZW6HckP3DCK8be" alt=""><figcaption></figcaption></figure>

Then you will be able to officially connect:

<figure><img src="/files/EHzNrDshZAkBOFFzRKpp" alt=""><figcaption></figcaption></figure>

Then you can start to run the model in the text interface:

<figure><img src="/files/sNwP4D078RJ5Q8XYFLtQ" alt=""><figcaption></figcaption></figure>

You can load personalities as well:

<figure><img src="/files/cZaDgIMHXvsxLvaZxWJ7" alt=""><figcaption></figcaption></figure>

Once you have a personality have fun interacting with it:

<figure><img src="/files/ou1kekQxMfTa6LpHRVUT" alt=""><figcaption></figcaption></figure>

For those Looking for Text-Completions you can set the API and Bypass status check:

<figure><img src="/files/WFNIWCRyfq5X5MVekcCy" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/NRwQdmKPJN5YBkwRdWth" alt=""><figcaption></figcaption></figure>

I hope this helps with people who want Text vs Chat Completions. There are some minor difference in how the API is setup.

Please let us know if you have any questions. Also please provide feedback at: <https://feedback.parasail.io/feature-requests>


# Third-Party Providers

Use Parasail through popular third-party aggregators and gateways such as OpenRouter, Requesty, BiFrost, and LiteLLM.

Parasail is available through many major third-party LLM aggregators and gateways. If you already use one of these tools, you can route requests to Parasail without changing your application code. This page lists the integrations we are part of, along with setup notes and known limitations for each.

For reference, Parasail's own OpenAI-compatible base URL is `https://api.parasail.io/v1`. Most gateways either proxy to this URL or have a built-in Parasail provider.

## Aggregators

### OpenRouter

* **What it is:** A model router that exposes many providers, including Parasail, behind a single OpenAI-compatible API. Provider page: <https://openrouter.ai/provider/parasail>
* **Base URL:** TODO: (OpenRouter base URL)
* **Model-name format:** TODO: (for example, `parasail/<model>`)
* **Setup steps:** TODO: steps to select Parasail as the upstream provider in OpenRouter.
* **Known limitations:** TODO:

### NanoGPT

* **What it is:** An aggregator offering pay-per-prompt access to many models, including Parasail. Site: <https://nano-gpt.com/>
* **Base URL:** TODO: (NanoGPT base URL)
* **Model-name format:** TODO:
* **Setup steps:** TODO:
* **Known limitations:** TODO:

### Requesty

* **What it is:** An LLM routing platform that lists Parasail models. Models page: <https://www.requesty.ai/solution/llm-routing/models>
* **Base URL:** TODO: (Requesty base URL)
* **Model-name format:** TODO:
* **Setup steps:** TODO:
* **Known limitations:** TODO:

## Gateways

### BiFrost

* **What it is:** TODO: one-line description of BiFrost.
* **Base URL:** TODO: (Parasail's base URL is `https://api.parasail.io/v1`)
* **Model-name format:** TODO:
* **Setup steps:**
  1. TODO: add Parasail as a provider in BiFrost configuration.
  2. TODO: set the API key (`PARASAIL_API_KEY`).
  3. TODO: reference a Parasail model name.
* **Known limitations:** TODO:

### LiteLLM

* **What it is:** TODO: one-line description of LiteLLM (open-source LLM gateway / proxy).
* **Base URL:** TODO: (Parasail's base URL is `https://api.parasail.io/v1`)
* **Model-name format:** TODO: (for example, `parasail/<model>` or via custom `api_base`)
* **Setup steps:**
  1. TODO: configure Parasail as a provider / model entry in LiteLLM.
  2. TODO: set `api_base` to `https://api.parasail.io/v1` and provide the API key.
  3. TODO: call the model through the LiteLLM SDK or proxy.
* **Known limitations:** TODO:

## Next steps

* [Serverless overview](/parasail-docs/products/overview)—get an API key and find available models
* [Run and evaluate any model](/parasail-docs/guides/run-and-evaluate-any-model)—pick the right model before wiring up a gateway
* TODO: link to any Parasail-authored integration guides as they are published.


# For AI Coding Agents

Quick orientation for AI coding agents building on Parasail with the OpenAI SDK.

If you are an AI coding agent building on behalf of a user, start here.

1. Read [`llms.txt`](https://github.com/parasail-ai/gitbook-doc/blob/main/llms.txt) for a flat index of every documentation page.
2. Use the OpenAI SDK with base URL `https://api.parasail.io/v1`.
3. Read the API key from the `PARASAIL_API_KEY` environment variable.
4. All Parasail endpoints are OpenAI-compatible—no special SDK is needed.

## Per-product notes

* **Serverless**—install the OpenAI SDK, set `PARASAIL_API_KEY`, and point the client at `https://api.parasail.io/v1`. All serverless endpoints are OpenAI-compatible. See [Serverless Quickstart](/parasail-docs/quickstart/serverless).
* **Dedicated Instances**—deployments are created through the Parasail web UI (or the [Management API](/parasail-docs/products/overview-1/management-api)) by specifying a Hugging Face ID. Once deployed, the endpoint serves an OpenAI-compatible API at `https://api.parasail.io/v1`, using the deployment name as the model. See [Dedicated Quickstart](/parasail-docs/quickstart/dedicated).
* **Batch**—install `openai-batch`, set `PARASAIL_API_KEY`, and use the `Batch` class to create, submit, and download results. The batch API is OpenAI-compatible. See [Batch Quickstart](/parasail-docs/quickstart/batch).

## Next steps

* [Authentication](/parasail-docs/api-reference/authentication)—base URL and API key details
* [Chat Completions API](/parasail-docs/api-reference/chat-completions)—the core OpenAI-compatible endpoint
* [`llms.txt`](https://github.com/parasail-ai/gitbook-doc/blob/main/llms.txt)—flat index of all pages


# Glossary

Standard terminology used across Parasail documentation.

Preferred terms and what they mean across Parasail's products and docs. Use these spellings and capitalizations consistently.

| Term                     | Definition                                                                                                                                                                                    |
| ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Serverless**           | On-demand, pay-per-token inference for popular open-source models, with no setup. See [Serverless](/parasail-docs/products/overview).                                                         |
| **Dedicated Instances**  | Private GPU endpoints with full control over model, hardware, and scaling. (Not "Dedicated Endpoints" or "dedicated service.") See [Dedicated Instances](/parasail-docs/products/overview-1). |
| **Dedicated Serverless** | Reserved serverless capacity tier. See [Dedicated Serverless](/parasail-docs/products/capacity).                                                                                              |
| **Batch**                | Asynchronous, OpenAI-compatible batch processing at 50% off serverless pricing. See [Batch](/parasail-docs/products/quickstart).                                                              |
| **Hugging Face**         | The model hub Parasail loads models from. Spell as two words in prose (the domain `huggingface.co` and code identifiers keep their own form).                                                 |
| **GPU**                  | Graphics processing unit. Use the abbreviation "GPU" in prose rather than spelling it out.                                                                                                    |
| **QoS**                  | Quality of service. Always capitalize as "QoS."                                                                                                                                               |
| **API key**              | The credential used to authenticate requests, passed as `Authorization: Bearer <PARASAIL_API_KEY>`.                                                                                           |
| **Organization**         | The account-level entity that owns API keys, quota, and billing.                                                                                                                              |
| **Replica**              | One running copy of a dedicated deployment. Multiple replicas are load-balanced.                                                                                                              |

## Next steps

* [Authentication](/parasail-docs/api-reference/authentication)—base URL and API keys
* [Pricing](/parasail-docs/billing/pricing)—serverless, dedicated, and batch pricing
* [Welcome](/parasail-docs)—product overview and where to start

Serverless	On-demand, pay-per-token inference for popular open-source models. No setup required—just call the API.	/pages/AbuAPWxU9zDfEIgukNsK
Dedicated Instances	Private GPU endpoints with full control over model, hardware, and scaling. Ideal for production workloads.	/pages/dvVsRgLfPYaM2uxDi8Gw
Batch Processing	Process millions of inferences at 50% off serverless pricing. OpenAI-compatible batch API.	/pages/51IF2LvdW8BHeIjbfIMj

Chat and text generation	Build chatbots, assistants, and text generation pipelines	/pages/9NzGG4P8mD9sCd9NCJMK
RAG and embeddings	Build retrieval-augmented generation and vector search systems	/pages/VgTsKmx1uUyZtUKwy1WK
Batch at scale	Process large volumes of prompts, images, or text offline	/pages/AXpJHRJH9bgHtkexfQ2q
Agents and tool calling	Build agentic workflows with function calling and multi-step reasoning	/pages/AwBBAik6tiXQRBQzxem9
Browser and UI automation	Automate browser tasks and GUI interactions with vision models	/pages/jA08SyPNrWTQdemUTvP1

Serverless quickstart	Call a serverless model with the OpenAI SDK	/pages/cqVFBa1P7r5p7nfFbCnL
Dedicated quickstart	Deploy your own model on a dedicated GPU	/pages/UcIhEOaxPYRZMC3qQ7xw
Batch quickstart	Submit a batch job in five lines of Python	/pages/OrPZ7ZRehYXxzse9LB3u

Guides	Step-by-step tutorials for common tasks	/pages/fCGvKxrehXKm6u7ta71R
Billing	Pricing, rate limits, and quota management	/pages/8hnW6Po2Bj1dutmH0v2j
Security	Data privacy, compliance, and API key management	/pages/reCnV1oJm5Rzj9QnXa4c
Resources	Community guides and third-party integrations	/pages/eWwm8YtUZYYgUDvXBdyp
Parameter	Type	Required?	Descripti
device	string	Yes	Device ID name.
count	integer	Yes	Number of. The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.
displayName	string	No	How the device is displayed in the UI.
cost	float	No	Cost per hour of the hardware configuration. This accounts for multiple GPUs.
estimatedSingleUserTps	float	No	This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.
estimatedSystemTps	float	No	This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.
recommended	boolean	No	If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.
limitedContext	boolean	No	If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the contextLength parameter.
available	boolean	No	If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.
selected	boolean	Yes	This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the create deployment endpoint will make it a selectable option by the scheduler.
Parameter	Type	Required	Description
deploymentName	string	Yes	Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.
modelName	string	Yes	The HuggingFace identifier (for example, `org/model-name`) of the model you want to deploy. For example: `"shuyuej/Llama-3.2-1 B-Instruct-GPTQ"`
deviceConfigs	list of structs	Yes	A list of deviceConfig structs (see previous section). Any struct with selected = true will be considered for scheduling.
replicas	integer	Yes	Number of replicas to run in parallel
scaleDownPolicy	string	No	Auto-scaling policy: `NONE or TIMER`
scaleDownThreshold	integer	No	Time threshold in ms before scaling down
draftModelName	string	No	(Advanced) Points to a draft model version.
engine	string	No	Inference engine. Defaults to `VLLM`.
`engineTask`	string	No	Engine task mode. Defaults to `AUTO`.
contextLengthOverride	integer	No	(Advanced) User override for context window length.
modelAccessKey	string	No	(For private models) HuggingFace token.

	Model is supported
	Model cannot be found indicates the model is not supported
Parameter Count	Size	Serverless Price	Batch Price FP8	Batch Price FP16	Cache Price FP8	Cache Price FP16
0-4 B	0-4 B	$0.05	$0.025	$0.033	$0.013	$0.016
4.1-8 B	4.1-8 B	$0.08	$0.040	$0.052	$0.020	$0.026
LLM_Model_8.1-16 B	8.1-16 B	$0.11	$0.055	$0.072	$0.028	$0.036
LLM_Model_16.1 B-21 B	16.1 B-21 B	$0.45	$0.225	$0.293	$0.113	$0.146
LLM_Model_21.1 B-41 B	21.1 B-41 B	$0.50	$0.250	$0.325	$0.125	$0.163
LLM_Model_41.1 B-80 B	41.1 B-80 B	$0.70	$0.350	$0.455	$0.175	$0.228
LLM_Model_80.1 B-404 B	80.1 B-404 B	$0.80	$0.400	$0.520	$0.200	$0.260
LLM_Model_405 B	405 B	$1.75	$0.875	$1.138	$0.438	$0.569