# Dedicated Endpoint Management API

## Parasail Dedicated Service API Documentation

### Overview

The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via a REST-compatible API. You can perform all functions available in the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.

Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatibility, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.

If you did not create the dedicated endpoint with the API, you can find the **deployment\_id** from the Parasail SaaS platform URL. For example, if your deployment status page is `https://www.saas.parasail.io/dedicated/33665`, then the `deployment_id` is **33665**. You can then use this for all API commands below.

### Base URLs

* **Control API** (for managing deployments):\
  `https://api.parasail.io/api/v1`
* **Chat API** (for inference requests via OpenAI-compatible endpoints):\
  `https://api.parasail.io/v1`

### Authentication

All requests to the Control API must include a Bearer token in the header. For example:

```python
import httpx

api_key = "<YOUR_PARASAIL_API_KEY>"
control_base_url = "https://api.parasail.io/api/v1"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

client = httpx.Client(base_url=control_base_url, headers=headers)
```

***

## Recommended Dedicated Model API Flow

Given the large set of combinations of model types and hardware instances, the recommended process for using the Dedicated API follows the UI flow and leverages our performance/memory models and hardware compatibility checkers to ensure smooth operation. The recommended flow is:

1. Check model compatibility with our `/dedicated/support` endpoint.
2. Retrieve a list of supported hardware for that model with our `/dedicated/devices` endpoint. This will include multi-GPU configurations that leverage parallelism.
3. Within that list, set `selected` to `True` for the desired configurations.
4. Submit (`POST`) a deployment payload (including your selected `deviceConfigs`) to `/dedicated/deployments`.
5. Use `/dedicated/deployments` endpoints to monitor, update, pause, resume, and delete deployments.

### 1. Checking Model Compatibility

The `/dedicated/support` endpoint can be used to query whether a model is supported. The `modelName` parameter can either be a HuggingFace ID like `Qwen/Qwen2.5-72B-Instruct` or a HuggingFace URL like <https://huggingface.co/Qwen/Qwen2.5-72B-Instruct>. Use the `engine` query parameter (for example `VLLM`) when needed.

```python
params = {
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "engine": "VLLM",
    "modelAccessKey": ""  # HF token for privately hosted models
}

# Make the request
response = client.get("/dedicated/support", params=params)

if response.status_code == 200:
    data = response.json()
    if data.get("supported") is True:
        print("✅ Model is supported.")
    else:
        reason = data.get("errorMessage", "No reason provided.")
        print(f"❌ Model not supported: {reason}")
else:
    print(f"❌ Request failed with status code {response.status_code}: {response.text}")
```

This step is optional, but is a useful sanity check for API-based flows to ensure the IDs are entered correctly.

### 2. Retrieving Supported Hardware

The first step to creating a new dedicated deployment is to select a supported hardware configuration. The `/dedicated/devices` endpoint returns valid hardware configurations and their availability. These configurations can be either single GPUs or multiple GPUs with parallelism enabled (tensor, expert, or sequence). The `selected` flag is set to `false` when returned from `/dedicated/devices`. The recommended flow is to flip `selected` to `true` for your desired configurations, then pass that array to the create deployment endpoint (see next section).

The entire structure does not need to be passed to the create dedicated instance endpoint. The required flag shows the parameters that are needed, though it is recommended to ensure compatibility by first calling `/dedicated/devices` and flipping `selected` to true.

`/dedicated/devices` returns an array of structures with the following parameters:

**DeviceConfigs structure**

<table><thead><tr><th width="230.8671875">Parameter</th><th width="105.34375">Type</th><th width="99.8984375">Required?</th><th>Descripti</th></tr></thead><tbody><tr><td>device</td><td><strong>string</strong></td><td>Yes</td><td>Device ID name.</td></tr><tr><td>count</td><td><strong>integer</strong></td><td>Yes</td><td>Number of. The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.</td></tr><tr><td>displayName</td><td><strong>string</strong></td><td>No</td><td>How the device is displayed in the UI.</td></tr><tr><td>cost</td><td><strong>float</strong></td><td>No</td><td>Cost per hour of the hardware configuration. This accounts for multiple GPUs.</td></tr><tr><td>estimatedSingleUserTps</td><td><strong>float</strong></td><td>No</td><td>This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.</td></tr><tr><td>estimatedSystemTps</td><td><strong>float</strong></td><td>No</td><td>This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.</td></tr><tr><td>recommended</td><td><strong>boolean</strong></td><td>No</td><td>If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.</td></tr><tr><td>limitedContext</td><td><strong>boolean</strong></td><td>No</td><td>If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the <em>contextLength</em> parameter.</td></tr><tr><td>available</td><td><strong>boolean</strong></td><td>No</td><td>If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.</td></tr><tr><td>selected</td><td><strong>boolean</strong></td><td>Yes</td><td>This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the <em>create deployment</em> endpoint will make it a selectable option by the scheduler.</td></tr></tbody></table>

Here is example code to retrieve supported hardware, look for a desired configuration, and flip the selected flag to true when it is found:

```python
def get_and_select_device_config(client: httpx.Client, model_name: str, desired_device: str, desired_count: int) -> list[dict]:
    params = {
        "engine": "VLLM",
        "modelName": model_name,
        "modelAccessKey": ""
    }

    response = client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = d.get("device") == desired_device and d.get("count") == desired_count
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(f"No matching device config for {desired_device} x{desired_count}")

    return devices

devices = get_and_select_device_config(
    client,
    model_name="Qwen/Qwen2.5-72B-Instruct",
    desired_device="H100SXM",
    desired_count=2
)

```

### 3. Creating a Deployment

<table><thead><tr><th>Parameter</th><th width="100">Type</th><th width="100">Required</th><th>Description</th></tr></thead><tbody><tr><td><strong>deploymentName</strong></td><td>string</td><td>Yes</td><td>Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.</td></tr><tr><td><strong>modelName</strong></td><td>string</td><td>Yes</td><td>The HuggingFace identifier (for example, <code>org/model-name</code>) of the model you want to deploy. For example: <code>"shuyuej/Llama-3.2-1 B-Instruct-GPTQ"</code></td></tr><tr><td><strong>deviceConfigs</strong></td><td>list of structs</td><td>Yes</td><td>A list of <strong>deviceConfig</strong> structs (see previous section). Any struct with <strong>selected</strong> = true will be considered for scheduling.</td></tr><tr><td><strong>replicas</strong></td><td>integer</td><td>Yes</td><td>Number of replicas to run in parallel</td></tr><tr><td><strong>scaleDownPolicy</strong></td><td>string</td><td>No</td><td>Auto-scaling policy: <code>NONE or TIMER</code></td></tr><tr><td><strong>scaleDownThreshold</strong></td><td>integer</td><td>No</td><td>Time threshold in ms before scaling down</td></tr><tr><td><strong>draftModelName</strong></td><td>string</td><td>No</td><td>(Advanced) Points to a draft model version.</td></tr><tr><td><strong>engine</strong></td><td>string</td><td>No</td><td>Inference engine. Defaults to <code>VLLM</code>.</td></tr><tr><td><code>engineTask</code></td><td>string</td><td>No</td><td>Engine task mode. Defaults to <code>AUTO</code>.</td></tr><tr><td><strong>contextLengthOverride</strong></td><td>integer</td><td>No</td><td>(Advanced) User override for context window length.</td></tr><tr><td><strong>modelAccessKey</strong></td><td>string</td><td>No</td><td>(For private models) HuggingFace token.</td></tr></tbody></table>

Use the `POST /dedicated/deployments` endpoint to create a new model deployment. The response will include an `id` you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from `STARTING` to `ONLINE` before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.

This uses the `devices` struct from the previous example:

```python
# Example: Creating a deployment
deployment_config = {
    "deploymentName": "myqwen",
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "engine": "VLLM",  # Optional (defaults to VLLM)
    "deviceConfigs": devices,
    "replicas": 1,
    # Optional parameters:
    # "scaleDownPolicy": "TIMER",
    # "scaleDownThreshold": 3600000,
}

response = client.post("/dedicated/deployments", json=deployment_config)

if response.status_code == 200:
    deployment_data = response.json()
    deployment_id = deployment_data["id"]
    print(f"Created deployment with ID: {deployment_id}")
else:
    print(f"Failed to create deployment: {response.status_code}, {response.text}")
```

**Response Body Example:**

```json
{
  "id": 12345,
  "deploymentName": "myqwen",
  "modelName": "Qwen/Qwen2.5-72B-Instruct",
  "status": {
    "status": "STARTING",
    "statusMessage": "Deployment is being created"
  },
  "externalAlias": "myorganization-myqwen",
  ...
}
```

* **id**: Unique numeric identifier for the deployment.
* **status.status**: Current state of the deployment (for example, `STARTING`, `ONLINE`).
* **externalAlias**: The string used to reference your model via OpenAI-compatible calls.

***

### 4. Getting Deployment Status

Use the `GET /dedicated/deployments/{deployment_id}` endpoint to retrieve the details of your deployment, including its status.

```python
deployment_id = 12345  # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")

if response.status_code == 200:
    status_data = response.json()
    model_status = status_data["status"]["status"]       # for example "ONLINE", "OFFLINE"
    model_identifier = status_data["externalAlias"]      # alias for OpenAI API calls
    print(f"Deployment status: {model_status}")
    print(f"OpenAI model identifier: {model_identifier}")
else:
    print(f"Failed to get deployment: {response.status_code}, {response.text}")
```

**Key fields in the response**:

* `status["status"]`: The overall state of the deployment.
* `externalAlias`: The string you use in OpenAI Chat/Completion requests (`model=`).

Below are the possible values for `status["status"]`:

| State        | Description                                                       |
| ------------ | ----------------------------------------------------------------- |
| **ONLINE**   | Deployment is active and ready to serve requests.                 |
| **STARTING** | Deployment is in the process of starting up.                      |
| **STOPPING** | Deployment is in the process of shutting down.                    |
| **OFFLINE**  | Deployment is not running and must be resumed to accept requests. |
| **ERROR**    | Deployment entered an error state and needs investigation.        |
| **UNKNOWN**  | Deployment state is unknown (for example, stale or missing data). |

***

### 5. Updating a Deployment

To change the configuration of an existing deployment (for example, number of replicas):

1. **GET** the current deployment (as above).
2. Modify the returned JSON structure to include your updated parameters.
3. **PUT** the entire updated JSON back to the same endpoint (`/dedicated/deployments/{deployment_id}`).

#### Example: Scaling up the replicas

```python
# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()

# 2. Modify the "replicas" field (for example, from 1 to 2)
current_deployment["replicas"] = 2

# 3. Send the updated deployment object
update_response = client.put(
    f"/dedicated/deployments/{deployment_id}",
    json=current_deployment
)

if update_response.status_code == 200:
    updated_deployment_data = update_response.json()
    print("Deployment updated successfully.")
    print(f"New replica count: {updated_deployment_data['replicas']}")
else:
    print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")
```

***

### 6. Pausing and Resuming a Deployment

You may want to pause a deployment to reduce costs. Pausing puts the deployment into an `OFFLINE` state and stops billing. When ready, you can resume it to bring it back `ONLINE`. When the resume command is sent, you will need to poll the status to monitor for the transition from `STARTING` to `ONLINE` before the endpoint will accept requests.

```python
# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
    print("Deployment paused successfully.")
else:
    print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")

# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
    print("Deployment resumed successfully.")
else:
    print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")
```

***

### 7. Using the Deployed Model via OpenAI-Compatible API

Once your deployment is `ONLINE`, you can send inference requests using the **Chat API** at `https://api.parasail.io/v1`. Use the `externalAlias` as the `model` parameter in your OpenAI calls.

Example using the OpenAI Python client:

```python
from openai import OpenAI

openai_client = OpenAI(api_key=api_key, base_url="https://api.parasail.io/v1")

# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"

response = openai_client.chat.completions.create(
    model=model_identifier,
    messages=[{"role": "user", "content": "What's the capital of France?"}],
)

print(response.choices[0].message.content)
```

**Note**: The `model_identifier` is the `externalAlias` from the deployment status.

***

### 8. Deleting a Deployment

To permanently delete an endpoint, send the `DELETE` request.

```python
deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")

if delete_response.status_code == 200:
    print("Deployment deleted successfully.")
else:
    print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")
```

### Example Python Code

```python
import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv

# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
    load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"


class deviceConfig(BaseModel):
    device: str
    count: int
    displayName: Optional[str] = None
    cost: Optional[float] = None
    estimatedSingleUserTps: Optional[float] = None
    estimatedSystemTps: Optional[float] = None
    recommended: Optional[bool] = None
    limitedContext: Optional[bool] = None
    available: Optional[bool] = None
    selected: bool


class CreateModelStruct(BaseModel):
    deploymentName: str
    modelName: str
    replicas: int
    deviceConfigs: list[deviceConfig]
    engine: Optional[str] = "VLLM"
    engineTask: Optional[str] = "AUTO"
    scaleDownPolicy: Optional[str] = None  # NONE, TIMER, INACTIVE
    scaleDownThreshold: Optional[int] = None  # time in ms
    draftModelName: Optional[str] = None
    contextLengthOverride: Optional[int] = None
    modelAccessKey: Optional[str] = None


def check_compatibility(parasail_client: httpx.Client, model_name: str):
    params = {
        "modelName": model_name,
        "engine": "VLLM",
        "modelAccessKey": "",  # HF token for privately hosted models
    }
    response = parasail_client.get("/dedicated/support", params=params)

    if response.status_code == 200:
        data = response.json()
        if data.get("supported") is True:
            print("✅ Model is supported.")
        else:
            reason = data.get("errorMessage", "No reason provided.")
            raise ValueError(f"❌ Model not supported: {reason}")
    else:
        raise RuntimeError(
            f"❌ Request failed with status code {response.status_code}: {response.text}"
        )


def get_and_select_device_config(
    parasail_client: httpx.Client,
    model_name: str,
    desired_device: str,
    desired_count: int,
) -> list[dict]:
    params = {"engine": "VLLM", "modelName": model_name, "modelAccessKey": ""}

    response = parasail_client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = (
            d.get("device") == desired_device and d.get("count") == desired_count
        )
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(
            f"No matching device config for {desired_device} x{desired_count}"
        )

    return devices


def create_deployment(
    parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
    response = parasail_client.post(
        "/dedicated/deployments", content=deployment.model_dump_json()
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to create deployment: {response.status_code} {response.text}"
    return response.json()


def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
    assert deployment["id"] is not None, "Deployment ID must be set for update"
    response = parasail_client.put(
        f"/dedicated/deployments/{deployment['id']}", json=deployment
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to update deployment: {response.status_code} {response.text}"
    return response.json()


def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to pause deployment: {response.status_code} {response.text}"


def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to resume deployment: {response.status_code} {response.text}"


def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.get(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to get deployment: {response.status_code} {response.text}"
    return response.json()


def is_deployment_offline(parasail_client: httpx.Client, id: int) -> bool:
    deployment = get_deployment(parasail_client, id)
    return deployment["status"]["status"] == "OFFLINE"


def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
    response = parasail_client.delete(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to delete deployment: {response.status_code} {response.text}"


def wait_for_status(
    parasail_client: httpx.Client,
    id: int,
    status: str = "ONLINE",
    timeout_minutes: int = 20,
) -> None:
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60
    while True:
        deployment = get_deployment(parasail_client, id)
        current_status = (
            deployment["status"]["status"] if "status" in deployment else "UNKNOWN"
        )
        if current_status == status:
            print(f"Deployment reached target status: {status}")
            return
        elif current_status == "STARTING" or current_status == "STOPPING":
            elapsed_minutes = (time.time() - start_time) / 60
            print(
                f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
            )
        else:
            raise ValueError(f"Unexpected deployment status: {current_status}")
        if time.time() - start_time > timeout_seconds:
            raise TimeoutError(
                f"Deployment status did not change to {status} within {timeout_minutes} minutes"
            )
        time.sleep(30)


def test_deployment(
    parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
    deployment = get_deployment(parasail_client, id)
    assert "status" in deployment, "Deployment status is None"
    assert (
        deployment["status"]["status"] == "ONLINE"
    ), f"Deployment is not online: {deployment['status']['status']}"

    chat_completion = openai_client.chat.completions.create(
        model=deployment["externalAlias"],
        messages=[{"role": "user", "content": "What's the capital of New York?"}],
    )
    assert len(chat_completion.choices) == 1, "No chat completion choices returned"
    assert chat_completion.choices[0].message.content, "Empty chat completion response"


def main():
    # Setup clients
    model_name = "RedHatAI/Qwen2-72B-Instruct-FP8"

    parasail_client = httpx.Client(
        base_url=control_base_url,
        headers={
            "Authorization": "Bearer " + api_key,
            "Content-Type": "application/json",
        },
    )
    openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)
    check_compatibility(parasail_client, model_name)

    devices = get_and_select_device_config(
        parasail_client,
        model_name=model_name,
        desired_device="H100SXM",
        desired_count=2,
    )

    # Create deployment configuration
    deployment_config = CreateModelStruct(
        deploymentName="scripting-example-deployment-3",
        modelName=model_name,
        deviceConfigs=devices,
        replicas=1,
    )

    deployment = None
    try:
        print("Creating deployment...")
        deployment = create_deployment(parasail_client, deployment_config)
        print(f"Created deployment with ID: {deployment['id']}")

        print("\nWaiting for deployment to come online.")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        pause_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
        resume_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        deployment = get_deployment(parasail_client, deployment["id"])
        print("\nUpdating deployment to 2 replicas...")
        deployment["replicas"] = 2
        deployment = update_deployment(parasail_client, deployment)
        print("Waiting for deployment to scale up...")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 2 replicas...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Scaled deployment tests passed successfully")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        if deployment:
            print("\nCleaning up - deleting deployment...")
            delete_deployment(parasail_client, deployment["id"])
            print("Deployment deleted")


if __name__ == "__main__":
    main()

```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.parasail.io/parasail-docs/dedicated-instance/dedicated-endpoint-management-api.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
