Dedicated Endpoint Management API

Parasail Dedicated Service API Documentation

Overview

The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via an REST-compatabile API. You can do all functions available on the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.

Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatability, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.

If you did not create the dedicated endpoint with the API, you can find the deployment_id from the Parasail SaaS platform URL. For example, if your deployment status page is https://www.saas.parasail.io/dedicated/33665, then the deployment_id is 33665. You can then use this for all API commands below.

Base URLs

Control API (for managing deployments): https://api.parasail.io/api/v1
Chat API (for inference requests via OpenAI-compatible endpoints): https://api.parasail.io/v1

Authentication

All requests to the Control API must include a Bearer token in the header. For example:

import httpx

api_key = "<YOUR_PARASAIL_API_KEY>"
control_base_url = "https://api.parasail.io/api/v1"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

client = httpx.Client(base_url=control_base_url, headers=headers)

Recommended Dedicated Model API Flow

Given the large set of combinations of model types and hardware instances, the recommended process for using the Dedicated API follows the UI flow and leverages our performance/memory models and hardware compatibility checkers to ensure smooth operation. The recommended flow is:

Check model compatability with our /dedicated/support endpoint.
Retrieve a list of supported hardware for that model with our /dedicated/devices endpoint. This will include multi-GPU configurations that leverage parallelism.
Within that list, set selected to True for the desired configurations.
Submit (PUT) that list to /dedicated/devices endpoint.
Use the /dedicated/devices endpoint to monitor, update, pause, resume, and destroy endpoints.

1. Checking Model Compatibility

The /dedicated/support endpoint can be used to query whether a model is supported. The modelName parameter can either be a HuggingFace ID like Qwen/Qwen2.5-72B-Instruct or a HuggingFace URL like https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

params = {
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "modelAccessKey": ""  #HF token for privately hosted models
}

# Make the request
response = client.get("/dedicated/support", params=params)

if response.status_code == 200:
    data = response.json()
    if data.get("supported") is True:
        print("✅ Model is supported.")
    else:
        reason = data.get("errorMessage", "No reason provided.")
        print(f"❌ Model not supported: {reason}")
else:
    print(f"❌ Request failed with status code {response.status_code}: {response.text}")

This step is optional, but is a useful sanity check for API-based flows to ensure the IDs are entered correctly.

2. Retrieving Supported Hardware

The first step to creating a new dedicated deployment is to query is to select a supported hardware configuration. The /dedicated/devices endpoint retrieves valid hardware configurations as well as their availability. These configurations can be either single GPUs or multiple GPUs with parallelism enabled (tensor, expert, or sequence). The 'selected' flag is set to false when returned from /dedicated/devices . The recommended flow is to select the desired configurations by flipping the selected flag to true, followed by passing the array of structures to the create dedicated instance endpoint (see next section). The dedicated flow supports picking from multiple configurations, so selected can be set to true on multiple configurations.

The entire structure does not need to be passed to the create dedicated instance endpoint. The required flag shows the parameters that are needed, though it is recommended to ensure compatibility by first calling /dedicated/devices and flipping selected to true.

/dedicated/devices returns an array of structures with the following parameters:

DeviceConfigs structure

Parameter

Type

Required?

Descripti

device

string

Yes

Device ID name.

count

integer

Yes

Number of . The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.

displayName

string

How the device is displayed in the UI.

cost

float

Cost per hour of the hardware configuration. This accounts for multiple GPUs.

estimatedSingleUserTps

float

This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.

estimatedSystemTps

float

This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.

recommended

boolean

If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.

limitedContext

boolean

If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the contextLength parameter.

available

boolean

If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.

selected

boolean

Yes

This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the create deployment endpoint will make it a selectable option by the scheduler.

Here is example code to retrieve supported hardware, look for a desired configuration, and flip the selected flag to true when it is found:

def get_and_select_device_config(client: httpx.Client, model_name: str, desired_device: str, desired_count: int) -> list[dict]:
    params = {
        "engineName": "VLLM",
        "modelName": model_name,
        "modelAccessKey": ""
    }

    response = client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = d.get("device") == desired_device and d.get("count") == desired_count
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(f"No matching device config for {desired_device} x{desired_count}")

    return devices

devices = get_and_select_device_config(
    client,
    model_name="Qwen/Qwen2.5-72B-Instruct",
    desired_device="H100SXM",
    desired_count=2
)

2. Creating a Deployment

Parameter

Type

Required

Description

deploymentName

string

Yes

Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.

modelName

string

Yes

The HuggingFace identifier (e.g., org/model-name) of the model you want to deploy. For example: "shuyuej/Llama-3.2-1B-Instruct-GPTQ"

deviceConfigs

list of structs

Yes

A list of deviceConfig structs (see previous section). Any struct with selected = true will be considered for scheduling.

replicas

integer

Yes

Number of replicas to run in parallel

scaleDownPolicy

string

Auto-scaling policy: NONE or TIMER

scaleDownThreshold

integer

Time threshold in ms before scaling down

draftModelName

string

(Advanced) Points to a draft model version

generative

bool

(Advanced) Indicates a generative model

embedding

bool

(Advanced) Indicates an embedding model

contextLength

integer

(Advanced) Context window length for LLMs

modelAccessKey

string

(For private models) HuggingFace Token

Use the POST /dedicated/deployments endpoint to create a new model deployment. The response will include an id you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.

This uses the devices struct from the previous example:

# Example: Creating a deployment
deployment_config = {
    "deploymentName": "myqwen",
    "modelName": "Qwen/Qwen2.5-72B-Instruct",
    "deviceConfigs: devices
    "replicas": 1,
    # Optional parameters:
    # "scaleDownPolicy": "TIMER",
    # "scaleDownThreshold": 3600000,
}

response = client.post("/dedicated/deployments", json=deployment_config)

if response.status_code == 200:
    deployment_data = response.json()
    deployment_id = deployment_data["id"]
    print(f"Created deployment with ID: {deployment_id}")
else:
    print(f"Failed to create deployment: {response.status_code}, {response.text}")

Response Body Example:

{
  "id": 12345,
  "deploymentName": "myqwen",
  "modelName": "Qwen/Qwen2.5-72B-Instruct",
  "status": {
    "status": "STARTING",
    "message": "Deployment is being created"
  },
  "externalAlias": "myorganization-myqwen",
  ...
}

id: Unique numeric identifier for the deployment.
status.status: Current state of the deployment (e.g., STARTING, ONLINE).
externalAlias: The string used to reference your model via OpenAI-compatible calls.

3. Getting Deployment Status

Use the GET /dedicated/deployments/{deployment_id} endpoint to retrieve the details of your deployment, including its status.

deployment_id = 12345  # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")

if response.status_code == 200:
    status_data = response.json()
    model_status = status_data["status"]["status"]       # e.g. "ONLINE", "PAUSED"
    model_identifier = status_data["externalAlias"]      # alias for OpenAI API calls
    print(f"Deployment status: {model_status}")
    print(f"OpenAI model identifier: {model_identifier}")
else:
    print(f"Failed to get deployment: {response.status_code}, {response.text}")

Key fields in the response:

status["status"]: The overall state of the deployment.
externalAlias: The string you use in OpenAI Chat/Completion requests (model=).

Below are the possible values for status["status"]:

State

Description

ONLINE

Deployment is active and ready to serve requests.

STARTING

Deployment is in the process of starting up.

PAUSED

Deployment is paused but can be resumed.

STOPPING

Deployment is in the process of shutting down.

OFFLINE

Deployment is not running and must be resumed to accept requests.

4. Updating a Deployment

To change the configuration of an existing deployment (e.g., number of replicas):

GET the current deployment (as above).
Modify the returned JSON structure to include your updated parameters.
PUT the entire updated JSON back to the same endpoint (/dedicated/deployments/{deployment_id}).

Example: Scaling up the replicas

# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()

# 2. Modify the "replicas" field (e.g., from 1 to 2)
current_deployment["replicas"] = 2

# 3. Send the updated deployment object
update_response = client.put(
    f"/dedicated/deployments/{deployment_id}",
    json=current_deployment
)

if update_response.status_code == 200:
    updated_deployment_data = update_response.json()
    print("Deployment updated successfully.")
    print(f"New replica count: {updated_deployment_data['replicas']}")
else:
    print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")

5. Pausing and Resuming a Deployment

You may want to pause a deployment to reduce costs. Pausing puts the deployment into an OFFLINE state and stops billing. When ready, you can resume it to bring it back ONLINE. When the resume command is sent, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests.

# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
    print("Deployment paused successfully.")
else:
    print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")

# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
    print("Deployment resumed successfully.")
else:
    print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")

6. Using the Deployed Model via OpenAI-Compatible API

Once your deployment is ONLINE, you can send inference requests using the Chat API at https://api.parasail.io/v1. Use the externalAlias as the model parameter in your OpenAI calls.

Example using the OpenAI Python client:

import openai

openai.api_key = api_key
openai.api_base = "https://api.parasail.io/v1"

# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"

response = openai.ChatCompletion.create(
    model=model_identifier,
    messages=[{"role": "user", "content": "What's the capital of France?"}]
)

print(response.choices[0].message["content"])

Note: The model_identifier is the externalAlias from the deployment status.

7. Deleting a Deployment

To permanently delete an endpoint, send the DELETE request.

deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")

if delete_response.status_code == 200:
    print("Deployment deleted successfully.")
else:
    print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")

Example Python Code

import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv

# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
    load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"


class deviceConfig(BaseModel):
    device: str
    count: int
    displayName: Optional[str] = None
    cost: Optional[float] = None
    estimatedSingleUserTps: Optional[float] = None
    estimatedSystemTps: Optional[float] = None
    recommended: Optional[bool] = None
    limitedContext: Optional[bool] = None
    available: Optional[bool] = None
    selected: bool


class CreateModelStruct(BaseModel):
    deploymentName: str
    modelName: str
    replicas: int
    deviceConfigs: list[deviceConfig]
    scaleDownPolicy: Optional[str] = None  # NONE, TIMER, INACTIVE
    scaleDownThreshold: Optional[int] = None  # time in ms
    draftModelName: Optional[str] = None
    generative: Optional[bool] = None
    embedding: Optional[bool] = None
    contextLength: Optional[int] = None


def check_compatibility(parasail_client: httpx.Client, model_name: str):
    params = {
        "modelName": model_name,
        "modelAccessKey": "",  # HF token for privately hosted models
    }
    response = parasail_client.get("/dedicated/support", params=params)

    if response.status_code == 200:
        data = response.json()
        if data.get("supported") is True:
            print("✅ Model is supported.")
        else:
            reason = data.get("errorMessage", "No reason provided.")
            raise ValueError(f"❌ Model not supported: {reason}")
    else:
        raise RuntimeError(
            f"❌ Request failed with status code {response.status_code}: {response.text}"
        )


def get_and_select_device_config(
    parasail_client: httpx.Client,
    model_name: str,
    desired_device: str,
    desired_count: int,
) -> list[dict]:
    params = {"engineName": "VLLM", "modelName": model_name, "modelAccessKey": ""}

    response = parasail_client.get("/dedicated/devices", params=params)
    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")

    devices = response.json()
    matched = False

    for d in devices:
        d["selected"] = (
            d.get("device") == desired_device and d.get("count") == desired_count
        )
        matched = matched or d["selected"]

    if not matched:
        raise ValueError(
            f"No matching device config for {desired_device} x{desired_count}"
        )

    return devices


def create_deployment(
    parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
    response = parasail_client.post(
        "/dedicated/deployments", content=deployment.model_dump_json()
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to create deployment: {response.status_code} {response.text}"
    return response.json()


def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
    assert deployment["id"] is not None, "Deployment ID must be set for update"
    response = parasail_client.put(
        f"/dedicated/deployments/{deployment['id']}", json=deployment
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to update deployment: {response.status_code} {response.text}"
    return response.json()


def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to pause deployment: {response.status_code} {response.text}"


def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to resume deployment: {response.status_code} {response.text}"


def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.get(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to get deployment: {response.status_code} {response.text}"
    return response.json()


def is_deployment_paused(parasail_client: httpx.Client, id: int) -> bool:
    deployment = get_deployment(parasail_client, id)
    return deployment["status"]["status"] == "PAUSED"


def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
    response = parasail_client.delete(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to delete deployment: {response.status_code} {response.text}"


def wait_for_status(
    parasail_client: httpx.Client,
    id: int,
    status: str = "ONLINE",
    timeout_minutes: int = 20,
) -> None:
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60
    while True:
        deployment = get_deployment(parasail_client, id)
        current_status = (
            deployment["status"]["status"] if "status" in deployment else "Unknown"
        )
        if current_status == status:
            print("Deployment is now online!")
            return
        elif current_status == "STARTING" or current_status == "STOPPING":
            elapsed_minutes = (time.time() - start_time) / 60
            print(
                f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
            )
        else:
            raise ValueError(f"Unexpected deployment status: {current_status}")
        if time.time() - start_time > timeout_seconds:
            raise TimeoutError(
                f"Deployment status did not change to {status} within {timeout_minutes} minutes"
            )
        time.sleep(30)


def test_deployment(
    parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
    deployment = get_deployment(parasail_client, id)
    assert "status" in deployment, "Deployment status is None"
    assert (
        deployment["status"]["status"] == "ONLINE"
    ), f"Deployment is not online: {deployment['status']['status']}"

    chat_completion = openai_client.chat.completions.create(
        model=deployment["externalAlias"],
        messages=[{"role": "user", "content": "What's the capital of New York?"}],
    )
    assert len(chat_completion.choices) == 1, "No chat completion choices returned"
    assert chat_completion.choices[0].message.content, "Empty chat completion response"


def main():
    # Setup clients
    model_name = "RedHatAI/Qwen2-72B-Instruct-FP8"

    parasail_client = httpx.Client(
        base_url=control_base_url,
        headers={
            "Authorization": "Bearer " + api_key,
            "Content-Type": "application/json",
        },
    )
    openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)
    check_compatibility(parasail_client, model_name)

    devices = get_and_select_device_config(
        parasail_client,
        model_name=model_name,
        desired_device="H100SXM",
        desired_count=2,
    )

    # Create deployment configuration
    deployment_config = CreateModelStruct(
        deploymentName="scripting-example-deployment-3",
        modelName=model_name,
        deviceConfigs=devices,
        replicas=1,
    )

    deployment = None
    try:
        print("Creating deployment...")
        deployment = create_deployment(parasail_client, deployment_config)
        print(f"Created deployment with ID: {deployment['id']}")

        print("\nWaiting for deployment to come online.")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        pause_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
        resume_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        deployment = get_deployment(parasail_client, deployment["id"])
        print("\nUpdating deployment to 2 replicas...")
        deployment["replicas"] = 2
        deployment = update_deployment(parasail_client, deployment)
        print("Waiting for deployment to scale up...")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 2 replicas...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Scaled deployment tests passed successfully")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        if deployment:
            print("\nCleaning up - deleting deployment...")
            delete_deployment(parasail_client, deployment["id"])
            print("Deployment deleted")


if __name__ == "__main__":
    main()

PreviousDeploying private models through HuggingFace Repos NextRate Limits and Limitations

Last updated 5 months ago