Dedicated Endpoint Management API

Parasail Dedicated Service API Documentation

Overview

The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via an REST-compatabile API. You can do all functions available on the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.

Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatibility, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.

If you did not create the dedicated endpoint with the API, you can find the deployment_id from the Parasail SaaS platform URL. For example, if your deployment status page is https://www.saas.parasail.io/dedicated/33665, then the deployment_id is 33665. You can then use this for all API commands below.

Base URLs

  • Control API (for managing deployments): https://api.parasail.io/api/v1

  • Chat API (for inference requests via OpenAI-compatible endpoints): https://api.parasail.io/v1

Authentication

All requests to the Control API must include a Bearer token in the header. For example:

import httpx

api_key = "<YOUR_PARASAIL_API_KEY>"
control_base_url = "https://api.parasail.io/api/v1"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

client = httpx.Client(base_url=control_base_url, headers=headers)

Given the large set of combinations of model types and hardware instances, the recommended process for using the Dedicated API follows the UI flow and leverages our performance/memory models and hardware compatibility checkers to ensure smooth operation. The recommended flow is:

  1. Check model compatibility with our /dedicated/support endpoint.

  2. Retrieve a list of supported hardware for that model with our /dedicated/devices endpoint. This will include multi-GPU configurations that leverage parallelism.

  3. Within that list, set selected to True for the desired configurations.

  4. Submit (PUT) that list to /dedicated/devices endpoint.

  5. Use the /dedicated/devices endpoint to monitor, update, pause, resume, and destroy endpoints.

1. Checking Model Compatibility

The /dedicated/support endpoint can be used to query whether a model is supported. The modelName parameter can either be a HuggingFace ID like Qwen/Qwen2.5-72 B-Instruct or a HuggingFace URL like https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

This step is optional, but is a useful sanity check for API-based flows to ensure the IDs are entered correctly.

2. Retrieving Supported Hardware

The first step to creating a new dedicated deployment is to query is to select a supported hardware configuration. The /dedicated/devices endpoint retrieves valid hardware configurations as well as their availability. These configurations can be either single GPUs or multiple GPUs with parallelism enabled (tensor, expert, or sequence). The 'selected' flag is set to false when returned from /dedicated/devices. The recommended flow is to select the desired configurations by flipping the selected flag to true, followed by passing the array of structures to the create dedicated instance endpoint (see next section). The dedicated flow supports picking from multiple configurations, so selected can be set to true on multiple configurations.

The entire structure does not need to be passed to the create dedicated instance endpoint. The required flag shows the parameters that are needed, though it is recommended to ensure compatibility by first calling /dedicated/devices and flipping selected to true.

/dedicated/devices returns an array of structures with the following parameters:

DeviceConfigs structure

Parameter
Type
Required?
Descripti

device

string

Yes

Device ID name.

count

integer

Yes

Number of. The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.

displayName

string

No

How the device is displayed in the UI.

cost

float

No

Cost per hour of the hardware configuration. This accounts for multiple GPUs.

estimatedSingleUserTps

float

No

This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.

estimatedSystemTps

float

No

This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.

recommended

boolean

No

If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.

limitedContext

boolean

No

If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the contextLength parameter.

available

boolean

No

If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.

selected

boolean

Yes

This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the create deployment endpoint will make it a selectable option by the scheduler.

Here is example code to retrieve supported hardware, look for a desired configuration, and flip the selected flag to true when it is found:

2. Creating a Deployment

Parameter
Type
Required
Description

deploymentName

string

Yes

Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.

modelName

string

Yes

The HuggingFace identifier (for example, org/model-name) of the model you want to deploy. For example: "shuyuej/Llama-3.2-1 B-Instruct-GPTQ"

deviceConfigs

list of structs

Yes

A list of deviceConfig structs (see previous section). Any struct with selected = true will be considered for scheduling.

replicas

integer

Yes

Number of replicas to run in parallel

scaleDownPolicy

string

No

Auto-scaling policy: NONE or TIMER

scaleDownThreshold

integer

No

Time threshold in ms before scaling down

draftModelName

string

No

(Advanced) Points to a draft model version

generative

bool

No

(Advanced) Indicates a generative model

embedding

bool

No

(Advanced) Indicates an embedding model

contextLength

integer

No

(Advanced) Context window length for LLMs

modelAccessKey

string

No

(For private models) HuggingFace Token

Use the POST /dedicated/deployments endpoint to create a new model deployment. The response will include an id you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.

This uses the devices struct from the previous example:

Response Body Example:

  • id: Unique numeric identifier for the deployment.

  • status.status: Current state of the deployment (for example, STARTING, ONLINE).

  • externalAlias: The string used to reference your model via OpenAI-compatible calls.


3. Getting Deployment Status

Use the GET /dedicated/deployments/{deployment_id} endpoint to retrieve the details of your deployment, including its status.

Key fields in the response:

  • status["status"]: The overall state of the deployment.

  • externalAlias: The string you use in OpenAI Chat/Completion requests (model=).

Below are the possible values for status["status"]:

State
Description

ONLINE

Deployment is active and ready to serve requests.

STARTING

Deployment is in the process of starting up.

PAUSED

Deployment is paused but can be resumed.

STOPPING

Deployment is in the process of shutting down.

OFFLINE

Deployment is not running and must be resumed to accept requests.


4. Updating a Deployment

To change the configuration of an existing deployment (for example, number of replicas):

  1. GET the current deployment (as above).

  2. Modify the returned JSON structure to include your updated parameters.

  3. PUT the entire updated JSON back to the same endpoint (/dedicated/deployments/{deployment_id}).

Example: Scaling up the replicas


5. Pausing and Resuming a Deployment

You may want to pause a deployment to reduce costs. Pausing puts the deployment into an OFFLINE state and stops billing. When ready, you can resume it to bring it back ONLINE. When the resume command is sent, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests.


6. Using the Deployed Model via OpenAI-Compatible API

Once your deployment is ONLINE, you can send inference requests using the Chat API at https://api.parasail.io/v1. Use the externalAlias as the model parameter in your OpenAI calls.

Example using the OpenAI Python client:

Note: The model_identifier is the externalAlias from the deployment status.


7. Deleting a Deployment

To permanently delete an endpoint, send the DELETE request.

Example Python Code

Last updated