The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via an REST-compatabile API. You can do all functions available on the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.
Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatability, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.
If you did not create the dedicated endpoint with the API, you can find the deployment_id from the Parasail SaaS platform URL. For example, if your deployment status page is https://www.saas.parasail.io/dedicated/33665, then the deployment_id is 33665. You can then use this for all API commands below.
Base URLs
Control API (for managing deployments):
https://api.parasail.io/api/v1
Chat API (for inference requests via OpenAI-compatible endpoints):
https://api.parasail.io/v1
Authentication
All requests to the Control API must include a Bearer token in the header. For example:
Given the large set of combinations of model types and hardware instances, the recommended process for using the Dedicated API follows the UI flow and leverages our performance/memory models and hardware compatibility checkers to ensure smooth operation. The recommended flow is:
Check model compatability with our /dedicated/support endpoint.
Retrieve a list of supported hardware for that model with our /dedicated/devices endpoint. This will include multi-GPU configurations that leverage parallelism.
Within that list, set selected to True for the desired configurations.
Submit (PUT) that list to /dedicated/devices endpoint.
Use the /dedicated/devices endpoint to monitor, update, pause, resume, and destroy endpoints.
1. Checking Model Compatibility
params = {
"modelName": "Qwen/Qwen2.5-72B-Instruct",
"modelAccessKey": "" #HF token for privately hosted models
}
# Make the request
response = client.get("/dedicated/support", params=params)
if response.status_code == 200:
data = response.json()
if data.get("supported") is True:
print("✅ Model is supported.")
else:
reason = data.get("errorMessage", "No reason provided.")
print(f"❌ Model not supported: {reason}")
else:
print(f"❌ Request failed with status code {response.status_code}: {response.text}")
This step is optional, but is a useful sanity check for API-based flows to ensure the IDs are entered correctly.
2. Retrieving Supported Hardware
The first step to creating a new dedicated deployment is to query is to select a supported hardware configuration. The /dedicated/devices endpoint retrieves valid hardware configurations as well as their availability. These configurations can be either single GPUs or multiple GPUs with parallelism enabled (tensor, expert, or sequence). The 'selected' flag is set to false when returned from /dedicated/devices . The recommended flow is to select the desired configurations by flipping the selected flag to true, followed by passing the array of structures to the create dedicated instance endpoint (see next section). The dedicated flow supports picking from multiple configurations, so selected can be set to true on multiple configurations.
The entire structure does not need to be passed to the create dedicated instance endpoint. The required flag shows the parameters that are needed, though it is recommended to ensure compatibility by first calling /dedicated/devices and flipping selected to true.
/dedicated/devices returns an array of structures with the following parameters:
DeviceConfigs structure
Parameter
Type
Required?
Descripti
device
string
Yes
Device ID name.
count
integer
Yes
Number of . The model will be parallelized across the GPUs using the most optimal approach (tensor parallelism, expert parallelism, and/or sequence parallelism). For data parallelism, multiple replicas should be used instead.
displayName
string
No
How the device is displayed in the UI.
cost
float
No
Cost per hour of the hardware configuration. This accounts for multiple GPUs.
estimatedSingleUserTps
float
No
This provides an estimated output TPS per user, however the model is currently out of date and it is not recommended to rely on this number.
estimatedSystemTps
float
No
This provides an estimated output TPS for the whole system, however the model is currently out of date and it is not recommended to rely on this number.
recommended
boolean
No
If true, this configuration hits a sweet spot of performance and cost and has enough memory to support full contexts.
limitedContext
boolean
No
If true, the full context of the model may cause memory overflow issues. It is recommended to manually reduce the context length using the contextLength parameter.
available
boolean
No
If true, the hardware is available meaning there is capacity and the account quota allows for the device count. This field currently only reflects a single replica.
selected
boolean
Yes
This will be set to false when retrieved from the compatible device query. Setting this to true and passing the entire hardware structure to the create deployment endpoint will make it a selectable option by the scheduler.
Here is example code to retrieve supported hardware, look for a desired configuration, and flip the selected flag to true when it is found:
def get_and_select_device_config(client: httpx.Client, model_name: str, desired_device: str, desired_count: int) -> list[dict]:
params = {
"engineName": "VLLM",
"modelName": model_name,
"modelAccessKey": ""
}
response = client.get("/dedicated/devices", params=params)
if response.status_code != 200:
raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")
devices = response.json()
matched = False
for d in devices:
d["selected"] = d.get("device") == desired_device and d.get("count") == desired_count
matched = matched or d["selected"]
if not matched:
raise ValueError(f"No matching device config for {desired_device} x{desired_count}")
return devices
devices = get_and_select_device_config(
client,
model_name="Qwen/Qwen2.5-72B-Instruct",
desired_device="H100SXM",
desired_count=2
)
2. Creating a Deployment
Parameter
Type
Required
Description
deploymentName
string
Yes
Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.
modelName
string
Yes
The HuggingFace identifier (e.g., org/model-name) of the model you want to deploy. For example: "shuyuej/Llama-3.2-1B-Instruct-GPTQ"
deviceConfigs
list of structs
Yes
A list of deviceConfig structs (see previous section). Any struct with selected = true will be considered for scheduling.
replicas
integer
Yes
Number of replicas to run in parallel
scaleDownPolicy
string
No
Auto-scaling policy: NONE or TIMER
scaleDownThreshold
integer
No
Time threshold in ms before scaling down
draftModelName
string
No
(Advanced) Points to a draft model version
generative
bool
No
(Advanced) Indicates a generative model
embedding
bool
No
(Advanced) Indicates an embedding model
contextLength
integer
No
(Advanced) Context window length for LLMs
modelAccessKey
string
No
(For private models) HuggingFace Token
Use the POST /dedicated/deployments endpoint to create a new model deployment. The response will include an id you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.
This uses the devices struct from the previous example:
status.status: Current state of the deployment (e.g., STARTING, ONLINE).
externalAlias: The string used to reference your model via OpenAI-compatible calls.
3. Getting Deployment Status
Use the GET /dedicated/deployments/{deployment_id} endpoint to retrieve the details of your deployment, including its status.
deployment_id = 12345 # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")
if response.status_code == 200:
status_data = response.json()
model_status = status_data["status"]["status"] # e.g. "ONLINE", "PAUSED"
model_identifier = status_data["externalAlias"] # alias for OpenAI API calls
print(f"Deployment status: {model_status}")
print(f"OpenAI model identifier: {model_identifier}")
else:
print(f"Failed to get deployment: {response.status_code}, {response.text}")
Key fields in the response:
status["status"]: The overall state of the deployment.
externalAlias: The string you use in OpenAI Chat/Completion requests (model=).
Below are the possible values for status["status"]:
State
Description
ONLINE
Deployment is active and ready to serve requests.
STARTING
Deployment is in the process of starting up.
PAUSED
Deployment is paused but can be resumed.
STOPPING
Deployment is in the process of shutting down.
OFFLINE
Deployment is not running and must be resumed to accept requests.
4. Updating a Deployment
To change the configuration of an existing deployment (e.g., number of replicas):
GET the current deployment (as above).
Modify the returned JSON structure to include your updated parameters.
PUT the entire updated JSON back to the same endpoint (/dedicated/deployments/{deployment_id}).
Example: Scaling up the replicas
# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()
# 2. Modify the "replicas" field (e.g., from 1 to 2)
current_deployment["replicas"] = 2
# 3. Send the updated deployment object
update_response = client.put(
f"/dedicated/deployments/{deployment_id}",
json=current_deployment
)
if update_response.status_code == 200:
updated_deployment_data = update_response.json()
print("Deployment updated successfully.")
print(f"New replica count: {updated_deployment_data['replicas']}")
else:
print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")
5. Pausing and Resuming a Deployment
You may want to pause a deployment to reduce costs. Pausing puts the deployment into an OFFLINE state and stops billing. When ready, you can resume it to bring it back ONLINE. When the resume command is sent, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests.
# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
print("Deployment paused successfully.")
else:
print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")
# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
print("Deployment resumed successfully.")
else:
print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")
6. Using the Deployed Model via OpenAI-Compatible API
Once your deployment is ONLINE, you can send inference requests using the Chat API at https://api.parasail.io/v1. Use the externalAlias as the model parameter in your OpenAI calls.
Example using the OpenAI Python client:
import openai
openai.api_key = api_key
openai.api_base = "https://api.parasail.io/v1"
# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"
response = openai.ChatCompletion.create(
model=model_identifier,
messages=[{"role": "user", "content": "What's the capital of France?"}]
)
print(response.choices[0].message["content"])
Note: The model_identifier is the externalAlias from the deployment status.
7. Deleting a Deployment
To permanently delete an endpoint, send the DELETE request.
deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")
if delete_response.status_code == 200:
print("Deployment deleted successfully.")
else:
print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")
Example Python Code
import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv
# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"
class deviceConfig(BaseModel):
device: str
count: int
displayName: Optional[str] = None
cost: Optional[float] = None
estimatedSingleUserTps: Optional[float] = None
estimatedSystemTps: Optional[float] = None
recommended: Optional[bool] = None
limitedContext: Optional[bool] = None
available: Optional[bool] = None
selected: bool
class CreateModelStruct(BaseModel):
deploymentName: str
modelName: str
replicas: int
deviceConfigs: list[deviceConfig]
scaleDownPolicy: Optional[str] = None # NONE, TIMER, INACTIVE
scaleDownThreshold: Optional[int] = None # time in ms
draftModelName: Optional[str] = None
generative: Optional[bool] = None
embedding: Optional[bool] = None
contextLength: Optional[int] = None
def check_compatibility(parasail_client: httpx.Client, model_name: str):
params = {
"modelName": model_name,
"modelAccessKey": "", # HF token for privately hosted models
}
response = parasail_client.get("/dedicated/support", params=params)
if response.status_code == 200:
data = response.json()
if data.get("supported") is True:
print("✅ Model is supported.")
else:
reason = data.get("errorMessage", "No reason provided.")
raise ValueError(f"❌ Model not supported: {reason}")
else:
raise RuntimeError(
f"❌ Request failed with status code {response.status_code}: {response.text}"
)
def get_and_select_device_config(
parasail_client: httpx.Client,
model_name: str,
desired_device: str,
desired_count: int,
) -> list[dict]:
params = {"engineName": "VLLM", "modelName": model_name, "modelAccessKey": ""}
response = parasail_client.get("/dedicated/devices", params=params)
if response.status_code != 200:
raise RuntimeError(f"Request failed: {response.status_code} - {response.text}")
devices = response.json()
matched = False
for d in devices:
d["selected"] = (
d.get("device") == desired_device and d.get("count") == desired_count
)
matched = matched or d["selected"]
if not matched:
raise ValueError(
f"No matching device config for {desired_device} x{desired_count}"
)
return devices
def create_deployment(
parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
response = parasail_client.post(
"/dedicated/deployments", content=deployment.model_dump_json()
)
assert (
200 <= response.status_code < 300
), f"Failed to create deployment: {response.status_code} {response.text}"
return response.json()
def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
assert deployment["id"] is not None, "Deployment ID must be set for update"
response = parasail_client.put(
f"/dedicated/deployments/{deployment['id']}", json=deployment
)
assert (
200 <= response.status_code < 300
), f"Failed to update deployment: {response.status_code} {response.text}"
return response.json()
def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
assert (
200 <= response.status_code < 300
), f"Failed to pause deployment: {response.status_code} {response.text}"
def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
assert (
200 <= response.status_code < 300
), f"Failed to resume deployment: {response.status_code} {response.text}"
def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.get(f"/dedicated/deployments/{id}")
assert (
200 <= response.status_code < 300
), f"Failed to get deployment: {response.status_code} {response.text}"
return response.json()
def is_deployment_paused(parasail_client: httpx.Client, id: int) -> bool:
deployment = get_deployment(parasail_client, id)
return deployment["status"]["status"] == "PAUSED"
def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
response = parasail_client.delete(f"/dedicated/deployments/{id}")
assert (
200 <= response.status_code < 300
), f"Failed to delete deployment: {response.status_code} {response.text}"
def wait_for_status(
parasail_client: httpx.Client,
id: int,
status: str = "ONLINE",
timeout_minutes: int = 20,
) -> None:
start_time = time.time()
timeout_seconds = timeout_minutes * 60
while True:
deployment = get_deployment(parasail_client, id)
current_status = (
deployment["status"]["status"] if "status" in deployment else "Unknown"
)
if current_status == status:
print("Deployment is now online!")
return
elif current_status == "STARTING" or current_status == "STOPPING":
elapsed_minutes = (time.time() - start_time) / 60
print(
f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
)
else:
raise ValueError(f"Unexpected deployment status: {current_status}")
if time.time() - start_time > timeout_seconds:
raise TimeoutError(
f"Deployment status did not change to {status} within {timeout_minutes} minutes"
)
time.sleep(30)
def test_deployment(
parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
deployment = get_deployment(parasail_client, id)
assert "status" in deployment, "Deployment status is None"
assert (
deployment["status"]["status"] == "ONLINE"
), f"Deployment is not online: {deployment['status']['status']}"
chat_completion = openai_client.chat.completions.create(
model=deployment["externalAlias"],
messages=[{"role": "user", "content": "What's the capital of New York?"}],
)
assert len(chat_completion.choices) == 1, "No chat completion choices returned"
assert chat_completion.choices[0].message.content, "Empty chat completion response"
def main():
# Setup clients
model_name = "RedHatAI/Qwen2-72B-Instruct-FP8"
parasail_client = httpx.Client(
base_url=control_base_url,
headers={
"Authorization": "Bearer " + api_key,
"Content-Type": "application/json",
},
)
openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)
check_compatibility(parasail_client, model_name)
devices = get_and_select_device_config(
parasail_client,
model_name=model_name,
desired_device="H100SXM",
desired_count=2,
)
# Create deployment configuration
deployment_config = CreateModelStruct(
deploymentName="scripting-example-deployment-3",
modelName=model_name,
deviceConfigs=devices,
replicas=1,
)
deployment = None
try:
print("Creating deployment...")
deployment = create_deployment(parasail_client, deployment_config)
print(f"Created deployment with ID: {deployment['id']}")
print("\nWaiting for deployment to come online.")
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 1 replica...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Initial deployment tests passed successfully")
pause_deployment(parasail_client, deployment["id"])
wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
resume_deployment(parasail_client, deployment["id"])
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 1 replica...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Initial deployment tests passed successfully")
deployment = get_deployment(parasail_client, deployment["id"])
print("\nUpdating deployment to 2 replicas...")
deployment["replicas"] = 2
deployment = update_deployment(parasail_client, deployment)
print("Waiting for deployment to scale up...")
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 2 replicas...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Scaled deployment tests passed successfully")
except Exception as e:
print(f"Error: {e}")
finally:
if deployment:
print("\nCleaning up - deleting deployment...")
delete_deployment(parasail_client, deployment["id"])
print("Deployment deleted")
if __name__ == "__main__":
main()
The /dedicated/support endpoint can be used to query whether a model is supported. The modelName parameter can either be a HuggingFace ID like Qwen/Qwen2.5-72B-Instruct or a HuggingFace URL like