The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via an REST-compatabile API. You can do all functions available on the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.
Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatability, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.
If you did not create the dedicated endpoint with the API, you can find the deployment_id from the Parasail SaaS platform URL. For example, if your deployment status page is https://www.saas.parasail.io/dedicated/33665, then the deployment_id is 33665. You can then use this for all API commands below.
Base URLs
Control API (for managing deployments):
https://api.parasail.io/api/v1
Chat API (for inference requests via OpenAI-compatible endpoints):
https://api.parasail.io/v1
Authentication
All requests to the Control API must include a Bearer token in the header. For example:
When creating or updating a deployment, you will provide a JSON body with the following fields:
Parameter
Type
Required
Description
deploymentName
string
Yes
Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.
modelName
string
Yes
The HuggingFace identifier (e.g., org/model-name) of the model you want to deploy. For example: "shuyuej/Llama-3.2-1B-Instruct-GPTQ"
serviceTierName
string
Yes
The tier to use for provisioning this deployment
replicas
integer
Yes
Number of replicas to run in parallel
scaleDownPolicy
string
No
Auto-scaling policy: NONE, TIMER, or INACTIVE
scaleDownThreshold
integer
No
Time threshold in ms before scaling down
draftModelName
string
No
(Advanced) Points to a draft model version
generative
bool
No
(Advanced) Indicates a generative model
embedding
bool
No
(Advanced) Indicates an embedding model
contextLength
integer
No
(Advanced) Context window length for LLMs
2. Creating a Deployment
Use the POST /dedicated/deployments endpoint to create a new model deployment. The response will include an id you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.
status.status: Current state of the deployment (e.g., STARTING, ONLINE).
externalAlias: The string used to reference your model via OpenAI-compatible calls.
3. Getting Deployment Status
Use the GET /dedicated/deployments/{deployment_id} endpoint to retrieve the details of your deployment, including its status.
deployment_id = 12345 # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")
if response.status_code == 200:
status_data = response.json()
model_status = status_data["status"]["status"] # e.g. "ONLINE", "PAUSED"
model_identifier = status_data["externalAlias"] # alias for OpenAI API calls
print(f"Deployment status: {model_status}")
print(f"OpenAI model identifier: {model_identifier}")
else:
print(f"Failed to get deployment: {response.status_code}, {response.text}")
Key fields in the response:
status["status"]: The overall state of the deployment.
externalAlias: The string you use in OpenAI Chat/Completion requests (model=).
Below are the possible values for status["status"]:
State
Description
ONLINE
Deployment is active and ready to serve requests.
STARTING
Deployment is in the process of starting up.
PAUSED
Deployment is paused but can be resumed.
STOPPING
Deployment is in the process of shutting down.
OFFLINE
Deployment is not running and must be resumed to accept requests.
4. Updating a Deployment
To change the configuration of an existing deployment (e.g., number of replicas):
GET the current deployment (as above).
Modify the returned JSON structure to include your updated parameters.
PUT the entire updated JSON back to the same endpoint (/dedicated/deployments/{deployment_id}).
Example: Scaling up the replicas
# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()
# 2. Modify the "replicas" field (e.g., from 1 to 2)
current_deployment["replicas"] = 2
# 3. Send the updated deployment object
update_response = client.put(
f"/dedicated/deployments/{deployment_id}",
json=current_deployment
)
if update_response.status_code == 200:
updated_deployment_data = update_response.json()
print("Deployment updated successfully.")
print(f"New replica count: {updated_deployment_data['replicas']}")
else:
print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")
5. Pausing and Resuming a Deployment
You may want to pause a deployment to reduce costs. Pausing puts the deployment into an OFFLINE state and stops billing. When ready, you can resume it to bring it back ONLINE. When the resume command is sent, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests.
# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
print("Deployment paused successfully.")
else:
print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")
# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
print("Deployment resumed successfully.")
else:
print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")
6. Using the Deployed Model via OpenAI-Compatible API
Once your deployment is ONLINE, you can send inference requests using the Chat API at https://api.parasail.io/v1. Use the externalAlias as the model parameter in your OpenAI calls.
Example using the OpenAI Python client:
import openai
openai.api_key = api_key
openai.api_base = "https://api.parasail.io/v1"
# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"
response = openai.ChatCompletion.create(
model=model_identifier,
messages=[{"role": "user", "content": "What's the capital of France?"}]
)
print(response.choices[0].message["content"])
Note: The model_identifier is the externalAlias from the deployment status.
7. Deleting a Deployment
To permanently delete an endpoint, send the DELETE request.
deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")
if delete_response.status_code == 200:
print("Deployment deleted successfully.")
else:
print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")
Example Python Code
import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv
# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"
class CreateModelStruct(BaseModel):
deploymentName: str
modelName: str
serviceTierName: str
replicas: int
scaleDownPolicy: Optional[str] = None # NONE, TIMER, INACTIVE
scaleDownThreshold: Optional[int] = None # time in ms
draftModelName: Optional[str] = None
generative: Optional[bool] = None
embedding: Optional[bool] = None
contextLength: Optional[int] = None
def create_deployment(
parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
response = parasail_client.post(
"/dedicated/deployments", content=deployment.model_dump_json()
)
assert (
200 <= response.status_code < 300
), f"Failed to create deployment: {response.status_code}"
return response.json()
def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
assert deployment["id"] is not None, "Deployment ID must be set for update"
response = parasail_client.put(
f"/dedicated/deployments/{deployment['id']}", json=deployment
)
assert (
200 <= response.status_code < 300
), f"Failed to update deployment: {response.status_code}"
return response.json()
def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
assert (
200 <= response.status_code < 300
), f"Failed to pause deployment: {response.status_code}"
def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
assert (
200 <= response.status_code < 300
), f"Failed to resume deployment: {response.status_code}"
def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
response = parasail_client.get(f"/dedicated/deployments/{id}")
assert (
200 <= response.status_code < 300
), f"Failed to get deployment: {response.status_code}"
return response.json()
def is_deployment_paused(parasail_client: httpx.Client, id: int) -> bool:
deployment = get_deployment(parasail_client, id)
return deployment["status"]["status"] == "PAUSED"
def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
response = parasail_client.delete(f"/dedicated/deployments/{id}")
assert (
200 <= response.status_code < 300
), f"Failed to delete deployment: {response.status_code}"
def wait_for_status(
parasail_client: httpx.Client,
id: int,
status: str = "ONLINE",
timeout_minutes: int = 20,
) -> None:
start_time = time.time()
timeout_seconds = timeout_minutes * 60
while True:
deployment = get_deployment(parasail_client, id)
current_status = (
deployment["status"]["status"] if "status" in deployment else "Unknown"
)
if current_status == status:
print("Deployment is now online!")
return
elif current_status == "STARTING" or current_status == "STOPPING":
elapsed_minutes = (time.time() - start_time) / 60
print(
f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
)
else:
raise ValueError(f"Unexpected deployment status: {current_status}")
if time.time() - start_time > timeout_seconds:
raise TimeoutError(
f"Deployment status did not change to {status} within {timeout_minutes} minutes"
)
time.sleep(30)
def test_deployment(
parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
deployment = get_deployment(parasail_client, id)
assert "status" in deployment, "Deployment status is None"
assert (
deployment["status"]["status"] == "ONLINE"
), f"Deployment is not online: {deployment['status']['status']}"
chat_completion = openai_client.chat.completions.create(
model=deployment["externalAlias"],
messages=[{"role": "user", "content": "What's the capital of New York?"}],
)
assert len(chat_completion.choices) == 1, "No chat completion choices returned"
assert chat_completion.choices[0].message.content, "Empty chat completion response"
def main():
# Setup clients
parasail_client = httpx.Client(
base_url=control_base_url,
headers={
"Authorization": "Bearer " + api_key,
"Content-Type": "application/json",
},
)
openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)
# Create deployment configuration
deployment_config = CreateModelStruct(
deploymentName="scripting-example-deployment-2",
modelName="shuyuej/Llama-3.2-1B-Instruct-GPTQ",
serviceTierName="low-cost",
replicas=1,
)
deployment = None
try:
print("Creating deployment...")
deployment = create_deployment(parasail_client, deployment_config)
print(f"Created deployment with ID: {deployment['id']}")
print("\nWaiting for deployment to come online.")
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 1 replica...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Initial deployment tests passed successfully")
pause_deployment(parasail_client, deployment["id"])
wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
resume_deployment(parasail_client, deployment["id"])
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 1 replica...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Initial deployment tests passed successfully")
deployment = get_deployment(parasail_client, deployment["id"])
print("\nUpdating deployment to 2 replicas...")
deployment["replicas"] = 2
deployment = update_deployment(parasail_client, deployment)
print("Waiting for deployment to scale up...")
wait_for_status(parasail_client, deployment["id"], status="ONLINE")
print("\nTesting deployment with 2 replicas...")
test_deployment(parasail_client, openai_client, deployment["id"])
print("Scaled deployment tests passed successfully")
except Exception as e:
print(f"Error: {e}")
finally:
if deployment:
print("\nCleaning up - deleting deployment...")
delete_deployment(parasail_client, deployment["id"])
print("Deployment deleted")
if __name__ == "__main__":
main()