Dedicated Endpoint Management API

Parasail Dedicated Service API Documentation

Overview

The Parasail Dedicated Service API allows you to deploy and manage dedicated endpoints via an REST-compatabile API. You can do all functions available on the UI for dedicated deployments, including create, pause, resume, scale up or down, and retrieve status information.

Because there is no visibility into GPU performance recommendations, nor insights into errors or issues with HuggingFace model compatability, it is recommended you set up and verify the model using the platform UI first before attempting to control a deployment with the API.

If you did not create the dedicated endpoint with the API, you can find the deployment_id from the Parasail SaaS platform URL. For example, if your deployment status page is https://www.saas.parasail.io/dedicated/33665, then the deployment_id is 33665. You can then use this for all API commands below.

Base URLs

  • Control API (for managing deployments): https://api.parasail.io/api/v1

  • Chat API (for inference requests via OpenAI-compatible endpoints): https://api.parasail.io/v1

Authentication

All requests to the Control API must include a Bearer token in the header. For example:

import httpx

api_key = "<YOUR_PARASAIL_API_KEY>"
control_base_url = "https://api.parasail.io/api/v1"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

client = httpx.Client(base_url=control_base_url, headers=headers)

1. Deployment Parameters

When creating or updating a deployment, you will provide a JSON body with the following fields:

Parameter
Type
Required
Description

deploymentName

string

Yes

Unique name for your deployment. This can only contain lowercase letters, numbers, and dashes.

modelName

string

Yes

The HuggingFace identifier (e.g., org/model-name) of the model you want to deploy. For example: "shuyuej/Llama-3.2-1B-Instruct-GPTQ"

serviceTierName

string

Yes

The tier to use for provisioning this deployment

replicas

integer

Yes

Number of replicas to run in parallel

scaleDownPolicy

string

No

Auto-scaling policy: NONE, TIMER, or INACTIVE

scaleDownThreshold

integer

No

Time threshold in ms before scaling down

draftModelName

string

No

(Advanced) Points to a draft model version

generative

bool

No

(Advanced) Indicates a generative model

embedding

bool

No

(Advanced) Indicates an embedding model

contextLength

integer

No

(Advanced) Context window length for LLMs


2. Creating a Deployment

Use the POST /dedicated/deployments endpoint to create a new model deployment. The response will include an id you will need for future operations (status checks, updates, etc.). When model is created, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests. This model will also appear on your dedicated status page of the platform UI.

# Example: Creating a deployment
deployment_config = {
    "deploymentName": "myllama",
    "modelName": "shuyuej/Llama-3.2-1B-Instruct-GPTQ",
    "serviceTierName": "low-cost",
    "replicas": 1,
    # Optional parameters:
    # "scaleDownPolicy": "TIMER",
    # "scaleDownThreshold": 3600000,
}

response = client.post("/dedicated/deployments", json=deployment_config)

if response.status_code == 200:
    deployment_data = response.json()
    deployment_id = deployment_data["id"]
    print(f"Created deployment with ID: {deployment_id}")
else:
    print(f"Failed to create deployment: {response.status_code}, {response.text}")

Response Body Example:

{
  "id": 12345,
  "deploymentName": "myllama",
  "modelName": "shuyuej/Llama-3.2-1B-Instruct-GPTQ",
  "status": {
    "status": "STARTING",
    "message": "Deployment is being created"
  },
  "externalAlias": "myorganization-myllama",
  ...
}
  • id: Unique numeric identifier for the deployment.

  • status.status: Current state of the deployment (e.g., STARTING, ONLINE).

  • externalAlias: The string used to reference your model via OpenAI-compatible calls.


3. Getting Deployment Status

Use the GET /dedicated/deployments/{deployment_id} endpoint to retrieve the details of your deployment, including its status.

deployment_id = 12345  # from creation step or UI
response = client.get(f"/dedicated/deployments/{deployment_id}")

if response.status_code == 200:
    status_data = response.json()
    model_status = status_data["status"]["status"]       # e.g. "ONLINE", "PAUSED"
    model_identifier = status_data["externalAlias"]      # alias for OpenAI API calls
    print(f"Deployment status: {model_status}")
    print(f"OpenAI model identifier: {model_identifier}")
else:
    print(f"Failed to get deployment: {response.status_code}, {response.text}")

Key fields in the response:

  • status["status"]: The overall state of the deployment.

  • externalAlias: The string you use in OpenAI Chat/Completion requests (model=).

Below are the possible values for status["status"]:

State
Description

ONLINE

Deployment is active and ready to serve requests.

STARTING

Deployment is in the process of starting up.

PAUSED

Deployment is paused but can be resumed.

STOPPING

Deployment is in the process of shutting down.

OFFLINE

Deployment is not running and must be resumed to accept requests.


4. Updating a Deployment

To change the configuration of an existing deployment (e.g., number of replicas):

  1. GET the current deployment (as above).

  2. Modify the returned JSON structure to include your updated parameters.

  3. PUT the entire updated JSON back to the same endpoint (/dedicated/deployments/{deployment_id}).

Example: Scaling up the replicas

# 1. Fetch the current deployment
deployment_id = 12345
response = client.get(f"/dedicated/deployments/{deployment_id}")
current_deployment = response.json()

# 2. Modify the "replicas" field (e.g., from 1 to 2)
current_deployment["replicas"] = 2

# 3. Send the updated deployment object
update_response = client.put(
    f"/dedicated/deployments/{deployment_id}",
    json=current_deployment
)

if update_response.status_code == 200:
    updated_deployment_data = update_response.json()
    print("Deployment updated successfully.")
    print(f"New replica count: {updated_deployment_data['replicas']}")
else:
    print(f"Failed to update deployment: {update_response.status_code}, {update_response.text}")

5. Pausing and Resuming a Deployment

You may want to pause a deployment to reduce costs. Pausing puts the deployment into an OFFLINE state and stops billing. When ready, you can resume it to bring it back ONLINE. When the resume command is sent, you will need to poll the status to monitor for the transition from STARTING to ONLINE before the endpoint will accept requests.

# Pause the deployment
deployment_id = 12345
pause_response = client.post(f"/dedicated/deployments/{deployment_id}/pause")
if pause_response.status_code == 200:
    print("Deployment paused successfully.")
else:
    print(f"Failed to pause: {pause_response.status_code}, {pause_response.text}")

# Resume the deployment
resume_response = client.post(f"/dedicated/deployments/{deployment_id}/resume")
if resume_response.status_code == 200:
    print("Deployment resumed successfully.")
else:
    print(f"Failed to resume: {resume_response.status_code}, {resume_response.text}")

6. Using the Deployed Model via OpenAI-Compatible API

Once your deployment is ONLINE, you can send inference requests using the Chat API at https://api.parasail.io/v1. Use the externalAlias as the model parameter in your OpenAI calls.

Example using the OpenAI Python client:

import openai

openai.api_key = api_key
openai.api_base = "https://api.parasail.io/v1"

# Assume you fetched this alias from /dedicated/deployments/{deployment_id}
model_identifier = "parasail-custom-model-id"

response = openai.ChatCompletion.create(
    model=model_identifier,
    messages=[{"role": "user", "content": "What's the capital of France?"}]
)

print(response.choices[0].message["content"])

Note: The model_identifier is the externalAlias from the deployment status.


7. Deleting a Deployment

To permanently delete an endpoint, send the DELETE request.

deployment_id = 12345
delete_response = client.delete(f"/dedicated/deployments/{deployment_id}")

if delete_response.status_code == 200:
    print("Deployment deleted successfully.")
else:
    print(f"Failed to delete deployment: {delete_response.status_code}, {delete_response.text}")

Example Python Code

import httpx
import openai
from pydantic import BaseModel
from typing import Optional
import os
import time
from dotenv import load_dotenv

# Load environment variables
env_path = os.path.expanduser("~/dev/parasail/.env")
if os.path.exists(env_path):
    load_dotenv(env_path)
api_key = os.environ["PARASAIL_API_KEY"]
control_base_url = "https://api.parasail.io/api/v1"
chat_base_url = "https://api.parasail.io/v1"


class CreateModelStruct(BaseModel):
    deploymentName: str
    modelName: str
    serviceTierName: str
    replicas: int
    scaleDownPolicy: Optional[str] = None  # NONE, TIMER, INACTIVE
    scaleDownThreshold: Optional[int] = None  # time in ms
    draftModelName: Optional[str] = None
    generative: Optional[bool] = None
    embedding: Optional[bool] = None
    contextLength: Optional[int] = None


def create_deployment(
    parasail_client: httpx.Client, deployment: CreateModelStruct
) -> dict:
    response = parasail_client.post(
        "/dedicated/deployments", content=deployment.model_dump_json()
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to create deployment: {response.status_code}"
    return response.json()

def update_deployment(parasail_client: httpx.Client, deployment: dict) -> dict:
    assert deployment["id"] is not None, "Deployment ID must be set for update"
    response = parasail_client.put(
        f"/dedicated/deployments/{deployment['id']}", json=deployment
    )
    assert (
        200 <= response.status_code < 300
    ), f"Failed to update deployment: {response.status_code}"
    return response.json()

def pause_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/pause")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to pause deployment: {response.status_code}"

def resume_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.post(f"/dedicated/deployments/{id}/resume")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to resume deployment: {response.status_code}"

def get_deployment(parasail_client: httpx.Client, id: int) -> dict:
    response = parasail_client.get(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to get deployment: {response.status_code}"
    return response.json()

def is_deployment_paused(parasail_client: httpx.Client, id: int) -> bool:
    deployment = get_deployment(parasail_client, id)
    return deployment["status"]["status"] == "PAUSED"

def delete_deployment(parasail_client: httpx.Client, id: int) -> None:
    response = parasail_client.delete(f"/dedicated/deployments/{id}")
    assert (
        200 <= response.status_code < 300
    ), f"Failed to delete deployment: {response.status_code}"

def wait_for_status(
    parasail_client: httpx.Client,
    id: int,
    status: str = "ONLINE",
    timeout_minutes: int = 20,
) -> None:
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60
    while True:
        deployment = get_deployment(parasail_client, id)
        current_status = (
            deployment["status"]["status"] if "status" in deployment else "Unknown"
        )
        if current_status == status:
            print("Deployment is now online!")
            return
        elif current_status == "STARTING" or current_status == "STOPPING":
            elapsed_minutes = (time.time() - start_time) / 60
            print(
                f"Deployment status: {current_status} (elapsed time: {elapsed_minutes:.1f} minutes)"
            )
        else:
            raise ValueError(f"Unexpected deployment status: {current_status}")
        if time.time() - start_time > timeout_seconds:
            raise TimeoutError(
                f"Deployment status did not change to {status} within {timeout_minutes} minutes"
            )
        time.sleep(30)

def test_deployment(
    parasail_client: httpx.Client, openai_client: openai.Client, id: int
) -> None:
    deployment = get_deployment(parasail_client, id)
    assert "status" in deployment, "Deployment status is None"
    assert (
        deployment["status"]["status"] == "ONLINE"
    ), f"Deployment is not online: {deployment['status']['status']}"

    chat_completion = openai_client.chat.completions.create(
        model=deployment["externalAlias"],
        messages=[{"role": "user", "content": "What's the capital of New York?"}],
    )
    assert len(chat_completion.choices) == 1, "No chat completion choices returned"
    assert chat_completion.choices[0].message.content, "Empty chat completion response"

def main():
    # Setup clients
    parasail_client = httpx.Client(
        base_url=control_base_url,
        headers={
            "Authorization": "Bearer " + api_key,
            "Content-Type": "application/json",
        },
    )
    openai_client = openai.Client(api_key=api_key, base_url=chat_base_url)

    # Create deployment configuration
    deployment_config = CreateModelStruct(
        deploymentName="scripting-example-deployment-2",
        modelName="shuyuej/Llama-3.2-1B-Instruct-GPTQ",
        serviceTierName="low-cost",
        replicas=1,
    )

    deployment = None
    try:
        print("Creating deployment...")
        deployment = create_deployment(parasail_client, deployment_config)
        print(f"Created deployment with ID: {deployment['id']}")

        print("\nWaiting for deployment to come online.")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        pause_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="OFFLINE")
        resume_deployment(parasail_client, deployment["id"])
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 1 replica...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Initial deployment tests passed successfully")

        deployment = get_deployment(parasail_client, deployment["id"])
        print("\nUpdating deployment to 2 replicas...")
        deployment["replicas"] = 2
        deployment = update_deployment(parasail_client, deployment)
        print("Waiting for deployment to scale up...")
        wait_for_status(parasail_client, deployment["id"], status="ONLINE")

        print("\nTesting deployment with 2 replicas...")
        test_deployment(parasail_client, openai_client, deployment["id"])
        print("Scaled deployment tests passed successfully")

    except Exception as e:
        print(f"Error: {e}")
    finally:
        if deployment:
            print("\nCleaning up - deleting deployment...")
            delete_deployment(parasail_client, deployment["id"])
            print("Deployment deleted")

if __name__ == "__main__":
    main()

Last updated