Capacity of Dedicated Serverless Endpoints

Austocaling with Dedicated Serverless is a nuanced topic because:

  1. You are not sharing an endpoint with other customers, so there is not a common pool serving a model to spread load across.

  2. Yet you are still paying per token and do not have to worry about any of the underlying GPU infrastructure or orchestration software.

When managing traffic on a dedicated serverless, there are two factors to consider.

  1. The current capacity of the endpoint, measured in maximum concurrent requests being processed. Behind the scenes, dedicated serverless endpoints autoscale GPUs to maximize efficiency, and the autoscaling has a 5 min dampening and delay in it to prevent thrashing of the infrastructure. This means the endpoint can only handle a certain amount of burst traffic. Anything in excess will issue 429s before the autoscaling kicks in.

  2. The total capacity of the endpoint, measured in maximum requests per minute. Every Dedicated Serverless endpoint has a limit to the autoscaling, which is set either through the Dedicated Serverless creation process or through a special arrangement with Parasail for higher limits.

Current Capacity

Current capacity is displayed on the dedicated serverless status page and is a dynamic number as it is determined behind the scenes by the number of active autoscaled replicas multiplied by the capacity of each replica. The capacity of each replica is determined by the model, the input characteristics, and the type of hardware used to deploy it. Because of some of these factors may change, the capacity of each replica may be subject to change.

For example, if each replica can handle 64 concurrent requests and there are 4 active replicas, the Dedicated Serverless status page will display a current capacity of 256 concurrent requests. Anything in excess of that will result in a 429 error. If after 5 minutes of sustained traffic (the auto scaling trigger is always significantly less than the maximum replica capacity), and addition 4 replicas are activated then approximately 5 minutes later the current capacity will be 512 concurrent requests.

Total Capacity

Total capacity is a constant factor that is either set through the Dedicated Serverless creation process, or through a special arrangement with sales/customer success. It represents the maximum autoscaling capacity of the endpoint and is shown as Maximum Requests per Minute in the Dedicated Serverless status page. This will not change without intervention from Parasail or by selecting a higher tier service plan.

For example, a high capacity endpoint of Qwen 32b may be shown as Maximum Requests Per Minute: 15,000

Last updated