Capacity of Dedicated Serverless Endpoints
Austocaling with Dedicated Serverless is a nuanced topic because:
You are not sharing an endpoint with other customers, so there is not a common pool serving a model to spread load across.
Yet you are still paying per token and do not have to worry about any of the underlying GPU infrastructure or orchestration software.
Despite being a serverless product, the underlying infrastructure that runs AI models requires load limits to maintain quality of service and uptime. The two load limits we employ are concurrency limits (how many requests are actively being processed at a given time), which most often relates to bursty traffic and changes depending on autoscaling, and maximum requests per minute, which is a static limit and most often relates to more steady traffic.
The following three factors are shown on the Dedicated Serverless status page.
Current Concurrency Limit
The current capacity of the endpoint, measured in maximum concurrent requests being processed.
Fully Autoscaled Concurrency Limit
Once the endpoint is autoscaled to its maximum, this represents the maximum concurrent requests that an endpoint can process.
Max Requests per Minute
A static rate limit for the endpoint that controls the maximum amount of steady-state traffic. This is determined by the product tier or negotiated agreement and is not related to autoscaling.
If your traffic is bursty, you may hit concurrency limits before you hit the maximum requests per minute.
If your traffic grows smoothly, you may hit maximum requests per minute before you hit any concurrency limits.
To illustrate this point, if the maximum requests per minute of an endpoint is 10,000 but all 10,000 show up at the exact same time, the endpoint will issue 429s due to concurrency limits. If the requests are more spread out, however, then no 429s will be issued since the traffic is did not exceed the rate limit.
Concurrency Limits in Detail
Current capacity is displayed on the dedicated serverless status page and is a dynamic number as it is determined behind the scenes by the number of active autoscaled replicas multiplied by the capacity of each replica. The capacity of each replica is determined by the model, the input characteristics, and the type of hardware used to deploy it. Because of some of these factors may change, the capacity of each replica may be subject to change.
Behind the scenes, dedicated serverless endpoints autoscale GPUs to maximize efficiency, and the autoscaling has a 5 min dampening and delay in it to prevent thrashing of the infrastructure. This means the endpoint can only handle a certain amount of burst traffic. Anything in excess will issue 429s before the autoscaling kicks in.
For example, if each replica can handle 64 concurrent requests and there are 4 active replicas, the Dedicated Serverless status page will display a current capacity of 256 concurrent requests. Anything in excess of that will result in a 429 error. If after 5 minutes of sustained traffic (the auto scaling trigger is always significantly less than the maximum replica capacity), and addition 4 replicas are activated then approximately 5 minutes later the current capacity will be 512 concurrent requests.
A graph of the number of concurrent requests can be viewed in the Dedicated Serverless status page under the "Num Running Requests" tab.

Max Requests per Minute in Detail
Max requests per minute is a constant factor not tied to austoscaling that is either set through the Dedicated Serverless creation process, or through a special arrangement with sales/customer success. It represents the maximum steady-state capacity of the endpoint and is shown as Maximum Requests per Minute in the Dedicated Serverless status page. This will not change without intervention from Parasail or by selecting a higher tier service plan.
For example, a high capacity endpoint of Qwen 32b may be shown as Maximum Requests Per Minute: 15,000
Last updated