Auto-Scaling

To enable auto-scaling you need to go to the checkbox autoscaling:

The way that auto-scaling works is in conjunction with Max Concurrent requests that is set:

Max Concurrent Requests: This is the number of requests per replica that after the number set will begin rejecting requests. The trade off with setting a higher number can be higher latency but a lower number rejecting the requests.

Target Concurrent Requests: This is the target of when to begin spinning up another replica. You would want to set this below the Max Concurrent requests and the lower the value the along with the smoothing factor the quicker it will trigger the auto scaling.

Smoothing Factor: This is the moving average of concurrent requests and how long you the amount of target requests coming in before a new replica is brought up or scaled down.

Conservative: Responds to changes slowly
Moderate: Balanced responsiveness
Aggressive: Quickly react to change

Auto Scaling Range for Replicas: This is the minimum to maximum amount of replicas the system will put as the base line and the max it will go up too.

Note: Whatever options you have set for hardware selection will be used for autoscaling.

PreviousFP8 Quantization NextQuick start

Last updated 10 days ago