Auto-Scaling

To enable auto-scaling you need to go to the checkbox autoscaling:

The way that auto-scaling works is in conjunction with Max Concurrent requests that is set:

Max Concurrent Requests: This is the number of requests per replica that after the number set will begin rejecting requests. The trade off with setting a higher number can be higher latency but a lower number rejecting the requests.

Target Concurrent Requests: This is the number of concurrent requests that trigger autoscaling of another replica. This should be set below the Max Concurrent requests.

Smoothing Factor: This is the moving average of concurrent requests and how long you the amount of target requests coming in before a new replica is brought up or scaled down.

Conservative: Responds to changes slowly
Moderate: Balanced responsiveness
Aggressive: Quickly react to change

Auto Scaling Range for Replicas: This is the minimum to maximum amount of replicas the system will put as the base line and the max it will go up to.

Note: Whatever options you have set for hardware selection will be used for autoscaling.

PreviousFP8 Quantization NextQuick start

Last updated 1 month ago