Dedicated Endpoints

Dedicated endpoints are private endpoints that in a few clicks can run most transformers on HuggingFace on our full range of GPU offerings at the best prices on the market. Dedicated endpoints are ideal for creating production endpoints with full control over performance and QOS, and for experimenting with models that are not available on public serverless endpoints.

Advantages of Dedicated Endpoints:

  • Great for both experimenting and deploying to production: run most transformers on HuggingFace by simply entering the URL.

  • Run custom models hosted in your private HuggingFace repo by simply entering an access token.

  • Select from a variety of hardware types — 4090s up to H200s — to match your cost and performance needs.

  • Scale your instance up as traffic grows with load balanced replicas.

  • Hit aggressive latency targets through methods like GPU selection, model quantization, and speculative decoding.

Great Customer Value: Our dedicated model service delivers great value to customers through easy on-demand usage for even large scale deployments, no contracts or commitments, and the lowest hourly rate GPUs anywhere. You can pause and resume models at any time and billing is only activated when the model is active.

Model Compatibility: Behind the scenes, we work with a variety of inference stacks to maintain the best model compatibility and performance. Depending on how a model is launched, if it is coordinated with the community (e.g. Llama models), we will have 0-day support. Sometimes if models are dropped unannounced it can take a few days to a week to get support in. If a model is released and we do not offer support please let us know on our Discord [LINK] or through our support channels and we will investigate.

We currently support most transformers, while support for diffusion models and more conventional AI models like CNNs or LSTMs are further down our roadmap. Some models are also not supported because they contain local code execution in the model repo, however we are able to whitelist trusted model providers so please contact us if you would like us to investigate a model support issue.

Creating a Dedicated Endpoint

After clicking on the Dedicated Tab, the configuration page will be shown:

Deployment Name:

Guide me (deprecated): this is a feature being deprecated that allowed for a click-through selection of popular AI models.

HuggingFace ID / URL: this is where you specify the model to load. Currently we only support models hosted either publicly or privately on HuggingFace. If your model is local, its very easy to load it into a private HuggingFace repo. For public models, you can specify either the URL or the model ID.

For example, for the FP8 version of Qwen 2 72B Instruct, you can specify:

https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-FP8 or neuralmagic/Qwen2-72B-Instruct-FP8

FYI HuggingFace has a convenient copy button next to the model ID:

HuggingFace Token: this allows you to load a model hosted on a private HuggingFace repo. This is ideal for custom models fine-tuned on private data. For more information on generating a token, please read Deploying Private HF models.

If the model is supported, you'll get a green message. If the model is not supported, please contact us on Discord or through your support channel and we will investigate.

After you select your model, you can select a hardware service tier. In closed beta, we pool the hardware into low-cost, low-latency, and ultra-low-latency. As we approach launch, we will support the selection of specific GPU instances such as H100s and H200s. Once you are ready, hit the deploy button and the endpoint will be live. If you want to set advanced options, click the advanced dropdown.

Once the model is deployed, you will be taken to the Dedicated Status Page . If the model weights have been cached previously, the model will take 3-5 minutes to start. If the model has to be downloaded from HuggingFace, it could take anywhere from 10 minutes to multiple hours to download the weights.

Advanced Options

Speculative Decoding Draft Model Name: This allows you to set a draft model for faster output token speeds with speculative decoding. This is an advanced concept that requires some care, so please read this guide before using: Speeding up Dedicated Models with Speculative Decoding

Replicas: This spins up multiple replicas that are automatically load balanced. This is designed for high throughput production deployments and the hourly cost will multiply by the number of replicas.

Model Mode: Most models will autodetect, however some models support multiple modes from the same model (such as GritLM, which can be both a generative and embedding model). If the autodetect creates the wrong type of endpoint, this can be used to force a specific mode.

Scale Down Policy: If you are using this model for experimenting or R&D, we highly recommend a scaledown policy to shut the model down after a certain amount of time. We've heard plenty of horror stories from users of other providers who rack up big bills from a forgotten endpoint so this is highly recommended.

Max concurrent connections per replicas: If the number of concurrent requests equals this amount, the endpoint will return a 429 Too Many Requests error until there is free space. This is necessary to prevent too many users from overloading an endpoint and drastically reducing quality of service.

Dedicated Status Page

Once a dedicated model is deployed, you can control and view the metrics of a deployment on the status page.

The Pause button will temporarily deactive the endpoint and pause billing. You can also Edit the model settings and redeploy. The Destroy button will permanently delete the endpoint

Last updated