Speeding up Dedicated Models with Speculative Decoding

Speculative decoding is a powerful inference approach that can yield a 1.5-2X improvement in output tokens per second with little overhead. With this approach, a lightweight model called the draft model runs ahead of the main target model and provides intelligent guesses for what the next token might be. In the Parasail platform, setting up speculative decoding is very easy with the dedicated service. This runs the draft model on the same GPU as the target model, meaning no additional hardware resources are needed.

In this example, we will create speculative decoding for the FP8 variant of Qwen-2.5-VL-72B (nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic). The draft model must be significantly smaller and lighter than the target model AND it must have the same vocubalury structure, meaning it must be in the same family of models. The draft model we will use in this example is Qwen/Qwen2.5-VL-7B-Instruct-AWQ.

The out-of-the-box tokens per second for Qwen-2-VL-72B on a single NVIDIA H200 is around 30 TPS. Lets see how much we can speed that up. First, create a dedicated model:

Chose a unique name for this deployment:

Enter the hugging-face ID or URL for this model, nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic and choose the low-cost tier:

Under advanced options, you can enter the HuggingFace ID or URL of the draft model, in this case Qwen/Qwen2-7B-Instruct-GPTQ-Int4. Also select a 1 hr scale-down pol