Speeding up Dedicated Models with Speculative Decoding
Last updated
Last updated
Speculative decoding is a powerful inference approach that can yield a 1.5-2X improvement in output tokens per second with little overhead. With this approach, a lightweight model called the draft model runs ahead of the main target model and provides intelligent guesses for what the next token might be. In the Parasail platform, setting up speculative decoding is very easy with the dedicated service. This runs the draft model on the same GPU as the target model, meaning no additional hardware resources are needed.
In this example, we will create speculative decoding for the FP8 variant of Qwen-2-VL-72B (nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic). The draft model must be significantly smaller and lighter than the target model AND it must have the same vocubalury structure, meaning it must be in the same family of models. The draft model we will use in this example is Qwen/Qwen2-7B-Instruct-GPTQ-Int4. Even though it is not the VL vision variant, it is in the Qwen2 family and they both have a vocab size of 152064 (Qwen2-2B would be a better choice, but it does not have the same vocab size).
The out-of-the-box tokens per second for Qwen-2-VL-72B on a single NVIDIA H200 is around 30 TPS. Lets see how much we can speed that up. First, create a dedicated model:
Chose a unique name for this deployment:
Enter the hugging-face ID or URL for this model, nm-testing/Qwen2-VL-72B-Instruct-FP8-dynamic and choose the low-cost tier:
Under advanced options, you can enter the HuggingFace ID or URL of the draft model, in this case Qwen/Qwen2-7B-Instruct-GPTQ-Int4. Also select a 1 hr scale-down pol
Hit deploy, and you will be taken to the model dashboard. The model status will be shown as starting and 🟡 while the model is being set up.
When the model status turns green 🟢, we are ready to test the performance.
The output tokens-per-second jumped from 30 to 76 using the same NVIDIA H200, which is a very impressive improvement.
Qwen2.5-72B-Instruct
Qwen/Qwen2.5-7B-Instruct-AWQ
Qwen2-72B-Instruct
Future content: show additional combinations of target models and draft models that work well.