We recommend using FP8 quantization for dedicated models which allows for a 2x reduction in model memory requirements and up to 1.6x improvement in inference speed with minimal impact on accuracy.
The easiest and recommended way to quantize the model is using with FP8_DYNAMIC scheme. In this setting, there is no need for any calibration data since the activations are quantized dynamically.
Here is an example of how to use llm-compressor to quantize a model:
# pip install llmcompressor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "YOUR_HUGGINGFACE_MODEL_HERE"
# Load model
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Apply quantization.
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
oneshot(model=model, recipe=recipe)
# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")
# Save to disk in compressed-tensors format
save_path = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")