# FP8 Quantization

We recommend using FP8 quantization for dedicated models which allows for a 2x reduction in model memory requirements and up to 1.6x improvement in inference speed with minimal impact on accuracy.

The easiest and recommended way to quantize the model is using [llm-compressor](https://github.com/vllm-project/llm-compressor) with `FP8_DYNAMIC` scheme. In this setting, there is no need for any calibration data since the activations are quantized dynamically.

Here is an example of how to use `llm-compressor` to quantize a model:

```python
# pip install llmcompressor

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "YOUR_HUGGINGFACE_MODEL_HERE"

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Apply quantization.
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
save_path = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.parasail.io/parasail-docs/dedicated-instance/fp8-quantization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
