Multi-modal (or multimodal) refers to systems or methods that can process or integrate multiple types of data, typically from different modalities such as text, image, audio, video, and other sensory inputs. Multi-modal systems aim to understand and generate outputs based on this diverse, merged information, mimicking human perception, where we simultaneously use different senses (e.g., seeing, hearing, and reading) to interpret the world.
In machine learning and AI, multi-modal models combine data from different modalities to improve the performance of tasks like classification, generation, captioning, and more. For example, an AI model that processes both images and text could interpret and respond to visual and written information in a cohesive and context-aware manner.
Loading Images
The OpenAI-compatible multi-modal API expects images in a Data URL format, which encodes the image data directly into an HTTP request in base64 format. Here is Python notebook code to load JPEG and PNGs in the correct format and display them in the notebook.
import base64
from IPython.display import Image, display
# Set image type - JPEG or PNG
jpeg = False
# Determine file path and MIME type based on selection
if jpeg:
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/blue_flowers.jpg"
file_path = "/content/blue_flowers.jpg"
mime_type = "image/jpeg"
else:
image_url = "https://www.filesampleshub.com/download/image/png/sample3.png"
file_path = "/content/sample3.png"
mime_type = "image/png"
# Download the image
!wget {image_url} -O {file_path}
def load_image_data_url(img_file_path,img_mime_type):
# Read the file and convert to base64
with open(img_file_path, "rb") as f:
return f"data:{img_mime_type};base64," + base64.b64encode(f.read()).decode("ascii")
img_data_url = load_image_data_url(file_path,mime_type) #OpenAI-compatible APIs expect data URLs
display(Image(file_path)) #Notebook display expects file paths
Sending a Multimodal LLM Request
To send the multimodal request using an OpenAI-compatible LLM endpoint, an element is added to the content entry of the messages field with a type image_url.
Here is notebook code that will run Qwen2.5 VL 72B on the sample image.
The most defining feature of this image is the naturally formed archway in a large, reddish-brown rock formation. This distinctive rock arch is a focal point, standing out against the sandy desert landscape and clear blue sky. The arch's unique shape and the way it frames the sky create a striking and memorable visual element.
Sending Multiple Images
To process multiple images, you can simply add multiple image_urlitems into the content list. The images are referenced in the order they are present in the content list and in the prompt.
The first image depicts vibrant, translucent flowers with water droplets, surrounded by reflective water and a bokeh effect in the background. The second image showcases a desert landscape with a large natural rock arch and vast sandy dunes under a clear sky.
### Similarities:
- Both images are visually striking and emphasize natural beauty.
- They feature prominent subjects (flowers and rock formation) with detailed textures.
- The backgrounds in both images are defocused to draw attention to the main subjects.
### Differences:
- The first image is rich in color with cooler tones (blues, purples), while the second is dominated by warm hues (browns, oranges).
- The settings are entirely different: one is a watery, floral environment, and the other is a dry, arid desert.
- The first image exudes a sense of freshness and moisture, whereas the second conveys heat and dryness.
Sending Video
Qwen-2-VL and Qwen-2.5-VL support the video_url content type, where the data URL is a series of encoded base64 PNG frames separated by a comma. The Parasail gateway supports up to 20MB in a single API request, so care must be taken to ensure the frame count and resolution fit in this limit.
The following notebook shows how to load an MP4 video, reorder the channels correctly in a Torch format, resize the video, and compile the data URL.