Multi-Modal
Multimodal (or Multi-Modal) refers to systems or methods that can process or integrate multiple types of data, typically from different modalities such as text, image, audio, video, and other sensory inputs. Multi-modal systems aim to understand and generate outputs based on this diverse, merged information, mimicking human perception, where we simultaneously use different senses (for example, seeing, hearing, and reading) to interpret the world.
In machine learning and AI, multi-modal models combine data from different modalities to improve the performance of tasks like classification, generation, captioning, and more. For example, an AI model that processes both images and text could interpret and respond to visual and written information in a cohesive and context-aware manner.
Quickstart
The simplest way to use Vision Language Model is using OpenAI-compatible API and embed image input as base64 encoded data URLs.
Below is an example using Qwen2.5-VL 72 B model with Parasail's serverless API. GLM 4.5V also works with this example.
!pip install openai > /dev/null
import base64
import openai
from IPython.display import Image, display
try:
from google.colab import userdata
API_KEY = userdata.get('PARASAIL_API_KEY')
except:
import os
API_KEY = os.getenv('PARASAIL_API_KEY')
!wget https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Flower_poster_2.jpg/1540px-Flower_poster_2.jpg -O flower.jpg &> /dev/null
display(Image("flower.jpg", width=500))
def data_url(image):
with open(image, "rb") as f:
return f"data:image/jpeg;base64," + base64.b64encode(f.read()).decode("ascii")
oai_client = openai.OpenAI(base_url="https://api.parasail.io/v1", api_key=API_KEY)
messages = [{
"role": "user",
"content": [
{ "type": "image_url", "image_url": { "url": data_url("flower.jpg")}},
{"type": "text", "text": "What's in the image?"},
],
}]
chat_completion = oai_client.chat.completions.create(
model="parasail-qwen25-vl-72b-instruct", messages=messages)
print(chat_completion.choices[0].message.content)
As in the output, the model shows good visual understanding and OCR ability.
Vision Capabilities
Limitations
Our public serverless model can support up to 8 input images for multimodal LLMs. For private dedicated deployments, the model supports 2 input images for RTX4090s and 8 input images for A100s, H100s, and H200s. If you need a different configuration than this than please contact us.
Multiple Images: Document Visual Question Answering Example
To process multiple images, you can simply add multiple image_urlitems into the content list. The images are referenced in the order they are present in the content list and in the prompt.

Video Understanding
Qwen-2-VL and Qwen-2.5-VL support the video_url content type, where the data URL is a series of encoded base64 PNG frames separated by a comma. The Parasail gateway supports up to 20 MB in a single API request, so care must be taken to ensure the frame count and resolution fit in this limit.
The following notebook shows how to load an MP4 video, reorder the channels correctly in a Torch format, resize the video, and compile the data URL.
Object Detection and Localization
Models like Qwen2.5-VL series and PaliGemma2 are trained with object detection and localization capabilities. Simply prompt the model with the description of the objects of interest, the models can output object bounding boxes in the specified format.
Here is an example using the same Transformer paper image:

More resources
https://github.com/QwenLM/Qwen2.5-VL/tree/main/cookbooks
https://ai.google.dev/gemma/docs/capabilities/vision/prompt-with-visual-data
https://github.com/bytedance/UI-TARS
https://docs.mistral.ai/capabilities/vision/
Last updated