Multimodal GUI Task Agent

Multimodal models including and have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.

This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.

Computer-Using Agent

is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.

To use UI-TARS-Desktop with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:

name: UI TARS Desktop Parasail Preset
language: en
vlmProvider: Parasail for UI-TARS-1.5
vlmBaseUrl: https://api.parasail.io/v1
vlmApiKey: YOUR_API_KEY_HERE # <-- IMPORTANT: Replace with your actual Parasail API key
vlmModelName: parasail-ui-tars-1p5-7b

For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: .

Browser Operator

is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.

The easiset way to get started with Midscene.js is using its .

Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:

Option 1: Using a Qwen-VL based model (e.g., parasail-qwen25-vl-72b-instruct)

export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-qwen25-vl-72b-instruct"
export MIDSCENE_USE_QWEN_VL=1 # Flag to indicate a Qwen-VL family model

Option 2: Using a UI-TARS based model (e.g., parasail-ui-tars-1p5-7b)

export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-ui-tars-1p5-7b"
export MIDSCENE_USE_VLM_UI_TARS=1.5 # Flag indicating UI-TARS 1.5 model compatibility

PreviousText-to-Speech with Orpheus TTS models NextTool/Function Calling

Last updated 11 days ago