Multimodal GUI Task Agent
Last updated
Last updated
Multimodal models including and have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.
This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.
is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.
To use UI-TARS-Desktop
with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:
For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: .
is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.
The easiset way to get started with Midscene.js is using its .
Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:
Option 1: Using a Qwen-VL based model (e.g., parasail-qwen25-vl-72b-instruct
)
Option 2: Using a UI-TARS based model (e.g., parasail-ui-tars-1p5-7b
)
For detailed setup, API documentation, and examples on how to integrate and use Midscene.js for browser automation, please consult its official repository: