# Multimodal GUI Task Agent

Multimodal models including [UI-TARS](https://github.com/bytedance/UI-TARS) and [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.

This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.

## Computer-Using Agent

[UI-TARS-Desktop](https://github.com/bytedance/UI-TARS-desktop) is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.

To use `UI-TARS-Desktop` with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:

```yaml
name: UI TARS Desktop Parasail Preset
language: en
vlmProvider: Parasail for UI-TARS-1.5
vlmBaseUrl: https://api.parasail.io/v1
vlmApiKey: YOUR_API_KEY_HERE # <-- IMPORTANT: Replace with your actual Parasail API key
vlmModelName: parasail-ui-tars-1p5-7b
```

For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: <https://github.com/bytedance/UI-TARS-desktop/>.

## Browser Operator

[Midscene.js](https://midscenejs.com/) is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.

The easiest way to get started with Midscene.js is using its [Chrome extension](https://midscenejs.com/quick-experience.html).

Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:

**Option 1: Using a Qwen-VL based model (for example, `parasail-qwen25-vl-72b-instruct`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-qwen25-vl-72b-instruct"
export MIDSCENE_USE_QWEN_VL=1 # Flag to indicate a Qwen-VL family model
```

**Option 2: Using a UI-TARS based model (for example, `parasail-ui-tars-1p5-7b`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-ui-tars-1p5-7b"
export MIDSCENE_USE_VLM_UI_TARS=1.5 # Flag indicating UI-TARS 1.5 model compatibility
```

For detailed setup, API documentation, and examples on how to integrate and use Midscene.js for browser automation, please consult its official repository: <https://github.com/web-infra-dev/midscene>
