> For the complete documentation index, see [llms.txt](https://docs.parasail.io/parasail-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.parasail.io/parasail-docs/cookbooks/gui-agent.md).

# Multimodal GUI Task Agent

Multimodal models including [UI-TARS](https://github.com/bytedance/UI-TARS) and [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.

This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.

## Computer-Using Agent

[UI-TARS-Desktop](https://github.com/bytedance/UI-TARS-desktop) is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.

To use `UI-TARS-Desktop` with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:

```yaml
name: UI TARS Desktop Parasail Preset
language: en
vlmProvider: Parasail for UI-TARS-1.5
vlmBaseUrl: https://api.parasail.io/v1
vlmApiKey: YOUR_API_KEY_HERE # <-- IMPORTANT: Replace with your actual Parasail API key
vlmModelName: parasail-ui-tars-1p5-7b
```

For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: <https://github.com/bytedance/UI-TARS-desktop/>.

## Browser Operator

[Midscene.js](https://midscenejs.com/) is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.

The easiest way to get started with Midscene.js is using its [Chrome extension](https://midscenejs.com/quick-experience.html).

Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:

**Option 1: Using a Qwen-VL based model (for example, `parasail-qwen25-vl-72b-instruct`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-qwen25-vl-72b-instruct"
export MIDSCENE_USE_QWEN_VL=1 # Flag to indicate a Qwen-VL family model
```

**Option 2: Using a UI-TARS based model (for example, `parasail-ui-tars-1p5-7b`)**

```bash
export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-ui-tars-1p5-7b"
export MIDSCENE_USE_VLM_UI_TARS=1.5 # Flag indicating UI-TARS 1.5 model compatibility
```

For detailed setup, API documentation, and examples on how to integrate and use Midscene.js for browser automation, please consult its official repository: <https://github.com/web-infra-dev/midscene>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.parasail.io/parasail-docs/cookbooks/gui-agent.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
