Parasail
  • Welcome
  • Serverless
    • Serverless
    • Available Parameters
  • Dedicated
    • Dedicated Endpoints
    • Speeding up Dedicated Models with Speculative Decoding
    • Deploying private models through HuggingFace Repos
    • Dedicated Endpoint Management API
    • Rate Limits and Limitations
    • FP8 Quantization
  • Batch
    • Quick start
    • Batch Processing with Private Models
    • Batch file format
    • API Reference
  • Cookbooks
    • Run and Evaluate Any Model
    • Chat Completions
    • RAG
    • Multi-Modal
    • Text-to-Speech with Orpheus TTS models
    • Multimodal GUI Task Agent
    • Tool/Function Calling
    • Structured Output
  • Billing
    • Pricing
    • Billing And Payments
    • Promotions
    • Batch SLA
  • Security and Account Management
    • Data Privacy and Retention
    • Account Management
    • Compliance
  • Resources
    • Silly Tavern Guide
    • Community Engagement
Powered by GitBook
On this page
  • Computer-Using Agent
  • Browser Operator
  1. Cookbooks

Multimodal GUI Task Agent

PreviousText-to-Speech with Orpheus TTS modelsNextTool/Function Calling

Last updated 11 days ago

Multimodal models including and have shown remarkable capability of understanding and interacting with GUIs to perform tasks based on natural language instructions. Common use cases include universal voice control, automated UI testing, workflow automation and more.

This document outlines how to approach using multi-modal models for general computer use and browser-specific operations.

Computer-Using Agent

is an application designed to enable a multimodal agent to control your desktop environment to perform tasks based on natural language inputs.

To use UI-TARS-Desktop with UI-TARS-1.5 model and Parasail as provider, simply import the follow preset config file:

name: UI TARS Desktop Parasail Preset
language: en
vlmProvider: Parasail for UI-TARS-1.5
vlmBaseUrl: https://api.parasail.io/v1
vlmApiKey: YOUR_API_KEY_HERE # <-- IMPORTANT: Replace with your actual Parasail API key
vlmModelName: parasail-ui-tars-1p5-7b

For detailed installation, usage instructions, and examples, please refer to the official UI-TARS-Desktop repository: .

Browser Operator

is a JavaScript framework designed to facilitate the interaction of AI agents with web browsers. It allows an agent to perceive the web page and execute actions (clicks, typing, navigation) typically by connecting to a VLM backend.

The easiset way to get started with Midscene.js is using its .

Here are examples of how you might configure Midscene.js to use different VLMs hosted by Parasail:

Option 1: Using a Qwen-VL based model (e.g., parasail-qwen25-vl-72b-instruct)

export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-qwen25-vl-72b-instruct"
export MIDSCENE_USE_QWEN_VL=1 # Flag to indicate a Qwen-VL family model

Option 2: Using a UI-TARS based model (e.g., parasail-ui-tars-1p5-7b)

export OPENAI_BASE_URL="https://api.parasail.io/v1"
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # <-- IMPORTANT: Replace with your actual API key
export MIDSCENE_MODEL_NAME="parasail-ui-tars-1p5-7b"
export MIDSCENE_USE_VLM_UI_TARS=1.5 # Flag indicating UI-TARS 1.5 model compatibility

For detailed setup, API documentation, and examples on how to integrate and use Midscene.js for browser automation, please consult its official repository:

UI-TARS
Qwen2.5-VL
UI-TARS-Desktop
https://github.com/bytedance/UI-TARS-desktop/
Midscene.js
Chrome extension
https://github.com/web-infra-dev/midscene