Browser and UI Automation

Hcompany/Holo1-7B

https://huggingface.co/Hcompany/Holo1-7B

Holo1 is an Action Vision-Language Model (VLM) developed by HCompany for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

This example shows how to run a single step where an input image and a prompt is entered and the action step is reported. This model can work off of previous memory, meaning the next step would be to include previous actions, the current image view of the UI, whereby the model will follow up with the next step.

Starting the Model

This model is not available in serverless, however it is small enough to run in a low-cost RTX 4090 in our dedicated service. To start the model, log in to the platform and navigate to teh Create New Dedicated Instance page. Enter Hcompany/Holo1-7B as the model ID and select the GPU type (we recommend the RTX 4090 for basic testing). Then hit deploy to start the model. Once it is running, change the model name in the code below to match the deployment name in your dedicated deployment.

Running the Model

This example also shows guided JSON decoding using pydantic schemas.

Given the following prompt and input image:

Book a hotel in Paris on August 3rd for 3 nights

The output will be a Click object with the "thinking" and coordinates:

Output navigation step: 
note='' 
thought='I need to select the check-out date as August 6th, 2025, and 
then proceed to search.' 
action=ClickElementAction(action='click_element', 
         element='August 6th, 2025 on the calendar', x=657, y=318)

from typing import Any, List, Literal

import requests
from PIL import Image, ImageDraw
from pydantic import BaseModel, Field, TypeAdapter
from typing import Literal, Union
from copy import deepcopy
from dotenv import load_dotenv
import os
import base64
import openai
import json


assert "PARASAIL_API_KEY" in os.environ
chat_base_url = "https://api.parasail.io/v1"

client = openai.Client(
    api_key=api_key,
    base_url=chat_base_url,
)
model = "miketest1-holo1"

SYSTEM_PROMPT: str = """Imagine you are a robot browsing the web, just like humans. Now you need to complete a task.
In each iteration, you will receive an Observation that includes the last  screenshots of a web browser and the current memory of the agent.
You have also information about the step that the agent is trying to achieve to solve the task.
Carefully analyze the visual information to identify what to do, then follow the guidelines to choose the following action.
You should detail your thought (i.e. reasoning steps) before taking the action.
Also detail in the notes field of the action the extracted information relevant to solve the task.
Once you have enough information in the notes to answer the task, return an answer action with the detailed answer in the notes field.
This will be evaluated by an evaluator and should match all the criteria or requirements of the task.
Guidelines:
- store in the notes all the relevant information to solve the task that fulfill the task criteria. Be precise
- Use both the task and the step information to decide what to do
- if you want to write in a text field and the text field already has text, designate the text field by the text it contains and its type
- If there is a cookies notice, always accept all the cookies first
- The observation is the screenshot of the current page and the memory of the agent.
- If you see relevant information on the screenshot to answer the task, add it to the notes field of the action.
- If there is no relevant information on the screenshot to answer the task, add an empty string to the notes field of the action.
- If you see buttons that allow to navigate directly to relevant information, like jump to ... or go to ... , use them to navigate faster.
- In the answer action, give as many details a possible relevant to answering the task.
- if you want to write, don't click before. Directly use the write action
- to write, identify the web element which is type and the text it already contains
- If you want to use a search bar, directly write text in the search bar
- Don't scroll too much. Don't scroll if the number of scrolls is greater than 3
- Don't scroll if you are at the end of the webpage
- Only refresh if you identify a rate limit problem
- If you are looking for a single flights, click on round-trip to select 'one way'
- Never try to login, enter email or password. If there is a need to login, then go back.
- If you are facing a captcha on a website, try to solve it.
- if you have enough information in the screenshot and in the notes to answer the task, return an answer action with the detailed answer in the notes field
- The current date is {timestamp}.
# <output_json_format>
# ```json
# {output_format}
# ```
# </output_json_format>
"""


class ClickElementAction(BaseModel):
    action: Literal["click_element"]
    element: str
    x: int
    y: int


class WriteElementAction(BaseModel):
    action: Literal["write_element_abs"]
    content: str
    element: str
    x: int
    y: int


class ScrollAction(BaseModel):
    action: Literal["scroll"]
    direction: Literal["down", "up", "left", "right"]


class GoBackAction(BaseModel):
    action: Literal["go_back"]


class RefreshAction(BaseModel):
    action: Literal["refresh"]


class GotoAction(BaseModel):
    action: Literal["goto"]
    url: str


class WaitAction(BaseModel):
    action: Literal["wait"]
    seconds: int


class RestartAction(BaseModel):
    action: Literal["restart"]


class AnswerAction(BaseModel):
    action: Literal["answer"]
    content: str


ActionSpace = Union[
    ClickElementAction,
    WriteElementAction,
    ScrollAction,
    GoBackAction,
    RefreshAction,
    WaitAction,
    RestartAction,
    AnswerAction,
    GotoAction,
]


class NavigationStep(BaseModel):
    note: str = Field(
        default="",
        description="Task-relevant information extracted from the previous observation. Keep empty if no new info.",
    )
    thought: str = Field(description="Reasoning about next steps (<4 lines)")
    action: ActionSpace = Field(description="Next action to take")


#fix references that are generated by json_schema()
def _walk(node, defs):
    if isinstance(node, dict):
        ref = node.get("$ref")
        if ref and ref.startswith("#/$defs/"):
            return _walk(deepcopy(defs[ref.split("/")[-1]]), defs)

        if "anyOf" in node:
            node["oneOf"] = node.pop("anyOf")
        return {k: _walk(v, defs) for k, v in node.items()}
    if isinstance(node, list):
        return [_walk(v, defs) for v in node]
    return node

def openai_schema(schema_class):
    schema = TypeAdapter(schema_class).json_schema()
    result = _walk(schema, schema.get("$defs", {}))
    result.pop("$defs", None)
    return result


def get_navigation_prompt(task, image, step=1):
    """
    Get the prompt for the navigation task.
    - task: The task to complete
    - image: The current screenshot of the web page
    - step: The current step of the task
    """
    system_prompt = SYSTEM_PROMPT.format(
        output_format=NavigationStep.model_json_schema(),
        timestamp="2025-06-04 14:16:03",
    )
    return [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": system_prompt},
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"<task>\n{task}\n</task>\n"},
                {"type": "text", "text": f"<observation step={step}>\n"},
                {"type": "text", "text": "<screenshot>\n"},
                {"type": "image_url", "image_url": {"url": image}},
                {"type": "text", "text": "\n</screenshot>\n"},
                {"type": "text", "text": "\n</observation>\n"},
            ],
        },
    ]


def navigate(image_path: str, task: str) -> str:

    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
        base64_image = base64.b64encode(image_data).decode("utf-8")
        image_url = "data:image/jpeg;base64," + base64_image
    # 2. Create the prompt using the resized image (for correct image tagging context) and task
    prompt = get_navigation_prompt(task, image_url, step=1)

    # 3. Run inference
    #    Pass `messages` (which includes the image object for template processing)
    #    and `resized_image` (for actual tensor conversion).
    # print(get_openai_compatible_schema())
    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            # response_format=get_openai_compatible_schema(),
            max_tokens=512,  # Adjust as needed
            extra_body={"guided_json": openai_schema(NavigationStep)},
        )
        navigation_str = response.choices[0].message.content
    except Exception as e:
        print(f"Error during model inference: {e}")
        raise

    # return navigation_str
    navigation_json = json.loads(navigation_str)
    return NavigationStep(**navigation_json)


# --- Load Example Data ---
example_task = "Book a hotel in Paris on August 3rd for 3 nights"
example_image_url = (
    "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg"
)
example_image_path = "image.jpg"
response = requests.get(example_image_url)

if response.status_code == 200:
    with open(example_image_path, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to download image: {response.status_code}")


output_coords_component = navigate(example_image_path, example_task)
print("Input task:", example_task)
print("Output navigation step:", output_coords_component)

PreviousMulti-Modal NextText-to-Speech with Orpheus TTS models

Last updated 5 days ago