Browser and UI Automation

Hcompany/Holo1-7 B

https://huggingface.co/Hcompany/Holo1-7Barrow-up-right

Holo1 is an Action Vision-Language Model (VLM) developed by HCompanyarrow-up-right for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

This example shows how to run a single step where an input image and a prompt is entered and the action step is reported. This model can work off of previous memory, meaning the next step would be to include previous actions, the current image view of the UI, whereby the model will follow up with the next step.

Starting the Model

This model is not available in serverless, however it is small enough to run in a low-cost RTX 4090 in our dedicated service. To start the model, log in to the platform and navigate to the Create New Dedicated Instancearrow-up-right page. Enter Hcompany/Holo1-7 B as the model ID and select the GPU type (we recommend the RTX 4090 for basic testing). Then hit deploy to start the model. Once it is running, change the model name in the code below to match the deployment name in your dedicated deployment.

Running the Model

This example also shows guided JSON decoding using pydantic schemas.

Given the following prompt and input image:

Book a hotel in Paris on August 3rd for 3 nights

The output will be a Click object with the "thinking" and coordinates:

Last updated