Creatus Vision-Language Model (VLM)

Revolutionizing Human-Computer Interaction: An Insight into Creatus Vision-Language Model

Introduction to the Vision-Language Model (VLM)

Marko Vidrih
5 min readNov 7, 2023


As the digital world evolves, the way we interact with computers is undergoing a significant transformation. The Creatus Vision-Language Model (VLM) is at the forefront of this change, bringing a new dimension to human-computer interaction. This model is designed to mimic how a person would engage with a computer screen — reading, searching, typing, and clicking — thus making our conversations with computers smoother and smarter.

The Concept Behind VLM

The core idea of VLM is to blend multimodal AI technology with with an interactive system to perform browser-based tasks. This combination has the potential to enhance user interface (UI) accessibility, streamline workflows, and advance automated UI searching and testing. Although still in its early stages, the potential of VLM is immense, promising to revolutionize how companies interact with their data and digital systems.

Creatus Vision-Language Model (VLM)

Core Components of Creatus VLM

Understanding the Multimodal Large Language Model

The Multimodal LLM is a crutial part of Creatus VLM. It allows the system to process visual inputs and understand user interfaces, making it adept at interpreting screen feedback.

Enhancing Interaction with SoM Prompting

SoM Prompting is a feature that significantly improves the system’s recognition of user commands and enhances response accuracy. This component is crucial for seamless interaction between the user and the computer.

The Role of the Auto-labeler

The Auto-labeler is a unique tool within Creatus VLM. It assigns numerical IDs to each interactable UI element, thus facilitating precise and efficient interactions.

The Interactive Avatar

Creatus VLM features an interactive avatar that provides a visual and vocal representation of the system. This avatar interacts in real-time, responding to text and voice inputs, making the user experience more human-like and engaging.

Features & Capabilities of Creatus VLM

Visual Recognition and Processing

Creatus VLM is capable of recognizing and processing visual elements on the screen, a fundamental aspect of its functionality.

Advanced Auto-labeling with COCO Export

The system’s auto-labeler not only labels UI elements but also supports COCO export, enhancing its utility in various applications.

Simulating Human Interactions

Creatus VLM can simulate mouse clicks and type characters autonomously, mimicking human actions in the digital space.

The Interactive Experience

The avatar in Creatus VLM enhances user interaction with real-time voice feedback and dynamic facial animations. These features offer visual and auditory cues, enriching the interaction experience.

Interaction Flow in Creatus VLM

User Input Methods

Users can interact with Creatus VLM either by typing their query or speaking directly to the avatar, offering flexibility in how they choose to communicate.

Processing Tasks and Generating Outputs

Upon receiving a user’s input, the VLM identifies the necessary actions. It then utilizes self made screenshots of the screen and numerical labels to determine exact pixel coordinates for mouse and keyboard operations. The system’s responses are relayed through actions, text and voice via the avatar.

Potential Use Cases of Creatus VLM

Transforming Business Intelligence and Data Analysis

Creatus VLM can revolutionize how companies manage and interact with their data. By having access to all company data, it can autonomously read, interpret, and analyze various types of documents. This functionality enables businesses to perform comprehensive data analysis, draw insights, and make informed decisions without manual intervention. It’s like having a bespoke, intelligent analyst that understands your company’s unique data landscape.

Revolutionizing Customer Support and Engagement

Imagine an AI-powered customer support system that can visually navigate through help documents, tutorials, or product guides to provide instant, accurate assistance. Creatus VLM can interact with customers using both text and voice, offering a more personalized and effective support experience. This could significantly enhance customer satisfaction and streamline support operations.

Enhancing E-learning and Educational Platforms

In educational settings, Creatus VLM could transform e-learning platforms by interacting with educational content in a more engaging and intuitive way. It can read and explain content, navigate through educational resources, and even assist in completing learning tasks. This could greatly benefit students, especially those with different learning preferences or needs, by providing an interactive, multimodal learning experience.

Streamlining E-commerce and Online Shopping

For e-commerce platforms, Creatus VLM can offer a unique shopping assistant experience. It could guide customers through the website, help them find products based on visual cues or descriptions, and even assist in the checkout process. This enhanced interaction could lead to a more satisfying shopping experience and potentially increase sales and customer loyalty.

Optimizing Content Management and Web Development

For content managers and web developers, Creatus VLM could serve as an invaluable tool for website optimization and testing. It can autonomously navigate through web pages, identify UI issues, suggest improvements, and even test website functionality. This could significantly reduce the time and resources spent on website development and maintenance.

Facilitating Accessibility for People with Disabilities

Creatus VLM can be a game-changer in making digital content more accessible to people with disabilities. Its ability to interpret and interact with digital interfaces can help visually impaired users navigate websites or applications more easily. The system can read out text, describe images, and even guide users through complex UIs, making digital spaces more inclusive.

Enhancing Security and Surveillance Systems

In security and surveillance, Creatus VLM can be used to monitor and analyze video feeds in real-time. It can identify and report unusual activities, manage access control systems by recognizing visual cues, and even interact with other security systems to provide a comprehensive security solution.

Automating Administrative and HR Tasks

In the administrative and HR domain, Creatus VLM could automate routine tasks like scheduling, email sorting, document processing, and even preliminary candidate screening. This would free up valuable time for HR professionals to focus on more strategic tasks.

Improving Healthcare and Patient Interaction

In healthcare, Creatus VLM could assist in patient management by helping patients navigate through medical portals, understand their medical records, and even provide preliminary consultation based on symptoms described by the patient.

Personalized Media and Entertainment Experiences

In media and entertainment, Creatus VLM could offer personalized content recommendations, navigate streaming platforms based on visual and verbal cues, and enhance the overall user experience by understanding individual preferences and behaviors.

A New Era of Digital Interaction

Creatus Vision-Language Model represents a significant step towards fluid human-computer interaction. By harmoniously blending visuals, sounds, and words, it not only enhances the user experience but also opens up new possibilities in how we interact with and utilize digital environments. VLM has the potential to be a versatile and powerful tool across a wide range of industries and applications. As this technology continues to evolve, it holds the promise of transforming the landscape of digital interaction and accessibility.

Follow me on social media

Project I’m currently working on



Marko Vidrih

Most writers waste tremendous words to say nothing. I’m not one of them.