- Women Who AI
- Posts
- What's New: Computer Use Agents
What's New: Computer Use Agents
Welcome to the Women Who AI Newsletter, your Monday morning update on what matters in AI, so you can stay focused on building and scaling startups.
Was this newsletter forwarded to you? Click here to subscribe and join our community of women founders shaping the future of AI.
This Week’s Deep Dive: Computer Use Agents
Earlier this week, Mistral released Mistral OCR, a new multi-modal model that can read and interpret complex documents, such as detailed PDFs or sketches. With models like these, the terms Vision Language Models (VLMs) or Multimodal Large Language Models (MLLMs) often get thrown around interchangeably, but don’t let the jargon confuse you; they essentially describe the same technology.
Automating computer tasks is one of the most exciting uses for these new models. Imagine an AI that can interact with apps on your desktop like you would, clicking buttons, filling out forms, and navigating interfaces visually. These computer-use agents combine visual understanding with step-by-step reasoning to independently perform digital tasks you've assigned, freeing you up from repetitive or tedious workflows.
How Do They Work?
Computer-use agents combine multimodal language models with tools to visually interpret and interact with your computer interface. Here’s the simplified breakdown:
Perception:
The agent starts by taking a screenshot or recording of your screen, just like a human would scan a computer visually. It uses multimodal models (such as Mistral OCR or Claude Sonnet) to understand visual content: reading text, identifying buttons, images, fields, icons, and their context.
Reasoning & Planning:
After understanding what’s on screen, the agent determines the most efficient way to complete your assigned task. It plans a step-by-step sequence of actions, such as "open app," "click this button," "enter text," or "navigate this menu."
Interaction:
Once a plan is set, the agent sends these actions to the operating system through APIs or automated control libraries. It mimics human inputs, such as moving the cursor, clicking, typing text, scrolling, and selecting menu items.
Feedback and Iteration:
Agents continuously monitor the screen, assess whether their actions succeeded, and adapt to unexpected events or changes, much like humans would adjust their behavior if something unexpected happened. This last step is where most of these models get stuck today.
Run Computer-Use Agents

Browser Use Agent Completing a Task
Source: https://www.reddit.com/r/LocalLLaMA/comments/1igdnx2/ok_i_admit_it_browser_use_is_insane_using_gemini/
Browser Use is an open-source solution gaining significant traction in the developer community. It allows you to create agents that can navigate websites, fill forms, extract data, and complete tasks just as you would. You can start using it with no setup required by visiting cloud.browser-use.com or following the GitHub instructions to run it on your own machine. I set it up locally and had it looking for airline tickets in less than 15 minutes.
Browser Use can automate data entry into your CRM, extract information from platforms behind login screens, and complete repetitive form submissions. The applications are virtually endless for any repetitive web-based task your team currently handles manually.
We’re committed to highlighting open-source alternatives that empower founders to build more sustainable, customizable tech stacks without breaking the bank. Open-source versions of new technologies give you security through transparency (anyone can inspect the code), control over how you implement them, and freedom from vendor lock-in. Some of them take a little setup, especially if you've never used things like your terminal or Python before, but it's very, very worth it. Check out this tutorial from freeCodeCamp to learn the basics of using your Terminal to install open-source software.
Evals
Choosing the right computer-use agent can be tricky because evaluation methods are still evolving. There are several factors to consider:
Visual accuracy: How reliably can the agent identify buttons, links, and form fields?
Reasoning and planning: Can it efficiently plan the right sequence of actions?
Task completion: Ultimately, can the agent consistently get the job done?
Right now, there’s no standardized way to test these models in realistic scenarios, but REAL Evals (currently in beta) is working on something interesting. They’ve built a benchmarking environment that creates sandbox versions of real-world websites, providing consistent yet realistic conditions to test how well agents actually perform practical tasks.
It’s still early, but REAL Evals is worth keeping an eye on if you want a clearer picture of which agents genuinely perform best.
Jobs We Love
Sheconomy
A coalition of 100 women leaders is creating an AI simulation showing what our economy would look like with gender equality. Sheconomy uses economic research to model how history might have unfolded if women had equal economic representation. They're seeking female AI leaders to advise and build this project over the next two months.
Interested? Email [email protected] | Website
Upcoming Hackathons
Ready to activate your team and build that product you've been dreaming about? Check out these upcoming hackathons! If you'd like to find a Women Who AI team for any event, reply to this email and we'll connect everyone interested.
Boston:
Sundai x Womenx Innovators Famtech Hack - Sunday, March 16 - Apply ASAP
Open Source AI Agent Hack Night - Tuesday, March 18 - Register
San Francisco:
We want this newsletter to address the real challenges you're facing. Is there a specific AI development you'd like explained? Jargon we included but didn't properly explain? A business problem you're trying to solve with AI? Reply directly to this email with your questions, and we'll tackle them in next week's edition.
If you found value in today's newsletter, please consider forwarding it to other women in your network who are building, or thinking about building, in the AI space. The more we grow this community, the stronger our collective impact becomes.
Here's to building the future of AI, together.
Lea & Daniela