Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.
Not sure, I think there is a lot of research being done here.
Actually, browser use works quite well with vision turned off, it just sometimes gets stuck at some trivial vision tasks. The interesting thing is that screenshot approach is often cheaper than cleaned up html, because some websites have HUGE action spaces.
We looked at some papers (like ferret ui) but i think we can do much better on html tasks. Also, there is a lot of space to improve the current pipeline.
Do you think they do any super fancy magic other than for example how ferret ui does their classification of ui elements? It could be very interesting to test head to head hope much better you can make computer use by adding html (it’s much better from our quick testing, just don’t know the numbers).
Per research across companies, both help, screenshots are worse, but marginally.
The computer use stuff gets me fired up enough that I end up always sharing this, even though when delivered concisely without breaking NDAs, it can sound like a hot take:
The whole thing is a dead end.
I saw internal work at a FAANG on this for years, and even in the case where the demo is cooked up to "get everything right", intentionally, to figure out the value of investing in chasing this further...its undesirable, for design reasons.
It's easy to imagine being wow'd by the computer doing something itself, but when its us, its a boring and slow way to get things done thats scary to watch.
Even with the stilted 100% success rate, our meatbrains cheerily emulated knowing its < 100%, the fear is akin to watching a toddler a month into walking, except if the toddler had your credit card and a web browser and instructions to buy a ticket.
I humbly and strongly suggest to anyone interested in this space to work towards CLI versions of this concept. Now, you're nonblocking, are in a more "native" environment for the LLM, and are much cheaper.
If that sounds regressive and hardheaded, Microsoft, in particular, has plenty of research on this subject, and there's a good amount from diverse sources.
Note the 20%-40% success rates they report, then, note that completing a full task successfully represents a product series of 20%-40%. To get an intuition for how this affects the design experience, think how annoying it is to have to repeat a question because Siri/Assistant/whatever voice assistant don't understand it, and they have roughly ~5 errors per 100 words.
(handwaving) I'd rather be in a loop of "here's our goal. here's latest output from the CLI. what do we type into the CLI" than the GUI version of that loop.
Hmm, but this how we handle it? We just have a CLI that outputs exactly, goal, state, and asks user for more clarity if needed, no GUI.
The original idea was to make it completely headless.
I'm sorry I'm definitely off today, and am missing it, I appreciate your patience.
I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.
Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?
A toy example: let's say the user says "get my current IP".
If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.
If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.
In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.
Why do I believe thats preferable over GUI-based automation?
1. More direct/efficient - no need for browser launching, screenshot processing, etc.
2. More reliable - dealing with only structured text output, rather than trying to parse visual elements
3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.
4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.
What the? Where did that come from?
TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?
I doubt screenshots would be better input considering that eg <select> box options and other markup are hisden visually until a user interacts with something
Screenshots aren't as accurate or context-rich as HTML, but they let you bypass the hassle of building logic for permissions and authentication across different apps to pull in text content for the LLM.
Context length + API cost is right now main bottleneck for huge HTML + CSS files. The extraction here is already quite efficient but still:
with past messages + system prompt + sometimes extracted text + extracted interactive elements you are quickly already around 2500 tokens (for gpt-4o 0.01$).
If you extract entire HTML and CSS your cost + inference time are quickly 10x.
Nope:
1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002
https://openai.com/api/pricing/
I do this for my extension [0] but the HTML is often too large for context window sizes . I end up doing scraping of the relevant pieces before sending to LLM.