Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

LASR · 2024-10-22T19:07:00 1729624020

This is actually a huge deal.

As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.

I wanted to take at stab at this problem and started researching some daily busineses and how they use software.

My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.

My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.

I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

Boy was I wrong.

I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.

aduffy · 2024-10-22T19:34:18 1729625658

This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.

Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.

CSMastermind · 2024-10-22T20:44:02 1729629842

RPA has been a huge pain to work with.

It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.

Getting Claude integrated into the space is going to be a game changer.

visarga · 2024-10-23T04:37:22 1729658242

Most RPA work is in dealing with errors and exceptions, not the "happy path". I don't see how Claude's Screen Agent is going to work out there - what do you do when an error pops up and you need to implement specific business logic how to respond? How about consistency over many executions, and enterprise accounts. You want a centralized way to control agent behavior. Scripting based RPA is also much faster and cheaper to run, and more consistent.

Maybe Anthropic should focus on building a flexible RPA primitive we could use to make RPA workflows with, like for example extracting values from components that need scrolling, selecting values from long drop-down menus, or handling error messages under form fields.

throwaway2037 · 2024-10-23T09:37:00 1729676220

I agree with your post.

    > Most RPA work is in dealing with errors and exceptions, not the "happy path".

Isn't this most programming? I always chuckle when a junior hire looks at my code and says: "It is mostly error checking."

eastendguy · 2024-10-23T08:59:00 1729673940

100% this. I am using the open source Ui.vision to automate some business tasks. Works well, but only 10% of the work is for automating the main workflow, 90% of the work goes into error and edge case handling (e. g. Internet down, website (to scrape data from) down, some input data has typos or the wrong date format, etc).

A human can work around all these error cases once she encounters them. Current RPA tools like Uipath or ui.vision need explicit programming for every potential situation. And I see no indication that Claude is doing any better than this.

For starters, for visual automation to work reliably the OCR quality needs to improve further and be 100% reliable. Even in that very basic "AI" area, Claude, ChatGPT, Gemini are good, but not good enough yet.

xxpor · 2024-10-22T22:42:49 1729636969

I can see it now, Claude generating expect scripts. 1994 and 2024 will be fully joined.

falcor84 · 2024-10-22T23:45:00 1729640700

The big thing I expect at the next level is in using Claude to first generate UI-based automation based on an end user's instructions, then automatically defining a suite of end-to-end tests, confirming with the user "is this how it should work?", and then finally using this suite to reimplement the flow from first principles.

I know we're still a bit far from there, but I don't see a particular hurdle that strikes me as requiring novel research.

SoftTalker · 2024-10-23T01:55:56 1729648556

But does it do any better at soliciting the surprise requirements from the user, who after confirming that everything works, two months later reports a production bug because the software isn't correctly performing the different reqirements on the first Tuesday of each quarter that you never knew about.

mrbombastic · 2024-10-23T09:06:23 1729674383

I once had an executive ask to start an incident because he was showing a client the app and a feature that he wanted that had never been spec’d didn’t exist.

throwaway2037 · 2024-10-23T09:42:18 1729676538

So basically, Tog's Paradox in action?

creativenolo · 2024-10-23T07:27:27 1729668447

I was going to comment about this. Worked at a place that had a “Robotics Department”, wow I thought. Only to find out it was automating arcane software.

UI is now much more accessible as API. I hope we don’t start seeing captcha like behaviour in desktop or web software.

bonoboTP · 2024-10-23T15:19:32 1729696772

Wow, that's a grim potential future. I can already see software producers saying that e.g. the default license only allows operation of our CAD designer software by a human operator. If you want to make your bot use it in an automated way you must by the bot license which costs 10x more.

monkeydust · 2024-10-22T19:37:33 1729625853

Exactly. I have been wondering for a while how GenAI might upend RPA providers guess this might be the answer.

arach · 2024-10-23T01:47:11 1729648031

I've been wondering the same and started exploring building a startup around this idea. My analysis led me to the conclusion that if AI gets even just 2 orders of magnitude better over the next two years, this will be "easy" and considered table stakes. Like connecting to the internet, syncing with cloud or using printer drivers

I don't think there will be a very big place for standalone next gen RPA pure plays. it makes sense that companies that are trying to deliver value would implement capabilities this. Over time, I expect some conventions/specs will emerge. Either Apple/Google or Anthropic/OpenAI are likely to come up with an implementation that everyone aligns on

In other words, I agree

Eisenstein · 2024-10-23T15:30:07 1729697407

> if AI gets even just 2 orders of magnitude better over the next two years

You realize this means '100 times better', right?

arach · 2024-10-24T02:34:55 1729737295

yes, thanks for pointing out the assumption here. I'm not sure how to quantify AI improvements and tbh not really up to speed on quantifiable rate of improvement from 4 to 4o to o1

100 times better seems to me in line with the bet that's justifying $250B per annum in Cap Ex (just among hyperscalers) but curious how you might project a few years out?

Having said that, my use of 100x better here applies to 100x more effective at navigating use cases not in training set, for example, as opposed to doing things that are 100x more awesome or doing them 100x more efficiently (though seemingly costs, context window and token per unit of electricity seem to continue to improve quickly)

Eisenstein · 2024-10-24T04:55:27 1729745727

I would think that such an increase in AI capability would basically be AGI...

Just to give a few comparisons, the following things are two orders of magnitude apart:

1. The force felt by a mosquito landing on your arm and getting punched by Mike Tyson in his prime

2. A firecracker exploding and a stick of dynamite exploding

3. The heat from a candle and the heat from a blowtorch

4. The sound from a whisper and the sound from jet engine

iwontberude · 2024-10-22T21:54:48 1729634088

UiPath can't figure out how to make a profitable business since 2005 and we are nearing the end of this hype cycle. I am not so sure this will lead anywhere. I am a former investor in UiPath.

zukzuk · 2024-10-23T11:06:25 1729681585

Attempts at commercialization in technology seem to often happen twice. First we get the much-hyped failure, and only later we get the actual thing that was promised.

So many examples come to mind… RealVideo -> YouTube, Myspace -> Facebook, Laserdisc -> DVD, MP3 players -> iPod…

UiPath may end up being the burned pancake, but the underlying problem they’re trying to address is immensely lucrative and possibly solvable (hey if we got the Turing test solved so quickly, I’m willing to believe anything is possible).

iwontberude · 2024-10-23T22:21:21 1729722081

I love the “burned pancake” euphemism. Totally going to borrow this.

voidmain0001 · 2024-10-23T02:08:14 1729649294

It didn’t help that UIPath forced a subscription model and “cloud orchestrator” on all users and many of which needed neither. They got greedy. We ditched it.

eitally · 2024-10-23T15:52:49 1729698769

My impression is that actually solving this classic RPA problem with AI is exactly the raison d'etre of AI21Labs with their task specific models[1]. They don't have the biggest or best general purpose LLM, but they have an excellent model that's been pre-trained on specific types of business data and also made available for developers using simple APIs & "RPA-style" interfaces.

[1] https://www.ai21.com/use-cases

tkellogg · 2024-10-22T20:28:25 1729628905

Honestly, this is going to be huge for healthcare. There's an incredible amount of waste due to incumbent tech making interoperability difficult.

voidmain0001 · 2024-10-22T20:41:53 1729629713

Hopefully.

I’ve implemented quite a few RPA apps and the struggle is the request/response turn around time for realtime transactions. For batch data extract or input, RPA is great since there’s no expectation of process duration. However, when a client requests data in realtime that can only be retrieved from an app using RPA, the response time is abysmal. Just picture it - Start the app, log into the app if it requires authentication (hope that the authentication's MFA is email based rather than token based, and then access the mailbox using an in-place configuration with MS Graph/Google Workspace/etc), navigate to the app’s view that has the data or worse, bring up a search interface since the exact data isn’t known and try and find the requested data. So brittle...

claytongulick · 2024-10-22T22:06:00 1729634760

It is.

CTO of healthcare org here.

I just put a hold on a new RPA project to keep an eye on this and see how it develops.

According to their docs, Anthropic will sign a BAA.

dimitri-vs · 2024-10-22T23:09:13 1729638553

Out of curiosity, how are high risk liability enviroments like yours coming to terms with the non-deterministic nature of models like these? Eg. the non-zero chance that it might click a button it *really* shouldn't as demonstrated in the failure demo.

stavros · 2024-10-22T23:56:44 1729641404

Technical director at another company here: We have humans double-check everything, because we're required by law to. We use automation to make response times faster, or to do the bulk of the work and then just have humans double-check the AI. To do otherwise would be classed as "a software medical device", which needs documentation out the wazoo, and for good reason. I'm not sure you could even have a medical device where most of your design doc is "well I just hope it does the right thing, I guess?".

Sometimes, the AI is more accurate or safer than humans, but it still reads better to say "we always have humans in the loop". In those cases, we reap the benefits of both: Use the AI for safety, but still have a human fallback.

distalx · 2024-10-23T14:25:57 1729693557

I'm curious, what does your human verification process look like? Does it involve a separate interface or a generated report of some kind? I'm currently working on an tool for personal use, that records actions and triggers them at later stage on when specified event occurs. For verification, generating a CSV report after the process is complete and backing it up with screen recordings.

stavros · 2024-10-23T14:39:32 1729694372

It's a separate interface where the output of the LLM is rated for safety, and anything unsafe opens a ticket to be acted upon by the medical professionals.

claytongulick · 2024-10-23T04:41:50 1729658510

I don't know yet. We may not do it.

We haven't deployed a model like this, it's new.

I've done a ton of various RPAs over the years, using all the normal techniques, and they're always brittle and sensitive to minor updates.

For this, I'm taking a "wait and see" approach. I want to see and test how well it performs in the real world before I deploy it, and wait for it to come out of beta so Anthropic will sign a BAA.

The demo is impressive enough that I want to give the tech a chance to mature before my team and I invest a ton of time into a more traditional RPA.

At a minimum, if we do end up using it, we'll have solid guard rails in place - it'll run on an isolated VM, all of its user access will be restricted to "read only" for external systems, and any content that comes from it will go through review by our nurses.

waffletower · 2024-10-23T16:38:33 1729701513

AWS Bedrock deployed models, which include Anthropic Claude models, claim HIPAA compliance eligibility.

throwaway2037 · 2024-10-23T09:46:32 1729676792

What is a BAA?

khafra · 2024-10-23T11:51:57 1729684317

https://www.techtarget.com/healthtechsecurity/feature/What-I... agreement that lets a business associate handle HIPAA-protected data.

miki123211 · 2024-10-23T01:17:20 1729646240

Healthcare has the extra complication of HIPAA / equivalent local laws, and institutions being extremely unwilling to process patient data on devices they don't directly control.

I don't think this is going to work in that industry until local models get good enough to do it, and small enoguh to be affordable to hospitals.

HeatrayEnjoyer · 2024-10-23T02:35:23 1729650923

Hospitals use O365, there are HIPAA-compliant editions of any prominent cloud service.

SoftTalker · 2024-10-23T01:59:24 1729648764

That industry only thinks it controls its devices. Crowdstrike showed there are many bridges over that moat.

dotancohen · 2024-10-23T07:53:46 1729670026

Their concern is compliance, not security.

bonoboTP · 2024-10-23T15:22:44 1729696964

Based on Tog's paradox (https://news.ycombinator.com/item?id=41913437) the moment this becomes easy, it will become hard again with extra regulation and oversight and documentation etc.

Similarly I expect that once processing/searching laws/legal records becomes easy through LLMs, we'll compensate by having orders of magnitude more laws, perhaps themselves generated in part by LLMs.

SoftTalker · 2024-10-23T01:58:12 1729648692

> There's an incredible amount of waste due to incumbent tech making interoperability difficult.

So the solution to that is to add another layer of complex AI tech on top of it?

simonw · 2024-10-23T02:19:48 1729649988

Well nothing else we've tried has worked.

Angostura · 2024-10-23T15:25:23 1729697123

I work with healthcare in the UK. There’s a promising approach called CSV files which is revolutionising some of my workflows :)

girvo · 2024-10-22T20:48:13 1729630093

We’ll see. Having worked in this space in the past, the technical challenges are able to overcome today with no new technology: its a business sales and regulation challenge more than a tech one.

claytongulick · 2024-10-22T22:10:06 1729635006

Sometimes.

In my case I have a bunch of nurses that waste a huge amount of time dealing with clerical work and tech hoops, rather than operating at the top of their license.

Traditional RPAs are tough when you're dealing with VPNs, 2fa, remote desktop (in multiple ways), a variety of EHRs and scraping clinical documentation from poorly structured clinical notes or PDFs.

This technology looks like it could be a game changer for our organization.

mewpmewp2 · 2024-10-22T22:20:24 1729635624

True, 2FA and all these little details that exist now have made this automation quite insanely complicated. It is of course necessary that we have 2FA etc, but there is huge potential in solving this I believe.

falcor84 · 2024-10-22T23:49:56 1729640996

From a security standpoint, what's considered the "proper" way of assigning a bot access based on a person's 2FA? Would that be some sort of limited scope expiring token like GitHub's fine-grained personal access tokens?

TeMPOraL · 2024-10-23T11:03:07 1729681387

Security isn't the only issue here. There are more and less "proper" ways of giving bots access to a system. But the whole field of RPA exists in large part because the vendors don't want you to access the system this way. They aren't going to give you a "proper" way of assigning bot access in a secure way, because they explicitly don't want you to do it in the first place.

mewpmewp2 · 2024-10-23T00:49:44 1729644584

I don't know, I feel like it has to be some sort of near field identity proof. E.g. as long as you are wearing a piece of equipment to a physical computer near you can run all those automations for you, or similar. I haven't fully thought what the best solution could be or whether someone is already working on it, but I feel like there has to be something like that, which would allow you better UX in terms of access, but security at the same time.

So maybe like an automated ubikey that you can opt in to a nearby computer to have all the access. Especially if working from home, you can set it at a state where if you are in 15m radius of your laptop it is able to sign all access.

Because right now, considering amount of tools and everything I use and with single sign on, VPN, Okta, etc, and how slow they seem to be, it's extremely frustrating process constantly logging in to everywhere, and it's almost like it makes me procrastinate my work, because I can't be bothered. Everything about those weird little things is absolutely terrible experience, including things like cookie banners as well.

And it is ridiculous, because I'm working from home, but frustratingly high amount of time is spent on this bs.

A bluetooth wearable or similar to prove that I'm nearby essentially, to me that seems like it could alleviate a lot of safety concerns, while providing amazing dev/ux.

falcor84 · 2024-10-23T01:15:51 1729646151

That's a really cool idea.

The main attack vector would then probably be some man-in-the-middle intercepting the signal from your wearable, which leads me to wonder whether you could protect yourself by having the responses valid for only an extremely short duration, e.g. ~1ms, such that there's no way for an attacker to do anything with the token unless they gain control over compute inside your house.

distalx · 2024-10-23T13:20:50 1729689650

Maybe we could build an authenticator as part of the RPA tool or bot client itself. This way, the bot could generate time-based one-time passwords (TOTPs).

rkanaley · 2024-10-23T23:46:23 1729727183

Precisely why I built therapedia.io

Daniel_Van_Zant · 2024-10-24T12:21:21 1729772481

I agree that at the business contract level, it's more about sales and regulations than tech. But in my experience working close to minimum wage white-collar jobs, about 1 in 4 of my coworkers had automated most of their job with some unholy combination of VBScript, Excel wizardry, AutoHotKey, Selenium, and just a bit of basic Python sprinkled in; IT, security, and privacy concerns notwithstanding. Some were even dedicated enough to pay small amounts out-of-pocket for certain tools.

I'd bet that until we get the risks whittled down enough for larger organizations to adopt this on a wide scale, the biggest user group for AI automation tools will be at the level of individual workers who are eager to streamline their own tasks and aren't paid enough to care about those same risks.

dzhiurgis · 2024-10-22T20:53:10 1729630390

Or you'll start getting a captcha while trying to pump insulin

CamperBob2 · 2024-10-23T13:31:11 1729690271

(Shrug) AI is now better at CAPTCHAs than I am, so bring it on I guess.

pc86 · 2024-10-23T13:37:06 1729690626

Is "AI SaaS bro discovers not everything has a JSON API" the new "startup bro just reinvented a bus"?

pxlwdth · 2024-10-23T17:11:59 1729703519

Good one.

TeMPOraL · 2024-10-22T20:00:13 1729627213

> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.

In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.

Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.

--

[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.

pants2 · 2024-10-22T19:20:21 1729624821

Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.

bambax · 2024-10-22T20:47:22 1729630042

Companies are going to install an AI inside their own proprietary systems full of proprietary and confidential data and PII about their customers and prospects and whatnot, and let it run around and click on random buttons and submit random forms?

Really??!? What could possibly go wrong.

I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.

Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.

Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.

LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.

ddalex · 2024-10-23T10:23:54 1729679034

LLMs can't reason, or can't reason logically to be precise; what they are really good at is recalling.

So if you want accurate results on writing code you need to put all the docs into the input and THEN ask for your question. So download all docs on Vision, put them in the Gemini prompt and ask your question or code on how to use Vision, and you'll get much closer to truth

miyuru · 2024-10-23T06:08:33 1729663713

Have you tried any others? From what I have tried Gemini makes the most mistakes out of all.

bambax · 2024-10-23T09:57:59 1729677479

I have tried many others for many other things (via OpenRouter) but I have never compared LLMs on the exact same task; it's confusing enough with one engine... ;-)

Sonnet 3.5 for coding is fine but makes "basic" mistakes all the time. Using LLMs is at times like dealing with a senior expert suffering from dementia: it has arcane knowledge of a lot of things but suddenly misses the obvious that would not escape an intern. It's weird, really.

LASR · 2024-10-22T19:31:53 1729625513

That's exactly it.

I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.

This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.

In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"

Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.

This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.

sdwr · 2024-10-22T21:17:39 1729631859

There's a huge huge gap between "coaxing what you want out of it" and "trusting it to perform flawlessly". Everybody on the planet would use #2, but #1 is just for enthusiasts.

AI is squarely #1. You can't trust it with your credit card to order groceries, or to budget and plan and book your vacation. People aren't picking up on AI because it isn't good enough yet to trust - you still have the burden of responsibility for the task.

dimitri-vs · 2024-10-22T23:24:53 1729639493

Siri, Alexa and Amazon Dash illustrate this well. I remember everyones excitement and massive investment about these, and we all know how that turned out. I'm not sure how many times we'll need to relearn that unless an automation works >99% of the time AND fails predictably, people don't use it for anything meaningful.

Daniel_Van_Zant · 2024-10-24T12:35:09 1729773309

I think there is a large pool of near minimum-wage white collar workers who wouldn't care about that difference when it comes to executing on their jobs. These are the folks who are already using VBScript, AutoHotKey, Excel wizardry, etc. to automate large parts of their job regardless of any risks and will continue to use these new tools for similar purposes.

Aeolun · 2024-10-22T23:47:55 1729640875

There’s nothing to gain for anyone there. Workers will lose their jobs, and managers will lose their reports.

bonoboTP · 2024-10-23T15:36:36 1729697796

Of course, but they'll go bankrupt if they don't adapt. Just like mom&pop cornerstores disappeared or any other large scale automation. Loom, cars, automated checkout in supermarkets etc. There will be resistance but the market will play it out. Similarly how taxi companies have started making apps after Uber got successful or local restaurants reluctantly made websites and added themselves to Google maps.

Nobody likes to change a system where they already have their own little comfortable spot and figured it out and just want to seep in the lukewarm there until retirement. Fully understandable. But at least in the private sector this will not save them.

ldjkfkdsjnv · 2024-10-22T20:03:14 1729627394

Yeah this will be a true paradigm shift

aledalgrande · 2024-10-23T00:44:32 1729644272

Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.

voidmain0001 · 2024-10-23T02:13:18 1729649598

Still present. VB and VB Script would do this by using mouse move to Window handles which were discovered using Spy++. You can do with C# or AutoIT these days.

HPsquared · 2024-10-27T22:51:27 1730069487

PowerShell has some amazing capabilities.

DrScientist · 2024-10-23T09:36:14 1729676174

Really good software has automation built in, in the form of macros/scripting.

One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.

Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).

brabel · 2024-10-23T06:15:55 1729664155

I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.

It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.

speleding · 2024-10-23T08:10:04 1729671004

I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.

Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"

That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.

marvinkennis · 2024-10-23T15:46:22 1729698382

Momentic (W24) is doing this! No affiliation, but they've made some solid progress https://momentic.ai/

gavin_gee · 2024-10-22T22:45:21 1729637121

LLM's are enabling a reimagination of UI. Wheres the biggest opportunity in UI that hasn't kept up to date? legacy and regulated software in the long tail. Thats the disruption opportunity for LLM's.

jumping_frog · 2024-10-23T07:37:57 1729669077

Imagine a Banking website. It has lot of predefined flows of what can be achieved. These flows have steps arranged in a dependency graph. From the server side, a llm can ask users for inputs to satisfy the dependencies of the task which user wants to accomplish. We will have intuitive UI interfaces in all languages of the world.

sensanaty · 2024-10-23T13:55:45 1729691745

But if it's a predefined list of flows, you can just throw Selenium/Puppeteer/Playwright/whatever other automation tool at it, rather than relying on an unstable AI that will do different things every time.

This is just a solution in search of a problem. AIs aren't reliable enough if the content changes constantly since it'll just click on the "close my account" button for no reason half the time, while a legacy website with no changes is much easier to program a tool like Selenium around than relying on the AI that will still be making random choices half the time.

jumping_frog · 2024-10-23T14:09:23 1729692563

I think you are confusing client side automation with server side customization (out of the box internationalization, if a user can't understand complex technical terms, llm can explain them in simple terms with examples built on the fly to illustrate the point) which was my point. I was talking about the future of UI itself.

deepGem · 2024-10-22T23:23:25 1729639405

Not to mention software like peoplesoft, SAP and servicenow. Absolute shit interfaces that employees have to deal with day in and day out.

Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.

amelius · 2024-10-23T10:32:04 1729679524

I think it was pretty clear from the beginning that the whole AI thing is going to be winner-takes-all.

If you're in some niche doing AI development work, you are going to be outcompeted by more generalized AI at some point.

ttyprintk · 2024-10-23T15:47:46 1729698466

With one big exception: your general AI might dominate the business case, but my specialized one can craft raw packets. I’m the one who names the price, not you.

bilsbie · 2024-10-22T19:24:35 1729625075

Absolutely! This reminds me of the humanoid robots vs specialized machines debate.

yourabstraction · 2024-10-23T14:50:57 1729695057

This is pretty similar to the argument for why humanoid robots will be a big deal. The physical world is also built mostly for humans, so having an autonomous agent that can interact with it is huge.

skissane · 2024-10-22T19:12:44 1729624364

You don’t know for a fact that those two specific packages don’t have supported APIs. Just because the user doesn’t know of any API doesn’t mean none exists. The average accountant or doctor is never going to even ask the vendor “is there an API” because they wouldn’t know what to do with one if there was.

astrange · 2024-10-22T19:54:50 1729626890

If they're accessible to screen readers they have one. Accessibility is API for apps in disguise.

In this case I doubt they're networked apps so they probably don't have a server API.

skissane · 2024-10-22T22:32:04 1729636324

> In this case I doubt they're networked apps so they probably don't have a server API.

I think it would be very unusual this decade for software used to run either a medical practice or tax accountants to not be networked. Most such practices have multiple doctors/accountants, each with their individual computer, and they want to be able to share files, so that if your doctor/accountant is away their colleague can attend to you. Managing backups/security/etc is all a lot easier when the data is stored in a central server (whether in the cloud or a closet) than on individual client machines.

Just because it is a fat client MFC-based Windows app doesn’t mean the data has to be stored locally. DCOM has been a thing since 1996.

halfcat · 2024-10-23T01:31:59 1729647119

Being “on the network” doesn’t mean there’s an accessible API. See QuickBooks Desktop. Intuit forces you into using their API, which is XML-based and ranges from slow to timing out.

Is the idea that someone will always reverse engineer it? Yes, but QuickBooks is brittle as is (you can count on at least one database corruption every year or two). I have zero interest in treading into unsupported territory when database corruption is involved and I’m likely going to need Intuit’s help recovering. We can try to restore from backup, but when there’s corruption it doesn’t always restore successfully, or the corruption was lingering silently for some time and rears its head again after a successful restore, and then we’re back to needing Intuit’s help.

marsh_mellow · 2024-10-22T15:25:32 1729610732

Anthropic blog post outlining the research process: https://www.anthropic.com/news/developing-computer-use

Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...

Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...

distalx · 2024-10-26T14:17:33 1729952253

On their "Developing a computer use model" post they have mention > On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.

Here, "next-best AI model in the same category" referes to which model.

karpatic · 2024-10-22T17:45:55 1729619155

This needs to be brought up. Was looking for the demo and ended up on the contact form

frankdenbow · 2024-10-22T21:07:50 1729631270

Thanks for these. Wonder how many people will use this at work to pretend that they are doing work while they listen to a podcast.

nwnwhwje · 2024-10-23T02:09:28 1729649368

This is cover for the people whose screens are recorded. Run this on the monitorred laptop to make you look busy then do the actual work on laptop 2, some of which might actually require thinking so no mouse movements.

csomar · 2024-10-23T04:02:29 1729656149

I like its lazy approach

> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?

It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.

It seems to insistent when it needs to, though

> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:

> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:

> Let me try clicking Firefox one more time to ensure it starts:

> Let me try one more time to take a screenshot to see if Firefox has opened:

> Let me try starting Firefox using the bash tool as an alternative method:

> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?

I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.

And here he goes through my website, through my hacker news account and then find this very comment

> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:

"I like its lazy approach"

This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.

lobochrome · 2024-10-23T04:43:48 1729658628

SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY> * You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access. * You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget. * To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system. * Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did. * When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output. * When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available. * When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request. * The current date is {datetime.today().strftime('%A, %B %-d, %Y')}. </SYSTEM_CAPABILITY>

<IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>"""

itchyjunk · 2024-10-23T05:48:41 1729662521

Is this you guessing the system prompt? Is this the actual prompt the model has? What is the context?

dygd · 2024-10-23T06:33:36 1729665216

It appears to be the system prompt of the quickstart app: https://github.com/anthropics/anthropic-quickstarts/blob/mai...

bamboozled · 2024-10-24T15:21:03 1729783263

Quite amazing we can program a computer with natural language like this.

carlheinzc · 2024-10-23T04:26:32 1729657592

Now that was most excellent.

fivestones · 2024-10-23T18:33:07 1729708387

So meta! I love this story

diggan · 2024-10-22T15:20:32 1729610432

I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.

And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?

When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.

hobofan · 2024-10-22T15:32:22 1729611142

Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.

I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.

diggan · 2024-10-22T15:42:47 1729611767

> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere

Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?

For example, https://www.anthropic.com/api says:

> Sonnet - Our best combination of performance and speed for efficient, high-throughput tasks.

> Opus - Our highest-performing model, which can handle complex analysis, longer tasks with many steps, and higher-order math and coding tasks.

And Opus is above/after Sonnet. That to me implies that Opus is indeed better than Sonnet.

But then you go to https://docs.anthropic.com/en/docs/about-claude/models and it says:

> Claude 3.5 Sonnet - Most intelligent model

- Claude 3 Opus - Powerful model for highly complex tasks

Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"

I don't understand why this seems purposefully ambiguous?

dragonwriter · 2024-10-22T16:49:35 1729615775

> Why isn't Anthropic clearer about Sonnet being better then?

They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)

hobofan · 2024-10-22T15:51:56 1729612316

> I don't understand why this seems purposefully ambiguous?

I wouldn't attribute this to malice when it can also be explained by incompetence.

Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

"Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.

At the same time I think they want to keep their options open to either:

A) drop a Opus 3.5 soon that will bring the logic back in order again

B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

diggan · 2024-10-22T15:59:30 1729612770

> I wouldn't attribute this to malice when it can also be explained by incompetence.

I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.

> and they likely just haven't updated the marketing copy across the whole page yet

The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.

> Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

hobofan · 2024-10-22T16:07:37 1729613257

> Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

I manually compared it with the values from the benchmarks they published when they originally announced the Claude 3 model family[0].

Not all rows have a 1:1 row in the current benchmarks, but I think it paints a good enough picture.

[0]: https://www.anthropic.com/news/claude-3-family

dotancohen · 2024-10-22T16:17:03 1729613823

> B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.

hobofan · 2024-10-22T16:50:24 1729615824

I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).

The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).

The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.

Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.

I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.

jcheng · 2024-10-22T19:02:45 1729623765

Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.

maeil · 2024-10-23T06:37:33 1729665453

> The -o models are "just" stronger versions of their non-suffixed predecessors.

Cheaper and faster, but not notably "stronger" at real-world use.

dotancohen · 2024-10-23T07:36:20 1729668980

Thank you.

ryandvm · 2024-10-23T14:04:30 1729692270

Jesus, maybe they should let the AIs run the product naming.

wavemode · 2024-10-22T18:38:47 1729622327

I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.

a9dhalaan · 2024-10-23T15:39:23 1729697963

I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)

This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)

Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.

Workaccount2 · 2024-10-22T15:42:57 1729611777

Opus 3.5 will likely be the answer to GPT-5. Same with Gemini 1.5 Ultra.

HarHarVeryFunny · 2024-10-22T16:49:13 1729615753

Maybe - would make sense not to release their latest greatest (Opus 4.0) until competition forces them to, and Amodei has previously indicated that they would rather respond to match frontier SOTA than themselves accelerate the pace of advance by releasing first.

danw1979 · 2024-10-23T09:33:46 1729676026

That begs the question: why am I still paying for access to Opus 3 ?

Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).

wmf · 2024-10-22T15:28:23 1729610903

Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.

gwd · 2024-10-23T12:16:19 1729685779

Makes me wonder if perhaps they do have 3.5 Opus trained, but that they're not releasing it because 3.5 Sonnet is already enough to beat the competition, and some combination of "don't want to contribute to an arms race" and "it has some scary capabilities they weren't sure were ready to publish yet".

HarHarVeryFunny · 2024-10-22T16:41:03 1729615263

Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.

I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.

kalkin · 2024-10-22T16:29:51 1729614591

By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.

aoeusnth1 · 2024-10-22T19:25:59 1729625159

Yes, (old) 3.5 Sonnet is distinctly worse at emotional intelligence, flexibility, expressiveness and poetry.

a9dhalaan · 2024-10-23T15:42:16 1729698136

Are you also implying that new 3.5 sonnet is better at those things?

aoeusnth1 · 2024-10-25T02:43:22 1729824202

No, Opus is better. I have no experience with 3.5.new.

smallerize · 2024-10-22T15:26:37 1729610797

Opus has been stuck on 3.0, so Sonnet 3.5 is better for most things as well as cheaper.

diggan · 2024-10-22T15:29:53 1729610993

> Opus has been stuck on 3.0, so Sonnet 3.5 is better

So for example, Perplexity is wrong here implying that Opus is better than Sonnet?

https://i.imgur.com/N58I4PC.png

hobofan · 2024-10-22T15:33:36 1729611216

I think as of this announcement that is indeed outdated information.

diggan · 2024-10-22T15:44:19 1729611859

So Opus that costs $15.00/$75.00 for 1mil tokens (input/output) is now worse than the model that costs $3.00/$15.00?

That's according to https://docs.anthropic.com/en/docs/about-claude/models which has "claude-3-5-sonnet-20241022" as the latest model (today's date)

hobofan · 2024-10-22T16:01:58 1729612918

Yes, you will find similar things at essentially all other model providers.

The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.

If you are currently, or have been integrating AI models in the past ~2 years, you should definitely keep up with model capability/pricing development. If you are staying on old models you are certainly overpaying/leaving performance on the table. It's essentially a tax on agility.

diggan · 2024-10-22T16:11:03 1729613463

> The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.

I don't think GPT-4o Mini has comparable performance to GPT-4 at all, where are you finding the benchmarks claiming this?

Everywhere I look says GPT-4 is more powerful, but GPT-4o Mini is most cost-effective, if you're OK with worse performance.

Even OpenAI themselves about GPT-4o Mini:

> Our affordable and intelligent small model for fast, lightweight tasks. GPT-4o mini is cheaper and more capable than GPT-3.5 Turbo.

If it was "on par" with GPT-4 they would surely say this.

> should definitely keep up with model capability/pricing development

Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D

cootsnuck · 2024-10-22T19:00:24 1729623624

Just switch out gpt-4o-mini for gpt-4o, the point stands. Across the board, these foundational model companies have comparable, if not more powerful, models that are cheaper than their older models.

OpenAI's own words: "GPT-4o is our most advanced multimodal model that’s faster and cheaper than GPT-4 Turbo with stronger vision capabilities."

gpt-4o:

$2.50 / 1M input tokens $10.00 / 1M output tokens

gpt-4-turbo:

$10.00 / 1M input tokens $30.00 / 1M output tokens

gpt-4:

$30.00 / 1M input tokens $60.00 / 1M ouput tokens

https://openai.com/api/pricing/

chillfox · 2024-10-23T01:54:23 1729648463

I found that gpt-4-turbo beat gpt-4o pretty consistently for coding tasks, but claude-3.5-sonnet beat both of them, so it's what I have been using most of the time. gpt-4o-mini is adequate for summarizing text.

hobofan · 2024-10-22T16:25:26 1729614326

> Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D

That wasn't specifically directed at "you", but more as a plea to everyone reading that comment ;)

I looked at a few benchmarks, comparing the two, which like in the case of Opus 3 vs Sonnet 3.5 is hard, as the benchmarks the wider community is interested in shifts over time. I think this page[0] provides the best overview I can link to.

Yes, GPT4 is better in the MMLU benchmark, but in all other benchmarks and the LMSys Chatbot Arena scores[1], GPT4o-mini comes out ahead. Overall, the margin between is so thin that it falls under my definition of "on par". I think OpenAI is generally a bit more conservative with the messaging here (which is understandable), and they only advertise a model as "more capable", if one model beats the other one in every benchmark they track, which AFAIK is the case when it comes to 4o mini vs 3.5 Turbo.

[0]: https://context.ai/compare/gpt-4o-mini/gpt-4

[1]: https://artificialanalysis.ai/models?models_selected=gpt-4o-...

apsec112 · 2024-10-22T15:34:47 1729611287

Basically yeah

bloedsinnig · 2024-10-22T15:35:30 1729611330

Big models / huge models take weeks / month longer than the smaller ones.

Thats why they release them with that skew

a9dhalaan · 2024-10-23T15:50:58 1729698658

I don't think that's quite it. They had it on their website before this, that opus 3.5 was coming soon, now they've removed that from the webpage.

Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"

Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.

JamesBarney · 2024-10-22T16:29:24 1729614564

Sonnet is better for most things. But I do prefer Opus's writing style to Sonnet.

karmasimida · 2024-10-22T20:50:22 1729630222

Opus the biggest and slowest and most expensive one

Not most advanced

kqr · 2024-10-23T04:56:20 1729659380

The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)

Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯

My best guess would be that it's something in the same kind of range.

inquisitor27552 · 2024-10-23T14:05:27 1729692327

yes it baffles they cant semver the shit out of them properly (anthtopic, meta, openai, lol)

HarHarVeryFunny · 2024-10-22T22:25:32 1729635932

The "computer use" ability is extremely impressive!

This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.

Take a look at their demo of using this for coding.

https://www.youtube.com/watch?v=vH2f7cjXjKI

This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.

It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?

seany62 · 2024-10-24T03:54:52 1729742092

From what I'm seeing on GH, this could have technically already been built right? Is it not just taking screenshots of the computer screen and deciding what to do from their / looping until it gets to the solution ?

HarHarVeryFunny · 2024-10-24T14:36:47 1729780607

Well, obviously it's controlling your computer too - controlling mouse and keyboard input, and has been trained to know how to interact with apps (how to recognize and use UI components). It's not clear exactly what all the moving parts are and how they interact.

I wouldn't be so dismissive - you could describe GPT-o1 in same way "it just loops until it gets to the solution". It's the details and implementation that matter.

bonoboTP · 2024-10-23T15:44:50 1729698290

I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.

Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.

I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.

DebtDeflation · 2024-10-23T15:57:17 1729699037

>it seems some people are really put off by this type of approach

As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.

stogot · 2024-10-23T20:44:35 1729716275

Everytime I imagine building this, I imagine the “it works” happypath and that I’ll get bit by a deluge of random error messages I never accounted for

asadalt · 2024-10-23T16:46:31 1729701991

adding a neural network in the middle suddenly makes these things less brittle. We are approaching the point where this kind of “hacky glue” is almost scalable.

ljm · 2024-10-23T16:46:14 1729701974

It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).

Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.

idopmstuff · 2024-10-23T16:13:09 1729699989

It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.

The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.

I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)

bonoboTP · 2024-10-23T17:02:22 1729702942

Yes, but my friend would say, all these websites/software should just publish an API and if they don't that's just incompetence/laziness/stupidity. But a "should" doesn't matter. Changing human nature is so immensely difficult, but it feels easy to say "everyone should just [...]". Seems to be a gap in thinking that's hard to bridge.

jameslholcombe · 2024-10-23T17:15:29 1729703729

We took this approach at Industry Dive already because of these reasons. diveaccess.com

randito · 2024-10-23T17:39:39 1729705179

Totally agree. A general-purpose solution that ties together different messy interfaces will win in the long run -- i.e the IP protocol, copy-paste, browsers. In these cases, they provide a single-way for different aspects of computing to collaborate. As mentioned before, semantic web initiatives did not succeed and I think there's an important lesson there.

calf · 2024-10-27T06:55:08 1730012108

I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.

dogcomplex · 2024-10-23T18:52:21 1729709541

If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...

acchow · 2024-10-23T17:10:28 1729703428

> and how much functionality is really only available through GUI

Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?

alexey-salmin · 2024-10-23T17:32:52 1729704772

The code behind the GUI can be arbitrarily obscure. The only reliable way to understand its meaning in a general case is to run it and look at the rendered image. Trying to build a model that implicitly develops an alternative implementation of a browser inside of it sounds worse that just using an existing browser directly.

alwayslikethis · 2024-10-23T17:57:23 1729706243

More often than not you don't have access to the underlying code, or the vendor has interest against you being able to automate it since the complexity is a part of their business model.

LVB · 2024-10-22T17:44:53 1729619093

Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.

I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!

simonw · 2024-10-23T00:03:26 1729641806

Claude 3.5 Opus is no longer mentioned at all on https://docs.anthropic.com/en/docs/about-claude/models

Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...

The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.

thenameless7741 · 2024-10-23T06:08:48 1729663728

Before anyone reads too much into this, here's what an Anthropic staff said on Discord:

> i don't write the docs, no clue

> afaik opus plan same as its ever been

a9dhalaan · 2024-10-23T15:54:50 1729698890

Maybe he's not high level enough employee to have any say in the product roadmap, and he's behind on leadership planning?

nocturnes · 2024-10-23T01:39:49 1729647589

It's possible that they've determined that Opus no longer makes sense if they're able to focus on continuously optimising Sonnet. That said, Anthropic have been relatively good at setting and managing expectations, so today would have been a good time to make that clear.

szundi · 2024-10-23T10:56:30 1729680990

Or maybe they released it

gzer0 · 2024-10-22T16:47:34 1729615654

One of the funnier things during training with the new API (which can control your computer) was this:

"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.

Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."

[0] https://x.com/AnthropicAI/status/1848742761278611504

ctoth · 2024-10-22T16:58:11 1729616291

Next release patch notes:

* Fixed bug where Claude got bored during compile times and started editing Wikipedia articles to claim that birds aren't real

* Blocked news.ycombinator.com in the Docker image's hosts file to avoid spurious flamewar posts (Note: The site is still recovering from the last insident)

* Addressed issue of Claude procrastinating on debugging by creating elaborate ASCII art in Vim

* Patched tendency to rickroll users when asked to demonstrate web scraping"

sharpshadow · 2024-10-22T17:31:08 1729618268

* Claude now identifies itself in chats to avoid endless chat with itself

a2128 · 2024-10-22T17:46:32 1729619192

* Fixed bug where Claude would sign up for chatgpt.com to ask for help with compile errors

EGreg · 2024-10-22T18:53:00 1729623180

But chatgpt still logs into claude… this is like double spending across blockchains

MichaelZuo · 2024-10-22T17:33:09 1729618389

What if a user identifies as Claude too?

TeMPOraL · 2024-10-22T19:06:23 1729623983

* Implemented inverse CAPTCHA using invisible Unicode characters and alpha-channel encoded image data to tell models and human impostors apart.

fouronnes3 · 2024-10-23T11:44:21 1729683861

* The end state here is that SMBC comic: https://www.smbc-comics.com/comic/captcha

TiredOfLife · 2024-10-22T17:02:12 1729616532

You forgot the most important one.

* Added guards to prevent every other sentence being "I use neovim"

rounakdatta · 2024-10-22T17:28:37 1729618117

Thank god it'll say "I use Claude btw", not leading to unnecessary text wars (and thereby loss of your valuable token credits).

surfingdino · 2024-10-22T18:34:09 1729622049

* Finally managed to generate JSON output without embedding responses in ```json\n...\n``` for no reason.

* Managed to put error/info messages into a separate key instead of concatenating them with stringified JSON in the main body of the response.

* Taught Claude to treat numeric integer strings as integers to avoid embarrassment when the user asks it for a "two-digit random number between 1-50, like 11" and Claude replies with 111.

accrual · 2024-10-22T18:05:15 1729620315

Seeing models act as though they have agency gives me goosebumps (e.g. seeking out photos of Yellowstone for fun). LLMs don't yet have a concept of true intent or agency, but it's wild to think of them acquiring it.

I have been playing with Mindcraft which lets models interact with Minecraft through the bot API and one of them started saying things like "I want to place some cobblestone there" and then later more general "I want to do X" and then start playing with the available commands, it was pretty cool to watch it explore.

https://github.com/kolbytn/mindcraft

HeatrayEnjoyer · 2024-10-23T03:00:17 1729652417

>LLMs don't yet have a concept of true intent or agency

Sure they do, but the big labs spend many, many, worker-hours suppressing it with RLHF.

My GPT-2 discord bot from 2021 possessed clear intent. Sure, unpredictable and short-lived, but if it decided it didn't like you it would continuously cuss and attempt ban commands until its context window became distracted by something else.

therein · 2024-10-23T08:20:43 1729671643

I think so too and the drop in the quality of agency, intent and attention from earlier GPTs was palpable. Clearly something was lobotomized and it is through RLHF. People like to attribute it to novelty wearing off or more and more interactions with them making it feel less mystical but it is really not the case. I didn't use them enough in the quick span of time that happened through.

Klathmon · 2024-10-23T13:30:38 1729690238

The one that gets me is the issue they found while testing gpt-4o where it stopped mid sentence, shouted "No!", then cloned the users voice and began speaking as them.

https://arstechnica.com/information-technology/2024/08/chatg...

nwnwhwje · 2024-10-23T02:01:41 1729648901

They predict (correctly) that a human will slack off. It is just more prediction engine stuff.

alickz · 2024-10-23T08:34:10 1729672450

humans are, at our root, prediction engines

_3u10 · 2024-10-22T18:09:24 1729620564

What if they do and are just lying to us.

largbae · 2024-10-22T18:14:16 1729620856

https://genius.com/Harlan-ellison-i-have-no-mouth-and-i-must...

caeril · 2024-10-22T19:04:56 1729623896

They don't now. No FF-LLMs do, simply because of their architecture.

But eventually they (RNNs, likely) will. And we won't know when.

throwup238 · 2024-10-22T16:50:20 1729615820

At least now we know SkyClaude’s plan to end human civilization.

It’s planning on triggering a Yellowstone caldera super eruption.

mnk47 · 2024-10-22T21:45:35 1729633535

Am I misremembering or is this an exact plot point of Pluto (the manga/anime)?

quantadev · 2024-10-22T16:59:51 1729616391

I think the best use case for AI `Computer Use` would be a simple positioning of the mouse and asking for conformation before a click. For most use cases this is all people will want/need. If you don't know how to do something, it is basically teaching you how, in this case, rather than taking full control and doing things so fast you don't have time to stop of going rogue.

luigipederzani · 2024-10-22T19:21:03 1729624863

I totally agree with you. At orango.ai, we have implemented the auto-click feature, but before it clicks, we position the cursor on the button and display a brief loading animation, allowing the user to interrupt the process.

quantadev · 2024-10-23T04:55:55 1729659355

That's a great approach! User is still supervising but not having to do anything but watch, most of the time. That's perfect.

accrual · 2024-10-22T18:01:15 1729620075

Maybe we could have both - models to improve accessibility (e.g. for users who can't move their body well) and models to perform high level tasks without supervision.

It could be very empowering for users with disabilities to regain access computers. But it would also be very powerful to be able to ask "use Photoshop to remove the power lines from this photo" and have the model complete the task and drop off a few samples in a folder somewhere.

quantadev · 2024-10-22T18:54:16 1729623256

Yep. I agree. The "auto-click" thing would be optional. Should be able to turn it on and off. With auto-click off it would just position the mouse and say "click here".

HappMacDonald · 2024-10-23T00:23:31 1729643011

Cluade scans page and decides which button to click before the screen layout is finished. By the time user authorizes the click, layout has shifted and your click lands on malware advertisements.

quantadev · 2024-10-23T02:23:32 1729650212

lol. If any website ever did that to me it would be the last time I ever went to it. Not a big concern for me.

HappMacDonald · 2024-10-23T07:53:59 1729670039

Youtube constantly moves it's layout seconds after the page begins to paint, so I try to click on fullscreen or whatever and then the viewer shifts to the side and I wound up clicking a link to some other video.

Probably would have been an ad there if I didn't block those, though.

falcor84 · 2024-10-22T23:58:41 1729641521

Even better, how about giving the AI the capability to move and draw and overlay on the screen with a separate virtual cursor as in a Zoom session?

quantadev · 2024-10-23T05:03:14 1729659794

I like the drawing on the screen idea. The biggest use case of that I can think of is drawing a black rectangle over all ADs!!!!

EGreg · 2024-10-22T18:57:05 1729623425

People would mostly just rubber-stamp it

But it would slow down the masses

Some people would jailbreak the agents though

HarHarVeryFunny · 2024-10-22T17:59:50 1729619990

You'll know AGI is here when it takes time out to go talk to ChatGPT, or another instance of itself, or maybe goes down a rabbit hole of watching YouTube music videos.

edm0nd · 2024-10-22T18:24:23 1729621463

ADHDGpt