Hacker News new | past | comments | ask | show | jobs | submit login

Out of curiosity, how are high risk liability enviroments like yours coming to terms with the non-deterministic nature of models like these? Eg. the non-zero chance that it might click a button it *really* shouldn't as demonstrated in the failure demo.



Technical director at another company here: We have humans double-check everything, because we're required by law to. We use automation to make response times faster, or to do the bulk of the work and then just have humans double-check the AI. To do otherwise would be classed as "a software medical device", which needs documentation out the wazoo, and for good reason. I'm not sure you could even have a medical device where most of your design doc is "well I just hope it does the right thing, I guess?".

Sometimes, the AI is more accurate or safer than humans, but it still reads better to say "we always have humans in the loop". In those cases, we reap the benefits of both: Use the AI for safety, but still have a human fallback.


I'm curious, what does your human verification process look like? Does it involve a separate interface or a generated report of some kind? I'm currently working on an tool for personal use, that records actions and triggers them at later stage on when specified event occurs. For verification, generating a CSV report after the process is complete and backing it up with screen recordings.


It's a separate interface where the output of the LLM is rated for safety, and anything unsafe opens a ticket to be acted upon by the medical professionals.


I don't know yet. We may not do it.

We haven't deployed a model like this, it's new.

I've done a ton of various RPAs over the years, using all the normal techniques, and they're always brittle and sensitive to minor updates.

For this, I'm taking a "wait and see" approach. I want to see and test how well it performs in the real world before I deploy it, and wait for it to come out of beta so Anthropic will sign a BAA.

The demo is impressive enough that I want to give the tech a chance to mature before my team and I invest a ton of time into a more traditional RPA.

At a minimum, if we do end up using it, we'll have solid guard rails in place - it'll run on an isolated VM, all of its user access will be restricted to "read only" for external systems, and any content that comes from it will go through review by our nurses.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: