Research

82% on OSWorld: What State-of-the-Art Computer Use Actually Means

Emily Watson|March 1, 2026|12 min

Benchmarks in AI are often criticized as disconnected from real-world performance. OSWorld is different. It measures whether an AI agent can actually complete tasks on a real operating system, using real applications, with real complexity.

What is OSWorld?

OSWorld is a benchmark that evaluates AI agents on their ability to complete real computer tasks. Tasks range from file management and web browsing to multi-application workflows that require planning, error recovery, and contextual understanding.

Why 82% Matters

  • The previous state-of-the-art was significantly lower
  • Human performance on the same benchmark averages around 90%
  • The gap between AI and human computer operators is closing rapidly
  • Performance scales with task complexity, not just simple operations
  • Error recovery and adaptation contribute significantly to the score

At 82%, Coasty can reliably complete the majority of computer tasks that a human office worker performs daily.

Beyond the Number

What matters more than the score is the pattern of success. Coasty excels at multi-step workflows, recovers from unexpected errors, and adapts when applications behave differently than expected. These are the qualities that make an AI agent useful in production, not just on benchmarks.

Benchmarks are imperfect, but OSWorld measures something that matters: can an AI actually use a computer to get work done? At 82%, the answer is increasingly yes.

Want to see this in action?

View Case Studies
Coasty - Your AI Employee That Collaborates With Everyone