Discussion about this post

User's avatar
Dr Peter McCann Strain's avatar

The capability split is the useful part here. Aggregate scores hide the real question: what can the agent do, with which tools, under which permissions, and where does failure become action rather than just a bad answer?

No posts

Ready for more?