Discussion about this post

User's avatar
野牛熊's avatar

Appreciate the detailed research! Super interesting to see, with data, how behavior changes with instruction changes. Going to try asking my agent to be more concise and see what changes.

You mentioned at the end that your next experiment is applying these findings to real Claude Code workflows - I actually just ran something adjacent on 29 real PRs from an open source repo (varying reasoning effort on Opus 4.7).

To do this, I've built a tool to run these kinds of evals on real tasks from your own repo. If you're interested in trying it out to run your next experiment, let me know, I'd love to collaborate. It's clear you've thought about the problem a lot and have your own methodology/thoughts on evals and I'm always looking for informed feedback. Let me know! Either way, keep up the good work

No posts

Ready for more?