Link your sessions
Point the CLI at your local agent history. It parses Claude Code, Codex, and Gemini CLI logs in seconds - fully on your machine.
EvalMyAgent reads your Claude Code, Codex, and Gemini CLI session history and turns it into a personal benchmark. When a new model drops, it tells you whether it's worth switching - for your tasks, not SWE-bench's.
Point the CLI at your local agent history. It parses Claude Code, Codex, and Gemini CLI logs in seconds - fully on your machine.
EvalMyAgent extracts your real task taxonomy - the refactors, debugs, and greenfield builds you actually do - into a graded personal benchmark.
A new model ships? Replay your bench against it and get a clear answer: switch, stay, or route by task type.
Public leaderboards do not know your codebase or task distribution. EvalMyAgent derives a benchmark from your real sessions so its dashboard reflects the work in front of you.
Install the isolated CLI, choose local rules or an installed Codex/Claude CLI with evalmyagent init, then run evalmyagent dashboard.
pipx install evalmyagent
no Python tooling? curl -sSL https://evalmyagent.ai/install.sh | sh