reads sessions you already have - nothing leaves your machine

Your AI tools, benchmarked on
the work you actually do.

EvalMyAgent reads your Claude Code, Codex, and Gemini CLI session history and turns it into a personal benchmark. When a new model drops, it tells you whether it's worth switching - for your tasks, not SWE-bench's.

view example dashboard ->

~/dev/evalmyagent

$ pipx install evalmyagent

-> installed package evalmyagent in an isolated environment

apps are now globally available on your machine

choose analysis: evalmyagent init

then run: evalmyagent dashboard

local dashboard: http://127.0.0.1:3847

ready - local rules unless you opt into a CLI analyzer

// how_it_works

Link your sessions

Point the CLI at your local agent history. It parses Claude Code, Codex, and Gemini CLI logs in seconds - fully on your machine.

Build your bench

EvalMyAgent extracts your real task taxonomy - the refactors, debugs, and greenfield builds you actually do - into a graded personal benchmark.

Run any model

A new model ships? Replay your bench against it and get a clear answer: switch, stay, or route by task type.

// the_bench

Your bench, not
somebody else's.

Public leaderboards do not know your codebase or task distribution. EvalMyAgent derives a benchmark from your real sessions so its dashboard reflects the work in front of you.

example task mix - your dashboard uses live data

Refactoring32%

Debugging24%

Feature build19%

Test writing12%

API integration9%

Infra and config4%

// your_leaderboard

See what wins on your work.

score = your bench · Δ = vs public bench · illustrative preview

model / harness

score

$/task

best at

Claude Sonnet 4.5Claude Codepreview

$0.31

Refactoring · debugging

Claude Opus 4.1Claude Code

$0.74

Deep debugging

GPT-5Codex

-2

$0.22

API & integration

Gemini 2.5 ProGemini CLI

$0.09

Boilerplate

GPT-5 miniCodex

—

$0.05

Quick fixes

Build your private benchmark locally.

Install the isolated CLI, choose local rules or an installed Codex/Claude CLI with evalmyagent init, then run evalmyagent dashboard.

$ pipx install evalmyagent

no Python tooling? curl -sSL https://evalmyagent.ai/install.sh | sh

Your AI tools, benchmarked onthe work you actually do.