j-rig-skill-binary-eval

jeremylongshore/j-rig-skill-binary-eval
★ 0 stars TypeScript 🤖 AI/LLM Updated 8d ago
Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.
View on GitHub →

Quick Install

Copy the config for your editor. Some servers may need additional setup — check the README.

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "j-rig-skill-binary-e": {
      "command": "npx",
      "args": [
        "-y",
        "jeremylongshore/j-rig-skill-binary-eval"
      ]
    }
  }
}

README Excerpt

> Software-grade release discipline for Claude Skills Binary evaluation harness that treats `SKILL.md` artifacts as production software. Package integrity, trigger precision, functional quality, regression gating, baseline comparison, model-aware testing, and evidence-backed rollout decisions — all through binary yes/no criteria with external evaluators.

Tools (2)

criterion_resultsexperiments

Topics

agent-evalai-evaluationbinary-criteriaclaude-codeeval-harnessllm-evalmcpplugin-testingregression-testingskill-evaluation