v0.1.0 now on PyPI

Ship agents
you can actually trust.

Evaluation, observability, and regression testing for LangGraph, CrewAI, and AutoGen agents — in production.

pip install cortexops View on GitHub
quickstart.py
from cortexops import CortexTracer, EvalSuite
 
# instrument in one line — no refactor required
tracer = CortexTracer(project="payments-agent")
graph = tracer.wrap(your_langgraph_app)
 
# run golden dataset evals
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=graph,
    fail_on="task_completion < 0.90"
)
print(results.summary())
Designed for engineers at Fintech Payments Enterprise SaaS AI Labs LegalTech
Platform

Everything your agent needs in production

One SDK. Zero refactoring. Catches the failures you didn't know you had.

Golden dataset evals
Define test cases in YAML with expected tool calls, output keywords, and latency budgets. Rule-based and LLM-as-judge scoring.
Full trace observability
Every node, tool call, and LLM turn captured. Replay any failure, diff two runs, and get root cause in seconds.
CI eval gate
GitHub Actions integration. Block PRs when task_completion drops below your threshold. No silent regressions.
LLM-as-judge scoring
GPT-4o scores open-ended outputs against natural language criteria. Falls back to heuristics if the API is unavailable.
Slack + webhook alerts
Get notified the moment task completion drops or regressions appear. Integrates with PagerDuty and any webhook endpoint.
Prompt versioning
Git-style prompt version history with unified diff. Know exactly what changed between an eval passing and failing.
Live dashboard

See exactly why your agent failed

Trace-level scoring with actionable root cause — not just a pass/fail.

payments-agent — eval run #47
run_id: f028a2ba • 9 cases • 8 passed
1 regression
91.4%
Task completion
+2.1% vs last run
97
Tool accuracy
+5 pts
1.8s
Latency p95
+0.3s
1
Regressions
vs baseline
refund_lookup_v2
96 pass
dispute_classify
78 warn
escalation_router
41 fail
balance_check
99 pass
Root cause detected on escalation_router
tool_call_mismatch → route_escalation returned null
How it works

From zero to eval gate in minutes

No refactoring. No new infrastructure. Just wrap and run.

01
Instrument your agent
Wrap your existing LangGraph or CrewAI app with one line. CortexTracer auto-detects the framework and captures every node, tool call, and LLM turn.
from cortexops import CortexTracer
tracer = CortexTracer("my-agent")
graph = tracer.wrap(your_graph)
02
Define golden cases
Write test cases in YAML. Specify expected tool calls, output keywords, and latency budgets per case. Use LLM judge for open-ended quality checks.
cases:
  - id: refund_lookup_01
    expected_tool_calls:
      - lookup_refund
    max_latency_ms: 3000
03
Gate your deploys
Add the eval gate to GitHub Actions. PRs are blocked if task completion drops below your threshold. No more silent regressions reaching production.
- name: Eval gate
  run: cortexops eval run \
    --fail-on "tc < 0.90"
Pricing

Start free. Scale with your agents.

No credit card required. Free tier is generous on purpose.

Free
$0
For solo engineers and side projects. Full SDK, unlimited local evals.
+ Full SDK — pip install cortexops
+ Unlimited local eval runs
+ Golden dataset YAML format
+ GitHub Actions CI gate
+ CLI tool
Get started
Enterprise
Custom
For compliance-driven teams. VPC deployment, SSO, custom SLA.
+ Everything in Pro
+ VPC / on-prem deployment
+ SSO / SAML
+ Custom data retention
+ Dedicated Slack support
+ SLA guarantee
Talk to us