Benchmark cards
Structured benchmark cases with scope, constraints, expected evidence, and reproducibility notes.
COG Eval
COG Eval may later be deployed as a separate product domain. This static page only previews the product direction.
Structured benchmark cases with scope, constraints, expected evidence, and reproducibility notes.
Compact evaluation records for models, agents, workflows, and task-specific behavior.
Execution traces, logs, artifacts, and review notes connected to each evaluation case.
A future pathway for comparing repeated runs and independent verification.
Side-by-side analysis for model, agent, and human-AI workflow performance.