#ai-evaluation

2 results found

Conkurrence measures whether multiple AI models produce consistent outputs on your evaluation tasks. It tells you which items your AI agrees on and which need human review — using Fleiss' κ, Kendall's W, and bootstrap confidence intervals, the same psychometric methods trusted in clinical research.

2 months ago

Judgmentlabs Mcp Server

@Sezer Ufuk Yavuz

A Model Context Protocol (MCP) server that provides seamless integration with the Judgment API for AI evaluation workflows. This server enables you to manage datasets, run evaluations, and track traces directly from your MCP-compatible environment.

a year ago