Run calibration comparing judge to human-labeled examples.
Use this to validate that the LLM judge produces verdicts that align with human expectations. Target 90%+ agreement rate.
Provide labeled examples with expected verdicts to calibrate the judge.
Platform token (starts with pat_)
Request schema for running calibration.
Human-labeled examples to calibrate against
1Successful Response
Aggregate calibration metrics.
True Positive Rate (sensitivity)
True Negative Rate (specificity)