label-score

Inter-annotator agreement metrics for JavaScript and TypeScript. Zero dependencies.

Description

label-score computes chance-corrected agreement metrics for annotation data -- labels assigned to items by multiple annotators (human or machine). It implements five standard inter-annotator agreement (IAA) metrics, each returning a structured result object with the computed value, an interpretation label, and all intermediate quantities (observed agreement, expected agreement, category lists, annotator/item counts).

Use cases include validating human annotation quality before model training, measuring LLM-as-judge consistency across multiple model evaluators, building gold-standard evaluation datasets, and reporting IAA in research papers. All computations are implemented in pure TypeScript with zero runtime dependencies. Numerical outputs are verified against established Python implementations (scikit-learn, NLTK, krippendorff).

Installation

npm install label-score

Quick Start

import {
  cohenKappa,
  fleissKappa,
  scottPi,
  krippendorffAlpha,
  gwetAC1,
} from 'label-score';

// Cohen's Kappa -- two raters, categorical labels
const kappa = cohenKappa(['A', 'B', 'C', 'A'], ['A', 'B', 'A', 'A']);
console.log(kappa.value);          // 0.5454...
console.log(kappa.interpretation); // 'moderate'

// Fleiss' Kappa -- multiple raters via category-count matrix
const fleiss = fleissKappa([
  [4, 0, 0],  // all 4 raters chose category 0
  [0, 3, 1],  // 3 raters chose category 1, 1 chose category 2
  [0, 0, 4],  // all 4 raters chose category 2
]);
console.log(fleiss.value);          // 0.7894...
console.log(fleiss.annotatorCount); // 4

// Scott's Pi -- two raters, joint marginals
const pi = scottPi(['A', 'B', 'C'], ['A', 'B', 'A']);
console.log(pi.value);

// Krippendorff's Alpha -- multiple raters, handles missing data
const alpha = krippendorffAlpha([
  ['A', null, 'C', 'A'],
  ['A', 'B',  'C', null],
]);
console.log(alpha.value);
console.log(alpha.missingCount); // 2

// Gwet's AC1 -- robust to prevalence effects
const ac1 = gwetAC1(['Y', 'Y', 'Y', 'N'], ['Y', 'Y', 'N', 'N']);
console.log(ac1.value);

Features

Five agreement metrics -- Cohen's Kappa, Fleiss' Kappa, Scott's Pi, Krippendorff's Alpha, and Gwet's AC1.
Weighted kappa -- Linear and quadratic weighting schemes for ordinal data in Cohen's Kappa.
Missing data support -- Krippendorff's Alpha excludes or rejects missing annotations via configuration.
Multiple measurement levels -- Nominal, ordinal, interval, and ratio distance functions for Krippendorff's Alpha.
Automatic interpretation -- Every result includes a human-readable interpretation label based on published scales (Landis & Koch for kappa-family; Krippendorff's thresholds for alpha).
Input validation -- Descriptive errors for mismatched array lengths, empty inputs, inconsistent matrices, and insufficient annotators.
Duplicate detection -- Utility to find repeated (item, annotator) pairs in annotation triple data.
Full TypeScript support -- All functions, options, and result types are exported and fully typed.
Zero dependencies -- Pure TypeScript, no runtime dependencies.

API Reference

Metric Functions

`cohenKappa(rater1, rater2, options?)`

Computes Cohen's Kappa for two raters. Supports unweighted (nominal) and weighted (ordinal) variants.

Parameters:

Parameter	Type	Description
`rater1`	`Label[]`	Labels assigned by the first rater.
`rater2`	`Label[]`	Labels assigned by the second rater. Must have the same length as `rater1`.
`options`	`CohensKappaOptions`	Optional. Configuration for weighting and confidence intervals.

CohensKappaOptions:

Field	Type	Default	Description
`weighted`	`boolean`	`false`	Enable weighted kappa for ordinal data.
`weights`	`'linear' \| 'quadratic'`	`'linear'`	Weight scheme when `weighted` is `true`.
`ci`	`boolean`	--	Reserved for future confidence interval support.
`ciLevel`	`number`	--	Confidence level (e.g., `0.95`).
`ciBootstrapSamples`	`number`	--	Number of bootstrap resamples.
`seed`	`number`	--	Seed for reproducible bootstrap sampling.

Returns: KappaResult

Field	Type	Description
`metric`	`MetricName`	Always `'cohens-kappa'`.
`value`	`number`	The computed kappa coefficient.
`observed`	`number`	Observed agreement proportion (Po).
`expected`	`number`	Expected agreement by chance (Pe).
`interpretation`	`Interpretation`	Landis & Koch interpretation label.
`categories`	`Label[]`	Sorted list of unique categories found in the data.
`ci`	`ConfidenceInterval`	Optional. Confidence interval if requested.

Example:

// Unweighted kappa
const result = cohenKappa(['A', 'B', 'C'], ['A', 'B', 'A']);
// { metric: 'cohens-kappa', value: 0.5384..., observed: 0.6666..., ... }

// Weighted kappa for ordinal ratings
const weighted = cohenKappa([1, 2, 3, 4], [1, 2, 4, 4], {
  weighted: true,
  weights: 'quadratic',
});

`fleissKappa(matrix)`

Computes Fleiss' Kappa for multiple raters (N >= 2).

Parameters:

Parameter	Type	Description
`matrix`	`number[][]`	Category-count matrix. `matrix[i][j]` is the number of raters who assigned category `j` to subject `i`. All rows must sum to the same value (the number of raters per subject).

Returns: FleissKappaResult

Field	Type	Description
`metric`	`'fleiss-kappa'`	Always `'fleiss-kappa'`.
`value`	`number`	The computed Fleiss' Kappa coefficient.
`observed`	`number`	Mean observed agreement across subjects.
`expected`	`number`	Expected agreement by chance.
`interpretation`	`Interpretation`	Landis & Koch interpretation label.
`annotatorCount`	`number`	Number of raters per subject (row sum).
`itemCount`	`number`	Number of subjects (rows).
`ci`	`ConfidenceInterval`	Optional. Confidence interval if requested.

Example:

const result = fleissKappa([
  [4, 0, 0],
  [0, 3, 1],
  [0, 0, 4],
  [1, 3, 0],
]);
// result.value: 0.6894...
// result.annotatorCount: 4
// result.itemCount: 4

Throws:

If the matrix is empty.
If rows have inconsistent lengths.
If row sums are not equal.
If fewer than 2 raters per subject.
If fewer than 2 categories.

`scottPi(rater1, rater2, options?)`

Computes Scott's Pi for two raters. Uses joint (pooled) marginal proportions to compute expected agreement, making it more robust than Cohen's Kappa when rater biases differ.

Parameters:

Parameter	Type	Description
`rater1`	`Label[]`	Labels assigned by the first rater.
`rater2`	`Label[]`	Labels assigned by the second rater. Must have the same length as `rater1`.
`options`	`ScottsPiOptions`	Optional. Reserved for future confidence interval support.

ScottsPiOptions:

Field	Type	Default	Description
`ci`	`boolean`	--	Reserved for future confidence interval support.
`ciLevel`	`number`	--	Confidence level.
`ciBootstrapSamples`	`number`	--	Number of bootstrap resamples.
`seed`	`number`	--	Seed for reproducible bootstrap sampling.

Returns: PiResult

Field	Type	Description
`metric`	`'scotts-pi'`	Always `'scotts-pi'`.
`value`	`number`	The computed Pi coefficient.
`observed`	`number`	Observed agreement proportion.
`expected`	`number`	Expected agreement from joint marginals.
`interpretation`	`Interpretation`	Landis & Koch interpretation label.
`categories`	`Label[]`	Sorted list of unique categories.
`ci`	`ConfidenceInterval`	Optional. Confidence interval if requested.

Example:

const result = scottPi(['A', 'B', 'A', 'C'], ['A', 'B', 'B', 'C']);
// result.value: 0.5294...
// result.observed: 0.75

`krippendorffAlpha(matrix, options?)`

Computes Krippendorff's Alpha for multiple raters with support for missing data and multiple measurement levels.

Parameters:

Parameter	Type	Description
`matrix`	`(string \| number \| null \| undefined)[][]`	Rater-by-item matrix. `matrix[r][c]` is the label assigned by rater `r` to item `c`. Use `null` or `undefined` for missing annotations.
`options`	`KrippendorffOptions`	Optional. Configuration for measurement level and missing data handling.

KrippendorffOptions:

Field	Type	Default	Description
`level`	`MeasurementLevel`	`'nominal'`	Measurement level: `'nominal'`, `'ordinal'`, `'interval'`, or `'ratio'`. Determines the disagreement function.
`missingData`	`'exclude' \| 'error'`	`'exclude'`	How to handle missing values. `'exclude'` skips them; `'error'` throws.
`ci`	`boolean`	--	Reserved for future confidence interval support.
`ciLevel`	`number`	--	Confidence level.
`ciBootstrapSamples`	`number`	--	Number of bootstrap resamples.
`seed`	`number`	--	Seed for reproducible bootstrap sampling.

Returns: AlphaResult

Field	Type	Description
`metric`	`'krippendorff-alpha'`	Always `'krippendorff-alpha'`.
`value`	`number`	The computed alpha coefficient.
`interpretation`	`AlphaInterpretation`	Krippendorff interpretation: `'unreliable'`, `'tentative'`, or `'reliable'`.
`level`	`MeasurementLevel`	The measurement level used.
`itemCount`	`number`	Number of items (columns).
`annotatorCount`	`number`	Number of raters (rows).
`missingCount`	`number`	Total number of missing annotations.
`ci`	`ConfidenceInterval`	Optional. Confidence interval if requested.

Example:

// Nominal data with missing values
const result = krippendorffAlpha(
  [
    ['A', null, 'C', 'A'],
    ['A', 'B',  'C', null],
    ['B', 'B',  'C', 'A'],
  ],
);
// result.value: 0.4615...
// result.missingCount: 2
// result.annotatorCount: 3

// Interval-level numeric data
const interval = krippendorffAlpha(
  [
    [1, 2, 3, 4],
    [1, 2, 4, 4],
  ],
  { level: 'interval' },
);
// interval.level: 'interval'

Disagreement functions by measurement level:

Level	Function	Description
`nominal`	`d = v === v' ? 0 : 1`	Binary: same or different.
`ordinal`	`d = v === v' ? 0 : 1`	Treated as nominal (rank-based extension planned).
`interval`	`d = (v - v')^2`	Squared numeric difference.
`ratio`	`d = (v - v')^2`	Squared numeric difference.

`gwetAC1(rater1, rater2, options?)`

Computes Gwet's AC1 for two raters. Designed to be robust to the prevalence and bias paradox that causes Cohen's Kappa to produce misleadingly low values when one category dominates.

Parameters:

Parameter	Type	Description
`rater1`	`Label[]`	Labels assigned by the first rater.
`rater2`	`Label[]`	Labels assigned by the second rater. Must have the same length as `rater1`.
`options`	`AC1Options`	Optional. Reserved for future confidence interval support.

AC1Options:

Field	Type	Default	Description
`ci`	`boolean`	--	Reserved for future confidence interval support.
`ciLevel`	`number`	--	Confidence level.
`ciBootstrapSamples`	`number`	--	Number of bootstrap resamples.
`seed`	`number`	--	Seed for reproducible bootstrap sampling.

Returns: AC1Result

Field	Type	Description
`metric`	`'gwets-ac1'`	Always `'gwets-ac1'`.
`value`	`number`	The computed AC1 coefficient.
`observed`	`number`	Observed agreement proportion.
`expected`	`number`	Gwet's expected agreement by chance.
`interpretation`	`Interpretation`	Landis & Koch interpretation label.
`ci`	`ConfidenceInterval`	Optional. Confidence interval if requested.

Example:

const result = gwetAC1(
  ['Y', 'Y', 'Y', 'Y', 'Y', 'N'],
  ['Y', 'Y', 'Y', 'Y', 'N', 'N'],
);
// result.value: 0.6666...
// result.observed: 0.6666...

Interpretation Functions

`interpretKappa(value)`

Classifies a kappa-family metric value using the Landis & Koch (1977) scale.

Parameters:

Parameter	Type	Description
`value`	`number`	The kappa coefficient to interpret.

Returns: Interpretation -- one of 'poor', 'slight', 'fair', 'moderate', 'substantial', 'almost-perfect'.

Scale:

Range	Interpretation
< 0.00	`'poor'`
0.00 -- 0.20	`'slight'`
0.20 -- 0.40	`'fair'`
0.40 -- 0.60	`'moderate'`
0.60 -- 0.80	`'substantial'`
>= 0.80	`'almost-perfect'`

interpretKappa(0.75); // 'substantial'
interpretKappa(-0.1); // 'poor'

`interpretAlpha(value)`

Classifies a Krippendorff's Alpha value using Krippendorff's recommended thresholds.

Parameters:

Parameter	Type	Description
`value`	`number`	The alpha coefficient to interpret.

Returns: AlphaInterpretation -- one of 'unreliable', 'tentative', 'reliable'.

Scale:

Range	Interpretation
< 0.667	`'unreliable'`
0.667 -- 0.800	`'tentative'`
>= 0.800	`'reliable'`

interpretAlpha(0.85); // 'reliable'
interpretAlpha(0.70); // 'tentative'

Validation Functions

`assertEqualLength(a, b, label?)`

Throws if arrays a and b do not have the same length.

assertEqualLength([1, 2], [3, 4]);       // passes
assertEqualLength([1, 2], [3], 'raters'); // throws: "Arrays must have equal length: got 2 and 1 raters"

`assertNonEmpty(arr, label?)`

Throws if arr is empty.

assertNonEmpty([1, 2, 3]);     // passes
assertNonEmpty([], 'subjects'); // throws: "Array must be non-empty subjects"

`assertConsistentRowLengths(matrix)`

Throws if the rows of a 2D matrix do not all have the same length.

assertConsistentRowLengths([[1, 2], [3, 4]]);    // passes
assertConsistentRowLengths([[1, 2], [3, 4, 5]]); // throws: "Matrix has inconsistent row lengths"

`assertConstantRowSums(matrix)`

Throws if rows of a numeric matrix do not all sum to the same value (within floating-point tolerance of 1e-9).

assertConstantRowSums([[1, 2, 3], [2, 2, 2]]); // passes (both sum to 6)
assertConstantRowSums([[1, 2], [3, 4]]);        // throws: "Category-count matrix rows must have equal sums"

`assertMinAnnotators(count)`

Throws if count is less than 2.

assertMinAnnotators(3); // passes
assertMinAnnotators(1); // throws: "At least 2 annotators required, got 1"

`detectDuplicates(triples)`

Returns (item, annotator) pairs that appear more than once in an array of annotation triples. Each duplicate pair is returned only once regardless of how many times it appears.

Parameters:

Parameter	Type	Description
`triples`	`Array<{ item: unknown; annotator: unknown }>`	Array of annotation triples.

Returns: Array<{ item: unknown; annotator: unknown }> -- the duplicate pairs.

const dupes = detectDuplicates([
  { item: 1, annotator: 'A' },
  { item: 1, annotator: 'A' },
  { item: 2, annotator: 'B' },
]);
// [{ item: 1, annotator: 'A' }]

Types

All TypeScript types are exported from the package entry point.

Core Types

Type	Description
`Label`	`string \| number` -- a single annotation label.
`MeasurementLevel`	`'nominal' \| 'ordinal' \| 'interval' \| 'ratio'`
`MetricName`	`'cohens-kappa' \| 'fleiss-kappa' \| 'krippendorff-alpha' \| 'scotts-pi' \| 'gwets-ac1' \| 'percent-agreement'`
`Interpretation`	`'poor' \| 'slight' \| 'fair' \| 'moderate' \| 'substantial' \| 'almost-perfect'`
`AlphaInterpretation`	`'unreliable' \| 'tentative' \| 'reliable'`

Data Types

Type	Description
`AnnotationTriple`	`{ item: string \| number; annotator: string \| number; label: Label }`
`ConfusionMatrix`	`{ labels: Label[]; matrix: number[][] }` -- `matrix[i][j]` = count where rater 1 said `labels[i]` and rater 2 said `labels[j]`.
`ConfidenceInterval`	`{ lower: number; upper: number; level: number }`

Result Types

Type	Description
`KappaResult`	Result from `cohenKappa`. Fields: `metric`, `value`, `observed`, `expected`, `interpretation`, `categories?`, `ci?`.
`FleissKappaResult`	Result from `fleissKappa`. Fields: `metric`, `value`, `observed`, `expected`, `interpretation`, `annotatorCount`, `itemCount`, `ci?`.
`PiResult`	Result from `scottPi`. Fields: `metric`, `value`, `observed`, `expected`, `interpretation`, `categories?`, `ci?`.
`AlphaResult`	Result from `krippendorffAlpha`. Fields: `metric`, `value`, `interpretation`, `level`, `itemCount`, `annotatorCount`, `missingCount`, `ci?`.
`AC1Result`	Result from `gwetAC1`. Fields: `metric`, `value`, `observed`, `expected`, `interpretation`, `ci?`.
`AgreementReport`	`{ metric: MetricName; value: number; interpretation: Interpretation \| AlphaInterpretation; ci?: ConfidenceInterval }`

Options Types

Type	Description
`CohensKappaOptions`	`{ weighted?, weights?, ci?, ciLevel?, ciBootstrapSamples?, seed? }`
`FleissKappaOptions`	`{ ci?, ciLevel?, ciBootstrapSamples?, seed? }`
`KrippendorffOptions`	`{ level?, ci?, ciLevel?, ciBootstrapSamples?, seed?, missingData? }`
`ScottsPiOptions`	`{ ci?, ciLevel?, ciBootstrapSamples?, seed? }`
`AC1Options`	`{ ci?, ciLevel?, ciBootstrapSamples?, seed? }`
`AgreementOptions`	`{ metric?, level?, missingData?, ci?, ciLevel?, seed? }`
`CIOptions`	`{ level?, bootstrapSamples?, seed? }`

Configuration

Each metric function accepts an optional options object as its last parameter. All options fields are optional and have sensible defaults.

Cohen's Kappa weighting:

// Unweighted (default) -- for nominal/categorical data
cohenKappa(rater1, rater2);

// Linear weights -- penalizes disagreements proportionally to distance
cohenKappa(rater1, rater2, { weighted: true, weights: 'linear' });

// Quadratic weights -- penalizes far disagreements more heavily
cohenKappa(rater1, rater2, { weighted: true, weights: 'quadratic' });

Krippendorff's Alpha measurement level:

// Nominal (default) -- categories with no order
krippendorffAlpha(matrix);

// Interval -- numeric data with meaningful distances
krippendorffAlpha(matrix, { level: 'interval' });

// Ratio -- numeric data with a true zero
krippendorffAlpha(matrix, { level: 'ratio' });

Missing data handling:

// Exclude missing values (default)
krippendorffAlpha(matrix, { missingData: 'exclude' });

// Throw an error if any value is missing
krippendorffAlpha(matrix, { missingData: 'error' });

Error Handling

All metric functions validate their inputs and throw descriptive Error instances on invalid data.

Condition	Error Message
Empty input array	`"Array must be non-empty"`
Rater arrays of different lengths	`"Arrays must have equal length: got X and Y"`
Matrix rows with inconsistent lengths	`"Matrix has inconsistent row lengths"`
Matrix rows with different sums	`"Category-count matrix rows must have equal sums"`
Fewer than 2 annotators	`"At least 2 annotators required, got N"`
Fewer than 2 categories (Fleiss)	`"Fleiss Kappa requires at least 2 categories"`
Missing data with `missingData: 'error'`	`"Missing data found at rater R, item C"`

All errors are synchronous and thrown immediately during input validation, before any computation begins.

Advanced Usage

Choosing Between Metrics

Scenario	Recommended Metric
Two raters, nominal categories	`cohenKappa` (unweighted)
Two raters, ordinal scale	`cohenKappa` with `weighted: true`
Two raters, possible rater bias	`scottPi` (pooled marginals)
Two raters, skewed category distribution	`gwetAC1` (prevalence-robust)
Three or more raters, nominal	`fleissKappa`
Any number of raters, missing data	`krippendorffAlpha`
Any number of raters, interval/ratio data	`krippendorffAlpha` with `level` option

Comparing Cohen's Kappa, Scott's Pi, and Gwet's AC1

When both raters have identical marginal distributions, Scott's Pi and Cohen's Kappa produce the same value. They diverge when raters have different biases (e.g., one rater assigns "positive" more frequently).

// Symmetric marginals -- Pi and Kappa agree
const r1 = ['A', 'A', 'B', 'B', 'C', 'C'];
const r2 = ['A', 'B', 'A', 'B', 'C', 'C'];
scottPi(r1, r2).value;   // same as cohenKappa(r1, r2).value

// Skewed data -- AC1 is more stable than Kappa
const r1Skew = ['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N'];
const r2Skew = ['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N'];
gwetAC1(r1Skew, r2Skew).value;   // higher than cohenKappa
cohenKappa(r1Skew, r2Skew).value; // suppressed by prevalence effect

Handling Edge Cases

When all items fall into a single category (Pe = 1), the kappa formula (Po - Pe) / (1 - Pe) would produce a division by zero. In this case, all metrics return 1.0 (perfect agreement) since every item receives the same label from every rater.

const allSame = cohenKappa(['A', 'A', 'A'], ['A', 'A', 'A']);
// allSame.value === 1.0

Pre-validating Annotation Data

Use the validation utilities to check data integrity before computing metrics.

import {
  assertEqualLength,
  assertNonEmpty,
  detectDuplicates,
} from 'label-score';

const triples = [
  { item: 1, annotator: 'A', label: 'pos' },
  { item: 1, annotator: 'A', label: 'neg' }, // duplicate!
  { item: 2, annotator: 'B', label: 'pos' },
];

const dupes = detectDuplicates(triples);
if (dupes.length > 0) {
  console.error('Duplicate annotations found:', dupes);
}

TypeScript

label-score is written in TypeScript and ships type declarations (dist/index.d.ts) alongside the compiled JavaScript. All public types are available as named imports.

import type {
  Label,
  KappaResult,
  FleissKappaResult,
  AlphaResult,
  PiResult,
  AC1Result,
  Interpretation,
  AlphaInterpretation,
  MeasurementLevel,
  MetricName,
  AnnotationTriple,
  ConfusionMatrix,
  ConfidenceInterval,
  CohensKappaOptions,
  FleissKappaOptions,
  KrippendorffOptions,
  ScottsPiOptions,
  AC1Options,
  AgreementOptions,
  CIOptions,
  AgreementReport,
} from 'label-score';

The package targets ES2022 and uses CommonJS module format. Compiler options include strict: true, declaration: true, and declarationMap: true for full IDE support.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README.md		README.md
SPEC.md		SPEC.md
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tasks.md		tasks.md
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

label-score

Description

Installation

Quick Start

Features

API Reference

Metric Functions

cohenKappa(rater1, rater2, options?)

fleissKappa(matrix)

scottPi(rater1, rater2, options?)

krippendorffAlpha(matrix, options?)

gwetAC1(rater1, rater2, options?)

Interpretation Functions

interpretKappa(value)

interpretAlpha(value)

Validation Functions

assertEqualLength(a, b, label?)

assertNonEmpty(arr, label?)

assertConsistentRowLengths(matrix)

assertConstantRowSums(matrix)

assertMinAnnotators(count)

detectDuplicates(triples)

Types

Core Types

Data Types

Result Types

Options Types

Configuration

Error Handling

Advanced Usage

Choosing Between Metrics

Comparing Cohen's Kappa, Scott's Pi, and Gwet's AC1

Handling Edge Cases

Pre-validating Annotation Data

TypeScript

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cohenKappa(rater1, rater2, options?)`

`fleissKappa(matrix)`

`scottPi(rater1, rater2, options?)`

`krippendorffAlpha(matrix, options?)`

`gwetAC1(rater1, rater2, options?)`

`interpretKappa(value)`

`interpretAlpha(value)`

`assertEqualLength(a, b, label?)`

`assertNonEmpty(arr, label?)`

`assertConsistentRowLengths(matrix)`

`assertConstantRowSums(matrix)`

`assertMinAnnotators(count)`

`detectDuplicates(triples)`

Packages