Skip to content

liodon-ai/semanticwer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 SemanticWER

Evaluation framework for speech-to-LLM systems.

PyPI version Python 3.9+ License: MIT


Classic Word Error Rate (WER) measures token accuracy. But modern pipelines look like this:

Speech → ASR → LLM → Task (QA, summarization, agents, RAG)

A 20% WER transcript can preserve meaning — or completely break downstream reasoning. WER cannot tell the difference.

SemanticWER fixes this with a four-component composite score:

SemanticWER = w₁·L + w₂·E + w₃·S + w₄·T
Component What it measures
L — Lexical Standard WER + CER (NIST-compatible)
E — Entity Named entity preservation (PERSON, ORG, DATE, …)
S — Semantic Embedding cosine similarity (SBERT)
T — Task Downstream task success delta

Lower score = better transcript quality.


Installation

# Minimal (WER/CER + regex NER + Jaccard semantic fallback)
pip install semanticwer

# Recommended (full features)
pip install "semanticwer[full]"
python -m spacy download en_core_web_sm

Quick Start

from semanticwer import SemanticWER

metric = SemanticWER()  # defaults: weights=(0.3, 0.2, 0.3, 0.2)

result = metric(
    reference="The patient was prescribed 50mg of metformin twice daily",
    hypothesis="The patient was prescribed 15mg of metformin twice daily",
)

print(result.summary())
# ====================================================
#   SemanticWER Result
# ====================================================
#   Composite Score  : 0.3241  (lower = better)
# ----------------------------------------------------
#   [L] Lexical      : WER=0.1429  CER=0.0541  (w=0.30)
#   [E] Entity       : F1=0.8000  Recall=0.6667  (w=0.20)
#   [S] Semantic     : Sim=0.8923  (w=0.30)
#   [T] Task         : N/A  (w=0.20)
# ====================================================

print(result.wer)           # 0.1429
print(result.semantic_sim)  # 0.8923
print(result.entity_f1)     # 0.8000
print(result.score)         # 0.3241

torchmetrics-Style API

metric = SemanticWER(weights=(0.3, 0.2, 0.3, 0.2))

# Accumulate samples
for ref, hyp in dataset:
    metric.update(ref, hyp)

# Compute over full corpus
result = metric.aggregate()
print(f"Corpus SemanticWER: {result.score:.4f}")

HuggingFace evaluate-Style API

result = metric.compute(
    predictions=hypotheses,
    references=references,
)

Task Utility: The Game-Changer

Connect SemanticWER to your actual downstream task:

Built-in: ROUGE

from semanticwer import SemanticWER
from semanticwer.modules.task import TaskModule

metric = SemanticWER(
    weights=(0.25, 0.25, 0.25, 0.25),
    task_fn=TaskModule.rouge_adapter("rougeL"),
)
result = metric(ref, hyp)
print(result.task_score)  # 0.0–1.0

Built-in: Token F1 (SQuAD-style QA)

metric = SemanticWER(
    task_fn=TaskModule.f1_token_adapter(),
    weights=(0.25, 0.25, 0.25, 0.25),
)

Custom: Any callable

def my_qa_eval(reference: str, hypothesis: str) -> float:
    """Return 1.0 if hypothesis preserves the answer to our question."""
    ref_answer = qa_model(question="Who was mentioned?", context=reference)
    hyp_answer = qa_model(question="Who was mentioned?", context=hypothesis)
    return 1.0 if ref_answer == hyp_answer else 0.0

metric = SemanticWER(
    task_fn=my_qa_eval,
    weights=(0.2, 0.2, 0.3, 0.3),
)

Custom: LLM-as-judge

import anthropic

client = anthropic.Anthropic()

def llm_judge(reference: str, hypothesis: str) -> float:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": (
                f"Score semantic equivalence 0.0–1.0 (1.0 = identical meaning).\n"
                f"REF: {reference}\nHYP: {hypothesis}\n"
                f"Respond with only a float."
            ),
        }],
    )
    return float(response.content[0].text.strip())

metric = SemanticWER(
    task_fn=TaskModule.llm_judge_adapter(llm_judge),
    weights=(0.2, 0.2, 0.3, 0.3),
)

NER Backend Selection

# spaCy (default, best accuracy for English)
metric = SemanticWER(ner_backend="spacy")

# HuggingFace transformers pipeline
metric = SemanticWER(ner_backend="hf")

# Lightweight regex (no extra deps)
metric = SemanticWER(ner_backend="regex")

# Disable entity scoring
metric = SemanticWER(ner_backend="none")

CLI

# Single pair
semanticwer --ref "John Smith called at 3pm" --hyp "Tom Jones called at 9am"

# Files (one sentence per line)
semanticwer --ref ref.txt --hyp hyp.txt

# With ROUGE task scoring
semanticwer --ref ref.txt --hyp hyp.txt --task rouge

# JSON output (for pipelines)
semanticwer --ref ref.txt --hyp hyp.txt --output json

# Custom weights
semanticwer --ref ref.txt --hyp hyp.txt --weights 0.4 0.2 0.3 0.1

# CSV output
semanticwer --ref ref.txt --hyp hyp.txt --output csv

Result Object

result = metric(ref, hyp)

result.score            # Composite SemanticWER [0, 1]
result.wer              # Classic WER
result.cer              # Character Error Rate
result.entity_f1        # Entity F1 score
result.entity_recall    # Entity recall
result.semantic_sim     # Cosine similarity [0, 1]
result.task_score       # Task utility score (or None)

result.to_dict()        # Full breakdown as dict
result.to_json()        # Full breakdown as JSON string
result.summary()        # Human-readable table

Reproducibility / Custom Weights

Weights must sum to 1.0. Recommended presets:

Use case Weights (L, E, S, T)
General ASR evaluation (0.3, 0.2, 0.3, 0.2)
Medical / legal (entity-critical) (0.2, 0.4, 0.2, 0.2)
LLM pipeline (task-first) (0.15, 0.15, 0.3, 0.4)
Backward-compatible WER (1.0, 0.0, 0.0, 0.0)

Citation

If you use SemanticWER in research, please cite:

@software{semanticwer2024,
  title     = {SemanticWER: Meaning-Aware ASR Evaluation Toolkit},
  year      = {2024},
  url       = {https://github.com/semanticwer/semanticwer},
}

License

MIT

About

Evaluation beyond WER for Speech

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages