Evaluation framework for speech-to-LLM systems.
Classic Word Error Rate (WER) measures token accuracy. But modern pipelines look like this:
Speech → ASR → LLM → Task (QA, summarization, agents, RAG)
A 20% WER transcript can preserve meaning — or completely break downstream reasoning. WER cannot tell the difference.
SemanticWER fixes this with a four-component composite score:
SemanticWER = w₁·L + w₂·E + w₃·S + w₄·T
| Component | What it measures |
|---|---|
| L — Lexical | Standard WER + CER (NIST-compatible) |
| E — Entity | Named entity preservation (PERSON, ORG, DATE, …) |
| S — Semantic | Embedding cosine similarity (SBERT) |
| T — Task | Downstream task success delta |
Lower score = better transcript quality.
# Minimal (WER/CER + regex NER + Jaccard semantic fallback)
pip install semanticwer
# Recommended (full features)
pip install "semanticwer[full]"
python -m spacy download en_core_web_smfrom semanticwer import SemanticWER
metric = SemanticWER() # defaults: weights=(0.3, 0.2, 0.3, 0.2)
result = metric(
reference="The patient was prescribed 50mg of metformin twice daily",
hypothesis="The patient was prescribed 15mg of metformin twice daily",
)
print(result.summary())
# ====================================================
# SemanticWER Result
# ====================================================
# Composite Score : 0.3241 (lower = better)
# ----------------------------------------------------
# [L] Lexical : WER=0.1429 CER=0.0541 (w=0.30)
# [E] Entity : F1=0.8000 Recall=0.6667 (w=0.20)
# [S] Semantic : Sim=0.8923 (w=0.30)
# [T] Task : N/A (w=0.20)
# ====================================================
print(result.wer) # 0.1429
print(result.semantic_sim) # 0.8923
print(result.entity_f1) # 0.8000
print(result.score) # 0.3241metric = SemanticWER(weights=(0.3, 0.2, 0.3, 0.2))
# Accumulate samples
for ref, hyp in dataset:
metric.update(ref, hyp)
# Compute over full corpus
result = metric.aggregate()
print(f"Corpus SemanticWER: {result.score:.4f}")result = metric.compute(
predictions=hypotheses,
references=references,
)Connect SemanticWER to your actual downstream task:
from semanticwer import SemanticWER
from semanticwer.modules.task import TaskModule
metric = SemanticWER(
weights=(0.25, 0.25, 0.25, 0.25),
task_fn=TaskModule.rouge_adapter("rougeL"),
)
result = metric(ref, hyp)
print(result.task_score) # 0.0–1.0metric = SemanticWER(
task_fn=TaskModule.f1_token_adapter(),
weights=(0.25, 0.25, 0.25, 0.25),
)def my_qa_eval(reference: str, hypothesis: str) -> float:
"""Return 1.0 if hypothesis preserves the answer to our question."""
ref_answer = qa_model(question="Who was mentioned?", context=reference)
hyp_answer = qa_model(question="Who was mentioned?", context=hypothesis)
return 1.0 if ref_answer == hyp_answer else 0.0
metric = SemanticWER(
task_fn=my_qa_eval,
weights=(0.2, 0.2, 0.3, 0.3),
)import anthropic
client = anthropic.Anthropic()
def llm_judge(reference: str, hypothesis: str) -> float:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=10,
messages=[{
"role": "user",
"content": (
f"Score semantic equivalence 0.0–1.0 (1.0 = identical meaning).\n"
f"REF: {reference}\nHYP: {hypothesis}\n"
f"Respond with only a float."
),
}],
)
return float(response.content[0].text.strip())
metric = SemanticWER(
task_fn=TaskModule.llm_judge_adapter(llm_judge),
weights=(0.2, 0.2, 0.3, 0.3),
)# spaCy (default, best accuracy for English)
metric = SemanticWER(ner_backend="spacy")
# HuggingFace transformers pipeline
metric = SemanticWER(ner_backend="hf")
# Lightweight regex (no extra deps)
metric = SemanticWER(ner_backend="regex")
# Disable entity scoring
metric = SemanticWER(ner_backend="none")# Single pair
semanticwer --ref "John Smith called at 3pm" --hyp "Tom Jones called at 9am"
# Files (one sentence per line)
semanticwer --ref ref.txt --hyp hyp.txt
# With ROUGE task scoring
semanticwer --ref ref.txt --hyp hyp.txt --task rouge
# JSON output (for pipelines)
semanticwer --ref ref.txt --hyp hyp.txt --output json
# Custom weights
semanticwer --ref ref.txt --hyp hyp.txt --weights 0.4 0.2 0.3 0.1
# CSV output
semanticwer --ref ref.txt --hyp hyp.txt --output csvresult = metric(ref, hyp)
result.score # Composite SemanticWER [0, 1]
result.wer # Classic WER
result.cer # Character Error Rate
result.entity_f1 # Entity F1 score
result.entity_recall # Entity recall
result.semantic_sim # Cosine similarity [0, 1]
result.task_score # Task utility score (or None)
result.to_dict() # Full breakdown as dict
result.to_json() # Full breakdown as JSON string
result.summary() # Human-readable tableWeights must sum to 1.0. Recommended presets:
| Use case | Weights (L, E, S, T) |
|---|---|
| General ASR evaluation | (0.3, 0.2, 0.3, 0.2) |
| Medical / legal (entity-critical) | (0.2, 0.4, 0.2, 0.2) |
| LLM pipeline (task-first) | (0.15, 0.15, 0.3, 0.4) |
| Backward-compatible WER | (1.0, 0.0, 0.0, 0.0) |
If you use SemanticWER in research, please cite:
@software{semanticwer2024,
title = {SemanticWER: Meaning-Aware ASR Evaluation Toolkit},
year = {2024},
url = {https://github.com/semanticwer/semanticwer},
}MIT