Evaluation: Show cost by AkhileshNegi · Pull Request #746 · ProjectTech4DevAI/kaapi-backend

AkhileshNegi · 2026-04-10T04:10:37Z

Summary

Target issue is #742

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Added cost tracking to evaluations, displaying token usage and USD costs for response and embedding processing stages.
Added support for text-embedding-3-large, text-embedding-3-small, and text-embedding-ada-002 embedding models.
Enhanced model configuration management with improved reasoning model detection

Summary by CodeRabbit

New Features
- Added cost tracking to evaluation runs with per-stage breakdown (response generation and embedding).
- Cost data is now visible in evaluation run API responses, including token counts and USD costs.
- Made score_trace_url visible in evaluation run responses.
Chores
- Database migration to support cost tracking.
- Internal refactoring of evaluation run update mechanisms.

coderabbitai · 2026-04-10T04:10:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors evaluation-run updates to use a new EvaluationRunUpdate model; adds persistent per-stage cost tracking with calculation and DB persistence; moves reasoning-model detection from a constant to a DB-backed helper and updates LLM mappers to accept a session.

Changes

Cohort / File(s)	Summary
Evaluation run model & CRUD `backend/app/models/evaluation.py`, `backend/app/crud/evaluations/core.py`	Added `cost` JSONB field to `EvaluationRun`; introduced `EvaluationRunUpdate` model; changed `update_evaluation_run(...)` signature to accept `update: EvaluationRunUpdate` and apply only provided fields.
Evaluation processing & cost integration `backend/app/crud/evaluations/processing.py`, `backend/app/crud/evaluations/cost.py`	Added `attach_cost(...)` module; integrated cost attachment into response and embedding completion flows; replaced direct update kwargs with `EvaluationRunUpdate(...)` and persist cost updates to DB.
Migrations & seed data `backend/app/alembic/versions/053_add_cost_to_evaluation_run.py`, `backend/app/alembic/versions/052_create_model_config_table.py`	New migration adds `cost` JSONB column; seeding extended with embedding models (`text-embedding-3-large`, `text-embedding-3-small`, `text-embedding-ada-002`).
LLM mapping & reasoning detection `backend/app/services/llm/mappers.py`, `backend/app/crud/model_config.py`, `backend/app/models/llm/constants.py`, `backend/app/services/llm/jobs.py`	Removed `OPENAI_REASONING_MODELS`; added `is_reasoning_model(session, ...)`; updated mappers and transformers to accept `session: Session` and use DB-backed reasoning checks; adjusted a call site to pass `session` kwarg.
Tests (evaluations & LLM) `backend/app/tests/crud/evaluations/test_processing.py`, `backend/app/tests/crud/evaluations/test_score_storage.py`, `backend/app/tests/services/llm/test_mappers.py`, `backend/app/tests/services/llm/test_jobs.py`	Updated tests to assert `cost` structure and to use `EvaluationRunUpdate` assertions; injected DB session into mapper tests and adjusted model strings used in expectations.
Public model exports `backend/app/models/__init__.py`	Re-exported `EvaluationRunUpdate` and added doc-transformation job model exports; reordered import/export blocks.
Minor cleanup `backend/app/services/llm/jobs.py`	Removed unused typing import and passed `session` into `transform_kaapi_config_to_native` via keyword args.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Processor as Processing task
  participant Coster as attach_cost module
  participant Pricing as estimate_model_cost
  participant DB as Database

  Processor->>Coster: provide response_model + results / embedding data
  Coster->>Pricing: aggregate tokens & request cost (provider="openai", usage_type="batch")
  Pricing-->>Coster: return cost_usd per stage
  Coster->>DB: merge cost into eval_run.cost and compute total_cost_usd
  DB-->>Processor: persisted eval_run with updated cost/status
  Processor->>DB: update EvaluationRun via update=EvaluationRunUpdate(...)
  DB-->>Processor: confirmation

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Evaluation #405 — Directly overlaps evaluation subsystem changes and EvaluationRunUpdate usage.
Add score trace URL to evaluation run #558 — Related to evaluation score trace handling and update_evaluation_run call-site changes.
STT/TTS: Smart Defaults for Config Params in Native and Kaapi Abstracted Flow #678 — Related to LLM mapper and reasoning-detection changes that touch the same modules.

Suggested labels

enhancement, ready-for-review

Suggested reviewers

vprashrex
Prajna1999
kartpop

Poem

🐰 I hopped through DBs, costs, and scores,

Tokens tallied behind tiny doors.
Constants swapped for queries that hum,
Updates bundled, now neat and plumb.
A carrot for testers—deploy, here I come! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation: Show cost' clearly and directly summarizes the main objective of the PR, which implements cost tracking and display for evaluation runs across response and embedding stages.
Docstring Coverage	✅ Passed	Docstring coverage is 89.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch enhancement/evaluation-cost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

backend/app/crud/evaluations/processing.py (1)

516-528: Persist cost explicitly in the final update call.

Line 516 updates eval_run.cost directly, but Line 526 uses update_evaluation_run(...) without cost. Passing it explicitly keeps persistence behavior consistent and future-proof.

♻️ Suggested tweak

-        eval_run = update_evaluation_run(
-            session=session, eval_run=eval_run, status="completed", score=eval_run.score
-        )
+        eval_run = update_evaluation_run(
+            session=session,
+            eval_run=eval_run,
+            status="completed",
+            score=eval_run.score,
+            cost=eval_run.cost,
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/processing.py` around lines 516 - 528, The
eval_run.cost is set via build_cost_dict (response_entry, embedding_cost_entry)
but the subsequent persistence uses update_evaluation_run(session=session,
eval_run=eval_run, status="completed", score=eval_run.score) which omits cost;
modify the final call to include cost (e.g., cost=eval_run.cost) so the computed
cost is persisted, keeping the try/except around build_cost_dict as-is and
ensuring update_evaluation_run receives the cost even when previously assigned
to eval_run.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/alembic/versions/050_add_cost_to_evaluation_run.py`:
- Around line 20-33: The migration functions upgrade and downgrade are missing
return type annotations; update the function signatures for upgrade() and
downgrade() to include explicit return type hints (-> None) to satisfy the
repository's typing guidelines, keeping the existing bodies unchanged and only
modifying the function declarations.

In `@backend/app/crud/evaluations/pricing.py`:
- Around line 92-98: calculate_response_cost is returning 0.0 for several
production-supported models because MODEL_PRICING lacks entries for variants
(e.g., gpt-4, gpt-4-turbo, o1-preview, o1-mini, gpt-4.1-mini, gpt-4.1-nano,
gemini-2.5-pro, gemini-2.5-pro-preview-tts) referenced in SUPPORTED_MODELS;
update the implementation by either (A) adding the missing pricing entries into
MODEL_PRICING for those exact model keys, or (B) implement normalization/mapping
inside calculate_response_cost (or a helper) to map variant names to existing
pricing keys (e.g., normalize "gpt-4-turbo" -> "gpt-4-turbo" pricing key or map
preview/mini/nano variants to their base model key) so the lookup
MODEL_PRICING.get(model) returns a valid price instead of falling back to 0.0;
ensure the same fix is applied to the other lookup block around lines 115-121
that uses MODEL_PRICING.

---

Nitpick comments:
In `@backend/app/crud/evaluations/processing.py`:
- Around line 516-528: The eval_run.cost is set via build_cost_dict
(response_entry, embedding_cost_entry) but the subsequent persistence uses
update_evaluation_run(session=session, eval_run=eval_run, status="completed",
score=eval_run.score) which omits cost; modify the final call to include cost
(e.g., cost=eval_run.cost) so the computed cost is persisted, keeping the
try/except around build_cost_dict as-is and ensuring update_evaluation_run
receives the cost even when previously assigned to eval_run.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 623d9238-23b0-443e-906a-bd51d8e56578

📥 Commits

Reviewing files that changed from the base of the PR and between dbcdb20 and b6750f0.

📒 Files selected for processing (7)

backend/app/alembic/versions/050_add_cost_to_evaluation_run.py
backend/app/crud/evaluations/__init__.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/pricing.py
backend/app/crud/evaluations/processing.py
backend/app/models/evaluation.py

coderabbitai

♻️ Duplicate comments (1)

backend/app/crud/evaluations/pricing.py (1)

23-72: ⚠️ Potential issue | 🟠 Major

MODEL_PRICING still misses supported evaluation models.

calculate_token_cost() does an exact lookup on Lines 92-97. Any run whose configured model is supported elsewhere in the app but absent here will silently get 0.0, so the new EvaluationRun.cost field underreports valid usage. Please either add the missing keys or normalize model IDs before the lookup.

Run this read-only check to compare model literals used by the app with the pricing table; after the fix, the missing-pricing list should only contain models you intentionally do not cost:

#!/bin/bash
python - <<'PY'
import ast
import pathlib
import re

pricing_path = pathlib.Path("backend/app/crud/evaluations/pricing.py")
embeddings_path = pathlib.Path("backend/app/crud/evaluations/embeddings.py")
constants_path = pathlib.Path("backend/app/models/llm/constants.py")

embedding_text = embeddings_path.read_text()
match = re.search(r'EMBEDDING_MODEL\s*=\s*"([^"]+)"', embedding_text)
embedding_model = match.group(1) if match else None

tree = ast.parse(pricing_path.read_text())
pricing_keys = set()

for node in tree.body:
    if isinstance(node, ast.AnnAssign) and getattr(node.target, "id", None) == "MODEL_PRICING":
        for key_node in node.value.keys:
            if isinstance(key_node, ast.Constant) and isinstance(key_node.value, str):
                pricing_keys.add(key_node.value)
            elif (
                isinstance(key_node, ast.Name)
                and key_node.id == "EMBEDDING_MODEL"
                and embedding_model
            ):
                pricing_keys.add(embedding_model)
        break

constants_text = constants_path.read_text()
supported_models = sorted(
    set(re.findall(r'"((?:gpt|o1|gemini|text-embedding)[^"]*)"', constants_text))
)

missing = [model for model in supported_models if model not in pricing_keys]

print("MODEL_PRICING keys:")
for model in sorted(pricing_keys):
    print(f"  - {model}")

print("\nSupported-looking model literals missing pricing:")
for model in missing:
    print(f"  - {model}")
PY

Also applies to: 92-97

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/pricing.py` around lines 23 - 72, MODEL_PRICING
is missing entries used elsewhere so calculate_token_cost (the exact lookup in
calculate_token_cost) can return 0.0; either add the missing model keys to
MODEL_PRICING or normalize model IDs before lookup. Fix options: (1) add the
supported model literals (from models/constants.py) as keys in MODEL_PRICING
with appropriate input/output costs (update the dict entries like "gpt-4o",
"gpt-5", etc.), or (2) change calculate_token_cost to map aliases/variants to
the canonical pricing key (e.g., normalize model names or strip suffixes) before
doing the MODEL_PRICING lookup so EvaluationRun.cost is computed correctly.
Ensure the chosen approach covers EMBEDDING_MODEL and any "mini"/"nano"/"-pro"
variants referenced by the app.

🧹 Nitpick comments (1)

backend/app/tests/crud/evaluations/test_pricing.py (1)
18-28: Avoid deriving expected costs from MODEL_PRICING.

These assertions use the same pricing table as the implementation, so an incorrect rate change can update both sides and still leave the tests green. Prefer hard-coded expected USD values, or a small local test matrix that pins the intended rates independently of the production constant.

Also applies to: 32-37, 55-62, 93-99, 162-164
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_pricing.py` around lines 18 - 28, The
tests currently compute expected values using MODEL_PRICING, which mirrors
production constants and can mask pricing regressions; update each assertion
that derives expected cost from MODEL_PRICING to use hard-coded expected USD
values instead (compute the expected by hand for the given
input_tokens/output_tokens and paste the numeric expected value into the test),
specifically replace uses of MODEL_PRICING in the calculate_token_cost test
cases with fixed numeric literals so the tests verify against pinned rates
independent of MODEL_PRICING or production changes; keep calls to
calculate_token_cost(model="gpt-4o", ...) and only change the expected
variable/assertion to a hard-coded float (use pytest.approx when appropriate).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@backend/app/crud/evaluations/pricing.py`:
- Around line 23-72: MODEL_PRICING is missing entries used elsewhere so
calculate_token_cost (the exact lookup in calculate_token_cost) can return 0.0;
either add the missing model keys to MODEL_PRICING or normalize model IDs before
lookup. Fix options: (1) add the supported model literals (from
models/constants.py) as keys in MODEL_PRICING with appropriate input/output
costs (update the dict entries like "gpt-4o", "gpt-5", etc.), or (2) change
calculate_token_cost to map aliases/variants to the canonical pricing key (e.g.,
normalize model names or strip suffixes) before doing the MODEL_PRICING lookup
so EvaluationRun.cost is computed correctly. Ensure the chosen approach covers
EMBEDDING_MODEL and any "mini"/"nano"/"-pro" variants referenced by the app.

---

Nitpick comments:
In `@backend/app/tests/crud/evaluations/test_pricing.py`:
- Around line 18-28: The tests currently compute expected values using
MODEL_PRICING, which mirrors production constants and can mask pricing
regressions; update each assertion that derives expected cost from MODEL_PRICING
to use hard-coded expected USD values instead (compute the expected by hand for
the given input_tokens/output_tokens and paste the numeric expected value into
the test), specifically replace uses of MODEL_PRICING in the
calculate_token_cost test cases with fixed numeric literals so the tests verify
against pinned rates independent of MODEL_PRICING or production changes; keep
calls to calculate_token_cost(model="gpt-4o", ...) and only change the expected
variable/assertion to a hard-coded float (use pytest.approx when appropriate).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3fba4ac7-711a-4d3b-916d-56d3c08ae24e

📥 Commits

Reviewing files that changed from the base of the PR and between b6750f0 and 63eb942.

📒 Files selected for processing (6)

backend/app/crud/evaluations/__init__.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/pricing.py
backend/app/crud/evaluations/processing.py
backend/app/tests/crud/evaluations/test_pricing.py
backend/app/tests/crud/evaluations/test_processing.py

✅ Files skipped from review due to trivial changes (1)

backend/app/crud/evaluations/init.py

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/crud/evaluations/core.py

codecov · 2026-04-16T07:09:53Z

Codecov Report

❌ Patch coverage is 96.15385% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/cost.py	92.72%	4 Missing ⚠️
backend/app/crud/evaluations/core.py	85.71%	1 Missing ⚠️
backend/app/crud/model_config.py	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

backend/app/crud/evaluations/cost.py (1)

171-174: Add traceback to swallowed-cost warnings for debugging.

Since exceptions are intentionally suppressed, include stack traces in logs to keep this observable in production.

Proposed fix

     except Exception as cost_err:
         logger.warning(
-            f"[attach_cost] {log_prefix} Failed to compute cost | {cost_err}"
+            f"[attach_cost] {log_prefix} Failed to compute cost | {cost_err}",
+            exc_info=True,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/cost.py` around lines 171 - 174, The except
block in attach_cost swallows exceptions without stack traces; update the logger
call in the except Exception as cost_err handler to include the traceback (e.g.,
use logger.exception or logger.warning(..., exc_info=True)) so the log for
"[attach_cost] {log_prefix} Failed to compute cost" includes the full stack
trace and the cost_err context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/crud/evaluations/cost.py`:
- Around line 159-170: When recomputing costs without embedding inputs, the
current block clears any previously stored embedding cost; modify the logic in
the section that builds embedding_entry (around embedding_model,
embedding_raw_results, build_embedding_cost_entry) to preserve an existing
embedding cost from eval_run.cost if embedding_model and embedding_raw_results
are both None. Concretely, if embedding_model and embedding_raw_results are
present, set embedding_entry via build_embedding_cost_entry as now; otherwise,
if eval_run.cost exists and contains an "embedding" key, set embedding_entry to
that existing embedding dict so that build_cost_dict receives the prior
embedding cost and it is not overwritten when assigning eval_run.cost.
- Around line 45-65: The _sum_tokens function currently adds values from
usage.get(...) directly which can raise when providers return non-numeric types;
update the loop in _sum_tokens to coerce each retrieved value for input_key,
"output_tokens", and "total_tokens" to an int (or float->int) safely and treat
non-coercible/missing values as 0 before adding to totals, e.g., call
usage_extractor(item) as before but for each key use a safe_numeric =
_safe_int(usage.get(...)) pattern (implement a small helper or inline
try/except) so totals["input_tokens"], totals["output_tokens"], and
totals["total_tokens"] always receive numeric increments and malformed values
are ignored.

---

Nitpick comments:
In `@backend/app/crud/evaluations/cost.py`:
- Around line 171-174: The except block in attach_cost swallows exceptions
without stack traces; update the logger call in the except Exception as cost_err
handler to include the traceback (e.g., use logger.exception or
logger.warning(..., exc_info=True)) so the log for "[attach_cost] {log_prefix}
Failed to compute cost" includes the full stack trace and the cost_err context.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fee3ebd2-a94a-4ab2-80a5-e73d4f33ec88

📥 Commits

Reviewing files that changed from the base of the PR and between b75a043 and 85e249a.

📒 Files selected for processing (9)

backend/app/alembic/versions/053_add_cost_to_evaluation_run.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/cost.py
backend/app/crud/evaluations/processing.py
backend/app/crud/model_config.py
backend/app/models/__init__.py
backend/app/models/evaluation.py
backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/crud/evaluations/test_score_storage.py

✅ Files skipped from review due to trivial changes (1)

backend/app/alembic/versions/053_add_cost_to_evaluation_run.py

🚧 Files skipped from review as they are similar to previous changes (5)

backend/app/crud/model_config.py
backend/app/crud/evaluations/core.py
backend/app/tests/crud/evaluations/test_processing.py
backend/app/models/evaluation.py
backend/app/crud/evaluations/processing.py

coderabbitai · 2026-04-16T08:10:21Z

+def _sum_tokens(
+    items: Iterable[dict[str, Any]],
+    usage_extractor: Callable[[dict[str, Any]], dict[str, Any] | None],
+    input_key: str,
+) -> dict[str, int]:
+    """Sum (input, output, total) tokens across items using a per-item usage extractor.
+
+    The OpenAI Embeddings API reports input tokens as ``prompt_tokens`` and has
+    no output tokens; chat/responses APIs use ``input_tokens`` and ``output_tokens``.
+    Missing keys default to 0, so the embedding case naturally produces
+    output_tokens=0.
+    """
+    totals = {"input_tokens": 0, "output_tokens": 0, "total_tokens": 0}
+    for item in items:
+        usage = usage_extractor(item)
+        if not usage:
+            continue
+        totals["input_tokens"] += usage.get(input_key, 0)
+        totals["output_tokens"] += usage.get("output_tokens", 0)
+        totals["total_tokens"] += usage.get("total_tokens", 0)
+    return totals


⚠️ Potential issue | 🟠 Major

Harden token aggregation against malformed usage fields.

At Line 62–64, direct += usage.get(...) can throw if a provider returns non-numeric values, and then cost calculation is fully skipped by the outer catch.

Proposed fix

+def _to_int_token(value: Any) -> int: + try: + return int(value) + except (TypeError, ValueError): + return 0 + def _sum_tokens( items: Iterable[dict[str, Any]], usage_extractor: Callable[[dict[str, Any]], dict[str, Any] | None], input_key: str, ) -> dict[str, int]: @@ usage = usage_extractor(item) if not usage: continue - totals["input_tokens"] += usage.get(input_key, 0) - totals["output_tokens"] += usage.get("output_tokens", 0) - totals["total_tokens"] += usage.get("total_tokens", 0) + totals["input_tokens"] += _to_int_token(usage.get(input_key, 0)) + totals["output_tokens"] += _to_int_token(usage.get("output_tokens", 0)) + totals["total_tokens"] += _to_int_token(usage.get("total_tokens", 0))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/crud/evaluations/cost.py` around lines 45 - 65, The _sum_tokens function currently adds values from usage.get(...) directly which can raise when providers return non-numeric types; update the loop in _sum_tokens to coerce each retrieved value for input_key, "output_tokens", and "total_tokens" to an int (or float->int) safely and treat non-coercible/missing values as 0 before adding to totals, e.g., call usage_extractor(item) as before but for each key use a safe_numeric = _safe_int(usage.get(...)) pattern (implement a small helper or inline try/except) so totals["input_tokens"], totals["output_tokens"], and totals["total_tokens"] always receive numeric increments and malformed values are ignored.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/evaluations/core.py (1)
213-216: ⚠️ Potential issue | 🟡 Minor

Prefix the error log with the function name.

Line 215 log message should follow the project logging convention ([update_evaluation_run] ...).
Suggested fix
-        logger.error(f"Failed to update EvaluationRun: {e}", exc_info=True)
+        logger.error(f"[update_evaluation_run] Failed to update EvaluationRun: {e}", exc_info=True)
As per coding guidelines, "Prefix all log messages with the function name in square brackets: `logger.info(f\"[function_name] Message {mask_string(sensitive_value)}\")`".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/core.py` around lines 213 - 216, The except
block's logger.error call must follow the project's logging convention by
prefixing the message with the function name in square brackets; update the log
inside the except in update_evaluation_run to use
logger.error(f"[update_evaluation_run] Failed to update EvaluationRun: {e}",
exc_info=True) so the message is consistently prefixed while retaining exc_info
and the rollback/raise behavior.

♻️ Duplicate comments (1)

backend/app/crud/evaluations/cost.py (1)

62-64: ⚠️ Potential issue | 🟠 Major

Harden token aggregation against malformed usage values.

Line 62-64 still adds raw usage fields directly; non-numeric payloads can throw and skip cost attachment for the run.

Suggested fix

+def _to_int_token(value: Any) -> int:
+    try:
+        return int(value)
+    except (TypeError, ValueError):
+        return 0
+
 def _sum_tokens(
     items: Iterable[dict[str, Any]],
     usage_extractor: Callable[[dict[str, Any]], dict[str, Any] | None],
     input_key: str,
 ) -> dict[str, int]:
@@
         usage = usage_extractor(item)
         if not usage:
             continue
-        totals["input_tokens"] += usage.get(input_key, 0)
-        totals["output_tokens"] += usage.get("output_tokens", 0)
-        totals["total_tokens"] += usage.get("total_tokens", 0)
+        totals["input_tokens"] += _to_int_token(usage.get(input_key, 0))
+        totals["output_tokens"] += _to_int_token(usage.get("output_tokens", 0))
+        totals["total_tokens"] += _to_int_token(usage.get("total_tokens", 0))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/cost.py` around lines 62 - 64, The aggregation
currently adds raw usage values (usage.get(input_key),
usage.get("output_tokens"), usage.get("total_tokens")) directly into totals
which can raise on non-numeric payloads; change the additions to coerce/validate
values first (e.g., attempt to convert to int or float and fall back to 0 on
TypeError/ValueError/None or if value is not finite) and then add the safe
numeric value to totals["input_tokens"], totals["output_tokens"], and
totals["total_tokens"]; implement a small helper (e.g., safe_num(value)) used
where input_key, "output_tokens", and "total_tokens" are read so malformed usage
entries are treated as 0 instead of throwing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/crud/evaluations/core.py`:
- Around line 213-216: The except block's logger.error call must follow the
project's logging convention by prefixing the message with the function name in
square brackets; update the log inside the except in update_evaluation_run to
use logger.error(f"[update_evaluation_run] Failed to update EvaluationRun: {e}",
exc_info=True) so the message is consistently prefixed while retaining exc_info
and the rollback/raise behavior.

---

Duplicate comments:
In `@backend/app/crud/evaluations/cost.py`:
- Around line 62-64: The aggregation currently adds raw usage values
(usage.get(input_key), usage.get("output_tokens"), usage.get("total_tokens"))
directly into totals which can raise on non-numeric payloads; change the
additions to coerce/validate values first (e.g., attempt to convert to int or
float and fall back to 0 on TypeError/ValueError/None or if value is not finite)
and then add the safe numeric value to totals["input_tokens"],
totals["output_tokens"], and totals["total_tokens"]; implement a small helper
(e.g., safe_num(value)) used where input_key, "output_tokens", and
"total_tokens" are read so malformed usage entries are treated as 0 instead of
throwing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e413f09c-e8af-4a74-8793-c64b704697fb

📥 Commits

Reviewing files that changed from the base of the PR and between 85e249a and b95e11e.

📒 Files selected for processing (4)

backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/cost.py
backend/app/models/evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/app/models/evaluation.py
backend/app/tests/crud/evaluations/test_processing.py

AkhileshNegi added 2 commits April 9, 2026 22:36

first stab at costing

631f3f4

minor fixes

b6750f0

AkhileshNegi marked this pull request as ready for review April 10, 2026 04:11

AkhileshNegi self-assigned this Apr 10, 2026

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Comment thread backend/app/alembic/versions/053_add_cost_to_evaluation_run.py

Comment thread backend/app/crud/evaluations/pricing.py Outdated

cleanup

63eb942

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

AkhileshNegi added 3 commits April 16, 2026 09:48

Merge branch 'main' into enhancement/evaluation-cost

35d7c9f

first stab of using config table

b75a043

cleanup

be8f60e

few fixes from suggestions

85e249a

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

AkhileshNegi added 2 commits April 16, 2026 13:43

cleanups

1af4148

coderabbit suggestions

77451b2

AkhileshNegi requested a review from vprashrex April 16, 2026 08:19

Merge branch 'main' into enhancement/evaluation-cost

b95e11e

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Conversation

AkhileshNegi commented Apr 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AkhileshNegi commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading