Skip to content

Initialize project#1

Open
d33bs wants to merge 11 commits intoWayScience:mainfrom
d33bs:init
Open

Initialize project#1
d33bs wants to merge 11 commits intoWayScience:mainfrom
d33bs:init

Conversation

@d33bs
Copy link
Copy Markdown
Member

@d33bs d33bs commented Apr 3, 2026

Summary by CodeRabbit

  • New Features
    • Initial MeerQat release: CLI and Python API for deterministic dataset validation, optional local LLM review, readiness probe, and JSON/Markdown/HTML reports.
  • Documentation
    • Complete docs and examples: full README, contributing guide, architecture/spec/roadmap, CLI/API guides, and example notebooks.
  • Tests
    • Extensive unit, integration, CLI, and sample-dataset test suites.
  • Chores
    • Project tooling and CI: release drafting, docs publishing, PyPI publish workflow, Dependabot, pre-commit pipeline, packaging/citation, and code of conduct.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new MeerQat Python package and CLI with deterministic dataset ingestion/validation, optional LLM review (Instructor + LangChain fallback), reporting (JSON/Markdown/HTML), Sphinx docs, tests, pre-commit tooling, and multiple GitHub automation and CI workflow/config files.

Changes

Cohort / File(s) Summary
GitHub repo automation & templates
​.github/ISSUE_TEMPLATE/issue.yml, ​.github/PULL_REQUEST_TEMPLATE.md, ​.github/dependabot.yml, ​.github/release-drafter.yml
New issue and PR templates, Dependabot weekly checks for GitHub Actions and pip, and Release Drafter configuration with semantic version mapping.
GitHub workflows
​.github/workflows/draft-release.yml, ​.github/workflows/publish-docs.yml, ​.github/workflows/publish-pypi.yml, ​.github/workflows/run-tests.yml
New Actions: release drafter, docs publish (GitHub Pages), PyPI publish on releases, and test matrix including pre-commit, pytest across Python versions, and an integration job for local LLM runtime.
Repo tooling & metadata
pyproject.toml, .python-version, .pre-commit-config.yaml, .yamllint, .gitignore, CITATION.cff
Adds build/runtime metadata, console script entrypoint, Python version pin, pre-commit hooks, yamllint rules, minor .gitignore whitespace normalization, and citation metadata.
Community & contributing docs
CODE_OF_CONDUCT.md, CONTRIBUTING.md
Adds contributor covenant and detailed contribution/release/development guidance.
Top-level docs & spec/roadmap
README.md, architecture.md, spec.md, roadmap.md
Expanded README and new architecture/spec/roadmap describing validation layers, domain model, CLI/API behavior, and implementation phases.
Sphinx docs & examples
docs/src/conf.py, docs/src/index.md, docs/src/cli.md, docs/src/python-api.md, docs/src/on_the_lookout_with_meerqat.*, docs/src/_static/custom.css
Sphinx configuration (MyST, pydata theme), CLI/API pages, example notebook/script demonstrating validation, and minimal CSS.
Package exports & version
src/meerqat/__init__.py, src/meerqat/_version.py
Top-level package exports and generated version module exposing package/version metadata.
Configuration models
src/meerqat/config.py
Immutable dataclasses for ValidationConfig/LLMConfig/AssayTemplate/ModelSpec, YAML loader, validation, CLI override merging, and model/context utilities.
Domain models
src/meerqat/models.py
Frozen dataclass domain/report types (Dataset, Plate, FiletreeSummary, ValidationReport, LLMReview, ReadyReport) with to_dict serialization.
Ingestion & normalization
src/meerqat/ingestion.py, src/meerqat/normalization.py
Filesystem plate discovery and XML parsing, image inventory and modality inference, metadata CSV/XLSX loading, filetree heuristics, and identifier normalization.
Deterministic validation engine
src/meerqat/validation.py
Validation rules for structure/naming/XML consistency, metadata coverage, image counts/zero-byte/mixed modality checks, filetree risk detection, and aggregation into ValidationReport.
LLM review integration
src/meerqat/llm.py
Pydantic schemas and orchestration for Instructor (local server auto-start/probe) and LangChain/gguf fallback, model resolution/caching, readiness checks, error normalization, and public APIs generate_llm_review/run_ready_checks.
Reporting & CLI API
src/meerqat/reporting.py, src/meerqat/cli.py, src/meerqat/main.py
Markdown/HTML renderers, JSON writers, CLI subcommands (validate, batch-validate, models, ready), config merging, exit-code mapping, and top-level API functions (validate_dataset, batch_validate, ready, write_reports).
Tests, fixtures & sample data
tests/conftest.py, tests/*, tests/data/*
Comprehensive test suite (unit, API, CLI, integration), fixtures, regression tests, and several small sample datasets under tests/data/.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI
    participant Config as Config/Loader
    participant Ingest as Ingestion
    participant Validator as Validation Engine
    participant LLM as LLM Provider
    participant Reporter as Reporter

    CLI->>Config: load config / apply CLI overrides
    CLI->>Ingest: ingest dataset (path, metadata)
    Ingest-->>Validator: Dataset
    Validator->>Validator: run deterministic rules -> ValidationReport
    Validator-->>LLM: submit ValidationReport for review (if enabled)
    alt LLM enabled
        LLM->>LLM: resolve model & provider
        alt instructor provider
            LLM->>LLM: ensure local instructor server (auto-start & probe)
            LLM->>Server: HTTP inference call
        else langchain fallback
            LLM->>HF: resolve/fetch GGUF model (cache/offline)
            LLM->>LocalModel: run local inference
        end
        LLM-->>Validator: LLMReview (hints/findings)
    else LLM disabled
        LLM-->>Validator: LLMReview(status="disabled")
    end
    Validator->>Reporter: attach LLMReview, build final report
    Reporter->>CLI: render/write JSON/MD/HTML
    CLI->>User: exit with mapped code
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 I nibble logs and chase the cues,
I count the plates and parse the clues,
Rules hop steady, Llamas ponder,
Reports hop out — then off I wander,
A tiny hop for code, a carrot for news!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The pull request title 'Initialize project' is vague and generic, failing to convey any meaningful information about the specific changes being introduced. Replace with a more descriptive title that highlights the primary changes, such as 'Add project scaffolding: configuration, workflows, documentation, and initial source code' or 'Set up MeerQat package with core validation, CLI, and GitHub workflows'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 93.37% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (17)
CITATION.cff (1)

14-14: Use canonical repository URL casing for consistency.

repository-code currently uses wayscience; consider matching the repository’s canonical org casing (WayScience) for consistency with project metadata and docs.

✏️ Suggested metadata tweak
-repository-code: "https://github.com/wayscience/meerqat"
+repository-code: "https://github.com/WayScience/meerqat"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CITATION.cff` at line 14, The CITATION.cff metadata uses a lowercased org
name in the repository-code field; update the repository-code value to use the
repository's canonical casing ("WayScience/meerqat") so it matches project
metadata and docs—locate the repository-code entry in CITATION.cff and change
"https://github.com/wayscience/meerqat" to the correctly cased canonical URL.
CONTRIBUTING.md (1)

63-65: Minor: Redundant section headers.

Lines 63-65 have consecutive headers "Creating a pull request" and "Pull requests" that could be consolidated.

Suggested consolidation
-### Creating a pull request
-
 ### Pull requests
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CONTRIBUTING.md` around lines 63 - 65, Consolidate the redundant consecutive
headers by removing or merging one of them: keep a single clear header (either
"Creating a pull request" or "Pull requests") and move any distinct content
under that header; update the section title where the duplicate appears and
ensure the remaining heading matches surrounding style and anchors (the two
header strings "Creating a pull request" and "Pull requests" identify where to
make the change).
src/meerqat/cli.py (1)

172-188: Reduce duplication: reuse add_common_flags for ready subparser.

The ready_parser manually defines the same LLM-related flags that add_common_flags already provides. This creates maintenance burden if flags need updates.

Proposed refactor
     ready_parser = subparsers.add_parser("ready")
-    ready_parser.add_argument("--config")
-    ready_parser.add_argument(
-        "--llm-provider",
-        choices=("langchain", "instructor"),
-    )
-    ready_parser.add_argument("--llm-model")
-    ready_parser.add_argument("--llm-repo-id")
-    ready_parser.add_argument("--llm-filename")
-    ready_parser.add_argument("--llm-cache-dir")
-    ready_parser.add_argument("--llm-base-url")
-    ready_parser.add_argument("--llm-n-ctx", type=int)
-    ready_parser.add_argument("--llm-threads", type=int)
-    ready_parser.add_argument("--llm-max-tokens", type=int)
-    ready_parser.add_argument("--llm-temperature", type=float)
-    ready_parser.add_argument("--offline", action="store_true")
+    add_common_flags(ready_parser)
     ready_parser.set_defaults(handler=_ready_command)

Note: add_common_flags also adds --metadata, --dataset-id, --report-json, etc. If these shouldn't apply to ready, consider extracting a separate add_llm_flags helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/cli.py` around lines 172 - 188, The ready subparser duplicates
LLM flags already provided by add_common_flags; update the ready parser to reuse
those shared flags by calling add_common_flags(ready_parser) instead of manually
adding --llm-* options, or if ready should only include LLM flags (not
metadata/dataset/report flags) extract a new helper add_llm_flags(parser) and
call that from both add_common_flags and where ready_parser is configured;
adjust references to ready_parser and the handler _ready_command accordingly so
flag names remain unchanged.
docs/src/on_the_lookout_with_meerqat.py (1)

49-49: Relative path assumes specific execution context.

The pathlib.Path("../..") / "tests" / "data" pattern works when the notebook is executed from docs/src/. Consider documenting this assumption or using a more robust path resolution if the notebook may be executed from other locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/on_the_lookout_with_meerqat.py` at line 49, The hard-coded relative
resolution for data_root (pathlib.Path("../..") / "tests" / "data" /
"valid_minimal") assumes the notebook runs from docs/src; change it to compute
the path relative to the source file (e.g., use the file's resolved path via
__file__ and parents to locate the repo root and then append
"tests/data/valid_minimal"), or explicitly document the execution-dir
requirement near the data_root declaration so callers know the expected CWD;
update the data_root assignment and its comment accordingly.
.github/workflows/run-tests.yml (1)

60-65: Consider including model spec file in cache key.

The cache key uses hashFiles('pyproject.toml'), but model specifications (repo_id, filename) are defined in src/meerqat/config.py. If model specs change without a pyproject.toml change, stale cached models would persist.

Optional: Include config.py in cache key
       - name: Cache Hugging Face models
         uses: actions/cache@v4
         with:
           path: ~/.cache/huggingface
-          key: ${{ runner.os }}-meerqat-hf-${{ hashFiles('pyproject.toml') }}
+          key: ${{ runner.os }}-meerqat-hf-${{ hashFiles('pyproject.toml', 'src/meerqat/config.py') }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/run-tests.yml around lines 60 - 65, The cache key for the
"Cache Hugging Face models" step should include the model spec file so changes
to repo_id/filename invalidate the cache; update the key expression that
currently uses hashFiles('pyproject.toml') to also hash the model spec file
(e.g., 'src/meerqat/config.py') so model spec changes trigger a new cache,
ensuring the workflow step named "Cache Hugging Face models" doesn't return
stale models.
tests/test_api.py (2)

41-48: Test may have non-deterministic LLM behavior.

This test calls validate_dataset(valid_dataset) without disabling LLM review. If an LLM runtime is available in the test environment, this could cause non-deterministic behavior or slow test execution.

Consider using config=_without_llm() for consistency with other tests, unless this is intentionally testing the default config path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_api.py` around lines 41 - 48, The test
test_validate_dataset_discovers_nearby_metadata calls
validate_dataset(valid_dataset) leaving LLM review enabled; to avoid
nondeterminism and slowness, update the call to pass the deterministic test
config (use config=_without_llm()) so call validate_dataset(valid_dataset,
config=_without_llm()), ensuring consistency with other tests that disable LLMs.

20-22: Duplicate helper across test modules.

The _without_llm() helper is duplicated in tests/test_sample_datasets.py. Consider extracting it to conftest.py as a shared fixture.

♻️ Suggested refactor to conftest.py
# tests/conftest.py
import pytest
from meerqat import LLMConfig, ValidationConfig

`@pytest.fixture`
def config_without_llm() -> ValidationConfig:
    """Return a validation config that skips the runtime review."""
    return ValidationConfig(llm=LLMConfig(enabled=False))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_api.py` around lines 20 - 22, Extract the duplicated helper
function `_without_llm()` into a shared pytest fixture in `tests/conftest.py`:
create a fixture (e.g., `config_without_llm`) that returns
`ValidationConfig(llm=LLMConfig(enabled=False))`, then replace usages of the
`_without_llm()` function in `tests/test_api.py` and
`tests/test_sample_datasets.py` with the new fixture by accepting
`config_without_llm` as a test parameter; ensure you import `LLMConfig` and
`ValidationConfig` in `conftest.py` and remove the local `_without_llm()`
definitions from the test modules.
src/meerqat/reporting.py (1)

84-99: Handle None dataset_id gracefully in report title.

If report.summary.dataset_id is None, the Markdown title will render as # Meerqat Report: None. Consider providing a fallback like "Unknown" or "(no dataset ID)".

📝 Suggested fix
 def report_to_markdown(report: ValidationReport) -> str:
     """Render a report as Markdown."""
+    dataset_id = report.summary.dataset_id or "(unnamed dataset)"
     lines = [
-        f"# Meerqat Report: {report.summary.dataset_id}",
+        f"# Meerqat Report: {dataset_id}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/reporting.py` around lines 84 - 99, The report title currently
interpolates report.summary.dataset_id directly, causing "None" to appear when
it's missing; update report_to_markdown to compute a safe title value (e.g.,
dataset_title = report.summary.dataset_id or "Unknown" / "(no dataset ID)") and
use that variable in the first line instead of report.summary.dataset_id so the
Markdown header shows a friendly fallback when dataset_id is None.
spec.md (2)

270-291: Configuration example has formatting artifacts.

Line 280 shows \- "Images/Index.xml" with an escaped dash. This should be a plain YAML list item.

📝 Suggested fix
 required_files:
 
-\- "Images/Index.xml"
+  - "Images/Index.xml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec.md` around lines 270 - 291, The YAML example under the dataset block
contains a formatting artifact: the required_files entry shows an escaped dash
(`\- "Images/Index.xml"`) instead of a normal list item; update the
required_files list so the item is a proper YAML list entry (replace `\-
"Images/Index.xml"` with `- "Images/Index.xml"`) within the same
dataset/required_files section to ensure valid YAML parsing.

226-263: Consider reformatting the Domain Model section.

The class definitions mix Markdown prose with code-block-style formatting in an unusual way. The backslash escapes (folder\_name, plate\_id) appear to be rendering artifacts.

📝 Suggested formatting
 ## Domain Model
 
 ### Dataset
 
-class Dataset:
-
-```
-dataset\_id: str | None
-
-plates: list\[Plate\]
-
-metadata\_records: list\[MetadataRecord\]
-```
+```python
+class Dataset:
+    dataset_id: str | None
+    plates: list[Plate]
+    metadata_records: list[MetadataRecord]
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec.md` around lines 226 - 263, The Domain Model section mixes prose and
malformed code blocks with backslash-escaped identifiers; update the three class
definitions (Dataset, Plate, ValidationIssue) to use proper fenced code blocks
(```python) and normal attribute names (dataset_id, plates, metadata_records,
folder_name, normalized_plate_id, xml_plate_id, metadata_plate_id, rule_id,
severity, message, entity_id) with consistent indentation and type annotations,
removing all backslash escapes and stray Markdown artifacts so the classes
render as clean Python-style examples.
docs/src/on_the_lookout_with_meerqat.ipynb (1)

203-231: Consider clearing or sanitizing notebook outputs before committing.

The cell outputs contain absolute filesystem paths (e.g., /Users/buntend/Documents/work/meerqat/...) that expose the local development environment. While not a security risk, this is unnecessary noise in documentation and can cause merge conflicts when different developers run the notebook.

Consider either:

  1. Clearing outputs before committing (jupyter nbconvert --clear-output)
  2. Using relative paths in the displayed output if possible
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/on_the_lookout_with_meerqat.ipynb` around lines 203 - 231, The
notebook cell outputs include absolute local paths (see keys like
'empty_directories', 'filetree_issues', and example entries
'segment_A'/'segment_B') which leak developer-specific filesystem info; fix by
clearing or sanitizing the outputs before committing: either run a notebook
clear-output command (e.g., nbconvert --clear-output or use your editor's "Clear
All Outputs") to remove the printed results, or replace those absolute paths in
the output cells with relative paths or sanitized placeholders (e.g.,
"./tests/data/...") in the cells that produce 'filetree_issues' so the committed
notebook contains no local absolute paths.
src/meerqat/llm.py (3)

302-311: Hardcoded 30-second server startup timeout.

The timeout value is hardcoded. Consider making it configurable via LLMConfig for environments where model loading takes longer (e.g., larger models or slower hardware).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/llm.py` around lines 302 - 311, The _wait_for_server function
currently uses a hardcoded 30s timeout; update it to accept a configurable
timeout (e.g., timeout_seconds: float) or read from the existing LLMConfig
(e.g., LLMConfig.server_startup_timeout) and use that value instead of 30,
update all call sites that invoke _wait_for_server(base_url, process) to pass
the configured timeout or to ensure LLMConfig is accessible there, and adjust
the error messages to include the configured timeout for clarity; ensure
function signature and any imports/constructors that create or pass LLMConfig
are updated accordingly.

240-252: URL scheme validation for urlopen calls.

The static analysis S310 warnings are reasonable for auditing, but these calls are for internal OpenAI-compatible API endpoints derived from config.base_url. The code already validates localhost URLs via _is_local_base_url. Consider adding explicit scheme validation if the base_url could come from untrusted sources.

🛡️ Optional: Add explicit scheme validation
 def _server_is_ready(base_url: str) -> bool:
     """Check whether an OpenAI-compatible endpoint is reachable."""
+    parsed = urllib_parse.urlsplit(base_url)
+    if parsed.scheme not in {"http", "https"}:
+        return False
     try:
         with urllib_request.urlopen(_models_url(base_url), timeout=1) as response:
             return response.status < HTTP_SERVER_ERROR_STATUS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/llm.py` around lines 240 - 252, The _server_is_ready and
_fetch_models_payload functions call urllib_request.urlopen with a URL derived
from _models_url(base_url) without explicit scheme checks; validate the scheme
extracted from base_url (or from the constructed _models_url) using
urllib.parse.urlparse and ensure it is "http" or "https" (and/or short-circuit
if _is_local_base_url(base_url) allows localhost) before calling urlopen,
returning False or raising a clear ValueError on invalid schemes, and update
callers to handle that behavior; reference the functions _server_is_ready,
_fetch_models_payload, _models_url and helper _is_local_base_url when making
this change.

141-173: Subprocess call compiles C code with hardcoded compiler path.

The subprocess.run call invokes cc directly. While Ruff flags this as S603/S607, the command arguments are constructed from internal paths, not user input. However, the function could fail silently on systems without a C compiler.

💡 Consider adding a more descriptive error message
     except (OSError, subprocess.CalledProcessError) as error:
         raise RuntimeError(
-            "Failed to build the macOS CPU-only llama.cpp runtime shim."
+            "Failed to build the macOS CPU-only llama.cpp runtime shim. "
+            "Ensure Xcode Command Line Tools are installed (`xcode-select --install`)."
         ) from error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/llm.py` around lines 141 - 173, The subprocess.run call in
_write_macos_metal_stub uses "cc" and currently raises a generic RuntimeError on
failure; make failures actionable by first checking for a C compiler (use
shutil.which to prefer "cc" then "clang") and fall back accordingly, and when
subprocess.run raises subprocess.CalledProcessError or FileNotFoundError include
the captured stderr/stdout (or the exception text) in the raised RuntimeError
with a clear message suggesting installing the macOS command line tools; update
the exception handling around the subprocess.run call to reference
subprocess.run (and the CalledProcessError/FileNotFoundError) and include those
error details in the RuntimeError.
tests/test_units.py (1)

360-390: Consider using pytest.raises instead of generator throw pattern.

The generator throw pattern (_ for _ in ()).throw(...) is functional but unusual. While it works for lazy exception raising in lambdas, it reduces readability.

♻️ Alternative approach using a helper function
+def _raise_runtime_error(msg: str):
+    def raiser(*args, **kwargs):
+        raise RuntimeError(msg)
+    return raiser
+
 def test_llm_returns_failed_review_when_all_paths_fail(
     monkeypatch: pytest.MonkeyPatch,
     valid_dataset: Path,
     metadata_csv: Path,
 ) -> None:
     """The core review should surface a structured failure instead of crashing."""
     report = _report_without_llm(valid_dataset, metadata_csv)
 
     monkeypatch.setattr(
         "meerqat.llm._ensure_local_instructor_server",
-        lambda config: (_ for _ in ()).throw(RuntimeError("startup failed")),
+        _raise_runtime_error("startup failed"),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_units.py` around lines 360 - 390, Replace the unusual
generator.throw pattern used in the monkeypatch lambdas with a clear helper that
raises the intended exception; define a small function (e.g., raise_exc(exc))
that simply raises the provided exception and use it in the monkeypatch.setattr
calls for meerqat.llm._ensure_local_instructor_server,
meerqat.llm._invoke_instructor, and meerqat.llm._invoke_langchain so the test
remains readable while still simulating startup/import failures.
src/meerqat/main.py (1)

67-88: Redundant LLMReview construction - same object created twice.

The LLMReview object is constructed identically at lines 74-79 and 81-86. This duplicates code and creates an unnecessary intermediate object.

♻️ Extract to a single variable
     llm_review = generate_llm_review(report, active_config.llm)
+    llm_review_model = LLMReview(
+        status=llm_review.status,
+        provider=llm_review.provider,
+        model=llm_review.model,
+        error=llm_review.error,
+    )
     return ValidationReport(
         summary=report.summary,
         issues=report.issues,
         dataset=report.dataset,
         rule_results=report.rule_results,
         llm_hints=llm_review.hints,
         llm_findings=llm_review.findings,
-        llm_review=LLMReview(
-            status=llm_review.status,
-            provider=llm_review.provider,
-            model=llm_review.model,
-            error=llm_review.error,
-        ),
-        provenance=_build_provenance(
-            LLMReview(
-                status=llm_review.status,
-                provider=llm_review.provider,
-                model=llm_review.model,
-                error=llm_review.error,
-            )
-        ),
+        llm_review=llm_review_model,
+        provenance=_build_provenance(llm_review_model),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/main.py` around lines 67 - 88, Create a single LLMReview instance
and reuse it instead of constructing it twice: build one local variable (e.g.,
llm_review_obj = LLMReview(status=llm_review.status,
provider=llm_review.provider, model=llm_review.model, error=llm_review.error))
and then pass llm_review_obj to both the ValidationReport(llm_review=...) field
and to _build_provenance(llm_review_obj), leaving the uses of llm_review.hints
and llm_review.findings unchanged.
src/meerqat/ingestion.py (1)

107-138: Potential performance concern with large directory trees.

The _build_filetree_summary function iterates over all paths multiple times and performs O(n²) similarity comparisons on directories. For datasets with many directories, this could be slow.

💡 Consider early termination for similarity checks
     similar_pairs: list[tuple[str, str]] = []
+    MAX_SIMILAR_PAIRS = 100  # Limit output to prevent explosion
     for index, left in enumerate(directories):
+        if len(similar_pairs) >= MAX_SIMILAR_PAIRS:
+            break
         left_text = str(left.relative_to(dataset_root))
         for right in directories[index + 1 :]:
             right_text = str(right.relative_to(dataset_root))
             similarity = SequenceMatcher(None, left_text, right_text).ratio()
             if similarity >= SIMILAR_DIRECTORY_THRESHOLD:
                 similar_pairs.append((str(left), str(right)))
+                if len(similar_pairs) >= MAX_SIMILAR_PAIRS:
+                    break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 107 - 138, _build_filetree_summary
currently walks paths multiple times and runs O(n²) SequenceMatcher comparisons
over directories (similar_pairs using SequenceMatcher and directories), which
will be slow on large trees; to fix, do a single pass collecting paths,
directory_children and directories (use one loop over dataset_root.rglob("*") to
build paths_list, directory_children and directories), then reduce pairwise
comparisons by bucketing directories (e.g., group by directory name stem, first
N chars, depth, or token prefixes) and only run SequenceMatcher within each
bucket (retain SIMILAR_DIRECTORY_THRESHOLD and SequenceMatcher but apply them to
bucketed pairs), and finally build empty_directories and file_extensions from
the single-pass data; update references in the function: paths,
directory_children, directories, similar_pairs, SequenceMatcher,
SIMILAR_DIRECTORY_THRESHOLD.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/ISSUE_TEMPLATE/issue.yml:
- Line 18: Update the duplicate-confirmation checkbox text that currently reads
"I found no existing covering this topic." to include the missing noun; locate
the string "I found no existing covering this topic." in
.github/ISSUE_TEMPLATE/issue.yml and change it to "I found no existing issue
covering this topic." so the wording is grammatically correct and clear.

In @.github/PULL_REQUEST_TEMPLATE.md:
- Line 14: Update the sentence in the pull request template that currently reads
"you may used a `#<digit>` to reference GitHub issues" to the correct phrasing
"you may use a `#<digit>` to reference GitHub issues" so the contributor
guidance is grammatically correct; locate the exact line containing that
sentence in the PULL_REQUEST_TEMPLATE and replace "used" with "use" in the
string.

In @.github/workflows/draft-release.yml:
- Line 21: Replace the mutable action tag used in the workflow: change the
"uses: release-drafter/release-drafter@v6" entry to reference a specific commit
SHA (e.g., "uses: release-drafter/release-drafter@<full-commit-sha>") so the
release-drafter action is pinned to an immutable commit; update the SHA to a
vetted commit from the release-drafter repo and commit the workflow change.

In @.github/workflows/publish-docs.yml:
- Around line 13-41: The workflow currently runs the docs deploy step with
JamesIves/github-pages-deploy-action@v4 but doesn't grant the GITHUB_TOKEN write
permissions; add an explicit permissions block (e.g., permissions: contents:
write, pages: write) at the workflow or jobs.build level so the deploy action
can push to the pages branch; update the workflow YAML around the jobs.build
definition (referencing jobs.build and the step using
JamesIves/github-pages-deploy-action@v4) to include this permissions block.

In @.github/workflows/publish-pypi.yml:
- Around line 19-33: Replace the mutable action tags with immutable commit SHAs:
for uses: actions/checkout@v4, actions/setup-python@v5, astral-sh/setup-uv@v6,
and pypa/gh-action-pypi-publish@release/v1 swap the version tags to the
corresponding full 40-character commit SHA for each action (look up each action
repo’s commit and update the uses: entries). Ensure you keep the existing keys
(with:, run:) intact and only replace the version suffix after each action name
so the workflow pins to exact commits.

In @.pre-commit-config.yaml:
- Around line 39-40: The yamllint hook's exclude path is incorrect: change the
exclude value under the yamllint entry (id: yamllint) from
"pre-commit-config.yaml" to ".pre-commit-config.yaml" so the hook correctly
ignores this file name; update the exclude string associated with id: yamllint
accordingly.

In `@CODE_OF_CONDUCT.md`:
- Around line 61-64: Replace the vague "reach out to the maintainers by using
their GitHub profile email" line with a concrete, stable reporting channel by
adding a dedicated reporting email alias (for example security@<org-domain>.org)
or a secure incident form URL, update the paragraph that begins "Instances of
abusive, harassing, or otherwise unacceptable behavior may be reported…" to
include that channel, and add a brief sentence that reports will be handled
confidentially and promptly (so readers know what to expect); ensure the new
contact is the single authoritative reporting point referenced in the same
paragraph.

In `@CONTRIBUTING.md`:
- Line 91: Fix the typo in CONTRIBUTING.md by replacing the misspelled word
"occaissions" with "occasions" in the sentence that mentions `_version.py` (the
sentence that reads "We also use the `_version.py` file as a place to persist
the version data for occaissions where the `git` history is unavailable or
unwanted (this file is only present in package builds).").

In `@docs/src/conf.py`:
- Around line 32-38: The extensions list currently includes the theme name
"pydata_sphinx_theme" which is not an extension; remove "pydata_sphinx_theme"
from the extensions = [...] array (leave other extensions like "myst_nb",
"sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinx.ext.viewcode" intact) and
rely on the existing html_theme setting (html_theme = "pydata_sphinx_theme") to
enable the theme instead.

In `@pyproject.toml`:
- Line 7: Remove the static version entry from pyproject.toml (the line setting
version = "0.0.1") so setuptools_scm can drive package versioning; leave the
existing setuptools_scm configuration (the [tool.setuptools_scm] block) intact
and ensure build-system still lists setuptools_scm as a build requirement so
versions are generated dynamically.

In `@spec.md`:
- Around line 215-216: The spec currently lists OME-Zarr as implemented in the
"Implemented Scope" section and also as pending in the "Future Work" section;
update the document to be consistent by deciding the true status and removing or
moving the OME-Zarr bullet accordingly: if OME-Zarr is implemented, remove it
from the "Future Work" list (or mark it as completed) and ensure the
"Implemented Scope" bullet explicitly names OME-Zarr; if it is not yet
implemented, remove it from "Implemented Scope" and keep it under "Future Work"
with any planned notes. Ensure both headings ("Implemented Scope" and "Future
Work") reference the same final status for OME-Zarr to eliminate the
contradiction.

In `@src/meerqat/ingestion.py`:
- Around line 56-76: Replace use of the vulnerable xml.etree.ElementTree with
defusedxml's safe parser: import defusedxml.ElementTree as ET and update calls
like ElementTree.parse(xml_path) to ET.parse(xml_path) in _extract_xml_plate_id
(ingestion.py) and the XML validation function in validation.py; also add
defusedxml to project dependencies (pyproject.toml) so the safe parser is
installed.

In `@src/meerqat/validation.py`:
- Around line 168-180: Replace unsafe ElementTree parsing with defusedxml's safe
parser in both locations: in the XML parseability check that calls
ElementTree.parse(plate.xml_path) (refer to the validation code that reads
plate.xml_path) and in the ingestion code that extracts plate identifiers from
XML (the routine that parses XML data in ingestion.py). Import
defusedxml.ElementTree (or alias it as ET) and use its parse/fromstring APIs
instead of xml.etree.ElementTree to prevent XXE and entity-expansion attacks,
and add defusedxml to pyproject.toml dependencies so the safe parser is
installed.

In `@tests/test_cli.py`:
- Around line 17-23: The _cli_env function builds a PYTHONPATH string using a
hardcoded ":" which is Unix-only; update _cli_env to join src_path and existing
PYTHONPATH using os.pathsep so the separator is platform-specific (modify the
logic around env["PYTHONPATH"] in _cli_env to use os.pathsep when concatenating
src_path and existing).

---

Nitpick comments:
In @.github/workflows/run-tests.yml:
- Around line 60-65: The cache key for the "Cache Hugging Face models" step
should include the model spec file so changes to repo_id/filename invalidate the
cache; update the key expression that currently uses hashFiles('pyproject.toml')
to also hash the model spec file (e.g., 'src/meerqat/config.py') so model spec
changes trigger a new cache, ensuring the workflow step named "Cache Hugging
Face models" doesn't return stale models.

In `@CITATION.cff`:
- Line 14: The CITATION.cff metadata uses a lowercased org name in the
repository-code field; update the repository-code value to use the repository's
canonical casing ("WayScience/meerqat") so it matches project metadata and
docs—locate the repository-code entry in CITATION.cff and change
"https://github.com/wayscience/meerqat" to the correctly cased canonical URL.

In `@CONTRIBUTING.md`:
- Around line 63-65: Consolidate the redundant consecutive headers by removing
or merging one of them: keep a single clear header (either "Creating a pull
request" or "Pull requests") and move any distinct content under that header;
update the section title where the duplicate appears and ensure the remaining
heading matches surrounding style and anchors (the two header strings "Creating
a pull request" and "Pull requests" identify where to make the change).

In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 203-231: The notebook cell outputs include absolute local paths
(see keys like 'empty_directories', 'filetree_issues', and example entries
'segment_A'/'segment_B') which leak developer-specific filesystem info; fix by
clearing or sanitizing the outputs before committing: either run a notebook
clear-output command (e.g., nbconvert --clear-output or use your editor's "Clear
All Outputs") to remove the printed results, or replace those absolute paths in
the output cells with relative paths or sanitized placeholders (e.g.,
"./tests/data/...") in the cells that produce 'filetree_issues' so the committed
notebook contains no local absolute paths.

In `@docs/src/on_the_lookout_with_meerqat.py`:
- Line 49: The hard-coded relative resolution for data_root
(pathlib.Path("../..") / "tests" / "data" / "valid_minimal") assumes the
notebook runs from docs/src; change it to compute the path relative to the
source file (e.g., use the file's resolved path via __file__ and parents to
locate the repo root and then append "tests/data/valid_minimal"), or explicitly
document the execution-dir requirement near the data_root declaration so callers
know the expected CWD; update the data_root assignment and its comment
accordingly.

In `@spec.md`:
- Around line 270-291: The YAML example under the dataset block contains a
formatting artifact: the required_files entry shows an escaped dash (`\-
"Images/Index.xml"`) instead of a normal list item; update the required_files
list so the item is a proper YAML list entry (replace `\- "Images/Index.xml"`
with `- "Images/Index.xml"`) within the same dataset/required_files section to
ensure valid YAML parsing.
- Around line 226-263: The Domain Model section mixes prose and malformed code
blocks with backslash-escaped identifiers; update the three class definitions
(Dataset, Plate, ValidationIssue) to use proper fenced code blocks (```python)
and normal attribute names (dataset_id, plates, metadata_records, folder_name,
normalized_plate_id, xml_plate_id, metadata_plate_id, rule_id, severity,
message, entity_id) with consistent indentation and type annotations, removing
all backslash escapes and stray Markdown artifacts so the classes render as
clean Python-style examples.

In `@src/meerqat/cli.py`:
- Around line 172-188: The ready subparser duplicates LLM flags already provided
by add_common_flags; update the ready parser to reuse those shared flags by
calling add_common_flags(ready_parser) instead of manually adding --llm-*
options, or if ready should only include LLM flags (not metadata/dataset/report
flags) extract a new helper add_llm_flags(parser) and call that from both
add_common_flags and where ready_parser is configured; adjust references to
ready_parser and the handler _ready_command accordingly so flag names remain
unchanged.

In `@src/meerqat/ingestion.py`:
- Around line 107-138: _build_filetree_summary currently walks paths multiple
times and runs O(n²) SequenceMatcher comparisons over directories (similar_pairs
using SequenceMatcher and directories), which will be slow on large trees; to
fix, do a single pass collecting paths, directory_children and directories (use
one loop over dataset_root.rglob("*") to build paths_list, directory_children
and directories), then reduce pairwise comparisons by bucketing directories
(e.g., group by directory name stem, first N chars, depth, or token prefixes)
and only run SequenceMatcher within each bucket (retain
SIMILAR_DIRECTORY_THRESHOLD and SequenceMatcher but apply them to bucketed
pairs), and finally build empty_directories and file_extensions from the
single-pass data; update references in the function: paths, directory_children,
directories, similar_pairs, SequenceMatcher, SIMILAR_DIRECTORY_THRESHOLD.

In `@src/meerqat/llm.py`:
- Around line 302-311: The _wait_for_server function currently uses a hardcoded
30s timeout; update it to accept a configurable timeout (e.g., timeout_seconds:
float) or read from the existing LLMConfig (e.g.,
LLMConfig.server_startup_timeout) and use that value instead of 30, update all
call sites that invoke _wait_for_server(base_url, process) to pass the
configured timeout or to ensure LLMConfig is accessible there, and adjust the
error messages to include the configured timeout for clarity; ensure function
signature and any imports/constructors that create or pass LLMConfig are updated
accordingly.
- Around line 240-252: The _server_is_ready and _fetch_models_payload functions
call urllib_request.urlopen with a URL derived from _models_url(base_url)
without explicit scheme checks; validate the scheme extracted from base_url (or
from the constructed _models_url) using urllib.parse.urlparse and ensure it is
"http" or "https" (and/or short-circuit if _is_local_base_url(base_url) allows
localhost) before calling urlopen, returning False or raising a clear ValueError
on invalid schemes, and update callers to handle that behavior; reference the
functions _server_is_ready, _fetch_models_payload, _models_url and helper
_is_local_base_url when making this change.
- Around line 141-173: The subprocess.run call in _write_macos_metal_stub uses
"cc" and currently raises a generic RuntimeError on failure; make failures
actionable by first checking for a C compiler (use shutil.which to prefer "cc"
then "clang") and fall back accordingly, and when subprocess.run raises
subprocess.CalledProcessError or FileNotFoundError include the captured
stderr/stdout (or the exception text) in the raised RuntimeError with a clear
message suggesting installing the macOS command line tools; update the exception
handling around the subprocess.run call to reference subprocess.run (and the
CalledProcessError/FileNotFoundError) and include those error details in the
RuntimeError.

In `@src/meerqat/main.py`:
- Around line 67-88: Create a single LLMReview instance and reuse it instead of
constructing it twice: build one local variable (e.g., llm_review_obj =
LLMReview(status=llm_review.status, provider=llm_review.provider,
model=llm_review.model, error=llm_review.error)) and then pass llm_review_obj to
both the ValidationReport(llm_review=...) field and to
_build_provenance(llm_review_obj), leaving the uses of llm_review.hints and
llm_review.findings unchanged.

In `@src/meerqat/reporting.py`:
- Around line 84-99: The report title currently interpolates
report.summary.dataset_id directly, causing "None" to appear when it's missing;
update report_to_markdown to compute a safe title value (e.g., dataset_title =
report.summary.dataset_id or "Unknown" / "(no dataset ID)") and use that
variable in the first line instead of report.summary.dataset_id so the Markdown
header shows a friendly fallback when dataset_id is None.

In `@tests/test_api.py`:
- Around line 41-48: The test test_validate_dataset_discovers_nearby_metadata
calls validate_dataset(valid_dataset) leaving LLM review enabled; to avoid
nondeterminism and slowness, update the call to pass the deterministic test
config (use config=_without_llm()) so call validate_dataset(valid_dataset,
config=_without_llm()), ensuring consistency with other tests that disable LLMs.
- Around line 20-22: Extract the duplicated helper function `_without_llm()`
into a shared pytest fixture in `tests/conftest.py`: create a fixture (e.g.,
`config_without_llm`) that returns
`ValidationConfig(llm=LLMConfig(enabled=False))`, then replace usages of the
`_without_llm()` function in `tests/test_api.py` and
`tests/test_sample_datasets.py` with the new fixture by accepting
`config_without_llm` as a test parameter; ensure you import `LLMConfig` and
`ValidationConfig` in `conftest.py` and remove the local `_without_llm()`
definitions from the test modules.

In `@tests/test_units.py`:
- Around line 360-390: Replace the unusual generator.throw pattern used in the
monkeypatch lambdas with a clear helper that raises the intended exception;
define a small function (e.g., raise_exc(exc)) that simply raises the provided
exception and use it in the monkeypatch.setattr calls for
meerqat.llm._ensure_local_instructor_server, meerqat.llm._invoke_instructor, and
meerqat.llm._invoke_langchain so the test remains readable while still
simulating startup/import failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a7db00b-61d9-419e-9ab1-6e287a47410f

📥 Commits

Reviewing files that changed from the base of the PR and between 7f52976 and 921c781.

⛔ Files ignored due to path filters (15)
  • docs/src/_static/meerqat.png is excluded by !**/*.png
  • tests/data/filetree_risks/dataset/Plate_Z09/image_001.tiff is excluded by !**/*.tiff
  • tests/data/filetree_risks/metadata.csv is excluded by !**/*.csv
  • tests/data/missing_index/dataset/Plate_C03/image_001.tiff is excluded by !**/*.tiff
  • tests/data/missing_index/metadata.csv is excluded by !**/*.csv
  • tests/data/missing_metadata/dataset/Plate_D04/image_001.tiff is excluded by !**/*.tiff
  • tests/data/missing_metadata/metadata.csv is excluded by !**/*.csv
  • tests/data/mixed_modalities/dataset/Plate_E05/image_001.tiff is excluded by !**/*.tiff
  • tests/data/mixed_modalities/dataset/Plate_E05/preview.png is excluded by !**/*.png
  • tests/data/mixed_modalities/metadata.csv is excluded by !**/*.csv
  • tests/data/valid_minimal/dataset/Plate_A01/image_001.tiff is excluded by !**/*.tiff
  • tests/data/valid_minimal/metadata.csv is excluded by !**/*.csv
  • tests/data/xml_mismatch/dataset/Plate_B02/image_001.tiff is excluded by !**/*.tiff
  • tests/data/xml_mismatch/metadata.csv is excluded by !**/*.csv
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (49)
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/dependabot.yml
  • .github/release-drafter.yml
  • .github/workflows/draft-release.yml
  • .github/workflows/publish-docs.yml
  • .github/workflows/publish-pypi.yml
  • .github/workflows/run-tests.yml
  • .gitignore
  • .pre-commit-config.yaml
  • .python-version
  • CITATION.cff
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • README.md
  • architecture.md
  • docs/src/_static/custom.css
  • docs/src/cli.md
  • docs/src/conf.py
  • docs/src/index.md
  • docs/src/on_the_lookout_with_meerqat.ipynb
  • docs/src/on_the_lookout_with_meerqat.py
  • docs/src/python-api.md
  • pyproject.toml
  • roadmap.md
  • spec.md
  • src/meerqat/__init__.py
  • src/meerqat/cli.py
  • src/meerqat/config.py
  • src/meerqat/ingestion.py
  • src/meerqat/llm.py
  • src/meerqat/main.py
  • src/meerqat/models.py
  • src/meerqat/normalization.py
  • src/meerqat/reporting.py
  • src/meerqat/validation.py
  • tests/conftest.py
  • tests/data/README.md
  • tests/data/filetree_risks/dataset/Plate_Z09/Index.xml
  • tests/data/missing_metadata/dataset/Plate_D04/Index.xml
  • tests/data/mixed_modalities/dataset/Plate_E05/Index.xml
  • tests/data/valid_minimal/dataset/Plate_A01/Index.xml
  • tests/data/xml_mismatch/dataset/Plate_B02/Index.xml
  • tests/test_api.py
  • tests/test_cli.py
  • tests/test_incident_regressions.py
  • tests/test_llm_runtime.py
  • tests/test_sample_datasets.py
  • tests/test_units.py

Comment on lines +215 to +216
- Image inventory support for TIFF, OME-TIFF, OME-Zarr, PNG, and JPEG assets
- Single-dataset and batch validation through both API and CLI
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistency: OME-Zarr listed in both "Implemented Scope" and "Future Work".

Line 215 states OME-Zarr is implemented, but line 366 lists it under "Future Work". Please reconcile these sections to reflect the actual implementation status.

Also applies to: 366-366

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec.md` around lines 215 - 216, The spec currently lists OME-Zarr as
implemented in the "Implemented Scope" section and also as pending in the
"Future Work" section; update the document to be consistent by deciding the true
status and removing or moving the OME-Zarr bullet accordingly: if OME-Zarr is
implemented, remove it from the "Future Work" list (or mark it as completed) and
ensure the "Implemented Scope" bullet explicitly names OME-Zarr; if it is not
yet implemented, remove it from "Implemented Scope" and keep it under "Future
Work" with any planned notes. Ensure both headings ("Implemented Scope" and
"Future Work") reference the same final status for OME-Zarr to eliminate the
contradiction.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (3)
src/meerqat/cli.py (1)

192-196: Consider logging the full traceback for debugging.

Catching bare Exception is reasonable for CLI error handling to ensure clean exits, but the current implementation discards the traceback. Consider logging or printing the full traceback when a debug/verbose flag is set, or at minimum to stderr.

Proposed improvement
+import traceback
+
 def main(argv: list[str] | None = None) -> int:
     """Run the CLI."""
     parser = build_parser()
     args = parser.parse_args(argv)
     try:
         return int(args.handler(args))
     except Exception as error:
         print(f"Meerqat system error: {error}", file=sys.stderr)
+        traceback.print_exc(file=sys.stderr)
         return 3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/cli.py` around lines 192 - 196, The except block around calling
args.handler(args) currently only prints the exception message and discards the
traceback; update the handler of that exception in the CLI entry (the try/except
surrounding return int(args.handler(args))) to emit the full traceback to stderr
or the logger when a debug/verbose flag is set (e.g., args.debug or
args.verbose) by calling traceback.print_exc(file=sys.stderr) or
logging.exception(...) as appropriate, otherwise continue returning exit code
3—ensure you import traceback or use the logging API and reference args.handler
and the surrounding CLI entrypoint when making the change.
CONTRIBUTING.md (1)

53-55: Document the --frozen flag to match CI.

The CI workflow uses uv run --frozen pytest, but this documentation shows only uv run pytest. Consider aligning the example with CI behavior or noting when --frozen is appropriate.

Proposed fix
 ```sh
-% uv run pytest
+% uv run --frozen pytest
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @CONTRIBUTING.md around lines 53 - 55, Update the example command in
CONTRIBUTING.md to match CI usage by adding the --frozen flag: change the
example invocation "uv run pytest" to "uv run --frozen pytest" (or add a brief
note explaining when to use --frozen) so the documented command aligns with the
CI workflow; locate the example command string "uv run pytest" to make the edit.


</details>

</blockquote></details>
<details>
<summary>docs/src/on_the_lookout_with_meerqat.ipynb (1)</summary><blockquote>

`230-233`: **CLI cell uses hardcoded relative path.**

The shell command uses a hardcoded `pathlib.Path("../..")` instead of the computed `repo_root` variable used elsewhere in the notebook. This works in Jupyter shell commands via brace expansion, but the markdown documentation cell above shows `{repo_root / ...}` as the pattern. Consider using `{repo_root}` consistently:

```python
!meerqat validate \
  {repo_root / "tests" / "data" / "xml_mismatch" / "dataset"}

This ensures the command works regardless of the notebook's working directory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/on_the_lookout_with_meerqat.ipynb` around lines 230 - 233, The CLI
cell is using a hardcoded pathlib.Path("../..") instead of the notebook's
repo_root variable; update the shell command in the CLI cell that runs "!meerqat
validate ..." to use the existing repo_root (repo_root / "tests" / "data" /
"xml_mismatch" / "dataset") so it matches the notebook's other examples and
works regardless of working directory—replace the hardcoded
pathlib.Path("../..") expression with the repo_root variable reference wherever
that validation command appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CONTRIBUTING.md`:
- Line 99: Replace the non-descriptive "here" link text in the sentence "We
publish source code by using GitHub Releases available here" with descriptive
link text such as "GitHub Releases page" (or "Meerqat Releases on GitHub") so
the anchor text clearly conveys destination and improves accessibility; update
the sentence to read something like "We publish source code using GitHub
Releases (GitHub Releases page)" and ensure the existing URL remains the link
target.

In `@spec.md`:
- Around line 224-251: Update the spec's domain model to match the actual
exported dataclasses in src/meerqat/models.py: change Dataset to include the
fields root and filetree_summary in addition to
dataset_id/plates/metadata_records; replace Plate's documented fields with
plate_id, path, xml_path, image_files, zero_byte_images, and image_modalities;
and update ValidationIssue to use code, plate_id, path, and remediation instead
of rule_id/severity/message/entity_id so the documentation matches the real
dataclass shapes (Dataset, Plate, ValidationIssue).
- Around line 257-275: The YAML example uses nested sections (dataset,
plate_naming, expectations, metadata) that don't match the flat schema expected
by load_config() and ValidationConfig; update the example to use flat keys
matching the names in src/meerqat/config.py (dataset_id, plate_name_pattern,
expected_plate_count, expected_images_per_plate, metadata_plate_column) so that
calling load_config(payload) / ValidationConfig(**payload) succeeds—locate
references to load_config and ValidationConfig in the code to verify the exact
key names and ensure the example YAML uses those flat keys and literal values
rather than nested objects.

In `@src/meerqat/config.py`:
- Around line 206-213: resolved_template currently merges template defaults
using truthy checks (e.g., self.required_files or template.required_files),
which treats an explicit empty override like required_files: [] as "unset";
change the merge to check for None instead so explicit empty values are honored:
in resolved_template use conditional expressions that test "if
self.required_files is not None then use self.required_files else
template.required_files" (and similarly for plate_name_pattern and
expected_images_per_plate) so explicit empty/falsey overrides are preserved.

In `@src/meerqat/ingestion.py`:
- Around line 174-192: The parent-directory discovery in
_discover_metadata_paths is too permissive (it currently collects all files with
METADATA_EXTENSIONS from dataset_root.parent), causing sibling datasets to share
CSV/XLS/XLSX metadata; restrict parent search by only considering parent files
whose stem matches the preferred names (e.g., "metadata", "meta",
"plate_metadata") or only accept parent candidates when there is exactly one
matching metadata file — otherwise do not auto-include parent results and
require explicit metadata_paths; update the logic in _discover_metadata_paths to
apply this narrower filter to search_root == dataset_root.parent and return only
the filtered parent candidate(s) or nothing if ambiguous.
- Around line 160-168: The loop that builds MetadataRecord objects currently
treats pandas NaN as a valid plate_value, producing plate_id="nan"; update the
loop in ingestion.py (the for row in frame.to_dict(orient="records") block) to
detect and skip NaN/blank values for config.metadata_plate_column (use
pandas.isna or math.isnan) before appending to records so that
MetadataRecord(plate_id=...) is only created for real IDs (keep references to
MetadataRecord, metadata_plate_column, records, and path).
- Around line 56-60: _extract_xml_plate_id currently calls
ET.parse(xml_path).getroot() without handling xml.etree.ElementTree.ParseError,
which lets malformed XML abort ingest_dataset before validate_dataset_model can
report it; wrap the ET.parse call in a try/except that catches
xml.etree.ElementTree.ParseError (or Exception if you prefer broader safety) and
return None (or otherwise swallow and optionally debug-log the error) so
ingest_dataset() does not raise on malformed XML and validate_dataset_model can
perform structured validation; update _extract_xml_plate_id to import the
ParseError symbol and handle it accordingly.

In `@src/meerqat/llm.py`:
- Around line 206-210: The current PermissionError fallback uses a predictable
temp path (tempfile.gettempdir() / "meerqat" / MACOS_CPU_RUNTIME_DIRNAME) which
is vulnerable to tampering; change the fallback in the code path that calls
build_runtime/_default_runtime_dir to create a unique, owner-only directory via
tempfile.mkdtemp (e.g., mkdtemp(prefix="meerqat-", dir=tempfile.gettempdir())),
immediately set permissions to 0o700 (os.chmod) and use that Path for
build_runtime, ensuring the directory is created with restrictive permissions
and is only writable by the current user (optionally register it for secure
cleanup), instead of using the fixed predictable temp location.
- Around line 662-676: The langchain fallback in _attempt_langchain_review
currently only catches a narrow set of errors; broaden the except clause to
catch Exception (i.e., replace the current multi-exception tuple with except
Exception as error) so any runtime/import/invocation error from
_invoke_langchain() or chain.invoke() is normalized the same way as the
instructor path and returns (None, error) instead of letting exceptions
propagate into generate_llm_review()/validate_dataset().

In `@src/meerqat/main.py`:
- Around line 123-137: The function write_reports currently ignores
markdown_path and html_path when given a BatchValidationReport; change it to
fail fast by detecting if report is a BatchValidationReport and either
markdown_path or html_path is not None and raising a ValueError with a clear
message. Update the branch in write_reports (the isinstance(report,
ValidationReport) check) to first check for BatchValidationReport with
markdown/html paths and raise; otherwise proceed to call write_report_file with
report_to_markdown and report_to_html for ValidationReport as before
(references: write_reports, ValidationReport, BatchValidationReport,
write_report_file, report_to_markdown, report_to_html).

In `@src/meerqat/reporting.py`:
- Around line 186-187: The HTML generator uses report.summary.dataset_id
directly and can render "None"; update the h1 construction in report_to_html to
match report_to_markdown's fallback by using
html.escape(report.summary.dataset_id or "Unknown") (ensure the fallback is
applied before escaping) so the header shows "Unknown" when dataset_id is None;
locate the f-string that builds "<h1>Meerqat Report: ...</h1>" and replace the
expression accordingly.

---

Nitpick comments:
In `@CONTRIBUTING.md`:
- Around line 53-55: Update the example command in CONTRIBUTING.md to match CI
usage by adding the --frozen flag: change the example invocation "uv run pytest"
to "uv run --frozen pytest" (or add a brief note explaining when to use
--frozen) so the documented command aligns with the CI workflow; locate the
example command string "uv run pytest" to make the edit.

In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 230-233: The CLI cell is using a hardcoded pathlib.Path("../..")
instead of the notebook's repo_root variable; update the shell command in the
CLI cell that runs "!meerqat validate ..." to use the existing repo_root
(repo_root / "tests" / "data" / "xml_mismatch" / "dataset") so it matches the
notebook's other examples and works regardless of working directory—replace the
hardcoded pathlib.Path("../..") expression with the repo_root variable reference
wherever that validation command appears.

In `@src/meerqat/cli.py`:
- Around line 192-196: The except block around calling args.handler(args)
currently only prints the exception message and discards the traceback; update
the handler of that exception in the CLI entry (the try/except surrounding
return int(args.handler(args))) to emit the full traceback to stderr or the
logger when a debug/verbose flag is set (e.g., args.debug or args.verbose) by
calling traceback.print_exc(file=sys.stderr) or logging.exception(...) as
appropriate, otherwise continue returning exit code 3—ensure you import
traceback or use the logging API and reference args.handler and the surrounding
CLI entrypoint when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb7c4abb-c33b-4690-b836-74edd1e7649a

📥 Commits

Reviewing files that changed from the base of the PR and between 921c781 and 18b75de.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (30)
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/dependabot.yml
  • .github/workflows/draft-release.yml
  • .github/workflows/publish-docs.yml
  • .github/workflows/publish-pypi.yml
  • .github/workflows/run-tests.yml
  • .pre-commit-config.yaml
  • .yamllint
  • CITATION.cff
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • docs/src/conf.py
  • docs/src/on_the_lookout_with_meerqat.ipynb
  • docs/src/on_the_lookout_with_meerqat.py
  • pyproject.toml
  • spec.md
  • src/meerqat/_version.py
  • src/meerqat/cli.py
  • src/meerqat/config.py
  • src/meerqat/ingestion.py
  • src/meerqat/llm.py
  • src/meerqat/main.py
  • src/meerqat/reporting.py
  • src/meerqat/validation.py
  • tests/conftest.py
  • tests/test_api.py
  • tests/test_cli.py
  • tests/test_sample_datasets.py
  • tests/test_units.py
✅ Files skipped from review due to trivial changes (14)
  • .yamllint
  • CITATION.cff
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/ISSUE_TEMPLATE/issue.yml
  • .github/dependabot.yml
  • .github/workflows/draft-release.yml
  • .github/workflows/publish-pypi.yml
  • .github/workflows/publish-docs.yml
  • docs/src/conf.py
  • tests/conftest.py
  • .github/workflows/run-tests.yml
  • src/meerqat/_version.py
  • .pre-commit-config.yaml
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_sample_datasets.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
src/meerqat/ingestion.py (1)

164-173: ⚠️ Potential issue | 🟠 Major

Store the normalized plate id, not the raw cell value.

The blank check strips whitespace, but MetadataRecord.plate_id still keeps the untrimmed value. A cell like " Plate_A01 " will pass the guard here and then miss the folder/XML identifiers during matching.

Possible fix
         for row in frame.to_dict(orient="records"):
             plate_value = row.get(config.metadata_plate_column)
             if plate_value is None or pd.isna(plate_value):
                 continue
-            if not str(plate_value).strip():
+            plate_id = str(plate_value).strip()
+            if not plate_id:
                 continue
             records.append(
                 MetadataRecord(
-                    plate_id=str(plate_value),
+                    plate_id=plate_id,
                     source=str(path),
                     values=dict(row),
                 )
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 164 - 173, The loop currently checks
plate_value for emptiness but passes the untrimmed value into MetadataRecord.
Normalize the plate id before constructing MetadataRecord: derive a
normalized_plate_id = str(plate_value).strip() (optionally apply consistent
casing if needed), use that for MetadataRecord(plate_id=normalized_plate_id,
...) and for any matching logic, and ensure you still skip when
normalized_plate_id is empty; update the code around the frame.to_dict loop and
the use of config.metadata_plate_column and records accordingly.
🧹 Nitpick comments (2)
src/meerqat/llm.py (2)

779-785: Redundant model path resolution.

The model_path was already successfully resolved at lines 688-692. This second call to get_cached_model_path is redundant since the variable is still in scope.

Suggested fix
     else:
         try:
-            model_path = get_cached_model_path(
-                spec,
-                cache_dir=config.cache_dir,
-                offline=config.offline,
-            )
             llm = _build_langchain_llm(model_path, config)
             response = llm.invoke("Reply with READY.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/llm.py` around lines 779 - 785, The try block is redundantly
calling get_cached_model_path again even though model_path was already resolved
earlier; remove the second get_cached_model_path call and pass the existing
model_path into _build_langchain_llm instead (ensure the earlier model_path
variable from the prior resolution is in scope before the try block and used
here); update any error handling in this try/except to only surround
_build_langchain_llm so you don't redo model resolution.

267-270: Consider handling IPv6 loopback address.

The localhost check doesn't include the IPv6 loopback address ::1, which could be relevant in dual-stack environments.

Optional enhancement
 def _is_local_base_url(base_url: str) -> bool:
     """Return whether the configured endpoint points at localhost."""
     hostname = urllib_parse.urlsplit(base_url).hostname
-    return hostname in {"127.0.0.1", "localhost"}
+    return hostname in {"127.0.0.1", "localhost", "::1"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/llm.py` around lines 267 - 270, The _is_local_base_url function
currently treats only "127.0.0.1" and "localhost" as local; update it to also
recognize the IPv6 loopback "::1" (and its possible bracketed/normalized forms)
by including "::1" in the hostname check or normalizing brackets before
comparison so that hostname from urllib_parse.urlsplit(base_url) matches "::1";
modify the set used in _is_local_base_url or add a small normalization step
around the hostname variable to ensure IPv6 loopback is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 50-53: The fallback that sets repo_root to
pathlib.Path("../..").resolve() when "__file__" is missing is unreliable in
notebooks; replace it with logic that locates the repo root by walking up from
the current working directory (pathlib.Path.cwd()) until a repo marker is found
(e.g., ".git", "pyproject.toml", "setup.py") — implement a small helper (e.g.,
find_repo_root) and use it as the fallback for repo_root so notebooks launched
from other CWDs still resolve the actual repository root instead of ../..; keep
the original branch that uses pathlib.Path(__file__).resolve().parents[2] when
"__file__" exists.

In `@spec.md`:
- Around line 103-104: Update the spec text to reflect the actual shipped
runtime: replace the sentence that says MeerQat "falls back to direct local GGUF
execution" with wording that the fallback provider is "langchain" (matching the
implemented providers `instructor` and `langchain`) and that tests exercise
LangChain as the fallback path; ensure the terms `instructor`, `langchain`, and
`MeerQat` are used so the spec matches the current implementation and test
behavior.
- Around line 284-290: The spec examples list CLI commands that don't exist;
update spec.md to only document commands actually registered in
src/meerqat/cli.py (validate, batch-validate, models, ready) by replacing the
"Suggest config" and "Explain anomalies" examples with equivalent examples using
those real commands (or explicitly note those features are not implemented yet),
and ensure the example command names in spec.md match the unique command
identifiers in src/meerqat/cli.py (validate, batch-validate, models, ready).

In `@src/meerqat/cli.py`:
- Around line 41-58: The current reconstruction of LLMConfig (creating a new
LLMConfig with selected fields from args and existing) drops unexposed settings
like auto_start_server and server_startup_timeout; instead preserve all existing
fields and only override those explicitly supplied by CLI by using
dataclasses.replace(existing, ...) or by copying existing and updating the
provided attributes (referencing LLMConfig, existing, and args), ensuring you
set
provider/model_alias/repo_id/filename/cache_dir/offline/n_ctx/n_threads/max_tokens/temperature/base_url
only when args provide values so auto_start_server and server_startup_timeout
(and any other fields) are retained.
- Around line 161-181: The batch subcommand is exposing flags that
_batch_validate_command ignores; update the CLI to only register flags that
batch supports. Either add a new add_batch_validation_flags(target) that only
adds "--metadata", "--report-json", and "--fail-on-warning" and call that for
batch_parser, or change add_validation_flags(target, *, for_batch=False) to
conditionally skip adding "--dataset-id", "--report-markdown", and
"--report-html" when for_batch is true; then keep validate_parser using the full
add_validation_flags and have batch_parser use the batch-specific variant so the
parser flags match _batch_validate_command's behavior.

In `@src/meerqat/ingestion.py`:
- Around line 82-90: The code treats multi-part extensions inconsistently:
update _collect_image_files and _build_filetree_summary to recognize compound
extensions like ".ome.zarr", ".ome.tif", and ".ome.tiff" (and detect ".ome.zarr"
recursively under plate directories) instead of relying on Path.suffix which
drops multi-part suffixes; use Path.suffixes or a name.endswith(...) check
against a canonical list of supported compound extensions when gathering files
and when bucketing by type so plate assets (e.g., Plate_A01/Images/run.ome.zarr)
are included in image totals and correctly labeled in the filetree inventory;
also ensure _looks_like_plate()’s expectations align with the same
extension-checking logic so detection, counting, and reporting are consistent.
- Around line 233-237: The current assignment to resolved_metadata_paths treats
an empty list metadata_paths as false and triggers discovery; change the
conditional to detect None specifically so an explicit empty list disables
discovery. Update the expression that builds resolved_metadata_paths (currently
using metadata_paths and _discover_metadata_paths with root) to use "if
metadata_paths is not None" (or equivalent None check) so metadata_paths=[]
yields an empty list (still mapping values through Path) while
metadata_paths=None continues to call list(_discover_metadata_paths(root)).

---

Duplicate comments:
In `@src/meerqat/ingestion.py`:
- Around line 164-173: The loop currently checks plate_value for emptiness but
passes the untrimmed value into MetadataRecord. Normalize the plate id before
constructing MetadataRecord: derive a normalized_plate_id =
str(plate_value).strip() (optionally apply consistent casing if needed), use
that for MetadataRecord(plate_id=normalized_plate_id, ...) and for any matching
logic, and ensure you still skip when normalized_plate_id is empty; update the
code around the frame.to_dict loop and the use of config.metadata_plate_column
and records accordingly.

---

Nitpick comments:
In `@src/meerqat/llm.py`:
- Around line 779-785: The try block is redundantly calling
get_cached_model_path again even though model_path was already resolved earlier;
remove the second get_cached_model_path call and pass the existing model_path
into _build_langchain_llm instead (ensure the earlier model_path variable from
the prior resolution is in scope before the try block and used here); update any
error handling in this try/except to only surround _build_langchain_llm so you
don't redo model resolution.
- Around line 267-270: The _is_local_base_url function currently treats only
"127.0.0.1" and "localhost" as local; update it to also recognize the IPv6
loopback "::1" (and its possible bracketed/normalized forms) by including "::1"
in the hostname check or normalizing brackets before comparison so that hostname
from urllib_parse.urlsplit(base_url) matches "::1"; modify the set used in
_is_local_base_url or add a small normalization step around the hostname
variable to ensure IPv6 loopback is detected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb1bc5f7-1804-4c11-85cc-237ae1d2f119

📥 Commits

Reviewing files that changed from the base of the PR and between 18b75de and dae8967.

📒 Files selected for processing (12)
  • CONTRIBUTING.md
  • docs/src/on_the_lookout_with_meerqat.ipynb
  • spec.md
  • src/meerqat/_version.py
  • src/meerqat/cli.py
  • src/meerqat/config.py
  • src/meerqat/ingestion.py
  • src/meerqat/llm.py
  • src/meerqat/main.py
  • src/meerqat/reporting.py
  • tests/test_api.py
  • tests/test_units.py
✅ Files skipped from review due to trivial changes (1)
  • src/meerqat/_version.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/meerqat/main.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
src/meerqat/cli.py (1)

100-105: Unify batch exit-code fallback with _exit_code semantics.

Current logic returns 0 for any non-fail/warn statuses. That diverges from _exit_code, which maps unknown statuses to 3.

Suggested patch
-    statuses = [entry.summary.status for entry in report.reports]
-    if "fail" in statuses:
-        return 2
-    if "warn" in statuses:
-        return 1
-    return 0
+    statuses = [entry.summary.status for entry in report.reports]
+    if "fail" in statuses:
+        return _exit_code("fail")
+    if "warn" in statuses:
+        return _exit_code("warn")
+    if all(status == "pass" for status in statuses):
+        return _exit_code("pass")
+    return _exit_code("unknown")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/cli.py` around lines 100 - 105, The batch exit-code logic in the
block that computes statuses = [entry.summary.status for entry in
report.reports] should follow the same semantics as _exit_code (unknown statuses
-> 3) rather than defaulting to 0; update the conditional branch so that it
returns 2 for "fail", 1 for "warn", and 3 for any other/unknown status, ensuring
the behavior is consistent with the _exit_code mapping used elsewhere.
src/meerqat/ingestion.py (3)

84-96: Case-sensitivity mismatch in nested element attribute lookup.

Lines 84-87 search root attributes with mixed-case keys ("PlateID", "Name", "ID"), but line 90 uses only lowercase keys ("id", "name", "plateid"). Since element.attrib.get() is case-sensitive, attributes like Name="Plate1" on nested elements will be missed.

♻️ Suggested fix to use consistent case handling
     for element in root.iter():
         if element.tag.lower().endswith("plate") or "plate" in element.tag.lower():
-            for key in ("id", "name", "plateid", "platename"):
-                value = element.attrib.get(key)
-                if value:
-                    candidates.append(value)
+            for key in ("PlateID", "PlateName", "Name", "ID", "id", "name", "plateid", "platename"):
+                if key in element.attrib:
+                    candidates.append(element.attrib[key])
+                    break
             text = (element.text or "").strip()
             if text:
                 candidates.append(text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 84 - 96, The nested element attribute
lookup is case-sensitive (using element.attrib.get with only lowercase keys)
while root attribute checks use mixed-case keys, so attributes like
Name="Plate1" on nested elements are missed; fix by normalizing attribute names
before lookup (e.g., build a lowercase-keyed mapping from element.attrib and
root.attrib or lowercase keys on the fly) and then check normalized keys
('id','name','plateid','platename') against that mapping when populating
candidates (affecting the loops that reference root, element, candidates, and
attrib.get).

173-178: Missing error handling for metadata file reads.

If a metadata file is malformed or unreadable, pd.read_csv() or pd.read_excel() will raise an exception, aborting ingestion. This is inconsistent with the graceful handling in _extract_xml_plate_id() which catches parse errors. Consider wrapping reads in try/except to skip problematic files and allow ingestion to continue.

♻️ Suggested fix
     for raw_path in metadata_paths:
         path = Path(raw_path)
+        try:
-        if path.suffix.lower() == ".csv":
-            frame = pd.read_csv(path)
-        else:
-            frame = pd.read_excel(path)
+            if path.suffix.lower() == ".csv":
+                frame = pd.read_csv(path)
+            else:
+                frame = pd.read_excel(path)
+        except (OSError, pd.errors.EmptyDataError, pd.errors.ParserError, ValueError):
+            continue
         frame.columns = [str(column).strip() for column in frame.columns]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 173 - 178, Wrap the per-file read
inside the for raw_path in metadata_paths loop (the block calling
pd.read_csv(path) / pd.read_excel(path)) in a try/except that catches exceptions
from malformed/unreadable files, logs the filename and exception (using the
project's logger or logging.exception), and skips that file (continue) so
ingestion proceeds; mirror the error-handling pattern used in
_extract_xml_plate_id to avoid leaving frame undefined and ensure downstream
code only processes successfully-read frames.

137-145: Unreachable condition: path.is_dir() on line 142.

Line 137 already handles path.is_dir(), so the or path.is_dir() in line 142 is dead code and never true. This doesn't affect correctness but is confusing.

♻️ Suggested fix
         if path.is_dir():
             directories.append(resolved)
             relative = resolved.relative_to(dataset_root)
             bucket_key = (len(relative.parts), resolved.name.lower()[:4])
             similarity_buckets[bucket_key].append(resolved)
-        elif path.is_file() or path.is_dir():
+        elif path.is_file():
             extension = _inventory_extension(path)
             if extension is not None:
                 file_extensions[extension] += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 137 - 145, The elif condition contains
a redundant check "or path.is_dir()" which is unreachable because the preceding
if already handles path.is_dir(); update the conditional in the block that calls
_inventory_extension(path) to only test path.is_file() so the code becomes elif
path.is_file(): and keep the rest (calling _inventory_extension, incrementing
file_extensions) unchanged; this affects the branch handling around variables
path, directories, similarity_buckets, file_extensions and the call to
_inventory_extension.
docs/src/on_the_lookout_with_meerqat.py (1)

71-71: Convert notebook-style bare expressions to explicit output or add Ruff per-file ignore for B018.

This .py companion file contains bare expressions at lines 71, 77, 91, 136, and 179 that trigger Ruff's B018 rule (useless expression). These expressions are notebook-style outputs intended to display results, but since the file is linted as regular Python, they need either explicit print() calls or a per-file Ruff ignore:

# Option 1: Explicit output
print(report.summary.status)
print(report.summary.to_dict())
print(broken_report.summary.status)
print(validation_config)
print(generated_files)

Or add to ruff.toml:

[lint.per-file-ignores]
"docs/src/on_the_lookout_with_meerqat.py" = ["B018"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/on_the_lookout_with_meerqat.py` at line 71, The file contains
notebook-style bare expressions (report.summary.status,
report.summary.to_dict(), broken_report.summary.status, validation_config,
generated_files) that trigger Ruff B018; fix by converting each bare expression
into explicit output (e.g., wrap with print(...) calls for the five identified
symbols) or alternatively add a per-file Ruff ignore for B018 in ruff.toml under
lint.per-file-ignores for this file; update either the expressions to use print
or add the per-file-ignore entry for "B018".
spec.md (2)

279-279: Redundant heading: "CLI Interface".

"CLI" already stands for "Command Line Interface," so "CLI Interface" is redundant. As per static analysis hint, consider simplifying the heading.

♻️ Suggested alternatives
-## CLI Interface
+## CLI Commands

or

-## CLI Interface
+## Command Line Interface
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec.md` at line 279, The heading "CLI Interface" is redundant because "CLI"
already means "Command Line Interface"; update the heading in the document (the
line containing "CLI Interface") to a non-redundant form such as "CLI" or
"Command Line Interface" so the section title is concise and clear.

75-75: Prescriptive LLM assumption in concept definition.

The sentence "LLM should convert if possible" places an implementation directive inside a core concept definition. This belongs in the LLM Integration or Functional Requirements section if automatic format conversion is a planned feature. The concept definition should focus on what a Metadata Record is, not on conversion tooling.

♻️ Suggested rewording
-A row from an external metadata file (CSV/XLSX). The analysts would prefer to already be in CSV format over XLSX. LLM should convert if possible.
+A row from an external metadata file (CSV/XLSX). CSV format is preferred over XLSX for downstream processing.

If automatic conversion is a planned feature, document it separately in the LLM Integration section (lines 299-317).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec.md` at line 75, The concept definition "A row from an external metadata
file (CSV/XLSX). LLM should convert if possible" mixes an implementation
directive with the core definition; remove the clause "LLM should convert if
possible" from the Metadata Record definition and rewrite the sentence to simply
define what a Metadata Record is (e.g., "A row from an external metadata file
(CSV or XLSX) representing ..."). If automatic format conversion is a planned
feature, move that requirement into the LLM Integration or Functional
Requirements section (the LLM Integration section referenced in the comment) and
describe the conversion behavior and expectations there.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 129-136: The snippet creates a ValidationConfig but never uses it;
call validate_dataset with that config and capture the result. Replace the last
two lines so you instantiate ValidationConfig(...) as needed and then call
validate_dataset(dataset_path, config=validation_config) assigning the return to
a variable like report (e.g., report = validate_dataset(dataset_path,
config=validation_config)); reference ValidationConfig, validate_dataset, config
and report so readers see the full usage.

In `@spec.md`:
- Line 358: The spec currently lists "Single-dataset and batch validation
through both API and CLI" and documents the `batch-validate` command as
implemented, but also lists "Batch validation workflows" under the Future Work
section; remove the duplicate future-work entry or replace it with a clarified
bullet that specifies any additional planned batch-validation features (e.g.,
scaling, orchestration, or UX enhancements) so the document consistently
reflects that basic batch validation is implemented; search for the phrases
"Single-dataset and batch validation through both API and CLI",
"`batch-validate`", and "Batch validation workflows" to locate and update the
entries.
- Around line 225-259: Update the Domain Model entries to match the
implementation: make Dataset.dataset_id non-nullable (str), change collection
types from list[...] to tuple[...] for Dataset.plates, Dataset.metadata_records,
Plate.image_files, Plate.zero_byte_images, and Plate.image_modalities and
reflect that they have immutable defaults, and add default = None for
ValidationIssue.plate_id, ValidationIssue.path, and ValidationIssue.remediation;
ensure the class names Dataset, Plate and ValidationIssue and the field names
dataset_id, plates, metadata_records, image_files, zero_byte_images,
image_modalities, plate_id, path, and remediation are used to locate and update
the definitions.

---

Nitpick comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Line 71: The file contains notebook-style bare expressions
(report.summary.status, report.summary.to_dict(), broken_report.summary.status,
validation_config, generated_files) that trigger Ruff B018; fix by converting
each bare expression into explicit output (e.g., wrap with print(...) calls for
the five identified symbols) or alternatively add a per-file Ruff ignore for
B018 in ruff.toml under lint.per-file-ignores for this file; update either the
expressions to use print or add the per-file-ignore entry for "B018".

In `@spec.md`:
- Line 279: The heading "CLI Interface" is redundant because "CLI" already means
"Command Line Interface"; update the heading in the document (the line
containing "CLI Interface") to a non-redundant form such as "CLI" or "Command
Line Interface" so the section title is concise and clear.
- Line 75: The concept definition "A row from an external metadata file
(CSV/XLSX). LLM should convert if possible" mixes an implementation directive
with the core definition; remove the clause "LLM should convert if possible"
from the Metadata Record definition and rewrite the sentence to simply define
what a Metadata Record is (e.g., "A row from an external metadata file (CSV or
XLSX) representing ..."). If automatic format conversion is a planned feature,
move that requirement into the LLM Integration or Functional Requirements
section (the LLM Integration section referenced in the comment) and describe the
conversion behavior and expectations there.

In `@src/meerqat/cli.py`:
- Around line 100-105: The batch exit-code logic in the block that computes
statuses = [entry.summary.status for entry in report.reports] should follow the
same semantics as _exit_code (unknown statuses -> 3) rather than defaulting to
0; update the conditional branch so that it returns 2 for "fail", 1 for "warn",
and 3 for any other/unknown status, ensuring the behavior is consistent with the
_exit_code mapping used elsewhere.

In `@src/meerqat/ingestion.py`:
- Around line 84-96: The nested element attribute lookup is case-sensitive
(using element.attrib.get with only lowercase keys) while root attribute checks
use mixed-case keys, so attributes like Name="Plate1" on nested elements are
missed; fix by normalizing attribute names before lookup (e.g., build a
lowercase-keyed mapping from element.attrib and root.attrib or lowercase keys on
the fly) and then check normalized keys ('id','name','plateid','platename')
against that mapping when populating candidates (affecting the loops that
reference root, element, candidates, and attrib.get).
- Around line 173-178: Wrap the per-file read inside the for raw_path in
metadata_paths loop (the block calling pd.read_csv(path) / pd.read_excel(path))
in a try/except that catches exceptions from malformed/unreadable files, logs
the filename and exception (using the project's logger or logging.exception),
and skips that file (continue) so ingestion proceeds; mirror the error-handling
pattern used in _extract_xml_plate_id to avoid leaving frame undefined and
ensure downstream code only processes successfully-read frames.
- Around line 137-145: The elif condition contains a redundant check "or
path.is_dir()" which is unreachable because the preceding if already handles
path.is_dir(); update the conditional in the block that calls
_inventory_extension(path) to only test path.is_file() so the code becomes elif
path.is_file(): and keep the rest (calling _inventory_extension, incrementing
file_extensions) unchanged; this affects the branch handling around variables
path, directories, similarity_buckets, file_extensions and the call to
_inventory_extension.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7052201-cf20-492b-8297-955615831e5c

📥 Commits

Reviewing files that changed from the base of the PR and between dae8967 and 4cc44a8.

📒 Files selected for processing (6)
  • docs/src/on_the_lookout_with_meerqat.ipynb
  • docs/src/on_the_lookout_with_meerqat.py
  • spec.md
  • src/meerqat/cli.py
  • src/meerqat/ingestion.py
  • src/meerqat/llm.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
src/meerqat/main.py (1)

100-106: Consider simplifying LLMReview construction.

If generate_llm_review already returns an object with the same interface as LLMReview, explicitly copying fields into a new LLMReview instance may be redundant. However, this explicit construction is also defensive against interface changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/main.py` around lines 100 - 106, The code redundantly copies
fields from the object returned by generate_llm_review into a new LLMReview
instance; replace that manual field copying by using the returned object
directly (e.g., assign llm_review_obj = llm_review) if generate_llm_review
already returns an LLMReview-compatible object, or else wrap/convert defensively
by using a single canonical conversion method (constructor/factory like
LLMReview.parse_obj or a from_... classmethod) so you don't repeat attributes;
ensure you check the type/interface of generate_llm_review's return value before
removing the copy.
src/meerqat/ingestion.py (1)

156-165: Consider performance for large directory trees.

The O(n²) similarity comparison within each bucket could become slow for datasets with many similarly-prefixed directories at the same depth. The bucketing strategy (4-char prefix + depth) provides reasonable mitigation, but consider adding a bucket size cap or early exit for very large buckets if this becomes a bottleneck in practice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/meerqat/ingestion.py` around lines 156 - 165, The nested O(n²) pairwise
comparison in the loop over similarity_buckets (using sorted_bucket,
SequenceMatcher, and appending to similar_pairs) can be slow for very large
buckets; add a guard such as a configurable bucket size cap (e.g.,
MAX_BUCKET_COMPARE) and skip or sample comparisons when len(sorted_bucket)
exceeds it, or implement an early-exit strategy (limit comparisons per entry or
stop after finding a set number of similar matches per bucket) while still
honoring SIMILAR_DIRECTORY_THRESHOLD; update the loop that iterates
sorted_bucket to check the cap and either truncate/sampled_rights or break early
to avoid quadratic blowup.
architecture.md (1)

8-8: Minor: Use hyphen for compound adjective "built-in".

When used as an adjective before a noun, "built-in" should be hyphenated.

-1. LLM-assisted pattern inference (built in)
+1. LLM-assisted pattern inference (built-in)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@architecture.md` at line 8, Update the phrase "LLM-assisted pattern inference
(built in)" to hyphenate the compound adjective by changing "(built in)" to
"(built-in)"; locate the text around "LLM-assisted pattern inference" and
replace the unhyphenated "built in" with "built-in".
docs/src/on_the_lookout_with_meerqat.py (1)

114-122: Consider assigning the dictionary for consistency.

This dictionary expression works in Jupyter (displays as cell output) but does nothing when run as a script. Consider assigning it to a variable and printing for consistency with the script execution path.

-{
+filetree_details = {
     "status": filetree_report.summary.status,
     "filetree_summary": filetree_report.dataset.filetree_summary.to_dict(),
     "filetree_issues": [
         issue.to_dict()
         for issue in filetree_report.issues
         if issue.code.startswith("dataset.")
     ],
 }
+print(filetree_details)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/on_the_lookout_with_meerqat.py` around lines 114 - 122, Assign the
constructed dict to a variable (e.g., summary_dict) instead of leaving it as a
bare expression, then print or return it so the output appears when run as a
script; update the block that builds the dict using filetree_report,
filetree_report.dataset.filetree_summary.to_dict(), and the filetree_issues list
comprehension (filtering issue.code.startswith("dataset.")) to store the result
and then print(summary_dict) or otherwise emit it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 180-181: Remove the redundant standalone expression
"generated_files" that merely evaluates the variable without using it; keep the
subsequent print(generated_files) call. Locate the occurrence of the bare
"generated_files" expression (adjacent to print(generated_files)) and delete
that line so only the print(generated_files) remains.

In `@src/meerqat/main.py`:
- Around line 149-158: The ready() function builds active_config but fails to
apply CLI overrides before selecting the LLM; update ready() to use
with_cli_overrides(active_config) (or apply with_cli_overrides to active_config)
and then derive active_llm from that overridden config instead of directly using
active_llm = llm_config or active_config.llm, ensuring consistency with
validate_dataset and batch_validate and then call run_ready_checks on the final
active_llm.

---

Nitpick comments:
In `@architecture.md`:
- Line 8: Update the phrase "LLM-assisted pattern inference (built in)" to
hyphenate the compound adjective by changing "(built in)" to "(built-in)";
locate the text around "LLM-assisted pattern inference" and replace the
unhyphenated "built in" with "built-in".

In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 114-122: Assign the constructed dict to a variable (e.g.,
summary_dict) instead of leaving it as a bare expression, then print or return
it so the output appears when run as a script; update the block that builds the
dict using filetree_report, filetree_report.dataset.filetree_summary.to_dict(),
and the filetree_issues list comprehension (filtering
issue.code.startswith("dataset.")) to store the result and then
print(summary_dict) or otherwise emit it.

In `@src/meerqat/ingestion.py`:
- Around line 156-165: The nested O(n²) pairwise comparison in the loop over
similarity_buckets (using sorted_bucket, SequenceMatcher, and appending to
similar_pairs) can be slow for very large buckets; add a guard such as a
configurable bucket size cap (e.g., MAX_BUCKET_COMPARE) and skip or sample
comparisons when len(sorted_bucket) exceeds it, or implement an early-exit
strategy (limit comparisons per entry or stop after finding a set number of
similar matches per bucket) while still honoring SIMILAR_DIRECTORY_THRESHOLD;
update the loop that iterates sorted_bucket to check the cap and either
truncate/sampled_rights or break early to avoid quadratic blowup.

In `@src/meerqat/main.py`:
- Around line 100-106: The code redundantly copies fields from the object
returned by generate_llm_review into a new LLMReview instance; replace that
manual field copying by using the returned object directly (e.g., assign
llm_review_obj = llm_review) if generate_llm_review already returns an
LLMReview-compatible object, or else wrap/convert defensively by using a single
canonical conversion method (constructor/factory like LLMReview.parse_obj or a
from_... classmethod) so you don't repeat attributes; ensure you check the
type/interface of generate_llm_review's return value before removing the copy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b96c6b51-cc45-4d7c-a913-ebb2babeb328

📥 Commits

Reviewing files that changed from the base of the PR and between 4cc44a8 and 80bee85.

📒 Files selected for processing (10)
  • README.md
  • architecture.md
  • docs/src/cli.md
  • docs/src/on_the_lookout_with_meerqat.py
  • spec.md
  • src/meerqat/cli.py
  • src/meerqat/ingestion.py
  • src/meerqat/main.py
  • tests/test_cli.py
  • tests/test_units.py
✅ Files skipped from review due to trivial changes (1)
  • docs/src/cli.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant