Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new MeerQat Python package and CLI with deterministic dataset ingestion/validation, optional LLM review (Instructor + LangChain fallback), reporting (JSON/Markdown/HTML), Sphinx docs, tests, pre-commit tooling, and multiple GitHub automation and CI workflow/config files. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Config as Config/Loader
participant Ingest as Ingestion
participant Validator as Validation Engine
participant LLM as LLM Provider
participant Reporter as Reporter
CLI->>Config: load config / apply CLI overrides
CLI->>Ingest: ingest dataset (path, metadata)
Ingest-->>Validator: Dataset
Validator->>Validator: run deterministic rules -> ValidationReport
Validator-->>LLM: submit ValidationReport for review (if enabled)
alt LLM enabled
LLM->>LLM: resolve model & provider
alt instructor provider
LLM->>LLM: ensure local instructor server (auto-start & probe)
LLM->>Server: HTTP inference call
else langchain fallback
LLM->>HF: resolve/fetch GGUF model (cache/offline)
LLM->>LocalModel: run local inference
end
LLM-->>Validator: LLMReview (hints/findings)
else LLM disabled
LLM-->>Validator: LLMReview(status="disabled")
end
Validator->>Reporter: attach LLMReview, build final report
Reporter->>CLI: render/write JSON/MD/HTML
CLI->>User: exit with mapped code
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (17)
CITATION.cff (1)
14-14: Use canonical repository URL casing for consistency.
repository-codecurrently useswayscience; consider matching the repository’s canonical org casing (WayScience) for consistency with project metadata and docs.✏️ Suggested metadata tweak
-repository-code: "https://github.com/wayscience/meerqat" +repository-code: "https://github.com/WayScience/meerqat"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CITATION.cff` at line 14, The CITATION.cff metadata uses a lowercased org name in the repository-code field; update the repository-code value to use the repository's canonical casing ("WayScience/meerqat") so it matches project metadata and docs—locate the repository-code entry in CITATION.cff and change "https://github.com/wayscience/meerqat" to the correctly cased canonical URL.CONTRIBUTING.md (1)
63-65: Minor: Redundant section headers.Lines 63-65 have consecutive headers "Creating a pull request" and "Pull requests" that could be consolidated.
Suggested consolidation
-### Creating a pull request - ### Pull requests🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CONTRIBUTING.md` around lines 63 - 65, Consolidate the redundant consecutive headers by removing or merging one of them: keep a single clear header (either "Creating a pull request" or "Pull requests") and move any distinct content under that header; update the section title where the duplicate appears and ensure the remaining heading matches surrounding style and anchors (the two header strings "Creating a pull request" and "Pull requests" identify where to make the change).src/meerqat/cli.py (1)
172-188: Reduce duplication: reuseadd_common_flagsforreadysubparser.The
ready_parsermanually defines the same LLM-related flags thatadd_common_flagsalready provides. This creates maintenance burden if flags need updates.Proposed refactor
ready_parser = subparsers.add_parser("ready") - ready_parser.add_argument("--config") - ready_parser.add_argument( - "--llm-provider", - choices=("langchain", "instructor"), - ) - ready_parser.add_argument("--llm-model") - ready_parser.add_argument("--llm-repo-id") - ready_parser.add_argument("--llm-filename") - ready_parser.add_argument("--llm-cache-dir") - ready_parser.add_argument("--llm-base-url") - ready_parser.add_argument("--llm-n-ctx", type=int) - ready_parser.add_argument("--llm-threads", type=int) - ready_parser.add_argument("--llm-max-tokens", type=int) - ready_parser.add_argument("--llm-temperature", type=float) - ready_parser.add_argument("--offline", action="store_true") + add_common_flags(ready_parser) ready_parser.set_defaults(handler=_ready_command)Note:
add_common_flagsalso adds--metadata,--dataset-id,--report-json, etc. If these shouldn't apply toready, consider extracting a separateadd_llm_flagshelper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/cli.py` around lines 172 - 188, The ready subparser duplicates LLM flags already provided by add_common_flags; update the ready parser to reuse those shared flags by calling add_common_flags(ready_parser) instead of manually adding --llm-* options, or if ready should only include LLM flags (not metadata/dataset/report flags) extract a new helper add_llm_flags(parser) and call that from both add_common_flags and where ready_parser is configured; adjust references to ready_parser and the handler _ready_command accordingly so flag names remain unchanged.docs/src/on_the_lookout_with_meerqat.py (1)
49-49: Relative path assumes specific execution context.The
pathlib.Path("../..") / "tests" / "data"pattern works when the notebook is executed fromdocs/src/. Consider documenting this assumption or using a more robust path resolution if the notebook may be executed from other locations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/on_the_lookout_with_meerqat.py` at line 49, The hard-coded relative resolution for data_root (pathlib.Path("../..") / "tests" / "data" / "valid_minimal") assumes the notebook runs from docs/src; change it to compute the path relative to the source file (e.g., use the file's resolved path via __file__ and parents to locate the repo root and then append "tests/data/valid_minimal"), or explicitly document the execution-dir requirement near the data_root declaration so callers know the expected CWD; update the data_root assignment and its comment accordingly..github/workflows/run-tests.yml (1)
60-65: Consider including model spec file in cache key.The cache key uses
hashFiles('pyproject.toml'), but model specifications (repo_id, filename) are defined insrc/meerqat/config.py. If model specs change without a pyproject.toml change, stale cached models would persist.Optional: Include config.py in cache key
- name: Cache Hugging Face models uses: actions/cache@v4 with: path: ~/.cache/huggingface - key: ${{ runner.os }}-meerqat-hf-${{ hashFiles('pyproject.toml') }} + key: ${{ runner.os }}-meerqat-hf-${{ hashFiles('pyproject.toml', 'src/meerqat/config.py') }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/run-tests.yml around lines 60 - 65, The cache key for the "Cache Hugging Face models" step should include the model spec file so changes to repo_id/filename invalidate the cache; update the key expression that currently uses hashFiles('pyproject.toml') to also hash the model spec file (e.g., 'src/meerqat/config.py') so model spec changes trigger a new cache, ensuring the workflow step named "Cache Hugging Face models" doesn't return stale models.tests/test_api.py (2)
41-48: Test may have non-deterministic LLM behavior.This test calls
validate_dataset(valid_dataset)without disabling LLM review. If an LLM runtime is available in the test environment, this could cause non-deterministic behavior or slow test execution.Consider using
config=_without_llm()for consistency with other tests, unless this is intentionally testing the default config path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_api.py` around lines 41 - 48, The test test_validate_dataset_discovers_nearby_metadata calls validate_dataset(valid_dataset) leaving LLM review enabled; to avoid nondeterminism and slowness, update the call to pass the deterministic test config (use config=_without_llm()) so call validate_dataset(valid_dataset, config=_without_llm()), ensuring consistency with other tests that disable LLMs.
20-22: Duplicate helper across test modules.The
_without_llm()helper is duplicated intests/test_sample_datasets.py. Consider extracting it toconftest.pyas a shared fixture.♻️ Suggested refactor to conftest.py
# tests/conftest.py import pytest from meerqat import LLMConfig, ValidationConfig `@pytest.fixture` def config_without_llm() -> ValidationConfig: """Return a validation config that skips the runtime review.""" return ValidationConfig(llm=LLMConfig(enabled=False))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_api.py` around lines 20 - 22, Extract the duplicated helper function `_without_llm()` into a shared pytest fixture in `tests/conftest.py`: create a fixture (e.g., `config_without_llm`) that returns `ValidationConfig(llm=LLMConfig(enabled=False))`, then replace usages of the `_without_llm()` function in `tests/test_api.py` and `tests/test_sample_datasets.py` with the new fixture by accepting `config_without_llm` as a test parameter; ensure you import `LLMConfig` and `ValidationConfig` in `conftest.py` and remove the local `_without_llm()` definitions from the test modules.src/meerqat/reporting.py (1)
84-99: Handle None dataset_id gracefully in report title.If
report.summary.dataset_idisNone, the Markdown title will render as# Meerqat Report: None. Consider providing a fallback like "Unknown" or "(no dataset ID)".📝 Suggested fix
def report_to_markdown(report: ValidationReport) -> str: """Render a report as Markdown.""" + dataset_id = report.summary.dataset_id or "(unnamed dataset)" lines = [ - f"# Meerqat Report: {report.summary.dataset_id}", + f"# Meerqat Report: {dataset_id}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/reporting.py` around lines 84 - 99, The report title currently interpolates report.summary.dataset_id directly, causing "None" to appear when it's missing; update report_to_markdown to compute a safe title value (e.g., dataset_title = report.summary.dataset_id or "Unknown" / "(no dataset ID)") and use that variable in the first line instead of report.summary.dataset_id so the Markdown header shows a friendly fallback when dataset_id is None.spec.md (2)
270-291: Configuration example has formatting artifacts.Line 280 shows
\- "Images/Index.xml"with an escaped dash. This should be a plain YAML list item.📝 Suggested fix
required_files: -\- "Images/Index.xml" + - "Images/Index.xml"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spec.md` around lines 270 - 291, The YAML example under the dataset block contains a formatting artifact: the required_files entry shows an escaped dash (`\- "Images/Index.xml"`) instead of a normal list item; update the required_files list so the item is a proper YAML list entry (replace `\- "Images/Index.xml"` with `- "Images/Index.xml"`) within the same dataset/required_files section to ensure valid YAML parsing.
226-263: Consider reformatting the Domain Model section.The class definitions mix Markdown prose with code-block-style formatting in an unusual way. The backslash escapes (
folder\_name,plate\_id) appear to be rendering artifacts.📝 Suggested formatting
## Domain Model ### Dataset -class Dataset: - -``` -dataset\_id: str | None - -plates: list\[Plate\] - -metadata\_records: list\[MetadataRecord\] -``` +```python +class Dataset: + dataset_id: str | None + plates: list[Plate] + metadata_records: list[MetadataRecord] +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spec.md` around lines 226 - 263, The Domain Model section mixes prose and malformed code blocks with backslash-escaped identifiers; update the three class definitions (Dataset, Plate, ValidationIssue) to use proper fenced code blocks (```python) and normal attribute names (dataset_id, plates, metadata_records, folder_name, normalized_plate_id, xml_plate_id, metadata_plate_id, rule_id, severity, message, entity_id) with consistent indentation and type annotations, removing all backslash escapes and stray Markdown artifacts so the classes render as clean Python-style examples.docs/src/on_the_lookout_with_meerqat.ipynb (1)
203-231: Consider clearing or sanitizing notebook outputs before committing.The cell outputs contain absolute filesystem paths (e.g.,
/Users/buntend/Documents/work/meerqat/...) that expose the local development environment. While not a security risk, this is unnecessary noise in documentation and can cause merge conflicts when different developers run the notebook.Consider either:
- Clearing outputs before committing (
jupyter nbconvert --clear-output)- Using relative paths in the displayed output if possible
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/on_the_lookout_with_meerqat.ipynb` around lines 203 - 231, The notebook cell outputs include absolute local paths (see keys like 'empty_directories', 'filetree_issues', and example entries 'segment_A'/'segment_B') which leak developer-specific filesystem info; fix by clearing or sanitizing the outputs before committing: either run a notebook clear-output command (e.g., nbconvert --clear-output or use your editor's "Clear All Outputs") to remove the printed results, or replace those absolute paths in the output cells with relative paths or sanitized placeholders (e.g., "./tests/data/...") in the cells that produce 'filetree_issues' so the committed notebook contains no local absolute paths.src/meerqat/llm.py (3)
302-311: Hardcoded 30-second server startup timeout.The timeout value is hardcoded. Consider making it configurable via
LLMConfigfor environments where model loading takes longer (e.g., larger models or slower hardware).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/llm.py` around lines 302 - 311, The _wait_for_server function currently uses a hardcoded 30s timeout; update it to accept a configurable timeout (e.g., timeout_seconds: float) or read from the existing LLMConfig (e.g., LLMConfig.server_startup_timeout) and use that value instead of 30, update all call sites that invoke _wait_for_server(base_url, process) to pass the configured timeout or to ensure LLMConfig is accessible there, and adjust the error messages to include the configured timeout for clarity; ensure function signature and any imports/constructors that create or pass LLMConfig are updated accordingly.
240-252: URL scheme validation forurlopencalls.The static analysis S310 warnings are reasonable for auditing, but these calls are for internal OpenAI-compatible API endpoints derived from
config.base_url. The code already validates localhost URLs via_is_local_base_url. Consider adding explicit scheme validation if the base_url could come from untrusted sources.🛡️ Optional: Add explicit scheme validation
def _server_is_ready(base_url: str) -> bool: """Check whether an OpenAI-compatible endpoint is reachable.""" + parsed = urllib_parse.urlsplit(base_url) + if parsed.scheme not in {"http", "https"}: + return False try: with urllib_request.urlopen(_models_url(base_url), timeout=1) as response: return response.status < HTTP_SERVER_ERROR_STATUS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/llm.py` around lines 240 - 252, The _server_is_ready and _fetch_models_payload functions call urllib_request.urlopen with a URL derived from _models_url(base_url) without explicit scheme checks; validate the scheme extracted from base_url (or from the constructed _models_url) using urllib.parse.urlparse and ensure it is "http" or "https" (and/or short-circuit if _is_local_base_url(base_url) allows localhost) before calling urlopen, returning False or raising a clear ValueError on invalid schemes, and update callers to handle that behavior; reference the functions _server_is_ready, _fetch_models_payload, _models_url and helper _is_local_base_url when making this change.
141-173: Subprocess call compiles C code with hardcoded compiler path.The
subprocess.runcall invokesccdirectly. While Ruff flags this as S603/S607, the command arguments are constructed from internal paths, not user input. However, the function could fail silently on systems without a C compiler.💡 Consider adding a more descriptive error message
except (OSError, subprocess.CalledProcessError) as error: raise RuntimeError( - "Failed to build the macOS CPU-only llama.cpp runtime shim." + "Failed to build the macOS CPU-only llama.cpp runtime shim. " + "Ensure Xcode Command Line Tools are installed (`xcode-select --install`)." ) from error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/llm.py` around lines 141 - 173, The subprocess.run call in _write_macos_metal_stub uses "cc" and currently raises a generic RuntimeError on failure; make failures actionable by first checking for a C compiler (use shutil.which to prefer "cc" then "clang") and fall back accordingly, and when subprocess.run raises subprocess.CalledProcessError or FileNotFoundError include the captured stderr/stdout (or the exception text) in the raised RuntimeError with a clear message suggesting installing the macOS command line tools; update the exception handling around the subprocess.run call to reference subprocess.run (and the CalledProcessError/FileNotFoundError) and include those error details in the RuntimeError.tests/test_units.py (1)
360-390: Consider usingpytest.raisesinstead of generator throw pattern.The generator throw pattern
(_ for _ in ()).throw(...)is functional but unusual. While it works for lazy exception raising in lambdas, it reduces readability.♻️ Alternative approach using a helper function
+def _raise_runtime_error(msg: str): + def raiser(*args, **kwargs): + raise RuntimeError(msg) + return raiser + def test_llm_returns_failed_review_when_all_paths_fail( monkeypatch: pytest.MonkeyPatch, valid_dataset: Path, metadata_csv: Path, ) -> None: """The core review should surface a structured failure instead of crashing.""" report = _report_without_llm(valid_dataset, metadata_csv) monkeypatch.setattr( "meerqat.llm._ensure_local_instructor_server", - lambda config: (_ for _ in ()).throw(RuntimeError("startup failed")), + _raise_runtime_error("startup failed"), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_units.py` around lines 360 - 390, Replace the unusual generator.throw pattern used in the monkeypatch lambdas with a clear helper that raises the intended exception; define a small function (e.g., raise_exc(exc)) that simply raises the provided exception and use it in the monkeypatch.setattr calls for meerqat.llm._ensure_local_instructor_server, meerqat.llm._invoke_instructor, and meerqat.llm._invoke_langchain so the test remains readable while still simulating startup/import failures.src/meerqat/main.py (1)
67-88: RedundantLLMReviewconstruction - same object created twice.The
LLMReviewobject is constructed identically at lines 74-79 and 81-86. This duplicates code and creates an unnecessary intermediate object.♻️ Extract to a single variable
llm_review = generate_llm_review(report, active_config.llm) + llm_review_model = LLMReview( + status=llm_review.status, + provider=llm_review.provider, + model=llm_review.model, + error=llm_review.error, + ) return ValidationReport( summary=report.summary, issues=report.issues, dataset=report.dataset, rule_results=report.rule_results, llm_hints=llm_review.hints, llm_findings=llm_review.findings, - llm_review=LLMReview( - status=llm_review.status, - provider=llm_review.provider, - model=llm_review.model, - error=llm_review.error, - ), - provenance=_build_provenance( - LLMReview( - status=llm_review.status, - provider=llm_review.provider, - model=llm_review.model, - error=llm_review.error, - ) - ), + llm_review=llm_review_model, + provenance=_build_provenance(llm_review_model), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/main.py` around lines 67 - 88, Create a single LLMReview instance and reuse it instead of constructing it twice: build one local variable (e.g., llm_review_obj = LLMReview(status=llm_review.status, provider=llm_review.provider, model=llm_review.model, error=llm_review.error)) and then pass llm_review_obj to both the ValidationReport(llm_review=...) field and to _build_provenance(llm_review_obj), leaving the uses of llm_review.hints and llm_review.findings unchanged.src/meerqat/ingestion.py (1)
107-138: Potential performance concern with large directory trees.The
_build_filetree_summaryfunction iterates over all paths multiple times and performs O(n²) similarity comparisons on directories. For datasets with many directories, this could be slow.💡 Consider early termination for similarity checks
similar_pairs: list[tuple[str, str]] = [] + MAX_SIMILAR_PAIRS = 100 # Limit output to prevent explosion for index, left in enumerate(directories): + if len(similar_pairs) >= MAX_SIMILAR_PAIRS: + break left_text = str(left.relative_to(dataset_root)) for right in directories[index + 1 :]: right_text = str(right.relative_to(dataset_root)) similarity = SequenceMatcher(None, left_text, right_text).ratio() if similarity >= SIMILAR_DIRECTORY_THRESHOLD: similar_pairs.append((str(left), str(right))) + if len(similar_pairs) >= MAX_SIMILAR_PAIRS: + break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 107 - 138, _build_filetree_summary currently walks paths multiple times and runs O(n²) SequenceMatcher comparisons over directories (similar_pairs using SequenceMatcher and directories), which will be slow on large trees; to fix, do a single pass collecting paths, directory_children and directories (use one loop over dataset_root.rglob("*") to build paths_list, directory_children and directories), then reduce pairwise comparisons by bucketing directories (e.g., group by directory name stem, first N chars, depth, or token prefixes) and only run SequenceMatcher within each bucket (retain SIMILAR_DIRECTORY_THRESHOLD and SequenceMatcher but apply them to bucketed pairs), and finally build empty_directories and file_extensions from the single-pass data; update references in the function: paths, directory_children, directories, similar_pairs, SequenceMatcher, SIMILAR_DIRECTORY_THRESHOLD.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/ISSUE_TEMPLATE/issue.yml:
- Line 18: Update the duplicate-confirmation checkbox text that currently reads
"I found no existing covering this topic." to include the missing noun; locate
the string "I found no existing covering this topic." in
.github/ISSUE_TEMPLATE/issue.yml and change it to "I found no existing issue
covering this topic." so the wording is grammatically correct and clear.
In @.github/PULL_REQUEST_TEMPLATE.md:
- Line 14: Update the sentence in the pull request template that currently reads
"you may used a `#<digit>` to reference GitHub issues" to the correct phrasing
"you may use a `#<digit>` to reference GitHub issues" so the contributor
guidance is grammatically correct; locate the exact line containing that
sentence in the PULL_REQUEST_TEMPLATE and replace "used" with "use" in the
string.
In @.github/workflows/draft-release.yml:
- Line 21: Replace the mutable action tag used in the workflow: change the
"uses: release-drafter/release-drafter@v6" entry to reference a specific commit
SHA (e.g., "uses: release-drafter/release-drafter@<full-commit-sha>") so the
release-drafter action is pinned to an immutable commit; update the SHA to a
vetted commit from the release-drafter repo and commit the workflow change.
In @.github/workflows/publish-docs.yml:
- Around line 13-41: The workflow currently runs the docs deploy step with
JamesIves/github-pages-deploy-action@v4 but doesn't grant the GITHUB_TOKEN write
permissions; add an explicit permissions block (e.g., permissions: contents:
write, pages: write) at the workflow or jobs.build level so the deploy action
can push to the pages branch; update the workflow YAML around the jobs.build
definition (referencing jobs.build and the step using
JamesIves/github-pages-deploy-action@v4) to include this permissions block.
In @.github/workflows/publish-pypi.yml:
- Around line 19-33: Replace the mutable action tags with immutable commit SHAs:
for uses: actions/checkout@v4, actions/setup-python@v5, astral-sh/setup-uv@v6,
and pypa/gh-action-pypi-publish@release/v1 swap the version tags to the
corresponding full 40-character commit SHA for each action (look up each action
repo’s commit and update the uses: entries). Ensure you keep the existing keys
(with:, run:) intact and only replace the version suffix after each action name
so the workflow pins to exact commits.
In @.pre-commit-config.yaml:
- Around line 39-40: The yamllint hook's exclude path is incorrect: change the
exclude value under the yamllint entry (id: yamllint) from
"pre-commit-config.yaml" to ".pre-commit-config.yaml" so the hook correctly
ignores this file name; update the exclude string associated with id: yamllint
accordingly.
In `@CODE_OF_CONDUCT.md`:
- Around line 61-64: Replace the vague "reach out to the maintainers by using
their GitHub profile email" line with a concrete, stable reporting channel by
adding a dedicated reporting email alias (for example security@<org-domain>.org)
or a secure incident form URL, update the paragraph that begins "Instances of
abusive, harassing, or otherwise unacceptable behavior may be reported…" to
include that channel, and add a brief sentence that reports will be handled
confidentially and promptly (so readers know what to expect); ensure the new
contact is the single authoritative reporting point referenced in the same
paragraph.
In `@CONTRIBUTING.md`:
- Line 91: Fix the typo in CONTRIBUTING.md by replacing the misspelled word
"occaissions" with "occasions" in the sentence that mentions `_version.py` (the
sentence that reads "We also use the `_version.py` file as a place to persist
the version data for occaissions where the `git` history is unavailable or
unwanted (this file is only present in package builds).").
In `@docs/src/conf.py`:
- Around line 32-38: The extensions list currently includes the theme name
"pydata_sphinx_theme" which is not an extension; remove "pydata_sphinx_theme"
from the extensions = [...] array (leave other extensions like "myst_nb",
"sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinx.ext.viewcode" intact) and
rely on the existing html_theme setting (html_theme = "pydata_sphinx_theme") to
enable the theme instead.
In `@pyproject.toml`:
- Line 7: Remove the static version entry from pyproject.toml (the line setting
version = "0.0.1") so setuptools_scm can drive package versioning; leave the
existing setuptools_scm configuration (the [tool.setuptools_scm] block) intact
and ensure build-system still lists setuptools_scm as a build requirement so
versions are generated dynamically.
In `@spec.md`:
- Around line 215-216: The spec currently lists OME-Zarr as implemented in the
"Implemented Scope" section and also as pending in the "Future Work" section;
update the document to be consistent by deciding the true status and removing or
moving the OME-Zarr bullet accordingly: if OME-Zarr is implemented, remove it
from the "Future Work" list (or mark it as completed) and ensure the
"Implemented Scope" bullet explicitly names OME-Zarr; if it is not yet
implemented, remove it from "Implemented Scope" and keep it under "Future Work"
with any planned notes. Ensure both headings ("Implemented Scope" and "Future
Work") reference the same final status for OME-Zarr to eliminate the
contradiction.
In `@src/meerqat/ingestion.py`:
- Around line 56-76: Replace use of the vulnerable xml.etree.ElementTree with
defusedxml's safe parser: import defusedxml.ElementTree as ET and update calls
like ElementTree.parse(xml_path) to ET.parse(xml_path) in _extract_xml_plate_id
(ingestion.py) and the XML validation function in validation.py; also add
defusedxml to project dependencies (pyproject.toml) so the safe parser is
installed.
In `@src/meerqat/validation.py`:
- Around line 168-180: Replace unsafe ElementTree parsing with defusedxml's safe
parser in both locations: in the XML parseability check that calls
ElementTree.parse(plate.xml_path) (refer to the validation code that reads
plate.xml_path) and in the ingestion code that extracts plate identifiers from
XML (the routine that parses XML data in ingestion.py). Import
defusedxml.ElementTree (or alias it as ET) and use its parse/fromstring APIs
instead of xml.etree.ElementTree to prevent XXE and entity-expansion attacks,
and add defusedxml to pyproject.toml dependencies so the safe parser is
installed.
In `@tests/test_cli.py`:
- Around line 17-23: The _cli_env function builds a PYTHONPATH string using a
hardcoded ":" which is Unix-only; update _cli_env to join src_path and existing
PYTHONPATH using os.pathsep so the separator is platform-specific (modify the
logic around env["PYTHONPATH"] in _cli_env to use os.pathsep when concatenating
src_path and existing).
---
Nitpick comments:
In @.github/workflows/run-tests.yml:
- Around line 60-65: The cache key for the "Cache Hugging Face models" step
should include the model spec file so changes to repo_id/filename invalidate the
cache; update the key expression that currently uses hashFiles('pyproject.toml')
to also hash the model spec file (e.g., 'src/meerqat/config.py') so model spec
changes trigger a new cache, ensuring the workflow step named "Cache Hugging
Face models" doesn't return stale models.
In `@CITATION.cff`:
- Line 14: The CITATION.cff metadata uses a lowercased org name in the
repository-code field; update the repository-code value to use the repository's
canonical casing ("WayScience/meerqat") so it matches project metadata and
docs—locate the repository-code entry in CITATION.cff and change
"https://github.com/wayscience/meerqat" to the correctly cased canonical URL.
In `@CONTRIBUTING.md`:
- Around line 63-65: Consolidate the redundant consecutive headers by removing
or merging one of them: keep a single clear header (either "Creating a pull
request" or "Pull requests") and move any distinct content under that header;
update the section title where the duplicate appears and ensure the remaining
heading matches surrounding style and anchors (the two header strings "Creating
a pull request" and "Pull requests" identify where to make the change).
In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 203-231: The notebook cell outputs include absolute local paths
(see keys like 'empty_directories', 'filetree_issues', and example entries
'segment_A'/'segment_B') which leak developer-specific filesystem info; fix by
clearing or sanitizing the outputs before committing: either run a notebook
clear-output command (e.g., nbconvert --clear-output or use your editor's "Clear
All Outputs") to remove the printed results, or replace those absolute paths in
the output cells with relative paths or sanitized placeholders (e.g.,
"./tests/data/...") in the cells that produce 'filetree_issues' so the committed
notebook contains no local absolute paths.
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Line 49: The hard-coded relative resolution for data_root
(pathlib.Path("../..") / "tests" / "data" / "valid_minimal") assumes the
notebook runs from docs/src; change it to compute the path relative to the
source file (e.g., use the file's resolved path via __file__ and parents to
locate the repo root and then append "tests/data/valid_minimal"), or explicitly
document the execution-dir requirement near the data_root declaration so callers
know the expected CWD; update the data_root assignment and its comment
accordingly.
In `@spec.md`:
- Around line 270-291: The YAML example under the dataset block contains a
formatting artifact: the required_files entry shows an escaped dash (`\-
"Images/Index.xml"`) instead of a normal list item; update the required_files
list so the item is a proper YAML list entry (replace `\- "Images/Index.xml"`
with `- "Images/Index.xml"`) within the same dataset/required_files section to
ensure valid YAML parsing.
- Around line 226-263: The Domain Model section mixes prose and malformed code
blocks with backslash-escaped identifiers; update the three class definitions
(Dataset, Plate, ValidationIssue) to use proper fenced code blocks (```python)
and normal attribute names (dataset_id, plates, metadata_records, folder_name,
normalized_plate_id, xml_plate_id, metadata_plate_id, rule_id, severity,
message, entity_id) with consistent indentation and type annotations, removing
all backslash escapes and stray Markdown artifacts so the classes render as
clean Python-style examples.
In `@src/meerqat/cli.py`:
- Around line 172-188: The ready subparser duplicates LLM flags already provided
by add_common_flags; update the ready parser to reuse those shared flags by
calling add_common_flags(ready_parser) instead of manually adding --llm-*
options, or if ready should only include LLM flags (not metadata/dataset/report
flags) extract a new helper add_llm_flags(parser) and call that from both
add_common_flags and where ready_parser is configured; adjust references to
ready_parser and the handler _ready_command accordingly so flag names remain
unchanged.
In `@src/meerqat/ingestion.py`:
- Around line 107-138: _build_filetree_summary currently walks paths multiple
times and runs O(n²) SequenceMatcher comparisons over directories (similar_pairs
using SequenceMatcher and directories), which will be slow on large trees; to
fix, do a single pass collecting paths, directory_children and directories (use
one loop over dataset_root.rglob("*") to build paths_list, directory_children
and directories), then reduce pairwise comparisons by bucketing directories
(e.g., group by directory name stem, first N chars, depth, or token prefixes)
and only run SequenceMatcher within each bucket (retain
SIMILAR_DIRECTORY_THRESHOLD and SequenceMatcher but apply them to bucketed
pairs), and finally build empty_directories and file_extensions from the
single-pass data; update references in the function: paths, directory_children,
directories, similar_pairs, SequenceMatcher, SIMILAR_DIRECTORY_THRESHOLD.
In `@src/meerqat/llm.py`:
- Around line 302-311: The _wait_for_server function currently uses a hardcoded
30s timeout; update it to accept a configurable timeout (e.g., timeout_seconds:
float) or read from the existing LLMConfig (e.g.,
LLMConfig.server_startup_timeout) and use that value instead of 30, update all
call sites that invoke _wait_for_server(base_url, process) to pass the
configured timeout or to ensure LLMConfig is accessible there, and adjust the
error messages to include the configured timeout for clarity; ensure function
signature and any imports/constructors that create or pass LLMConfig are updated
accordingly.
- Around line 240-252: The _server_is_ready and _fetch_models_payload functions
call urllib_request.urlopen with a URL derived from _models_url(base_url)
without explicit scheme checks; validate the scheme extracted from base_url (or
from the constructed _models_url) using urllib.parse.urlparse and ensure it is
"http" or "https" (and/or short-circuit if _is_local_base_url(base_url) allows
localhost) before calling urlopen, returning False or raising a clear ValueError
on invalid schemes, and update callers to handle that behavior; reference the
functions _server_is_ready, _fetch_models_payload, _models_url and helper
_is_local_base_url when making this change.
- Around line 141-173: The subprocess.run call in _write_macos_metal_stub uses
"cc" and currently raises a generic RuntimeError on failure; make failures
actionable by first checking for a C compiler (use shutil.which to prefer "cc"
then "clang") and fall back accordingly, and when subprocess.run raises
subprocess.CalledProcessError or FileNotFoundError include the captured
stderr/stdout (or the exception text) in the raised RuntimeError with a clear
message suggesting installing the macOS command line tools; update the exception
handling around the subprocess.run call to reference subprocess.run (and the
CalledProcessError/FileNotFoundError) and include those error details in the
RuntimeError.
In `@src/meerqat/main.py`:
- Around line 67-88: Create a single LLMReview instance and reuse it instead of
constructing it twice: build one local variable (e.g., llm_review_obj =
LLMReview(status=llm_review.status, provider=llm_review.provider,
model=llm_review.model, error=llm_review.error)) and then pass llm_review_obj to
both the ValidationReport(llm_review=...) field and to
_build_provenance(llm_review_obj), leaving the uses of llm_review.hints and
llm_review.findings unchanged.
In `@src/meerqat/reporting.py`:
- Around line 84-99: The report title currently interpolates
report.summary.dataset_id directly, causing "None" to appear when it's missing;
update report_to_markdown to compute a safe title value (e.g., dataset_title =
report.summary.dataset_id or "Unknown" / "(no dataset ID)") and use that
variable in the first line instead of report.summary.dataset_id so the Markdown
header shows a friendly fallback when dataset_id is None.
In `@tests/test_api.py`:
- Around line 41-48: The test test_validate_dataset_discovers_nearby_metadata
calls validate_dataset(valid_dataset) leaving LLM review enabled; to avoid
nondeterminism and slowness, update the call to pass the deterministic test
config (use config=_without_llm()) so call validate_dataset(valid_dataset,
config=_without_llm()), ensuring consistency with other tests that disable LLMs.
- Around line 20-22: Extract the duplicated helper function `_without_llm()`
into a shared pytest fixture in `tests/conftest.py`: create a fixture (e.g.,
`config_without_llm`) that returns
`ValidationConfig(llm=LLMConfig(enabled=False))`, then replace usages of the
`_without_llm()` function in `tests/test_api.py` and
`tests/test_sample_datasets.py` with the new fixture by accepting
`config_without_llm` as a test parameter; ensure you import `LLMConfig` and
`ValidationConfig` in `conftest.py` and remove the local `_without_llm()`
definitions from the test modules.
In `@tests/test_units.py`:
- Around line 360-390: Replace the unusual generator.throw pattern used in the
monkeypatch lambdas with a clear helper that raises the intended exception;
define a small function (e.g., raise_exc(exc)) that simply raises the provided
exception and use it in the monkeypatch.setattr calls for
meerqat.llm._ensure_local_instructor_server, meerqat.llm._invoke_instructor, and
meerqat.llm._invoke_langchain so the test remains readable while still
simulating startup/import failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7a7db00b-61d9-419e-9ab1-6e287a47410f
⛔ Files ignored due to path filters (15)
docs/src/_static/meerqat.pngis excluded by!**/*.pngtests/data/filetree_risks/dataset/Plate_Z09/image_001.tiffis excluded by!**/*.tifftests/data/filetree_risks/metadata.csvis excluded by!**/*.csvtests/data/missing_index/dataset/Plate_C03/image_001.tiffis excluded by!**/*.tifftests/data/missing_index/metadata.csvis excluded by!**/*.csvtests/data/missing_metadata/dataset/Plate_D04/image_001.tiffis excluded by!**/*.tifftests/data/missing_metadata/metadata.csvis excluded by!**/*.csvtests/data/mixed_modalities/dataset/Plate_E05/image_001.tiffis excluded by!**/*.tifftests/data/mixed_modalities/dataset/Plate_E05/preview.pngis excluded by!**/*.pngtests/data/mixed_modalities/metadata.csvis excluded by!**/*.csvtests/data/valid_minimal/dataset/Plate_A01/image_001.tiffis excluded by!**/*.tifftests/data/valid_minimal/metadata.csvis excluded by!**/*.csvtests/data/xml_mismatch/dataset/Plate_B02/image_001.tiffis excluded by!**/*.tifftests/data/xml_mismatch/metadata.csvis excluded by!**/*.csvuv.lockis excluded by!**/*.lock
📒 Files selected for processing (49)
.github/ISSUE_TEMPLATE/issue.yml.github/PULL_REQUEST_TEMPLATE.md.github/dependabot.yml.github/release-drafter.yml.github/workflows/draft-release.yml.github/workflows/publish-docs.yml.github/workflows/publish-pypi.yml.github/workflows/run-tests.yml.gitignore.pre-commit-config.yaml.python-versionCITATION.cffCODE_OF_CONDUCT.mdCONTRIBUTING.mdREADME.mdarchitecture.mddocs/src/_static/custom.cssdocs/src/cli.mddocs/src/conf.pydocs/src/index.mddocs/src/on_the_lookout_with_meerqat.ipynbdocs/src/on_the_lookout_with_meerqat.pydocs/src/python-api.mdpyproject.tomlroadmap.mdspec.mdsrc/meerqat/__init__.pysrc/meerqat/cli.pysrc/meerqat/config.pysrc/meerqat/ingestion.pysrc/meerqat/llm.pysrc/meerqat/main.pysrc/meerqat/models.pysrc/meerqat/normalization.pysrc/meerqat/reporting.pysrc/meerqat/validation.pytests/conftest.pytests/data/README.mdtests/data/filetree_risks/dataset/Plate_Z09/Index.xmltests/data/missing_metadata/dataset/Plate_D04/Index.xmltests/data/mixed_modalities/dataset/Plate_E05/Index.xmltests/data/valid_minimal/dataset/Plate_A01/Index.xmltests/data/xml_mismatch/dataset/Plate_B02/Index.xmltests/test_api.pytests/test_cli.pytests/test_incident_regressions.pytests/test_llm_runtime.pytests/test_sample_datasets.pytests/test_units.py
| - Image inventory support for TIFF, OME-TIFF, OME-Zarr, PNG, and JPEG assets | ||
| - Single-dataset and batch validation through both API and CLI |
There was a problem hiding this comment.
Inconsistency: OME-Zarr listed in both "Implemented Scope" and "Future Work".
Line 215 states OME-Zarr is implemented, but line 366 lists it under "Future Work". Please reconcile these sections to reflect the actual implementation status.
Also applies to: 366-366
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@spec.md` around lines 215 - 216, The spec currently lists OME-Zarr as
implemented in the "Implemented Scope" section and also as pending in the
"Future Work" section; update the document to be consistent by deciding the true
status and removing or moving the OME-Zarr bullet accordingly: if OME-Zarr is
implemented, remove it from the "Future Work" list (or mark it as completed) and
ensure the "Implemented Scope" bullet explicitly names OME-Zarr; if it is not
yet implemented, remove it from "Implemented Scope" and keep it under "Future
Work" with any planned notes. Ensure both headings ("Implemented Scope" and
"Future Work") reference the same final status for OME-Zarr to eliminate the
contradiction.
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (3)
src/meerqat/cli.py (1)
192-196: Consider logging the full traceback for debugging.Catching bare
Exceptionis reasonable for CLI error handling to ensure clean exits, but the current implementation discards the traceback. Consider logging or printing the full traceback when a debug/verbose flag is set, or at minimum to stderr.Proposed improvement
+import traceback + def main(argv: list[str] | None = None) -> int: """Run the CLI.""" parser = build_parser() args = parser.parse_args(argv) try: return int(args.handler(args)) except Exception as error: print(f"Meerqat system error: {error}", file=sys.stderr) + traceback.print_exc(file=sys.stderr) return 3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/cli.py` around lines 192 - 196, The except block around calling args.handler(args) currently only prints the exception message and discards the traceback; update the handler of that exception in the CLI entry (the try/except surrounding return int(args.handler(args))) to emit the full traceback to stderr or the logger when a debug/verbose flag is set (e.g., args.debug or args.verbose) by calling traceback.print_exc(file=sys.stderr) or logging.exception(...) as appropriate, otherwise continue returning exit code 3—ensure you import traceback or use the logging API and reference args.handler and the surrounding CLI entrypoint when making the change.CONTRIBUTING.md (1)
53-55: Document the--frozenflag to match CI.The CI workflow uses
uv run --frozen pytest, but this documentation shows onlyuv run pytest. Consider aligning the example with CI behavior or noting when--frozenis appropriate.Proposed fix
```sh -% uv run pytest +% uv run --frozen pytest</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@CONTRIBUTING.mdaround lines 53 - 55, Update the example command in
CONTRIBUTING.md to match CI usage by adding the --frozen flag: change the
example invocation "uv run pytest" to "uv run --frozen pytest" (or add a brief
note explaining when to use --frozen) so the documented command aligns with the
CI workflow; locate the example command string "uv run pytest" to make the edit.</details> </blockquote></details> <details> <summary>docs/src/on_the_lookout_with_meerqat.ipynb (1)</summary><blockquote> `230-233`: **CLI cell uses hardcoded relative path.** The shell command uses a hardcoded `pathlib.Path("../..")` instead of the computed `repo_root` variable used elsewhere in the notebook. This works in Jupyter shell commands via brace expansion, but the markdown documentation cell above shows `{repo_root / ...}` as the pattern. Consider using `{repo_root}` consistently: ```python !meerqat validate \ {repo_root / "tests" / "data" / "xml_mismatch" / "dataset"}This ensures the command works regardless of the notebook's working directory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/on_the_lookout_with_meerqat.ipynb` around lines 230 - 233, The CLI cell is using a hardcoded pathlib.Path("../..") instead of the notebook's repo_root variable; update the shell command in the CLI cell that runs "!meerqat validate ..." to use the existing repo_root (repo_root / "tests" / "data" / "xml_mismatch" / "dataset") so it matches the notebook's other examples and works regardless of working directory—replace the hardcoded pathlib.Path("../..") expression with the repo_root variable reference wherever that validation command appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CONTRIBUTING.md`:
- Line 99: Replace the non-descriptive "here" link text in the sentence "We
publish source code by using GitHub Releases available here" with descriptive
link text such as "GitHub Releases page" (or "Meerqat Releases on GitHub") so
the anchor text clearly conveys destination and improves accessibility; update
the sentence to read something like "We publish source code using GitHub
Releases (GitHub Releases page)" and ensure the existing URL remains the link
target.
In `@spec.md`:
- Around line 224-251: Update the spec's domain model to match the actual
exported dataclasses in src/meerqat/models.py: change Dataset to include the
fields root and filetree_summary in addition to
dataset_id/plates/metadata_records; replace Plate's documented fields with
plate_id, path, xml_path, image_files, zero_byte_images, and image_modalities;
and update ValidationIssue to use code, plate_id, path, and remediation instead
of rule_id/severity/message/entity_id so the documentation matches the real
dataclass shapes (Dataset, Plate, ValidationIssue).
- Around line 257-275: The YAML example uses nested sections (dataset,
plate_naming, expectations, metadata) that don't match the flat schema expected
by load_config() and ValidationConfig; update the example to use flat keys
matching the names in src/meerqat/config.py (dataset_id, plate_name_pattern,
expected_plate_count, expected_images_per_plate, metadata_plate_column) so that
calling load_config(payload) / ValidationConfig(**payload) succeeds—locate
references to load_config and ValidationConfig in the code to verify the exact
key names and ensure the example YAML uses those flat keys and literal values
rather than nested objects.
In `@src/meerqat/config.py`:
- Around line 206-213: resolved_template currently merges template defaults
using truthy checks (e.g., self.required_files or template.required_files),
which treats an explicit empty override like required_files: [] as "unset";
change the merge to check for None instead so explicit empty values are honored:
in resolved_template use conditional expressions that test "if
self.required_files is not None then use self.required_files else
template.required_files" (and similarly for plate_name_pattern and
expected_images_per_plate) so explicit empty/falsey overrides are preserved.
In `@src/meerqat/ingestion.py`:
- Around line 174-192: The parent-directory discovery in
_discover_metadata_paths is too permissive (it currently collects all files with
METADATA_EXTENSIONS from dataset_root.parent), causing sibling datasets to share
CSV/XLS/XLSX metadata; restrict parent search by only considering parent files
whose stem matches the preferred names (e.g., "metadata", "meta",
"plate_metadata") or only accept parent candidates when there is exactly one
matching metadata file — otherwise do not auto-include parent results and
require explicit metadata_paths; update the logic in _discover_metadata_paths to
apply this narrower filter to search_root == dataset_root.parent and return only
the filtered parent candidate(s) or nothing if ambiguous.
- Around line 160-168: The loop that builds MetadataRecord objects currently
treats pandas NaN as a valid plate_value, producing plate_id="nan"; update the
loop in ingestion.py (the for row in frame.to_dict(orient="records") block) to
detect and skip NaN/blank values for config.metadata_plate_column (use
pandas.isna or math.isnan) before appending to records so that
MetadataRecord(plate_id=...) is only created for real IDs (keep references to
MetadataRecord, metadata_plate_column, records, and path).
- Around line 56-60: _extract_xml_plate_id currently calls
ET.parse(xml_path).getroot() without handling xml.etree.ElementTree.ParseError,
which lets malformed XML abort ingest_dataset before validate_dataset_model can
report it; wrap the ET.parse call in a try/except that catches
xml.etree.ElementTree.ParseError (or Exception if you prefer broader safety) and
return None (or otherwise swallow and optionally debug-log the error) so
ingest_dataset() does not raise on malformed XML and validate_dataset_model can
perform structured validation; update _extract_xml_plate_id to import the
ParseError symbol and handle it accordingly.
In `@src/meerqat/llm.py`:
- Around line 206-210: The current PermissionError fallback uses a predictable
temp path (tempfile.gettempdir() / "meerqat" / MACOS_CPU_RUNTIME_DIRNAME) which
is vulnerable to tampering; change the fallback in the code path that calls
build_runtime/_default_runtime_dir to create a unique, owner-only directory via
tempfile.mkdtemp (e.g., mkdtemp(prefix="meerqat-", dir=tempfile.gettempdir())),
immediately set permissions to 0o700 (os.chmod) and use that Path for
build_runtime, ensuring the directory is created with restrictive permissions
and is only writable by the current user (optionally register it for secure
cleanup), instead of using the fixed predictable temp location.
- Around line 662-676: The langchain fallback in _attempt_langchain_review
currently only catches a narrow set of errors; broaden the except clause to
catch Exception (i.e., replace the current multi-exception tuple with except
Exception as error) so any runtime/import/invocation error from
_invoke_langchain() or chain.invoke() is normalized the same way as the
instructor path and returns (None, error) instead of letting exceptions
propagate into generate_llm_review()/validate_dataset().
In `@src/meerqat/main.py`:
- Around line 123-137: The function write_reports currently ignores
markdown_path and html_path when given a BatchValidationReport; change it to
fail fast by detecting if report is a BatchValidationReport and either
markdown_path or html_path is not None and raising a ValueError with a clear
message. Update the branch in write_reports (the isinstance(report,
ValidationReport) check) to first check for BatchValidationReport with
markdown/html paths and raise; otherwise proceed to call write_report_file with
report_to_markdown and report_to_html for ValidationReport as before
(references: write_reports, ValidationReport, BatchValidationReport,
write_report_file, report_to_markdown, report_to_html).
In `@src/meerqat/reporting.py`:
- Around line 186-187: The HTML generator uses report.summary.dataset_id
directly and can render "None"; update the h1 construction in report_to_html to
match report_to_markdown's fallback by using
html.escape(report.summary.dataset_id or "Unknown") (ensure the fallback is
applied before escaping) so the header shows "Unknown" when dataset_id is None;
locate the f-string that builds "<h1>Meerqat Report: ...</h1>" and replace the
expression accordingly.
---
Nitpick comments:
In `@CONTRIBUTING.md`:
- Around line 53-55: Update the example command in CONTRIBUTING.md to match CI
usage by adding the --frozen flag: change the example invocation "uv run pytest"
to "uv run --frozen pytest" (or add a brief note explaining when to use
--frozen) so the documented command aligns with the CI workflow; locate the
example command string "uv run pytest" to make the edit.
In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 230-233: The CLI cell is using a hardcoded pathlib.Path("../..")
instead of the notebook's repo_root variable; update the shell command in the
CLI cell that runs "!meerqat validate ..." to use the existing repo_root
(repo_root / "tests" / "data" / "xml_mismatch" / "dataset") so it matches the
notebook's other examples and works regardless of working directory—replace the
hardcoded pathlib.Path("../..") expression with the repo_root variable reference
wherever that validation command appears.
In `@src/meerqat/cli.py`:
- Around line 192-196: The except block around calling args.handler(args)
currently only prints the exception message and discards the traceback; update
the handler of that exception in the CLI entry (the try/except surrounding
return int(args.handler(args))) to emit the full traceback to stderr or the
logger when a debug/verbose flag is set (e.g., args.debug or args.verbose) by
calling traceback.print_exc(file=sys.stderr) or logging.exception(...) as
appropriate, otherwise continue returning exit code 3—ensure you import
traceback or use the logging API and reference args.handler and the surrounding
CLI entrypoint when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fb7c4abb-c33b-4690-b836-74edd1e7649a
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (30)
.github/ISSUE_TEMPLATE/issue.yml.github/PULL_REQUEST_TEMPLATE.md.github/dependabot.yml.github/workflows/draft-release.yml.github/workflows/publish-docs.yml.github/workflows/publish-pypi.yml.github/workflows/run-tests.yml.pre-commit-config.yaml.yamllintCITATION.cffCODE_OF_CONDUCT.mdCONTRIBUTING.mddocs/src/conf.pydocs/src/on_the_lookout_with_meerqat.ipynbdocs/src/on_the_lookout_with_meerqat.pypyproject.tomlspec.mdsrc/meerqat/_version.pysrc/meerqat/cli.pysrc/meerqat/config.pysrc/meerqat/ingestion.pysrc/meerqat/llm.pysrc/meerqat/main.pysrc/meerqat/reporting.pysrc/meerqat/validation.pytests/conftest.pytests/test_api.pytests/test_cli.pytests/test_sample_datasets.pytests/test_units.py
✅ Files skipped from review due to trivial changes (14)
- .yamllint
- CITATION.cff
- .github/PULL_REQUEST_TEMPLATE.md
- .github/ISSUE_TEMPLATE/issue.yml
- .github/dependabot.yml
- .github/workflows/draft-release.yml
- .github/workflows/publish-pypi.yml
- .github/workflows/publish-docs.yml
- docs/src/conf.py
- tests/conftest.py
- .github/workflows/run-tests.yml
- src/meerqat/_version.py
- .pre-commit-config.yaml
- pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_sample_datasets.py
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
src/meerqat/ingestion.py (1)
164-173:⚠️ Potential issue | 🟠 MajorStore the normalized plate id, not the raw cell value.
The blank check strips whitespace, but
MetadataRecord.plate_idstill keeps the untrimmed value. A cell like" Plate_A01 "will pass the guard here and then miss the folder/XML identifiers during matching.Possible fix
for row in frame.to_dict(orient="records"): plate_value = row.get(config.metadata_plate_column) if plate_value is None or pd.isna(plate_value): continue - if not str(plate_value).strip(): + plate_id = str(plate_value).strip() + if not plate_id: continue records.append( MetadataRecord( - plate_id=str(plate_value), + plate_id=plate_id, source=str(path), values=dict(row), ) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 164 - 173, The loop currently checks plate_value for emptiness but passes the untrimmed value into MetadataRecord. Normalize the plate id before constructing MetadataRecord: derive a normalized_plate_id = str(plate_value).strip() (optionally apply consistent casing if needed), use that for MetadataRecord(plate_id=normalized_plate_id, ...) and for any matching logic, and ensure you still skip when normalized_plate_id is empty; update the code around the frame.to_dict loop and the use of config.metadata_plate_column and records accordingly.
🧹 Nitpick comments (2)
src/meerqat/llm.py (2)
779-785: Redundant model path resolution.The
model_pathwas already successfully resolved at lines 688-692. This second call toget_cached_model_pathis redundant since the variable is still in scope.Suggested fix
else: try: - model_path = get_cached_model_path( - spec, - cache_dir=config.cache_dir, - offline=config.offline, - ) llm = _build_langchain_llm(model_path, config) response = llm.invoke("Reply with READY.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/llm.py` around lines 779 - 785, The try block is redundantly calling get_cached_model_path again even though model_path was already resolved earlier; remove the second get_cached_model_path call and pass the existing model_path into _build_langchain_llm instead (ensure the earlier model_path variable from the prior resolution is in scope before the try block and used here); update any error handling in this try/except to only surround _build_langchain_llm so you don't redo model resolution.
267-270: Consider handling IPv6 loopback address.The localhost check doesn't include the IPv6 loopback address
::1, which could be relevant in dual-stack environments.Optional enhancement
def _is_local_base_url(base_url: str) -> bool: """Return whether the configured endpoint points at localhost.""" hostname = urllib_parse.urlsplit(base_url).hostname - return hostname in {"127.0.0.1", "localhost"} + return hostname in {"127.0.0.1", "localhost", "::1"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/llm.py` around lines 267 - 270, The _is_local_base_url function currently treats only "127.0.0.1" and "localhost" as local; update it to also recognize the IPv6 loopback "::1" (and its possible bracketed/normalized forms) by including "::1" in the hostname check or normalizing brackets before comparison so that hostname from urllib_parse.urlsplit(base_url) matches "::1"; modify the set used in _is_local_base_url or add a small normalization step around the hostname variable to ensure IPv6 loopback is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.ipynb`:
- Around line 50-53: The fallback that sets repo_root to
pathlib.Path("../..").resolve() when "__file__" is missing is unreliable in
notebooks; replace it with logic that locates the repo root by walking up from
the current working directory (pathlib.Path.cwd()) until a repo marker is found
(e.g., ".git", "pyproject.toml", "setup.py") — implement a small helper (e.g.,
find_repo_root) and use it as the fallback for repo_root so notebooks launched
from other CWDs still resolve the actual repository root instead of ../..; keep
the original branch that uses pathlib.Path(__file__).resolve().parents[2] when
"__file__" exists.
In `@spec.md`:
- Around line 103-104: Update the spec text to reflect the actual shipped
runtime: replace the sentence that says MeerQat "falls back to direct local GGUF
execution" with wording that the fallback provider is "langchain" (matching the
implemented providers `instructor` and `langchain`) and that tests exercise
LangChain as the fallback path; ensure the terms `instructor`, `langchain`, and
`MeerQat` are used so the spec matches the current implementation and test
behavior.
- Around line 284-290: The spec examples list CLI commands that don't exist;
update spec.md to only document commands actually registered in
src/meerqat/cli.py (validate, batch-validate, models, ready) by replacing the
"Suggest config" and "Explain anomalies" examples with equivalent examples using
those real commands (or explicitly note those features are not implemented yet),
and ensure the example command names in spec.md match the unique command
identifiers in src/meerqat/cli.py (validate, batch-validate, models, ready).
In `@src/meerqat/cli.py`:
- Around line 41-58: The current reconstruction of LLMConfig (creating a new
LLMConfig with selected fields from args and existing) drops unexposed settings
like auto_start_server and server_startup_timeout; instead preserve all existing
fields and only override those explicitly supplied by CLI by using
dataclasses.replace(existing, ...) or by copying existing and updating the
provided attributes (referencing LLMConfig, existing, and args), ensuring you
set
provider/model_alias/repo_id/filename/cache_dir/offline/n_ctx/n_threads/max_tokens/temperature/base_url
only when args provide values so auto_start_server and server_startup_timeout
(and any other fields) are retained.
- Around line 161-181: The batch subcommand is exposing flags that
_batch_validate_command ignores; update the CLI to only register flags that
batch supports. Either add a new add_batch_validation_flags(target) that only
adds "--metadata", "--report-json", and "--fail-on-warning" and call that for
batch_parser, or change add_validation_flags(target, *, for_batch=False) to
conditionally skip adding "--dataset-id", "--report-markdown", and
"--report-html" when for_batch is true; then keep validate_parser using the full
add_validation_flags and have batch_parser use the batch-specific variant so the
parser flags match _batch_validate_command's behavior.
In `@src/meerqat/ingestion.py`:
- Around line 82-90: The code treats multi-part extensions inconsistently:
update _collect_image_files and _build_filetree_summary to recognize compound
extensions like ".ome.zarr", ".ome.tif", and ".ome.tiff" (and detect ".ome.zarr"
recursively under plate directories) instead of relying on Path.suffix which
drops multi-part suffixes; use Path.suffixes or a name.endswith(...) check
against a canonical list of supported compound extensions when gathering files
and when bucketing by type so plate assets (e.g., Plate_A01/Images/run.ome.zarr)
are included in image totals and correctly labeled in the filetree inventory;
also ensure _looks_like_plate()’s expectations align with the same
extension-checking logic so detection, counting, and reporting are consistent.
- Around line 233-237: The current assignment to resolved_metadata_paths treats
an empty list metadata_paths as false and triggers discovery; change the
conditional to detect None specifically so an explicit empty list disables
discovery. Update the expression that builds resolved_metadata_paths (currently
using metadata_paths and _discover_metadata_paths with root) to use "if
metadata_paths is not None" (or equivalent None check) so metadata_paths=[]
yields an empty list (still mapping values through Path) while
metadata_paths=None continues to call list(_discover_metadata_paths(root)).
---
Duplicate comments:
In `@src/meerqat/ingestion.py`:
- Around line 164-173: The loop currently checks plate_value for emptiness but
passes the untrimmed value into MetadataRecord. Normalize the plate id before
constructing MetadataRecord: derive a normalized_plate_id =
str(plate_value).strip() (optionally apply consistent casing if needed), use
that for MetadataRecord(plate_id=normalized_plate_id, ...) and for any matching
logic, and ensure you still skip when normalized_plate_id is empty; update the
code around the frame.to_dict loop and the use of config.metadata_plate_column
and records accordingly.
---
Nitpick comments:
In `@src/meerqat/llm.py`:
- Around line 779-785: The try block is redundantly calling
get_cached_model_path again even though model_path was already resolved earlier;
remove the second get_cached_model_path call and pass the existing model_path
into _build_langchain_llm instead (ensure the earlier model_path variable from
the prior resolution is in scope before the try block and used here); update any
error handling in this try/except to only surround _build_langchain_llm so you
don't redo model resolution.
- Around line 267-270: The _is_local_base_url function currently treats only
"127.0.0.1" and "localhost" as local; update it to also recognize the IPv6
loopback "::1" (and its possible bracketed/normalized forms) by including "::1"
in the hostname check or normalizing brackets before comparison so that hostname
from urllib_parse.urlsplit(base_url) matches "::1"; modify the set used in
_is_local_base_url or add a small normalization step around the hostname
variable to ensure IPv6 loopback is detected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb1bc5f7-1804-4c11-85cc-237ae1d2f119
📒 Files selected for processing (12)
CONTRIBUTING.mddocs/src/on_the_lookout_with_meerqat.ipynbspec.mdsrc/meerqat/_version.pysrc/meerqat/cli.pysrc/meerqat/config.pysrc/meerqat/ingestion.pysrc/meerqat/llm.pysrc/meerqat/main.pysrc/meerqat/reporting.pytests/test_api.pytests/test_units.py
✅ Files skipped from review due to trivial changes (1)
- src/meerqat/_version.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/meerqat/main.py
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
src/meerqat/cli.py (1)
100-105: Unify batch exit-code fallback with_exit_codesemantics.Current logic returns
0for any non-fail/warnstatuses. That diverges from_exit_code, which maps unknown statuses to3.Suggested patch
- statuses = [entry.summary.status for entry in report.reports] - if "fail" in statuses: - return 2 - if "warn" in statuses: - return 1 - return 0 + statuses = [entry.summary.status for entry in report.reports] + if "fail" in statuses: + return _exit_code("fail") + if "warn" in statuses: + return _exit_code("warn") + if all(status == "pass" for status in statuses): + return _exit_code("pass") + return _exit_code("unknown")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/cli.py` around lines 100 - 105, The batch exit-code logic in the block that computes statuses = [entry.summary.status for entry in report.reports] should follow the same semantics as _exit_code (unknown statuses -> 3) rather than defaulting to 0; update the conditional branch so that it returns 2 for "fail", 1 for "warn", and 3 for any other/unknown status, ensuring the behavior is consistent with the _exit_code mapping used elsewhere.src/meerqat/ingestion.py (3)
84-96: Case-sensitivity mismatch in nested element attribute lookup.Lines 84-87 search root attributes with mixed-case keys (
"PlateID","Name","ID"), but line 90 uses only lowercase keys ("id","name","plateid"). Sinceelement.attrib.get()is case-sensitive, attributes likeName="Plate1"on nested elements will be missed.♻️ Suggested fix to use consistent case handling
for element in root.iter(): if element.tag.lower().endswith("plate") or "plate" in element.tag.lower(): - for key in ("id", "name", "plateid", "platename"): - value = element.attrib.get(key) - if value: - candidates.append(value) + for key in ("PlateID", "PlateName", "Name", "ID", "id", "name", "plateid", "platename"): + if key in element.attrib: + candidates.append(element.attrib[key]) + break text = (element.text or "").strip() if text: candidates.append(text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 84 - 96, The nested element attribute lookup is case-sensitive (using element.attrib.get with only lowercase keys) while root attribute checks use mixed-case keys, so attributes like Name="Plate1" on nested elements are missed; fix by normalizing attribute names before lookup (e.g., build a lowercase-keyed mapping from element.attrib and root.attrib or lowercase keys on the fly) and then check normalized keys ('id','name','plateid','platename') against that mapping when populating candidates (affecting the loops that reference root, element, candidates, and attrib.get).
173-178: Missing error handling for metadata file reads.If a metadata file is malformed or unreadable,
pd.read_csv()orpd.read_excel()will raise an exception, aborting ingestion. This is inconsistent with the graceful handling in_extract_xml_plate_id()which catches parse errors. Consider wrapping reads in try/except to skip problematic files and allow ingestion to continue.♻️ Suggested fix
for raw_path in metadata_paths: path = Path(raw_path) + try: - if path.suffix.lower() == ".csv": - frame = pd.read_csv(path) - else: - frame = pd.read_excel(path) + if path.suffix.lower() == ".csv": + frame = pd.read_csv(path) + else: + frame = pd.read_excel(path) + except (OSError, pd.errors.EmptyDataError, pd.errors.ParserError, ValueError): + continue frame.columns = [str(column).strip() for column in frame.columns]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 173 - 178, Wrap the per-file read inside the for raw_path in metadata_paths loop (the block calling pd.read_csv(path) / pd.read_excel(path)) in a try/except that catches exceptions from malformed/unreadable files, logs the filename and exception (using the project's logger or logging.exception), and skips that file (continue) so ingestion proceeds; mirror the error-handling pattern used in _extract_xml_plate_id to avoid leaving frame undefined and ensure downstream code only processes successfully-read frames.
137-145: Unreachable condition:path.is_dir()on line 142.Line 137 already handles
path.is_dir(), so theor path.is_dir()in line 142 is dead code and never true. This doesn't affect correctness but is confusing.♻️ Suggested fix
if path.is_dir(): directories.append(resolved) relative = resolved.relative_to(dataset_root) bucket_key = (len(relative.parts), resolved.name.lower()[:4]) similarity_buckets[bucket_key].append(resolved) - elif path.is_file() or path.is_dir(): + elif path.is_file(): extension = _inventory_extension(path) if extension is not None: file_extensions[extension] += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 137 - 145, The elif condition contains a redundant check "or path.is_dir()" which is unreachable because the preceding if already handles path.is_dir(); update the conditional in the block that calls _inventory_extension(path) to only test path.is_file() so the code becomes elif path.is_file(): and keep the rest (calling _inventory_extension, incrementing file_extensions) unchanged; this affects the branch handling around variables path, directories, similarity_buckets, file_extensions and the call to _inventory_extension.docs/src/on_the_lookout_with_meerqat.py (1)
71-71: Convert notebook-style bare expressions to explicit output or add Ruff per-file ignore for B018.This
.pycompanion file contains bare expressions at lines 71, 77, 91, 136, and 179 that trigger Ruff's B018 rule (useless expression). These expressions are notebook-style outputs intended to display results, but since the file is linted as regular Python, they need either explicitprint()calls or a per-file Ruff ignore:# Option 1: Explicit output print(report.summary.status) print(report.summary.to_dict()) print(broken_report.summary.status) print(validation_config) print(generated_files)Or add to
ruff.toml:[lint.per-file-ignores] "docs/src/on_the_lookout_with_meerqat.py" = ["B018"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/on_the_lookout_with_meerqat.py` at line 71, The file contains notebook-style bare expressions (report.summary.status, report.summary.to_dict(), broken_report.summary.status, validation_config, generated_files) that trigger Ruff B018; fix by converting each bare expression into explicit output (e.g., wrap with print(...) calls for the five identified symbols) or alternatively add a per-file Ruff ignore for B018 in ruff.toml under lint.per-file-ignores for this file; update either the expressions to use print or add the per-file-ignore entry for "B018".spec.md (2)
279-279: Redundant heading: "CLI Interface"."CLI" already stands for "Command Line Interface," so "CLI Interface" is redundant. As per static analysis hint, consider simplifying the heading.
♻️ Suggested alternatives
-## CLI Interface +## CLI Commandsor
-## CLI Interface +## Command Line Interface🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spec.md` at line 279, The heading "CLI Interface" is redundant because "CLI" already means "Command Line Interface"; update the heading in the document (the line containing "CLI Interface") to a non-redundant form such as "CLI" or "Command Line Interface" so the section title is concise and clear.
75-75: Prescriptive LLM assumption in concept definition.The sentence "LLM should convert if possible" places an implementation directive inside a core concept definition. This belongs in the LLM Integration or Functional Requirements section if automatic format conversion is a planned feature. The concept definition should focus on what a Metadata Record is, not on conversion tooling.
♻️ Suggested rewording
-A row from an external metadata file (CSV/XLSX). The analysts would prefer to already be in CSV format over XLSX. LLM should convert if possible. +A row from an external metadata file (CSV/XLSX). CSV format is preferred over XLSX for downstream processing.If automatic conversion is a planned feature, document it separately in the LLM Integration section (lines 299-317).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spec.md` at line 75, The concept definition "A row from an external metadata file (CSV/XLSX). LLM should convert if possible" mixes an implementation directive with the core definition; remove the clause "LLM should convert if possible" from the Metadata Record definition and rewrite the sentence to simply define what a Metadata Record is (e.g., "A row from an external metadata file (CSV or XLSX) representing ..."). If automatic format conversion is a planned feature, move that requirement into the LLM Integration or Functional Requirements section (the LLM Integration section referenced in the comment) and describe the conversion behavior and expectations there.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 129-136: The snippet creates a ValidationConfig but never uses it;
call validate_dataset with that config and capture the result. Replace the last
two lines so you instantiate ValidationConfig(...) as needed and then call
validate_dataset(dataset_path, config=validation_config) assigning the return to
a variable like report (e.g., report = validate_dataset(dataset_path,
config=validation_config)); reference ValidationConfig, validate_dataset, config
and report so readers see the full usage.
In `@spec.md`:
- Line 358: The spec currently lists "Single-dataset and batch validation
through both API and CLI" and documents the `batch-validate` command as
implemented, but also lists "Batch validation workflows" under the Future Work
section; remove the duplicate future-work entry or replace it with a clarified
bullet that specifies any additional planned batch-validation features (e.g.,
scaling, orchestration, or UX enhancements) so the document consistently
reflects that basic batch validation is implemented; search for the phrases
"Single-dataset and batch validation through both API and CLI",
"`batch-validate`", and "Batch validation workflows" to locate and update the
entries.
- Around line 225-259: Update the Domain Model entries to match the
implementation: make Dataset.dataset_id non-nullable (str), change collection
types from list[...] to tuple[...] for Dataset.plates, Dataset.metadata_records,
Plate.image_files, Plate.zero_byte_images, and Plate.image_modalities and
reflect that they have immutable defaults, and add default = None for
ValidationIssue.plate_id, ValidationIssue.path, and ValidationIssue.remediation;
ensure the class names Dataset, Plate and ValidationIssue and the field names
dataset_id, plates, metadata_records, image_files, zero_byte_images,
image_modalities, plate_id, path, and remediation are used to locate and update
the definitions.
---
Nitpick comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Line 71: The file contains notebook-style bare expressions
(report.summary.status, report.summary.to_dict(), broken_report.summary.status,
validation_config, generated_files) that trigger Ruff B018; fix by converting
each bare expression into explicit output (e.g., wrap with print(...) calls for
the five identified symbols) or alternatively add a per-file Ruff ignore for
B018 in ruff.toml under lint.per-file-ignores for this file; update either the
expressions to use print or add the per-file-ignore entry for "B018".
In `@spec.md`:
- Line 279: The heading "CLI Interface" is redundant because "CLI" already means
"Command Line Interface"; update the heading in the document (the line
containing "CLI Interface") to a non-redundant form such as "CLI" or "Command
Line Interface" so the section title is concise and clear.
- Line 75: The concept definition "A row from an external metadata file
(CSV/XLSX). LLM should convert if possible" mixes an implementation directive
with the core definition; remove the clause "LLM should convert if possible"
from the Metadata Record definition and rewrite the sentence to simply define
what a Metadata Record is (e.g., "A row from an external metadata file (CSV or
XLSX) representing ..."). If automatic format conversion is a planned feature,
move that requirement into the LLM Integration or Functional Requirements
section (the LLM Integration section referenced in the comment) and describe the
conversion behavior and expectations there.
In `@src/meerqat/cli.py`:
- Around line 100-105: The batch exit-code logic in the block that computes
statuses = [entry.summary.status for entry in report.reports] should follow the
same semantics as _exit_code (unknown statuses -> 3) rather than defaulting to
0; update the conditional branch so that it returns 2 for "fail", 1 for "warn",
and 3 for any other/unknown status, ensuring the behavior is consistent with the
_exit_code mapping used elsewhere.
In `@src/meerqat/ingestion.py`:
- Around line 84-96: The nested element attribute lookup is case-sensitive
(using element.attrib.get with only lowercase keys) while root attribute checks
use mixed-case keys, so attributes like Name="Plate1" on nested elements are
missed; fix by normalizing attribute names before lookup (e.g., build a
lowercase-keyed mapping from element.attrib and root.attrib or lowercase keys on
the fly) and then check normalized keys ('id','name','plateid','platename')
against that mapping when populating candidates (affecting the loops that
reference root, element, candidates, and attrib.get).
- Around line 173-178: Wrap the per-file read inside the for raw_path in
metadata_paths loop (the block calling pd.read_csv(path) / pd.read_excel(path))
in a try/except that catches exceptions from malformed/unreadable files, logs
the filename and exception (using the project's logger or logging.exception),
and skips that file (continue) so ingestion proceeds; mirror the error-handling
pattern used in _extract_xml_plate_id to avoid leaving frame undefined and
ensure downstream code only processes successfully-read frames.
- Around line 137-145: The elif condition contains a redundant check "or
path.is_dir()" which is unreachable because the preceding if already handles
path.is_dir(); update the conditional in the block that calls
_inventory_extension(path) to only test path.is_file() so the code becomes elif
path.is_file(): and keep the rest (calling _inventory_extension, incrementing
file_extensions) unchanged; this affects the branch handling around variables
path, directories, similarity_buckets, file_extensions and the call to
_inventory_extension.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7052201-cf20-492b-8297-955615831e5c
📒 Files selected for processing (6)
docs/src/on_the_lookout_with_meerqat.ipynbdocs/src/on_the_lookout_with_meerqat.pyspec.mdsrc/meerqat/cli.pysrc/meerqat/ingestion.pysrc/meerqat/llm.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
src/meerqat/main.py (1)
100-106: Consider simplifying LLMReview construction.If
generate_llm_reviewalready returns an object with the same interface asLLMReview, explicitly copying fields into a newLLMReviewinstance may be redundant. However, this explicit construction is also defensive against interface changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/main.py` around lines 100 - 106, The code redundantly copies fields from the object returned by generate_llm_review into a new LLMReview instance; replace that manual field copying by using the returned object directly (e.g., assign llm_review_obj = llm_review) if generate_llm_review already returns an LLMReview-compatible object, or else wrap/convert defensively by using a single canonical conversion method (constructor/factory like LLMReview.parse_obj or a from_... classmethod) so you don't repeat attributes; ensure you check the type/interface of generate_llm_review's return value before removing the copy.src/meerqat/ingestion.py (1)
156-165: Consider performance for large directory trees.The O(n²) similarity comparison within each bucket could become slow for datasets with many similarly-prefixed directories at the same depth. The bucketing strategy (4-char prefix + depth) provides reasonable mitigation, but consider adding a bucket size cap or early exit for very large buckets if this becomes a bottleneck in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/meerqat/ingestion.py` around lines 156 - 165, The nested O(n²) pairwise comparison in the loop over similarity_buckets (using sorted_bucket, SequenceMatcher, and appending to similar_pairs) can be slow for very large buckets; add a guard such as a configurable bucket size cap (e.g., MAX_BUCKET_COMPARE) and skip or sample comparisons when len(sorted_bucket) exceeds it, or implement an early-exit strategy (limit comparisons per entry or stop after finding a set number of similar matches per bucket) while still honoring SIMILAR_DIRECTORY_THRESHOLD; update the loop that iterates sorted_bucket to check the cap and either truncate/sampled_rights or break early to avoid quadratic blowup.architecture.md (1)
8-8: Minor: Use hyphen for compound adjective "built-in".When used as an adjective before a noun, "built-in" should be hyphenated.
-1. LLM-assisted pattern inference (built in) +1. LLM-assisted pattern inference (built-in)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@architecture.md` at line 8, Update the phrase "LLM-assisted pattern inference (built in)" to hyphenate the compound adjective by changing "(built in)" to "(built-in)"; locate the text around "LLM-assisted pattern inference" and replace the unhyphenated "built in" with "built-in".docs/src/on_the_lookout_with_meerqat.py (1)
114-122: Consider assigning the dictionary for consistency.This dictionary expression works in Jupyter (displays as cell output) but does nothing when run as a script. Consider assigning it to a variable and printing for consistency with the script execution path.
-{ +filetree_details = { "status": filetree_report.summary.status, "filetree_summary": filetree_report.dataset.filetree_summary.to_dict(), "filetree_issues": [ issue.to_dict() for issue in filetree_report.issues if issue.code.startswith("dataset.") ], } +print(filetree_details)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/on_the_lookout_with_meerqat.py` around lines 114 - 122, Assign the constructed dict to a variable (e.g., summary_dict) instead of leaving it as a bare expression, then print or return it so the output appears when run as a script; update the block that builds the dict using filetree_report, filetree_report.dataset.filetree_summary.to_dict(), and the filetree_issues list comprehension (filtering issue.code.startswith("dataset.")) to store the result and then print(summary_dict) or otherwise emit it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 180-181: Remove the redundant standalone expression
"generated_files" that merely evaluates the variable without using it; keep the
subsequent print(generated_files) call. Locate the occurrence of the bare
"generated_files" expression (adjacent to print(generated_files)) and delete
that line so only the print(generated_files) remains.
In `@src/meerqat/main.py`:
- Around line 149-158: The ready() function builds active_config but fails to
apply CLI overrides before selecting the LLM; update ready() to use
with_cli_overrides(active_config) (or apply with_cli_overrides to active_config)
and then derive active_llm from that overridden config instead of directly using
active_llm = llm_config or active_config.llm, ensuring consistency with
validate_dataset and batch_validate and then call run_ready_checks on the final
active_llm.
---
Nitpick comments:
In `@architecture.md`:
- Line 8: Update the phrase "LLM-assisted pattern inference (built in)" to
hyphenate the compound adjective by changing "(built in)" to "(built-in)";
locate the text around "LLM-assisted pattern inference" and replace the
unhyphenated "built in" with "built-in".
In `@docs/src/on_the_lookout_with_meerqat.py`:
- Around line 114-122: Assign the constructed dict to a variable (e.g.,
summary_dict) instead of leaving it as a bare expression, then print or return
it so the output appears when run as a script; update the block that builds the
dict using filetree_report, filetree_report.dataset.filetree_summary.to_dict(),
and the filetree_issues list comprehension (filtering
issue.code.startswith("dataset.")) to store the result and then
print(summary_dict) or otherwise emit it.
In `@src/meerqat/ingestion.py`:
- Around line 156-165: The nested O(n²) pairwise comparison in the loop over
similarity_buckets (using sorted_bucket, SequenceMatcher, and appending to
similar_pairs) can be slow for very large buckets; add a guard such as a
configurable bucket size cap (e.g., MAX_BUCKET_COMPARE) and skip or sample
comparisons when len(sorted_bucket) exceeds it, or implement an early-exit
strategy (limit comparisons per entry or stop after finding a set number of
similar matches per bucket) while still honoring SIMILAR_DIRECTORY_THRESHOLD;
update the loop that iterates sorted_bucket to check the cap and either
truncate/sampled_rights or break early to avoid quadratic blowup.
In `@src/meerqat/main.py`:
- Around line 100-106: The code redundantly copies fields from the object
returned by generate_llm_review into a new LLMReview instance; replace that
manual field copying by using the returned object directly (e.g., assign
llm_review_obj = llm_review) if generate_llm_review already returns an
LLMReview-compatible object, or else wrap/convert defensively by using a single
canonical conversion method (constructor/factory like LLMReview.parse_obj or a
from_... classmethod) so you don't repeat attributes; ensure you check the
type/interface of generate_llm_review's return value before removing the copy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b96c6b51-cc45-4d7c-a913-ebb2babeb328
📒 Files selected for processing (10)
README.mdarchitecture.mddocs/src/cli.mddocs/src/on_the_lookout_with_meerqat.pyspec.mdsrc/meerqat/cli.pysrc/meerqat/ingestion.pysrc/meerqat/main.pytests/test_cli.pytests/test_units.py
✅ Files skipped from review due to trivial changes (1)
- docs/src/cli.md
Summary by CodeRabbit