Add commit retries to insert table writes#250
Conversation
Parallel writes to the same Iceberg table can fail with CommitFailedException when the branch changes during active writes. This adds the same retry-with-refresh logic that upsert already has to both insert() and delete(), resolving concurrent write conflicts automatically.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR introduces retry mechanisms with configurable parameters to table insert and delete operations, normalizes session URLs based on host locality for scheme switching, updates Python dependency constraints for Iceberg-related packages, and adds concurrent write tests to verify retry behavior. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds commit-conflict retry behavior to Iceberg table writes to reduce CommitFailedException failures under concurrent writers, aligning insert() and delete() with the existing upsert() approach.
Changes:
- Add
CommitFailedExceptionretry loops with metadata refresh toTable.insert()andTable.delete(). - Expose
max_retries/retry_delay_secondsparameters on both methods (defaults: 5 / 0.5s). - Add concurrent-write tests for
insert()anddelete()retry behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/tower/_tables.py | Implements refresh+retry loops for insert() and delete() on commit conflicts. |
| tests/tower/test_tables.py | Adds concurrent insert/delete tests intended to validate the new retry behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tower/_tables.py (1)
156-213:⚠️ Potential issue | 🟠 MajorValidate retry arguments before entering these loops.
With
max_retries < 0, Line 199 / Line 353 never executes the body and Line 213 / Line 374 ends up doingraise None, which surfaces as aTypeErrorinstead of a clear API error. A negativeretry_delay_secondsalso leaks aValueErrorfromsleep()only after the first conflict. Reject both values up front; ideally do it once in a shared retry helper soinsert(),delete(), andupsert()stay consistent.Proposed fix
+ `@staticmethod` + def _validate_retry_args(max_retries: int, retry_delay_seconds: float) -> None: + if max_retries < 0: + raise ValueError("max_retries must be >= 0") + if retry_delay_seconds < 0: + raise ValueError("retry_delay_seconds must be >= 0") + def insert( self, data: pa.Table, max_retries: int = 5, retry_delay_seconds: float = 0.5, ) -> TTable: + self._validate_retry_args(max_retries, retry_delay_seconds) last_exception = None ... def delete( self, filters: Union[str, List[pc.Expression]], max_retries: int = 5, retry_delay_seconds: float = 0.5, ) -> TTable: + self._validate_retry_args(max_retries, retry_delay_seconds) ...Also applies to: 297-374
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tower/_tables.py` around lines 156 - 213, Validate retry arguments at the start of the retrying methods (insert, delete, upsert) before entering their retry loops: check that max_retries is >= 0 and retry_delay_seconds is >= 0 and raise a clear ValueError (or a custom API error) if not. Update the insert method (and the corresponding delete/upsert implementations) to perform this validation at the top (before the for attempt in range(...) loop) so you never end up raising None or letting time.sleep receive a negative delay; optionally factor this validation into a shared retry helper used by insert/delete/upsert to keep behavior consistent.
🧹 Nitpick comments (1)
crates/tower-cmd/src/session.rs (1)
135-140: Centralize tower URL normalization to keep session persistence consistent.This logic currently lives only in
finalize_session, whilecrates/config/src/session.rs(snippet Lines 303-318) setssession.tower_urldirectly. Consider extracting a shared normalizer and reusing it in both paths so persistedsession.tower_urlbehavior doesn’t diverge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tower-cmd/src/session.rs` around lines 135 - 140, The URL normalization logic that adjusts http->https for non-local hosts should be centralized into a reusable helper (e.g., a function named normalize_tower_url or normalize_and_set_tower_url that accepts/returns a url or &mut Url); move the current code from finalize_session into that helper and replace the inline block in finalize_session (which mutates session.tower_url) and the direct assignment in crates/config/src/session.rs (where session.tower_url is currently set) to call this helper so both code paths use the same normalization: detect local hosts ("localhost", "127.0.0.1", "::1") and only change scheme from "http" to "https" when not local.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/tower/test_tables.py`:
- Around line 324-381: Make the retry-path deterministic by forcing the first
commit to fail and verifying refresh was invoked before retry: in
test_insert_concurrent_writes_with_retry wrap or monkeypatch the underlying
commit/append method used by tower.tables(...)._table (e.g., the internal
commit/append method your implementation calls) so it raises
CommitFailedException on the first call and succeeds thereafter, incrementing
retry_count inside your tracked_refresh; ensure insert_ticker installs that
failing stub before calling t.insert and then restores the real method after the
simulated failure so the retry can succeed, and assert that t._table.refresh
(the tracked_refresh) was called before the retry completes; apply the same
pattern to the other test referenced (lines 384-441) to eliminate flakiness.
---
Outside diff comments:
In `@src/tower/_tables.py`:
- Around line 156-213: Validate retry arguments at the start of the retrying
methods (insert, delete, upsert) before entering their retry loops: check that
max_retries is >= 0 and retry_delay_seconds is >= 0 and raise a clear ValueError
(or a custom API error) if not. Update the insert method (and the corresponding
delete/upsert implementations) to perform this validation at the top (before the
for attempt in range(...) loop) so you never end up raising None or letting
time.sleep receive a negative delay; optionally factor this validation into a
shared retry helper used by insert/delete/upsert to keep behavior consistent.
---
Nitpick comments:
In `@crates/tower-cmd/src/session.rs`:
- Around line 135-140: The URL normalization logic that adjusts http->https for
non-local hosts should be centralized into a reusable helper (e.g., a function
named normalize_tower_url or normalize_and_set_tower_url that accepts/returns a
url or &mut Url); move the current code from finalize_session into that helper
and replace the inline block in finalize_session (which mutates
session.tower_url) and the direct assignment in crates/config/src/session.rs
(where session.tower_url is currently set) to call this helper so both code
paths use the same normalization: detect local hosts ("localhost", "127.0.0.1",
"::1") and only change scheme from "http" to "https" when not local.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b8095b3b-c087-4f5e-a86e-4dc9b8b082bb
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
crates/tower-cmd/src/session.rspyproject.tomlsrc/tower/_tables.pytests/tower/test_tables.py
socksy
left a comment
There was a problem hiding this comment.
Please make the test deterministic before merging though
…istic Add input validation for max_retries and retry_delay_seconds across insert(), upsert(), and delete() to prevent confusing errors from negative values. Make concurrent insert/delete tests deterministic by forcing CommitFailedException on the first attempt rather than relying on real concurrency conflicts.
CommitFailedExceptionretry logic with metadata refresh toinsert()anddelete()inTable, matching the existing pattern inupsert(). This resolves failures when parallel apps write to the same Iceberg table concurrently.max_retriesandretry_delay_secondsparameters to both methods (defaulting to 5 retries / 0.5s delay, same asupsert()).insert()anddelete()to verify retry behavior under contention.Context
A user reported
CommitFailedException: branch main has changederrors when multiple parallel apps insert into the same Iceberg table. The retry-with-refresh pattern already existed onupsert()but was missing frominsert()anddelete().Summary by CodeRabbit
Release Notes
New Features
max_retriesandretry_delay_secondsparameters for improved resilience during concurrent writes.Chores