[NONEVM-2809] [TXM] Implement exponential backoff for retries#728
[NONEVM-2809] [TXM] Implement exponential backoff for retries#728patricios-space wants to merge 8 commits intomainfrom
Conversation
d34c176 to
3256750
Compare
There was a problem hiding this comment.
Pull request overview
Implements exponential backoff (with jitter) for Transaction Manager broadcast retries to reduce retry storms and make retry behavior configurable.
Changes:
- Switch retry logic from a fixed delay to exponential backoff and add jittered delays during retry waits.
- Rename retry delay config to
SendRetryBaseDelayand introduceSendRetryJitterPctwith updated defaults. - Update config-related tests to use the new base-delay field name.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/txm/txm.go | Applies exponential backoff + jitter to broadcast retry wait logic. |
| pkg/txm/config.go | Renames retry delay config, adds jitter config, updates defaults/validation scaffolding. |
| pkg/txm/config_test.go | Adjusts defaulting tests to the renamed retry delay field. |
| pkg/config/toml_test.go | Updates TOML decoding assertions for the renamed retry delay field. |
Comments suppressed due to low confidence (2)
pkg/txm/config.go:63
SendRetryJitterPcthas a default value inDefaultConfigSet, butApplyDefaults()never applies it. As a result, configs that omit the field will run with0jitter (contradicting the default/comment and reducing herd-avoidance). Add defaulting logic forSendRetryJitterPct(or make it a pointer if0is a meaningful explicit value).
if c.SendRetryBaseDelay == nil {
c.SendRetryBaseDelay = DefaultConfigSet.SendRetryBaseDelay
}
if c.MaxSendRetryAttempts == 0 {
c.MaxSendRetryAttempts = DefaultConfigSet.MaxSendRetryAttempts
}
pkg/txm/config.go:93
SendRetryJitterPctis used to compute a sleep duration, but it's never validated. Values < 0 or very large values can yield negative/overflowed durations (causing immediate retries or extremely long sleeps). Add validation that it is within a sane range (e.g., 0 <= pct && pct <= 1) and consider validatingSendRetryBaseDelayis > 0 as well.
if c.SendRetryBaseDelay == nil {
err = errors.Join(err, config.ErrMissing{Name: "SendRetryDelay", Msg: "must be set"})
}
if c.MaxSendRetryAttempts == 0 {
err = errors.Join(err, config.ErrInvalid{Name: "MaxSendRetryAttempts", Msg: "must be greater than 0"})
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7db7fcb to
a7373ad
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
pkg/txm/config_test.go:23
- Config.ApplyDefaults now has defaulting logic for SendRetryJitterPct, but this test suite doesn't assert the default value is applied (or that a custom value is preserved). Adding assertions for SendRetryJitterPct here would help prevent regressions in the retry timing behavior.
assert.Equal(t, DefaultConfigSet.BroadcastChanSize, cfg.BroadcastChanSize)
assert.Equal(t, DefaultConfigSet.ConfirmPollInterval, cfg.ConfirmPollInterval)
assert.Equal(t, DefaultConfigSet.SendRetryBaseDelay, cfg.SendRetryBaseDelay)
assert.Equal(t, DefaultConfigSet.MaxSendRetryAttempts, cfg.MaxSendRetryAttempts)
assert.Equal(t, DefaultConfigSet.TxExpiration, cfg.TxExpiration)
assert.Equal(t, DefaultConfigSet.CleanupInterval, cfg.CleanupInterval)
})
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [TransactionManager] | ||
| BroadcastChanSize = 101 | ||
| ConfirmPollInterval = '9s' | ||
| SendRetryDelay = '8s' | ||
| SendRetryBaseDelay = '8s' | ||
| MaxSendRetryAttempts = 7 | ||
| SendRetryJitterPct = 0.2 |
There was a problem hiding this comment.
Switching the TOML key from SendRetryDelay to SendRetryBaseDelay is a breaking config change because the TOML decoder uses DisallowUnknownFields (old configs will now fail to start with an unknown field error). If backward compatibility is required, consider supporting both keys temporarily (e.g., keep the old field for decoding and map it to the new behavior, with deprecation messaging).
There was a problem hiding this comment.
@jadepark-dev do we need backwards compatibility here?
There was a problem hiding this comment.
@patricios-space Depends, in some cases we should update staging, prod-testnet, and the NOPs’ configurations before shipping.
- Adding a new field with a correct default is backward compatible. If needed, we can fine-tune the value later by updating the configs.
- Renaming an existing config field can also have a default. However, it's a startup-breaking config change for any deployed TOML that still contains
SendRetryDelay.NewDecodedTOMLConfigusesDisallowUnknownFields, so those configs fail before defaults can help
There was a problem hiding this comment.
Then I'll keep the old name. Is not that big of a change
a7373ad to
1cd4b20
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
NONEVM-2809