Skip to content

[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644

Open
saurabhd336 wants to merge 15 commits intoapache:mainfrom
saurabhd336:realtimeUsageUpdateOss
Open

[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644
saurabhd336 wants to merge 15 commits intoapache:mainfrom
saurabhd336:realtimeUsageUpdateOss

Conversation

@saurabhd336
Copy link
Copy Markdown
Contributor

@saurabhd336 saurabhd336 commented Apr 2, 2026

What changes were proposed in this pull request?

Often times, celeborn is too late in detecting diskfull issues simply because the DiskInfo's usableSpace is updated asynchronously in the worker heartbeat flow.
In such cases, if heartbeats are missed and / or multiple highly large writers end up pushing too much data to memory buffers (bypassing the diskfull based HARD_SPLIT checks), it can cause severe degradation.

In some cases we've noticed that we easily breach the configured disk usage limit, causing job degradations, cleanup failures (due to rocksdb sharing the disk with shuffle data) which makes the situation even worse.

This change proposes a more realtime, coordinated acquisition during flush, making the disk full detection full proof preventing any spillage beyond the configured limits.

Additionally, when sorting partition files, currently the extra disk space used isn't accounted for at all. This PR also changes the logic to account for disk space used / reclaimed during the file sorting process.

Everything behind a new config celeborn.worker.disk.storage.strictReserve.enabled, currently default set to false.

Why are the changes needed?

Disk full detection is not full proof

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs added

@zaynt4606 zaynt4606 changed the title Realtime usage update oss [WIP] Realtime usage update oss Apr 7, 2026
@zaynt4606
Copy link
Copy Markdown
Contributor

Please create a jira to tag this pr @saurabhd336
https://issues.apache.org/jira/projects/CELEBORN/issues

@saurabhd336 saurabhd336 changed the title [WIP] Realtime usage update oss [CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately. Apr 13, 2026
@saurabhd336
Copy link
Copy Markdown
Contributor Author

@zaynt4606 I've created and attached the JIRA.

Could you please help review / assign reviewers.

Just for context, we have seen issues where the async nature of the disk usage update can delay HARD_SPLITs to the point where were completely run out of disk space. For our setup, this can lead to rather serious degradation. This config based feature will help us be more accurate with our disk usage accounting.

cc: @s0nskar

@saurabhd336 saurabhd336 force-pushed the realtimeUsageUpdateOss branch from f957fdc to 2f0bc53 Compare April 13, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants