Skip to content

[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662

Open
s0nskar wants to merge 5 commits intoapache:mainfrom
s0nskar:fix_quota_shuffle_id
Open

[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662
s0nskar wants to merge 5 commits intoapache:mainfrom
s0nskar:fix_quota_shuffle_id

Conversation

@s0nskar
Copy link
Copy Markdown
Contributor

@s0nskar s0nskar commented Apr 13, 2026

What changes were proposed in this pull request?

  • Added a new mapping for celebornShuffleId -> appShuffleId
  • cancelAllActiveStages should passing appShuffleId not celebornShuffleId

Why are the changes needed?

shuffleAllocatedWorkers worker contains celebornShuffleId, we need to use appShuffleId because DAGScheduler only understand app shuffle id.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

NA

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Apr 13, 2026

cc: @leixm PTAL

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates quota-triggered shuffle/stage cancellation to use Spark’s app shuffle ID (the one understood by DAGScheduler) by tracking a mapping from Celeborn-generated shuffle IDs to app shuffle IDs.

Changes:

  • Introduced a celebornShuffleId -> appShuffleId mapping in LifecycleManager.
  • Populated the mapping when generating new Celeborn shuffle IDs.
  • Updated cancelAllActiveStages to translate active Celeborn shuffle IDs to app shuffle IDs before invoking the Spark cancel callback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2053 to +2054
.filter(celebornShuffleIdToAppShuffleIdMap.contains(_))
.map(celebornShuffleIdToAppShuffleIdMap.get(_))
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentHashMap.contains(...) checks for a value (deprecated alias of containsValue), not a key. Here we need to filter by whether the Celeborn shuffleId exists as a key in celebornShuffleIdToAppShuffleIdMap; otherwise quota cancellation will frequently skip active shuffles. Use containsKey (or a single get + null/Option check) before mapping to the appShuffleId.

Suggested change
.filter(celebornShuffleIdToAppShuffleIdMap.contains(_))
.map(celebornShuffleIdToAppShuffleIdMap.get(_))
.flatMap(shuffleId => Option(celebornShuffleIdToAppShuffleIdMap.get(shuffleId)))

Copilot uses AI. Check for mistakes.
Comment on lines 109 to 113
private val shuffleIdMapping = JavaUtils.newConcurrentHashMap[
Int,
scala.collection.mutable.LinkedHashMap[String, (Int, Boolean)]]()
private val celebornShuffleIdToAppShuffleIdMap = JavaUtils.newConcurrentHashMap[Int, Int]()
private val shuffleIdGenerator = new AtomicInteger(0)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

celebornShuffleIdToAppShuffleIdMap is only ever added to, and it is not cleared when shuffles expire/unregister (e.g., removeExpiredShuffle removes shuffleAllocatedWorkers, latestPartitionLocation, etc. but not this map). In long-running drivers this can grow without bound; consider removing the mapping when a shuffle is expired/removed (and/or when unregistering app shuffles).

Copilot uses AI. Check for mistakes.
@SteNicholas
Copy link
Copy Markdown
Member

SteNicholas commented Apr 13, 2026

@s0nskar, you'd better to firstly resolve the following failure of compilation and address the comments of coiplot.

Error:  /home/runner/work/celeborn/celeborn/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:2056: inferred type arguments [Integer,Unit] do not conform to method toSet's type parameter bounds [B >: Int,U]
Error:  /home/runner/work/celeborn/celeborn/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:2056: type mismatch;
 found   : Integer => Unit
 required: B => U
[INFO] : Integer => Unit <: B => U?
[INFO] : false
Error: [ERROR] two errors found

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Apr 13, 2026

@SteNicholas working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants