global sort: reduce the memory usage of merge sort concurrent reader by fzzf678 · Pull Request #62921 · pingcap/tidb

fzzf678 · 2025-08-11T08:42:42Z

What problem does this PR solve?

Issue Number: close #62853

Problem Summary:

What changed and how does it work?

Reduce the concurrency of concurrent readers, pass the outerConcurrency correctly.

In the previous PR, we increased the concurrency to get better performance and if there are multiple readers concurrently read, the outerConcurrency will be used to control the whole concurrency. https://github.com/pingcap/tidb/pull/48871/files#diff-1ea5d87a202f7a3a089796fa8ac98c520879a2bdfdc739ed98a2261a9a7e8ad1R377-R381

But the outerConcurrency was forgotten to be passed(see https://github.com/pingcap/tidb/pull/48871/files#diff-b573df9af3de0ee789f4ec5114d897cc2f9b5ac57f7cc19757e2353127680edfR91), the reader concurrency is always 256 no matter how many reader we use, this leads to the OOM.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

For the OOM problem, after we pass the outerConcurrency correctly in this PR, the add index can be success without OOM. The profile as shown below, we can see the memory occupied by switchToConcurrentReader is 2GiB, this is expected because there are 8 readers and each has 32 concurrency with 256MiB memory, totally 8*256MiB = 2GiB.

About the merge sort performance, we test the merge sort time using tidb_ddl_reorg_worker_cnt = 4. Before this PR, each subtask tasks about 44-46 minutes. After this PR, each subtask tasks about 33-34 minutes to finish merge sort.
We detect the reason is, previously the concurrency was too high, which slowed down the pull of some goroutines, causing tail requests to drag down the overall situation. After reducing the concurrency, resource contention and congestion were alleviated, slow requests were reduced, and the overall tail delay converged. Resource contention is reduced or congestion handling capabilities are improved, making data transmission more stable and efficient. This PR is mainly to fix the OOM problem, maybe we can continue investigating performance improvements later.

img_v3_02pb_b3dc4a5b-f78d-4a81-9b8e-a4b9d02cf14g

img_v3_02pb_e2668838-dc01-4d55-b315-7ca579947efg

Before(repeat 3 times):

+---------------------+---------------------+----------+------+-----------+---------+
| start_time          | end_time            | duration | step | exec_id   | state   |
+---------------------+---------------------+----------+------+-----------+---------+
| 2025-08-12 05:28:30 | 2025-08-12 06:14:54 | 00:46:24 |    2 | tc-tidb-0 | succeed |
| 2025-08-12 05:28:30 | 2025-08-12 06:14:25 | 00:45:55 |    2 | tc-tidb-2 | succeed |
| 2025-08-12 05:28:30 | 2025-08-12 06:13:26 | 00:44:56 |    2 | tc-tidb-1 | succeed |
+---------------------+---------------------+----------+------+-----------+---------+

+---------------------+---------------------+----------+------+-----------+---------+
| start_time          | end_time            | duration | step | exec_id   | state   |
+---------------------+---------------------+----------+------+-----------+---------+
| 2025-08-12 08:41:37 | 2025-08-12 09:27:12 | 00:45:35 |    2 | tc-tidb-0 | succeed |
| 2025-08-12 08:41:37 | 2025-08-12 09:27:03 | 00:45:26 |    2 | tc-tidb-1 | succeed |
| 2025-08-12 08:41:37 | 2025-08-12 09:26:24 | 00:44:47 |    2 | tc-tidb-2 | succeed |
+---------------------+---------------------+----------+------+-----------+---------+

+---------------------+---------------------+----------+------+-----------+---------+
| start_time          | end_time            | duration | step | exec_id   | state   |
+---------------------+---------------------+----------+------+-----------+---------+
| 2025-08-12 10:48:09 | 2025-08-12 11:33:27 | 00:45:18 |    2 | tc-tidb-1 | succeed |
| 2025-08-12 10:48:09 | 2025-08-12 11:33:24 | 00:45:15 |    2 | tc-tidb-2 | succeed |
| 2025-08-12 10:48:10 | 2025-08-12 11:33:21 | 00:45:11 |    2 | tc-tidb-0 | succeed |
+---------------------+---------------------+----------+------+-----------+---------+

After(repeat 3 times):

+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| start_time          | end_time            | duration | step | exec_id                                                                 | state   |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| 2025-08-19 01:40:47 | 2025-08-19 02:15:01 | 00:34:14 |    2 | tc-tidb-1 | succeed |
| 2025-08-19 01:40:47 | 2025-08-19 02:14:25 | 00:33:38 |    2 | tc-tidb-0 | succeed |
| 2025-08-19 01:40:47 | 2025-08-19 02:14:10 | 00:33:23 |    2 | tc-tidb-2 | succeed |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+

+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| start_time          | end_time            | duration | step | exec_id                                                                 | state   |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| 2025-08-19 03:47:08 | 2025-08-19 04:21:05 | 00:33:57 |    2 | tc-tidb-0 | succeed |
| 2025-08-19 03:47:08 | 2025-08-19 04:20:58 | 00:33:50 |    2 | tc-tidb-1 | succeed |
| 2025-08-19 03:47:08 | 2025-08-19 04:20:42 | 00:33:34 |    2 | tc-tidb-2 | succeed |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+

+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| start_time          | end_time            | duration | step | exec_id                                                                 | state   |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+
| 2025-08-19 12:42:42 | 2025-08-19 13:17:04 | 00:34:22 |    2 | tc-tidb-2 | succeed |
| 2025-08-19 12:42:42 | 2025-08-19 13:16:44 | 00:34:02 |    2 | tc-tidb-0 | succeed |
| 2025-08-19 12:42:42 | 2025-08-19 13:16:41 | 00:33:59 |    2 | tc-tidb-1 | succeed |
+---------------------+---------------------+----------+------+-------------------------------------------------------------------------+---------+

No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2025-08-11T08:42:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tiprow · 2025-08-11T08:42:58Z

Hi @fzzf678. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov · 2025-08-14T02:46:08Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.4498%. Comparing base (6361402) to head (aee5212).
⚠️ Report is 472 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #62921        +/-   ##
================================================
+ Coverage   72.8138%   73.4498%   +0.6359%     
================================================
  Files          1823       1853        +30     
  Lines        494575     502708      +8133     
================================================
+ Hits         360119     369238      +9119     
+ Misses       112573     110903      -1670     
- Partials      21883      22567       +684

Flag	Coverage Δ
integration	`44.7680% <0.0000%> (?)`
unit	`72.2793% <66.6666%> (+0.0052%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.8700% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`46.3505% <ø> (+0.0178%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fzzf678 · 2025-08-14T06:47:26Z

tidb/pkg/ddl/backfilling_merge_sort.go

Line 92 in 16ec21b

memSizePerCon := res.Mem.Capacity() / res.CPU.Capacity()

~~Should we consider using this memSizePerConto control the concurrency of concurrent reader?~~
The memSizePerCon can't control, need another value to control the total memory usage of concurrent reader

…ort_oom

D3Hunter

before this PR, each subtask tasks about 44-46 minutes. After this PR, each subtask tasks about 33-34 minutes to finish merge sort.

should before this PR OOM? where the 44-46 minutes came from?

i think this PR only avoid OOM, why the performance is getting better?

fzzf678 · 2025-08-20T10:46:49Z

before this PR, each subtask tasks about 44-46 minutes. After this PR, each subtask tasks about 33-34 minutes to finish merge sort.

should before this PR OOM? where the 44-46 minutes came from?

i think this PR only avoid OOM, why the performance is getting better?

The OOM occurs when tidb_ddl_reorg_worker_cnt = 8, the 44-46 minutes is tested by using the tidb_ddl_reorg_worker_cnt = 4. Supply this to the description.

ti-chi-bot · 2025-08-21T02:50:51Z

[LGTM Timeline notifier]

Timeline:

2025-08-20 03:44:24.556505674 +0000 UTC m=+411472.499681190: ☑️ agreed by wjhuang2016.
2025-08-21 02:50:50.857473549 +0000 UTC m=+494658.800649066: ☑️ agreed by D3Hunter.

hawkingrei · 2025-08-21T02:56:37Z

/retest

ti-chi-bot · 2025-08-21T03:59:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, tangenta, wjhuang2016

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/lightning/OWNERS~~ [D3Hunter]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ort_oom

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2025-11-28T10:12:42Z

In response to a cherrypick label: new pull request created to branch release-8.5: #64744.
But this PR has conflicts, please resolve them!

…62921) (#64744) close #62853

limit the memory usage of merge sort MergeKVIter

74493cb

fzzf678 added 2 commits August 11, 2025 21:04

use 1GiB memory for concurrent reader

4593ea4

remove useless code

bd53cdf

ti-chi-bot bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/needs-triage-completed and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked labels Aug 13, 2025

fzzf678 changed the title ~~global sort: limit the memory usage of merge sort MergeKVIter~~ global sort: reduce the memory usage of merge sort MergeKVIter Aug 13, 2025

fzzf678 changed the title ~~global sort: reduce the memory usage of merge sort MergeKVIter~~ global sort: reduce the memory usage of merge sort concurrent reader Aug 14, 2025

fzzf678 marked this pull request as ready for review August 14, 2025 02:27

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2025

fzzf678 requested review from D3Hunter and wjhuang2016 August 14, 2025 02:27

fzzf678 assigned tangenta and unassigned tangenta Aug 14, 2025

fzzf678 requested a review from tangenta August 14, 2025 02:28

fzzf678 added 4 commits August 18, 2025 19:04

Update merge.go

56f87f7

Merge remote-tracking branch 'upstream/master' into fix_62853_merge_s…

e88cbc8

…ort_oom

Update onefile_writer_test.go

d423445

add ut

4adc987

ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/needs-triage-completed labels Aug 19, 2025

wjhuang2016 approved these changes Aug 20, 2025

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Aug 20, 2025

D3Hunter reviewed Aug 20, 2025

View reviewed changes

D3Hunter approved these changes Aug 21, 2025

View reviewed changes

ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Aug 21, 2025

tangenta approved these changes Aug 21, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix_62853_merge_s…

aee5212

…ort_oom

ti-chi-bot bot merged commit fdca815 into pingcap:master Aug 21, 2025
25 checks passed

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Nov 28, 2025

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Nov 28, 2025

This is an automated cherry-pick of pingcap#62921

f6b2786

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Nov 28, 2025

global sort: reduce the memory usage of merge sort concurrent reader (#62921) #64744

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this pull request Dec 8, 2025

global sort: reduce the memory usage of merge sort concurrent reader (#…

d785f04

…62921) (#64744) close #62853

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

global sort: reduce the memory usage of merge sort concurrent reader#62921

global sort: reduce the memory usage of merge sort concurrent reader#62921
ti-chi-bot[bot] merged 8 commits intopingcap:masterfrom
fzzf678:fix_62853_merge_sort_oom

fzzf678 commented Aug 11, 2025 •

edited

Loading

Uh oh!

ti-chi-bot bot commented Aug 11, 2025

Uh oh!

tiprow bot commented Aug 11, 2025

Uh oh!

codecov bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

fzzf678 commented Aug 14, 2025 •

edited

Loading

Uh oh!

D3Hunter left a comment •

edited

Loading

Uh oh!

fzzf678 commented Aug 20, 2025 •

edited

Loading

Uh oh!

ti-chi-bot bot commented Aug 21, 2025

Uh oh!

hawkingrei commented Aug 21, 2025

Uh oh!

ti-chi-bot bot commented Aug 21, 2025

Uh oh!

Uh oh!

ti-chi-bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

fzzf678 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Uh oh!

ti-chi-bot bot commented Aug 11, 2025

Uh oh!

tiprow bot commented Aug 11, 2025

Uh oh!

codecov bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fzzf678 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

D3Hunter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzzf678 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot bot commented Aug 21, 2025

[LGTM Timeline notifier]

Uh oh!

hawkingrei commented Aug 21, 2025

Uh oh!

ti-chi-bot bot commented Aug 21, 2025

Uh oh!

Uh oh!

ti-chi-bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fzzf678 commented Aug 11, 2025 •

edited

Loading

codecov bot commented Aug 14, 2025 •

edited

Loading

fzzf678 commented Aug 14, 2025 •

edited

Loading

D3Hunter left a comment •

edited

Loading

fzzf678 commented Aug 20, 2025 •

edited

Loading