Nixl_ep ci: adding ep tests to dlcluster#1557
Draft
lishapira wants to merge 20 commits intoai-dynamo:mainfrom
Draft
Nixl_ep ci: adding ep tests to dlcluster#1557lishapira wants to merge 20 commits intoai-dynamo:mainfrom
lishapira wants to merge 20 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi lishapira! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Contributor
|
/ok to test 42356e7 |
Contributor
|
/build |
42356e7 to
1220c57
Compare
Contributor
|
/ok to test 1220c57 |
Contributor
|
/build |
2f5fd3c to
bfa3677
Compare
Contributor
|
/ok to test bfa3677 |
Contributor
|
/build |
3edde0e to
206b36e
Compare
Contributor
|
/ok to test 206b36e |
Contributor
|
/build |
206b36e to
3760dbc
Compare
Contributor
|
/ok to test 015c3b9 |
Contributor
|
/build |
Contributor
|
/ok to test 67529f6 |
Contributor
|
/build |
67529f6 to
e78e5ba
Compare
Contributor
|
/ok to test e78e5ba |
Contributor
|
/build |
- Use 4 processes instead of 8 for the elastic EP CI test. - Assert that rank failures only occur for ranks marked for kill in the plan, catching unexpected crashes during any phase. - Verify the SIGTERM exit count matches the plan's expected kills, catching cases where the fault-tolerance kill mechanism fails. - Add get_total_killed_ranks() helper to Plan class. Made-with: Cursor
Last phase now uses [0, 3] instead of [0, 1], introducing a rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling. Made-with: Cursor
Made-with: Cursor
…flag Address review comment: elastic.py now exposes an optional --validate-plan flag that enables plan-specific assertions (unexpected failure rejection and SIGTERM count verification). test_python.sh passes the flag for its specific plan. Made-with: Cursor
The global meson.build sets -rdc=true for all CUDA targets, which causes nvlink to enforce register limits across call boundaries at link time. nixl_ep kernels use --register-usage-level=10 and call nixlPut() which uses 215 registers, exceeding the nvlink limit and failing at link time. Adding -rdc=false overrides the global setting for nixl_ep only, so nixlPut gets inlined at compile time instead of being linked separately by nvlink. This matches the standalone setup.py build behavior. Made-with: Cursor
Install DOCA GPUNetIO dev packages when PRE_INSTALLED_ENV skips the full apt bootstrap but BUILD_NIXL_EP=true requires them. UCX device headers include doca_gpunetio_dev_verbs_qp.cuh which fails compilation without the dev package installed. Unset UCX_NET_DEVICES in elastic test subshell so UCX auto-selects a GPU-capable transport. When set by the CI environment, UCX is restricted to a device without GPU peer memory support, causing the RDMA path to fail with "no lane found" errors. Bump CI_IMAGE_TAG to 20260421-1 (build.sh changed). Made-with: Cursor
DOCA device headers (.cuh) may be installed to a path that nvcc does
not search by default, and meson may not find a pkg-config file for
doca-gpunetio to add the include path automatically. Copy all .cuh
files from the DOCA installation directory to ${CUDA_HOME}/include/
so nvcc can find doca_gpunetio_dev_verbs_qp.cuh, which is included
transitively via UCX device headers.
Made-with: Cursor
The UCX-master build activates UCX GPU Device API which triggers the full gdaki.cuh include chain requiring both .cuh and .h DOCA GPUNetIO headers. Add a wildcard search for all doca_gpunetio* files (any extension) from /usr/include and /opt. Bump CI_IMAGE_TAG to 20260421-3 to trigger Docker image rebuild.
…hout GPU API - Export UCX_VERSION in Dockerfile.gpu-test for test_ep.sh. - Fail EP elastic step when UCX_VERSION=master and BUILD_NIXL_EP=true but nixl_ep_cpp is missing; keep skipping on other UCX versions. - Skip examples/device/ep in Meson when UCX GPU Device API is unavailable. - Bump CI_IMAGE_TAG to 20260427-1 in build and test matrices.
Convert "Run DL EP elastic tests" from a raw sudo+ssh shell command to the slurmCI module format used by all other test steps. This fixes two bugs introduced after the rebase onto c83d742: 1. Wrong job ID file path: the old step read from /mnt/pvc/dl_job_id_<ver>_<build>.txt but the Allocate step now writes to ${JOB_ID_FILE_ROOT}/job_id_<ver>_<build>.txt, causing --slurm_job_id to be empty. 2. SSH Permission denied: the raw sudo -u svc-nixl approach never loaded the Jenkins SSH credential (svc-nixl-ssh_key), so SSH to dlcluster.nvidia.com failed with permission denied. Using slurmCI with credentialsId injects the key automatically. Bump CI_IMAGE_TAG to 20260428-1 in all three matrix YAML files to trigger a fresh base image build that incorporates the current build.sh (DOCA GPUNetIO headers block) and the PyTorch CUDA-version alignment from main (commit 1200fe5). Made-with: Cursor
basic.json and no_expansion.json are identical; remove the duplicate and use no_expansion.json consistently in all elastic test calls. Made-with: Cursor
elastic.py imports the nixl_ep package, not nixl_ep_cpp directly. Use "import nixl_ep" so the check matches the actual runtime import path. Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
a29db38 to
b8a4fd4
Compare
Contributor
|
/build |
02fb86f to
c8d289c
Compare
Revert DL matrix UCX pin to master + v1.21.x. CI_IMAGE_TAG 20260428-3.
c8d289c to
a0dbb3c
Compare
d1e842b to
79c3e36
Compare
79c3e36 to
dfc1312
Compare
Contributor
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Integrates nixl_ep elastic tests into the existing nixl-ci-dl-gpu Jenkins pipeline, running on GB300 NVL72 nodes. The tests exercise the elastic scale-up/scale-down functionality of nixl_ep for both NVLink and RDMA transports.
Changes: