Skip to content

Nixl_ep ci: adding ep tests to dlcluster#1557

Draft
lishapira wants to merge 20 commits intoai-dynamo:mainfrom
lishapira:nixl_ep_ci_adding_ep_to_dlcluster
Draft

Nixl_ep ci: adding ep tests to dlcluster#1557
lishapira wants to merge 20 commits intoai-dynamo:mainfrom
lishapira:nixl_ep_ci_adding_ep_to_dlcluster

Conversation

@lishapira
Copy link
Copy Markdown
Contributor

@lishapira lishapira commented Apr 20, 2026

Integrates nixl_ep elastic tests into the existing nixl-ci-dl-gpu Jenkins pipeline, running on GB300 NVL72 nodes. The tests exercise the elastic scale-up/scale-down functionality of nixl_ep for both NVLink and RDMA transports.

Changes:

  1. CI: Added "Run DL EP elastic tests" step to test-dl-matrix.yaml; builds PR image with BUILD_NIXL_EP=true
  2. test_ep.sh: runs elastic.py with plans: no_expansion.json and expansion_fault_contraction.json on 4 GPUs (NVLink + RDMA variants); gracefully skips on UCX v1.21.x.
  3. Build: Added ucx_gpu_device_api_available guard to skip build nixl_ep when UCX GPU Device API is unavailable; disabled -rdc to fix nvlink register count errors; added DOCA GPUNetIO header installation in build.sh.
  4. elastic.py: Added silent-failure assertions (unexpected rank crashes, SIGTERM count mismatch); fixed log message wording; added get_total_killed_ranks() helper (from plan.py); validation behind --validate-plan flag.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

👋 Hi lishapira! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test 42356e7

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira changed the title Nixl_ep ci: adding ep test to dlcluster Nixl_ep ci: adding ep tests to dlcluster Apr 20, 2026
@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from 42356e7 to 1220c57 Compare April 21, 2026 11:58
@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test 1220c57

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch 2 times, most recently from 2f5fd3c to bfa3677 Compare April 23, 2026 07:28
@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test bfa3677

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch 2 times, most recently from 3edde0e to 206b36e Compare April 27, 2026 15:56
@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test 206b36e

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from 206b36e to 3760dbc Compare April 28, 2026 09:24
@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test 015c3b9

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test 67529f6

@itayalroy
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from 67529f6 to e78e5ba Compare April 28, 2026 13:42
@itayalroy
Copy link
Copy Markdown
Contributor

/ok to test e78e5ba

@itayalroy
Copy link
Copy Markdown
Contributor

/build

lishapira and others added 17 commits May 3, 2026 03:47
- Use 4 processes instead of 8 for the elastic EP CI test.
- Assert that rank failures only occur for ranks marked for kill
  in the plan, catching unexpected crashes during any phase.
- Verify the SIGTERM exit count matches the plan's expected kills,
  catching cases where the fault-tolerance kill mechanism fails.
- Add get_total_killed_ranks() helper to Plan class.

Made-with: Cursor
Last phase now uses [0, 3] instead of [0, 1], introducing a
rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling.

Made-with: Cursor
…flag

Address review comment: elastic.py now exposes an optional
--validate-plan flag that enables plan-specific assertions
(unexpected failure rejection and SIGTERM count verification).
test_python.sh passes the flag for its specific plan.

Made-with: Cursor
The global meson.build sets -rdc=true for all CUDA targets, which causes
nvlink to enforce register limits across call boundaries at link time.
nixl_ep kernels use --register-usage-level=10 and call nixlPut() which
uses 215 registers, exceeding the nvlink limit and failing at link time.

Adding -rdc=false overrides the global setting for nixl_ep only, so
nixlPut gets inlined at compile time instead of being linked separately
by nvlink. This matches the standalone setup.py build behavior.

Made-with: Cursor
Install DOCA GPUNetIO dev packages when PRE_INSTALLED_ENV skips the full
apt bootstrap but BUILD_NIXL_EP=true requires them. UCX device headers
include doca_gpunetio_dev_verbs_qp.cuh which fails compilation without
the dev package installed.

Unset UCX_NET_DEVICES in elastic test subshell so UCX auto-selects a
GPU-capable transport. When set by the CI environment, UCX is restricted
to a device without GPU peer memory support, causing the RDMA path to
fail with "no lane found" errors.

Bump CI_IMAGE_TAG to 20260421-1 (build.sh changed).

Made-with: Cursor
DOCA device headers (.cuh) may be installed to a path that nvcc does
not search by default, and meson may not find a pkg-config file for
doca-gpunetio to add the include path automatically. Copy all .cuh
files from the DOCA installation directory to ${CUDA_HOME}/include/
so nvcc can find doca_gpunetio_dev_verbs_qp.cuh, which is included
transitively via UCX device headers.

Made-with: Cursor
The UCX-master build activates UCX GPU Device API which triggers the full
gdaki.cuh include chain requiring both .cuh and .h DOCA GPUNetIO headers.
Add a wildcard search for all doca_gpunetio* files (any extension) from /usr/include and /opt.
Bump CI_IMAGE_TAG to 20260421-3 to trigger Docker image rebuild.
…hout GPU API

- Export UCX_VERSION in Dockerfile.gpu-test for test_ep.sh.
- Fail EP elastic step when UCX_VERSION=master and BUILD_NIXL_EP=true but
  nixl_ep_cpp is missing; keep skipping on other UCX versions.
- Skip examples/device/ep in Meson when UCX GPU Device API is unavailable.
- Bump CI_IMAGE_TAG to 20260427-1 in build and test matrices.
Convert "Run DL EP elastic tests" from a raw sudo+ssh shell command to
the slurmCI module format used by all other test steps. This fixes two
bugs introduced after the rebase onto c83d742:

1. Wrong job ID file path: the old step read from
   /mnt/pvc/dl_job_id_<ver>_<build>.txt but the Allocate step now
   writes to ${JOB_ID_FILE_ROOT}/job_id_<ver>_<build>.txt, causing
   --slurm_job_id to be empty.

2. SSH Permission denied: the raw sudo -u svc-nixl approach never
   loaded the Jenkins SSH credential (svc-nixl-ssh_key), so SSH to
   dlcluster.nvidia.com failed with permission denied. Using slurmCI
   with credentialsId injects the key automatically.

Bump CI_IMAGE_TAG to 20260428-1 in all three matrix YAML files to
trigger a fresh base image build that incorporates the current build.sh
(DOCA GPUNetIO headers block) and the PyTorch CUDA-version alignment
from main (commit 1200fe5).

Made-with: Cursor
basic.json and no_expansion.json are identical; remove the duplicate
and use no_expansion.json consistently in all elastic test calls.

Made-with: Cursor
elastic.py imports the nixl_ep package, not nixl_ep_cpp directly.
Use "import nixl_ep" so the check matches the actual runtime import path.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from a29db38 to b8a4fd4 Compare May 3, 2026 10:50
@ofirfarjun7
Copy link
Copy Markdown
Contributor

/build

@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from 02fb86f to c8d289c Compare May 4, 2026 08:41
Revert DL matrix UCX pin to master + v1.21.x. CI_IMAGE_TAG 20260428-3.
@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from c8d289c to a0dbb3c Compare May 4, 2026 11:11
@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from d1e842b to 79c3e36 Compare May 4, 2026 16:08
@lishapira lishapira force-pushed the nixl_ep_ci_adding_ep_to_dlcluster branch from 79c3e36 to dfc1312 Compare May 4, 2026 16:12
@rakhmets
Copy link
Copy Markdown
Contributor

rakhmets commented May 4, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants