nvbug6084457: Fix device architecture handling and NVLink link count query by mdboom · Pull Request #1937 · NVIDIA/cuda-python

mdboom · 2026-04-17T12:23:44Z

Filing as a draft because this is still only a partial fix for the reported bug. The final fix requires coordination with upstream NVML.

…query

copy-pr-bot · 2026-04-17T12:23:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>

mdboom · 2026-04-20T12:34:56Z

Marking this as "ready to review". As a follow-on when we do the 13.3 bring up, we will need to add logic to make NVML_NVLINK_MAX_LINK dynamic based on the CTK version, but no point in doing that now.

github-actions · 2026-04-20T12:52:54Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1937/
https://nvidia.github.io/cuda-python/pr-preview/pr-1937/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1937/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1937/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk

LGTM, but since I'm not immersed in the context, it wasn't easy to be sure about the details. It'd be helpful to get an agent summary of the changes in the PR description.

rwgk · 2026-04-23T20:53:52Z

        # can't be more specific about how many links we should find.
        if value.nvml_return == nvml.Return.SUCCESS:
-            assert value.value.ui_val <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val}"
+            assert value.value.ui_val[0] <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val[0]}"


LGTM, based on my understanding that this fixes a subtle latent bug in the test: before, value.value.ui_val was a 1-element NumPy array, so this was doing an array-to-scalar comparison and relying on NumPy's size-1 truthiness. With [0], the test now compares the actual scalar field value to nvml.NVLINK_MAX_LINKS, which I assume is the intended behavior.

rwgk · 2026-04-23T21:04:15Z

+        try:
+            return DeviceArch(arch)
+        except ValueError:
+            return nvml.DeviceArch.UNKNOWN


LGTM, based on my understanding that nvml.device_get_architecture() returns a raw integer architecture code, and DeviceArch(arch) is the enum conversion/validation step. This change seems to make Device.arch handle newer/unknown architecture IDs gracefully by returning UNKNOWN instead of raising ValueError. Please correct me if I'm missing any nuance.

How easy or difficult would it be to add a test that covers the except path?

rwgk · 2026-04-23T21:19:07Z

+            arch = nvml.DeviceArch(arch)
+            return arch.name
+        except ValueError:
+            return f"UNKNOWN({arch})"


Readability nit, to make the intent more obvious:

try: arch = nvml.DeviceArch(arch) except ValueError: return f"UNKNOWN({arch})" return arch.name

Nit 2: UNKNOWN_ARCH_ID, so the resulting warning explains what the magic integer is (it can be guessed, this is just a little more helpful).

nvbug6084457: Fix device architecture handling and NVLink link count …

39be54d

…query

github-actions Bot added the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 17, 2026

mdboom self-assigned this Apr 17, 2026

mdboom added bug Something isn't working test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module labels Apr 17, 2026

mdboom added this to the cuda.bindings next milestone Apr 17, 2026

cpcloud reviewed Apr 17, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/system/_device.pyx Outdated

Apply suggestion from @cpcloud

9cd877c

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>

mdboom marked this pull request as ready for review April 20, 2026 12:34

Merge branch 'main' into nvbug6084457

fb740c9

mdboom requested a review from cpcloud April 20, 2026 12:34

rwgk removed the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 21, 2026

Merge branch 'main' into nvbug6084457

40632ad

github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 21, 2026

rwgk approved these changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvbug6084457: Fix device architecture handling and NVLink link count query#1937

nvbug6084457: Fix device architecture handling and NVLink link count query#1937
mdboom wants to merge 4 commits intoNVIDIA:mainfrom
mdboom:nvbug6084457

mdboom commented Apr 17, 2026

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

Uh oh!

mdboom commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk left a comment

Uh oh!

rwgk Apr 23, 2026

Uh oh!

rwgk Apr 23, 2026

Uh oh!

rwgk Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mdboom commented Apr 17, 2026

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

Uh oh!

mdboom commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants