Skip to content

nvbug6084457: Fix device architecture handling and NVLink link count query#1937

Open
mdboom wants to merge 4 commits intoNVIDIA:mainfrom
mdboom:nvbug6084457
Open

nvbug6084457: Fix device architecture handling and NVLink link count query#1937
mdboom wants to merge 4 commits intoNVIDIA:mainfrom
mdboom:nvbug6084457

Conversation

@mdboom
Copy link
Copy Markdown
Contributor

@mdboom mdboom commented Apr 17, 2026

Filing as a draft because this is still only a partial fix for the reported bug. The final fix requires coordination with upstream NVML.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 17, 2026
@mdboom mdboom self-assigned this Apr 17, 2026
@mdboom mdboom added bug Something isn't working test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module labels Apr 17, 2026
@mdboom mdboom added this to the cuda.bindings next milestone Apr 17, 2026
Comment thread cuda_core/cuda/core/system/_device.pyx Outdated
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
@mdboom mdboom marked this pull request as ready for review April 20, 2026 12:34
@mdboom mdboom requested a review from cpcloud April 20, 2026 12:34
@mdboom
Copy link
Copy Markdown
Contributor Author

mdboom commented Apr 20, 2026

Marking this as "ready to review". As a follow-on when we do the 13.3 bring up, we will need to add logic to make NVML_NVLINK_MAX_LINK dynamic based on the CTK version, but no point in doing that now.

@github-actions
Copy link
Copy Markdown

@rwgk rwgk removed the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 21, 2026
@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 21, 2026
Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but since I'm not immersed in the context, it wasn't easy to be sure about the details. It'd be helpful to get an agent summary of the changes in the PR description.

# can't be more specific about how many links we should find.
if value.nvml_return == nvml.Return.SUCCESS:
assert value.value.ui_val <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val}"
assert value.value.ui_val[0] <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val[0]}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, based on my understanding that this fixes a subtle latent bug in the test: before, value.value.ui_val was a 1-element NumPy array, so this was doing an array-to-scalar comparison and relying on NumPy's size-1 truthiness. With [0], the test now compares the actual scalar field value to nvml.NVLINK_MAX_LINKS, which I assume is the intended behavior.

try:
return DeviceArch(arch)
except ValueError:
return nvml.DeviceArch.UNKNOWN
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, based on my understanding that nvml.device_get_architecture() returns a raw integer architecture code, and DeviceArch(arch) is the enum conversion/validation step. This change seems to make Device.arch handle newer/unknown architecture IDs gracefully by returning UNKNOWN instead of raising ValueError. Please correct me if I'm missing any nuance.

How easy or difficult would it be to add a test that covers the except path?

arch = nvml.DeviceArch(arch)
return arch.name
except ValueError:
return f"UNKNOWN({arch})"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability nit, to make the intent more obvious:

        try:
            arch = nvml.DeviceArch(arch)
        except ValueError:
            return f"UNKNOWN({arch})"
        return arch.name

Nit 2: UNKNOWN_ARCH_ID, so the resulting warning explains what the magic integer is (it can be guessed, this is just a little more helpful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants