nvbug6084457: Fix device architecture handling and NVLink link count query#1937
nvbug6084457: Fix device architecture handling and NVLink link count query#1937mdboom wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
|
Marking this as "ready to review". As a follow-on when we do the 13.3 bring up, we will need to add logic to make |
|
rwgk
left a comment
There was a problem hiding this comment.
LGTM, but since I'm not immersed in the context, it wasn't easy to be sure about the details. It'd be helpful to get an agent summary of the changes in the PR description.
| # can't be more specific about how many links we should find. | ||
| if value.nvml_return == nvml.Return.SUCCESS: | ||
| assert value.value.ui_val <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val}" | ||
| assert value.value.ui_val[0] <= nvml.NVLINK_MAX_LINKS, f"Unexpected link count {value.value.ui_val[0]}" |
There was a problem hiding this comment.
LGTM, based on my understanding that this fixes a subtle latent bug in the test: before, value.value.ui_val was a 1-element NumPy array, so this was doing an array-to-scalar comparison and relying on NumPy's size-1 truthiness. With [0], the test now compares the actual scalar field value to nvml.NVLINK_MAX_LINKS, which I assume is the intended behavior.
| try: | ||
| return DeviceArch(arch) | ||
| except ValueError: | ||
| return nvml.DeviceArch.UNKNOWN |
There was a problem hiding this comment.
LGTM, based on my understanding that nvml.device_get_architecture() returns a raw integer architecture code, and DeviceArch(arch) is the enum conversion/validation step. This change seems to make Device.arch handle newer/unknown architecture IDs gracefully by returning UNKNOWN instead of raising ValueError. Please correct me if I'm missing any nuance.
How easy or difficult would it be to add a test that covers the except path?
| arch = nvml.DeviceArch(arch) | ||
| return arch.name | ||
| except ValueError: | ||
| return f"UNKNOWN({arch})" |
There was a problem hiding this comment.
Readability nit, to make the intent more obvious:
try:
arch = nvml.DeviceArch(arch)
except ValueError:
return f"UNKNOWN({arch})"
return arch.name
Nit 2: UNKNOWN_ARCH_ID, so the resulting warning explains what the magic integer is (it can be guessed, this is just a little more helpful).
Filing as a draft because this is still only a partial fix for the reported bug. The final fix requires coordination with upstream NVML.