Skip to content

Specify runtime_path partition size#1191

Merged
ajdecon merged 1 commit intoNVIDIA:masterfrom
hkmc-airlab:resize-run-partition
Jul 5, 2022
Merged

Specify runtime_path partition size#1191
ajdecon merged 1 commit intoNVIDIA:masterfrom
hkmc-airlab:resize-run-partition

Conversation

@seyong-um
Copy link
Copy Markdown
Contributor

Problem:

  • In the case a docker image is bigger than /run partition size then slurm job fails: no space left on device
  • Cf.) nvcr.io/nvidia/pytorch is about 14GB, aws g4dn.2xlarge instance /run is about 3GB
  • srun fails  pyxis#53

Cause:

  • Pyxis uses runtime_path to store squashfs file temporarily
  • The default location of runtime_path is /run/pyxis.
  • /run size might be up to distribution, however ubuntu is 10% of physical memory

Suggestion:

  • Add resize_run_partition option to specify /run partition size

I know we can change runtime_path to other location, but I believe it is meaningful to use tmpfs residing in memory faster than other filesystems if compute nodes have sufficient memory.

add resize_run_partition option, which enables specifying /run partition size
where is default partition of pyxis `runtime_path`, used to store squashfs file temporarily

In the case a docker image is bigger than the partition then slurm job fails.

Signed-off-by: Seyong Um <seyong.um@hyundai.com>
Copy link
Copy Markdown
Contributor

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

  • All CI tests are passing, so the default case is working well
  • Manual test with resize_run_partition: false behaved as expected
  • Manual test with resize_run_partition: true increased the size of /run correctly

@ajdecon ajdecon merged commit cc3585a into NVIDIA:master Jul 5, 2022
@dholt dholt mentioned this pull request Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants