Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

GPU memory leak when using gluon.data.DataLoader with num_workers>0 #20959

@ann-qin-lu

Description

@ann-qin-lu

Description

GPU memory leak when using gluon.data.DataLoader after upgrading Cuda-11.1/Cudnn-8.2.x (also tested with latest Cuda11.5+CuDnn8.3.x but still leaking). Minimal code to repro attached below.

No memory leak with older Cuda version (Cuda-10.1 + CuDnn-7.6.5).

Error Message

gpu memory keeps increasing during training.

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

import mxnet.gluon as gl
import mxnet as mx
import gc

if __name__ == "__main__":
    gpu_ctx = mx.gpu()
    model = gl.nn.Embedding(10, 5)
    model.initialize(ctx=gpu_ctx)
    X = mx.random.uniform(shape=(1000, 3))
    dataset = mx.gluon.data.dataset.ArrayDataset(X)
    num_workers_list = [0, 4, 8]
    for num_workers in num_workers_list:

        for epoch in range(5):
            dataset = mx.gluon.data.dataset.ArrayDataset(X)
            data_loader = gl.data.DataLoader(
                dataset,
                batch_size=1,
                num_workers=num_workers,
            )
            for batch in data_loader:
                # move data to gpu
                data_gpu = batch.copyto(mx.gpu())
                # forward
                l = model(data_gpu)
                # force immediate compute
                l.asnumpy()
            # gc & gpu_ctx.empty_cache
            mx.nd.waitall()
            del dataset
            del data_loader
            gc.collect()
            gpu_ctx.empty_cache()
            mx.nd.waitall()

            a, b = mx.context.gpu_memory_info(0)
            print(f"num_workers: {num_workers} epoch {epoch}: "
                  f"current gpu memory {(b - a) / (1024 * 1024 * 1024)} GB, "
                  f"Total gpu memory {b / (1024 * 1024 * 1024)} GB.")

Steps to reproduce

(Paste the commands you ran that produced the error.)

### Output with MXNet-1.9 built with Cuda11.1 CuDnn 8.2.0 (Memory leak when `num_workers` > 0)
  (also tested with the latest Cuda11.5+CuDnn8.3.x)

num_workers: 0 epoch 0: current memory 1.381591796875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 1: current memory 1.381591796875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 2: current memory 1.381591796875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 3: current memory 1.381591796875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 4: current memory 1.381591796875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 0: current memory 1.483154296875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 1: current memory 1.582763671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 2: current memory 1.683349609375 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 3: current memory 1.782958984375 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 4: current memory 1.880615234375 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 0: current memory 1.980224609375 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 1: current memory 2.080810546875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 2: current memory 2.180419921875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 3: current memory 2.281982421875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 4: current memory 2.380615234375 GB, Total memory 15.78173828125 GB.
  

### Output with MXNet-1.9 built with Cuda10.1 CuDnn 7.6.5 (No memory leak)

num_workers: 0 epoch 0: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 1: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 2: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 3: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 0 epoch 4: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 0: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 1: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 2: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 3: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 4 epoch 4: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 0: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 1: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 2: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 3: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.
num_workers: 8 epoch 4: current memory 1.301513671875 GB, Total memory 15.78173828125 GB.

What have you tried to solve it?

  1. python gc clean doesn't help
  2. upgrade cuda/cudnn to least version doesn't help

Environment

We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://github.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information
# Paste the diagnose.py command output here
----------Python Info----------
Version      : 3.6.14
Compiler     : GCC 7.5.0
Build        : ('default', 'Feb 19 2022 10:06:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
No corresponding pip install for current python.
----------MXNet Info-----------
Version      : 1.9.0
Directory    : /efs-storage/debug_log/test-runtime/lib/python3.6/site-packages/mxnet
Commit hash file "/efs-storage/debug_log/test-runtime/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/efs-storage/debug_log/test-runtime//lib/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✖ CPU_SSE4_1
✖ CPU_SSE4_2
✖ CPU_SSE4A
✖ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-4.14.232-177.418.amzn2.x86_64-x86_64-with
system       : Linux
node         : ip-10-0-10-233.ec2.internal
release      : 4.14.232-177.418.amzn2.x86_64
version      : #1 SMP Tue Jun 15 20:57:50 UTC 2021
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2630.103
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.04
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions