[v1.x] provide a faster PrefetchedDataLoader#19748
[v1.x] provide a faster PrefetchedDataLoader#19748szha merged 16 commits intoapache:v1.xfrom Neutron3529:patch-1
Conversation
Now, since my programming skill is very poor, this `PrefetchedDataLoader` only allow generate a single iter at the same time.
the benefit of `PrefetchedDataLoader` is that, `PrefetchedDataLoader` provides better performance with a simple replacement in most of the existing codes.
test:
```python
$ cat iternew.py && python iternew.py
import mxnet as mx
from mxnet.gluon.data import PrefetchedDataLoader as DataLoader,ArrayDataset
from time import sleep,perf_counter_ns
train_data=ArrayDataset(mx.nd.array([[i] for i in range(50000)]),mx.nd.array([[99-i] for i in range(50000)]))
test_data=ArrayDataset(mx.nd.array([[i] for i in range(10000)]),mx.nd.array([[99-i] for i in range(10000)]))
def transform_train(sample):
sleep(0.0016)
return sample
def transform_test(sample):
sleep(0.0008)
return sample
train_iter=DataLoader(train_data.transform_first(transform_train),batch_size=500,num_workers=10)
test_iter =DataLoader(test_data .transform_first(transform_test ),batch_size=500,num_workers=10)
if True:
tic=perf_counter_ns()
for epoch in range(10):
print("epoch"+str(epoch)+" start at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
for i in train_iter:
sleep(0.1)
print(" finished train phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
for i in test_iter:
sleep(0.05)
print(" finished test phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
print("cost="+str((perf_counter_ns()-tic)*1e-9)+"s")
epoch0 start at 0.0s
finished train phase at 11.25s
finished test phase at 12.31s
epoch1 start at 12.31s
finished train phase at 22.62s
finished test phase at 23.68s
epoch2 start at 23.68s
finished train phase at 34.03s
finished test phase at 35.09s
epoch3 start at 35.09s
finished train phase at 45.41s
finished test phase at 46.48s
epoch4 start at 46.48s
finished train phase at 56.82s
finished test phase at 57.88s
epoch5 start at 57.88s
finished train phase at 68.24s
finished test phase at 69.3s
epoch6 start at 69.3s
finished train phase at 79.65s
finished test phase at 80.71s
epoch7 start at 80.71s
finished train phase at 91.04s
finished test phase at 92.11s
epoch8 start at 92.11s
finished train phase at 102.46s
finished test phase at 103.53s
epoch9 start at 103.53s
finished train phase at 113.89s
finished test phase at 114.95s
cost=114.94954171600001s
```
(cost ~`129.67192333600002s` if we are using `Dataloader` rather than `PrefetchedDataLoader`)
|
Hey @Neutron3529 , Thanks for submitting the PR
CI supported jobs: [clang, centos-gpu, edge, windows-gpu, website, miscellaneous, sanity, centos-cpu, windows-cpu, unix-cpu, unix-gpu] Note: |
This comment has been minimized.
This comment has been minimized.
there already exists some faster dataloader in mxnet 2.0, but in v1.x, the exist dataloader is slower and could be improved by changing its prefetch behavior as what 2.0 have done.
```python
$ cat iternew.py && python iternew.py
import mxnet as mx
from mxnet.gluon.data import PrefetchedDataLoader as DataLoader,ArrayDataset
from time import sleep,perf_counter_ns
train_data=ArrayDataset(mx.nd.array([[i] for i in range(50000)]),mx.nd.array([[99-i] for i in range(50000)]))
test_data=ArrayDataset(mx.nd.array([[i] for i in range(10000)]),mx.nd.array([[99-i] for i in range(10000)]))
def transform_train(sample):
sleep(0.0016)
return sample
def transform_test(sample):
sleep(0.0008)
return sample
train_iter=DataLoader(train_data.transform_first(transform_train),batch_size=500,num_workers=10)
test_iter =DataLoader(test_data .transform_first(transform_test ),batch_size=500,num_workers=10)
if True:
tic=perf_counter_ns()
for epoch in range(10):
print("epoch"+str(epoch)+" start at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
for i in train_iter:
sleep(0.1)
print(" finished train phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
for i in test_iter:
sleep(0.05)
print(" finished test phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
print("cost="+str((perf_counter_ns()-tic)*1e-9)+"s")
epoch0 start at 0.0s
finished train phase at 11.28s
finished test phase at 12.35s
epoch1 start at 12.35s
finished train phase at 22.73s
finished test phase at 23.79s
epoch2 start at 23.79s
finished train phase at 34.15s
finished test phase at 35.21s
epoch3 start at 35.22s
finished train phase at 45.59s
finished test phase at 46.66s
epoch4 start at 46.66s
finished train phase at 57.01s
finished test phase at 58.07s
epoch5 start at 58.07s
finished train phase at 68.43s
finished test phase at 69.5s
epoch6 start at 69.5s
finished train phase at 79.87s
finished test phase at 80.93s
epoch7 start at 80.93s
finished train phase at 91.3s
finished test phase at 92.37s
epoch8 start at 92.37s
finished train phase at 102.74s
finished test phase at 103.8s
epoch9 start at 103.8s
finished train phase at 114.17s
finished test phase at 115.23s
cost=115.23376344s
```
add unittest for PrefetchedDataLoader
update document
previous test shows that there may be something wrong with the `_MultiWorkerIter` according to the inappropriate __iter__() is called, I tried to fix it by moving the call here.
fix the outdated perfetcheddataloader
most of the behavior is not changed since it is only prefetch data rather than modify data.
due to the lazy evaluation, the iter will not call what's more, for a regular training procedure: >>> train_data = ArrayDataset([i for i in range(10)],[9-i for i in range(10)])
>>> def transform_train(sample):
... if sample == 0 : print('(pre)fetching data here')
... return sample
...
>>> train_iter = DataLoader(train_data.transform_first(transform_train),
... auto_reload=False, batch_size=1,num_workers=1)
>>> test_data = ArrayDataset([i for i in range(10)],[9-i for i in range(10)])
>>> test_iter = DataLoader(test_data, batch_size=1,num_workers=1)
>>> for epoch in range(200):
... # there is almost no difference between it and the default DataLoader
... for data, label in train_iter:
... # training...
... for data, label in test_iter:
... # testing...there is only one iter per DataLoader each time. Most of the times, users will not consider what happened under the dataloader.
Here we |
|
@mxnet-bot run ci [centos-cpu] |
|
Jenkins CI successfully triggered : [centos-cpu] |
leezu
left a comment
There was a problem hiding this comment.
Please change the default of auto_reload to ensure legacy code is not affected by this PR.
This reverts commit 2c8d858.
This reverts commit 7d934a7.
Description
there already exists some faster dataloader in mxnet 2.0, but in v1.x, the exist dataloader is slower and could be improved by changing its prefetch behavior as what 2.0 have done.
test:
(cost ~
129.67192333600002sif we are usingDataloaderrather thanPrefetchedDataLoader)(test is done using v1.7.0, the newest v1.x is on my GPU server and running, I do not want to bother my GPU server.)
Checklist
Essentials
Changes
auto_reloadflag forDataLoader, which loads data faster in the first several batch compare to the defaultDataloader.Comments
Now, the default behavior of DataLoader is prefetch immediately after it is created, rather than wait for its
__iter__()is called.This behavior is compatible to MXNet 2.0's dataloader with
nopythonmode, since it really speeds up the program (~5% faster with my handwriting autoaugment transform function when training CIFAR-100 with RTX 3090, batch_size=250, wide resnet 16-4 and deep mutual learning technique.) and provide almost no difference(all the parameter create the prefetched dataloader is private and should not be modified by any other code), I switch the default behavior to prefetch.