Conversation
|
@mxnet-label-bot add [pr-awaiting-review] |
|
@ptrendx can you share a benchmark on SGD performance when |
|
This PR is part of upstreaming improvements to MXNet that are available in NVIDIA's NGC 18.11 MXNet container. I will use results from that container to show the impact once all the other improvements are in place. The benchmark shown is ResNet v1.5 training on single V100 32GB in DGX1-V, batch size 32.
|
|
Thanks for the contribution @ptrendx ! |
| @@ -98,6 +99,9 @@ def dict_equ(a, b): | |||
|
|
|||
| @with_seed() | |||
There was a problem hiding this comment.
Is it not tested with test_trainer?
There was a problem hiding this comment.
Are you asking why I am not changing the test_trainer as well since it should fail with the MXNET_UPDATE_ON_KVSTORE=0 option set? Since you made a PR to fix that test, I did not change it. The MXNET_UPDATE_ON_KVSTORE=0 option is not set in CI (although logic for the aggregated SGD itself is tested by the SGD test).
|
@ptrendx Can you please rebase this PR? |
anirudhacharya
left a comment
There was a problem hiding this comment.
In the PR description you said that the test_sgd covers the new code paths. But in _update_impl there is an if statement with aggregate
if aggregate:
...
else:
...
can you explain how the else block is covered with the current test_sgd code.
| for weight, grad in zip(weights, grads): | ||
| assert(isinstance(weight, NDArray)) | ||
| assert(isinstance(grad, NDArray)) | ||
| aggregate = (aggregate and |
There was a problem hiding this comment.
@anirudhacharya As you can see aggregate is set to True at the beginning and changes to False when encountering non-default storage type, so testing with both dense and sparse data tests both branches of the code.
|
|
||
| template<typename DType, typename MPDType> | ||
| struct MultiSGDKernelParam { | ||
| static const int N = 60; |
There was a problem hiding this comment.
@anirudhacharya This is the reason of 60 - I pass this struct as kernel parameter, which has a limit of 4 kB.
|
Is there anything else needed for this PR? |
anirudhacharya
left a comment
There was a problem hiding this comment.
I think the code LGTM. Some minor doc fixes are needed I think
| self._index_update_count = {} | ||
| self.clip_gradient = clip_gradient | ||
| self.multi_precision = multi_precision | ||
| self.aggregate_num = 0 |
There was a problem hiding this comment.
please add this in the parameter list in the class doc.
There was a problem hiding this comment.
It is not really a parameter though - it is up to the optimizer (not the user) to override this value if they support aggregated execution.
| super(SGD, self).__init__(**kwargs) | ||
| self.momentum = momentum | ||
| self.lazy_update = lazy_update | ||
| self.aggregate_num = int(os.getenv('MXNET_OPTIMIZER_AGGREGATION_SIZE', "4")) |
There was a problem hiding this comment.
in line 510 can you add a section on aggregate updates and in line 524 can also point to these two methods - multi_sgd_mom_update and multi_mp_sgd_update as optimizer update rules.
There was a problem hiding this comment.
I wrote the section on aggregate updates, but I'm not sure about pointing to the new methods in line 524 - they use the same algorithm as the sgd_update and sgd_mom_update functions so pointing to those functions for details of the algorithm seems sufficient.
There was a problem hiding this comment.
I still think the 'multi' update methods should show up in the SGD doc description. But I am okay with the code owner/merger making a call on this.
There was a problem hiding this comment.
I don't think it's necessary to point to these two methods since the algorithm is the same one
|
LGTM @eric-haibin-lin Open question brought by @anirudh2290. In your opinion should 'multi' update methods should show up in the SGD doc description? |
eric-haibin-lin
left a comment
There was a problem hiding this comment.
Looks good pending some suggestions for documentation. Awesome work
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit fabc318.
* Aggregate SGD * Make OpWrapperGenerator understand Tuple<float> * Trigger * Add NNVM Tuple to cpp-package op.h * Trigger * Fix pylint aggregate SGD * Update info about new ENV vars and modifying 2 tests that require update_on_kvstore to be true * Fix * Aggregate SGD support for Gluon trainer * Added text to doc about aggregate update in SGD optimizer * Docs changes from review
|
Missing type information for some parameteres E.g. From here https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.multi_mp_sgd_mom_update This should be And |
* Aggregate SGD * Make OpWrapperGenerator understand Tuple<float> * Trigger * Add NNVM Tuple to cpp-package op.h * Trigger * Fix pylint aggregate SGD * Update info about new ENV vars and modifying 2 tests that require update_on_kvstore to be true * Fix * Aggregate SGD support for Gluon trainer * Added text to doc about aggregate update in SGD optimizer * Docs changes from review
Description
Currently MXNet optimizers are invoked 1 weight at a time. This leads to a lot of synchronization overhead, as updates (especially for convolutions and batchnorm) tend to be small, but each one needs to by synchronized upon.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
update_on_kvstorevalue via environment variableMXNET_UPDATE_ON_KVSTORE(default is 1, which is consistent with the current behavior)update_on_kvstoreis False, in the case of SGD optimizer it attempts to bundle updates of multiple weights together and launches a single kernel to perform them all, reducing the number of kernel calls and synchronizations.Comments