This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Speed fused_op compilation by caching ptx and jit-compiled device functions#16783
Merged
ptrendx merged 1 commit intoapache:masterfrom Nov 12, 2019
Merged
Conversation
Contributor
Author
|
As reported originally with pointwise fusion enabled: As timed on this latest passing CI run for centos-gpu: |
Contributor
Author
|
Using the centos-gpu unittest runtime now as a metric: Before op fusion: 40 minutes |
ptrendx
pushed a commit
to ptrendx/mxnet
that referenced
this pull request
Nov 15, 2019
ptrendx
added a commit
that referenced
this pull request
Nov 16, 2019
…, #16792) (#16832) * Fix nightly build (#16773) * Remove dependency on tvmop.conf * Fix binaries dependencies for ni nightly * Add comments * Update tvmop.py * Fix rebase * Fix (#16781) * Speed fused_op compilation by caching ptx and jit-compiled functions (#16783) * [Numpy] Fix collect_params().zero_grad() in gluon numpy interface (#16716) * fix zero_grad * Update parameter.py * add test * fix * Mixed data type binary ops (#16699) * support mixed-precision binary operations * improvement for documentations and error messages * Support boolean elemwise/broadcast binary add, multiply and true_divide (#16728) * support pure boolean elemwise/broadcast binary op * switch to unique_tpr * fix the test error * Fix rtrue_divide grad (#16769) * Fix rtrue_divide_scalar * More tests * Fix numpy-compatible mean output type for integer inputs (#16792) * fix mean output type for integer inputs * enable for windows
apeforest
pushed a commit
that referenced
this pull request
Nov 19, 2019
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR speeds up the dynamic nvrtc-compilation of fused_ops in response to @rondogency's comment #15167 (comment). As reported in the comment, the runtime of 3 mentioned unittests had grown drastically with the fusion enabled to 17.5 minutes in total. With this PR, the runtime drops to 1 minute, with the original fusion-turned-off runtime being 30 seconds.
The process of runtime compilation of NVIDIA gpu kernels involves 2 steps:
- compiling the cuda code to PTX assembly (performed once per GPU architecture)
- translating the ptx assembly to binary and loading it into a GPU's set of runnable kernels (performed once per GPU device). This latter step produces the CUfunction needed to execute the kernel on the device.
After realizing that the slowed-down unittests were creating many identical fused ops, I added a cache of the PTX and CUfunctions. The cache comprises a mapping (for each GPU arch) from the cuda source code to the PTX and to any CUfunctions created from it.
It's worth a reminder that the fusion framework is targeting the typical scenario of creating a model's graph and executing it many times. The CI was adversely impacted because it often executes a model's graph just once after creation.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments