Float FMA vs Integer DP4A & DPX Instructions #35
+286
−43
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including
f16
andbf16
. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators andumul24
instructions for 24-bit integer multiplication. Starting with Hopper, Dynamic Programming eXtensitons (DPX) were added for combinatorial problems that can be used to implement Algebraic Graph Theory algorithms using matrix multiplications over alternative semirings.How do those instructions stack up, and how much performance can we expect from recent State-of-the-Art GPUs like the Nvidia H200?
f64
FMA: 4.5 Ti64
FMA: 3.1 Tf32
FMA: 22 Ti32
FMA: 15.5 T ...so we should always prefer 32-bit opsu8u32
DP4A: 39.3 Tu24u32
UMUL: 13.4 T ...not really better thani32
FMAf16
FMA on Volta: 12.2 Tbf16
FMA on Ampere: 12.2 Tu16
andu32
on Hopper: 11 Ti16
andi32
on Hopper: 11 Ti32
on Hopper: 27 TCheck the code and inline comments for more details!