Skip to content

Float FMA vs Integer DP4A & DPX Instructions #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ashvardanian
Copy link
Owner

@ashvardanian ashvardanian commented Feb 12, 2025

CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including f16 and bf16. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and umul24 instructions for 24-bit integer multiplication. Starting with Hopper, Dynamic Programming eXtensitons (DPX) were added for combinatorial problems that can be used to implement Algebraic Graph Theory algorithms using matrix multiplications over alternative semirings.

How do those instructions stack up, and how much performance can we expect from recent State-of-the-Art GPUs like the Nvidia H200?

  • f64 FMA: 4.5 T
  • i64 FMA: 3.1 T
  • f32 FMA: 22 T
  • i32 FMA: 15.5 T ...so we should always prefer 32-bit ops
  • u8u32 DP4A: 39.3 T
  • u24u32 UMUL: 13.4 T ...not really better than i32 FMA
  • f16 FMA on Volta: 12.2 T
  • bf16 FMA on Ampere: 12.2 T
  • DPX for Floyd-Warshall algorithm with u16 and u32 on Hopper: 11 T
  • DPX for Needleman-Wunsch algorithm with i16 and i32 on Hopper: 11 T
  • DPX for Smith-Waterman algorithm with i32 on Hopper: 27 T

Check the code and inline comments for more details!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant