iRoPE (-p > 8192)

y-sq · facebook-github-bot · commit f44be7307da7 · 2025-05-03T02:21:25.000-07:00
Summary: X-link: facebookresearch/FBGEMM#1149 When the max_seq_len is larger than 8192, one input sample will be divided into multiple sequences. Such as: When bs = 2, and seqlen = 7, we will have seq_lens = [0, 7, 7, 7, 7, 14, 14, 14, 14] in the prefill attention. In decoding, it won't as it's handled by the gappy bias. Differential Revision: D73833204
diff --git a/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu b/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu
@@ -2856,8 +2856,8 @@ at::Tensor quantize_qkv_per_head(
   dim3 block_size(kThreadsPerWarp, kWarpsPerBlock);
   dim3 grid_size(cuda_calc_xblock_count(num_warps, kWarpsPerBlock));
 
-  auto scale_q = at::zeros(
-      {q_seqstarts.size(0) - 1, N_KVH_L}, XQ_O.options().dtype(at::kFloat));
+  auto scale_q =
+      at::zeros({cache_K.size(0), N_KVH_L}, XQ_O.options().dtype(at::kFloat));
   float* const scale_q_ptr = scale_q.data_ptr<float>();
   // Launch the kernel
   // TODO: Launch the kernel with B_T * N_H_L blocks only in case of decode.