Skip to content

Commit f44be73

Browse files
y-sqfacebook-github-bot
authored andcommitted
iRoPE (-p > 8192)
Summary: X-link: facebookresearch/FBGEMM#1149 When the max_seq_len is larger than 8192, one input sample will be divided into multiple sequences. Such as: When bs = 2, and seqlen = 7, we will have seq_lens = [0, 7, 7, 7, 7, 14, 14, 14, 14] in the prefill attention. In decoding, it won't as it's handled by the gappy bias. Differential Revision: D73833204
1 parent 24bc151 commit f44be73

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu

+2-2
Original file line numberDiff line numberDiff line change
@@ -2856,8 +2856,8 @@ at::Tensor quantize_qkv_per_head(
28562856
dim3 block_size(kThreadsPerWarp, kWarpsPerBlock);
28572857
dim3 grid_size(cuda_calc_xblock_count(num_warps, kWarpsPerBlock));
28582858
2859-
auto scale_q = at::zeros(
2860-
{q_seqstarts.size(0) - 1, N_KVH_L}, XQ_O.options().dtype(at::kFloat));
2859+
auto scale_q =
2860+
at::zeros({cache_K.size(0), N_KVH_L}, XQ_O.options().dtype(at::kFloat));
28612861
float* const scale_q_ptr = scale_q.data_ptr<float>();
28622862
// Launch the kernel
28632863
// TODO: Launch the kernel with B_T * N_H_L blocks only in case of decode.

0 commit comments

Comments
 (0)