[wip] context parallelism #2668

ebsmothers · 2025-05-02T23:12:17Z

Initial implementation of context parallelism in torchtune.

Initial test

tune run --nproc_per_node 8 full_finetune_distributed --config llama3/8B_full \
context_parallel_dim=4 metric_logger=torchtune.training.metric_logging.WandBLogger 
metric_logger.project=context-parallel metric_logger.name=llama3-8b-cp4-dp2

Also confirmed that we can run 1M sequence length on a single node (will paste results in here shortly)

Still to test

Should test (a) equivalent loss curves and (b) requisite memory improvements on a long-context dataset for each of the below:

CP only (given above)
CP + TP
- Currently blocked until Re-refactor Linear Loss #2667 lands
CP + DP shard
CP + DP replicate
Composability with context parallel
Composability with flex attention
- Need to look at [cp] dispatch flex_attention to CP impl in TorchDispatchMode pytorch#151497 and [cp][flex_attention] integration test trial torchtitan#1160 (thanks @XilunWu)
Composability with activation checkpointing + offloading
Composability with optimizer in backward
Composability with fp8 (will this work?)

pytorch-bot · 2025-05-02T23:12:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2668

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 1 New Failure, 2 Cancelled Jobs

As of commit da41f80 with merge base d39fd9b ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.9, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3-tune-4-1-True]

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
##[error]The operation was canceled.
GPU tests / gpu_test (3.11, stable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pbontrager · 2025-05-05T15:54:15Z

torchtune/training/_distributed.py

@@ -718,3 +743,108 @@ def prepare_mha_for_tp(
    if is_fusion_model:
        model.decoder = decoder
    return model
+
+
+def _get_sdpa_context() -> (


Does this mean CP doesn't work with FlexAttention?

Yes, at least until pytorch/pytorch#151497 lands

But I also think this is somewhat orthogonal. Like flex does not have its own backend (see here). My assumption is that it should be using the flash attention backend (but need to confirm)

pbontrager · 2025-05-05T15:58:28Z

recipes/full_finetune_distributed.py

+                # Define optional context manager for context parallelism
+                model_inputs = list(batch.values())
+                buffers = list(self._model.buffers())
+                optional_context_parallel_context_manager = (


Is this the naming we're using for other optional ctx managers? We have "activations_handling_ctx", though I'd prefer to consolidate on something like "context_parallel" or "maybe_context_parallel", I think the "with" statement says it's a context manager.

Yeah I'm good taking out the "optional" here and matching what we do for activation offloading

ebsmothers added 8 commits April 17, 2025 12:43

[wip] context parallelism

d7362a7

messing with context manager

421b1fa

merge

1dad580

wip changes

898bfa9

some changes

91d874d

Merge branch 'main' into cp-prototype

3b81a35

Merge branch 'main' into cp-prototype

f025f63

cleanup

da41f80

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 2, 2025

pbontrager reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] context parallelism #2668

[wip] context parallelism #2668

ebsmothers commented May 2, 2025

pytorch-bot bot commented May 2, 2025 •

edited

Loading

pbontrager May 5, 2025

ebsmothers May 5, 2025

ebsmothers May 5, 2025

pbontrager May 5, 2025

ebsmothers May 5, 2025

[wip] context parallelism #2668

Are you sure you want to change the base?

[wip] context parallelism #2668

Conversation

ebsmothers commented May 2, 2025

Initial test

Still to test

pytorch-bot bot commented May 2, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2668

❗ 1 Active SEVs

❌ 1 New Failure, 2 Cancelled Jobs

pbontrager May 5, 2025

Choose a reason for hiding this comment

ebsmothers May 5, 2025

Choose a reason for hiding this comment

ebsmothers May 5, 2025

Choose a reason for hiding this comment

pbontrager May 5, 2025

Choose a reason for hiding this comment

ebsmothers May 5, 2025

Choose a reason for hiding this comment

pytorch-bot bot commented May 2, 2025 •

edited

Loading