-
Notifications
You must be signed in to change notification settings - Fork 3.5k
SLURM training: training freezes when using ddp
and torchdata
#17066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ddp
and torchdataddp
and torchdata
Torchdata requires extra setup and shutdown calls that Lightning doesn't do for you at the moment: #16603. This might be what's causing the issue. So using |
Thank you for the comment. I'll have a look into it. If I solve it or find anything meaningful, I'll open a pull request. |
Hi @knoriy did you solve this? |
Ive seen issues that stem from using datapipes with the old dataloader class. Maybe using DataLoader2 from torchdata helps |
@carmocca for me it crashed randomly after saving checkpoint, sometimes crashed, sometimes no. |
A workaround that's worked for me is to move Try this: def _create_pipeline(self, data_dir):
datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
.shuffle()\
.open_files_by_fsspec(mode='rb')\
.load_from_tar() \
.sharding_filter() \
.batch(2) \
.map(self.to_sampels)
return datapipe |
For me, dataloader2 causes issues when using Reading Services; it leads to freezing and worse performance. The classic dataloader worked best for me when using PL and TorchData. |
cc @ejguan |
I think the main problem is unbalanced data sharding across distributed ranks, which causes hanging.
Can you pls shed more light on this? In theory and based on our benchmarking, DataLoader2 should perform better than DataLoader. |
Thank you, I'll try adding Feel free to ask for anything I miss here: The cluster manager is Other observations:
@ejguan Does the order of reading services matter? |
I am having an issue of very slow training after something on the cluster I am using got updated, which I am trying to figure out with the admins, but I can see there are some difference in logs I am getting in particular, I am receiving very similar logs as in this post, my nccl logs are:
however, previously I was getting the logs:
can the change in aws-ofi-nccl version from 1.4.0aws --> 1.5.0aws have caused the issue? |
Update: I've been stepping through the PL code, which looks to happen in the Closure class Further notes and things that may help isolate this issue: |
Bug description
Training freezes when using
ddp
on slurm cluster (dp
runs as expected). The dataset is loaded via torchdata from an s3 bucket. Similar behaviour also arises when using webdataset.Possibly a linked issue: #16893 (comment)
Error:
No Error is thrown
UPDATE:
Removing
val_step
andtest_step
frompl.LightningModule
gives us the following:How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
The model is able to finish an epoch when (line 51)
.sharding_filter()\
is removed, but this result in undesirable behavior, if turned off workers will return the same batch multiple timecc @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: