what dataloader to use for torchdata.nodes nodes? #1442

keunwoochoi · 2025-02-13T17:32:53Z

hi, thanks for reviving torchdata. i was able to move on to 0.10.1 for lots of my existing datapipes. it seems to work pretty nicely.

question - am i supposed to use torchdata.nodes.Loader or torchdata.stateful_dataloader.StatefulDataLoader for my data nodes? or just torch.utils.data.DataLoader? i'm getting confused a bit after reading the docs and code. currently Loader works for my iterable data nodes, but with some caveats (no multi processing).

The text was updated successfully, but these errors were encountered:

ramanishsingh · 2025-02-13T19:45:42Z

Hi @keunwoochoi ,
Thanks for checking out nodes.
Loader works pretty well, and you can check out some examples here: https://github.com/pytorch/data/tree/main/examples/nodes
You can also check the migration guide here: https://pytorch.org/data/beta/migrate_to_nodes_from_utils.html

For your failing examples, can you share a minimum working example, and we can look into it.

keunwoochoi · 2025-02-15T22:18:16Z

i see. yes Loader works pretty well.

but, to clarify @ramanishsingh - you meant, StatefulDataLoader is supposed to work with nodes? so are torch.utils.data.DataLoader ?

keunwoochoi · 2025-02-16T20:28:17Z

from #1389, it seems like Loader is the only choice.

andrewkho · 2025-02-18T19:16:47Z

@keunwoochoi thanks for trying this out! We shoudl clarify this in the documentation, but right now the idea is that torchdata.nodes is a super-set of StatefulDataLoader, ie nodes should be able to do everything torch.utils.DataLoader and StatefulDataLoader should do, but nodes are not designed to be plugged into StatefulDataLoader. cc @scotts on confusion around torchdata vs dataloader v1.

keunwoochoi · 2025-02-21T08:37:52Z

@andrewkho i see, thanks. so.. guess i'm still not sure after instantiating a Loader, i don't know how i can make it multi-threaded (or multiprocessing-ed) to have multiple workers dispatching the data.

prompteus · 2025-02-22T14:20:21Z

@keunwoochoi Maybe torchdata.nodes.ParallelMapper is what you are looking for?

keunwoochoi · 2025-02-22T21:13:40Z

based on this official example, my guess is that we're supposed to compose nodes like this then it'll work like a DataLoader with pin_memory and multiple workers.

    node = tn.Batcher(node, batch_size=args.batch_size)
    node = tn.ParallelMapper(
        node,
        map_fn=ImagenetTransform(),
        num_workers=args.num_workers,
        method=args.loader,
    )
    if args.pin_memory:
        node = tn.PinMemory(node)
    node = tn.Prefetcher(node, prefetch_factor=2)

    return tn.Loader(node)

but i still wonder how we can implement an early stage sharding. well ok, technically it's possible by instantiating same node (that performs from sharded file listing to loading and processing) many times (e.g., 4 of them) and then multiplexing them + with prefetch it would work.

prompteus · 2025-02-22T21:50:52Z

Actually, I'm also trying to figure out how to make multiple parallel initial nodes - each with its own worker. I haven't found a straightforward solution though.

I asked about it here #1334 (comment), so hopefully we will get a reply soon.

keunwoochoi · 2025-02-22T22:11:36Z

@prompteus i think one way is to just

i) instantiate multiple nodes that has the same, common processing. (perhaps with some sharding in the early stage if needed)
ii) multiplex them
iii) batch them
iv) add Prefetcher
v) then Loader

maybe that's it??

prompteus · 2025-02-23T00:19:40Z

@keunwoochoi
sure, that's doable, but not ideal. When data generation is heavy, it makes sense to offload it to parallel workers, and creating multiple instances and multiplexing them on its own doesn't do that.

I guess one way to implement it is to generally follow what you suggested, but make a custom subclass of BaseNode that creates and manages a separate worker process/thread.

However, I'm still not sure if I'm not just overlooking a feature that solves this problem more systematically.

isarandi · 2025-03-04T13:30:04Z

Is torchdata.nodes usable in multi-GPU training? It's really hard to piece together the current recommended way to load data in PyTorch...

divyanshk · 2025-03-04T18:37:34Z

@isarandi Have you tried creating a Dataset class and using a DsitributedSampler for multi-gpu workloads ?

isarandi · 2025-03-05T15:39:02Z

I figured it out. It's possible to use torchdata.nodes without any Dataset or DataLoader. For multi-GPU training, if the nodes start with IterableWrapper, it's enough to slice the iterable so it only yields every nth item as torchdata.nodes.IterableWrapper(itertools.slice(rank, None, world_size)).

divyanshk · 2025-03-05T18:34:58Z

@isarandi That's exactly right!

busbyjrj · 2025-04-17T12:55:53Z

Hello everyone, I haven't found any tutorials on using loaders with multiple GPUs. I would like to ask about this issue.

I based it on this case imagenet_benchmark. I replaced RandomSampler with DistributedSampler.

    sampler = DistributedSampler(dataset,
                                 num_replicas=misc.get_world_size(),
                                 rank=misc.get_rank(),
                                 shuffle=True,
                                 )

    node = tn.MapStyleWrapper(map_dataset=dataset, sampler=sampler)

I'm wondering if this can replace the dataloader for multi-GPU training?

Thanks.

ramanishsingh · 2025-04-17T18:13:26Z

@busbyjrj
You can check out the discussion here #1472

We are currently working on developing some for multi-GPU training and we plan to publish them in the coming weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what dataloader to use for torchdata.nodes nodes? #1442

what dataloader to use for torchdata.nodes nodes? #1442

keunwoochoi commented Feb 13, 2025

ramanishsingh commented Feb 13, 2025

keunwoochoi commented Feb 15, 2025

keunwoochoi commented Feb 16, 2025

andrewkho commented Feb 18, 2025

keunwoochoi commented Feb 21, 2025

prompteus commented Feb 22, 2025

keunwoochoi commented Feb 22, 2025

prompteus commented Feb 22, 2025

keunwoochoi commented Feb 22, 2025

prompteus commented Feb 23, 2025

isarandi commented Mar 4, 2025

divyanshk commented Mar 4, 2025

isarandi commented Mar 5, 2025

divyanshk commented Mar 5, 2025

busbyjrj commented Apr 17, 2025

ramanishsingh commented Apr 17, 2025

what dataloader to use for torchdata.nodes nodes? #1442

what dataloader to use for torchdata.nodes nodes? #1442

Comments

keunwoochoi commented Feb 13, 2025

ramanishsingh commented Feb 13, 2025

keunwoochoi commented Feb 15, 2025

keunwoochoi commented Feb 16, 2025

andrewkho commented Feb 18, 2025

keunwoochoi commented Feb 21, 2025

prompteus commented Feb 22, 2025

keunwoochoi commented Feb 22, 2025

prompteus commented Feb 22, 2025

keunwoochoi commented Feb 22, 2025

prompteus commented Feb 23, 2025

isarandi commented Mar 4, 2025

divyanshk commented Mar 4, 2025

isarandi commented Mar 5, 2025

divyanshk commented Mar 5, 2025

busbyjrj commented Apr 17, 2025

ramanishsingh commented Apr 17, 2025