torchdata or torchdata-contrib? #1471

keunwoochoi · 2025-04-06T01:11:33Z

my team has been implementing quite several utilities. some are close to core features, some other are more advanced and utilities. for example, their class names and features are like:

class RoundRobinNode(BaseNode[T]):
    """A node that cycles through multiple datasets in a round-robin way.

class FileListNode(BaseNode[Dict]):
    """Node that lists files from any supported filesystem (local, S3) matching specified patterns.

    Uses fsspec to provide universal file access capabilities for both local and remote files.

    Features:
    - Lists files from supported filesystems (local, S3)
    - Supports glob patterns for file matching
    - Maintains state for checkpointing and resumption

class FileReaderNode(BaseNode[Dict]):
    """Universal node that reads file contents from any supported filesystem.

    Uses smart_open to support local files, S3, HTTP, and more file systems.

class TextStreamDecodeNode(BaseNode[Dict]):
   """Node that streams text files line by line from any source.

   This node combines functionality of file reading and line-by-line processing,
   supporting both local and remote (S3, HTTP, etc.) files via smart_open.

   Features:
   - Streams files line-by-line (memory efficient)
   - Supports local files, S3, HTTP, and more
   - Handles compressed files (.gz, .bz2) transparently
   - Maintains state for checkpointing and resumption
   - Preserves metadata from source nodes

class HuggingFaceDatasetStreamNode(BaseNode[dict]):
    """
    Node that streams examples from a HuggingFace dataset.

    Output format:
        {
            "data": {...},           # Original dataset item
            "metadata": {
                "dataset_name": "squad",
                "split": "train",
                "index": 42
            }
        }

    Input: None (configured with dataset name and split at initialization)
    Output: Dict containing example data and metadata

class JsonlStreamNode(TextStreamDecodeNode):
    """Node that streams JSONL files and parses each line as JSON.

    This node extends TextStreamDecodeNode to add JSON parsing for each line.
    It maintains the same state management and streaming capabilities while adding
    JSONL-specific processing.

and some more.

conservatively, i'd say these can be part of, say, torchdata-contrib. but i'd like to hear from the maintainers. where would you suggest drawing the line? any other suggestions would be great, too.

The text was updated successfully, but these errors were encountered:

divyanshk · 2025-04-09T17:41:45Z

Thanks @keunwoochoi for bringing this up! It is a very valid point of discussion. We had some discussions and this is one way we can go about it: let's keep all core node functionalities in torchdata/nodes (similar to Cycler, Header, etc that you are adding). For nodes that pertain to data access, lets include those which are general enough for wide use. For eg, CSV, JsonL, etc. We can discuss further to have them all in torchdata/nodes or torchdata/nodes/data. That said, we would be cautious of bringing in nodes that depend on exclusive company stacks or services like S3, GCP, etc, although we encourage users to continue building on existing nodes to unblock their specific needs.

Regarding round robin, we already have a PR open which @ramanishsingh is taking care of. Edit: CSV Reader WIP here (PR1474)

cc. @ramanishsingh @scotts

divyanshk · 2025-04-14T17:51:37Z

Happy to discuss here which nodes we can add!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchdata or torchdata-contrib? #1471

torchdata or torchdata-contrib? #1471

keunwoochoi commented Apr 6, 2025 •

edited

Loading

divyanshk commented Apr 9, 2025 •

edited

Loading

divyanshk commented Apr 14, 2025

torchdata or torchdata-contrib? #1471

torchdata or torchdata-contrib? #1471

Comments

keunwoochoi commented Apr 6, 2025 • edited Loading

divyanshk commented Apr 9, 2025 • edited Loading

divyanshk commented Apr 14, 2025

keunwoochoi commented Apr 6, 2025 •

edited

Loading

divyanshk commented Apr 9, 2025 •

edited

Loading