Skip to content

torchdata or torchdata-contrib? #1471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
keunwoochoi opened this issue Apr 6, 2025 · 2 comments
Open

torchdata or torchdata-contrib? #1471

keunwoochoi opened this issue Apr 6, 2025 · 2 comments

Comments

@keunwoochoi
Copy link
Contributor

keunwoochoi commented Apr 6, 2025

my team has been implementing quite several utilities. some are close to core features, some other are more advanced and utilities. for example, their class names and features are like:

class RoundRobinNode(BaseNode[T]):
    """A node that cycles through multiple datasets in a round-robin way.
class FileListNode(BaseNode[Dict]):
    """Node that lists files from any supported filesystem (local, S3) matching specified patterns.

    Uses fsspec to provide universal file access capabilities for both local and remote files.

    Features:
    - Lists files from supported filesystems (local, S3)
    - Supports glob patterns for file matching
    - Maintains state for checkpointing and resumption
class FileReaderNode(BaseNode[Dict]):
    """Universal node that reads file contents from any supported filesystem.

    Uses smart_open to support local files, S3, HTTP, and more file systems.
class TextStreamDecodeNode(BaseNode[Dict]):
   """Node that streams text files line by line from any source.

   This node combines functionality of file reading and line-by-line processing,
   supporting both local and remote (S3, HTTP, etc.) files via smart_open.

   Features:
   - Streams files line-by-line (memory efficient)
   - Supports local files, S3, HTTP, and more
   - Handles compressed files (.gz, .bz2) transparently
   - Maintains state for checkpointing and resumption
   - Preserves metadata from source nodes
class HuggingFaceDatasetStreamNode(BaseNode[dict]):
    """
    Node that streams examples from a HuggingFace dataset.

    Output format:
        {
            "data": {...},           # Original dataset item
            "metadata": {
                "dataset_name": "squad",
                "split": "train",
                "index": 42
            }
        }

    Input: None (configured with dataset name and split at initialization)
    Output: Dict containing example data and metadata
class JsonlStreamNode(TextStreamDecodeNode):
    """Node that streams JSONL files and parses each line as JSON.

    This node extends TextStreamDecodeNode to add JSON parsing for each line.
    It maintains the same state management and streaming capabilities while adding
    JSONL-specific processing.

and some more.

conservatively, i'd say these can be part of, say, torchdata-contrib. but i'd like to hear from the maintainers. where would you suggest drawing the line? any other suggestions would be great, too.

@divyanshk
Copy link
Contributor

divyanshk commented Apr 9, 2025

Thanks @keunwoochoi for bringing this up! It is a very valid point of discussion. We had some discussions and this is one way we can go about it: let's keep all core node functionalities in torchdata/nodes (similar to Cycler, Header, etc that you are adding). For nodes that pertain to data access, lets include those which are general enough for wide use. For eg, CSV, JsonL, etc. We can discuss further to have them all in torchdata/nodes or torchdata/nodes/data. That said, we would be cautious of bringing in nodes that depend on exclusive company stacks or services like S3, GCP, etc, although we encourage users to continue building on existing nodes to unblock their specific needs.

Regarding round robin, we already have a PR open which @ramanishsingh is taking care of. Edit: CSV Reader WIP here (PR1474)

cc. @ramanishsingh @scotts

@divyanshk
Copy link
Contributor

Happy to discuss here which nodes we can add!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants