Skip to content

Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames #7517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
giraffacarp opened this issue Apr 15, 2025 · 4 comments · Fixed by #7521
Closed
Assignees

Comments

@giraffacarp
Copy link
Contributor

Describe the bug

When using IterableDataset.from_spark() with a Spark DataFrame containing image data, the Image feature class fails to properly process this data type, causing an AttributeError: 'bytearray' object has no attribute 'get'

Steps to reproduce the bug

  1. Create a Spark DataFrame with a column containing image data as bytearray objects
  2. Define a Feature schema with an Image feature
  3. Create an IterableDataset using IterableDataset.from_spark()
  4. Attempt to iterate through the dataset
from pyspark.sql import SparkSession
from datasets import Dataset, IterableDataset, Features, Image, Value

# initialize spark
spark = SparkSession.builder.appName("MinimalRepro").getOrCreate()

# create spark dataframe
data = [(0, open("image.png", "rb").read())]
df = spark.createDataFrame(data, "idx: int, image: binary")

# convert to dataset
features = Features({"idx": Value("int64"), "image": Image()})
ds = Dataset.from_spark(df, features=features)
ds_iter = IterableDataset.from_spark(df, features=features)

# iterate
print(next(iter(ds)))
print(next(iter(ds_iter)))

Expected behavior

The features should work on IterableDataset the same way they work on Dataset

Environment info

  • datasets version: 3.5.0
  • Platform: macOS-15.3.2-arm64-arm-64bit
  • Python version: 3.12.7
  • huggingface_hub version: 0.30.2
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0
@lhoestq
Copy link
Member

lhoestq commented Apr 15, 2025

Hi ! The Image() type accepts either

  • a bytes object containing the image bytes
  • a str object containing the image path
  • a PIL.Image object

but it doesn't support bytearray, maybe you can convert to bytes beforehand ?

@giraffacarp
Copy link
Contributor Author

Hi @lhoestq,
converting to bytes is certainly possible and would work around the error. However, the core issue is that Dataset and IterableDataset behave differently with the features.

I’d be happy to work on a fix for this issue.

@lhoestq
Copy link
Member

lhoestq commented Apr 15, 2025

I see, that's an issue indeed. Feel free to ping me if I can help with reviews or any guidance

If it can help, the code that takes a Spark DataFrame and iterates on the rows for IterableDataset is here:

def _generate_iterable_examples(
df: "pyspark.sql.DataFrame",
partition_order: list[int],
state_dict: Optional[dict] = None,
):

@giraffacarp
Copy link
Contributor Author

#self-assign

lhoestq added a commit that referenced this issue May 7, 2025
…cts from Spark DataFrames (#7517) (#7521)

* add bytearray to features encode_example methods

* add spark decode test

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants