Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset from Hub with name to specify config using incorrect builder type when multiple data formats are present #7101

Open
hlky opened this issue Aug 14, 2024 · 1 comment

Comments

@hlky
Copy link

hlky commented Aug 14, 2024

Following documentation I had defined different configs for Dataception, a dataset of datasets:

configs:
- config_name: dataception
  data_files:
  - path: dataception.parquet
    split: train
  default: true
- config_name: dataset_5423
  data_files:
  - path: datasets/5423.tar
    split: train
...
- config_name: dataset_721736
  data_files:
  - path: datasets/721736.tar
    split: train

The intent was for metadata to be browsable via Dataset Viewer, in addition to each individual dataset, and to allow datasets to be loaded by specifying the config/name to load_dataset.

While testing load_dataset I encountered the following error:

>>> dataset = load_dataset("bigdata-pw/Dataception", "dataset_7691")
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 467k/467k [00:00<00:00, 1.99MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.0M/71.0M [00:02<00:00, 26.8MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "datasets\load.py", line 2145, in load_dataset
    builder_instance.download_and_prepare(
  File "datasets\builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "datasets\builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "datasets\packaged_modules\parquet\parquet.py", line 58, in _split_generators
    self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))
                                                             ^^^^^^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 2325, in read_schema
    file = ParquetFile(
           ^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow\_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow\error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The correct file is downloaded, however the incorrect builder type is detected; parquet due to other content of the repository. It would appear that the config needs to be taken into account.

Note that I have removed the additional configs from the repository because of this issue and there is a limit of 3000 configs anyway so the Dataset Viewer doesn't work as I intended. I'll add them back in if it assists with testing.

@hlky
Copy link
Author

hlky commented Aug 18, 2024

Having looked into this further it seems the core of the issue is with two different formats in the same repo.

When the parquet config is first, the WebDatasets are loaded as parquet, if the WebDataset configs are first, the parquet is loaded as WebDataset.

A workaround in my case would be to just turn the parquet into a WebDataset, although I'd still need the Dataset Viewer config limit increasing. In other cases using the same format may not be possible.

Relevant code:

@hlky hlky changed the title load_dataset from Hub with name to specify config using incorrect builder type Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant