`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101

hlky · 2024-08-14T18:12:25Z

Following documentation I had defined different configs for Dataception, a dataset of datasets:

configs:
- config_name: dataception
  data_files:
  - path: dataception.parquet
    split: train
  default: true
- config_name: dataset_5423
  data_files:
  - path: datasets/5423.tar
    split: train
...
- config_name: dataset_721736
  data_files:
  - path: datasets/721736.tar
    split: train

The intent was for metadata to be browsable via Dataset Viewer, in addition to each individual dataset, and to allow datasets to be loaded by specifying the config/name to load_dataset.

While testing load_dataset I encountered the following error:

>>> dataset = load_dataset("bigdata-pw/Dataception", "dataset_7691")
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 467k/467k [00:00<00:00, 1.99MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.0M/71.0M [00:02<00:00, 26.8MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "datasets\load.py", line 2145, in load_dataset
    builder_instance.download_and_prepare(
  File "datasets\builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "datasets\builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "datasets\packaged_modules\parquet\parquet.py", line 58, in _split_generators
    self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))
                                                             ^^^^^^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 2325, in read_schema
    file = ParquetFile(
           ^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow\_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow\error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The correct file is downloaded, however the incorrect builder type is detected; parquet due to other content of the repository. It would appear that the config needs to be taken into account.

Note that I have removed the additional configs from the repository because of this issue and there is a limit of 3000 configs anyway so the Dataset Viewer doesn't work as I intended. I'll add them back in if it assists with testing.

The text was updated successfully, but these errors were encountered:

hlky · 2024-08-18T10:32:53Z

Having looked into this further it seems the core of the issue is with two different formats in the same repo.

When the parquet config is first, the WebDatasets are loaded as parquet, if the WebDataset configs are first, the parquet is loaded as WebDataset.

A workaround in my case would be to just turn the parquet into a WebDataset, although I'd still need the Dataset Viewer config limit increasing. In other cases using the same format may not be possible.

Relevant code:

hlky changed the title ~~load_dataset from Hub with name to specify config using incorrect builder type~~ Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101

`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101

hlky commented Aug 14, 2024

hlky commented Aug 18, 2024

load_dataset from Hub with name to specify config using incorrect builder type when multiple data formats are present #7101

load_dataset from Hub with name to specify config using incorrect builder type when multiple data formats are present #7101

Comments

hlky commented Aug 14, 2024

hlky commented Aug 18, 2024

`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101

`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101