Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache only changed columns? #7138

Open
Modexus opened this issue Sep 5, 2024 · 2 comments
Open

Cache only changed columns? #7138

Modexus opened this issue Sep 5, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Modexus
Copy link
Contributor

Modexus commented Sep 5, 2024

Feature request

Cache only the actual changes to the dataset i.e. changed columns.

Motivation

I realized that caching actually saves the complete dataset again.
This is especially problematic for image datasets if one wants to only change another column e.g. some metadata and then has to save 5 TB again.

Your contribution

Is this even viable in the current architecture of the package?
I quickly looked into it and it seems it would require significant changes.

I would spend some time looking into this but maybe somebody could help with the feasibility and some plan to implement before spending too much time on it?

@Modexus Modexus added the enhancement New feature or request label Sep 5, 2024
@Modexus
Copy link
Contributor Author

Modexus commented Sep 5, 2024

so I guess a workaround to this is to simply remove all columns except the ones to cache and then add them back with concatenate_datasets(..., axis=1).

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2024

yes this is the right workaround. We're keeping the cache like this to make it easier for people to delete intermediate cache files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
2 participants