You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure if this is a new feature, but I wanted to post this problem here, and hear if others have ways of optimizing and speeding up this process.
Let's say I have a really large dataset that I cannot load into memory. At this point, I am only aware of streaming=True to load the dataset. Now, the dataset consists of many tables. Ideally, I would want to have some simple filtering criterion, such that I only see the "good" tables. Here is an example of what the code might look like:
dataset = load_dataset(
"really-large-dataset",
streaming=True
)
# And let's say we process the dataset bit by bit because we want intermediate results
dataset = islice(dataset, 10000)
# Define a function to filter the data
def filter_function(table):
if some_condition:
return True
else:
return False
# Use the filter function on your dataset
filtered_dataset = (ex for ex in dataset if filter_function(ex))
And then I work on the processed dataset, which would be magnitudes faster than working on the original. I would love to hear if the problem setup + solution makes sense to people, and if anyone has suggestions!
Motivation
See description above
Your contribution
Happy to make PR if this is a new feature
The text was updated successfully, but these errors were encountered:
Feature request
I am not sure if this is a new feature, but I wanted to post this problem here, and hear if others have ways of optimizing and speeding up this process.
Let's say I have a really large dataset that I cannot load into memory. At this point, I am only aware of
streaming=True
to load the dataset. Now, the dataset consists of many tables. Ideally, I would want to have some simple filtering criterion, such that I only see the "good" tables. Here is an example of what the code might look like:And then I work on the processed dataset, which would be magnitudes faster than working on the original. I would love to hear if the problem setup + solution makes sense to people, and if anyone has suggestions!
Motivation
See description above
Your contribution
Happy to make PR if this is a new feature
The text was updated successfully, but these errors were encountered: