Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: CoW does not seem to work on an index with duplicated labels #59272

Open
2 of 3 tasks
arnaudlegout opened this issue Jul 18, 2024 · 10 comments
Open
2 of 3 tasks

BUG: CoW does not seem to work on an index with duplicated labels #59272

arnaudlegout opened this issue Jul 18, 2024 · 10 comments
Assignees
Labels
Closing Candidate May be closeable, needs more eyeballs Copy / view semantics Performance Memory or execution speed performance

Comments

@arnaudlegout
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

pd.set_option("mode.copy_on_write", True)
size = 100_000_000
df = pd.DataFrame(
    np.random.randint(1, 100, (size, 1)), columns=["a"], dtype=np.uint32
)
df["b"] = range(size)
df.iloc[-2, -1] = size-1
df = df.set_index("b").sort_index()

%time df.loc[0:200_000,:]
>>>
CPU times: total: 4.8 s
Wall time: 9.4 s

Issue Description

Here I create a dataframe with a column that contains duplicated values. When I set this column in the index and slice on it, I expect to obtain a view with CoW, but this operation in my example takes seconds to run (instead of a few hundreds of microseconds when CoW returns a view)

Expected Behavior

I expect df.loc[0:200_000, :] to return a view with CoW.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.2.final.0
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

@arnaudlegout arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 18, 2024
@sgysherry
Copy link

Hi, I use df = pd.DataFrame(np.arange(size), columns=["a"], dtype=np.uint32(hopefully no duplication) and df = pd.DataFrame(np.random.randint(1, 10, (size, 1)), columns=["a"], dtype=np.uint32 (hopefully more duplication), they all give the same execution time.
Could you clarify a little bit more why is the duplication relevant?

@arnaudlegout
Copy link
Contributor Author

arnaudlegout commented Jul 25, 2024

Hi, duplication in the index is the issue, not in a column. My example might be misleading, the problem does not come from column a, but from the index (made from column b).

@sgysherry
Copy link

take

@rhshadrach
Copy link
Member

I am seeing the same timings whether CoW is enabled or disabled. I believe this is just due to index being far more performant when it is unique. @arnaudlegout - can you confirm you're seeing the same?

@rhshadrach rhshadrach added Performance Memory or execution speed performance Copy / view semantics Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 12, 2024
@sgysherry
Copy link

sgysherry commented Aug 12, 2024

Hi,

Sorry for the late reply I forgot to update.
I am seeing the same, the one using range index is more performant, that can be the reason. To see the difference, try comment out df.iloc[-2, -1] = size-1 and you can trace the different index it is using.

However, when I was doing some investigation into this issue, I don't think the pd.set_option("mode.copy_on_write", False) is actually doing anything. But this can be a separate topic.

import numpy as np
import pandas as pd
import trace
import sys

def main():
    pd.set_option("mode.copy_on_write", True)
    size = 100_000_000
    df = pd.DataFrame(
        np.random.randint(1, 100, (size, 1)), columns=["a"], dtype=np.uint32
    )
    df["b"] = range(size)
    df.iloc[-2, -1] = size-1
    df = df.set_index("b").sort_index()

    print("Start slicing:")
    import time
    start = time.time()
    df.loc[0:200_000,:]
    end = time.time()
    print(end-start)

with open('trace_output_true.txt', 'w') as f:
    original_stdout = sys.stdout
    sys.stdout = f
    tracer = trace.Trace(trace=True, count=False)
    tracer.run('main()')

If you try to generate trace_output_false.txt with pd.set_option("mode.copy_on_write", False) and will notice trace_output_true.txt and trace_output_false.txt are running through the same code flow.

image

I am not sure why is this behaving like this, seems to me that pandas forcing COW now.

@rhshadrach
Copy link
Member

Ah - indeed, I believe we do have a fast path for range index here.

@arnaudlegout
Copy link
Contributor Author

@rhshadrach Indeed with CoW set to False I observe the same performance issue. But, my understanding is that accessing df.loc[0:200_000,:] should just return a view, so it should be really fast. Why an index performance issue should enter into the discussion here (as I am not making any computation and do not make any index operation).

Do you mean that CoW is asking the index to make some operation before returning the view? If this is the case, this is counter intuitive to me.

@rhshadrach
Copy link
Member

Is it possible you are thinking of iloc instead?

%time _ = df.iloc[0:200_000,:]
CPU times: user 975 μs, sys: 3 ms, total: 3.97 ms
Wall time: 3.98 ms

loc must look up where in the index each number 0, ..., 200_000 is in your index and return all the corresponding rows. This is not cheap, unless you have a RangeIndex, in which case we know a priori where they are.

@arnaudlegout
Copy link
Contributor Author

I misunderstood how .loc works with CoW. My mental model was the one of groupby with a lazy evaluation. I believed that .loc was just returning a view a deferred the actual identification when accessing the view. Anyway, the current implementation is fine and it does not change my performance issue.

So in summary, there is no CoW issue and my bugreport is not relevant.

However, I am surprised how slow it is to slice with .loc a sorted numeric index. There is certainly room for improvement, but this is another issue. Are you aware of a bugreport for this performance issue? Is it even a performance issue according to you?
In theory, the index being sorted and contiguous in this case, to `.loc[:200_000, :] should only go through 200000 entries, which should definitively not take seconds.

@rhshadrach
Copy link
Member

Is it even a performance issue according to you?

I've seen nothing in this issue yet indicating a performance issue. pandas indexing must consider many different cases. That being said, PRs to improve performance are always welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Copy / view semantics Performance Memory or execution speed performance
3 participants