BUG: CoW does not seem to work on an index with duplicated labels #59272

arnaudlegout · 2024-07-18T11:56:00Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

pd.set_option("mode.copy_on_write", True)
size = 100_000_000
df = pd.DataFrame(
    np.random.randint(1, 100, (size, 1)), columns=["a"], dtype=np.uint32
)
df["b"] = range(size)
df.iloc[-2, -1] = size-1
df = df.set_index("b").sort_index()

%time df.loc[0:200_000,:]
>>>
CPU times: total: 4.8 s
Wall time: 9.4 s

Issue Description

Here I create a dataframe with a column that contains duplicated values. When I set this column in the index and slice on it, I expect to obtain a view with CoW, but this operation in my example takes seconds to run (instead of a few hundreds of microseconds when CoW returns a view)

Expected Behavior

I expect df.loc[0:200_000, :] to return a view with CoW.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.2.final.0
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

sgysherry · 2024-07-25T12:18:18Z

Hi, I use df = pd.DataFrame(np.arange(size), columns=["a"], dtype=np.uint32(hopefully no duplication) and df = pd.DataFrame(np.random.randint(1, 10, (size, 1)), columns=["a"], dtype=np.uint32 (hopefully more duplication), they all give the same execution time.
Could you clarify a little bit more why is the duplication relevant?

arnaudlegout · 2024-07-25T12:36:32Z

Hi, duplication in the index is the issue, not in a column. My example might be misleading, the problem does not come from column a, but from the index (made from column b).

sgysherry · 2024-07-25T15:47:18Z

take

rhshadrach · 2024-08-12T21:25:02Z

I am seeing the same timings whether CoW is enabled or disabled. I believe this is just due to index being far more performant when it is unique. @arnaudlegout - can you confirm you're seeing the same?

sgysherry · 2024-08-12T21:34:17Z

Hi,

Sorry for the late reply I forgot to update.
I am seeing the same, the one using range index is more performant, that can be the reason. To see the difference, try comment out df.iloc[-2, -1] = size-1 and you can trace the different index it is using.

However, when I was doing some investigation into this issue, I don't think the pd.set_option("mode.copy_on_write", False) is actually doing anything. But this can be a separate topic.

import numpy as np
import pandas as pd
import trace
import sys

def main():
    pd.set_option("mode.copy_on_write", True)
    size = 100_000_000
    df = pd.DataFrame(
        np.random.randint(1, 100, (size, 1)), columns=["a"], dtype=np.uint32
    )
    df["b"] = range(size)
    df.iloc[-2, -1] = size-1
    df = df.set_index("b").sort_index()

    print("Start slicing:")
    import time
    start = time.time()
    df.loc[0:200_000,:]
    end = time.time()
    print(end-start)

with open('trace_output_true.txt', 'w') as f:
    original_stdout = sys.stdout
    sys.stdout = f
    tracer = trace.Trace(trace=True, count=False)
    tracer.run('main()')

If you try to generate trace_output_false.txt with pd.set_option("mode.copy_on_write", False) and will notice trace_output_true.txt and trace_output_false.txt are running through the same code flow.

I am not sure why is this behaving like this, seems to me that pandas forcing COW now.

rhshadrach · 2024-08-12T21:35:52Z

Ah - indeed, I believe we do have a fast path for range index here.

arnaudlegout · 2024-08-13T08:08:10Z

@rhshadrach Indeed with CoW set to False I observe the same performance issue. But, my understanding is that accessing df.loc[0:200_000,:] should just return a view, so it should be really fast. Why an index performance issue should enter into the discussion here (as I am not making any computation and do not make any index operation).

Do you mean that CoW is asking the index to make some operation before returning the view? If this is the case, this is counter intuitive to me.

rhshadrach · 2024-08-13T09:58:49Z

Is it possible you are thinking of iloc instead?

%time _ = df.iloc[0:200_000,:]
CPU times: user 975 μs, sys: 3 ms, total: 3.97 ms
Wall time: 3.98 ms

loc must look up where in the index each number 0, ..., 200_000 is in your index and return all the corresponding rows. This is not cheap, unless you have a RangeIndex, in which case we know a priori where they are.

arnaudlegout · 2024-08-13T10:31:14Z

I misunderstood how .loc works with CoW. My mental model was the one of groupby with a lazy evaluation. I believed that .loc was just returning a view a deferred the actual identification when accessing the view. Anyway, the current implementation is fine and it does not change my performance issue.

So in summary, there is no CoW issue and my bugreport is not relevant.

However, I am surprised how slow it is to slice with .loc a sorted numeric index. There is certainly room for improvement, but this is another issue. Are you aware of a bugreport for this performance issue? Is it even a performance issue according to you?
In theory, the index being sorted and contiguous in this case, to `.loc[:200_000, :] should only go through 200000 entries, which should definitively not take seconds.

rhshadrach · 2024-08-13T10:38:49Z

Is it even a performance issue according to you?

I've seen nothing in this issue yet indicating a performance issue. pandas indexing must consider many different cases. That being said, PRs to improve performance are always welcome.

arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 18, 2024

github-actions bot assigned sgysherry Jul 25, 2024

rhshadrach added Performance Memory or execution speed performance Copy / view semantics Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 12, 2024

arnaudlegout closed this as completed Aug 13, 2024

arnaudlegout reopened this Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: CoW does not seem to work on an index with duplicated labels #59272

BUG: CoW does not seem to work on an index with duplicated labels #59272

arnaudlegout commented Jul 18, 2024

INSTALLED VERSIONS

sgysherry commented Jul 25, 2024

arnaudlegout commented Jul 25, 2024 •

edited

Loading

sgysherry commented Jul 25, 2024

rhshadrach commented Aug 12, 2024

sgysherry commented Aug 12, 2024 •

edited

Loading

rhshadrach commented Aug 12, 2024

arnaudlegout commented Aug 13, 2024

rhshadrach commented Aug 13, 2024

arnaudlegout commented Aug 13, 2024

rhshadrach commented Aug 13, 2024

BUG: CoW does not seem to work on an index with duplicated labels #59272

BUG: CoW does not seem to work on an index with duplicated labels #59272

Comments

arnaudlegout commented Jul 18, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

sgysherry commented Jul 25, 2024

arnaudlegout commented Jul 25, 2024 • edited Loading

sgysherry commented Jul 25, 2024

rhshadrach commented Aug 12, 2024

sgysherry commented Aug 12, 2024 • edited Loading

rhshadrach commented Aug 12, 2024

arnaudlegout commented Aug 13, 2024

rhshadrach commented Aug 13, 2024

arnaudlegout commented Aug 13, 2024

rhshadrach commented Aug 13, 2024

arnaudlegout commented Jul 25, 2024 •

edited

Loading

sgysherry commented Aug 12, 2024 •

edited

Loading