-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: CoW does not seem to work on an index with duplicated labels #59272
Comments
Hi, I use |
Hi, duplication in the index is the issue, not in a column. My example might be misleading, the problem does not come from column |
take |
I am seeing the same timings whether CoW is enabled or disabled. I believe this is just due to index being far more performant when it is unique. @arnaudlegout - can you confirm you're seeing the same? |
Hi, Sorry for the late reply I forgot to update. However, when I was doing some investigation into this issue, I don't think the import numpy as np
import pandas as pd
import trace
import sys
def main():
pd.set_option("mode.copy_on_write", True)
size = 100_000_000
df = pd.DataFrame(
np.random.randint(1, 100, (size, 1)), columns=["a"], dtype=np.uint32
)
df["b"] = range(size)
df.iloc[-2, -1] = size-1
df = df.set_index("b").sort_index()
print("Start slicing:")
import time
start = time.time()
df.loc[0:200_000,:]
end = time.time()
print(end-start)
with open('trace_output_true.txt', 'w') as f:
original_stdout = sys.stdout
sys.stdout = f
tracer = trace.Trace(trace=True, count=False)
tracer.run('main()') If you try to generate trace_output_false.txt with pd.set_option("mode.copy_on_write", False) and will notice I am not sure why is this behaving like this, seems to me that pandas forcing COW now. |
Ah - indeed, I believe we do have a fast path for range index here. |
@rhshadrach Indeed with CoW set to False I observe the same performance issue. But, my understanding is that accessing Do you mean that CoW is asking the index to make some operation before returning the view? If this is the case, this is counter intuitive to me. |
Is it possible you are thinking of
|
I misunderstood how So in summary, there is no CoW issue and my bugreport is not relevant. However, I am surprised how slow it is to slice with |
I've seen nothing in this issue yet indicating a performance issue. pandas indexing must consider many different cases. That being said, PRs to improve performance are always welcome. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Here I create a dataframe with a column that contains duplicated values. When I set this column in the index and slice on it, I expect to obtain a view with CoW, but this operation in my example takes seconds to run (instead of a few hundreds of microseconds when CoW returns a view)
Expected Behavior
I expect
df.loc[0:200_000, :]
to return a view with CoW.Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.12.2.final.0
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: