25

I got following warning

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use newframe = frame.copy()

when I tried to append multiple dataframes like

df1 = pd.DataFrame()
for file in files:
   df = pd.read(file)
   df['id'] = file        # <---- this line causes the warning
   df1 = df1.append(df, ignore_index =True)

I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.


I tried to create a testing code to duplicate the problem but I don't see PerformanceWarning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.

import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
    if not os.path.isdir('./data'):
        os.mkdir('./data')
        files = []
        for i in range(num_files):
            file = f'./data/{i}.pkl'
            pd.DataFrame(
                np.random.randint(1, 1_000, (rows, cols))
            ).to_pickle(file)
            files.append(file)
    return files

# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning

dfs = []
for file in files:
    df = pd.read_pickle(file)
    df['id'] = file

    dfs.append(df)

dfs = pd.concat(dfs, ignore_index = True)
cottontail
  • 10,268
  • 18
  • 50
  • 51
Chung-Kan Huang
  • 251
  • 1
  • 3
  • 4
  • 2
    When reassigning, invoke copy on your frame. – ifly6 Jul 07 '21 at 20:54
  • 3
    This should probably be something like `df1 = pd.concat([pd.read(file).assign(id=file) for file in files])` – Henry Ecker Jul 07 '21 at 20:56
  • 1
    A simple python list of dataframes is lighter weight than appended dataframes. As long as you can afford to hold both the dataframes in the list and the final concatentated dataframe in memory at the same time, @HenryEcker has a good solution. – tdelaney Jul 07 '21 at 20:59
  • @ifly6, I tried using list and concat but they do not seem to be the issues of the fragmentation. I am curious what you meant by invoking copy when reassigning. Do you mind providing an example? Thanks. – Chung-Kan Huang Jul 08 '21 at 17:45
  • 3
    `df1 = df1.append(df, ignore_index=True).copy()` – ifly6 Jul 08 '21 at 17:46
  • Thanks @ifly6, I tried but I still got the same warning. PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()` df['id'] = file – Chung-Kan Huang Jul 08 '21 at 17:55
  • @ifly6, I suspect that when I create a new column for each df I might make the data "scatter" more therefore more fragmented. After append or concat with the fragmented dataframe I will start to suffer performance problem if I continue to use the resulted df. However, I might be able to resolve this by making a copy to defragment. – Chung-Kan Huang Jul 08 '21 at 18:01

5 Answers5

13

append is not an efficient method for this operation. concat is more appropriate in this situation.

Replace

df1 = df1.append(df, ignore_index =True)

with

 pd.concat((df1,df),axis=0)

Details about the differences are in this question: Pandas DataFrame concat vs append

Polkaguy6000
  • 1,150
  • 1
  • 8
  • 15
  • 2
    The warning is what's inefficient, IMO. It was well intended that pandas should educate users about high-performance computing, and I am personally a huge fan of vectorization. But warning messages should not be abused for 'vectorization 101 tutorials'. I currently get >50 of these warnings logged out in <0.3 seconds.. so my application is clearly *not* suffering from bad performance. But from an incredibly dirty stdout log, which is being polluted by obsolete 'educational' warnings for no reason – KingOtto May 22 '23 at 12:27
7

Aware that this might be a reply that some will find highly controversial, I'm still posting my opinion here...

Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it's the user's responsibility to fix it, not the module's responsibility to propose code refactoring steps.

Rationale for this harsh reply: I am seeing this warning now that I have migrated to pandas v2.0.0 at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[] and .concat().

Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some "inner 20%" of the dataframe, after multiple pd.merge() operations - think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don't perceive as particularly "poor performance" (running a serial application with PyCharm in 'debugging' mode - so not exactly a setup in which you would expect best performance in the first place).

So, I conclude:

  • The code ran with pandas <2.0.0, and never raised a warning
  • The performance is excellent
  • We have multiple colleagues with a PhD in high-performance computing working on the code, and they believe it's fine
  • Module warning messages should not be abused for 'tutorials' or 'educational purposes' (even if well intented) - this is different than, for example, the "setting to copy of dataframe", where chances are very high that the functional behavior of the module leads to incorrect output. Here, it's just a 100% educational warning - that deserves, if anything, the logger level "info" (if not "debug"), certainly not "warning"
  • We get an incredibly dirty stdout log, for no reason
  • The warning itself is highly misleading - we don't have a single call to .insert() across the entire ecosystem - the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates - so thanks for sending us down the wrong path

We will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an 'info') - but given its current indiscriminate way of popping up: For once, it's actually the module that's the problem, not the user.

Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. - which, despite warning me about "performance", pops up 28 times (!) in less than 3 seconds - again, in debugging mode of PyCharm. I'm pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0 and should be removed from the module altogether.

KingOtto
  • 840
  • 5
  • 18
6

I had the same problem. This raised the PerformanceWarning:

df['col1'] = False
df['col2'] = 0
df['col3'] = 'foo'

This didn't:

df[['col1', 'col2', 'col3']] = (False, 0, 'foo')

This doesn't raise the warning either, but doesn't do anything about the underlying issue.

df.loc[:, 'col1'] = False
df.loc[:, 'col2'] = 0
df.loc[:, 'col3'] = 'foo'

Maybe you're adding single columns elsewhere?

copy() is supposed to consolidate the dataframe, and thus defragment. There was a bug fix in pandas 1.3.1 [GH 42579][1]. Copies on a larger dataframe might get expensive.

Tested on pandas 1.5.2, python 3.8.15 [1]: https://github.com/pandas-dev/pandas/pull/42579

Frank_Coumans
  • 173
  • 1
  • 11
2

This is a problem with recent update. Check this issue from pandas-dev. It seems to be resolved in pandas version 1.3.1 (reference PR).

bruno-uy
  • 1,647
  • 12
  • 20
2

Assigning more than 100 non-extension dtype new columns causes this warning (source code).1 For example, the following reproduces it:

df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = range(101)    # <---- PerformanceWarning

Using extension dtype silences the warning.

df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = pd.DataFrame([range(101)], index=df.index, dtype='Int64')  # <---- no warning

However, in most cases, pd.concat() as suggested by the warning is a better solution. For the case above, that would be as follows.

df = pd.DataFrame(index=range(5))
df = pd.concat([
    df, 
    pd.DataFrame([range(101)], columns=[f"col{x}" for x in range(101)], index=df.index)
], axis=1)

For the example in the OP, the following would silence the warning (because assign creates a copy).

dfs = pd.concat([pd.read_pickle(file).assign(id=file) for file in files], ignore_index=True)

1: New column(s) assignment is done via the __setitem__() method, which calls insert() method of the BlockManager object (the internal data structure that holds pandas dataframes). That's why the warning is saying insert is being called repeatedly.

cottontail
  • 10,268
  • 18
  • 50
  • 51