I got following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, usenewframe = frame.copy()
when I tried to append multiple dataframes like
df1 = pd.DataFrame()
for file in files:
df = pd.read(file)
df['id'] = file # <---- this line causes the warning
df1 = df1.append(df, ignore_index =True)
I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.
I tried to create a testing code to duplicate the problem but I don't see PerformanceWarning
with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.
import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
if not os.path.isdir('./data'):
os.mkdir('./data')
files = []
for i in range(num_files):
file = f'./data/{i}.pkl'
pd.DataFrame(
np.random.randint(1, 1_000, (rows, cols))
).to_pickle(file)
files.append(file)
return files
# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning
dfs = []
for file in files:
df = pd.read_pickle(file)
df['id'] = file
dfs.append(df)
dfs = pd.concat(dfs, ignore_index = True)