I have thousands of csv files in disk. Each of them with a size of approximately ~10MB (~10K columns). Most of these columns hold real (float) values.
I would like to create a dataframe by concatenating these files. Once I have this dataframe, I would like to sort its entries by the first two columns.
I currently have the following:
my_dfs = list()
for ix, file in enumerate(p_files):
my_dfs.append(
pd.read_csv(p_files[ix], sep=':', dtype={'c1' : np.object_, 'c2' : np.object_}))
print("Concatenating files ...")
df_merged= pd.concat(my_dfs)
print("Sorting the result by the first two columns...")
df_merged = df_merged.sort(['videoID', 'frameID'], ascending=[1, 1])
print("Saving it to disk ..")
df_merged.to_csv(p_output, sep=':', index=False)
But this requires so much memory that my process is killed before getting the result (in the logs I see that the process is killed when its using around 10GB of memory).
I am trying to figure out where exactly it fails, but I am still unable to do it (although I hope to log the stdout soon)
Is there a better way to do this in Pandas?