I have a very big csv file to manage, with this process :
- group the file by 3 columns
- for each group, sort the dataframe on 5 columns
- write this dataframe in csv file
Here was my first try:
file = pd.read_csv('file.csv')
grouped = file.groupby([col1, col2, col3])
for key, df in grouped:
name = 'key.csv'
df = df.sort_values(by=[col4, col5, col6, col7, col8])
df.to_csv(name , index=False)
yield name
Good point of this method : I can yield at each iteration the filename, and so continue my ETL process of the file without waiting the other to be ready, and I sort directly the dataframe before to write a csv.
Bad point : the file is too big to process like that, I have a memory error.
So my second (and current) try:
list_files = []
for chunk in pd.read_csv('file.csv', chunksize=CHUNKSIZE):
grouped = chunk.groupby([col1, col2, col3])
for key, df in grouped:
name = 'key.csv'
if Path(name).exists():
df.to_csv(name, index=False, header=False, mode='a')
else:
list_files.append(name)
df.to_csv(name, index=False)
yield list_files
Here : no problem of memory because I read the file with chunk.
But, as you can see, because I append the data to the file if it exits, the data is not sorted. So I need to yield list of all files, and create a second function that will do that :
def sort(list_files):
for filename in list_files:
df = pd.read_csv(filename)
df = df.sort_value(..)
df.to_csv(filename)
yield filename
So I need to read again each file, and here the process need to create all the list_files before to pass to the next step in the ETL process
Regarding this, do you know if there is a way (I don't see it currently), to solve the problem of memory error and doing this process of group/sort with a faster way? Maybe (and certainly) it is not possible, but any improvement will help (append the data into the file with a smarter way then the data is already sorted maybe?)
Thanks
Edit : Maybe a way could be to sort the big file before to read it, but again I will have problem of memory, don't know if there is other way than pandas to do it that will be better ?