4

I have a very big csv file to manage, with this process :

  • group the file by 3 columns
  • for each group, sort the dataframe on 5 columns
  • write this dataframe in csv file

Here was my first try:

file = pd.read_csv('file.csv')
grouped = file.groupby([col1, col2, col3])
for key, df in grouped: 
    name = 'key.csv'
    df = df.sort_values(by=[col4, col5, col6, col7, col8])
    df.to_csv(name , index=False)
    yield name 

Good point of this method : I can yield at each iteration the filename, and so continue my ETL process of the file without waiting the other to be ready, and I sort directly the dataframe before to write a csv.

Bad point : the file is too big to process like that, I have a memory error.

So my second (and current) try:

list_files = []
for chunk in pd.read_csv('file.csv', chunksize=CHUNKSIZE):
    grouped = chunk.groupby([col1, col2, col3])
    for key, df in grouped:
        name = 'key.csv'
        if Path(name).exists():
            df.to_csv(name, index=False, header=False, mode='a')
        else:
            list_files.append(name)
            df.to_csv(name, index=False)
yield list_files

Here : no problem of memory because I read the file with chunk.

But, as you can see, because I append the data to the file if it exits, the data is not sorted. So I need to yield list of all files, and create a second function that will do that :

def sort(list_files):
    for filename in list_files:
        df = pd.read_csv(filename)
        df = df.sort_value(..)
        df.to_csv(filename)
        yield filename

So I need to read again each file, and here the process need to create all the list_files before to pass to the next step in the ETL process

Regarding this, do you know if there is a way (I don't see it currently), to solve the problem of memory error and doing this process of group/sort with a faster way? Maybe (and certainly) it is not possible, but any improvement will help (append the data into the file with a smarter way then the data is already sorted maybe?)

Thanks

Edit : Maybe a way could be to sort the big file before to read it, but again I will have problem of memory, don't know if there is other way than pandas to do it that will be better ?

kilag
  • 509
  • 1
  • 6
  • 19

2 Answers2

0

Dask implements most of the pandas' functionality, and will not give MemoryError (apparently, the performance will not be so brilliant). Similar case: Killed/MemoryError when creating a large dask.dataframe from delayed collection

Oleg O
  • 1,005
  • 6
  • 11
0

I've been there and I advise you to use Dask wich provides advanced parallelism for analytics https://dask.org/ somewhat similar to what Spark does. You can then use the same code as you did in the first try

  import dask.dataframe as dd

  file = dd.read_csv('file.csv')
  grouped = file.groupby([col1, col2, col3])
  for key, df in grouped: 
      name = 'key.csv'
      df = df.sort_values(by=[col4, col5, col6, col7, col8])
      df.to_csv(name , index=False)
      yield name 

P.S: if you have a memory error when saving the file to csv use the option chunksize in the to_csv function

Souha Gaaloul
  • 328
  • 4
  • 9