3

I'm using following script to

  • Apply a function to a column in each row of a DataFrame
  • Write the returns from that function into two new columns of a DataFrame
  • Continuously write the DataFrame into a *.csv

I like to learn whether there's a better way to run the following computation:

df = a DataFrame with 500 rows, 20 columns

for index, row in df.iterrows():
    df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
    df.to_csv('out.csv', encoding='utf-8', index=False)

Currently, the script each time (for each row) outputs the full df dataframe as *.csv, including the added values for the computed rows "words" and "counts" until then. I like to know, whether it would also be possible to just write line by line complete, i.e. to only output those lines in the csv that are complete.

Thanks!

Christopher
  • 2,120
  • 7
  • 31
  • 58
  • With my current function, the described append mode has the effect that the full dataframe is attached multiple times. – Christopher Aug 10 '19 at 20:09

1 Answers1

1

I can't understand why you want to do it row by row instead of writing the whole dataframe at the end, but here is a solution for your question: write slices of the dataframe (i.e. the current row) in append mode, adding the header for the first row only:

is_first_row = True
for index, row in df.iterrows():
    df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
    df.loc[index:index].to_csv('out.csv', encoding='utf-8', index=False, mode='a', header=is_first_row)
    is_first_row = False


Update based on comment that script could be interrupted:
In this case you may want to determine whether or not to write the header by checking if the file already exists or is new:

with open('out.csv', encoding='utf-8', mode='a') as f:
    for index, row in df.iterrows():
        df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
        df.loc[index:index].to_csv(f, index=False, header=f.tell()==0)
Stef
  • 28,728
  • 2
  • 24
  • 52
  • There is a very long list and I'm applying an API as the function. Whenever the script would interrupt, I have sunk cost and no output. That's why I like to write row by row. – Christopher Aug 11 '19 at 08:28
  • 1
    @Christopher: I understand. See my updated answer for this case. If your script gets interrupted, you'll also lose the current value of `is_first_row`, so you'll have to determine the header condition dynamically. – Stef Aug 11 '19 at 09:12
  • As an alternative, you may want to wrap the whole logic in a `try` block and write the dataframe in the `finally` part of it. – Stef Aug 11 '19 at 09:14
  • One last question: Assume I have a pooling/multithreading process where each 4 rows are being computed. How can I change your script above so that it always takes 4 rows each to apply the function and write into the csv? – Christopher Aug 11 '19 at 10:02
  • I'm not sure how you'd implement it. If each thread has its own iterrow() loop you can use the solution as is but the problem would be that 4 threads write to one file: see [here](https://stackoverflow.com/questions/11983938/python-appending-to-same-file-from-multiple-threads) for details. So maybe you need a queue where all 4 threads push their results in and one thread that takes the values from the queue, appends them to the dataframe and writes it back (one by one as shown above). – Stef Aug 11 '19 at 17:09