DataFrame interrows() and .to_csv: Writing row by row

Question

I'm using following script to

Apply a function to a column in each row of a DataFrame
Write the returns from that function into two new columns of a DataFrame
Continuously write the DataFrame into a *.csv

I like to learn whether there's a better way to run the following computation:

df = a DataFrame with 500 rows, 20 columns

for index, row in df.iterrows():
    df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
    df.to_csv('out.csv', encoding='utf-8', index=False)

Currently, the script each time (for each row) outputs the full df dataframe as *.csv, including the added values for the computed rows "words" and "counts" until then. I like to know, whether it would also be possible to just write line by line complete, i.e. to only output those lines in the csv that are complete.

Thanks!

With my current function, the described append mode has the effect that the full dataframe is attached multiple times. — Christopher, Aug 10 '19 at 20:09

Stef · Accepted Answer · 2019-08-11T09:09:34.483

1

I can't understand why you want to do it row by row instead of writing the whole dataframe at the end, but here is a solution for your question: write slices of the dataframe (i.e. the current row) in append mode, adding the header for the first row only:

is_first_row = True
for index, row in df.iterrows():
    df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
    df.loc[index:index].to_csv('out.csv', encoding='utf-8', index=False, mode='a', header=is_first_row)
    is_first_row = False

Update based on comment that script could be interrupted:
In this case you may want to determine whether or not to write the header by checking if the file already exists or is new:

with open('out.csv', encoding='utf-8', mode='a') as f:
    for index, row in df.iterrows():
        df.loc[index, 'words'], df.loc[index, 'count'] = transcribe(df.loc[index, 'text'])
        df.loc[index:index].to_csv(f, index=False, header=f.tell()==0)

edited Aug 11 '19 at 09:09

answered Aug 10 '19 at 20:59

Stef

28,728
2
24
52

There is a very long list and I'm applying an API as the function. Whenever the script would interrupt, I have sunk cost and no output. That's why I like to write row by row. – Christopher Aug 11 '19 at 08:28
1

@Christopher: I understand. See my updated answer for this case. If your script gets interrupted, you'll also lose the current value of `is_first_row`, so you'll have to determine the header condition dynamically. – Stef Aug 11 '19 at 09:12
As an alternative, you may want to wrap the whole logic in a `try` block and write the dataframe in the `finally` part of it. – Stef Aug 11 '19 at 09:14
One last question: Assume I have a pooling/multithreading process where each 4 rows are being computed. How can I change your script above so that it always takes 4 rows each to apply the function and write into the csv? – Christopher Aug 11 '19 at 10:02
I'm not sure how you'd implement it. If each thread has its own iterrow() loop you can use the solution as is but the problem would be that 4 threads write to one file: see [here](https://stackoverflow.com/questions/11983938/python-appending-to-same-file-from-multiple-threads) for details. So maybe you need a queue where all 4 threads push their results in and one thread that takes the values from the queue, appends them to the dataframe and writes it back (one by one as shown above). – Stef Aug 11 '19 at 17:09

DataFrame interrows() and .to_csv: Writing row by row

1 Answers1