Python export to .csv without overwriting columns in for loop

Question

I am trying to write data from several documents (implemented in a a for loop) to a csv file in Python 3. However, the column gets overwritten every time. How can I make that data from the individual documents be printed on a csv in the rows below, without overwriting?

from pdfminer.high_level import extract_text
for selectedfile in glob.glob(r'C:\Users\...\*.pdf'):
    text = extract_text(selectedfile)

Y = set(text)
Z = []
Znew = []
for val in Y:
    occurrences = wordlist2.count(val)
    if occurrences > 50:  # define min. no. of occurrences
        # print(val, ':', occurrences)
        Z.append(val)
        Znew.append(occurrences)

dict = {'Stem': Z, 'Count': Znew}
df = pd.DataFrame(dict)
df.to_csv('Exported list.csv', header=True, index=True, encoding='utf-8')

What do you mean by "the column gets overwritten"? What is the current outcome, what is the expected outcome? — derpirscher, Nov 20 '22 at 16:35
I think he wants to append new data to old data without overwriting it — risky last, Nov 20 '22 at 16:39

score 1 · Answer 1 · answered Nov 20 '22 at 16:55

The problem is in that first for loop. You keep replacing text with new extracted text and only process the final extraction. You could move the processing into the for loop to work on each extraction. In this example, I've opened the file beforehand and written the header once. Then its a question of making sure the index is correct for each write.

from pdfminer.high_level import extract_text
import pandas as pd
import numpy as np

with open('Exported list.csv', 'w', encoding='utf-8') as outfile:
    outfile.write(",Stem,Count\n") # header
    base = 0
    for selectedfile in glob.glob(r'C:\Users\...\*.pdf'):
        text = extract_text(selectedfile)

        Y = set(text)
        Z = []
        Znew = []
        for val in Y:
            occurrences = wordlist2.count(val)
            if occurrences > 50:  # define min. no. of occurrences
                # print(val, ':', occurrences)
                Z.append(val)
                Znew.append(occurrences)

        dict = {'Stem': Z, 'Count': Znew}
        df = pd.DataFrame(dict, index=np.arange(base, base+len(Z)))
        df.to_csv(outfile, index=True)
        base += len(Z)

Python export to .csv without overwriting columns in for loop

1 Answers1