0

I'm attempting to parse data from around 53k pdfs stored on disk. The script I have iterates through a dataframe of filenames of pdfs and has a function which returns bounding boxes for each pdf and for each bbox parses the text data within that bbox, appends it to a list and adds that list as a row to another dataframe. For each pdf there may be between 1 and 4 rows of data to append to the new dataframe so the resulting dataframe dimension will have between 53k X 10 to 212k X 10. I am using spyder version 5.1.5 (Python 3.9.7 64-bit | Qt 5.9.7 | PyQt5 5.9.2 | Windows 10 ).

I get the above error and the script exits. I've tried running the script outside spyder via CMD prompt and the same thing happens (see screenshot below:)

enter image description here

The for loop I am using to iterate through the dataframe containing Filenames of pdfs is below:

for i, row in cp12_docs[cp12_docs['Filename'].isin(files)].iterrows():
doc = fitz.open(row['Filename'])
try:
    page = doc[1]
    words = page.get_text('dict')
    doc_words = page.get_text('Words')
    bboxes = get_structure(words, doc_words)
    doc.close()
    to_append = []
    for j in bboxes[0]:
        df_row = row[['ClientUPRN', 'Filepath', 'Filename']].tolist()
        for k in j:                
            rect = fitz.Rect((k[0],k[1],k[2],k[3]))
            my_words = [w for w in doc_words if fitz.Rect(w[:4]) in rect]
            df_row.append(make_text(my_words))
        CP12_cert_data.loc[len(CP12_cert_data)] = df_row
except:
    print('Error when opening file:-' + row['Filename'])
    continue

I am using pymupdf ('\nPyMuPDF 1.19.5: Python bindings for the MuPDF 1.19.0 library.\nVersion date: 2022-02-01 00:00:01.\nBuilt for Python 3.9 on win32 (64-bit).\n')

I have considered writing each row to append to a dataframe to file to read in later but there shouldn't be an issue as I've dealt with multiple dataframes at a time of larger dimension.

Any help would be greatly appreciated.

furbaw
  • 109
  • 1
  • 12
  • **UPDATE**. I have also run the script after moving the "doc= Fitz.open()" inside the try loop however the same thing happens. Task Manager says around 5Mb ram is getting used (machine has 8Gb). – furbaw May 25 '22 at 12:10
  • What happens if you remove all the Pandas stuff and just open the PDFs and process their boxes (discarding the results)? Can you at least get through the list of all 53K PDFs? – Zach Young May 25 '22 at 16:01
  • I've not tried that Zach. I will give that a go. I did manage to complete the task by splitting the dataframe of filenames into 3 and processing each one separately. But I still want to find out why this is happening so I'll try that out. – furbaw May 26 '22 at 15:35
  • The same thing happened. I did notice that my system was running slower than normal after having split the dataframe into 3 of equal length even after the script had completed so I checked Task Manager Processes and Python was sitting at ~5Gb Memory being used. This is cleared after ipython is restarted which is obviously not ideal. – furbaw May 28 '22 at 23:38

0 Answers0