I'm attempting to parse data from around 53k pdfs stored on disk. The script I have iterates through a dataframe of filenames of pdfs and has a function which returns bounding boxes for each pdf and for each bbox parses the text data within that bbox, appends it to a list and adds that list as a row to another dataframe. For each pdf there may be between 1 and 4 rows of data to append to the new dataframe so the resulting dataframe dimension will have between 53k X 10 to 212k X 10. I am using spyder version 5.1.5 (Python 3.9.7 64-bit | Qt 5.9.7 | PyQt5 5.9.2 | Windows 10 ).
I get the above error and the script exits. I've tried running the script outside spyder via CMD prompt and the same thing happens (see screenshot below:)
The for loop I am using to iterate through the dataframe containing Filenames of pdfs is below:
for i, row in cp12_docs[cp12_docs['Filename'].isin(files)].iterrows():
doc = fitz.open(row['Filename'])
try:
page = doc[1]
words = page.get_text('dict')
doc_words = page.get_text('Words')
bboxes = get_structure(words, doc_words)
doc.close()
to_append = []
for j in bboxes[0]:
df_row = row[['ClientUPRN', 'Filepath', 'Filename']].tolist()
for k in j:
rect = fitz.Rect((k[0],k[1],k[2],k[3]))
my_words = [w for w in doc_words if fitz.Rect(w[:4]) in rect]
df_row.append(make_text(my_words))
CP12_cert_data.loc[len(CP12_cert_data)] = df_row
except:
print('Error when opening file:-' + row['Filename'])
continue
I am using pymupdf ('\nPyMuPDF 1.19.5: Python bindings for the MuPDF 1.19.0 library.\nVersion date: 2022-02-01 00:00:01.\nBuilt for Python 3.9 on win32 (64-bit).\n')
I have considered writing each row to append to a dataframe to file to read in later but there shouldn't be an issue as I've dealt with multiple dataframes at a time of larger dimension.
Any help would be greatly appreciated.