I have code that pulls over 400 PDFs off a website via Beautiful Soup. PyPDF2 converts the PDFs to text, which is then saved as a jsonlines file called 'output.jsonl'.
When I save new PDFs in future updates, I want PyPDF to only convert the new PDFs to text and append the jsonlines file with that new text, which is where I am struggling.
The jsonlines file looks like this:
{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...
The PDFs are named "1234", "1235", etc and are saved in file_path_PDFs. I am trying to recognize if that "id" is a value in the jsonlines file, then there is no need for PyPDF2 to convert it to text. If it does not exist, process away as usual.
file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []
for filename in os.listdir(file_path_PDFs):
if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
mytext = jsonlines.Reader.iter(reader)
for obj in mytext:
if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
continue
else:
~convert to text~
with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
writer.write_all(json_list)
As is, I believe this code is not finding any of the values and is converting ALL the text each time I run it. Obviously this is quite a lengthy process with each document spanning 200 or 300 pages.