1

I have code that pulls over 400 PDFs off a website via Beautiful Soup. PyPDF2 converts the PDFs to text, which is then saved as a jsonlines file called 'output.jsonl'.

When I save new PDFs in future updates, I want PyPDF to only convert the new PDFs to text and append the jsonlines file with that new text, which is where I am struggling.

The jsonlines file looks like this:

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...

The PDFs are named "1234", "1235", etc and are saved in file_path_PDFs. I am trying to recognize if that "id" is a value in the jsonlines file, then there is no need for PyPDF2 to convert it to text. If it does not exist, process away as usual.

file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []

for filename in os.listdir(file_path_PDFs):   
    if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
        with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
            mytext = jsonlines.Reader.iter(reader)
            for obj in mytext:
                if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
                    continue
                else:
                    ~convert to text~

with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
    writer.write_all(json_list)

As is, I believe this code is not finding any of the values and is converting ALL the text each time I run it. Obviously this is quite a lengthy process with each document spanning 200 or 300 pages.

GMB
  • 13
  • 3
  • 1
    Am I understanding correctly that you simply want to determine if an `id` contains a string like "1234"? If so, this might work for you. Where `jsl` is your jsonline of `{'id': '1235', 'title': 'Transcript', 'url': 'www.stackoverflow.com', 'text': '200 pages worth of text'}` this will determine if the PDF exists in the line. `"1234" == jsl.get('id')` and will return either `True` or `False`. – S3DEV Sep 26 '19 at 21:40
  • Additionally, I suggest using `os.path.splitext('myfile.pdf')[0]` to remove a file extension. It's a more robust method for those unpredictable files ... – S3DEV Sep 26 '19 at 21:44
  • Excellent suggestion on splitext. I appreciate it. As for the rest, I believe my confusion is due to using a generator and jsonlines (I am not fully understanding its documentation). My understanding is that jsonlines.Reader.iter(reader) iterates through each line (or dict) of the jsonlines file. Each obj in mytext is then a dict. Is this correct? – GMB Sep 26 '19 at 23:58
  • It almost appears as though obj is returning all 400+ lines of the jsonlines file, so I believe my confusion is not in using jsl.get('id'), but how would I check to see if 'filename' is equal to jsl.get('id') in the context of the jsonlines code above? Hopefully this makes sense, as I am fairly lost. I truly appreciate the help though. – GMB Sep 27 '19 at 00:18
  • Thanks for the update and clarification. I've worked through the issue myself and better understand where you're coming from. The answer below is **completely** re-written. Have a look and see if this helps. (Sorry for the long delay.) – S3DEV Sep 27 '19 at 19:05

1 Answers1

0

Updates:

  • Optimised to only store the id field to the DataFrame.
    • A DataFrame was kept (rather than a list) to aid in future expansion and flexibility.

Answer:

After working through (what I believe to be) your scenario, we have the following setup/requirements:

  • You have one jsonlines file called output.jsonl.
  • This output.jsonl file contains (n) dictionaries; one for each PDF parsed by PyPDF2.
  • We must loop through a directory of 400+ parsed PDF files and determine if that PDF's filename is in output.jsonl.

If this is correct, let's change tack and take the following approach:

  • Create a list of PDF filenames (called pdfs).
  • Read the id field from the jsonlines file (output.jsonl) into a pandas.DataFrame (called df).
  • Loop through the pdfs list and test whether the filename (id) is in the DataFrame (df).
  • If not, add the filename to a list (called notin).
  • Do as you wish with the notin list to parse these new files into ... whatever you like.

My (extended) output.jsonl file looks like this:

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1236", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1237", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1238", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}

Here's the commented code to accomplish the steps above:

import os
import jsonlines
import pandas as pd

# Set the path to output.jsonl
path = os.path.expanduser('~/Desktop/output.jsonl')
# Build a list of PDFs (You'll use `os.listdir()`)
pdfs = ['1234.pdf', '1235.pdf', '1236.pdf', '1237.pdf', 
        '1238.pdf', '5000.pdf', '5001.pdf']
# Create an empty DataFrame.
df = pd.DataFrame()

# Read output.jsonl
with jsonlines.open(path) as reader:
    for line in reader.iter():
        # Add 'id' value to the DataFrame.
        df = df.append({'id': line.get('id')}, ignore_index=True)
# Display the DataFrame's contents.
print('Contents of the jsonlines file:\n')
print(df)

# Loop over the PDF filenames and test if each filename is in the DataFrame.
notin = [i for i in pdfs if os.path.splitext(i)[0] not in df['id'].values]
# Display the results.
print('\nThese PDFs are not in your jsonlines file:')
print(notin)    

The output; note that files 5000.pdf and 5001.pdf were not found:

Contents of the jsonlines file:

     id
0  1234
1  1235
2  1236
3  1237
4  1238

These PDFs are not in your jsonlines file:
['5000.pdf', '5001.pdf']
S3DEV
  • 8,768
  • 3
  • 31
  • 42
  • I truly appreciate your help, as I was about to give up on this project. However, creating the dataframe above takes just as long as converting all the PDFs to text. Is there a way to create a dataframe containing only the id's? I imagine writing all the text is holding this process up, and it is completely unnecessary for the purposes of looking up the id. As we iterate through each line of output.jsonl, how would I isolate the id by itself rather than the whole line? Again, I cannot stress enough how helpful you have been. I appreciate it. – GMB Sep 28 '19 at 01:43
  • My pleasure. Completely understood. I'll have a look tonight and post an update for you. – S3DEV Sep 28 '19 at 07:16
  • Thanks again. I marked this as the correct answer. The df still takes far too long to create, so I am not sure what to make of it. I will have to come up with an alternate solution. – GMB Sep 29 '19 at 19:48
  • **Suggestion:** I bet reading the jsonlines file is slow, due to content. A suggestion is to setup a MySQL database to store the PDF data; rather than a jsonlines file. (MySQL documentation is really good, if you've not done this before.) Then you can run a simple / quick query against the database to determine if the id exists before parsing a new file. – S3DEV Sep 30 '19 at 11:00