extracting data from multiple pdfs and putting that data into an excel table

Question

I am taking data extracted from multiple pdfs that were merged into one pdf. The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.

So far, I've been able to merge the pdfs, extract the text and specific data from the text, but I want to put it all into a corresponding excel table. Below is my current code:

import PyPDF2 from PyPDF2 import PdfFileMerger from glob import glob

#merge all pdf files in current directory
def pdf_merge():
    merger = PdfFileMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("Merged_pdfs1.pdf", "wb") as new_file:
        merger.write(new_file)
        
if __name__ == "__main__":
    pdf_merge()
     

#scan pdf
text =""
with open ("Merged_pdfs1.pdf", "rb") as pdf_file, open("sample.txt", "w") as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(0, number_of_pages):
        page = read_pdf.getPage(page_number)
        text += page.extractText()
        text_file.write(text)


#turn text script into list, separated by newlines
def Convert(text):
    li = list(text.split("\n"))
    return li
li = Convert(text)
filelines = []
for line in li:
    filelines.append(line)
print(filelines)

#extract data from text and put into dictionary
full_data = []
test_data = {"Sample":[], "Timepoint":[],"Phosphat (mmol/l)":[], "Bilirubin, total (µmol/l)":[], 
             "Bilirubin, direkt (µmol/l)":[], "Protein (g/l)":[], "Albumin (g/l)":[],
             "AST (U/l)":[], "ALT (U/l)":[], "ALP (U/l)":[], "GGT (U/l)":[], "IL-6 (ng/l)":[]}
for line2 in filelines:
    # For each data item, extract it from the line and strip whitespace
    if line2.startswith("Phosphat"):
        test_data["Phosphat (mmol/l)"].append(line2.split(" ")[-2].strip())
    if line2.startswith("Bilirubin,total"):
        test_data["Bilirubin, total (µmol/l)"].append(line2.split(" ")[-2].strip())
    if line2.startswith("Bilirubin,direkt"):
         test_data["Bilirubin, direkt (µmol/l)"].append(line2.split(" ")[-4].strip())
    if line2.startswith("Protein "):
        test_data["Protein (g/l)"].append( line2.split(" ")[-2].strip())
    if line2.startswith("Albumin"):
        test_data["Albumin (g/l)"].append(line2.split(" ")[-2].strip())
    if line2.startswith("AST"):
        test_data["AST (U/l)"].append(line2.split(" ")[-2].strip())
    if line2.startswith("ALT"):
        test_data["ALT (U/l)"].append(line2.split(" ")[-4].strip())
    if line2.startswith("Alk."):
        test_data["ALP (U/l)"].append(line2.split(" ")[-2].strip())
    if line2.startswith("GGT"):
        test_data["GGT (U/l)"].append(line2.split(" ")[-4].strip())
    if line2.startswith("Interleukin-6"):
        test_data["IL-6 (ng/l)"].append(line2.split(" ")[-4].strip())
    for sampnum in range(100):
        num = str(sampnum)
        sampletype = "T" and "H" 
        if line2.startswith(sampletype+num):
            sample = sampletype+num
            test_data["Sample"]=sample
    for time in range(0,360):
        timepoint = str(time) + "h"
        word_list = list(line2.split(" "))
        for word in word_list:
            if word == timepoint:
                  test_data["Timepoint"].append(word)
full_data.append(test_data)
import pandas as pd
df = pd.DataFrame(full_data)
df.to_excel("IKC4.xlsx", sheet_name="IKC", index=False)
print(df)

The issue is I'm wondering how to move the individual items in the list to their own cells in excel, with the proper timepoint, since they dont necessarily correspond to the right timepoint. For example, timepoint 1 and 3 can have protein measurements, whereas timepoint 2 is missing this info, but timepoint 3 measurements are found at position 2 in the list and will likely be in the wrong row for an excel table.

I figured maybe I need to make an alternative dictionary for the timepoints, and attach the corresponding measurements to the proper timepoint. I'm starting to get confused though on how to do all this and am now asking for help!

Thanks in advance :)

I tried doing an "else" argument after every if argument to add a "-" if there if a measurement wasnt present for that timepoint, but I got far too many dashes since it iterates through the lines of the entire pdf.

extracting data from multiple pdfs and putting that data into an excel table

0 Answers0