trying to extract data from pdf and make sense of it and upload it to a database

Question

Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details. i am trying to write a program to convert this data into Text file and using different methods to extract info. i used methods like line.startswith , Regex , string slicing to etract and store this info as variables which i will use to upload it to database. Question 1 : is there a more efficient way of doing this because I have got more than 1000 PDFs and in some PDFs some values Of for example Email Id is null. Question 2 : each PDf has got many person's info with each person having multiple contact address , multiple many other inf0 what is the best way of storing this and how can i save multiple address with each address having its own unique info:

handle.seek(0)
NAME = []
MEMBER_ID = []
R_DATE = []
R_TIME = []
MRN = []
DOB = []
GENDER = []
for line in handle:
    line =line.strip()
    if line.startswith('CONSUMER:'):
        NAME.append(line)
    elif line.startswith('MEMBER ID:'):
        MEMBER_ID.append(line)
    elif line.startswith('DATE:'):
        R_DATE.append(line)
    elif line.startswith('TIME:'):
        R_TIME.append(line)
    elif line.startswith('MEMBER REFERENCE NUMBER:'):
        MRN.append(line)
    elif line.startswith('DATE OF BIRTH:'):
        DOB.append(line)
    elif line.startswith('GENDER:'):
        GENDER.append(line)

Name = (NAME[0][10:]).strip()
Member_Id = (MEMBER_ID[0][11:]).strip()
R_date = (R_DATE[0][5:]).strip()
R_time = (R_TIME[0][5:]).strip()
Mrn = (MRN[0][24:]).strip()
Date_Of_Birth = (DOB[0][15:]).strip()
Gender = (GENDER[0][8:]).strip()
handle.seek(0)
basic_table_dict = {
    "PAN_NO" : PAN_NO,
    "Consumer Name" : Name,
    "Member_Id" : Member_Id,
    "R_date" : R_date,
    "R_time" : R_time,
    "Mrn" : Mrn,
    "Date_Of_Birth" : Date_Of_Birth,
    "Gender" : Gender
}

print(basic_table_dict)


Address = ""
Category = ""
Residence_code = ""
Date_Reported = ""

printing = False
for line in handle4:
    line =line.strip()

    if line.startswith('ADDRESS(ES):'):
        printing = True
        # print(line)
        continue
    elif line.startswith('EMPLOYMENT INFORMATION :'):
        printing = False
        # print(line)
        break
    elif printing:
        line.lstrip()
        if line.startswith("AADDRESS:" ):
                Address = line[8:]
                Address1 = True
        elif line.startswith("CATEGORY:" ):
                Category = line[9:]
                Address1 = True      
        elif line.startswith("RESIDENCE CODE:" ):
                Residence_code = line[15:]
                Address1 = True
        if line.startswith("DATE REPORTED:" ):
                Date_Reported = line[14:]
                Address1 = True                                  
        print(line)

The address data is some what like this ADDRESS: *****************

CATEGORY: OFFICE

RESIDENCE CODE: OWNED

DATE REPORTED: 21-11-2017

ADDRESS: *************************

CATEGORY: PERMANENT

RESIDENCE CODE: OWNED

DATE REPORTED: 29-04-2017

ADDRESS: ****************************

CATEGORY: PERMANENT

RESIDENCE CODE:

DATE REPORTED: 18-04-2017

trying to extract data from pdf and make sense of it and upload it to a database

0 Answers0