Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details. i am trying to write a program to convert this data into Text file and using different methods to extract info. i used methods like line.startswith , Regex , string slicing to etract and store this info as variables which i will use to upload it to database. Question 1 : is there a more efficient way of doing this because I have got more than 1000 PDFs and in some PDFs some values Of for example Email Id is null. Question 2 : each PDf has got many person's info with each person having multiple contact address , multiple many other inf0 what is the best way of storing this and how can i save multiple address with each address having its own unique info:
handle.seek(0)
NAME = []
MEMBER_ID = []
R_DATE = []
R_TIME = []
MRN = []
DOB = []
GENDER = []
for line in handle:
line =line.strip()
if line.startswith('CONSUMER:'):
NAME.append(line)
elif line.startswith('MEMBER ID:'):
MEMBER_ID.append(line)
elif line.startswith('DATE:'):
R_DATE.append(line)
elif line.startswith('TIME:'):
R_TIME.append(line)
elif line.startswith('MEMBER REFERENCE NUMBER:'):
MRN.append(line)
elif line.startswith('DATE OF BIRTH:'):
DOB.append(line)
elif line.startswith('GENDER:'):
GENDER.append(line)
Name = (NAME[0][10:]).strip()
Member_Id = (MEMBER_ID[0][11:]).strip()
R_date = (R_DATE[0][5:]).strip()
R_time = (R_TIME[0][5:]).strip()
Mrn = (MRN[0][24:]).strip()
Date_Of_Birth = (DOB[0][15:]).strip()
Gender = (GENDER[0][8:]).strip()
handle.seek(0)
basic_table_dict = {
"PAN_NO" : PAN_NO,
"Consumer Name" : Name,
"Member_Id" : Member_Id,
"R_date" : R_date,
"R_time" : R_time,
"Mrn" : Mrn,
"Date_Of_Birth" : Date_Of_Birth,
"Gender" : Gender
}
print(basic_table_dict)
Address = ""
Category = ""
Residence_code = ""
Date_Reported = ""
printing = False
for line in handle4:
line =line.strip()
if line.startswith('ADDRESS(ES):'):
printing = True
# print(line)
continue
elif line.startswith('EMPLOYMENT INFORMATION :'):
printing = False
# print(line)
break
elif printing:
line.lstrip()
if line.startswith("AADDRESS:" ):
Address = line[8:]
Address1 = True
elif line.startswith("CATEGORY:" ):
Category = line[9:]
Address1 = True
elif line.startswith("RESIDENCE CODE:" ):
Residence_code = line[15:]
Address1 = True
if line.startswith("DATE REPORTED:" ):
Date_Reported = line[14:]
Address1 = True
print(line)
The address data is some what like this ADDRESS: *****************
CATEGORY: OFFICE
RESIDENCE CODE: OWNED
DATE REPORTED: 21-11-2017
ADDRESS: *************************
CATEGORY: PERMANENT
RESIDENCE CODE: OWNED
DATE REPORTED: 29-04-2017
ADDRESS: ****************************
CATEGORY: PERMANENT
RESIDENCE CODE:
DATE REPORTED: 18-04-2017