I have as an end product an object called 'members' and 'pcps' which are themselves actually a bunch of separate string objects. I need to vectorize them into a single list so that I can add them to a dataframe and ultimately as an excel table
The problem arose somewhere along the way as I scraped text data out of a PDF, It doesn't have a data structure as a list within a list. Was wondering if around the line where I try create the 'members' series I can somehow merge these separate objects into a list.
def PDFsearch(origFileName):
# creating a pdf File object of original pdf
pdfFileObj = open(origFileName, 'rb')
# creating a pdf Reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
numPages = pdfReader.numPages
print(numPages)
for p in range(pdfReader.numPages):
# creating page object
pageObj = pdfReader.getPage(p)
#extract txt from pageObj into unicode string object
pages = pageObj.extractText()
# loop through string object by page
pges = []
for page in pages.split("\n"):
# split the pages into words
pges.append(page)
lns = []
for lines in page.split(" "):
for line in lines.split(","): #seperate the ,"This" from the last name
lns.append(line)
names = list()
if lns[0] == "Dear": # If first word in a line is "Dear"
names.append(lns[1:4]) # Get the 2nd and 3rd items (First and Last names)
for name in names:
members = " ".join(name) # These are the names we want
PCPs = lns[78:85]
pcps = " ".join(PCPs)
providers = pd.Series(pcps)
members = pd.Series(members)
'''This is what I get when I print the series 'members':
0 LAILIA TAYLOR
dtype: object
0 LATASIA WILLIS
dtype: object
0 LAURYN ALLEN
dtype: object
0 LAYLA ALVARADO
dtype: object
0 LAYLA BORELAND
dtype: object
0 LEANIAH MULLIGAN
dtype: object
All separate objects! Same with 'providers'. and when I create a dataframe and export to excel I only get one row'''