1

I have as an end product an object called 'members' and 'pcps' which are themselves actually a bunch of separate string objects. I need to vectorize them into a single list so that I can add them to a dataframe and ultimately as an excel table

The problem arose somewhere along the way as I scraped text data out of a PDF, It doesn't have a data structure as a list within a list. Was wondering if around the line where I try create the 'members' series I can somehow merge these separate objects into a list.


def PDFsearch(origFileName): 

    # creating a pdf File object of original pdf 
    pdfFileObj = open(origFileName, 'rb')  
    # creating a pdf Reader object 
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

    numPages = pdfReader.numPages
    print(numPages)
    for p in range(pdfReader.numPages): 

        # creating page object 
        pageObj = pdfReader.getPage(p)
        #extract txt from pageObj into unicode string object
        pages = pageObj.extractText()
        # loop through string object by page
        pges = []


        for page in pages.split("\n"):
            # split the pages into words
            pges.append(page)

            lns = []            
            for lines in page.split(" "):
                for line in lines.split(","):   #seperate the ,"This" from the last name
                    lns.append(line)

            names = list()
            if lns[0] == "Dear":   # If first word in a line is "Dear"
                names.append(lns[1:4]) # Get the 2nd and 3rd items (First and Last names)              
                for name in names:
                    members = " ".join(name) # These are the names we want

                PCPs = lns[78:85]        
                pcps = " ".join(PCPs)

                providers =  pd.Series(pcps)
                members = pd.Series(members)

'''This is what I get when I print the series 'members':

0    LAILIA TAYLOR 
dtype: object
0    LATASIA WILLIS 
dtype: object
0    LAURYN ALLEN 
dtype: object
0    LAYLA ALVARADO 
dtype: object
0    LAYLA BORELAND 
dtype: object
0    LEANIAH MULLIGAN 
dtype: object

All separate objects!  Same with 'providers'.  and when I create a dataframe and export to excel I only get one row'''

Ben Smith
  • 360
  • 4
  • 14

1 Answers1

0

Just a quick look, but I believe your issue is that you are overwriting your series every time. Try something like this:

# add at the beginning of your function 
temp = pd.DataFrame()
data = pd.DataFrame()

# this would replace where you assign to providers and members
temp['providers'] = pd.Series(pcps)
temp['members'] = pd.Series(members)
data = pd.concat([data, temp]).reset_index(drop=True)

This way you will overwrite temp everytime, but your data DataFrame will contain all members and providers. I hope this helps, good luck!

Denver
  • 629
  • 4
  • 6
  • This appears to be working but Its repeating again and again and again each time reapeating each row by one. At which indentation tab should I put these series declarations and concatotation to get it where each member is just once? – Ben Smith Sep 06 '19 at 19:33
  • Not positive without working through your code and what your data looks like, but you could try putting it outside of the second for loop so it occurs directly after the second for loop ends, but if your code is repeating the same information and you don't need it to you many want to do something about that. Again it's difficult to say without seeing the data inputs and outputs. – Denver Sep 06 '19 at 19:49