0

I need to concatenate tables created from a loop. The have repeats of the names in the columns but they are telling a different story, but for some reason when running this code I get an error:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Here is the code:

url = 'https://www.impactfees.com/publications%20pdf/2019survey.pdf'

tables = camelot.read_pdf(url, flavor = 'stream', edge_tol = 500, pages = '4-end')

i = 0


while i in range(0,tables.n):
    table_value = tables[i].df.loc[0,4]
    header = 1
    header = tables[i].df.iloc[header]
    tables[i].df = tables[i].df.rename(columns = header)
    
    nan_v = float("NaN")
    tables[i].df.replace('',nan_v,inplace = True) 
    tables[i].df.dropna(subset = ['Jurisdiction'], inplace = True)
    tables[i].df.replace(['Jurisdiction'], nan_v, inplace = True)
    tables[i].df.dropna(subset = ['Jurisdiction'], inplace = True)

#    Tot_col = tables[i].df.columns.get_loc('Total')
#    tables[i].df = tables[i].df.iloc[:,0:Tot_col+1]
    tables[i].df['report_name'] = table_value
    tables[i].df.loc[~tables[i].df.index.duplicated(keep = 'first')]
    i = i + 1

dfs = pd.concat([table.df for table in tables])

dfs

and here is the error I am getting:

InvalidIndexError                         Traceback (most recent call last)
<ipython-input-133-2617eb5ae448> in <module>
     23     i = i + 1
     24 
---> 25 dfs = pd.concat([table.df for table in tables])
     26 
     27 

~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    296     )
    297 
--> 298     return op.get_result()
    299 
    300 

~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
    514                     obj_labels = obj.axes[1 - ax]
    515                     if not new_labels.equals(obj_labels):
--> 516                         indexers[ax] = obj_labels.get_indexer(new_labels)
    517 
    518                 mgrs_indexers.append((obj._mgr, indexers))

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Ben Sorensen
  • 51
  • 1
  • 7
  • 1
    Columns doesn't match. – Ynjxsjmh Apr 24 '21 at 05:17
  • @Ynjxsjmh is right, and you can see also where and why they don't match, if you look at the pdf, for example at the 3rd line at pag 2 of 20 of the table you find a `suspension extended indefinitely as of 3/21/19` that takes 5 columns merged in 1 only. And there are many others. – Max Pierini Apr 24 '21 at 08:12

1 Answers1

1
  • camelot has an issue. I had to patch utils.py to use a different user-agent
  • the pages are not fully consistent, hence passing a list to rename(columns=) does not work. You need to pass a dict
  • have kept two data frames - one with target rows, other with excluded rows
  • there remain inconsistent columns e.g. Drain Parks
import pandas as pd
import camelot

url = 'https://www.impactfees.com/publications%20pdf/2019survey.pdf'
tables = camelot.read_pdf(url, flavor = 'stream', edge_tol = 500, pages = '4-end')

df = pd.DataFrame()
dfexc = pd.DataFrame()
for i in range(tables.n):
    dft = tables[i].df.rename(columns={i:v.replace("\n"," ") for i,v in tables[i].df.iloc[1].items() if v!=""})
    if " " in dft.columns[0]: 
        dft = dft.rename(columns={dft.columns[0]:dft.columns[0].split(" ")[0], 1:dft.columns[0].split(" ")[1]})
    m = (dft.State.str.len()!=2) | (dft.index < 2)

    dfexc = pd.concat([dfexc, tables[i].df.loc[m].assign(page=i)])
    df = pd.concat([df, dft.loc[~m].assign(page=i)])#.reset_index(drop=True)

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Hi Rob But I just don’t understand why out wouldn’t add the unmatched column to the end and have it be “NaN” on the sheets that don’t have anything there at all. Don’t you think that should be the case? – Ben Sorensen Apr 25 '21 at 05:59
  • @BenSorensen it's a preference. The 25 years I have of working with ETL I prefer to separate source rows between those that have matched the data model and those that haven't. Those that haven't may need additional rules to put into data model or can be valid exclusions (e.g. header definitions) – Rob Raymond Apr 25 '21 at 07:00
  • How do I assign a new user agent with Camelot, I can't figure out how to do that. I know how to do it with tabula but I have't been able to do it with camelot. Can you give me some insight into that? – Ben Sorensen Apr 26 '21 at 16:18
  • I have it installed in a venv. I patched the code for Camelot to use a different user-agent in **utils.py** that you will find under the Camelot package. There is no way to pass it as a parameter... – Rob Raymond Apr 26 '21 at 18:09
  • Yeah, I figured that out...I ended up downloading the pdf using urllib and then opening it up with camelot...camelot is just so convenient as a pdf reader and converter. Thanks for your help on this, have an awesome week! – Ben Sorensen Apr 26 '21 at 21:34