0

I want to create a dataframe. I parse several pdf with PyPdf2 and camelot. With PyPdf2 I search title of each table that I put it in a list. With camelot I extract the table of each part next to the title. And I want to add a column in this table with the title of each part. But my problem is when the table is to big it is on two pages. Thereby, I get more tables than titles and I have of course an IndexError: list index out of range .

indice1 = 0
    for file, li in zip(files,pageslist):
        table = camelot.read_pdf(file, pages = li, line_scale = 50)  

        df = pd.DataFrame()
        for k in range(len(table)): 
            df_tables = table[k]
            df_tables = df_tables.df

            if all(elem in df_tables.iloc[0].values for elem in ["1", "2", "3"]): 
                df_tables.columns = df_tables.iloc[0] 
                df_tables = df_tables[1:]

                df_tables2 = df_tables.copy()
                df_tables2["Titles"] = ""
# (1) df_tables2
                Title_List = []
                for o in range(len(df_tables2["Titles"])): # The part which is problematic
                    Title_List.append(str(l[indice1]))

                df_tables2["Titles"] = Title_List
# (2) df_tables2
                df = pd.concat([df,df_tables2]) 

                indice1 += 1
                
        dff = pd.concat([dff,df]) 

files is my list of pdf files, pageslist is a list of strings which are the pages where extract the interresting tables ex : ['4, 5, 14, 15, 45, 46, 80, 81', '10, 11, 23, 24, 33, 34', …] they are the pages where I found the title with the next page to avoid missing big tables which are on two pages . l is my list of Titles l = ['title 1', 'title 2 ', 'title 3' ..., 'title n'].

(1) df_tables :

1 2 3 Titles
ab aa aze
aa aa aze

(2) df_tables :

1 2 3 Titles
ab aa aze title 1
aa aa aze title 1

When the loop for file, li in zip(files,pageslist): done,

Expected output :

1 2 3 Titles
ab aa aze title 1
aa aa aze title 1
ac ze aze title 2
ab aa aze title 3
... ... ... ...
aa aa aze title 9
ac ze aze title 10

I tried to add a counter in the loop but this is not working too. There is a way to say, if the table is shared on two pages keep the same : Title_List.append(str(l[indice1])) or something like that ?

TomYabo
  • 34
  • 5

1 Answers1

0

Please provide the output of df_tables2 = df_tables.copy() for the case where the for o loop fails.

I suspect that the following is causing a problem - I don't see where the variable l is defined:

    Title_List.append(str(l[indice1]))

what happens if you replace this line with Title_List.append(str(o))?

memyself
  • 11,907
  • 14
  • 61
  • 102
  • `l` is define before in other function. It is generated when I parse the files with Pypdf2 and I append the titles in the list with a pattern. I use this `df_tables2 = df_tables.copy()` to avoid a error message SettingWithCopyWarning: `A value is trying to be set on a copy of a slice from a DataFrame`. – TomYabo Oct 14 '22 at 06:52
  • If I use the `Title_List.append(str(o))` the column titles looks like `titles : 0, 1, 2, 3, 0, 1, 2, 3, 4, ...` . – TomYabo Oct 14 '22 at 07:00
  • ok, I see. if `l` is defined in another function, then it would be important to know what the output of `l` looks like. same is true for `df_tables2`. – memyself Oct 14 '22 at 10:35
  • Ok I edit the question this will be more understandable than here – TomYabo Oct 14 '22 at 11:46