I want to create a dataframe. I parse several pdf with PyPdf2 and camelot. With PyPdf2 I search title of each table that I put it in a list. With camelot I extract the table of each part next to the title. And I want to add a column in this table with the title of each part. But my problem is when the table is to big it is on two pages. Thereby, I get more tables than titles and I have of course an IndexError: list index out of range .
indice1 = 0
for file, li in zip(files,pageslist):
table = camelot.read_pdf(file, pages = li, line_scale = 50)
df = pd.DataFrame()
for k in range(len(table)):
df_tables = table[k]
df_tables = df_tables.df
if all(elem in df_tables.iloc[0].values for elem in ["1", "2", "3"]):
df_tables.columns = df_tables.iloc[0]
df_tables = df_tables[1:]
df_tables2 = df_tables.copy()
df_tables2["Titles"] = ""
# (1) df_tables2
Title_List = []
for o in range(len(df_tables2["Titles"])): # The part which is problematic
Title_List.append(str(l[indice1]))
df_tables2["Titles"] = Title_List
# (2) df_tables2
df = pd.concat([df,df_tables2])
indice1 += 1
dff = pd.concat([dff,df])
files
is my list of pdf files, pageslist
is a list of strings which are the pages where extract the interresting tables ex : ['4, 5, 14, 15, 45, 46, 80, 81', '10, 11, 23, 24, 33, 34', …] they are the pages where I found the title with the next page to avoid missing big tables which are on two pages . l
is my list of Titles l = ['title 1', 'title 2 ', 'title 3' ..., 'title n'].
(1) df_tables :
1 | 2 | 3 | Titles |
---|---|---|---|
ab | aa | aze | |
aa | aa | aze |
(2) df_tables :
1 | 2 | 3 | Titles |
---|---|---|---|
ab | aa | aze | title 1 |
aa | aa | aze | title 1 |
When the loop for file, li in zip(files,pageslist):
done,
Expected output :
1 | 2 | 3 | Titles |
---|---|---|---|
ab | aa | aze | title 1 |
aa | aa | aze | title 1 |
ac | ze | aze | title 2 |
ab | aa | aze | title 3 |
... | ... | ... | ... |
aa | aa | aze | title 9 |
ac | ze | aze | title 10 |
I tried to add a counter in the loop but this is not working too.
There is a way to say, if the table is shared on two pages keep the same :
Title_List.append(str(l[indice1]))
or something like that ?