I have searched stack overflow on how to extract table information from a pdf without horizontal lines, and I am almost successful, however this brings me to my next problem. How to correctly output the data for use in a DataFrame.
The pdf tables in question is the following:
Now I would like to have all of the data of this table, excluding the total (from Samtals ISK... down).
As of yet I have used the following:
# Extract text and table from pdf.
with pdfplumber.open(file_path) as invoice:
page = invoice.pages[0]
text = page.extract_text().split('\n')
table = page.extract_table(table_settings={"vertical_strategy": "text",
"horizontal_strategy": "lines"})
table
But calling up this table
gives the following output:
[['',
'7159\n7156\n7154\n7155\n7158\n7157\n7160\n5013\n5014\n5015\n5025\n5017',
'Hummus\nGuacamole\nChili Mayo\nSalsa\nTzatzikisósa\nPestó\nGarlic oil\nSætkartöflusalat\nRauðrófur\nBrokkolísalat\nBrokkoli\nSalat',
'Samtal\n11% VS\nSamtal',
'1\n1\n2\n1\n1\n1\n1\n5\n6\n1\n2\n1\ns \nK\ns',
'0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n8\nISK\nISK',
'án VS\n með',
'809,00\n2.170,00\n444,00\n812,00\n713,00\n909,00\n1.886,00\n1.205,00\n1.683,00\n1.391,00\n1.362,00\n1.980,00\nK\nVSK',
'11\n11\n11\n11\n11\n11\n11\n11\n11\n11\n11\n11',
'8.090\n21.700\n8.880\n8.120\n7.130\n9.090\n18.860\n60.250\n100.980\n13.910\n27.240\n35.640\n319.890\n35.188'],
[None, None, None, None, None, None, None, None, None, '355.078']]
Which is step in the right direction but not 100% what I want. I do not know how to get each horizontal entry seen as a new "line" that corresponds to the rest.
What is the solution to this problem? Do I need to extract the data in a different way, or should I format the extracted data better?