How to correctly format this pdfplumber extract_table() output to DataFrame?

Question

I have searched stack overflow on how to extract table information from a pdf without horizontal lines, and I am almost successful, however this brings me to my next problem. How to correctly output the data for use in a DataFrame.

The pdf tables in question is the following:

Now I would like to have all of the data of this table, excluding the total (from Samtals ISK... down).

As of yet I have used the following:

# Extract text and table from pdf.
with pdfplumber.open(file_path) as invoice:
    page = invoice.pages[0]
    text = page.extract_text().split('\n')
    table = page.extract_table(table_settings={"vertical_strategy": "text", 
                                               "horizontal_strategy": "lines"})
table

But calling up this table gives the following output:

[['',
  '7159\n7156\n7154\n7155\n7158\n7157\n7160\n5013\n5014\n5015\n5025\n5017',
  'Hummus\nGuacamole\nChili Mayo\nSalsa\nTzatzikisósa\nPestó\nGarlic oil\nSætkartöflusalat\nRauðrófur\nBrokkolísalat\nBrokkoli\nSalat',
  'Samtal\n11% VS\nSamtal',
  '1\n1\n2\n1\n1\n1\n1\n5\n6\n1\n2\n1\ns \nK\ns',
  '0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n8\nISK\nISK',
  'án VS\n með',
  '809,00\n2.170,00\n444,00\n812,00\n713,00\n909,00\n1.886,00\n1.205,00\n1.683,00\n1.391,00\n1.362,00\n1.980,00\nK\nVSK',
  '11\n11\n11\n11\n11\n11\n11\n11\n11\n11\n11\n11',
  '8.090\n21.700\n8.880\n8.120\n7.130\n9.090\n18.860\n60.250\n100.980\n13.910\n27.240\n35.640\n319.890\n35.188'],
 [None, None, None, None, None, None, None, None, None, '355.078']]

Which is step in the right direction but not 100% what I want. I do not know how to get each horizontal entry seen as a new "line" that corresponds to the rest.

What is the solution to this problem? Do I need to extract the data in a different way, or should I format the extracted data better?

score 0 · Answer 1 · answered Mar 17 '23 at 16:10

Based on the screenshot you provided, instead of using lines as your horizontal_strategy, use text. Use

page.extract_table(
    table_settings={"vertical_strategy": "text", "horizontal_strategy": "text"}
)

Furthermore, you can specify explicit vertical line separators to get even more clean output. Example

page.extract_table(
    table_settings={
        "vertical_strategy": "explicit",
        "explicit_vertical_lines": [100, 200, 300, 400],  # These would be your custom cooridnates.
        "horizontal_strategy": "text"
    }
)

How to correctly format this pdfplumber extract_table() output to DataFrame?

1 Answers1