0

I am trying to parse pdf tables by using pdftables python library. But it is combining columns and ignoring spaces.

Here is my code:

pdf_page = get_pdf_page(fileobj, page)
tables = page_to_tables(pdf_page)

Structure of tables in pdf files: Structure of tables in pdf files

Output: Combing the elements of columns ignoring spaces in first 6 columns next one are correct

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Khushhal
  • 91
  • 1
  • 8
  • This is more of a workaround than a solution, but if you know the expected length of each cell, you can parse them individually to fix the output table. – Niayesh Isky Apr 03 '18 at 06:37
  • Yea thank you. I think about that but cells length vary from 1 to 3 – Khushhal Apr 03 '18 at 06:43
  • Yeah, that would be difficult. Interestingly, this problem only seems to arise in columns that have 100s in them - i.e. columns that sometimes have wider text. Ideally it should be possible to decrease the threshold that PDFMiner (which PDFTables is based on) uses to decide whether different words are in the same box, but I'm not sure if that's doable. – Niayesh Isky Apr 03 '18 at 07:03
  • Also, you might want to specify which version of PDFTables you're using. There are 2 main ones that I know of on GitHub, for example: [this](https://github.com/chrisdev/pdftables) and [this](https://github.com/drj11/pdftables). (And you can try switching between them to see if one of them works better for you.) – Niayesh Isky Apr 03 '18 at 07:05
  • Finally, here's a weird idea - what if you manually add column-separating lines to the tables in your PDF, to make it easier for PDFTables to sense the individual cells? This isn't feasible if your PDF has a lot of columns that aren't working, of course, but for just one or two pages, it should be ok. – Niayesh Isky Apr 03 '18 at 07:06
  • @NiayeshIsky The problem is not only just with 100 it is with all rows of specified columns. BTW I am using pdftables.six because I am using python while I try to intall pip install pdftables it gives me an error " SyntaxError: Missing parentheses in call to 'print'" I think it because of python version. – Khushhal Apr 03 '18 at 07:30
  • Yes, the whole `total`, `urban`, `rural` columns. But those are special because they contain a 100 somewhere in them, which seems to affect every row in those columns. So if you add lines just between those columns, it might work better. – Niayesh Isky Apr 03 '18 at 07:41
  • While I was checking on github I find that tables = page_to_tables(pdf_page, atomise=True) Now output is like this: ['Australia', '100', '100', '100', '100–', '', '', '98', '94', '94', '95', '94', '94', '94', '87', '94', '', '', '', '', '', ''] Now it is combing next two columns you can see on 4th index 100-- – Khushhal Apr 03 '18 at 07:46

1 Answers1

1

You can dodge some pdf frustration if you realize that its a % and easily you can read any number over 9 and under 100: Reading digits until you have 2 digits (11 to 99) combination or 1 digit combination (0-9) or 10. If you have 10, then you can add 0 but not any other number than 0 to the 3rd digit of the string.

I express myself better in python than English xD I Hope this can be helpfully for you:

def split(str):
    number = '0'
    numbers = []
    for char in str:
        if int(char) == 0 and int(number) == 10:
            numbers.append(int(number + char))
            number = '0'
        elif int(number) > 9 and int(number) < 100 and int(char) != 0:
            numbers.append(int(number))
            number = char
        elif int(number) >= 0 and int(number) < 10:
            number = number + char
    if int(number) > 0:
        numbers.append(int(number))
    return numbers

For example, with this code if I calls with:

split('25106387100')

it returns

[25, 10, 63, 87, 100]

Then with this code you can split any string in numbers over 10 to 100, the problem now its if you need to split one digit numbers, in this case you can add a conditional inside 0-9 condition to detect if 'isdigit()' in pdf having the position of digit reducing the processing of the pdf to the minimum