20

I have been trying the whole day to convert several. pdf files which contain traffic flow for São Paulo to spreadsheets like MS Office Excel, or LibreOffice Calc in Ubuntu. When I open the .pdf file with LibreOffice Calc it opens LibreOffice Draw, and I can't get the spreadsheet.

The most promising method that I found was here with pdftotext. It works fine and I can get the tables in LibreOffice Calc but adjusting manually the columns.

My problem is that I have so many .pdf files that it would take me a lot of time.

Does anyone know a better method?

Community
  • 1
  • 1
Sergio
  • 714
  • 1
  • 8
  • 24

4 Answers4

36

Another option is to use Okular (http://okular.kde.org). It has table selection tool (Ctrl+5). You may select a table, add lines for additional rows and columns and copy the resulting table into a clipboard. It works fine for me.

Dmitry Somov
  • 486
  • 4
  • 10
19

Tabula can work quite well. PDF is not an easy format to extract structured information from, so it's not always possible.

scruss
  • 1,030
  • 10
  • 24
11

Maybe the -layout would be useful for you. With this option set, pdftotext will try to keep the column layout in the resulting text file.

Now, you can import the text file into LibreOffice Calc with the appropriate import settings. When opening a txt file in Calc, you will get asked how to parse the file content (see screenshot below). Under Separator Options, select both the Options [separated by] Space and Merge Delimiters. This way, Calc will be able to restore the column structure (assuming the cell data doesn't contain spaces).

text import into calc

tohuwawohu
  • 13,268
  • 4
  • 42
  • 61
  • Thank you @TeTeT and @tohuwawohu, but it wasnt very helpfull, because i would have to manually incorporate the fields per each file. [here] (http://docs.google.com/file/d/0B8dmwpzdfD55YmlsT0hLcEZhQ0E/edit?usp=sharing) is a copy of one txt file. btw, when i used pdftotext i did it with the following command: `pdftotext -layout pg_0014.pdf pg_0014.txt` – Sergio Aug 20 '13 at 17:47
  • Ok, i see. The source PDF is available online, too. It has almost 200 pages with a lot of tables. If there's no way to use a professional pdf-to-calc (pdf-to-excel) solution, you could only try and ask the CETSP people if they send you the original winword files. In any case, you will still have to import each table into calc manually. Maybe the raw data is available, too. – tohuwawohu Aug 21 '13 at 13:11
  • I tried asking to CETSP several times, even by Transparency System. Well Im gonna try by Windows :( Thanks! – Sergio Aug 21 '13 at 16:44
4

Tool called Able2Extract is the option that can do for you exactly wat you want with minimum errors

Ruyonga Dan
  • 656
  • 2
  • 9
  • 24