Problems with extracting table from PDF

Question

I know there is a few threads on this topic but none of their solutions seems to work for me. I have a table in a PDF document from which I would like to be able to extract information. I can copy and paste the text into textedit and it is legible but not really useable. By this I mean all the text is readable but the data is all separated by spaces with no way to differentiate columns from spaces within text within a cell.

But whenever I try to use tools like tabula or scraper wiki the text extracted is garbage.

Is anyone able to give me any pointers as to how I might go about this?

Your PDF uses custom adhoc font encodings which it provides in the respective **Font** dictionary **Encoding** entries. It does not provide **ToUnicode** maps, though. There may be text extractors which cannot work based on that encoding entry alone. Probably newer releases will do. — mkl, Mar 04 '15 at 10:35

Jonathan Epstein · Answer 1 · 2015-03-03T16:09:23.947

Here's a solution using Python and Unix

In Python:

import urllib
# download pdf
testfile = urllib.URLopener()
testfile.retrieve('http://www.european-athletics.org/mm/Document/EventsMeetings/General/01/27/52/10/EICH-FinalEntriesforwebsite_Neutral.pdf', 'test.pdf')

In Unix:

$ pdftotext -layout test.pdf

Snippet of output to test.txt:

Lastname 1500m Men Rowe Brenton Vojta Andreas Khadiri Amine Friš Jan Holuša Jakub Kocourek Milan Bueno Andreas Alcalá Marc Mechaal Adel Olmedo Manuel Ruíz Diego Kowal Yoann Grice Charlie O'Hare Chris Orth Florian Tesfaye Homiyu Kazi Tamás Mooney Danny Travers John Bussotti Neves Junior Jurkēvičs Dmitrijs Ingebrigtsen Henrik Ingebrigtsen Filip Krawczyk Szymon Ostrowski Artur ebrowski Krzysztof Smirnov Valentin Nava Goran Pelikán Jozef Ek Staffan Rogestedt Johan Özbilen lham Tanui Özdemir Ramazan Firstname Country DOB PB SB AUT 17/08/1987 AUT 09/06/1989 3:38.99 3:41.09 CYP 20/11/1988 3:45.16 3:45.16 CZE 19/12/1995 3:43.76 3:43.76 CZE 20/02/1988 3:38.79 3:41.54 CZE 06/12/1987 3:43.97 3:43.97 DEN 07/07/1988 3:42.78 3:42.78 ESP 07/11/1994 3:41.79 3:41.79 ESP 05/12/1990 3:38.30 3:38.30 ESP 17/05/1983 3:39.82 3:40.66 ESP 05/02/1982 3:36.42 3:40.60 FRA 28/05/1987 3:38.07 3:39.22 GBR 07/11/1993 3:39.44 3:39.44 GBR 23/11/1990 3:37.25 3:40.42 GER 24/07/1989 3:39.97 3:40.20 GER 23/06/1993 3:34.13 3:34.13 HUN 16/05/1985 3:44.28 3:44.28 IRL 20/06/1988 3:42.69 3:42.69 IRL 16/03/1991 3:42.52 3:43.74 Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58 LAT 07/01/1987 3:45.95 3:45.95 NOR 24/02/1991 3:44.00 NOR 20/04/1993 POL 29/12/1988 3:41.64 3:41.64 POL 10/07/1988 3:41.36 3:41.36 POL 09/07/1990 3:41.49 3:41.49 RUS 13/02/1986 3:37.55 3:38.74 SRB 15/04/1981 3:40.65 3:44.49 SVK 29/07/1984 3:43.85 3:45.51 SWE 13/11/1991 3:43.54 3:43.54 SWE 27/01/1993 3:40.03 3:40.03 TUR 05/03/1990 3:34.76 3:38.05 TUR 06/07/1991 3:44.35 3:44.35

Thanks for your response. The output looks great and exactly what I need. However I am working on Mac. I have tried downloading a version of pdftotext for the mac from softpedia but this seems to produce an empty .txt file. — lac, Mar 03 '15 at 16:39
Not an expert on switching between operating systems, but you could try moving entirely to Unix OS. I believe pdftotext is a built in Unix command. — Jonathan Epstein, Mar 03 '15 at 20:48
I've now managed to get hold of a Unix system to perform operation on but I'm still getting a bunch of symbols like this ÔÄöÔÄõÔÄúÔÄêÔÄíÔÄö when I open in textedit. Did you have to do anything with the encoding to get it to display nicely as above? — lac, Mar 04 '15 at 08:46
Try the enc option, described here http://linux.about.com/od/commands/l/blcmdl1_pdftote.htm — Jonathan Epstein, Mar 04 '15 at 14:51

score 0 · Answer 2 · answered May 01 '16 at 17:20

You can also download a simple command line tool to deal with the PDF file you linked to. The run this command to extract the table(s) on the first page:

pdftotext     \
   -enc UTF-8 \
   -l 1       \
   -table     \
    EICH-FinalEntriesforwebsite_Neutral.pdf \
    EICH-FinalEntriesforwebsite_Neutral.txt

-enc UTF-8: sets the text encoding so that the Ö, Ä, Ü and İ (as well as ö, ä, ü, ß, á, š, ē, í and č) characters in the text get correctly extracted.
-l 1: tells the command to extract as the last page the page number 1.
-table: this is the decisive parameter.

The command produces this output:

EUROPEAN ATHLETICS INDOOR CHAMPIONSHIPS PRAGUE / CZE, 6-8 MARCH 2015 FINAL ENTRIES - MEN Lastname Firstname Country DOB PB SB 1500m Men Rowe Brenton AUT 17/08/1987 Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09 Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16 Friš Jan CZE 19/12/1995 3:43.76 3:43.76 Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54 Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97 Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78 Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79 Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30 Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66 Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60 Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22 Grice Charlie GBR 07/11/1993 3:39.44 3:39.44 O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42 Orth Florian GER 24/07/1989 3:39.97 3:40.20 Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13 Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28 Mooney Danny IRL 20/06/1988 3:42.69 3:42.69 Travers John IRL 16/03/1991 3:42.52 3:43.74 Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58 Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95 Ingebrigtsen Henrik NOR 24/02/1991 3:44.00 Ingebrigtsen Filip NOR 20/04/1993 Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64 Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36 Żebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49 Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74 Nava Goran SRB 15/04/1981 3:40.65 3:44.49 Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51 Ek Staffan SWE 13/11/1991 3:43.54 3:43.54 Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03 Özbilen İlham Tanui TUR 05/03/1990 3:34.76 3:38.05 Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35 3000m Men Rowe Brenton AUT 17/08/1987 Vojta Andreas AUT 09/06/1989 7:59.95 7:59.95

Note, however:

The -table parameter to the pdftotext command line tool is only available in the XPDF-version 3.04, which you can download here: www.foolabs.com/xpdf/download.html. It is NOT (yet) available in Poppler's fork of pdftotext (latest version of which is 0.43.0).

If you only have Poppler's pdftotext, you'd have to use the -layout parameter (instead of -table), which gives you a similarly good result for the PDF file in question:

pdftotext     \
   -enc UTF-8 \
   -l 1       \
   -layout    \
    EICH-FinalEntriesforwebsite_Neutral.pdf \
    EICH-FinalEntriesforwebsite_Neutral.txt

However, I have seen PDFs where the result is much better with -table (and XPDF) than it is with -layout (and Poppler).

(XPDF has the -layout parameter too -- so you can see the difference if you try both.)

Problems with extracting table from PDF

2 Answers2

Note, however: