1

I know there is a few threads on this topic but none of their solutions seems to work for me. I have a table in a PDF document from which I would like to be able to extract information. I can copy and paste the text into textedit and it is legible but not really useable. By this I mean all the text is readable but the data is all separated by spaces with no way to differentiate columns from spaces within text within a cell.

But whenever I try to use tools like tabula or scraper wiki the text extracted is garbage.

Is anyone able to give me any pointers as to how I might go about this?

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
lac
  • 755
  • 10
  • 19
  • Your PDF uses custom adhoc font encodings which it provides in the respective **Font** dictionary **Encoding** entries. It does not provide **ToUnicode** maps, though. There may be text extractors which cannot work based on that encoding entry alone. Probably newer releases will do. – mkl Mar 04 '15 at 10:35

2 Answers2

0

Here's a solution using Python and Unix

In Python:

import urllib
# download pdf
testfile = urllib.URLopener()
testfile.retrieve('http://www.european-athletics.org/mm/Document/EventsMeetings/General/01/27/52/10/EICH-FinalEntriesforwebsite_Neutral.pdf', 'test.pdf')

In Unix:

$ pdftotext -layout test.pdf

Snippet of output to test.txt:

Lastname Firstname Country DOB PB SB 1500m Men Rowe Brenton AUT 17/08/1987 Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09 Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16 Friš Jan CZE 19/12/1995 3:43.76 3:43.76 Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54 Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97 Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78 Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79 Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30 Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66 Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60 Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22 Grice Charlie GBR 07/11/1993 3:39.44 3:39.44 O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42 Orth Florian GER 24/07/1989 3:39.97 3:40.20 Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13 Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28 Mooney Danny IRL 20/06/1988 3:42.69 3:42.69 Travers John IRL 16/03/1991 3:42.52 3:43.74 Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58 Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95 Ingebrigtsen Henrik NOR 24/02/1991 3:44.00 Ingebrigtsen Filip NOR 20/04/1993 Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64 Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36 ebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49 Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74 Nava Goran SRB 15/04/1981 3:40.65 3:44.49 Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51 Ek Staffan SWE 13/11/1991 3:43.54 3:43.54 Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03 Özbilen lham Tanui TUR 05/03/1990 3:34.76 3:38.05 Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35

Jonathan Epstein
  • 369
  • 2
  • 12
  • Thanks for your response. The output looks great and exactly what I need. However I am working on Mac. I have tried downloading a version of pdftotext for the mac from softpedia but this seems to produce an empty .txt file. – lac Mar 03 '15 at 16:39
  • Not an expert on switching between operating systems, but you could try moving entirely to Unix OS. I believe pdftotext is a built in Unix command. – Jonathan Epstein Mar 03 '15 at 20:48
  • I've now managed to get hold of a Unix system to perform operation on but I'm still getting a bunch of symbols like this ÔÄöÔÄõÔÄúÔÄêÔÄíÔÄö when I open in textedit. Did you have to do anything with the encoding to get it to display nicely as above? – lac Mar 04 '15 at 08:46
  • Try the enc option, described here http://linux.about.com/od/commands/l/blcmdl1_pdftote.htm – Jonathan Epstein Mar 04 '15 at 14:51
0

You can also download a simple command line tool to deal with the PDF file you linked to. The run this command to extract the table(s) on the first page:

pdftotext     \
   -enc UTF-8 \
   -l 1       \
   -table     \
    EICH-FinalEntriesforwebsite_Neutral.pdf \
    EICH-FinalEntriesforwebsite_Neutral.txt
  • -enc UTF-8: sets the text encoding so that the Ö, Ä, Ü and İ (as well as ö, ä, ü, ß, á, š, ē, í and č) characters in the text get correctly extracted.
  • -l 1: tells the command to extract as the last page the page number 1.
  • -table: this is the decisive parameter.

The command produces this output:

EUROPEAN ATHLETICS INDOOR CHAMPIONSHIPS PRAGUE / CZE, 6-8 MARCH 2015 FINAL ENTRIES - MEN Lastname Firstname Country DOB PB SB 1500m Men Rowe Brenton AUT 17/08/1987 Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09 Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16 Friš Jan CZE 19/12/1995 3:43.76 3:43.76 Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54 Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97 Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78 Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79 Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30 Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66 Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60 Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22 Grice Charlie GBR 07/11/1993 3:39.44 3:39.44 O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42 Orth Florian GER 24/07/1989 3:39.97 3:40.20 Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13 Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28 Mooney Danny IRL 20/06/1988 3:42.69 3:42.69 Travers John IRL 16/03/1991 3:42.52 3:43.74 Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58 Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95 Ingebrigtsen Henrik NOR 24/02/1991 3:44.00 Ingebrigtsen Filip NOR 20/04/1993 Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64 Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36 Żebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49 Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74 Nava Goran SRB 15/04/1981 3:40.65 3:44.49 Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51 Ek Staffan SWE 13/11/1991 3:43.54 3:43.54 Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03 Özbilen İlham Tanui TUR 05/03/1990 3:34.76 3:38.05 Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35 3000m Men Rowe Brenton AUT 17/08/1987 Vojta Andreas AUT 09/06/1989 7:59.95 7:59.95

Note, however:

The -table parameter to the pdftotext command line tool is only available in the XPDF-version 3.04, which you can download here: www.foolabs.com/xpdf/download.html. It is NOT (yet) available in Poppler's fork of pdftotext (latest version of which is 0.43.0).

If you only have Poppler's pdftotext, you'd have to use the -layout parameter (instead of -table), which gives you a similarly good result for the PDF file in question:

pdftotext     \
   -enc UTF-8 \
   -l 1       \
   -layout    \
    EICH-FinalEntriesforwebsite_Neutral.pdf \
    EICH-FinalEntriesforwebsite_Neutral.txt

However, I have seen PDFs where the result is much better with -table (and XPDF) than it is with -layout (and Poppler).

(XPDF has the -layout parameter too -- so you can see the difference if you try both.)

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345