-1

I have a few pdf's written in Brazillian Portuguese which I'd like to parse and process. I tried using PDFBox text extraction command line tools( with no arguments at all ) but I get the following results:

Cão 

ends up as

C~
ao

Also, copying and pasting the text or exporting it as text using Adobe Reader outputs the same results. Doing the same (PDFBox, copy&paste, Adobe Reader export) with other files I managed to extract the text as expected ("Cão") so , not being the PDF expert, I figure it has to do with the way the files were created. I'd like to know if anyone has seen such behavior and how to work around it when extracting the text.

Grasshopper
  • 1,749
  • 1
  • 14
  • 30
  • What are you using to extract the text? This question is very incomplete. – Jean-Bernard Pellerin Dec 06 '13 at 00:13
  • 1
    How are you using PDFBox? – Dour High Arch Dec 06 '13 at 00:14
  • *why the text extraction is messed up for those specific documents.* - As @DourHighArch implied you probably use PDFBox incorrectly. If you expect us to check this, provide some code. Furthermore you mention it is a problem with certain documents only. Maybe these documents simply provide incorrect information about their content (cf. e.g. [this answer](http://stackoverflow.com/questions/20402741/how-to-get-text-extraction-from-pdf-to-work/20410126#20410126)). If you expect us to check this, provide the PDF in question. – mkl Dec 06 '13 at 07:59
  • I'm using PDFBox text extraction command line tools [http://pdfbox.apache.org/commandline/#extractText] with no options. – Grasshopper Dec 08 '13 at 11:45
  • Read http://joelonsoftware.com/articles/Unicode.html – fuesika Dec 08 '13 at 12:12

1 Answers1

0

So thanks to Stack Overflow I managed to find the post below:

How to get text extraction from PDF to work?

which gave me the information I was looking for. Apparently the PDF's are being generated without the information needed to understand the latin characters.

Community
  • 1
  • 1
Grasshopper
  • 1,749
  • 1
  • 14
  • 30