PDF to text mess up latin accents

Question

I have a few pdf's written in Brazillian Portuguese which I'd like to parse and process. I tried using PDFBox text extraction command line tools( with no arguments at all ) but I get the following results:

Cão

ends up as

C~
ao

Also, copying and pasting the text or exporting it as text using Adobe Reader outputs the same results. Doing the same (PDFBox, copy&paste, Adobe Reader export) with other files I managed to extract the text as expected ("Cão") so , not being the PDF expert, I figure it has to do with the way the files were created. I'd like to know if anyone has seen such behavior and how to work around it when extracting the text.

What are you using to extract the text? This question is very incomplete. — Jean-Bernard Pellerin, Dec 06 '13 at 00:13
*why the text extraction is messed up for those specific documents.* - As @DourHighArch implied you probably use PDFBox incorrectly. If you expect us to check this, provide some code. Furthermore you mention it is a problem with certain documents only. Maybe these documents simply provide incorrect information about their content (cf. e.g. [this answer](http://stackoverflow.com/questions/20402741/how-to-get-text-extraction-from-pdf-to-work/20410126#20410126)). If you expect us to check this, provide the PDF in question. — mkl, Dec 06 '13 at 07:59
I'm using PDFBox text extraction command line tools [http://pdfbox.apache.org/commandline/#extractText] with no options. — Grasshopper, Dec 08 '13 at 11:45

score 0 · Accepted Answer · edited May 23 '17 at 10:32

0

So thanks to Stack Overflow I managed to find the post below:

How to get text extraction from PDF to work?

which gave me the information I was looking for. Apparently the PDF's are being generated without the information needed to understand the latin characters.

edited May 23 '17 at 10:32

Community

1
1

answered Dec 08 '13 at 12:39

Grasshopper

1,749
1
14
30

PDF to text mess up latin accents

1 Answers1