2

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.

The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.

In the case of pdftotext I am using these options:

pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf

My Tika code looks like that:

          String newname = f.getCanonicalPath().replace(".pdf", ".txt");
          OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
          String    fileString = "path\to\myfiles\"
          try{

              is = new FileInputStream(f);

              ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
              Metadata metadata = new Metadata();
              PDFParser pdfparser = new PDFParser();

              pdfparser.parse(is, contenthandler, metadata, new ParseContext());
              String outputString = contenthandler.toString();

              outputString = outputString.replace("\n", "\r\n");
              System.err.println("Writing now file "+newname);
              print.write(outputString);

          }catch (Exception e) {
              e.printStackTrace();
            }
            finally {
               if (is != null) is.close();
               print.close();
            }

Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.

Yannis P.
  • 2,745
  • 1
  • 24
  • 39

1 Answers1

2

Well aside from anything else, this code will use the platform default encoding:

PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();

I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).

You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Hey Jon. Thanks for the answer. The truth is that I hided some code that made the utf8 conversion on string level, i.e. before writing to the PrintWriter, but you are right the OutputStreamWriter is the best solution. Nevertheless, still having problems. Perhaps this is due to some issues with the pdf encoding or so but I 'm not a pdf expert – Yannis P. Jun 24 '13 at 10:09
  • @YannisP.: You shouldn't be doing anything before using a writer - if you do, you're almost certainly doing something wrong, because plain `String` doesn't have any encoding (or it's always UTF-16, depending on your POV). If you're currently munging the strings using `getBytes` and a `String` constructor, stop doing that straight away. – Jon Skeet Jun 24 '13 at 10:12
  • Good, so i changed to `OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));` but problems still remain. I suspect this is a general issue with the pdf format because, among others, Adobe Reader's converter behaves the same. – Yannis P. Jun 24 '13 at 10:33
  • @YannisP.: Or perhaps it's an issue with the PDF you're reading? – Jon Skeet Jun 24 '13 at 10:35
  • Possibly, I have seen this with several pdfs from the same source. What drives me crazy is that some characters convert correctly while some others convert to e.g. '}' – Yannis P. Jun 24 '13 at 10:40