0

I am using Java PDFBOX to read text from PDF.It is working fine for PDF in English. but I want to read data from PDF in language other than English. Language in PDF is 'Hindi' (from India). Data I get in this case is like encoded strings. How I can get this data in original language (Hindi)

import java.io.IOException;
import java.io.PrintWriter;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.apache.pdfbox.pdmodel.PDDocument;
import java.io.File;

public class PDF2DataExample
{
    public static void main(final String[] args) throws Exception {
        String SRC = "";
        String DEST = "";
        for (final String s : args) {
            SRC = args[0];
            DEST = args[1];
        }
        final File file = new File(DEST);
        file.getParentFile().mkdirs();
        try {
            PDDocument document = null;
            document = PDDocument.load(new File(SRC));
            document.getClass();
            final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
            stripper.setSortByPosition(true);
            final PDFTextStripper Tstripper = new PDFTextStripper();
            final String st = Tstripper.getText(document);
            try {
                final PrintWriter writer = new PrintWriter(DEST, "UTF-8");
                writer.println("Text:" + st);
                writer.close();
            }
            catch (IOException ex) {}
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I get out put like

PkvTkv bUk#kmrTkv ¢Tkn^kkR QkkZk Pkkv H Uk|Ak#kTk bkgUkoOkrUkOkv bkYkkHTkv \kkXkRkZkA Tkm^kMv ¢vYk
hrishi
  • 1,610
  • 6
  • 26
  • 43
  • https://pdfbox.apache.org/2.0/faq.html#how-come-i-am-not-getting-any-text-from-the-pdf-document%3F – Tilman Hausherr Nov 10 '20 at 18:42
  • Especially pdfs with Hindi text are known to contain incomplete or explicitly wrong information for text extraction. – mkl Nov 10 '20 at 22:33

0 Answers0