3

I'm trying to extract and print english text out of a pdf on console. Extraction is done through itextpdf API using PdfTextExtractor class. Text i'm getting is not understandble. May be some language issues I'm facing. My intent is to find a particular text within a PDF and replace it with some other string. I started with parsing the file to find the string. Following code snippet represents my string extractor:

Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,
    new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {

    String str=PdfTextExtractor.getTextFromPage(reader, i); 
    System.out.println(str);  

}
document.close();

but the output I'm getting on console is not understandable even though the text in the PDF is in english.

Output:

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli e erefcern emsyst o f et h se. ru I n tioi, dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo tofoi. nmirna ni soitaoli n mor f chea e. roth s iTh s i a cel ra csea ewerh " eth lweoh is ermo nath eth ms u fo sti

rtasp ".

Can anybody please help me out what could be the possible solution for bringing text in english language as it is like in source PDF. Any sort of help will be highly appreciated.

famousgarkin
  • 13,687
  • 5
  • 58
  • 74

1 Answers1

3

If you want the text to be ordered based on its position on the page, you need to introduce a specific strategy, such as the LocationTextExtractionStrategy:

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    String str=PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
}

The LocationTextExtractionStrategy sometimes results in odd sentences, more specifically if the letters 'dance' on the page (the baseline of the glyphs differs for text on the same line). In that case, you can try the SimpleTextExtractionStrategy which will return the text in the order in which it appears in the PDF syntax content stream.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • i tried with LocationTextExtractionStrategy it didn't worked out but then i tried with SimpleTextExtractionStrategy it worked perfectly fine. Thanks a lot for giving instant response. You made me think in right direction. – codechefvaibhavkashyap May 16 '14 at 06:34
  • That is odd... SimpleTextExtractionStrategy doesn't reorder the text. I'll update my answer. – Bruno Lowagie May 16 '14 at 06:36
  • I stumbled over this issue today, with a PDF for which the tokens of a text array (TJ operator) where reversed with LocationTextExtractionStrategy but not with the Simple.. one.The underlying issue is that the font widths are all zero. I.e. [a,1,b,1,c] TJ would result in cba. Each PdfString adjusts the text matrix by 0 and each PdfNumber adjusts the text matrix slightly to the left and thus logically reversing the order of the text. Sadly I don't know how to really fix this other than roll my own hacked text extraction strategy. – tom_imk Jun 28 '16 at 14:08