I have the pdf which is of 2 column format. I am able to parse it to simple text, but these pdfs also have images in between . As a result my text output gets jumbled up for that specific page of the pdf which have images in between.
For example consider a 2 column page format
Image Text2
Image Image
Image Text3
Text1 Image
Text4
Output is Text4 Text3 Text2 Text1 instead of Text1 Text2 Text3 Text4
Any solution for this to read the text in the proper order?
I am using the following code
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 76; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
}