I am using Java 8 and POI ooxml v5.0.0 to extract text from PDF files; the files in this case are generated by my state's government. I do not know for sure, but what familiarity I have with the state's IT system leads me to believe the PDF files are generated by COBOL programs running on an IBM mainframe or mainframe clone, if that makes any difference to anyone. EDIT: a helpful commenter points out that I'm actually using PDFBox classes; I am not sure what the difference is, or whether other classes might produce a different result.
The PDFs contain pages of text in a fixed-width font; the output is arranged in columns to make it (somewhat) easier to read. I extract it with the following code:
public String extractText(File pdfFile) throws Exception
{
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
// without this sorting, text is not organized into lines as they appear on the
// PDF page.
stripper.setSortByPosition(true);
String text = stripper.getText(document);
document.close();
return text;
}
My code then processes the text line by line; the extracted text does not have the columns represented any more, it appears to mostly have single spaces where one might have expected multiple spaces, but that's ok.
What isn't ok is that I occasionally get 'extra' characters sprinkled among the characters in the text. So far I've only seen spaces and asterisks, and so far only between words. I am baffled trying to figure out what could be causing this; I've inspected the PDF in Acrobat reader and cannot see anything that these could be representing. It is hard to imagine that the COBOL or whatever would be generating these extra characters sporadically; there are lines and lines in the same format, and only very few have the extra characters. It doesn't appear to be just random stuff, since I haven't seen any symbols indicating unprintable characters.
It does seem to happen in the same place in the text on repeated reads. I guess that means there is something in the PDF that causes the text extraction to do this, but it's hard to figure out what.
Here's a snippet from the PDF:
And here's the eclipse debug variables window, showing a space between CONV and ":", including the hex output that verifies that the character there is a space (\u0020) (array index 22):
There are scores of CONV lines in this PDF file, and this is the only one that is interpreted as having a space between this CONV and this colon.
I suppose I could mention that I create a StringReader so I can process the extracted text line by line, but I don't think that can be the problem as it doesn't seem to generate extra spaces for other lines.
So. Does anyone have any ideas about where these extra characters might come from, so that I might have some idea how to identify them. Is it possible they have some characteristic in their PDF form such that the text extraction can be set up to skip them, or at least identify them?