Unexpected extra characters in text extracted from PDF using POI's PDFBox

Question

I am using Java 8 and POI ooxml v5.0.0 to extract text from PDF files; the files in this case are generated by my state's government. I do not know for sure, but what familiarity I have with the state's IT system leads me to believe the PDF files are generated by COBOL programs running on an IBM mainframe or mainframe clone, if that makes any difference to anyone. EDIT: a helpful commenter points out that I'm actually using PDFBox classes; I am not sure what the difference is, or whether other classes might produce a different result.

The PDFs contain pages of text in a fixed-width font; the output is arranged in columns to make it (somewhat) easier to read. I extract it with the following code:

  public String extractText(File pdfFile) throws Exception
  {
    PDDocument document = PDDocument.load(pdfFile);
    PDFTextStripper stripper = new PDFTextStripper();

    // without this sorting, text is not organized into lines as they appear on the
    // PDF page. 
    stripper.setSortByPosition(true);

    String text = stripper.getText(document);
    document.close();
    return text;
  }

My code then processes the text line by line; the extracted text does not have the columns represented any more, it appears to mostly have single spaces where one might have expected multiple spaces, but that's ok.

What isn't ok is that I occasionally get 'extra' characters sprinkled among the characters in the text. So far I've only seen spaces and asterisks, and so far only between words. I am baffled trying to figure out what could be causing this; I've inspected the PDF in Acrobat reader and cannot see anything that these could be representing. It is hard to imagine that the COBOL or whatever would be generating these extra characters sporadically; there are lines and lines in the same format, and only very few have the extra characters. It doesn't appear to be just random stuff, since I haven't seen any symbols indicating unprintable characters.

It does seem to happen in the same place in the text on repeated reads. I guess that means there is something in the PDF that causes the text extraction to do this, but it's hard to figure out what.

Here's a snippet from the PDF:

And here's the eclipse debug variables window, showing a space between CONV and ":", including the hex output that verifies that the character there is a space (\u0020) (array index 22):

There are scores of CONV lines in this PDF file, and this is the only one that is interpreted as having a space between this CONV and this colon.

I suppose I could mention that I create a StringReader so I can process the extracted text line by line, but I don't think that can be the problem as it doesn't seem to generate extra spaces for other lines.

So. Does anyone have any ideas about where these extra characters might come from, so that I might have some idea how to identify them. Is it possible they have some characteristic in their PDF form such that the text extraction can be set up to skip them, or at least identify them?

Have you tried to download the PDF and open with a regular PDF file viewer, like Microsoft Word or by a browser plugin? I am curious to see if you would experience some sort of issue with those as well. — hfontanez, Jan 12 '22 at 14:57
Are you sure you're using Apache POI? Looks a lot more like Apache PDFBox to me... — Gagravarr, Jan 12 '22 at 15:09
@hfontanez As I said, I viewed the snippet using Acrobat Reader, which is, to my mind, a much more 'regular' PDF file viewer than Word. PDF is Acrobat's format, I would expect their reader to be the gold standard on viewing the file. And it's a file, I'm not looking at it on a web page at all, the only internet involvement IS obtaining the file in the first place. The state has a web-based portal from which an authorized person can obtain the PDF file, and it is downloaded from there. Everything else is happening on my machine locally. — arcy, Jan 12 '22 at 15:10
@arcy You got my point even if you didn't realize it. Open it with a "non-golden standard" reader and see you encounter any issues. — hfontanez, Jan 12 '22 at 15:13
Can you post the PDF file or at least a URL ? Without seeing the actual PDF contents (not the screen display, what's in the actual file) it's hard to make any suggestions. The only suggestion I can make from this is that the space is misplaced. I looks like it should read 0x56 0x3a 0x20 0x20 0x28, maybe the text extraction code thinks that two spaces is bad and inserts one in front of the colon instead. — KenS, Jan 12 '22 at 15:53
Indeed, without the PDF in question this is pure guesswork. Please share it. That being said, though, there are multiple possible causes for that space. In particular there may be an actual space between 'V' and ':' overlapping both those characters. Or if your PDF is a scan with OCR applied, the text is extracted from invisible text created from the OCR results, not from the visible image; consequentially there may be weirdest differences... — mkl, Jan 12 '22 at 16:30
My guess is that the document was produced by an OCR tool, and that tool decided that there are two pieces of text which visually overlap: `"CONV "` (note the trailing space) and `": (634)FAIL TO APPEAR"`. — VGR, Jan 12 '22 at 16:49
@KenS If I could have posted the PDF, I would have done so -- it contains confidential information, and I have no way to edit it. — arcy, Jan 12 '22 at 17:14
@mkl I think the posted PDF snippet makes it clear there is no space between 'V' and ':', just as there aren't for the scores of 'CONV:' strings elsewhere on the page. I was hoping someone else had run into something similar, like extraneous characters from elsewhere on the page as printed in the PDF appearing in the extracted text stream, and some API call to set an option to avoid them, or something. — arcy, Jan 12 '22 at 17:14
@VGR I have a little experience with OCR, and this doesn't seem to me to be caused by that. There are scores of times that "CONV:" appears on the page, but only one that has the extra space determined. Also, there are none of the occasional wrong characters that are to be expected from any OCR scan. I suppose I might look at a hex dump of the PDF document, see if I can figure out anything from there, but I had hoped to avoid that. — arcy, Jan 12 '22 at 17:17
Almost certainly it will be compressed, so you'll have to locate the relevant stream and decompress it, or decompress the entire document. You may find it suprisingly difficult to locate the text even in a decompressed document. VGR's guess about the positioning sounds entirely plausible to me (no idea about the OCR). The screen display doesn't tell you anything about whether there are two pieces of individually positioned text, and if the ':' was on top of a space character you wouldn't be able to tell. Text in PDF files need not be contiguous. — KenS, Jan 12 '22 at 17:55
Well, I'm probably not going to be able to find and decompress that stuff in my lifetime, so much for that. I do think that the software that generated the text is SO unlikely to have put a space there in ONE case and not the 2 score OTHER instances of the 'CONV:' string on this page. — arcy, Jan 12 '22 at 18:39
@KJ I suppose I could look for one of those (a text extractor with layout), but I really don't think this is an OCR situation. There are just too few anomolies, such as the one I've already mentioned about the number of errors of this kind. But it could be. — arcy, Jan 12 '22 at 22:15
*"I think the posted PDF snippet makes it clear there is no space between 'V' and ':'"* - no it doesn't. There might e.g. be a space whose left half is overlapping the V and whose right half is overlapping the colon. *"just as there aren't for the scores of 'CONV:' strings elsewhere on the page"* - thus, there *is* something special about that single occurrence. Without the file we can only guess what that is. Some commenter here might be guessing correctly, but sharing the file would allow analyzing it and *knowing* the problem. Have you tried redacting the PDF to allow sharing it? — mkl, Jan 13 '22 at 05:58
One question, if you don't set `SortByPosition`, is the problem text extracted contiguously? — mkl, Jan 13 '22 at 06:10
You might want to look at [this answer](https://stackoverflow.com/a/31033508/1729265) and the work-around explained there - in that case there also were extra spaces and the work-around simply removed *all* spaces and let PDFBox re-construct spaces where there were gaps. — mkl, Jan 13 '22 at 09:03
@mkl If you have any suggestions on how I might change the bytes in the file to redact anything, I'd be interested to hear it. I cannot just obscure a visual rendition and show a picture; the things we're looking at don't appear on an Acrobat Reader visual. On further examination of the extracted text, I've now found asterisks in the middle of MM-DD-YY dates with similar characteristics -- they appear in the extract text, but are not displayed by Acrobat Reader. So those aren't as easily hidden by overlap. And in the cases I'm looking at, Reader does not show an asterisk anywhere nearby — arcy, Jan 13 '22 at 15:13
The current PDFBox version is 2.0.25. (This may not improve anything, but it's best to use the latest version) — Tilman Hausherr, Jan 13 '22 at 15:33
@mkl I tried the method from your other answer; it uses the method `String TextPosition.getCharacter()`, which doesn't exist now. I found `getUnicode()` as the only reasonable replacement and used that. Unfortunately, it doesn't help. I still get the spurious characters, including spaces and asterisks. — arcy, Jan 13 '22 at 16:42
Ok, that answer still was based on PDFBox 1.8.x, the method names changed a bit in the 2.0.x releases. Nonetheless, replacing `getCharacter()` by `getUnicode()` is correct, and if that does not remove your spurious spaces, I have no idea without the document. Concerning how to redact - Adobe Acrobat has a decent redaction tool. iText has a decent redaction module. Certainly other viewers and libraries also have acceptable redaction features, I merely have not tested them. — mkl, Jan 13 '22 at 19:06

Unexpected extra characters in text extracted from PDF using POI's PDFBox

0 Answers0