3

I'm using this https://github.com/TomRoush/PdfBox-Android PDFBox on Android Studio library to extract text from a PDF document. Here's what I'm doing:

File pdf_file = new File(file_path);

to create the file, then

PDDocument document = null;
document = PDDocument.load(pdf_file);

to load the file into a PDDocument object, and then

PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(...);
pdfStripper.setEndPage(...);
String page_text = pdfStripper.getText(document);

to get the text content of the page. The issue is that when there's for example the word "firm" it displays it like "fi rm". It basically puts a space after fi (and I guess fls and other ligatures). I tried reading this Problems with extracting OpenTypeFont text using pdfBox but I don't understand how to fix it. There are no solution details.

Important: As it turns out, in my PDF file, I don't have any ligatures such as fi but I have regular fi and yet, there's space after it. A solution is unclear.

PDF file: https://wetransfer.com/downloads/09e9036dda4a7962ccad32b1cbcd8edc20200506050349/ab4752

  • Please share the file. I wonder if it happens with PDFBox for desktop. – Tilman Hausherr May 06 '20 at 03:27
  • @TilmanHausherr Hello, I've updated my question with link to download the PDF – Sleb Lagnej May 06 '20 at 05:04
  • I had this problem once too and I solved it by searching for fi AND fi (the ligature) – Lonzak May 06 '20 at 10:05
  • @Lonzak Hmm, how did you fix it exactly? You find the ligature fi and you remove the space after it? – JingleBells May 06 '20 at 10:42
  • @JingleBells See my posted answer... – Lonzak May 06 '20 at 11:23
  • One way would be to search for "fi " or "fl " and remove the space afterwards if there is one. – Lonzak May 08 '20 at 14:49
  • @Lonzak Good idea but I'm a bit worried about words ending with fi and fl or places where fi and fl should have space after them. What's interesting though is that I created my own PDF with fi and fl and they didn't have space after them, so I guess there's some issue with the PDF (the Harry Potter one, with the fi and fl issues) that causes the bugged spaces. – Sleb Lagnej May 08 '20 at 17:48
  • @Lonzak Is this the only currently available fix? – JingleBells May 09 '20 at 14:43
  • @Anovalium: What do you mean with _"in my PDF file, I don't have any ligatures such as fi"_? In the file you link to ("The boy who lived") there seems to be liturgies in "firm". When I try to select the word "firm" in my PDF reader "fi" is treated as a single unit. – Lii May 11 '20 at 11:42
  • @Lii Hmm, when I run the PDF through the PDFBox Android Studio reader it displays it as "fi rm", not "fi rm". Weird. On my PDF reader on the PC it shows it as a single unit fi. I suppose the PDFBox library is doing something. On the Android Studio I print it using Log.d("Debug", pdf_text); – JingleBells May 11 '20 at 13:07
  • @Lii I'm Anovalium btw – JingleBells May 11 '20 at 13:23
  • I guess the problem must be that the ligatures in the input PDF file are translated to corresponding-letters-PLUS-space. – Lii May 11 '20 at 13:26
  • @Lii Do you know a way to fix this? – JingleBells May 11 '20 at 13:33

2 Answers2

5

The issue is that when there's for example the word "firm" it displays it like "fi rm".

The reason is simple: There is a space after the "fi"!

This is the text drawing instruction drawing the line with the first occurrence of "firm" in your sample file:

 [( )360.3(Mr Dursley was the director of a “)250( )110.3(rm called Grunnings, )]TJ

The byte (147) by means of the font encoding is mapped to the glyph name fi and by means of the ToUnicode map of the font to the Unicode character U+fb01, the Latin small ligature fi.

Thus, PDF viewers display the ligature glyph and text extractors extract either the Unicode ligature character or after expansion the characters f and i.

After that ligature the start point for drawing the next glyph is moved left by 250 units, then a space is drawn, then the next start point is moved left by 110.3 units, and then "rm" is drawn.

Thus, you don't see a gap between "fi" and "rm" in viewers (because the moves left counteract the drawing of the space glyph) but text extractors extract a space character (because it's there).

You can check that this is not a PDFBox quirk, e.g. Adobe Reader with copy&paste extracts that text line as

Mr Dursley was the director of a fi rm called Grunnings,

Just like PDFBox it expands the ligature and extracts the space character.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    You may want to try dropping all spaces and letting PDFBox insert spaces where appropriate due to distance as demonstrated in [this answer](https://stackoverflow.com/a/31033508/1729265). – mkl May 11 '20 at 15:14
  • Thank you for the answer! So it turns out there actually is space between fi and rm and PDF viewers work with "coordinates" and ignore the space but when the PDFBox extracts the text, space is actually there. Is that issue from the PDF file itself or will it happen with other PDFs? The suggestion in the above comment is good but I'm not sure I want to remove all spaces and trust PDFBox to add where it thinks it's necessary. A solution I found is to simply look for fi and fl and remove the space afterwards. As far as I know, there aren't many English words that end with fi or fl. – JingleBells May 11 '20 at 16:36
  • Anyways, I really hope it's an issue with the PDF file itself and it's coordination system. – JingleBells May 11 '20 at 16:37
  • Strictly speaking "enabling easy and correct text extraction" **is not a requirement for PDF files**, so this is not an issue per se. But this complication indeed is entirely unnecessary (as far as the PDF standard is concerned); so if you have a contract with the producer of the PDF in which said producer promises to make text extraction possible with reasonable efforts, you can indeed ask him to change this. This actually looks like the producer handles the ligatures like diacritical marks applied to a space glyph. – mkl May 11 '20 at 16:47
  • By the way, you summarized *PDF viewers work with "coordinates" and ignore the space*... They don't *ignore* the space, they actually do *draw* it! But due to the numeric parameters in the instruction discussed in my answer they draw the empty (completely transparent) space glyph over the ligature, so you don't perceive it in the final rendering. – mkl May 11 '20 at 17:00
  • I don't know the producer of the PDF, my application revolves around the user being able to import any PDF book and the PDFBox converts it to text and so on. Thank you for the answer and explanation! One last thing, so this fi and fl ligature issue is from the PDF file itself, meaning if I go and download some other book's PDF, the issue might not be present? I will still remove the space character after fi and fl just incase because, as I mentioned, I don't think many words end in fi or fl – JingleBells May 11 '20 at 18:27
  • This construct of needlessly adding a space overlapping a previous character is totally unnecessary, so your next pdf probably won't have it. If it's from there same source, though, it probably does have it. – mkl May 12 '20 at 04:28
2

As mentioned in the comment I had a similar problem once with ligatures. I had to check PDF files for certain strings and was wondering why it didn't work for some. After analysis I found that those files contained ligatures and thus I could not find "Textfield" even though it visually contained it. My solutions was to not only search for textfield but also for textfield - so search two Strings one with and one without ligature.

You said you want to extract text from pdf files. So I would add a post processing step.

  1. Extract the text like you do now
  2. Search all ligatures e.g. "fi " and "fi" and replace it with "fi".

I had documents with no space following a ligature - so I would consider both cases. And cases of word endings (e.g. buffi) should also be considered (might be two spaces then?).

A general word: The topic is not easy as you already researched. This step is called NFKC normalization. In pdfbox 2.X this is done internally (cp. PDFBOX-2384) now but in pdfbox 1.X the TextNormalize.java was doing it.

Upate:

One other possibility you could try is to change the PDFTextStripper.java. There is a method called normalizeWord(...). It converts the single "fi" ligature to "f" and "i". There you could add

//line 1971...
//for PDFs where ligatures are followed by a space (e.g. "fi ve") 
if(word.substring(q+1,q+2).equals(" ")) {
  p = q + 2;
}
else {
  p = q + 1;
}

But I tried it only with pdfbox 2.0.19 (and it seems you are using 1.8.X). The good thing is it is only applied when a ligature was found. However it seems not to be a general solution due to problems with words which end with a ligature. But in your case you should be fine since there consistently seems to be a space after each ligature.

Lonzak
  • 9,334
  • 5
  • 57
  • 88