Getting spaces between letters while extracting pdf

Question

I am getting space between letters(s p a c e s b e t w e e n l e t t er s) while extracting pdf using pdfbox. This seems to have occurred because of the Couriernew font as the extraction works fine with other fonts. Application is running on AWS lambda.I can also see an error "Could not write to font cache java.io.FileNotFoundException: /home/user/.pdfbox.cache" in the logs only for this particular pdf.

I have tried to set PDDocument fonts default to arial.

PDFont font = PDTrueTypeFont.loadTTF(_PDdoc, new File("C:\\Windows\\FONTS\\arial.ttf"));
for (int i = 0; i < _PDdoc.getNumberOfPages(); ++i) {
            PDPage page1 = _PDdoc.getPage(i);
            PDResources res = page1.getResources();
            for (COSName fontName : res.getFontNames()) {
                res.put(fontName, font);
            }
        }

But this is not working as expected. In local machine there is no issue of cache. Any leads would be appreciated.

Tried implementing the solution provided in Apache PDFBox Remove Spaces between characters.

String extractNoSpaces(PDDocument document,String regionName,PDPage page) throws IOException
{
    PDFTextStripperByArea pts = new PDFTextStripperByArea() {
        @Override
        protected void processTextPosition(TextPosition text)
        {
            int[] character = text.getCharacterCodes();
            //check for space
        }
    };      
                        pts = _PDFTextStripperByAreaMap.get(regionName);
                        pts.setSortByPosition(true);
                        pts.extractRegions(page);
                        return pts.getTextForRegion(regionName);
}

There is not much provided in the docs for getCharacterCodes() and above method is not executing as well.

Your code does not extract any text. Did you create the PDF yourself? Can you install the standard 14 fonts (time, courier, Helvetica/arial, symbol, zapt dingbats)? What PDFBox version are you using? — Tilman Hausherr, May 24 '19 at 17:37
Sometimes PDFs contain a space and a letter character in the same location; PDFBox usually then will extract both. Thus, please not only share your pivotal code (as implied by @Tilman's comment) but also the PDF. — mkl, May 24 '19 at 18:22
@mkl- I followed your article https://stackoverflow.com/questions/29554400/apache-pdfbox-remove-spaces-between-characters and this is the exact issue I am facing. But I am having trouble implementing your solution as the method getCharacter() is not there anymore. — user108, May 31 '19 at 11:21
@user108 that answer implements a solution for pdfbox 1.8.x. I assume you now use a 2.x.x version and there have been some changes between versions. — mkl, May 31 '19 at 13:33
@mkl yes. Just that my overridden method is not getting executed. processTextPosition() method of PDFTextStripperByArea is called. — user108, May 31 '19 at 13:38
@user108 meanwhile that method appears to be called `getUnicode()`. — mkl, May 31 '19 at 13:38

Getting spaces between letters while extracting pdf

0 Answers0