I am getting space between letters(s p a c e s b e t w e e n l e t t er s) while extracting pdf using pdfbox. This seems to have occurred because of the Couriernew font as the extraction works fine with other fonts. Application is running on AWS lambda.I can also see an error "Could not write to font cache java.io.FileNotFoundException: /home/user/.pdfbox.cache" in the logs only for this particular pdf.
I have tried to set PDDocument fonts default to arial.
PDFont font = PDTrueTypeFont.loadTTF(_PDdoc, new File("C:\\Windows\\FONTS\\arial.ttf"));
for (int i = 0; i < _PDdoc.getNumberOfPages(); ++i) {
PDPage page1 = _PDdoc.getPage(i);
PDResources res = page1.getResources();
for (COSName fontName : res.getFontNames()) {
res.put(fontName, font);
}
}
But this is not working as expected. In local machine there is no issue of cache. Any leads would be appreciated.
Tried implementing the solution provided in Apache PDFBox Remove Spaces between characters.
String extractNoSpaces(PDDocument document,String regionName,PDPage page) throws IOException
{
PDFTextStripperByArea pts = new PDFTextStripperByArea() {
@Override
protected void processTextPosition(TextPosition text)
{
int[] character = text.getCharacterCodes();
//check for space
}
};
pts = _PDFTextStripperByAreaMap.get(regionName);
pts.setSortByPosition(true);
pts.extractRegions(page);
return pts.getTextForRegion(regionName);
}
There is not much provided in the docs for getCharacterCodes() and above method is not executing as well.