2

I'm building a tool to compress PDF files, and using pdfbox. I have some images with the DCTDecode + FlateDecode filter and I'd like to experiment with the JPXDecode filter to see if it occupies less space.

I've seen some code using iText, but how to do it with pdfbox?. I've found no documentation how to do so.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
david.perez
  • 6,090
  • 4
  • 34
  • 57
  • 1
    I tried but failed: 1) Adobe doesn't accept my file 2) PDFBox displays it but the colors are wrong 3) I looked at the generated JPEG2000 image file and it looks terrible, the colors are wrong, I suspect that there is a bug in the encoder. – Tilman Hausherr Sep 18 '19 at 12:05
  • Thanks, then I wont have any other choice that to use `iText` or some other solution. – david.perez Sep 19 '19 at 12:51
  • 1
    For what it is worth, from my experience, JPEG takes up less bytes for "small" images, while JPEG2000 provides better compression for "larger" images. – Ryan Sep 19 '19 at 18:05
  • My PDF files are normally scanned pages, and are large. – david.perez Sep 20 '19 at 07:52
  • Is there any sample on how to add the `FlateDecode` filter to an image?. It seems to improve the compression. – david.perez Sep 25 '19 at 08:42

2 Answers2

2

This code replaces the image stream without having to alter COSWriter (which sounds scary), however my experience with the PDF I tried was that the encoded image was incorrect, i.e. that there is a bug in the JPEG 2000 encoder, so check your result PDFs.

public class SO57972743
{
    public static void main(String[] args) throws IOException
    {
        System.out.println("supported formats: " + Arrays.toString(ImageIO.getReaderFormatNames()));

        try (PDDocument doc = PDDocument.load(new File("test.pdf")))
        {
            // get 1st level images only here (there may be more in form XObjects!)
            PDResources res = doc.getPage(0).getResources();
            for (COSName name : res.getXObjectNames())
            {
                PDXObject xObject = res.getXObject(name);
                if (xObject instanceof PDImageXObject)
                {
                    replaceImageWithJPX(xObject);
                }
            }
            doc.save("test-result.pdf");
        }
    }

    private static void replaceImageWithJPX(PDXObject xObject) throws IOException
    {
        PDImageXObject img = (PDImageXObject) xObject;
        BufferedImage bim = img.getOpaqueImage(); // the mask (if there) won't be touched
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        boolean written = ImageIO.write(bim, "JPEG2000", baos);
        if (!written)
        {
            System.err.println("write failed");
            return;
        }
        // replace image stream
        try (OutputStream os = img.getCOSObject().createRawOutputStream())
        {
            os.write(baos.toByteArray());
        }
        img.getCOSObject().setItem(COSName.FILTER, COSName.JPX_DECODE); // replace filter
        img.getCOSObject().removeItem(COSName.COLORSPACE); // use the colorspace in the image itself
    }
}
Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • 1
    I don't reemplement the full `COSWriter`, only derive from it and override the method that handles streams. Your solution seems simpler than mine. – david.perez Oct 08 '19 at 10:50
1

With pdfbox it is possible to compress all images, by using a custom COSWriter that handles all image streams and recodes them with the JPXDecode filter. pdfbox isn't able to do so, but the JAI library with a plugin can generate a JPEG2000 image. Compression factor is configurable, and high compression ratios can be achieved without losing too much quality.

By using in addition the FlateDecode filter, a little more compression can be obtained with no quality loss.

david.perez
  • 6,090
  • 4
  • 34
  • 57