Extracting an embedded object from a pdf

Question

I had embedded a byte array into a pdf file (Java). Now I am trying to extract that same array. The array was embedded as a "MOVIE" file.

I couldn't find any clue on how to do that...

Any ideas?

Thanks!

EDIT

I used this code to embed the byte array:

public static void pack(byte[] file) throws IOException, DocumentException{

    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
    writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
    writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);

    document.open();
    RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));

    PdfFileSpecification fs
        = PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
    PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
    RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
    RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
    RichMediaParams flashVars = new RichMediaParams();
    instance.setAsset(asset);
    configuration.addInstance(instance);
    RichMediaActivation activation = new RichMediaActivation();
    richMedia.setActivation(activation);
    PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
    richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
    writer.addAnnotation(richMediaAnnotation);
    document.close();

How was it embedded? As an annotation or as an attachment? If as an annotation: as a movie annotation or as a RichMedia annotation? If as an attachment: as an attachment annotation or as a document-level attachment? If the movie is embedded, it is inside the PDF as a stream object. It is fairly easy to get the bytes of a stream. The difficult part is *which stream to extract*. If you don't gives us a clue, we can't give you one. Your question is not specific enough. — Bruno Lowagie, May 17 '15 at 11:47
As Bruno says, this question is lacking relevant details. See [how to ask good questions](http://stackoverflow.com/help/how-to-ask). — Anthony Geoghegan, May 17 '15 at 12:01
Sorry for the very vague question, I'll try to be more specific which is kind of hard since im new at this :) I had embedded a byte array as a RichMedia. I have found out that I need to get the specific bytes stream ( there is only one kind since I embedded only a byte array into an empty pdf ). how can I get the bytes of the stream? I know that if i'll do that I will be able to search manually for the stream which I need and the "translate" it or something like that ( I think its called "Deflate" ). Is it clearer now? :) I edited to post and wrote the code I used to embed the byte array — Itai Soudry, May 17 '15 at 12:46

score 2 · Accepted Answer · edited May 23 '17 at 12:22

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:

public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";

public static void main(String[] args) throws IOException {
    File file = new File(DEST);
    file.getParentFile().mkdirs();
    new ExtractStreams().parse(SRC, DEST);
}

public void parse(String src, String dest) throws IOException {
    PdfReader reader = new PdfReader(src);
    PdfObject obj;
    for (int i = 1; i <= reader.getXrefSize(); i++) {
        obj = reader.getPdfObject(i);
        if (obj != null && obj.isStream()) {
            PRStream stream = (PRStream)obj;
            byte[] b;
            try {
                b = PdfReader.getStreamBytes(stream);
            }
            catch(UnsupportedPdfException e) {
                b = PdfReader.getStreamBytesRaw(stream);
            }
            FileOutputStream fos = new FileOutputStream(String.format(dest, i));
            fos.write(b);
            fos.flush();
            fos.close();
        }
    }
}

Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:

When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.

This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library

You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Thank you so much Bruno! I had delete the "new extractStreams()" part and just call parse() so it would work. It was exactly what I needed! — Itai Soudry, May 20 '15 at 07:33

Extracting an embedded object from a pdf

1 Answers1