Removing PDF invisible objects with iTextSharp

Question

Is possible to use iTextSharp to remove from a PDF document objects that are not visible (or at least not being displayed)?

More details:

1) My source is a PDF page containing images and text (maybe some vectorial drawings) and embedded fonts.

2) There's an interface to design multiple 'crop boxes'.

3) I must generate a new PDF that contains only what is inside the crop boxes. Anything else must be removed from resulting document (indeed I may accept content which is half inside and half outside, but this is not the ideal and it should not appear anyway).

My solution so far:

I have successfully developed a solution that creates new temporary documents, each one containing the content of each crop box (using writer.GetImportedPage and contentByte.AddTemplate to a page that is exactly the size of the crop box). Then I create the final document and repeat the process, using the AddTemplate method do position each "cropped page" in the final page.

This solution has 2 big disadvantages:

the size of the document is the [original size] * [number of crop boxes], since the entire page is there, stamped many times! (invisible, but it's there)
the invisible text may still be accessed by selecting all (CTRL+A) within Reader and pasted.

So, I think I need to iterate through PDF objects, detect if it is visible or not, and delete it. At the time of writing, I am trying to use pdfReader.GetPdfObject.

Thanks for the help.

As iText provides a low level API which allows you to manipulate nearly everything in a document, **it is possible**. That is **not** to say that it is **easy**, though, as you will have to write the code yourself to identify for each element in the page content whether or not it is visible, and you will have to glue together the remaining parts of the content yourself, too. You can reduce the resulting document size in your current solution, though, if you reuse an imported page template if multiple sections of it are to be made visible. Interesting work for many weeks... — mkl, Mar 06 '13 at 11:24
Try using the `PdfStamper` class for cropping: http://itextpdf.com/examples/iia.php?id=231 — Markus Palme, Mar 31 '13 at 21:33
I'm not a 100 percent on this as far as iTextSharp is concerned but iPdfSharp has the ability to render from forms. the idea is that you open your page, that you are cropping, inside a form and then render out only the parts you need into a new document. You will not be making multiple copies and the rendered (cropped) parts will be images. Try to see if this is an option under IText api. — Alex, May 28 '13 at 08:31
Due to time restrictions, I decided to use another PDF framework to accomplish what I need. For that I used the AmyUni PDF Creator .NET, a simple yet nice library. It has it`s own bugs though, but I'm interacting with them to solve. — Hetote, May 29 '13 at 15:21
Have you looked at [ABCPdf](http://www.websupergoo.com/abcpdf-1.htm)? If I'm correct it can do exactly what you want to do, and pricing is about the same as the AmyUni lics. — Peter R, Jul 30 '13 at 09:16
I recently switched from iTextSharp to a wkhtmltopdf, which renders HTML in webkit and then converts it to PDF. I found it a lot easier to work with as you can build your page in HTML instead of needing to code it manually in iText syntax. IIRC iTextSharp used to have a HTML to PDF routine but they took it out for some reason. — roryok, Jul 30 '13 at 10:55
Hidden Objects? I think there is something wrong with your dynamically created objects. If I were you, I'll just have to fix the algorithm I use in displaying objects to prevent generation of hidden objects — Mark, Sep 12 '13 at 02:08
Christian, I'm not creating these objects. I'm cropping complete PDF documents. Think of newspaper or magazine pages, get a scisor to cut different news in pieces. Using the described partial solution, it creates invisible content. No good. So, I ended implementing it with AmyUni library. But 6 months later, I still find bugs on it... — Hetote, Sep 13 '13 at 14:17
Any news about this issue? I'm looking for the same thing, found some interesting piece of code [here](http://stackoverflow.com/questions/29260154/itextsharp-crop-pdf-file-c). But it doesn't remove the invisible elements, it cleans an area of the page, which can affect other parts of the page. Does AmyUni or ABCPdf can do the work? — Max, Oct 01 '15 at 08:24

score 1 · Answer 1 · answered Sep 18 '13 at 05:47

1

If the PDF which you are trying is a template/predefined/fixed then you can remove that object by calling RemoveField.

PdfReader pdfReader = new PdfReader(../Template_Path.pdf"));
PdfStamper pdfStamperToPopulate = new PdfStamper(pdfReader, new FileStream(outputPath, FileMode.Create));
AcroFields pdfFormFields = pdfStamperToPopulate.AcroFields;
pdfFormFields.RemoveField("fieldNameToBeRemoved");

answered Sep 18 '13 at 05:47

Praveena M

522
4
10

The OP is not talking about form fields. He threw away all form fields during `writer.GetImportedPage` and `contentByte.AddTemplate` anyways if there were any to start with. – mkl Sep 18 '13 at 07:05

score 1 · Answer 2 · answered Sep 21 '13 at 19:12

PdfReader pdfReader = new PdfReader(../Template_Path.pdf"));
PdfStamper pdfStamperToPopulate = new PdfStamper(pdfReader, new FileStream(outputPath, FileMode.Create));
AcroFields pdfFormFields = pdfStamperToPopulate.AcroFields;
pdfFormFields.RemoveField("fieldNameToBeRemoved");

HABJAN · Answer 3 · 2013-09-27T15:29:45.757

Yes, it's possible. You need to parse pdf page content bytes to PdfObjects, store them to the memory, delete unvanted PdfObject's, build Pdf content from PdfObject's back to pdf content bytes, replace page content in PdfReader just before you import the page via PdfWriter.

I would recommend you to check out this: http://habjan.blogspot.com/2013/09/proof-of-concept-converting-pdf-files.html

Sample from the link implements Pdf content bytes parsing, building back from PdfObjec's, replacing PdfReader page content bytes...

score 1 · Answer 4 · edited May 23 '17 at 11:44

Here is three solutions I found, if it can help someone (using iTextSharp, Amyuni or Tracker-Software, as @Hetote said in the comments he was looking for another library):

Using iTextSharp

As answered by @martinbuberl in another question:

public static void CropDocument(string file, string oldchar, string repChar)
{
    int pageNumber = 1;
    PdfReader reader = new PdfReader(file);
    iTextSharp.text.Rectangle size = new iTextSharp.text.Rectangle(
    Globals.fX,
    Globals.fY,
    Globals.fWidth,
    Globals.fHeight);
    Document document = new Document(size);
    PdfWriter writer = PdfWriter.GetInstance(document,
    new FileStream(file.Replace(oldchar, repChar),
    FileMode.Create, FileAccess.Write));
    document.Open();
    PdfContentByte cb = writer.DirectContent;
    document.NewPage();
    PdfImportedPage page = writer.GetImportedPage(reader,
    pageNumber);
    cb.AddTemplate(page, 0, 0);
    document.Close();
}

Another answer by @rafixwpt in his question, but it doesn't remove the invisible elements, it cleans an area of the page, which can affect other parts of the page:

static void textsharpie()
{
    string file = "C:\\testpdf.pdf";
    string oldchar = "testpdf.pdf";
    string repChar = "test.pdf";
    PdfReader reader = new PdfReader(file);
    PdfStamper stamper = new PdfStamper(reader, new FileStream(file.Replace(oldchar, repChar), FileMode.Create, FileAccess.Write));
    List<PdfCleanUpLocation> cleanUpLocations = new List<PdfCleanUpLocation>();
    cleanUpLocations.Add(new PdfCleanUpLocation(1, new iTextSharp.text.Rectangle(0f, 0f, 600f, 115f), iTextSharp.text.BaseColor.WHITE));
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
    cleaner.CleanUp();
    stamper.Close();
    reader.Close();
}

Using Amyuni

As answered by @yms in another question:

IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.

Then you can iterate all the objects in the page and delete those that you are not interested in:

//open a pdf document
document.Open(testfile, "");
IacPage page1 = document.GetPage(1);
Amyuni.PDFCreator.IacAttribute attribute = page1.AttributeByName("Objects");

// listObj is an array list of graphic objects
System.Collections.ArrayList listobj = (System.Collections.ArrayList) attribute.Value.Cast<IacObject>();;

// listObjToKeep is an array list of graphic objects inside a rectangle
var listObjToKeep = document.GetObjectsInRectangle(0f, 0f, 600f, 115f,  IacGetRectObjectsConstants.acGetRectObjectsIntersecting).Cast<IacObject>();
foreach (IacObject pdfObj in listObj.Except(listObjToKeep))
{
   // if pdfObj is not in visible inside the rectangle then call pdfObj.Delete();
   pdfObj.Delete(false);
}

As said by @yms in the comments, another solution using the new method IacDocument.Redact in version 5.0 can also be used to delete all the objects in the specified rectangle and draw a solid color rectangle at their place.

Using Tracker-Software Editor SDK

I didn't try it but it seems possible, see this post.

In the case of Amyuni PDF Creator, a new method [IacDocument.Redact](https://www.amyuni.com/WebHelp/Amyuni_PDF_Creator_for_NET/Amyuni_PDFCreator_IacDocument/Methods/IacDocument.Redact_Method.htm) was added in version 5.0 which might be helpful in this kind of scenario. — yms, Oct 06 '15 at 16:21

score 0 · Answer 5 · answered Aug 06 '13 at 20:46

0

Have you tried using an IRenderListener? You can selectively add only those elements to the new pdf which fall within the crop regions by examining the StartPoint and EndPoint or Area of the TextRenderInfo or ImageRenderInfo objects.

answered Aug 06 '13 at 20:46

B2K

2,541
1
22
34

Removing PDF invisible objects with iTextSharp

5 Answers5