Remove text from PDF document using Aspose.PDF library?

Question

I need to delete a text from a PDF document. I am using Aspose for the purpose am currently using TextFragmentAbsorber.

FYI, I cannot use any other 3rd party library.

Below is the code I am using :

private string DeleteMachineReadableCode(string inputFilePath)
    {
        var outputFilePath = Path.Combine(Path.GetTempPath(), string.Format(@"{0}.pdf", Guid.NewGuid()));

        try
        {
            // Open document
            Document pdfDocument = new Document(inputFilePath);

            // Create TextAbsorber object to find all the phrases matching the regular expression
            TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("#START#((.|\r\n)*?)#END#"); 

            // Set text search option to specify regular expression usage
            TextSearchOptions textSearchOptions = new TextSearchOptions(true);


            textFragmentAbsorber.TextSearchOptions = textSearchOptions;

            // Accept the absorber for all pages
            pdfDocument.Pages.Accept(textFragmentAbsorber);

            // Get the extracted text fragments
            TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

            // Loop through the fragments
            foreach (TextFragment textFragment in textFragmentCollection)
            {
                // Update text and other properties
                textFragment.Text = string.Empty;

                // Set to an instance of an object.
                textFragment.TextState.Font = FontRepository.FindFont("Verdana");
                textFragment.TextState.FontSize = 1;
                textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.White);
                textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.White);
            }

            pdfDocument.Save(outputFilePath);

        }
        finally
        {
            if (File.Exists(inputFilePath))
                File.Delete(inputFilePath);
        }

        return outputFilePath;
    }

I am able to replace the content if the content to be deleted is on a single page. My problem is that if the text spans over multiple pages the TextFragmentAbsorber does not recognize the text with the mentioned regex pattern ("#START#((.|\r\n)*?)#END#").

Please suggest if anything can be done on the regex or the some setting in Aspose can fix my issue.

I have observed your comments and like to request you to share the source file with us because we need that particular document to test this scenario. You may share the file using any free file hosting service like Google Drive, Dropbox etc. — Farhan Raza, Nov 08 '17 at 20:18
@FarhanRaza uploaded : https://drive.google.com/open?id=1PALgqgXIltrAKcZuZ2ron_I2pD-8Wgqg — Jose Francis, Nov 09 '17 at 09:11
Thank you for sharing the requested file. I have worked with the data shared by you, but the TextFragmentAbsorber is not recognizing the text even if it spans over a single page.Please share with us what string do you want to extract from this PDF so that we may check the regex accordingly. Note: I work with Aspose as Developer Evangelist. — Farhan Raza, Nov 09 '17 at 11:47
I need to remove the whole string that starts with #START# and ends with #END# — Jose Francis, Nov 09 '17 at 12:45
I have worked with the data shared by you and have been able to observe the problem with TextFragmentAbsorber, when text spans over multiple pages. So, an investigation ticket with ID **PDFNET-43671** has been logged in our issue management system. We will share our findings with you as soon as the issue is investigated. We are sorry for the inconvenience. — Farhan Raza, Nov 09 '17 at 21:36
I would like to update you that we have investigated the issue and it is an architecture limitation of the TextFragmentAbsorber, as it processes the document page by page. Considering the previously logged and higher priority issues, we can not promise an ultimate solution in near future. Furthermore, the bug tracking system is an internal issue management system and I am afraid you may not be able to access it. — Farhan Raza, Nov 12 '17 at 18:25

score 1 · Answer 1 · answered Nov 22 '17 at 09:05

As shared earlier, we can not promise earlier resolution of the issue reported by you, because of architecture limitation. However, we have modified the code snippet to meet your requirements.

The idea is to find text starting from '#START#' on the one of the document pages. Then to find text ending with '#END#' on the one of subsequent pages. And also to process all text fragments that placed on the pages between those two pages (if it exists).

    private string DeleteMachineReadableCodeUpdated(string inputFilePath)
    {
    string outputFilePath = Path.Combine(Path.GetTempPath(), string.Format(@"{0}.pdf", Guid.NewGuid()));

try
{
    // Open document
    Document pdfDocument = new Document(inputFilePath);

    // Create TextAbsorber object to find all the phrases matching the regular expression
    TextFragmentAbsorber absorber = new TextFragmentAbsorber("#START#((.|\r\n)*?)#END#");

    // Set text search option to specify regular expression usage
    TextSearchOptions textSearchOptions = new TextSearchOptions(true);

    absorber.TextSearchOptions = textSearchOptions;

    // Accept the absorber for all pages
    pdfDocument.Pages.Accept(absorber);

    // Get the extracted text fragments
    TextFragmentCollection textFragmentCollection = absorber.TextFragments;

    // If pattern found on one of the pages
    if (textFragmentCollection.Count > 0)
    {
        RemoveTextFromFragmentCollection(textFragmentCollection);
    }
    else
    {
        // In case nothing was found tries to find by parts
        string startingPattern = "#START#((.|\r\n)*?)\\z";
        string endingPattern = "\\A((.|\r\n)*?)#END#";
        bool isStartingPatternFound = false;
        bool isEndingPatternFound = false;
        ArrayList fragmentsToRemove = new ArrayList();

        foreach (Page page in pdfDocument.Pages)
        {
            // If ending pattern was already found - do nothing
            if (isEndingPatternFound)
                continue;

            // If starting pattern was already found - activate textFragmentAbsorber with ending pattern
            absorber.Phrase = !isStartingPatternFound ? startingPattern : endingPattern;

            page.Accept(absorber);
            if (absorber.TextFragments.Count > 0)
            {
                // In case something is found - add it to list
                fragmentsToRemove.AddRange(absorber.TextFragments);

                if (isStartingPatternFound)
                {
                    // Both starting and ending patterns found - the document processing
                    isEndingPatternFound = true;                        
                    RemoveTextFromFragmentCollection(fragmentsToRemove);
                }
                else
                {
                    // Only starting pattern found yet - continue
                    isStartingPatternFound = true;                        
                }
            }
            else
            {
                // In case neither starting nor ending pattern are found on current page
                // If starting pattern was found previously - get all fragments from the page
                if (isStartingPatternFound)
                {
                    absorber.Phrase = String.Empty;
                    page.Accept(absorber);
                    fragmentsToRemove.AddRange(absorber.TextFragments);
                }
                // Otherwise do nothing (continue)
            }
        }
    }

    pdfDocument.Save(outputFilePath);
}
finally
{
    if (File.Exists(inputFilePath))
        File.Delete(inputFilePath);
}

return outputFilePath;
}

private void RemoveTextFromFragmentCollection(ICollection fragmentCollection)
{
// Loop through the fragments
foreach (TextFragment textFragment in fragmentCollection)
{
    textFragment.Text = string.Empty;
}
}

Note:

This code assumed that the only one text block starting from '#START#' and ending with '#END#' is in the document. However the above code can be easly modified to process several those blocks.
Instead of processing text on intermediate page(s) you may store page number(s) and than delete using pdfDocument.Pages.Delete(pageNumber) before the saving document. It lets to avoid 'blank' pages if them undesirable.

Remove text from PDF document using Aspose.PDF library?

1 Answers1