Pdfclown:How to override the existing highlighted keyword in pdfclown

Question

I got the requirement in pdfclown like if there are few keywords which are substring/matched with another keyword, while highlighting those keywords has to be override and should allow to highlight full keyword .For example in below map ETS keyword is substring of just.ETS and Test.ETS keywords. And Expected result should be like We need to highlight full keyword like just.ETS , Test.ETS instead of ETS keyword and their popup measure value. .ActualPdf and actual result pdf. and jar path.

Map<String, String> m = new HashMap<String, String>();
        map.put("ETS" , "Loss");
        map.put("Just. ETS" , "Net ");
        map.put("Test. ETS" , "Profit");

(Note:1. If large size keyword is already highlighted in file then small size keyword which are matched with large keyword should not allow to highlight 2. If small size keyword is already highlighted and this keyword matched with large keyword then large keyword should higlight and ignore/unhighlight the small keyword.).

    import java.awt.Color;
    import java.awt.Desktop;
    import java.awt.geom.Rectangle2D;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.UnsupportedEncodingException;
    import java.net.URL;
    import java.nio.charset.Charset;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.Date;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.concurrent.TimeUnit;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.io.File;
    import org.pdfclown.documents.Page;
    import org.pdfclown.documents.contents.ITextString;
    import org.pdfclown.documents.contents.TextChar;
    import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
    import org.pdfclown.documents.interaction.annotations.TextMarkup;
    import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;

    import org.pdfclown.files.SerializationModeEnum;
    import org.pdfclown.util.math.Interval;
    import org.pdfclown.util.math.geom.Quad;
    import org.pdfclown.tools.TextExtractor;

    public class pdfclown2 {
        private static int count;

        public static void main(String[] args) throws IOException {

            highlight("C:\\Users\\uc23\\Desktop\\pdf\\80743064.pdf","C:\\Users\\\Downloads\\6.pdf");
            System.out.println("OK");
        }
        private static void highlight(String inputPath, String outputPath) throws IOException {




   org.pdfclown.files.File file = null;

try {
    file = new org.pdfclown.files.File("C:\\Users\\uc239646\\Desktop\\test.pdf");

List<Keyword> l=new ArrayList<Keyword>();
Keyword k=new Keyword();
Keyword k1=new Keyword();
k1.setKey("Just. ETS");
k1.setValue("NET");
l.add(k1);
Keyword k2=new Keyword();
k2.setKey("Test. ETS");
k2.setValue("PROFIT");
l.add(k2);
k.setKey("ETS");
k.setValue("LOSS");
l.add(k);

 long startTime = System.currentTimeMillis();




    // 2. Iterating through the document pages...
    TextExtractor textExtractor = new TextExtractor(true, true);
    for (final Page page : file.getDocument().getPages()) {
        Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
        for (Keyword e : l) {
            Pattern pattern;
            String serachKey =  e.getKey();
            final String translationKeyword = e.getValue();

                if ((serachKey.contains(")") && serachKey.contains("("))
                        || (serachKey.contains("(") && !serachKey.contains(")"))
                        || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                        || serachKey.contains("*") || serachKey.contains("+")) {
                    pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
                }
                else
                     pattern = Pattern.compile("\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);
        // 2.1. Extract the page text!

    //System.out.println(textStrings.toString().indexOf(entry.getKey()));

        // 2.2. Find the text pattern matches!
                        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
        // 2.3. Highlight the text pattern matches!
        //System.out.println(textStrings);
        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {

            public boolean hasNext() {
                // if(key.getMatchCriteria() == 1){
                if (matcher.find()) {
                    return true;
                }
                /*
                 * } else if(key.getMatchCriteria() == 2) { if
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * (matcher.hitEnd()) { count++; return true; } }
                 */
                return false;

            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end());
            }

            public void process(Interval<Integer> interval, ITextString match) {
                System.out.println(match);
                // Defining the highlight box of the text pattern
                // match...
                /*List l=new ArrayList();
                if(!l.contains(match)){
                    System.out.println("map.put("+match+","+translationKeyword+")");
                }
            */
                List<Quad> highlightQuads = new ArrayList<Quad>();
                {
                    Rectangle2D textBox = null;
                    for (TextChar textChar : match.getTextChars()) {
                        Rectangle2D textCharBox = textChar.getBox();
                        if (textBox == null) {
                            textBox = (Rectangle2D) textCharBox.clone();
                        } else {
                            if (textCharBox.getY() > textBox.getMaxY()) {
                                highlightQuads.add(Quad.get(textBox));
                                textBox = (Rectangle2D) textCharBox.clone();
                            } else {
                                textBox.add(textCharBox);
                            }
                        }

                    System.out.println(highlightQuads.contains(textBox));

                    textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
                    highlightQuads.add(Quad.get(textBox));
                }
            /*  List<Quad> highlightQuads = new ArrayList<Quad>();
                List<TextChar> textChars = match.getTextChars();
                Rectangle2D firstRect = textChars.get(0).getBox();
                Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
                Rectangle2D rect = firstRect.createUnion(lastRect);
                highlightQuads.add(Quad.get(rect));*/
                // subtype can be Highlight, Underline, StrikeOut, Squiggly


                new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);

            }

            }

            public void remove() {
                throw new UnsupportedOperationException();
            }

        });

    }

}

    SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
    file.save(new java.io.File(outputPath), serializationMode);
    System.out.println("file created");
    long endTime = System.currentTimeMillis();
    System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);

} catch (Exception e) {
       e.printStackTrace();
}


        }
    }

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

As already mentioned in comments (which meanwhile have been moved to chat):

Your issue only becomes a PDF Clown issue because you try to put the cart before the horse:

You have determined that you are creating too many highlights.

The obvious solution would be to stop making those surplus highlights from the start, and sorting that out is an issue unrelated to PDF Clown.

Your attempted solutions, on the other hand, is to remove the surplus highlights after the fact, and only this makes it an PDF Clown issue for you because now you have to search the already existing highlights for overlaps. That solution is a possible one, too, but it unnecessarily wastes resources.

Here an approach that sorts out unwanted matches before highlights are created for them. The contents of your loop over the pages is replaced like this:

[...]
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
    Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);

    List<Match> matches = new ArrayList<>();

    for (Keyword e : l) {
        final String searchKey = e.getKey();
        final String translationKeyword = e.getValue();

        final Pattern pattern;
        if ((searchKey.contains(")") && searchKey.contains("("))
                || (searchKey.contains("(") && !searchKey.contains(")"))
                || (searchKey.contains(")") && !searchKey.contains("(")) || searchKey.contains("?")
                || searchKey.contains("*") || searchKey.contains("+")) {
            pattern = Pattern.compile(Pattern.quote(searchKey), Pattern.CASE_INSENSITIVE);
        } else
            pattern = Pattern.compile("\\b" + searchKey + "\\b", Pattern.CASE_INSENSITIVE);

        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());

        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
            public boolean hasNext() {
                return matcher.find();
            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end(), true, false);
            }

            public void process(Interval<Integer> interval, ITextString match) {
                matches.add(new Match(interval, match, translationKeyword));
            }

            public void remove() {
                throw new UnsupportedOperationException();
            }
        });
    }

    removeOverlaps(matches);

    for (Match match : matches) {
        List<Quad> highlightQuads = new ArrayList<Quad>();
        {
            Rectangle2D textBox = null;
            for (TextChar textChar : match.match.getTextChars()) {
                Rectangle2D textCharBox = textChar.getBox();
                if (textBox == null) {
                    textBox = (Rectangle2D) textCharBox.clone();
                } else {
                    if (textCharBox.getY() > textBox.getMaxY()) {
                        highlightQuads.add(Quad.get(textBox));
                        textBox = (Rectangle2D) textCharBox.clone();
                    } else {
                        textBox.add(textCharBox);
                    }
                }

                textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(),
                        textBox.getHeight());
                highlightQuads.add(Quad.get(textBox));
            }

            new TextMarkup(page, highlightQuads, match.tag, MarkupTypeEnum.Highlight);
        }
    }
}
[...]

(ComplexHighlight test testMarkLikeSeshadriImproved)

making use of these helper methods / classes:

static void removeOverlaps(List<Match> matches) {
    Collections.sort(matches, ComplexHighlight::compareLowLengthTag);

    for (int i = 0; i < matches.size() - 1; i++) {
        Interval<Integer> intervalI = matches.get(i).interval;
        for (int j = i + 1; j < matches.size(); j++) {
            Interval<Integer> intervalJ = matches.get(j).interval;
            if (intervalI.getLow() < intervalJ.getHigh() && intervalJ.getLow() < intervalI.getHigh()) {
                System.out.printf("Match %d removed as it overlaps match %d.\n", j, i);
                matches.remove(j--);
            }
        }
    }
}

(ComplexHighlight method removeOverlaps)

static int compareLowLengthTag(Match a, Match b) {
    int compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = - a.interval.getHigh().compareTo(b.interval.getHigh());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

(ComplexHighlight method compareLowLengthTag)

class Match {
    final Interval<Integer> interval;
    final ITextString match;
    final String tag;

    public Match(final Interval<Integer> interval, final ITextString match, final String tag) {
        this.interval = interval;
        this.match = match;
        this.tag = tag;
    }
}

(Match class)

As you see the matches here are not immediately added as highlights but instead collected in a list matches. This list then is processed to not contain overlaps anymore, and only the elements of the remaining list without overlaps are added as highlights.

As also mentioned in comments one has to decide on priorities among the matches.

E.g. in case of search terms "AB" and "BCD" and a document text "ABCD" the comparison method compareLowLengthTag used above always prefers the AB match while the following comparison method compareLengthLowTag prefers the longer match BCD and only in case of equal lengths would have resorted to preferring a match starting earlier:

static int compareLengthLowTag(Match a, Match b) {
    int aLength = a.interval.getHigh() - a.interval.getLow();
    int bLength = b.interval.getHigh() - b.interval.getLow();
    int compare = - Integer.compare(aLength, bLength);
    if (compare == 0)
        compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

(ComplexHighlight method compareLengthLowTag)

@mkl-Few test cases is failing when i tried with below text and keywords. Still highlighting the same keyword with multiple times. PDF Text: alpa beta gamma alpa beta alpa beta gamma Keywords: m.put("beta", "beta"); m.put("alpa beta gamma", "alpa beta gamma"); m.put("alpa beta", "alpa beta"); m.put("gamma", "gamma"); m.put("alpa", "alpa"); m.put("beta gamma", "beta gamma"); — Seshadri, Apr 20 '18 at 13:02
I just found an issue in the `removeOverlaps` code. I fixed it in my answer, please try again. — mkl, Apr 20 '18 at 14:48
@Seshadri great. Keep in mind, though, that in case of very special desired resolutions of conflicts of regular expression matches you might have to tweak the compare method or even the overlap removal method. — mkl, Apr 23 '18 at 09:57

Pdfclown:How to override the existing highlighted keyword in pdfclown

1 Answers1