3

I am able to calculate keyword density of a body of text using the following code:

HashMap<String, Integer> frequencies = new HashMap<String, Integer>();

String[] splitTextArr = StringX.splitStrIntoWordsRtrnArr(text);

int articleWordCount = splitTextArr.length;

for (String splitText : splitTextArr) {


 if (frequencies.containsKey(splitText)) {

    Integer previousInt = frequencies.get(splitText);
    Integer newInt = ++previousInt;
    frequencies.put(splitText, newInt);
    } else {

    frequencies.put(splitText, 1);
    }
}

I then sort keyword density list in order of most occurring keywords to least occurring keywords using the following call:

    Map<String, Integer> sortedMap = new LinkedHashMapX<String, Integer>().sortByValueInDescendingOrder(frequencies);

The above code works as expected however I now have to implement a unique requirement.

SUMMARY: Given a list of title entries I need to extract enough keywords so that each title in the list is represented by exactly one keyword and then calculate the total number of titles represented by that keyword (see below for example).

EXAMPLE: Suppose I have the following 5 titles in the form of a LinkedHashMap<String, String>:

title1: canon photo printer 
title2: canon camera canon
title3: wireless mouse
title4: wireless charger
title5: mouse trap

Where the KEY represents the titleID (i.e, title1, title2 etc) and the VALUE represents the actual title.

Raw occurrences of each keyword (in descending order, case insensitive) is as follows:

canon: 2 | mouse: 2 | wireless: 2 | camera: 1 | charger: 1 | photo: 1 | printer: 1 | trap: 1

Note: Each keyword is only counted once per title. So although the keyword canon appears 3 times, since it appears twice in the same title (i.e, title2) it is only counted once.

In the previous map, the keyword canon is appears in both title1 & title2. Since each title needs to be represented by exactly one keyword, both titles can be represented by the keyword canon. It is not necessary to include the other keywords from title1 and title2 (such as: photo, printer and camera) as each title should be represented by exactly one keyword (not more, not less). Although we can tech choose to represent title1 and title2 using the keywords photo and camera (or printer and camera) - since this will have the effect of increasing the total number of keywords necessary to represent all titles, it is not desired. In other words, we want to represent all titles by the least number of keywords possible.

The important part is to extract the least number of keywords that are able to "represent" all titles in list one time and keep track of the number of titles each keyword is linked to and the titleID. If instead of 5 titles we had a list of 100 titles where the keyword photo appeared 95 times (i.e, more times than the keyword canon) the keyword photo would be used to replace the keyword canon and title2 would be represented by the keyword camera.

If two or more keywords can represent the same number of titles we would select the first one in alphabetical order. Thus, the keyword mouse would be used to represent titles title3 and title5, instead of the keyword wireless. Similarly, to represent title4 the keyword charger would be used as the letter C comes before the letter W in the alphabet (this is true even though the keyword wireless appears twice and the keyword charger only appears once since title3 which contains the keyword wireless is already being represented by the keyword mouse and not by the keyword wireless so we revert to using first keyword in alphabet when two keywords represent same number of titles)

In the previous example of 5 titles, the expected output would have the following format:

LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>> map = new LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>>();

Where String represents the keyword (such as canon or mouse), Integer represents the number of unique titles represented by that keyword (canon = 2, mouse = 2, charger = 1) and ArrayList<String> is a list of titleIDs which is linked to that keyword. For example, the keyword canon would be linked to titles: title1 and title2. The keyword mouse would be linked to titles: title3 and title5 and the keyword charger would be linked to title: title4.

Note: wireless = 0, trap = 0 so it is omitted from the final result.

What is the most efficient way to achieve this?

Thanks

hfontanez
  • 5,774
  • 2
  • 25
  • 37
alpha1
  • 426
  • 3
  • 14
  • Eliminate confusion by referring to things by their correct terms. Instead of "first String" and "second String", refer to them as KEY and VALUE respectively. – hfontanez Apr 11 '21 at 23:06
  • Also, this explanation is very convoluted. Try to summarize it as much as you can. – hfontanez Apr 11 '21 at 23:11
  • @hfontanez thank you for the constructive feedback! I updated the terminology and tried to enhancing the summary. Hopefully, it makes things a bit more clear. – alpha1 Apr 11 '21 at 23:50
  • No problem. I am working on a solution. I should have it ready soon if no interruptions. :) – hfontanez Apr 11 '21 at 23:55
  • I posted my solution. I hope I understood your requirements correctly. – hfontanez Apr 12 '21 at 01:20
  • This is basically a lesson on data structures, preferring composition over inheritance (i.e. I wrapped the tree map in a class rather than extending the map to add more functionality), and the use of Java's `Comparator`. – hfontanez Apr 12 '21 at 03:41

1 Answers1

3

Basically, you need to create a data structure of some kind where you need to keep track of a keyword, the number of titles it is linked to, and the title IDs it is associated to; something like this:

[canon, {2, [ title1, title2 ] } ]

If this is the case, This is bad for at least two reasons:

  1. Using the count as a key is bad because the count will repeat for other elements
  2. You can simply calculate the count by returning the size or length of the list of title IDs for a given keyword.

My proposed solution is two-fold: Using a class to capture your keyword and title ID mappings, and then a set of those mappings objects.

The Class

public class KeywordMapping implements Comparable<KeywordMapping> {
    
    private final String keyword;
    private TreeSet<String> titleIds = new TreeSet<String>();
    private int frequency;
    
    public KeywordMapping(String keyword, String titleId) {
        this.keyword = keyword;
        titleIds.add(titleId);
    }
    
    public String getKeyword () {
        return keyword;
    }
    
    public int getTitleCount () {
        return titleIds.size();
    }
    
    public void addTitleId (String titleId) {
        titleIds.add(titleId);
    }
    
    public String getFirstTitleId () {
        return titleIds.first();
    }
    
    public void incrementFrequency(String keyword) {
        if (this.keyword.equals(keyword)) {
            frequency++;            
        }
    }

    public int getFrequency() {
        return frequency;
    }

    public boolean containsTitle(String titleId) {
        return titleIds.contains(titleId);
    }

    @Override
    public int hashCode () {
        final int prime = 31;
        int result = 1;
        result = prime * result + ( (keyword == null) ? 0 : keyword.hashCode());
        result = prime * result + ( (titleIds == null) ? 0 : titleIds.hashCode());
        return result;
    }

    @Override
    public boolean equals (Object obj) {
        if (this == obj) return true;
        if (obj == null) return false;
        if (getClass() != obj.getClass()) return false;
        KeywordMapping other = (KeywordMapping) obj;
        if (keyword == null) {
            if (other.keyword != null) return false;
        } else if (!keyword.equals(other.keyword)) return false;
        if (titleIds == null) {
            if (other.titleIds != null) return false;
        } else if (!titleIds.equals(other.titleIds)) return false;
        return true;
    }

    @Override
    public String toString () {
        return "KeywordMapping [keyword=" + keyword + ", titleIds=" + titleIds + ", frequency=" + frequency
            + "]\n";
    }

    @Override
    public int compareTo (KeywordMapping o) {
        return COMPARATOR.compare(this, o);
    }

    private static final Comparator<KeywordMapping> COMPARATOR =
        Comparator.comparing( (KeywordMapping keyMapping) -> keyMapping.keyword);
}

This solution is OK. I would rather use Builder Pattern to build this and make this class immutable so that every instance of this class is built only when all the title IDs have been added to the class. However, let me explain a couple of details of this implementation. I used a TreeSet to store the title IDs so that they are arranged in natural order (alphabetically). This is not important for storing title IDs, but it will be important later on. Lastly, the class must implement Comparable<T> if you need to insert these objects into a sortable collection.

The TreeMap of keyword mappings

TreeMap<String, KeywordMapping> mappings = new TreeMap<>();

Using a tree map is necessary to arrange entries in natural order by their key.

I created this simple main method to demonstrate this implementation

public static void main (String[] args) {
    List<String> titles = List.of("title2: canon camera canon",
        "title3: wireless mouse",
        "title1: canon photo printer",
        "title4: wireless charger", "title5: mouse trap");
    
    MasterMap mappings = new MasterMap(); // A class wrapping TreeMap<String, KeywordMapping> mappings (the code is shown later on)
    
    for (String title : titles) {
        String[] tokens = title.split(": ");
        Set<String> keywords = new HashSet<>(Arrays.asList(tokens[1].split(" ")));
        for (String keyword : keywords) {
            KeywordMapping km = mappings.get(keyword); // don't create duplicate keyword mapping objects
            if (km == null) {
                km = new KeywordMapping(keyword, tokens[0]);
            }
            km.addTitleId(tokens[0]);
            km.incrementFrequency(keyword);
            mappings.put(keyword, km); // update existing or new keyword mapping entry
        }
    }
    System.out.println(mappings);
    System.out.println("First entry: " + mappings.firstMapping());
    System.out.println("First title in first entry: " + mappings.firstMapping().getValue().getFirstTitleId());
}

The output of this program is:

camera=KeywordMapping [keyword=camera, titleIds=[title2], frequency=1]
canon=KeywordMapping [keyword=canon, titleIds=[title1, title2], frequency=2]
charger=KeywordMapping [keyword=charger, titleIds=[title4], frequency=1]
mouse=KeywordMapping [keyword=mouse, titleIds=[title3, title5], frequency=2]
photo=KeywordMapping [keyword=photo, titleIds=[title1], frequency=1]
printer=KeywordMapping [keyword=printer, titleIds=[title1], frequency=1]
trap=KeywordMapping [keyword=trap, titleIds=[title5], frequency=1]
wireless=KeywordMapping [keyword=wireless, titleIds=[title3, title4], frequency=2]

First entry: camera=KeywordMapping [keyword=camera, titleIds=[title2], frequency=1]
First title in first entry: title2

Notice how, even though the original title list is not in natural order, after insertion into the KeywordMapping object, the set backing the list of title IDs arranges them in order. This is the same with the order in which the entries in the tree map are printed out.

Lastly, the frequency count is only incremented once per title because the list of title IDs is backed by a Set rather than a List and sets in Java do not allow duplicates. Therefore, for a case like title2: canon camera canon, the keyword canon is counted only once because for that title, the set of keywords is only {cannon, camera}.

Wrapping TreeMap for added functionality

public class MasterMap {
    private TreeMap<String, KeywordMapping> mappings = new TreeMap<>();
    
    public KeywordMapping put(String key, KeywordMapping value) {
        return mappings.put(key, value);
    }
    
    public KeywordMapping get(String key) {
        return mappings.get(key);
    }
    
    public KeywordMapping firstMapping() {
        return mappings.firstEntry().getValue();
    }
    
    public String getKeywordForTitle (String titleId) {
        
        List<KeywordMapping> keywords = new ArrayList<>();
        
        Collection<KeywordMapping> values = mappings.values();
        Iterator<KeywordMapping> iter = values.iterator();
        
        while (iter.hasNext()) {
            KeywordMapping value = iter.next();
            if (value.containsTitle(titleId)) {
                keywords.add(value);
            }
        }
        
        KeywordMapping temp = keywords.stream()
            .max(Comparator.comparingInt(KeywordMapping::getFrequency)
                .thenComparing(KeywordMapping::getKeyword).reversed())
            .get();
        
        return temp.getKeyword();
    }
}

In order to satisfy the last requirement, I decided to wrap the TreeMap into a class so that I could add methods to perform the last action: return the keyword associated with a given title based on the following criteria: return the keyword with the highest frequency or the first keyword in alphabetical order if the frequency is the same. The implemented function DOES NOT manipulate the KeywordMapping objects or the wrapped tree map in any way. What it does is use a Comparator in order to find the matched keyword based on that criteria.

With the added changes (previous code has been update), passing "title3" to the function produces the following output:

Keyword for title3: mouse

This is because the two associated keywords with "title3" {mouse, wireless} have the same frequency (2) but "mouse" is the first keyword in the set in alphabetical order.

hfontanez
  • 5,774
  • 2
  • 25
  • 37
  • @thanks for your incredibly detailed and informative response! I think the only remaining requirement is to remove keywords that represent a title that is already being represented by another keyword. For example, the keyword 'photo' should be removed as title1 is represented by the keyword 'canon'. Same idea applies to keyword 'printer'. A keyword should only be included in result if it represents a title that is not represented by another keyword.. (cont) – alpha1 Apr 12 '21 at 02:16
  • @alpha1 I was about to post an update that, if you want more "metadata" functionality, you should wrap the mappings `TreeMap` in a class and add the methods you would like to have. For example `getKeywordsByTitle(String titleId)` should return `{camera, canon}` if the title ID is `title2`. Likewise, you should be able to create another method to reduce or refine your map with that requirement. – hfontanez Apr 12 '21 at 02:20
  • (cont) If two or more keywords can represent the same title (canon vs photo) we select the one that occurs more frequently (canon). If the frequency is equal we select the one that comes first in alphabet. For this reason, the keyword 'wireless" does not represent ANY title as title3 is already represented by the keyword 'mouse' as it occurs with the same frequency as the keyword 'wireless' but it comes first in alphabet. So 'wireless' can only uniquely represent title4 but since 'charger' comes first in alphabet title4 is represented by the keyword 'charger' rather than 'wireless'. Thanks! – alpha1 Apr 12 '21 at 02:21
  • @alpha1 all of these features you are mentioning belong to another class (call it `MasterMapping` that wraps `TreeMap mappings`. – hfontanez Apr 12 '21 at 02:23
  • @alpha1 I think I completed the implementation. – hfontanez Apr 12 '21 at 03:32
  • 1
    @ hfontanez up-voted your amazing response. I will test the code on my local machine tomorrow morning and accept and award the bounty. Thanks for your enormous effort! UPDATE: I accepted it is the answer as you put much effort into this. Cannot award bounty for another 19 hours. I'll test the code first thing in the morning on my machine. – alpha1 Apr 12 '21 at 03:44
  • @alpha1 you should select this as the answer by clicking on the checkmark next to this post, or other post for that matter if you rather wait for someone else to post. Regardless, let me know if this worked for you. – hfontanez Apr 12 '21 at 03:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231018/discussion-between-hfontanez-and-alpha1). – hfontanez Apr 12 '21 at 03:46