I am able to calculate keyword density of a body of text using the following code:
HashMap<String, Integer> frequencies = new HashMap<String, Integer>();
String[] splitTextArr = StringX.splitStrIntoWordsRtrnArr(text);
int articleWordCount = splitTextArr.length;
for (String splitText : splitTextArr) {
if (frequencies.containsKey(splitText)) {
Integer previousInt = frequencies.get(splitText);
Integer newInt = ++previousInt;
frequencies.put(splitText, newInt);
} else {
frequencies.put(splitText, 1);
}
}
I then sort keyword density list in order of most occurring keywords to least occurring keywords using the following call:
Map<String, Integer> sortedMap = new LinkedHashMapX<String, Integer>().sortByValueInDescendingOrder(frequencies);
The above code works as expected however I now have to implement a unique requirement.
SUMMARY: Given a list of title entries I need to extract enough keywords so that each title in the list is represented by exactly one keyword and then calculate the total number of titles represented by that keyword (see below for example).
EXAMPLE: Suppose I have the following 5 titles in the form of a LinkedHashMap<String, String>
:
title1: canon photo printer
title2: canon camera canon
title3: wireless mouse
title4: wireless charger
title5: mouse trap
Where the KEY
represents the titleID (i.e, title1, title2 etc) and the VALUE
represents the actual title.
Raw occurrences of each keyword (in descending order, case insensitive) is as follows:
canon: 2 | mouse: 2 | wireless: 2 | camera: 1 | charger: 1 | photo: 1 | printer: 1 | trap: 1
Note: Each keyword is only counted once per title. So although the keyword canon appears 3 times, since it appears twice in the same title (i.e, title2) it is only counted once.
In the previous map
, the keyword canon is appears in both title1 & title2. Since each title needs to be represented by exactly one keyword, both titles can be represented by the keyword canon. It is not necessary to include the other keywords from title1 and title2 (such as: photo, printer and camera) as each title should be represented by exactly one keyword (not more, not less). Although we can tech choose to represent title1 and title2 using the keywords photo and camera (or printer and camera) - since this will have the effect of increasing the total number of keywords necessary to represent all titles, it is not desired. In other words, we want to represent all titles by the least number of keywords possible.
The important part is to extract the least number of keywords that are able to "represent" all titles in list one time and keep track of the number of titles each keyword is linked to and the titleID. If instead of 5 titles we had a list of 100 titles where the keyword photo appeared 95 times (i.e, more times than the keyword canon) the keyword photo would be used to replace the keyword canon and title2 would be represented by the keyword camera.
If two or more keywords can represent the same number of titles we would select the first one in alphabetical order. Thus, the keyword mouse would be used to represent titles title3 and title5, instead of the keyword wireless. Similarly, to represent title4 the keyword charger would be used as the letter C comes before the letter W in the alphabet (this is true even though the keyword wireless appears twice and the keyword charger only appears once since title3 which contains the keyword wireless is already being represented by the keyword mouse and not by the keyword wireless so we revert to using first keyword in alphabet when two keywords represent same number of titles)
In the previous example of 5 titles, the expected output would have the following format:
LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>> map = new LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>>();
Where String
represents the keyword (such as canon or mouse), Integer
represents the number of unique titles represented by that keyword (canon = 2, mouse = 2, charger = 1) and ArrayList<String>
is a list of titleIDs which is linked to that keyword. For example, the keyword canon would be linked to titles: title1 and title2. The keyword mouse would be linked to titles: title3 and title5 and the keyword charger would be linked to title: title4.
Note: wireless = 0, trap = 0 so it is omitted from the final result.
What is the most efficient way to achieve this?
Thanks