I need to come up with a way to sort and display the most relevant data to users. Our data consists of multiple n-grams that are extracted from Social Media. We call these 'topics'.
The problem I am facing is that the data contains a lot of duplication. While each String is not a direct duplicate of another, they are sub-sets. To a User, this information appears duplicated. Here's some sample data:
{
"count": 1.0,
"topic": "lazy people"
},
{
"count": 1.0,
"topic": "lazy people taking"
},
{
"count": 1.0,
"topic": "lazy people taking away food stamps"
}
An edge case is that the phrase "lazy people" can be extracted from other phrases. For example, "lazy people are happy". Using the smallest common denominator ("lazy people" in this case) does not seem like a good idea because the end-User would not be presented the different contexts ("taking away food stamps" and "are happy").
On the other hand, taking the longest N-Gram may be too much information. In the example I gave above, that seems logical. However, that may not always hold true.
My overall goal is to present this data in a way that is informative and ranked.
Are there any existing solutions and corresponding algorithms to solve this class of problems?
Note: Initially my question was extremely vague and unclear. In fact, that led me to changing the question all together because what I really need is guidance in what my end-result should be.
Note 2: Let me know if I've mis-used any terms or should modify the title of this question to enhance others searching for answers to this type of a question.