De-Duplicating sets of n-grams

Question

I need to come up with a way to sort and display the most relevant data to users. Our data consists of multiple n-grams that are extracted from Social Media. We call these 'topics'.

The problem I am facing is that the data contains a lot of duplication. While each String is not a direct duplicate of another, they are sub-sets. To a User, this information appears duplicated. Here's some sample data:

{
    "count": 1.0, 
    "topic": "lazy people"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking away food stamps"
}

An edge case is that the phrase "lazy people" can be extracted from other phrases. For example, "lazy people are happy". Using the smallest common denominator ("lazy people" in this case) does not seem like a good idea because the end-User would not be presented the different contexts ("taking away food stamps" and "are happy").

On the other hand, taking the longest N-Gram may be too much information. In the example I gave above, that seems logical. However, that may not always hold true.

My overall goal is to present this data in a way that is informative and ranked.

Are there any existing solutions and corresponding algorithms to solve this class of problems?

Note: Initially my question was extremely vague and unclear. In fact, that led me to changing the question all together because what I really need is guidance in what my end-result should be.

Note 2: Let me know if I've mis-used any terms or should modify the title of this question to enhance others searching for answers to this type of a question.

What exactly are you trying to accomplish? There are several ways to reduce the space for your n-grams, depending on what your needs are. — Jim Mischel, Sep 30 '13 at 14:11
I am trying to display a sorted list of all N-Grams without displaying the collisions. An easy example is that if this were all the data I had, both "The World" and "The World is Good" would be displayed as being equivalent in quantity even though it would only be useful to display "The World is Good". Another edge case is that other objects in my database may contain "The World" as a 2-gram but "The World is Alive" as a 4-gram. Does that help? — Kurtis, Sep 30 '13 at 14:21
@JimMischel, I've modified my question entirely. I'm not sure on what the end-result should be -- only that "here is the type of data I have" and "here is my, somewhat generic, goals to achieve with the data". Overall, I think I need someone to help me understand how this sort of information should best be transformed to be presented to a User. — Kurtis, Oct 04 '13 at 18:15

Jim Mischel · Accepted Answer · 2013-10-04T19:28:16.780

This is a hard problem and solutions tend to be very application specific. Typically you'd collect more than just the n-grams and counts. For example, it usually matters if a particular n-gram is used a lot by a single person, or by a lot of people. That is, if I'm a frequent poster and I'm passionate about wood carving, then the n-gram "wood carving" might show up as a common term. But I'm the only person who cares about it. On the other hand, there might be many people who are into oil painting, but they post relatively infrequently and so the count for the n-gram "oil painting" is close to the count for "wood carving." But it should be obvious that "oil painting" will be relevant to your users and "wood carving" would not be. Without information about what pages the n-grams come from, it's impossible to say which would be relevant to more users.

A common way to identify the most relevant phrases across a corpus of documents is called TF-IDF: Term frequency-inverse document frequency. Most descriptions you see concern themselves with individual words, but it's simple enough to extend that to n-grams.

This assumes, of course, that you can identify individual documents of some sort. You could consider each individual post as a document, or you could group all of the posts from a user as a larger document. Or maybe all of the posts from a single day are considered a document. How you identify documents is up to you.

A simple TF-IDF model is not difficult to build and it gives okay results for a first cut. You can run it against a sample corpus to get a baseline performance number. Then you can add refinements (see the Wikipedia article and related pages), always testing their performance against your pure TF-IDF baseline.

Given the information I have, that's where I would start.

I should mention one thing that may have not been entirely clear. I'd like to keep a sum/count of each of the phrases. In this case, there may be 100 mentions of 'The World', 70 of 'The World is good' and 30 of 'The world is round'. When displayed, it would be more interesting to show 'The World is Good' and 'The World is Round' but 'The World' would dwarf them in count. Looking back -- I think I need to put more thought into the end result. Thanks! — Kurtis, Sep 30 '13 at 16:21
@Kurtis: Yes, a complete understanding of what you want to get out of the process is required. For example, things get much more complicated if you also want to merge suffixes as in "feed the world" and "love the world". — Jim Mischel, Sep 30 '13 at 16:35

score 0 · Answer 2 · answered Sep 30 '13 at 13:47

Consider using a graph database, having a table of words, containing the elements of the N-Grams; and a tabe of N-Grams containing arcs to the words that are contained in the N-Grams.

As implementation, you can use neo4j that has also a Python library: http://www.coolgarif.com/brain-food/getting-started-with-neo4j-in-python

De-Duplicating sets of n-grams

2 Answers2