0

I have a list of strings, which I want to classify into groups. I then want to show on string from each groups.

Say my list looks like this:

  • The quick brown fox jumps over the lazy dog
  • The quick brown fox jumps over the lazy dog!!!!
  • The brown fox jumps over the lazy dog
  • Zing, dwarf jocks vex lymph
  • dwarf jocks vex lymph123
  • I love cookies

Then I want to show something like this (one string from each class):

  • The quick brown fox jumps over the lazy dog
  • dwarf jocks vex lymph123
  • I love cookies

I know trigrams are a very easy and useful solution for classifying strings into "strings which are similar" and "strings which are different". I'm also pretty sure they can be used for dividing a list of strings into classes, but I'm not sure how.

Can anyone here help me, or should I use something completely different?

I would much prefer a method which is simple and maintainable over high accuracy.

lejlot
  • 64,777
  • 8
  • 131
  • 164
Markus
  • 2,526
  • 4
  • 28
  • 35

2 Answers2

0

You can use nearly any clustering technique and simply select one representant from each cluster. One of the simpliest approaches would be to use k-medoids over the space of n-grams of your texts, and print out the cluster's centroids (as k-medoids requires centroids to be parts of the training set)

lejlot
  • 64,777
  • 8
  • 131
  • 164
0

You haven't mentioned the criteria used for string clustering into groups. It is not clear from your question what is the grouping criteria. I can imagine any criteria:

  • string length is into some range
  • some letters presented (or not presented) into the string
  • some words presented (or not presented) into the string
  • string are close by some metric (e.g. Levenstein distance)
  • string are close by sense
  • and hundreds more..

Please exactly mention what is classification criteria in your case.

iryndin
  • 530
  • 1
  • 5
  • 11