0

First of all, this is not a language specific question, the below example uses PHP but it's more about the method (regex?) to find the answer.

Let's say I have an array:

$array = ['The Bert and Ernie game', 'The Bert & Ernie game', 'Bert and Ernie game', 'Bert and Ernie game - english version', 'Bert & Ernie (game)', 'Bert and Ernie - game'] etc...

I want to fetch a combination that shows the most important combinations. So I want to do:

$magicPattern = [something that renders most important occurrences];
preg_match($magicPattern, $array, $matches);
print_r($matches);

As an output I would like to receive something like: "Bert and Ernie game"

PS: I'm not necessary looking for an actual array, a concept to do this would be great too.

UPDATE:
Current code below, any thoughts if this would be a good way of finding the best version of an occurrence? Having a hard time figuring it out from the source of the function.

$array['The Bert and Ernie game']               =0; //lev distance
$array['The Bert & Ernie game']                 =0; //lev distance
$array['Bert and Ernie game']                   =0; //lev distance
$array['Bert and Ernie game - english version'] =0; //lev distance
$array['Bert & Ernie (game)']                   =0; //lev distance
$array['Bert and Ernie - game']                 =0; //lev distance

foreach($array as $currentKey => $currentVal){
    foreach($array as $matchKey => $matchVal){
        $array[$currentKey] += levenshtein($currentKey, $matchKey);
    }
}

$array = array_flip($array);
ksort($array);

echo array_values($array)[0]; //Bert and Ernie game
Bob van Luijt
  • 7,153
  • 12
  • 58
  • 101
  • 2
    How can the program possibly tell what is important or unimportant? – psmears Jun 17 '15 at 09:37
  • Fair enough. Maybe the word 'important' isn't properly chosen, but the goal of the question makes sense right? – Bob van Luijt Jun 17 '15 at 09:38
  • Not really, unless you can say in a bit more detail what you mean. Do you mean the words individually happen most frequently? Occur most frequently in the same string? Occur most frequently adjacent to each other? Something else? It's a lot easier to help if you can tell us what you actually want :) – psmears Jun 17 '15 at 10:01

2 Answers2

1

There are many different solutions for solving an issue like this, personally I wouldn't recommend a regex for this. This is typically something that you would solve using a fulltext search index (just google fulltext search for many methods to do this).

For this particular case, assuming you don't have too much data, you could just compute the Levenshtein distance: http://php.net/manual/en/function.levenshtein.php

Or use the similar_text() function: http://php.net/manual/en/function.similar-text.php

Wolph
  • 78,177
  • 11
  • 137
  • 148
  • Thanks Wolph, I think the lev. function is the best starting point. However, I also need the solution. With this I mean that -from my cheesy example- I already need to know that "Bert and Ernie game" is the best solution, I match the other ones to that. So, I can loop through my array to set `x` and I match it agains `y`. Problem is, I don't know `y`... Any thoughts? – Bob van Luijt Jun 17 '15 at 15:54
  • 1
    @bvl: the best way to get the "closest" match is to build a distance matrix from every sentence to every other sentence. Note that this is very heavy for many items. For `n` items it's `n*n` so with `100` items you'll be calculating `10000` distances (actually half of that since `a -> b = b -> a`). You simply need a 2 dimensional matrix and 2 levels in your loop – Wolph Jun 27 '15 at 13:24
0

You need something that will look at each value and compute a numerical weight, then sort the array according to the weight and take the top most item.

The weight is your "importance", so you can, for example, choose to assign higher weights to terms you consider more important.

Ali
  • 1,462
  • 2
  • 17
  • 32