8

I need to convert a list of search terms into the most efficient set of combined search terms. Any word or quoted phrase can be separated by an OR. Many terms can be combined within parentheses. ANDs can also be used.

For example, foo bar and boo bar share bar, so instead of two different search terms, the can be combined as (foo OR boo) AND bar.

Here's what the algorithm needs to do. Given this data set:

foo bar
boo bar
goo bar
hoo doo
foo manchu
moo bar
too bar
foo fighters
"blue kazoo" bar
baz
qux
quux

I want to get the following back:

(foo OR boo OR goo OR moo OR too OR "blue kazoo") AND bar
foo AND (manchu OR fighters)
hoo doo
baz OR qux OR quux

This does not work:

(foo bar) OR (boo bar) OR (goo bar) OR (foo manchu)

I'll be working in PHP, but I'll take the answer in pseudo-code, PHP or I'll convert from major languages.

Ryan
  • 14,682
  • 32
  • 106
  • 179
OneCleverMonkey
  • 422
  • 4
  • 15
  • 1
    Any word or quoted phrase can be separated by an OR. In my example since `foo bar` and `boo bar` share `bar` in common, they can become `(foo OR boo) AND bar` – OneCleverMonkey May 11 '16 at 19:08
  • Just added a bounty to improve (or seek new) answers that solve recursively for _n_ number of words per query (not just 2) as well as preserve quotes. – Ryan May 23 '16 at 16:06
  • http://blog.burntsushi.net/transducers/#finite-state-machines-as-data-structures – Jacob May 27 '16 at 13:56

3 Answers3

4

I got the following code:

function keyMultiSort(&$array,
                      $key,
                      $reverse = false,
                      $priority_last = false,
                      $save_key = true,
                      Callable $func = null)
{
    if ($func === null)
    {
        $func = function ($first, $second) use ($key, $reverse, $priority_last)
        {
            if (!isset($first[$key]))
            {
                return ($reverse === false) ? -1 : 1;
            }
            if (!isset($second[$key]))
            {
                return ($reverse === false) ? 1 : -1;
            }

            if ($first[$key] > $second[$key])
            {
                return ($reverse === false) ? 1 : -1;
            }
            if ($first[$key] < $second[$key])
            {
                return ($reverse === false) ? -1 : 1;
            }
            if ($first[$key] === $second[$key])
            {
                return ($priority_last === false) ? 1 : -1;
            }

            return 0;
        };
    }

    if ($save_key)
    {
        uasort($array, $func);
    }
    else
    {
        usort($array, $func);
    }
}

$array = [
    ['foo', 'bar'],
    ['boo', 'bar'],
    ['goo', 'bar'],
    ['hoo', 'doo'],
    ['foo', 'manchu'],
    ['moo', 'bar'],
    ['too', 'bar'],
    ['foo', 'fighters'],
    ['blue kazoo', 'bar'],
];

$pairs = [];
$str = '';
foreach($array as $item)
{
    if(!isset($pairs[$item[0]]['count']))
    {
        $pairs[$item[0]]['count'] = 1;
    }
    else
    {
        $pairs[$item[0]]['count']++;
    }
    $pairs[$item[0]]['elements'][] = $item[1];

    if(!isset($pairs[$item[1]]['count']))
    {
        $pairs[$item[1]]['count'] = 1;
    }
    else
    {
        $pairs[$item[1]]['count']++;
    }
    $pairs[$item[1]]['elements'][] = $item[0];
    keyMultiSort($pairs, 'count', true);
}

$remove = [];
foreach($pairs as $elm=>$item)
{
    $remove[] = $elm;
    $elements = array_diff($item['elements'], $remove);
    if(empty($elements))
    {
        if (in_array($elm, $remove))
        {
            continue;
        }
        $str .= $elm.PHP_EOL;
    }
    else
    {
        $str .= $elm.' AND ('.implode(' OR ', $elements).')'.PHP_EOL;
    }
    $remove = array_merge($remove, $elements);
}
var_dump($str);

Result:

string(99) "bar AND (foo OR boo OR goo OR moo OR too OR blue kazoo)
foo AND (manchu OR fighters)
hoo AND (doo)
"

It can be optimized, depending on the objectives...

Maxim Tkach
  • 1,607
  • 12
  • 23
  • I just added a bounty to improve (or seek new) answers that solve recursively for _n_ number of words per query (not just 2) as well as preserve quotes. Will you consider modifying your answer to account for this and claim the bounty? – Ryan May 23 '16 at 16:08
3

I understand the logic but you really need to make the question clearer.

Anyway, I see this as a graph problem where we want to find the set of nodes that are have highest degree and can span the whole graph.

enter image description here

I believe if you picture it this way, you can use any data structure you like to serve the purpose. You could create an adjacency list and then find nodes with higher degree and then check to see if all elements are covered through those nodes. The matter of adding AND, OR is just simple afterwards.

Saeed
  • 319
  • 5
  • 21
1

Code for processing more than 2 values

<?php
function keyMultiSort(&$array,
                      $key,
                      $reverse = false,
                      $priority_last = false,
                      $save_key = true,
                      Callable $func = null)
{
    if ($func === null)
    {
        $func = function ($first, $second) use ($key, $reverse, $priority_last)
        {
            if (!isset($first[$key]))
            {
                return ($reverse === false) ? -1 : 1;
            }
            if (!isset($second[$key]))
            {
                return ($reverse === false) ? 1 : -1;
            }

            if ($first[$key] > $second[$key])
            {
                return ($reverse === false) ? 1 : -1;
            }
            if ($first[$key] < $second[$key])
            {
                return ($reverse === false) ? -1 : 1;
            }
            if ($first[$key] === $second[$key])
            {
                return ($priority_last === false) ? 1 : -1;
            }

            return 0;
        };
    }

    if ($save_key)
    {
        uasort($array, $func);
    }
    else
    {
        usort($array, $func);
    }
}

$array = [
    ['foo', 'bar', 'test'],
    ['boo', 'bar'],
    ['goo', 'bar'],
    ['hoo', 'doo', 'test', 'test2'],
    ['foo', 'manchu'],
    ['moo', 'bar'],
    ['too', 'bar'],
    ['foo', 'fighters'],
    ['blue kazoo', 'bar', 'test'],
];

$pairs = [];
$str = '';
foreach($array as $item)
{
    foreach($item as $key=>$elm)
    {
        foreach($item as $key2=>$elm2)
        {
            if($key !== $key2)
            {
                if(!isset($pairs[$elm]['count']))
                {
                    $pairs[$elm]['count'] = 1;
                }
                else
                {
                    $pairs[$elm]['count']++;
                }
                $pairs[$elm]['elements'][] = $elm2;
            }
        }
    }

    keyMultiSort($pairs, 'count', true);
}
//var_dump($pairs);
$remove = [];
foreach($pairs as $elm=>$item)
{
    $remove[] = $elm;
    $elements = array_diff($item['elements'], $remove);
    if(empty($elements))
    {
        if (in_array($elm, $remove))
        {
            continue;
        }
        $str .= $elm.PHP_EOL;
    }
    else
    {
        $str .= $elm.' AND ('.implode(' OR ', array_unique($elements)).')'.PHP_EOL;
    }
}
var_dump($str);

Response:

string(184) "bar AND (foo OR test OR boo OR goo OR moo OR too OR blue kazoo)
test AND (foo OR hoo OR doo OR test2 OR blue kazoo)
foo AND (manchu OR fighters)
hoo AND (doo OR test2)
doo AND (test2)
"

P.S. I hope I have correctly understood the task ...

UPDATE Added code which not ignores "single values". I changed logic:

...

['"yellow balloon"', 'foo', 'bar', 'baz', 'qut'],

...

return:

...

qut AND ("yellow balloon" OR baz)
baz AND ("yellow balloon")

...

It seems to me, for this task, that's correct (to conditions to combine more than 2 values).

function keyMultiSort(&$array,
                      $key,
                      $reverse = false,
                      $priority_last = false,
                      $save_key = true,
                      Callable $func = null)
{
    if ($func === null)
    {
        $func = function ($first, $second) use ($key, $reverse, $priority_last)
        {
            if (!isset($first[$key]))
            {
                return ($reverse === false) ? -1 : 1;
            }
            if (!isset($second[$key]))
            {
                return ($reverse === false) ? 1 : -1;
            }

            if ($first[$key] > $second[$key])
            {
                return ($reverse === false) ? 1 : -1;
            }
            if ($first[$key] < $second[$key])
            {
                return ($reverse === false) ? -1 : 1;
            }
            if ($first[$key] === $second[$key])
            {
                return ($priority_last === false) ? 1 : -1;
            }

            return 0;
        };
    }

    if ($save_key)
    {
        uasort($array, $func);
    }
    else
    {
        usort($array, $func);
    }
}

$array = [
    ['foo', 'bar', 'test'],
    ['boo', 'bar'],
    ['goo', 'bar'],
    ['hoo', 'doo', 'test', 'test2'],
    ['foo', 'manchu'],
    ['moo', 'bar'],
    ['too', 'bar'],
    ['foo', 'fighters'],
    ['"blue kazoo"', 'bar', 'test'],
    ['"red panda"', 'bar', 'test'],
    ['"yellow balloon"', 'foo', 'bar', 'baz', 'qut'],
    ['"red panda"', 'fighters', 'moo'],
    ['"foo fighters"'],
    ['foo'],
    ['bar'],
];

$pairs = [];
$singles = [];
$str = '';
foreach ($array as $item)
{
    foreach ($item as $key => $elm)
    {
        if(count($item) === 1)
        {
            $singles[$elm] = 1;
        }
        else
        {
            if (!isset($pairs[$elm]))
            {
                $pairs[$elm]['count'] = 0;
                $pairs[$elm]['elements'] = [];
            }
            foreach ($item as $key2 => $elm2)
            {
                if ($key !== $key2)
                {
                    $pairs[$elm]['count']++;
                    $pairs[$elm]['elements'][] = $elm2;
                }
            }
        }
    }

    keyMultiSort($pairs, 'count', true);
}
//var_dump($pairs);exit;
$remove = [];
foreach ($pairs as $elm => $item)
{
    $remove[] = $elm;
    $elements = array_diff($item['elements'], $remove);
    $elements = array_unique($elements);
    if (!empty($elements)){
        $str .= $elm.' AND ('.implode(' OR ', $elements).')'.PHP_EOL;
    }
}
foreach ($singles as $elm => $item)
{
    $str .= $elm.PHP_EOL;
}
var_dump($str);

Response:

string(421) "bar AND (foo OR test OR boo OR goo OR moo OR too OR "blue kazoo" OR "red panda" OR "yellow balloon" OR baz OR qut)
test AND (foo OR hoo OR doo OR test2 OR "blue kazoo" OR "red panda")
foo AND (manchu OR fighters OR "yellow balloon" OR baz OR qut)
"red panda" AND (fighters OR moo)
qut AND ("yellow balloon" OR baz)
baz AND ("yellow balloon")
test2 AND (hoo OR doo)
fighters AND (moo)
doo AND (hoo)
"foo fighters"
foo
bar
"

P.S. In my opinion, this problem does not apply to reality

Maxim Tkach
  • 1,607
  • 12
  • 23
  • Not quite. I used the following array in your code and it doesn't show the single queries. Ex. `foo` alone should show up. Also, `"foo fighters"`. See here: $array = [ ['foo', 'bar', 'test'], ['boo', 'bar'], ['goo', 'bar'], ['hoo', 'doo', 'test', 'test2'], ['foo', 'manchu'], ['moo', 'bar'], ['too', 'bar'], ['foo', 'fighters'], ['"blue kazoo"', 'bar', 'test'], ['"red panda"', 'bar', 'test'], ['"yellow balloon"', 'foo', 'bar', 'baz', 'qut'], ['"red panda"', 'fighters', 'moo'], ['"foo fighters"'], ['foo'], ['bar'], ]; – Ryan May 24 '16 at 21:39
  • Shoot bounty expired. I haven't been able to test yet. I will redo bounty and award points if it works. Thanks. – Ryan May 31 '16 at 19:27