1

I need help.
I have a table where only two columns are: ID and NAME and these data:

ID | NAME
1    HOME
2    GAME
3    LINK

And I want show e.g. row with name: HOME if user search: HOME or OMEH or EMOH or HMEO, etc... - all permutations from word HOME.

I can't save to mysql all these permutations and search in this columns, because some words will be a too big (9-10 chars) and more than 40 MB for each 9 chars words.

  • 1
    MySQL supports [`regexp`](https://dev.mysql.com/doc/refman/5.5/en/regexp.html#operator_regexp) Try this one: https://stackoverflow.com/a/3101385/575376 – juergen d Nov 01 '18 at 11:29

2 Answers2

3

One way to solve this problem is to store the sorted set of characters in each name in your database as an additional column and then sort the string the user inputs before searching e.g. database has

ID   NAME   CHARS
1    HOME   EHMO
2    GAME   AEGM
3    LINK   IKLN

Then when searching in PHP you would do this:

$search = 'MEHO';                // user input = MEHO
$chars = str_split($search);
sort($chars);
$search = implode('', $chars);   // now contains EHMO
$sql = "SELECT ID, NAME FROM table1 WHERE CHARS = '$search'";
// perform query etc.

Output

ID   NAME
1    HOME
Nick
  • 138,499
  • 22
  • 57
  • 95
  • @juergend this does not require storing permutations, only the sorted set of characters for each word – Nick Nov 01 '18 at 11:33
  • @juergend, although it does require a little upfront work to sort the values already stored in the database, this is actually a decent solution. – Justin T. Nov 01 '18 at 11:34
  • Pretty good solution; however we may need to use the original word also, so a separated modified copy would have to be stored, instead of modifying the original column. +1 – Madhur Bhaiya Nov 01 '18 at 11:44
  • @MadhurBhaiya You're absolutely right. I was editing as you were typing... – Nick Nov 01 '18 at 11:49
0

This sounds like a "please do my homework for me" question. It is hard to conceive what real world problem this is applicable to and there is no standard solution. It is OK to ask for help with your homework here, but you should state that this is the case.

more than 40 MB for each 9 chars words

Your maths is a bit wonky, but indeed the storage does not scale well. OTOH leaving aside the amount of storage, in terms of the processing workload it does scale well as a solution.

You could simply brute-force a dynamic query:

 function mkqry($word)
 {
     $qry="SELECT * FROM yourtable WHERE 1 ";
     $last=strlen($word);
     for ($x=0; $x<$last; $x==) {
          $qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
     } 
     return $qry;
 }

However this will always result in a full table scan (slow) and won't correctly handle cases where a letter occurs twice in a word.

The solution is to use an indexing function which is independent of the order in which the characters appear - a non-cryptographic hash. An obvious candidate would be to XOR the characters together, although this only results in a one character identifier which is not very selective. So I would suggest simply adding the character codes:

 function pos_ind_hash($word)
 {
     $sum=0;
     for ($x=0; $x<$last; $x==) {
         $sum+=ord(substr($word, $x));
     }
     return $sum;
 }

 function mkqry($word)
 {
     $qry="SELECT * FROM yourtable WHERE 1 ";
     $last=strlen($word);
     for ($x=0; $x<$last; $x==) {
          $qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
     }
     $qry.=" AND yourtable.hash=" .  pos_ind_hash($word);
     return $qry;
 }

Note that the hash mechanism here does not uniquely identify a single word, but is specific enough to reduce the volume to the point where an index (on the hash) would be effective.

Multiplying rather than adding would create fewer collisions but at a greater risk of overflowing (which would create ambiguity between implementations).

But both the hash and the single character LIKE only reduce the number of potential matches. To get the query to behave definitively, you need to go further. You could add an attribute to the table (and to the index with the hash)containing the string length - this would be more selective (i.e. improve effectiveness of the index) but still not definitive.

For a definitive method you would need to specify in your query that the data does NOT contain characters which are NOT in the word you are looking for.

The wrong way to do that would be to add a loop specifying "AND NOT LIKE....".

A valid way of doing that would be to add a test in the query which replaces all the letters in the table attribute which appear in the word you are searching for which results in a zero length string.

symcbean
  • 47,736
  • 6
  • 59
  • 94