5

I have two list of words, suppose LIST1 and LIST2. I want to compare LIST1 against LIST2 to find the duplicates, but it should find the plural of the word as well as ing form also. For example.

Suppose LIST1 has word "account", and LIST2 has words "accounts,accounting" When i do compare the result should show two match for word "account".

I am doing it in PHP and have the LIST in mysql tables.

quickshiftin
  • 66,362
  • 10
  • 68
  • 89
daron
  • 49
  • 1

5 Answers5

5

You can use a technique called porter stemming to map each list entry to its stem, then compare the stems. An implementation of the Porter Stemming algorithm in PHP can be found here or here.

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Nice. Had never heard of that technique before. – Veign Dec 06 '10 at 17:43
  • That's why its just gone on my research list. its pretty simple from what I can gather – RobertPitt Dec 06 '10 at 17:47
  • thanks for the algorithm. it was nice to know. but this didn't really worked in all cases for example "office supplies" was changed to "office suppli" , "full house" changed to "full hous" and many other issues. – daron Dec 06 '10 at 19:03
  • Look at the more general topic of stemming then. Porter stemming is considered among the best algorithms for English language stemming, but it is just one algorithm among many. Wikipedia has a full page on the subject http://en.wikipedia.org/wiki/Stemming – Mark Baker Dec 06 '10 at 22:46
  • I'd like to add [another option](https://pecl.php.net/package/stem), there's a native extension which handles other languages besides English and is probably pretty fast. – quickshiftin Apr 19 '16 at 18:36
  • BTW @MarkBaker I added a solution which uses an inflector in conjunction with a stemmer, if you have a min to scope it out that would be awesome! – quickshiftin Apr 19 '16 at 19:49
0

You might consider using the Doctrine Inflector class in conjunction with a stemmer for this.

Here's the algorithm at a high level

  1. Split search string on spaces, process words individually
  2. Lowercase the search word
  3. Strip special characters
  4. Singularize, replace differing portion with wildcard ('%')
  5. Stem, replace differing portion with wildcard ('%')

Here's the function I put together

/**
 * Use inflection and stemming to produce a good search string to match subtle
 * differences in a MySQL table.
 *
 * @string $sInputString The string you want to base the search on
 * @string $sSearchTable The table you want to search in
 * @string $sSearchField The field you want to search
 */
function getMySqlSearchQuery($sInputString, $sSearchTable, $sSearchField)
{
    $aInput  = explode(' ', strtolower($sInputString));
    $aSearch = [];
    foreach($aInput as $sInput) {
        $sInput = str_replace("'", '', $sInput);

        //--------------------
        // Inflect
        //--------------------
        $sInflected = Inflector::singularize($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sInflected, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $sInput = substr($sInflected, 0, $iPosition) . '%';
        } else {
            $sInput = $sInput;
        }

        //--------------------
        // Stem
        //--------------------
        $sStemmed = stem_english($sInput);

        // Otherwise replace the part of the inflected string where it differs from the input string
        // with a % (wildcard) for the MySQL query
        $iPosition = strspn($sInput ^ $sStemmed, "\0");

        if($iPosition !== null && $iPosition < strlen($sInput)) {
            $aSearch[] = substr($sStemmed, 0, $iPosition) . '%';
        } else {
            $aSearch[] = $sInput;
        }
    }

    $sSearch = implode(' ', $aSearch);
    return "SELECT * FROM $sSearchTable WHERE LOWER($sSearchField) LIKE '$sSearch';";
}

Which I ran with several test strings

Input String: Mary's Hamburgers
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'mary% hamburger%';

Input String: Office Supplies
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'offic% suppl%';

Input String: Accounting department
SearchString: SELECT * FROM LIST2 WHERE LOWER(some_field) LIKE 'account% depart%';

Probably not perfect, but it's a good start anyway! Where it will fall down is when multiple matches are returned. There's no logic to determine the best match. That's where things like MySQL fulltext and Lucene come in. Thinking about it a little more, you might be able to use levenshtein to rank multiple results with this approach!

Community
  • 1
  • 1
quickshiftin
  • 66,362
  • 10
  • 68
  • 89
0

What I would do is take your word and compare it directly to LIST2 and at the same time remove your word from every word your're comparing looking for a left over ing, s, es to denote a plural or ing word (this should be accurate enough). If not you'll have to generate an algorithm for making plurals out of words as it not as simple as adding an S.

Duplicate Ending List
s
es
ing

LIST1
Gas
Test

LIST2
Gases
Tests
Testing

Now compare List1 to List2. During the same loop of comparison do a direct comparision to items and one where the word, from list 1, is removed from the current word you're looking at in list 2. Now just check is this result is in the Duplicate Ending List.

Hope that makes sense.

Veign
  • 608
  • 4
  • 7
0

The problem with that is, in English at least, plurals are not all standard extensions, nor are present participles. You can make an approximation by using all words +'ing' and +'s', but that will give false positives and negatives.

You can handle it directly in MySQL if you wish.

SELECT DISTINCT l2.word
  FROM LIST1 l1, LIST l2
  WHERE l1.word = l2.word OR l1.word + 's' = l2.word OR l1.word + 'ing' = l2.word;
Orbling
  • 20,413
  • 3
  • 53
  • 64
0

This function will output the plural of a word.

http://www.exorithm.com/algorithm/view/pluralize

Something similar could be written for gerunds and present participles (ing forms)

Mike C
  • 1,808
  • 15
  • 17
  • Thanks Mike C this script was simple and easy to understand and helped me get some idea – daron Dec 15 '10 at 21:29