2

I have a regex/soundex type method:

public static string SoundEx(string word)
{
    if (word.All(char.IsDigit))
    {
        //sentenceParts = words;
        return word;
    }
    word = word.ToUpper();
    word = word[0] +
        Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(word.Substring(1), "[AEIOUYHW]", ""), "[BFPV]+", "1"), "[CGJKQSXZ]+", "2"), "[DT]+", "3"), "[L]+", "4"), "[MN]+", "5"), "[R]+", "6");

    return word;//word.PadRight(4, '0').Substring(0, 4);
}

This works fine on strings with one word but as soon as you dump a sentence in it can start acting funny.

"The big brown cat." and "The big brown dog."

Come up as a match. Now I understand it keeps the first char of the first word and then starts using the regex to match numbers to the vowels, etc. But how can I implement this on an entire sentence making it more accurate?

Laurel
  • 5,965
  • 14
  • 31
  • 57
VinnyGuitara
  • 605
  • 8
  • 26
  • What is it supposed to do? – Kenny Lau Apr 28 '16 at 15:41
  • Take answers to a "quiz" and soundex them in case of minor spelling errors. It works for single words like: Retroactive vs. Ritroactive Will return a match but Retroactive vs Metroactive Will return no match etc. – VinnyGuitara Apr 28 '16 at 15:44
  • I still don't get it. What does `Retroactive` return? – Kenny Lau Apr 28 '16 at 15:49
  • Let me rephrase this: The function simply takes in strings and returns a value based on regex and then my program compares them to see if they are similar. "Foot" and "Fut" should return a match "Foot" and "Food" should not – VinnyGuitara Apr 28 '16 at 15:52
  • 1
    What do you mean by match? The results for your example are different "T 12 165 23." vs "T 12 165 32." – Sign Apr 28 '16 at 15:53
  • Then remove all the punctuation from the sentences and then split by whitespace? – Kenny Lau Apr 28 '16 at 15:53
  • @KennyLau - That is what I thought it would take. I just figured maybe someone had more experience in this and would know a better technique. – VinnyGuitara Apr 28 '16 at 15:54
  • Don't `Foot`, `Fut`, `Food` all return `F3`? – Kenny Lau Apr 28 '16 at 15:55
  • @Sign - when I run the two sentences from my OP through the function I get "T12" for both. You are seeing "T 12 165 23"... that is very interesting. Did you just use the code I posted? – VinnyGuitara Apr 28 '16 at 15:56
  • @KennyLau - Can try with the two sentences in the OP... "The Big Brown..." I get "T12" for both... but user Sign says he gets more detailed results. – VinnyGuitara Apr 28 '16 at 15:56
  • 1
    Yeah I used your code. It seems like you are just doing the first word not the whole sentence. – Sign Apr 28 '16 at 16:15
  • Soundex was designed to find English names that sound similar when spoken and is not suited for doing anything else with. If you explained what you want these codes for we may be able to recommend a better alternative. – Dour High Arch Apr 28 '16 at 16:16
  • Thanks @Sign, you kind of pointed me in the right direction. It was this: word.PadRight(4, '0').Substring(0, 4); I know its commented out in my above code but that was not the code that was in my project. I made a mistake when I posted. I guess I had the solution all along in my test project. Once removed I now get "T 12 165 23" vs "T 12 165 32"... This is exactly what I was after. I guess I shouldn't rush to post. Thanks for the nudge in the right direction. – VinnyGuitara Apr 28 '16 at 16:16
  • @DourHighArch - I am simply looking for a way to allow slight spelling mistakes in a console based quiz game. This is not homework, lol. It is for a quick quiz on company acronyms. It's actually just for me personally to learn new acronyms and company information. If you have a better solution other than the soundex I am all ears. – VinnyGuitara Apr 28 '16 at 16:18

1 Answers1

3

You have to soundex each word separately. That turns the sentence into a set of 4-byte codes instead of string of characters. You then compare the sets against each other.

So your example becomes "T000 B200 B650 D200" v "T000 B200 B650 C300".

I would recommend using the double-Metaphone algorithm instead of soundex as its much, much better, it also does not rely on the first letter remaining the same, which doesn't help match words like Fishing and Phishing.

gbjbaanb
  • 51,617
  • 12
  • 104
  • 148
  • Thanks for the information. I will look into double-Metaphone algorithm. Is this something that can be implemented into .net? – VinnyGuitara Apr 28 '16 at 16:22
  • @VinnyGuitara easily, its a more complicated algorithm but nothing impossible to understand, chances are there's already libraries for you. – gbjbaanb Apr 29 '16 at 08:52