2

I made a program that reads an input string, compares it to check if it's certain emoji and returns a number depending on which emoji it is.

The problem comes with emojis with different genres. For example, the policeman emoji doesn't get detected. I tried comparing the string with "‍", but it wasn't detected. I tried adding the male symbol and comparing the string with "‍♂️♂️", but it didn't work either.

Example of a piece of my code code:

                case "":
                case "":
                    Send(args[1] + " 70%", update.Message.Chat.Id);
                    break;
                case "":
                case "":
                case "":
                case "":
                    Send(args[1] + " 40%", update.Message.Chat.Id);
                    break;

All of them work except for and , which happen to be the ones with different genders.

Not sure if it matters, but language is C# and I'm programming in Visual Studio, which lets me copy and paste the emojis in there.

What am I doing wrong?

  • In C# (and .NET in general), strings are sequences of UTF-16 code units. Since a lot of emoji have code points beyond U+FFFF, they will be encoded as two code units (a "surrogate pair") instead of one code unit, which may be the cause of some of your trouble. I can't speak to the exact problem, however. – Joe Sewell Feb 22 '19 at 17:41
  • 1
    It does matter that this is C#, since that means these are almost certainly being compared byte-by-byte. You'll want to look at what precise bytes you're passing in, and what precise bytes you've pasted into the source code. In a unixy world, I would use "xxd" to dump each of them and examine the bytes. I'm not as certain about good raw-data dumping tools on Windows. I suggest looking at `String.Normalize()` as a possible solution. – Rob Napier Feb 22 '19 at 18:54

1 Answers1

1

I tried comparing the string with "‍", but it wasn't detected.

This Police emoji above is made of two Unicode "Characters", better called Codepoints: the Police Officer U+1F46E and a character modifier, the U+200D (Combining 4 dots above). If in the case statement you have only the Police Officer U+1F46E then it will not be found.

You must be sure that the emojis that you pasted in the code are identical to the emoji that you received in the input string. Just displaying the string is confusing because they seem equal but aren't.

In the source code I would place the ‍ as a comment and in the string of the case statement the Police Officer using the Codepoint escaping "\U0001F46E".

case "\U0001F46E":        // ‍
case "\U0001F46E\u200D":  // ‍ + ....

Or

const string PoliceOfficer = "\U0001F46E" // ‍
...
case PoliceOfficer: 

Notice the different escaping, upper \U for 8 hex digits and lower \u for 4 hex digits. Then when you don't recognize a string, just print it out (eventually in the debugger), get the correct escaping that build your string and add it to the case statements.

As an alternative you could remove first from the input string all the combining marks, like the "\u200D" and then pass it to the case statement. And then eventually give an additional meaning to the combining character.

andreaplanet
  • 731
  • 5
  • 14
  • I tried using Console.WriteLine(emoji); but all it writes is "????". Asking so I can find the codepoints of other emojis by myself. Thanks for the reply, though. – Jaime Fernández Feb 23 '19 at 14:43
  • Printing the emjos as string doesn't show the codepoint but the string. That you see ? is another issue with console printing. You can just print the hex values for each character in your string in format "\unnnn" where nnnn are the 4 hex digits of each character in your string. Then use that generated strings your case statement. – andreaplanet Feb 23 '19 at 15:11
  • @andreaplanet The emoji shown in the OP's question ("‍") is actually using codepoint `U+200D` instead of `U+20DC` like you claim. For emojis, certain modifying codepoint sequences are joined by `U+200D`. These are known as [Emoji ZWJ Sequences](https://emojipedia.org/emoji-zwj-sequences/). For instance, the emoji "‍♂️" ([Man Police Officer](https://emojipedia.org/male-police-officer/)) is `U+1F46E U+200D U+2642 U+FE0F`, while the emoji "‍" ([Man Police Officer: Light Skin Tone](https://emojipedia.org/male-police-officer-type-1-2/)) is `U+1F46E U+1F3FB U+200D U+2642 U+FE0F`. – Remy Lebeau Feb 26 '19 at 23:35
  • @"Remy Lebeau" thank you for the correction. I took the wrong position when pasting and opening it in an editor – andreaplanet Feb 28 '19 at 00:21