What unicode character (emoji) it was?

Question

I have that string in my text file: ├░┬č┬Ź┬ć

What is known is that it was emoji or at least some surrogate character/character created by javascript string of length 2 or 4

Because of some reason it end up in that form. (It was obtained from mysql database which is utf8_general_ci and by node.js/mysql2/connection with charset latin1_swedish_ci)

How can I find what emoji it was? Is it possible?

Other examples:

├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á

Algorithm written in JS would be best option.

@DanyalSandeelo kind of, it is corrupted by miss-using correct char coding when getting data from database and saving files. — ElSajko, Jul 02 '21 at 07:27
@mplungjan I know only that mysql database is `utf8_general_ci` and charset on connection (by node.js 'mysql2' lib) was `latin1_swedish_ci` — ElSajko, Jul 02 '21 at 07:31
This gives error `console.log(btoa(\`├░┬č┬Ź┬ć\`))` so I am out of ideas — mplungjan, Jul 02 '21 at 07:33
@ElSajko you cannot get the actual data from the corrupted. If you are able to view it somewhere (sometimes the corrupted data shows the emoji correctly on html), just note that emoji and take the corresponding emoji code from internet. — Danyal Sandeelo, Jul 02 '21 at 07:40
@DanyalSandeelo it's not like random transformation from one state to another, so if it's not random, it should be possible to unwind it backward — ElSajko, Jul 02 '21 at 07:51
@ElSajko it's not encryption. The data is corrupted so it's pretty hard to convert it back. Something the data is correct, it's just the encoding that you view it in. It's pretty hard to say anything without experimenting on it. — Danyal Sandeelo, Jul 02 '21 at 07:57
@DanyalSandeelo count of characters match emoji length in JS (4 or 8 string length) so all informations are in there, not even a bit of information was lost I guess. — ElSajko, Jul 02 '21 at 08:00
I suppose emoji were read as byte sequence as https://en.wikipedia.org/wiki/Code_page_852, and then saved the relative characters as Unicode. Try manually to decrypt from the link the original byte sequence, and then check if it is a plausible unicode. In such case, then you just invert the conversion: read the database as UTF-8 -> Convert to 852, then trick the system thinking that that it is in reality UTF-8, and so display the result as UTF-8 — Giacomo Catenazzi, Jul 02 '21 at 08:54
was it any kind of compressed backup of database from which you got this? — Channa, Jul 25 '21 at 07:25

score 2 · Accepted Answer · answered Jul 03 '21 at 21:05

It's double mojibake as shown in the following python code snippet (sorry, I cannot give Javascript equivalent):

print('   '.
      encode('utf-8').decode('latin1').  # 1st mojibake stage
      encode('utf-8').decode('cp852')    # 2nd mojibake stage
    )                                    # ├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á

Possible repair (although prevention is better than cure):

print('├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á'.
      encode('cp852').decode('utf-8').       # fix 2nd mojibake stage
      encode('latin1').decode('utf-8')       # fix 1st mojibake stage
    )                                        #

FYI, those emojis are (column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes; column Description contains surrogate pairs in parentheses):

Char CodePoint                      Description
---- ---------                      -----------
   {U+1F346, 0xF0,0x9F,0x8D,0x86} AUBERGINE               (0xd83c,0xdf46)
   {U+1F4A6, 0xF0,0x9F,0x92,0xA6} SPLASHING SWEAT SYMBOL  (0xd83d,0xdca6)
   {U+1F608, 0xF0,0x9F,0x98,0x88} SMILING FACE WITH HORNS (0xd83d,0xde08)
   {U+1F975, 0xF0,0x9F,0xA5,0xB5} OVERHEATED FACE         (0xd83e,0xdd75)

Woah, You did great! Thank You very much, I've accepted answer. — ElSajko, Jul 05 '21 at 11:02

What unicode character (emoji) it was?

1 Answers1