1

I have that string in my text file: ├░┬č┬Ź┬ć

What is known is that it was emoji or at least some surrogate character/character created by javascript string of length 2 or 4

Because of some reason it end up in that form. (It was obtained from mysql database which is utf8_general_ci and by node.js/mysql2/connection with charset latin1_swedish_ci)

How can I find what emoji it was? Is it possible?

Other examples:

├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á

Algorithm written in JS would be best option.

ElSajko
  • 1,612
  • 3
  • 17
  • 37
  • 4
    This seems to be corrupted data. – Danyal Sandeelo Jul 02 '21 at 07:24
  • @DanyalSandeelo kind of, it is corrupted by miss-using correct char coding when getting data from database and saving files. – ElSajko Jul 02 '21 at 07:27
  • Do you know the from code page? – mplungjan Jul 02 '21 at 07:28
  • @mplungjan I know only that mysql database is `utf8_general_ci` and charset on connection (by node.js 'mysql2' lib) was `latin1_swedish_ci` – ElSajko Jul 02 '21 at 07:31
  • This gives error `console.log(btoa(\`├░┬č┬Ź┬ć\`))` so I am out of ideas – mplungjan Jul 02 '21 at 07:33
  • @ElSajko you cannot get the actual data from the corrupted. If you are able to view it somewhere (sometimes the corrupted data shows the emoji correctly on html), just note that emoji and take the corresponding emoji code from internet. – Danyal Sandeelo Jul 02 '21 at 07:40
  • @DanyalSandeelo it's not like random transformation from one state to another, so if it's not random, it should be possible to unwind it backward – ElSajko Jul 02 '21 at 07:51
  • 1
    @ElSajko it's not encryption. The data is corrupted so it's pretty hard to convert it back. Something the data is correct, it's just the encoding that you view it in. It's pretty hard to say anything without experimenting on it. – Danyal Sandeelo Jul 02 '21 at 07:57
  • @DanyalSandeelo count of characters match emoji length in JS (4 or 8 string length) so all informations are in there, not even a bit of information was lost I guess. – ElSajko Jul 02 '21 at 08:00
  • I suppose emoji were read as byte sequence as https://en.wikipedia.org/wiki/Code_page_852, and then saved the relative characters as Unicode. Try manually to decrypt from the link the original byte sequence, and then check if it is a plausible unicode. In such case, then you just invert the conversion: read the database as UTF-8 -> Convert to 852, then trick the system thinking that that it is in reality UTF-8, and so display the result as UTF-8 – Giacomo Catenazzi Jul 02 '21 at 08:54
  • was it any kind of compressed backup of database from which you got this? – Channa Jul 25 '21 at 07:25

1 Answers1

2

It's double mojibake as shown in the following python code snippet (sorry, I cannot give Javascript equivalent):

print('   '.
      encode('utf-8').decode('latin1').  # 1st mojibake stage
      encode('utf-8').decode('cp852')    # 2nd mojibake stage
    )                                    # ├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á

Possible repair (although prevention is better than cure):

print('├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á'.
      encode('cp852').decode('utf-8').       # fix 2nd mojibake stage
      encode('latin1').decode('utf-8')       # fix 1st mojibake stage
    )                                        #    

FYI, those emojis are (column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes; column Description contains surrogate pairs in parentheses):

Char CodePoint                      Description
---- ---------                      -----------
   {U+1F346, 0xF0,0x9F,0x8D,0x86} AUBERGINE               (0xd83c,0xdf46)
   {U+1F4A6, 0xF0,0x9F,0x92,0xA6} SPLASHING SWEAT SYMBOL  (0xd83d,0xdca6)
   {U+1F608, 0xF0,0x9F,0x98,0x88} SMILING FACE WITH HORNS (0xd83d,0xde08)
   {U+1F975, 0xF0,0x9F,0xA5,0xB5} OVERHEATED FACE         (0xd83e,0xdd75)
JosefZ
  • 28,460
  • 5
  • 44
  • 83