0

After a long time, I come to review the contents of my HDD and see a weird file name.

I'm not sure what tool or program has changed it this way, but when I see the content of the file I could find its original name.

Anyway, I'm encountering a type of encoding and I want to find it. It's not complicated. Mostly for those who are familiar with unicode and utf8. Now I map them and you guess what has happened.

In the following, I give a table which maps the characters. In the second column, there's utf8 form and in the first column there's its equivalent character which is converted.

I need to know what happened and how is it converted to convert it back to utf8. that is, what I have is in the first column, and what I need to get is in the second column:

Hide Copy Code

638 2020        646
639 AF          6AF
637 A7          627
637 B1          631
637 B3          633
637 6BE         62A
20          20
638 67E         641
63A 152         6CC

For more description, consider the first row, utf8 form is 46 06 (type bytes) or 0x0646. The file name for this character is converted into two wide-characters, 0x0638 0x2020.

shammelburg
  • 6,974
  • 7
  • 26
  • 34
hamidi
  • 1,611
  • 1
  • 15
  • 28
  • Weird filenames are normally an artifact of file systems errors. – leppie Jun 26 '15 at 15:42
  • Which OS are you using? – dan04 Jun 26 '15 at 18:53
  • oh no! i'm sure it's not because of a file system error. it's NTFS on Windows 7. maybe the useful utility, file encryption, caused it, because it doesn't support non-western file names (just a guess). i mean this utility: http://www.file-encryption.net – hamidi Jun 26 '15 at 19:13
  • 1
    Your table makes no sense. `46 06` is not valid UTF-8 byte sequence. And besides, NTFS does not use UTF-8 anyway, it uses UTF-16. So please clarify what you REALLY have, and what you are REALLY expecting. Maybe use actual screenshots instead of hand-written tables. Otherwise, it is very hard to figure out what you are saying. – Remy Lebeau Jun 26 '15 at 19:54
  • exactly "نگارستان نفیس" is converted to "ظ†ع¯ط§ط±ط³طھط§ظ† ظ†ظپغŒط³" source byte sequence: 638 2020 639 AF 637 A7 637 B1 637 B3 637 6BE 637 A7 638 2020 20 638 2020 638 67E 63A 152 637 B3 2E 68 74 6D destination (what i expect) byte sequence: 646 6AF 627 631 633 62A 627 646 20 646 641 6CC 633 maybe it's not utf8. maybe it's unicode... yes, ur right. i put this byte sequence in a file: FF FE 46 06 AF 06 27 06 31 06 33 06 2A 06 27 06 46 06 20 00 46 06 41 06 CC 06 33 06 when i opened it in notepad++ i see what i expect. the encoding is UCS-2 little endian. is it another name for unicode or utf16? – hamidi Jun 26 '15 at 20:47

1 Answers1

0

I found the solution myself. In Notepad++:

  1. Select "Encode in ANSI" from Encoding menu.
  2. Paste the corrupted text.
  3. Select "Encode in UTF-8" from Encoding menu.

That's it. The correct text will be displayed. If so, how can I do the same with Perl?

hamidi
  • 1,611
  • 1
  • 15
  • 28
  • See [Perl - read file with encoding method](http://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method) – Mike Samuel Jun 27 '15 at 14:15