1

So, right now I'm making a small package reader in Java. All the unicode strings have periods (or at least that's how they are presented in hex editor) so when I read them I need to go to the offset and read the allocated memory for that information. Like, if it's a game name from an Xbox 360 file, I need to read 80 bytes and remove the '.'s from it to get a readable string.

So why is unicode stored like this in files? Is it to signify that it's Unicode or is it allocation padding or what?

I'm not sure if my question is valid it's just always been on my mind. Thanks.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
user3530525
  • 691
  • 3
  • 8
  • 20
  • 1
    Its most likely just how your hex editor attempts to show null characters as text. – Alex K. Jan 13 '15 at 18:40
  • @AlexK. - I understand like null terminators but why would there be null characters in-between characters in a word? – user3530525 Jan 13 '15 at 18:41
  • 1
    read on utf-16. The most significant byte is 0 for the representation of ASCII characters – bolov Jan 13 '15 at 18:44
  • This looks like off-topic at SO. Try SuperUser, but specify the hex editor you are using and show a sample of data in some format that lets the individual bytes to be accessed. – Jukka K. Korpela Jan 13 '15 at 18:52

1 Answers1

6

Create a file containing "A" in Notepad, save it as Unicode and Windows will use UTF-16(LE) Encoding to do so; this uses 2 bytes to store the character: 0x41 0x00.

When you view this file in a hex editor (which knows nothing about nor cares about text encoding) 0x41 can be displayed as A but 00 maps to no character so a . (or equivalent) is displayed to let you know there is a byte there.

Alex K.
  • 171,639
  • 30
  • 264
  • 288