2

Update: Apparently these are control characters, not Unicode characters.

I'm trying to parse an XML file which has an odd character in it that makes it invalid and is causing my tools (Firefox, Nokogiri) to complain.

Here's what the character looks like in Firefox, and what it looks like when I copy and paste it into Textmate (I'm on OS X obviously).

crazy characters http://img.skitch.com/20090811-ghu43k5u9nhpcjmh443dpq76jp.preview.jpg

Rather than just cryptic icons and little grey diamonds I'd really like to know what these characters are (e.g. hex/dec codes) but I'm not sure how to figure that out.

Teflon Ted
  • 8,696
  • 19
  • 64
  • 78

10 Answers10

5

I would save the page in Firefox to a file, and pass it to hexdump -C. Look for the fragment of HTML around it in the ASCII part, then look for the hex bytes. Most likely, these are UTF-8, so expect a multi-byte code.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
4

Your screenshot is tiny, but does the Firefox sample contain a glyph with 4 hexadecimal characters in it? If so, that's the Unicode character's code number. You could also hunt for that diamond glyph on the Unicode code charts, or simply copy the diamond into a Google search and the character name should turn up near the top.

But the real question is how to handle Unicode input in your program. You need to do that correctly if you're processing XML. Nokogiri is a Ruby library? I'm surprised to hear it doesn't handle Unicode automatically.

Nelson
  • 27,541
  • 5
  • 35
  • 31
  • I tried pasting them into Google (sorry I should have noted that in the original question) and it came up blank. I've found a few of these now and they all show up as grey diamonds in Textmate; I don't think they are actually the code for the diamond symbol. – Teflon Ted Aug 11 '09 at 18:36
2

The search term you are looking for is U+2603 or U2603, obviously substituting the numbers from your lamentably blurry "unknown glyph" box. The first several results will be about that Unicode character.

joeforker
  • 40,459
  • 37
  • 151
  • 246
  • +1 funny. [Rails trivia](http://stackoverflow.com/questions/3222013/what-is-the-snowman-param-in-rails-3-forms-for) – Andrew Grimm Oct 19 '11 at 06:40
1

Copy it into emacs and start hexl-mode.

Michael Speer
  • 4,656
  • 2
  • 19
  • 10
0

Simply open the file using a hexeditor like xvi32.

h0b0
  • 1,802
  • 1
  • 25
  • 44
0

Open the file hexeditor and extract the hexadecimal representation of the character. Then look up the code on on http://unicode.org to find out the name of the character.

sebasgo
  • 3,845
  • 23
  • 28
0

hexdump -c from the Terminal command line will show you the character code.

Mark Bessey
  • 19,598
  • 4
  • 47
  • 69
0

Save file and then from the terminal use:

od ( octal dump )

Laurel
  • 5,965
  • 14
  • 31
  • 57
OscarRyz
  • 196,001
  • 113
  • 385
  • 569
0

If you're using Vim, then move the cursor over the character and type ga to show the hex in the status area

Rhubarb
  • 34,705
  • 2
  • 49
  • 38
0

you can download the Ruby hexdump extension for class String, and print out a hexdump from Ruby directly:

require 'hexdump'

#... whatever you do in your program

puts your_string.hexdump

output looks like what you get from hexdump -C in a shell

See:

Ruby Hexdump method for Class String

Laurel
  • 5,965
  • 14
  • 31
  • 57
Tilo
  • 33,354
  • 5
  • 79
  • 106