5

This is a shot-in-the-dark, and I apologize in advance if this question sounds like the ramblings of a madman.

As part of an integration with a third party, I need to UTF8-encode some string info using C# so I can send it to the target server via multipart form. The problem is that they are rejecting some of my submissions, probably because I'm not encoding their contents correctly.

Right now, I'm trying to figure out how a dash or hyphen -- I can't tell which it is just by looking at it -- is received or interpreted by the target server as ?~@~S (yes, that's a 5-character string and is not your browser glitching out). And unfortunately I don't have a thorough enough understanding of Encoding.UTF8.GetBytes() to know how to use the byte array to begin identifying where the problem might lie.

If anybody can provide any tips or advice, I would greatly appreciate it. So far my only friend has been MSDN, and not much of one at that.

UPDATE 1: After some more digging around, I discovered that using System.Web.HttpUtility.UrlEncode()to encode an EM DASH character ("—") will hex-encode it into "%e2%80%94".

I'm currently sending this info in aHttpWebRequestpost, with a content type of "application/x-www-form-urlencoded" -- could this be what's causing the problem? And if so, what is the proper way to encode a series of name-value pairs whose values may contain Unicode characters, such that it will be understood by a server expecting a UTF-8 request?

Mass Dot Net
  • 2,150
  • 9
  • 38
  • 50
  • From that result I'd guess you might be ascii encoding the result of utf8 encoding an em dash. – Joshua Jan 29 '11 at 00:11
  • Even with the wrong encoding, it's very unlikely that a 1-character dash whould be translated into a 5-characters sequence. It's probably not only an encoding problem. – Simon Mourier Jan 29 '11 at 07:37
  • @Joshua: I think you're close to the heart of the problem. I just added an update to my original post with some more info. – Mass Dot Net Mar 16 '11 at 15:57
  • Unfortunately I'd have chosen application/octet-stream and just assume it's in the right format on both ends so I'm not going to be able to help you any farther. – Joshua Mar 16 '11 at 22:53

2 Answers2

3
byte[] test = System.Text.Encoding.UTF8.GetBytes("-");

Should give you

test[0] = 0x2D (45 as integer).  

Verify that your sending 0x2D to the target server.

Jacob
  • 77,566
  • 24
  • 149
  • 228
Chauncat
  • 248
  • 3
  • 12
  • 1
    Wireshark is helpful for this kind of stuff – Marlon Jan 28 '11 at 23:52
  • I've never used Wireshark before, but our lead developer is very experienced with Fiddler. I'll give this a shot as soon as I'm back in the office -- thank you for the tip. – Mass Dot Net Jan 29 '11 at 00:12
  • Wireshark is very simple to setup. It allows you to see what packets are coming in to your sever. You can filter the data in many ways so you don't get lose in the data. – Chauncat Jan 29 '11 at 00:23
  • Btw, what is the C# code that properly outputs the information above (`test[0] = 0x2D (45 as integer)`)? – Mass Dot Net Jan 29 '11 at 00:39
  • I use this function to convert byte arrays to hex strings public static string BtyeToHexString(byte[] in_record, int startIndex, int count) { string hexString = ""; if (startIndex + count <= in_record.Length) { for (int i = startIndex; i < startIndex + count; i++) { hexString += in_record[i].ToString("X2") + " "; } } return hexString; } – Chauncat Jan 29 '11 at 00:50
  • The problem ended being an issue on the target server's end: they were double-URL-decoding POST'ed requests. However, using GetBytes() to prepare test cases proved invaluable for showing them how work didn't get transmitted correctly. – Mass Dot Net May 05 '11 at 20:03
1

You may need to add a "charset=utf-8" parameter to your Content-Type header. You may also want to have a Content-Encoding header to set your encoding. The headers should contain the following:

Content-Type: multipart/form-data; charset=utf-8

Otherwise, the web server won't know your bytes are UTF-8 bytes, so it will misinterpret them.

Jacob
  • 77,566
  • 24
  • 149
  • 228
  • You are correct in that I'm not currently explicitly defining a content encoding type when I send the multipart form. I've just sent an email to the third party, asking if they knew what the default expected content type was -- is that something they'd be able to easily identify? I think they're running Microsoft servers (IIS). – Mass Dot Net Jan 29 '11 at 00:11
  • `UTF-8` is [not a valid `Content-Encoding` value](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11). That header is used to indicate how the payload is compressed. It's not used to indicate the charset. – dkarp Jan 29 '11 at 02:07