Convert a string from ISO to UTF-8 with null byte ending

Question

I get a ISO-8859-1-encoded string from an exernal DLL, and I must use it to create a link.

Apparently, in production (I have no access to the production server whatsoever), it adds "%00" at the end of the utf-8-decoded strings. I understand that it is a null-terminated string, then. How can I handle this situation ?

My conversion function is this one :

Private Function DecodeString(ByRef SourceData As String, ByVal SourceEncoding As Encoding, ByVal OutputEncoding As Encoding) As String
    Dim bSourceData As Byte() = SourceEncoding.GetBytes(SourceData)
    Dim bOutData As Byte() = System.Text.Encoding.Convert(SourceEncoding, OutputEncoding, bSourceData)
    Return OutputEncoding.GetString(bOutData)
End Function

What do I have to change to make it work ? I thought of something like .Replace("%00", "") but even if it worked, I don't think it would solve the underlying problem...

I found this : Fastest way to convert a possibly-null-terminated ascii byte[] to a string? but I don't want to do it "in the fastest way possible", and I don't like the "unsafe" part, as it is for a banking site (even if I don't understand exactly what it can lead to, I don't want to take any unnecessary risk).

Thanks

score 1 · Accepted Answer · answered Jan 26 '12 at 09:54

1

I get a ISO-8859-1-encoded string from an exernal DLL

No, if you've been given a string, it's already a sequence of Unicode characters (hand-waving around surrogate pairs aside). The fact that it may have originally been loaded from ISO-8859-1 binary data is irrelevant.

Your method is fundamentally flawed. Assuming every character in the string can be represented in both encodings, it should be a no-op. It's conceptually converting text to bytes in the source encoding, then converting those bytes to text, then converting that text to bytes in the target encoding, then converting those bytes to text. At best that's a no-op; at worst it could lose data if either encoding can't handle all the characters in the original string.

You almost certainly need to step up a bit, to where the string is originally being constructed - that's where the problem will be. At that point, you should probably be able to remove your method entirely.

answered Jan 26 '12 at 09:54

Jon Skeet

1,421,763
867
9,128
9,194

Why couldn't the DLL send me an ISO string ? And I don't have access to the DLL internals ; I can't even get a copy to test on my own machine, I have to do back-and-forth with the testers on the test server... – thomasb Jan 26 '12 at 10:04
@cosmo0: Because there's no such thing as "an ISO string". It's like talking about "a decimal `Integer`" or "a hex `Integer`" - it's just a number; likewise a string is just a sequence of Unicode characters. A string is always represented as UTF-16 code units. If the DLL is giving you a string, it's up to the DLL to perform whatever decoding is required from any source binary data. It sounds like it *may* be returning you bad data (a string with a NUL character at the end). First things first, you need to understand what a .NET string is. Then ideally fix your dev process... – Jon Skeet Jan 26 '12 at 10:11
Yes, I understand that. But, as I was saying, I can't interact with the DLL creators at all. They completely refuse to send me a copy, so I can't even debug it to see what exactly it sends me (as far as I know, it may be a byte[] and vb.net magically converts it). And I have no documentation whatsoever. So I have to work with that, and it has to be done in a few days. Now, do you have any counsel that can actually help me with my current problem ? Thanks. – thomasb Jan 26 '12 at 10:23
1

@cosmo0: If you understood that, why did you keep referring to "an ISO string" when it doesn't exist? I would *really* suggest you talk to your manager about the impracticalities of working under such conditions, but if *all* you need is to remove a NUL character from the end, you can just use `text = text.TrimEnd(Chr(0))` (I believe that would be right for VB; I'm rather more comfortable in C#.) – Jon Skeet Jan 26 '12 at 10:26
Well, it's a shortcut to say that I think the DLL sends me a string that is considered ISO-encoded by the DLL (I don't even know which language the DLL is in), but of course .net considers it utf-8, even if it had character conversion problems. So it might be a badly-decoded-from-iso string ? Anyway, my immediate problem is to remove the NUL character, so I will try your solution, thanks (it indeed looks right). And be sure, my manager already knows my problems, and he is not happy about it, but he can't do very much either right now :/ – thomasb Jan 26 '12 at 10:52
1

@cosmo0: No, .NET doesn't consider it UTF-8 - it considers it as a sequence of UTF-16 code units. Yes, it's possible that the DLL has converted it badly. – Jon Skeet Jan 26 '12 at 10:56

Convert a string from ISO to UTF-8 with null byte ending

1 Answers1