1

I want to use the buffer from a UNICODE_STRING, but it seems I cannot just directly use it, by copying reference, because sometime I can see that there are null bytes in the middle of a string, and Length is greater than what I see in the debugger. So if I do this

UNICODE_STRING testStr;
//after being used by some function it has data like this 'bad丣\0more_stuff\0'

wchar_t * wStr = testStr.Buffer;

I will end up with wStr = "bad丣"; Is there a way to convert this to the null terminated, valid wchar_t*?

theB
  • 6,450
  • 1
  • 28
  • 38
Vlad
  • 369
  • 4
  • 16
  • 1
    In which encoding is the unicode string? – 2501 Jul 13 '16 at 05:19
  • 3
    So you have a [double-null terminated string](https://blogs.msdn.microsoft.com/oldnewthing/20091008-00/?p=16443). Not that unusual. – Jonathan Potter Jul 13 '16 at 05:30
  • 1
    It works for any string system that uses `0` as a termination character. – Jonathan Potter Jul 13 '16 at 05:38
  • 1
    IoW, you can't convert it to a `wchar_t *` because it contains more than one string. You could convert it to an array of `wchar_ t *` if you wanted. – Harry Johnston Jul 13 '16 at 06:05
  • @2501: The character encoding is not interesting. If you insist that it is, it's UTF-16LE. The same as `wchar_t` for any compiler targeting the Win32 platform. – IInspectable Jul 13 '16 at 15:11
  • 1
    @JonathanPotter: [UNICODE_STRING](https://msdn.microsoft.com/en-us/library/windows/desktop/aa380518.aspx) is a **counted string**, not a double-null terminated string (although it can store one). – IInspectable Jul 13 '16 at 15:16

2 Answers2

4

A wchar_t* is just a pointer. Unless you tell the debugger (or any function you pass the wchar_t* to) exactly how many wchar_t characters are actually being pointed at, it has to stop somewhere, so it stops on the first null character it encounters.

UNICODE_STRING::Buffer is not guaranteed to be null-terminated, but it can contain embedded nulls. You have to use the UNICODE_STRING::Length field to know how many WCHAR elements are in the Buffer, including embedded nulls but not counting a trailing null terminator if one is present. If you need a null terminator, copy the Buffer data to your own buffer and append a terminator.

The easiest way to do that is to use std::wstring, eg:

#include <string>

UNICODE_STRING testStr;
// fill testStr as needed...

std::wstring wStrBuf(testStr.Buffer, testStr.Length / sizeof(WCHAR));
const wchar_t *wStr = wStrBuf.c_str();

The embedded nulls will still be present, but c_str() will append the trailing null terminator for you. The debugger will still display the data up to the first null only, unless you tell the debugger the actual number of WCHAR elements are in the data.

Alternatively, if you know the Buffer data contains multiple substrings separated by nulls, you could optionally split the Buffer data into an array of strings instead, eg:

#include <string>
#include <vector>

UNICODE_STRING testStr;
// fill testStr as needed...

std::vector<std::wstring> wStrArr;

std::wstring wStr(testStr.Buffer, testStr.Length / sizeof(WCHAR));
std::wstring::size_type startidx = 0;
do
{
    std::wstring::size_type idx = wStr.find(L'\0', startidx);
    if (idx == std::wstring::npos)
    {
        if (startidx < wStr.size())
        {
            if (startidx > 0)
                wStrArr.push_back(wStr.substr(startidx));
            else
                wStrArr.push_back(wStr);
        }
        break;
    }
    wStrArr.push_back(wStr.substr(startidx, idx-startidx));
    startidx = idx + 1;
}
while (true);

// use wStrArr as needed...

Or:

#include <vector>
#include <algorithm>

UNICODE_STRING testStr;
// fill testStr as needed...

std::vector<std::wstring> wStrArr;

WCHAR *pStart = testStr.Buffer;
WCHAR *pEnd = pStart + (testStr.Length / sizeof(WCHAR));

do
{
    WCHAR *pFound = std::find(pStart, pEnd, L'\0');
    if (pFound == pEnd)
    {
        if (pStart < pEnd)
            wStrArr.push_back(std::wstring(pStart, pEnd-pStart));
        break;
    }
    wStrArr.push_back(std::wstring(pStart, pFound-pStart));
    pStart = pFound + 1;
}
while (true);

// use wStrArr as needed...
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
1

A UNICODE_STRING is a structure, that stores both character data as well as its length. As such, it allows for embedded NUL characters, just like a std::wstring, for example.

A C-style string (e.g. wchar_t*), on the other hand, does not store an explicit string length. By convention, it is terminated by a NUL character. It's length is implied. A corollary of this is, that it cannot contain embedded NUL characters.

That means that you cannot convert from UNICODE_STRING to wchar_t* without losing the length information. You have to either store the length explicitly, alongside the wchar_t* pointer, or establish rules for interpretation, that allow to recalculate the length (e.g. by interpreting the character sequence as a double-null-terminated string)1).


Additional information:


1) The debugger will interpret a wchar_t* as a zero-terminated string. If you want to see the entire sequence, you need to explicitly provide the array size using a format specifier.
IInspectable
  • 46,945
  • 8
  • 85
  • 181