0

This question is a modified redux of this previous question:

how to decode ubyte[] to a specified encoding?

I'm looking for an idiomatic way to convert the ubyte[] array returned from a std.zip.ArchiveMember.expandedData attribute into a string or other range-able collection of strings... either the whole contents akin to calling File.open("file"), or something iterable in similar fashion to File.open("file").byLine().

So far everything I've found from the standard documentation that deals with character arrays or strings does not appreciate a ubyte[] argument, and the examples around D's zip file handling are very rudimentary, dealing only with getting raw data out of zip archives and their member files... with no obvious file/stream/io interface capable of being easily layered between the raw bytestream and text-oriented file/string manipulation.

I think I can find something in std.utf or std.uni to decode individual code points, and while/for-loop my way through the bytestream, but surely there might be a better way?

Code sample:

std.zip.ZipArchive zipFile;
// just humor me, this is what I've been given.
zipFile = new std.zip.ZipArchive("dataSet.csv.zip");
foreach(memberFile; zipFile.directory)
{
    zipFile.expand(memberFile);
    ubyte[] uByteArray = memberFile.expandedData;

    // ok, now what?
    // is there a relatively simplistic way to get this
    // decoded/translated byteStream into a string
    // or collection of strings(for example, one string per line
    // of the compressed file) ?

    string completeCsvContents = uByteArray.PQR();
    string[] csvRows = uByteArray.XYZ();
}

Is there anything that I could easily fill in for PQR or XYZ?

Or, if it's a matter of making an API call in the style of

string csvData = std.ABC.PQR(uByteArray)

What would ABC/PQR be?

Community
  • 1
  • 1
joduncan
  • 33
  • 5
  • well, perhaps a bit more searching in a different direction was in order. seems Adam Ruppe has a module that may get the job done: [characterencodings.d](https://github.com/adamdruppe/arsd/blob/master/characterencodings.d) I'll try it out, and if it works decently, as it appears it should, then I'll answer my own question. – joduncan Dec 19 '15 at 05:37
  • It depends what encoding the file is in. If it is already ascii or utf-8, you can simply cast it to `char[]`. If it is something else like Windows-1252 which Excel often saves as in English, then something like my characterencodings.d module will help you convert. – Adam D. Ruppe Dec 19 '15 at 18:02
  • I came to that realization after skimming your source, I'd never worked with utf-8 in detail before so I wasn't sure what kinds of casting assumptions were safe to make. I appreciate your response, thank you. – joduncan Dec 20 '15 at 14:57
  • Yeah, it is safe to cast either if it is already valid (my function tryToDetermineEncoding does this check. UTF-8 validity is rare in a random byte stream, but other encodings are hard to figure out which is why that's all it returns), or only has all values `< 128`. It still isn't necessarily correct, but almost certainly is then. But any values >= 128 that aren't already utf8 encoding should not be casted - it would throw if you tried to use it as a string. – Adam D. Ruppe Dec 21 '15 at 02:25

2 Answers2

1

Maybe just do

auto stuff = cast(char[]) memberFile.expandedData; 

When using the resulting char[] stuff it will be auto decoded anyway, for example by the functions that will call the range primitives when passing this char[] stuff as input range.

Because actually neither char[] nor string are decoded. Only dchar[] or dstring are.

Abstract type
  • 1,901
  • 2
  • 15
  • 26
1

If you know that the string is UTF-8 encoded, you can use std.string.assumeUTF to convert it to a string/char array. All this does is a cast, as Nested type mentions, but it's mode self-documenting.

If you need to make sure that the resulting string is actually valid UTF-8 (as there are several operations with undefined behavior on invalid strings), then you can use std.utf.validate. assumeUTF also does this under debug builds.

Colonel Thirty Two
  • 23,953
  • 8
  • 45
  • 85
  • I don't have one of the files on this machine, so I can't peek at the format, but that will be a nice safety check. Thank you. (and Nested Type ) – joduncan Dec 20 '15 at 15:06