how to decode ubyte[] to a specified encoding?

Question

The problem is: how to parse a file when encoding is set at runtime?

encoding could be: utf-8, utf-16, latin1 or other

The goal it is to convert ubyte[] to a string from the selected encoding. Because when you use std.stdio.File.byChunk or std.mmFile.MmFile you have ubyte[] as data.

Rather than posting code, you should instead describe the problem you're trying to solve. — Vladimir Panteleev, Mar 11 '12 at 13:27

score 1 · Answer 1 · answered Mar 11 '12 at 03:17

1

Are you trying to convert text file to utf-8? If answer is 'yes', Phobos have function specialy for this: @trusted string toUTF8(in char[] s). See http://dlang.org/phobos/std_utf.html for details.

Sorry if it not what you need.

answered Mar 11 '12 at 03:17

Raxillan

233
2
6

The only thing `toUTF8` does is validate the input string and return a copy of it. D's `string` type already uses UTF-8. [Source](https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1231) – Vladimir Panteleev Mar 11 '12 at 13:25
@CyberShadow Ok, but how about converting this "standart" UTF-8 to some another encoding? i find only this in std.encoding: `void transcode(Src, Dst)(immutable(Src)[] s, out immutable(Dst)[] r)` – Raxillan Mar 11 '12 at 14:45
1

Looks like as OP's problem solution. BUT: how properly to add new encoding to use with `transcode`? For example, Windows-1251 (Cyrillic), because it is widely used under Windows. – Raxillan Mar 11 '12 at 16:16
toUTF8 conver a char[] to a string. Here i would like convert ubyte[] to an encoding – bioinfornatics Mar 12 '12 at 06:48
Hm, char and ubyte types are identical by length (8 bit, unsigned). So i don't understand, why `transcode` isn't your problem solution. – Raxillan Mar 12 '12 at 15:45

score 0 · Answer 2 · answered Sep 25 '12 at 12:21

File.byChunk returns a range which returns a ubyte[] via front.

A quick Google search seemed to indicate that UTF-8 uses 1 to 6 bytes to encode data so just make sure you always have 6 bytes of data and you can use std.encoding's decode to convert it a dchar character. You can then use std.utf's toUFT8 to convert to a regular string instead of a dstring.

The convert function below will convert any unsigned array range to a string.

import std.encoding, std.stdio, std.traits, std.utf;

void main()
{
    File input = File("test.txt");

    string data = convert(input.byChunk(512));

    writeln("Data: ", data);
}

string convert(R)(R chunkRange) 
in
{
    assert(isArray!(typeof(chunkRange.front)) && isUnsigned!(typeof(chunkRange.front[0])));
} 
body
{
    ubyte[] inbuffer;
    dchar[] outbuffer;

    while(inbuffer.length > 0 || !chunkRange.empty)
    {
        while((inbuffer.length < 6) && !chunkRange.empty)// Max UTF-8 byte length is 6
        {
            inbuffer ~= chunkRange.front;
            chunkRange.popFront();
        }

        outbuffer ~= decode(inbuffer);
    }

    return toUTF8(outbuffer); // Convert to string instead of dstring
}

bioinfornatics · Answer 3 · 2012-03-10T21:46:15.703

0

I have found a way, maybe use std.algorithm.reduce should be better

import std.string;
import std.stdio;
import std.encoding;
import std.algorithm;

void main( string[] args ){
    File f = File( "pathToAfFile.txt", "r" );
    size_t i;
    auto e = EncodingScheme.create("utf-8");
    foreach( const(ubyte)[] buffer; f.byChunk( 4096 ) ){
        size_t step = 0;
        if( step == 0 ) step = e.firstSequence( buffer );
        for( size_t start; start + step < buffer.length; start = start + step )
            write( e.decode( buffer[start..start + step] ) );
    }
}

edited Mar 10 '12 at 21:46

answered Mar 10 '12 at 21:09

bioinfornatics

1,749
3
17
36

2

This is a bad solution. The chunk size may cut the file in the middle of an UTF-8 sequence. It looks like your code will not cause any exceptions, but it will skip characters. – Vladimir Panteleev Mar 11 '12 at 13:23
until chunk is a multiple of a utf-8 (or other) length it is safe. e.firstSequence granted this. If value of e.firstSequence is a multiple of chunk value it is ok. – bioinfornatics Mar 11 '12 at 16:17
1

UTF-8 is a variable-length encoding. – Vladimir Panteleev Mar 11 '12 at 16:52

score 0 · Answer 4 · answered Mar 11 '12 at 13:21

0

D strings are already UTF-8. No transcoding is necessary. You can use validate from std.utf to check if the file contains valid UTF-8. If you use readText from std.file, it will do the validation for you.

answered Mar 11 '12 at 13:21

Vladimir Panteleev

24,651
6
70
114

i know it was an example. I want to read a text in various encoding. It could to be in latin1 or other. – bioinfornatics Mar 11 '12 at 13:50
This is why it's important to say the actual problem you're trying to solve! :) With the code you posted, I could only guess what you really wanted to do. – Vladimir Panteleev Mar 11 '12 at 14:40
i have edited the problem. In fact i want to use 1) MmFile 2) convert ubyte[] to an encoding at runtime . For the first it is good. – bioinfornatics Mar 12 '12 at 06:50

how to decode ubyte[] to a specified encoding?

4 Answers4

Linked