Set DataInputStream to String Value

Question

I am trying to write a junit test for a method that depads a word. I am having the problem that the method is returning symbols instead of the depadded word.

My test method is

    @Test
public void testReadString() throws IOException
{
    String testString = "******test";

    InputStream stream = new ByteArrayInputStream(testString.getBytes(StandardCharsets.UTF_8));
    DataInputStream dis = new DataInputStream(stream);

    String word = readString(dis, 10);

    assertEquals("test", word);
}

The methods it is testing are

    public static String readString(DataInputStream dis, int size) throws IOException
{

    byte[] makeBytes = new byte[size * 2];// 2 bytes per char
    dis.read(makeBytes);  // read size characters (including padding)
    return depad(makeBytes);
}

public static String depad(byte[] read) 
{
    //word = word.replace("*", "");
    StringBuilder word = new StringBuilder();
    for (int i = 0; i < read.length; i += 2)
    {
        char c = (char) (((read[i] & 0x00FF) << 8) + (read[i + 1] & 0x00FF));

        if (c != '*')
        {
            word.append(c);
        }
    }
    return word.toString();
}

The error I am getting when i run the test is test failed expected [test] but was [⨪⨪⨪瑥獴]

Am I concluding correctly from comments, etc, that you are reading a file into a byte array, treating a portion of it as UTF-8 encoded text and you want to de-pad it and get the remaining text as a string? If so, can you explain that better. Are you sure it is UTF-8? — Tom Blodget, Apr 23 '17 at 19:46

John Kugelman · Accepted Answer · 2017-04-23T18:21:52.447

InputStream stream = new ByteArrayInputStream(testString.getBytes(StandardCharsets.UTF_8));

...

char c = (char) (((read[i] & 0x00FF) << 8) + (read[i + 1] & 0x00FF));

Your code expects a UCS-2 encoded string, but you're feeding it a UTF-8 encoded string. In UCS-2 each character is exactly two bytes. UTF-8 is a variable length encoding where ASCII characters are one byte and other characters are two or more.

See: Comparison of Unicode encodings on Wikipedia

Note that UCS-2 is a very simplistic and antiquated encoding. It can only encode the first 64K Unicode characters. It's been superseded by UTF-16 in modern Unicode applications. According to the Unicode Consortium:

UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.

What's the reason for working with byte arrays, anyways? If you want to manipulate character data you should work with strings, not bytes. Strings keep you from having to worry about encodings.

Thanks, this is part of an assignment, we have to be able to pad and depad words to a file Using byte arrays. Otherwise I would be using ObjectOutPutStream to save to a file — Michael Grinnell, Apr 23 '17 at 18:26

score 0 · Answer 2 · answered Apr 23 '17 at 18:26

There are two kinds of I/O classes:

Byte Streams: they are used to read bytes.

You can find a lot of classes like: ByteArrayInputStream and DataInputStream.

Character Streams: they are used to read human-readable text.

You can find a lot of classes like: StringReader and InputStreamReader. You can find this classes easily because they use the sufix Writter or Reader.

I suggest using StringReader like this:

new StringReader("******test");

Set DataInputStream to String Value

2 Answers2