6

I am looking for some util class/method to take a large String and return an InputStream.

If the String is small, I can just do:

InputStream is = new ByteArrayInputStream(str.getBytes(<charset>));

But when the String is large(1MB, 10MB or more), a byte array 1 to 2 times(or more?) as large as my String is allocated on the spot. (And since you won't know how many bytes to allocate exactly before all the encoding is done, I think there must be other arrays/buffers allocated before the final byte array is allocated).

I have performance requirements, and want to optimize this operation.

Ideally I think, the class/method I am looking for would encode the characters on the fly one small block at a time as the InputStream is being consumed, thus no big surge of mem allocation.

Looking at the source code of apache commons IOUtils.toInputStream(..), I see that it also converts the String to a big byte array in one go.

And StringBufferInputStream is Deprecated, and does not do the job properly.

Is there such util class/method from anywhere? Or I can just write a couple of lines of code to do this?

The functional need for this is that, elsewhere, I am using a util method that takes an InputStream and stream out the bytes from this InputStream.

I haven't seem other people looking for something like this. Am I mistaking something somewhere?

I have started writing a custom class for this, but would stop if there is a better/proper/right solution/correction to my need.

Fai Ng
  • 760
  • 1
  • 6
  • 14
  • 2
    Wait... Are you aware that a `String` is an array of `char`s internally, and that a `char` is two `byte`s long? What is more, you don't even account for the encoding... – fge Jan 12 '15 at 18:53
  • 2
    How about using [ReaderInputStream](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReaderInputStream.html) on top of a [StringReader](http://docs.oracle.com/javase/7/docs/api/java/io/StringReader.html)? Also, see http://stackoverflow.com/questions/837703/how-can-i-get-a-java-io-inputstream-from-a-java-lang-string. – shmosel Jan 12 '15 at 19:01
  • Right, I am hoping to find something that let's me specify the encoding/charset that I want. – Fai Ng Jan 12 '15 at 19:01
  • @shmosel, I'd put that in as an answer. – Louis Wasserman Jan 12 '15 at 19:13
  • 1
    @LouisWasserman Don't have time to look into it now. Free feel to use it. – shmosel Jan 12 '15 at 19:16
  • @shmosel & Louis Wasserman, it really looks like shmosel is showing exactly what I am look for – Fai Ng Jan 12 '15 at 19:16
  • Is there a particular reason why you want an `InputStream`, not a `StringReader`? Wanting to treat text as bytes is less common than the other way around. – Dawood ibn Kareem Jan 12 '15 at 19:28
  • I don't want to read the text no more. I just want to stream the text out using a lib method that takes a InputStream only. – Fai Ng Jan 12 '15 at 19:44
  • The answer of this question should go to http://stackoverflow.com/questions/837703/how-can-i-get-a-java-io-inputstream-from-a-java-lang-string, which I don't think has the right answer yet. – Fai Ng Jan 12 '15 at 19:51

4 Answers4

4

The Java built-in libraries assume you'd only need to go from chars to bytes in output, not input. The Apache Commons IO libraries have ReaderInputStream, however, which can wrap a StringReader to get what you want.

beat
  • 1,857
  • 1
  • 22
  • 36
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
1

For me there is a fundamental problem. Why do you have such huge Strings in memory in the first place...

Anyway, you can try this:

public static InputStream largeStringToBytes(final String tooLarge,
    final Charset charset)
{
    final CharsetEncoder encoder = charset.newEncoder()
        .onUnmappableCharacter(CodingErrorAction.REPORT);
    final ByteBuffer buf = charset.encode(CharBuffer.wrap(tooLarge));
    return new ByteArrayInputStream(buf.array());
}
fge
  • 119,121
  • 33
  • 254
  • 329
  • Why would this be any better than `string.getBytes(charset)`? – Louis Wasserman Jan 12 '15 at 19:12
  • @LouisWasserman Because it can detect malformed inputs... `String`'s `.getBytes()` won't. – fge Jan 12 '15 at 19:24
  • This makes a copy of the input string, just as getBytes. – Petter Nordlander Jan 12 '15 at 20:49
  • @Petter yes, but it detects errors! You _could_ use your decoder to read only partly (I have done that in [this project](https://github.com/fge/largetext), but for the reverse operation), but the fundamental problem here anyway is why such large strings are there in memory... – fge Jan 12 '15 at 21:00
0

If you are passing the large string as parameter then the memory is already allocated. A string that big cannot even be pushed on to the stack (most of the time max stack size is 1MB) so this is getting allocated on the heap just to pass it as a parameter. The only way I can see to avoid this would be to create a tree on disk where you streamed back a chracter at a time as you walked the tree. If you have multiple large strings perhaps to can index them in a Trie or a DAWG and walk that structure. This will eliminate many of the duplicate characters between strings. But, I will need to know more about what the strings represent to assist further.

  • The String is allocated already. Just don't want to allocate another big byte array as an intermediate step. – Fai Ng Jan 12 '15 at 19:11
0

Implement your own String-backed input stream:

class StringifiedInputStream extends InputStream {

    private int idx=0;
    private final String str;
    private final int len;

    StringifiedInputStream(String str) {
        this.str = str;
        this.len = str.length();
    }

    @Override
    public int read() throws IOException {
        if(idx>=len)
            return -1;

        return (byte) str.charAt(idx++);
    }
}

This is slow, but it streams the bytes without byte array duplication. Add the 3-arg method to this implementation if speed is an issue.

Danny Daglas
  • 1,501
  • 1
  • 9
  • 9