Building a Stream that doesn't fit into memory

Question

This is a follow-up question on this one: Spark FlatMap function for huge lists

Summarized: I want to write a Spark FlatMap function in Java8 which generates all possible regular expressions matching with a set of dna sequences. For huge strings this is problematic since the regex collection will not fit in memory (one mapper easily generates gigabytes of data). I understand that I have to resort to something like a lazy sequence, I assume I have to use a Stream<String> for this. My question now is how to build this stream. I was having a look here: Java Streams - Stream.Builder.

If my algorithm starts generating patterns they can be 'pushed' into the Stream with the accept(String) method but when I tried out the code in the link (replaced it with a string generator function) with some log statements in between I noticed the random string generation function gets executed before build() is called. I don't understand how all the random strings would be stored if they can't fit into memory.

Do I have to build the stream in a different way? Basically I want to have the equivalent of my context.write(substring) I had in my MapReduce Mapper.map function.

UPDATE1: cannot use the range function, in fact I am using a structure which iterates over a suffix tree.

UPDATE2: Upon request a more complete implementation, I didn't replace the interface with the actual implementation because the implementations are very big and not essential to grasp the idea.

More complete problem sketch:

My algorithm tries to discover patterns on DNA sequences. The algorithm takes in sequences of different organisms corresponding to the same gene. Say I have a gene A in mays, and have the same gene A in rice and some other species then I compare their upstream sequences. The patterns I am looking for are similar to regular expression, for example TGA..GA..GA. To explore all possible patterns I build a generalized suffix tree from the sequences. This tree provides information about the different sequences a pattern occurs in. To decouple the tree from the search algorithm I implemented some sort of iterator structure: TreeNavigator. It has the following interface:

interface TreeNavigator {
        public void jumpTo(char c); //go from pattern p to p+c (c can be a dot from a regex or [AC] for example)
        public void backtrack(); //pop the last character
        public List<Position> getMatches();
        public Pattern trail(); //current pattern p
    }

interface SearchSpace {
        //degrees of freedom in regex, min and maxlength,...
    public boolean inSearchSpace(Pattern p); 
    public Alphabet getPatternAlphabet();
}

interface ScoreCalculator {
    //calculate a score, approximately equal to the number of occurrences of the pattern
    public Score calcConservationScore(TreeNavigator t);
}

//Motif algorithm code which is run in the MapReduce Mapper function:
public class DiscoveryAlgorithm {
    private Context context; //MapReduce context object to write to disk
    private Score minScore;

    public void runDiscovery(){
    //depth first traveral of pattern space A, AA, AAA,... AAC, ACA, ACC and so fort
        exploreSubTree(new TreeNavigator());
    }

    //branch and bound for pattern space, if pattern occurs too little, stop searching
    public boolean survivesBnB(Score s){
        return s.compareTo(minScore)>=0;
    }

    public void exploreSubTree(Navigator nav){
        Pattern current = nav.trail();
        Score currentScore = ScoreCalculator.calc(nav);

        if (!survivesBnB(currentScore)}{
           return;
        }


        if (motif in searchspace)
            context.write(pattern);

        //iterate over all possible extensions: A,C,G,T, [AC], [AG],... [ACGT]
        for (Character c in SearchSpace.getPatternAlphabet()){
             nav.jumpTo(c);
             exploreSubTree(nav);
             nav.backtrack();
        }
    }
}

FULL MapReduce SOURCE @ https://github.com/drdwitte/CloudSpeller/ Related research paper: http://www.ncbi.nlm.nih.gov/pubmed/26254488

UPDATE3: I have continued reading about ways to create a Stream. From what I have read so far I think I have to rewrite my runDiscovery() into a Supplier. This Supplier can then be transformed into a Stream via the StreamSupport class.

You can implement a `Spliterator` and use that to create your stream source. There are many implementations in the JDK to gather inspiration from. — Brian Goetz, Sep 19 '15 at 20:21
I updated the question with the actual algorithm I am using, any advice on how to turn this into the SplitIterator? — DDW, Sep 19 '15 at 23:05

Lukas Eder · Answer 1 · 2015-09-19T15:53:49.213

3

Here's a simple, lazy evaluation of your requirement:

public static void main(String[] args) {
    String string = "test";

    IntStream.range(0, string.length())
             .boxed()
             .flatMap(start -> IntStream
                 .rangeClosed(start + 1, string.length())
                 .mapToObj(stop -> new AbstractMap.SimpleEntry<>(start, stop))
             )
             .map(e -> string.substring(e.getKey(), e.getValue()))
             .forEach(System.out::println);
}

It yields:

t
te
tes
test
e
es
est
s
st
t

Explanations:

// Generate "start" values between 0 and the length of your string
IntStream.range(0, string.length())
         .boxed()

// Combine each "start" value with a "stop" value that is between start + 1 and the length
// of your string
         .flatMap(start -> IntStream
             .rangeClosed(start + 1, string.length())
             .mapToObj(stop -> new AbstractMap.SimpleEntry<>(start, stop))
         )

// Convert the "start" / "stop" value tuple to a corresponding substring
         .map(e -> string.substring(e.getKey(), e.getValue()))
         .forEach(System.out::println);

edited Sep 19 '15 at 15:53

answered Sep 19 '15 at 15:46

Lukas Eder

211,314
129
689
1,509

Dear Lukas, great answer but in my particular case (navigation is in an index structure) I cannot start with a range, I only have an iterator object which automatically navigates through the index structure, the index structure allows recycling a lot since it reuses prefix matches) – DDW Sep 19 '15 at 22:10
@DDW: I can only answer the question you present. A hypothetical, "bit more complex" real world scenario might yield an entirely different answer once you remove "hypothetical" and actually explain it here in very clear terms. I'm sorry, but I cannot improve my answer as your question is right now. – Lukas Eder Sep 20 '15 at 08:22
I provided more details about the algorithm in the update, I think it contains the necessary details now. – DDW Sep 20 '15 at 08:27
1

@DDW: I'm sorry but it really doesn't :) Please, take the time and explain *very well* what your data structures look like. I.e., what's `Navigator`, what's `searchspace`, what's the significance of `currentScore` in this context? What's the significance of stopping the search with `survivesBnB` etc. I mean, unless you explain this very thoroughly, we keep guessing and spending time on non-answers that are irrelevant to you and to future visitors of this question. – Lukas Eder Sep 20 '15 at 08:35

Tagir Valeev · Answer 2 · 2015-09-20T03:30:16.287

3

An alternative to @LukasEder solution which, I believe, more effective:

IntStream.range(0, string.length())
    .mapToObj(start -> IntStream.rangeClosed(start+1, string.length())
            .mapToObj(end -> string.substring(start, end)))
    .flatMap(Function.identity())
    .forEach(System.out::println);

Update as benchmark was requested, here it is (Java 8u45, x64, for string length 10, 100, 1000):

Benchmark                  (len)  Mode  Cnt      Score     Error  Units
SubstringTest.LukasEder       10  avgt   30      1.947 ±   0.012  us/op
SubstringTest.LukasEder      100  avgt   30    151.660 ±   0.524  us/op
SubstringTest.LukasEder     1000  avgt   30  52405.761 ± 183.921  us/op
SubstringTest.TagirValeev     10  avgt   30      1.712 ±   0.018  us/op
SubstringTest.TagirValeev    100  avgt   30    138.179 ±   5.063  us/op
SubstringTest.TagirValeev   1000  avgt   30  48188.499 ± 107.321  us/op

Well, the @LukasEder solution is only 8-13% slower, which is probably not that much.

edited Sep 20 '15 at 03:30

answered Sep 19 '15 at 15:51

Tagir Valeev

97,161
19
222
334

3

*"which, I believe, more effective"* - why? – Lukas Eder Sep 19 '15 at 15:54
1

Would you care to explain why is this solution "more effective"? – eliasah Sep 19 '15 at 17:20
1

I would guess Tagir may be referring to the intermediate storage Object in Lukas' solution (AbstractMap.SimpleEntry), and the fact that his uses primitives for longer (start is an Integer in Lukas' solution and an int in Tagirs). Tagir's solution should result in less new Objects created / less garbage collection (which may / or may not be important to the OP). Otherwise I guess you would have to benchmark them both to see. Either way I think it's an optimisation and you should, in most cases, chose the one you find most readable / understandable. – John McClean Sep 19 '15 at 20:36
@JohnMcClean: Thanks for the explanation, and thanks for the benchmark, Tagir – Lukas Eder Sep 20 '15 at 08:18

Building a Stream that doesn't fit into memory

2 Answers2

Explanations: