This is a follow-up question on this one: Spark FlatMap function for huge lists
Summarized: I want to write a Spark FlatMap function in Java8 which generates all possible regular expressions matching with a set of dna sequences. For huge strings this is problematic since the regex collection will not fit in memory (one mapper easily generates gigabytes of data). I understand that I have to resort to something like a lazy sequence, I assume I have to use a Stream<String>
for this. My question now is how to build this stream.
I was having a look here: Java Streams - Stream.Builder.
If my algorithm starts generating patterns they can be 'pushed' into the Stream with the accept(String)
method but when I tried out the code in the link (replaced it with a string generator function) with some log statements in between I noticed the random string generation function gets executed before build()
is called. I don't understand how all the random strings would be stored if they can't fit into memory.
Do I have to build the stream in a different way? Basically I want to have the equivalent of my context.write(substring)
I had in my MapReduce Mapper.map
function.
UPDATE1: cannot use the range function, in fact I am using a structure which iterates over a suffix tree.
UPDATE2: Upon request a more complete implementation, I didn't replace the interface with the actual implementation because the implementations are very big and not essential to grasp the idea.
More complete problem sketch:
My algorithm tries to discover patterns on DNA sequences. The algorithm takes in sequences of different organisms corresponding to the same gene. Say I have a gene A in mays, and have the same gene A in rice and some other species then I compare their upstream sequences. The patterns I am looking for are similar to regular expression, for example TGA..GA..GA. To explore all possible patterns I build a generalized suffix tree from the sequences. This tree provides information about the different sequences a pattern occurs in. To decouple the tree from the search algorithm I implemented some sort of iterator structure: TreeNavigator. It has the following interface:
interface TreeNavigator {
public void jumpTo(char c); //go from pattern p to p+c (c can be a dot from a regex or [AC] for example)
public void backtrack(); //pop the last character
public List<Position> getMatches();
public Pattern trail(); //current pattern p
}
interface SearchSpace {
//degrees of freedom in regex, min and maxlength,...
public boolean inSearchSpace(Pattern p);
public Alphabet getPatternAlphabet();
}
interface ScoreCalculator {
//calculate a score, approximately equal to the number of occurrences of the pattern
public Score calcConservationScore(TreeNavigator t);
}
//Motif algorithm code which is run in the MapReduce Mapper function:
public class DiscoveryAlgorithm {
private Context context; //MapReduce context object to write to disk
private Score minScore;
public void runDiscovery(){
//depth first traveral of pattern space A, AA, AAA,... AAC, ACA, ACC and so fort
exploreSubTree(new TreeNavigator());
}
//branch and bound for pattern space, if pattern occurs too little, stop searching
public boolean survivesBnB(Score s){
return s.compareTo(minScore)>=0;
}
public void exploreSubTree(Navigator nav){
Pattern current = nav.trail();
Score currentScore = ScoreCalculator.calc(nav);
if (!survivesBnB(currentScore)}{
return;
}
if (motif in searchspace)
context.write(pattern);
//iterate over all possible extensions: A,C,G,T, [AC], [AG],... [ACGT]
for (Character c in SearchSpace.getPatternAlphabet()){
nav.jumpTo(c);
exploreSubTree(nav);
nav.backtrack();
}
}
}
FULL MapReduce SOURCE @ https://github.com/drdwitte/CloudSpeller/ Related research paper: http://www.ncbi.nlm.nih.gov/pubmed/26254488
UPDATE3: I have continued reading about ways to create a Stream. From what I have read so far I think I have to rewrite my runDiscovery() into a Supplier. This Supplier can then be transformed into a Stream via the StreamSupport class.