I have a very basic question. Spark's flatMap
function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.
My question is: what happens if this list is too large for your memory to handle!?
I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write()
anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.
In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)