0

I have a problem where I want to implement a recursive algorithm in Spark, and looking to see if there are any recommendations for building this in Spark, or exploring other data analytics frameworks that might be better suited.

eg. The job needs to list a directory structure/tree recursively and process nodes, combined with map/reduce patterns to map paths or groups of files into derived data, group/merge such derived data recursively.

I'm trying to do this in a way that can leverage parallelizing the overall algorithm. It would be straightforward to build a solution that runs on a single node (Eg. the spark master), but assume the directory structure is very large with O(Billion) leaf nodes.

Any suggestions for building recursive/iterative kinds of data pipelines in Spark or other frameworks/data processing technologies?

Nikhil Kothari
  • 5,215
  • 2
  • 22
  • 28

1 Answers1

0

With Flink I would look at using the Stateful Functions API for this sort of use case.

David Anderson
  • 39,434
  • 4
  • 33
  • 60