2

When designing a distributed storage and analytics architecture, is it a common usage pattern to run an analytics engine on the same machine as the data nodes? Specifically, would it make sense to run Spark/Storm directly on Cassandra/HDFS nodes?

I know that MapReduce on HDFS has this sort of usage pattern since according to Hortonworks, YARN minimizes data motion. I have no idea whether this is the case with these other systems though. I would imagine it is since they seem to be so pluggable with each other, but I can't seem to find any information about this online.

I'm sort of a newbie on this topic, so any resources or answers would be greatly appreciated.

Thanks

1 Answers1

6

Yes it makes sense to run Spark on Cassandra nodes to minimize data movement between machines.

When you create an RDD from a Cassandra table, the RDD partitions will be created from the token ranges that are local to each machine.

Here's a link to a talk on this subject for the Spark Cassandra connector:

Cassandra and Spark: Optimizing for Data Locality

As it says in the summary: "There are only three things that are important in doing analytics on a distributed database: Locality, locality and locality."

Jim Meyer
  • 9,275
  • 1
  • 24
  • 49