When designing a distributed storage and analytics architecture, is it a common usage pattern to run an analytics engine on the same machine as the data nodes? Specifically, would it make sense to run Spark/Storm directly on Cassandra/HDFS nodes?
I know that MapReduce on HDFS has this sort of usage pattern since according to Hortonworks, YARN minimizes data motion. I have no idea whether this is the case with these other systems though. I would imagine it is since they seem to be so pluggable with each other, but I can't seem to find any information about this online.
I'm sort of a newbie on this topic, so any resources or answers would be greatly appreciated.
Thanks