17

Imagine I do some Spark operations on a file hosted in HDFS. Something like this:

var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...

Because in the Hadoop world the code should go where the data is, right?

So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Frizz
  • 2,524
  • 6
  • 31
  • 45
  • Look up the documentation: https://spark.apache.org/docs/latest/cluster-overview.html It depends on the cluster manager. – stefana Feb 12 '15 at 16:13
  • I don't think Spark cares about where the data is, and I don't think you should either. Throughput is limited by disk, not network. I don't agree with "the code should go where the data is". – Daniel Darabos Feb 12 '15 at 16:48
  • When you use `hdfs` as the protocol, the filesystem api gives away the physical locations. Whether spark uses it or not, it doesn't matter much like Daniel already said. – Thomas Jungblut Feb 12 '15 at 16:54
  • To take advantage of data locality, Hadoop Map/Reduce transfers code to nodes that have the required data, which the nodes then process in parallel. Spark must do the same imho. I can imagine that, with the help of a ResourceManager (like YARN), Spark is able to do so. Which would mean that I always have to set up a RM in order to "properly" run Spark (other than simple word count demos). No? – Frizz Feb 13 '15 at 08:19
  • 7
    Spark does use locality. Look at `HadoopRDD`. You most certainly want to avoid moving data across the network most of all. – Sean Owen Feb 13 '15 at 09:41

1 Answers1

14

Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.

G Quintana
  • 4,556
  • 1
  • 22
  • 23
  • 2
    Let me clarify this: When my file is somewhere in HDFS, Spark can figure out on which node it is, right? Is it enough to set up a Spark worker on all of my HDFS data nodes - and Spark with automatically route the data to the right node? Or do I always need a Resource Manager (like Mesos or YARN)? – Frizz Feb 13 '15 at 08:26
  • 1
    Yes. Using `InputFormat` means it is reusing logic that can determine where input splits are located. This is used for scheduling. – Sean Owen Feb 13 '15 at 09:40
  • 1
    No using YARN is not required, each Spark worker knows on which node it is running. Then, the Spark master can select worker nodes based on data location (and available resources). Yet if you already have an Hadoop YARN cluster, then reusing it may be good idea. – G Quintana Feb 13 '15 at 09:46
  • 3
    Interesting. So I can install HDFS and Spark independently from each other (first install my HDFS data nodes, then install my Spark workers)? And because the "location information" is compatible between the two frameworks, Spark automatically selects the right worker/data node - can I say it this way? – Frizz Feb 13 '15 at 10:29