How does SparkContext.textFile work under the covers?

Question

I am trying to understand the textFile method deeply, but I think my lack of Hadoop knowledge is holding me back here. Let me lay out my understanding and maybe you can correct anything that is incorrect

When sc.textFile(path) is called, then defaultMinPartitions is used, which is really just math.min(taskScheduler.defaultParallelism, 2). Let's assume we are using the SparkDeploySchedulerBackend and this is

conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))

So, now let's say the default is 2, going back to the textFile, this is passed in to HadoopRDD. The true size is determined in getPartitions() using inputFormat.getSplits(jobConf, minPartitions). But, from what I can find, the partitions is merely a hint and is in fact mostly ignored, so you will probably get the total number of blocks.

OK, this fits with expectations, however what if the default is not used and you provide a partition size that is larger than the block size. If my research is right and the getSplits call simply ignores this parameter, then wouldn't the provided min end up being ignored and you would still just get the block size?

Cross posted with the spark mailing list

yjshen · Accepted Answer · 2015-05-18T03:26:11.697

Short Version:

Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.

Detailed Version:

I think you are right in understanding the procedure before inputFormat.getSplits.

Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).

Let's see the code: FileInputFormat from Hadoop 2.4.0

long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
  FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();

for (FileStatus file: files) {
  Path path = file.getPath();
  long length = file.getLen();
  if (length != 0) {
    FileSystem fs = path.getFileSystem(job);
    BlockLocation[] blkLocations;
    if (file instanceof LocatedFileStatus) {
      blkLocations = ((LocatedFileStatus) file).getBlockLocations();
    } else {
      blkLocations = fs.getFileBlockLocations(file, 0, length);
    }
    if (isSplitable(fs, path)) {
      long blockSize = file.getBlockSize();
      long splitSize = computeSplitSize(goalSize, minSize, blockSize);

      long bytesRemaining = length;
      while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
        String[] splitHosts = getSplitHosts(blkLocations,
            length-bytesRemaining, splitSize, clusterMap);
        splits.add(makeSplit(path, length-bytesRemaining, splitSize,
            splitHosts));
        bytesRemaining -= splitSize;
      }

      if (bytesRemaining != 0) {
        String[] splitHosts = getSplitHosts(blkLocations, length
            - bytesRemaining, bytesRemaining, clusterMap);
        splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
            splitHosts));
      }
    } else {
      String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
      splits.add(makeSplit(path, 0, length, splitHosts));
    }
  } else { 
    //Create empty hosts array for zero length files
    splits.add(makeSplit(path, 0, length, new String[0]));
  }
}

Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:

protected long computeSplitSize(long goalSize, long minSize,
                                   long blockSize) {
  return Math.max(minSize, Math.min(goalSize, blockSize));
}

Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.

score 0 · Answer 2 · answered Jan 26 '18 at 01:05

I will add more points with examples to Yijie Shen answer

Before we go into details,lets understand the following

Assume that we are working on Spark Standalone local system with 4 cores

In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then 

defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)

/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */ 

defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user

* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.

logic to find defaultMinPartitions as below

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize

package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
    protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
        return Math.max(minSize, Math.min(goalSize, blockSize));
    }
}

where,
    minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
    blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
    goalSize = totalInputSize/numPartitions
        where,
            totalInputSize is the total size in bytes of all the files in the input path
            numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)

blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB


Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7 

splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode

Splits = 91(file size 91 bytes) / 7 (splitSize) => 13

FileInputFormat: Total # of splits generated by getSplits: 13

=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes

Please edit your answer to readable format. It might be an excellence answer but no one will read it. — aaa, Jan 26 '18 at 01:34

How does SparkContext.textFile work under the covers?

2 Answers2