2

How can a file from HDFS be read in a spark function not using sparkContext within the function.

Example:

val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) }

Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we could do a sc.textFile but in this case sc cannot be used in the function.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
darkknight444
  • 546
  • 8
  • 21

1 Answers1

8

You don't necessarily need service context to interact with HDFS. You can simply broadcast the hadoop configuration from master and use the broadcasted configuration value on executors to construct a hadoop.fs.FileSystem. Then the world is your. :)

Following is the code:

import java.io.StringWriter

import com.sachin.util.SparkIndexJobHelper._
import org.apache.commons.io.IOUtils
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf}

class Test {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("local[15]")
      .setAppName("TestJob")
    val sc = createSparkContext(conf)

    val confBroadcast = sc.broadcast(new SerializableWritable(sc.hadoopConfiguration))

    val rdd: RDD[String] = ??? // your existing rdd
    val filedata_rdd = rdd.map { x => readFromHDFS(confBroadcast.value.value, x) }

  }

  def readFromHDFS(configuration: Configuration, path: String): String = {
    val fs: FileSystem = FileSystem.get(configuration)
    val inputStream = fs.open(new Path(path));

    val writer = new StringWriter();
    IOUtils.copy(inputStream, writer, "UTF-8");
    writer.toString();
  }

}
Sachin Tyagi
  • 2,814
  • 14
  • 26
  • I agree with solution but would this have RDD partitioned? It does require a manual repariitioning right? – darkknight444 Oct 22 '16 at 06:33
  • Yes, your `filedata_rdd` will have exactly as many partitions as your `rdd`. So no need to re partition unless you want different numbers partition. And yes, for each partition the `readFromHDFS` calls for that partition (once per record) will be run of some executor (and not on driver). – Sachin Tyagi Oct 22 '16 at 06:55
  • Okay. So my source rdd has only 1 partition since it just has suffix that I need to use in forming the URI. So, this would result in filedata_rdd to have only 1 partition no matter how big the file is? – darkknight444 Oct 22 '16 at 06:59
  • Yes, if your parent RDD has n partitions then calling rdd.map() on it will result in an RDD with exactly same number of partitions as the parent. So you can re partition the parent rdd to remove the number of partitions if that's what you want. But, given that we are mapping a path string to to its content, the content will form the single record of new rdd no matter how big that content is. – Sachin Tyagi Oct 22 '16 at 07:11
  • * to increase the number of partitions if that's what you want. – Sachin Tyagi Oct 22 '16 at 07:19
  • Better to return an list of Strings with each entry in the string containing each line in the file. – darkknight444 Oct 22 '16 at 07:20
  • You can simply do a flatMap after you've all of the content. `filedata_rdd.flatMap(s => s.split("\n"))` it will give you an RDD of strings which will have each record as a line in the file. – Sachin Tyagi Oct 22 '16 at 07:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/126385/discussion-between-vishamdi-and-sachin-tyagi). – darkknight444 Oct 22 '16 at 08:41