When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
}
rdd.collect().foreach(println)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
Update
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles()
recursively.
Given a list of file names in a value files
(standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus
), you can create one RDD for each file:
val rdds = files.map { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
}
Now you can reduce
the list of RDDs into a single one: The function for reduce
concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)
This works, but I cannot test if this distributes/performs well with large files.