How to add source file name to each row in Spark?

Question

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.

I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).

I can use the spark-shell to get a list of the filenames:

val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show

but that's a dataframe. I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.

Each record needs to show which file it's originally... easier to debug things when you know the whole path it goes through (like a malformed input file) — mcmcmc, Oct 23 '15 at 14:45

score 17 · Answer 1 · edited Mar 03 '18 at 12:05

17

Another solution I just found to add file name as one of the columns in DataFrame

val df = sqlContext.read.parquet("/blah/dir")

val dfWithCol = df.withColumn("filename",input_file_name())

Ref: spark load data and add filename as dataframe column

edited Mar 03 '18 at 12:05

Ramesh Maharjan

41,071
6
69
97

answered Feb 28 '17 at 06:16

Dipankar

3,162
3
16
11

Can we get file name if we do `.repartition(1)` after loading the file ? – Sudarshan kumar Mar 22 '18 at 07:56

Beryllium · Answer 2 · 2015-10-23T16:18:39.783

When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:

case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
    l =>
      val tokens = l.split(",")
      Person(inputPath, tokens(0), tokens(1).trim().toInt)
  }
rdd.collect().foreach(println)

If you do not want to mix "business data" with meta data:

case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)

// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
  l =>
    val tokens = l.split(",")
    PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
}
rdd.collect().foreach(println)

and if you promote the RDD to a DataFrame:

import sqlContext.implicits._
val df = rdd.toDF()
df.registerTempTable("x")

you can query it like

sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()

Update

You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.

Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:

val rdds = files.map { f =>
  val md = InputSourceMetaData(f.getPath.toString, f.getLen)

  sc.textFile(md.path).map {
    l =>
      val tokens = l.split(",")
      PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
  }
}

Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:

val rdd = rdds.reduce(_ ++ _)
rdd.collect().foreach(println)

This works, but I cannot test if this distributes/performs well with large files.

I definitely appreciate this, but the only problem is that you have to specify the full path and filename of your input file. I'm just specifying the input directory, pulling in all the input files located in it. — mcmcmc, Oct 23 '15 at 14:48
Which function are you currently using? Is it `wholeTextFiles()`? — Beryllium, Oct 23 '15 at 15:03
For CSV files I'm using the databricks/spark-csv library `sqlContext.read.format("com.databricks.spark.csv").load("/path/dir/")`. For parquet files, using `sqlContext.read.parquet("/path/parquetdir/")`. — mcmcmc, Oct 23 '15 at 15:09

How to add source file name to each row in Spark?

2 Answers2