1

How to read multiple files (> 1000 files) and say only print out the first line for each file in spark?

I was reading the link How to read multiple text files into a single RDD? which mentioned I can read in multiple files (say 3 files) in spark using the following syntax:

val fs = sc.textFile("a.txt,b.txt,c.txt")

But fs seems glue all the files together.

Community
  • 1
  • 1
Carson Pun
  • 1,742
  • 2
  • 13
  • 20

1 Answers1

3

One approach is to use HadoopFile with TextInputFormat:

import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}

val input: String = ???

val firstLines = sc.hadoopFile(
     input, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
  .flatMap {
    case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
  }

Since keys of the TextInputFormat represent the offset of the beginning of the file for a given line you should get exactly what you want.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Zero, master of Spark, why not sc.wholeTextFiles()? – Mariano Kamp Dec 22 '15 at 23:13
  • Mariano, zero, I have the same question, comparing to `sc.textFile` or `sc.wholeTextFiles`. Sorry I am not familiar with the `sc.hadoopFile` API.... – Carson Pun Dec 23 '15 at 04:07
  • 1
    @MarianoKamp My good man, just to avoid follow-up questions starting with "what if data doesn't fit in the memory" :) Seriously though, from what I've seen so far `WholeTextFileInputFormat` usually shows significantly worse performance than the `TextInputFormat`. If you add possible memory issues on top of that it is really hard to justify `wholeTextFiles` here. Nice thing about this solution is that it doesn't require any configuration changes depending on a input. – zero323 Dec 23 '15 at 04:38
  • 1
    @CarsonPun `textFile` is not applicable because it drops required information. You can use something like `sc.wholeTextFiles(input).map(_._2.takeWhile(_ != '\n'))` but I am not particularly fond of this solution. – zero323 Dec 23 '15 at 04:41