spark: read multiple textfiles and spill out the first line of each file?

Question

How to read multiple files (> 1000 files) and say only print out the first line for each file in spark?

I was reading the link How to read multiple text files into a single RDD? which mentioned I can read in multiple files (say 3 files) in spark using the following syntax:

val fs = sc.textFile("a.txt,b.txt,c.txt")

But fs seems glue all the files together.

zero323 · Accepted Answer · 2015-12-22T03:28:22.387

3

One approach is to use HadoopFile with TextInputFormat:

import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}

val input: String = ???

val firstLines = sc.hadoopFile(
     input, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
  .flatMap {
    case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
  }

Since keys of the TextInputFormat represent the offset of the beginning of the file for a given line you should get exactly what you want.

edited Dec 22 '15 at 03:28

answered Dec 22 '15 at 03:20

zero323

322,348
103
959
935

Zero, master of Spark, why not sc.wholeTextFiles()? – Mariano Kamp Dec 22 '15 at 23:13
Mariano, zero, I have the same question, comparing to `sc.textFile` or `sc.wholeTextFiles`. Sorry I am not familiar with the `sc.hadoopFile` API.... – Carson Pun Dec 23 '15 at 04:07
1

@MarianoKamp My good man, just to avoid follow-up questions starting with "what if data doesn't fit in the memory" :) Seriously though, from what I've seen so far `WholeTextFileInputFormat` usually shows significantly worse performance than the `TextInputFormat`. If you add possible memory issues on top of that it is really hard to justify `wholeTextFiles` here. Nice thing about this solution is that it doesn't require any configuration changes depending on a input. – zero323 Dec 23 '15 at 04:38
1

@CarsonPun `textFile` is not applicable because it drops required information. You can use something like `sc.wholeTextFiles(input).map(_._2.takeWhile(_ != '\n'))` but I am not particularly fond of this solution. – zero323 Dec 23 '15 at 04:41

spark: read multiple textfiles and spill out the first line of each file?

1 Answers1