1

In Spark, we can use textFile to load file into lines and try to do some operations with these lines as follows.

val lines = sc.textFile("xxx")
val counts = lines.filter(line => lines.contains("a")).count()

However, in my situation, I would like to load the file into blocks because the data in files and the block will be kind of as follow. Blocks will be separated with empty line in files.

user: 111
book: 222
comments: like it!

Therefore, I hope the textFile function or any other solutions can help me load the file with blocks, which may be achieved as follows.

val blocks = sc.textFile("xxx", 3 line)

Does anyone face this situation before? Thanks

kylejan
  • 163
  • 1
  • 13

1 Answers1

2

I suggest you to implement your own file reader function from Hdfs. Look attextFile function, it's built on top of hadoopFile function and it uses TextInputFormat:

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

But this TextInputFormat can be customized via hadoop properties as described in this answer. In your case delimiter could be:

conf.set("textinputformat.record.delimiter", "\n\n")
Community
  • 1
  • 1
Nikita
  • 4,435
  • 3
  • 24
  • 44