spark textfile load file instead of lines

Question

In Spark, we can use textFile to load file into lines and try to do some operations with these lines as follows.

val lines = sc.textFile("xxx")
val counts = lines.filter(line => lines.contains("a")).count()

However, in my situation, I would like to load the file into blocks because the data in files and the block will be kind of as follow. Blocks will be separated with empty line in files.

user: 111
book: 222
comments: like it!

Therefore, I hope the textFile function or any other solutions can help me load the file with blocks, which may be achieved as follows.

val blocks = sc.textFile("xxx", 3 line)

Does anyone face this situation before? Thanks

What do you want to get exactly instead of `RDD[String]` which is the return type of `textFile` method? — Nader Ghanbari, Apr 15 '15 at 06:58
How do you detect end of block? Do you have some special separator? — G Quintana, Apr 15 '15 at 07:01
Yes, grouping is the most natural solution, if that works for your requirement. — Nader Ghanbari, Apr 15 '15 at 07:55
@kylejan I think you'll have problem if you try to group lines after reading them because of distributed processing. On a single node, it's ok. On multiple nodes, one block could be split in 2 halves, the first half given to one node, the second half to another. — G Quintana, Apr 15 '15 at 08:10
@GQuintana I will make an indexWithValue RDD firstly. And then group into new RDD with indexes bound — kylejan, Apr 15 '15 at 08:14

score 2 · Answer 1 · edited May 23 '17 at 11:43

I suggest you to implement your own file reader function from Hdfs. Look attextFile function, it's built on top of hadoopFile function and it uses TextInputFormat:

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

But this TextInputFormat can be customized via hadoop properties as described in this answer. In your case delimiter could be:

conf.set("textinputformat.record.delimiter", "\n\n")

spark textfile load file instead of lines

1 Answers1