5

In Spark, it is possible to set some hadoop configuration settings like, e.g.

System.setProperty("spark.hadoop.dfs.replication", "1")

This works, the replication factor is set to 1. Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the textinputformat.record.delimiter:

System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")

However, it seems that spark just ignores this setting. Do I set the textinputformat.record.delimiter in the correct way? Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.

ptikobj
  • 2,690
  • 7
  • 39
  • 64
  • What version of hadoop are you using? – Noah Jul 17 '13 at 16:48
  • I'm using the prebuilt version of spark-0.7.2 with Hadoop 1 / CDH3 (see [here](http://spark-project.org/downloads/)). I'm pretty sure that it was in fact built with hadoop 1.0.4 – ptikobj Jul 18 '13 at 06:40
  • 1
    I'm not sure that it's in that version of hadoop, you might have to recompile yourself to a version that supports what you want: https://issues.apache.org/jira/browse/HADOOP-7096 – Noah Jul 18 '13 at 14:05

1 Answers1

9

I got this working with plain uncompressed files with the below function.

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlFile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "\n")
    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}
Andrew
  • 3,272
  • 2
  • 25
  • 26
  • Can you please share the hadoop core version you were using. – user 923227 Sep 02 '15 at 06:54
  • @SumitKumarGhosh that was with CDH 4.4 I believe. – Andrew Sep 04 '15 at 04:56
  • 1
    Got it looks like it needs specific releases Hadoop 0.23.x, and 2.x versions - [link](http://stackoverflow.com/questions/12330447/paragraph-processing-for-hadoop/12351209#12351209) I used the following maven dependency - ` org.apache.hadoop hadoop-client 2.2.0 ` this is good too - ` org.apache.hadoop hadoop-mapreduce-client-core 2.2.0 ` – user 923227 Sep 11 '15 at 22:13