Setting textinputformat.record.delimiter in spark

Question

In Spark, it is possible to set some hadoop configuration settings like, e.g.

System.setProperty("spark.hadoop.dfs.replication", "1")

This works, the replication factor is set to 1. Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the textinputformat.record.delimiter:

System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")

However, it seems that spark just ignores this setting. Do I set the textinputformat.record.delimiter in the correct way? Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.

I'm using the prebuilt version of spark-0.7.2 with Hadoop 1 / CDH3 (see [here](http://spark-project.org/downloads/)). I'm pretty sure that it was in fact built with hadoop 1.0.4 — ptikobj, Jul 18 '13 at 06:40
I'm not sure that it's in that version of hadoop, you might have to recompile yourself to a version that supports what you want: https://issues.apache.org/jira/browse/HADOOP-7096 — Noah, Jul 18 '13 at 14:05

score 9 · Answer 1 · answered Jan 03 '14 at 07:18

9

I got this working with plain uncompressed files with the below function.

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlFile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "\n")
    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}

answered Jan 03 '14 at 07:18

Andrew

3,272
2
25
26

Can you please share the hadoop core version you were using. – user 923227 Sep 02 '15 at 06:54
@SumitKumarGhosh that was with CDH 4.4 I believe. – Andrew Sep 04 '15 at 04:56
1

Got it looks like it needs specific releases Hadoop 0.23.x, and 2.x versions - [link](http://stackoverflow.com/questions/12330447/paragraph-processing-for-hadoop/12351209#12351209) I used the following maven dependency - ` org.apache.hadoop hadoop-client 2.2.0 ` this is good too - ` org.apache.hadoop hadoop-mapreduce-client-core 2.2.0 ` – user 923227 Sep 11 '15 at 22:13

Setting textinputformat.record.delimiter in spark

1 Answers1

Linked