The default input file delimiter while reading a file via Spark is newline character(\n). It is possible to define a custom delimiter by using "textinputformat.record.delimiter" property.
But, Is it possible to specify multiple delimiter for the same file ?
Suppose a file has following content :
COMMENT,A,B,C
COMMENT,D,E,
F
LIKE,I,H,G
COMMENT,J,K,
L
COMMENT,M,N,O
I want to read this file with delimiter as COMMENT and LIKE instead of newline character.
Although, i came up with an alternative if multiple delimiters are not allowed in spark.
val ss = SparkSession.builder().appName("SentimentAnalysis").master("local[*]").getOrCreate()
val sc = ss.sparkContext
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "COMMENT")
val rdd = sc.textFile("<filepath>")
val finalRdd = rdd.flatmap(f=>f.split("LIKE"))
But still, i think it will better to have multiple custom delimiter. Is it possible in spark ? or do i have to use the above alternative ?