Spark String split not working on last 4 delimiters

Question

.dat file has |||| at the end of each line. While giving split these four end pipes are not considered.

val splitLine = record.split("\\|").to[ListBuffer]

// I/p: A|B||||||||||C|D||||
// O/p: A,B,,,,,,,,,,C,D

Is there a way to read dat files in Spark?

What is the meaning of four pipes at the end of each line in dat file?

Possible duplicate of [Java String split removed empty values](https://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values) — philantrovert, Sep 04 '18 at 07:52
Yes, use `.split("\\|", -1)` if you want to consider trailing delimiters. — relet, Sep 04 '18 at 08:36

score 0 · Answer 1 · answered Sep 04 '18 at 08:43

Using split function with -1 is the thing you required. Observe below with and without scenarios.

import ss.implicits._
val rd = sc.textFile("path to your file")
       .map(x => x.split("[|]",-1)).map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12), x(13), x(14), x(15), x(16))) // `split` function with `-1`

rd.foreach(println)

Output :

(A,B,,,,,,,,,,C,D,,,,)

Without split function, it throws error. Because it cannot read the last 4 empty columns.

import ss.implicits._
val rd = sc.textFile("path to your file")
       .map(x => x.split("[|]")).map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12), x(13), x(14), x(15), x(16))) // `split` function without `-1`

rd.foreach(println)

java.lang.ArrayIndexOutOfBoundsException: 13

Spark String split not working on last 4 delimiters

1 Answers1