1

.dat file has |||| at the end of each line. While giving split these four end pipes are not considered.

val splitLine = record.split("\\|").to[ListBuffer]

// I/p: A|B||||||||||C|D||||
// O/p: A,B,,,,,,,,,,C,D

Is there a way to read dat files in Spark?

What is the meaning of four pipes at the end of each line in dat file?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • 2
    Possible duplicate of [Java String split removed empty values](https://stackoverflow.com/questions/14602062/java-string-split-removed-empty-values) – philantrovert Sep 04 '18 at 07:52
  • 2
    Yes, use `.split("\\|", -1)` if you want to consider trailing delimiters. – relet Sep 04 '18 at 08:36

1 Answers1

0

Using split function with -1 is the thing you required. Observe below with and without scenarios.

import ss.implicits._
val rd = sc.textFile("path to your file")
       .map(x => x.split("[|]",-1)).map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12), x(13), x(14), x(15), x(16))) // `split` function with `-1`

rd.foreach(println)

Output :

(A,B,,,,,,,,,,C,D,,,,)

Without split function, it throws error. Because it cannot read the last 4 empty columns.

import ss.implicits._
val rd = sc.textFile("path to your file")
       .map(x => x.split("[|]")).map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12), x(13), x(14), x(15), x(16))) // `split` function without `-1`

rd.foreach(println)

java.lang.ArrayIndexOutOfBoundsException: 13
Praveen L
  • 937
  • 6
  • 13