How to split a data with different delimiter in single RDD in spark scala?

Question

WARN:router1 warning in Japan. How to do the splitting of the above line by delimiter ":" and " " in single RDD and how to create Dataframe after creating RDD with below info WARN router1 JApan

Does this answer your question? [Scala : How to split words using multiple delimeters](https://stackoverflow.com/questions/45758378/scala-how-to-split-words-using-multiple-delimeters) — SternK, May 19 '20 at 08:09

score 1 · Answer 1 · answered May 19 '20 at 07:52

First split the string via Regex and create the RDD as RDD[String]. To create the dataframe you need to include its schema although because RDD is a RDD[String] you can create the Dataset directly and then transform to DataFrame:

import spark.implicits._

val str = "WARN:router1 warning in Japan"
val arr = str.split("(:|\\s)")

val rdd = spark.sparkContext.parallelize(arr)
val ds = spark.createDataset(rdd)

ds.toDF().show()

gives

+-------+
|  value|
+-------+
|   WARN|
|router1|
|warning|
|     in|
|  Japan|
+-------+

score 0 · Answer 2 · answered May 19 '20 at 09:47

val data = Seq("WARN:router1 warning in Japan")
val rdd = sc.parallelize(data) // RDD of Strings
import spark.implicits._
val dataDF = rdd
             .flatMap(line => line.replace(":"," ").split(" "))
             .toDF("value") // Dataframe

dataDF.show()

output

+-------+
|  value|
+-------+
|   WARN|
|router1|
|warning|
|     in|
|  Japan|
+-------+

How to split a data with different delimiter in single RDD in spark scala?

2 Answers2