-2

I am new in spark/scala. I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations. below should be the schema of dataframe

schema[UserId, EntityId, WebSessionId, ProductId]

rdd.foreach(println)

545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS

Will anyone please help me....!!!

I have tried same by defining schema class and mapping same against rdd but getting error

"ArrayIndexOutOfBoundsException :3"

koiralo
  • 22,594
  • 6
  • 51
  • 72
Vishvajit
  • 3
  • 3

1 Answers1

1

If you treat your columns as String you can create with the following:

import org.apache.spark.sql.Row

val rdd : RDD[Row] = ???

val df = spark.createDataFrame(rdd, StructType(Seq(
  StructField("userId", StringType, false),
  StructField("EntityId", StringType, false),
  StructField("WebSessionId", StringType, false),
  StructField("ProductId", StringType, true))))

Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.

In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:

val list = 
 List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
   "875643,5485254,JHDSFJD543514KJKJ4", 
   "545456,5615615,DIKFH6545614561456,PR5454564656445454", 
   "545456,5615615,DIKFH6545614561456,PR5454564656445454", 
   "545456,5615615,DIKFH6545614561456,PR54545DSKJD541054", 
   "264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515", 
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")


val FilterReadClicks = spark.sparkContext.parallelize(list)

val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
  val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
  if(array.length == 4) 
    array
  else Row.fromSeq(array.toSeq.:+(""))
}

rows.foreach(el => println(el.toSeq))

val df = spark.createDataFrame(rows, StructType(Seq(
  StructField("userId", StringType, false),
  StructField("EntityId", StringType, false),
  StructField("WebSessionId", StringType, false),
  StructField("ProductId", StringType, true))))

df.show()

+------------------+------------------+------------+---------+
|            userId|          EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456|     5615615|   545456|
|JHDSFJD543514KJKJ4|           5485254|      875643|         |
|PR5454564656445454|DIKFH6545614561456|     5615615|   545456|
|PR5454564656445454|DIKFH6545614561456|     5615615|   545456|
|PR54545DSKJD541054|DIKFH6545614561456|     5615615|   545456|
|PR5142545564542515|MNXZCBMNABC5645SAD|     3254564|   264264|
|UJHSG4240323545144|           8765984|      732543|         |
|KJDXSGFJFS2545DSAS|           6276832|      564574|         |
+------------------+------------------+------------+---------+

With rows rdd you will be able to create the dataframe.

Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
  • Hi, It is giving error as error: overloaded method value createDataFrame with alternatives: – Vishvajit Sep 24 '19 at 10:00
  • 1
    edit your question an add your RDD code to see what it´s happening. – Emiliano Martinez Sep 24 '19 at 10:00
  • val ReadClicks = c.textFile(FlumePath) \\here flume path is containing multiple data sources val FilterReadClicks = ReadClicks.filter(x => ((!x.isEmpty) && (x != null) && (x.lenght >3))) \\now here I am trying for coversion of RDD into dataframe val df = spark.createDataframe(FilterReadClicks, StructType(Seq(StructField("userId", StringType, false),StructField("EntityId", StringType, false),StructField("WebSessionId", StringType, false),StructField("ProductId", StringType, true)))) – Vishvajit Sep 24 '19 at 10:09
  • How can I create RDD[Row] or is there any way out to convert RDD[String] into RDD[Row] – Vishvajit Sep 24 '19 at 10:58
  • Thank you for updates...but seems no luck..!! I have added suggested code – Vishvajit Sep 24 '19 at 11:53