Programmatically generate the schema AND the data for a dataframe in Apache Spark

Question

I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below:

val headerDescs : String = "Name,Age,Location"

val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true)))

However now I want to do the same for the data (which is in effect the same data i.e. the metadata).

I create an RDD :

val headerRDD = sc.parallelize(headerDescs.split(","))

I then intended to use createDataFrame to create it:

val headerDf = sqlContext.createDataFrame(headerRDD, headerSchema)

however that fails because createDataframe is expecting a RDD[Row], however my RDD is an array of strings - I can't find a way of converting my RDD to a Row RDD and then mapping the fields dynamically. Examples I've seen assume you know the number of columns beforehand, however I want the ability eventually to be able to change the columns without changing the code - having the columns in a file for example.

Code excerpt based on first answer:

val headerDescs : String = "Name,Age,Location"

// create the schema from a string, splitting by delimiter
val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true)))

// create a row from a string, splitting by delimiter
val headerRDDRows = sc.parallelize(headerDescs.split(",")).map( a => Row(a))

val headerDf = sqlContext.createDataFrame(headerRDDRows, headerSchema)
headerDf.show()

Executing this Results in:

+--------+---+--------+

|    Name|Age|Location|

+--------+---+--------+

|    Name|

|     Age|

|Location|

+--------+---+-------

Rajat Mishra · Accepted Answer · 2020-07-08T13:04:04.423

4

For converting RDD[Array[String]] to RDD[Row] you need to do following steps:

import org.apache.spark.sql.Row

val headerRDD = sc.parallelize(Seq(headerDescs.split(","))).map(x=>Row(x(0),x(1),x(2)))

scala> val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true)))
headerSchema: org.apache.spark.sql.types.StructType = StructType(StructField(Name,StringType,true), StructField(Age,StringType,true), StructField(Location,StringType,true))

scala> val headerRDD = sc.parallelize(Seq(headerDescs.split(","))).map(x=>Row(x(0),x(1),x(2)))
headerRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[6] at map at <console>:34

scala> val headerDf = sqlContext.createDataFrame(headerRDD, headerSchema)
headerDf: org.apache.spark.sql.DataFrame = [Name: string, Age: string, Location: string]


scala> headerDf.printSchema
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Location: string (nullable = true)



scala> headerDf.show
+----+---+--------+
|Name|Age|Location|
+----+---+--------+
|Name|Age|Location|
+----+---+--------+

This would give you a RDD[Row]

For reading through file

val vRDD = sc.textFile("..**filepath**.").map(_.split(",")).map(a => Row.fromSeq(a))
 
val headerDf = sqlContext.createDataFrame(vRDD , headerSchema)

Using Spark-CSV package :

 val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(headerSchema) // defining based on the custom schema
    .load("cars.csv")

OR

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

There are are various options also which you can explore in its documentation.

edited Jul 08 '20 at 13:04

answered Jan 19 '17 at 15:24

Rajat Mishra

3,635
4
27
41

Thanks for your quick reponse but I get the error: 45: error: value split is not a member of Char – Jon Robinson Jan 19 '17 at 15:34
@JonRobinson have updated the answer .this should work – Rajat Mishra Jan 19 '17 at 15:42
Thanks, this is closer but has mapped all the values to the first dataframe column only i.e. all values are under Name column - I want 'Name' under the Name column, 'Age' under the Age column etc. – Jon Robinson Jan 19 '17 at 16:01
@JonRobinson can you show the output you are getting and the code. – Rajat Mishra Jan 19 '17 at 16:05
@JonRobinson because if you see, even the schema is showing 3 columns – Rajat Mishra Jan 19 '17 at 16:06
The schema is fine, problem is the data appears under the first column only i.e. 'Name', 'Age', 'Location' all appear under column Name. – Jon Robinson Jan 23 '17 at 15:52
please from ur it seems u creating empty dataframe. please post the code where u r inserting data into dataframe. – Rajat Mishra Jan 23 '17 at 16:34
Updated the original question with the code and output. – Jon Robinson Jan 23 '17 at 17:02
Thanks for the suggestion but I had already discounted this approach - again the issue is you need to know how many columns are in the output beforehand i.e. it isn't programmatic. I wanted it so that if you were to update headerDescs e.g. headerDescs = "Name, Age, Location, Email" you wouldn't have to change the other code. – Jon Robinson Jan 24 '17 at 11:57
@JonRobinson from where are you reading the data . I mean is it from a file or table ? – Rajat Mishra Jan 24 '17 at 11:59
@JonRobinson if you are reading from a file and you need to convert into a dataframe, i would suggest you use spark-csv package. https://github.com/databricks/spark-csv – Rajat Mishra Jan 24 '17 at 13:09
@JonRobinson updated the answer to create dataframe from a file. – Rajat Mishra Jan 24 '17 at 13:27
I like your answer for reading from a file, that's what I'm looking for. Thanks for your help Rajat. – Jon Robinson Jan 25 '17 at 09:46

Programmatically generate the schema AND the data for a dataframe in Apache Spark

1 Answers1

Linked