1

I try to decompose the structure of a complex dataframe in spark. I am only interested in the nested arrays under the root. The issue is that I can't retrieve the ElementType from the type of StructField.

Here is an example, this schema of a StructType Object :

df.printSchema
result>>
root
 |-- ID: string (nullable = true)
 |-- creationDate: string (nullable = true)
 |-- personsList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)

Every StructType is an array of

FieldType ( name, type, nullable , metadata).

I tried the code below :

val personsList = df.schema("personsList") // personsList is a StructField
println(passengersList.dataType)

I would like to retrieve the ElementType to have the StructType of the nested array, but unfortunately we only have typeName or json method.

Best regards,

Ismail Addou
  • 383
  • 1
  • 2
  • 17

1 Answers1

1

You can select the elements of array struct and get the dataType

val arraydf = df.select("personsList.firstName", "personsList.lastName") arraydf.schema.foreach(x => println(x.dataType))

This will give following dataType

ArrayType(StringType,true)
ArrayType(StringType,true)

Above way gave arrayType which I guess is not what you require. You can go one step ahead to use explode function

val arraydf = df.select(explode($"personsList.firstName"))
arraydf.schema.foreach(x => println(x.dataType))

This will print

StringType

I hope this is what you wanted. If not, it will give you ideas :)

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Thank you, I have a question: I am only interested in the schema, I wonder if the "select" statement retrieves also the records also even if we do not show or use them. ( I am supposed to have millions of records and I won't use them ) what do you think ? – Ismail Addou Jun 21 '17 at 11:34
  • Yes, `select` statement will generate new dataframe of the selected columns. But if you don't use the selected dataframe, it should be garbage collected when runs out of scope. So you don't need to worry about having millions of records. – Ramesh Maharjan Jun 21 '17 at 15:33
  • my pleasure @IsmailAddou :) – Ramesh Maharjan Jun 22 '17 at 00:30