3

This is my code for the union:

val dfToSave=dfMainOutput.union(insertdf.select(dfMainOutput).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))

When I do union I get below error:

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. string <> boolean at the 11th column of the second table;;
'Union

Here is the schema of two dataframes:

insertdf.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: string (nullable = true)
 |-- IsPlayingCSRAuditorRole: boolean (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)

Here is the schema of second dataframe:

dfMainOutput.printSchema()
root
 |-- OrganizationID: long (nullable = true)
 |-- SourceID: integer (nullable = true)
 |-- AuditorID: integer (nullable = true)
 |-- AuditorOpinionCode: string (nullable = true)
 |-- AuditorOpinionOnInternalControlCode: string (nullable = true)
 |-- AuditorOpinionOnGoingConcernCode: string (nullable = true)
 |-- IsPlayingAuditorRole: boolean (nullable = true)
 |-- IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |-- AuditorEnumerationId: integer (nullable = true)
 |-- AuditorOpinionId: integer (nullable = true)
 |-- AuditorOpinionOnInternalControlsId: integer (nullable = true)
 |-- AuditorOpinionOnGoingConcernId: boolean (nullable = true)
 |-- IsPlayingCSRAuditorRole: string (nullable = true)
 |-- FFAction: string (nullable = true)
 |-- DataPartition: string (nullable = true)

To avoid this problem I might have to write a select for each columns. So is there any Scala syntax that manage to type caste or make both dataframes to same type?

This is what I have tried so far but still getting the same error:

val columns = dfMainOutput.columns.toSet.intersect(insertdf.columns.toSet).map(col).toSeq

//Perform Union
val dfToSave=dfMainOutput.select(columns: _*).union(insertdf.select(columns: _*)).withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")))
Shaido
  • 27,497
  • 23
  • 70
  • 73
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83

1 Answers1

10

The datatypes of each column must match to perform an union of dataframes.

Looking at your schemas, there are three columns that does not conform to this:

AuditorOpinionOnInternalControlsId
AuditorOpinionOnGoingConcernId
IsPlayingCSRAuditorRole

An easy way to change the data types would be to use withColumn and cast. I'm assuming the correct type is in the dfMainOutput dataframe for the code below:

val insertDfNew = insertdf
  .withColumn("AuditorOpinionOnInternalControlsId", $"AuditorOpinionOnInternalControlsId".cast(IntegerType))
  .withColumn("AuditorOpinionOnGoingConcernId", $"AuditorOpinionOnGoingConcernId".cast(BooleanType))
  .withColumn("IsPlayingCSRAuditorRole", $"IsPlayingCSRAuditorRole".cast(StringType))
  .withColumn("FFAction", when($"FFAction" === "O" || $"FFAction" === "I", lit("I|!|")).otherwise($"FFAction"))

val dfToSave = dfMainOutput.union(insertDfNew)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Also, you can cast in the following way df.select(col("value").cast("string")) – Eric Bellet Feb 14 '20 at 13:53
  • 1
    @BdEngineer: Sure, you can do that. If you have a list of dataframes that you want to union into a single one, then you can do: `dataframeList.reduce(_.union(_))`. It can also be done iterativly inside a for-loop if you calculate them one-by-one, but then you need to have a start dataframe or create an empty dataframe with the correct schema. – Shaido Apr 20 '20 at 09:30
  • 1
    @BdEngineer: I'm not sure if there are any performance differences but I think it shold be pretty similar. I found this question where the first two answers shows what I wrote in the comment above (a bit clearer with actual examples): https://stackoverflow.com/questions/43489807/scala-spark-how-to-union-all-dataframe-in-loop Note that the dataframe you union into needs to be a `var` (it can change) and not a `val`. – Shaido Apr 20 '20 at 09:45
  • @BdEngineer: I'm not sure what the problem can be. Maybe you can create a new question with the code, hopefully I or someone else can spot what is going wrong. – Shaido Apr 21 '20 at 01:33