I have the following table
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
The schema of dataDS
is
scala> dataDS.printSchema;
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: integer (nullable = true)
I want to sum all the values of the count
column. I suppose I can do it using the reduce
method of Dataset
.
I thought I could do the following but got error
scala> (dataDS.select(col("count"))).reduce((acc,n)=>acc+n);
<console>:38: error: type mismatch;
found : org.apache.spark.sql.Row
required: String
(dataDS.select(col("count"))).reduce((acc,n)=>acc+n);
^
To make the code work, I had to explicitly specify that count
is Int
even though in the schema, it is an Int
scala> (dataDS.select(col("count").as[Int])).reduce((acc,n)=>acc+n);
Why did I have to explicitly specify type of count
? Why Scala's type inference
didn't work? In fact, the schema of the intermediate Dataset
also infers count
as a Int
.
dataDS.select(col("count")).printSchema;
root
|-- count: integer (nullable = true)