I have a Dataset<Row>
which contains six columns in it, like below:
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188264901 | 0002019000000| 0 | 0 |Voltage | 5 |
|1554188264901 | 0002019000000| 0 | 0 |SetPoint | 7 |
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188276412 | 0002019000000| 0 | 0 |SetPoint | 10 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
End Goal:
Getting the last updated row based on MAX(time)
, for the combination of thingId
, controller
, module
and variableName
.
so the required output should have MAX(time)
in all the rows, and last_updatedValue for rest of the variableName values.
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
and the column variableName
has two values ('Voltage'
and 'SetPoint'
) for this particular thingId, controller and module so for the value Voltage
in column variableName
it should return last updated row for the value Voltage
with MAX(time)
.
like below, Expected Output:
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
What I tried:
I tried Scalar sub-query
to get this, but the column inside sub-query should have been aggregated, I tried in multiple ways with no luck.
For example this code below:
Dataset<Row> inputds = spark.read().format("avro").load("hdfs://path");
inputds.createOrReplaceTempView("abc");
Dataset<Row> update = spark.sql("select MAX(p.time) max_time, p.thingId, p.controller, p.module, p.variableName, (SELECT d.value from abc d where d.thingId=p.thingId and d.controller=p.controller and d.module=p.module and d.variableName=p.variableName group by thingId,controller,module,variableName,value) as values from abc p")
update.show();
which throws the error:
Correlated scalar variable must be aggregated for a scalar sub-query
How can I solve this? Please suggest me if there is any workaround.
Thank you!