2

I have a Dataset<Row> which contains six columns in it, like below:

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188264901  |  0002019000000|        0       | 0     |Voltage       |    5   |
 |1554188264901  |  0002019000000|        0       | 0     |SetPoint      |    7   |
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188276412  |  0002019000000|        0       | 0     |SetPoint      |    10  |  
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

End Goal:

Getting the last updated row based on MAX(time), for the combination of thingId, controller, module and variableName.

so the required output should have MAX(time) in all the rows, and last_updatedValue for rest of the variableName values.

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

and the column variableName has two values ('Voltage' and 'SetPoint') for this particular thingId, controller and module so for the value Voltage in column variableName it should return last updated row for the value Voltage with MAX(time).

like below, Expected Output:

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

What I tried:

I tried Scalar sub-query to get this, but the column inside sub-query should have been aggregated, I tried in multiple ways with no luck.

For example this code below:

 Dataset<Row> inputds = spark.read().format("avro").load("hdfs://path");
 inputds.createOrReplaceTempView("abc");
 Dataset<Row> update = spark.sql("select MAX(p.time) max_time, p.thingId, p.controller, p.module, p.variableName, (SELECT d.value from abc d where d.thingId=p.thingId and d.controller=p.controller and d.module=p.module and d.variableName=p.variableName group by thingId,controller,module,variableName,value) as values from abc p")
 update.show();

which throws the error:

Correlated scalar variable must be aggregated for a scalar sub-query

How can I solve this? Please suggest me if there is any workaround.

Thank you!

Johwhite
  • 323
  • 4
  • 18
Pyd
  • 6,017
  • 18
  • 52
  • 109

2 Answers2

2

The problem seems to be that you actually need both aggregation and sorting.

You need to have the value directly related to MAX(time), for that specific grouped value of the column variableName, so basically the value which is on the same row. Since there is no aggregation function to do this in SQL, you could sort your sub-query results.

So to achieve the desired "last updated" row, you sort the sub-query by time, descending, and then you limit the results to just 1 row.

It may be something like this:

Dataset<Row> update = spark.sql("SELECT
    MAX(p.time) max_time,
    p.thingId, p.controller, p.module, p.variableName,
    (SELECT d.value FROM abc d WHERE d.thingId=p.thingId AND d.controller=p.controller AND d.module=p.module AND d.variableName=p.variableName
        ORDER BY time DESC LIMIT 1) AS [lastUpdatedValue]
FROM abc p
GROUP BY thingId,controller,module,variableName")

P.S. I come from a SQL Server Background, so to do that I would usually do TOP 1. I'm not entirely sure about LIMIT 1 to have the same effects in Apache Spark SQL.

EDIT: I found this, thanks this answer here.

Basically it's talking about an aggregation function in spark, called first.

Maybe using it in the sub-query is going to solve the issue?

    (SELECT first(d.value) FROM abc d WHERE d.thingId=p.thingId AND d.controller=p.controller AND d.module=p.module AND d.variableName=p.variableName
        ORDER BY time DESC LIMIT 1) AS [lastUpdatedValue]
Johwhite
  • 323
  • 4
  • 18
  • I tried it throwing same error `correlated scalar sub-query must be aggregated` – Pyd Apr 02 '19 at 13:27
  • 1
    [Here](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html) I found some more info, maybe something is going to help you. Check my last edit too! – Johwhite Apr 03 '19 at 10:19
  • same error `correlated scalar sub-query must be aggregated`, thanks for the link will try it now.! – Pyd Apr 03 '19 at 10:37
1

I solved it finally using struct in spark Dataset.

Input Dataset

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188264901  |  0002019000000|        0       | 0     |Voltage       |    5   |
 |1554188264901  |  0002019000000|        0       | 0     |SetPoint      |    7   |
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188276412  |  0002019000000|        0       | 0     |SetPoint      |    10  |  
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

 Dataset<Row> intermediate = inputDS.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

 //above code gives me intermediate output
 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

So now my task is to take maximum value from time column and populate it for that thingId, controller and module for that used sql as below

intermediate.createOrReplaceTempView("intermediate");

Dataset<Row> outputDS = spark.sql("select B.time,A.thingId,A.controller,A.module,A.variableName,A.value from intermediate A 
inner join (select thingId,controller,module,MAX(time)time from intermediate group by thingId,controller,module) B 
on A.thingId=B.thingId and A.controller=B.controller and A.module=B.module");

Which gives us Expected output

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

So I can pivot now to get the last Updated value for every thingId, controller and module

I knew the sql for intermediate step has inner join in it, if I could able to figure out some efficient sql query instead of inner join that would be great.

Thank you @johwhite for the help

Pyd
  • 6,017
  • 18
  • 52
  • 109