4

I am trying to convert formulas from a Google Sheet file to SQL. I get stuck on a recursive formula (calling on the result of the previous line of the same column).

Here is the table I am working with

timestamp   var1
1602460890  1
1602460900  3
1602460910  4
1602460920  4
1602460930  1
1602460940  8
1602460950  2
1602460960  3
1602460970  4
1602460980  2

I wish to have the following result, with the addition of the column "var2"

timestamp   var1  var2
1602460890  1   
1602460900  3     1   # 1 = 1 + 3 * 0
1602460910  4     5   # 5 = 1 + 4 * 1 
1602460920  4     21  # 21 = 1 + 4 * 5 
1602460930  1     0   # Here the result is "0" because in the previous line, the result is greater than a threshold (the threshold is set at 10)
1602460940  8     1   # 1 = 1 + 8 * 0 
1602460950  2     3   # 3 = 1 + 2 * 1 
1602460960  3     10  # 10 = 1 + 3 * 3 
1602460970  4     41  # 41 = 1 + 4 * 10 
1602460980  2     0   # Here the result is "0" because in the previous line, the result is greater than a threshold (the threshold is set at 10)

"var2" is equal to 1 to which we add "var1" multiplied by the previous result of "var2". But if its previous result is greater than 10 (the threshold is set at 10), then the result will be zero.

The SQL query might look like this, but this one logically returns the following error:

"SYNTAX_ERROR: line 3:18: Column 'var2' cannot be resolved"

SELECT *,
   (CASE
       WHEN (lag(var2, 1) OVER (ORDER BY timestamp)) > 10  -- (the threshold is set at 10)
       THEN 0
       ELSE 1 + (var1 * (lag(var2, 1) OVER (ORDER BY timestamp))) END)
    AS var2
FROM my_table

Would anyone have an idea to solve this problem or maybe I have to change SQL language? Thanks for your help

David
  • 450
  • 3
  • 14
  • 1
    This type of logic requires a recursive CTE, which Athena/Presto do not support. You can do this by iterating externally, such as in Python. – Gordon Linoff Oct 19 '20 at 11:40
  • @GordonLinoff The dataset I'm working with is too big to support with Pandas. The latter increases in size daily, and the operations done with Pandas are getting longer and longer ... I find it hard to understand why such a simple operation with Pandas, numpy (or even Excel) is not possible with a language such as Presto SQL... Is it possible to solve this problem with another SQL language? – David Oct 19 '20 at 12:07
  • . . This is a very complicated operation, actually. I am also surprised Pandas cannot solve it. You have data every 10 seconds. How much data do you have? And, yes, you can express this in SQL using recursive CTEs, but unless you have another key, it will be pretty slow. – Gordon Linoff Oct 19 '20 at 12:15
  • OK, I didn't know that behind this kind of recursion, the impact is quite complicated. Exactly, I only have one recording every 10 seconds. But I do have 400 columns (approximately), and one file per hour (so 360 lines per file) stored on an AWS s3 bucket. It takes a long time to read the files one by one, then concatenate them all. Querying the data stored in S3 from Athena is very efficient, but I actually feel like it doesn't meet my need. – David Oct 19 '20 at 12:24
  • 1
    . . Any logic that has thresholds where the value is set to some constant value when the threshold is met requires row-by-row processing. There is no way to "partition" the calculation so it can run in parallel. – Gordon Linoff Oct 19 '20 at 12:40
  • Thanks for your feedback @GordonLinoff. I add here the corresponding feature request https://github.com/prestosql/presto/issues/1122. I will have to find a solution other than Athena to solve my problem. – David Oct 19 '20 at 13:29

0 Answers0