-1

I have basically one table but I have created another(subset) table from the original one. I'll original table as Table 1 and subset table as Table 2.

Table_1 looks like this:

enter image description here

Table_2 looks like this:

enter image description here

Final_table needs to look like this:

enter image description here

Now as you can see, Table_2 has been created by picking certain texts out of Table_1. Now what I really need is a way to first look up these exact same rows of data in Table_1, and then for each row looking backwards using date (timestamp) find a row when value2 from Table_2 matches with text from Table_1. Whenever it does, get the value and add it to value_original column.

Final table shows the result I need to get. How do I do that using Vertica sql?

I have tried this code:

SELECT *
FROM
  (SELECT table_2.*, table_1.value as value_original
  FROM
      (Select * from table
      where date < '1/10/2020' 
      and text = 'settle') as table_2
  LEFT JOIN
      (Select * from table where date < '1/10/2020') as table_1
  ON table_2.id1 = table_1.id1 and table_2.id2 = table_1.id2 and table_2.value2 = table_1.text) as final_table
where value_original is not null
ORDER BY date

I didn't realize how terrible my choice of aliases were until now, sorry about that. After trying this code, I ended up in situations where it would have exact same values multiple times in Table_1 but I need the one that is closest to the selected row in Table_2 going backwards using date (timestamp). I'm not sure how to do that. Any suggestions?

Thanks!

Shubham
  • 1
  • 2
  • 1
    Hi @Shubham, please show us what you have tried so far. – Gary Mar 03 '20 at 16:05
  • 1
    Hi @Gary, I just updated the post. I get that the query might not seem great, but I cut down a bunch from the original query and still wanted to show the original structure. Does the update help a bit? Thank you! – Shubham Mar 03 '20 at 16:28
  • Please [use text, not images/links, for text--including tables & ERDs](https://meta.stackoverflow.com/q/285551/3404097). Use images only for what cannot be expressed as text or to augment text. Please in code questions give a [mre]--cut & paste & runnable code; example input (as initialization code) with desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. [ask] For SQL that includes DBMS & DDL, which includes constraints & indexes & tabular initialization. – philipxy Mar 05 '20 at 08:44

3 Answers3

0

I think you just want a left join:

select t2.*, t1.value as value_orig
from table_2 t2 left join
     table_1 t1
     on t2.id1 = t1.id1 and
        t2.id2 = t1.id2 and
        t2.value2 = t1.text;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you for responding @Gordon Linoff. I understand that this query does not look great and can be simplified, I posted it like this here because I had to cut a whole bunch of other stuff and didn't want to change the structure of the query when posting here. But my main issue is that this type of JOIN doesn't work if there are multiple entries which are exactly the same because in those situations I need it to pick up what's closest to the row from Table_2 which exists in Table_1 going backwards in date for the same value2 as text. – Shubham Mar 03 '20 at 16:48
0

You're working with Vertica, so it would be legitimate to use things that are specific to Vertica.

We are working on time series, here, columns date and time setting the pace.

The two specifics you'll see in the script below are:

  • "Sessionization" - that's the CONDITIONAL_TRUE_EVENT() OLAP function. This function starts by returning 0, and returns a number incremented by 1 every time the Boolean expression parameter is true, and resets to 0 every time the PARTITION BY expression changes. I use it to group rows into everything that precedes a row whose txt is 'settle'.
  • The Analytic Limit Clause - that's LIMIT <n> OVER (PARTITION BY ... ORDER BY ...) which can be used to get the first or the last row in time within a group.

So I run a full select to get the session identifier, and name the full-select w_sess, and I run a full select from this w_sess to get the last row within a session whose txt is one of the 3 names you supplied, and name it orig_rows, and finally join the two over the session identifier I got when building w_sess. I show in comments the contents of the two Common Table Expressions for clarity in the script.

Happy playing ...

WITH
input(id1,id2,txt,val,dt,tm) AS (
          SELECT 1,1,'jane'   ,97,DATE'2020-01-01','05:30:22'::TIME
UNION ALL SELECT 1,1,'henry'  ,54,DATE'2020-01-01','06:30:22'::TIME
UNION ALL SELECT 1,1,'jane'   ,10,DATE'2020-01-01','07:30:22'::TIME
UNION ALL SELECT 1,1,'jack'   , 2,DATE'2020-01-01','08:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,30,DATE'2020-01-01','09:30:22'::TIME
UNION ALL SELECT 1,1,'kara'   ,16,DATE'2020-01-01','10:30:22'::TIME
UNION ALL SELECT 1,1,'sam'    ,46,DATE'2020-07-01','11:30:22'::TIME
UNION ALL SELECT 1,1,'pam'    ,14,DATE'2020-07-01','12:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,27,DATE'2020-07-01','13:30:22'::TIME
UNION ALL SELECT 1,1,'michael',90,DATE'2020-07-01','14:30:22'::TIME
UNION ALL SELECT 1,1,'tom'    ,10,DATE'2020-07-01','15:30:22'::TIME
UNION ALL SELECT 1,1,'jackson',20,DATE'2020-07-01','16:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,40,DATE'2020-07-01','17:30:22'::TIME
)
,
w_sess AS (
SELECT
  CONDITIONAL_TRUE_EVENT(LAG(txt)='settle') OVER(
   PARTITION BY id1,id2 ORDER BY dt,tm
  ) AS session_id
, *
FROM input
)
-- select * from w_sess;
-- out  session_id | id1 | id2 |   txt   | val |     dt     |    tm    
-- out ------------+-----+-----+---------+-----+------------+----------
-- out           0 |   1 |   1 | jane    |  97 | 2020-01-01 | 05:30:22
-- out           0 |   1 |   1 | henry   |  54 | 2020-01-01 | 06:30:22
-- out           0 |   1 |   1 | jane    |  10 | 2020-01-01 | 07:30:22
-- out           0 |   1 |   1 | jack    |   2 | 2020-01-01 | 08:30:22
-- out           0 |   1 |   1 | settle  |  30 | 2020-01-01 | 09:30:22
-- out           1 |   1 |   1 | kara    |  16 | 2020-01-01 | 10:30:22
-- out           1 |   1 |   1 | sam     |  46 | 2020-07-01 | 11:30:22
-- out           1 |   1 |   1 | pam     |  14 | 2020-07-01 | 12:30:22
-- out           1 |   1 |   1 | settle  |  27 | 2020-07-01 | 13:30:22
-- out           2 |   1 |   1 | michael |  90 | 2020-07-01 | 14:30:22
-- out           2 |   1 |   1 | tom     |  10 | 2020-07-01 | 15:30:22
-- out           2 |   1 |   1 | jackson |  20 | 2020-07-01 | 16:30:22
-- out           2 |   1 |   1 | settle  |  40 | 2020-07-01 | 17:30:22
-- out (13 rows)
-- out 
-- out Time: First fetch (13 rows): 119.260 ms. All rows formatted: 119.315 ms
,
orig_rows AS (
SELECT
  *
FROM w_sess
WHERE txt IN('jane','pam','tom')
LIMIT 1 OVER(PARTITION BY session_id,txt ORDER BY dt DESC,tm DESC)
)
-- SELECT * FROM orig_rows;
-- out  session_id | id1 | id2 | txt  | val |     dt     |    tm    
-- out ------------+-----+-----+------+-----+------------+----------
-- out           0 |   1 |   1 | jane |  10 | 2020-01-01 | 07:30:22
-- out           1 |   1 |   1 | pam  |  14 | 2020-07-01 | 12:30:22
-- out           2 |   1 |   1 | tom  |  10 | 2020-07-01 | 15:30:22
-- out (3 rows)
-- out 
-- out Time: First fetch (3 rows): 211.165 ms. All rows formatted: 211.222 ms
 SELECT 
  w_sess.id1
, w_sess.id2
, w_sess.txt
, w_sess.val
, w_sess.dt
, orig_rows.txt AS value2
, orig_rows.val AS value_orig
FROM w_sess join orig_rows USING(session_id)
WHERE w_sess.txt='settle'
ORDER BY w_sess.session_id
;
-- out  id1 | id2 |  txt   | val |     dt     | value2 | value_orig 
-- out -----+-----+--------+-----+------------+--------+------------
-- out    1 |   1 | settle |  30 | 2020-01-01 | jane   |         10
-- out    1 |   1 | settle |  27 | 2020-07-01 | pam    |         14
-- out    1 |   1 | settle |  40 | 2020-07-01 | tom    |         10
-- out (3 rows)
-- out 
-- out Time: First fetch (3 rows): 540.519 ms. All rows formatted: 540.566 ms
marcothesane
  • 6,192
  • 1
  • 11
  • 21
0

Just thought about it some more - and there's another way, also Vertica specific, to solve your problem.

It's yet another time series related goodie in the DBMS: The Event Series Join . It works on all outer joins, joining the current row with the row of the other table with either a matching join column (of the INTERPOLATE PREVIOUS VALUE predicate) or the row with the immediately preceding value in the join column.

You can join with several equi predicates, but only one predicate can be INTERPOLATE PREVIOUS VALUE.

In your case, if it is possible that a preceding row you're looking for is also for a different date, not only a different time, you must add a column: ts TIMESTAMP DEFAULT dt+tm::INTERVAL to the base table, and INTERPOLATE PREVIOUS VALUE on that one.

My example is lazy and assumes that you can find a match with just the time.

WITH
input(id1,id2,txt,val,dt,tm) AS (
          SELECT 1,1,'jane'   ,97,DATE'2020-01-01','05:30:22'::TIME
UNION ALL SELECT 1,1,'henry'  ,54,DATE'2020-01-01','06:30:22'::TIME
UNION ALL SELECT 1,1,'jane'   ,10,DATE'2020-01-01','07:30:22'::TIME
UNION ALL SELECT 1,1,'jack'   , 2,DATE'2020-01-01','08:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,30,DATE'2020-01-01','09:30:22'::TIME
UNION ALL SELECT 1,1,'kara'   ,16,DATE'2020-01-01','10:30:22'::TIME
UNION ALL SELECT 1,1,'sam'    ,46,DATE'2020-07-01','11:30:22'::TIME
UNION ALL SELECT 1,1,'pam'    ,14,DATE'2020-07-01','12:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,27,DATE'2020-07-01','13:30:22'::TIME
UNION ALL SELECT 1,1,'michael',90,DATE'2020-07-01','14:30:22'::TIME
UNION ALL SELECT 1,1,'tom'    ,10,DATE'2020-07-01','15:30:22'::TIME
UNION ALL SELECT 1,1,'jackson',20,DATE'2020-07-01','16:30:22'::TIME
UNION ALL SELECT 1,1,'settle' ,40,DATE'2020-07-01','17:30:22'::TIME
)
,
orig_rows AS (
SELECT 
  *
FROM input
WHERE txt IN('jane','pam','tom')
)
SELECT
  input.id1
, input.id2
, input.txt
, input.val
, input.dt
, input.dt + input.tm::interval AS ts
, orig_rows.txt AS value2
, orig_rows.val AS value_orig
FROM input
LEFT JOIN orig_rows
 ON input.id1=orig_rows.id1
AND input.id2=orig_rows.id2
AND input.dt = orig_rows.dt
AND input.tm INTERPOLATE PREVIOUS VALUE orig_rows.tm
WHERE input.txt='settle'
;
-- out  id1 | id2 |  txt   | val |     dt     |         ts          | value2 | value_orig 
-- out -----+-----+--------+-----+------------+---------------------+--------+------------
-- out    1 |   1 | settle |  30 | 2020-01-01 | 2020-01-01 09:30:22 | jane   |         10
-- out    1 |   1 | settle |  27 | 2020-07-01 | 2020-07-01 13:30:22 | pam    |         14
-- out    1 |   1 | settle |  40 | 2020-07-01 | 2020-07-01 17:30:22 | tom    |         10
-- out (3 rows)
-- out 
-- out Time: First fetch (3 rows): 90.260 ms. All rows formatted: 90.301 ms

happy playing, again ...

marcothesane
  • 6,192
  • 1
  • 11
  • 21