1

My table in SQL is like:-

RN   Name   value1  value2  Timestamp
1    Mark   110     210     20160119
1    Mark   106     205     20160115
1    Mark   103     201     20160112
2    Steve  120     220     20151218
2    Steve  111     210     20151210
2    Steve  104     206     20151203

Desired Output:-

RN  Name    value1Lag1 value1lag2   value2lag1  value2lag2
1   Mark       4             3            5        4
2   Steve      9             7            10       4

The difference is calculated from the most recent to the second recent and then from second recent to the third recent for RN 1

value1lag1 = 110-106 =4

value1lag2 = 106-103 = 3

value2lag1 = 210-205 = 5

value2lag2 = 205-201 = 4

similarly for other RN's also.

Note: For each RN there are 3 and only 3 rows.

I have tried in several ways by taking help from similar posts but no luck.

sumeet agrawal
  • 57
  • 1
  • 12
  • 2
    Which database you are using? – Rahul Tripathi Feb 02 '16 at 05:57
  • Oracle database. We have to apply the logic on big data so you can answer on any database you are convenient with.It will serve the purpose. – sumeet agrawal Feb 02 '16 at 06:03
  • 1
    Please use words to clearly explain how the output is related to the input in the general case. – philipxy Feb 02 '16 at 06:16
  • @Philipxy The difference is calculated from the most recent to the second recent and then from second recent to the third recent for RN 1 value1lag1 = 110-106 = 4 value1lag2 = 106-103 = 3 value2lag1 = 210-205 = 5 value2lag2 = 205-201 = 4 similarly for other RN's also – sumeet agrawal Feb 02 '16 at 06:21
  • Please edit clarifications into your question. PS What if there are fewer than enough values in the table? What if an RN has multiple Names or vice versa? PPS What is a query for the most recent time and its value1 & value2 for each RN? For the 2nd & 3rd most recent times? – philipxy Feb 02 '16 at 06:40
  • @sumeet agrawal: Hey Simon, it looks like you might be new to SO. If you feel an answer solved the problem, please mark it as 'accepted' by clicking the green check mark. This helps keep the focus on older SO which still don't have answers. I'd encourage you to look at the LAG() option that gpeche suggested. – Robert Rodkey Feb 04 '16 at 02:51
  • Thanks Robert. I will surely look on LAG() as well. But wondering if the functions will be available on hive or not? – sumeet agrawal Feb 04 '16 at 13:50

4 Answers4

1

I've assumed that RN and Name are linked here. It's a bit messy, but if each RN always has 3 values and you always want to check them in this order, then something like this should work.

SELECT
    t1.Name
    , AVG(CASE WHEN table_ranked.Rank = 1 THEN table_ranked.value1 ELSE NULL END) - AVG(CASE WHEN table_ranked.Rank = 2 THEN table_ranked.value1 ELSE NULL END)   value1Lag1
    , AVG(CASE WHEN table_ranked.Rank = 2 THEN table_ranked.value1 ELSE NULL END) - AVG(CASE WHEN table_ranked.Rank = 3 THEN table_ranked.value1 ELSE NULL END) value1Lag2
    , AVG(CASE WHEN table_ranked.Rank = 1 THEN table_ranked.value2 ELSE NULL END) - AVG(CASE WHEN table_ranked.Rank = 2 THEN table_ranked.value2 ELSE NULL END) value2Lag1
    , AVG(CASE WHEN table_ranked.Rank = 2 THEN table_ranked.value2 ELSE NULL END) - AVG(CASE WHEN table_ranked.Rank = 3 THEN table_ranked.value2 ELSE NULL END) value2Lag2
FROM table t1
INNER JOIN
(
    SELECT
        t1.Name
        , t1.value1
        , t1.value2
        , COUNT(t2.TimeStamp) Rank
    FROM table t1
    INNER JOIN table t2
        ON t2.name = t1.name
        AND t1.TimeStamp <= t2.TimeStamp
    GROUP BY t1.Name, t1.value1, t1.value2
) table_ranked
    ON table_ranked.Name = t1.Name
GROUP BY t1.Name
Robert Rodkey
  • 423
  • 3
  • 9
  • Thanks for the solution.I tried on SQL fiddle which is throwing en error. Appreciate you help here..http://sqlfiddle.com/#!9/97d78/1 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') - SUM(CASE WHEN table_ranked.Rank = 2 THEN table_ranked.value1 ELSE NULL) valu' at line 3 – sumeet agrawal Feb 02 '16 at 09:30
  • Thanks for the link - I've edited my answer to get rid of some of the syntax errors - returns the correct result from the jfiddle link you sent. That being said, while this works, using the LAG function that @gpeche outlined is likely the best approach (assuming it works anyway, I've never used the LAG function). – Robert Rodkey Feb 02 '16 at 15:49
  • Robert Rodkey Fantastic!! Thanks a lot.The query is working absolutely fine. – sumeet agrawal Feb 03 '16 at 06:36
1

There are other answers here, but I think your problem is calling for analytic functions, specifically LAG():

select
    rn,
    name,
    -- calculate the differences
    value1 - v1l1 value1lag1,
    v1l1 - v1l2 value1lag2,
    value2 - v2l1 value2lag1,
    v2l1 - v2l2 value2lag2
 from (
     select 
       rn, 
       name, 
       value1, 
       value2, 
       timestamp, 
       -- these two are the values from the row before this one ordered by timestamp (ascending)
       lag(value1) over(partition by rn, name order by timestamp asc) v1l1,
       lag(value2) over(partition by rn, name order by timestamp asc) v2l1
       -- these two are the values from two rows before this one ordered by timestamp (ascending)
       lag(value1, 2) over(partition by rn, name order by timestamp asc) v1l2,
       lag(value2, 2) over(partition by rn, name order by timestamp asc) v2l2

    from (
      select
      1 rn, 'Mark' name, 110 value1, 210 value2, '20160119' timestamp
      from dual
      union all
      select
      1 rn, 'Mark' name, 106 value1, 205 value2, '20160115' timestamp
      from dual
      union all
      select
      1 rn, 'Mark' name, 103 value1, 201 value2, '20160112' timestamp
      from dual
      union all
      select
      2 rn, 'Steve' name, 120 value1, 220 value2, '20151218' timestamp
      from dual
      union all
      select
      2 rn, 'Steve' name, 111 value1, 210 value2, '20151210' timestamp
      from dual
      union all
      select
      2 rn, 'Steve' name, 104 value1, 206 value2, '20151203' timestamp
      from dual
    ) data
)
where 
-- return only the rows that have defined values
v1l1 is not null and 
v1l2 is not null and
v2l1 is not null and 
v2l1 is not null

This approach has the benefit that Oracle does all the necessary buffering internally, avoiding self-joins and the like. For big data sets this can be important from a performance viewpoint.

As an example, the explain plan for that query would be something like

-------------------------------------------------------------------------
| Id  | Operation        | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |     6 |   150 |    13   (8)| 00:00:01 |
|*  1 |  VIEW            |      |     6 |   150 |    13   (8)| 00:00:01 |
|   2 |   WINDOW SORT    |      |     6 |   138 |    13   (8)| 00:00:01 |
|   3 |    VIEW          |      |     6 |   138 |    12   (0)| 00:00:01 |
|   4 |     UNION-ALL    |      |       |       |            |          |
|   5 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
|   6 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
|   7 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
|   8 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
|   9 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
|  10 |      FAST DUAL   |      |     1 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("V1L1" IS NOT NULL AND "V1L2" IS NOT NULL AND "V2L1" IS 

Note that there are no joins, just a WINDOW SORT that buffers the necessary data from the "data source" (in our case, the VIEW 3 that is the UNION ALL of our SELECT ... FROM DUAL) to partition and calculate the different lags.

gpeche
  • 21,974
  • 5
  • 38
  • 51
0

if just in this case, it's not that difficult.you need 2 steps

  1. self join and get the result of minus

    select t1.RN,
           t1.Name,
           t1.rm,
           t2.value1-t1.value1 as value1, 
           t2.value2-t1.value2 as value2
    from 
    (select RN,Name,value1,value2,
            row_number(partition by Name order by Timestamp desc) as rm from table)t1 
    left join
    (select RN,Name,value1,value2,
            row_number(partition by Name order by Timestamp desc) as rm from table) t2 
    on t1.rm = t2.rm-1
    where t2.RN is not null.
    

you set this as a table let's say table3.

2.you pivot it

select * from (
  select t3.RN, t3.Name,t3.rm,t3.value1,t3.value2 from table3 t3
               )
pivot 
 (
   max(value1)
    for rm in ('1','2')
  )v1

3.you get 2 pivot table for value1 and value2 join them together to get the result.

but i think there may be a better way and i m not sure if we can just join pivot when we pivot it so i ll use join after i get the pivot result that will make 2 more tables. its not good but the best i can do

0
-- test data
with data(rn,
name,
value1,
value2,
timestamp) as
 (select 1, 'Mark', 110, 210, to_date('20160119', 'YYYYMMDD')
    from dual
  union all
  select 1, 'Mark', 106, 205, to_date('20160115', 'YYYYMMDD')
    from dual
  union all
  select 1, 'Mark', 103, 201, to_date('20160112', 'YYYYMMDD')
    from dual
  union all
  select 2, 'Steve', 120, 220, to_date('20151218', 'YYYYMMDD')
    from dual
  union all
  select 2, 'Steve', 111, 210, to_date('20151210', 'YYYYMMDD')
    from dual
  union all
  select 2, 'Steve', 104, 206, to_date('20151203', 'YYYYMMDD') from dual),

-- first transform value1, value2 to value_id (1,2), value
data2 as
 (select d.rn, d.name, 1 as val_id, d.value1 as value, d.timestamp
    from data d
  union all
  select d.rn, d.name, 2 as val_id, d.value2 as value, d.timestamp
    from data d)

select *  -- find previous row P of row D, evaluate difference and build column name as desired
  from (select d.rn,
               d.name,
               d.value - p.value as value,
               'value' || d.val_id || 'Lag' || row_number() over(partition by d.rn, d.val_id order by d.timestamp desc) as col
          from data2 p, data2 d
         where p.rn = d.rn
           and p.val_id = d.val_id
           and p.timestamp =
               (select max(pp.timestamp)
                  from data2 pp
                 where pp.rn = p.rn
                   and pp.val_id = p.val_id
                   and pp.timestamp < d.timestamp))
       -- pivot
       pivot(sum(value) for col in('value1Lag1',
                                   'value1Lag2',
                                   'value2Lag1',
                                   'value2Lag2'));
Frank Ockenfuss
  • 2,023
  • 11
  • 26