6

How can I tell the LAG function to get the last "not null" value?

For example, see my table bellow where I have a few NULL values on column B and C. I'd like to fill the nulls with the last non-null value. I tried to do that by using the LAG function, like so:

case when B is null then lag (B) over (order by idx) else B end as B,

but that doesn't quite work when I have two or more nulls in a row (see the NULL value on column C row 3 - I'd like it to be 0.50 as the original).

Any idea how can I achieve that? (it doesn't have to be using the LAG function, any other ideas are welcome)

A few assumptions:

  • The number of rows is dynamic;
  • The first value will always be non-null;
  • Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.

Thanks

enter image description here

Diego
  • 34,802
  • 21
  • 91
  • 134
  • Itzik Ben-Gan wrote a blog on a that problem: http://sqlmag.com/sql-server/how-previous-and-next-condition. Unfortunatley SQL Server doesn't support the `IGNORE NULLS` option in `LAST_VALUE`, then it's simple: `LAST_VALUE(B IGNORE NULLS) OVER (ORDER BY idx)`. – dnoeth Apr 28 '16 at 07:09

5 Answers5

8

You can do it with outer apply operator:

select t.id,
       t1.colA,
       t2.colB,
       t3.colC 
from table t
outer apply(select top 1 colA from table where id <= t.id and colA is not null order by id desc) t1
outer apply(select top 1 colB from table where id <= t.id and colB is not null order by id desc) t2
outer apply(select top 1 colC from table where id <= t.id and colC is not null order by id desc) t3;

This will work, regardless of the number of nulls or null "islands". You may have values, then nulls, then again values, again nulls. It will still work.


If, however the assumption (in your question) holds:

Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.

there is a more efficient solution. We only need to find the latest (when ordered by idx) values. Modifying the above query, removing the where id <= t.id from the subqueries:

select t.id,
       colA = coalesce(t.colA, t1.colA),
       colB = coalesce(t.colB, t2.colB),
       colC = coalesce(t.colC, t3.colC) 
from table t
outer apply (select top 1 colA from table 
             where colA is not null order by id desc) t1
outer apply (select top 1 colB from table 
             where colB is not null order by id desc) t2
outer apply (select top 1 colC from table 
             where colC is not null order by id desc) t3;
ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
Giorgi Nakeuri
  • 35,155
  • 8
  • 47
  • 75
  • hey, thanks, but as I said, "The number of rows is dynamic" so how would that work with 5 rows? – Diego Apr 25 '16 at 10:44
  • @diego - This will work regardless of the number of rows... But it would have a potentially exponential cost as the data set grows *(the cost of each sub query is higher for the 1000th row than it is for the 999th row)*, but it's certainly tidy for small data-sets. – MatBailie Apr 25 '16 at 10:56
  • @GiorgiNakeuri Yes, you are right. I hadn't read the question carefully. – ypercubeᵀᴹ Apr 25 '16 at 10:56
  • of course, I misread your query. It does seem to work, +1 thanks for your help – Diego Apr 25 '16 at 14:24
4

You could make a change to your ORDER BY, to force the NULLs to be first in your ordering, but that may be expensive...

lag(B) over (order by CASE WHEN B IS NULL THEN -1 ELSE idx END)

Or, use a sub-query to calculate the replacement value once. Possibly less expensive on larger sets, but very clunky.
- Relies on all the NULLs coming at the end
- The LAG doesn't rely on that

COALESCE(
    B,
    (
        SELECT
            sorted_not_null.B
        FROM
        (
            SELECT
                table.B,
                ROW_NUMBER() OVER (ORDER BY table.idx DESC)   AS row_id
            FROM
                table
            WHERE
                table.B IS NOT NULL
        )
           sorted_not_null
        WHERE
           sorted_not_null.row_id = 1
    )
)

(This should be faster on larger data-sets, than LAG or using OUTER APPLY with correlated sub-queries, simply because the value is calculated once. For tidiness, you could calculate and store the [last_known_value] for each column in variables, then just use COALESCE(A, @last_known_A), COALESCE(B, @last_known_B), etc)

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • +1 that does seem to work but my "table" is actually a big query that I don't really want to run more than once, which your solution would require. Thanks very much for the help – Diego Apr 25 '16 at 14:26
  • @Diego - other than using LAG, every other answer here (and every approach I can think of) is going to have that problem. – MatBailie Apr 25 '16 at 14:28
1

if it is null all the way up to the end then can take a short cut

declare @b varchar(20) = (select top 1 b from table where b is not null order by id desc);
declare @c varchar(20) = (select top 1 c from table where c is not null order by id desc); 
select is, isnull(b,@b) as b, insull(c,@c) as c 
from table;
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • good aproach, I didn't want to declare variables so I ended up doing something like: case when B is null then (select top 1 B from where B is not null order by idx desc) else B end as B great Idea, thanks very much – Diego Apr 25 '16 at 14:32
  • I think variable are cleaner to read and it assures you that the query optimizer only does it once. – paparazzo Apr 25 '16 at 14:48
1
Select max(diff) from(
Select 
    Case when lag(a) over (order by b) is not null
    Then (a -lag(a) over (order by b)) end as diff 
     From <tbl_name> where
    <relevant conditions>
    Order by b) k

Works fine in db visualizer.

Syscall
  • 19,327
  • 10
  • 37
  • 52
ritusmita
  • 11
  • 1
-3
UPDATE table 
SET B = (@n := COALESCE(B , @n))
WHERE B is null;
Adesh
  • 34
  • 4