0

I am having a complicated SQL problem in PostgreSQL.

Suppose I have a large table called 'selling_prices'. It contains around 19 million rows. I want to remove some duplicate rows and also update some data. Here is the table structure:

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-01-18 14:34:48
2 1414343 sku001 0.4 2021-01-18 14:34:48
3 1414343 sku001 0.4 2021-01-16 01:34:48
4 1512333 sku002 0.2 2021-01-16 01:34:48
5 1512333 sku002 0.5 2021-01-16 01:34:48

and so on....

Condition 1: If the customer_co_cd and item_sku and selling_tanka_rate is same update the latest updatedate to '2021/11/12' and delete the other data.

After the SQL table should be like: delete seq(2,3) and update seq 1

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-11-12 00:00:00

Condition 2: If the (customer_co_cd and item_sku) is same and selling_tanka_rate is different then get the data as group

customer_co_cd item_sku count
1512333 sku002 2

I tried some query using group by but it is slow...

SELECT customer_co_cd, item_sku, COUNT(*) 
FROM selling_prices 
GROUP BY customer_co_cd,item_sku 
HAVING COUNT(*) > 1

I don't know how to query the condition 1. Also what is the efficient way to get condition 2. Keep in mind that there are around 19 million data.

Should I create a script or is there a efficient query I can use.

  • How many columns in the table? could you use a select with group by, max etc to create and new table and then either drop the old one and rename the new one or empty to old one and copy the data and the drop the new one? –  Mar 28 '22 at 05:46
  • You could work by batches, maybe based on 1000's or 10 000's of `customer_co_cd` numbers? –  Mar 28 '22 at 05:47
  • There are 22 columns.. Can you please give an example sql? – hakuna_matata Mar 28 '22 at 05:51
  • Maybe this should be two separate requests. At first it sounds like you want to change data in your database table (remove rows, update rows). Then you are suddenly talking about selecting data (all customer_co_cd / item_sku pairs with more than one selling_tanka_rate, if I understand correctly). So are these actually two different things you want to do: 1. update/delete, 2. select? Or is this only about the select where you don't want to count customer_co_cd / item_sku pairs with the same selling_tanka_rate? But if it's just about selecting data, why then the date update? – Thorsten Kettner Mar 28 '22 at 07:52
  • My main concern is to update the data... and then delete the data.. Condition 1 is my main concern. Condition 2(select) is after the condition 1 occurred. – hakuna_matata Mar 28 '22 at 08:33
  • Note: you **condition1** is a delete+update query, your **condition2** is just an aggregating select query. – wildplasser Mar 28 '22 at 12:29

2 Answers2

1

This should answers your needs (result here)

select * from t1 order by seq;

-- Update first
with t2 as (
  select 
    row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,
    lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku order by customer_co_cd,item_sku,seq) as lead,
    * 
  from t1)
update t1
set updatedate = '20211112'
from t2
where t2.seq = t1.seq and t2.rn = 1
and t2.seliing_tanka_rate = t2.lead;

-- delete to keep the wanted records
with t2 as (select row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,* from t1)
delete 
from t1
where seq in (select seq from t2 where rn > 1);

select * from t1 order by seq;

-- Condition 2
with t2 as (
  select *,
  lead(customer_co_cd) over (partition by customer_co_cd,item_sku) as co_cd,
  lead(item_sku) over (partition by customer_co_cd,item_sku) as sku,
  lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku) as rate
  from t1
  )
select customer_co_cd,item_sku,
count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate) + 1 as count
from t2
group by customer_co_cd,item_sku
having count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate) + 1 > 1
Philippe
  • 1,714
  • 4
  • 17
  • Thank you for helping me... I think there is a misunderstanding. on the update query 4 and 5 data should not get affected. If customer_co_cd && item_sku && selling_tanka_rate is same (1,2,3 data) then update seq 1 only not seq 4 and 5(because there selling_tanka_rate is not same) – hakuna_matata Mar 28 '22 at 08:09
  • So don't run the delete query that's all (answer and fiddle updated) – Philippe Mar 28 '22 at 08:17
  • Sorry I made a mistake... Your delete query is OK. It is required. I meant that updatedate should not get updated for 4 and 5 data. Since 4 and 5 data has no similar data(for selling_tanka_rate). – hakuna_matata Mar 28 '22 at 08:30
  • You can also query like only 1,2,3 data will be updated (updatedate). Then keep one data and delete others.. – hakuna_matata Mar 28 '22 at 08:59
  • 1
    to do it you havec to add a lead column to know if the next value of `seliing_tanka_rate` has changed. (answer updated) – Philippe Mar 28 '22 at 09:05
  • I think it worked.. Let me test it with more test cases and I will accept the answer... Thanks so much... :) – hakuna_matata Mar 28 '22 at 09:49
  • Thank you so much... You are a genius... – hakuna_matata Mar 29 '22 at 10:37
0

I think Philippe has answered your question but I'll add some slightly different approaches.

create temporary table orders (
    seq serial primary key,
  customer_co_cd int,
  item_sku varchar,
  selling_tanka_rate float,
  updated_at date
);

insert into orders (seq, customer_co_cd, item_sku, selling_tanka_rate, updated_at) values
(1 , 1414343, 'sku001', 0.4, '2021-01-18'),
(2 , 1414343, 'sku001', 0.4, '2021-01-18'),
(3 , 1414343, 'sku001', 0.4, '2021-01-16'),
(4 , 1512333, 'sku002', 0.2, '2021-01-16'),
(5 , 1512333, 'sku002', 0.5, '2021-01-16')
;

with ranked_orders as (
    select 
        orders.*,
      row_number() over(partition by customer_co_cd, item_sku, selling_tanka_rate order by updated_at DESC, seq) as recent_updated_at
   from orders
)
update orders
set updated_at = '2021-11-12' 
from ranked_orders 
where 
    orders.seq = ranked_orders.seq AND 
  ranked_orders.recent_updated_at = 1
;

select * from orders order by seq ASC; 

/*
 seq | customer_co_cd | item_sku | selling_tanka_rate | updated_at
-----+----------------+----------+--------------------+------------
   1 |        1414343 | sku001   |                0.4 | 2021-11-12
   2 |        1414343 | sku001   |                0.4 | 2021-01-18
   3 |        1414343 | sku001   |                0.4 | 2021-01-16
   4 |        1512333 | sku002   |                0.2 | 2021-11-12
   5 |        1512333 | sku002   |                0.5 | 2021-11-12
*/

delete from orders 
where orders.updated_at <> '2021-11-12' 

select * from orders order by seq ASC; 

/*
 seq | customer_co_cd | item_sku | selling_tanka_rate | updated_at
-----+----------------+----------+--------------------+------------
   1 |        1414343 | sku001   |                0.4 | 2021-11-12
   4 |        1512333 | sku002   |                0.2 | 2021-11-12
   5 |        1512333 | sku002   |                0.5 | 2021-11-12
*/


select t.* from 
( select customer_co_cd, 
         item_sku, 
         count(distinct selling_tanka_rate) as count
  from orders
  group by (customer_co_cd, item_sku)
) as t
where t.count > 1 //  you may want to remove this. Not sure of your exact requirements.

/*
 customer_co_cd | item_sku | count
----------------+----------+-------
        1512333 | sku002   |     2
*/
robodisco
  • 4,162
  • 7
  • 34
  • 48
  • 1
    Yes, Philippe's answer is right... Yours also right except the update the updatedate... Philippe got it right... My requirement is confusing.. Sorry for that... – hakuna_matata Mar 29 '22 at 10:39