1

I have geographic products that may have duplicates.

I return a list of possible duplicates and display them on a map for a user to check and delete.

To help the user cross reference between rows, I want to colour code the duplicates. Two or more rows that appear to be duplicates should have the same ColourGroup, so that it is easy to compare the two green rows or the three red rows.

I want to return a unique number for each ColourGroup.

SQL

This returns a list of Products that may have duplicates based on their Latitude, Longitude, ProductType and Price +/- 5%.

WITH Prods AS
(
SELECT p.ProductID,
p. ProductType,
p.Price,
p.Price+((p.Price/100)*5) As PriceUpper,
p.Price-((p.Price/100)*5) As PriceLower,
Round(p.Latitude,3) As Latitude,
Round(p.Longitude,3) As Longitude
FROM Products p
AND p.Latitude is not null AND p.Longitude is not null
)

SELECT 
DISTINCT a.ProductID,
b.ProductID As Duplicate,
a.Latitude,
a.Longitude
FROM Prods a

INNER JOIN Prods b ON a.ProductID <> b.ProductID
AND a.Latitude = b.Latitude 
AND a.Longitude = b.Longitude
AND a.ProductType = b.ProductType

AND (b.Price < a.PriceUpper AND b.Price > a.PriceLower)

DESIRED RESULT

Products

ProductID   Price    Latitude   Longitude   ProductType
ID1         500      12.34      56.78       Widget
ID2         505      12.34      56.78       Widget
ID3         200      12.34      56.78       Widget
ID4         800      12.34      56.78       Widget
ID5         500      12.34      56.78       Doodad

ID6         300      98.76      54.32       Doodad
ID7         295      98.76      54.32       Doodad
ID8         302      98.76      54.32       Doodad
ID9         100      98.76      54.32       Doodad

ID10        250      12.34      56.78       Thingamy
ID11        600      12.34      56.78       Thingamy

I want to return the following:

ProductID   Duplicate  Latitude   Longitude   ColourGroup
ID1         ID2        12.34      56.78       1
ID2         ID1        12.34      56.78       1

ID6         ID7        98.76      54.32       2
ID6         ID8        98.76      54.32       2
ID7         ID6        98.76      54.32       2
ID7         ID8        98.76      54.32       2
ID8         ID6        98.76      54.32       2
ID8         ID7        98.76      54.32       2

ID3 and ID4 do not match ID1 or ID2 because they are outside the +/- 5%, and do not match each other. ID5 does not match ID1 or ID2 because it's a different ProductType even though it's in the same location.

ID9 does not match IDs 6, 7 or 8.

ID10 and ID11 do not match each other.

How do I identify and number the sets of duplicates so I can colour code them later?

Ideally, rather than have a thousand colours, the ColourGroup number would reset for each Lat/Lng so I can use a set of about ten colours.

user2470281
  • 107
  • 9

1 Answers1

2

the dense_rank window function is handy here. It will numerate groups based on criteria you've passed in the "over()" part. Try this one

with pct_diff as (
select sd.ProductID, 
           sd_1.ProductID duplicate, 
           sd.latitude, 
           sd.longitude,
           sd.producttype,
           round((1.0 * min(sd.price) over(partition by sd.latitude, 
                                    sd.longitude, 
                                    sd.ProductType)) / (1.0 * sd.price) * 100 / 5, 0) pct_diff
      from some_data sd
      join some_data sd_1
        on sd.latitude = sd_1.latitude
       and sd.longitude = sd_1.longitude
       and sd.ProductType = sd_1.ProductType
       and sd.productid <> sd_1.productid
       and sd.price between sd_1.price - sd_1.Price * 0.05 and sd_1.price + sd_1.Price * 0.05)

  select ProductID,
         duplicate,
         Latitude,
         Longitude,
         ProductType,
         DENSE_RANK() over(order by Latitude,
         Longitude,
         ProductType,pct_diff) color_group
    from pct_diff   
ekochergin
  • 4,109
  • 2
  • 12
  • 19
  • Thank you for your help, but that ignores the Price +/- 5%. It is necessary because there are many similar products by ProductType in each location. The Price must be included to avoid most false positives. – user2470281 Oct 22 '21 at 10:15
  • Ah, sorry. I didn't notice that – ekochergin Oct 22 '21 at 10:16
  • @user2470281 check the updated answer please – ekochergin Oct 22 '21 at 10:18
  • Thanks but... I already have that code working to select +/- 5%. The problem is DENSE_RANK is still only ranking by location and product type. It will group products that may be the same type but vastly differing prices. It should only group those that have specifically matched each other by location, product type and Price +/- 5%. – user2470281 Oct 22 '21 at 10:25
  • @user2470281 I've updated parameters on dense_rank. Let's see if this worked out. If not, can you please provide me with more test data. Especially some lines that are not in +-5% tolerance – ekochergin Oct 22 '21 at 10:31
  • I have edited to give some additional, non-matching data. I have tried the CASE WHEN in my code and it ranks Products by LatLng and ProductType but is ignoring the Price component. i.e. all combinations of duplicate Widgets are being ranked the same. – user2470281 Oct 22 '21 at 11:10
  • Hmm, I still don't get it, sorry. The query from the last update seems to be working. At least, It produces the exact same rows as in desired result. Here's dbfiddle https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=211256e1943e0d122b1b9f023fb2ae7c – ekochergin Oct 22 '21 at 11:22
  • Ok, so on that fiddle I added 2 extra Widget rows for LatLng 12.34,56.78 with prices of 900 and 905. So now ID1 and ID2 match and the two new rows match. Both ID1 and ID2 and the two new rows are given the same colour group despite the fact they are two separate duplicate sets. – user2470281 Oct 22 '21 at 11:45
  • Wooow, it becomes more and more interesting. Check my updated answer or see this https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=0444d449116ff26d6e1b7c651b1079a9 – ekochergin Oct 22 '21 at 13:43
  • Yes, that works - thanks very much for your help! – user2470281 Oct 22 '21 at 18:00
  • @user2470281 Thanks for the interesting question you too – ekochergin Oct 22 '21 at 18:05