1

I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.

I have two source tables :

  • Source 1: SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
  • Source 2: CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)

  • Target: DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)

The idea is to find the top N nearest SALES_MAN for every client using a ROW_NUMBER in the target table.

What I'm doing currently is calculating the distance between every client and every sales man :

INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
                      [SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
    SELECT 
        [ID_SMAN], [ID_CLIENT],
        geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
        [SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
    FROM 
        [dbo].[SALES_MAN], [dbo].[CLIENT]

The DISTANCE table contains approximately 1 milliards rows.

The second step to get my 5 nearest sales man per client is to run this query :

SELECT * 
FROM  
    (SELECT 
         *, 
         ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang  
     FROM DISTANCE) TAB
WHERE rang < 6 

The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.

But when I created a clustered index on DISTANCE and keep the nonclustered sorted index in the ID_CLIENT the things went better.

So what a nonclustered sorting index is not working in this case?

But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.

So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?

Many thanks

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • 1
    [Bad habits to kick : using old-style JOINs](https://sqlblog.org/2009/10/08/bad-habits-to-kick-using-old-style-joins) - that old-style *comma-separated list of tables* style was replaced with the *proper* ANSI `JOIN` syntax in the ANSI-**92** SQL Standard (**25 years** ago) and its use is discouraged – marc_s Aug 24 '17 at 20:27

2 Answers2

1

Too long for a comment, but consider the following points.

Item 1) Consider adding a Geography field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point() function calls

Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)

So then the calculation for distance would simply be

  ,InMeters  = C.GeoPoint.STDistance(S.GeoPoint) 
  ,InMiles   = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344


Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example

From CLIENT C
Join SALES_MAN S
  on S.Lat between C.Lat-1 and C.Lat+1
 and S.Lng between C.Lng-1 and C.Lng+1

This +/- 1 could be any reasonable value ... (i.e. 0.5 or even 2.0)

John Cappelletti
  • 79,615
  • 7
  • 44
  • 66
  • Thanks, i will try the first trick in order to avoid redundant function calls. Also the second trick is really brilliant and i will see if the business rules will not be violated doing it. And what about index ? do you think there is no need for them? – Hadria Mehdi Aug 24 '17 at 21:00
  • @HadriaMehdi I suspect the bulk was generating 1MM records while you'll need a fraction of that. An index is like Chicken Soup for a Cold.... Couldn;t hurt, – John Cappelletti Aug 24 '17 at 21:29
0

ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,

and you've to change the following code :

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) 
rang FROM DISTANCE
) TAB

WHERE rang < 6 

into this:

WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS 
(
  SELECT DISTINCT CLIENT_ID 
  FROM DISTANCE
) 

SELECT Dx.* 
FROM DISTANCE_CLIENT_IDS D1,
(
   SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE 
   FROM (
     SELECT TOP(5) * 
     FROM DISTANCE D2
     WHERE D1.CLIENT_ID = D2.CLIENT_ID
   ) Dt
) Dx

and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns

Siamand
  • 1,080
  • 10
  • 19
  • Hello thanks for your reply. I can filter my data because i have to sort it by distance and select only rows with rang < 6. The idea is to select for each ID_CLIENT the 5 nearest sales men. – Hadria Mehdi Aug 24 '17 at 19:51
  • for having a good performance , the ROW_NUMBER should be used for making identical numbers at the result **after filter** – Siamand Aug 24 '17 at 19:53
  • In your query no need to use ROW_NUMBER() because you are selecting only the top 5 by distance – Hadria Mehdi Aug 24 '17 at 21:01
  • I use Row_Number for Range field in the final result , I though you need it @HadriaMehdi – Siamand Aug 24 '17 at 21:08