6

I don't know if I've chosen the appropriate title for this question (if not, please change it accordingly) but consider the following simplified table structure I'm working with:

----------------------------------------------
|  date  |  i  |  j  |  k  |  x  |  y  |  z  |
----------------------------------------------
| 100209 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100210 |  2  |  3  |  4  |  5  |  6  |  7  |
----------------------------------------------
| 100211 |  0  |  1  |  2  |  3  |  4  |  5  |
----------------------------------------------
| 100212 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100213 |  6  |  5  |  4  |  3  |  2  |  1  |
----------------------------------------------

i, j, k, x, y, z are all unrelated integers / floats, they all represent different factors and can have very different orders of magnitude (i can range from 1 - 10 while j can range from 100 - 1000).

I'm trying to select dates that share similar conditions; Given a set of i, j, k, x, y, z values I need to return all results ordered by closeness of all values as a whole for instance, if i = 1, j = 2, k = 3, x = 4, y = 5 and z = 6 the query should return the following dates in this order:

  1. 100209
  2. 100212
  3. 100210
  4. 100211
  5. 100213

I'm not sure if this is relevant or not to the question, but some values (i, j, k) mean more is better while other values (x, y, z) mean the opposite: less is better.

How I should build such a query? Is this possible with SQL alone?


@Pentium10:

I'll try to answer your comment the best way I can. Here is a sample of my data:

---------------------------------------------------------------------------------
  date  |  temperature  |  humidity  |  pressure  |  windSpeed  |  moonDistance  
---------------------------------------------------------------------------------
 090206 |  7            |  87        |  998.8     |  3          |  363953        
---------------------------------------------------------------------------------
 ...... |  ...          |  ...       |  ....      |  ...        |  ......        
---------------------------------------------------------------------------------
 100206 |  10           |  86        |  1024      |  2          |  386342        
---------------------------------------------------------------------------------
 100207 |  9            |  90        |  1015      |  1          |  391750        
---------------------------------------------------------------------------------
 100208 |  13           |  90        |  1005      |  2          |  396392        
---------------------------------------------------------------------------------
 100209 |  12           |  89        |  1008      |  2          |  400157        
---------------------------------------------------------------------------------
 100210 |  11           |  92        |  1007      |  3          |  403012        
---------------------------------------------------------------------------------
 100211 |  6            |  86        |  1012      |  2          |  404984        
---------------------------------------------------------------------------------
 100212 |  6            |  61        |  1010      |  3          |  406135        
---------------------------------------------------------------------------------
 100213 |  7            |  57        |  1010      |  2          |  406542        
---------------------------------------------------------------------------------

My table structure has more columns and thousands of rows but hopefully this will be enough to get my point clear. I'm not going to attempt to order these values like I did in my previous example because I would probably get it wrong, but I basically need to do two types of queries with this data:

  1. show me all dates, ordered by the resemblance of the conditions provided by me
  2. show me all dates, ordered by the resemblance of the conditions observed in date X

I understand that the second query can easily be archived by using the first one, but my problem lies in sorting by resemblance using several columns, that's what I meant by "ordered by closeness of all values as a whole". As in, if I was dealing with only one column it would be a lot easier to order by likeness but my head really starts spinning when dealing with several columns.

The goal is to to be able to produce results like this:

Today is really similar to d/m/yy, very similar to d/m/yy, somewhat similar to d/m/yy, ...

In my case I'm working with weather and atmospheric data but if it helps I guess you can think in this problem in terms of employees (having attendance, hoursPerWeek and monthlySalary columns) and order by employees that most closely resemble attendance = 100%, hoursPerWeek = 40 and monthlySalary = $5000, for instance.

PS: Now that I've given this employees example I'm not really sure anymore if it can be compared to the weather example I'm working with, since with the employees table you can compute (rating = monthlySalary / hoursPerWeek * attendance for instance) and kinda weight the columns, but I don't think the same can be done with the weather table - input is appreciated anyhow.

PS2: I'm not sure if I expressed myself well enough, if you still have doubts please let me know.


Bounty

Some good suggestions so far, however none of them truly solve my problem. I'm setting up a bounty to hopefully gather even more possible solutions to this problem. Thanks.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • a language barrier issue, can you explain what you mean `all values as a whole`, and give more example having other ranges like 100-1000 – Pentium10 Feb 13 '10 at 11:50
  • @Pentium10: I've updated my question to address your comment, please check it again, thanks. =) – Alix Axel Feb 13 '10 at 13:06
  • Thank you. you did a great job. As you mentioned rating for weather data is kinda hard. See my updated answer below. – Pentium10 Feb 13 '10 at 13:55

4 Answers4

6

The problem you seem to have is that each column has a different scale and so you can't easily combine them. This problem can be solved using a technique called whitening. This involves calculating the average and standard deviation of each column (you could do this in 1 SQL statement) and then rescaling each column to this when selecting:

colSortPos = (colValue-colMean) / colStdev

Doing this will give you each column ranging around 0 which +/- 1 standard deviation within the range +/- 1. The trick then is to combine these so that similar dates are together. The problem here is that this is not a 2 dimensional problem and so you need to think multidimensionally. So my suggestion its to take the the Euclidean distance as your sort order.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

The only problem with this is that it projects your problem onto a 1 dimensional space that may make you miss some correlations. To work around this I suggest using a clustering technique like K-means which is pretty simple to implement and is really fast. This will allow you to group your dates into k clusters that display the most similarity [ http://en.wikipedia.org/wiki/K-means_clustering ]. If you have the raw data and want to play around with these (and other) techniques then I suggest trying the weka toolkit [ http://www.cs.waikato.ac.nz/ml/weka/ ] which will let you play around with these techniques.

Neel
  • 854
  • 5
  • 7
  • I think I made a mistake in my SQL, the Euclidean distance I stated would only give you closest to the mean, if you want to compare with a particular row then you'd need the distance to that: SQRT( POW(@iTarget - ((i-@iMean)/@iStdDEv), 2) + POW(@jTarget - ((j-@jMean)/@jStdDEv), 2) + POW(@kTarget - (k-@kMean)/@kStdDEv, 2) ) Of course the whitened values can be pre-calculated in the table and then you could make this calc easier. – Neel Feb 15 '10 at 17:58
  • Hi Neel, really interesting answer - I am wondering why each result is squared and then the total square rooted? Also, is it necessary to use the mean at all - I tryed solving it by using ((@iTarget - i)/@iStdDev) instead - giving you the measure of distance between them in the number of standard deviations. – MrCraze Feb 18 '10 at 10:46
  • You want to do this to get the distance from the point you're interested in to the point you're at. Think about a right angled triangle, to get the distance from (x1,y1) to (x2,y2) you find the hypotenuse which is sqrt((x1-x2)^2 + (y1-y2)^2). In 3D (x,y,z) this becomes sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2). Think of your columns as axes on a graph, the points that are most similar would be spacially closer together. The more columns you have the more dimensions you need to consider. Your intuition to do ((@iTarget - i)/@iStdDev) is correct as this will measure the distance to some target. – Neel Feb 18 '10 at 13:02
  • Like MrCraze said this is a really interesting answer, I'll try it in a couple of hours and post my feedback here - regarding Weka: it seems like a really nice project (I actually bough the "Data Mining - Practical Machine Learning Tools and Techniques" book by the same authors a while back) but I think I lack the math skills to fully understand how to use it. – Alix Axel Feb 18 '10 at 23:38
  • @Alix Will be interested in the results, I have some ideas on how you could do this in a precomputed way to allow you to do arbitrary sorts but will hold back to see if this works out for you. – Neel Feb 19 '10 at 13:01
3

I think it's possible. Try using a function that adds up the difference. For instance:

SELECT *, ABS(i - @i) + ABS(j - @j) + ABS(k - @k) + ABS(x - @x) + ABS(y - @y) + ABS(z - @z) FROM table

This will give you your fuzzy logic difference value. The lower, the closer the match. You should also be able to sort by this expression: ie, order by ABS(i - @i)... ASC

A couple of suggestions, perhaps select the results to a temp table and sort on this.

Alternatively, if you are using SQL server, look in to using the .NET CLR and doing it in an .NET assembly - this will give you more power and options. It will also allow you to attach extra weight to certain fields more easily (if you meant that some fields should have additional weighting).

Hope this helps, Steve

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
MrCraze
  • 365
  • 2
  • 17
  • This query just returns NULL for me. He tagged MySQL. – Pentium10 Feb 13 '10 at 11:59
  • 2
    Hi, sorry - yes, he did. So CLR is out. However, MySQL also has the ABS function. Remember if any of the values are null, the whole expression will be null. So, if you don't want the null values to count: COALESCE(ABS(i-@i),0) + .... for each expression - then nulls will be treated as 0's. Alternatively, you can put a higher number in there to ensure nulls decrease the relevance - eg, COALESCE(ABS(i-@i),10)... Depending on what you want to achieve. I assumed there would be no nulls in this data - sorry. – MrCraze Feb 13 '10 at 12:11
  • what ABS(i-@i) suppose to do? I don't get that @ operator there. – Pentium10 Feb 13 '10 at 12:14
  • @Pentium10: @i is a variable you can define. – Alix Axel Feb 13 '10 at 13:07
  • @MrCraze: +1, However using your solution wouldn't the `moonDistance` values pretty much make all the other values irrelevant? Also, you mentioned weighting the columns, could you give me a suggestion about which factors to use in my weather data example? Thanks. – Alix Axel Feb 13 '10 at 13:27
  • 1
    Hi Alix, thanks for positive feedback. As for weighting, maybe you consider temperature as the most important factor so it could be COALESCE(ABS(i-@i)) * 10, which would mean differences in temperature would give higher number and therefore less relevance. The opposite could be done for say pressure-perhaps divide the difference by 10, or whatever is most fitting. Similar logic could be used to reduce relevance of moondistance. A weather analyst best decide that :) hope this helps! – MrCraze Feb 13 '10 at 14:51
  • @MrCraze: I was hoping for a solution that could analyze values as a whole since I've no idea what constitutes greater importance. – Alix Axel Feb 13 '10 at 22:40
  • 1
    Hi Alix, I think Pentium10 is on the right track. I would select the standard deviation of all the values first of all into variables to speed up the query, then divide the difference between the current date and that of the query by the standard deviation, adding these together for each element. Eg. SELECT @IDev = STDDEV(i), @JDev = STDDEV(j)... then SELECT *, Rank (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)... FROM table ORDER BY (ABS(i-@i)/@IDev) + (ABS(j-@j)/@JDev)... This will add the number of standard deviations each value is from the specified parameter (ie, today) - give better weighting – MrCraze Feb 14 '10 at 12:36
1

I'm not sure if this is relevant or not to the question, but some values (i, j, k) mean more is better while other values (x, y, z) mean the opposite: less is better.

Since you are dealing with dates, you may want to derive other dates based on these criteria.
For example you can compute new dates, for a value having a higher effect you add couple of hours/days/weeks/months to your date, and if a value has a lover effect you add just seconds/minutes to your dates. Then you sort based on this computed new date.
You can subtract dates using negative compensations if they reach a peak level

Check out Fuzzy SQL (FSQL)

EDIT 1

Try this: std(i)*abs(i-@i) ...

SELECT 
  (select std(i) from fuzzysearch)*abs(i-7) as v1,
  (select std(j) from fuzzysearch)*abs(j-90) as v2,
  f.* 
FROM fuzzysearch as f 
order by 
  v1 asc, 
  v2 desc
Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • Didn't knew MySQL had a STD() built-in function, thanks. Two questions: **1)** wouldn't your query be more favorable (in terms of sorting) to the `i` (`v1`) column? **2)** Why do you sort `v1` ASC and `v2` DESC? – Alix Axel Feb 13 '10 at 14:05
  • I must've used the computed value because it differs if for example you set: 10 than `i`. And with `asc` and `desc` I wanted to emphasize that some columns have different order. If temp is equal a higher humidity is better. MySQL has also `VARIANCE()`. – Pentium10 Feb 13 '10 at 14:15
  • @Pentium10: Thank you, regarding the first question I asked I was not referring to the use of the computed values, but to the order of the `ORDER BY` clause itself. Shouldn't MySQL order by `v1` primarily and only order by `v2` if there are two or more `v1` values with the same "ranking" / order? – Alix Axel Feb 13 '10 at 14:25
  • 1
    MySQL orders by v1 primarily and only order by v2 if there are two or more v1 values with the same ranking. That's true. `order by v1 asc, v2 desc` – Pentium10 Feb 13 '10 at 14:30
  • @Pentium10: In that case this will fail to order by all values as a whole. =\ – Alix Axel Feb 13 '10 at 14:39
  • 1
    Yeah it will. However I think helps you to get some answers, until you find a working formula to compute a value of all. – Pentium10 Feb 13 '10 at 15:25
1

This is hard (if not impossible) to do in SQL, but there's a multivariate analysis technique called Factor Analysis. It's a way of creating a "factor" - a linear combination of your variables which puts a weight on each input to get a "factor score". Basically it does a bunch of linear algebraic manipulations on your data to create a singles set of scores that minimizes some objective (like total errors).

I did a factor analysis on 3 days of hourly weather data and it looks pretty good. You can see that entries with similar factor scores generally are very close in values across all four measures. I used a principal component extractions with an Equimax rotation:

Wind    Air    Dewpoint    BP         Score
-------------------------------------------
3       12     9           1012.2     -2.72
5       17     11          1011.9     -1.77
5       16     10          1010.8     -1.75
6       15     10          1010.4     -1.68
3       19     13          1012.1     -1.57
5       17     11          1010.8     -1.54
7       17     12          1012.0     -1.53
8       19     12          1012.0     -1.24
5       20     14          1012.0     -1.18
7       17     12          1009.8     -1.06
9       20     13          1012.5     -1.05
8       21     12          1012.1     -1.05
7       19     11          1009.9     -0.98
6       18     13          1009.6     -0.90
0       24     17          1012.1     -0.90
8       20     12          1010.8     -0.88
5       22     15          1012.0     -0.87
10      21     12          1012.0     -0.86
8       21     12          1010.8     -0.78
9       19     12          1010.2     -0.78
8       23     10          1010.7     -0.76
6       23     15          1012.3     -0.75
7       20     12          1009.7     -0.73
7       20     12          1009.6     -0.71
10      22     14          1011.5     -0.45
7       19     15          1009.3     -0.45
6       20     16          1009.7     -0.41
7       20     15          1009.4     -0.37
10      24     12          1010.6     -0.26
5       26     18          1012.5     -0.26
9       23     15          1011.0     -0.22
12      24     16          1012.3     -0.04
6       25     16          1009.5      0.15
5       28     20          1012.5      0.16
0       28     17          1009.0      0.16
8       24     17          1010.2      0.17
3       30     20          1012.5      0.19
12      23     16          1010.6      0.21
8       25     15          1009.5      0.21
13      25     13          1010.5      0.22
13      25     14          1010.6      0.30
5       25     20          1010.1      0.35
6       25     19          1009.9      0.37
9       23     18          1009.5      0.40
13      25     15          1010.3      0.46
5       30     21          1012.4      0.48
7       26     19          1010.0      0.54
12      27     16          1010.9      0.56
8       24     20          1009.7      0.58
7       27     17          1009.2      0.60
7       27     18          1009.6      0.62
7       27     17          1009.0      0.64
8       26     21          1010.6      0.70
9       28     17          1009.8      0.75
8       25     22          1010.0      0.82
8       26     23          1010.8      0.86
8       25     22          1009.8      0.87
8       25     23          1010.1      0.91
9       26     22          1010.5      0.91
8       26     22          1009.8      0.97
14      29     17          1010.8      1.06
12      26     22          1010.8      1.09
10      31     18          1010.3      1.14
14      30     18          1010.7      1.28
17      29     17          1010.9      1.29
10      27     22          1009.4      1.32
12      26     22          1009.7      1.32
8       27     24          1009.3      1.38
14      27     22          1010.2      1.49
12      28     22          1009.8      1.51
16      31     19          1010.7      1.66
14      28     23          1009.6      1.82
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
Grembo
  • 1,223
  • 7
  • 6
  • Seems interesting and may be worth enough to try it, can you get me some more info on the factor analysis method you mentioned? – Alix Axel Feb 18 '10 at 23:40