selecting "similar" groups - where to start with probabilities?

Question

Let's say I have a table with 10.000 lines (representing 10.000 persons) and the following columns:

id    qualification    gender    age    income

When I select all persons having a certain qualification (say "plumber") I get 100 lines, having a certain gender, age and income distribution.

What I now want to do is select some kind of test group to check if the income is influenced by qualification or by the distribution of the other attributes.

That means (and now I come to my question) I want to get another set of 100 lines, having the same gender and age distribution (but a different qualification value). These 100 lines should of course been chosen by random.

My primary problem is that I don't know how to write an SQL command that would take care of the distributions (which of course could and maybe should be seen as probabilities in this context) when I select random lines.

Thank you in advance!

Denis de Bernardy · Accepted Answer · 2011-07-14T01:25:47.960

2

You seem to be trying to solve something that is tightly related to this extremely thorny problem.

The wiki page depicts a number of approaches for detecting correlations in a database, complete with references to prior pg-hacker discussions (here's another), a variety of (rejected) patch proposals, and scientific papers that discusses the topic.

If it sounds too thorny, I'd second Catcall's pl/r suggestion. Or another applicable pl, for that matter.

As an aside, you might find pg-kmeans useful too:

http://pgxn.org/dist/kmeans/doc/kmeans.html

As well as PostStat (never tried it myself):

http://poststat.projects.postgresql.org/

edited Jul 14 '11 at 01:25

answered Jul 14 '11 at 01:19

Denis de Bernardy

75,850
13
131
154

thank you! I think this really get's too sophisticated for a simple test-group check. I'll see if I can go with weights instead. – speendo Jul 14 '11 at 10:49

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

1

Might be better on stats.stackexchange.com.

Selecting random rows is easy; matching the distribution is hard.

You could write a stored procedure that

repeatedly selects 100 rows at random,
calculates the statistics,
and returns when it finds 100 rows that fit.

But that seems a lot like kicking dead whales down the beach. And, depending on your data, it might never return.

Before you spend much time trying to do this in SQL, consider spending a little time to see how hard (or how easy) this is to do with statistical software, like R.

Later

Just discovered that there's a package called pl/R.

PL/R is a loadable procedural language that enables you to write PostgreSQL functions and triggers in the R programming language. PL/R offers most (if not all) of the capabilities a function writer has in the R language.

Google postgresql +statistics +r +pl for additional links to papers and tutorials.

edited Apr 13 '17 at 12:44

Community

1
1

answered Jul 13 '11 at 11:03

Mike Sherrill 'Cat Recall'

91,602
17
122
185

R would be alright (already had to give up on this question and switched to R: http://stackoverflow.com/questions/6425713/smoothing-time-data-can-it-be-done-more-efficient) but as we are talking about several millions of lines, I fear R also isn't adequate. – speendo Jul 13 '11 at 11:54
1

@Marcel: Can you reduce the number of rows for exporting to R or SAS or SPSS by taking a sample from the millions of rows? You'll end up with a 100-row sample of a sample; I don't know whether that invalidates your statistical stuff. – Mike Sherrill 'Cat Recall' Jul 13 '11 at 22:01
should also work but then I could probably also work with weights (http://en.wikipedia.org/wiki/Statistical_weight) instead (have to check that). A solution using SQL, random() and probability distributions would be more elegant (if there is one). – speendo Jul 13 '11 at 23:09
thank you! PL/R looks interesting. Another possible way would be to use PostgreSQL from within R using RPostgreSQL, but I never managed to do this yet. However, I think I'll better have a look at weights instead. – speendo Jul 14 '11 at 11:01

score 0 · Answer 3 · edited Jul 13 '11 at 12:38

0

SELECT * from Table1 order by random() limit 100;

random() is valid for PostgreSql. For MySql you can use RAND() instead of Random()

edited Jul 13 '11 at 12:38

Aziz Shaikh

16,245
11
62
79

answered Jul 13 '11 at 10:50

Muhammad Usama

2,797
1
17
14

1

You, too, seem to have missed this part of the requirements: "having the same gender and age distribution (but a different qualification value)". – Mike Sherrill 'Cat Recall' Jul 13 '11 at 11:04

selecting "similar" groups - where to start with probabilities?

3 Answers3