1

I have to aggregate (of course with a categorical break variable) a quite big data table containing some continuous variables by resulting the mean, median, standard deviation and interquartile range (IQR) of the required variables.

The first three is an easy one with the SPSS Aggregate command, but I have no idea how to compute IQR by aggregating the data table.

I know I could compute IQR by using Descriptives (by quartiles), but as I need the calculations in aggregation - this is not an option. Unfortunately using R fails also thanks to some odd circumstances (not able to load a huge comma separated file in R neither with base:: read.table, neither with sqldf, neither with bigmemory and neither with ff packages).

Any idea is welcomed! And of course: thank you in advance.


P.S.: I thought about estimating IQR by multiplying the standard deviation by 1.5, but that method would not work as the distributions are skewed, so assuming normality does not stands.

P.S.: do you think using R within SPSS would not result in memory problems like while opening the dataset in pure R?

daroczig
  • 28,004
  • 7
  • 90
  • 124
  • do you have access to a traditional database? If so, you could split your data at the median into two groups, then find the median of the high and low groups for the 75% and 25% percentile which would give you the IQR. – Chase Apr 07 '11 at 23:27
  • Thank you @Chase! I will investigate your suggestion. Though I do not get what you mean by 'traditional database', I fear that splitting the file is not trivial. I would like to aggregate the file by a variable with about 2000 categories, so the resulting matrix will have about 2000 cases with all aggregated variables. By cutting the file in two and computing medians in both groups would require to compute the median of the continuous variables within all category of break variable, split the file into 2 times 2000 files and compute medians again. Seems tricky, but I might not get the point :) – daroczig Apr 07 '11 at 23:45
  • sorry, I should have said relational database, i.e Microsoft SQL or MySQL, etc. Most relational databases allow you to read in flat files directly. However - after reading your last comment, I don't think that's probably the best approach after all. – Chase Apr 08 '11 at 00:39
  • Check out the OMS system. You can run the commands to produce the statistics while exporting the statistics table to a dataset. – Andy W Apr 08 '11 at 00:41
  • Did you specify colClasses to read.table? Specifying nrows might also help. Both help reduce the memory overhead needed to load the data. It also loads much faster! – Tommy Apr 08 '11 at 05:32
  • @Tommy: of course. The problem is that I have to load a 4.000.000x150 matrix from csv, which does not seems to fit in 4 Gb of RAM. The packages (specified above) failed to load the cvs with segfault, that is why I try to do this stuff in SPSS. – daroczig Apr 08 '11 at 08:12
  • @Chase: I get the point now, thank you! I have been thinking about this one, but I am not sure if aggregating in MySQL would be fast. But I will try, thank you again. – daroczig Apr 08 '11 at 08:13
  • @Andy W: thank you too, I check OMS system (looks promising!) which is unheard-of to me. – daroczig Apr 08 '11 at 08:16

2 Answers2

3

This syntax should do the trick. There is no need to migrate back and forth between SPSS and R solely for this task.

*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.

vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.

*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.

This takes along time still with such a large dataset, but thats to be expected. Just search the SPSS help files for "OMS" to find the example syntax with how OMS works.


Given the further constraint that you want to calculate the IQR for many groups, there is a few different ways I could see to proceed. One would be just use the split file command and run the above frequency command again.

split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.

You could also get specific percentiles within ctables (and can do whatever grouping/nesting you want for that). Potentially a more useful solution at this point though is to make a program that actually saves separate files (or reduces the full dataset the specific group while still loaded), does the calculation on each separate file and dumps it into a dataset. Working with the dataset that has the 4 million records is a pain, and it does not appear to be necessary if you are just splitting the file up anyway. This could be accomplished via macro commands.

Andy W
  • 5,031
  • 3
  • 25
  • 51
  • As a note my memory was exceeded when trying to run this on all 150 variables at once, but was not when only specifying one variable. – Andy W Apr 08 '11 at 13:04
  • thank you very much, I think I get the point in saving the Descriptives in a new database - it seems really useful! My only problem that I cannot figure out how to do this with a break variable (as I would like to aggregate by that one), so to get a database with these values but for every category of the break variable (with around 2000 categories). I see I could do this in a loop (repeat from 1 to ~2000) by filtering the cases for the given category, but it would result in a bunch of files. Do you have an idea how to improve your answer? Thanks again, you definitely deserved upvote. – daroczig Apr 08 '11 at 13:33
  • sorry for late reply, I was "playing" quite much (besides a lot other things) with OMS to test this method. While OMS seems really nice, splitting such big file on a regular basis **is** really resource hungry - the syntax was running for two days for only one variable :) So I will go with @JKP's suggestion which seems to be really fast compared to the OMS/splitting, so the check-mark goes there. Anyway: thank you *very much* for your really kind help and also for pointing my attention to OMS! – daroczig Apr 13 '11 at 23:12
2

OMS can capture any pivot table as a dataset, so any statistical results displayed that way can be used as a dataset. Another approach, however, in this case would be to use the RANK command. RANK allows for grouping variables, so you could get rank within group, and it can compute the quartiles and percentiles within group. For example, RANK VARIABLES=salary (A) BY jobcat minority /RANK /NTILES(4) /PERCENT. Then aggregating with FIRST and the group variables as breaks would give you a dataset of the quartiles by group from which to compute the iqr.

Many ways to skin a cat.

-Jon Peck

JKP
  • 5,419
  • 13
  • 5
  • thank you very much for this really smart answer, which seems to solve my problem really fast - based under some short experiments. – daroczig Apr 13 '11 at 23:13