How to equalize the number of rows per unit in an SPSS file

Question

I have a file with a different number of rows for every "unit", and I'd like all the units to have the same number of rows, by adding the right number of empty rows per unit in the data.

For example:

data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.

what i'd like to get to is this:

I've worked with simple workarounds, for example casestovars => varstocases (keeping nulls), or preparing a base file with all the lines with unit names and serials, and then matching it with the data file so I end up with all the lines and all the data.
Could anyone suggest a more direct (\elegant\efficient\simple) approach? Thanks!

Yep those are the ways I would do it as well. `LOOP` and `XSAVE` is another option, although will not be any more efficient. You can likely use `AGGREGATE` and just `VARSTOCASES` to expand the data, but any of these are going to take 5 to 6 lines of code at least I imagine. — Andy W, Feb 20 '17 at 13:54

Jignesh Sutar · Accepted Answer · 2017-02-20T14:19:02.997

Cartesian product is what you require here.

Using your example data and downloading the Custom Extension Command, you can solve as below:

data list list/ unit serial someData.
begin data.
1 1 54
2 1 57
2 2 87
2 3 91
3 1 17
3 2 43
end data.
DATASET NAME ds0.
DATASET ACTIVATE ds0.
STATS CARTPROD VAR1=unit VAR2=serial /SAVE OUTFILE="C:\Temp\dsCart".
SORT CASES BY unit serial.
MATCH FILES FILE=* /BY unit serial /FIRST=Primary.
SELECT IF Primary.
MATCH FILES FILE=* /FILE=ds0  /BY unit serial /DROP=Primary.
EXE.

I'm not sure how efficient this Custom Extension Command is so you may want to experiment with different flavours of using STATS CARTPROD. An alternative approach would be to create two datasets (left and right) with your unique unit and serial values and then process these through the STATS CARTPROD command.

thanks for the reference, indeed I used this successfully - I aggregated to a dataset to it's key values (in my real data it's a combination of four variables), and to another one with just the serial numbers. Then I got the CARTPROD of the two, which created a dataset where all the keys ("units") have the maximum number of lines with relating serial numbers. Matching this with the original file got me exactly where I wanted. 5 lines of syntax. I LIKE!! — eli-k, Feb 21 '17 at 13:00

mirirai · Answer 2 · 2017-02-21T05:03:40.657

You already mentioned it: creating a base file with all the lines with unit names and serials, and then matching it with the data file would be a simple approach. I'd like to outline this one here for other readers.

So for the questions example you would create the base data set like this:

INPUT PROGRAM.
LOOP #i = 1 to 3. /* 3 = maximum value of unit.
   LOOP # = 1 to 3. /* 3 = maximum value of serial.
      COMPUTE unit = #i.
      COMPUTE serial = #j.
      END CASE.
   END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME base.
EXECUTE.

The data set will look like this.

unit serial
1    1 
1    2 
1    3 
2    1 
2    2 
2    3
3    1 
3    2
3    3

The following match files command will bring the wanted result.

MATCH FILES 
    /FILE base
    /FILE data1
    /BY unit serial.

If you want the code be more flexible regarding the maximum value of "unit" and "serial" you can make use of the python extension:

BEGIN PROGRAM.
import spss, spssdata

# list of variable names
variables = ["unit", "serial"]

#fetch variable data
data = spssdata.Spssdata(variables).fetchall()

# get maximum of 'unit' and 'serial'
maxunit = max([int(i[0]) for i in data])
maxserial = max([int(i[1]) for i in data])

# create base data set
spss.Submit('''
INPUT PROGRAM.
LOOP #i = 1 to {maxu}.
   LOOP #j = 1 to {maxs}.
      COMPUTE unit = #i.
      COMPUTE serial = #j.
      END CASE.
   END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME base.
EXECUTE.
'''.format(maxu=maxunit, maxs=maxserial))
END PROGRAM.

Thanks for drawing this out - It's a great solution for automating the creation of the base file. This version would only work though when the key variables are numeric and consecutive - for nominal base variables you'd need to loop through the actual existing values instead. But if it comes to that, the CARTPROD as @JigneshSutar suggests gets me there very easily. — eli-k, Feb 21 '17 at 13:08

How to equalize the number of rows per unit in an SPSS file

2 Answers2