4

I'm currently attempting to clean a rather large dataset, however, I have noticed that a significant amount of participants failed to complete the survey, or did not even respond to the first question. Nevertheless, their data was included in my dataset.

Question: Is there a way to filter out participants based on response completion? For example, I'd like to filter out all cases which have failed to provide a response on at least 30% of the total questions.

eli-k
  • 10,898
  • 11
  • 40
  • 44
Olivier
  • 321
  • 2
  • 11

3 Answers3

4

Yes.

First, you'll want to create a new variable that counts the number of missing observations in the data:

COUNT
countmiss = v1 v2 v3 v4 v5 v6  (MISSING).

Then, you'll want to filter out participants who miss a specified number of responses (here, I'm going to filter out people who missed two responses):

USE ALL.
COMPUTE filter_$=(countmiss >= 2).
VARIABLE LABEL filter_$ countmiss >= 2 (FILTER)'.
VALUE LABELS filter_$  0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

Note that the second step can be achieved via the Data -> Select Cases menu.

maxwelldeux
  • 160
  • 5
  • 1
    Within the `COUNT` command: instead of typing a complete list of variables (`v1 v2 ... vn`) you can also write `v1 TO vn`. – mirirai Feb 29 '16 at 00:18
  • Good point, @mirirai. But that only works if the variables are listed sequentially in the datafile. – maxwelldeux Feb 29 '16 at 16:09
3

@eli-k made an important point if you want to take string variables into account.

In case you have a dataset with a lot of variables and you want to save yourself typing a long list of variable names, you can dynamicly generate a list of numeric variables and one of string variables with help of the Python Plugin (which must be installed).

The following code creates the two macros "!numericvars" and "!stringvars". When called they will be expanded to the according variable list.

BEGIN PROGRAM.
import spss

#create separate strings of numeric and string variables 
numericvars=''
stringvars=''
varcount=spss.GetVariableCount()
for i in xrange(varcount):
  if spss.GetVariableType(i) > 0: 
    stringvars=stringvars + " " + spss.GetVariableName(i)
  else: 
    numericvars=numericvars + " " + spss.GetVariableName(i)

# define macro variables for the numeric and the string variable lists
spss.SetMacroValue("!numericvars", numericvars)
spss.SetMacroValue("!stringvars", stringvars)
END PROGRAM.

(I've taken most of the code from the example of spss.SetMacroValue syntax reference page)

Within the COUNT command you then just have to type the macro names instead of the whole variables lists.

COUNT countmiss = !numericvars (MISSING)
                  !stringvars ("").

Then you can make use of the SELECT IF command as proposed by @eli-k.

SELECT IF countmiss < 20./* pick your best suited limit instead of "20".

This will delete all cases with 20 or more missing answers permanently from your data set.

Or you can use the FILTER BY command as proposed by @maxwelldeux:

COMPUTE filter_$ = (countmiss<20).
FILTER BY filter_$.

In this case data from the respondents with 20 or more missing answers won't be used in program procedures as long as the filter is active, while the data still remains in the data set.

mirirai
  • 1,365
  • 9
  • 25
  • That makes a lot more sense, however, the script isn't working when I attempt to run it on the syntanx. Is there anything I need to download? – Olivier Mar 01 '16 at 17:11
  • Do you have the Python Essentials Plugin installed? Its part of the standard installation routine since SPSS 22. With older versions of SPSS you have to install it separatly I guess. Do you get any error messages? If so, what does it say? – mirirai Mar 01 '16 at 17:41
2

@maxwelldeux is right about using COUNT, but you need to count both missing numeric variables and empty text variables (a text variable isn't necessarily defined as missing when it is empty). So:

COUNT countmiss = numvar1 numvar2 numvar3 numvar4 numvar5 (MISSING)
                  textVar1 textVar2 textVar3 textVar4 textVar5 ("").

At this point you can filter the file as @maxwelldeux suggested, if you want to keep the empty (or nearly empty) rows in the file, just not to include them in the analysis. If you've sure which participants you're not going to include in the file, you could use:

SELECT IF countmiss < 20./* pick your best suited limit instead of "20".

Make sure, though, after SELECTing, to save the file with a new name and use it as a workfile. This way you can always go back to the raw data and run the syntax again with alterations if you need to.

eli-k
  • 10,898
  • 11
  • 40
  • 44