1

I am currently trying to perform data cleaning on my dataset containing 2K records from online transactions from an online supermarket.

In my dataset, there are a few data quality issues -

1) "?" in my Income columns as shown

Income with "?" in my dataset

May I know how do I clean this data in IBM SPSS Modeler? I tried to use the "Filler" node to replace the "?" but I'm not quite sure on what to write in the expression builder. As you can see, Income is stored as a string in the Filler node due to records with "?".

IBM Spss Modeler Filler node

Is there anyone who knows how to go about replacing/cleaning the Income data as I would like to replace the missing values with the mean of the Income column using the Data Audit Prep node. However, for me to do that, I would need to remove the "?" in order to change the Income type to a continuous data in the Type node.

Type node

2) Missing values for my Cigg column (T/F)

enter image description here

I'm not too sure on how can I go about replacing missing values for the Cigg column as its a boolean. May I know how should I replace data for this as well?

Thank you.

plzhelp
  • 185
  • 1
  • 3
  • 9

3 Answers3

0

The images are blocked to me but you can try:

1) "?" in my Income columns as shown: Use a derive node with a syntax similar to:

replace('?','',Income)

The syntax will replace all symbols '?' for ''.

2) Missing values for my Cigg column (T/F)

What do you want to do with the missing values? You can delete the rows with missing values or use some technique to replace them, but if you have 2k records, maybe it woul be better to just delete those missing rows.

A) In order to delete them, use the filter node with the syntax:

'variable' = '$null$' and select the option discard in the dialog box.

B) To replace the missing values: Use the Derive the node and write a syntax similar to:

if variable = '$null$' then mean(variable) else variable endif (I used here the mean, but try to see the best option for replacing missings for you).

These nodes above are going to create new variables and you have to choose a new name for them and the type of the variable in the dialog box.

Ref from IBM for Derive Node

Hope to have been helpful!

Érica Wong
  • 119
  • 6
0

1.) You could try to use @FIELD = "?" in the condition of your filler node.

2.) It depends what you want to do with the data, depending on usage, it might be best to ignore the column, to replace it with a default value or to ignore the data row.

pandayo
  • 310
  • 2
  • 13
0

For the question marks, you can try Variable = "?", or read the symbol in text editor, then place it in the formula. A workaround is a logical exclusion, like: conditional replacement -> not(Variable >= 0) -> 0

For the empty values, you can use Blanks and Nulls button on the Replace section, or you can use Variable = undef (undef is the Null value in Modeler)

This is all logic within the Filler node.

Julian
  • 154
  • 1
  • 11