0

I am trying to implement the apriori and fpgrowth algorithm to some characterisation data that I have. The data I have are already binarised and it is composed of 1's (passes), 0's (fails) and Null values.

I want to clarify with my preprocessing pipeline if it would be good enough in practise. I have already removed rows/columns from the dataset that have the ENTIRE row/column with Null values and now I am still left with some Null values.

I was thinking of applying categorical PCA to decrease the size of the dataset even more, but I believe that wouldn't good enough practise as it requires to impute and fill the missing values with something else, and I don't need that as it will affect final results.

So, what I am actually doing to address the issue of the Null values, is to fill them up with a 0. I do this, because the algorithms above try to measure the frequency of items that exist in a database. And I guess, the 1's are the datapoints that are keeping count of that frequency. Hence, the rest should be 0.

But, I am still not sure if it's good enough practise because it looks like I am filling up the Null values with a 0 (failure) as if it has been measured.

Any help on this, if I am tackling my problem correctly or if I should try something else would be very much appreciated. :)

0 Answers0