-2

I have a dataframe of shape 2701x128 It has a lot of missing values. The thing is that some rows can have 95% of filled data and some - only 5%. Let me try to visualize it:

X-axis is number of row(after sort), y-axis is number of non-zero values (SORTED, histogram-like)

enter image description here

X-axis is number of column(after sort), y-axis illustrates, how many non-zero's column have over all rows (SORTED, histogram-like)

enter image description here

I need: i need to imput data as accurate as i can, because this is the problem i need to solve. Problem: I cant interpolate everything with means, medians and othe statistical moments, because it's very rough. I also can't create a usual learning model cause there's NO structure in missing data.

Can you please suggest something as accurate as learning models, which can model the distribution, but be able to deal with completly random misses. So, apparently, the main problem is to create dataset from this unstructured misses. I can't find the solution at the moment.

Ladenkov Vladislav
  • 1,247
  • 2
  • 21
  • 45

1 Answers1

2

I think the first problem is considering you data as row-structured Try to think about it as a column-based

There is Japanese game called sudoku and I can suggest you to follow its strategy

First of all you need find out the most (but not 100% percent filled column) Lets called this one as B-column What is the percentage of missing data? If it is a small part - build a histogram and look at its PDF - may be simple mean and median will work that out?

Is there any 100% filled column? Lets call this one a G-column Try to find out is there any non fully-filled column which is strongly correlated to filled one. If so - impute the missing values based on this correlation - you can try to use more than 2 filled column with a basic regression

You can even try to restore part of data in B-column from 1 set of other non fully-filled column and other part by another set of non fully-filled column and you can do that many times

Of course you will have a kind of Frankenstein monster - but it is worth try and you always can asses how good an effect it was based on CV

However it is just a short sketch

  • 1. I did not consider it as a row-structured - i even plotted 2 graphs: one to show misses is rows, others - in columns. 2. These iterative alghoritm is good, actually, it's my baseline, but from "error" perspective this is not so good. I mean, for Sudoku it IS going to work well, because the conditions are strictly-determined, but in dataset there's not even one. So, if you interpolate columns/rows, you ignore, that the row/column may be a very close neighboor of some other row/column and forces this observation to have a general distribution. Can you think of this "bayessian" aproach? – Ladenkov Vladislav Jun 21 '17 at 07:32
  • Well why not to use a Bayesian approach - based of filled data in columns you can have a probability for category of a probabilty for continious variable to be in range - than you have some data in other column and ca try to calculate a posterior probability and thus do it for a whole range. Or you can just use a Naive Bayes from sklearn in python for that – Евгений Петросян Jun 21 '17 at 10:52